A multiclass cascade of artificial neural network for network intrusion detection

Abstract

This paper presents a cascade of ensemble-based artificial neural network for multi-class intrusion detection (CANID) in computer network traffic. The proposed system learns a number of neural-networks connected as a cascade with each network trained using a small sample of training examples. The proposed cascade structure uses the trained neural network as a filter to partition the training data and hence a relatively small sample of training examples are used along with a boosting-based learning algorithm to learn an optimal set of neural network parameters for each successive partition. The performance of the proposed approach is evaluated and compared on the standard KDD CUP 1999 dataset as well as a very recent dataset, UNSW-NB15, composed of contemporary synthesized attack activities. Experimental results show that our proposed approach can efficiently detect various types of cyber attacks in computer networks.

Keywords

Intrusion detection artificial neural network cascading classifiers ensemble learning AdaBoost

1 Introduction

With the exponential growth in the number of online services, the number of computing devices and users connected to computer networks and the Internet has significantly increased. In parallel, there has been a myriad of associated security vulnerabilities and intrusive behaviors causing disclosure of sensitive information, disruption and unauthorized access of online services and systems, and unauthorized access to restricted resources. Some common forms of attacks include the scanning of port, probing for information,injecting viruses and worms, and DoS [1].

The ever-increasing types of attacks demands for a highly effective intrusion detection system in place that can not only detect well documented forms of intrusions but can also learn to detect new attack forms. Classical intrusion detection systems are commonly divided into signature-based systems, anomaly-based systems or a hybrid of the two types. Signature based systems detect intrusion by monitoring system usage patterns and comparing these patterns against well known signatures of misuses. Such intrusion detection systems require a frequent update of intrusion signature database to be effective. Anomaly based systems uses the notion of normal behavior verses intrusion and a classification algorithm is used to mark any activity as normal or intrusion. In contrast, hybrid approaches combine both signature-based and behavior-based systems. Recently, Such systems have received much attention due to their generalization characteristics [3]. Examples of these techniques include, amongst others, artificial neural networks (ANNs), principal component analysis (PCA), support vector machines (SVMs), and nearest neighbor (k-NN) classifiers [4 –8].

Although supervised learning-based techniques such as SVMs, ANNs, decision trees, and ensemble learning have been used in the past to create IDS, they fail in detecting the infrequently occurring attacks or have a very high false positive rate. Moreover, due to the availability of large training data, the learning time and memory requirements of these algorithms is very large. To overcome these difficulties a novel filtering mechanism is introduced that when used along with boosting-based neural network learning results into effective intrusion detection system without requiring the processing of very large sparse matrices.

Our main observation regarding the performance of multi-class learning algorithms is that for skewed datasets such algorithms totally ignore the the sparsely represented classes and learn a solution that favors only the most dominant classes at the expensive of totally ignoring other classes. Such algorithms still attains a high overall accuracy due to the dominance of few classes in the overall distribution. The well cited network intrusion dataset from the KDD-CUP 99 [9] is an example of such learning problems which contains three classes constituting about 98% of training data whereas the remaining classes representing the less common intrusions constitute only about 2% of the total data.

The remainder of this paper is organized as follows. Section 2 describes the boosting-based neural network learning algorithm, the novel example filtering mechanism and the cascade structure for detecting intrusion. A detailed description of the adopted datasets, experimental settings to evaluate the proposed approach, and the corresponding results is provided in Section 3. Finally, Section 4 concludes the paper with discussions and highlights for some future directions.

2 Proposed method

This section begins with a brief review of boosting-based ANN learning [10, 11] that can be used to learn weights of a given neural network using AdaBoost. This review is followed by the presentation of a cascade structure and an associated example filtering mechanism used to learn an effective multi-class classifier by combining several binary classifiers connected as a decision tree or cascade. The cascade structure is a generalization of one-vs-remaining encoding strategy of building a multi-class classifier by combining several binary classifiers in the form of a tree structure. The cascade is designed greedily by selecting the most discriminative partition of the set of classes and a classifier is then trained using a binary encoding of the partition. Once the classifier has been trained, the training data is divided into two parts by using the labels assigned to the examples by the trained classifier. Each partition, hopefully, contains examples belonging to a subset of the total classes and hence the learning problem is divided into two smaller multi-class learning problems. This procedure is repeated for the two sub-problems and hence a multi-class decision cascade structure is learned.

2.1 AdaBoost based neural network learning

AdaBoost [12] is one of the most successful boosting algorithms that constructs a highly accurate classifier ensemble of moderately accurate classifier instances called base learners. It takes n labeled training examples, $(\bar{x_{i}}, y_{i}), i = 1 . . . n$ , as input and iteratively selects T classifiers, h_t, by maintaining an adaptive weight distribution over the training examples. AdaBoost constructs the final ensemble by taking a linear combination of the selected classifiers using: $H (\bar{x}) = sign (\sum_{t = 1}^{T} α_{t} . h_{t} (\bar{x}))$ (1) where α_t is the weight of the classifier instance h_t and is computed using an error rate of h_t.

Single-node decision trees, commonly referred to as stumps, have been frequently used as base classifiers in AdaBoost [13, 14]. Baig et al. [10] introduced a new representation of a decision stump using homogeneous coordinate to learn weights of a single layer perceptron. Their method, called Boostron, represents a decision stump as a product of a weight vector $\bar{W}$ and an extended feature vector $\bar{X}$ given as $s (x_{i}) = \bar{W} . {\bar{X_{i}}}^{T}$ . This new representation of stumps when used with the confidence rated version of AdaBoost [13] learns weights of a linear classifier equivalent to a Perceptron.

The Boostron algorithm has been extended further to learn parameters of an ANN with a single hidden layer and a single output neuron [11] such as the one shown in Fig. 1. The extension of Boostron uses a transformed set of examples and a layer-wise iterative traversal of neurons in the network to fine tune its parameters. It uses two simple problem reductions:

Learning an output neuron is reduced to that of Perceptron learning, and

Learning a hidden neuron is reduced to that of Perceptron learning,

to learn weights of the hidden neurons and the output neuron by the use of Boostron algorithm during an iterative traversal of the neurons in a given feed-forward ANN. The main algorithm by Baig et al. [11] that uses these reductions along with the Boostron algorithm is shown as Algorithm 1.

Algorithm 1 Learning a linear feed-forward ANN using AdaBoost [11]

Require: Training examples $({\bar{x}}_{1}, y_{1}) \dots ({\bar{x}}_{N}, y_{N})$ where

$\bar{x_{i}}$ is a training instance and y_i ∈ {-1, + 1} the corresponding class label

P is the number of iterations over ANN layers

Randomly initialize all weights in the range (0 1)

Randomly assign features to each hidden neuron.

for j = 1 to P do

Compute transformed training examples $(\bar{X_{i}}, y_{i}), i = 1, 2, \dots, N$ where $\bar{X_{i}}$ = $[X_{0}^{1}, X_{1}^{1}, \dots, X_{m_{0}}^{1}]$ and $X_{i}^{1}$ = $f^{2} (f^{1} (x_{i}^{1}))$

for Each hidden layer neuron H_jdo

Use the Boostron algorithm and the transformed training examples, $(\bar{X_{i}}, y_{i})$ , to learn the weights $w_{j_{k}}^{1}$ of H_j where k = 0, 1, …, m₀

Obtain weights of $w_{1 i}^{1} = \frac{W_{i}}{w_{1 j}^{2}}$ of hidden neuron H_j

end for

Compute transformed training examples $(\bar{X_{i}^{2}}, y_{i}), i = 1, 2, \dots, N$

where $\bar{X_{k}^{2}}$ = $[x_{0}^{2}, x_{1}^{2}, \dots, x_{m_{1}}^{2}]$

Use the Boostron algorithm and the training examples $(\bar{X_{i}^{2}}, y_{i})$ , to learn the weights $w_{0}^{2}, w_{1}^{2}, \dots, w_{m_{1}}^{2},$ of the output neuron O₁

end for

Output the learned ANN weights

The boosting-based ANN learning method can learn effective network parameters for a given problem but, like many other supervised learning algorithms, it also suffers from the drawback of overfitting, especially in case of very skewed distribution of training examples typically found in an IDS having a large number of classes. To overcome such limitations, a novel cascade structure and an example filtering method is introduced, which can be used to handle learning in case of a large number of classes and handle skewed distribution of trainingexamples.

2.2 The cascade structure

The construction of the proposed cascade is based on the observation that it might be possible to partition the classes so that the training examples belonging to classes in one part of the partite can be accurately discriminated from the classes in other parts. Once we have found such a partition the procedure can be recursively used to divide the resulting problems into even smaller subproblems containing less classes and finally combine these solutions to form a classifier. The the 23-class KDD-cup intrusion detection dataset, used in our experiments, an accurate classifier can be obtained to separate the smurf class from the remaining classes. Such a classifier can be used to derive two smaller problems one containing a majority of examples belonging to the smurf class and the other containing a negligible fraction of smurf. Each of the problems can be recursively solved until each part has instances of a single class. During experiments, it has also been observed that an accurate classifier can be learned by selecting a small sample of training examples from each part of the partition in such a way that the distribution of training examples in the reduced problem is not very skewed. Therefore, we have devised a simple approach to convert a K-class problem into that of learning a binary classifier that can be used to reduce the bigger problem into 2 subproblems each generally having smaller set of classes. A detailed algorithm based on the proposed method is given as Algorithm 2 which reduces a K-class problem into a binary classification learning problem using the partitioning of classes into 2 sets and then uses the boosting-based ANN learning algorithm [11] to fit a binary classifier on the relabeled training dataset and the process is independently repeated for the subproblems so formed. This partitioning process is stopped if

instance of a single dominant class remain in a partition, or

the number of examples reaching a partition are less than a predefined threshold.

Algorithm 2 Build Cascade

Require: Examples $(\bar{x_{1}}, y_{1}) \dots (\bar{x_{n}}, y_{n})$ where

$\bar{x_{i}}$ is a training instance and y_i ∈ 1, 2, …, K are labels, and l is the number of partitions to use

1: ifK = 1 or number of training examples is less than a threshold then

2: Label the leaf node with the dominating class in the training data

3: return

4: end if

5: for each possible partition P of the K classes into two sets P₁ and P₂do

6: Create a binary classification problem by relabeling y_i ∈ P_j as +1 or -1.

7: Learn a binary classifier B using Boosting based ANN learning algorithm.

8: Choose this partition if it results in the most accurate classifier B amongst all such classifiers

9: end for

10: Partition the training data D into two parts D₁ and D₂ using the predictions of the best classifier B

11: Recursively repeat the above steps for each partition

Algorithm 2 takes as input a K-class learning problem and uses a partitioning mechanism to construct a binary classification learning problem to be used for partitioning the training data. For a problem involving K classes there are 2^K possible partitions to choose from and hence finding an optimal partition of K-classes into two sets becomes intractable even for a moderate number of classes. Therefore, in our experiments we only considered K different partitions of the K-classes each obtained by dividing total classes into two sets such that one set contains only one of the classes while the second set contains all remaining classes. Such partitioning is exactly equivalent to encoding of classes using one-vs-remaining strategy along with an example filtering step. The class best discriminated from the remaining classes is used to divide the learning problem into smaller sub-problems in each cascade stage. Figure 2a shows a the structure of such a cascade.

A traversal of the cascade starting at the root and then traversing the subtree corresponding to the predicted class of an instance $\bar{X}$ can be used to assign a label to $\bar{X}$ . A recursive traversing process used to assigning a label to an instance $\bar{x}$ is shown as Algorithm 3.

Algorithm 3 Compute Label of $\bar{x}$

Require: Instance $\bar{x}$ to be labeled, and

Cascaded classifier C

1: ifC does not have Descendants then

2: set class label of the node as the predicted label of $\bar{x}$ .

3: return

4: end if

5: Use the classifier at the root of C to compute the label y of $\bar{x}$

6: Recursively compute label of $\bar{x}$ by using subtree corresponding to the computed label.

3 Experimental settings and results

A detailed description of the adopted datasets, the experimental settings and the obtained results are presented in this section. Three experiments have been performed to evaluate the proposed cascade structure for two intrusion detection datasets: KDD Cup 99 and UNSW-NB15. A comparison of the proposed cascade with the standard feed-forward ANN trained with sigmoid activation function in the hidden layer is also provided for the recent UNSW-NB15 dataset.

3.1 Datasets description

A subset of KDD Cup 99 (KDD’99) dataset [9] has been used to empirically evaluate the proposed method. Since its first use in the International Knowledge Discovery and Data Mining Tools Competition in 1999, it has been a gold standard intrusion detection dataset used by a large number of researchers and their experimental work [15 –19]. A detailed description of the dataset can be found in [] and a summary of the example distribution is given in Table 1. The dataset contains very few dominant classes and hence it is an interesting optimization problem because a large class of algorithms converge to suboptimal solution and ignore the sparse classes still attaining high overall accuracy.

A recent network intrusion detection dataset, UNSW-NB15 [20], comprising contemporary attacks, has also been used to evaluate the proposed network intrusion detection system. This dataset contains a hybrid of modern normal and attack behaviors represented using 49 features and containing nine attack categories. A partition of the overall dataset into training/testing datasets is also provided. The training partition consists of 175,341 instances whereas the testing dataset contains 82,332 instances. Table 2 lists the overall class distribution in the test and training data.

3.2 Experimental settings

In our first set of experiments with the KDD-cup dataset, five iterations of 2-fold cross-validation have been used to evaluate the learned classifier for the KDD-cup intrusion detection dataset. In each iteration, the dataset has been randomly split into two non-overlapping partitions. A small sample of training examples from one of the partitions have been used for training while the examples in the other partition have been used for testing. Average values of various performance measures are reported in the following section on results. In the second set of experiments with the UNSW-NB15 dataset, a small fraction (about 2%) of the randomly selected dataset has been used for training and the whole testing dataset is used for evaluating the performance of proposed method.

While building the classifiers, classes have always been partitioned into two sets one containing a single class and the second containing all the remaining classes. For example, a classifier discriminating Smurf traffic (i.e. class no 19 in KDD’99) from the class representing the remaining classes has been placed at the root followed by Normal versus remaining attacks and so on.

A boosting-based ANN with twenty hidden neurons and one output neuron has been used as a classifier at each stage of the proposed cascade structure. The resulting cascade structure similar to the one shown in Fig. 2b with a ANN used as classifier in each internal node of the cascade has been used. Example filtering process that uses an ANN classifier corresponding to a node eliminated one of the classes at each stage of the cascade and hence the corresponding examples are also eliminated at successive stages.

3.3 Results

The proposed system has been evaluated using the accuracy, precision, recall and F1-score measures over the KDD’99 and UNSW-NB15 intrusion detection datasets. Since a major objective of any network intrusion detection system is to discriminate normal network traffic from intrusion therefore the first set of results presents the performance of the proposed system for discriminating the normal traffic from that representing some form of intrusion.

For the two datasets, confusion matrices along with the four performance measures for detecting intrusion without marking the actual intrusion type is given in Table 3. For the KDD’99 dataset these performance measures have been computed using results of a single fold whereas the test results of a complete run are reported for the UNSW-NB15 dataset. For the KDD’99 dataset the trained cascade has a very low false positive rate (i.e. normal traffic marked as intrusion) of 3.77% and a very low false negative rate (i.e. intrusion detected as normal traffic) of 1.26%. The values of accuracy, precision, recall, and F1-score for this single experiment are also very reasonable. For the UNSW-NB15 dataset the values of false positive and false negative rates are relatively poor than the corresponding values for the KDD’99 dataset.

The next set of results presents the overall performance of the system using five runs of two-fold cross validation scheme for the KDD’99 dataset as described above. Table 4 reports the fold-wise and average test performance of the trained system for the entire testing dataset. From the reported results it is obvious the the proposed learning strategy has resulted into an intrusion detection system with fairly high values for overall accuracy of 99.36% with both precision and recall having value above 0.97 and F1-score greater than 0.96. These high values have been obtained for a larger testing dataset consisting of 50% of the overall data whereas a very small fraction of the training data (about 1% only) has been used for training the classifier.

Table 5 presents a further insight into the results by providing a class-wise average values of the four performance measures for eight dominant classes. These results have been obtained by computing the corresponding values for each of the five two-fold runs and the average values of the obtained results are reported. The classifier trained for intrusion detection has high accuracy for fifteen classes but very low values for the remaining measures. As these classes have a sparse representation in the overall training and testing datasets, therefore the system has been able to achieve high overall values of performance measures even without having high values for these classes.

A similar set of results for the UNSW-NB15 dataset is summarized in Table 6. From these results, it is revealed that the proposed system can detect intrusion successfully but determining the type of intrusion is poorly marked for a number of cases. The average values of accuracy, precision, recall are 86.40, 53.19 and 60.71 respectively. It is also important to note that unlike a typical intrusion detection system, the proposed scheme marks the less frequently occurring classes as intrusion because of the cascade structure however the actual label of such instances might be incorrect.

The last set of results compares the proposed cascade-based approach with a two-layer neural network having twenty hidden neurons with sigmoid activation function. In the previous experiments, only a small fraction (about 5%) training data has been used for learning a classifier whereas in this experiment a larger subset (about 30%) of randomly chosen training examples have been used for comparing the two algorithms. Each experiment has been performed several times and the average performance values for detecting intrusion are reported in Table 7. The proposed approach obviously outperforms the standard feed-forward neural network for detecting intrusion. Because of the cascade structure and the filtering mechanism used, each filtered example contributes to the error accumulation only once. The overall change, i.e. Proposed –ANN, in the four performance measures are reported in Table 8 and it is obvious that the overall improvement obtained by using the proposed approach is significant. By comparing results presented in Tables 3 and 7, we can also make an interesting observation that the results obtained with a smaller fraction (5%) of training dataset are better than those obtained when a larger fraction (30%) of training data is used to build the classifier.

4 Conclusion and discussion

An effective method of learning a multi-class classifier along with the results obtained for two intrusion detection datasets have been presented. The proposed method uses a cascade of boosting-based ANN to create an effective multi-class classifier and is similar, in principle, to the one-vs-remaining strategy with an additional filtering of examples step. The intrusion detection system trained using the proposed method has very high overall accuracy, precision, recall, and F1-score for the KDD’99 dataset while these measures are relatively lower for the UNSW-NB15 intrusion detection dataset. The reported results also reveled that the trained classifier had high performance for most of the well-represented classes. Although the intrusion detection rate of the classifier trained using the proposed structure has been very high but for extremely sparse classes the proposed intrusion detection system has been unable to discriminate between various types of intrusions.

Two orthogonal research directions can be taken from this point onwards which include i) the proposed structure for learning a classifier can be tested for more classification tasks involving a large number of classes ii) the proposed intrusion detection system can be further refined using a filtering and example weighting strategy that favors the spare classes a little more than the remaining classes. Theoretical analysis of the proposed cascade for learning an effective classifier handling a large number of classes can also be done as a future direction.

Footnotes

Acknowledgments

The authors acknowledge the support of Lahore University of Management Sciences(LUMS) and the Higher Education Commission of Pakistan (H.E.C) for conducting this research work. The third author would like to acknowledge the support provided by King Abdulaziz City for Science and Technology (KACST) through the Science and Technology Unit at King Fahd University of Petroleum and Minerals (KFUPM) during this work through project No. 11-INF1658-04 as part of the National Science, Technology and Innovation Plan.

References

Simmonds

, Sandilands

and van Ekert

, An ontology for network security attacks. In Applied Computing, Springer, 2004, pp. 317–323.

Axelsson

, Intrusion detection systems: A survey and taxonomy. Technical report, Technical report, 2000.

Tsai

C.-F.

, Hsu

Y.-F.

, Lin

C.-Y.

and Wei-Yang

, Intrusion detection by machine learning: A review, Expert Systems with Applications36(10) (2009), 11994–12000.

Mukkamala

, Janoski

and Sung

, Intrusion detection using neural networks and support vector machines. In Neural Networks, 2002 IJCNN’02 Proceedings of the 2002 International Joint Conference on, volume 2, 2002, pp. 1702–1707. IEEE.

Zhang

, Jiang

and Kamel

, Intrusion detection using hierarchical neural networks, Pattern Recognition Letters26(6) (2005), 779–791.

Kim

D.S.

and Park

J.S.

, Network-based intrusion detection with support vector machines. In Information Networking, 2003, pp. 747–756. Springer.

and Wang

, An adaptive network intrusion detection method based on pca and support vector machines. In Advanced Data Mining and Applications, 2005, pp. 696–703. Springer.

Liao

and Vemuri

V.R.

, Use of k-nearest neighbor classifier for intrusion detection, Computers & Security21(5) (2002), 439–448.

KDD Cup 1999 dataset for network-based intrusion detection systems. Available on: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

10.

Baig

M.M.

, Awais

M.M.

and El-Alfy

E.-S.M.

, Boostron: Boosting based perceptron learning. In Neural Information Processing, volume 8834 of Lecture Notes in Computer Science, 2014, pp. 199–206. Springer.

11.

Baig

M.M.

, El-Alfy

E.-S.M.

and Awais

M.M.

, Learning rule for linear multilayer feedforward ann by boosted decision stumps, In Neural Information Processing, 2015, pp. 345–353. Springer.

12.

Freund

and Schapire

R.E.

, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences55(1) (1997), 119–139.

13.

Schapire

R.E.

and Singer

, Improved boosting algorithms using confidence-rated predictions, Machine Learning37(3) (1999), 297–336.

14.

Viola

and Jones

, Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, 2001, pp. 313–321.

15.

Feng

, Zhang

, Hu

and Huang

J.X.

, Mining network data for intrusion detection through combining SVMs with ant colony networks, Future Generation Computer Systems (2013).

16.

and Liu

, A method of SVM with normalization in intrusion detection, Procedia Environmental Sciences, 11, 2011, Part A:256–262.

17.

Altwaijry

and Algarny

, Bayesian based intrusion detection system, Journal of King Saud University –Computer and Information Sciences24(1) (2012), 1–6.

18.

Amiri

, Yousefi

M.M.R.

, Lucas

, Shakery

and Yazdani

, Mutual information-based feature selection for intrusion detection systems, Journal of Network and Computer Applications34(4) (2011), 1184–1199.

19.

Bolón-Canedo

, Sánchez-Maroño

and Alonso-Betanzos

, Feature selection and classification in multiple class datasets: An application to KDD cup 99 dataset, Expert Systems with Applications38(5) (2011), 5947–5957.

20.

Moustafa

and Jill

, The evaluation of network anomaly detection systems: Statistical analysis of the unswnb15 data set and the comparison with the kdd99 data set, Information Security Journal: A Global Perspective (2016).