Abstract
This paper presents development and performance evaluation of a host-based misuse intrusion detection system. The misuse detection system employs an ensemble design for classification and N-gram feature extraction methodology for preprocessing the raw ADFA-LD system call trace data. Inputs to the ensemble classifier are fixed size patterns whose attributes are N-grams derived from operating system call traces. The ADFA-LD data set entails system call traces generated by a number of concurrently executing applications on the Linux operating system environment. In addition to the normal or attack-free operating mode, six different attacks are modeled in the data set. Two filters are used to capture the unique signatures of each class and also to reduce the dimensionality of input patterns. The most frequent unique features in the form of N-grams from each class are extracted from the raw trace data. Then, to eliminate the effect of noise, most frequent patterns regardless of uniqueness are extracted based on the statistical attributes of their occurrence frequencies. The SMOTE algorithm is used to balance the classes in terms of pattern counts. The classifier design is based on majority voting ensemble with base classifiers including naïve Bayes, support vector machine, PART, decision tree and random forest. The study considers both binary and multi class problems. In the case of binary class problem, the two classes are the “normal” and the “attack” where the latter is formed by merging all of the trace data for all six different attack signatures into a single one. In the multiclass case, normal and six attack classes are each considered separately. A simulation study was performed to evaluate the performance of the proposed host-based misuse detection system. Multiple performance indicators and metrics including confusion matrix, true positive/negative, and false positive/negative rates were recorded. The proposed host-based misuse detection system demonstrated very high performance for detecting the attacks for the binary class problem although its performance for the multi-class case will benefit from further improvement.
Keywords
Introduction
It is difficult to imagine the functioning of a society without computing technology in modern times. Computing, operating and networking systems for being highly complex possess countless security flaws and are plaqued with security vulnerabilities. Exploitation of any of the flaws or vulnerabilities by hackers may result in catastrophic outcomes. Consequently the concept of security in computer-based systems is a critical issue to address. Any unauthorized access to a computing infrastructure constitutes an intrusion which must be detected without delay.
Intrusions which may lead to actual attacks are classified into two broad categories: misuse or signature based and anomaly. If a certain attack had happened in the past and hence its signature is known to the intrusion detection system, its repeat occurrence results in misuse. For an attack that has never been seen before, its occurrence elicits anomaly detection.
An Intrusion Detection System (IDS) monitors computer-based systems, and gathers and analyzes information to identify malicious activities or policy violations [2, 20, 43]. IDSs may operate in two different modes based on the data source. One mode is the so-called host-based, which relates to tracking the system’s configuration files to find adverse settings or call traces for processes, threads or applications as well as inspection of stored files, such as those for passwords, to determine if they are compromised [21, 45]. The second operation mode for an IDS is network-based. It checks for known or unknown behaviors of attack code execution as visible through networking communications.
IDSs can monitor system behavior at various levels. One such level entails monitoring calls to operating system services by the application processes. System calls are the primary interfaces between an application and the kernel. Each operating system consists of a number of system processes for performing various low-level operations. An application typically interacts with the operating system through system calls, which result in execution of one or more related system processes. System calls offer information relating to the interaction between an application and the kernel which has the potential to help distinguish the abnormal behaviors. The ordering, type, length and other attributes of system calls made by an application process can provide a unique signature or trace. These signatures can be used to identify known and unknown applications or discriminate between legitimate and illegitimate behavior [1, 41].
A number of benchmark data sets, including DARPA [3] and KDD’99 [4], served well the research community for network-based intrusion detection studies over the past two decades. However, there is a perceived void in addressing the intrusion detection problem using up to date and host-based data that models current day attack scenarios with sufficient coverage. A recent and host-based dataset, ADFA-LD, offers a wealth of present-day attacks simulated to help researchers develop data-driven intrusion detection systems using machine learning techniques [39, 40]. This dataset was originally conceived for development of anomaly detection systems.
Background and literature survey
Sequence-based datasets are a collection of traces where each trace is a series of system calls, which are collected during the execution of an application process. A first step towards using such data for intrusion detection system development through a data-driven approach such as machine learning is to extract features, which is accomplished in the preprocessing phase. Extracted features can then be used for classification purposes. There are several sequence-based data pre-processing methods such as STIDE [7], PCA [8], and
Pre-processing methods such as
Machine learning and IDS development
Machine learning in conjuction with other artificial intelligence tools and techniques offers a fundamental methodology to develop data-driven inductive models that can serve as intrusion detection system for both misuse and anomaly detection. There are two main approaches to machine learning: symbolic approaches examples of which are inductive learning of symbolic descriptions such as decision rules and decision trees [13]; and statistical approaches or pattern recognition methods which include, but are not limited to, k nearest neighbors (kNN) classifiers, Bayesian belief networks (BBN), artificial neural networks (ANN), and support vector machines (SVM) [14]. This categorization is made roughly, and many methods cross the boundaries between these two categories.
Machine learning algorithms have specific strengths and weaknesses as individual classifiers. One commonly employed technique to enhance the robustness of classification is to use multiple machine learning classifiers and fuse their outputs through a mathematical procedure. Ensemble methods [3, 25, 27, 42, 46, 47] such as bagging, boosting, stacking, voting and others combine the predictions of multiple classifier models on a sound statistical basis. Ensemble methods as collective decision-making algorithms offer a powerful classification framework as they can address the limitations of individual classifiers. Ensemble learning of a target function occurs by training a number of independent classification algorithms and combining their outputs. Ensembles are a well established method for realizing highly efficient and accurate classifiers by combining less precise ones. An ensemble classifier is more reliable and more effective than any one individual classifier because more accurate mappings can be attained by combining the outputs of several base classifiers whose strength are likely to be complementary.
Benchmark datasets for IDS development
There are many studies reported in the literature which employ a number of existing benchmark datasets for misuse intrusion detection system (IDS) development and performance evaluation using machine learning and associated techniques [44]. Following paragraphs will present misuse or hybrid (both misuse and anomaly) intrusion detection system studies on various benchmark datasets including the KDD Cup 1999, DARPA 1998/99, UNM, and ADFA-LD and ADFA-WD.
A number of recently published papers applied the SVM classifier to the ADFA-LD dataset [6, 21, 23, 29, 30]. Creech and Hu [23] presented a semantic model for anomaly detection using a short sequence framework for ADFA-LD dataset. They have created a dictionary of words and phrases from the dataset and investigated it with a number of machine learning algorithms including the one-class SVM. They reported an accuracy of 80% with 15% FPR. Xie et al. [30] reported an accuracy of 80% with nearly 15% FPR by applying one-class SVM for anomaly detection. In another study, Xie et al. [6] reported an accuracy of 60% with nearly 20% FPR using a short sequence based approach for the ADFA-LD data set for anomaly detection. Borisaniya and Patel [29] presented another study for misuse detection on the ADFA-LD and ADFA-WD datasets. They used the modified vector space along with the
In a study by Depren et al. [34], authors discussed development of a hybrid IDS architecture on the network-based KDD Cup 1999 dataset which employs both misuse and anomaly detection. They used the C4.5 (J48 in Weka) decision tree algorithm to classify different attacks within the misuse detection framework. For anomaly detection, their IDS used a self-organizing map to model the normal behavior and the deviation from that normal was considered as abnormal. For misuse detection, they stated 99% accuracy and 0.2% FPR for the best performing design. In another study, Kuchimanchi et al. [38] proposed a dimension reduction method using neural network-based feature extraction methods for real-time binary class network-based misuse detection system on the KDD 1999 dataset. They applied neural network based principal component analysis and nonlinear component analysis for feature extraction and dimension reduction. The reported performance using the CART decision tree was 99% accuracy with 0.2% FPR. Yao et al. [18] proposed an enhanced SVM model with weighted kernel function for anomaly intrusion detection using the KDD Cup 1999 dataset. They reported 99.8% precision and 6% FPR for their experiments.
Assem et al. [35] proposed a misuse detection system for binary class problems using the UNM datasets. Their design defined conditional probabilities using the Markov modeling of long sequences of system calls in four different UNM datasets. They computed the class-conditional probabilities of a sequence of system calls using the Markov chain model. Authors reported the performance in terms of accuracy, TPR and FPR on the four UNM data sets for naïve Bayes multinomial (NBm), the C4.5 decision tree, Repeated Incremental Pruning to Produce Error Reduction (RIPPER), support vector machine (SVM), and logistic regression (LR) classification algorithms. Accuracy of their proposed IDS design reportedly ranged from 97% to 99% with 3% to 0.3% FPR across four different UNM datasets. In another study that employed the UNM datasets, Kang et al. [36] proposed a classifier model for misuse detection. They generated fixed-length sequences using the STIDE windowing technique. Authors applied five different classifiers, namely naïve Bayes, C4.5, RIPPER, SVM and Logistic Regression, for misuse detection through 10-fold cross-validation. They reported 100% accuracy with 0% FPR using the support vector machine as the best result.
Mukkamala et al. [19] proposed a model for misuse detection, in which, they applied the SVM to the DARPA network-level data set for binary classification. They reported accuracy values of 99.7% for normal traces and 99.9% for the attack traces. Khan et al. [22] proposed an anomaly detection system design using combination of SVM and Dynamically Growing Self-organizing Tree method on the network-based DARPA 1998 dataset. They have considered five different classes including a normal class and four different attacks, namely DoS, U2R, R2L and Probe. For normal and DoS classes, they reported 95% and 97% accuracies, 3% and 2% false positive rates (FPR), and 5% and 3% false negative rates (FNR), respectively.
Laskov et al. [37] compared the performances of supervised and unsupervised learning algorithms in terms of detecting both network-based known and unknown attacks within the framework of binary classification. Their data source consisted of the combination of instances from KDD Cup 1999, DARPA 1998 and DARPA 1999. For pre-processing, the attack data was split into disjoint subsets containing only one attack type while the normal class data were split into disjoint subsets containing only one service type. These subsets were merged, and used to generate three equal-length datasets for training, validation, and testing. They performed metric embedding to transform the data into a metric space since KDD Cup dataset contains categorical and numerical features of various sources and scales. They experimented with various supervised and unsupervised algorithms. The best performing classifier for supervised misuse detection was C4.5 with 95% TPR and 1% FPR. In terms of unsupervised learning, Single Linkage Clustering algorithm performed the best by scoring 70% TPR with 6% FPR. They also reported the performance for anomaly detection using two different classifiers: the SVM as a supervised method scored 80% TPR with 8% FPR while the Gamma algorithm generated 75% TPR with 7% FPR.
It is of interest to note that although there are many studies reporting high performance for anomaly and misuse intrusion detection systems on several benchmark datasets, the applicability and relevance of these designs are limited due to nearly all benchmark datasets with the exception of ADFA-LD and ADFA-WD being out of date as they do not model many of the modern day attacks. ADFA-LD and ADFA-WD are two datasets that model many of the modern day attacks, although their use was mainly constrained to anomaly detection with few studies leveraging them for misuse detection which we aim to address in this study [40].
ADFA-LD dataset
ADFA-LD dataset
The study presented in this paper aims to address the need to develop signature-based or misuse intrusion detection systems responsive to the most recent attacks by leveraging the ADFA-LD dataset. The research study employs a combination of pattern recognition, text mining and statistics based approaches for host-based misuse detection using the ADFA-LD dataset of system call traces. The classification algorithm is an ensemble design with naïve Bayes [15], support vector machine (SVM) [16, 17], tree based classifiers [10, 24], and rule based classifiers such as RIPPER [25] or PART [26] as base learners. The ensemble classifier employs the voting algorithm as a combiner for the outputs of base classifiers.
Data set and preprocessing
In this section, we present the ADFA-LD data sets and describe the application of
ADFA-LD dataset
The characteristic of modern computer and network systems evolved dramatically over the past couple decades, which mandates updating the benchmark datasets as older datasets are not capable of meeting the needs of security requirements for modern computers [39]. ADFA-LD [5] is recently published sequence-based data set of system calls collected on a modern Linux operating system. ADFA-LD contains thousands of traces of system calls, which are collected under a variety of scenarios to mimic real life circumstances. It consists of a comprehensive collection of system call traces representing recent system level vulnerabilities and attacks. ADFA-LD is generated from a Linux local server configured to represent contemporary computer system conceptualization. This server enables several services such as a database, remote access, web server and file sharing on an operating system which is a fully-patched Ubuntu version 11.04 (Linux kernel 2.6.38). The SSH, FTP and MySQL 14.14 services are provided with their default ports. Apache 2.2.17 and PHP 5.3.5 are installed for web-based services. In addition, TikiWiki 8.1 is installed as a web-based collaborative tool [6].
Attack types and trace counts
Attack types and trace counts
ADFA-LD dataset consist of three parts: their names as reported by their originators are TRAINING_DATA _MASTER (TDM), VALIDATION_DATA_MASTER (VDM) and ATTACK_DATA_MASTER (ADM) [5]. Training and validation traces were collected during normal processes of the host, ranging from web browsing to LATEX document preparation. UNIX program Audit is the generator of traces. Table 1 presents the structure of ADFA-LD including the number and type of traces for each data subset category.
The ADM data subset consists of six types of common cyber-attacks including Adduser, Hydra-FTP, Hydra-SSH, Java-Meterpreter, Meterpreter, and Web-Shell. Each one is provided in 10 different files, and each file contains 8 to 20 traces. Table 2 shows the number of collected traces for each type of attack.
Comparing with previous benchmark datasets, ADFA_LD dataset is much more representative of current cyber-attacks, and is positioned provide a more convincing and appropriate framework for intrusion detection system (IDS) development and performance evaluation [28].
To classify a process behavior as either “normal” or a specific type of “attack” using raw system call traces, which is a sequence of integers in a range specific to the operating system design, it is necessary, as the initial step, to extract the features from the associated dataset [25, 27]. Pre-processing techniques can be used to extract feature vectors from system call traces. As the data for system call traces have only sequences of numbers corresponding to those system calls made by the applications, there is a resemblance to the text classification domain. Accordingly, a well-known text preprocessing technique called the
Generation of 1:4-grams using a sequence of system calls.
Number of unique N-grams with their occurrence frequencies in training set for N 
The
The order of
Average values for occurrence of attributes prior to and after duplicate removal.
We also employ the count of occurrence of each unique
The
System call traces of six attacks have similar patterns, and this similarity is exhibited in the extracted attributes and patterns as well, which makes the signature based pattern recognition challenging. To generate unique attributes for each class, those attributes existing in two or more classes are removed from the preprocessed data set which has the 250 and 100 most frequent attributes for normal and each of the attack classes, respectively. After removal of duplicate attributes, there are now seven different classes where each class has a set of unique attributes: there is no overlap among attribute sets belonging to different classes. Consequently, for each of the seven classes, we compared each attribute with all of the attributes in other classes and to see if it exists in at least one other class. If it does exist, then it is eliminated from the attributes of that class. Through this procedure, total number of unique attributes across all seven classes was reduced to 345 from a high of 850.
The overall preprocessing approach described up to this point has a potential weakness where a number of attributes, which could play a pivotal role in the classification, might have been eliminated. For example, the call trace sequence “3 3 3” happens more than 2500 times across different normal traces, and it is one of the most frequent attributes. Even though it happens very few times in other classes, it is eliminated due to overlap. It is likely that this and similar sequences are highly correlated with one of the classes and therefore should not be removed from further consideration during the overlap detection and removal phase.
Occurrence counts for all attributes across all instances of seven classes
Note that
Standard deviation values for occurrence of attributes prior to and after duplicate removal.
Analysis of occurrence frequencies of attributes
Ratio values for attributes which are in top 2% (38 attributes). The ratio is given by maximum number of attribute occurrences in one class over total number of occurrences of that particular attribute across all seven classes.
To address this shortcoming associated with the preprocessing, the following procedure is formulated, which identifies those attributes (N-grams) that occur very frequently in a class while occurring rarely in other classes and yet is tagged for removal due to overlap. Such attributes are not removed even though they may exist in the training data set of more than one class. Occurrence counts for all attributes for each of the seven classes are computed as illustrated in Table 3. The approach initially entails computing statistical properties of each attribute in terms of mean and standard deviation values. Figure 3a and b represent the average value for the occurrence count of each attribute of the original training set and the preprocessed training set, respectively. Original training set consists of all of TDM and 60% of ADM with
Training and testing data set compositions
We also considered the standard deviation of each attribute in both the original and preprocessed training sets as presented in Fig. 4. There are several attributes in Fig. 4a whose occurrence frequencies have high standard deviation values. Figure 4b shows that only a small number of these attributes are included in the preprocessed training set following duplicate removal.
Detailed performance values for voting ensemble classifier on multiclass validation data set
Confusion matrix for voting ensemble classifier on multiclass validation data set
Performance of voting ensemble classifier on binary class validation data set
As Fig. 3a shows almost 98% of attribute occurrence frequencies have a mean value that is lower than 10. One objective is to identify those attributes whose occurrence count mean values are considerably higher than those of the rest of the attributes. For this purpose, top 2% of attributes (N-grams) with the highest standard deviation value in the original training set are selected and included as columns as shown in Table 4. Each row of the same table is allocated to one class. Therefore, this table has seven rows and the values appearing in column
Confusion matrix for voting ensemble classifier on binary class validation data set
“Ratio of the maximum value of a given attribute across all 7 classes to the sum of occurrence counts of the same attribute for all 7 classes”
If the ratio value is close to 1.0, then the associated attribute must have mostly occurred in one class and very few times in other classes.
Performance of voting ensemble classifier on multiclass testing data set
Confusion matrix for voting ensemble classifier on multiclass testing data set
Figure 5 represents the ratio values for all the attributes in the original training set. The mean value of ratios is 0.73, median value is 0.80 and the standard deviation value is 0.25. Those attributes with ratios above 0.90 will be added back to the preprocessed training set regardless if they appear in duplicate across multiple classes. This is how the multiclass training data set is generated. To create the binary class data set for training, all six attack classes are merged into a single attack class prior to implementation of the processing described herein.
The ADFA-LD dataset is originally structured to facilitate development and performance evaluation of anomaly intrusion detection systems. Consequently, we reorganized the three ADFA-OD data subsets, namely TDM, VDM and ADM, to facilitate development and performance evaluation of misuse intrusion detection systems. The training set for the classifier model generation employed the entire TDM and 60% of ADM data sets while the testing set included the entire VDM and rest of the ADM data sets (remaining 40%) without any overlap as shown in Table 5. The ADM dataset is split into two subsets at 60% and 40% ratio randomly.
For classifier development, the training data set is further split into two data subsets at a ratio of 67% to 33%. The larger data subset is used to generate the classifier model and the smaller split, validation data set, is used to prevent overfit. Performance evaluation is done using the testing data set that is the not used in the model generation phase. For the multiclass case, because the number of instances of each type of attack is considerably lower than the normal class instance count, we have equalized the number of instances for each class using the SMOTE algorithm [33].
We used the Weka machine learning workbench on Windows™ 7 (64-bit) desktop computers [Weka 3.9.0]. The simulation study was performed on a computer with the following specifications: Intel® Core™ i5-4690 CPU @ 3.50 GHz, and 16 GB of DRAM.
Classifier development
We initially evaluated many candidate base classifiers individually and identified promising ones with respect to a set of performance metrics including precision, accuracy, false positive/negative rates, and true positive/negative rates. Five winning base classifiers, namely naïve Bayes [15], SVM [17, 31], PART [26], J48 [24] and Random Forest [32], were then incorporated into a voting ensemble classification framework. We employ the voting combiner for the ensemble classifier as it is computationally efficient and simple to implement: voting can combine base classifier outputs using the average or maximum of probability estimates or numeric predictions. For this study, the combination rule for the voting ensemble classifier is “majority voting” on base learner class predictions. Table 6 presents the performance results in terms of true positive rate (TPR), false positive rate (FPR), true negative rate (TNR), false negative rate (FNR), precision, F-Measure, and ROC area for the voting ensemble classifier on the multiclass validation data set. Results indicate that the true positive rate for each class is high, while the false positive rate is low with an overall accuracy rate approximating 96%.
Table 7 shows the confusion matrix for the same ensemble classifier. Based on this matrix, all attack behaviors are detected although there is a low level of confusion as to membership to the correct attack class. It is significant to observe that none of the attack traces are classified as normal, which indicates that the missed alarm rate is zero, if only two classes are considered, one being the normal class and the other attack class. For the multiclass case, there are a number of missed alarms for detecting the type of attack. Results indicate that the voting ensemble classifier accurately detects most attacks for their attack class membership and does not miss detecting any attack instances.
Performance of voting ensemble classifier on binary class for testing data set
Performance of voting ensemble classifier on binary class for testing data set
Confusion matrix of voting ensemble classifier for binary class on testing data set
Performance values of the voting ensemble classifier are presented in Table 8 on the validation data set for the binary classification problem. The results show that the voting ensemble classifier performs nearly perfect for the binary class problem. True positive rate for the attack traces is 0.993, which means all, but one, attack instances are detected. According to the Table 9, missed alarm rate is 0.007 due to misclassifying 1 out of 142 attack instances as normal. The overall accuracy is 99.77.
Table 10 shows the performance results for the multiclass model on the testing data set. The detection rate for the normal class is nearly 100% as only 2 out of 4372 normal traces are misclassified as Web-Shell attacks. It is relevant to note from the confusion matrix in Table 1 that for any missed detection, it is overwhelmingly an attack trace being misclassified as another attack class but not as a normal trace. Conversely, only 2 normal traces are misclassified as attack traces (false alarms).
Table 12 presents the performance of voting ensemble classifier applied to the binary class problem. Its performance considerably improves compared to that of the multiclass case. All 301 attacks are detected except two, and there is no false alarm among a total of 4372 normal traces which means all the normal traces have been classified correctly as presented in Table 13. The confusion matrix in Table 13 shows that only 2 traces are misclassified as normal out of 311 attack traces.
The misuse IDS design proposed in this study delivered performance that improved upon that of the design by Borisaniya and Patel [29] on the ADFA-LD datasets, which is the only misuse study in literature for this dataset.
Conclusions
In this study, we presented the design and evaluated the performance of a host-based misuse intrusion detection system for Linux operating system through the ADFA-LD benchmark dataset. The design employed a voting ensemble classifier for signature-based intrusion detection through
