Abstract
This paper presents the design and performance evaluation of a host-based misuse intrusion detection system for the Linux operating system. The proposed system employs a feature extraction technique based on principal component analysis, which is called Eigentraces, of operating system call trace data, and k-nearest neighbor algorithm for classification. The design is evaluated on the ADFA-LD dataset which entails one normal and six attack classes. Feature vectors are formed from fixed-size system call trace raw data through windowing and the principal component analysis, and serve as templates for the training phase. Classification of system call trace data that is in the form of feature vectors which are formulated through the Eigentraces procedure is accomplished using the k-nearest-neighbor algorithm. Two variants of the misuse intrusion detection system designs were evaluated through a simulation study on the ADFA-LD dataset: one design considered only two classes, namely normal and attack classes while the second design considered seven classes, namely one normal and six attack classes. In both cases the proposed design demonstrated very high performance. In overall, the misuse intrusion detection system was able to detect the attacks and predict the type of the attacks.
Keywords
Introduction
Internet and computer networks are inseparable aspects of our daily routines. Managing banking transactions online, exchanging sensitive private and confidential data over networks, and many other online activities are all ubiquitous and need preservation of privacy, confidentiality and protection against all kinds of exploitation. There does not appear to be a shortage of criminals attempting to access online data, accounts and sensitive information repositories with malicious intentions. Consequently, detection of and protection against such intrusions are critical for online infrastructure to play its valuable role in our lives.
Intrusion Detection Systems (IDS) are built using a combination of hardware and software, and as their name implies, employed to detect intrusions into computing systems. IDS designs are classified into three major categories, namely Signature-based Detection (SD), Anomaly-based Detection (AD), and Stateful Protocol Analysis (SPA) [1]. An SD type IDS detects the attacks and intruders by monitoring the activities throughout a network or a system, and identifying the pre-determined signatures of abnormal activities. As such, it is a useful method to detect the known attacks on a specific system. An AD type IDS generates a statistical model of normal behavior by monitoring the process during an attack-free period. It considers any activity out of normal range as an anomaly. It is an efficient method to detect zero-day attacks and it is not heavily dependent on the type of computing system. It may suffer from false alarms if the normal behavior is not covered or captured comprehensively as it will declare some normal behavior as anomalous. An SPA type IDS, which is also known as the deep packet inspection, is a resource intensive technique, which relies on vendor-developed universal profiles. Vendor-developed universal profiles determine the function of a well-defined protocol. In other words, it defines how a specific protocol should be and should not be used. For example, it can detect the length of command or the number of repeating a command and also the variation of minimum and maximum values for attributes and any other misbehavior. Unusual or unexpected activity outside the established profile is considered as an intrusion threat [2].
The signature-based intrusion detection technique is useful when there exist a proper, new and up-to-date dataset because type of threats are changing over time and detection process should be able to detect the most recent instance of malicious activities. Furthermore, for effective detection, attack signatures need to be extracted accurately from raw data. There are several techniques to extract the signatures from raw data. One prominent technique is the Principal Component Analysis (PCA) [3]. The Eigenfaces methodology which is based on PCA is a signature-based algorithm for face image analysis. It demonstrated applicability and superior performance for pattern recognition through dimension reduction and feature extraction [4]. Eigenfaces extracts the major features of a face image and encodes them as templates for training a classification algorithm [5]. Numerous studies about Eigenfaces report high performance for face recognition applications [5, 6, 7, 8].
In the Eigenfaces based approach, face images are represented as sequences of pixels each of which is a numeral. Pixel values are correlated in relation to their distances as measured through an appropriate distance metric. If two pixels are neighbors then their gray-scale or color-coded values are likely to be close to each other in most cases unless they are on the opposite sides of a very sharp boundary. Traces of system calls by applications concurrently executing in an operating system environment are somewhat similar to pixelated representation of images. Such a system call trace for multiple and concurrently executing applications will have subsequences of calls that will repeat throughout the period of capture for a given session. As system calls are represented as long sequences of non-negative integers with repeating subsequences, they exhibit similarities with raw pixelated representation of face images in some respect.
Studies on the design and development of misuse intrusion detection systems on network-level data are abound in the literature as cited in the next section while similar studies for detecting misuse intrusions at the operating system level are somewhat scarce. There is a need for studies to address this perceived gap in the design and development of misuse intrusion detection systems based on system call trace data at the operating system level. The study reported in this paper will present design and performance evaluation of a misuse intrusion detection system using a principal component analysis based technique, which is named as Eigentraces, for feature extraction, and the kNN algorithm for classifier model generation at the operating system level using system call traces of concurrently executing applications.
Related work on misuse detection
We next present a survey of recent works for misuse intrusion detection system (IDS) design and performance evaluation on several benchmark datasets in the associated domains.
Bhavesh Borisaniya [21] presented a misuse detection study on ADFA-LD and ADFA-WD datasets. They used the modified vector space along with the N-gram feature extraction technique and generated classifier models for both binary and multiple classes. They defined feature vectors using a multitude of N-grams, namely 1-gram, 2-gram, 3-gram and 5-gram. Among the several classifiers which they applied to the binary and multi class data sets, they reported an accuracy of 92% with 20% false positive rate (FPR) by using IBk and J48 classifiers (in Weka) as their best results for the multiclass model, and an accuracy of 96% with 19% FPR for the binary class problem.
In a study by Depren et al., authors reported on the development of a hybrid IDS architecture on the network-based KDD Cup 99 dataset [22]. The IDS design employs both misuse and anomaly detection. They used the C4.5 (J48 in Weka) decision tree algorithm to classify different attacks in datasets for misuse detection. For anomaly detection, the IDS design used a self-organizing map to model the normal mode of operation. They reported 99% accuracy and 0.2% FPR for the misuse detection system.
Assem et al. proposed a so-called SC2.2 misuse detection system for binary class problems using the UNM datasets [23]. Their design defines conditional probabilities using the Markov chain model of long sequences of system calls in four different UNM datasets. They reported the performance of the SC2.2 in terms of accuracy, true positive rate (TPR), and false positive rate (FPR) for naïve Bayes multinomial (NBm), C4.5 decision tree, Repeated Incremental Pruning to Produce Error Reduction (RIPPER), support vector machine (SVM), and logistic regression (LR) classifiers. Classifier accuracies are reported to be ranging from 97% to 99% with 0.3% to 3% FPR for all UNM datasets.
In another study that employed the UNM datasets, Kang et al. proposed a classifier model for binary class misuse detection [24]. They formulated fixed-length system call sequences using the STIDE windowing technique. They applied five different classifiers such as naïve Bayes, C4.5, RIPPER, SVM and LR for misuse detection through 10-fold cross-validation. They reported 100% accuracy with 0% FPR using the support vector machine as the best result.
Laskov et al. compared the performances of supervised and unsupervised learning algorithms in terms of detecting network-based both known and unknown attacks within the framework of binary classification [25]. Their data source consisted of the combination of instances from KDD Cup 1999, DARPA 1998 and DARPA 1999 data sets. For pre-processing, they split the attack data into disjoint subsets containing only one attack type and the normal data into disjoint subsets containing only one service type. They next merged these subsets to generate three equal length datasets for training, validation and testing. They performed metric embedding to transform the data into a metric space since KDD Cup dataset contains categorical and numerical features of various sources and scales. They experimented with many different supervised and unsupervised algorithms. The best result in performance evaluation for supervised misuse detection belongs to C4.5 with 95% TPR and 1% FPR. In terms of unsupervised learning, Single Linkage Clustering algorithm performed better than the others by scoring 70% TPR with 6% FPR. They also reported the performance for anomaly detection using two different classifiers: the SVM as a supervised method scored 80% TPR and 8% FPR while the Gamma algorithm produced 75% TPR and 7% FPR.
In a study reported by Kuchimanchi et al., authors proposed a dimension reduction approach using neural network-based feature extraction and CART classifier for real-time binary class network-based misuse detection system on the KDD 1999 dataset [26]. They performed neural network principal component analysis and nonlinear component analysis for feature detection and dimension reduction. The reported performance using the CART decision tree was 99% accuracy with 0.2% FPR.
As the presentation in this section indicates many researchers developed high performance misuse detection algorithms on a variety of datasets including several benchmarks. There are fundamental differences among these datasets. The KDD dataset and its derivatives possess network traffic data and not intended for host-based intrusion detection systems. The UNM trace data are sequences of system calls captured during execution of a single process. On the other hand, ADFA-LD dataset represents operating system call traces based on observations of a modern Linux local server offering file sharing, database services, remote access, and web server functionality among others.
Most designs are for the binary class case while a few designs for the multi-class case reportedly did not demonstrate high level performance. Among those listed, only one study reports a misuse intrusion detection system design on the ADFA-LD dataset which is also used in our study. Their misuse intrusion detection system’s performance is amenable to further improvement as they report relatively high false alarm rates of 20% for classification of both binary and multiple classes.
Data set and methodology
In this section we present the ADFA-LD dataset along with the data preprocessing, application of the PCA based Eigentraces methodology and machine learning classifiers employed in the IDS design.
ADFA-LD
Monitoring a process in a computer system using system-call trace sequences is a promising approach to detect malicious activities. ADFA-LD is a recent dataset which is collection of system call sequences and intended to help with the development of host-based intrusion detection systems [9]. Ubuntu Linux operating system, version 11.04, was the host for generating the ADFA-LD dataset. The operating system was reported to be fully patched and provided FTP, SSH and MYSQL version 14.14 services with their default ports. Apache version 2.2.17 running PHP version 5.3.5 was enabled to allow for web-based attacks. Also, TikiWiki version 8.1 was installed as a web-based collaborative tool. ADFA-LD dataset consists of seven different types of system call traces including normal and six web-based attacks, namely Adduser, Hydra-FTP, Hydra-SSH, Java-Meterpreter, Meterpreter and Web-Shell. Table 1 shows the attack types, effects and vectors.
ADFA-LD covers a pool of Linux system call traces under different conditions, which are compiled by the UNIX server security audit. It is categorized into three different groups, namely TRAINING_DATA_MASTER (TDM), VALIDATION_DATA_MASTER (VDM) and ATTACK_DATA_ MASTER (ADM). TDM and VDM are captured during normal processes ranging from web browsing to LATEX document preparation [10]. Table 2 presents these three ADFA-LD datasets with the type and count of traces for each one. Originators of data or our own analysis did not indicate if there is any noise. None of the three data sets has any missing values.
Intrusion detection studies on ADFA-LD dataset in literature
The ADFA-LD dataset was employed mainly for developing anomaly detection systems as reported by several studies in the literature [11, 12, 13, 14, 15]. Given that our work proposes development of a misuse intrusion detection system, it must be noted that a performance comparison between misuse and anomaly detection systems is not meaningful. Nevertheless, this section presents an overview of anomaly detection studies conducted on the ADFA-LD dataset for the sake of completeness.
Attacks modeled in ADFA-LD dataset
Attacks modeled in ADFA-LD dataset
ADFA-LD data subsets and compositions
Xie et al., used a short sequence method for feature extraction and formulation on the ADFA-LD dataset and applied one-class support vector machine (SVM) for multiclass classification [11]. They reported 80% accuracy with false positive rate (FPR) of 15%. In another study, Xie et al. reduced the dimensions of the ADFA-LD dataset using a frequency-based multiclass model [12]. They applied kNN and k-Means Clustering (kMC) algorithms for detection of anomalies. Their main conclusion was that the kNN or kMC algorithms were not promising to detect attacks as their kMC implementation resulted in 60% accuracy with 20% FPR. In another study, Doyle III created frequency-based binary class model using the N-gram method and employed the kNN and SVM classifiers, which reportedly delivered 60% accuracy [13].
Haider et al. employed the zero-watermark algorithm in two different studies. In their first study, they proposed a character data zero-watermark inspired statistical based feature extraction strategy for integer data [14]. They evaluated the performance of RBF kernel, SVM and kNN classifiers on the binary class version of the anomaly detection problem. They used several metrics to report their results including the detection rate (DR), which was defined as the ratio of the number of detected anomalous activities to all the anomalous instances in the testing set, and false alarm rate (FAR) which they defined as average of false positive and negative rates, (FPR
Principal component analysis (PCA) technique is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, it accounts for as much of the variability in the data as possible), and each succeeding element or component, in turn, has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors form an uncorrelated orthogonal basis set. The principal components are the eigenvectors of the covariance matrix, which is symmetric. Forming these orthogonal vectors together into a matrix results in a linear space. Each original data instance vector can be represented by a smaller number of variables, which makes the PCA a useful tool for analysis in high dimensional spaces.
To apply the Eigenfaces method, which we renamed as Eigentraces, on the ADFA-LD system call trace data set we need to consider seven sets of system call traces, one set for each of seven classes (
where
The average trace vector is calculated to normalize the trace vectors. The average trace vector
The matrix
Given the matrix
In the next step, eigenvectors
where
To generate templates for storage, compute the set of vectors
where the value of
To determine the class membership of a newly encountered trace,
If only the nearest neighbor is considered for class association of the test trace, then the minimum Euclidean distance
where
K-nearest-neighbors classification algorithm
The k-nearest neighbor (kNN) algorithm is a nonparametric technique for classification. It is a competitive algorithm based on many studies reported for clustering, classification, pattern recognition, and categorization [16]. Training examples for kNN are feature vectors in multi-dimensional space in which each example is assigned a class label. The basic idea for classification phase behind this instance-based learning method is to calculate the distance between an unlabeled test vector and
The kNN classifier was applied to a variety of computer security datasets with notable success [17]. Yihua et al. used the kNN classifier on DARPA computer security dataset with different
Number of traces in training and testing data sets
Number of traces in training and testing data sets
Training and testing data
There are seven different trace files or attack types in ADFA-LD, which are distributed across three different folders which are named as TDM, VDM and ADM. To generate the training data set for misuse detection, we used the entire TDM and a 60% of the ADM. The ADM data set was split into two subsets at 60% and 40% randomly and 60% split was included in the training data set. Consequently, the training data set contains 833 normal traces and 445 attack traces. For the testing data set, we have used the entire VDM with 4372 normal traces and the remaining 40% of ADM with 301 attack traces. Table 3 presents the detailed information about how the each attack and normal instances in the ADFA-LD data set was split into training and testing data sets.
The length of traces in the ADFA-LD data set varies from 75 to 4494 system calls. To apply the Eigentraces methodology on this dataset, all traces must be equalized in length. One option is to define a window of size
Assume a window size of 6 (
Next shifting the window to the right over this sequence by 5 positions or elements will result in the pattern as below:
A number that is outside the range of valid system call values can be used for any blank position so as not to affect the outcome. For instance, if the range of numbers for valid system calls is between 1 and 100, any value or number outside this range can be added for the position
For ADFA-LD, we assigned 76 as the window size and 10 as the shift length or size since the minimum trace size is 75. The window size is implied by the shortest trace length. The choice of a value of 10 for the shift size is a compromise to generate a manageable number of input patterns. This procedure is performed to convert the traces with different lengths into a set of vectors with 76 elements. These trace vectors are stored in the columns of a matrix associated with each of seven classes. Therefore, there are seven different matrices where each one is associated with data for one of six attack classes and the normal class. There is a separate matrix for the training data set and a second matrix for the testing data set. Table 4 shows the dimensions of training and testing data matrices for each of six attack classes and the normal class.
Eigentraces methodology
Training and testing matrices
Training and testing matrices
MATLAB
Eigentraces methodology for training set.
Classifier performance evaluation for binary classification on training data split
Classifier performance evaluation for multiclass case on training data split
For classifier development, class instances in the training data set are further split into two parts at a ratio of 67% to 33%. The larger split is used to train the classifier and the smaller split is used to validate its performance. The kNN algorithm implemented in Weka Version 3.8 with
Classifier performance for binary class on test data
Classifier performance for binary class on test data
Eigentraces methodology for testing set.
Classifier performance for multiclass on test data
Two labels, namely “Normal” and “Attack”, are used for the binary classification. The multiclass problems consists of the “Normal” label and six attack labels. Table 5a and b present the performance metric values and confusion matrix for the kNN classifier applied to the binary class problem, respectively. In Table 5a, the TPR for normal class is 1.0, which means all normal traces are detected. The TPR for the attack class is 0.995, which indicates that 5416 of 5441 attack traces are detected: only 25 traces among 13937 normal and attack traces are misclassified. The overall accuracy of this model is 99.82%.
The results for the multiclass case also indicate that the kNN delivers nearly perfect prediction for each of the seven classes. As shown in Table 6a, the TPR for all seven classes is 0.988 or higher, which means 99.8% of all traces are classified correctly (Table 6b). The accuracy of this model is 99.78%.
Tables 7 and 8 show the performance evaluation results for binary and multi class models, respectively,on the testing data. In those tables, part a represents the performance evaluation results and part b shows the confusion matrix for the same model.
Data in both tables indicate highly accurate results in predicting traces in term of their class memberships. For the binary class problem, 23 out of 195890 traces are misclassified. All of the 185196 normal traces and 10671 out of 10694 attack traces are correctly classified. For the multiclass model, 42 traces are misclassified. In this model, referring to Table 8, all of the 185196 normal traces, 1018 meterpreter attacks, and 2175 web-shell attacks are classified correctly. The number of misclassified traces for Adduser, Hydra_FTP, Hydra_SSH, and Java_Meterpreter are 18, 6, 2, and 16 respectively, which are less than 0.05% of total traces for each class. Furthermore 26 attack traces are misclassified as “Normal” while 16 attack traces are classified as belonging to another attack class.
In overall, the accuracy for both models (binary and multiclass classifiers) is 99.98% or higher. This suggests that the Eigentraces in combination with kNN is a very promising classification algorithm for misuse detection in operating system call traces for the ADFA-LD context.
Conclusions
This paper presented the development of a host-based misuse intrusion detection (IDS) system. The proposed design employs the Eigentraces methodology for feature extraction and pattern formation, and the kNN algorithm for classification on the ADFA-LD data set. The ADFA-LD dataset entails seven different classes, one for normal and six for different attacks, where data are sequences of system calls (traces) collected on the Linux operating system. The attacks include Adduser, Hydra_FTP, Hydra_SSH, Java_Meterpreter, Meterpreter, and Web_Shell. The IDS performance was evaluated for two different scenarios. In one case, two classes were created, one for the normal traces and a second class for all six attacks combined. The original ADFA-LD dataset was split into two subsets, one for training and another one for testing, where all attacks as well as the normal class patterns were included in both the training and testing datasets. The second scenario entailed addressing the problem as a seven-class case. Both the training and testing data had samples from all seven classes.
Performance of the IDS was evaluated through a simulation study. For the case of binary classification, only 23 attack traces were misclassified as normal while all 185196 normal traces and the remaining 10671 attack traces were correctly classified with no false alarms. For the multiclass case, all 185196 normal traces were correctly classified. In addition, 1018 Java_meterpreter attacks, and 2175 web-shell attacks are classified correctly. Misclassifications occurred for a small fraction of all traces for the other four attacks. 18 Adduser, 6 Hydra_FTP, 2 Hydra_SSH, and 16 Java_Meterpreter attack traces were misclassified while misclassified traces constitute less than 0.05% of total traces for each attack class. In conclusion, the proposed IDS design performance was very high for identifying attacks and predicting the type of attack.
