Mutual clustered redundancy assisted feature selection for an intrusion detection system

Abstract

Intrusion Detection is very important in computer networks because the widespread of internet makes the computers more prone to several cyber-attacks. With this inspiration, a new paradigm called Intrusion Detection System (IDS) has emerged and attained a huge research interest. However, the major challenge in IDS is the presence of redundant and duplicate information that causes a serious computational problem in network traffic classifications. To solve this problem, in this paper, we propose a novel IDS model based on statistical processing techniques and machine learning algorithms. The machine learning algorithms incudes Fuzzy C-means and Support Vector Machine while the statistical processing techniques includes correlation and Joint Entropy. The main purpose of FCM is to cluster the train data and SVM is to classify the traffic connections. Next, the main purpose of correlation is to discover and remove the duplicate connections from every cluster while the Joint entropy is applied for the discovery and removal of duplicate features from every connection. For experimental validation, totally three standard datasets namely KDD Cup 99, NSL-KDD and Kyoto2006+ are considered and the performance is measured through Detection Rate, Precision, F-Score, and accuracy. A five-fold cross validation is done on every dataset by changing the traffic and the obtained average performance is compared with existing methods.

Keywords

Intrusion Detection System Normalization Entropy Correlation FCM SVM accuracy

1. Introduction

In the recent years, the internet has emerged as a routine and important necessity for people. Due to its widespread utilization in different fields like entertainment, teaching, and electronic communication, the demand for internet has risen very quickly. However, the widespread utilization of internet causes so many problems for people by inserting different kinds of cyber-attacks into the computer networks. In general, the attack or malicious intrusion enters into the computer information servers may break the security policies, i.e., availability, integrity and confidentiality (AIC). Hence there is a serious necessity to develop a novel cyber-security mechanism and it was attracted the attention of researchers from both academic and industry. Even though there is a huge availability of different security provision mechanisms like firewalls, user authentications and data encryptions, many organizations suffers from different kind of cyber-attacks [1]. Furthermore, Intrusion Detection Systems (IDSs) and Intrusion Prevention Systems (IPSs) can be utilized as a security measures in the network that can undertake the responsibility of malicious activities detection and prevention if the traditional firewall cannot provide effective protection of cyber-attacks. These systems are can be developed as a software application and also as a hardware appliance can analyze the network traffic automatically and notify the security concerns to the management office if required [18].

In general, according to the methodology followed for the intrusion detection the IDSs are broadly categorized into two distinct categories such as pattern matching methods and statistical anomaly method. The former one is also called misuse-based IDS or signature based IDS and it is able to detect the known attack patterns based on the patterns stored in the IDS database [17]. In these models, the audit logs are verified for the patterns those are interpreted as previous attack signs. If the IDS found that the pattern of newly incoming record is matched with any of the patterns stored in the IDS database, then it generates an alarm and to give a notification to security manager. However, the major drawback with pattern matching IDS models is their inability to identify the unknown attacks for which their patterns are not stored in the IDS database. On the other hand, the second one called anomaly based IDS searches for abnormal behaviors. In these models, the patterns of normal behaviors are stored.If the IDS found that the new incoming record has deviated with the profile of normal patterns then it is considered as attack and the IDS generates an alarm to give notification to the security manager. Even though this model can identify the unknown attacks, the main drawback is their higher false positives [31,45].

Even though the current IDS models have attained promising results in the detection of several kinds of attacks, the major challenge is the size of current network traffic data [2]. The huge sized network traffic data slow down the process and may consequences to unsatisfactory detection performance due to the difficulties occurred at computations of such data. The classification of such large sized data generally leads to too many computational difficulties and consequences to heavy computational complexity. For example, consider the standard intrusion dataset, i.e., KDD cup99. This dataset has almost five million training data connections and two million testing data connections. Such kind of data makes the IDS to fail and consequences to poor performance. Moreover, the large scale dataset consists of duplicate, redundant, and noisy information which results in critical challenges to data modelling and knowledge discovery. Further, the datasets are labeled in such a manner even for small changes in the features, they are considered as one different connection. This is one more due to which the size of dataset increases and results in heavy burden to the system.

To solve these problems, in this paper we propose a new IDS model which mainly concentrates on the removal of duplicate connections in intrusion datasets. Our main focus is made on the training data because to get a better performance the training data must be more. But as the training data increases, the system suffers from huge computational complexity. Hence, we propose a novel connections selection followed by feature selection mechanism based on linear and non-linear statistics of training data. Furthermore, we also employ a clustering mechanism through Fuzzy C-Means algorithm over the training data before subjecting it to feature selection. A simple and adaptive normalization process is also proposed to make the data uniform in nature. For connections selection, we employ correlation and for feature selection we employ Mutual Information between the traffic connections of training data. At testing, we apply feature selection through Sliding Window assisted Mutual Redundancy method and the obtained features are fed to SVM classifier for classification.

The rest of the paper is structured as follows; the details of the literature survey are outlined in Section 2. Section 3 explores the full-pledged details of proposed IDS model. Details of simulation experiments are discussed in Section 4 and conclusions are provided in Section 5.

2. Literature survey

Due to the possibility of different kinds of network compromising attacks, the computer needs to get updated every time and this updating always adds new features to the database. In the case of misuse based IDS, the new features belongs to the characteristics of different attacks while in the case of anomaly based IDS, the new features belongs to the characteristics of non-attacks (normal). Due to this continuous updating process, the database size increases quickly and for newly coming record, the detection takes much more time. Hence the feature selection is important in which instead of training the entire connection, only few features are updated those have similar semantics with the features trained already. Hence the feature selection followed by computational complexity reduction is the main research direction in IDS and most of researchers concentrated and developed so many methods [15,22,25,33].

Amiri et al. [6] proposed two different methods for feature selection based on linear and non-linear properties of Intrusion data. This approach proposed anew measure called feature goodness for feature selection. For linear analysis and non-linear analysis, they have employed linear correlation coefficient and mutual information respectively. Further for classification purpose, they have employed an improved version of Support Vector Machine (SVM) called Least Squares SVM (LS-SVM). Experimental analysis sis done with the help of KDD cup 99 dataset and performance is measured through classification accuracy.

Similarly, Ambusaidi et al. [3–5] also focused towards the feature selection based on statistical properties of network traffic connections in IDS. They proposed a new method called Flexible Mutual Information based Feature Selection (FMIFS) to extract only a subset of features those are discriminative and efficient to represent intrusion data. Further, they also employed Linear Correlation Coefficient (LCC) for linearity analysis between traffic connections. They successfully extracted only a few set of features those have independent and also have more contribution towards the class. Further for classification, they applied LS-SVM and the simulation experiments are conducted on the three standard intrusion datasets such as KDD cup99, NS-KDD and Kyoto2006+.

Zhao et al. [48] proposed a new feature selection algorithm called Redundant Penalty between Features based on Mutual Information (RPFMI) to select optimal features. The RPFMI considers three factors during the feature selection; they are redundancy between features, the effect between selected features and classes and the relationship between classes and candidate features. Two datasets such as KDD Cup99 and Kyoto 2006+ are employed for experimental validation and the performance is measured through accuracy measure.

To solve the feature selection problem and to select only optimal features for network traffic connections in IDS, Mohammadi et al. [28] proposed four feature selection algorithms such as Feature Grouping based on Pairwise MI, Feature Grouping based on LCC (FGLCC), Multivariate LCC based feature selection (MLCFS) and Feature Grouping based on Multivariate Mutual Information (FGMMI). Further for classification, they applied LS-SVM and the simulation experiments are conducted on KDD cup99, and NS-KDD and datasets. Due to the consideration of both linear and non-linear features, they assumed that it can be implementable on any kind of IDS. However, the major problem is the presence of duplicate connections at training phase which introduces a computational burden over the system.

Song et al. [40,41] proposed a Modified Mutual Information based feature Selection (MMIFS) method for Intrusion detection. After the selection of features through MMIFS, they employed the C4.5 classifier for classification purpose. For simulation purpose, they used KDD Cup99 dataset and performance is measured through accuracy measure.

Farahani [12] proposed a new method called Cross-Correlation based feature selection (CCFS) and employed four classifiers for classification purpose. The four classifiers are namely K-nearest neighbor (KNN), Decision Tree (DT, Naïve Bayes (NB) and SVM. Here the main intention of CCFS is the dimensionality reduction thereby the reduction of computational burden. For simulation purpose, they have considered four datasets such as KDD Cup99, NSL-KDD, AWID and CIC-IDS2017 and the performance is measured through accuracy, recall and precision.

Unlike the method those employed statistical measured for feature selection, Ji et al. [21] considered signal processing method for feature selection. They employed the complete work in three phases such as feature selection, visual analysis and classification. For feature selection, they employed Multi-level Discrete wavelet transforms (MDWT) [10], for visual analysis Principal Component analysis (PCA) and for classification, SVM was employed. NSL-KDD dataset is used to validate the developed IDS model. However, the data connections related to data traffic won’t have any significance of high land low frequencies.

Some authors focused on the clustering and employed different clustering methods for Intrusion Detection [20,29]. Sandosh et al. [37] employed modified K-means clustering algorithm for data segmentation. In this approach, the authors proposed an enhanced IDS via agent clustering and classification based on outlier detection (EIDS-ACC-OD). At the start, the pre-processing is employed to eliminate the unnecessary spaces through outlier detection and the employed modified K-means for segmentation. Finally they employed KNN algorithm for the classification of attacks.

Yang et al. [46] employed the Fuzzy aggregation model to reduce the size of training data by proposing a Modified Density Peak Clustering (MDPCA) [34]. This approach tried to reduce the size and also ensures the imbalance of the samples by dividing the training data into several blocks with similar semantics. Every subset of trained with the help of Deep Belief Networks (DBNs) [16]. For every subset, one sub-DBN is used for training through which the system can learn the high level abstract features and reduces the data dimensions. For simulation purpose, they have employed two datasets such as NSL-KDD and UNSW-NB15 and the performance is measured with accuracy, recall, precision and F-score.

Jackins and Punithavathani [19] proposed an unsupervised method based on hybrid clustering algorithm by combining Fuzzy C-Means algorithm and Incremental Support Vector Machine (ISVM) for anomaly detection. After FCM and ISVM, the processed data is fed to DBSCAN algorithm for further anomalies detection. For simulation purpose, they have employed two datasets such as KDD Cup99 and Gure KDD Cup database [32] and the performance is measured with true positive rate.

Hajisalem and Babaie [14] proposed a hybrid method for Intrusion detection by combining two algorithms such as Artificial Bee Colony (ABC) [13] and Artificial Fish Swarm (AFS) [7,26]. For the removal of redundant information from intrusion dataset, they have employed FCM and Correlation based feature selection algorithms. Additionally, they employed If-Then rule through CART [9] method for the detection of normal and anomaly records based on the selected features. The simulation is done one two datasets such as NSL-KDD and UNSW-NB15 and the performance is measured through detection rate, false positive rate, computational complexity and time cost.

Elbasiony et al. [11] proposed a hybrid IDS algorithm for the detection of bot anomalies and misuses. In misuse detection, they employed the Random Forests classifier to train the intrusion patterns and then perform matching between test patterns and trained patterns. Further, for anomaly detection, they used a weighted K-means clustering and cluster the network connections. KDD Cup99 dataset is used for the simulation experiments. Recently a one more method is proposed for intrusions detection based on K-means. Meng et al. [27] proposed an improved version of K-means algorithm for Intrusion detection in computer networks. Initially, the PCA algorithm is applied to reduce the dimensionality of dataset and then the outlier detection is used for the elimination of outliers that have great impact on the final clustering results. The initial clustering center is chosen with the help of distance such that it can get an optimal local solution and then the K is used t get final cluster centers. Simulation is done with the help of KDD Cup99 and the performance is measured through detection rate, and false positive rate.

3. Proposed approach

3.1. Overview

In this paper, we propose a new Intrusion Detection mechanism based on statistical processing and machine learning algorithm. Here our main intention is to remove the duplicate information in the Intrusion detection process (at both training and testing). In IDS, to detect the intrusion, initially the system needs to get trained about the intrusions. At the training phase, with an increase in the size of data, the detection performance increases. However, as the size of data used for training increases, the computational complexity also increases. Hence in our proposed model, we mainly focused on the reduction of duplicate or redundant data connections and also the features from every connection. For this purpose, we proposed a new feature selection mechanism called Mutual Clustered Redundancy based Feature selection (MCRFS) by combining two different methods based on linear and non-linear statistics of data. We employ the linear method for the elimination of duplicate connections while we employ the non-linear method for duplicate features removal. Under linear method, we employed the Pearson Correlation Coefficient (PCC) and at non-linear method, we employed Mutual Information (MI). In our model, initially we propose a data pre-processing approach based on the normalization. Next, the normalized data is subjected to clustering and here for clustering purpose, we employed Fuzzy C-Means (FCM) clustering algorithm. The clustering is applied only over the training data and once the clustering is completed, the data is subjected to feature selection and the final features are trained through SVM algorithm. In the case of testing, we apply normalization followed by feature selection and classification. The complete block diagram of proposed IDS model is shown in Fig. 1.

Fig. 1.

Block diagram of proposed IDS model.

3.2. Data preprocessing

Data pre-processing is an important step in IDS. Even though it consumes a considerable time for processing, it is necessary to implement because the raw data comes from heterogeneous environments and can be inconsistent, incomplete, redundant and noisy [24]. Hence it is necessary to transform the raw data into some suitable form for the analysis and knowledge discovery. For example, consider NSL-KDD dataset, every traffic connection is represented with a set of features and they are not in same format. Some features are symbolic in nature, some features sand numerical and some features are binary format. To process this dataset, all the features are needed to be in uniform format. Similarly, consider a one more dataset called CIC-2017IDS in which the feature called “Fwd Header Length” appears twice in every traffic connection. Further, there is a one more feature called “Flow Packets/s” which contains abnormal values like ‘NaN’ and ‘Infinity’. Some more intrusion datasets are there in which the traffic connections are incomplete and also have missing features. Hence to sort out all these problems, a data processing is needed and it varies from dataset to dataset. For NSL-KDD dataset, the pro-processing follows a normalization process while for CIC-IDS2017 [38] dataset, the data preprocessing includes the removal of duplicate features and replacement of incomplete or missing features with zeros.

Consider the NSL-KDD dataset in which each and every connection have 41 features. Among these 41 features, the features 2, 3 and 4 are in the symbolic form and the remaining features are in numerical form. Here the feature 2 is protocol, feature 3 is flag and feature 4 is service. Except these three features, the remaining 38 features are continuous and are of numerical values. The details of these three features are shown in the following Table 1.

Table 1
Details of symbolic features of NSL_KDD dataset

Protocol type Flag Service

TCP, UDP, ICMP, ARP OTH, REJ, RSTO, RSTOS0, RSTR, RSTRH, SHR, SF,S0, S1, S2, S3, SH Aol, http_443, http_8001, http_2784, domain_u, ftp_data, auth, bgp, courier, tftp_u, uucp_path, csnet_ns, ctf, daytime, time, discard, domain, echo, eco_i, ecr_i, efs’, exec, finger, gopher, harvest, hostnames, http, imap4, IRC, iso_tsap, klogin, kshell, ldap, link, login, smtp, mtp, name, netbios_dgm, netbios_ns, netbios_ssn, netstat, nnsp, nntp, ntp_u, other, pm_dump, pop_2, pop_3, printer, private, red_i, remote_job, rje, shell, sql_net, ssh, sunrpc, supdup, systat, telnet, tim_i, urh_i, urp_i, uucp,ftp, vmnet, whois, X11,Z39_50.

Total = 4 13 70

Protocol type	Flag	Service
TCP, UDP, ICMP, ARP	OTH, REJ, RSTO, RSTOS0, RSTR, RSTRH, SHR, SF,S0, S1, S2, S3, SH	Aol, http_443, http_8001, http_2784, domain_u, ftp_data, auth, bgp, courier, tftp_u, uucp_path, csnet_ns, ctf, daytime, time, discard, domain, echo, eco_i, ecr_i, efs’, exec, finger, gopher, harvest, hostnames, http, imap4, IRC, iso_tsap, klogin, kshell, ldap, link, login, smtp, mtp, name, netbios_dgm, netbios_ns, netbios_ssn, netstat, nnsp, nntp, ntp_u, other, pm_dump, pop_2, pop_3, printer, private, red_i, remote_job, rje, shell, sql_net, ssh, sunrpc, supdup, systat, telnet, tim_i, urh_i, urp_i, uucp,ftp, vmnet, whois, X11,Z39_50.
Total = 4	13	70

As demonstrated in the above Table 1, we can observe that there are four different protocols (i.e., TCP, UDP, ICMP and ARP) and 13 different Flag types (OTH, REJ, RSTO, RSTOS0, RSTR, RSTRH, SHR, SF, S0, S1, S2, S3, SH) and totally 70 different service types. These three features are subjected to normalization. For normalization purpose, we consider the probability of occurrence of each feature. For example consider the Protocol column in NSL-KDD dataset. In this column, initially we measure the total number of occurrence of TCP protocol and the obtained count is divided by the total length of column. The Step by step process of normalization is demonstrated here;

Step 1: Consider the dataset X with size $M \times N$ , where M is total number of connections and N is total number of features used to represent each connection.

Step 2: Fetch the columns of Protocol, Flag and Service as $\begin{array}{l} F_{P} = X (:, 2); \\ F_{F} = X (:, 3); \\ F_{S} = X (:, 4); \end{array}$

Where $F_{P}$ , $F_{F}$ and $F_{S}$ denotes the Protocol feature, Flag feature and Service feature respectively.

Step 3: Find out the total number of occurrences of each and every sub-feature in the feature column as $\begin{matrix} (1) & M_{i} = \sum_{i = 1}^{L} strcmp (Feature (i), Feature_X) \end{matrix}$

Where $Feature_X$ represents the individual feature columns and the $Feature (i)$ represent the ith feature in every feature column. For example if we consider the $Feature_X$ as $F_{P}$ , then the $Feature (i)$ will be TCP, UDP, ICMP or ARP.

Step 4: Measure the probability of each feature as $\begin{matrix} (2) & P F_{i} = \frac{M_{i}}{Length (Feature_X)} \end{matrix}$

Where $M_{i}$ is the total number of occurrences of feature i and $Length (Feature_X)$ denotes the total size of respective column.

Step 5: Replace the probability values of ith feature in their respective positions in the dataset X.

For other datasets, if we observe the incomplete connections, then the connection is completed by adding zeros in sufficient number. Similarly for the datasets which have connections with abnormal values like NaN and Infinity, they are replaced with 0’s.

3.3. Clustering

Clustering is applied here only over the training data. Since the size of training data is very large, at classification, the classifier suffers with huge computational complexity when it performs matching between training data and test data. In the case of IDS, the training phase requires a larger number of traffic connections and then only the performance of IDS will be good. However, as the size of training data increases the complexity also raises. Hence we have focus on the training data which should be less sized and more informative. Means we have to train larger sized data in a compact manner. To sort out this problem, clustering is one possible way which groups the data with similar semantics. In clustering, the data is represented in the form of clusters. Consider the NSL-KDD dataset in which the original training data have 125,973 connections. Training of all these connections creates a huge burden over the system. Hence we apply clustering over the training data and cluster the entire data into several clusters. At here, the total number of clusters into which the data needs to get clustered is completely user dependent. For instance, to cluster the NSL-KDD dataset we consider five clusters.

In our method, for clustering purpose, we use the most popular Fuzzy C-means clustering (FCM) algorithm. FCM is initially proposed by Bezdek [8]. Due to its novel properties such as simplicity of implementation, output validity, heterogeneity between subsets, homogeneity within subsets and concerning data in same subset, FCM is used here for clustering purpose [23,47]. Before applying the FCM clustering over the training data, we compute the entropy of very connection in the dataset. Consider the dataset X with size $M \times N$ , where M is total number of connections and N is total number of features used to represent each connection, the entropy of each connection is measured as $\begin{matrix} (3) & H (X_{m}) = \sum_{i = 1}^{N} p_{i} log p_{i}, \end{matrix}$

Where $H (X_{m})$ is the entropy of mth connection in dataset X where ma varies from 1 to M, $p_{i}$ is the probability of occurrence of ith feature in mth connection. After the computation of entropy of each connection on dataset X, we have totally M entropy values. Now the FCM cluster the entire dataset into C clusters based on these M entropies. According to the standard definition of FCM, the objective function is modeled as $\begin{matrix} (4) & J_{m} = \sum_{i = 1}^{M} \sum_{j = 1}^{C} u_{i j}^{m} ‖ x_{i} - c_{j} ‖^{2} \end{matrix}$

Where m is a real number greater than 1, $u_{i j}$ is the membership function between ith entropy value and jth cluster center, M is the total number of connections and C is total number of clusters. $‖ . ‖$ denotes the normalization function which defines the similarity check between any measured entropy values and cluster center. The cluster center and membership function are iteratively updated and mathematically they are calculated as $\begin{matrix} (5) & u_{i j} = \frac{1}{\sum_{c = 1}^{C} {(\frac{‖ x_{i} - c_{j} ‖}{‖ x_{i} - c_{c} ‖})}^{(\frac{2}{m - 1})}} \end{matrix}$

And $\begin{matrix} (6) & c_{j} = \frac{\sum_{i = 1}^{M} u_{i j}^{m} . x_{i}}{\sum_{i = 1}^{M} u_{i j}^{m}} \end{matrix}$

The iteration process is terminated when the following condition met $\begin{matrix} (7) & max_{i j} {| u_{i j}^{(q + 1)} - u_{i j}^{q} |} < ε \end{matrix}$

Where ε is the termination threshold which lies in between 0 and 1 and q indicates the iteration number. At the start of the iteration, i.e., q = 1, C number of cluster centers are randomly selected and an initial fuzzy membership function is measured between all entropy values and cluster centers. After this process, based on the obtained values, the cluster centers are updated. This process is repeated until the condition shown in Eq. (7) is met. Once the termination condition is met, the connections are grouped based on the indices of entropy values present in each cluster. Based on this process, we can state that the clustering of connections with similar semantics will reduce the unnecessary computational burden over the system. Moreover, we also analyzed that the different traffic connections of same class have only small deviations in their feature values. Clustering of such kind of connections will improve the efficiency and lessens the computational burden over the system. After the completion of clustering, we focus on eth removal of duplicate connections in every cluster and it was accomplished through the correlation statistics of data connections in every cluster.

3.4. Correlation based connections selection

Once the entire training data is grouped into different clusters, then our responsibility is to find the duplicate connections and their removal from every cluster. Here the meaning of duplicate connections specifies the connections with similar attributes. For example, in the NSL-KDD dataset, the total number of normal training connections at training is 67,343 which are very large in number. If all these connections are processed, then it constitutes a huge computational burden over the system. Hence we focus on the detection and removal of duplicate connections in every cluster based on correlation properties. For this purpose, we use the standard Person Correlation Coefficient (PCC) which finds the linear relations between two variables [30]. Consider two connections, X and Y from any cluster, the PCC is calculated as $\begin{matrix} (8) & r (X, Y) = \frac{cov (X, Y)}{σ_{X} σ_{Y}} \end{matrix}$

Where $\begin{matrix} (9) & cov (X, Y) = \sum_{i = 1}^{N} (x_{i} - \overline{X}) (y_{i} - \overline{Y}) \end{matrix}$

And $\begin{matrix} (10) & σ_{X} = \sqrt{{(x_{i} - \overline{X})}^{2}} and σ_{Y} = \sqrt{{(y_{i} - \overline{Y})}^{2}} \end{matrix}$

Where $\begin{matrix} (11) & \overline{X} = \frac{1}{N} \sum_{i = 1}^{N} x_{i} and \overline{Y} = \frac{1}{N} \sum_{i = 1}^{N} y_{i} \end{matrix}$

Here $\overline{X}$ and $\overline{Y}$ are the mean of two connections X and Y respectively. The PCC is most effective measure which computes the linear relation between two random linear dependent connections. The value of PCC fall into the closed interval of −1 and 1, where the both values denotes that the two connections are strongly correlated while the mid values, i.e., 0 indicates that the two connections are weakly correlated. Based on the value of PCC, we decide whether the connection is duplicate or not. For this purpose, we fix a threshold value and if the PCC between two connections (let X and Y) is greater than the threshold, then we consider the second connection (i.e., Y) us duplicate and we remove it from the cluster. The threshold computation is done by averaging the PCC values of every connection with respect every connection in the cluster. For instance, if we consider there is P number of connections in the cluster, then we will get a correlation matrix (let it be Q) of size $P \times P$ . The diagonal values of this matrix are 1 since the correlation between same connections is 1. $\begin{matrix} (12) & Q = [\begin{matrix} r_{11} & r_{12} & \dots & r_{1 P} \\ r_{21} & r_{22} & \dots & r_{2 P} \\ ⋮ & ⋮ & \dots & ⋮ \\ r_{P 1} & ⋮ & \dots & r_{P P} \end{matrix}], \end{matrix}$ where $r_{i j}$ is the PCC between two connections i and j. The first row of Q contributes the correlation of first connection with remaining connections and the second row contributes the correlation of second connection with remaining connections and so on. Based on the Q matrix, we compute the threshold values and it was compared with every $r_{i j}$ . The computation of threshold is done as follows; $\begin{matrix} (13) & T = \frac{1}{length (Q)} \sum_{i = 1}^{P} \sum_{j = 1}^{P} r_{i j} \end{matrix}$

After the computation of threshold, every value of Q is compared with the threshold T to find out the duplicate connections. If the value of $r_{i j}$ is found to be greater than the threshold, then the jth connections is determined as duplicate connection and it was kept in the removal list. When the comparison is done between the value of first row and Threshold, then the first connection is reference connections and remaining connections are current connections. A connection showing heavy correlation with fist connection is duplicate with respect to first connection only bit not with other connections. Hence it was kept in the removal list. This process is repeated for remaining rows (means for remaining connections also) and we prepare a removal list for every connection. After the completion, the connections those have maximum appearance in the removal lists are only removed from the cluster. In this manner, we eliminate the duplicate connections from every cluster.

3.5. Feature selection through mutual redundancy

Once the connections are finalized for every cluster, then we apply feature selection method to find out the set of features which can contribute more towards the representation. In our method for feature selection, we apply a sliding window assisted mutual redundancy method. Here we try to find the non-linear relation between the connections of same cluster. For a cluster, we understood that the connections are linearly related but the features are non-linearly related. Hence we need to find the non-linear relation between features and only few features are selected from every connection those have more contribution towards the class representation. For feature selection process, we followed our earlier contribution [44] in which we apply Sliding window as a preprocessing and Duplicate Mutual Information (DMI) for feature selection. In this method, we initially divide the entire connections into several blocks of equal size through sliding windows of size 5. Consider $B_{1}^{U}, B_{2}^{U}, \dots, B_{Z}^{U}$ be the Z number of blocks obtained after partitioning the connection U through sliding window. After the sliding windowing, we compute the MI [36] between as well as within connections; they are simply called Inter_MI and Intra_MI respectively. Consider two connections U and V and the respective blocks are $B_{1}^{U}, B_{2}^{U}, \dots, B_{Z}^{U}$ and $B_{1}^{V}, B_{2}^{V}, \dots, B_{Z}^{V}$ , then the Inter_MI is computed as follows; $\begin{matrix} (14) & I (B_{i}^{U}, B_{i}^{V}) = \sum_{f_{k} \in B_{i}^{U}} \sum_{f_{l} \in B_{i}^{V}} p (f_{k}, f_{l}) log (\frac{p (f_{k}, f_{l})}{p (f_{k}) p (f_{l})}) \end{matrix}$

Where $B_{i}^{U}$ and $B_{i}^{V}$ are the blocks of the data connection U and data connection V respectively and $f_{k}$ and $f_{l}$ are the features of i_th block in the connection U and i_th block in the different connection V respectively. Next, Intra_MI is computed as follows $\begin{matrix} (15) & I (B_{i}^{U}, B_{j}^{U}) = \sum_{f_{k} \in B_{i}^{U}} \sum_{f_{l} \in B_{j}^{U}} p (f_{k}, f_{l}) log (\frac{p (f_{k}, f_{l})}{p (f_{k}) p (f_{l})}) \end{matrix}$

Where $B_{i}^{U}$ and $B_{j}^{U}$ are the two blocks of the data connection U and $f_{k}$ and $f_{l}$ are the features of i_th block and j_th block respectively. In the above expression, we compute the Mutual Dependency between the blocks of same connection and based on obtained MI values, we decide which blocks are mutually dependent and mutually independent. The two blocks are selected which has stronger mutual dependency within the connection and the features of those blocks are considered as required subset of features. However, this process eliminates some features those significance. To regain such kind of features, we measure duplication between eliminated features and selected subset of features through the following DMI $\begin{matrix} (16) & DMI = \frac{I (f_{i}; f_{s})}{I (C; f_{i})} \end{matrix}$

Where $I (f_{i}; f_{s})$ is the Mutual information between the feature $f_{i}$ left in the connection and the feature $f_{s}$ selected in the subset. Next, $I (C; f_{i})$ explores the MI between class and feature $f_{i}$ . With the help of Eq. (16), we can state that the features are selected those have more information to contribute with respect to both class and neighbor features. The final set of features is obtained based on the following expression [21]; $\begin{matrix} (17) & I_{M I} = arg max_{f_{i} \in U} (I (C; f_{i}) - \frac{1}{| B |} \sum_{f_{s} \in B} DMI) \end{matrix}$

In the case of $I (C; f_{i}) = 0$ , the corresponding feature $f_{i}$ is eliminated permanently. On the other hand, if the features $f_{i}$ and $f_{s}$ are highly related, then the feature $f_{i}$ contributes to redundancy. For this purpose, we have kept a threshold φ and the obtained $I_{M I}$ is compared as follows;

If $I_{M I} < φ$ , then the feature $f_{i}$ constitutes to redundancy with respect to class C because it may consequences to less MI between the selected features $f_{s}$ and class C.

If $I_{M I} = φ$ , then the feature $f_{i}$ constitutes to redundancy with respect to class C because it don’t carry any additional information about the class C.

If $I_{M I} > φ$ , then the feature $f_{i}$ constitutes to relevancy and it have more contribution to class C, because it can provide some additional information about class C and hence it is added to the selected features subset.

4. Simulation experiments

In this section, we discuss the details of simulation experiments conducted over the developed IDS model. For the simulation purposes, we used MATLAB software and a personal computer with 1 TB hard disk and 8 GB RAM. To show the effectiveness of developed model, we simulate it over different datasets with different characteristics. Initially we explain the details of dataset and their settings made before processing for simulation. Next we explore the details of results derived during the simulation and finally a simple comparison is explained through which we can prove the effectiveness of our method.

4.1. Datasets

During the evaluation of IDS, the major challenge faced by researchers is the discovery of an appropriate intrusion dataset. Capturing a real time intrusion dataset is a great issue for all researchers since it required a lot of components that are considered as critical for the researchers. Due to this reason, many researchers consider the simulated datasets such as most well-known KDD Cup 99 dataset [35], NSL-KDD Dataset [42] and Kyoto 2006+ dataset [39]. According to the Tsai et al. [43], most of researchers utilized the KDD Cup99 dataset for experimental validation. Moreover, these three datasets are of different sizes and also have represented with different set of features. Hence, in order to ensure a fair and rational comparison with the existing methods, we have selected these three datasets to assess the performance of our IDS model. The details of every dataset are demonstrated here;

4.1.1. KDD Cup99 dataset

The KDD Cup 999 dataset is one of the most popular dataset which is generally used by many researchers. This dataset is constructed in the year of 2000 and it is built based on the data captured in one IDS evaluation program. This Dataset consists of totally five different classes such as Normal, Denial of Service (DoS), Probe, User-to-Root (U2R) and Remote-to-Login (R2L). Among these five, the first one is normal and the remaining are attacks. This dataset approximately consists of five million training traffic connections and two million test traffic connections. Each connection is represented with 41 features which are classified into three different classes; they are Basic features, Traffic features, and content features. Further the Traffic features are classified as same host features and same service features. Here the main intention behind the consideration of KDDCup 99 dataset for simulation is the availability of huge number of connections for every class. The details of training and testing connections of KDD Cup 99 dataset are shown in the Table 2.

Table 2
Statistics of the KDD Cup99 dataset

Class/Set Training Testing

Normal 97,278 60,593

Attacks DoS 391,458 229,853

U2R 52 228

R2L 1126 16,189

Probe 4107 4166

Total 395,743 250,436

Total 494,021 311,029

Class/Set	Training	Testing
Normal	97,278	60,593
Attacks	DoS	391,458	229,853
U2R	52	228
R2L	1126	16,189
Probe	4107	4166
Total	395,743	250,436
Total	494,021	311,029

4.1.2. NSL-KDD dataset

The NSL-KDD dataset is a revised version of KDD Cup 99 dataset that was constructed by Tavallaee et al. [42]. This dataset have less number of traffic connections when compared with the KDD Cup 99 dataset. Similar to the KDD Cup99, in NSL-KDD also, each connection is represented with 41 features and total number of classes present are four (normal, DoS, Probe, U2R and R2L). Further, the NSL-KDD includes three dataset such training (10% of KDDCup99 and ${KDDTrain}^{+}$ ), testing (KDDCup test data and ${KDDTest}^{+}$ ) and some set of additional samples ( ${KDDTest}^{- 21}$ ) those represents new attacks that were not covered in the training data. The details of each and every set are demonstrated in Table 3.

Table 3
Statistics of the NSL-KDD dataset

Class/Set KDDTrain⁺ KDDTest⁺ KDDTest⁻²¹

Normal 67343 9711 2152

Attacks DoS 45927 7458 4342

U2R 52 200 200

R2L 995 2754 2754

Probe 11656 2421 2402

Total 58630 12833 9698

Total 125973 22544 11850

Class/Set	KDDTrain⁺	KDDTest⁺	KDDTest⁻²¹
Normal	67343	9711	2152
Attacks	DoS	45927	7458	4342
U2R	52	200	200
R2L	995	2754	2754
Probe	11656	2421	2402
Total	58630	12833	9698
Total	125973	22544	11850

4.1.3. Kyoto 2006+ dataset

Kyoto2006+ is one more standard intrusion dataset that was constructed by Song et al. [39]. In this dataset, each connection is represented with 24 features and they are categorized into two categories; they are conventional features and additional features. Among the 24 features, the first 14 are derived based on the features of KDD Cup99 dataset. Among the total 41 features of KDD dataset, only 14 features are considered here which has more significance and they were acquired based on the Honeypot deployed in the University of Kyoto. In addition, during the observation in Kyoto University, 10 more features are extracted which may enable the users to investigate the ongoing process in network. For experimental analysis, here we have considered the data acquired on the dates of 12, 13, 14, 15, and 16 of November 2006. From these dates, totally we have procured 93240 connections among which the 71885 are attacks and 21355 are normal. This categorization is done based on the label present in the dataset. Among the 93240 connections, we have employed 70% for training and the remaining 30% for testing. Means among the 71855, 50320 connections are used for training and remaining 21565 connections are used for testing. In the case of attacks, among the available 21355 connections, 14950 are used for training and 6405 are used for testing.

4.2. Results

The performance of developed ID model is asses with respect to the capability of its correct assessment of every class. For this assessment, we conduct a five-fold cross validation on every dataset by changing the traffic connections. At every validation, the total number of connections used for training and testing is kept constant but used different connections for training and testing. After the simulation process, the complete detected results are represented in a confusion matrix. Based on these confusion matrices, we measure the performance with different performance metrics such as Detection Rate, Precision, F-Score, False Negative Rate (FNR), False Positive Rate (FPR), False Alarm Rate (FAR) and Accuracy. For the simulation of KDD Cup99 dataset, we used the available 10% KDD Train and KDD Test data. Next, for the simulation of NSL-KDD, we used the connections is ${KDDTrain}^{+}$ , ${KDDTest}^{+}$ and ${KDDTest}^{- 21}$ . Further, in the case of Kyoto2006+, there is not such kind of inbuilt connections availability. To simulate this dataset, we procured connections form different dates in the month of November 2006.

Table 4
Confusion matrix of results from the simulation of KDDTest of KDDCup99 dataset

Normal DoS U2R R2L probe Total

Normal 58169 856 124 964 480 60,593

Dos 7210 216063 584 1949 4047 229,853

U2R 25 15 168 7 13 228

R2L 1103 825 175 13436 650 16,189

Probe 216 162 54 110 3624 4166

Total 66,723 217,921 1105 16466 8814 311,029

	Normal	DoS	U2R	R2L	probe	Total
Normal	58169	856	124	964	480	60,593
Dos	7210	216063	584	1949	4047	229,853
U2R	25	15	168	7	13	228
R2L	1103	825	175	13436	650	16,189
Probe	216	162	54	110	3624	4166
Total	66,723	217,921	1105	16466	8814	311,029

Table 5

Performance metrics of proposed method for KDDTest of KDDCup99 dataset

Class/Metric	DR (%)	PPV (%)	FNR (%)	FPR (%)	FAR (%)	F-Score (%)
Normal	96.3212	87.1845	3.6788	12.8155	8.2471	91.5254
DoS	94.1524	99.1554	5.8476	0.8446	3.3461	96.5892
U2R	73.6154	15.2314	26.3846	84.7686	55.5766	25.2404
R2L	82.9932	81.6024	17.0068	18.3976	17.7022	82.2919
Probe	86.9964	41.1247	13.0036	58.8735	35.9385	55.8497

The values shown in Table 4 are the results obtained after the simulation of developed IDS model over the KDD Cup 99 dataset. Here for the simulation, we consider totally 3, 11, 029 traffic connections. The values demonstrated here are the best values those are obtained in our five-fold cross validated simulation. At this phase, due to the more number of possibilities of DOS attacks, we consider more number of connections for DoS Category. The two attacks namely DoS and Probe are called major attacks and the remaining R2L and U2R attacks are called minor attacks. Due to this reason, we can observe a less number of connections of these minor attacks in every dataset. Further, we measure the performance through the performance based on values shown in Table 4 and the resultant values are shown in Table 5. From this table, we can see that the maximum DR (96.3212%) is observed for Normal and minimum DR (73.6154%) is observed at U2R. Similar, the maximum precision (99.1554%) is observed at DoS and minimum precision (15.2314%) is observed at U2R. Since the U2R is a minor attack which has less information at training phase, we obtain a less DR and Precision. Next, the FNR and FPR simply follow an inverse relation with DR and PPV respectively and FAR is the average of FPR and FNR. Finally we measured the F-score for every class and the maximum value is observed at DoS class and minimum value is observed at U2R. The major intention behind the consideration of KDD Cup 99 dataset for simulation is that it has a huge number of record which can help in the realization of our concept, i.e., correlation based connections selection. At this phase, we remove the duplicate connections and trains only informative connections. Generally with the decrement in the connections number at training the system may lose some information and consequences to less performance. But we can see from the obtained results, we have gained satisfactory results and the main reason is information preservation through correlation calculation.

Table 6

Confusion matrix of results from the simulation of ${KDDTest}^{+}$ of NSL-KDD dataset

	Normal	DoS	U2R	R2L	probe	Total
Normal	7064	115	15	35	54	7283
Dos	62	5313	21	29	168	5593
U2R	10	4	119	4	13	150
R2L	66	64	16	1775	144	2065
Probe	57	62	11	16	1669	1815
Total	7259	5558	182	1859	2048	16906

Table 7

Performance metrics of proposed method for ${KDDTest}^{+}$ of NSL-KDD dataset

Class/Metric	DR (%)	PPV (%)	FNR (%)	FPR (%)	FAR (%)	F-Score (%)
Normal	96.9964	97.3121	3.0125	2.6956	2.8541	97.1540
DoS	94.9987	95.5956	5.0142	4.4114	4.7128	95.2965
U2R	79.3314	65.3484	20.6754	34.6251	27.6503	71.6642
R2L	85.9645	95.4875	14.0412	4.5212	9.2812	90.4761
Probe	91.9674	81.4923	8.0448	18.5123	13.2785	86.4136

After the simulation of developed IDS model over NSL-KDD dataset, the obtained results are demonstrated in the form of confusion matrix in Table 6. At this phase, for the simulation purpose, we considered the 75% of traffic connections from KDDTest+ and as usual 100% of connections are used from KDDTrain+. Since there is possibility to reduce the number of training connections through our proposed method (correlation based connections selection), we considered an entire connections set for training. Even though the NSL-KDD dataset is a selective one which was retrieved from large sized KDD Cup 99 dataset, it also has some redundant and duplicate connections particularly at normal class. For normal class, almost the KDDTrain+ has 67343 connections which is a very huge number. Training of this many connections constitutes a larger burden over the system. Hence it was subjected to clustering followed by correlation to remove the connections with duplicate semantics. From the results shown in Table 6, we can observe that our new model classified all types of classes effectively even though the trained data is less. Since we select the connection and the respective features based linear and non-linear properties of data, the loss if significant information is less. From the performance analysis shown in Table 7, we can see that the maximum DR (96.9964%) maximum PPV (97.3121%) is observed for Normal class while the minimum DR (79.3314%) and minimum PPV (65.3484%) is observed for U2R class. Further the FPR and FNR follows inverse relation with PPV and DR respectively.

For further simulation, we approached a simple dataset called Kyoto2006+. Under this dataset, there is only two classes. They are normal and attacks. From the results shown in Table 8 and Table 9, it can be seen that both classes have optimal performance. In this case, due to the possibility of only two classes, slight higher values of FPR can be observed. Means the false positives count is high. The main reason is that even for a small deviation in the features of an incoming traffic connection, the system classifies it wrongly. Hence the PPV of attacks is observed to be only 80.8854% which is relatively lower value. On an average the overall performance is observed to be high.

Table 8

Confusion matrix of results from the simulation of Kyoto2006+ dataset

	Normal	Attack	Total
Normal	20123	1442	21565
Attack	307	6098	6405
Total	20430	7540	27970

Table 9

Performance metric of proposed method for Kyoto2006+

Class/Metric	DR (%)	PPV (%)	FNR (%)	FPR (%)	FAR (%)	F-Score (%)
Normal	93.3124	98.4973	6.6876	1.5027	4.0951	95.8348
Attack	95.2121	80.8854	4.7879	19.1212	11.9544	87.4301

Table 10

Performance comparison of proposed method with recent methods

Method	Average accuracy (%)	Average FAR (%)
RPFMI [48]	93.1145	1.7880
FGLCC [28]	90.2145	1.6345
FSVM [4]	93.2963	0.9963
MDWT [21]	91.2353	2.3023
MMIFS [40,41]	92.9421	4.2157
CCFS [12]	87.1254	9.4457
MDPCA-DBN [46]	82.0825	2.6225
IK-Means [27]	83.5421	4.1128
MRFS-MC-SVM [44]	95.6345	0.7664
MCRFS-MC-SVM	97.6695	0.7012

4.3. Comparison

Table 10 shows the details of comparative analysis between the proposed and several existing IDS models. In This comparison, we consider the maximum performance measures that were observed in their articles. Since our MCR-FS is oriented to two new contributions, the comparison is employed with the methods those have same methodology. For instance, the methods RPFMI [48], FGLCC [28], FSVM [4], MMIFS [40,41] are mainly concentrated on the feature selection and commonly taken the Assistance of Mutual information. All these methods developed a revised version of MI which can measures the non-linear dependency and select et features those have higher independency. After feature selection, they have considered different machine learning algorithms for classification. Unlike these methods, CCFS [12] used correlation measure for the feature selection and they have considered the linear relations between intrusion traffic connections. However, the intrusion data is not restricted to a single orientation, i.e., the data connections are both linear and non-linearly connected. For such kind of data, the feature selection method must be in such a way it should derive both relations and select only optimal features which satisfies both constraints. Unlike these methods, MDWT [21] applied DWT for feature selection process. However, in text data, there is no matter of frequencies. Even though it has gained a better accuracy, this kind of approach won’t opt for Intrusions detection. Next, some methods like MDPCA-DBN [46] and IK-Means [27] applied clustering techniques to reduce the burden of computational complexity by removing the excess data from train data. However, they didn’t check the relations between the connections of same cluster. Since the connections belong to a single cluster may have similar semantics, the removal of such kind connections will lessen the computational burden. At this phase, the random removal may consequences to worst classification performance. A perfect measure is required through which the exact relation between the connections of a single cluster may estimate. In our present contribution, we concentrated on this fact and employed a correlation based connections selection from every cluster. At this phase, the linear relations between the connections are found and the connections those are too similar are removed. Further at feature selection, we employed non-linear measured and removed the duplicate features according to our earlier contribution i.e., MRFS-MC-SVM [44]. Hence our method has achieved a maximum accuracy compared to the all existing methods.

5. Conclusion

In this paper, we developed a new IDS model to detect intrusions in computer networks based on the statistical processing and machine learning. The statistical processing techniques include Correlation and Mutual Information. Next the machine learning algorithms includes FCM and SVM. The main responsibility of correlation measure is to discover the duplicate connections in training data while the main responsibility of Mutual Information is to find the duplicate feature in every connection. Next, the FCM is employed to cluster the training data into different cluster such that the system suffers with less computational burden. Finally the SVM executes the task of classification. For experimental analysis, three standard datasets such as KDD Cup 99, NSL-KDD and Kyoto2006+ are used and subjected to a five-fold cross validation be chaining the training testing connections. At every dataset, we attained a satisfactory result in the classification of attacks and normal. On an average, the obtained accuracy is observed as 97.6695% while the average false alarm rate is observed as 0.7012% which is much better compared to the state-of-art methods.

Conflict of interest

None to report.

References

O.Y.

Al-Jarrah,

Alhussein,

P.D.

Yoo,

Muhaidat,

Taha and

Kim, Data randomization and cluster-based partitioning for botnet intrusion detection, IEEE Trans. Cybern. 46(8) (2015), 1796–1806. doi:10.1109/TCYB.2015.2490802.

A.M.

Ambusaidi,

He and

Nanda, Unsupervised feature selection method for intrusion detection system, in: International Conference on Trust, Security and Privacy in Computing and Communications, IEEE, 2015.

A.M.

Ambusaidi,

He and

Nanda, Unsupervised feature selection method for intrusion detection system, in: International Conference on Trust, Security and Privacy in Computing and Communications, IEEE, 2015.

M.A.

Ambusaidi,

He,

Nanda and

Tan, Building an intrusion detection system using a filter-based feature selection algorithm, IEEE Trans. Comput. 65 (2016), 2986–2998. doi:10.1109/TC.2016.2519914.

M.A.

Ambusaidi,

Tan,

He,

Nanda,

Fu Lu and

Jamdagni, Intrusion detection method based on non-linear correlation measure, Int. J. Internet Protocol Technology 8(2/3) (2014), 77–86. doi:10.1504/IJIPT.2014.066377.

Amiri,

RezaeiYousefi,

Lucas,

Shakery and

Yazdani, Mutual information-based feature selection for intrusion detection systems, Journal of Network and Computer Applications 34(4) (2011), 1184–1199. doi:10.1016/j.jnca.2011.01.002.

Azizi, Empirical study of artificial fish swarm algorithm, Int. J. Comput. Commun. Netw. 3 (2014), 1–7. doi:10.7763/IJCCE.2014.V3.281.

J.C.

Bezdek,

Ehrlich and

Full, FCM: The fuzzy c-means clustering algorithm, Comput. Geosci. 10 (1984), 191–203. doi:10.1016/0098-3004(84)90020-7.

Breiman,

Friedman,

Stone and

R.A.

Olshan, Classification and Regression Trees, CRC Press, Inc., 1984.

10.

Donghong,

Zhibiao,

Wu,

Ping and

Jian-Ping, Analysis of network security data using wavelet transforms, J. Algorithm. Comput. Technol. 8 (2014), 59–70. doi:10.1260/1748-3018.8.1.59.

11.

R.M.

Elbasiony,

E.A.

Sallam,

T.E.

Eltobely and

M.M.

Fahmy, A hybrid network intrusion detection framework basedon random forests and weighted k-means, Ain Shams Engineering Journal 4 (2013), 753–762. doi:10.1016/j.asej.2013.01.003.

12.

Farahani, Feature selection based on cross-correlation for the intrusion detection system, Hindawi Security and Communication Networks 2020 (2020), Article ID 8875404.

13.

Gong,

Gao,

Gong,

Li,

Gao and

Engineering, An optimized artificial bee colony algorithm for clustering, Int. J. Control Autom. 9 (2016), 107–116. doi:10.14257/ijca.2016.9.4.11.

14.

Hajisalem and

Babaie, A hybrid intrusion detection system based on ABC-AFS algorithm for misuse and anomaly detection, Computer Networks 136 (2018), 37–50. doi:10.1016/j.comnet.2018.02.028.

15.

Hamed,

J.B.

Ernst and

S.C.

Kremer, A survey and taxonomy of classifiers of intrusion detection systems, in: Computer and Network Security Essentials, Ser. Computer and Network Security Essentials,

Daimi, ed., Springer International Publishing, Cham, 2018, pp. 21–39. doi:10.1007/978-3-319-58424-9_2.

16.

G.E.

Hinton and

R.R.

Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (2006), 504–507. doi:10.1126/science.1127647.

17.

Hubballi and

Suryanarayanan, False alarm minimization techniques in signature-based intrusion detection systems: A survey, Comput. Commun. 49 (2014), 1–17. doi:10.1016/j.comcom.2014.04.012.

18.

Inayat,

Gani,

N.B.

Anuar,

M.K.K.

Khan and

Anwar, Intrusion response systems: Foundations, design, and challenges, J. Netw. Comput. Appl. 62 (2016), 53–74. doi:10.1016/j.jnca.2015.12.006.

19.

Jackins and

D.S.

Punithavathani, An anomaly-based network intrusion detection system using ensemble clustering, International Journal of Enterprise Network Management 9(3/4) (2018), 251–260. doi:10.1504/IJENM.2018.094664.

20.

Jecheva and

Nikolova, Some clustering-based methodology applications to anomaly intrusion detection systems, International Journal of Security and Its Applications 10(1) (2016), 215–228. doi:10.14257/ijsia.2016.10.1.20.

21.

S.Y.

Ji,

B.K.

Jeong,

Choi and

D.H.

Jeong, A multi-level intrusion detection method for abnormal network behaviors, J. Netw. Comput. Appl. 62 (2016), 9–17. doi:10.1016/j.jnca.2015.12.004.

22.

Khraisat,

Gondal,

Vamplew and

Kamruzzaman, Survey of intrusion detection systems: Techniques, datasets and challenges, Cyber security, Springer open 2 (2020), 20.

23.

Kumar Abhaya, An efficient network intrusion detection system based on fuzzy C-means and support vector machine, in: 2016 Int. Conf. Comput. Electr. Commun. Eng., IEEE, 2016, pp. 1–6. doi:10.1109/ICCECE.2016.8009581.

24.

Li,

Cheng,

Wang,

Morstatter,

R.P.

Trevino,

Tang and

Liu, Feature selection: A data perspective, ACM Comput. Surv. (CSUR) 50(6) (2018), 94. doi:10.1145/3136625.

25.

Liu and

Lang, Machine learning and deep learning methods for intrusion detection systems: A survey, Appl. Sci. 9(20) (2019), 4396. doi:10.3390/app9204396.

26.

Liu,

Bin Hou,

A.L.

Qi and

X.T.

Chang, Feature optimization based on artificial fish-swarm algorithm in intrusion detections, in: Proc. – Int. Conf. Networks Secur. Wirel. Commun. Trust. Comput. NSWCTC, Vol. 1, 2009, pp. 542–545. doi:10.1109/NSWCTC.2009.57.

27.

Meng,

Lv,

You and

Yue, Intrusion detection method based on improved K-means algorithm, IOP Conf. Series: Journal of Physics: Conf. Series 1302 (2019), 032011. doi:10.1088/1742-6596/1302/3/032011.

28.

Mohammadi and

Mirvazari, Cyber intrusion detection by combined feature selection algorithm, Journal of Information Security and Applications 44 (2019), 80–88. doi:10.1016/j.jisa.2018.11.007.

29.

Muda,

Yassin,

M.N.

Suliaman and

N.I.

Udzir, I ntrusion detection based on K-means clustering and naïve Bayes classification, in: 7th International Conference on Information Technology in Asia, Sarawak, Malaysia, 2011. doi:10.1109/CITA.2011.5999520.

30.

Nguyen,

Franke and

Petrovi’c, Improving effectiveness of intrusion detection by correlation feature selection, in: 5th Int. Conf. Availability, Reliab. Secur., Krakow, Poland, 2010, pp. 17–24.

31.

Ni,

He,

Ahmad and

Farooq, Practical network anomaly detection using data mining techniques, VFAST Trans. Softw. Eng. 9 (2016), 1–6. doi:10.21015/vtse.v9i2.403.

32.

Perona ,

Gurrutxaga,

Arbelaitz,

J.I.

Martín,

Muguerza and

J.M.

Pérez, Service-independent payload analysis to improve intrusion detection in network traffic, in: Proceedings of the 7th Australasian Data Mining Conference (AusDM08), Adelaide, Australia, 171–178, 2008.

33.

Ring,

Wunderlich,

Scheuring,

Landes and

Hotho, A survey of network-based intrusion detection data sets, 2019, pp. 1–15, arXiv preprint arXiv:1903.02460.

34.

Rodriguez and

Laio, Clustering by fast search and find of density peaks, Science 344 (2014), 1492–1496. doi:10.1126/science.1242072.

35.

Rosset and

Inger, Kdd-cup 99: Knowledge discovery in a charitable organization’s donor database, SIGKDD Explorat. 1(2) (2000), 85–90. doi:10.1145/846183.846204.

36.

M.S.

Roulston, Estimating the errors on measured entropy and mutual information, Physica D: Nonlinear Phenomena 125(3) (1999), 285–294. doi:10.1016/S0167-2789(98)00269-3.

37.

Sandosh,

Govindasamy and

Akila, Enhanced intrusion detection system via agent clustering and classification based on outlier detection, peer-to-peer Netw. Appl. 13 (2020), 1038–1045.

38.

Sharafaldin,

Habibi Lashkari and

A.A.

Ghorbani, Toward generating a new intrusion detection dataset and intrusion traffic characterization, in: 4th International Conference on Information Systems Security and Privacy (ICISSP), Portugal, 2018.

39.

Song,

Takakura,

Okabe,

Eto,

Inoue and

Nakao, Statistical analysis of honeypot data and building of Kyoto 2006+ dataset for nids evaluation, in: Proceedings of the First Workshop on Building Analysis Datasets and Gathering Experience Returns for Security, ACM, 2011, pp. 29–36. doi:10.1145/1978672.1978676.

40.

Song,

Zhu and

Price, Feature grouping for intrusion detection based on mutual information, Journal of Communications 9(12) (2014), 987–993.

41.

Song,

Zhu,

Scully and

Price, Modified mutual information-based feature selection for intrusion detection systems in decision tree learning, Journal of Computers 9(7) (2014), 1542–1546.

42.

Tavallaee,

Bagheri,

Lu and

A.A.

Ghorbani, A detailed analysis of the KDD cup 99 data set, in: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, IEEE, 2009, pp. 1–6. doi:10.1109/CISDA.2009.5356528.

43.

C.F.

Tsai,

Y.F.

Hsu,

C.Y.

Lin and

W.Y.

Lin, Intrusion detection by machine learning: A review, Expert Systems with Applications 36(10) (2009), 11994–12000. doi:10.1016/j.eswa.2009.05.029.

44.

Veeranna and

K.K.

Reddy, Sliding window assisted mutual redundancy based feature selection for intrusion detection system, International Journal of Ad Hoc and Ubiquitous Computing, Accepted in.

45.

L.J.G.

Villalba,

A.L.S.

Orozco,

J.M.

Vidal,

Member,

A.L.S.

Orozco and

J.M.

Vidal, Anomaly-based network intrusion detection system, IEEE Lat. Am. Trans. 13 (2015), 850–855. doi:10.1109/TLA.2015.7069114.

46.

Yang,

Zheng,

Wu,

Niu and

Yang, Building an effective intrusion detection system using the modified density peak clustering algorithm and deep belief networks, Appl. Sci. 9 (2019), 238. doi:10.3390/app9020238.

47.

Zhang and

Gu, Intrusion detection network based on fuzzy c-means and particle swarm optimization, in: Proc. 6th Int. Asia Conf. Ind. Eng. Manag. Innov., Atlantis Press, Paris, 2016, pp. 111–119. doi:10.2991/978-94-6239-145-1_12.

48.

Zhao,

Niu,

Luo and

Xin, A filter feature selection algorithm based on mutual information for intrusion detection, Appl. Sci. 8 (2018), 1535. doi:10.3390/app8091535.

Mutual clustered redundancy assisted feature selection for an intrusion detection system

Abstract

Keywords

1. Introduction

2. Literature survey

3. Proposed approach

3.1. Overview

3.4. Correlation based connections selection

3.5. Feature selection through mutual redundancy

4. Simulation experiments

4.1. Datasets

4.1.1. KDD Cup99 dataset

Table 2 Statistics of the KDD Cup99 dataset Class/Set Training Testing Normal 97,278 60,593 Attacks DoS 391,458 229,853 U2R 52 228 R2L 1126 16,189 Probe 4107 4166 Total 395,743 250,436 Total 494,021 311,029

Table 3 Statistics of the NSL-KDD dataset Class/Set KDDTrain+ KDDTest+ KDDTest−21 Normal 67343 9711 2152 Attacks DoS 45927 7458 4342 U2R 52 200 200 R2L 995 2754 2754 Probe 11656 2421 2402 Total 58630 12833 9698 Total 125973 22544 11850

4.2. Results

5. Conclusion

Conflict of interest

References

Table 2
Statistics of the KDD Cup99 dataset

Class/Set Training Testing

Normal 97,278 60,593

Attacks DoS 391,458 229,853

U2R 52 228

R2L 1126 16,189

Probe 4107 4166

Total 395,743 250,436

Total 494,021 311,029

Table 3
Statistics of the NSL-KDD dataset

Class/Set KDDTrain⁺ KDDTest⁺ KDDTest⁻²¹

Normal 67343 9711 2152

Attacks DoS 45927 7458 4342

U2R 52 200 200

R2L 995 2754 2754

Probe 11656 2421 2402

Total 58630 12833 9698

Total 125973 22544 11850