A classification model based on svm and fuzzy rough set for network intrusion detection

Abstract

Intrusion Detection Systems (IDS) are designed to provide security into computer networks. Different classification models such as Support Vector Machine (SVM) has been successfully applied on the network data. Meanwhile, the extension or improvement of the current models using prototype selection simultaneous with their training phase is crucial due to the serious inefficacies during training (i.e. learning overhead). This paper introduces an improved model for prototype selection. Applying proposed prototype selection along with SVM classification model increases attack discovery rate. In this article, we use fuzzy rough sets theory (FRST) for prototype selection to enhance SVM in intrusion detection. Testing and evaluation of the proposed IDS have been mainly performed on NSL-KDD dataset as a refined version of KDD-CUP99. Experimentations indicate that the proposed IDS outperforms the basic and simple IDSs and modern IDSs in terms of precision, recall, and accuracy rate.

Keywords

SVM data selection feature selection fuzzy rough set theory ids

1 Introduction

Today, most vital infrastructures i.e. telecommunications, transportations, trade and banking systems, are managed by computer networks. Therefore, the security of these systems against planned attacks, is of a high importance. One can find a variety of meanings or concepts for the term, “intrusion”, in different glossaries. In computer science, there is much debate over the meaning or meanings of intrusion. Many define intrusion to be unsuccessful attacks, while others present separate definitions of intrusion and attack. Intrusion can be defined as [1]: “an active series of events related to each other, with aim of illegally accessing to data, illegally changing data or damaging the system in a way that it makes the system unavailable”. Such a definition covers both successful and unsuccessful attempts.

IDSs are designed to provide security in computer networks setting. They are the systems that strive to detect attacks to a network. These systems collect network traffic data from nodes of the network or computer systems and use these pieces of information to provide the security of network. The chief purpose of an IDS is to characterize and prevent accessibility of users known as Unauthorized Users (UUs). When a protected system or a computerized network is under attack, the IDS generates warnings for the potential attack occurrence, even if the system is not vulnerable to the reported attack. Therefore, the purpose of intrusion detection or IDS is to discover unauthorized use, abusing and damaging computer systems and networks by both domestic users and external attackers. In a perfect security system, IDS is used as an auxiliary tool to enhance system security, along with using firewalls, encryption and authentication methods. To sum up, three chief reasons exist to employ IDSs [1]:

Controlling and legal problems: using IDSs is necessary due to presence of national laws to comply with security principles and monitor their potential violators, protecting national infrastructures, the need to develop surveillance systems to have the ability to record and investigate suspicious accesses to information and the existence of different standards.

Specification and definition for types of attacks: IDSs provide the system administrator with the opportunity to think of managing and dealing with the attacks by identifying attacks to system. IDSs have the ability to create a profile for the occurred attacks against the system. This profile can be used to investigate some appropriate solutions to cope with the attacks against the system. These systems, also, are capable of collecting information about the attackers, which we can use them for judicial cases.

Overall in-depth-defense implementation: IDSs are a critical part of in-depth-defense systems, specifically for organizations to which dealing with intruders is of a high significance. IT IDSs provide a suitable solution to protect the network and application layers’ vulnerabilities. Also, they contribute to the solidarity and authentication of information across different devices, such as firewalls, anti-viruses and routers.

An IDS can be considered to be a set of tools, methods and evidences that contribute to detect, identify and report unauthorized or unapproved activities access in a network [1]. The title, “Intrusion Detection", is not an appropriate heading for intrusion detection systems, since these systems, in fact, do not recognize the intrusion; rather they detect an activity on the network as an intrusion that may not be substantially an intrusion. Indeed, IDS is a small part of the protective system installed in a system and it cannot be considered to be a stand-alone and absolute system.

In general, anything that is potential risk and threat to global security is a malware. Malwares are scattering concerning to their variety and number as well [2]. Besides the possible losses to the computers of users, malwares are able to own an accessibility to the privacy of users and thus, their confidential information can be stolen. Unluckily, because of the application vulnerability, many authentic packages of software launch the malwares unwantedly. This leads to system intrusion; therefore, the privacy of users even via using a legal package of software, can be endangered.

Of course modern computer systems’ security relies heavily on the ability of keeping malware and security products updated. Therefore, timely identification of malwares and finding some ways to encounter the malicious effects of malwares, are the most important concerns of programmers and IT security specialists. A malware detection system, in fact, is the implementation of intrusion detection techniques that contribute to protect the system by identifying malicious behaviors. Regarding to diversifying and expanding the endeavors of the attackers to access private and important information of users and their attempts to utilize various updated techniques to bypass malware and security products, system programmers need up-to-date and efficient detection techniques that overcome the new and complex methods of the attackers and detect the malwares.

Until now, methods that have been used to detect malwares, have succeeded with different rates. However, the important problem, which is very crucial these days, is indeed to detect malwares that hide themselves in different ways, such as compaction, polymorphism, and so on. Therefore, traditional methods no longer meet the requirements to deal with these new and complex species, and intelligent methods (such as clustering [3 –9], fuzzy clustering [9 –12], classification [13–14], domain adaptation [15, 16], deep learning [17–18], optimization [19 –22], dynamic optimization [23–24], preprocessing [25], and so on) are required to detect these malwares in a short time and with a low error rate. One of the main challenges in reaching an appropriate model for these systems can be the huge size of their data. Huge data occurs in many applications [26 –29]. This is an important challenge that has to be dealt with as a preprocessing task [25] prior to model creation. While there are many approaches to build a detection model [30 –33], due to our special case, in this paper, it is the main aim to handle this problem using an appropriate method, i.e. fuzzy rough set theory.

A rough set, introduced by Pawlak [34], defines a set by two approximations: its lower approximation (LA) and its upper approximation (UA). LA and UA of a set are also sets. The rough set theory, as the extension of traditional set theory, and its derivatives have been used in various applications [35, 36]. It has been especially applicable to feature selection problem [37 –39] or sample selection problem [40, 41]. One of its most well-known extensions is fuzzy rough set [42, 43] where the LA and the UA of a set is defined as two fuzzy sets. Although rough set is different from fuzzy set, they can complete each other. The first one is able to manage indiscernibility [34] and the second one is capable of managing vagueness [44 –46]. Therefore, both of them are suitable for managing uncertain systems. The fuzzy rough set theory has been especially applicable to feature selection [47], missing value handling [48], imbalanced learning [49], and instance selection [50].

The main objective of this study is to gain more accurate results in the detection process. To reach the mentioned objective, the following objectives will be pursued to be achieved:

Proposing a data reduction technique.

Increasing attack discovery rate using data reduction technique.

As fuzzy rough set theory based approaches are appropriate solutions to the feature and instance selection problems, and also as one of the main challenges to our problems is their huge data size, we target to introduce a model based on an innovative fuzzy rough set sample and feature selection mechanism.

The rest of this paper is organized as follows. In the second section, the related works will be mentioned. The materials needed for the proposed method are presented in third section. In the fourth section, details of the proposed method will be provided. In the fifth section, details of implementation and experimental results are given. Eventually in the last section, a general conclusion will be presented about the contents of this article and there will be a review of new horizons and suggestions for future research on the subject.

2 Related works

Generally, according to the authors [51], in the intelligent IDS, the attraction criteria are calculated by the nearest neighbor and rival (the closest pattern belonged to the other class) for each model. Let’s assume U to be the universal set. In general, for a pattern or data point p, where p ∈ U, it can be suitably classified when |p - NN_U (p) | - |p - NN_{{x∈U|m_l(x)≠m_l(p)}} (p) | > Δ, where m_l (x) is the label of pattern x, NN_θ (x) is the nearest neighbor of pattern x in the dataset θ and Δ, a positive real number, is a threshold.

An intrusion detection method proposed where it is aimed to choose a pattern based on an edited nearest neighbor [52]. This method emphasizes on eliminating noisy patterns in the training set. The edited nearest neighbor ignores a pattern in U, when its class is different from the class which is the most frequent class among its k nearest neighbors. An extension to this algorithm is the redundant edited nearest neighbor [53], where edited nearest neighbor is repetitively applied until the class of any pattern in U matches with the class which is the most frequent class among its k nearest neighbors. Another method, which is based on edited nearest neighbor, is called multiple edited nearest neighbor. During the procedure of multiple edited nearest neighbor, U will be divided into r blocks {B₁, B₂, … , B_r } and it applies edited nearest neighbor on each block B_j and its neighbors will be detected on the next block, that is B_mod(j+1,r). As long as at least one deletion occurs on tth iteration, this procedure will be continued [54]. Instead of k nearest neighbors, edited nearest neighbor provides a method for pattern selection by which it utilizes dependency probability to a class.

A framework based on IB2 and IB3 has been introduced (IB2 and IB3 are additive methods) [55]. IB2 selects patterns that are not classified by 1-NN correctly; IB3 is the improved version of IB2 in which a set of records is used to identify the patterns that are needed to be preserved.

From other methods, which are based on k nearest neighbors, five methods are worthy to be mentioned [56]: DROP1, DROP2, DROP3, DROP4, and DROP5; these methods are based on the concept of dependency. The dependencies of pattern p are those patterns that pattern p is considered to be one of their nearest neighbors. DROP1 removes the pattern p if its dependencies in U are classified accurately without any consideration of pattern p. Through this rule, since dependencies of a noisy pattern can be correctly classified without that pattern, DROP1 can eliminate the noisy patterns. However, in DROP1, if neighbors of a noisy pattern are eliminated at first, then, the noisy pattern will not be deleted. In order to solve this problem, DROP2 has been introduced where it considers dependencies of a pattern in the whole training set. That is, pattern p will be deleted only when its dependencies are classified without pattern p. DROP3 and DROP4 firstly remove noisy patterns by considering a filter similar to edited nearest neighbor, then apply DROP2. DROP5 works similar to DROP2 with the difference where it firstly removes the closest rivals (the nearest patterns with a different class).

Another method that operates based on dependency is called iterative filtering algorithm [57]. The iterative filtering algorithm works based on the reachable (denoted by Reachable (p)) and coverage (denoted by Coverage (p)) sets of pattern p that are the locality and dependency sets, respectively. If |Reachable (p) | > |Coverage (p) |, then the iterative filtering algorithm will remove pattern p. This means that some patterns can classify patterns similar to pattern p, regardless of their classification in the training set. In the first step, this algorithm runs edited nearest neighbor.

Evolutionary algorithms are widely used in pattern selection. Their main idea is as follows: an initial population of individuals (for example chromosomes in genetic algorithm) are given; each individual can be considered to be a solution. Based on an objective function (an objective function can be accuracy of a classifier trained on the selected patterns by an individual), individuals of a population are evaluated and best individuals are selected for exploration operators (for example the mutation and recombination operators in genetic algorithm) in order to generate new population individuals (in a way that they will maximize objective function). Such algorithm will be repeated a specific number of times (for example generations in genetic algorithm) determined by a user, and the best individual of the last iteration will be selected as the final solution. Generally, individual I can be displayed as a binary coded presentation, I = [n1, 0, 0, 1, …, 0], so that it holds |I| = |U|, which means a certain subset of U is selected. Every I_j is the representative of jth pattern in T. If I_j is one, then it means the individual I selects the jth pattern; otherwise, the individual I removes the jth pattern.

Some researchers employed tabu search algorithm for pattern selection [58]. The tabu search algorithm is applied on an initial potential solution containing a set of primitive randomly selected patterns S_j ∈ U, and during the search process, some solutions are identified as tabu (which means that solution should not be changed); other solutions (non-taboo solutions) will be evaluated using a classifier in order to find the best solution to replace the current solution. Procedure of finding the next best solution from the current solution is the assessment of neighboring solution sets, each of which differs in only one pattern with the current solution. When this process is completed, the current solution is considered to be the best solution found so far.

According to direct computation of decision level separation, another category of studies that is based on the maximum margin separation of data with two classes emerged. In this category of researches, it is attempted to gain a decision level in order to classify the data in which such a decision level creates the maximum margin of separation between two classes [59, 60]. In a method of this type [59], the aim was to determine the distance-based separation function with a maximum distance of each data class in Banach space. For this purpose, Lipschitz functions have been used where a fixed Lipschitz function indicates the separation margin value. Finding the suitable Lipschitz function for real data is a difficult task, and the current solutions are weak in terms of performance. In this regard, the results of its implementation are not satisfactory. In another method of this type, SVM [60], a linear decision level in Hilbert space, was proposed with the aim of finding maximum margin of separator between two classes of data. In this method, the problem transformed into a quadratic programming problem. Solving this problem for applications with big data such as intrusion detection application is a challenging task [61].

Although SVM can potentially classify the data in intrusion detection, they cannot be used solely in IDSs. Based on the conducted studies, some cases affect the efficiency of these systems. When the number of normal data inputs in the training time is approximately equal to attack data, SVMs will be able to work more accurately. However, it is not always possible. On one hand, the number of the features in network traffic data has a big influence on both training time and the classification performance. In many training datasets (such as KDDCUP99 [62]), of 41 features, it is possible to select some of the key features and employ them in learning phase of model. In [64], a combination of SVM, the simulated annealing and the decision tree has been used to detect the network attacks. At the beginning, SVM and simulated annealing are used to select the best features. Then, the decision tree and simulated annealing have been used to generate decision rules. These phases will be performed in a loop and will be continued until the termination condition is not satisfied. The outputs of the algorithm are the selected features, the best accuracy in the evaluation dataset, and the decision rules.

Kuang et al. [64] proposed a new hybrid model for SVM consisting of kernel principal component analysis in combination with genetic algorithm for intrusion detection. In their model, kernel principal component analysis is used to reduce the problem dimensionality and runtime. Furthermore, to reduce noises caused by different features and improve the SVM performance, an improved kernel function denoted by N-RBF has been presented. Genetic algorithm is employed to optimize the penalty factor, c, kernel parameters, σ, and ɛ for SVM. The results of applying this method indicate an increase in prediction accuracy, acceleration of convergence speed and a better generalization in comparison to other similar algorithms. Similarly, in another work [65], the combination of kernel principal component analysis and the improved chaotic particle swarm optimization algorithm has been used to deal with the intrusion detection problem. In this model, improved chaotic particle swarm optimization algorithm is used to optimize above parameters. Using this method has led to the improvement in comparison with similar methods in dataset of NSL-KDD [66].

3 Preliminary materials

This section provides definitions and necessary notifications. Then, we introduce SVM. The prototype selection will be introduced in the next subsection. Next subsection discusses rough set theory. After that, fuzzy rough set theory will be discussed. The feature selection based on fuzzy rough set theory is presented in the last subsection of this section.

Data classification has always been one of the most common subjects for different sciences and researchers. So many methods have been presented in data classification area. Each of them is designed to be used into a wide special range of applications. In data classification, the main goal is to detect different classes in a dataset such that it is able to determine the labels for new unseen data points. Classification [67 –69] is one of the most important tasks in the pattern recognition that has received a lot of attention due to its extensive use. Until recently, numerous methods have been developed. We mention some of them in the following: 1) decision tree, 2) Bayesian classifier, 3) artificial neural networks, 4) k-nearest neighbors, and 5) SVM.

Our main purpose is to propose a classification system in order to be used as a large scale IDS, so that it can improve SVM in terms of time and accuracy, and it increases the potency of detecting various new and unknown intrusions. Some advantages of the proposed method are including (but not limited to) (1) lack of need to parameters’ tuning and (2) the usage of prototype selection methods (here fuzzy rough set theory based prototype selection) simultaneous with SVM training for classification.

3.1 Fuzzy rough set theory

A fuzzy rough set can be introduced by a duality of fuzzy LA (FLA) and fuzzy UA (FUA) of a fuzzy subset (denoted by $\tilde{X}$ ) in the universal set U on which a fuzzy relation (denoted by $\tilde{R}$ ) has been defined. According to Equation 4, the FLA of $\tilde{X}$ considering fuzzy relation $\tilde{R}$ is defined. $μ_{\tilde{R} ↓ \tilde{X}} (p_{1}) = min_{p_{2} \in U} I (μ_{\tilde{R}} (p_{1}, p_{2}), μ_{\tilde{X}} (p_{2}))$ (1)

where $μ_{\tilde{X}} (p_{1})$ indicates the membership degree of point p₁ to FS $\tilde{X}$ , $μ_{\tilde{R}} (p_{1}, p_{2})$ indicates the degree that the point p₁ and point p₂ are member of the fuzzy relation $\tilde{R}$ , and $I (x, y)$ indicates the fuzzy implication operator. The FUA of $\tilde{X}$ considering fuzzy relation $\tilde{R}$ is defined according to Equation 2.

$μ_{\tilde{R} ↑ \tilde{X}} (p_{1}) = max_{p_{2} \in U} T (μ_{\tilde{R}} (p_{1}, p_{2}), μ_{\tilde{X}} (p_{2}))$ (2)

where $T (x, y)$ is t-norm fuzzy operator.

Extending positive region of decision table D considering relation R to fuzzy positive region of fuzzy relation $\tilde{R}$ denoted by ${PoR}_{\tilde{R}}$ is done according to Equation 3. $μ_{{PoR}_{\tilde{R}}} (p) = max_{\dot{p} \in U} min_{p \in U} I (μ_{\tilde{R}} (p, p), μ_{{\overset{⃛}{p} \in U | m_{l} (\overset{⃛}{p}) = m_{l} (\dot{p})}} (p))$ (3)

Assuming two following conditions: (1) “ $I (β, 0)$ is zero for any arbitrary β value”, and (2) “ $μ_{{\overset{⃛}{p} \in U | m_{l} (\overset{⃛}{p}) = m_{l} (\dot{p})}} (p)$ is one if p is the member of the set ${\overset{⃛}{p} \in U | m_{l} (\overset{⃛}{p}) = m_{l} (\dot{p})}$ ; otherwise, it is zero”, we can simplify Equation 3 into Equation 4. $μ_{{PoR}_{\tilde{R}}} (p) = min_{\dot{p} \in U} I (μ_{\tilde{R}} (p, \dot{p}), μ_{{\overset{\in}{p} U | m_{l} (p) = m_{l} (\dot{p})}} (\dot{p}))$ (4)

Each data point p in U that has a very high value $μ_{{PoR}_{\tilde{R}}} (p)$ can be considered to be an inner (or simple) data point; each data point p in U that has a very low value $μ_{{PoR}_{\tilde{R}}} (p)$ can be considered to be either a misclassified data point or an overlapped data point; each data point p in U that has an intermediate value $μ_{{PoR}_{\tilde{R}}} (p)$ can be considered to be a borderline (or suitable-to-selection) data point. This can be observed in an example presented in Fig. 1. A toy dataset is depicted by Fig. 1a. The data points with high values of $μ_{{PoR}_{\tilde{R}}} (p)$ are the ones with the gray color in Fig. 1b and Fig. 1c. The data points with middle values of $μ_{{PoR}_{\tilde{R}}} (p)$ are the ones with the gray color in Fig. 1d.

Fig. 1

(a) An exemplary U with two classes. (b) Data points with low membership value to positive region. (c) Data points with very low membership value to positive region. (d) Data points with average membership value to positive region.

3.2 Fuzzy rough set theory for feature selection

The feature selection problem is one of the problems that have been raised in the context of machine learning and statistical pattern recognition. This problem is important in many applications, such as classification. While there are a huge number of the features in many applications, only a fraction of them has considerable information and the rest features are almost useless. Although maintenance of those almost useless features theatrically seems good or at least without side effect (it means it does not lead to loss of information), it can cause reduction in model performance. It can be due to the fact that the computational rate of this application will be increased and moreover it causes a lot of useless (sometimes misleading) information to be saved. By removing these attributes from the dataset, the efficiency of learning models can be significantly improved.

The aim of feature selection is to find the smallest subset of input features with the highest classification quality. Unlike other methods of dimensionality reduction (feature extraction methods), feature selection methods maintain the main concept of features after reduction. These methods are widely used in the datasets that involve in lots of features and make the processing hard.

In IDSs, it is difficult to analyze and classify due to the fact that the amount of data that the system needs for network monitoring is too much. Thus, an IDS should reduce the amount of data to be processed so that complexity of the relationships between some features is removed or reduced [70, 71]. One of the desired data mining concepts based on which we can extract and select features is RST which is discussed in the next section.

As mentioned a fuzzy decision system [42] can be considered to be a pair (U, Q) where Q = C∪ { l }. Again assume for every q ∈ Q, there is a mapping (here a function) q, m_q : U → V_q where V_q is called the range of m_q. Also, consider a fuzzy equivalence relation ${\tilde{R}}_{{q}}$ defined on U according to Equation 5 [72]. $μ_{{\tilde{R}}_{{q}}} (p_{1}, p_{2}) = {\begin{matrix} 1 - | m_{q} (p_{1}) - m_{q} (p_{2}) | & q \neq l \\ ω (m_{q} (p_{1}) = = m_{q} (p_{2})) & q = l \end{matrix}$ (5)

where ω (x) is one if x is true; otherwise, it is zero. We also define a fuzzy equivalence relation defined on U for any subset $\dot{Q}$ of the features Q, denoted by ${\tilde{R}}_{\dot{Q}}$ , according to Equation 6. $μ_{{\tilde{R}}_{\dot{Q}}} (p_{1}, p_{2}) = min_{\dot{q} \in \dot{Q}} μ_{{\tilde{R}}_{{\dot{q}}}} (p_{1}, p_{2})$ (6)

By defining λ_p according to Equation 7. $\begin{matrix} λ_{p} = μ_{{\tilde{R}}_{Q} ↓ {\dot{p} \in U | m_{l} (\dot{p}) = m_{l} (p)}} (p) \\ = min_{p_{2} \in U} I (μ_{{\tilde{R}}_{Q}} (p, p_{2}), μ_{{\dot{p} \in U | m_{l} (\dot{p}) = m_{l} (p)}} (p_{2})) \end{matrix}$ (7)

We can simplify λ_p as presented in Equation 8. $λ_{p} = 1 - max_{p_{2} \in U} μ_{{\tilde{R}}_{Q}} (p, p_{2})$ (8)

Now, the λ-conditional entropy of the decision feature l compared with the conditional feature subset $\dot{Q}$ , which is denoted by $E_{λ} (l | \dot{Q})$ , is defined [72] according to Equation 9. $E_{λ} (l | \dot{Q}) = - \frac{1}{| U |} \sum_{p \in U} F_{1} (p, \dot{Q}) log \frac{F_{1} (p, \dot{Q})}{F_{2} (p, \dot{Q})}$ (9)

where $F_{1} (p, \dot{Q})$ and $F_{2} (p, \dot{Q})$ are defined based on Equations 10 and 11 respectively. $F_{1} (p, \dot{Q}) = \sum_{\dot{p} \in U} λ_{p} \times ω (1 - λ_{p} < μ_{{\tilde{R}}_{\dot{Q}}} (p, \dot{p}))$ (10)

$\begin{matrix} F_{2} (p, \dot{Q}) = \sum_{\dot{p} \in U} λ_{p} \times ω ((1 - λ_{p} < μ_{{\tilde{R}}_{\dot{Q}}} (p, \dot{p})) \\ & (m_{l} (p) = = m_{l} (\dot{p}))) \end{matrix}$ (11)

It is worthy to be mentioned that 0 × log 0 is considered to be 0 in Equation 9 and throughout the paper. After that, $Sig (q, \dot{Q}, l)$ is defined based on Equation 12. $Sig (q, \dot{Q}, l) = E_{λ} (l | \dot{Q}) - E_{λ} (l | \dot{Q} \cup {q})$ (12)

4 Proposed method

This section provides proposed framework for constructing an IDS. The proposed IDS contains four main components including normalization, feature selection, sample selection, and SVM classifier. Figure 2 shows structure of the proposed IDS. In the first step, normalization is done so as to assure that all features of dataset are numeric values transformed into interval [0, 1] by Equation 13.

Fig. 2

The proposed IDS framework.

$m_{q} (p) = \frac{{\dot{m}}_{q} (p) - min_{\dot{p}} {\dot{m}}_{q} (\dot{p})}{max_{\dot{p}} {\dot{m}}_{q} (\dot{p}) - min_{\dot{p}} {\dot{m}}_{q} (\dot{p})}$ (13)

where ${\dot{m}}_{q} (p)$ stands for the qth feature of sample p before normalization and q ≠ l.

Fuzzy rough set theory based sample selection (FRSTSS) (presented in section 4.1) is the second step to reduce the dataset size. Fuzzy rough set theory based feature selection (FRSTFS) (presented in section 4.2) is the third step to reduce the feature space size. The SVM is employed in the subsequent step to learn the data subsample in the projected selected features according to the proposed IDS framework depicted in Fig. 2.

4.1 FRSTSS

The pseudo code of the proposed method for sample selection is presented in Fig. 3. It is based on fuzzy rough set theory; therefore, we call it fuzzy rough set theory based sample selection or FRSTSS.

Fig. 3

The pseudo code of the proposed FRSTSS.

We want to select only borderline data points. Therefore, we define $μ_{{PoR}_{\tilde{R}}} (p)$ just like section 3.2, and use this function as our metric of selection. Each data point p in U that has an intermediate value $μ_{{PoR}_{\tilde{R}}} (p)$ can be considered to be a borderline (or suitable-for-selection) data point. To find those data points, a set including all of different values of $μ_{{PoR}_{\tilde{R}}} (p)$ is initially sorted in a list T. Then, accuracy of the nearest neighbor classifier on test set U according to the leave one out mechanism is computed by considering S as training set, where S is a subset of data points with consecutive values of $μ_{{PoR}_{\tilde{R}}} (p)$ (from the sorted list of $μ_{{PoR}_{\tilde{R}}} (p)$ ) and of size a little more than ρ. Note that only a number ς of the first subsets S are considered. After finding the best S, FRSTSS returns the best S.

4.2 FRSTFS

The use of all available features of a given dataset for evaluating and detecting of attack patterns increases processing overhead, prolongs diagnosis process and, consequently reduces IDS efficiency. As mentioned before, the aim of feature selection is to find the features’ subset that will maximizes the classification metrics such as the classification accuracy. Since removing worthless and useless input data features simplifies the problem and accelerates classification task, the feature selection is a very important step in designing an accurate and fast IDS.

In our model presented, the fuzzy rough set theory is used for feature selection. In [73], rough set theory is used to propose an IDS. The results have indicated that the rough set theory based IDS has a high classification accuracy and it performs classification task faster than traditional IDS. In [74], the rough set theory is used in designing an IDS. The rough set theory has been used as the feature selection step in the presented IDS [74]. In this paper, we employ the fuzzy rough set theory as feature selection step. Similar to the presented IDS in [74], we use SVM algorithm as classifier. But we also use the fuzzy rough set theory based sample selection and the fuzzy rough set theory based feature selection to create a new improved and scalable IDS.

The fuzzy rough set theory based feature selection (FRSTFS) algorithm starts with an empty $\dot{Q}$ and at each iteration. Then, it adds the best feature q in $Q ∖ \dot{Q}$ , i.e. the feature with the maximum Sig value where $q = arg max_{\dot{q}} Sig (\dot{q}, \dot{Q}, l)$ , to $\dot{Q}$ (see section 3.2 for more detail about $Sig (\dot{q}, \dot{Q}, l)$ and other notations). The loop will continue until the $max_{\dot{q}} Sig (\dot{q}, \dot{Q}, l)$ is less than a threshold ɛ.

4.3 Overall implementation complexity

Time complexity of FRSTSS is of the order O (ςnρ log ρ) at most. Time complexity of FRSTFS is of the order O (n²). Time complexity of SVM can be either of the order O (n²) or of the order O (n³) [75]. Therefore, FRSTSS+FRSTFS+SVM is of the order O (ς|U|ρ log ρ + ρ² + ρ³) at most. Assuming ρ ⪡ |U|, the overall time complexity of the proposed IDS framework is of the order O (ς|U|ρ log ρ) at most. The space complexity is of dataset order, i.e. O (|U|).

5 Experimental study

In this section, in order to determine strengths and weaknesses of our IDS and to represent an analysis of its performance, NSL-KDD dataset, a subset of KDD-CUP99 dataset, is used as the main benchmark. All of the experimental results are presented in this section. The used datasets will be initially reviewed. Then, the tests and the evaluation parameters will be presented and eventually, the experimental results will be offered.

5.1 Benchmark dataset

To evaluate the proposed method, NSL-KDD [66] dataset is used. This dataset contains selected records of KDD-CUP99 [52]. The problems such as duplicate records have been eliminated in the KDD-CUP99 dataset. Since 1999, KDD-CUP99 dataset was the most beneficiary dataset to evaluate the anomaly detection methods and IDSs. This dataset was developed by Stolfo et al. [53], based on an application for evaluation of IDSs, called DARPA’98 [76].

DARPA’98 is a series of nearly 5 million records collected from different links. Each record in DARPA’98 is approximately 100 bytes. Its training set is consisted of approximately 4,900,000 different links in which each vector constitutes 41 features and a binary class tag indicating whether it is a normal or an attack record. Each attack record belongs to one of the following general subclasses:

Denial of Service (DoS) attack: it is a subclass in attack class where the attacker saturates and or occupies one source of processing or storage of the victim in such a way that the victim cannot carry out its legal requests.

User to Root (U2R) attack: it is a subclass in attack class in which an attacker begins its attack by accessing to a regular user’s account on the system (usually achieved through password eavesdropping, dictionary attacks, social engineering) and gets the thematic access to the system by finding vulnerable ports.

Remote to Local (R2L) attack: this attack occurs when an attacker can send packets over the network, however, but attacker processes no account on one of the network systems and gets the access to the system as a system user, by examining its vulnerable ports.

Probing attack: it is an attempt to obtain information about computer network with the aim of threatening network security.

It is worthy to be mentioned that “the test set has no identical probability distribution with the training set, yet encompasses specific types of attacks which they are not existed in the training set, and this makes the evaluations more actual”. More accurately, it is worth mentioning that the training set includes 24 attack types while the test set contains 14 attack types other than those in the training set. However, some authorities in the area of intrusion problem also believe that the majority of new attacks are originating from the old ones and known attack signatures are adequate to recognize the current attacks.

In Table 1, general types (or subclasses) of the attack class in NSL-KDD dataset can be seen. The approximate distribution of various subclasses in the training and testing datasets is also shown in Table 1. It should be noted that the total does not sum up to 100% due to the rounding of ratios.

Table 1
Approximate distribution of training and test data in NSL-KDD dataset

Class Training Test

NORMAL 48% 19%

PRB 20% 1%

DOS 26% 73%

U2R 0.2% 0.07%

R2L 5% 5%

Class	Training	Test
NORMAL	48%	19%
PRB	20%	1%
DOS	26%	73%
U2R	0.2%	0.07%
R2L	5%	5%

The main aim of data preprocessing is to convert features of nominal values into numerical ones. In Table 2, the characteristics of NSL-KDD dataset are shown. Some features in the NSL-KDD dataset are as nominal and as the SVM can be applied only to numerical features, therefore, the nominal features have to be converted into numerical ones. As it can be seen in Table 2, some of features have nominal values such as Protocol_type (B), Service (C) and Flag (D). For example, we consider 1, 2, and 3 respectively for the values tcp, udp, and icmp of attribute Protocol_type. As our work assumes the range of any attribute to be in interval [0, 1], it is necessary for the input data to be represented by two forms of binary and continuous. Therefore, in data normalizing, after transforming the three above-mentioned features of the nominal format into numerical format, the main problem is to convert the data into a normalized binary or continuous form. In our IDS, the normal continuous form is used for features.

Table 2

Different features of the NSL-KDD dataset

Label	Feature name	Label	Feature name	Label	Feature name
A	Duration	O	Su_attempted	AC	Same_srv_rate
B	Protocol_type	P	Num_root	AD	Diff_srv_rate
C	Service	Q	Num_fle_creations	AE	Srv_diff_host_rate
D	Flag	R	Num_shells	AF	Dst_host_count
E	Src_byte	S	Num_access_files	AG	Dst_host_srv_count
F	Dst_byte	T	Num_cutbounds_cmds	AH	Dst_host_same_srv_rate
G	Land	U	Is_host_login	AI	Dst_host_diff_srv_rate
H	Wrong_fragment	V	Is_guest_login	AJ	Dst_host_same_src_port_rate
I	Urgent	W	Count	AK	Dst_host_srv_diff_host_rate
J	Hot	X	Sev_count	AL	Dst_host_server_rate
K	Num_failed_login	Y	Serror_rate	AM	Dst_host_srv_serror_rate
L	Logged_in	Z	Sev_serror_rate	AN	Dst_host_rerror_rate
M	Num_comprised	AA	Rerror_rate	AO	Dst_host_srv_rerror_rate
N	Root_shell	AB	Srv_rerror_rate

To normalize features, first, a statistical analysis is done on any individual feature based on the available data in NSL-KDD and then maximum and minimum values for that feature are determined. Then, according to Equation 13, normalization is done.

5.2 Evaluation criteria

Our ultimate purpose is to improve intrusion detection rate. In this study, the main focus is to detect the intrusion at first and after that, to signify the correct subclass of the intrusion. In fact, if the attacks are properly detected, its appropriate subclass of attack class can be decided with the supervisory of the network admin. Therefore, the main two metrics that are studied in this section are including precision and recall; the two metrics that are well-known and important in assessing data mining and machine learning algorithms. A formal definition of “precision” is as follows in Equation 14. $precision = \frac{TP}{TP + FP}$ (14)

A formal definition of “recall” is as follows in Equation 15. $recall = \frac{TP}{TP + FN}$ (15)

In Equations 14 and 15, TP represents the number of data points that have been properly assigned to the positive (or attack) class, FP represents the number of data that are incorrectly assigned to the positive class and FN represents the number of data that are mistakenly assigned to a negative (or non-attack) class. It is easy to indicate that each multi-class problem can be easily converted into a set of bi-class problems. That is, each time we assume one class as positive and others as negative; after that, it is possible to calculate the precision and recall metrics. However, it should be mentioned that when we interpret precision, it is the precision that determines how much ratio of the data points is correctly diagnosed as positive by the algorithm. The purpose of recall is to determine how much ratio of positive class data points is correctly allocated to the target (or attack) class by the algorithm.

There is a third metric known as F-measure which is defined based on precision and recall metrics. A general definition of this metric is shown in Equation 16. $F_{β} = \frac{(1 + β^{2}) \times precision \times recall}{β^{2} \times precision + recall}$ (16) where we only use F₁.

5.3 Experimental Setting

After selecting important samples and features respectively by FRSTSS and FRSTFS algorithms, the SVM is employed to learn the selected projected data. SVM are used as the main classifier in our IDS. In SVM training phase, training data is used to train SVM and test data is used to detect the potential attacks in the test phase. FRSTSS and FRSTFS algorithms are done only on training data. The ρ and ς are effective parameters in computational cost of our method. We chose the mentioned values by trial and error in such a way that the computational cost of our method should be admissible. The parameter ς is always set to one percent of size |Λ| (i.e. 0.01 × |Λ|) throughout the experiments. The parameter ρ is always set to 0.1 percent of total dataset size (i.e. 0.001 × |U|) throughout the experiments.

In recent years, various implementations have been developed for SVM. Most of them are designed and implemented in academic environments. However, at the same time, commercial products can be discovered. The LIBSVM software [77] is used in our IDS. This software, in fact, is designed with the aim of developing a library to do the machine learning tasks using SVM. Such a library is designed using C++ programming language. It is possible to utilize different kernels depending on the nature of the data. The most serious disadvantage of SVM method is lack of a mechanism to select the suitable kernel function. By default, the LIBSVM library supports four different kernels including linear, polynomial, radial basis function, and sigmoid ones. In the proposed kernel system, radial basis function kernels are selected due to its higher efficiency. This function has only one parameter called γ that is set to its default value [77].

5.4 Experimental settings

In order to investigate the performance of the proposed algorithm, we compare its performance with those of the state of the art methods including: (1) Aburomman and Ibne Reaz [78], (2) Singh et al. [79], (3) Sreenath and Udhayan [80], (4) Gaikwad and Thool [63], (5) Masarat et al. [81], (6) Elbasiony et al. [82], (7) Aslahi-Shahri et al. [83], and (8) Rastegari et al. [84]. In addition, we compared the proposed method with other “strong” base algorithms including: (1) Fisher Classifier (Fisher), (2) Quadratic Classifier (Quadratic), (3) UDC classifier (UDC), (4) Statistical Decision Tree Classifier (StatsDTC), (5) Decision Tree Classifier (DTC), (6) Naïve Bayesian Classifier (NaiveBC), (7) Bagging of Ensemble of Naïve Bayesian Classifier (BagEnsNaiveBC), (8) Bagging of Ensemble of Rule-based Classifier (BagEnsWeakC), (9) Bagging of Ensemble of Decision Tree Classifier (BagEnsDT), (10) Adaptive Boosting Naïve Bayesian Classifier (ADABOOSTC), (11) Boosting ensemble of Decision Tree Classifiers (BoostEnsDT), (12) SVM, (13) multi-layer perceptron (MLP), and (14) FRSTFS+SVM.

5.5 Experimental results

To train our IDS, 2,000,000 records of KDDTrain+set are randomly selected without-replacement in such a way that they include 24 types of detected attacks, and to test it, also 700,000 records of KDDTest+set are selected in such a way that they include 14 unknown new attacks along with the previous 24 known attacks. After selecting the important features, the input data are decreased by about 80 percent.

5.5.1 Traditional methods

An advantage of our method is its low computational cost and therefore low usage of computer sources such as memory and processor time which are important for intrusion detection. For this reason, training and test computational costs are computed in four modes (with sample selection, with feature selection, with sample and feature selection, and without any of them) and they are presented in Table 3. Any consumed time presented in Table 3 are in terms of milliseconds (ms). Construction of an SVM model on selected features and FRSTFS, together, need 70.09% less time than construction of an SVM model on all dataset. Construction of an SVM model on selected samples and FRSTSS, together, need 91.73% less time than construction of an SVM model on all dataset. Construction of an SVM model on selected features and samples, FRSTSS and FRSTFS, together, need 98.67% less time than construction of an SVM model on all dataset.

Table 3
Training time of different SVM models without and with feature selection and without and with sample selection in terms of milliseconds

All Features FRSTFS Time Reduction

All Samples 28,384,654 8,489,850 70.09%

FRSTSS 2,348,220 377,124 83.94%

Time Reduction 91.73% 95.56% 98.67%

	All Features	FRSTFS	Time Reduction
All Samples	28,384,654	8,489,850	70.09%
FRSTSS	2,348,220	377,124	83.94%
Time Reduction	91.73%	95.56%	98.67%

The performance of these methods are also represented in Table 4 in terms of precision.

Table 4

Precision of different SVM models without and with feature selection and without and with sample selection

	R2L	U2R	PRB	DoS	Normal
SVM	63.39%	53.49%	99.86%	97.72%	76.71%
FRSTFS+SVM	99.04%	55.86%	99.87%	97.29%	96.65%
FRSTSS+SVM	98.88%	55.18%	99.77%	97.58%	95.85%
FRSTSS+FRSTFS+SVM	99.21%	57.07%	99.94%	97.98%	97.02%

In Table 4, “FRSTSS+FRSTFS+SVM” stands for the proposed method with both sample selection and feature selection; “FRSTSS+SVM” stands for the proposed method with only sample selection; “FRSTFS+SVM” stands for the proposed method with only feature selection; and finally, “SVM” stands for the proposed method without both of sample selection and feature selection. The features presented in Table 5 are selected by “FRSTSS+FRSTFS+SVM” method. The same features with one change have been selected by “FRSTFS+SVM” method (the feature num_file_creations that is labeled “Q” in Table 2 is added to features presented in Table 5).

Table 5

The selected features by “FRSTSS+FRSTFS+SVM”

Label	Feature
C	Service
E	Src_byte
F	Dst_byte
W	Count
AF	Dst_host_count
AG	Dst_host_srv_count
AI	Dst_host_diff_srv_rate
AK	Dst_host_srv_diff_srv_host_rate

The “FRSTSS+FRSTFS+SVM” method is the best method in Table 4 for all classes. After that, the “FRSTFS+SVM” is the second best method. The third best is the “FRSTSS+SVM” and the worst is SVM. The same results of Table 4 are replicated in Table 6 for recall. Table 7 also represent the F-measure of the same methods.

Table 6

Recall of different SVM models without and with feature selection and without and with sample selection

	R2L	U2R	PRB	DoS	Normal
SVM	53.55%	40.07%	97.65%	86.27%	97.99%
FRSTFS+SVM	65.40%	46.40%	98.50%	84.10%	98.39%
FRSTSS+SVM	63.59%	40.07%	98.41%	85.96%	98.26%
FRSTSS+FRSTFS+SVM	66.17%	46.37%	98.77%	86.83%	98.79%

Table 7

F-measure (F₁) of different SVM models without and with feature selection and without and with sample selection

	R2L	U2R	PRB	DoS	Normal
SVM	58.06%	45.82%	98.74%	91.64%	86.05%
FRSTFS+SVM	78.78%	50.69%	99.18%	90.22%	97.51%
FRSTSS+SVM	77.40%	46.43%	99.09%	91.40%	97.04%
FRSTSS+FRSTFS+SVM	79.39%	51.17%	99.35%	92.07%	97.90%

Comparison of precision, recall and f-score for each given subclass is shown in Table 5, Table 6, and Table 7 respectively for different modes of our method. The f-score results presented in Table 7 completely verify the precision results presented in Table 4, i.e. the “FRSTSS+FRSTFS+SVM” method is the best method in terms of f-score for all classes. But in the recall results presented in Table 6, the “FRSTSS+FRSTFS+SVM” method is the best method for all classes, with the exception of U2R class where the “FRSTSS+FRSTFS+SVM” method is the second best to the “FRSTFS+SVM” method.

What is evident in Tables 5–7 is the fact that recall and f-score of IDS will be also increased with sample and feature selection. In all 5 subclasses of attack, detection rate is higher when we use FRSTSS and FRSTFS compared to all features and samples. This will be specifically higher for detection rate in U2R and R2L subclasses. Therefore, the system uses the most important features and prototypes for learning.

Now, the performance of our method is compared to the multilayer perceptron (MLP) artificial neural network (ANN) (denoted by MLP), nearest neighbor (denoted by 1NN), and decision tree (denoted by DT). The back propagation learning algorithm is based on steepest descent algorithm. Setting the network parameters is carried out in accordance with the error signals which will be calculated based on presentation of each pattern to the network. MATLAB is used to implement these methods. Gini measure is used for decision tree classifier and Euclidian distance is used for nearest neighbor classifier.

The comparison of precision, recall and f-score, is shown respectively in Fig. 4, Figs. 5, and 6 based on attack subclasses. Also, the results of these classifiers trained on all data points and all features are compared with the results of them trained on the selected projected data points by FRSTSS and FRSTFS in Fig. 4, Figs. 5, and 6. As it is shown, the SVM outperforms the other methods in almost all subclasses. Also, applying FRSTSS and FRSTFS as a preprocessing step is almost always better for all classifiers and all subclasses.

Fig. 4

Comparison of attack precision in the different subclasses for different methods.

Fig. 5

Comparison of attack recall in the different subclasses for different methods.

Fig. 6

Comparison of attack F₁ in the different subclasses for different methods.

5.5.2 Modern methods

In Table 8 (and Table 9), the performance of the proposed classification method has been compared with the ones of other classification methods in terms of the f-measure (and accuracy). Indeed, in Table 8, the various classification methods have been compared in terms of f-measure. The results indicate that the proposed classification method outperforms the other classification methods on the KDD dataset. It is worthy to be mentioned that each of these reported results has been independently run 50 times. In each run, we obtain an independent f-measure; and finally the average of the 50 f-measure values has been reported as the f-measure of a method. It is also should be mentioned that these results (reported in Table 8 and Table 9) are obtained on all 24 types of attacks.

Table 8
F₁ rates of different classification methods

NSL-KDD

FRSTSS+FRSTFS+SVM 99.80±0.10

Fisher 92.53±0.01

Quadratic 70.99±6.76

UDC 82.57±3.15

StatsDTC 99.79±0.01

DTC 99.79±0.01

NaiveBC 93.75±0.08

BagEnsNaiveBC 93.72±0.12

BagEnsWeakC 53.30±6.49

BagEnsDT 95.04±0.27

ADABOOSTC 95.30±0.36

BoostEnsDT 98.77±0.15

SVM 93.73±0.39

	NSL-KDD
FRSTSS+FRSTFS+SVM	99.80±0.10
Fisher	92.53±0.01
Quadratic	70.99±6.76
UDC	82.57±3.15
StatsDTC	99.79±0.01
DTC	99.79±0.01
NaiveBC	93.75±0.08
BagEnsNaiveBC	93.72±0.12
BagEnsWeakC	53.30±6.49
BagEnsDT	95.04±0.27
ADABOOSTC	95.30±0.36
BoostEnsDT	98.77±0.15
SVM	93.73±0.39

Table 9

Accuracy rates of different classification methods

	NSL-KDD
FRSTSS+FRSTFS+SVM	99.84±0.10
Fisher	92.52±0.01
Quadratic	67.14±8.19
UDC	81.55±3.05
StatsDTC	99.79±0.01
DTC	99.79±0.01
NaiveBC	93.75±0.08
BagEnsNaiveBC	93.73±0.12
BagEnsWeakC	53.48±0.08
BagEnsDT	95.03±0.27
ADABOOSTC	95.30±0.36
BoostEnsDT	98.77±0.16
SVM	93.55±0.35

Table 10 summarizes results of the state-of-the-art classification methods on intrusion detection field. In all of these methods, it is tried to use all of the parameters indicated in their corresponding papers. To be fair, in the other parameters we tried to use the same parameters with the other methods (for example the same training set and test set). Three datasets have been used here: (a) a real-world created-by-user dataset, (b) DARPA and (c) NSL-KDD. While our method is the third best method in DARPA dataset (in terms of precision), it is the best method in the other two datasets. Note that a little improvement in the NSL-KDD dataset is really significant due to its huge size. The real-world created-by-user dataset is a moderate dataset with 20,000 records. The DARPA dataset is a small dataset. One reason for failing our method on the DARPA dataset is its small size, as our method works better in datasets with huge sizes.

Table 10

Precisions of different classification methods vs. the proposed classification method on three datasets

Method	Dataset
	Real Traffic	DARPA	KDD
Aburomman and Ibne Reaz [78]	92.85	99.99	92.59
Singh et al. [79]	86.20	92.83	98.66
Sreenath and Udhayan [80]	96.37	98.91	97.85
Gaikwad and Thool [63]	91.08	98.09	81.29
Masarat et al. [81]	75.68	81.50	93.00
Elbasiony et al. [82]	86.57	93.23	98.15
Aslahi-Shahri et al. [83]	91.36	98.39	97.20
Rastegari et al. [84]	90.48	97.44	98.40
Binbusayyis and Vaiyapuri [85]	95.91	99.78	99.58
Proposed	97.25	99.23	99.84

6 Conclusions and future works

Nowadays, IDSs have become very important tools to ensure the security of computer networks. Detection methods in IDSs are divided into two categories, abuse detection and anomaly detection. In this work, the network-based IDS is evaluated based on an anomaly detection method with the use of fuzzy rough set theory and SVM. Fuzzy rough set theory is known as one of the most widely used theories in the discovery of knowledge in information systems. Utilizing all of the features in the detection process reduces IDS performance. The use of feature selection methods is beneficial because of managing and reducing computational time and even increasing detection performance. With the same analysis, the use of sample subset selection methods is beneficial. In this article, we have reviewed some of the basic concepts of the fuzzy rough set theory and explained its applications how to reduce feature and data sizes. To evaluate the proposed model, NSL-KDD dataset is used. The results indicated that the proposed framework outperforms them.

For future works to this study, fuzzy rough set theory can be utilized to improve the SVM efficiency for data reduction in other applicable areas. In the proposed sample selection algorithm, named FRSTSS, we can investigate other subsets approaches. Another study that can be done in future is to use several classifiers and use their aggregated vote for classification task.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

References

Endorf

, Eugene

and Mellander

, Intrusion Detection & Prevention, McGraw-Hill (2004).

Santos

, Sanz

, Laorden

, Brezo

and Bringas

P.G.

, Opcode-Sequence-Based Semi-Supervised Unknown Malware Detection, Computational Intelligence in Security for Information Systems 6694 (2011), 50–57.

Niu

, Khozouie

, Parvin

, Alinejad-Rokny

, Beheshti

and Mahmoudi

M.R.

, An Ensemble of Locally Reliable Cluster Solutions, Appl Sci 10 (2020), 1891.

Mojarad

, Parvin

, Nejatian

and Rezaie

, Consensus Function Based on Clusters Clustering and Iterative Fusion of Base Clusters, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 27(1) (2019), 97–120.

Najafi

, Parvin

, Mirzaie

, Nejatian

and Rezaie

, Dependability-based cluster weighting in clustering ensemble, Statistical Analysis and Data Mining 13(2) (2020), 151–164.

Parvin

, Beigi

and Mozayani

, A clustering ensemble learning method based on the ant colony clustering algorithm, Int J Appl Comput Math 11(2) (2012), 286–302.

Rashidi

, Nejatian

, Parvin

and Rezaie

, Diversity based cluster weighting in cluster ensemble: an information theory approach, Artificial Intelligence Review (2019), 1–28.

Abbasi

, Nejatian

, Parvin

, Rezaie

and Bagherifard

, Clustering ensemble selection considering quality and diversity, Artificial Intelligence Review 52(2) (2019), 1311–1340.

Nazari

, Dehghan

, Nejatian

, Rezaie

and Parvin

, A comprehensive study of clustering ensemble weighting based on cluster quality and diversity, Pattern Analysis and Applications 22(1) (2019), 133–145.

10.

Bagherinia

, Minaei-Bidgoli

, Hossinzadeh

and Parvin

, Elite fuzzy clustering ensemble based on clustering diversity and quality measures, Applied Intelligence 49(5) (2019), 1724–1747.

11.

Mojarad

, Nejatian

, Parvin

and Mohammadpoor

, A fuzzy clustering ensemble based on cluster clustering and iterative Fusion of base clusters, Applied Intelligence 49(7) (2019), 2567–2581.

12.

Bagherinia

, Minaei-Bidgoli

, Hosseinzadeh

and Parvin

, Reliability-based fuzzy clustering ensemble, Fuzzy Sets and Systems DOI:10.1016/j.fss.2020.03.008

13.

Nejatian

, Parvin

and Faraji

, Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification, Neurocomputing 276 (2018), 55–66.

14.

Jamalinia

, Khalouei

, Rezaie

, Nejatian

, Bagheri-Fard

and Parvin

, Diverse classifier ensemble creation based on heuristic dataset modification, Journal of Applied Statistics 45(7) (2018), 1209–1226.

15.

Pirbonyeh

, Rezaie

, Parvin

, Nejatian

and Mehrabi

, A linear unsupervised transfer learning by preservation of cluster-and-neighborhood data organization, Pattern Analysis and Applications 22(3) (2019), 1149–1160.

16.

Nejatian

, Rezaie

, Parvin

, Pirbonyeh

, Bagherifard

and Yusof

S.K.S.

, An innovative linear unsupervised space adjustment by keeping low-level spatial data structure, Knowledge and Information Systems 59(2) (2019), 437–464.

17.

Niu

, Xu

, Akbarzadeh

, Parvin

, Beheshti

and Alinejad-Rokny

, Deep feature learnt by conventional deep neural network, Computers & Electrical Engineering 84, 106656.

18.

, Parvin

and Izadparast

, Deep Learning Neural Network for Unconventional Images Classification, Neural Process Lett (2020). https://doi.org/10.1007/s11063-020-10238-3.

19.

Yasrebi

, Eskandar-Baghban

, Parvin

and Mohammadpour

, Optimisation inspiring from behaviour of raining in nature: droplet optimisation algorithm, International Journal of Bio-Inspired Computation 12(3) (2018), 152–163.

20.

Nejatian

, Omidvar

, Mohamadi

, Eskandar-Baghban

, Rezaie

and Parvin

, An optimization algorithm based on behavior of see-see partridge chicks, Journal of Intelligent & Fuzzy Systems 33(6) (2017), 3227–3240.

21.

Omidvar

M.N.

, Nejatian

, Parvin

and Rezaie

, A new natural-inspired continuous optimization approach, Journal of Intelligent & Fuzzy Systems (2018), 1–17.

22.

Alishvandi

, Gouraki

G.H.

and Parvin

, An enhanced dynamic detection of possible invariants based on best permutation of test cases, Computer Systems Science and Engineering 31(1) (2016), 53–61.

23.

Parvin

, Nejatian

and Mohamadpour

, Explicit memory based ABC with a clustering strategy for updating and retrieval of memory in dynamic environments, Applied Intelligence 48(11) (2018), 4317–4337.

24.

Moradi

, Nejatian

, Parvin

and Rezaie

, CMCABC: Clustering and memory-based chaotic artificial bee colony dynamic optimization algorithm, International Journal of Information Technology & Decision Making 17(04) (2018), 1007–1046.

25.

Jenghara

M.M.

, Ebrahimpour-Komleh

, Rezaie

, Nejatian

, Parvin

and Yusof

S.K.S.

, Imputing missing value through ensemble concept based on statistical measures,123–139, Knowledge and Information Systems 56(1) (2018).

26.

Jenghara

M.M.

, Ebrahimpour-Komleh

and Parvin

, Dynamic protein–protein interaction networks construction using firefly algorithm, Pattern Analysis and Applications 21(4) (2018), 1067–1081.

27.

Bahrani

, Minaei-Bidgoli

, Parvin

, Mirzarezaee

, Keshavarz

and Alinejad-Rokny

, User and item profile expansion for dealing with cold start problem, J Intell Fuzzy Syst 38(4) (2020), 4471–4483.

28.

Yasrebi

, Rafe

, Parvin

and Nejatian

, An efficient approach to state space management in model checking of complex software systems using machine learning techniques, J Intell Fuzzy Syst 38(2) (2020), 1761–1773.

29.

Partabian

, Rafe

, Parvin

and Nejatian

, An approach based on knowledge exploration for state space management in checking reachability of complex software systems, Soft Comput 24(10) (2020), 7181–7196.

30.

Tavana

, Parvin

and Rezazadeh

, Parkinson detection: an image processing approach,(2), Journal of Medical Imaging and Health Informatics 7 (2017), 464–472.

31.

Aminsharifi

, Irani

, Pooyesh

, Parvin

, Dehghani

, Yousofi

, Fazel

and Zibaie

, Artificial neural network system to predict the postoperative outcome of percutaneous nephrolithotomy, Journal of Endourology 31(5) (2017), 461–467.

32.

Hosseinpoor

M.J.

, Parvin

, Nejatian

and Rezaie

, Gene Regulatory Elements Extraction in Breast Cancer by Hi-C Data Using a Meta-Heuristic Method, Russian Journal of Genetics 55(9) (2019), 1152–1164.

33.

Shabaniyan

, Parsaei

, Aminsharifi

, Movahedi

M.M.

, Jahromi

A.T.

, Pouyesh

and Parvin

, An artificial intelligence-based clinical decision support system for large kidney stone treatment, Australasian Physical & Engineering Sciences in Medicine 42(3) (2019), 771–779.

34.

Pawlak

, “Rough sets”, International Journal of Computer and Information Science 11(5) (1982), 341–356.

35.

Pawlak

and Skowron

, “Rough sets: some extensions,”, Information Sciences 177(1) (2007), 28–40.

36.

Pawlak

and Skowron

, “Rough sets and Boolean reasoning,”, Information Sciences 177(1) (2007), 41–73.

37.

Wang

, Yang

, Jensen

and Liu

, Rough set feature selection and rule induction for prediction of malignancy degree in brain glioma, Comput Methods Programs Biomed 83(2) (2006), 147–156.

38.

Parthaláin

N.M.

, Shen

and Jensen

, Distance Measure Assisted Rough Set Feature Selection, in Proceedings of 2007 FUZZ-IEEE (2007), 1–6.

39.

Parthaláin

M.N.

, Shen

and Jensen

, A Distance Measure Approach to Exploring the Rough Set Boundary Region for Attribute Reduction, IEEE Trans Knowl Data Eng 22(3) (2010), 305–317.

40.

Jensen

, Data Reduction with Rough Sets, Encyclopedia of Data Warehousing and Mining (2009), 556–560.

41.

Jensen

and Cornelis

, Fuzzy-rough instance selection, in Proceedings of 2009 FUZZ-IEEE (2009), 1–7.

42.

Chen

D.G.

, Theory and Methods of Fuzzy Rough Sets; Science Press: Beijing, China, (2013).

43.

Nanda

and Majumdar

, Fuzzy rough sets, Fuzzy Sets and Systems 45(2) (1992), 157–160.

44.

Sun

, Mou

, Qiu

, Wang

and Gao

, adaptive fuzzy control for non-triangular structural stochastic switched nonlinear systems with full state constraints, IEEE Transactions on Fuzzy Systems 27(8) (2019), 1587–1601.

45.

Qiu

, Sun

, Wang

and Gao

, observer-based fuzzy adaptive event-triggered control for pure-feedback nonlinear systems with prescribed performance,–, IEEE Transactions on Fuzzy Systems 27(11) (2162).

46.

Qiu

, Sun

, Rudas

I.J.

and Gao

, Command filter-based adaptive NN control for MIMO nonlinear systems with full state constraints and actuator hysteresis, IEEE Transactions on Cybernetics, in press, doi: 10.1109/TCYB.2019.2944761.

47.

Kuncheva

L.I.

, “Fuzzy rough sets: application to feature selection,”, Fuzzy Sets and Systems 51(2) (1992), 147–153.

48.

Amiri

and Jensen

, Missing data imputation using fuzzy-rough methods, Neurocomputing 205 (2016), 152–164.

49.

Jensen

, Amiri

, Parthaláin

N.M.

, Effective instance selection using the fuzzy-rough lower approximation, in Proceedings of 2019 FUZZ-IEEE, (2019), 1–6.

50.

Ramentol

, Gondres

, Lajes

, Bello

, Caballero

, Cornelis

and Herrera

, Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: The SMOTE-FRST-2T algorithm, Eng Appl Artif Intell 48 (2016), 134–139.

51.

Chandrasekhar

A.M.

and Raghuveer

, “Intrusion detection technique by using k-means, fuzzy neural network and SVM classifiers,” in The 2013 International Conference on Computer Communication and Informatics (ICCCI) Jan. 2013, (2013), 1–7.

52.

Tavallaee

, Bagheri

, Lu

and Ghorbani

A.A.

, “A Detailed Analysis of the KDD CUP 99 Data Set, ” in Proceeding of the 2009 IEEE symposium on computational Intelligence in security and defense application (CISDA), (2009).

53.

Stolfo

S.J.

, Fan

, Prodromidis

, Chan

P.K.

and Lee

, “Cost-sensitive modeling for fraud and intrusion detection: Results from the JAM project", in Proceedings of the 2000 DARPA Information Survivability Conference and Exposition, (2000).

54.

Abduvaliyev

, Pathan

A.-S.K.

, Zhou

, Roman

and Wong

W-C.

, “On the Vital Areas of Intrusion Detection Systems in Wireless Sensor Networks,”, IEEE Communications Surveys & Tutorials 15(3) (2013), 1223–1237.

55.

Aljarah

and Ludwig

S.A.

, “MapReduce intrusion detection system based on a particle swarm optimization clustering algorithm, ” in The 2013 IEEE Congress on Evolutionary Computation (CEC) June 2013, (2013), 955–962.

56.

Viterbo

and Ohrn

, Minimal approximate hitting sets and rule templates, International Journal of Approximate Reasoning 25 (2000), 123–143.

57.

Butun

, Morgera

S.D.

and Sankar

, A Survey of Intrusion Detection Systems in Wireless Sensor Networks, IEEE Communications Surveys & Tutorials 16(1) (2014), 266–282.

58.

Bhutan

M.H.

, Bhattacharyya

D.K.

and Kalita

J.K.

, Network Anomaly Detection: Methods, Systems and Tools, IEEE Communications Surveys & Tutorials 16(1) (2014), 303–336.

59.

Luxburg

U.V.

and Bousquet

, “Distance–based classification with Lipschitz functions”, Journal of Machine Learning Research 5 (2004), 669–695.

60.

Cortes

and Vapnik

, “Support-vector network”, Machine Learning 20 (1995), 273–297.

61.

Zhang

, Perdisci

, Lee

, Luo

and Sarfraz

, Building a Scalable System for Stealthy P2P-Botnet Detection, IEEE Transactions on Information Forensics and Security 9(1) (2014), 27–38.

62.

KDD Cup 1999. Available on: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, October (2007).

63.

Gaikwad

D.P.

and Thool

R.C.

, “Intrusion Detection System Using Bagging Ensemble Method of Machine Learning”, in Proceedings of 2015 International Conference Computing Communication Control and Automation (ICCUBEA), (2015), 291–295.

64.

Kuang

, Zhang

and Xu

, A novel hybrid KPCA and SVM with GA model for intrusion detection, Applied Soft Computing 18 (2014), 178–184.

65.

Kuang

, Zhang

, Jin

and Xu

, “A novel SVM by combining kernel principal component analysis and improved chaotic particle swarm optimization for intrusion detection, Soft Computing 19 (2015), 1187–1199.

66.

"Nsl-kdd data set for network-based intrusion detection systems", Available on: http://nsl.cs.unb.ca/NSL-KDD/, (2009).

67.

Keshavarz

, Ghassemian

and Dehghani

, “Hierarchical Classification of Hyperspectral Images by Using SVMs and Neighborhood Class Property", In IEEE IGARSS2005, (2005), 3219–3222.

68.

Woniakeyot

, Graña

and Corchado

, “A survey of multiple classifier systems as hybrid systems”, Inf Fusion 16 (2014), 45–90.

69.

Bijani

and Robertson

, “A Review of Attacks and Security Approaches in Open Multi-Agent Systems”, Artif Intell Rev 42 (2014), 607–636.

70.

Chebrolu

, Abraham

and Thomas

J.P.

, Feature deduction and ensemble design of intrusion detection systems, Computers & Security 24 (2005), 295–307.

71.

Dastfal

, Nejatian

, Parvin

, Rezaie

Lecture Notes in Computer Science 10632 (2019), 54–66 Introducing a Classification Model Based on SVM for Network Intrusion Detection, Castro F., Miranda-Jiménez S., González-Mendoza M. (eds) Advances in Soft Comuting. MICAI.

72.

Zhang

, Mei

C.L.

, Chen

D.G.

and Li

J.H.

, Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy, Pattern Recognition 56 (2016), 1–15.

73.

Zhang

, Zhang

, Yu

and Bai

, Intrusion Detection Using Rough Set Classification, Journal of Zhejiang University Science 5(9) (2004), 1076–1086.

74.

Chen

R.C.

, Cheng

and Hsieh

C.F.

, Using Rough Set and Support Vector Machine for Network Intrusion Detection System, in Proceedings of the 1st Asian Conference on Intelligent Information and Database Systems, Washington, DC, USA, (2009).

75.

Bottou

and Lin

C.J.

, Support vector machine solvers, In: Large Scale Kernel Machines (2007), 301–320.

76.

Lippmann

, Haines

, Fried

, Korba

and Das

, The DARPA off-line intrusion detection evaluation, Computer Networks 34 (2000), 579–595.

77.

LIBSVM – A Library for Support Vector Machines: http://www.csie.ntu.edu.tw/cjlin/libsvm/.

78.

Aburomman

A.A.

and IbneReaz

M.B.

, A novel SVM-kNN-PSO ensemble method for intrusion detection system, Applied Soft Computing 38 (2016), 360–372.

79.

Singh

, Kumar

and Singl

R.K.

, An intrusion detection system using network traffic profiling and online sequential extreme learning machine, Expert Systems with Applications 42 (2015), 8609–8624.

80.

Sreenath

and Udhayan

, Intrusion detection system using Bagging Ensemble Selection, in Proceedings of 2015 International Conference Engineering and Technology (ICETECH), (2015), 1–4.

81.

Masarat

, Taheri

and Sharifian

, A novel framework, based on fuzzy ensemble of classifiers for intrusion detection systems, in Proceedings of 2014 International Conference Computer and Knowledge Engineering (ICCKE), (2014), 165–170.

82.

Elbasiony

R.M.

, Sallam

E.A.

, Eltobely

T.E.

and Fahmy

M.M.

, A hybrid network intrusion detection framework based on random forests and weighted k-means, Ain Shams Eng J 4 (2013), 753–762.

83.

Aslahi-Shahri

B.M.

, Rahmani

, Chizari

, Maralani

, Eslami

, Golkar

M.J.

and Ebrahimi

, A hybrid method consisting of GA and SVM for intrusion detection system”, Neural Computing and Applications 27 (2016), 1669–1676.

84.

Rastegari

, Hingston

and Lam

C.P.

, Evolving statistical rulesets for network intrusion detection, Applied Soft Computing 33 (2015), 348–359.

85.

Binbusayyis

and Vaiyapuri

, Identifying and Benchmarking Key Features for Cyber Intrusion Detection: An Ensemble Approach, IEEE Access 7 (2019), 106495–106513.

A classification model based on svm and fuzzy rough set for network intrusion detection

Abstract

Keywords

1 Introduction

2 Related works

3 Preliminary materials

3.1 Fuzzy rough set theory

4.3 Overall implementation complexity

5 Experimental study

5.1 Benchmark dataset

Table 1 Approximate distribution of training and test data in NSL-KDD dataset Class Training Test NORMAL 48% 19% PRB 20% 1% DOS 26% 73% U2R 0.2% 0.07% R2L 5% 5%

5.4 Experimental settings

5.5 Experimental results

5.5.1 Traditional methods

Table 3 Training time of different SVM models without and with feature selection and without and with sample selection in terms of milliseconds All Features FRSTFS Time Reduction All Samples 28,384,654 8,489,850 70.09% FRSTSS 2,348,220 377,124 83.94% Time Reduction 91.73% 95.56% 98.67%

Compliance with ethical standards

Conflict of interest

Ethical approval

References

Table 1
Approximate distribution of training and test data in NSL-KDD dataset

Class Training Test

NORMAL 48% 19%

PRB 20% 1%

DOS 26% 73%

U2R 0.2% 0.07%

R2L 5% 5%

Table 3
Training time of different SVM models without and with feature selection and without and with sample selection in terms of milliseconds

All Features FRSTFS Time Reduction

All Samples 28,384,654 8,489,850 70.09%

FRSTSS 2,348,220 377,124 83.94%

Time Reduction 91.73% 95.56% 98.67%