Abstract
Intrusion Detection Systems (IDS) are designed to provide security into computer networks. Different classification models such as Support Vector Machine (SVM) has been successfully applied on the network data. Meanwhile, the extension or improvement of the current models using prototype selection simultaneous with their training phase is crucial due to the serious inefficacies during training (i.e. learning overhead). This paper introduces an improved model for prototype selection. Applying proposed prototype selection along with SVM classification model increases attack discovery rate. In this article, we use fuzzy rough sets theory (FRST) for prototype selection to enhance SVM in intrusion detection. Testing and evaluation of the proposed IDS have been mainly performed on NSL-KDD dataset as a refined version of KDD-CUP99. Experimentations indicate that the proposed IDS outperforms the basic and simple IDSs and modern IDSs in terms of precision, recall, and accuracy rate.
Introduction
Today, most vital infrastructures i.e. telecommunications, transportations, trade and banking systems, are managed by computer networks. Therefore, the security of these systems against planned attacks, is of a high importance. One can find a variety of meanings or concepts for the term, “intrusion”, in different glossaries. In computer science, there is much debate over the meaning or meanings of intrusion. Many define intrusion to be unsuccessful attacks, while others present separate definitions of intrusion and attack. Intrusion can be defined as [1]: “an active series of events related to each other, with aim of illegally accessing to data, illegally changing data or damaging the system in a way that it makes the system unavailable”. Such a definition covers both successful and unsuccessful attempts.
IDSs are designed to provide security in computer networks setting. They are the systems that strive to detect attacks to a network. These systems collect network traffic data from nodes of the network or computer systems and use these pieces of information to provide the security of network. The chief purpose of an IDS is to characterize and prevent accessibility of users known as Unauthorized Users (UUs). When a protected system or a computerized network is under attack, the IDS generates warnings for the potential attack occurrence, even if the system is not vulnerable to the reported attack. Therefore, the purpose of intrusion detection or IDS is to discover unauthorized use, abusing and damaging computer systems and networks by both domestic users and external attackers. In a perfect security system, IDS is used as an auxiliary tool to enhance system security, along with using firewalls, encryption and authentication methods. To sum up, three chief reasons exist to employ IDSs [1]:
Controlling and legal problems: using IDSs is necessary due to presence of national laws to comply with security principles and monitor their potential violators, protecting national infrastructures, the need to develop surveillance systems to have the ability to record and investigate suspicious accesses to information and the existence of different standards.
Specification and definition for types of attacks: IDSs provide the system administrator with the opportunity to think of managing and dealing with the attacks by identifying attacks to system. IDSs have the ability to create a profile for the occurred attacks against the system. This profile can be used to investigate some appropriate solutions to cope with the attacks against the system. These systems, also, are capable of collecting information about the attackers, which we can use them for judicial cases.
Overall in-depth-defense implementation: IDSs are a critical part of in-depth-defense systems, specifically for organizations to which dealing with intruders is of a high significance. IT IDSs provide a suitable solution to protect the network and application layers’ vulnerabilities. Also, they contribute to the solidarity and authentication of information across different devices, such as firewalls, anti-viruses and routers.
An IDS can be considered to be a set of tools, methods and evidences that contribute to detect, identify and report unauthorized or unapproved activities access in a network [1]. The title, “Intrusion Detection", is not an appropriate heading for intrusion detection systems, since these systems, in fact, do not recognize the intrusion; rather they detect an activity on the network as an intrusion that may not be substantially an intrusion. Indeed, IDS is a small part of the protective system installed in a system and it cannot be considered to be a stand-alone and absolute system.
In general, anything that is potential risk and threat to global security is a malware. Malwares are scattering concerning to their variety and number as well [2]. Besides the possible losses to the computers of users, malwares are able to own an accessibility to the privacy of users and thus, their confidential information can be stolen. Unluckily, because of the application vulnerability, many authentic packages of software launch the malwares unwantedly. This leads to system intrusion; therefore, the privacy of users even via using a legal package of software, can be endangered.
Of course modern computer systems’ security relies heavily on the ability of keeping malware and security products updated. Therefore, timely identification of malwares and finding some ways to encounter the malicious effects of malwares, are the most important concerns of programmers and IT security specialists. A malware detection system, in fact, is the implementation of intrusion detection techniques that contribute to protect the system by identifying malicious behaviors. Regarding to diversifying and expanding the endeavors of the attackers to access private and important information of users and their attempts to utilize various updated techniques to bypass malware and security products, system programmers need up-to-date and efficient detection techniques that overcome the new and complex methods of the attackers and detect the malwares.
Until now, methods that have been used to detect malwares, have succeeded with different rates. However, the important problem, which is very crucial these days, is indeed to detect malwares that hide themselves in different ways, such as compaction, polymorphism, and so on. Therefore, traditional methods no longer meet the requirements to deal with these new and complex species, and intelligent methods (such as clustering [3–9], fuzzy clustering [9–12], classification [13–14], domain adaptation [15, 16], deep learning [17–18], optimization [19–22], dynamic optimization [23–24], preprocessing [25], and so on) are required to detect these malwares in a short time and with a low error rate. One of the main challenges in reaching an appropriate model for these systems can be the huge size of their data. Huge data occurs in many applications [26–29]. This is an important challenge that has to be dealt with as a preprocessing task [25] prior to model creation. While there are many approaches to build a detection model [30–33], due to our special case, in this paper, it is the main aim to handle this problem using an appropriate method, i.e. fuzzy rough set theory.
A rough set, introduced by Pawlak [34], defines a set by two approximations: its lower approximation (LA) and its upper approximation (UA). LA and UA of a set are also sets. The rough set theory, as the extension of traditional set theory, and its derivatives have been used in various applications [35, 36]. It has been especially applicable to feature selection problem [37–39] or sample selection problem [40, 41]. One of its most well-known extensions is fuzzy rough set [42, 43] where the LA and the UA of a set is defined as two fuzzy sets. Although rough set is different from fuzzy set, they can complete each other. The first one is able to manage indiscernibility [34] and the second one is capable of managing vagueness [44–46]. Therefore, both of them are suitable for managing uncertain systems. The fuzzy rough set theory has been especially applicable to feature selection [47], missing value handling [48], imbalanced learning [49], and instance selection [50].
The main objective of this study is to gain more accurate results in the detection process. To reach the mentioned objective, the following objectives will be pursued to be achieved: Proposing a data reduction technique. Increasing attack discovery rate using data reduction technique.
As fuzzy rough set theory based approaches are appropriate solutions to the feature and instance selection problems, and also as one of the main challenges to our problems is their huge data size, we target to introduce a model based on an innovative fuzzy rough set sample and feature selection mechanism.
The rest of this paper is organized as follows. In the second section, the related works will be mentioned. The materials needed for the proposed method are presented in third section. In the fourth section, details of the proposed method will be provided. In the fifth section, details of implementation and experimental results are given. Eventually in the last section, a general conclusion will be presented about the contents of this article and there will be a review of new horizons and suggestions for future research on the subject.
Related works
Generally, according to the authors [51], in the intelligent IDS, the attraction criteria are calculated by the nearest neighbor and rival (the closest pattern belonged to the other class) for each model. Let’s assume U to be the universal set. In general, for a pattern or data point p, where p ∈ U, it can be suitably classified when |p - NN U (p) | - |p - NN{x∈U|m l (x)≠m l (p)} (p) | > Δ, where m l (x) is the label of pattern x, NN θ (x) is the nearest neighbor of pattern x in the dataset θ and Δ, a positive real number, is a threshold.
An intrusion detection method proposed where it is aimed to choose a pattern based on an edited nearest neighbor [52]. This method emphasizes on eliminating noisy patterns in the training set. The edited nearest neighbor ignores a pattern in U, when its class is different from the class which is the most frequent class among its k nearest neighbors. An extension to this algorithm is the redundant edited nearest neighbor [53], where edited nearest neighbor is repetitively applied until the class of any pattern in U matches with the class which is the most frequent class among its k nearest neighbors. Another method, which is based on edited nearest neighbor, is called multiple edited nearest neighbor. During the procedure of multiple edited nearest neighbor, U will be divided into r blocks {B1, B2, … , B r } and it applies edited nearest neighbor on each block B j and its neighbors will be detected on the next block, that is Bmod(j+1,r). As long as at least one deletion occurs on tth iteration, this procedure will be continued [54]. Instead of k nearest neighbors, edited nearest neighbor provides a method for pattern selection by which it utilizes dependency probability to a class.
A framework based on IB2 and IB3 has been introduced (IB2 and IB3 are additive methods) [55]. IB2 selects patterns that are not classified by 1-NN correctly; IB3 is the improved version of IB2 in which a set of records is used to identify the patterns that are needed to be preserved.
From other methods, which are based on k nearest neighbors, five methods are worthy to be mentioned [56]: DROP1, DROP2, DROP3, DROP4, and DROP5; these methods are based on the concept of dependency. The dependencies of pattern p are those patterns that pattern p is considered to be one of their nearest neighbors. DROP1 removes the pattern p if its dependencies in U are classified accurately without any consideration of pattern p. Through this rule, since dependencies of a noisy pattern can be correctly classified without that pattern, DROP1 can eliminate the noisy patterns. However, in DROP1, if neighbors of a noisy pattern are eliminated at first, then, the noisy pattern will not be deleted. In order to solve this problem, DROP2 has been introduced where it considers dependencies of a pattern in the whole training set. That is, pattern p will be deleted only when its dependencies are classified without pattern p. DROP3 and DROP4 firstly remove noisy patterns by considering a filter similar to edited nearest neighbor, then apply DROP2. DROP5 works similar to DROP2 with the difference where it firstly removes the closest rivals (the nearest patterns with a different class).
Another method that operates based on dependency is called iterative filtering algorithm [57]. The iterative filtering algorithm works based on the reachable (denoted by Reachable (p)) and coverage (denoted by Coverage (p)) sets of pattern p that are the locality and dependency sets, respectively. If |Reachable (p) | > |Coverage (p) |, then the iterative filtering algorithm will remove pattern p. This means that some patterns can classify patterns similar to pattern p, regardless of their classification in the training set. In the first step, this algorithm runs edited nearest neighbor.
Evolutionary algorithms are widely used in pattern selection. Their main idea is as follows: an initial population of individuals (for example chromosomes in genetic algorithm) are given; each individual can be considered to be a solution. Based on an objective function (an objective function can be accuracy of a classifier trained on the selected patterns by an individual), individuals of a population are evaluated and best individuals are selected for exploration operators (for example the mutation and recombination operators in genetic algorithm) in order to generate new population individuals (in a way that they will maximize objective function). Such algorithm will be repeated a specific number of times (for example generations in genetic algorithm) determined by a user, and the best individual of the last iteration will be selected as the final solution. Generally, individual I can be displayed as a binary coded presentation, I = [n1, 0, 0, 1, …, 0], so that it holds |I| = |U|, which means a certain subset of U is selected. Every I j is the representative of jth pattern in T. If I j is one, then it means the individual I selects the jth pattern; otherwise, the individual I removes the jth pattern.
Some researchers employed tabu search algorithm for pattern selection [58]. The tabu search algorithm is applied on an initial potential solution containing a set of primitive randomly selected patterns S j ∈ U, and during the search process, some solutions are identified as tabu (which means that solution should not be changed); other solutions (non-taboo solutions) will be evaluated using a classifier in order to find the best solution to replace the current solution. Procedure of finding the next best solution from the current solution is the assessment of neighboring solution sets, each of which differs in only one pattern with the current solution. When this process is completed, the current solution is considered to be the best solution found so far.
According to direct computation of decision level separation, another category of studies that is based on the maximum margin separation of data with two classes emerged. In this category of researches, it is attempted to gain a decision level in order to classify the data in which such a decision level creates the maximum margin of separation between two classes [59, 60]. In a method of this type [59], the aim was to determine the distance-based separation function with a maximum distance of each data class in Banach space. For this purpose, Lipschitz functions have been used where a fixed Lipschitz function indicates the separation margin value. Finding the suitable Lipschitz function for real data is a difficult task, and the current solutions are weak in terms of performance. In this regard, the results of its implementation are not satisfactory. In another method of this type, SVM [60], a linear decision level in Hilbert space, was proposed with the aim of finding maximum margin of separator between two classes of data. In this method, the problem transformed into a quadratic programming problem. Solving this problem for applications with big data such as intrusion detection application is a challenging task [61].
Although SVM can potentially classify the data in intrusion detection, they cannot be used solely in IDSs. Based on the conducted studies, some cases affect the efficiency of these systems. When the number of normal data inputs in the training time is approximately equal to attack data, SVMs will be able to work more accurately. However, it is not always possible. On one hand, the number of the features in network traffic data has a big influence on both training time and the classification performance. In many training datasets (such as KDDCUP99 [62]), of 41 features, it is possible to select some of the key features and employ them in learning phase of model. In [64], a combination of SVM, the simulated annealing and the decision tree has been used to detect the network attacks. At the beginning, SVM and simulated annealing are used to select the best features. Then, the decision tree and simulated annealing have been used to generate decision rules. These phases will be performed in a loop and will be continued until the termination condition is not satisfied. The outputs of the algorithm are the selected features, the best accuracy in the evaluation dataset, and the decision rules.
Kuang et al. [64] proposed a new hybrid model for SVM consisting of kernel principal component analysis in combination with genetic algorithm for intrusion detection. In their model, kernel principal component analysis is used to reduce the problem dimensionality and runtime. Furthermore, to reduce noises caused by different features and improve the SVM performance, an improved kernel function denoted by N-RBF has been presented. Genetic algorithm is employed to optimize the penalty factor, c, kernel parameters, σ, and ɛ for SVM. The results of applying this method indicate an increase in prediction accuracy, acceleration of convergence speed and a better generalization in comparison to other similar algorithms. Similarly, in another work [65], the combination of kernel principal component analysis and the improved chaotic particle swarm optimization algorithm has been used to deal with the intrusion detection problem. In this model, improved chaotic particle swarm optimization algorithm is used to optimize above parameters. Using this method has led to the improvement in comparison with similar methods in dataset of NSL-KDD [66].
Preliminary materials
This section provides definitions and necessary notifications. Then, we introduce SVM. The prototype selection will be introduced in the next subsection. Next subsection discusses rough set theory. After that, fuzzy rough set theory will be discussed. The feature selection based on fuzzy rough set theory is presented in the last subsection of this section.
Data classification has always been one of the most common subjects for different sciences and researchers. So many methods have been presented in data classification area. Each of them is designed to be used into a wide special range of applications. In data classification, the main goal is to detect different classes in a dataset such that it is able to determine the labels for new unseen data points. Classification [67–69] is one of the most important tasks in the pattern recognition that has received a lot of attention due to its extensive use. Until recently, numerous methods have been developed. We mention some of them in the following: 1) decision tree, 2) Bayesian classifier, 3) artificial neural networks, 4) k-nearest neighbors, and 5) SVM.
Our main purpose is to propose a classification system in order to be used as a large scale IDS, so that it can improve SVM in terms of time and accuracy, and it increases the potency of detecting various new and unknown intrusions. Some advantages of the proposed method are including (but not limited to) (1) lack of need to parameters’ tuning and (2) the usage of prototype selection methods (here fuzzy rough set theory based prototype selection) simultaneous with SVM training for classification.
Fuzzy rough set theory
A fuzzy rough set can be introduced by a duality of fuzzy LA (FLA) and fuzzy UA (FUA) of a fuzzy subset (denoted by
where
where
Extending positive region of decision table D considering relation R to fuzzy positive region of fuzzy relation
Assuming two following conditions: (1) “
Each data point p in U that has a very high value

(a) An exemplary U with two classes. (b) Data points with low membership value to positive region. (c) Data points with very low membership value to positive region. (d) Data points with average membership value to positive region.
The feature selection problem is one of the problems that have been raised in the context of machine learning and statistical pattern recognition. This problem is important in many applications, such as classification. While there are a huge number of the features in many applications, only a fraction of them has considerable information and the rest features are almost useless. Although maintenance of those almost useless features theatrically seems good or at least without side effect (it means it does not lead to loss of information), it can cause reduction in model performance. It can be due to the fact that the computational rate of this application will be increased and moreover it causes a lot of useless (sometimes misleading) information to be saved. By removing these attributes from the dataset, the efficiency of learning models can be significantly improved.
The aim of feature selection is to find the smallest subset of input features with the highest classification quality. Unlike other methods of dimensionality reduction (feature extraction methods), feature selection methods maintain the main concept of features after reduction. These methods are widely used in the datasets that involve in lots of features and make the processing hard.
In IDSs, it is difficult to analyze and classify due to the fact that the amount of data that the system needs for network monitoring is too much. Thus, an IDS should reduce the amount of data to be processed so that complexity of the relationships between some features is removed or reduced [70, 71]. One of the desired data mining concepts based on which we can extract and select features is RST which is discussed in the next section.
As mentioned a fuzzy decision system [42] can be considered to be a pair (U, Q) where Q = C∪ { l }. Again assume for every q ∈ Q, there is a mapping (here a function) q, m
q
: U → V
q
where V
q
is called the range of m
q
. Also, consider a fuzzy equivalence relation
where ω (x) is one if x is true; otherwise, it is zero. We also define a fuzzy equivalence relation defined on U for any subset
By defining λ
p
according to Equation 7.
We can simplify λ
p
as presented in Equation 8.
Now, the λ-conditional entropy of the decision feature l compared with the conditional feature subset
where
It is worthy to be mentioned that 0 × log 0 is considered to be 0 in Equation 9 and throughout the paper. After that,
This section provides proposed framework for constructing an IDS. The proposed IDS contains four main components including normalization, feature selection, sample selection, and SVM classifier. Figure 2 shows structure of the proposed IDS. In the first step, normalization is done so as to assure that all features of dataset are numeric values transformed into interval [0, 1] by Equation 13.

The proposed IDS framework.
where
Fuzzy rough set theory based sample selection (FRSTSS) (presented in section 4.1) is the second step to reduce the dataset size. Fuzzy rough set theory based feature selection (FRSTFS) (presented in section 4.2) is the third step to reduce the feature space size. The SVM is employed in the subsequent step to learn the data subsample in the projected selected features according to the proposed IDS framework depicted in Fig. 2.
The pseudo code of the proposed method for sample selection is presented in Fig. 3. It is based on fuzzy rough set theory; therefore, we call it fuzzy rough set theory based sample selection or FRSTSS.

The pseudo code of the proposed FRSTSS.
We want to select only borderline data points. Therefore, we define
The use of all available features of a given dataset for evaluating and detecting of attack patterns increases processing overhead, prolongs diagnosis process and, consequently reduces IDS efficiency. As mentioned before, the aim of feature selection is to find the features’ subset that will maximizes the classification metrics such as the classification accuracy. Since removing worthless and useless input data features simplifies the problem and accelerates classification task, the feature selection is a very important step in designing an accurate and fast IDS.
In our model presented, the fuzzy rough set theory is used for feature selection. In [73], rough set theory is used to propose an IDS. The results have indicated that the rough set theory based IDS has a high classification accuracy and it performs classification task faster than traditional IDS. In [74], the rough set theory is used in designing an IDS. The rough set theory has been used as the feature selection step in the presented IDS [74]. In this paper, we employ the fuzzy rough set theory as feature selection step. Similar to the presented IDS in [74], we use SVM algorithm as classifier. But we also use the fuzzy rough set theory based sample selection and the fuzzy rough set theory based feature selection to create a new improved and scalable IDS.
The fuzzy rough set theory based feature selection (FRSTFS) algorithm starts with an empty
Overall implementation complexity
Time complexity of FRSTSS is of the order O (ςnρ log ρ) at most. Time complexity of FRSTFS is of the order O (n2). Time complexity of SVM can be either of the order O (n2) or of the order O (n3) [75]. Therefore, FRSTSS+FRSTFS+SVM is of the order O (ς|U|ρ log ρ + ρ2 + ρ3) at most. Assuming ρ ⪡ |U|, the overall time complexity of the proposed IDS framework is of the order O (ς|U|ρ log ρ) at most. The space complexity is of dataset order, i.e. O (|U|).
Experimental study
In this section, in order to determine strengths and weaknesses of our IDS and to represent an analysis of its performance, NSL-KDD dataset, a subset of KDD-CUP99 dataset, is used as the main benchmark. All of the experimental results are presented in this section. The used datasets will be initially reviewed. Then, the tests and the evaluation parameters will be presented and eventually, the experimental results will be offered.
Benchmark dataset
To evaluate the proposed method, NSL-KDD [66] dataset is used. This dataset contains selected records of KDD-CUP99 [52]. The problems such as duplicate records have been eliminated in the KDD-CUP99 dataset. Since 1999, KDD-CUP99 dataset was the most beneficiary dataset to evaluate the anomaly detection methods and IDSs. This dataset was developed by Stolfo et al. [53], based on an application for evaluation of IDSs, called DARPA’98 [76].
DARPA’98 is a series of nearly 5 million records collected from different links. Each record in DARPA’98 is approximately 100 bytes. Its training set is consisted of approximately 4,900,000 different links in which each vector constitutes 41 features and a binary class tag indicating whether it is a normal or an attack record. Each attack record belongs to one of the following general subclasses: Denial of Service (DoS) attack: it is a subclass in attack class where the attacker saturates and or occupies one source of processing or storage of the victim in such a way that the victim cannot carry out its legal requests. User to Root (U2R) attack: it is a subclass in attack class in which an attacker begins its attack by accessing to a regular user’s account on the system (usually achieved through password eavesdropping, dictionary attacks, social engineering) and gets the thematic access to the system by finding vulnerable ports. Remote to Local (R2L) attack: this attack occurs when an attacker can send packets over the network, however, but attacker processes no account on one of the network systems and gets the access to the system as a system user, by examining its vulnerable ports. Probing attack: it is an attempt to obtain information about computer network with the aim of threatening network security.
It is worthy to be mentioned that “the test set has no identical probability distribution with the training set, yet encompasses specific types of attacks which they are not existed in the training set, and this makes the evaluations more actual”. More accurately, it is worth mentioning that the training set includes 24 attack types while the test set contains 14 attack types other than those in the training set. However, some authorities in the area of intrusion problem also believe that the majority of new attacks are originating from the old ones and known attack signatures are adequate to recognize the current attacks.
In Table 1, general types (or subclasses) of the attack class in NSL-KDD dataset can be seen. The approximate distribution of various subclasses in the training and testing datasets is also shown in Table 1. It should be noted that the total does not sum up to 100% due to the rounding of ratios.
Approximate distribution of training and test data in NSL-KDD dataset
Approximate distribution of training and test data in NSL-KDD dataset
The main aim of data preprocessing is to convert features of nominal values into numerical ones. In Table 2, the characteristics of NSL-KDD dataset are shown. Some features in the NSL-KDD dataset are as nominal and as the SVM can be applied only to numerical features, therefore, the nominal features have to be converted into numerical ones. As it can be seen in Table 2, some of features have nominal values such as Protocol_type (B), Service (C) and Flag (D). For example, we consider 1, 2, and 3 respectively for the values tcp, udp, and icmp of attribute Protocol_type. As our work assumes the range of any attribute to be in interval [0, 1], it is necessary for the input data to be represented by two forms of binary and continuous. Therefore, in data normalizing, after transforming the three above-mentioned features of the nominal format into numerical format, the main problem is to convert the data into a normalized binary or continuous form. In our IDS, the normal continuous form is used for features.
Different features of the NSL-KDD dataset
To normalize features, first, a statistical analysis is done on any individual feature based on the available data in NSL-KDD and then maximum and minimum values for that feature are determined. Then, according to Equation 13, normalization is done.
Our ultimate purpose is to improve intrusion detection rate. In this study, the main focus is to detect the intrusion at first and after that, to signify the correct subclass of the intrusion. In fact, if the attacks are properly detected, its appropriate subclass of attack class can be decided with the supervisory of the network admin. Therefore, the main two metrics that are studied in this section are including precision and recall; the two metrics that are well-known and important in assessing data mining and machine learning algorithms. A formal definition of “precision” is as follows in Equation 14.
A formal definition of “recall” is as follows in Equation 15.
In Equations 14 and 15, TP represents the number of data points that have been properly assigned to the positive (or attack) class, FP represents the number of data that are incorrectly assigned to the positive class and FN represents the number of data that are mistakenly assigned to a negative (or non-attack) class. It is easy to indicate that each multi-class problem can be easily converted into a set of bi-class problems. That is, each time we assume one class as positive and others as negative; after that, it is possible to calculate the precision and recall metrics. However, it should be mentioned that when we interpret precision, it is the precision that determines how much ratio of the data points is correctly diagnosed as positive by the algorithm. The purpose of recall is to determine how much ratio of positive class data points is correctly allocated to the target (or attack) class by the algorithm.
There is a third metric known as F-measure which is defined based on precision and recall metrics. A general definition of this metric is shown in Equation 16.
After selecting important samples and features respectively by FRSTSS and FRSTFS algorithms, the SVM is employed to learn the selected projected data. SVM are used as the main classifier in our IDS. In SVM training phase, training data is used to train SVM and test data is used to detect the potential attacks in the test phase. FRSTSS and FRSTFS algorithms are done only on training data. The ρ and ς are effective parameters in computational cost of our method. We chose the mentioned values by trial and error in such a way that the computational cost of our method should be admissible. The parameter ς is always set to one percent of size |Λ| (i.e. 0.01 × |Λ|) throughout the experiments. The parameter ρ is always set to 0.1 percent of total dataset size (i.e. 0.001 × |U|) throughout the experiments.
In recent years, various implementations have been developed for SVM. Most of them are designed and implemented in academic environments. However, at the same time, commercial products can be discovered. The LIBSVM software [77] is used in our IDS. This software, in fact, is designed with the aim of developing a library to do the machine learning tasks using SVM. Such a library is designed using C++ programming language. It is possible to utilize different kernels depending on the nature of the data. The most serious disadvantage of SVM method is lack of a mechanism to select the suitable kernel function. By default, the LIBSVM library supports four different kernels including linear, polynomial, radial basis function, and sigmoid ones. In the proposed kernel system, radial basis function kernels are selected due to its higher efficiency. This function has only one parameter called γ that is set to its default value [77].
Experimental settings
In order to investigate the performance of the proposed algorithm, we compare its performance with those of the state of the art methods including: (1) Aburomman and Ibne Reaz [78], (2) Singh et al. [79], (3) Sreenath and Udhayan [80], (4) Gaikwad and Thool [63], (5) Masarat et al. [81], (6) Elbasiony et al. [82], (7) Aslahi-Shahri et al. [83], and (8) Rastegari et al. [84]. In addition, we compared the proposed method with other “strong” base algorithms including: (1) Fisher Classifier (Fisher), (2) Quadratic Classifier (Quadratic), (3) UDC classifier (UDC), (4) Statistical Decision Tree Classifier (StatsDTC), (5) Decision Tree Classifier (DTC), (6) Naïve Bayesian Classifier (NaiveBC), (7) Bagging of Ensemble of Naïve Bayesian Classifier (BagEnsNaiveBC), (8) Bagging of Ensemble of Rule-based Classifier (BagEnsWeakC), (9) Bagging of Ensemble of Decision Tree Classifier (BagEnsDT), (10) Adaptive Boosting Naïve Bayesian Classifier (ADABOOSTC), (11) Boosting ensemble of Decision Tree Classifiers (BoostEnsDT), (12) SVM, (13) multi-layer perceptron (MLP), and (14) FRSTFS+SVM.
Experimental results
To train our IDS, 2,000,000 records of KDDTrain+set are randomly selected without-replacement in such a way that they include 24 types of detected attacks, and to test it, also 700,000 records of KDDTest+set are selected in such a way that they include 14 unknown new attacks along with the previous 24 known attacks. After selecting the important features, the input data are decreased by about 80 percent.
Traditional methods
An advantage of our method is its low computational cost and therefore low usage of computer sources such as memory and processor time which are important for intrusion detection. For this reason, training and test computational costs are computed in four modes (with sample selection, with feature selection, with sample and feature selection, and without any of them) and they are presented in Table 3. Any consumed time presented in Table 3 are in terms of milliseconds (ms). Construction of an SVM model on selected features and FRSTFS, together, need 70.09% less time than construction of an SVM model on all dataset. Construction of an SVM model on selected samples and FRSTSS, together, need 91.73% less time than construction of an SVM model on all dataset. Construction of an SVM model on selected features and samples, FRSTSS and FRSTFS, together, need 98.67% less time than construction of an SVM model on all dataset.
Training time of different SVM models without and with feature selection and without and with sample selection in terms of milliseconds
Training time of different SVM models without and with feature selection and without and with sample selection in terms of milliseconds
The performance of these methods are also represented in Table 4 in terms of precision.
Precision of different SVM models without and with feature selection and without and with sample selection
In Table 4, “FRSTSS+FRSTFS+SVM” stands for the proposed method with both sample selection and feature selection; “FRSTSS+SVM” stands for the proposed method with only sample selection; “FRSTFS+SVM” stands for the proposed method with only feature selection; and finally, “SVM” stands for the proposed method without both of sample selection and feature selection. The features presented in Table 5 are selected by “FRSTSS+FRSTFS+SVM” method. The same features with one change have been selected by “FRSTFS+SVM” method (the feature num_file_creations that is labeled “Q” in Table 2 is added to features presented in Table 5).
The selected features by “FRSTSS+FRSTFS+SVM”
The “FRSTSS+FRSTFS+SVM” method is the best method in Table 4 for all classes. After that, the “FRSTFS+SVM” is the second best method. The third best is the “FRSTSS+SVM” and the worst is SVM. The same results of Table 4 are replicated in Table 6 for recall. Table 7 also represent the F-measure of the same methods.
Recall of different SVM models without and with feature selection and without and with sample selection
F-measure (F1) of different SVM models without and with feature selection and without and with sample selection
Comparison of precision, recall and f-score for each given subclass is shown in Table 5, Table 6, and Table 7 respectively for different modes of our method. The f-score results presented in Table 7 completely verify the precision results presented in Table 4, i.e. the “FRSTSS+FRSTFS+SVM” method is the best method in terms of f-score for all classes. But in the recall results presented in Table 6, the “FRSTSS+FRSTFS+SVM” method is the best method for all classes, with the exception of U2R class where the “FRSTSS+FRSTFS+SVM” method is the second best to the “FRSTFS+SVM” method.
What is evident in Tables 5–7 is the fact that recall and f-score of IDS will be also increased with sample and feature selection. In all 5 subclasses of attack, detection rate is higher when we use FRSTSS and FRSTFS compared to all features and samples. This will be specifically higher for detection rate in U2R and R2L subclasses. Therefore, the system uses the most important features and prototypes for learning.
Now, the performance of our method is compared to the multilayer perceptron (MLP) artificial neural network (ANN) (denoted by MLP), nearest neighbor (denoted by 1NN), and decision tree (denoted by DT). The back propagation learning algorithm is based on steepest descent algorithm. Setting the network parameters is carried out in accordance with the error signals which will be calculated based on presentation of each pattern to the network. MATLAB is used to implement these methods. Gini measure is used for decision tree classifier and Euclidian distance is used for nearest neighbor classifier.
The comparison of precision, recall and f-score, is shown respectively in Fig. 4, Figs. 5, and 6 based on attack subclasses. Also, the results of these classifiers trained on all data points and all features are compared with the results of them trained on the selected projected data points by FRSTSS and FRSTFS in Fig. 4, Figs. 5, and 6. As it is shown, the SVM outperforms the other methods in almost all subclasses. Also, applying FRSTSS and FRSTFS as a preprocessing step is almost always better for all classifiers and all subclasses.

Comparison of attack precision in the different subclasses for different methods.

Comparison of attack recall in the different subclasses for different methods.

Comparison of attack F1 in the different subclasses for different methods.
In Table 8 (and Table 9), the performance of the proposed classification method has been compared with the ones of other classification methods in terms of the f-measure (and accuracy). Indeed, in Table 8, the various classification methods have been compared in terms of f-measure. The results indicate that the proposed classification method outperforms the other classification methods on the KDD dataset. It is worthy to be mentioned that each of these reported results has been independently run 50 times. In each run, we obtain an independent f-measure; and finally the average of the 50 f-measure values has been reported as the f-measure of a method. It is also should be mentioned that these results (reported in Table 8 and Table 9) are obtained on all 24 types of attacks.
F1 rates of different classification methods
F1 rates of different classification methods
Accuracy rates of different classification methods
Table 10 summarizes results of the state-of-the-art classification methods on intrusion detection field. In all of these methods, it is tried to use all of the parameters indicated in their corresponding papers. To be fair, in the other parameters we tried to use the same parameters with the other methods (for example the same training set and test set). Three datasets have been used here: (a) a real-world created-by-user dataset, (b) DARPA and (c) NSL-KDD. While our method is the third best method in DARPA dataset (in terms of precision), it is the best method in the other two datasets. Note that a little improvement in the NSL-KDD dataset is really significant due to its huge size. The real-world created-by-user dataset is a moderate dataset with 20,000 records. The DARPA dataset is a small dataset. One reason for failing our method on the DARPA dataset is its small size, as our method works better in datasets with huge sizes.
Precisions of different classification methods vs. the proposed classification method on three datasets
Nowadays, IDSs have become very important tools to ensure the security of computer networks. Detection methods in IDSs are divided into two categories, abuse detection and anomaly detection. In this work, the network-based IDS is evaluated based on an anomaly detection method with the use of fuzzy rough set theory and SVM. Fuzzy rough set theory is known as one of the most widely used theories in the discovery of knowledge in information systems. Utilizing all of the features in the detection process reduces IDS performance. The use of feature selection methods is beneficial because of managing and reducing computational time and even increasing detection performance. With the same analysis, the use of sample subset selection methods is beneficial. In this article, we have reviewed some of the basic concepts of the fuzzy rough set theory and explained its applications how to reduce feature and data sizes. To evaluate the proposed model, NSL-KDD dataset is used. The results indicated that the proposed framework outperforms them.
For future works to this study, fuzzy rough set theory can be utilized to improve the SVM efficiency for data reduction in other applicable areas. In the proposed sample selection algorithm, named FRSTSS, we can investigate other subsets approaches. Another study that can be done in future is to use several classifiers and use their aggregated vote for classification task.
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
