Abstract
This study aimed to deal with the problems that current intrusion detections have poor classification ability toward small sets of samples. A new intrusion detection model based on coordinative immune and random antibody forest (CIRAFID) is proposed. The vaccination mechanism of coordinative immune algorithm is designed to increase the fitness of poor antibodies, a kind of random antibody detection forest model is given to detect anomalies, and to classify attacks. The experimental results show: the proposed model has higher detection rate, classification accuracy, classification ability and lower false positives rate.
Introduction
Intrusion detection system (IDS) is to detect unwanted attempts at accessing, manipulating and disabling computer system by either authorized users or external perpetrators, mainly through the network, such as the Internet [2]. Recently, many techniques have been proposed on IDS to reduce false positive detection rate and increase the true positive detection rate, such as neural network, random forest, fuzzy theory, genetic algorithm, rough set algorithm (RSA), artificial immune theory and so on [1,8].
In real network, the percentage of intrusions is low, so the true positive detection rate can always be maintained at a high level. Besides, unknown attack behaviors evolve quickly, which places high demands on the model with high training speed. Random forest and artificial immune have been popular in recent years. Random forest consists of various decision trees and have strong representation learning ability. Artificial immune (AI) is evolved from the biological immune system, and has the adaptability, which is applied in IDS.
On the one hand, in order to improve the classification performances aimed at small samples sets, on the other hand, to increase the detection precision and accuracy, reduce the rate of false positives, we synthesize the mechanism of random forests and artificial immune system, to design a collaborative intrusion detection model (CIRAFID) in this paper. The main contributions are as follows:
Using collaborative immune algorithm, we obtain the optimal antibodies. In order to get a wide diversity of antibodies, we execute crossover and mutation on some particular antibodies to acquire new antibody sets; randomly generate some random antibodies. Excellent antibodies are extracted as vaccines, with which the weaker antibodies are vaccinated, and the new antibodies are selected in new set. Through collaborative immune algorithm, we get more and better antibodies aimed at small samples set.
Random antibodies forest model is designed. Antibodies are randomly selected to generate the antibodies decision trees with the voting by the majority. Random antibody forests are used to classify characteristic data to get new antibodies, update antibody sets. Random antibody forest detection algorithm is used to improve intrusion detection rate, accuracy rate and precision rate, reduce the false positives rate.
Finally, NSL-KDD dataset and CICIDS2017 dataset are used to evaluate the detection performances of CIRAFID, and to validate its adaptability, detection and classification performances.
Related works
Artificial intelligence methods such as support vector machine (SVM), fuzzy set, random forests (RF) and AI are introduced in the researches of intrusion detection, many breakthroughs have been achieved.
Kishor Kumar applied SVM in anomaly detection, but the algorithm had worse classification ability [8]. Alyaseen established intrusion detection model based on SVM and extreme learning machine, which raised the anomaly detection rate, but got the lower rates of detection toward small samples data set [1]. Yang designed an effective intrusion detection system using the modified density peak clustering algorithm and deep networks (MDPCA-DBN). The system removed the redundant data of the training set, and improved the testing speed, but had the problem of high rate of false positives [19]. Song used the hidden Markov process model to design an intrusion detection model (AA-HMM). The model improved the ability of online learning, but the system parameters had greater influences on the test results [17].
In order to improve the detection rate, precision, accuracy, and reduce the rate of false positives, RF classification algorithm is becoming a hot spot in current researches. Inspired by bagging algorithm and random selection segmentation algorithm, random forest classification algorithm was proposed by LEO [3]. JooHwa proposed an intrusion detection system based on autoencoder-conditional, generative adversarial networks and random forest (AE-CGAN-RF), in which, RF was used to classify characteristic data [9]. Because the same type of attacks had similar network traffic, Ren built a multi-level random forest model, and used it to detect abnormal behaviors [15].
According to the similarity of decision trees, redundant attribute values were reduced [10]. RF algorithm can effectively solve the imbalance of feature data, improve the detection performance of intrusion detection, but detection performances of small sample attacks are poorer.
Kim et al. proposed a dynamic clonal selection algorithm [7]. Yin et al. gave an improved clonal selection algorithm, which was used to select the best individuals and to be cloned, that enhanced the accuracy, and reduce the false positives rate of intrusion detection [20].
IDS builds a dynamic and adaptive information defense system with artificial immune. Current clonal selection algorithms are widely applied to optimize the detection rules aimed at small samples by cloning and selecting excellent antibodies, because they have low detection rate, but for the high false positives rate, they cannot be applied in real intrusion detection. So AI and RF are used in IDS could solve those problems.

CIRAFID model.
CIRAFID model mainly includes three submodules: preprocessing, collaborative immune and random forest detection. RSRFID model is shown in Fig. 1.
Preprocessing module: Preprocessing module is mainly to normalize the data set S and store them into training data set Collaborative immune module: the module mainly inoculates antibodies and obtains optimal antibodies. Antibodies of the training set are divided into different types of sets: normal type, the Probe attack, Dos attack, U2R and R2L attack, Infiltration attack, etc., which are called RF detection module: RF detection module is mainly to detect the attack behaviors, such as Dos attacks. New classification sets are transformed into decision trees. The decision trees are converted into random forest with random forest algorithm. RF is used to detect abnormalities, and get the type of attacks. The new rules are stored into the training set.
Preprocessing module
The data preprocessing module of CIRAFID model is to complete the pretreatment of data sets S, according to the Definition 1,
Antigens and antibodies are normalized with formula (1), the attribute values are normalized to
Among them, m and n respectively are the minimum and maximum within the data fields. Data sets are preprocessed with the definitions of antigen and antibody. Antigen Antibody
Vaccine extraction, inoculation and vaccine evaluation are the critical steps in antibody optimization algorithm.
Vaccine extraction
According to the prior knowledge of problems, the genes fitness values of antibodies are calculated, optimal individuals are obtained in the process of population evolution, vaccines are more excellent antibodies [18].
The decision attributes are reduced and classified with decision attribute significance degree reduction algorithm in rough set theory [4], the significance degree of antibodies’ attributes are calculated with the significance of decision attributes, such as Destination Port, current Flow Duration, current Flow Bytes/s etc. Significances of attributes’ value are as the significance degree of antibody genes. A matching antigen with more numbers of matching genes, the more important the genes are.
The decision attribute significance relative to antibody genes, Let’s suppose that
Vaccine
Vaccine extraction operator, the definition of vaccine extraction is in following formula [4]:
The values of parameter α, β are given as:
If the value of
Inoculation is the process that we use the best genes of antibody to inoculate the isotope gene, vaccine inoculation operator is defined as:
Vaccine inoculation operator, a is the antibody,
The antibodies after inoculation need to be evaluated, excellent antibodies are put into memory data set, the poorer antibodies are deleted from the data set.
Vaccine evaluation, Let
In Eq. (5),
The affinity function of antigen and antibody is adopted with shape-space of Euclid distance [18].
The quantity and quality of the antibodies decide the detection performances of the detection algorithm. Using excellent antibodies to obtain vaccines, vaccines inoculation strategy is used to enhance the detection abilities of antibodies. We take Dos attacks for example, vaccination algorithm implementation steps are shown in Algorithm 1.

Cooperative immune algorithm
Crossover and Gaussian mutation algorithms are used to get some new antibodies, new individuals join the antibody sets; antibodies are randomly generated and put into antibody set; one-third of antibodies in antibodies set remain unchanged.
The three ways above are used to generate immature antibodies, so local convergence and limitation of too much antibody copies are avoided; at the same time, in view of the small samples of antibody sets, they can effectively increase the number of antibodies, improve the detection accuracy. In order to obtain excellent antibodies, in each attack antibody set, we choose antibodies with high fitness to be vaccines, and calculate the antibody gene’s significance as its weight value, the poorer fitness antibodies are inoculated to generate more excellent antibodies, which are used to detect the antigens of IDS [4].
Random antibody detection forest (RADF) is a set which contains multiple identical distribution of antibody decision trees, each decision tree depends on an independent random vectors.
The principles of RADF: antibody set X contains N antibodies, k antibody samples are obtained by a random back way (
The less antibody decision trees the RADF has, the weaker classification ability the RADF owns. So only when the algorithm has large quantity of decision trees, the classification results will be more effective.
The inadequacy of samples will cause high false detection rate and low detection rates. In order to deal with those problems, in Section 3.2, collaborative immune algorithm is adopted to obtain excellent antibodies, extend the small sample set to get enough antibodies.
Referring to RF and AI theory, in this paper, we design a random antibody detection forest model, which is shown in Fig. 2.
It is crucial to improve the extrapolation prediction ability of random forest classification model for IDS. So we need to generate different antibodies sets to increase the classified abilities of forest classification model. After k rounds of trainings, we get an antibody decision tree set

Random antibody detection model.
Random antibody detection forest algorithm is shown as Algorithm 2.

Random antibody detection forest
The quantity and quality of antibodies decide the detection performance indexes of the antibodies.
If the random antibody detection forest has less random decision tree, its classification ability will be weaker. Only there are a number of decision trees, the detection algorithm can get effective classification results. Synthesis coordination immune algorithm and random antibody detection forest algorithm, the CIRAFID algorithm is designed, the CIRAFID algorithm is shown as Algorithm 3.

CIRAFID algorithm
In order to verify the performances of CIRAFID model above, NSL-KDD and CICIDS2017 are both used in following experiments. Important parameters in algorithm are obtained through experiments, the threshold values of matching radius between antibody and antigen are separately set up based on two data sets above; antigens are detected to get the confusion matrixes of IDS, then intrusion detection performance indexes are obtained by confusion matrix; finally, the anomaly detection performance of the proposed algorithm and the comparisons with other classification algorithms are analyzed, the feasibility and the detection performances of the algorithm are verified. The purpose of using NSL-KDD data sets is to verify the detection performances of CIRAFID algorithm for classical attacks; with CICIDS2017 data set, the purpose is to research classification performances about new attacks.
Therefore, we use some common performance indicators as parameters to detect antigens and present comparison analysis respectively according to NSL-KDD and CICIDS2017 data set. NSL-KDD and CICIDS2017 both include small attacks samples, with which we research the attack classification performances of CIRAFID algorithm for small samples.
In order to verify the vaccination strategies performances of cooperative immune algorithm, the steps of (1.8), (1.9) and (1.10) in Algorithm 1 are omitted, other steps are same as CIRAFID algorithm, we compare the testing results among CIRAFID, cooperative immune algorithm and other algorithms. For the conveniences of the research, the vaccination strategy of cooperative immune algorithm is named intrusion detection based on random antibody forest (RAFID).
Experiments datasets
There are many data sets for the simulations of IDS, such as KDD Cup 99, NSL-KDD, ADFA, Kyoto 2006+, ISCXIDS2012, UNSW-NB15 and CICIDS2017.
KDD Cup 99 and NSL-KDD both are classic data sets which are used for intrusion detection simulations. KDD Cup 99 contains a large number of redundant samples, NSL-KDD is the reduction of data sets based on KDD Cup 99 data set. To compare with classical algorithms, in this paper, we use NSL-KDD data sets for simulations. While NSL-KDD data set was collected in 1999, NSL-KDD lacks new attacks, so we need new data sets for experiments.
ADFA, Kyoto 2006+, ISCXIDS2012, UNSW-NB15 and CICIDS2017 which are used for IDS, ADFA, Kyoto 2006+, ISCXIDS2012 are outdated, less researchers applied them in experiments, and they are not the classic data sets. UNSW-NB15 and CICIDS2017 are newer, and contain new attack logs. UNSW-NB15 data sets are captured in Australian Centre for Cyber Security (ACCS) in 2015 with IXIA PerfectStorm network testing tools, UNSW- NB15 contains nine types of attacks [16]. CICIDS2017 data sets are collected in Canada’s network security institute, published in 2017, contain 15 types of attacks. CICIDS2017 data sets are more comprehensive data sets at present. In order to verify CIRAFID’ detection performances of new attacks, we adopt CICIDS2017 in experiments. In conclusion, we adopt NSL-KDD and CICIDS2017 for simulations.
NSL-KDD dataset
NSL-KDD is provided by Lincoln laboratory for experiment simulations, the training sample set KDDTrain+ includes 125973 records. Testing data set includes two subsets, KDDTest+ and KDDTest-21, each subset contains Normal, Dos, Probe, U2R and R2L.
In the experiments, all the training samples are used to get antibodies, testing samples set KDDTest+ is used to test CIRAFID algorithm. Sample distributions of the data set are shown in Table 1.
NSL-KDD’ distribution [13]
NSL-KDD’ distribution [13]
CICIDS2017 includes normal samples and 15 types of attacks, contains 2,830,743 samples. In CICIDS2017, 60 percent of data are as the training samples, the rest of data are as testing ones. The distributions of samples are shown in Table 2.
CICIDS2017’ distribution
CICIDS2017’ distribution
The code of CIRAFID algorithm is writing with C, all the experiments run in Linux operation system (Intel Pentium Dual CPU E2180, 64 G RAM). There are four types of detection results of IDS: true positive (TP), true negative (TN), false positive (FP) and false negative (FN) [21].
The sum of TP and FN is the total number of actual normal samples, TP is the number of normal samples which are identified as normal antigens, FN is the number of normal samples which are identified as attacks in error. The sum of FP and TN is the total number of actual abnormal samples. FP is the number of abnormal samples which are identified as normal antigens in error, TN is the number of abnormal samples which are identified as attacks. We use confusion matrix to calculate the performances of the intrusion detection evaluation index [5,14]. The confusion matrix is shown in Table 3.
Confusion matrix
Confusion matrix
Mainly performance evaluation indicators of CIRAFID algorithm:
Detection rate (DR) is the ratio of attacks which are recognized accurately to antigens that are taking part in intruding the immune system, DR is calculated with formula (8). False alarm rate (FAR) is the ratio of attacks which are recognized wrongly to antigens which are taking part in intruding the immune system, FAR is calculated with formula (9). Precision ( Accuracy (
The parameters of CIRAFID algorithm are different with various data sets. Some of the parameters are referred to other studies, such as antigen binary length of antigen, binary length of antibody, the number of vaccines, the number of antibodies which are obtained crossover and mutation operation. The matching radius of antibodies and antigens are given by experiment. According to NSL-KDD and CICIDS2017, we preprocess the data sets to get antibodies and antigens [22]. In Section 4.3.1 and 4.3.2, match radius parameters of NSL-KDD and CICIDS2017 are given respectively.
The matching radius of NSL-KDD
The antigen is a binary string of 92 bits in length, the length of the antibodies is 95 bits. 20 percent of antibodies are used in CIRAFID. The matching radius value is given in following experiments. The results are shown in Fig. 3.

The detection results for different radius with NSL-KDD.
NSL-KDD contains 41 attributes and 1 label. In experiments, we respectively set up that the matching radius value is the even numbers from 30 to 50, take all the training data and testing data for cross experiments. For each experiment, the algorithm runs 10 times and calculate the average value as the final result.
In Fig. 3, for different radius values which are the matching number between antigens and antibodies, DR and FAR are given. It is concluded that the detection rate is increasing and the rate of false positives is decreasing with the increasing of detection radius. When the length of detection radius is 40 bits, the detection rate is 94.73%, the false positives rate is 6.16%. When detection radius’ length is more than 40 bits, the detection rate and the false positives rate are stable. In the following experiments about NSL-KDD, the match radius value is 40.
The antigen is a binary string owing 119 bits in length, the length of the antibodies is 123 bits. With the same method in Section 4.3.1, 20 percent of antibodies are used in CIRAFID. The matching radius is given in following experiments.
CICIDS2017 contains 52 attributes and 1 label. In experiments, we respectively set up the matching radius value with the even numbers which are from 40 to 90, take all the training data and testing data for cross experiments. In each experiment, the algorithm runs 10 times, and we calculate the average value as the final result. The results are shown in Fig. 4.

The detection results for different radius with CICIDS2017.
In Fig. 4, for different radius of the matching value between antigens and antibodies, DR and FAR are given. It is concluded that the detection rate is increasing and the false positives rate is decreasing with the increase of detection radius. When the detection radius is 58 bits in length, the detection rate is 93.82%, the false positives rate is 8.01%. When detection radius’ length is more than 58 bits, the detection rate and the false positives rate are stable. In the following experiments with NSL-KDD, the match radius value is 40.
In conclusion, for NSL-KDD, the matching radius value is 41, for CICIDS2017, the matching radius is 58.
The confusion matrix which is an important index to calculate IDS’s performances is shown in Section 3.2. CIRAFID algorithm’ performance evaluation indexes are calculated with the confusion matrix. For two data sets above, we run CIRAFID algorithm, and get the confusion matrixes respectively.
The confusion matrix in NSL-KDD
For 22,544 testing samples in Table 2, normal samples, Dos, the Probe attacks, U2R and R2L attacks are detected respectively, the results are shown in Table 4.
The confusion matrix for NSL-KDD
The confusion matrix for NSL-KDD
In Table 3, 1,132,296 testing samples are detected with CIRAFID algorithm. Testing set contains 909,239 normal samples, 51,211 samples of Ddos, 63,572 PortS samples, 786 Bot samples, 14 Inf attack samples, 872 XSS attack samples, 3,175 FTP attack samples, 2,359 SSH attack samples, 4,117 DosGE attack samples, 92,429 DosHu attack samples, 2,200 attack samples, 2,318 DoSSL attack samples and 4 Heart attacks samples. The results are shown in Table 5.
The confusion matrix for CICIDS2017
The confusion matrix for CICIDS2017
In order to verify the proposed algorithm, classification performance comparison between CIRAFID and other algorithms are given in this section.
Because there are more researchers who apply NSL-KDD in experiments, and CICIDS2017 data set is relatively newer, there are less studies to refer to. Performance comparisons are given for the two data sets respectively. Utilize percentage as the unit of detection results. The algorithm performance indexes in this section are calculated by confusion matrix in Section 3.4.
Performances comparisons using NSL-KDD
We run CIRAFID algorithm using NSL-KDD data set, the parameters of the algorithm are given in Section 3.3. Intrusion detection usually includes anomaly detection (only recognize normal and abnormal behaviors) and intrusion classification (classify attacks). According to CIRAFID algorithm, anomaly detection and classification performances are analyzed respectively in following introductions.
comparison of anomaly detection performances
We compare the CIRAFID with other algorithms in Section 1, including horizontal comparison with immune algorithms and random forest algorithms, as well as other machine learning algorithms for longitudinal comparison. All the attacks as abnormal samples for testing. For each group of data, we run CIRAFID algorithm for 10 times, and get the average value. Different algorithms’ anomaly detection performance comparison results are shown in Table 6.
Comparison of anomaly detection performances with NSL-KDD (N/A denotes the results are uncertain)
Comparison of anomaly detection performances with NSL-KDD (N/A denotes the results are uncertain)
According to the results in Table 6: compare the CIRAFID algorithm with other algorithms, for detection algorithm in reference [1], the detection rate decreases by 1.07%,
The detection performances of the proposed are worse than clonal selection algorithm in reference [20], but detection performances are better than other algorithms, and can balance the detection rate, precision, accuracy, F1-score and the false positives rate. Compared with RAFID, detection rate, accuracy, precision, F1-score increase, the false positives rate decreases.
comparison of classification performances
In order to test the classification performance of CIRAFID algorithm, especially for small samples data set. Compare the CIRAFID algorithm with other algorithms in Table 6, the detection rates of different attack types are shown in Table 7.
Comparisons 1 of classification performances for NSL-KDD
In Table 7, we can conclude that: the CIRAFID algorithm’ detection performances are superior to other algorithms for small samples Prb, U2R and R2L attacks. The detection rates of the CIRAFID algorithm are almost same with other algorithms for big samples. Meanwhile, the CIRAFID algorithm’ adaptive ability is better than other algorithms.
In the study by Yang et al. [19], the MDPCA-DBN algorithm’ confusion matrix is given, according to the confusion matrix, we get the detection of DR, Pre – and F1 Score index, and compare with the testing results of this paper, the results are shown in Table 8.
Comparisons 2 of classification performances for NSL-KDD
It is concluded in Table 8: The CIRAFID algorithm’ performances such as DR, Pre and F1-Score are higher than MDPCA-DBN algorithms in ref. [19] and RAFID testing results. Especially for the Probe, U2R and R2L small sample data, detection performances are improved.
In the study by Lee and Park [9], the detection performances of Single-RF, the AE-RF and AE-CGAN-RF are given, so we compare the CIRAFID algorithm’ performances with these three algorithms for CICIDS2017 data sets, the results are shown in Table 9.
Comparisons 1 of classification performances for CICIDS2017
Comparisons 1 of classification performances for CICIDS2017
The results of experiments show that the performances of CIRAFID algorithm are superior to single random forest algorithm and RAFID algorithm, and similar to the AE-CGAN-RF algorithm. For small sample set, such as Bot and Infiltration attacks, the detection performances are improved. The synthetic detection performances of CIRAFID algorithm proposed in this paper are better than that of Single-RF and AE-RF algorithm.
In conclusion, with CIRAFID algorithm, we can classify classic intrusions, can also identify new attacks, the CIRAFID algorithm is feasible, and has good detection performance. The vaccination mechanism of collaborative immune algorithm can optimize antibodies, improve the detection performance. NSL-KDD and CICIDS2017 are obtained from the simulated network environment which is similar to the actual one. Through the above experiments, it is concluded that the proposed algorithm is practical to the actual network intrusion detection.
The intrusion detection model is designed with collective immune and random forests in this paper. In collaborative immune algorithm, by three ways: vaccine extraction, vaccination inoculation based on genes significance degree and vaccine evaluation, antibodies are optimized; using randomly generation, crossover and mutation, we get diverse antibodies to avoid local convergence of CIRAFID algorithm, and to improve the antibody diversity.
Synthesize vaccination strategies and a variety of antibodies generation methods, the detection rate and precision increased, the false alarm rate decreased, the balance among the detection performance indicators is optimized. Because the antibodies are optimized in the cycle of CIRAFID algorithm, the adaptability of intrusion detection is improved. Random antibody forest detection algorithm is adopted to detect antigens and classify the attacks, update antibody library, improve antibodies’ ability of attack classification for small samples. Benchmark data sets NSL-KDD and CICIDS2017 are used to evaluate the performances of the CIRAFID algorithm. The experimental results show: comparing the CIRAFID algorithm with other intrusion detection algorithms, the CIRAFID algorithm can improve detection rate, accuracy, classification accuracy, at the same time, reduce the false positives rate, improve the adaptability of intrusion detection and classification performance. Our next work is to research dimension reduction algorithm, reduce the redundant attributes and increase the speed of intrusion detection.
Footnotes
Acknowledgements
This research was supported by the National Natural Science Foundation of China (grant no. 61502436) and the Project of Science and Technology Tackling Key Problems in Henan Province (grant no. 202102210149).
Conflict of interest
None to report.
