Intrusion detection model based on coordinative immune and random antibody forest

Abstract

This study aimed to deal with the problems that current intrusion detections have poor classification ability toward small sets of samples. A new intrusion detection model based on coordinative immune and random antibody forest (CIRAFID) is proposed. The vaccination mechanism of coordinative immune algorithm is designed to increase the fitness of poor antibodies, a kind of random antibody detection forest model is given to detect anomalies, and to classify attacks. The experimental results show: the proposed model has higher detection rate, classification accuracy, classification ability and lower false positives rate.

Keywords

Intrusion detection antibody vaccine coordinative immune random antibody forest

1. Introduction

Intrusion detection system (IDS) is to detect unwanted attempts at accessing, manipulating and disabling computer system by either authorized users or external perpetrators, mainly through the network, such as the Internet [2]. Recently, many techniques have been proposed on IDS to reduce false positive detection rate and increase the true positive detection rate, such as neural network, random forest, fuzzy theory, genetic algorithm, rough set algorithm (RSA), artificial immune theory and so on [1,8].

In real network, the percentage of intrusions is low, so the true positive detection rate can always be maintained at a high level. Besides, unknown attack behaviors evolve quickly, which places high demands on the model with high training speed. Random forest and artificial immune have been popular in recent years. Random forest consists of various decision trees and have strong representation learning ability. Artificial immune (AI) is evolved from the biological immune system, and has the adaptability, which is applied in IDS.

On the one hand, in order to improve the classification performances aimed at small samples sets, on the other hand, to increase the detection precision and accuracy, reduce the rate of false positives, we synthesize the mechanism of random forests and artificial immune system, to design a collaborative intrusion detection model (CIRAFID) in this paper. The main contributions are as follows:

Using collaborative immune algorithm, we obtain the optimal antibodies. In order to get a wide diversity of antibodies, we execute crossover and mutation on some particular antibodies to acquire new antibody sets; randomly generate some random antibodies. Excellent antibodies are extracted as vaccines, with which the weaker antibodies are vaccinated, and the new antibodies are selected in new set. Through collaborative immune algorithm, we get more and better antibodies aimed at small samples set.

Random antibodies forest model is designed. Antibodies are randomly selected to generate the antibodies decision trees with the voting by the majority. Random antibody forests are used to classify characteristic data to get new antibodies, update antibody sets. Random antibody forest detection algorithm is used to improve intrusion detection rate, accuracy rate and precision rate, reduce the false positives rate.

Finally, NSL-KDD dataset and CICIDS2017 dataset are used to evaluate the detection performances of CIRAFID, and to validate its adaptability, detection and classification performances.

2. Related works

Artificial intelligence methods such as support vector machine (SVM), fuzzy set, random forests (RF) and AI are introduced in the researches of intrusion detection, many breakthroughs have been achieved.

Kishor Kumar applied SVM in anomaly detection, but the algorithm had worse classification ability [8]. Alyaseen established intrusion detection model based on SVM and extreme learning machine, which raised the anomaly detection rate, but got the lower rates of detection toward small samples data set [1]. Yang designed an effective intrusion detection system using the modified density peak clustering algorithm and deep networks (MDPCA-DBN). The system removed the redundant data of the training set, and improved the testing speed, but had the problem of high rate of false positives [19]. Song used the hidden Markov process model to design an intrusion detection model (AA-HMM). The model improved the ability of online learning, but the system parameters had greater influences on the test results [17].

In order to improve the detection rate, precision, accuracy, and reduce the rate of false positives, RF classification algorithm is becoming a hot spot in current researches. Inspired by bagging algorithm and random selection segmentation algorithm, random forest classification algorithm was proposed by LEO [3]. JooHwa proposed an intrusion detection system based on autoencoder-conditional, generative adversarial networks and random forest (AE-CGAN-RF), in which, RF was used to classify characteristic data [9]. Because the same type of attacks had similar network traffic, Ren built a multi-level random forest model, and used it to detect abnormal behaviors [15].

According to the similarity of decision trees, redundant attribute values were reduced [10]. RF algorithm can effectively solve the imbalance of feature data, improve the detection performance of intrusion detection, but detection performances of small sample attacks are poorer.

Kim et al. proposed a dynamic clonal selection algorithm [7]. Yin et al. gave an improved clonal selection algorithm, which was used to select the best individuals and to be cloned, that enhanced the accuracy, and reduce the false positives rate of intrusion detection [20].

IDS builds a dynamic and adaptive information defense system with artificial immune. Current clonal selection algorithms are widely applied to optimize the detection rules aimed at small samples by cloning and selecting excellent antibodies, because they have low detection rate, but for the high false positives rate, they cannot be applied in real intrusion detection. So AI and RF are used in IDS could solve those problems.

Fig. 1.

CIRAFID model.

3. CIRAFID

CIRAFID model mainly includes three submodules: preprocessing, collaborative immune and random forest detection. RSRFID model is shown in Fig. 1.

Preprocessing module: Preprocessing module is mainly to normalize the data set S and store them into training data set $S_{TR}$ and testing data set $S_{TE}$ . Each record in the training data set is converted into a binary string which is called “antibody”, testing data are converted into a binary string named “antigen”.

Collaborative immune module: the module mainly inoculates antibodies and obtains optimal antibodies. Antibodies of the training set are divided into different types of sets: normal type, the Probe attack, Dos attack, U2R and R2L attack, Infiltration attack, etc., which are called $S_{Normal}$ , $S_{Probe}$ , $S_{Dos}$ , $S_{U 2 R}$ , $S_{R 2 L}$ . Antibodies with the same type are optimized. For example, Dos attacks are crossovered and mutated, some antibodies after mutation, and some antibodies which are randomly generated are put into $S_{Dos}$ . Dos attack antibodies are ordered according to their fitness, some better antibodies are taken as vaccinations, which are used to inoculate the antibodies with poor fitness. Calculate the fitness of Dos attack antibodies after inoculation. Finally delete the antibodies with poor fitness.

RF detection module: RF detection module is mainly to detect the attack behaviors, such as Dos attacks. New classification sets are transformed into decision trees. The decision trees are converted into random forest with random forest algorithm. RF is used to detect abnormalities, and get the type of attacks. The new rules are stored into the training set.

3.1. Preprocessing module

The data preprocessing module of CIRAFID model is to complete the pretreatment of data sets S, according to the Definition 1, $S_{TE}$ will be turned to antigens set, $S_{TE}$ will be converted to antibodies sets with Definition 2.

Antigens and antibodies are normalized with formula (1), the attribute values are normalized to $[0, 100]$ [18,22]. $\begin{matrix} (1) & f (x) = \{\begin{matrix} 0, & x \in [0, m) \\ 100 \frac{x}{n - m}, & x \in [m, n] \\ 100, & x \in (n, + \infty) \end{matrix} \end{matrix}$

Among them, m and n respectively are the minimum and maximum within the data fields. Data sets are preprocessed with the definitions of antigen and antibody.

Definition 1.
Antigen $b \in S_{TE}$ , $S_{TE} \subset D$ , $D = {0, 1}^{l}$ , ( $l \in N$ , $l > 0$ ), $S_{TE}$ is the antigen set which are detected, D is the l binary character string in length, the value of antigen b represents the behavior characteristics of the binary string [11,12].
Definition 2.
Antibody $a \in S_{TR}$ , $S_{TR} {⟨ d, s, g, c, q ⟩}$ , $d \in D$ , $s \in {000, 001, 010, \dots}$ , $g \in N$ , $S_{TR}$ is antibody set; g denotes the age of antibody; c is the matched value; q is the weight value of the antibody, whose value is an integer; N is the set of positive integer [4,6,16].

3.2. Cooperative immune model

Vaccine extraction, inoculation and vaccine evaluation are the critical steps in antibody optimization algorithm.

3.2.1. Vaccine extraction

According to the prior knowledge of problems, the genes fitness values of antibodies are calculated, optimal individuals are obtained in the process of population evolution, vaccines are more excellent antibodies [18].

The decision attributes are reduced and classified with decision attribute significance degree reduction algorithm in rough set theory [4], the significance degree of antibodies’ attributes are calculated with the significance of decision attributes, such as Destination Port, current Flow Duration, current Flow Bytes/s etc. Significances of attributes’ value are as the significance degree of antibody genes. A matching antigen with more numbers of matching genes, the more important the genes are.

Definition 3.
The decision attribute significance relative to antibody genes, Let’s suppose that $a_{1}, a_{2}, \dots a_{s}$ are the more excellent individuals which are evaluated from antibody population $S_{TR} = {a_{1}, a_{2}, \dots a_{n}}$ . The decision attribute significance relative to antibody gen $a_{i}^{k}$ is $sig (a_{i}^{k})$ , the significance of gene position relative to decision attribute is importance measure which is used to evaluate decision attribute, the genes significance is calculated with formula (2): $\begin{matrix} (2) & sig (a_{i}^{k}) = \frac{\sum_{j = 1}^{m} | fit (a_{i}^{k}, b_{j}^{k}) |}{m} \end{matrix}$

$| fit (a_{i}^{k}, b_{i}^{k}) |$ is the number that the kth gene of the ith antibody which matches m antigens with the same type.
Definition 4.
Vaccine $va \in S$ , $S = {0, 1, }^{l}$ , ( $l \in N$ , $l > 0$ ), $va$ is defined of the string with length l, which includes 0, 1 and .

${va}_{k}$ is the code of the kth gene in vaccine $va$ . Antibody population is denoted as $S_{TR} = {a_{1}, a_{2}, \dots a_{n}}$ , $a_{i}^{k}$ is the code of the kth gene in the ith antibody [12].
Definition 5.
Vaccine extraction operator, the definition of vaccine extraction is in following formula [4]: $\begin{matrix} (3) & {va}^{k} = \{\begin{matrix} 1, & (1 / \sum_{i = 1}^{s} sig (a_{i}^{k})) \sum_{i = 1}^{s} (sig (a_{i}^{k}) a_{i}^{k}) > α \\ 0, & (1 / \sum_{i = 1}^{s} sig (a_{i}^{k})) \sum_{i = 1}^{s} (sig (a_{i}^{k}) a_{i}^{k}) < β \\ , & other \end{matrix} \end{matrix}$

The values of parameter α, β are given as: $α ⩾ 0.8$ , $β ⩽ 0.2$ .

If the value of ${va}^{k}$ is ‘0’ or ‘1’, the value of gene is fixed value, denotes that the value of gene is uncertain. The weighting value of isotope genes is closer to 1, which indicate that the gene is 1, by the same token, the weighting value of isotope genes is closer to 0, which indicates that the value of gene is more likely to ‘0’. The others are uncertain genes, whose values are ‘*’.
3.2.2. Vaccine inoculation

Inoculation is the process that we use the best genes of antibody to inoculate the isotope gene, vaccine inoculation operator is defined as:

Definition 6.
Vaccine inoculation operator, a is the antibody, $va$ is vaccine, vaccination operator is $\hat{a} = a ⊖ va$ , $\hat{a}$ is the code of the antibody which is inoculated. The operation of vaccination inoculation ⊖ is [4]: $\begin{matrix} (4) & \hat{a} = b^{k^{'}} ⊖ v^{k^{'}} = \{\begin{matrix} v^{k^{'}}; & v^{k^{'}} = 0 or 1 \\ b^{k^{'}}; & v^{k^{'}} = * \end{matrix} \end{matrix}$

The antibodies after inoculation need to be evaluated, excellent antibodies are put into memory data set, the poorer antibodies are deleted from the data set.

Vaccine evaluation, Let $va$ is the vaccination of antibody population A, $a_{i}$ is one of the antibody population A,the evaluation formula of the vaccination $va$ is [21]: $\begin{matrix} (5) & E (va) = E^{'} (va) + \sum_{i = 1}^{n} (fit ({\hat{a}}_{i}, b) - fit (a_{i}, b)) \end{matrix}$

In Eq. (5), $E^{'} (v)$ is the value of the fitness of antibody before inoculation, fit is the function with which the affinity between antigen and antibody can be calculated, and $\hat{a}$ is the one after inoculation.

The affinity function of antigen and antibody is adopted with shape-space of Euclid distance [18].
3.2.3. Cooperative immune algorithm

The quantity and quality of the antibodies decide the detection performances of the detection algorithm. Using excellent antibodies to obtain vaccines, vaccines inoculation strategy is used to enhance the detection abilities of antibodies. We take Dos attacks for example, vaccination algorithm implementation steps are shown in Algorithm 1.

Algorithm 1

Cooperative immune algorithm

Crossover and Gaussian mutation algorithms are used to get some new antibodies, new individuals join the antibody sets; antibodies are randomly generated and put into antibody set; one-third of antibodies in antibodies set remain unchanged.

The three ways above are used to generate immature antibodies, so local convergence and limitation of too much antibody copies are avoided; at the same time, in view of the small samples of antibody sets, they can effectively increase the number of antibodies, improve the detection accuracy. In order to obtain excellent antibodies, in each attack antibody set, we choose antibodies with high fitness to be vaccines, and calculate the antibody gene’s significance as its weight value, the poorer fitness antibodies are inoculated to generate more excellent antibodies, which are used to detect the antigens of IDS [4].

3.3. Random antibody detection forest

Random antibody detection forest (RADF) is a set which contains multiple identical distribution of antibody decision trees, each decision tree depends on an independent random vectors.

The principles of RADF: antibody set X contains N antibodies, k antibody samples are obtained by a random back way ( $0 < k < N$ ), each types of antibodies is used to generate antibody decision tree $T_{i}$ ( $1 ⩽ i ⩽ k$ ), then k antibodies decision trees are converted into a random forest $F = {T_{1}, T_{2}, \dots, T_{k}}$ , $0 < k < N$ , antibody detection trees are used to classify antigens, the classified results are by voting [15].

The less antibody decision trees the RADF has, the weaker classification ability the RADF owns. So only when the algorithm has large quantity of decision trees, the classification results will be more effective.

The inadequacy of samples will cause high false detection rate and low detection rates. In order to deal with those problems, in Section 3.2, collaborative immune algorithm is adopted to obtain excellent antibodies, extend the small sample set to get enough antibodies.

Referring to RF and AI theory, in this paper, we design a random antibody detection forest model, which is shown in Fig. 2.

It is crucial to improve the extrapolation prediction ability of random forest classification model for IDS. So we need to generate different antibodies sets to increase the classified abilities of forest classification model. After k rounds of trainings, we get an antibody decision tree set ${h_{1} (a), h_{2} (a), h_{3} (a), \dots, h_{k} (a)}$ . After voting, we get the final antibodies detection forest: $\begin{matrix} (6) & H (a) = {argmax}_{Y} \sum_{i = 1}^{k} (I (h_{i} (a)) = ε) \end{matrix}$

$H (a)$ is the antibody detection forest model, $h_{i} (a)$ is classification results of individual antibody decision tree, Y is the target variable output of antigen detection, $I (∙)$ denotes the indicative function of antibody detection forest, antibody decision trees are used to detect antigens, the detection result is $a_{i} . lable$ , such as Probe, $I (∙)$ is calculated with formula (7). $\begin{matrix} (7) & I (∙) = \frac{count ({\sum_{i = 1}^{N} a_{i} . lable} = = 001)}{N} \end{matrix}$

Fig. 2.

Random antibody detection model.

Random antibody detection forest algorithm is shown as Algorithm 2.

Algorithm 2

Random antibody detection forest

3.4. CIRAFID algorithm

The quantity and quality of antibodies decide the detection performance indexes of the antibodies.

If the random antibody detection forest has less random decision tree, its classification ability will be weaker. Only there are a number of decision trees, the detection algorithm can get effective classification results. Synthesis coordination immune algorithm and random antibody detection forest algorithm, the CIRAFID algorithm is designed, the CIRAFID algorithm is shown as Algorithm 3.

Algorithm 3

CIRAFID algorithm

4. Experiments and results analysis

In order to verify the performances of CIRAFID model above, NSL-KDD and CICIDS2017 are both used in following experiments. Important parameters in algorithm are obtained through experiments, the threshold values of matching radius between antibody and antigen are separately set up based on two data sets above; antigens are detected to get the confusion matrixes of IDS, then intrusion detection performance indexes are obtained by confusion matrix; finally, the anomaly detection performance of the proposed algorithm and the comparisons with other classification algorithms are analyzed, the feasibility and the detection performances of the algorithm are verified. The purpose of using NSL-KDD data sets is to verify the detection performances of CIRAFID algorithm for classical attacks; with CICIDS2017 data set, the purpose is to research classification performances about new attacks.

Therefore, we use some common performance indicators as parameters to detect antigens and present comparison analysis respectively according to NSL-KDD and CICIDS2017 data set. NSL-KDD and CICIDS2017 both include small attacks samples, with which we research the attack classification performances of CIRAFID algorithm for small samples.

In order to verify the vaccination strategies performances of cooperative immune algorithm, the steps of (1.8), (1.9) and (1.10) in Algorithm 1 are omitted, other steps are same as CIRAFID algorithm, we compare the testing results among CIRAFID, cooperative immune algorithm and other algorithms. For the conveniences of the research, the vaccination strategy of cooperative immune algorithm is named intrusion detection based on random antibody forest (RAFID).

4.1. Experiments datasets

There are many data sets for the simulations of IDS, such as KDD Cup 99, NSL-KDD, ADFA, Kyoto 2006+, ISCXIDS2012, UNSW-NB15 and CICIDS2017.

KDD Cup 99 and NSL-KDD both are classic data sets which are used for intrusion detection simulations. KDD Cup 99 contains a large number of redundant samples, NSL-KDD is the reduction of data sets based on KDD Cup 99 data set. To compare with classical algorithms, in this paper, we use NSL-KDD data sets for simulations. While NSL-KDD data set was collected in 1999, NSL-KDD lacks new attacks, so we need new data sets for experiments.

ADFA, Kyoto 2006+, ISCXIDS2012, UNSW-NB15 and CICIDS2017 which are used for IDS, ADFA, Kyoto 2006+, ISCXIDS2012 are outdated, less researchers applied them in experiments, and they are not the classic data sets. UNSW-NB15 and CICIDS2017 are newer, and contain new attack logs. UNSW-NB15 data sets are captured in Australian Centre for Cyber Security (ACCS) in 2015 with IXIA PerfectStorm network testing tools, UNSW- NB15 contains nine types of attacks [16]. CICIDS2017 data sets are collected in Canada’s network security institute, published in 2017, contain 15 types of attacks. CICIDS2017 data sets are more comprehensive data sets at present. In order to verify CIRAFID’ detection performances of new attacks, we adopt CICIDS2017 in experiments. In conclusion, we adopt NSL-KDD and CICIDS2017 for simulations.

4.1.1. NSL-KDD dataset

NSL-KDD is provided by Lincoln laboratory for experiment simulations, the training sample set KDDTrain+ includes 125973 records. Testing data set includes two subsets, KDDTest+ and KDDTest-21, each subset contains Normal, Dos, Probe, U2R and R2L.

In the experiments, all the training samples are used to get antibodies, testing samples set KDDTest+ is used to test CIRAFID algorithm. Sample distributions of the data set are shown in Table 1.

Table 1
NSL-KDD’ distribution [13]

No Type KDDTrain+ KDDTest+

1 Normal 67,343 9,710

2 DoS 45,927 7,458

3 Probe 11,656 2,422

4 U2R 52 67

5 R2L 995 2,887

total 125,973 22,544

No	Type	KDDTrain+	KDDTest+
1	Normal	67,343	9,710
2	DoS	45,927	7,458
3	Probe	11,656	2,422
4	U2R	52	67
5	R2L	995	2,887
total	125,973	22,544

4.1.2. CICIDS2017 data set

CICIDS2017 includes normal samples and 15 types of attacks, contains 2,830,743 samples. In CICIDS2017, 60 percent of data are as the training samples, the rest of data are as testing ones. The distributions of samples are shown in Table 2.

Table 2
CICIDS2017’ distribution

Type Abbreviation Number Percent

Benign Normal 2,273,097 80.3004

Distributed Denial-of-service (DDos) Ddos 128,027 4.5227

Port Scan PortS 158,930 5.6441

Bot Bot 1,966 0.0695

Infiltration Inf 36 0.0013

Brute Force XSS 2,180 0.077

Web Attack Structured Query Language (SQL)

Injection Cross-site Scripting (XSS)

File Transfer Protocol (FTP) – Patator FTP 7,938 0.2804

Secure Shell (SSH) – Patator SSH 5,897 0.2083

Denial-of-service (Dos) GoldenEye DosGE 10,293 0.3636

DoS Hulk DoSHu 231,073 8.163

DoS Slowhttptest DoSSH 5,499 0.1943

DoS Slowloris DoSSL 5,796 0.2048

Heartbleed Heart 11 0.0004

Type	Abbreviation	Number	Percent
Benign	Normal	2,273,097	80.3004
Distributed Denial-of-service (DDos)	Ddos	128,027	4.5227
Port Scan	PortS	158,930	5.6441
Bot	Bot	1,966	0.0695
Infiltration	Inf	36	0.0013
Brute Force	XSS	2,180	0.077
Web Attack Structured Query Language (SQL)
Injection Cross-site Scripting (XSS)
File Transfer Protocol (FTP) – Patator	FTP	7,938	0.2804
Secure Shell (SSH) – Patator	SSH	5,897	0.2083
Denial-of-service (Dos) GoldenEye	DosGE	10,293	0.3636
DoS Hulk	DoSHu	231,073	8.163
DoS Slowhttptest	DoSSH	5,499	0.1943
DoS Slowloris	DoSSL	5,796	0.2048
Heartbleed	Heart	11	0.0004

4.2. Evaluation parameter

The code of CIRAFID algorithm is writing with C, all the experiments run in Linux operation system (Intel Pentium Dual CPU E2180, 64 G RAM). There are four types of detection results of IDS: true positive (TP), true negative (TN), false positive (FP) and false negative (FN) [21].

The sum of TP and FN is the total number of actual normal samples, TP is the number of normal samples which are identified as normal antigens, FN is the number of normal samples which are identified as attacks in error. The sum of FP and TN is the total number of actual abnormal samples. FP is the number of abnormal samples which are identified as normal antigens in error, TN is the number of abnormal samples which are identified as attacks. We use confusion matrix to calculate the performances of the intrusion detection evaluation index [5,14]. The confusion matrix is shown in Table 3.

Table 3
Confusion matrix

Type Actual values

Normal Attacks

Predicted results Normal TP FP

Attacks FN TN

Type	Actual values
Predicted results	Normal	TP	FP
Attacks	FN	TN

Mainly performance evaluation indicators of CIRAFID algorithm:

Detection rate (DR) is the ratio of attacks which are recognized accurately to antigens that are taking part in intruding the immune system, DR is calculated with formula (8).

False alarm rate (FAR) is the ratio of attacks which are recognized wrongly to antigens which are taking part in intruding the immune system, FAR is calculated with formula (9).

Precision ( $Pre$ ) is the ratio of real attacks to antigens which are recognized as attacks the immune system, $Pre$ is calculated with formula (10).

Accuracy ( $Acc$ ) is the ratio of antigens which are recognized correctly to all the antigens in the immune system, $Acc$ is calculated with formula (11).

$F 1 - score$ is a comprehensive evaluation of antigen detection rate and the accurate rate of an index, $F 1 - score$ is calculated with formula (12).

\begin{array}{l} (8) & DR = \frac{TP}{TP + FN} \\ (9) & FAR = \frac{FP}{TN + FP} \\ (10) & Pre = \frac{TP}{TP + FP} \\ (11) & Acc = \frac{TP + TN}{TP + TN + FP + FN} \\ (12) & F 1 - score = \frac{2 \times DR \times Pre}{DR + Pre} \end{array}

4.3. Parameter settings

The parameters of CIRAFID algorithm are different with various data sets. Some of the parameters are referred to other studies, such as antigen binary length of antigen, binary length of antibody, the number of vaccines, the number of antibodies which are obtained crossover and mutation operation. The matching radius of antibodies and antigens are given by experiment. According to NSL-KDD and CICIDS2017, we preprocess the data sets to get antibodies and antigens [22]. In Section 4.3.1 and 4.3.2, match radius parameters of NSL-KDD and CICIDS2017 are given respectively.

4.3.1. The matching radius of NSL-KDD

The antigen is a binary string of 92 bits in length, the length of the antibodies is 95 bits. 20 percent of antibodies are used in CIRAFID. The matching radius value is given in following experiments. The results are shown in Fig. 3.

Fig. 3.

The detection results for different radius with NSL-KDD.

NSL-KDD contains 41 attributes and 1 label. In experiments, we respectively set up that the matching radius value is the even numbers from 30 to 50, take all the training data and testing data for cross experiments. For each experiment, the algorithm runs 10 times and calculate the average value as the final result.

In Fig. 3, for different radius values which are the matching number between antigens and antibodies, DR and FAR are given. It is concluded that the detection rate is increasing and the rate of false positives is decreasing with the increasing of detection radius. When the length of detection radius is 40 bits, the detection rate is 94.73%, the false positives rate is 6.16%. When detection radius’ length is more than 40 bits, the detection rate and the false positives rate are stable. In the following experiments about NSL-KDD, the match radius value is 40.

4.3.2. The matching radius of CICIDS2017

The antigen is a binary string owing 119 bits in length, the length of the antibodies is 123 bits. With the same method in Section 4.3.1, 20 percent of antibodies are used in CIRAFID. The matching radius is given in following experiments.

CICIDS2017 contains 52 attributes and 1 label. In experiments, we respectively set up the matching radius value with the even numbers which are from 40 to 90, take all the training data and testing data for cross experiments. In each experiment, the algorithm runs 10 times, and we calculate the average value as the final result. The results are shown in Fig. 4.

Fig. 4.

The detection results for different radius with CICIDS2017.

In Fig. 4, for different radius of the matching value between antigens and antibodies, DR and FAR are given. It is concluded that the detection rate is increasing and the false positives rate is decreasing with the increase of detection radius. When the detection radius is 58 bits in length, the detection rate is 93.82%, the false positives rate is 8.01%. When detection radius’ length is more than 58 bits, the detection rate and the false positives rate are stable. In the following experiments with NSL-KDD, the match radius value is 40.

In conclusion, for NSL-KDD, the matching radius value is 41, for CICIDS2017, the matching radius is 58.

4.4. The confusion matrix of CIRAFID algorithm

The confusion matrix which is an important index to calculate IDS’s performances is shown in Section 3.2. CIRAFID algorithm’ performance evaluation indexes are calculated with the confusion matrix. For two data sets above, we run CIRAFID algorithm, and get the confusion matrixes respectively.

4.4.1. The confusion matrix in NSL-KDD

For 22,544 testing samples in Table 2, normal samples, Dos, the Probe attacks, U2R and R2L attacks are detected respectively, the results are shown in Table 4.

Table 4
The confusion matrix for NSL-KDD

Actual values

Normal DoS Probe U2R R2L

Predicted values Normal 9216 23 82 2 15

DoS 107 7359 61 9 106

Probe 97 73 2270 5 1

U2R 117 0 0 47 4

R2L 173 3 9 4 2761

		Actual values
Predicted values	Normal	9216	23	82	2	15
DoS	107	7359	61	9	106
Probe	97	73	2270	5	1
U2R	117	0	0	47	4
R2L	173	3	9	4	2761

4.4.2. The confusion matrix in CICIDS2017

In Table 3, 1,132,296 testing samples are detected with CIRAFID algorithm. Testing set contains 909,239 normal samples, 51,211 samples of Ddos, 63,572 PortS samples, 786 Bot samples, 14 Inf attack samples, 872 XSS attack samples, 3,175 FTP attack samples, 2,359 SSH attack samples, 4,117 DosGE attack samples, 92,429 DosHu attack samples, 2,200 attack samples, 2,318 DoSSL attack samples and 4 Heart attacks samples. The results are shown in Table 5.

Table 5
The confusion matrix for CICIDS2017

Actual values

Normal Ddos PortS Bot Inf XSS FTP SSH Dos GE DoS Hu DoS SH DoS SL Heart

Predicted values Normal 908573 25 5 249 2 15 4 5 6 193 7 3 0

Ddos 1 51182 0 0 0 0 0 0 0 0 0 0 0

PortS 286 0 63559 0 0 0 0 0 0 0 1 1 0

Bot 68 0 0 537 0 0 0 0 0 0 0 0 0

Inf 0 0 0 0 12 0 0 0 0 0 0 0 0

XSS 0 0 0 0 0 857 0 0 0 0 0 1 0

FTP 0 0 0 0 0 0 3171 0 0 0 0 0 0

SSH 0 0 0 0 0 0 0 2354 0 0 0 0 0

DosGE 10 1 0 0 0 0 0 0 4101 9 2 1 0

DoSHu 289 3 8 0 0 0 0 0 8 92227 1 0 0

DoSSH 11 0 0 0 0 0 0 0 1 0 2184 5 0

DoSSL 1 0 0 0 0 0 0 0 1 0 5 2307 0

Heart 0 0 0 0 0 0 0 0 0 0 0 0 4

		Actual values
Predicted values	Normal	908573	25	5	249	2	15	4	5	6	193	7	3	0
Ddos	1	51182	0	0	0	0	0	0	0	0	0	0	0
PortS	286	0	63559	0	0	0	0	0	0	0	1	1	0
Bot	68	0	0	537	0	0	0	0	0	0	0	0	0
Inf	0	0	0	0	12	0	0	0	0	0	0	0	0
XSS	0	0	0	0	0	857	0	0	0	0	0	1	0
FTP	0	0	0	0	0	0	3171	0	0	0	0	0	0
SSH	0	0	0	0	0	0	0	2354	0	0	0	0	0
DosGE	10	1	0	0	0	0	0	0	4101	9	2	1	0
DoSHu	289	3	8	0	0	0	0	0	8	92227	1	0	0
DoSSH	11	0	0	0	0	0	0	0	1	0	2184	5	0
DoSSL	1	0	0	0	0	0	0	0	1	0	5	2307	0
Heart	0	0	0	0	0	0	0	0	0	0	0	0	4

4.5. Classification performance comparisons between CIRAFID and other algorithms

In order to verify the proposed algorithm, classification performance comparison between CIRAFID and other algorithms are given in this section.

Because there are more researchers who apply NSL-KDD in experiments, and CICIDS2017 data set is relatively newer, there are less studies to refer to. Performance comparisons are given for the two data sets respectively. Utilize percentage as the unit of detection results. The algorithm performance indexes in this section are calculated by confusion matrix in Section 3.4.

4.5.1. Performances comparisons using NSL-KDD

We run CIRAFID algorithm using NSL-KDD data set, the parameters of the algorithm are given in Section 3.3. Intrusion detection usually includes anomaly detection (only recognize normal and abnormal behaviors) and intrusion classification (classify attacks). According to CIRAFID algorithm, anomaly detection and classification performances are analyzed respectively in following introductions.

comparison of anomaly detection performances

We compare the CIRAFID with other algorithms in Section 1, including horizontal comparison with immune algorithms and random forest algorithms, as well as other machine learning algorithms for longitudinal comparison. All the attacks as abnormal samples for testing. For each group of data, we run CIRAFID algorithm for 10 times, and get the average value. Different algorithms’ anomaly detection performance comparison results are shown in Table 6.

Table 6
Comparison of anomaly detection performances with NSL-KDD (N/A denotes the results are uncertain)

Method DR Acc FAR Pre F1-score

Hybrid multilayer model [1] 95.17 95.75 1.87 N/A N/A

MDPCA-DBN [19] 61.57 66.18 13.06 95.51 74.87

AA-HMM [17] 91.06 93.48 N/A 93.63 92.33

Outlier RF [15] 93.55 94.36 2.34 N/A N/A

Traditional RF [19] 49.84 56.84 11.62 95.08 65.39

RF with weight [10] N/A 98.36 N/A N/A 84.00

Improved CSA [20] 99.20 N/A 0.20 N/A N/A

RAFID 93.36 65.82 4.17 65.82 59.49

CIRAFID 94.91 98.69 0.91 97.34 96.77

Method	DR	Acc	FAR	Pre	F1-score
Hybrid multilayer model [1]	95.17	95.75	1.87	N/A	N/A
MDPCA-DBN [19]	61.57	66.18	13.06	95.51	74.87
AA-HMM [17]	91.06	93.48	N/A	93.63	92.33
Outlier RF [15]	93.55	94.36	2.34	N/A	N/A
Traditional RF [19]	49.84	56.84	11.62	95.08	65.39
RF with weight [10]	N/A	98.36	N/A	N/A	84.00
Improved CSA [20]	99.20	N/A	0.20	N/A	N/A
RAFID	93.36	65.82	4.17	65.82	59.49
CIRAFID	94.91	98.69	0.91	97.34	96.77

According to the results in Table 6: compare the CIRAFID algorithm with other algorithms, for detection algorithm in reference [1], the detection rate decreases by 1.07%, $Acc$ increases by 0.71%, false positives rate drops by 0.71%.

The detection performances of the proposed are worse than clonal selection algorithm in reference [20], but detection performances are better than other algorithms, and can balance the detection rate, precision, accuracy, F1-score and the false positives rate. Compared with RAFID, detection rate, accuracy, precision, F1-score increase, the false positives rate decreases.

comparison of classification performances

In order to test the classification performance of CIRAFID algorithm, especially for small samples data set. Compare the CIRAFID algorithm with other algorithms in Table 6, the detection rates of different attack types are shown in Table 7.

Table 7

Comparisons 1 of classification performances for NSL-KDD

Method	Norm	Dos	Prb	U2R	R2L
Hybrid multilayer model [1]	98.13	99.54	87.22	21.93	31.39
MDPCA-DBN [19]	97.38	81.09	73.94	17.25	6.50
Traditional RF [19]	88.38	66.08	60.45	0.50	10.42
Outlier RF [15]	97.66	97.32	95.34	21.05	31.96
RAFID	94.40	97.29	87.28	37.31	12.37
CIRAFID	94.91	98.67	93.72	70.15	95.63

In Table 7, we can conclude that: the CIRAFID algorithm’ detection performances are superior to other algorithms for small samples Prb, U2R and R2L attacks. The detection rates of the CIRAFID algorithm are almost same with other algorithms for big samples. Meanwhile, the CIRAFID algorithm’ adaptive ability is better than other algorithms.

In the study by Yang et al. [19], the MDPCA-DBN algorithm’ confusion matrix is given, according to the confusion matrix, we get the detection of DR, Pre – and F1 Score index, and compare with the testing results of this paper, the results are shown in Table 8.

Table 8

Comparisons 2 of classification performances for NSL-KDD

Type	DR			Pre			F1-Score

	MDPCA-DBN [19]	RAFID	CIRAFID	MDPCA-DBN [19]	RAFID	CIRAFID	MDPCA-DBN [19]	RAFID	CIRAFID
Mor	71.42	94.40	94.91	97.38	92.18	98.69	82.40	93.27	96.77
Dos	96.34	97.29	98.67	81.09	86.48	96.30	88.06	91.57	97.47
Probe	85.85	87.28	93.72	73.94	76.87	92.80	79.45	81.75	93.26
U2R	11.82	37.31	70.15	6.50	2.86	27.97	8.39	5.31	40.00
R2L	57.30	12.37	95.63	17.25	61.03	93.59	26.51	20.56	94.60
Ave	64.54	65.73	90.62	55.23	63.88	81.87	56.96	58.49	84.42

It is concluded in Table 8: The CIRAFID algorithm’ performances such as DR, Pre and F1-Score are higher than MDPCA-DBN algorithms in ref. [19] and RAFID testing results. Especially for the Probe, U2R and R2L small sample data, detection performances are improved.

4.5.2. Performances comparisons using CICIDS2017

In the study by Lee and Park [9], the detection performances of Single-RF, the AE-RF and AE-CGAN-RF are given, so we compare the CIRAFID algorithm’ performances with these three algorithms for CICIDS2017 data sets, the results are shown in Table 9.

Table 9
Comparisons 1 of classification performances for CICIDS2017

Type DR Pre F1-Score

Single-RF [9] AE-CGAN-RF [9] RAFID CIRAFID Single-RF [9] AE-CGAN-RF [9] RAFID CIRAFID Single-RF [9] AE-CGAN-RF [9] RAFID CIRAFID

Benign 99.52 99.91 99.87 99.92 99.62 99.92 99.79 99.94 99.57 99.92 99.83 99.93

Ddos 99.86 99.92 99.78 99.94 99.98 99.99 99.99 99.99 99.92 99.96 99.89 99.97

PortS 99.89 99.96 98.42 99.97 98.52 99.38 98.85 99.54 99.20 99.67 98.63 99.76

Bot 20.69 54.41 51.15 68.32 100.00 83.69 81.71 88.76 34.29 65.94 62.91 77.21

Inf 40.00 66.67 64.29 85.71 100.00 100.00 100.00 100.00 57.14 80.00 78.26 92.30

XSS 91.40 94.84 84.29 98.27 96.62 99.40 95.33 99.88 95.33 97.07 89.47 99.07

FTP 99.50 99.84 96.19 99.87 100.00 100.00 100.00 100.00 99.75 99.92 98.06 99.93

SSH 99.81 99.75 94.62 99.78 100.00 100.00 99.96 100.00 99.40 99.87 97.21 99.89

DosGE 97.55 99.44 97.98 99.61 95.73 99.42 96.62 99.44 99.63 99.43 97.30 99.52

DoSHu 96.73 99.73 99.47 99.78 95.33 99.63 99.16 99.66 96.03 99.68 99.31 99.72

DoSSH 79.00 89.95 95.73 99.27 88.22 99.00 96.92 99.22 83.36 98.98 96.32 99.25

DoSSL 85.38 99.31 97.20 99.52 99.55 99.61 97.62 99.69 91.92 99.46 97.41 99.61

Heart 80.00 100.00 75.00 100.00 100.00 100.00 75.00 100.00 88.78 100.00 75.00 100.00

Ave 83.79 93.29 82.43 96.15 98.20 98.46 88.64 98.93 87.79 95.38 84.97 97.40

Type	DR	Pre	F1-Score
Benign	99.52	99.91	99.87	99.92	99.62	99.92	99.79	99.94	99.57	99.92	99.83	99.93
Ddos	99.86	99.92	99.78	99.94	99.98	99.99	99.99	99.99	99.92	99.96	99.89	99.97
PortS	99.89	99.96	98.42	99.97	98.52	99.38	98.85	99.54	99.20	99.67	98.63	99.76
Bot	20.69	54.41	51.15	68.32	100.00	83.69	81.71	88.76	34.29	65.94	62.91	77.21
Inf	40.00	66.67	64.29	85.71	100.00	100.00	100.00	100.00	57.14	80.00	78.26	92.30
XSS	91.40	94.84	84.29	98.27	96.62	99.40	95.33	99.88	95.33	97.07	89.47	99.07
FTP	99.50	99.84	96.19	99.87	100.00	100.00	100.00	100.00	99.75	99.92	98.06	99.93
SSH	99.81	99.75	94.62	99.78	100.00	100.00	99.96	100.00	99.40	99.87	97.21	99.89
DosGE	97.55	99.44	97.98	99.61	95.73	99.42	96.62	99.44	99.63	99.43	97.30	99.52
DoSHu	96.73	99.73	99.47	99.78	95.33	99.63	99.16	99.66	96.03	99.68	99.31	99.72
DoSSH	79.00	89.95	95.73	99.27	88.22	99.00	96.92	99.22	83.36	98.98	96.32	99.25
DoSSL	85.38	99.31	97.20	99.52	99.55	99.61	97.62	99.69	91.92	99.46	97.41	99.61
Heart	80.00	100.00	75.00	100.00	100.00	100.00	75.00	100.00	88.78	100.00	75.00	100.00
Ave	83.79	93.29	82.43	96.15	98.20	98.46	88.64	98.93	87.79	95.38	84.97	97.40

The results of experiments show that the performances of CIRAFID algorithm are superior to single random forest algorithm and RAFID algorithm, and similar to the AE-CGAN-RF algorithm. For small sample set, such as Bot and Infiltration attacks, the detection performances are improved. The synthetic detection performances of CIRAFID algorithm proposed in this paper are better than that of Single-RF and AE-RF algorithm.

In conclusion, with CIRAFID algorithm, we can classify classic intrusions, can also identify new attacks, the CIRAFID algorithm is feasible, and has good detection performance. The vaccination mechanism of collaborative immune algorithm can optimize antibodies, improve the detection performance. NSL-KDD and CICIDS2017 are obtained from the simulated network environment which is similar to the actual one. Through the above experiments, it is concluded that the proposed algorithm is practical to the actual network intrusion detection.

5. Conclusions

The intrusion detection model is designed with collective immune and random forests in this paper. In collaborative immune algorithm, by three ways: vaccine extraction, vaccination inoculation based on genes significance degree and vaccine evaluation, antibodies are optimized; using randomly generation, crossover and mutation, we get diverse antibodies to avoid local convergence of CIRAFID algorithm, and to improve the antibody diversity.

Synthesize vaccination strategies and a variety of antibodies generation methods, the detection rate and precision increased, the false alarm rate decreased, the balance among the detection performance indicators is optimized. Because the antibodies are optimized in the cycle of CIRAFID algorithm, the adaptability of intrusion detection is improved. Random antibody forest detection algorithm is adopted to detect antigens and classify the attacks, update antibody library, improve antibodies’ ability of attack classification for small samples. Benchmark data sets NSL-KDD and CICIDS2017 are used to evaluate the performances of the CIRAFID algorithm. The experimental results show: comparing the CIRAFID algorithm with other intrusion detection algorithms, the CIRAFID algorithm can improve detection rate, accuracy, classification accuracy, at the same time, reduce the false positives rate, improve the adaptability of intrusion detection and classification performance. Our next work is to research dimension reduction algorithm, reduce the redundant attributes and increase the speed of intrusion detection.

Footnotes

Acknowledgements

This research was supported by the National Natural Science Foundation of China (grant no. 61502436) and the Project of Science and Technology Tackling Key Problems in Henan Province (grant no. 202102210149).

Conflict of interest

None to report.

References

W.L.

Alyaseen,

Z.A.

Othman and

M.Z.A.

Nazri, Multi-level hybrid support vector machine and extreme learning machine based on modified K-means for intrusion detection system, Expert Systems with Applications 67(1) (2017), 296–303.

J.P.

Anderson, Computer security threat monitoring and surveillance, Technique report, Pennsylvania, 1980.

Breiman, Random forests, Machine Learning 45(1) (2001), 5–32. doi:10.1023/A:1010933404324.

F.I.

Chou,

W.H.

Ho,

Y.J.

Chen et al., Detecting mixed-type intrusion in high adaptability using artificial immune system and parallelized automata, Applied Sciences, 10 (2020), 1566. doi:10.3390/app10051566.

D’Angelo,

Palmieri,

Ficco and

Rampone, An uncertainty-managing batch relevance-based approach to network anomaly detection, Applied Soft Computing Journal 36 (2015), 408–418. doi:10.1016/j.asoc.2015.07.029.

Ehsan,

Hossein and

Nowroozi, A novel sophisticated hybrid method for intrusion detection using the artificial immune system, Journal of Information Security and Applications 58 (2021). doi:10.1016/j.jisa.2020.102721.

Kim, The artificial immune model for network intrusion detection, in: The 7th EUFIT’99, Aachen, Germany, 1999.

Kishor Kumar,

Raja Kumar,

Suleman Basha et al., Intrusion detection using an ensemble of support vector machines. Advances in engineering, Management and Sciences 3(s) (2019), 266–275. doi:10.26782/jmcms.spl.3/2019.09.00020.

Lee and

Park, AE-CGAN model based high performance network intrusion detection system, Applied Sciences 20(9) (2019), 1–14. doi:10.3390/app9204221.

10.

Liu,

Zhao,

Liu, Network traffic classification based on Spark frame, Journal on Communications 39(Z1) (2018), 30–35 (in Chinese). doi:10.26939/d.cnki.gbhgu.2019.000833.

11.

D.Q.

Miao and

D.G.

Li, Rough Sets Theory Algorithms and Applications, Tsinghua University Press, Beijing, 2008, pp. 175–222 (in Chinese).

12.

B.A.

Naila,

Mohamed and

Abdelouahid, NSNAD: Negative selection-based network anomaly detection approach with relevant feature subset, Neural Computing and Applications 32 (2020), 3475–3501. doi:10.1007/s00521-019-04396-2.

13.

Nour, Designing an online and reliable statistical anomaly detection framework for dealing with large high-speed network traffic, Diss, University of New South Wales, Canberra, Australia, 2017.

14.

Palmieri, Network anomaly detection based on logistic regression of nonlinear chaotic invariants, Journal of Network and Computer Applications 148 (2019). doi:10.1016/j.jnca.2019.102460.

15.

Ren,

Liu,

Wang et al., An multi-level intrusion detection method based on KNN outlier detection and random forests, Journal of Computer Research and Development 56(3) (2019), 566–575, (in Chinese). doi:10.7544/issn1000-1239.2019.20180063.

16.

Sahar,

Daniyal and

Li, DeepDCA: Novel network-based detection of IoT attacks using artificial immune system, Applied Sciences, 10 (2020), 1909. doi:10.3390/app10061909.

17.

C.Y.

Song,

Pons and

Yen, AA-HMM: An anti-adversarial hidden Markov model for network-based intrusion detection, Applied Sciences 12(8) (2018), 1–25. doi:10.3390/app8122421.

18.

Wang and

Liu, A study on coordinative immune-computing model, Acta Electronica Sinica 8(37) (2009), 1739–1745 (in Chinese).

19.

Yang,

Zheng,

Wu,

Niu and

Yang, Building an effective intrusion detection system using the modified density peak clustering algorithm and deep belief networks, Applied Sciences 2(9) (2019), 238–262. doi:10.3390/app9020238.

20.

Yin ,

Ma and

Feng , Towards accurate intrusion detection based on improved clonal selection algorithm, Multimedia Tools Appl. 19(76) (2017), 19397–19410. doi:10.1007/s11042-015-3117-0.

21.

Zhang, Research on Intrusion Detection Model Based on Rough Set and Artificial Immunity, Bei Jing University of Post and Communication, Bei Jing, 2014, (in Chinese).

22.

Zhang,

Bai,

Luo et al., Integrated intrusion detection model based on rough set and artificial immune, Journal on Communications 34(9) (2013), 166–176 (in Chinese). doi:10.3969/j.issn.1000-436x.2013.09.020.

Type	DR				Pre				F1-Score

	Single-RF [9]	AE-CGAN-RF [9]	RAFID	CIRAFID	Single-RF [9]	AE-CGAN-RF [9]	RAFID	CIRAFID	Single-RF [9]	AE-CGAN-RF [9]	RAFID	CIRAFID
Benign	99.52	99.91	99.87	99.92	99.62	99.92	99.79	99.94	99.57	99.92	99.83	99.93
Ddos	99.86	99.92	99.78	99.94	99.98	99.99	99.99	99.99	99.92	99.96	99.89	99.97
PortS	99.89	99.96	98.42	99.97	98.52	99.38	98.85	99.54	99.20	99.67	98.63	99.76
Bot	20.69	54.41	51.15	68.32	100.00	83.69	81.71	88.76	34.29	65.94	62.91	77.21
Inf	40.00	66.67	64.29	85.71	100.00	100.00	100.00	100.00	57.14	80.00	78.26	92.30
XSS	91.40	94.84	84.29	98.27	96.62	99.40	95.33	99.88	95.33	97.07	89.47	99.07
FTP	99.50	99.84	96.19	99.87	100.00	100.00	100.00	100.00	99.75	99.92	98.06	99.93
SSH	99.81	99.75	94.62	99.78	100.00	100.00	99.96	100.00	99.40	99.87	97.21	99.89
DosGE	97.55	99.44	97.98	99.61	95.73	99.42	96.62	99.44	99.63	99.43	97.30	99.52
DoSHu	96.73	99.73	99.47	99.78	95.33	99.63	99.16	99.66	96.03	99.68	99.31	99.72
DoSSH	79.00	89.95	95.73	99.27	88.22	99.00	96.92	99.22	83.36	98.98	96.32	99.25
DoSSL	85.38	99.31	97.20	99.52	99.55	99.61	97.62	99.69	91.92	99.46	97.41	99.61
Heart	80.00	100.00	75.00	100.00	100.00	100.00	75.00	100.00	88.78	100.00	75.00	100.00
Ave	83.79	93.29	82.43	96.15	98.20	98.46	88.64	98.93	87.79	95.38	84.97	97.40