Intrusion detection algorithom based on transfer extreme learning machine

Abstract

Intrusion detection can effectively detect malicious attacks in computer networks, which has always been a research hotspot in field of network security. At present, most of the existing intrusion detection methods are based on traditional machine learning algorithms. These methods need enough available intrusion detection training samples, training and test data meet the assumption of independent and identically distributed, at the same time have the disadvantages of low detection accuracy for small samples and new emerging attacks, slow speed of establishment model and high cost. To solve the above problems, this paper proposes an intrusion detection algorithm-TrELM based on transfer learning and extreme machine. TrELM is no longer limited by the assumptions of traditional machine learning. TrELM utilizes the idea of transfer learning to transfer a large number of historical intrusion detection samples related to target domain to target domain with a small number of intrusion detection samples. With the existing historical knowledge, quickly build a high-quality target learning model to effectively improve the detection effect and efficiency of small samples and new emerging intrusion detection behaviors. Experiments are carried out on NSL-KDD, KDD99 and ISCX2012 data sets. The experimental results show that the algorithm can improve the detection accuracy, especially for unknown and small samples.

Keywords

Transfer learning intrusion detection extreme learning machine

1. Introduction

With the rapid development of networks, they play a crucial role in national life and people’s daily activities [1]. However, network attacks highly emerge, which seriously threatens the security of cyberspace. Therefore, protecting the network from attacks has become an important research field. In order to deal with different network attacks, network security technologies such as firewall, cryptography, network isolation and identity authentication have gradually developed. However, these network security technologies are passive defense technologies, that cannot efficiently deal with the current complex and changeable network threats. The intrusion detection system (IDS) is a security management system, which is used to detect the intrusion on the network. It is an indispensable part of the modern network security system [2]. It not only makes up for the shortcomings of the traditional network security technologies, but also independently and efficiently detects attacks and takes corresponding protective measures. However, most of the intrusion detection systems are built based on the traditional matching method. In addition, several problems exist. More precisely, the system establishment speed is slow, the false positive rate and false negative rate are high, only existing attack behaviors can be detected, the detection rate of new attack behaviors and massive attack behaviors is low, and the detection rate of each attack behavior is unbalanced.

In recent years, with the development of machine learning, several machine learning algorithms have been applied to intrusion detection. For instance, LV et al. [3] propose a new accurate and efficient misuse intrusion detection system based on hybrid kernel extreme learning machine, which depends on specific attack characteristics in order to distinguish normal and malicious activities. Wang et al. [4] propose an efficient intrusion detection framework based on support vector machine, which performs LOG edge density ratio transformation to form original features, in order to obtain new and better transformed features. This framework greatly improves the detection ability of the model based on support vector machine. Gu et al. [5] propose a support vector machine intrusion detection framework based on Naive Bayesian feature embedding. This framework applies the naive Bayesian feature transformation on the original features in order to generate high-quality new data. It then uses the transformed data to train the support vector machine classifier, in order to develop the intrusion detection model. Al turaiki et al. [6] propose two models based on deep learning to solve the two classification and multi classification problems of network attacks. A hybrid hierarchical intrusion detection system, which combines different machine learning and feature selection technologies to provide high-performance intrusion detection in different attack types, is proposed in [7]. Aburomman et al. [8] propose a new integration construction method, which uses the weights generated by PSO to create an intrusion detection classifier integration with higher accuracy. This method also uses the local unimodal sampling (LUS) as the meta optimizer in order to find better behavior parameters for PSO. Practice has proved that the intrusion detection method based on machine learning improves the accuracy and efficiency of intrusion detection, and reduces the false alarm rate [9, 10].

Although the traditional machine learning methods achieved satisfactory results in the intrusion detection, their shortcomings restrict further development:

1)
a large number of well-labeled training datasets are required for model training, and the training data and test data satisfy the independent and identical distribution conditions to make the detection model having a high detection accuracy [11, 12]. However, in practical applications, the distribution of the test and training data is difficult to be consistent. When the data distribution changes, it is necessary to recollect and label the training data, and train the model from scratch.
2)
The uneven number of attack behavior samples in the intrusion detection samples leads to an imbalance in the detection success rate of each attack behavior. Consequently, the data resources are wasted, and therefore more costs are spent to mark samples and reconstruct models. In addition, the detection rate of attacks with sufficient samples is relatively high, and it cannot efficiently detect new network attacks with scarce samples.

In recent years, transfer learning, which is a new machine learning method, has been widely used [13]. Compared with the traditional machine learning algorithms, it can use the knowledge in existing historical (source domain) data, save cost of data collection and model construction, and improve the speed of model training. Moreover, the data distribution of training and test data does not have to meet the independent and identical distribution. At present, it has been widely used in image recognition [14], machine fault diagnosis [15], system filtering [16] and computer vision [17], where it achieved satisfactory results. In order to solve the problems of current intrusion detection algorithms, this paper proposes an intrusion detection algorithm (TrELM) based on transfer learning and extreme learning machine.

TrELM is applied to the scene where the source domain contains a small number of labeled intrusion detection samples, and the training samples in the target domain are scarce enough to train a reliable classifier. Combined with the advantages of extreme learning machine, the source domain samples and weight vectors are first calculated using the maximum mean discrepancy (MMD). Afterwards, using the weight information, the learning model is trained for the samples with large amount of information and that are similar to the target domain in the source domain. Finally, the objective function of TrELM is constructed using the source domain model knowledge and data of target domain. Simultaneously the domain similar distance terms and constraints are constructed to further increase the knowledge similar to the target domain in the source domain, in order to obtain the target strong learning model, avoid negative transfer and improve the detection effect.

The main contributions of this paper are summarized as follows:

1)
An intrusion detection algorithm based on transfer learning and extreme learning machine TrELM is proposed. This algorithm integrates the idea of transfer learning into extreme learning machine, transfers the knowledge of source domain through MMD, similar distance term and objective function constraints, and helps create a high-quality target learning model in the target domain.
2)
A large number of experiments on the intrusion detection datasets NSL-KDD, KDD99 and ISCX2012 demonstrate the efficiency and accuracy of the proposed algorithm. The results of the experiments also show that the detection accuracy of the proposed algorithm is better or at least comparable to the benchmark algorithms.

The remainder of this paper is organized as follows. Section 2 reviews the related works of transfer learning, extreme learning machine and maximum mean discrepancy. The proposed intrusion detection algorithm based on transfer learning and extreme learning machine is detailed in Section 3. In Section 4, the efficiency of the proposed algorithm is verified using the NSL-KDD, KDD99 and ISCX2012 datasets. Finally, the conclusions are drawn in Section 5.
2. Related works

2.1 Transfer learning

Transfer learning is a machine learning method which can use the knowledge in existing historical data to solve the learning problems in different but related fields. It relaxes the two basic assumptions in traditional machine learning, and transfers the existing knowledge in order to solve the learning problem that the target domain contains only a small amount of label data or even no label data [18, 19, 20]. Transfer learning has been widely used and studied with different concepts such as learn learning, lifelong learning, knowledge transfer, inductive transfer, multitasking learning, knowledge consolidation, context sensitive learning, knowledge-based inductive bias, meta learning and incremental/cumulative learning [12], for example. In transfer learning, the main research contents include what to transfer, how to transfer, when to transfer and negative transfer.

The transfer learning is defined in [21] as follows.

Given a source domain $D_{S}=\{X_{S},P(X_{S})\}$ and a learning task $T_{S}=(y_{S},f_{S}(\cdot))$ , a target domain $D_{T}=\{X_{T},P(X_{T})\}$ and a learning task $T_{T}=\{y_{T},f_{T}(\cdot)\}$ , the transfer learning aims at using the relevant knowledge in $D_{S}$ and $T_{S}$ to help improve the target prediction function $f_{T}(\cdot)$ in the target domain under the condition of $D_{S}\neq D_{T}$ or $T_{S}\neq T_{T}$ .

According to whether the samples in the source domain are labeled, and whether the learning tasks are the same, the transfer learning tasks can be divided into inductive transfer learning, direct push transfer learning and unsupervised transfer learning. According to the technology used in transfer learning, the transfer learning methods can be divided into feature selection, feature mapping and instance based on transfer learning algorithm. These two definitions classify the transfer learning from different points of view. In general, the two classification methods are not different. In special cases, the transfer learning problem with multiple source domains can be extended according to this definition of single source transfer learning [22]. In the practical application of transfer learning, if the data that are not related to the target domain are forcibly transferred to the source domain, this will not help the learning of the target domain. On the contrary, it will be worse than the learning effect without transfer, that is, the negative transfer effect. Negative transfer has been widely studied since the transfer learning has been developed. In order to avoid negative transfer and better assist the learning tasks in the target domain, it is crucial to select the sample data with high similarity with the sample data of the target domain in the source domain [23].

2.2 Extreme learning machine

The extreme learning machine (ELM) is a new learning algorithm [24], used to train the single hidden layer feedforward neural network (SLFN). By randomly selecting the input layer weight and hidden layer bias, the output layer weight is computed and analytically calculated according to the Moore Penrose (MP) generalized inverse matrix theory, by minimizing the loss function composed of the training error term and regular term of the output layer weight norm. As a nonlinear model, ELM has a good generalization and nonlinear mapping. In addition, it can be used to solve dimensional disaster problems. Compared with other machine learning algorithms, such as BP neural network and SVM, it has the advantages of fast learning speed, less intervention and high computational scalability [25].

Given a training data set $\{X_{i},t_{i}|X_{i}\in R^{N},t_{i}\in R^{m},i=1,2,\ldots,N\}$ , $X_{i}=[x_{i1},x_{i2},\ldots,x_{in}]^{T}$ , $t_{i}=[t_{i1},t_{i2},\ldots,t_{im}]^{T}$ , where $w$ is the weight vector connecting the neurons, $b_{i}$ represents the bias of the $i$ -th hidden neuron, $\beta$ denotes the output weight of the hidden layer, $L$ is the number of hidden layer nodes of the extreme learning machine, and $t_{i}$ represents the mark corresponding to the $i$ -th data example. The network structure of ELM is shown in Fig. 1.

Figure 1.

Framework of the extreme learning machine.

It can be seen from Fig. 1 that the output matrix $H({x})$ of the hidden layer is expressed as:

$\displaystyle H(x)=[h_{j}(x),\ldots,h_{L}(x)]^{T}$ (1)

where $h_{j}(x)$ is given by:

$\displaystyle h_{j}(x)=g(w_{j},b_{j},x)$ (2)

where $g(\cdot)$ represents the activation function [26], which is a nonlinear piecewise continuous function. The commonly used functions include the Sigmoid function and Gaussian function. In this paper, the Sigmoid function is used. After passing through the hidden layer, it enters the output layer. According to Fig. 1 and Eq. (1), the output of the “generalized” single hidden layer feedforward neural network ELM is expressed as:

$\displaystyle f_{L}(x)=\sum\limits_{j=1}^{L}\beta_{j}h_{j}(x)=H(x)\beta=t_{i},% i=1,2,\ldots m$ (3)

where $\beta=[\beta_{1},\ldots\beta_{L}]$ .

The unknown quantities in Eqs (2) and (3) are $w$ , $b$ and $\beta$ . ELM consists in adjusting the weights and deviations between neurons according to the training data. In addition, the learned data are contained in the connection weights and deviations. Equation (3) can then be written as:

$\displaystyle H\beta=T,T=[t_{1},\ldots,t_{m}]^{T}$ (4)

ELM aims at minimizing the training error and obtaining the parameters with better learning effect on the training dataset $\beta$ . In the ideal state, Eq. (4) holds. However, in practice, the output $H\beta$ can only be as close as possible to the sample label $T$ . In this paper, the method of minimizing the approximate square difference is used to connect the weight of the hidden layer and output layer $\beta$ . The objective function is expressed as:

$\displaystyle\min||H\beta-T||^{2},\beta\in R^{L\times m}$ (5)

According to [27], the Moore Penrose generalized inverse matrix $H^{{\dagger}}=H^{T}(HH^{T})^{-1}$ is calculated by an orthogonal projection matrix. The optimal solution is given by:

$\displaystyle\beta^{\ast}=H^{{\dagger}}T$ (6)

Although ELM has made some achievements, there is still room for improvement. Several researchers also propose some algorithms to optimize ELM [28, 29]. At present, the traditional extreme learning machine is limited by the fact that the training and test data meet the constraints of independent, identically distributed and sufficient available training data. In fact, it is often hoped to use only a small amount of new data and a large amount of historical data in order to learn an accurate model. This is possible for the further development of ELM due to the emergence of transfer learning.

2.3 Maximum mean discrepancy

Transfer learning requires to transfer knowledge similar to the target domain in the source domain. Otherwise, negative transfer will occur, which results in a worse learning effect. Therefore, it is necessary to choose a method to calculate the distribution difference between the domains, in order to transfer knowledge with high similarity between them. In this paper, the Maximum mean difference (MMD) [30, 31] is used to reduce the distribution difference between domains. The square form of the MMD is given by:

Given a labeled data set $D_{s}=(\{x_{1},y_{1}\},\ldots,(x_{n},y_{n}))$ in the source domain and an unlabeled data set $D_{t}=(z_{1},\ldots,z_{m})$ in the target domain, the nonlinear mapping function in the reproducing kernel Hilbert space H is denoted by $\phi$ . The square form of the MMD is given by:

$\displaystyle\textit{MMD}_{H}^{2}=\left\|\frac{1}{n}\sum\nolimits_{i=1}^{n}{% \phi(x_{i})}-\frac{1}{m}\sum\nolimits_{i=1}^{m}{\phi(z_{i})}\right\|^{2}$ (7)

In Eq. (7), the smaller the MMD value, the closer the two domains. If the value is 0, the distribution is completely consistent. At present, the MMD measurement method has been widely used in the algorithm research of transfer learning [32, 33]. In fact, it can make the features learned from different domains more similar by constructing regularization terms.

3. Transfer extreme learning machine algorithm

In this paper, a large number of labeled intrusion detection data samples in the source domain are used to solve the application scenario, with a small number of labeled intrusion detection data samples in the target domain. The intrusion detection based on transfer extreme learning machine is constructed using the transfer learning, maximum mean discrepancy and extreme learning machine. The framework of the algorithm is presented in Fig. 2.

Figure 2.

Framework of TrELM.

It can be seen from Fig. 2 that TrELM first uses o-MMD to calculate the similarity weight between the source and target domain, and trains ELM in order to obtain the source domain model knowledge. The objective function of TrELM is then constructed using the samples of target domain and the model knowledge of the source domain. Finally, TrELM is trained to obtain the intrusion detection classifier. In the algorithm, o-MMD is used to select the knowledge with great similarity between the source and target domain, which reduces the size of the samples, and uses the weight information to reduce the negative transfer. The constraints of objective function and similarity distance term limit the negative transfer again. Therefore, TrELM improves both the training efficiency and learning effect.

3.1 Problems definition

Given a source domain $D_{S}$ with a large amount of labeled data, and a target domain $D_{T}$ with only a small amount of labeled data. For convenience and generality, this paper only considers the binary classification problem $Y=\{-1,1\}$ , while the multi-classification problem can be extended on the binary classification problem.

Definition 1. Testing dataset

$\displaystyle TE=\{(x_{i}^{t},y_{i}^{t})\},x_{i}^{t}\in D_{T},y_{i}^{t}\in Y,i% =1,2,3,\ldots,k$

Definition 2. Training dataset

$\displaystyle TR_{a}=\{(x_{i}^{s},y_{i}^{s})\},x_{i}^{s}\in D_{S},y_{i}^{s}\in Y% ,i=1,2,3,\ldots,n$ $\displaystyle TR_{b}=\{(x_{i}^{t},y_{i}^{t})\},s_{i}^{t}\in D_{T},y_{i}^{t}\in Y% ,i=1,2,3,\ldots,m$

where $TR_{a}$ is the dataset in source domain $D_{S}$ , $TR_{b}$ is the dataset in target domain $D_{T}$ , $n$ and $m$ are respectively the numbers of samples in $TR_{a}$ and $TR_{b}$ , and $T=S_{a}\cup S_{b}$ .

Therefore, the training dataset is expressed as:

$\displaystyle TR=\{TR_{a},TR_{b}\}=\{x_{i},i=1,..,n+m\}=\left\{{\begin{array}[% ]{l}x_{i}^{a},i=1,2,\ldots,n\\ x_{i}^{b},i=n+1,n+2,\ldots,n+m\\ \end{array}}\right.$

Definition 3. Problem solving

Given a target domain dataset $TR_{b}$ with a small number of labeled samples, a source domain dataset $TR_{a}$ with a large number of unlabeled samples, and a test dataset $T E$ , the learning aims at training a classifier which can minimize the classification error on the target domain and improve the prediction accuracy.

3.2 Implementation of TrELM

It can be observed from Fig. 2 that the extreme learning machine, MMD and transfer learning are fused. In addition, the construction process is divided into: calculation process of source domain weight and model training, TrELM objective function construction and related theorem proof, and intrusion detection classifier training.

(1) Calculation similarity weight of samples in the source domain

MMD is transformed to obtain its derivation form o-MMD, and the similarity weight vector of the samples in the source domain can be calculated.

According to Eq. (7), o-MMD is obtained on the target and source domains:

$\displaystyle\textit{o-MMD}=\left\|\frac{1}{m}\sum\limits_{x_{i}\in D_{T}}{% \phi(x_{i})-}\frac{1}{n}\sum\limits_{x_{j}\in D_{S}}{o_{j}\phi(x_{j})}\right\|% ^{2}=\frac{1}{m^{2}}\sum\limits_{i=1}^{m}{\sum\limits_{j=1}^{m}{\phi(x_{i})^{T% }}}\phi(x_{j}){}+\frac{1}{n^{2}}\sum\limits_{i=1}^{n}{\sum\limits_{j=1}^{n}{o_% {i}o_{j}\phi(x_{i})^{T}}}\phi(x_{j}){}-2\frac{1}{mn}\sum\limits_{i=1}^{m}{\sum% \limits_{j=1}^{n}{o_{j}}}\phi(x_{i})^{T}\phi(x_{j})$ (8)

In Eq. (8), the first term is a constant which can be ignored, and only the last two terms should be minimized. Therefore:

$\displaystyle f({w})=\frac{1}{n^{2}}\sum\limits_{i=1}^{n}{\sum\limits_{j=1}^{n% }{o_{i}o_{j}\phi(x_{i})^{T}}}\phi(x_{j}){}-2\frac{1}{mn}\sum\limits_{i=1}^{m}{% \sum\limits_{j=1}^{n}{o_{j}}}\phi(x_{i})^{T}\phi(x_{j})$ (9)

By simplifying Eq. (9), the objective function can be obtained:

$\displaystyle\mathop{\min}\limits_{\beta}\frac{1}{2}{o}^{T}K{o}-k^{T}{o}$ (10) $\displaystyle\textit{s.t. }0\leqslant{o}_{i}\leqslant 1,i=1,2,\ldots,n$

where $K=K(x_{i},x_{j})=\phi(x_{i})^{T}\phi(x_{j})$ .

Equation (3.2) is a standard quadratic programming problem. Therefore, it can be solved using a quadratic programming solver in order to obtain the weight $o_{i}$ of each source domain sample. The source domain samples with weight $0\leqslant o_{i}\leqslant 0.5$ are selected to form a sample set $TR^{\prime}_{a}$ . Note that the number of samples is $n^{\prime}\leqslant n$ . The knowledge in the source domain can be transferred through $o_{i}$ in the transfer learning, which can efficiently avoid the occurrence of negative transfer.

(2) Construction, solution and steps proving the TrELM objective function

The optimal solution of the ELM model $\beta_{s}=H^{{\dagger}}T$ is trained on the source domain dataset $TR^{\prime}_{a}$ . On the target domain dataset $TR_{b}=\{(x_{i}^{t},y_{i}^{t})\}$ , assuming that the parameters of ELM are $\beta_{t}$ , combined with the structural risk minimization theory of transfer learning and the ELM optimization algorithm [34], the objective function of TrELM is developed:

$\displaystyle\arg\mathop{\min}\limits_{w_{t},b_{t}}\frac{1}{2}\|\beta_{t}\|^{2% }+\frac{C_{1}}{2}\|\beta_{t}-\beta_{s}\|^{2}+C_{2}\varepsilon_{i}^{t}$ $\displaystyle s.t.$ $\displaystyle\varepsilon_{i}^{t}\geqslant 0,$ (11) $\displaystyle y_{i}^{t}(\beta_{t}\cdot h(x_{i}^{t}))\geqslant 1-\varepsilon_{i% }^{t},i=1,2,\ldots,m$ $\displaystyle y_{{}_{i}}^{t}(\beta_{t}^{T}\cdot h(x_{{}_{i}}^{t})-\tilde{\beta% _{t}}^{T}\cdot h(x_{{}_{i}}^{t}))\geqslant 0.$

where constants $C_{1}$ and $C_{2}$ respectively represent the penalty parameters of domain adaptation contribution and target classification error, $\|\beta_{t}-\beta_{s}\|^{2}$ denotes the difference term between the classifier of the source and target domains, the constraint $y_{i}^{t}(\beta_{t}\cdot h(x_{i}^{t}))\geqslant 1-\varepsilon_{i}^{t}$ indicates that the target domain classifier can correctly perform, the constraint $y_{{}_{i}}^{t}(\beta_{t}^{T}\cdot h(x_{{}_{i}}^{t})-\tilde{\beta_{t}}^{T}\cdot h% (x_{{}_{i}}^{t}))$ ensures that the effect of transfer is not worse than the classification effect only learned on the dataset of the target domain, and limits the possibility of negative transfer.

By calculating Eq. (3.2), a classifier is obtained in the target domain, which can correctly classify the samples in the target domain.

Step 1. The optimization problem of Eq. (3.2). The Lagrangian form is given by:

$\displaystyle L(\beta_{t},\varepsilon^{t},\alpha,\mu,\gamma)=\frac{1}{2}{\beta% }_{t}^{T}\beta_{t}+\frac{C_{1}}{2}\ \|\beta_{t}-\beta_{s}\|^{2}+C_{2}\sum% \limits_{i=1}^{m}{\varepsilon_{i}^{t}}$ (12) $\displaystyle\quad{}-\sum\limits_{i=1}^{m}{\alpha_{i}(y_{i}^{t}({\beta}_{t}^{T% }\cdot h(x_{i}^{t})-1+\varepsilon_{i}^{t}))}-\sum\limits_{i=1}^{m}{\mu_{i}(y_{% i}^{t}(\beta_{t}^{T}\cdot h(x_{i}^{t})+\varepsilon_{i}^{t}-\tilde{{w}_{t}}^{T}% \cdot h(x_{{}_{i}}^{t})))}-\sum\limits_{i=1}^{m}{\gamma_{i}\varepsilon_{i}^{t}}$

where ${\alpha}=(\alpha_{1},\alpha_{2},\ldots,\alpha_{m})$ , ${\mu}=(\mu_{1},\mu_{2},\ldots,\mu_{m})$ and ${\gamma}=(\gamma_{1},\gamma_{2},\ldots,\gamma_{m})$ are the Lagrange multipliers, $\alpha_{i}\geqslant 0$ , $\gamma_{i}\geqslant 0$ , $\mu_{i}\geqslant 0$ and $m$ are the number of samples in the target domain.

Step 2. According to the KKT condition, the derivative of the parameter ${\beta}_{t}$ sum $\varepsilon_{i}^{t}$ of the Lagrange function in Eq. (3.2) is determined, and the derivative is set to zero in order to obtain the following formula:

$\displaystyle\frac{\partial L}{{\beta}_{t}}={\beta}_{t}+C_{1}({\beta}_{t}-{% \beta}_{s})-\sum\limits_{i=1}^{m}{\alpha_{i}y_{i}^{t}h(x_{i}^{t})}-\sum\limits% _{i=1}^{m}{\mu_{i}y_{i}^{t}h(x_{i}^{t})}=0$ (13) $\displaystyle\frac{\partial L}{{\varepsilon}^{t}}=C_{2}-\alpha_{i}-\gamma_{i}-% \mu_{i}=0$ (14)

By simplifying Eqs (13) and (14), the following is obtained:

$\displaystyle\beta_{t}=\frac{C_{1}{\beta}_{s}}{1+C_{1}}+\frac{1}{1+C_{1}}\sum% \limits_{i=1}^{m}{(\alpha_{i}+\mu_{i})y_{i}^{t}h(x_{i}^{t})}$ (15) $\displaystyle\alpha_{i}+\gamma_{i}+\mu_{i}=C_{2}$ (16)

Step 3. By substituting Eqs (15) and (16) into Eq. (3.2), solving ELM becomes equivalent to solving the dual problem expressed as:

$\displaystyle\mathop{\min}\limits_{\beta}\frac{1}{1+C_{1}}\sum\limits_{i=1}^{m% }\sum\limits_{j=1}^{m}(\alpha_{i}+\mu_{i})(\alpha_{j}+\mu_{j})y_{i}^{t}y_{j}^{% t}(h(x_{i}^{t})\cdot h(x_{j}^{t}))$ $\displaystyle\quad+\sum\limits_{i=1}^{m}\left(\frac{C_{1}y_{i}^{t}(\cdot h(x_{% i}^{t}))}{1+C_{1}}-1\right)\beta_{i}-\frac{1}{1+C_{1}}\|\beta_{s}\|^{2}$ (17) $\displaystyle 0\leqslant\alpha_{i}\leqslant{C}_{2},i=1,2,\ldots,m.$ $\displaystyle 0\leqslant\mu_{i}\leqslant{C}_{2},i=1,2,\ldots,m.$

Equation (3.2) is solved as in [34], and the result is substituted into Eq. (15) in order to obtain the parameter value of the target domain and the decision function of TrELM.

(3) The intrusion detection classifier is obtained by training the TrELM decision function in 2).

3.3 Training TrELM

The TrELM training is presented in algorithm 1.

Algorithm 1: Training process of the TrELM algorithm
Training process of the TrELM algorithm
Input: $n$ labeled samples set $TR_{a}$ in the source domain, $m$ labeled sample set $TR_{b}$ in the target domain, testing dataset $T E$ , training dataset $TR=TR_{a}\cup TR_{b}$ , class label $Y=\{-1,1\}$ , a basic classifier ELM.
Initialize:
Step1. Call o-MMD to calculate the similarity weight of the source domain dataset, select a labeled sample set $TR^{\prime}_{a}$ similar to the target domain, the number of samples is $n^{\prime}$ ;
Step2. Train the parameters of ELM on $TR^{\prime}_{a}$ to obtain ${\beta}_{s}$ ;
Step3. Choose appropriate parameter values $C_{1}$ and $C_{2}$ , and solve Eq. (3.2) to obtain the optimal value of ${\alpha}$ and ${\mu}$ that are ${\alpha}^{\ast}=[\alpha_{1},\ldots,\alpha_{m}]^{T}{\mu}^{\ast}=[\mu_{1},\ldots% ,\mu_{m}]^{T}$ ;
Step4. Substitute ${\alpha}^{\ast}$ and ${\mu}^{\ast}$ into Eq. (15) to obtain the optimal parameter value ${\beta}_{t}^{\ast}$ in the source domain;
Step5. Train the decision function of the target domain and output the intrusion detection classifier: $\displaystyle h_{f}(x)=\sum\limits_{i=1}^{N}{\alpha_{t}h_{t}(x)}$

In summary, it can be seen that the TrELM algorithm first uses o-MMD to select the intrusion detection data samples from the source domain that are similar to the target domain, and initially reduces the impact of negative transfer. ELM is then trained on the dataset selected from the source domain, in order to obtain the model knowledge. The target transfer model TrELM is constructed with the target domain dataset and model knowledge of the source domain. On the one hand, it uses the existing source domain knowledge to speed up the training process. On the other hand, the objective function constraints and similar distance terms are used to further restrict the negative transfer. Finally, TrELM is trained on the target dataset in order to obtain the intrusion detection classifier.

Compared with the traditional ELM algorithm, TrELM can use a small amount of target intrusion detection data samples and a large number of source domain intrusion detection data samples, in order to construct a high-quality classification model. It fully uses the advantages of ELM and makes up for the shortcoming of inability to use the existing knowledge.

The pseudocode of the testing and evaluation stage is given by:

class ELM_Core (object): def __init__(self, X, y, num_hidden): Initialization parameters … end def sigmoid (self, x): return 1.0/(1 $+$ np.exp (-x)) end def train (self, x_train, y_train, classes): mul $=$ np.dot (self.data_x, self.w) add $=$ mul $+$ self.b H $=$ self.sigmoid (add) H_ $=$ np.linalg.pinv (H) … end def predict (self, x_test): self.t_data $=$ np.atleast_2d (x_test) self.num_tdata $=$ len (self.t_data) self.pred_Y $=$ np.zeros ((x_test.shape[0])) … end TrELM $=$ ELM_Core (x_train, y_train, L) TrELM.train (x_train, y_train, 3) TrELM.predict (x_test) TrELM.score (y_test) end

4. Experimental results

4.1 Experimental setting evaluation criteria

In order to verify the efficiency of the TrELM, the common datasets NSL-KDD, KDD99 and ISCX2012 are used. The benchmark algorithms used in the experiment are: SVM [38], ELM [24], the algorithm proposed in [6] and TrAdaBoost [39], in which SVM is implemented using the LIBSVM toolkit. The average of all the experiments repeated ten times is considered as the final comparison result.

The commonly used evaluation indicators for detection include the precision, detection rate and accuracy, false positive rate and miss rate. The precision reflects the proportion of correctly classified samples to the total number of samples (the larger the better). The accuracy denotes the proportion of true positive samples to the total number of samples classified as positive (the larger the better). The detection rate reflects the proportion of positive samples classified as positive in all the positive samples. The accuracy and detection rates are a pair of contradictory indicators. The higher the accuracy, the fewer the false positives, while the higher the detection rate, the fewer the false negatives. When the precision increases, the detection rate increases, while the accuracy decreases, and vice versa. In intrusion detection, the false positive rate refers to the proportion of the number of misclassified positive samples to the number of all the negative samples. A smaller value denotes a better performance, which is prone to “the wolf is coming”.

The formal description of the precision rate, detection rate, accuracy rate, false positive rate and miss rate is given by:

Precision: $CR=\frac{TP}{TP+FP}\times 100\%$ Detection Rate: $DR=\frac{TP}{TP+FN}\times 100\%$ Accrracy: $\textit{ACC}=\frac{TP+TN}{TN+FP+FN+TP}\times 100\%$ False Positive Rate: $FR=\frac{FP}{TN+FP}\times 100\%$ Miss Rate: $MR=\frac{FN}{TP+FN}\times 100\%$

where $T P$ represents the number of positive samples that are correctly classified, $F P$ denotes the number of negative samples that are incorrectly classified as positive samples, $T N$ is the number of negative samples that are correctly classified, and $F N$ represents the number of positive samples that are incorrectly classified as negative samples.

In the experiments, the accuracy rate, false negative rate and miss rate are selected as indicators used to evaluate the algorithms. Note that an Intel Core i3-4160 processor with 8GB memory, win10 operating system and Python 3.6.3, is used.

4.2 Dataset

This section describes the NSL-KDD, KDD99 and ISCX2012 datasets and preprocesses them.

a. Dataset

ISCX2012 dataset

Several researchers demonstrated that the attack types considered in the KDD99 intrusion detection dataset are out of date. In 2012, the center of information security Excellence (iscx) of the University of New Brunswick (UNB) released an intrusion detection dataset, referred to as ISCX2012 [35]. This dataset comprises seven days of original network traffic data, including normal traffic and four intrusion types Dos and Prob, R2L and U2R (cf. Table 1). In the experiment, 2% of the data are selected from the training dataset, most of the labeled information are considered as source domain dataset and therefore deleted, the remaining labeled data are composed of target domain dataset, and the two datasets together constitute the training dataset. Similarly, 1% of the data are extracted from the test dataset and considered as the test data.

Table 1
Distribution of attack types in the ISCX2012 dataset

Dataset	Training dataset		Testing dataset
	Count	Percentage	Count	Percentage
Normal	890,726	97.27%	593,811	97.27%
BFSSH	4,179	0.46%	2,785	0.46%
Infiltration	6,027	0.66%	4,017	0.66%
HttpDoS	2,090	0.2%	1,392	0.23%
DDos	12,673	1.38%	8,448	1.38%
Total	915,695		610,453

KDD 99

KDD99 is a widely used competition data for intrusion detection provided by Lincoln Laboratory of Massachusetts Institute of Technology. It is an intrusion detection dataset having the best influence and credibility in academia [36]. The dataset has 5*106 pieces of data, while each piece of data has 41 characteristic attributes and 1 class identifier. It contains almost 38 attack types, of which 21 attack types appear in the training dataset, and other 17 unknown attack types only exist in the test dataset. This design aims at testing the generalization ability of the classifier model. The ability to detect unknown attack types is also one of the crucial indicators to evaluate the effect of the classifiers in intrusion detection applications.

The researchers mostly use the 10% KDD99 dataset (including training dataset and test dataset), which is a sample of 10% of all the datasets of KDD99. This dataset is also used in this paper. The 10% dataset contains 1 type normal with normal signs, and 4 major network attack types: DOS, Probing, U2R and R2L. In the two 10% datasets, the four types of cyber attacks contain different amounts of attack behavior. Table 2 presents 22 attack behaviors in the training dataset, 39 attack behaviors in the test dataset, while the normal data are also counted as one type of attack.

Table 2

KDD 99 dataset

Dataset	10% testing dataset	10% training dataset
Normal	Normal	Normal
DOS	Back, land, neptune, pod, smurf, teardrop, apache2, mailbomb, udpstorm, processtable	Back, land, neptune, pod, smurf, teardrop
Probing	Ipsweep, nmap, portsweep, satan, saint, mscan	Ipsweep, nmap, portsweep, satan
R2L	Ftp_write, guess_password, imap, multihop, phf, spy, warezmaster, warezclient, named, xsnoop, xlock, sendmail, worm, snmpgetattack, snmpguess	Ftp_write, guess_password, imap, multihop, phf, spy, warezmaster
U2R	Buffer_overflow, loadmodule, perl, rookit, httptunnel, ps, sqlattack, xterm	Buffer_overflow, loadmodule, perl, rookit

Table 3

Distribution of attack types in the KDD 99 dataset

Dataset	Training dataset		Testing dataset
	Number	Proportion (%)	Number	Proportion (10%)
Normal	97278	19.69	60593	19.48
Probe	4107	0.83	4166	1.34
DOS	391458	79.24	229853	73.90
U2R	52	0.01	228	0.073
R2L	1126	0.22	16189	5.20

In order for the intrusion detection algorithm to be able to recognize new attack behaviors by learning from the training dataset, the test dataset in Table 3 contains more new attack behaviors than the training dataset. In Table 3, the proportion of normal in the two 10% datasets is mainly the same. However, the proportions of the other four attack types are significantly different. Due to the fact that U2R and R2L have very small proportions, most of the current detection algorithms have difficulties in detecting these two types of attacks.

NSL-KDD

NSL-KDD [37] is an optimized version of the KDD99 dataset, where some duplicate records are deleted. It includes different classification difficulty levels, and the number is more balanced, so that it can be used as an efficient benchmark dataset to correct and efficiently detect the ability of the model. The NSL-KDD dataset includes 4 sub-datasets: KDDTrain $+$ , KDDTrain $+$ _20Precent, KDDTest $+$ and KDDTest $+$ 21. In this paper, KDDTrain $+$ is used for training and KDDTest $+$ is used for testing. The dataset contains 4 anomaly types, that are subdivided into 39 attack types, of which 17 unknown attack types are in the test set. Each record includes 41 characteristics and 1 category identifier. Among the 41 features, there are 9 basic TCP connection features, 13 TCP connection content features, 9 time-based network traffic statistics features and 10 host-based network traffic statistics features. The details of the NSL-KDD dataset are provided in Table 4.

Table 4

Distribution of attack types in the KDD 99 dataset

Dataset	KDDTrain $+$		KDDTest $+$
	Number	Proportion (%)	Number	Proportion (%)
Normal	67345	53.46	9711	43.08
Probe	11655	9.25	2421	10.74
DOS	45926	36.46	7458	33.08
U2R	52	0.04	200	0.89
R2L	995	0.79	2754	12.22

b. Data preprocessing

In the intrusion detection dataset, non-numerical data and dimension difference between the values exist. In addition, these data should be converted into numerical data and unified dimension processing. Therefore, the data preprocessing operation includes two steps: character type digitization and data normalization.

1) Character type digitization

The ISCX2012, NSL-KDD and KDD99 dataset processing methods are also the same. In each record, their symbol characteristics are converted into numerical data using 1 to N encoding. Considering KDD99 as example, 3 network connection types, 70 network service types, 23 attack types (including normal type) and 11 network connection states of the character type of the dataset are converted into numerical types. The converted forms of the 11 network connection types are shown in Table 5, while the other character types are similar.

Table 5

Network connection type conversion

Character	Numerical
OTH	1
REJ	2
RSTO	3
RSTOSO	4
RSTR	5
S0	6
S1	7
S2	8
S3	9
SF	10
SH	11

2) Data normalization

After digitization, the TrELM algorithm uses MMD to select the data. In the MMD, the distance is calculated. For the continuous feature attributes in the dataset, the measurement methods of each attribute are different. Therefore, the calculation of the distance between the data has a higher impact, which affects the accuracy of the calculation results. In order to avoid this situation, the difference between the different features can be eliminated. The method of MMD normalization is used for the discrete features. For the continuous features, the method of Z-Score is used to fix the value between 0 and 1, as shown in Eqs (18) and (19).

$\displaystyle x_{mn}=\frac{x-x_{\min}}{x_{\max}-x_{\min}}$ (18) $\displaystyle x_{ze}=\frac{x-x_{av}}{\sigma}$ (19)

where $x$ is the original sample data, $x_{\max}$ represents the maximum value, $x_{\min}$ denotes the minimum value, $x_{av}$ is the average value, $\sigma$ represents the standard deviation, $x_{mn}$ and $x_{ze}$ are the normalized results of the original data.

3) Data cleaning

Data cleaning is an indispensable link in machine learning. The quality of its results is directly related to the effect of the learning model and the final conclusion. In the actual data preprocessing, data cleaning usually occupies 50%–80% of the time in the data analysis process. In the experiment, for the missing values of the data, the fillna() method provided by pandas is used to fill in the missing values with the mean. For the noise of the data, the binning method is used to smooth the data by examining the “nearest neighbors” (i.e. the surrounding values) of the data. The ordered data values are distributed into some bins. For example, a set of data {2, 3, 4, 15, 21, 24, 28, 34, 13} sorted and divided into 3 bins, is given by:

{2, 3, 4} {13, 15, 21} {24, 28, 34}

After applying mean smoothing, it becomes:

{3, 3, 3} {16, 16, 16} {28, 28, 28}

A high-quality dataset is obtained by a process of data cleaning to replace, modify or delete the portions of the data that are incorrect, incomplete, irrelevant, inaccurate or questionable (“dirty”).

4.3 Experiments results and analysis

This section analyzes the experimental results of the SVM, ELM, algorithm proposed in [6], TrAdaBoost and the proposed TrELM algorithm using the NSL-KDD, KDD99 and ISCX2012 datasets, in order to verify and compare the efficiencies. In addition, the influence of the adjustable parameter $C_{1}$ on the results of TrELM is evaluated and analyzed.

Table 6
Average accuracy rate, false positive rate (%) and miss rate, for the NSL-KDD dataset

Algorithm	DoS			Normal			Probe			R2L			U2R
	ACC	FR	MR	ACC	FR	MR	ACC	FR	MR	ACC	FR	MR	ACC	FR	MR
TrELM	99.98	2.41	2.03	99.99	3.12	2.28	99.67	3.47	3.37	89.12	4.15	4.96	66.34	4.98	4.85
TrAdaBoost	98.12	2.32	2.25	98.86	3.23	2.43	98.10	3.63	3.42	73.25	9.76	9.13	47.65	9.33	10.12
[6]	97.86	2.28	3.11	98.57	4.15	3.15	97.84	4.05	4.26	69.56	10.45	9.47	35.21	9.74	10.53
ELM	97.23	2.46	3.02	97.46	4.17	3.37	97.12	4.15	4.37	68.17	10.24	9.97	32.34	10.24	11.02
SVM	97.18	2.97	3.21	97.43	4.24	3.47	97.24	4.42	4.46	64.42	11.13	9.86	31.26	10.75	11.21

Table 7

Average accuracy rate, false positive rate (%) and miss rate, for the KDD 99 dataset

Algorithm	DoS			Normal			Probe			R2L			U2R
	ACC	FR	MR	ACC	FR	MR	ACC	FR	MR	ACC	FR	MR	ACC	FR	MR
TrELM	99.45	2.56	2.19	99.99	3.23	2.37	99.34	3.76	3.42	88.33	4.23	4.11	65.25	4.29	4.89
TrAdaBoost	97.23	2.45	2.37	98.45	3.34	2.65	98.73	3.82	3.56	72.34	9.85	9.23	46.65	9.42	10.23
[6]	97.29	2.37	3.12	98.25	4.21	3.21	98.32	4.11	4.32	68.42	10.11	9.56	35.34	9.87	10.65
ELM	96.65	2.53	2.98	97.47	4.11	3.47	97.65	4.26	4.43	67.34	11.23	10.12	31.87	10.35	11.13
SVM	96.53	3.11	3.26	97.21	4.27	3.53	96.56	4.56	4.52	63.56	12.15	10.65	30.12	10.84	11.37

Table 8

Average accuracy rate, false positive rate (%) and miss rate, for the ISCX2012 dataset

Algorithm	Normal			Infiltrating			HttpDoS			DDoS			BFSSH
	ACC	FR	MR	ACC	FR	MR	ACC	FR	MR	ACC	FR	MR	ACC	FR	MR
TrELM	99.78	2.32	1.18	97.45	2.54	1.87	91.25	3.67	2.45	97.67	5.45	3.87	96.88	2.27	1.15
TrAdaBoost	96.45	2.98	1.28	93.28	2.87	2.65	86.53	4.26	3.85	96.46	6.02	4.12	92.45	2.98	1.28
[6]	95.28	3.12	1.78	91.34	3.27	2.76	84.56	4.27	3.89	93.79	6.11	4.87	90.23	3.12	1.32
ELM	94.43	3.25	2.54	90.76	3.76	2.85	82.33	4.89	4.23	91.25	6.23	4.65	89.38	3.26	1.87
SVM	93.67	3.35	2.87	90.23	4.02	3.14	81.25	5.12	4.58	90.15	7.12	5.27	88.35	3.35	2.87

Tables 6–8 present the average accuracy, false positive rate and miss rate of the algorithms on the NSL-KDD, KDD99 and ISCX2012 datasets. The obtained conclusions can be summarized as follows:

The sufficient available intrusion detection training samples are the base of the high-accuracy trained classifier. Using the intrusion detection datasets NSL-KDD and KDD99, there is a large number of the three types of attacks: Normal, Prob and DOS. All the algorithms have a high accuracy for these three types of attacks, reaching over 96%. Similarly, using the ISCX2012 dataset, the accuracy of all the algorithms against a large number of Normal, Infiltration and DDoS attacks, reaches 90%.

Using the intrusion detection datasets NSL-KDD and KDD99, for the attack types U2R and R2L with a small number of samples, the traditional intrusion detection algorithms are not sufficient for training and obtaining a high-accuracy detection model. Therefore, they have a low accuracy against the two types. TrELM and TrAdaBoost are transfer learning algorithms that use knowledge from a large number of well-labeled intrusion detection samples, in order to train the detection types for U2R and R2L. Therefore, their detection rates for U2R and R2L are improved. It can be seen from Tables 6 and 7 that TrELM and TrAdaBoost have an improved accuracy on U2R and R2L, especially the accuracy rate of R2L which is greater than 88%, and that of U2R which is greater than 65%. It can be observed from Table 8 that, using the ISCX2012 dataset, the TrELM and TrAdaBoost transfer learning algorithms are more accurate than ELM, [6] and SVM for the smaller number of attack types HttpDos and BFSSH. Therefore, TrELM highly improves the detection rate of the U2R and R2L attack types that contain a small number of samples.

Tables 6 and 7 show that the false alarm rate of all the algorithms for the three intrusion attack behaviors (Normal, Probe and DOS) using the intrusion detection datasets NSL-KDD and KDD99, does not exceed 5%. TrELM has the lowest false alarm rate, which is less than 4%. In the intrusion detection behavior U2R and R2L, the three non-transfer benchmark algorithms poorly perform. The false alarm rates of R2L and U2R using the NSL-KDD dataset are respectively greater than 10% and 9%, while using the KDD99 dataset, they are greater than 9%. However, TrELM has a relatively high performance on these two datasets, reaching false alarm rates below 5%. It can be seen from Table 8 that using the ISXC2012 dataset, TrELM has the lowest false positive rate for all the attack types.

It can be seen from Tables 6–8 that TrELM has the lowest miss rate in the nine attack behaviors.

The experimental results show that using the KDD 99 and NSL-KDD datasets, TrELM has higher accuracy than that of the benchmark algorithms, for the five attack types. In addition, the accuracy of the attack type with a small number of samples has also been significantly improved. Using the ISCX2012 dataset, TrELM has also higher accuracy than that of the benchmark algorithms, for the five attack types.

In order to facilitate the discussion and prove that TrELM has a better effect, only the Wilcoxon test results of the algorithm accuracy on NSL-KDD, KDD99 and ISCX2012, are discussed.

Table 9

Average training accuracy (%) on NSL-KDD, KDD99 and ISCX2012

Algorithm	NSL-KDD	KDD99	ISCX2012
TrELM	91.02	90.47	96.01
TrAdaBoost	83.196	82.68	93.03
[6]	79.808	79.52	91.04
ELM	78.464	78.20	89.63
SVM	77.506	76.80	88.73

It can be seen from Table 9 that using.

NSL-KDD, the classification accuracy of TrELM is 7.82% higher than that of TrAdaBoost, and the values of W $+$ and W $-$ are $+$ 58 and $-$ 12, respectively. For a two-sided test with $\alpha=$ 0.05, when $n=$ 50, by querying the distribution table of the Wilcoxon signed-rank test, T ${}^{0.025}$ $=$ 2.009. In addition, H ${}_{0}$ is accepted since W ${}^{+}$ $>$ T ${}^{0.025}$ . Moreover, the classification results of the two methods are not significantly different.

Using KDD99, the classification accuracy of TrELM is 7.79% higher than that of TrAdaBoost, and the values of W $+$ and W $-$ are $+$ 49 and $-$ 9, respectively. For a two-sided test with $\alpha=$ 0.05, when $n=$ 50, by querying the distribution table of the Wilcoxon signed-rank test, T ${}^{0.025}$ $=$ 2.009. Furthermore, H ${}_{0}$ is accepted since W ${}^{+}$ $>$ T ${}^{0.025}$ . The classification results of the two methods are also not significantly different.

Using ISCX2012, the classification accuracy of TrELM is 2.98% higher than that of TrAdaBoost, and the values of W $+$ and W $-$ are $+$ 25 and $-$ 36, respectively. For a two-sided test with $\alpha=$ 0.05, when $n=$ 50, by querying the distribution table of the Wilcoxon signed-rank test, T ${}^{0.025}$ $=$ 2.009. H ${}_{0}$ is accepted since W ${}^{+}$ $>$ T ${}^{0.025}$ . Similarly, the classification results of the two methods are not significantly different.

Table 10

Average training time (s) on NSL-KDD, KDD99 and ISCX2012

Algorithm	NSL-KDD	KDD99	ISCX2012
TrELM	5.56	12.76	6.12
TrAdaBoost	11.24	47.56	12.45
[6]	10.25	43.15	11.34
ELM	4.23	12.45	5.87
SVM	2.98	11.45	3.09

Table 10 shows the average training time of the algorithms on three intrusion detection datasets. Compared with the TrAdaBoost and TrELM migration learning algorithms, the ELM and SVM non-migration learning algorithms do not require auxiliary source domain datasets. Therefore, they require less training time. TrELM selectively uses a large amount of data in the source field that is similar to the target field, and the training time is increased within an acceptable range.

TrELM improves the detection rate of all the 9 types of attack behaviors, especially for R2L attacks with sparse samples, which has a significant effect. In fact, there is no problem that the accuracy of a certain attack behavior is too low, and the accuracy is very different, which is efficiently alleviated problems of imbalance in the detection of attack types in the machine learning algorithms. TrELM has significant advantages in both the false positive rate and false negative rate. The experimental results on the three datasets also show that the TrELM algorithm has a better generalization performance. In addition, it has certain advantages in terms of the training time.

It can be seen from Eq. (3.2) that the objective function of TrELM includes parameter $C_{1}$ , which affects the learning effect. Therefore, it is necessary to analyze the sensitivity. Using the NSL-KDD, KDD and ISCX2012 datasets, different values for parameter $C_{1}$ are set between 0 and 1 in order to record the average accuracy of the changes in the classification effect of the TrELM algorithm (cf. Figs 3–5). It can be observed from the obtained results that parameter $C_{1}$ highly affects the learning effect of TrELM.

Figure 3.

Sensitivity analysis of parameter $C_{1}$ using NSL-KDD.

Figure 4.

Sensitivity analysis of parameter $C_{1}$ using KDD99.

Figure 5.

Sensitivity analysis of parameter $C_{1}$ using ISCX2012.

By analyzing the results in Figs 3–5, the following conclusions can be drawn:

Using the parameter grid search method presented in [19], the value of $C_{1}$ is determined, and the experimental results on the real dataset are simultaneously recorded. For different values of $C_{1}$ within a certain range, the classification effect of TrELM is significantly different. It can also be seen that the closer the domain relationship, the greater the value and the higher the accuracy. Therefore, the algorithm is sensitive to the regularization parameter $C_{1}$ within a certain value range. Finally, the parameter value leading to the best classification effect can be obtained on different cross-domain tasks.

5. Conclusions

This paper proposes an intrusion detection algorithm (TrELM) based on transfer extreme learning machine. TrELM first uses o-MMD to select intrusion detection data samples having high similarity with the target domain from the source domain. This initially reduces the impact of negative transfer. The model knowledge trained on the selected source domain and target domain dataset, is then used to construct the target model TrELM. The existing source domain knowledge is used to speed up the training process. In addition, the objective function constraints and similar distance terms are used to further limit the negative transfer. Finally, TrELM is trained on the target dataset in order to obtain the intrusion detection classifier. The experimental results show that TrELM efficiently solves the problems of low detection accuracy, slow model establishment speed and high cost of the traditional intrusion detection algorithms for small samples and emerging attacks. It also improves the classification accuracy and efficiency. Although the proposed TrELM algorithm achieves satisfactory results, it still has some limitations that should be tackled in future work. In fact, it considers the difference of conditional probability between the test and training data. Furthermore, the protection of the data privacy of network users should be tackled.

Footnotes

Acknowledgments

The authors would like to express their gratitude to EditSprings (https://www.editsprings.cn) for the expert linguistic services provided.

This work was partially supported by major special fund projects in Heilongjiang Province, China [Grant No. 2020-230100-54-01-000358].

References

J.M.

W.F.

and Xue

, An intrusion detection method based on active transfer learning, Intelligent Data Analysis 24(2) (2020), 363–383.

Sharma

and Gupta

R.K.

, Intrusion detection system: A review, International Journal of Software Engineering and its Applications 9(5) (2015), 69–76.

Wang

Zhang

et al., A novel intrusion detection system based on an optimal hybrid kernel extreme learning machine, Knowledge-Based Systems 195 (2020), 105648.

Wang

and Wang

, An effective intrusion detection framework based on SVM with feature augmentation, Knowledge-Based Systems, 2017.

and Lu

, An effective intrusion detection approach using SVM with nave Bayes feature embedding, Computers & Security 103(3) (2020), 102158.

Al-Turaiki

and Altwaijry

, A convolutional neural network for improved anomaly-based network intrusion detection, Big Data 9(3) (2021), 233–252.

ünal and Cavusoglu, A new hybrid approach for intrusion detection using machine learning methods, Applied Intelligence, 2019.

Aburomman

A.A.

and Reaz

M.B.I.

, A novel SVM-kNN-PSO ensemble method for intrusion detection system, Applied Soft Computing 38(C) (2016), 360–372.

Sohn

, Deep belief network based intrusion detection techniques: A survey, Expert Systems with Applications, 2020, 167.

10.

Sultana

Chilamkurti

Wei

et al., Survey on SDN based network intrusion detection system using machine learning approaches, Peer-to-Peer Networking and Applications (1-2) (2018), 1–9.

11.

Zhuang

F.Z.

Luo

et al., Survey on transfer learning research, Journal of Software 26(1) (2015), 26–39 (in Chinese).

12.

Pan

S.J.

and Yang

, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering 22(10) (2010), 1345–1359.

13.

Zhuang

et al., A comprehensive survey on transfer learning, Proceedings of the IEEE 109(1) (2021), 43–76.

14.

Shaha

and Pawar

, Transfer Learning for Image Classification, in: 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore IEEE, 2018, pp. 656–660.

15.

Siyu

Stephen

M.A.

Ruqiang

et al., Highly-Accurate Machine Fault Diagnosis Using Deep Transfer Learning, IEEE Transactions on Industrial Informatics, 2018, 1–1.

16.

Wang

and Ke

, Feature subspace transfer for collaborative filtering, Neurocomputing 136(20) (2014), 1–6.

17.

Shao

Zhu

and Li

, Transfer learning for visual categorization: A survey, IEEE Transactions on Neural Networks and Learning Systems 26(5) (2017), 1019–1034.

18.

Day

and Khoshgoftaar

T.M.

, A survey on heterogeneous transfer learning, Journal of Big Data 4(29) (2017), 1–42.

19.

J.M.

W.F.

and Xue

, Research on transfer learning algorithm based on support vector machine, Journal of Intelligent &Fuzzy Systems 38(4) (2020), 4091–4106.

20.

Pan

and Qiang

, Transfer learning in heterogeneous collaborative filtering domains, Artificial Intelligence 197 (2013), 39–55.

21.

J.M.

W.F.

and Xue

, Transfer naive bayes algorithm with group probabilities, Applied Intelligence 50(1) (2020), 61–73.

22.

J.M.

et al., Multi-source Deep Transfer Neural Networks algorithm, Sensors 19(18) (2019).

23.

Lei

Zuo

and Zhang

, LSDT: Latent sparse domain transfer learning for visual adaptation, IEEE Transactions on Image Processing 25(3) (2016), 1177–1191.

24.

Huang

G.B.

Chen

and Siew

C.K.

, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE Trans Neural Netw 17(4) (2006), 879–892.

25.

Mao

and Wei

, Extreme learning machine based transfer learning for data classification, Neurocomputing 174 (2015), 203–210.

26.

Huang

G.B.

Zhu

Q.Y.

and Siew

C.K.

, Extreme learning machine: Theory and applications, Neurocomputing 70(1/3) (2006), 489–501.

27.

Huang

G.B.

Zhou

Ding

et al., Extreme learning machine for regression and multiclas sclassification, IEEETrans.Syst.ManCybern.Part B: Cybern 42 (2012), 513–529.

28.

Wang

Cao

and Yuan

, A study on effectiveness of extreme learning machine, Neurocomputing 74(16) (2011), 2483–2490.

29.

Zong

Huang

G.B.

and Chen

, Weighted extreme learning machine for imbalance learning, Neurocomputing 101 (2013), 229–242.

30.

Pan

S.J.

Tsang

Kwok

et al., Domain adaptation via transfer component analysis, IEEE Transactions on Neural Networks 22(2) (2011), 199–210.

31.

Sanodiya

R.K.

and Yao

, Unsupervised transfer learning via relative distance comparisons, IEEE Access pp(99) (2020), 1–1.

32.

Zheng

Ming

et al., Incomplete multisource transfer learning, IEEE Transactions on Neural Networks and Learning Systems 29(2) (2016), 310–323.

33.

Zhang

Lan

et al., Maximum mean and covariance discrepancy for unsupervised domain adaptation, Neural Processing Letters 51(1) (2020), 347–366.

34.

Huang

G.B.

Ding

and Zhou

, Optimization method based extreme learning machine for classification, Neurocomputing 74(1–3) (2010), 155–163.

35.

Wang

Sheng

Wang

et al., HAST-IDS: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection, IEEE Access 6(99) (2018), 1792–1806.

36.

Tavallaee

Bagheri

et al., A detailed analysis of the KDD CUP 99 data set, in: IEEE International Conference on Computational Intelligence for Security & Defense Applications, IEEE, 2009.

37.

Dhanabal

and Shantharajah

S.P.

, A study on NSL-KDD dataset for intrusion detection system based on classification algorithms, International Journal of Advanced Research in Computer and Communication Engineering 4(6) (2015), 446–452.

38.

Chang

C.C.C.C.

and Lin

C.C.C.

, A Library for Support Vector Machines, 2011.

39.

Dai

Yang

Xue

G.R.

et al., Boosting for transfer learning International Conference on Machine Learning, ACM, 2007, 193–200.

Intrusion detection algorithom based on transfer extreme learning machine

Abstract

Keywords

1. Introduction

2.1 Transfer learning

2.2 Extreme learning machine

3.2 Implementation of TrELM

4. Experimental results

4.1 Experimental setting evaluation criteria

4.2 Dataset

a. Dataset

ISCX2012 dataset

Table 1 Distribution of attack types in the ISCX2012 dataset

KDD 99

NSL-KDD

b. Data preprocessing

1) Character type digitization

2) Data normalization

3) Data cleaning

Table 6 Average accuracy rate, false positive rate (%) and miss rate, for the NSL-KDD dataset

Footnotes

Acknowledgments

References

Table 1
Distribution of attack types in the ISCX2012 dataset

Table 6
Average accuracy rate, false positive rate (%) and miss rate, for the NSL-KDD dataset