An intrusion detection method based on active transfer learning

Abstract

Intrusion detection plays a very important role in the field of network security. In order to improve the intrusion detection rate, intrusion detection algorithms based traditional machine learning are widely used in this field. These methods generally satisfy the following two assumptions: the training and the testing data must be under the condition of the independent and identical distribution; the training samples are sufficient. However, in practice, the above assumptions are difficult to satisfy, which will result in poor intrusion detection. This paper proposes an intrusion detection algorithm based on active transfer learning ACTrAdaBoost. ACTrAdaBoost takes advantage of transfer learning and need not to satisfy the two assumptions of the traditional machine learning. In addition, ACTrAdaBoost utilizes active learning and maximum mean discrepancy knowledge to obtain maximum knowledge with minimum training sample cost and solve the problem of negative transfer. The ACTrAdaBoost compared with the traditional machine learning method on the KDDCUP99, DARPA1998 and ISCX2012 datasets. The experimental results show that the intrusion detection rate of the ACTrAdaBoost algorithm is greater than benchmark algorithms, and the training time efficiency improves at the same time. The performance of ACTrAdaBoost is better than the traditional machine learning classification algorithm. The ACTrAdaBoost algorithm improves the accuracy of intrusion detection and provides a new research method for intrusion detection.

Keywords

Transfer learning active learning machine learning intrusion detection network security

1. Introduction

With the rapid development of network, it plays an increasingly important role in national life and people’s daily activities. Therefore, the importance of network security technology has become prominent. Nowadays, network security is facing more and more challenges such as viruses, system vulnerabilities and hacker attacks. Therefore, how to identify various network attacks is an important technology to protect network security. Intrusion detection is one of the core technologies in network security. It can detect malicious attacks that are happening or have happened in time [1, 2, 3]. Intrusion detection system with intrusion detection technology as the core is an active system of network security defense. It not only remedies the deficiencies of firewall, but also can effectively detect attacks and propose corresponding defense measures. However, the traditional intrusion detection system has many problems, such as false positive rate and false negative rate. It can only detect the existing attacks, but it is more and more difficult to detect new attacks and massive attacks.

In recent years, the intrusion detection method based on machine learning algorithm makes it possible to detect network attacks intelligently with the rising of machine learning. Compared with traditional intrusion detection methods, on the one hand, it improves the efficiency of intrusion detection, on the other hand, it reduces the false positive rate and false negative rate [2]. Therefore, the emergence of machine learning has pointed out a new direction for the development of intrusion detection technology [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. At present, the commonly used machine learning algorithms have applied in intrusion detection, such as common decision tree algorithm, neural network algorithm, support vector machine, Bayesian classification and K-means clustering algorithm. Yin et al. [1] proposed a deep learning model RNN-IDS of intrusion detection based on recurrent neural network. The experimental results show that RNN-IDS model improves the accuracy of intrusion detection and provides a new research method for intrusion detection. Quamar et al. [5] designed an effective and flexible intrusion detection system with STL, a self-learning technology based on deep learning, the experimental results show that the performance of the system is better than that of the previous system on NSL-KDD dataset. Yan et al. [6] constructed an optimization model with degenerate solutions by introducing smoothing functions to construct smooth unconditional optimization problems on the base of transductive support vector machine, and applied this model to generate a new method of network intrusion detection. Tiwari et al. [7] proposed a new method of the network intrusion detection with neural networks, rough set and the scheme of firefly intrusion detection. Pichara et al. [8] proposed a new semi-supervised algorithm, which detects relevant anomalies through active learning and expert interaction to obtain semantic information about user preferences. Makanju et al. [9] utilized genetic programming technology to apply classifiers and artificial neural networks to denial-of-service attacks. In addition, [10, 11, 12, 13, 14, 15] put forward many intrusion detection models based on machine learning algorithms. These algorithms greatly promote the wide application of machine learning in intrusion detection.

At present, although traditional machine learning is widely used in intrusion detection, most of them regard several different kinds of attacks as one attack without specific distinction, and adopt a single detection algorithm to detect them. This may lead to the detection success rate of each attack out of balance. For example, the classifier trained by one machine learning algorithm has a high detection rate for one type of attack, while the other one type of attack is difficult to detect, especially for the type of attack with few samples. In general, the classifier will omit these samples. In addition, traditional machine learning algorithms usually need to satisfy the following two assumptions: (1) training and testing samples satisfy the conditions of independent and identical distribution; (2) a large number of training samples are necessary to learn a good learning model. However, in practical applications, the distribution of testing and training data is difficult to achieve consistency, and some sample resources are very scarce. For example, data classification in biology often requires a large number of long-term and expensive experiments to label training samples. In the field of text classification, people find that the existing training samples are far from enough to establish a reliable classification model, and the experts need label abundant documents, which need high salaries resulting in a high cost of obtaining labeled training samples. In short, on the one hand, people need a large number of training samples to build a classification model with high accuracy; on the other hand, it is almost impossible to obtain a number of the training samples in many practical applications. In order to solve the problem of sample scarcity, researchers have proposed the transfer learning method, which is a new machine learning method for solving related domain problems by applying existing knowledge. Transfer learning relaxes two basic assumptions in traditional machine learning. Its purpose is to utilize existing knowledge to solve problems in which there are only a few samples in the target domain or even there is no learning [20, 21, 22, 23]. Transfer learning represents the future development direction of machine learning for intrusion detection. Compared with other machine algorithms, it can save the cost of data collection by using the existing knowledge in the source domain. The distribution of training and testing data can be different. Besides, it can effectively solve the problem of sparse samples of attack and unbalanced detection. Therefore, transfer learning has great advantages over traditional machine learning algorithms.

Figure 1.

Differences between transfer learning and traditional machine learning process (a) the traditional machine learning process (b) the learning process of transfer learning.

Figure 1 shows the differences of the learning processes between traditional learning and transfer learning techniques. It shows that traditional machine learning technology tries to learn every task from the starting, while the technology of transfer learning tries to transfer knowledge from some previous tasks to target tasks, and the latter has fewer high-quality training data. Therefore, in transfer learning task, not only the knowledge in the current domain (Target domain) are necessary, but also the knowledge in the source domain (Source domain) are required to assist the learning task in the current domain. As a strategy to solve the lack of training samples, transfer learning has the following problems, which make it difficult to popularize and apply. Firstly, the training dataset in the source domain is large, and the cost of labeling samples task is still great; secondly, when the data in the target and source domain are very different, transferring knowledge may greatly reduce the learning effect and easily lead to the phenomenon of ‘negative transfer’.

In this paper, based on the advantages of transfer learning and active learning, combined with the knowledge of maximum mean discrepancy, an active transfer learning algorithm – ACTrAdaBoost is proposed for intrusion detection. ACTrAdaBoost utilizes the maximum mean discrepancy to measure the difference of samples between the source and the target domain, and selects the samples in source domain with the more similarity with the samples in target domain to form the new source domain. On this basis, further active learning labeling the training samples, the purpose is to try to select the data in the source domain with large amount of information as the training data, and a high accuracy classification model is established with the least training samples. Compared with the semi-supervised learning method [35], they all can solve the problem of sample sparsity. However, the semi-supervised learning method is to add a large number of unlabeled training samples to a small number of labeled training samples for training to improve learning performance. It requires that both labeled and unlabeled samples must come from the same domain; in addition, the training and test samples must satisfy the same assumptions as traditional machine learning methods. Unlike the semi-supervised learning method, in order to improve the learning effect of the target domain, ACTrAdaBoost uses a large number of source domain samples to assist the learning task of target domain, that is, to solve the target domain based on the existing knowledge of the source domain. The learning problem of the sample [22] does not require learning from scratch every time, nor does it require that the training samples and the test samples must satisfy the same assumptions.

The rest of the paper is organized as follows. Section 2 introduces related works, including maximum mean discrepancy, active learning and TrAdaBoost algorithm. In Section 3, we present our algorithm and detail the construction process of framework based on active learning and transfer learning. In the Section 4, we analyze experimental results. The Section 5 summarizes the main works in this paper.

2. Related works

2.1 Maxinum discrepancy

Most of transfer learning algorithms assume that the samples in the source domain are available when constructing the learning model, that is to say, all the samples in the source domain have correlation with the target domain, but in fact, this is not the case. If all the knowledge of samples in the source domain are transferred into the target domain, negative transfer often occurs, resulting in poor learning effect in target domain. In order to avoid the negative transfer and better assist the learning task in the target domain, it is particularly important to select the samples with high similarity with the samples of the target domain from source domain. Maximum mean discrepancy (MMD) is an effective method to measure the distance between different distributions [24]. Its idea is to project and sum each sample, and use the sum to express the distribution difference between the data under two distributions.

Given a source domain $D_{s}$ containing $n$ samples, $X_{S}=\{x_{1}^{S},x_{2}^{S},\ldots,x_{n}^{S}\}$ , $Y_{S}=\{y_{1}^{S},y_{2}^{S},\ldots,y_{n}^{S}\}$ , $D_{S}=\{(x_{1}^{S},y_{1}^{S}),(x_{2}^{S},y_{2}^{S}),\ldots,(x_{n}^{S},y_{n}^{S% })\}$ . Similarly, for the target domain $D_{T}=\{(x_{1}^{T},y_{1}^{T}),(x_{2}^{T},y_{2}^{T}),\linebreak\ldots,(x_{m}^{T% },y_{m}^{T})\}$ . For convenience, this paper only considers the problem of binary classification. The square of the maximum mean discrepancy between the samples in the source domain and in the target domain is as follows:

$\displaystyle\textit{MMD}^{2}=\left\|\frac{1}{m}\sum\limits_{x_{i}\in D_{T}}% \phi(x_{i})-\frac{1}{n}\sum\limits_{x_{j}\in D_{s}}\phi(x_{j})\right\|^{2}$ (1)

$\phi(\cdot)$ is a non-linear mapping function. It can be seen from the Eq. (4) that when MMD method is used to measure the difference between source domain and target domain, it needs to use all samples in source domain. At this time, the correlation of data between source domain and target domain is neglected, which it will affect the accuracy of measurement. In order to avoid the above situation, when measuring the difference between the source domain and the target domain, the importance of the data itself in the source domain should be fully considered, which is the weight,.

The definition of the measurement method of selective maximum mean discrepancy is as follows:

$\displaystyle\textit{SMMD}=\left\|\frac{1}{m}\sum\limits_{x_{i}\in D_{T}}\phi(% x_{i})-\frac{1}{n}\sum\limits_{x_{j}\in D_{s}}\beta_{j}\phi(x_{j})\right\|^{2}$ (2)

In Eq. (2), the correlation between the $i^{\text{th}}$ sample in the source domain and the target domain is weak when $\beta_{i}$ tends to zero, and the correlation is the best when $\beta_{i}$ tends to 1. For convenience of calculation, the square formula of $\textit{SMMD}^{2}$ is as follows:

$\displaystyle\textit{SMMD}^{2}=\left\|\frac{1}{m}\sum\limits_{x_{i}\in D_{T}}% \phi(x_{i})-\frac{1}{n}\sum\limits_{x_{j}\in D_{s}}\beta_{j}\phi(x_{j})\right% \|^{2}=\frac{1}{m^{2}}\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{m}\phi(x_{i})^{T% }\phi(x_{j}){}+\frac{1}{n^{2}}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{n}\beta_% {i}\beta_{j}\phi(x_{i})^{T}\phi(x_{j}){}-2\frac{1}{mn}\sum\limits_{i=1}^{m}% \sum\limits_{j=1}^{n}\beta_{j}\phi(x_{i})^{T}\phi(x_{j})$ (3)

In Eq. (3), the first term of the right is the constants, which is omitted, and only the last two terms need to be minimized. Therefore, the following Eq. (4) is as follows:

$\displaystyle f(\beta)=\frac{1}{n^{2}}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{% n}\beta_{i}\beta_{j}\phi(x_{i})^{T}\phi(x_{j}){}-2\frac{1}{mn}\sum\limits_{i=1% }^{m}\sum\limits_{j=1}^{n}\beta_{j}\phi(x_{i})^{T}\phi(x_{j})$ (4)

By simplifying Eq. (4), the objective function is as follows:

$\displaystyle\mathop{\min}\limits_{\beta}\frac{1}{2}\beta^{T}K\beta-k^{T}\beta$ (5) $\displaystyle s.t.\,0\leqslant\beta\leqslant 1,i=1,2,\ldots,n$

In Eq. (2.1), $K=K(x_{i},x_{j})=\phi(x_{i})^{T}\phi(x_{j})$ and $k_{i}$ is as follows:

$\displaystyle k_{i}=\frac{n}{m}\sum\limits_{j=1}^{m}\phi(x_{i})^{T}\phi(x_{j})$ (6)

It can be concluded that the objective Eq. (2.1) is a standard quadratic programming problem, so quadratic programming solver can be used to solve it. Maximum discrepancy measurement method can easily solve sample dataset with high similarity in source domain with the target domain. Transferring knowledge of these samples in the source domain can effectively avoid the occurrence of negative transfer in transfer learning.

2.2 Active learning

When supervised learning method constructs a classifier, it usually needs a large number of labeled training samples. In theory, the more labeled training samples, the higher the quality of labeled the training samples, the better the learning effect is. However, in the practical applications of machine learning, the most time-consuming and expensive task is to obtain labeled training samples, and because of human subjective factors, it is easy to make mistakes [16]. Faced with this situation, traditional supervised learning methods are difficult to construct classifiers with high accuracy. Active learning method can effectively solve the problem. In active learning method, we can independently select unlabeled samples with large amount of information to label by experts, and then add the labeled samples to the training dataset. Thus, we can get a higher classification accuracy and effectively reduce the cost of building a classifier when the training set is relatively small [17].

The model of active learning is as follows:

$\displaystyle A=(C,L1,L2,Q,U)$

In the model $A$ , $C$ represents a group or a classifier; $L2$ which is labeled by expert represents a labeled dataset of the training samples; $L1$ represents an existing training sample set containing a small number of labeled samples; $Q$ is a query function for querying samples from large number of unlabeled samples; $U$ represents an unlabeled dataset which contains a lot of samples; $S$ represents an expert, who is responsible for labeling unlabeled samples according to certain strategies.

Figure 2.

Execution process of the active learning.

Figure 2 is a schematic diagram of the active learning execution process. The diagram shows that the process is cyclic iteration. Firstly, the algorithm randomly selects a small number of samples from unlabeled samples $U$ , and the initial classifier $C$ is built on the training set, which consists of $L2$ correctly labeled by the expert $S$ and $L1$ . Then, according to a query strategy $Q$ , the expert $S$ selects the unlabeled samples that satisfy the conditions for labeling in $U$ , and adds them to the $L2$ , $L2$ and $L1$ are used to retrain the classifier $C$ . Finally, the whole process proceeds in a cycle until a certain condition is satisfied. In the iteration process, the classifier is trained with the samples of each feedback, which gradually improves the accuracy of classification. At present, there are three kinds of active learning algorithm for classification problems: Committee-based heuristic method (QBC), Edge-based heuristic method (MS) and Posterior probability-based heuristic method (PP).

The purpose of active learning is obtaining and labeling the samples from a large number of unlabeled samples, and then add labeled samples to the training set, so as to effectively reduce the size of the training set and greatly reduce the complexity of learning model construction. Compared with the traditional supervised learning method, it can deal with large-scale training set well, selects high-quality samples, decreases the scale of training set, and reduces the cost of labeling samples manually. Active learning has proved to be very effective in machine learning [18] combined low-level transformation (LRT) with active learning based on SVM. The appropriateness of active learning labeling data improves the performance of LRT and further improves the accuracy of SVM classifier [19] proposed an active learning technology using clustering hypothesis batch processing model. Similarly, active learning can be applied to transfer learning to improve the classification effect. In transfer learning, a large number of unlabeled samples are required as auxiliary data. Active learning is applied to the screening out auxiliary data, the large amounts of information samples can be obtained as training set, which will greatly reduce the number of training samples in transfer learning, thus reducing the scale of training set and assisting in improving the training efficiency of transfer learning.

2.3 TrAdaBoost algorithm

Table 1
Training process of TrAdaBoost algorithm

Training process of TrAdaBoost algorithm
Input: $a$ labeled samples set $T_{a}$ in source domain, $b$ labeled sample set $T_{b}$ in target domain, $T=T_{a}\cup T_{b}x_{i}\in T$ , $c(x_{i})$ is real class label of the sample, unlabeled testing dataset $S$ , a basic classifier $C$ , the number of iteration $N$ .
Initialize:
1. Initial weight vector $w_{1}=(w_{1}^{1},\ldots,w_{a+b}^{1})$ , where $\displaystyle w_{i}^{1}=\left\{{\begin{array}[]{l}\frac{1}{a},i=1,\ldots,a\\ \\ \frac{1}{b},i=a+1,\ldots,a+b\\ \end{array}}\right.$
2. Set $\beta=1/(1+\sqrt{2\ln n/N})$
Process of iteration:
3. for $t=1,\ldots,N$ (1) Set $p^{t}$ as follows: $\displaystyle p^{t}=\frac{w_{t}}{\sum\nolimits_{i=1}^{a+b}{w_{i}^{t}}}$ (2) Call classifier $C$ , we can get a classifier $X\mapsto Y$ by training on $T$ with the distribution $p^{t}$ and $S$ . (3) Calculate the error rate on $T_{b}$ : $\displaystyle\varepsilon_{t}=\sum\limits_{i=a+1}^{a+b}{\frac{w_{i}^{t}\|h_{t}(x% _{i})-c(x_{i})\|}{\sum\nolimits_{i=a+1}^{a+b}{w_{i}^{t}}}}$ (4) Set $\beta_{t}=\varepsilon_{t}/(1-\varepsilon_{t})$ , weight coefficient of the weaker classifier $\alpha_{t}=\ln\left(\frac{1}{\beta_{t}}\right)$ (5) Update the new weight vector: $\displaystyle w_{i}^{t+1}=\left\{{\begin{array}[]{ll}w_{i}^{t}\beta^{\|h_{t}(x_% {i})-c(x_{i})\|},&i=1,\ldots,a\\ w_{i}^{t}\beta^{-\|h_{t}(x_{i})-c(x_{i})\|},&i=a+1,\ldots,a+b\\ \end{array}}\right.$ (6) End for
Output: Strong classifier $\displaystyle h_{f}(x)=\sum\limits_{i=1}^{N}\alpha_{t}h_{t}(x)$

TrAdaBoost algorithm is a weight-based transfer learning method proposed by Dai in 2007. The core of this algorithm is to use Boosting technology to select samples with the smallest difference between the samples in source domain and in target domain [26]. In TrAdaBoost, Boosting establishes an automatic weight adjustment mechanism. The more similar the samples in the source domain with the target domain, the larger the weight of samples; otherwise, the weight will decrease. In the algorithm, in order to ensure the accuracy of classification model in source domain, AdaBoost is used for training in target domain; at the same time, $\textit{Hedge}(\beta)$ is used for training samples in source domain to adjust the importance of training samples in source domain. The TrAdaBoost algorithm is described in Table 1.

In TrAdaBoost, in each iteration, if the sample in the source domain is misclassified, it contradicts the samples in target domain. In the next iteration, we reduce the weight of this sample, so the misclassified training sample will have a smaller impact on the classification model than the previous round iteration. After many iterations, the samples in the source domain that are similar to the samples of target domain will have a higher weight, but weight of those samples that do not meet will decrease. Eligible the samples in source domain will help learning task training in the target domain to get a better classification model. In extreme cases, the samples in source domain are completely ignored, and TrAdaBoost degenerates into the traditional AdaBoost algorithm. Therefore, the TrAdaBoost algorithm can be seen as an extension of the AdaBoost. [27] proposed a multi-source domain version of the TrAdaBoost algorithm, which greatly reduced the impact of negative transfer and effectively improved the classification effect.

3. Active transfer learning algorithm

3.1 Basic ideology of ACTrAdaBoost

This paper proposes an active transfer learning algorithm ACTrAdaBoost based on transfer learning TrAdaBoost. ACTrAdaBoost is formally described as $(M1,M2,Q)$ , $M1$ is a classical SVM classifier, $M2$ is an improved TrAdaBoost classifier, $Q$ is a query function of active learning that determines that samples in the source domain labeled by $M1$ as training dataset. The process of ACTrAdaBoost algorithm is as follows. The first step is to select the samples in the source domain by using the maximum mean deviation in Section 2.1. The sample scale in the source domain is reduced by selecting the samples, which are similar to the samples in the target domain. In the second step, the active learning algorithm uses the samples to train $M1$ after the first step, and the expert labels the samples according to the query strategy $Q$ to form a new source domain. In the third step, the improved TrAdaBoost classifier utilizes new source domain and target domain to train, and gets the final classifier. The benchmark classifier of TrAdaBoost chooses a SVM [28]-BA_SVM based on bat algorithm. BA_SVM can find the optimal parameter combination of SVM and avoid the local optimal problem.

On the one hand, active learning in ACTrAdaBoost algorithm reduces the scale of available source domains, while MMD method can filter out samples with less similarity to target domains. On the other hand, it is helpful to solve the problem of negative transfer. In addition, the transfer function of TrAdaBoost algorithm itself can give different weights to the samples labeled in the training samples again, and further screen out the samples in the source domain that conform to the target domain for knowledge transfer. ACTrAdaBoost uses active learning and MMD to reduce the scale of samples in the source domain. The selected samples have a high similarity with the samples in the target domain, which is conducive to solving the negative transfer in transfer learning. Therefore, ACTrAdaBoost not only improves the training efficiency, but also more effectively inhibits the occurrence of negative transfer.

Definition 3.1 Basic symbols

(1)
Set $S D$ as source domain and $T D$ as target domain;
(2)
Set $Y$ as the classification, only the binary classification problem is considered in the general, and the problem can easily be extended for the multi-classification problem;
(3)
Training dataset $T\subseteq\{(X=SD\cup TD)\}$ ;
(4)
Function $f:X\mapsto Y$ , mapping the sample to class label.

Definition 3.2 Testing dataset

$\displaystyle S=\{(x_{i}^{t})\},x_{i}^{t}\in TD,i=1,2,3,\ldots,k$

Definition 3.3 Training dataset

$\displaystyle S_{a}=\{(x_{i}^{a},f(x_{i}^{a}))\},x_{i}^{a}\in SD,i=1,2,3,% \ldots,n$ $\displaystyle S_{b}=\{(x_{i}^{b},f(x_{i}^{b}))\},x_{i}^{b}\in TD,i=1,2,3,% \ldots,m$

$f(x)$ is the class label; $S_{a}$ is dataset in source domain $S D$ ; $S_{b}$ is the dataset in target domain $T D$ ; the numbers of samples in $S_{a}$ and $S_{b}$ are $n$ and $m$ . $T=S_{a}\cup S_{b}$ , so the training dataset $T=\{(x_{i},c(x_{i}))\}$ defines as follows:

$\displaystyle x_{i}=\left\{{\begin{array}[]{l}x_{i}^{a},i=1,\ldots,n;\\ x_{i}^{b},i=n+1,\ldots,n+m.\\ \end{array}}\right.$

At this point, the problem of transfer learning can be defined as follows: given a small intrusion detection dataset $S_{b}$ in the target domain, a large number of unlabeled intrusion detection dataset $S_{a}$ in the source domain. The goal is to train a classifier, the classifier can try to reduce the classification error on dataset $S$ , and improve the prediction accuracy of intrusion detection behavior.

The input and output of the problem are as follows:

Input: Two training datasets $S_{a}$ and $S_{b}$ , a testing dataset $S$ ; A basic classifier BA_SVM. Output: Classifier.

Figure 3.
Framework of the ACTrAdaBoost.

3.2 Description of ACTrAdaBoost

The framework of ACTrAdaBoost algorithm is shown in Fig. 3, and the detailed training process is shown in Table 2.

Table 2
Training process of ACTrAdaBoost algorithm

Training process of ACTrAdaBoost algorithm
Input: $n$ unlabeled samples set $S_{a}$ in source domain $S D$ , $b$ labeled sample set $S_{b}$ in target domain $T D$ , $T=S_{a}\cup S_{b}x_{i}\in\linebreak T$ , $f(x_{i})$ is real class label of the sample, unlabeled testing dataset $S$ , a basic classifier BA_SVM, the number of iteration $N_{1}$ .
Initialize:
1. Calculate Eq. (2.1), obtain the weight vector $\beta_{i}$ ;
2. The process of active learning begins and output new source domain is: Constructing a query strategy $Q$ : an unlabeled sample set $S_{a}$ is calculated according to Eq. (2.1), a weight vector $\beta_{i}$ for each source domain sample $x_{i}$ relative to the importance of the target domain, a sample with the lowest confidence level [32] and $\beta_{i}>0$ is selected. The process of active learning is as follows: (1) Selecting a number of samples from the unlabeled sample set $S_{a}$ and correctly labeling the class to construct an initial training sample data set $S^{\prime}_{a}$ containing, at least one sample with an output of 1 and an output of $-$ 1; (2) Training an SVM classifier $C$ based on the training dataset $S^{\prime}_{a}$ ; (3) The remaining samples in $S_{a}$ are labeled as $({\rm{\bf x}},{\rm{\bf y}}^{\prime})$ , where the label vector ${\rm{\bf y}}^{\prime}$ is obtained by the classifier $C$ ; (4) Calling the query function $Q$ selecting the sample set $B$ to be labeled; (5) The expert will correctly label $B$ and add it to the training dataset $S^{\prime}_{a}$ ; (6) Repeat (3)–(5) until the classification accuracy of $C$ reaches a certain threshold $\Psi$ , and the final labeled training data set $S^{\prime}_{a}$ is obtained. $S^{\prime}_{a}$ is the new dataset in the source domain, whose number of samples is $n^{\prime}$ .
3. Initial weight vector $w_{1}=(w_{1}^{1},\ldots,w_{n^{\prime}+m}^{1})$ , where $\displaystyle w_{i}^{1}=\left\{{\begin{array}[]{l}\frac{1}{n^{\prime}},i=1,% \ldots,n^{\prime}\\ \\ \frac{1}{m},i=n^{\prime}+1,\ldots,n^{\prime}+m\\ \end{array}}\right.$
4. Set $\tau=1/(1+\sqrt{2\ln n^{\prime}/N})$
Process of iteration:
5. for $t=1,\ldots,N_{1}$ (1) Set $p^{t}$ as follows: $\displaystyle p^{t}=\frac{w_{t}}{\sum\nolimits_{i=1}^{n^{\prime}+m}{w_{i}^{t}}}$ (2) Call classifier BA_SVM, the weak classifier $h_{t}:X\mapsto Y$ by training on $T$ with the distribution $p^{t}$ and $S$ . (3) Calculate the error rate on $S_{b}$ : $\displaystyle\varepsilon_{t}=\sum\limits_{i=n^{\prime}+1}^{{n^{\prime}}+m}{% \frac{w_{i}^{t}\|h_{t}(x_{i})-f(x_{i})\|}{\sum\nolimits_{i=n^{\prime}+1}^{n^{% \prime}+m}{w_{i}^{t}}}}$ (4) Set $\beta_{t}=\varepsilon_{t}/(1-\varepsilon_{t})$ , weight coefficient of the weaker classifier $\alpha_{t}=\ln\left(\frac{1}{\tau_{t}}\right)$ (5) Update the new weight vector: $\displaystyle w_{i}^{t+1}=\left\{{\begin{array}[]{ll}w_{i}^{t}\beta^{\|h_{t}(x_% {i})-f(x_{i})\|},&i=1,\ldots,n^{\prime}\\ w_{i}^{t}\beta^{-\|h_{t}(x_{i})-f(x_{i})\|},&i=n^{\prime}+1,\ldots,n^{\prime}+m% \\ \end{array}}\right.$ (6) End for
Output: Classifier $\displaystyle h_{f}(x)=\sum\limits_{i=1}^{N}{\alpha_{t}h_{t}(x)}$

4. Experimental results

In this section, we analyze and verify the effectiveness of ACTrAdaBoost algorithm. EvaluatingACTrAdaBoost on intrusion datasets, the results show that ACTrAdaBoost can assist current intrusion detection learning tasks by transfer knowledge from existing intrusion detection datasets, and improves detection rate and time efficiency. The following contents will describe the experimental settings and the analysis of the experimental results in detail.

4.1 Experimental environment and evaluation criteria

In order to verify the effectiveness of ACTrAdaBoost algorithm in intrusion detection, it is validated on KDD CUP99, DARPA 1998 and ISCX2012 dataset. The experimental environment is configured as follows: Intel Core (TM), 3.6 GHz, 8 GB, Windows 10 operating system. The benchmark algorithms used in the experiment are: SVM, SVMt [29] and TrAdaBoost [26], in which SVM is implemented with LIBSVM [30] toolkit and the average of the all experimental results repeated 10 times are used as the final comparison results.

The detection rate, false positive rate and false negative rate of each algorithm in all attack types are as evaluation criteria [1, 5]. The detection rate reflects the proportion of the number of attacks detected by the intrusion detection system to the total number of attacks, indicating the accuracy of intrusion detection. The false positive rate reflects the proportion of the number of intrusion detection systems that incorrectly detect normal behavior as attack behavior to the total number of normal behavior. The false negative rate reflects the number of intrusion detection systems that incorrectly detect attack behavior as normal behavior to the total number of attacks. The definitions of detection rate, false positive rate and false negative rate are as follows:

$\displaystyle\text{Detection rate: }CR=\frac{TP}{TP+FN}\times 100\%$ $\displaystyle\text{False positive rate: }FR=\frac{FP}{TN+FP}\times 100{\%}$ $\displaystyle\text{False negative rate: }MR=\frac{FN}{TP+FN}\times 100\%$

$T P$ represents the number of abnormal samples correctly classified as abnormal samples, the number of normal samples incorrectly classified as abnormal samples is $F P$ , $T N$ is the number of normal samples correctly classified as normal samples, $F N$ is and the number of abnormal samples incorrectly classified as normal samples.

4.2 Intrusion detection dataset

4.2.1 KDD CUP99

KDD CUP99 is a widely used dataset for intrusion detection provided by Lincoln Laboratory of Massachusetts Institute of Technology [31]. The number of dataset is 5*106, each of which has 41 feature attributes and a class identifier. There are about 38 types of attack, 21 of which appear in the training dataset, and 17 unknown attack types only appear in the testing dataset. The purpose of this design is to test the generalization ability of the classifier model. The ability to detect unknown attack types is also one of the important indicators to evaluate the application effect of classifiers in intrusion detection.

So far, researchers have used 10% KDDCUP99 dataset (including training and testing dataset), which is a sample set of 10% of all KDDCUP99 datasets. The 10% KDDCUP99 dataset is used in paper, which contains one normal type, four major network attack types, DOS, Probing, U2R and R2L. In testing and training of 10% KDDCUP dataset, the number of attack in the four types of network attacks is different. Table 3 lists 22 kinds of attack type in the training dataset, 39 kinds of attack type in the testing dataset, and the normal type is as an attack type in the table.

Table 3
10% KDDCUP99 intrusion detection dataset

Attacks	10% testing dataset	10% trainning dataset
Normal	Nromal	Nromal
DOS	back, land, neptune, pod, smurf, teardrop, apache2,	back, land, neptune, pod, smurf, teardrop
	mailbomb, udpstorm, processtable
Probing	ipsweep, nmap, portsweep, satan, saint, mscan	Ipsweep, nmap, portsweep, satan
R2L	ftp_write, guess_password, ima, multihop, phf, spy,	ftp_write, guess_password, imap, multihop, phf, spy,
	warezmaste, warezclient, named, xsnoop, xlock,	warezmaster
	sendmail, worm, snmpgetattack, snmpguess
U2R	buffer_overflow, loadmodule, perl, rookit, httptunnel,	buffer_overflow, loadmodule, perl, rookit
	ps, sqlattack, xterm

Table 4

Distribution of attack types in 10% KDDCUP99 dataset

Attacks	Training dataset		Testing dataset
Normal	97278	(19.69%)	60593	(19.48%)
DOS	391458	(79.24%)	229853	(73.90%)
Probing	4107	(0.83%)	4166	(1.34%)
R2L	1126	(0.22%)	16198	(5.20%)
U2R	52	(0.01%)	228	(0.073%)

In order to enable intrusion detection algorithm to recognize new attacks by learning the training dataset, the testing dataset in Table 3 contains more new attacks than the training dataset. In Table 4, the content in parentheses is the proportion of the attack type to the dataset. The proportion of Normal in two datasets is basically the same, but the proportion of the other four types of attacks is obviously different. Because the proportion of U2R and R2L is very small, most of the existing detection algorithms are difficult to detect these two types of attacks.

Both 10% KDD CUP99 datasets with the same record format are stored in text format. In addition, the only difference between the testing dataset and the training dataset is that there is no last attack type. The following two join record formats are listed, in which each feature is separated by commas. The first 41 features of each record represent the attributes of the data, the last one represents the data identification. The first one represents the normal record Normal, and the second one represents satan, which is one of the denial of service DOS. The first data represents the type of TCP protocol, and the second data represents the type of UDP protocol.

0, tcp, http, SF, 290, 3084, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 145, 255, 1.00, 0.00, 0.01, 0.02, 0.00, 0.00, 0.00, 0.00, normal.

0, udp, private, SF, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24, 4, 0.00, 0.00, 0.00, 0.00, 0.17, 0.12, 0.00, 255, 4, 0.02, 0.22, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00, satan.

Character exists in data records, and there are two types of numerical values in fixed feature attributes: discrete and continuous. On the one hand, machine learning algorithm is good at numerical data. On the other hand, the difference of measurement methods between continuous and discrete affects the calculation results, so it is necessary to pre-process the dataset. Preprocessing operations mainly include converting character type data into numerical type and standardizing continuous feature attributes.

4.2.2 DARPA1998 dataset

DARPA 1998 is a dataset used by DARPA at the Massachusetts Institute of Technology’s Lincoln Laboratory for intrusion detection [33]. The DARPA1998 dataset contains 7 weeks of training data, 2 weeks of test data, including normal behavior Normal and Probe, DoS, R2L, U2R four types of attacks. See Table 5 below for details.

Table 5
Distribution of attack types in DARPA dataset

Dataset	Training dataset		Testing dataset
	Count	Percentage	Count	Percentage
Normal	849,991	34.46%	459,547	41.79%
Dos	1,561,231	63.29%	591,619	53.8%
Probe	48,984	1.99%	40,317	3.67%
R2L	6,494	0.26%	8,041	0.73%
U2R	229	0.01%	207	0.02%
Total	2,446,929		1,099,731

In the experiment, 2% data is selected in the training dataset, and all the label information is deleted as the unlabeled training dataset; the same operation is performed on testing dataset, and the obtained dataset is used as the testing dataset.

4.2.3 ISCX2012 dataset

In 2012, the Information Security Center of Excellence (ISCX) at the University of New Brunswick (UNB) in Canada released an intrusion detection dataset called ISCX2012 [34]. This dataset contains seven days of raw network traffic data, including normal traffic and four types. Some researchers have noticed that the types of attacks considered in KDD99 are now obsolete. In contrast, the ISCX2012 attack types are more modern and closer to reality. See Table 6 for details.

Table 6
Distribution of attack types in ISCX2012 dataset

Dataset	Training dataset		Testing dataset
	Count	Percentage	Count	Percentage
Normal	890,726	97.27%	593,811	97.27%
BFSSH	4,179	0.46%	2,785	0.46%
Infiltration	6,027	0.66%	4,017	0.66%
HttpDoS	2,090	0.2%	1,392	0.23%
DDos	12,673	1.38%	8,448	1.38%
Total	915,695		610,453

4.3 Dataset processing

In order to make intrusion detect dataset suitable for machine learning algorithm, it is necessary to convert its character type into numerical type and standardize the continuous feature attributes.

4.3.1 Converting character type to numerical type

For example, in KDD CUP99 three network connection types, 70 network service types, 23 attack types (including normal type) and 11 network connection states of character types in datasets are converted into the numerical types, the results are saved into CSV type files. The transformed forms of 11 the type of network connection shows in Table 7.

Table 7
Forms of network connection states converted

Character type	Numberical type
OTH	0
REJ	1
RSTO	2
RSTOSO	3
RSTR	4
S0	5
S1	6
S2	7
S3	8
SF	9
SH	10

In most the same way, the corresponding values of three network connection types are 0–2, 70 network service types are 0–69, and 23 attack types are 0–22.

4.3.2 Data standardization

ACTrAdaBoost algorithm uses maximum mean discrepancy to select data, and calculates distance during the process. For continuous feature attributes in datasets, the measurement methods of attributes are different. Therefore, the distance between data has a great influence on the accuracy of calculation results. In order to avoid the above situation and eliminate the influence of attribute measurement differences, it is necessary to standardize continuous feature attributes, but not discrete feature attributes. Assuming that there are N connected records, every record contains the vector of 20 continuous attributes is $X_{ij}(1\leqslant i\leqslant N,11\leqslant j\leqslant 31)$ , and the normalized value of $X_{ij}$ is $X^{\prime}_{ij}$ .

$\displaystyle X^{\prime}_{ij}=\frac{X_{ij}-\textit{AVG}_{j}}{S_{j}}$ (7) $\displaystyle\textit{AVG}_{j}=\frac{1}{N}(X_{1j}+X_{2j}+\ldots+X_{Nj})$ (8) $\displaystyle S_{j}=\frac{1}{N}(|X_{1j}-\textit{AVG}_{j}|+|X_{2j}-\textit{AVG}% _{j}|+\ldots+|X_{Nj}-\textit{AVG}_{j}|)$ (9)

Equation (8) denotes the average value, and Eq. (9) represents the average absolute deviation. The following judgments need to be made in the above calculation:

$X_{ij}=0$ , if $X_{ij}^{\prime}=0$

$S_{j}=0$ , if $X_{ij}^{\prime}=0$

After standardization, it is necessary to normalize the values to [0, 1] for the convenience of calculation. Set it to the normalized value.

$\displaystyle X^{\prime\prime}_{ij}=\frac{X^{\prime}_{ij}-X_{\min}}{X_{\max}-X% _{\min}}$ (10) $\displaystyle X_{\min}=\min(X^{\prime}_{ij})$ (11) $\displaystyle X_{\max}=\max(X^{\prime}_{ij})$ (12)

4.4 Analysis of experimental results

4.4.1 Detection rate, false alarm rate and false alarm rate of ACTrAdaBoost

The number of five types of attacks on 10% KDDCUP99 and DARPA 1998 dataset is different. The intrusion detection algorithm has high prediction accuracy, low false alarm rate and low false alarm rate for more number of attacks, and vice versa. Classifiers trained by traditional machine learning algorithms are based on sufficient training samples. Therefore, the detection accuracy of attack types such as Normal, Probe and DOS is high, while the detection accuracy of U2R and R2L attack types is low. Traditional machine learning intrusion detection algorithm divides all attack types into an attack type, but does not distinguish each attack. Therefore, it may have higher detection accuracy for one attack type, but lower detection accuracy for other types, so it is difficult to balance the prediction accuracy of all attack types. ACTrAdaBoost can train for each specific attack, and will not regard all attacks as an attack. ACTrAdaBoost transfers the knowledge of a large number of existing intrusion detection samples into the train process of intrusion detection with the sparse samples, which effectively alleviate the low detection rate of U2R and R2L due to fewer samples. Therefore, ACTrAdaBoost can effectively solve the problem of low detection rate of sparse sample attacks, which can make all attack detection rates show considerable effect, and give good consideration to the balance of intrusion detection behavior. Table 8 shows the detection rate, false positive rate and false negative rate of ACTrAdaBoost on intrusion detection datasets.

Table 8
Detection rate, false positive rate and false negative rate of ACTrAdaBoost on KDD CUP99 and DARPA 1998

Intrusion types	KDD CUP99			DARPA 1998
	CR (%)	FR (%)	MR (%)	CR (%)	FR (%)	MR (%)
Normal	99.98 $+$	2.98	1.28	99.99	2.15	1.67
DOS	99.92 $+$	1.82	1.12	99.51	2.26	1.12
Probe	97.78	3.82	2.02	99.36	4.56	2.01
R2L	92.25	15.84	10.68	90.21	11.14	9.25
U2R	75.95	6.82	7.72	74.86	9.25	10.26

Table 9

Detection rate, false positive rate and false negative rate of ACTrAdaBoost on ISCX2012

Intrusion types	ISCX2012
	CR (%)	FR (%)	MR (%)
BFSSH	95.87	2.18	1.02
Normnal	99.97	4.56	1.53
Infiltration	96.98	2.18	2.24
HttpDoS	88.75	4.02	3.52
DDoS	97.25	5.84	3.68

As can be seen from Tables 8 and 9, similar to traditional machine learning algorithms, the detection rate of ACTrAdaBoost algorithm is higher for Normal, DOS and Probe attack types with sufficient training samples, while the detection rate of U2R, R2L and HttpDoS attack types with few samples is relatively low. In terms of false positive rate and false negative rate, for attack types with more training samples, the false positive rate are higher and false negative rate are lower, and vice versa.

4.4.2 Comparison of ACTrAdaBoost with SVM, SVMt and AdaBoost

In order to better illustrate the application effect of ACTrAdaBoost in intrusion detection, this part compares ACTrAdaBoost with three benchmark algorithms in terms of time cost, detection accuracy, false positive rate and false negative rate.

Table 10
Time consumption of ACTrAdaBoost on DARP1998 and ISCX2012

Datasets	ACTrAdaBoost	TrAdaBoost	SVMt	SVM
DARP1998	5.25	7.89	6.89	7.24
ISCX2012	2.35	3.48	2.79	3.09

Table 11

Comparison of detect rate (%), false positive rate (%) and false negative rate (%) of ACTrAdaBoost, TrAdaBoost, SVMt and SVM on DARPA1998 dataset

Algorithm	DoS			Normal			Probe			R2L			U2R
	CR	FR	MR	CR	FR	MR	CR	FR	MR	CR	FR	MR	CR	FR	MR
ACTrAdaBoost	99.51	2.26	1.12	99.99	2.15	1.67	97.78	3.82	2.02	90.21	11.14	9.25	74.86	9.25	10.26
TrAdaBoost	96.24	2.86	1.65	97.76	2.76	1.87	94.25	3.98	2.14	88.47	11.85	9.86	71.26	9.68	10.55
SVMt	91.38	3.14	2.01	93.65	3.11	2.12	90.87	4.21	2.86	85.32	12.01	10.14	68.65	10.02	11.01
SVM	88.57	3.68	2.26	91.87	3.54	2.76	87.97	4.75	3.27	82.85	12.58	11.76	65.26	10.45	11.25

Figure 4.

Comparison of ACTrAdaBoost, TrAdaBoost, SVMt and SVM in time consumption.

Figure 4 and Table 10 show the training time of ACTrAdaBoost algorithm and three contrast algorithms. We can see that the training time of ACTrAdaBoost algorithm is better than three contrast algorithms. The reason is that ACTrAdaBoost algorithm actively labels large information samples and reduces the size of training samples, so it can reduce the cost of training classifier and the time consumption decreases.

Table 12

Comparison of detect rate (%), false positive rate (%) and false negative rate (%) of ACTrAdaBoost, TrAdaBoost, SVMt and SVM on ISCX2012 dataset

Algorithm	BFSSH			Infiltrating			HttpDos			DDos
	CR	FR	MR	CR	FR	MR	CR	FR	MR	CR	FR	MR
ACTrAdaBoost	95.87	2.18	1.02	96.98	2.18	2.24	88.75	4.02	3.52	97.25	5.84	3.68
TrAdaBoost	91.45	2.98	1.28	92.67	2.87	2.65	85.53	4.26	3.85	94.46	6.02	4.12
SVMt	88.59	3.01	1.87	89.21	3.25	2.98	82.87	4.98	4.25	91.58	6.45	4.86
SVM	85.35	3.35	2.87	85.88	4.02	3.14	80.25	5.12	4.58	89.15	7.12	5.27

Figure 5.

Comparison of detection rates of ACTrAdaBoost, TrAdaBoost, SVMt and SVM on KDD CUP99 dataset.

Figure 6.

Comparison of false positive rate of ACTrAdaBoost, TrAdaBoost, SVMt and SVM on KDD CUP 99 dataset.

Figure 7.

Comparison of false negative rate of ACTrAdaBoost, TrAdaBoost, SVMt and SVM on KDD CUP 99 dataset.

Figure 8.

Iteration curve of ACTrAdaBoost, TrAdaBoost.

The following conclusions can be drawn from Tables 11 and 12, Figs 5–8:

Adequate training samples are the basis of training machine learning algorithm to get a classifier with high accuracy. Normal, Prob and DOS attack types have a large number, so the detection rate of ACTrAdaBoost and benchmark algorithm for these three types of attacks is very high, reaching more than 97% on DRAPA 1998 and KDD CUP 99 datasets. Similarly, on the ISCX2012 dataset, the detection rate of the larger number of BFSSH, Infiltration, and DDoS attack types reached more than 95%. Compared with benchmark algorithms, ACTrAdaBoost detection rate is higher on DRAPA 1998, KDD CUP 99 and ISCX2012 datasets.

Because of the fewer samples of attack types U2R and R2L, the detection rate of traditional intrusion detection algorithms is low. ACTrAdaBoost transfers knowledge from a large number of existing intrusion detection samples, then the knowledge is applied to detect attack types with insufficient samples, so the detection rate of ACTrAdaBoost for U2R and R2L is improved in a certain extent. From Fig. 5, it can be seen that the detection rate of ACTrAdaBoost on U2R and R2L has reached more than 75%, especially the detection rate of R2L has been greatly improved; and the detection rate of U2R has been greatly improved. The detection rate is less than 70%, and the detection rate of R2L is less than 35% on three benchmark algorithms. As can be seen from Table 11, the detection rate of ACrAdaBoost for U2R and R2L is best among algorithms, too. Therefore, ACTrAdaBoost can improve the detection rate of U2R and R2L attack types with a small number of samples.

Figure 6 and Table 11 show that the false positive rate of Normal, Probe, and DOS on the KDD CUP99 and DARPA1998 datasets do not exceed 6.5%, and ACTrAdaBoost performs best, which reduces the false positive rate to less than 4%. In the intrusion detection behaviors U2R and R2L, the three benchmark algorithms performed poorly. The false positive rate of U2R and R2L on the KDD CUP 99 and dataset reached 8.5% and 18.5%. It has reached more than 10% on DARPA1998 dataset. However, ACTrAdaBoost performed relatively well below 10% on both datasets.

In terms of positive negative rate, as can be seen from Fig. 7, Tables 11 and 12, ACTrAdaBoost has some advantages over the benchmark algorithms in five attack types.

ACTrAdaBoost is an iterative algorithm like TrAdaBoost. Convergence is an important indicator. In Fig. 8, the convergence curves of the two algorithms on the intrusion detection dataset are given. We can see that ACTrAdaBoost converges faster than TrAdaBoost, so the ACTrAdaBoost algorithm performs better.

The detection rate of ACTrAdaBoost for five attack types is higher than that of three benchmark algorithms, and the detection rate for attack types with fewer samples is significantly improved. The experimental results show that the detection rate of ACTrAdaBoost algorithm for all five kinds of attacks has been improved, especially for R2L attacks with sparse samples. There is no problem that the detection rate of an attack is too low and the detection rate is very different. ACTrAdaBoost effectively alleviates the imbalance problem of attack type detection in machine learning algorithm. In addition, ACTrAdaBoost has significant advantages over the benchmark algorithm in terms of false positive rate and positive negative rate. In addition, the results on the three datasets show a better generalization performance of ACTrAdaBoost.

5. Conclusions

In this paper, the ACTrAdaBoost algorithm for intrusion detection based on TrAdaBoost, which utilizes transfer learning, active learning and Maximum mean discrepancy knowledge is proposed. ACTrAdaBoost uses active learning to label a large number of intrusion detection samples in the source domain to form a new source domain, whose samples need contain abundant information. MMD can effectively avoid negative transfer because it selects the more similarity samples between source domain and target domain. On one hand, ACTrAdaBoost can not only utilize the information from the historical intrusion detection samples to transfer the knowledge from the historical information to the current target domain learning task, which reduces the cost of collecting samples. On the other hand, active learning labels the samples with large amount of information in the source domain, and MMD screens samples in the source domain that have greater similarity with the target domain to reduce the scale of training samples. ACTrAdaBoost not only reduces the cost of the generation of classifier, but also effectively solves the problem of negative transfer in transfer learning, and improves the classification effect. In order to prove that the ACTrAdaBoost algorithm has good generalization performance, experiments were carried out on three intrusion detection public data sets KDD CUP99, DARPA1998 and ISCX2012 data sets. Experimental results on 10% KDDCUP99 and DARPA1998 dataset show that the detection rate of ACTrAdaBoost, especially for U2R and R2L attack types with fewer samples, is much higher than that of the benchmark algorithm. The detection balance of each attack type is better than the benchmark algorithm. The time efficiency of the algorithm also has great advantages. In the future work, we plan to study the multi-classification problem of ACTrAdaBoost. In addition, in view of the limitation that ACTrAdaBoost can only transfer knowledge from one source domain, how to transfer knowledge from multiple source domains needs further study.

References

Yin

C.L.

Zhu

Y.F.

Fei

J.L.

et al., A deep learning approach for intrusion detection using recurrent neural networks, IEEE Access, 2017, 21954–21961.

Tsai

C.F.

Hsu

Y.F.

Lin

C.Y.

et al., Intrusion detection by machine learning: A review, Expert Systems with Applications, 2009, 11994–12000.

Denning

D.E.

, An Intrusion-Detection Mode, IEEE Press, 1987.

Stolfo

S.J.

Lee

Chan

P.K.

and Fan

, Data mining-based intrusion detectors: An overview of the columbia IDS project, ACM SIGMOD Record, 2001, 5–14.

Javaid

A.Y.

Niyaz

Sun

et al., A Deep Learning Approach for Network Intrusion Detection System, in: 9th EAI International Conference on Bio-inspired Information and Communications Technologies, 2015.

Yan

and Liu

, A New Method of Transductive SVM-Based Network Intrusion Detection, in: Computer & Computing Technologies in Agriculture Iv-ifip Tc 12 Conference, DBLP, 2017.

Tiwari

Roy

S.S.

Charaborty

et al., A novel hybrid model for network intrusion detection, in: International Conference on Green Computing, IEEE, 2014.

KPichara

and Soto

, Active learning and subspace clustering for anomaly detection, Intelligent Data Analysis, 2011, 151–171.

Makanju

Zincir-Heywood

A.N.

and Milios

E.E.

, Robust learning intrusion detection for attacks on wireless networks, Intelligent Data Analysis, 2011, 801–823.

10.

Roshan

Miche

Akusok

et al., Adaptive and online network intrusion detection system using clustering and Extreme Learning Machines, Journal of the Franklin Institute, 2017.

11.

Cheng

Tay

W.P.

and Huang

G.B.

, Extreme learning machines for intrusion detection, in: International Joint Conference on Neural Networks, IEEE, 2012.

12.

Teng

Yang

et al., Intrusion Detection System Based on Improved SVM Incremental Learning, in: International Conference on Artificial Intelligence & Computational Intelligence, IEEE, 2010.

13.

Wang

and Ma

, Optimization of Neural Networks for Network Intrusion Detection, in: International Workshop on Education Technology & Computer Science, IEEE, 2009.

14.

Chang

and Yang

, Network Intrusion Detection Based on Random Forest and Support Vector Machine, in: IEEE International Conference on Computational Science & Engineering, IEEE, 2017.

15.

Xiang

and Westerlund

, Using extreme learning machine for intrusion detection in a big data environment, in: The Workshop on Artificial Intelligent and Security Workshop, ACM, 2014.

16.

Krishnakumar

, Active Learning Literature Survey, 2007.

17.

Bell

B.S.

and Kozlowski

S.W.J.

, Active Learning: Effects of Core Training Design Elements on Self-Regulatory Processes, Learning, and Adaptability, Journal of Applied Psychology, 2008, 296–316.

18.

Liang

Wenbo

and Yupu

, Active learning support vector machines with low-rank transformation, Intelligent Data Analysis, 2018, 701–715.

19.

C.J.

and Yang

Y.P.

, A batch-mode active learning SVM method based on semi-supervised clustering, Intelligent Data Analysis, 2015, 345–358.

20.

Zhuang

F.Z.

Luo

and Shi

Z.Z.

, Survey on transfer learning research, Ruan Jian Xue Bao/Journal of Software, 2015, 26–39 (in Chinese). http://www.jos.org.cn/1000-9825/4631.htm.

21.

Zhu

Z.F.

et al., Transfer active learning, 2011, 2169–2172.

22.

Weiss

Khoshgoftaar

T.M.

and Wang

D.D.

, A survey of transfer learning, Journal of Big Data, 2016, 9.

23.

Day

and Khoshgoftaar

T.M.

, A survey on heterogeneous transfer learning, Journal of Big Data, 2017, 29.

24.

Quanz

and Huan

, Large margin transductive transfer learning, in: ACM Conference on Information and Knowledge Management, ACM09.

25.

Pan

S.J.

Kwok

J.T.

and Yang

, Transfer learning via dimensionality reduction, in: AAAI Conference on Artificial Intelligence, AAAI08.

26.

Dai

Yang

Xue

G.R.

et al., Boosting for transfer learning, in: International Conference on Machine Learning, ACM07.

27.

Yao

and Doretto

, Boosting for transfer learning with multiple sources, in: Computer Vision and Pattern Recognition, IEEE, 2010.

28.

Tharwat

Hassanien

A.E.

and Elnaghi

B.E.

, A BA-based algorithm for parameter optimization of support vector machine, Pattern Recognition Letters, 2016.

29.

Boser

B.E.

Guyon

I.M.

and Vapnik

V.N.

, A training algorithm for optimal margin classifiers, in: The Workshop on Computational Learning Theory, 1992.

30.

Chang

C.C.

and Lin

C.C.C.

, A Library for Support Vector Machines, 2011.

31.

Tavallaee

Bagher

et al., A detailed analysis of the KDD CUP 99 data set, in: IEEE International Conference on Computational Intelligence for Security & Defense Applications, 2009.

32.

Dekel

Gentile

and Sridharan

, Selective sampling and active learning from single and multiple teachers, Journal of Machine Learning Research 13(1) (2016), 2655–2697.

33.

Liu

Lang

Liu

and Yan

H.B.

, CNN and RNN based payload classification methods for attack detection, Knowledge-Based Systems 163 (1 January 2019), 332–341.

34.

Wang

Sheng

Wang

et al., HAST-IDS: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection, IEEE Access (99) (2017), 1.

35.

Liu

J.J.

Liu

and Luo

X.L.

, Semi-supervised learning method, Chinese Journal of Computers (8) (2015), 1592–1617.

An intrusion detection method based on active transfer learning

Abstract

Keywords

1. Introduction

2.1 Maxinum discrepancy

Table 1 Training process of TrAdaBoost algorithm

3.1 Basic ideology of ACTrAdaBoost

Table 2 Training process of ACTrAdaBoost algorithm

4.1 Experimental environment and evaluation criteria

4.2 Intrusion detection dataset

4.2.1 KDD CUP99

Table 3 10% KDDCUP99 intrusion detection dataset

Table 5 Distribution of attack types in DARPA dataset

Table 6 Distribution of attack types in ISCX2012 dataset

4.3.1 Converting character type to numerical type

Table 7 Forms of network connection states converted

4.4.1 Detection rate, false alarm rate and false alarm rate of ACTrAdaBoost

Table 8 Detection rate, false positive rate and false negative rate of ACTrAdaBoost on KDD CUP99 and DARPA 1998

Table 10 Time consumption of ACTrAdaBoost on DARP1998 and ISCX2012

References

Table 1
Training process of TrAdaBoost algorithm

Table 2
Training process of ACTrAdaBoost algorithm

Table 3
10% KDDCUP99 intrusion detection dataset

Table 5
Distribution of attack types in DARPA dataset

Table 6
Distribution of attack types in ISCX2012 dataset

Table 7
Forms of network connection states converted

Table 8
Detection rate, false positive rate and false negative rate of ACTrAdaBoost on KDD CUP99 and DARPA 1998

Table 10
Time consumption of ACTrAdaBoost on DARP1998 and ISCX2012