Abstract
Nowadays, early detecting and warning Advanced Persistent Threat (APT) attacks is a major challenge for intrusion monitoring and prevention systems. Current studies and proposals for APT attack detection often focus on combining machine-learning techniques and APT malware behavior analysis techniques based on network traffic. To improve the efficiency of APT attack detection, this paper proposes a new approach based on a combination of deep learning networks and ATTENTION networks. The proposed process for APT attack detection in this study is as follows: Firstly, all data of network traffic is pre-processed, and analyzed by the CNN-LSTM deep learning network, which is a combination of Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM). Then, instead of being used directly for classification, this data is analyzed and evaluated by the ATTENTION network. Finally, the output data of the ATTENTION network is classified to identify APT attacks. The optimization proposal for detecting APT attacks in this study is a novel proposal. It hasn’t been proposed and applied by any research. Some scenarios for comparing and evaluating the method proposed in this study with other approaches (implemented in section 4.4) show the superior effectiveness of our proposed approach. The results prove that the proposed method not only has scientific significance but also has practical significance because the model combining deep learning with ATTENTION network has helped improve the efficiency of analyzing and detecting APT malware based on network traffic.
Introduction
Problem
The APT attack is an advanced, persistent, and targeted attack technique [1–3]. The superiority of this attack technique is expressed in its characteristics, process, and life cycle [1–3]. Most of the APT attack campaigns are supported by government organizations with human and material resources. Therefore, most of the key agencies, government organizations of countries in the world are the targets of this attack technique. According to statistics [4], the APT attack is in the top of the most current dangerous cyber-attack techniques in the world. Besides, according to assessments [5, 6], in the future, this attack technique will increasingly develop in both scale and danger level. Therefore, the problem of early detecting and warning this attack campaign is very necessary.
Studies [1, 7–15] listed several approaches for APT attack detection based on its characteristics, process, and life cycle. Especially, in order to detect signs of APT attack, the approaches often focus on seeking, analyzing, and evaluating signs and behaviors of APT malware based on network datasets using machine learning techniques. According to studies [2, 16], some experimental datasets that could be applied for analysis are Web log, DNS log, and network traffic. In particular, the network traffic dataset is considered to be a good dataset for APT attack detection because this dataset has recorded many signs and behaviors of APT attacks [1, 2].
Studies [9, 17] pointed out the difference between the APT attack and other cyber-attack techniques. This difference causes many difficulties and challenges for systems to detect and monitor this attack type. Besides, studies [9, 17] presented some difficulties and challenges in detecting APT attacks using machine learning and deep learning such as the lack of public data of APT attacks, the imbalance in monitoring data, the difficulty of identifying behaviors of APT attacks, etc. To resolve these problems, studies [1, 18] proposed some different methods and approaches including Anomaly Detection, Pattern Matching, and Graph Analysis. These approaches and proposals gave good effects. However, because monitoring data is increasingly abundant and diverse, must need some changes in processing and analysis to accommodate the current APT detection task. Two main problems for detecting APT attacks that need to be improved include [18]:
The way of identifying abnormal behaviors: Because APT attacks often use separate technologies and malwares to exploit and attack the target. Therefore, it is difficult to find features and behaviors of the APT attack to classify it with other attacks or with normal access. The previous approaches [19–24] often try to find ways to extract behaviors of APT based on DNS logs, network traffic. However, since the APT attack is designed for each specific purpose and target, using behaviors of a specific attack campaign as a basis for detecting other attack campaigns is ineffective. In addition, in order to extract features of APT attacks, it often requires a very large dataset collected over a long period of time so it is very difficult to extract these features.
Detection method: To detect and classify APT attacks, research directions often apply 2 main techniques: machine learning and deep learning [1, 25]. Machine learning algorithms have the advantage of requiring less processing time and accuracy. However, these algorithms often have difficulty when monitoring large datasets. Therefore, using deep learning algorithms is a possible approach for this problem. However, we noticed that although the deep learning algorithms have yielded good results, they still have some disadvantages that need to be resolved such as the ability to remember time series, the focus on important information, etc [25–29]. This causes the low efficiency of some detection methods.
To fix the above 2 problems, this paper proposes a method of combining the CNN-LSTM deep learning network with the ATTENTION network (CNN-LSTM-ATTENTION) to analyze and detect behaviors of APT attacks based on network traffic. In particular, the proposed approach will find ways to solve problems that we think need to be optimized as follows: Regarding the problem of “Identifying abnormal behavior": In this paper, in the process of selecting and extracting features of APT attacks in network traffic, instead of trying to look for abnormal behaviors and features of APT attacks, this study proposes to use the statistical features of the flow, and then use the CNN-LSTM-ATTENTION model to calculate, process, and synthesize it into IP features. With this approach, this study could not only solve the difficulty of defining abnormal behaviors but also fix the problem of extracting features of ATP attacks. Regarding the problem “Detection method": We noticed that although the CNN-LSTM deep learning model has partly solved the ability to remember time series, it often encounters the problem of losing important information. The reason is that when processing a long time series, due to the characteristics of the deep learning model, the back information has priority over the front information, but in reality important information may appear anywhere in the time series. Therefore, using the CNN-LSTM-ATTENTION model could improve this problem by calculating the mean of hidden states of all time steps (Attention 2) or calculating the mean of hidden states combining multiple inputs (Attention 1). With this approach, this study will find and synthesize a lot of important information of IPs in network traffic, thereby not only improving the efficiency of the APT attack detection process but also reducing false detection.
Contributions
In order to optimize the detection of APT attacks in the system, in this study, we have proposed some new scientific features as follows: Proposing the CNN-LSTM-ATTENTION model for the task of synthesizing, analyzing, evaluating, and detecting APT IPs. This proposed model has many practical meanings because the process of normalizing, processing, extracting features, and classifying is carried out through many different networks and layers. The output of the preceding network is the input of the next network. It forms a continuous and synchronous computational system. This requires a complex process of pre-processing data, installing, and computing. The CNN-LSTM-ATTENTION model is a new model that has not been proposed by any research. Therefore, successfully proposing this model for the task of optimizing APT attack detection has many scientific meanings. Proposing architecture of the ATTENTION 1 network and the ATTENTION 2 network for the task of finding and synthesizing important information of IP, thereby improving the efficiency of classifying APT IPs as well as normal IPs. In this paper, we have proposed the architecture of two ATTENTION models. In which, ATTENTION 1 network has the ability to highlight important information of IP based on the process of calculating the mean of hidden states combining multiple inputs. The ATTENTION 2 network highlights important information by calculating the mean of hidden states of all time steps. From the 2 architectures of the ATTENTION network proposed in the paper, we have provided several ATTENTION networks to choose for the task of optimizing both APT attack detection and network anomaly detection.
Related work
Chu et al. [28] conducted an experiment to detect APT attacks based on the NSL-KDD dataset using Multilayer Perceptron (MLP) and Support Vector Machine (SVM) algorithms. During the experiment, the authors changed some parameters in the kernel of the two algorithms. The MLP model yielded better APT attack detection results in some specific cases. Similar to Chu’s approach [28], Nkiruka Eke [30] used the KDD 99 dataset for APT attack detection based on deep learning algorithms such as LSTM, Recurrent Neural Networks (RNN), Gated Recurrent Unit (GRU). In the study, Joloudari [31] proposed to use the C5.0 Decision Tree, Bayesian network, and deep learning algorithms for detecting APT attacks based on the NSL-KDD dataset. The experimental results showed that the 6-layers deep learning algorithm gave better results than the C5.0 Decision Tree and Bayesian network algorithms. The accuracy of these three methods was 98.85%, 95.64%, 88.37%, respectively. In addition, the 6-layers deep learning algorithm proposed by the author was also highly effective (the error is only 1.13%). Likewise, Cosimo [32] proposed the Autoencoder network for detecting cyber-attack based on the NSL-KDD dataset. In the experimental section, the Autoencoder network showed superiority compared to other models and algorithms in the APT attack detection task. However, we found that the NSL-KDD and KDD 99 datasets were standardized and balanced in the amount of clean data and attack data. If the authors have used the Random Forest algorithm, the result would have been higher than the authors’ proposed deep learning algorithms. Besides, Branka’s research [2] demonstrated that using the KDD 99 dataset for the APT attack detection task has no longer been appropriate for actual demands. In addition, in the study Sai Charan et al. [33], also proposed to use the LSTM network model for detecting APT attacks in banking systems based on Security Information and Event Management systems. In particular, the team [33] used the big data technology platform to implement the idea of using the LSTM model for APT attack detection based on each its development stage. The experimental section evaluated the effectiveness of the LSTM model for APT attack detection by calculating the time for processing and detecting attacks based on the experimental dataset division process.
On the other hand, Yan [19] proposed the idea of detecting APT Domains based on the CNN deep learning model. Specifically, the author proposed 3 abnormal behavior groups of APT Domains based on DNS requests: Domain Name-based Features; Feature of the Relationship between DNS Request Behavior and Response Behavior; Feature of the Relationship between DNS Request Behavior and Response Behavior from a dataset consisting of 4,907,147,146 pieces of initial data of 47 days DNS request records of Jilin University Education Network. Experimental results showed that this model had a relatively high efficiency when it gave results in the range of 96%to 97%. With the same approach, Zongyuan [21] proposed a method to detect C&C APT domains based on DNS logs at mobile devices and PCs based on 9 abnormal behaviors and using different machine learning and deep learning algorithms.
In addition, Hofer-Schmitz et al. [34] presented the optimal method for APT IP detection based on the CICIDS2017 dataset. Specifically, in their research, the authors used the PCA algorithm to reduce the flow feature dimension from 76 to 66 and then divided the dataset into 3 groups for processing and analyzing each data group.
In the study Bodström et al. [27], proposed a deep learning model combining many different models and algorithms. Each layer has the function of aggregating and analyzing features. The output of this layer serves as an input to the following layers. With this combination, according to the team’s analysis, it could take advantage of deep learning algorithms in the task of analyzing and aggregating features.
Cho et al. [16] proposed an APT attack detection method using combined deep learning. Accordingly, in their research, the authors proposed a deep learning model based on the combination of the Bidirectional Long Short-Term Memory (BiLSTM) and Graph Convolutional Networks (GCN) (BiLSTM-GCN) models and applied some basic deep learning models such as MLP, GCN to monitor and detect APT attacks based on network traffic. Experimental results showed that the BiLSTM-GCN model gave the highest efficiency on all measures compared to other deep learning models.
Along with the idea of Cho [17], studies [25, 35] proposed the idea of applying the CNN-LSTM combined deep learning model for the feature extraction task to detect network anomaly based on the CICIDS2017 dataset. During the experiment, Pengfei [35] compared his proposed method with some other methods and found that CNN-LSTM deep learning model gave the best results on all measurements.
The proposed model
The model architecture
Figure 1 shows the multi-layer model for APT attack detection based on Network Traffic.

The APT attack detection model using the CNN-LSTM-ATTENTION model.
Figure 1 illustrates an APT detection method based on the flow network using the CNN-LSTM-ATTENTION model. Specifically, the proposed model includes the following components: Feature Extraction phase: this process includes the following steps: Feature extraction flow: At this stage, the entire network traffic is analyzed into the network flows by CICFlowMeter [37] tool. These flows are then grouped by pair of source IP (SrcIP) and destination IP (DstIP) addresses. For the flow feature extraction phase, the flows clustered in pairs of SrcIP and DstIP are analyzed and extracted into different features. These features represent the difference between flows in each SrcIP and DstIP pair. Representing IP information: Represents SrcIP information by a single vector from the network flow grouped in the Feature extraction flow process. IP feature extraction: In this step, the vectors synthesized in the step “Representing IP information” are put into the CNN-LSTM-ATTENTION model to extract the IP features. Here, 2 different ATTENTION networks are proposed to optimize this process.
The ATTENTION network is one of the most interesting models in the deep learning field. In the study Shi Feng et al. [39], first proposed to use the ATTENTION network for the machine translation problem. Jiachen [40] proposed to apply Attention to the task of text classification. We noticed that in studies [39–41], ATTENTION models were proposed as a computational block in some models such as CNN, RNN, LSTM. This paper proposes to use ATTENTION models for the task of extracting features and classifying APT attacks. The proposed ATTENTION models operate independently and do not participate in the calculation and synthesis process of the CNN-LSTM model. With such an approach, we think that the ATTENTION model will synthesize more important and meaningful features. The next sections of the paper will present the architecture and process of the CNN-LSTM-ATTENTION models in detail.
Proposing the CNN-LSTM-ATTENTION 1 model
Proposing architecture of the CNN-LSTM-ATTENTION 1 model
Figure 2 shows the architecture of the proposed CNN-LSTM- ATTENTION 1 model for the APT attack detection. From Fig. 2, seeing that our model is a combination of 3 main components including the CNN deep learning network, the LSTM deep learning network, and the ATTENTION network. This paper has found a way to combine and standardize these networks together to form a unified model for the task of extracting features and classifying IP based on network traffic. Details of the main components in the CNN-LSTM-ATTENTION 1 model are as follows:

The architecture of the CNN- LSTM-ATTENTION 1 linking model.

The architecture of the CNN- LSTM-ATTENTION 1 linking model.
The study [44] presented and defined the parameters and operation process of the LSTM network in detail. This paper uses this model to find ways to aggregate and extract IP features based on the flow network.
Where: W
f
, W
h
, W
g
are the coefficient matrices for linear transformations; H is the output matrix of the CNN-LSTM model. The matrix H is defined as follows (3):
Where: m is the size of the feature extracted from the CNN model, n is the size of the LSTM network.
In which, the formula of the softmax function is defined and described in detail in the study [46], in which d is the selected constant to divide the data that is presented in the study [47]. The softmax function is calculated with n corresponding to the size of the LSTM network.
Where: matrix C is the weight matrix synthesized according to formula (4). The result of this block is a matrix M representing fully the features and characteristics of the input data and the weights. Next, the matrix M is put into the classification model to conclude about the APT attack.
b) APT attack detection process using the CNN-LSTM-ATTENTION 1 model
From Fig. 2, seeing that the CNN-LSTM-ATTENTION 1 model is divided into 3 main stages:
Proposing architecture of the CNN-LSTM-ATTENTION 2 model
From Fig. 4, seeing that the architecture of the CNN-LSTM-ATTENTION 2 model has 4 main blocks: the CNN network, the LSTM network, the ATTENTION 2 network, and the classification block. Processing the CNN and LSTM deep learning networks, and the classification block in this model is similar to in the CNN-LSTM-ATTENTION 1 model. The only difference is in the processing blocks inside the ATTENTION 2 network. Accordingly, the processing components in the network ATTENTION 2 are as follows:

The architecture of the CNN- LSTM-ATTENTION 2 model.
In the above formula, W is a 2-dimensional initial weight matrix with size dxn (d is arbitrarily set, n is the size of the LSTM network); Y is a weight vector with size d, H is the matrix defined in formula (3).
Where: X is a 2D matrix with size mxd; m is the size of the CNN model.
From formula 8, it can be seen that the result of the feature aggregation block is a matrix that fully represents features of the entire input representation. Next, this block is put into the classification model to determine signs of APT attacks. The processing of the classification block is presented in detail in the ATTENTION 1 model.
b) APT attack detection process using the CNN-LSTM-ATTENTION 2 model
The processing and calculation process to detect APT using CNN-LSTM-ATTENTION 2 model is similar to the process using CNN-LSTM-ATTENTION 1 model. There is only one difference in the processing of the ATTENTION 2 network. Accordingly, instead of using the ATTENTION 1 network to synthesize and highlight important information, the ATTENTION 2 network is used to minimize the loss of important and significant information by calculating the mean of weights of all features after these features have been processed by the CNN-LSTM model. With this approach, features are averaged and aggregated into vectors that contain all the important information and components of IPs in frames. Then, the output vectors of the ATTENTION 2 network are put into the Softmax Classifier to classify APT and normal frames. Finally, the frames containing APT or normal flows are used to classify normal IPs and APT IPs. This whole process is presented in step 4 and step 5 in section 3.2.1 of the paper.
Experimental dataset
The positive experimental data (attack data) was collected from 29 Network Traffic files in the Malware Capture CTU-13 data set which contains 6 types of malwares from the APT attacks, including Andromeda, Colbalt, Cridex, Dridex, Emotet, and Gh0stRAT [47].
The negative experimental data (normal data) was collected from E-Government server of Soc Trang province [48] according to the scientific research project N° KC.01.05/16-20 of the Ministry of Science and Technology of Vietnam. This dataset was collected on July 30, 2019.
The Table 1 shows the statistic of experiment data which we collected and used in this paper.
Components of experimental dataset
Components of experimental dataset
For the dataset: The dataset is divided into 2 parts for the training and testing process. The training dataset accounts for 80%and the testing dataset accounts for 20%.
With the evaluation scenario, we conduct evaluating the model with the following 3 scenarios:
Installation requirements and classification measures
Installation requirements
Software requirements: Python version 3.6, Tensorflow 2.0, Ubuntu 18.04. Hardware requirements: RAM 32 GB; CPU Intel Core i5-7500 CPU @3; 4 GHz; GPU Tesla K80.
Classification measures
The following measures will be used in this paper to evaluate the accuracy of models:
In which: TP - True positive: The number of malicious samples classified correctly. FN - False negative: The number of malicious samples classified as normal. TN - True negative: The number of normal samples classified correctly. FP - False positive: The number of normal samples classified as malicious.
Experimental results of scenario 1
Table 2 below shows the experimental results when we apply LSTM algorithm [33] for APT attack detection.
Results of detecting APT attacks using LSTM algorithm [33]
Results of detecting APT attacks using LSTM algorithm [33]
With the results in Table 2, seeing that the accuracy of the flow and IP classification task has changed continuously and irregularly on most metrics when increasing the LSTM model complexity. For frame classification results, with the overall Accuracy score from 85.7%to 87.4%, it can be seen that the LSTM model seems to be effective in the flow classification process. However, with an imbalanced dataset, the Accuracy score could be highly pushed up if the model predicts all IPs belong to Normal class (because the number of Normal IPs accounts for a large rate in total IPs). Thus, this study evaluates experiment results of models with a more stable measure - F1 score. With F1 score, the LSTM network gave flow classification results ranging from 47%to 56%. These are relatively low results and do not bring much meaning in the application process in reality (this result is unacceptable because it has no practical significance). Because the flow classification result was not high, this has a great influence on the IP classification process. The LSTM model was not actually effective in the task of detecting APT attacks since its best result was only 58.8%. This is an easily predictable result because the model did not work well in the flow classification, so the result of the IP classification was quite low.
Table 3 below shows the experimental results when we apply CNN-LSTM [25, 35] model for APT attack detection.
After experimenting with various different models combining CNN with LSTM with fine-tuning parameters, Table 3 is a summary evaluation of the models giving the best results with the test dataset. From Table 3, the experimental results between models with the different number of parameters have clear differences. Besides, the data in Table 3 also shows clearly if the frame classification results are good, it will give good IP classification results. However, many CNN networks combined with many LSTM networks did not always give the best results. The reasons are that structures of the networks are different and the parameters are randomly initialized leading to different learning styles of each model, and the learned patterns also are different leading to differences in the classification of results. In Table 3, model [4 CNN - 2 LSTM] gave the best results for the classification process of both APT frame and IP with a significant increase of F1-score (about 12%compared to model LSTM). From this experimental result, seeing that the combined deep learning model is a suitable approach for anomaly-based attack detection problems without too large imbalances in the dataset. Figure 5 below shows the confusion matrix results of the models for APT attack detection in scenario 1.

The confusion matrix results of models in scenario 1. In which (a), (b) respectively represent the results of LSTM, CNN-LSTM models.
The results in Fig. 5 shows that the CNN-LSTM model had better performance than the LSTM model on both APT IPs and normal IPs detection. Regarding accurately detecting APT IPs, the LSTM model only correctly detected 15 IPs out of 35 IPs. This result is 4 IPs worse than that of the CNN-LSTM model. Regarding detecting normal IPs, the LSTM model incorrectly detected 3 IPs out of 743 IPs. This result is 2 IPs higher than the CNN-LSTM model.
Experimental results of detecting APT attacks using MLP model
Table 4 below shows the experimental results of detecting APT attack using MLP model [28].
Experimental results of detecting APT attacks using MLP model [28]
Experimental results of detecting APT attacks using MLP model [28]
From the experimental results presented in Table 4, seeing that the more complex the network architecture is, the higher the number of hidden layers and the corresponding number of nodes is, the better the ability of the model to learn is and the more accurate the test results are. In particular, the results of flow classification had a significant difference between MLP models. With the most simple model, accuracy was 83.6%. In the model with the highest complexity, the accuracy of flow classification was a relatively good result (87.6%). However, for IP classification, the MLP model was not effective. Specifically, the accuracy of normal IP classification and APT IP classification were both less than 40%. The reason is that the dataset has a huge discrepancy between the number of APT IPs and normal IPs. Therefore, in reality, the MLP model couldn’t be used for classification problems when there is a large imbalance in dataset. Based on the results shown in Tables 2, and 4, seeing that the CNN-LSTM combined deep learning model brought better results than individual deep learning models such as MLP [28], LSTM [33]. The reason for this problem is that the combined deep learning model using CNN network to extract and combine features from neighboring flows (which LSTM can not do) and using LSTM network to extract timed and sequential features (which CNN can not do) to combine into high-level features. These two deep learning networks solve each other’s shortcomings while reinforcing the strengths of each network, making the learning ability of the model significantly improved.
b) Experimental results of APT attack detection using Autoencoder model
The experimental results of APT attack detection using model Autoencoder [32] are presented in Table 5.
Experimental results of APT attack detection using Autoencoder model [32]
The experimental results presented in Table 5 show that the Autoencoder model gave relatively high results for the process of classifying and detecting APT frames as well as APT IPs. The experimental results based on the process of changing the parameters of the Autoencoder model, we found that the Autoencoder model gave the best APT IP classification results on all measures with the model [128-256-128]. Comparing the results of Tables 3 5, seeing that the CNN-LSTM model gave higher results of classifying normal frames and IPs than the Autoencoder model (about 3%for frame classification and 20%for IP classification). However, with the classification process for APT frames and APT IPs, the Autoencoder model gave higher results than the CNN-LSTM model [25, 35] (about 19%for frame classification and 3%for IP classification). The reason is that the Autoencoder model [32] uses encoder and decoder components to compress the features and then recreate them as they were. Besides, the Autoencoder model has a relatively good ability to remember important features, so it filtered out and learned important features in the data, thereby helping the model improve the APT IP classification results. From the experimental results, we found that the Autoencoder model is more suitable for the APT IP classification problem, while the CNN-LSTM model is more efficient for accurately predicting normal flows.
Figure 6 below depicts the confusion matrix results of attack detection models in scenario 2.

The confusion matrix results of models in scenario 2. In which (a), (b) respectively represent the results of MLP, Autoencoder.
From Fig. 6, seeing that there are different efficiency in the process of predicting the APT IPs and the normal IPs of the experimental models. Comparing Figs. 5 6, it can be seen that the CNN-LSTM model and the Autoencoder model yielded higher results than other models. Specifically, the CNN-LSTM model had the superiority in accurately predicting normal IPs because this model only incorrectly predicted 1 IP out of 743 IPs. However, for the task of detecting APT IPs, this model was not actually effective because it detected only 19 APT IPs out of 35 IPs. For the Autoencoder model, it had the opposite trend when it gave a higher APT IP correct prediction rate (correctly detected 20 APT IPs - increased 1 IP compared to the CNN-LSTM model) and a lower normal IP correct prediction rate (wrong detected 6 normal IPs - increased 5 IPs compared to the CNN-LSTM model).
a) Experimental results of detecting APT attack using the CNN-LSTM-ATTENTION 1 model
Based on the CNN-LSTM-ATTENTION 1 model proposed in Section 3.2.2, this study applies the model to the APT attack detection task. To see the effectiveness of the CNN-LSTM-ATTENTION 1 model, this study tries to experiment this model with many different parameters. Table 6 below shows experimental results when using the CNN-LSTM-ATTENTION 1 model.
Experimental results of detecting APT attacks using CNN-LSTM-ATTENTION 1 model
Experimental results of detecting APT attacks using CNN-LSTM-ATTENTION 1 model
The experimental results in Table 6 show that CNN-LSTM-ATTENTION 1 model gave very good results, the accuracy ranged from 91.7%to 92.7%with frame classification and from 97.9%to 98.1%with IP classification. In particular, with the CNN-LSTM-ATTENTION combined model, the normal frame classification results were up to 99.2%accurate and the APT frame classification results reached 73%. Regarding the IP classification, classification results were 95.4%for normal IPs and 60%for APT IPs. From Table 6, seeing that CNN-LSTM-ATTENTION 1 model gave the best frame classification results with model [3-1-1] and the best IP classification with model [4-1-1]. There are two reasons for this problem. The first reason is the imbalance between the number of normal IPs and APT IPs. Secondly, due to the characteristics of APT attacks, many IPs are connected to each other only through one flow, leading to misclassification. However, with the results shown in Table 6, the CNN-LSTM-ATTENTION 1 model has not only shown its superiority compared to individual or combined deep learning models, or autoencoder models, but also improved the accuracy of the IP classification. Specifically, with the CNN-LSTM-ATTENTION 1 model, the APT IP classification accuracy increased by about 23%compared to the MLP network, 17%compared to the LSTM model, 6%compared to the CNN-LSTM model, and 3%compared to the Autoencoder model. In addition, comparing the results of Table 6 with Table 3, it is found that, although CNN-LSTM has somewhat improved the ability to remember results while processing long-term data, the CNN-LSTM model did not yield as good classification results as the CNN-LSTM-ATTENTION 1 model when dealing with the unbalanced dataset. This result has proved that ATTENTION 1 networks with a flexible combination of blocks (as shown in Fig. 4) provided the best ability to exploit and highlight important features in the data instead using of individual blocks for the computation. On the other hand, although the CNN-LSTM-ATTENTION 1 model gave APT IP classification results between 57%and 60%, overall, this result is very good because the experimental dataset has a difference deviation up to 14 times between the number of normal IPs and APT IPs. Thus, directing the model’s attention to important features in all flows and combining them together by using the attention mechanism has yielded positive results, not only for the task of APT frame detection based on flow networks but also for improving the efficiency of correctly classifying APT IPs.
b) Experimental results of detecting APT attack using the CNN-LSTM-ATTENTION 2 model
The experimental results in Table 7 show that model CNN-LSTM-ATTENTION 2 gave the frame classification accuracy from 89.1%to 92.7%and the IP classification accuracy from 97.7%to 97.9%. In addition, according to the results of frame prediction, the system correctly predicted normal frames ranging from 98.6%to 99.4%and APT frames between 59.1%and 72.5%. Similarly, for the IP classification results, we found that the CNN-LSTM-ATTENTION 2 model was highly effective when it correctly predicted the APT IPs from 54.2%to 57.1%and the normal IPs from 86.7%to 95.2%. In addition, like the CNN-LSTM-ATTENTION 1 model, the CNN-LSTM-ATTENTION 2 model gave different results of detecting APT frames and APT IPs when changing parameters in the model architecture. In particular, with the architecture [4-1-1], the CNN-LSTM-ATTENTION 2 model gave the best frame classification results on all measures, but the APT IP classification results were the best with the architecture [3 -2-1]. Clearly, with the addition of the ATTENTION 2 network to the CNN-LSTM combined deep learning network model, the accuracy of the whole classification process has been significantly improved. This demonstrates our approach and proposal for using the ATTENTION 2 network to sum the weights of all time steps to recover and aggregate important features that could be lost in the training phase is reasonable and meaningful. However, the results shown in Table 7 are much higher than the CNN-LSTM model and other deep learning models, but lower than the classification results in the CNN-LSTM-ATTENTION 1 model shown in Table 6. Accordingly, comparing the results of detecting APT frames and IPs, results of the ATTENTION 1 model are higher than that of the ATTENTION 2 model (about 0.5%with APT frame detection and 3%with APT IP detection).
Experimental results of detecting APT attacks using CNN-LSTM-ATTENTION 2 model
Figure 7 depicts the confusion matrix results of our proposed CNN-LSTM-ATTENTION 1 and CNN-LSTM-ATTENTION 2 models.

The confusion matrix results of models in scenario 2. In which (a), (b) respectively represent the results of CNN-LSTM- ATTENTION 1 and CNN-LSTM- ATTENTION 2 models.
The results in Fig. 7 show that the CNN-LSTM-ATTENTION 1 model gave superior performance compared to the CNN-LSTM-ATTENTION 2 model in the task of accurately detecting APT IPs. For the normal IP classification task, these models were equally effective. Comparing the results shown in Figs. 5, 6 and 7 seeing that the confusion matrix results of our proposed models have been better than other models on both correctly detecting APT IPs and normal IPs.
From 3 comparison scenarios with different evaluation methods, we have proved the complete superiority of the CNN-LSTM-ATTENTION model compared to some other deep learning models. The reason is that the CNN-LSTM-ATTENTION model with the combination of many different layers and models not only supports aggregating and extracting many important features, but also supports seeking, aggregating, and associating important IP features and characteristics based on the flow network. The experiment results show that the CNN-LSTM-ATTENTION model brought high efficiency for both detecting APT IPs and normal IPs, even though the experimental dataset has a huge difference between the number of normal data and attack data (the number of normal IPs is about 14 times more than the number of APT IPs). From the experimental results, we think that the CNN-LSTM-ATTENTION model is fully capable of applying to the real APT attack monitoring system because it satisfies 2 requirements of the system: the ability to analyze big data and the efficiency with imbalanced datasets. Besides, based on changing parameters in the proposed models, we want to provide options for the APT attack monitoring system when there is a trade-off between computation time and detection efficiency. Obviously, the more layers are used and the more complex network architecture is, the better the results are, but that requires more time and computational resources.
Conclusion and future direction
APT attack techniques are evolving more and more from methods to assistive technologies, making it more difficult for cyber-attack monitoring and detection systems. This study has successfully proposed the CNN-LSTM-ATTENTION model for APT attack detection. Specifically, regarding the process of selecting and extracting IP features in network traffic, with the support of the CNN-LSTM-ATTENTION model, we have succeeded in extracting information about abnormal IP behaviors based on statistical features of the flow network. The experimental results in Tables 6, 7 showed the superiority of the CNN-LSTM-ATTENTION model with other models. This has proven that our proposal is correct and reasonable. Regarding the task of detecting APT IPs, from Tables 6, 7 and Figs. 5, 7, obviously, the CNN-LSTM-ATTENTION model has proven its effectiveness in finding and synthesizing important information, which helps the APT IP classifier works well despite the imbalance between the number of APT IPs and normal IPs. However, because the processing complexity of ATTENTION networks is different, the ATTENTION 1 network has been more efficient than the ATTENTION 2 network. Besides, to evaluate the effectiveness of each model, this study has conducted experiments, evaluated, and optimized parameters to find out a suitable CNN-LSTM-ATTENTION model that brought the best results for the task of detecting APT attacks. Tables 6 7 have proven that CNN-LSTM-ATTENTION 1 model with parameters [4 - 2 - 1] and CNN-LSTM-ATTENTION 2 model with parameters [4 -1-1] brought good effects not only for the ability to accurately detect APT attacks but also for limiting false detections. In further studies, to improve the efficiency of the APT attack detection process, we think that it is necessary to improve 2 main issues: i) the method of synthesizing and extracting IP features based on network traffic; ii) improving the ATTENTION network architecture.
Footnotes
Acknowledgments
This work has been sponsored by the Posts and Telecommunications Institute of Technology, Viet Nam
