Abstract
Advanced Persistent Threat (APT) attacks are a form of malicious, intentionally and clearly targeted attack. This attack technique is growing in both the number of recorded attacks and the extent of its dangers to organizations, businesses and governments. Therefore, the task of detecting and warning APT attacks in the real system is very necessary today. One of the most effective approaches to APT attack detection is to apply machine learning or deep learning to analyze network traffic. There have been a number of studies and recommendations to analyze network traffic into network flows and then combine with some classification or clustering methods to look for signs of APT attacks. In particular, recent studies often apply machine learning algorithms to spot the present of APT attacks based on network flow. In this paper, a new method based on deep learning to detect APT attacks using network flow is proposed. Accordingly, in our research, network traffic is analyzed into IP-based network flows, then the IP information is reconstructed from flow, and finally deep learning models are used to extract features for detecting APT attack IPs from other IPs. Additionally, a combined deep learning model using Bidirectional Long Short-Term Memory (BiLSTM) and Graph Convolutional Networks (GCN) is introduced. The new detection model is evaluated and compared with some traditional machine learning models, i.e. Multi-layer perceptron (MLP) and single GCN models, in the experiments. Experimental results show that BiLSTM-GCN model has the best performance in all evaluation scores. This not only shows that deep learning application on flow network analysis to detect APT attacks is a good decision but also suggests a new direction for network intrusion detection techniques based on deep learning.
Keywords
Introduction
Advanced Persistent Threat attacks
Advanced persistent threat (APT) attacks are concentrated and purposeful attacks designed specifically for each specific target and particular victim to search for valuable information and spread it to the outside. The term includes: Advanced: The attackers use advanced techniques to methodically attack the target system. Besides, APT attacks often include many different techniques scientifically, logically, and intelligently. Most of the recorded attacks around the world have different characteristics, methods of attack, and data mining techniques, ranging from attack supporting tools to malware distribution or privilege escalation methods, trace deletion. Persistent: The term “Persistent” relates to the point where an attacker identifies the target to make the attacks, to hide and to exploit in stages. Attackers, at the same time, may use many different techniques, methods to attack on the target until success. Besides, the persistent of attackers also means that they can spend months, even years just to collect the victim’s information as a prerequisite for the attacks. For example, to attack users, they are persistent to find out information about users such as preferences, personalities, or how they name files, their concerns when opening emails, the relationships of victims in the virtual world. At the same time, attackers can spend months trying a tool or an attacking method again and again that exploits a security flaw on the victim’s system, then can wait for several months later to trigger the attack actions. Once hacked into the system, an attacker can also persist in waiting for the opportunity to activate and spread the attacks to avoid the surveillance and monitoring of information security systems. Moreover, persistency in APT attacks also means that when attackers get into the system and steal the data, they never stop at normal data theft, but their purpose is to install malicious codes to the system to get as much data as possible. This theft of data may last from several months, to several years if undetected. Threat: The dangers of APT attack are presented in the following aspects: i) Clear purposes and goals: an APT attack is designed for each specific target and object. Additionally, attackers clearly identify the target before attacking and they only perform attack to that specified victim from start to the end. This shows the professionalism of the APT attack; ii) Fully financial assisted: most APT attackers have large financial institutions behind to provide financial support, especially some of these attacks are supported by government organizations. Besides, the APT attackers are very patient in observing, monitoring and gathering information about the victims. Therefore, they would not be able to follow up APT attacks without any support from large organizations or governments.
Authors in [1, 2] discussed about the characteristics, procedures and life cycle of APT attacks targeting organizations and offices. From the characteristics of the APT attacks, the dangers of this attack are evident, and it is clear that any organization, individual, business or government can become its victims.
A review on APT attack detection methods
In [1] and [3], the authors presented a number of characteristics in the APT attack scenarios that make this kind of attack detection much more difficult than any other threats. Two main difficulties of APT attack detection approaches are: The evolution of attacking methods: In APT attack techniques, attackers often study very carefully about their targets, then they may apply a different attack method for each victim. As a result, conventional detection methods cannot detect this kind of attack. The lack of published databases on APT attacks: there is a very small number of organizations, agencies or businesses admitting that they are victims of APT attacks. This makes it difficult for experts in research and analysis to find out the characteristics of the attacks. Stojanović et al. [52] figured out that current open access datasets, including KDD-99 / NSL-KDD, UNSW-NB15, NGIDS-DS, TRAbID, CICIDS2017, and CSE-CIC-IDS2018, are not completely suitable for APT attack detection research. Therefore, in order to detect APT attacks, researchers tend to rely on two main methods to generate the training data as Network infrastructure use-case [3, 47] and Cyber Physical System use-case [53, 54]. However, although APT attacks are high-class and sophisticated with completely new ways of attacking, they generally compose of some main phases [1–3] as: spying (gathering information), attack and escalate privileges, information theft, and trace deletion. Therefore, studies and recommendations to detect APT attacks are often based on the characteristics of these stages.
There are three main approaches to detecting APT attacks based on the life cycle and stages of the attacks as follows [1]: Monitoring Methods, APT Detection Methods, and Deception Methods. One of the approaches that are currently being studied and paid much attention is APT Detection Methods. For APT Detection Methods, there are some further approaches listed in [1, 2] as: Anomaly Detection, Pattern Matching, and Graph Analysis. Pattern matching technique for abnormal detections often rely on the behaviors of a process or an application using Database access logs or Honeypot strategies. Yan et al. [29] proposed an early detection method for APT attack aiming at the first phase of the attack based on KDD 99 dataset using the structured modeling of cyber-attack behavior. In general, the pattern matching approach to detecting network anomalies in general and APT attack in particular can achieve a high detection accuracy. However, it requires cumbersome computing systems and a reliable and complete training dataset. This makes this method not really applicable for detecting APT attacks so far. Graph analysis approach for network anomalous detection focuses on two main processes, which are attack graph construction and attack graph analysis. Attack graphs are usually constructed using vulnerability and reachability information [30, 31], time efficient cost effective hardening of network [32], logical dependencies [33], or divide and conquer [34]. In order to detect network abnormalities based on attack graph, some algorithms and methods have been proposed, such as: Automated Graph Analysis, Binary Decision Diagrams [31], Approximation Algorithm and Linear Scalability [32], Markov Decision Process [35], Loop detection [33], and Page Rank based algorithm [36]. From the literature review, it is show that Graph analysis is not a very popular approach for network anomaly detection systems, and is not a very accurate tool. One of the reasons for this is the fact that APT attackers often use advanced attacking techniques and methods to create processes and behaviors identical to normal activities. As a result, it is often difficult to analyze and evaluate graphs in order to see anomalies. Additionally, the availability of APT attack data is very limited, which may cause difficulty in attack graph construction as a basis for detecting APT attacks. Anomaly detection seems to be an efficient method for APT attack detection systems. This approach enables network operators to trace and recognize APT attacks in each of their stages and life-cycles. Anomaly-based APT attack detection approaches can be either supervised, semi-supervised, or unsupervised [1, 2], for example: spear phishing [37, 41], Malicious domains DNS [4, 7], anomalous access patterns [42, 43]. Especially, the trend of detecting APT attacks based on abnormal behavior analysis using network traffic data sets [3, 5] or DNS log [6, 7] is widely used today. In order to detect and classify APT attacks in DNS log or network traffic data, two main methods are Machine learning and deep learning. Recent studies and recommendations often focus on the application of machine learning algorithms to detect APT attacks based on network traffic because this data set provides a lot of important and meaningful information. Some empirical results have shown the correctness and suitability of machine learning algorithms to detect APT attack based on abnormal behavioral analysis techniques. Machine learning methods in previous researches on detecting APT attacks often have the following issues: The use of synthetic data: the imbalance between normal data and APT attack data is a frequent problem. This leads to the fact that APT attack detection studies based on Machine learning often rely on the data sets created by the research team, which have a more balance between normal data and APT attack data. Therefore, the methods proposed based on the synthetic datasets in the literature may be good in some specific lab cases. However, it may not be appropriate in real scenarios due to the data inconsistencies. The use of definitions and abnormal behaviors extracted from APT attacks: Stojanović et al. [52] surveyed and pointed out that current researches often use feature construction combined with feature selection or dimensionality reduction methods to find and extract the abnormal behavior of APT attacks from their datasets. Due to the lack of public data on this kind of attack, previous studies often suggested methods to calculate and identify the unusual behaviors of APT attacks based on their synthetic data. This results in the inefficiency in detecting unusual behavior of APT attacks in practice.
In this study, we propose an APT attack detection method based on behavior analysis and extraction technique from network flow in network traffic using deep learning algorithms. Our proposed approach is different from current studies in two main points as follows. The dataset: in this paper, instead of using preprocessed and normalized abnormal network datasets like other researches, we utilize the datasets that are directly related to real APT attacks. Our dataset contains almost all current procedures in which the APT attacks have been performed. Additionally, the data for normal network activities are collected directly from the Quang Nam e-government server in Vietnam. This can be seen as a raw dataset. There is a huge imbalance between APT attack samples and normal samples in the dataset, i.e. the difference between normal flow and APT flow is 10 times and the difference between clean IPs and APT IPs is 250 times. This is, however, the nature characteristics of real dataset of APT attacks, in which the signs and behaviors of APT attacks are too small compared to the normal data. APT features in network traffic: in this research, we do not pre-define or try to look for abnormal APT behaviors from network traffic. Instead, we construct the behaviors and characteristics of APT in network traffic based on calculating and processing the flow networks.
Specifically, the proposed method includes following steps: Step 1: Analyzing network traffic into flow network according to each IP. Step 2: Selecting and analyzing flow behaviors. Step 3: Reconstructing the IP information based on network flows. Step 4: Feature selection and IP classification based on flow network using deep neural networks, GCN, and the combination of BiLSTM and GCN.
Contributions
Although deep learning algorithms have been widely studied and applied to detect many different types of attacks, the application of deep learning algorithms in APT attack detection based on behavior analysis and extraction techniques from network flow in network traffic is still limited. The main contributions of this paper include: To propose a new APT attack detection method based on network flow analysis technique using deep learning models. Specifically, deep learning models are used to extract and detect abnormal behaviors of APT from flow networks. Those attributes and typical behaviors of APT attacks in previous studies [4, 7] are not used in this research. Instead, the statistical information available in network flow are exploited to search for abnormal behaviors in Network traffic as a basis for detecting APT attacks. Deep learning algorithms are used to reconstruct and extract abnormal IP behaviors in Network traffic based on flow network. This approach is useful not only for detecting APT attacks but also for detecting many other anomalies in the network. To propose a deep learning model based on the combination of BiLSTM and GCN to detect and classify APT attack IPs from normal IPs based on flow network. In our experiments, some architectures of BiLSTM-GCN model are investigated to find the most suitable model structure for APT attack detection. The empirical results in Section 4.3.2 show the advancement of the BiLSTM-GCN model to traditional models such as MLP and GCN. This also demonstrates that the proposed APT attack detection approach based on the deep learning network analysis is useful for monitoring and detecting other unauthorized network intrusion activities.
Related work
APT attack detection based on flow network analysis
In [3], authors proposed a method to detect APT attack signals based on analyzing some unusual behavior of flow in network traffic. Accordingly, their method includes the process of extracting, standardizing and calculating anomalous values of the three main groups of signs from the flow, which are numbytes, numflows, and numdst.
Bahtiyar [8] proposed the FD-APT model with 5 Levels to analyze the abnormal behavior of flow in order to detect signs of malicious attacking APT codes.
Milajerdi et al. [9] introduced a method to detect APT attacks based on the process and life cycle of the attacks. To accomplish this task, the authors used a correlation analysis technique between the unusual behaviors of flow during each stage of the APT attack.
Chu et al. [10] proposed an APT attack detection method using machine learning algorithms including Support Vector Machine, Naive Bayes, Decision Tree. They used NSL-KDD data set for evaluation. Experimental results show that the Support Vector Machine algorithm has the best results among those investigated algorithms.
APT attack detection based on flow network analysis using deep learning
Deep learning algorithms have been studied and applied in many different fields of study. However, the application of deep learning algorithms for APT attack detection based on flow network analysis is still new to researchers.
In [10], Multilayer Perceptron networks are used to detect APT attacks. Experimental results in the paper show that Support Vector Machine algorithm has better results than Multilayer Perceptron algorithm. However, it is not appropriate for the authors to use the NSL-KDD data set to detect APT attacks because APT has different characteristics from the types of attacks described in NSL-KDD.
The APT attack detection model proposed by Bodström et al. [11] includes 5 layers, each of which has a main task to filter one part of the data. Here, the first layer is Known Attacks, which is used to detect known attacks from the data stream, save them to the database along with the timeline and finally delete the data from the stream. The remainder of the data stream is input to the second detection layer, called Normal Traffic: The second layer detects normal network traffic and removes it from the stream. The third layer is Historical Appearance, which aims at to determine if there is an exception appearing early in the network. The third layer is updated by retraining the machine learning network with cases that have passed the above two diagnostic processes. The fourth layer named Outlier Classification, which classifies the remaining data stream into the following categories: known attacks, predicted attacks, unknown outliers, and normal traffic. The fifth layer with graphical approaches is used for simulation purposes. In their study, there has not been any experimental results presented. They mainly focused on the use of deep learning in APT attack detection.
Pektas and Acarman [12] presented an APT botnet detection process in 3 steps: i) Feature extraction; ii) Model training; iii) Assess the results. They also presented experimental results comparing the performances of some basic algorithms in machine learning and deep learning such as: Decision tree, Naive Bayesian, Random forest, Recurrent neural network. Empirical results show that the Deep neural network is better than the Random forest algorithm on the same data set. This also implies that deep learning methods may also be suitable for the APT attack detection problems.
In addition, Tuor et al. [44] presented an online unsupervised deep learning system to filter system log data for analyst review for APT attack detection. Specifically, the authors combine the CERT Insider Threat v6.2 data-set with the RNN, and long short term memory (LSTM) deep learning algorithms to analyze and detect the behaviors of APT attacks. Their experimental results showed that the RNN and LSTM algorithms gain a better performance than the common machine learning algorithms such as Isolation Forest, support vector machine (SVM) and principle component analysis (PCA).
Yan et al. [45] introduced a method using deep learning CNN to detect APT attacks based on DNS Activities. In their research, three main attribute groups are extracted, including domain name-based features, features representing the relationship between DNS request behaviors and response behaviors, features representing the Relationship between DNS request behaviors and response behaviors using the dataset collecting from 4,907,147,146 pieces of initial data of 47 days DNS request records of Jilin University Education Network. Those features are combined with the CNN algorithm to detect the behaviors of APT attacks. In the experimental results, their method achieved an average accuracy of 97.6% in detecting suspicious DNS behaviors, with the false positive (FP) at 2.3% and the recall at 96.8%.
In [46], Eke et al. proposed a method to detect APT attacks based on KDD 99 data set using different models such as LSTM, RNN, and Gated Recurrent Unit (GRU). Experimental results show that deep learning algorithms have an average accuracy of 99.99%, which is higher than other conventional machine learning algorithms such as SVM, k-nearest Neighbors (KNN), Random Forest Classifier, and Logistic Regression.
Some other approaches
Ghafir et al. [47] presented an APT attack detection technique based on Network traffic using machine learning correlation analysis. Their real-time APT attack detection model has 6 main steps: 1) Intelligence gathering, which is a passive process, and do not contain any detection; 2) Point of entry, which includes disguised “exe” file detection, malicious file hash detection, and malicious domain name detection; 3) C&C communication, which implements malicious IP address detection, malicious SSL certificate detection, and domain flux detection; 4) Lateral movement, which is internal traffic within the target’s network. In this step, the system monitors the inbound and outbound traffic, so there is not and detection modules; 5) Asset/Data discovery, which executes scanning detection; 6) Data exfiltration, which handles tor connection detection. Their experimental results showed that the APT attack prediction accuracy of the method yields 84.4% during the data exfiltration period.
Li et al. [48] proposed Lyapunov-based intelligence-driven defense mechanism against APT attacks. In that model, the attack-defense confrontation process is modeled as the dynamic attack graph, and the defense strategy-making is built based on acquired heterogeneity knowledge. In [49], Lu et al. introduced a correlation analysis model to detect APT attacks based on Network traffic. Their correlation model is mainly based on the calculation and normalization of Flow properties in Network traffic. Rubio et al. [50] suggested the use of opinion dynamics for monitoring and detecting APT attacks in the Scada system. The model correlates different anomalies over time, then, looks for signatures of the persistence of threats and the criticality of resources.

The APT attack detection model for flow network based on deep learning.
Vinayakumara et al. [51] proposed a method to detect dangerous domains using some deep learning models such as RNN, LSTM. They also compared their proposed method with some other traditional machine learning models such as Random Forest, Decision Tree, Naive Bayes. Experimental results show that the LSTM algorithm gives the best results in terms of accuracy, precision, recall and F-score metrics.
Machine learning algorithms are also implemented in [55] to detect APT attack based on DNS query. In their research, the authors proposed some in-depth attributes for detecting the connection of APT malware using a controlling server installed some hidden algorithms such as line or tunnel encryption. In this paper, we further discuss about the problem presented in [55].
The proposed APT detection method for flow network based on deep learning
Figure 1 depicts our APT attack detection method for flow network based on deep learning algorithms. Specifically, the proposed detection model includes following components: Network traffic: Network traffic data used in this paper includes normal network traffic data collected from the e-government server within the National project KC.01.05/16-20 sponsored by the Ministry of Science and Technology of Vietnam [13]. The network traffic data set for APT attacks are collected from many different real attacks in the world and in Vietnam. Flow feature extraction: at this stage, the entire Network traffic will be analyzed into flows. These flows are then grouped by pairs according to their source IP (SrcIP) and destination IP (DstIP). After that, during the flow attribute extraction phase, the flows clustered in pairs of SrcIP and DstIP will be analyzed and extracted into different properties. These properties represent the difference between flows in each pair of SrcIP and DstIP. IP information reconstruction: at this stage, the information of SrcIP is presented by a single vector from the flow network, which is formulated in previous stage. This process depends on the method used in each experimental setup described in details later in this paper. The output of this stage is a single vector containing all the information about SrcIP. Training phase includes two main steps. IP feature extraction: in this steps, deep learning algorithms are used to extract IP features from IP representation vectors. In this research, some deep learning algorithms are investigated including MLP, GCN, and LSTM. IP classification: the IPs are classified based on their features (represented as a vector) that was extracted in the previous step. In this paper, a Softmax regression network is adopted to calculate the probability that an input IP falls into a specific class, and then compared with a standard label to optimize network parameters using the Adam optimizer. Details of Softmax regression network are detailed in Section 3.3.2. After training process, a deep learning model is created with input is a vector representing an IP information and output is a two component vector containing the probabilities of the input IP falling into either Normal or APT. Testing phase: is a process to apply the trained deep learning model to predict an IP to be normal or APT
Flow feature extraction and selection
There have been a number of studies suggesting different properties of flow for the problem of detecting abnormal behavior based on network traffic. Specifically, Beigi et al. [14] have listed and synthesized some properties of flow. However, APT attacks are very different from other attack techniques, so the properties of flow need to be analyzed clearly in detail for detecting this kind of attack. In this paper, to analyze and extract flow properties from network traffic, it is recommended to use the CICFlowMeter tool [15]. This is a powerful tool for accessing network traffic flow, as well as to serve as a statistical tool for network investigation. Some results of applying CICFlowMeter tool in Network traffic analysis have been presented in [16, 17]. The CICFlowMeter tool analyzes network traffic into flow and extracts some flow behavioral properties as shown in Table 1.
Features extracted by CIC Flow Meter
Features extracted by CIC Flow Meter
Where: Activation time is the period from when the first packet is sent on the network stream until no packet has been transmitted on the network stream and the timeout has expired. Idle time: is started when the timeout has expired and no packets have been sent on the network stream. The average value of the packet is calculated as:
Deep learning
The concept of Deep Learning has been proposed by Dechter in 1986, following the development of neural networks. So far, deep learning has been widely applied in many different fields such as computer vision, voice recognition, natural language processing, robotics, etc. Basically, deep learning is a sub-field in Machine Learning, in which deep neural networks, which are neural networks having many hidden layers with a lot of hidden neurons, are the main component. Deep neural networks can be used in a combination with some other machine learning tools to form different types of deep learning models. Based on the architecture and techniques used in each model, it is possible to classify deep learning models into three classes as follows [18]: Discriminative deep networks: this type of networks are usually applied to classification problems. They can output the probability of distribution for each label based on existing data. Popular deep learning models are Deep Neural Networks, Recurrent Neural Networks, Convolutional Neural Network. Generative deep networks: these models are usually used in analytical and aggregative problems based on observed data. They describe joint statistical distributions between the input data and related classes. Popular generative deep learning models include Deep Belief Network, Deep Autoencoder, Deep Boltzmann Machine. Hybrid deep networks: they are used for classification problems. This kind of model refers to an architecture that makes use of both generative and discriminative components.
Deep learning algorithms for APT attack detection on flow networks
3.3.2.1 Multilayer Perceptron
a) The model
The mathematical basis and operating principle of MLP has been presented in [19]. Generally, MLP consists of 3 main components: input layer, output layer, both of which usually consist of 1 layer on neutrons, and a hidden layer which may have one or more layers of neurons depending on specific problems. In deep neural networks, the hidden layer of MLP may consist of from tens to hundreds of neuron layers. MLP was basically designed to describe how the nervous system works. In MLP, except for the input layer, all neurons of other layers are fully linked to the neurons of the previous layer. Each neuron receives the input vector, combines with its weights, then input to get the transfer function to output a result. The output is calculated based on following formula:
Where

Graph of ReLU function.
The ReLU function is presented by following formula:
The advantages of using ReLU are not only to help deep networks converge faster but also to facilitate the computation process [20].
In the output layer, the Softmax [21] function is used, which is further discussed in section 3.1. The number of output neurons is equal to the number of output classes. The output of each neuron in the output layer is formulated as [21]:
Where, z i is the input to the Softmax function, which is the production of the inputs and the neuron’s weights adding the bias; C is the number of output classes.
IP information reconstruction based on flow network
In order to classify the APT attack IP and the normal IP on the flow network using MLP, we need to extract the representative characteristics of the IPs to input the classification model described above. The flow behavioral properties are presented in Section 3.2. Those properties of the flow have been extracted quite completely, and they are added together to form representative feature vectors of the IP. Feature extraction algorithm for the IP is presented in Algorithm 1:
Feature extraction for the IP
As presented in Algorithm 1, the feature extraction algorithm for the IP includes following phases: Build IP feature table: available flows are inspected. Among many possible attributes of the flow, two features, SrcIP and DstIP, are used to build the IP feature table G. Formulate IP feature extraction: for each flow f, if its SrcIP has presented in G, the feature vector of SrcIP will be added to the feature list of f. If the SrcIP is not in G, the IP feature vector will be initilated using the features of f. The procedure is repeated for DstIP of each flow f. The result feature table G contains all IP in the flow with their respective IP feature vectors. The length of each vector is equal to the number of features of the flow.
The output of Algorithm 1 will be input of the MLP model to predict the label of the IP as an APT attack or not.
3.3.2.2. Graph convolutional networks
a) IP information reconstruction using Graph Convolutional Networks
GCN is a deep network operating based on graphs [22]. GCN is proposed to reproduce information of vertices. For example, they can create representative vectors, or embedding vectors based on their neighboring vertices. Kipf and Welling [22] have shown that GCN is efficient for classifying vertices when it is combined with some other classifier.
In this paper, combining GCN network with another classifier is used to make the formulation of vertex information more efficiently, not only its own feature set, but also the directly related vertices’ information. As a result, this helps providing a better classification results. The role of convolution layers in GCN is to remember the information of related vertices, each layer for each vertex. The number of convolution layers depends on how many neighboring vertices need to be stored. Figure 3 illustrates the structure of first convolution layer of GCN.

First convolution layer of GCN for APT attack detection.
To model, let’s consider a scalar graph:
Where V and E are sets of vertices and edges.
In GCN, the set of edges also include direct repeated edges, i.e., all vertices are on those edges, (v, v) ∈ E, ∀ v ∈ V. The reconstruction of the vertices by adding information of their directly related vertices is formulated by (6):
Where h v is the output of the layer, W ∈ Rm*m and b ∈ R m are weight matrix and bias vector, respectively, N(v) is the set of neighboring vertices of v, i.e., they are on the edges connected to v, ReLU is the activation function as presented in Section 1.
Equation (6) involves all vertices on the edges connecting to v. In other words, GCN can reconstruct the vertices of the graph at a higher level compared to traditional methods based on normal attributes. The general formula for information reconstruction can be presented as follows.
Where k is the layer index, and
b) IP information reconstruction on flow
The IP information reconstruction based on flow properties using GCN is graphically presented in Fig. 4.

The model using GCN.
Figure 4 shows that the APT attack detection model based on GCN has 3 main components equivalent to 3 matrices as follows. Adjacency matrix A of NxN (where N is the number of vertices). The purpose of this matrix is to reconstruct the graph. Feature matrix X of NxD (where D is the number of features of vertex). This matrix is useful for calculating the loss function. Label matrix T of NxE, which is a binary matrix (where E is the number of classes). This matrix contains the outputs of the model.
In APT attack IP detection problems using flow information, if one IP is considered as a vertex there should be an edge to connect two vertices, which are the SrcIP and DstIP of the flow. As a result, a graph will be created to present all IPs in the dataset. The procedure generating the graph is presented in Algorithm 2:
Generating the graph based on GCN
The feature matrix is then used to present the IP information based on the feature extraction algorithm by cumulatively adding them together. This procedure helps minimize the information loss compared to composing the information manually.
The label matrix is formulated such that the i th row presents the labels of IP i with the length equal to the number of labels. For example, for binary APT attack detection problems, the vector t presenting labels of IPt, which is an APT attack, will have the values as: t = [0, 1].
3.3.2.3. Proposed combined model from Long Short Term Memory and Graph Convolutional Networks
a) Reconstruction of IP information on flow
The main disadvantage of composing IP features based on MLP and GCN as discussed above is that the time index is not utilized. Since the time index is a valuable source of information for the APT attack detection process, the performance of the detection system may be degraded if that index is set aside. BiLSTM can be used here to take into account the information from the time index. Consequently, the general detection model includes a BiLSTM running on the flow of IP combined with GCNs to reconstruct the information of IP [23]. This IP embedding system is used instead of the feature set in the matrix X presented in Section 2.
b) BiLSTM
Long Short-Term Memory (LSTM) has been mathematically defined and described in details in the study [24]. According to this work, LSTM(x1 :i)receives the input vector (x1 :i)as a sequence from left to right, then outputs at the position i the values h
i
∈ R
d
h
. The above sequence can be reconstructed thanks to the segments from the starting position to the i position. In other words, the network encodes the information of the entire sequence at position i based on its left side. However, it can be seen that any part of information at position j also has the relationship with its right, not just the left one. This is the reason why BiLSTM can be used to solve this problem of the LSTM network. In fact, BiLSTM uses 2 LSTM networks as follows [25, 26]. One LSTM network is used for forward propagation (from the left to the right). One LSTM network is used for backward propagation (from the right to the left).
Those two networks are labeled as LSTMF and LSTMB, respectively. BiLSTM fuses the outputs of the 2 network to form a grand vector. As a result, the information from both the left and the right of any vertex are encoded. These reconstruction process can be presented as in formula (9) bellow.
In general, the proposed model composed by the combination between the GCN discussed in Section 2 and the BiLSTM can be presented as in Fig. 5.

The proposed model combining BiLSTM and GCN.
c) IP information reconstruction method on flow
Here, the graph described in part b of Section 2 needs to be reconstructed and presented in an attribute matrix. Empirical results show that LSTM can achieve the best performance with the input sequence is divided into 64 segments. Furthermore, we will need to construct n matrices corresponding to n IPs. For each IP there will be a 64 x s representation matrices, where s is the vector size of one segment. In general, there will be a matrix of nx64xs.
Data set
APT attack data
This experimental dataset is collected and analyzed from 29 network traffic files in the Malware Capture CTU-13 data warehouse, which contains 6 malwares from different APT attacks, including: Andromeda, Colbalt, Cridex, Dridex, Emotet, and Gh0stRAT. In the dataset, the distribution of the network flows of each APT attack type is as follows [27]. Andromeda: 56.134 normal network flows, 22.205 intentional APT network flow attacks. Colbalt: 37.874 normal network flows, 605 intentional malware network flow attacks. Cridex: 5.242 normal network flows, 121.746 APT network flow attacks. Dridex: 501.035 normal network flows, 216.781 APT malware network flows. Emotet: 190.999 normal network flows, 569.731 APT malware network flows Gh0stRAT: 8.288 normal network flows, 8283 APT malware network flows.
Normal data
The normal network traffic dataset is collected from the e-government server of Quang Nam province [28]. The dataset is provided by the Ministry of Science and Technology of Vietnam via the national project KC.01.05/16-20. The data were collected in June, 26th, 2019.
Data selection for experiments
9.958.918 flows, including: 931.068 APT attack flows, 9.027.850 normal flows. 50052 normal IPs. 234 APT attack IPs.
Installation requirements and classification measures
Installation requirements
The experiments are implemented in machines with following configurations. Software Installations: Python version 3.6.7; Tensorflow-gpu 0.13; networkx 2.4; Ubuntu 16.04.6. Hardware: RAM 32GB; CPU Intel Core i5-7500 CPU @3; 4 GHz; GPU GeForce GTX 1080 Ti; 1TB SSD Harddisk.
Classification Measures
To evaluate the performance of the APT attach detection system, 4 different measures are used, such as: accuracy, precision, recall and f1-score. These metrics are calculated based on the following components: True positive (TP) is the number of APT attack IPs correctly classified. True negative (TN) is the number of normal IP correctly classified. False positive (FP) is the number of normal IP missed classified into “APT” False negative (FN) is the number of APT attack IP missed classified into normal
Experimental data setup
The dataset is divided into two subsets, training and testing, with two main scenarios. Training and testing data subsets are selected randomly. 80% equivalent to 189 APT attack IP and 40042 normal IPs are used for training, the remaining 20% equivalent to 45 APT attack IPs and 10010 normal IPs are used for testing. In our research, the experiments are repeated with different ratios between training and testing subset, and it is noticed that the ratio 80/20 between training and testing data has the best performance results. In this paper, only experimental results from this data division scenario is provided and discussed. Readers are encouraged to try different data division setup with their own datasets.
The performance of the proposed IP classification algorithm using deep learning with different parameter setups is evaluated. The performance comparison between the proposed models GCN and BiLSTM-GCN with some other machine learning models such as MLP used in [10] are also implemented. Different structural and parameter setups for the combined BiLSTM-GCN model are also conducted to find the best results.
Experimental results
For MLP algorithm, different number of hidden layers are used combining with different number of neurons in each layer. Table 2 presents the APT attack detection performance of MLP with different setups.
Performance of MLP [10] for APT attack detection
Performance of MLP [10] for APT attack detection
The empirical results in Table 2 show that single hidden layer MLP even with 100 hidden neurons has the worst performance results. The more hidden layers and the number of neurons are used, the better the classification performance appears to be. Consequently, the model with 4 hidden layers [500,500,500, 50] has the best performance results in all measurements. However, the precision score of MLP models is very low in all setups. The reason for this is due to the huge asymmetric of the data. The number samples of normal IPs dominate that of attack IPs (normal IP is more than 200 times the APT’s IP) and the MLP cannot learn enough information from such a small number of attack samples. Figure 6 shows the confusion matrix for the MLP model having 4 hidden layers [500,500,500,50].

Confusion matrix of MLP with 4 hidden layers [500,500,500,50].
As shown in Fig. 6, the number of FP is up to 162 IPs is the main factor leading to a relatively low precision score as shown in Table 2 (16.06%). This is because the domination of normal IPs, which confuses the model and generates bias. In fact, if the model can accurately predict all 45 IPs of the APT attack, the precision result will still be only 45/(45 + 162) = 21.74%. This is a relatively low result for the classification model due to the high number of normal IP missed classified to the attack label. However, the general accuracy of the system is still very high. The model has accurately classified up to 98.24% of normal IPs and 68.89% of APT attack IPs. On the bright side, the statistics of APT attack detection results in Table 2 and Fig. 6 show that MLP is still a good model for APT attack detection since it is not over-fitted, even under the domination of the normal IP data.
Table 3 presents experiment results of APT attack detection using network flows based on GCNs. Some different number of convolution layers of GCNs are also investigated.
Experiment results of GCN
The results in Table 3 show that the GCN model works better than MLP, which is used in [10], in terms of accuracy score and f1-score. Besides, even though the GCN model does not have much better recall score than MLP, the confusion matrices in Figs. 6 and 7 show that they have identical correct APT attack IP detection rate.

Confusion matrix of 3-layer GCN.
Based on Fig. 7, it can be seen that the 3-layer GCN only miss-detected 114 attack IPs compared to 162 miss-classified attack IP of the best MLP model structure. This means that using model based on graphs is good for reconstructing IP information from the flow. In other words, representation of IP information using GCN is more complete and accurate thanks to the information from adjacent vertices.
Experimental results of our proposed APT attack detection model based on the network flows using the combination of BiLSTM and GCN are presented in Table 4. Different number of convolution layers of the model are also investigated in the experiments
Experimental results of the proposed BiLSTM-GCN model
Experimental results in Table 4 show that the proposed BiLSTM-GCN model works best when we combine 2 BiLSTM layers with 2 GCN layers. Comparing the performance results in Tables 3 and 4, the BiLSTM-GCN model is much better than both GCN and MLP models in terms of all scores. These results demonstrate that the GCN layers can help improve the accuracy for normal IP classification. This also shows that the combined model BiLSTM-GCN can better support the reconstruction of the IP information than other single models, even with the large number of normal IP, which is more than 10,000 IP.
The charts in Fig. 8 show that the accuracy results for normal IP detection of BiLSTM-GCN model is higher than that of single GCN model. Additionally, for APT attack data, the BiLSTM layer efficiently helps extract the flow useful features to characterize the IPs embedding, i.e. it outputs full information of the IP, although the empirical data set contains only 189 APT IPs.

Performance summary chart of different models.
In this paper, three main achievements have been accomplished. Firstly, a new APT attack detection method based on network flow analysis from network traffic using deep learning models is proposed. Experimental results show that the proposed model can detect APT attacks with not only for a high accuracy score but also at a small false alarm, even when there is a huge imbalance between normal and attack samples in the dataset. Secondly, the proposed combined BiLSTM-GCN model is better than other conventional machine learning models, i.e. MLP and GCN models, for APT attack detection on all performance measurements. This demonstrates the combination of individual deep learning models is efficient in extracting good features for detecting APT attack IPs. Finally, different structural combinations of BiLSTM-GCN have been investigated in the experiments and the results show that the BiLSTM-GCN model with 2 layers of BiLSTM and 2 layers of GCN can provide the best performance.
The proposed method opens up a new trend for APT attack detection, which is based on network flow analysis techniques to extract features of the IPs, then help detect APT attack IP. This is a meaningful contribution since, in practice, the number of data sources for monitoring and detecting the cyber-attacks is growing, making traditional methods impossible to handle. In our future work, this approach will be further developed for other types of cyber-attack detection, such as DDOS detection, Botnet detection, unauthorized intrusion detection, etc. In addition, the success of the combined BiLSTM-GCN model is a good motivation for deep learning algorithms to be widely applied in cyber-attack detection systems. This will also be the topic of our future works on analysis, information reconstruction, feature extraction for APT attack IP detection based on network traffic.
