Abstract
Advanced persistent threat (APT) attacking campaigns have been a common method for cyber-attackers to attack and exploit end-user computers (workstations) in recent years. In this study, to enhance the effectiveness of the APT malware detection, a combination of deep graph networks and contrastive learning is proposed. The idea is that several deep graph networks such as Graph Convolution Networks (GCN), Graph Isomorphism Networks (GIN), are combined with some popular contrastive learning models like N-pair Loss, Contrastive Loss, and Triplet Loss, in order to optimize the process of APT malware detection and classification in endpoint workstations. The proposed approach consists of three main phases as follows. First, the behaviors of APT malware are collected and represented as graphs. Second, GIN and GCN networks are used to extract feature vectors from the graphs of APT malware. Finally, different contrastive learning models, i.e. N-pair Loss, Contrastive Loss, and Triplet Loss are applied to determine which feature vectors belong to APT malware, and which ones belong to normal files. This combination of deep graph networks and contrastive learning algorithm is a novel approach, that not only enhances the ability to accurately detect APT malware but also reduces false alarms for normal behaviors. The experimental results demonstrate that the proposed model, whose effectiveness ranges from 88% to 94% across all performance metrics, is not only scientifically effective but also practically significant. Additionally, the results show that the combination of GIN and N-pair Loss performs better than other combined models. This provides a base malware detection system with flexible parameter selection and mathematical model choices for optimal real-world applications.
Introduction
The problems
The APT attacks have long been considered as one of the most dangerous attack techniques in cyber-attacks [1–6]. All organizations and government agencies can be easy victims of this kind of attacks [3]. As a result, detecting and early warning APT attacks have been a hot topic for many researchers throughout the world recently. According to the study in [3], there are multiple common stages and cycles of an APT attack campaign, including reconnaissance, attack, privilege escalation, data exfiltration, and maintenance. This information can be exploited to early detect APT attacks, as presented in [7–9]. One of the techniques often used by attackers during the APT attack campaigns is to distribute small parts of malicious codes to the endpoint workstations, and then gradually escalate into the system [7, 8]. Apart from technical ways, human factor can also be exploited in APT attacks since this is the most important but also the weakest role in the cyber system. Several approaches on detecting malware in general and on detecting APT malware in particular can be found in [10–13]. As discussed in studies [12, 13], the use of artificial intelligence models to analyze the anomalous behaviors of malware is currently yielding better results than many traditional approaches. Besides, Cho et al. [16] also pointed out some issues of traditional approaches that need to be improved for a better APT attack detection system as follows.
–Anomalous behavior collection issues: in traditional approaches, Sandbox environment is commonly used to collect the behaviors of the malware. This approach is shown to be effective for regular types of malware, but not for APT malware, since APT malware can detect the Sandbox and cease its operations when running in that environment. Additionally, APT malware may have a dormant mechanism that can last for weeks or even several months, making it impractical for the Sandbox to capture its behaviors. As a result, using a Sandbox to collect APT malware is not very feasible and practical.
–Preprocessing and feature extraction issues: since it is challenging to collect data for the behaviors of APT malware, selecting and extracting specific and meaningful features from the malware becomes more difficult. To address this problem, researchers tend to construct and mainly rely on a synthetic dataset in the laboratory setting and extract features from that dataset. While this approach can be effective in simulated cases for detecting malware, it may not be practically useful for real-world APT attack detection scenarios because those attributes are difficult to generalize in practice. Although anomalous behavior analysis is a promising approach for the APT malware detection [12, 13] thanks to the combination of static analysis tools, dynamic analysis tools, machine learning and deep learning models, there are still two main problems that need to be addressed, including:
The proposed approach
Based on the analysis above, it can be seen that collecting behaviors of APT malware, together with constructing, extracting, and calculating the correlations between these behaviors play an important role for a successful APT malware detection system. In this paper, a new APT detection approach based on deep graph networks and contrastive learning is proposed. The new method in the paper includes three main phases as follows.
The scientific basis of the proposed method
There have been many different methods and techniques introduced in the literature to detect APT malware. The standouts of the APT malware detection approach in this study are based on two main factors as follows.
–
–
Contributions of paper
The novelty and scientific contributions of this study are three folds as follows.
–
–
Related studies
Cho et al. [19] proposed an APT attack detection method based on the behavior profile analysis technique. Specifically, in their study, the research team proposed to build behavior profiles based on abnormal features and behaviors of IPs and Domains on Network traffic. In the experimental part, the authors used the Random Forest (RF) algorithm to prove the effectiveness of the APT attack detection method by behavior profiles. In addition, behavior profile analysis technique can also be found in another APT attack detection model proposed by Cho [8]. Accordingly, the APT attack behavior profile was proposed by the research team based on a multi-layer analysis technique. Specifically, the authors used 2 main layers to build APT attack behavior profiles based on Network traffic. In the first layer, the authors used the machine learning method to detect APT IPs. In the second layer, the authors used the Suricata tool to analyze network traffic into components such as DNS log, HTTP log, log file, TLS log, etc. and then extracted abnormal behaviors of these components. Finally, behavior profiles of the APT attack were built based on the results of layer 1 and layer 2. In the experimental section, the authors have demonstrated that APT attack detection based on behavior profiles provided better performance than detection methods on individual components. With the same idea, studies [20–22] proposed the method to detect APT IPs based on the behavior profile of each IP in network traffic. Their approach applies the combination of a deep learning model and the Attention network to detect normal IPs and APT IPs. In addition, Amit Sharma et al. [23] presented a study on various techniques for hiding, erasing traces, and identifying sandbox recognition methods of APT malware. In that research, the authors attempted to define a list of some sandbox detection behaviors of certain APT malware families. These results are considered an initial foundation for further studies to define the anomalous behaviors of APT attacks. Additionally, the DMMal model [24] is proposed based on multi-view images of known malware. However, the success of that approach is mostly based on complex computations as well as large amounts of training data. The studies in [15–17] proposed some methods of detecting APT malware on Workstations based on anomalous behaviors and graph analysis techniques. Specifically, the authors used 4 main feature groups: “Process” Events, Registry Events, Network connection, and File Deletion Events to extract abnormal behaviors of APT malware. Next, deep graph networks are used to extract behaviors and, then, to detect the malware. In addition, Mohamed et al. [25] proposed the FWA-FS model for detecting malware on the Android operating system. The authors extracted and analyzed the anomalous features of Android malware through APK files. In the experimental section, the authors demonstrated that the model’s malware detection accuracy increased from 6% to 25%. Built from the same idea, research [54] proposed a malware method based on attributes in APK files and deep learning algorithms. However, both methods have not yet proposed new attributes nor have they developed techniques for representation and synthesis of information. Yong Fang [26] proposed an APT attack detection method based on the LMTracker model. Accordingly, in the LMTracker model, the authors used the heterogeneous graph to analyze and predict APT attack behavior in event logs and traffic. Ullah [27] proposed an approach for malware detection through the extraction of attributes from API-Call Graphs (ACGs) with byte-level image representation. In the experimental section, the authors utilized the CIC-InvesAndMal2019 dataset and achieved over 99% accuracy, yielding effective results. Studies [28, 29] proposed Log2vec and Sec2graph methods to detect APT attacks in organizations. With the same idea, Xiaohu Liu [30] proposed a method to detect APT based on analyzing the graph of HTTP traffic.
In the study [31], the authors proposed a GCN model for the Android malware detection task based on anomalous behaviors of malware running on applications. Similarly, Jiaqi Yan et al. [32] proposed a malware detection method based on Control Flow Graphs (CFGs) using DGCNN. The research [33] proposed to use Deep Eigensapce Learning to embed CFGs of OpCodes in the executable file. In the experimental section, based on a dataset consisting of 1,078 normal files and 128 IoT malware samples, the CNN algorithm provides an accuracy of up to 99.68%. However, that model can only be successful with such a limited dataset, since it is evaluated with a lot of bias.
The proposed model
Overall architecture
Figure 1 shows the architecture and flow of the proposed model. Accordingly, the main components in the proposed model include:

The architecture of the APT malware detection model that combines deep graph network and contrastive learning.
The Event IDs have once been exploited in some other studies [15–17], where Cho et al. proposed a method to select and extract APT malware features. Accordingly, the 4 main feature groups used by the authors, which include the “Process” Event, Registry Event, File Deletion Event, and Network connection. These are important features and characteristics to depict the behaviors of the APT malware. In this research, we continue to use these feature groups as the basis for the APT malware detection system. After extracting the APT malware features based on the above 4 feature groups, the research proceeds to build process trees of each file. The procedure of building a process tree is as follows: nodes of process trees are malware processes, and each edge of the graph represents a call from the parent process to the child process.
Feature extraction
As stated above, the purpose of the Feature Extraction phase is to find a way to embed the process tree into a uniform feature vector. Accordingly, each process tree, together with its label, is analyzed and standardized into a fixed vector. This is a very important phase because if the standardization is successful and complete, it will contribute to improving the efficiency of the APT malware detection process. However, because the characteristics of the APT malware files and the normal files are so different, which lead to the difference in the architectures and lengths of all the process trees. Therefore, it is very difficult to standardize and embed these process trees of different lengths into vectors of an equal length. To achieve the set goal, this study proposes to use some deep graph networks [34–37]. These networks are further discussed in the next section of this paper.
Graph Convolutional Network (GCN)
Graph Convolutional Network (GCN), which was introduced by Thomas Kipf and Max Welling, is one variant of Graph Neural Networks [38]. GCN uses local spectral filters on the graph to extract subgroups from the graph, thereby clearly showing the graph structure. According to the working principle, the ‘convolution’ method in the GCN layer aims to extract features on the neighboring components similar to the CNN layer. However, since the CNN network has been shown to be less effective on non-Euclidean datasets, in its architecture, the GCN network looks for ways to extract features from the neighboring nodes. The following formula (1) shows the feature propagation process of a GCN layer in the GCN model.
Where: A is the adjacency matrix. X is the feature matrix. I is the unit matrix of the same size as A f is the activation function. D is the degree matrix of (A + I). W is the weight matrix. b is the bias matrix. H(i) is the output of layer i. H(0)=X.
The GIN network [39] is a GNN variant in the form of Spatial Graph Convolutional Networks. Basically, the GIN also uses a propagation formula similar to that of the GCN, so GIN can be considered as a variant of the GCN. The difference between GIN and GCN is that GCN computes on a normalized adjacency matrix, while GIN uses matrix (c(k)I+A) which helps optimize the computation speed on the network. Research [39] pointed out that although GIN’s propagation formula is somewhat simpler than Spectral Convolution methods, it still worked well in classification tasks, especially graph classification. The reason is that GIN acts as a dual representation of CNN classifier on the graph signal space where the shift operation is defined by the adjacency matrix. The propagation formula used in the GIN network is described in formula (2) below:
Where:
A is the adjacency matrix
X is the feature matrix
I is the unit matrix with the same size as A
σ is the activation function
W is the weight matrix
c = 1 + e (where e is a learnable parameter)
Z(i) is the output of layer i.
The main components of the GIN network include [39]:
Where:
P v is the feature vector at the v-th node
P G is the final graph feature
MLP is the multi-layer perceptron
N(v) is the neighborhood of the v-th node
ɛ is a learnable parameter
In this study, based on the feature vector of each file built in section 3.3, a method to improve the efficiency of the detection process for normal and abnormal files is proposed. Accordingly, a combination between the MLP network and the contrastive learning with suitable loss functions is implemented to support classification.
The MLP network
MLP network is a simplest form of deep learning models with 3 layers: an input layer, an output layer, and hidden layers [40]. The input layer receives the input signal to be processed. The prediction and classification outcomes are provided by the output layer. An arbitrary number of hidden layers placed between the input and output layers is the computational engine of the MLP. Similar to the feedforward networks, in MLP, data flows in the forward direction from the input layer, through the hidden layer, toward the output layer. The neurons in the MLP are usually trained with backpropagation algorithm. Specifically, the working principle of MLP is as follows [40, 41]. The neurons at the input layer receive input signals for processing (multiplying with the corresponding weights, summing up the products, then sending to the transfer function). The results from the transfer functions of all the neurons in the input layer are transmitted to first array of neurons in the hidden layer. These neurons process the input information, then send their results to the second array of the hidden layer. The process continues until the results hit neurons in the output layer, then the outputs are provided.
Contrastive learning
Contrastive loss is a deep learning algorithm useful for learning the general features of an unlabeled dataset by teaching the model to distinguish similar or dissimilar data points. Contrastive loss was first introduced in 2005 with its initial applications in dimensionality reduction [42]. The characteristic of the contrastive learning model is to find pairs of features with similarity (positive) or contrast (negative) in the data set. It is implemented by pulling similar data pairs together as well as pushing contrasting data pairs away from each other. In this approach, it is necessary to use similarity metrics to calculate the distance between the embedding vectors representing the data points. For example, the cosine distance of the vectors can be used to calculate the distance as the predicted difference probability between them, which can be done using a typical classifier network. The distance of negative samples can be considered as the output probability using cross-entropy loss. In practice, when performing supervised classification, the outputs of the network are usually provided by a softmax function represented by the following formula [43–45]:
The loss function is calculated using the following formula:
The formula (7) shows that contrastive loss is quite similar to the softmax function. The difference between them is that the values in the denominator are the total cosine distances from the positive samples to the negative samples.
Triplet loss was first introduced in 2015 [47] and it has been one of the most popular loss functions for supervised or similar learning since then. Triplet loss assumes that the distance between any different data pairs must be at least a certain threshold away from the distance of any similar pair. Mathematically, the triplet loss value can be calculated as follows.
Where: p is the sample with the same label as a n is the sample with the different label as a d is the function to measure the distance between these three samples m is the margin value to keep negative samples apart
Since triplet loss tends to amplify the distance between negative samples making it larger than the distance between positive samples. As a result, current patterns may focus only on negative samples, making it less effective to identify a data sample from other positive ones. To improve the performance in this situation, N-pair loss [47] is introduced. N-pair loss function is represented as follows:
The formula (9) shows that N-pair Loss is a generalized version of Triplet loss. If N = 2, it is similar to Triplet loss.
Experimental dataset
The dataset was collected from [48–52]. Table 1 lists the details of the datasets we have collected. The dataset is randomly divided into two subsets, including 70% of the samples are used for training, and the remaining 30% of samples are for evaluation.
Statistics on the number of samples of the experimental dataset
Statistics on the number of samples of the experimental dataset
The following formulas 10, 11, 12, and 13 are used as the evaluation metrics for the APT malware detection model at endpoint workstations.
Where: TP –True positive: The number of APT malware samples that are correctly classified as malware. FN –False negative: The number of APT malware samples that are incorrectly classified as normal. TN –True negative: The number of normal samples that are correctly classified as normal. FP –False positive: The number of normal samples that are incorrectly classified as APT malware.
From the above formula, it can be seen that larger Precision value is equivalent to higher accuracy of finding correct samples among all returned positive results.
It can be seen that a high Recall score means a low rate of missing positive points.
Higher F1-score value means better performance of the model. The highest possible value of F1-core is 1.
In this paper, we conduct experiments with some scenarios to evaluate the effectiveness of the proposed method as follows:
–Scenario 1: Conducting experiments to detect APT malware based on some different combination setup between graph deep learning models and contrastive learning algorithms. Based on this scenario, the study not only shows the effectiveness of the proposed method, but also shows which combination of deep graph networks with contrastive learning algorithms gives the best performance.
–Scenario 2: Conducting experiments using different individual deep graph networks to analyze the data, extract the features, and classify APT malware.
Experimental results
Results of the scenario 1
In this scenario, performance of different models using different ways to combine deep graph networks with contrastive learning methods are investigated. For an objective evaluation, during the combination setups, some parameters of each model are fine-tuned to ensure the best possible outputs are obtained. Specifically, different number of Embedding vectors in the deep graph network are used, i.e. 128, 256, and 512, respectively. Besides, dropout probability in contrastive learning techniques is also adjusted.
The experimental results in Table 2 and Fig. 2 show that when changing the Embedding vector value in GIN and dropout probability, prediction results also change. However, the difference between the results is not too significant (only about 1%). The model giving the best result is the combined model whose Embedding vector parameters in GIN are [3, 128] and the dropout probability value τ=0.1. Accordingly, the model combining GIN with N-pair Loss yields an Accuracy of 94%, Precision of 88%, Recall of 88%, and F1_score of 88%. In general, with this result, this model has also brought significant efficiency to the task of classifying normal files and APT malware. Next, the study tests the predictive ability of this model with the testing dataset. Figure 3 below shows the results of the confusion matrix of the model combining GIN with N-pair Loss with parameters for the best results described in Table 2.
Results of the model combining GIN with N-pair Loss
Results of the model combining GIN with N-pair Loss

Line plot of the model combining GIN with N-pair Loss.

Confusion Matrix of the model combining GIN with N-pair Loss with parameters for best results.
Evaluating the results in Fig. 3, it can be seen that the obtained results of the model combining GIN with N-pair Loss are relatively good. Specifically, the number of detected APT malware reaches a relatively high rate: this model correctly predicts 7,226 APT malware cases out of a total of 8,142 malware files, reaching a rate of about 89%. For the task of classifying normal samples, this model also gives very good performance when it only wrongly predicts 1,039 normal samples.
The results in Table 3 and Fig. 4 show that the model combining GIN with Triplet Loss also gives relatively good results and this model gives the best results with the number of GIN layers of 4, τ of 0.5, and Embedding vector of 256. From the results of Tables 2 3, it shows that the model combining GIN with N-pair Loss gives higher results than the model combining GIN with Triplet loss on all measures. However, this difference is not too large. The reason is that the N-pair loss model is a generalized version of the Triplet loss model (as stated above, N-pair loss is similar to Triplet loss when N = 2) so it gives better performance due to its flexibility in the classification problem. In addition, increasing the Embedding vector from 128 to 256 also reduces the predictive ability of the model. Figure 5 below depicts the results of testing the model combining GIN with Triplet loss through the confusion matrix.
Results of the model combining GIN with Triplet Loss

Line plot of the model combining GIN with Triplet Loss.

Confusion Matrix of the model combining GIN with Triplet Loss with parameters for best results.
Comparing Figs. 5 3, it can be seen that the ability to detect APT malware of the model combining GIN with N-pair Loss is approximately 0.71% better than that of the model combining GIN with Triplet Loss (specifically 7,226–7,175 = 51 files). Regarding the ability to accurately predict normal files, the model combining GIN with N-pair Loss is also more accurate than the model combining GIN with Triplet Loss (more than 190 files) This result once again proves the superior efficiency of the model combining GIN with N-pair Loss compared to the remaining models.
c)
From Table 4 and Fig. 6, it can be seen the model combining GIN with Contrastive loss also brings a relatively good effect. However, when changing the model’s parameters, there is a relatively high disparity between the results, and the difference between metrics is inconsistent. Specifically, the difference between the best and worst Accuracy results is about 1%, this difference with Precision, Recall, and F1-score measures is 8%, 4%, and 3%, respectively.
Results of the model combining GIN with Contrastive loss

Line plot of the model combining GIN with Contrastive loss.
Comparing the results of Table 2 with Table 4, it can be seen that the model combining GIN with Contrastive loss does not bring high efficiency compared to the model combining GIN with N-pair Loss, although this difference is not too large. This is shown through 3 measures: Accuracy, Recall, and F1-Score. Accuracy higher than 1% indicates that the classification ability of the model combining GIN with N-pair Loss is better. Recall higher than 2% shows that the ability to accurately detect APT malware of the model combining GIN with N-pair Loss is better. This further reinforces the hypothesis that the use of the Loss function leads to a decrease in the number of detected malware. Finally, F1-Score is higher than 1%.
Figure 7 below describes the Confusion Matrix results of the model combining GIN with Contrastive loss with best parameters (the number of GIN layers of 2, τ of 0.05, and Embedding vector of 128).
Comparing Figs. 7 5, we can see that the ability to detect malware of the model combining GIN with N-pair Loss is better than that of the model combining GIN with Contrastive loss. This difference is approximately 1.53% with 109 malware files (7,226–7,117). Similarly, with the correct prediction of normal files of 22,375 files, the model combining GIN with Contrastive loss is also less efficient than the model combining GIN with N-pair Loss model by about 66 files.

Confusion Matrix of the model combining GIN with Contrastive loss with parameters for best results.
The results of Tables 2 5 show that the model combining GIN with N-pair Loss gives higher results than the model combining GCN with N-pair Loss at all performance measures. Specifically, according to Accuracy and Precision metrics, the difference is 4%. As for the Recall and F1-Score metrics, these differences are 8% and 6%, respectively. From this result, it is shown that when replacing GIN with GCN, the ability to aggregate and extract features is less efficient than that of combined models using GIN. In addition, Figure 8 presents the experimental results with different parameter setups of the GCN and N-pair Loss. The results show that changing the parameters of the GIN and N-pair Loss will lead to a significant change in the prediction results.
Results of the model combining GCN with N-pair Loss

Line plot of the model combining GCN with N-pair Loss.
The results in Table 6 have shown that the model using only GIN [17] gives better performance than the model using only GCN [53]. This result once again confirms that GIN [17] is more effective than GCN [53] in the terms of analyzing and extracting APT malware features. Comparing the results of Table 2 with Table 6, it can be seen that the model combining deep graph networks with contrastive learning is completely better than using only individual deep graph networks, GIN [17] and GCN [53]. Figure 9 presents some comparison results between our newly proposed method with other approaches. The experimental results from Figure 9 show that our proposed approach in has yielded better results than other approaches on all performance measures.
The results of the model using only deep graph networks
The results of the model using only deep graph networks

Bar chart of the model using only deep graph networks.
Based on the two experimental scenarios presented in section 4.4, it can be seen that the scientific and practical contributions of the proposal in this paper have been clearly confirmed. Specifically, with scenario 1, the paper has presented some models combining deep graph networks with contrastive learning methods. This scenario has proven that the model combining GIN with N-pair loss is more effective than other combined models. This result also has shown the suitability of GIN and N-pair Loss for the task of extracting and classifying APT malware. For scenario 2, the paper has compared the approach proposed in this study with some approaches using GCN [53] and GIN [17]. Obviously, with the flexible combination of layers of deep graph networks with contrastive learning algorithms, we have created a complete model that not only promotes the efficiency of feature extraction but also supports the task of classifying and predicting abnormal behavior of malware. Finally, from the 2 experimental scenarios presented in the paper, we have shown why our proposed method is superior to other methods. The results have not only confirmed the novelty and scientific meanings in the study, which is the first to propose a model combining deep graph networks with contrastive learning for the APT malware detection task, but also opened up a new solution trending to the problems of detecting anomalies, malware, insider attack, etc. on endpoint workstations. Besides, this study has also suggested a new approach to classification problems using imbalanced datasets.
Conclusion
With the purpose of improving the ability to detect APT malware on workstations, we have proposed in this study 3 main scientific contents: the architecture of the detection model combining a deep graph network with a contrastive learning algorithm, extracting APT malware features using a deep graph network, and optimizing the APT malware detection and classification process based on contrastive learning. Accordingly, with the flexible combination of deep graph networks and contrastive learning, a complete model is formed with the functions of analyzing input data, extracting APT features, and detecting APT malware on workstations. This is a novel combined model that has not been proposed and used by any research. In addition, the experimental results on scenarios 1 and 2 have proved that the combined model proposed in the paper has a significantly better performance than individual deep graph models, or other approaches. During the experiments, we tried to tune and clarify different parameters to clearly see the change in detection results associating with different model setups. Here, the results of almost all optimal and suitable parameter settings are presented and compared. The results also show that the use of deep graph networks has been very effective for the process of aggregating and extracting APT malware behaviors based on the correlation between the processes collected from workstations. Finally, with the support of contrastive learning techniques, the process of classifying and predicting APT malware has been greatly improved. This result has once again proved the correctness and the scientific meaning of the new model in the paper. Apart from the aforementioned advantages, this research is somewhat efficiently solving one of the challenging issues in machine learning, which is the imbalance in the experimental dataset. In reality, to detect APT malware, researchers have to deal with training datasets that have a significant amount difference between clean data and APT malware data. To this end, there is a need to improve the feature extraction mechanism, which can highlight the information of the data in both classes. This is such a crucial issue in this field of study. In this paper, these features are extracted and optimized using deep graph learning. It is suggested that, in the future, there can be some different research directions to enhance the prediction accuracy of APT malware from the normal files, as follows: i) to develop a better method of extracting APT malware features from malware processes; ii) to construct a better behavior profile of the APT malware in imbalanced datasets; iii) to improve the loss functions in contrastive learning or to use unsupervised contrastive learning.
Footnotes
Acknowledgments
This work has been sponsored by the Ministry of Information and Communications, Vietnam, grant number DT.18/23.
