Abstract
Detecting and warning Advanced Persistent Threat (APT) malware in Endpoint is essential because the current trend of APT attacker groups is to find ways to spread malware to users and then escalate privileges in the system. In this study, to improve the ability to detect APT malware on Endpoint machines, we propose a novel intelligent cognitive calculation method based on a model combining graph embeddings and Attention using processes generated by executable files. The proposed intelligent cognitive computation method performs 3 main tasks: i) extracting behaviors of processes; ii) aggregating the malware behaviors based on the processes; iii) detecting APT malware based on behavior analysis. To carry out the task (i), we propose to use several data mining techniques: extracting processes from Event IDs in the operating system kernel; extracting abnormal behaviors of processes. For task (ii), a graph embedding (GE) model based on the Graph Convolutional Networks (GCN) network is proposed to be used. For task (iii), based on the results of task (ii), the paper proposes to use a combination of the Convolutional Neural Network (CNN) network and Attention network (called CNN-Attention). The novelty and originality of this study is an intelligent cognitive computation method based on the use, combination, and synchronization of many different data mining techniques to compute, extract, and represent relationships and correlations among APT malware behaviors from processes. Based on this new intelligent cognitive computation method, many meaningful anomalous features and behaviors of APT malware have been synthesized and extracted. The proposals related to data mining methods to extract malware’s features and the list of malware’s behaviors provided in this paper are new information that has not been published in previous studies. In the experimental section, to demonstrate the effectiveness of the proposed method in detecting APT malware, the study has compared and evaluated it with other approaches. Experimental results in the paper have shown the outstanding efficiency of the proposed method when ensuring all metrics from 96.6% or more (that are 2% to 6% higher than other approaches). Experimental results in the paper have proven that our proposed method not only has scientifically significant but also has practical meaning because the method has helped to improve the efficiency of analyzing and detecting APT malware on Endpoint devices. In addition, this research result also has opened up a new approach for the task of detecting other anomalies on the Endpoint such as malware, unauthorized intrusion, insider, etc.
Keywords
Introduction
Problem
In recent years, more and more APT attack campaigns have been being recognized in the world, not only in sensitive and important fields but also in many different fields and corporations. There are two reasons for this situation: i) APT attack is a targeted and very dangerous attack technique [1, 2]. APT attack campaigns are often supported by government organizations and use advanced technology, so it is very difficult to detect them. ii) The trend of APT attack campaigns is often to exploit weaknesses in Endpoints, so it is difficult to detect and prevent. According to statistics in [3], most APT attack campaigns recorded in the world in 2020 stemmed from vulnerabilities in Endpoint machines. The studies [1, 5] presented the characteristics, process, life cycle of APT attack campaigns. Based on an overview analysis of APT, studies [6–8] pointed out that to prevent APT attacks, just prevent and break a process and a life cycle in the cycle of this attack campaign, the entire attack process will crash. The above analysis shows that, to detect APT attacks, the solution to detect and prevent the distribution and privilege escalation of APT malware on Endpoint devices is very necessary.
Some methods to detect malware in general and APT malware, in particular, are: using the set of signs and analyzing abnormal behaviors [1, 9]. In which, the method of detecting malware using behavior analysis techniques based on machine learning or deep learning techniques has brought high efficiency because they have the ability to detect new malware types. However, previous studies [10–15] pointed out that APT malware has many differences from common malware, so the use of traditional detection methods is not very effective. Three main difficulties in detecting APT malware listed and analyzed in studies [16–19] include: how to define abnormal features, extraction method, and malware detection moment. To solve the above problems, in this paper, we propose a new intelligent cognitive computation method for the task of analyzing behavior and detecting APT malware on Endpoints based on a combination of GE and CNN-Attention (called GECA model). The GECA model not only supports analyzing and extracting features and characteristics of APT malware but also provides a mechanism to optimize the classification of abnormal and normal behaviors in order to improve the efficiency of detecting and warning this malware type.
Solving the problems
The two main methods to detect APT malware that are commonly studied and applied are the signature-based method through the rule set and anomaly-based method based on behavior analysis to find anomalies [4]. Malware detection approaches combining anomaly-based with machine learning or deep learning techniques have been highly effective in identifying new APT malware samples. The studies [2, 8] listed two main methods to extract features and behaviors of malware including the static analysis method and the dynamic analysis method. The current trend of detecting cyber-attacks using machine learning or deep learning often applies dynamic analysis methods with the support of Sandbox tools to analyze and extract the features and behaviors of APT malware. However, we found that these approaches have some problems [9–15]: The features and behaviors of malware: Virtualization tools such as Sandbox [1] assist in collecting and extracting the features of malware. They usually work well with simple samples but not very effective for APT malware types because this malware often has Identifying Functions and anti-Sandbox, hibernating, etc. Therefore, the APT malware features that are collected and extracted from the sandbox log usually do not make much sense [16]. The malware detection time: We noticed that applying machine learning or deep learning methods based on features and abnormal behaviors would only detect malware at later stages of the attack campaign. That is when the attackers are able to steal information from the victim. Lack of correlation between events [10, 17]: As regards conducting malware signature and behavior based on virtualization tools, it is not only difficult to fully collect features and behaviors of malware but also lead to the situation in which the system can not seek and synthesize the correlation among single events of malware. This is due to the fact that the APT malware often uses a variety of exploit and propagation techniques at different timelines when the collected signs and behaviors are completely benign. However, if we concatenate the events, we will see that this is the hiding and concealing behavior of malware.
The studies [1, 19] also showed that one of the problems that made detecting APT attacks much more difficult than other attack techniques was the lack of correlation in the events of the attack.
Contribution
Scientific and practical significance in detecting APT attacks in the proposal of the paper includes: Proposing the GECA combined model for the task of detecting APT attack malware on Endpoints. The GECA model is built on a flexible combination of deep learning graph networks and Attention networks. This model uses a new intelligent cognitive computation method that has not been researched and proposed to analyze, extract and classify APT malware behaviors. Proposing a GE model based on the GCN model for the task of building and extracting malware’s behaviors based on processes. Regarding this approach, we propose two new main methods: i) the method of extracting the abnormal behaviors of processes; ii) the method of synthesizing and extracting abnormal behaviors of APT malware based on processes. Proposing APT malware detection method based on the CNN-Attention model. Proposing to use a combination of CNN and Attention networks for the task of detecting APT malware. This is a new approach that no research has proposed and applied. This combined model will contribute to improving the ability to classify normal files and APT malware files.
Related work
Cho et al. [20] proposed a method of detecting APT malware on Endpoints based on anomalous behaviors and graph analysis techniques. Accordingly, the authors used 4 main feature groups: “Process” Event, Registry Event, Network connection, File Deletion Event to extract abnormal behaviors of APT malware. Next, the authors used the Graph Isomorphism Network (GIN) deep learning graph network to extract behaviors and detect malware. In the experimental section, the authors compared the GIN model with some other deep learning graph models such as GCN, Dynamic Graph CNN (DGCNN), etc. The experimental results showed that the GIN model brings superior efficiency than other models. Wajih Ul Hassan [7] proposed an APT attack detection method based on graph analysis using System Logs and Rule Matching. Their research proposed Tactical Provenance Graphs for feature analysis and extraction. The experimental results showed that this method reduced up to 87% of the amount of data that the system needs to store while still ensuring the required accuracy. In addition, in the study [21], Peng Gao proposed a novel stream-based query system that takes as input, a real-time event feed aggregated from multiple hosts in an enterprise, and provides an anomaly query engine that queries the event feed to identify abnormal behaviors based on the specified anomalies. The research [22] presented the ATLAS framework constructing an end-to-end attack story from off-the-shelf audit logs for detecting signs of APT attack. Specifically, this research proposed the ATLAS model including causality analysis, natural language processing, and machine learning techniques. The characteristic of the ATLAS model is that ATLAS constructs a set of candidate sequences associated with the symptom node, uses the sequence-based model to identify nodes in a sequence that contribute to the attack, and unifies the identified attack nodes to construct an attack story. In the experimental section, this team evaluated the APT attack detection ability of the ATLAS model with 10 attack techniques. The average results were 91.06% precision, 97.29% recall, and 93.76% F1-score. In the study [23], Yong Fang proposed an APT attack detection method based on the LMTracker model. Accordingly, in the LMTracker model, the authors used the heterogeneous graph to analyze and predict APT attack behaviors in event logs and traffic. In addition, some studies [24, 25] proposed Log2vec, Sec2graph methods to detect APT attacks in organizations. With the same idea, Xiaohu Liu [26] proposed the APT detection method based on graph analysis of HTTP traffic. The study [27] proposed a method to detect IoT malware using API call graphs and the SAEs deep learning model consisting of multiple sparse AutoEncoder layers to extract features. In the experimental part, this study proposed to perform with algorithms such as SAE-DT, SAE-KNN, SAE-NB, and SAE-SVM. In which SAE-DT had the best results. The collected and analyzed data includes 880 malicious samples and 880 normal samples using VX Heaven, Cuckoo tools. This study also pointed out that behaviors of malware could be the synthesis of clean behaviors. The study [28] proposed to use the Deep Eigensapce Learning to embed the CFGs of OpCodes in the executable file. In the experimental part, with a dataset of 1078 normal files and 128 IoT malware samples, the CNN algorithm gave an accuracy of up to 99.68% . However, we think that, with such a limited dataset, evaluating the algorithms is unbalanced.
In the study [29], the authors proposed the GCN model for the Android malware detection task based on abnormal behaviors of the malware running on the application. Similarly, Jiaqi Yan et al. [30] proposed a malware detection method based on CFG using DGCNN.
The research [31] performed data collection by monitoring processes in the Cuckoo Sandbox. The process of extracting data and training the detection model using RNN and CNN deep learning algorithms gave the highest accuracy of 92% .
Studies [32–35] presented some other approaches based on deep learning graph networks and anomalous behaviors of malware.
Methods
The model architecture
Figure 1 describes the architecture of the proposed model for APT malware detection on Endpoint devices. The main components of the proposed model are as follows: File: is data for the process of training and detecting APT malware. The file includes many different formats and components including text files, multimedia files, executables files, etc. Endpoint: is the user’s machine that needs monitoring to detect APT malware. Endpoints will install and configure the Sysmon log collection tool. Event IDs: are some Events collected by the Sysmon tool from Endpoint. There are 23 different Event types collected by the Sysmon from Endpoint [36]. Process behavior extraction: this block has the function of synthesizing and analyzing abnormal behaviors of processes in Event IDs. This paper uses 4 main and common Event ID groups including Network Connection, Registry Event, “Process” Event, File Deletion Event to build abnormal behaviors of processes. Process-based malware behavior representation: This block has the function of extracting APT malware’s abnormal behaviors based on behaviors of processes extracted by the “Process behavior extraction” block. To solve this problem, we propose to perform a different step in different sub-blocks including: “Behavior-based process labeling” sub-block: This sub-block has the task of classifying the process to determine whether the label of the process is malicious, normal, suspicious, or unknown. This paper proposes to use some machine learning and deep learning algorithms to perform this task. “Process aggregation by file” sub-block: this block has 2 main functions: i) grouping and aggregating all processes generated by the same file; ii) building the graphical relationship of the processes generated by the same executable file. “Process-based file behavior extraction” sub-block: this block has the task of aggregating and extracting APT malware’s abnormal behaviors based on the file’s behavior profile in the graph form built by the “Process aggregation by file” sub-block. To perform this task, the paper proposes a GE model based on the GCN network. APT malware detection: This block has the function of detecting APT malware on the Endpoints based on the behaviors built by the “Process-based malware behavior representation” block. To achieve this purpose, the paper proposes a CNN-Attention combined model. The output of this block is the executable file’s label as APT malware or normal.

The process of detecting APT malware on Endpoints.
In the study [20], Cho proposed to use 4 main process groups to build the APT malware behavior profile. In this paper, we also use such 4 process groups, but in each process we add some new behaviors.
Process event
Documents [20, 36] presented the concept and the role of the “Process” Event in detail. Based on the process characteristics, the study extracts anomalous behaviors of a process as shown in Table 1 below.
List of abnormal behaviors extracted from “Process” Event [20]
List of abnormal behaviors extracted from “Process” Event [20]
The Registry is a database storing important information about hardware, programs, settings, and user account profiles in the computer. Therefore, the Registry stores valuable information supporting the monitoring system to detect malware behaviors such as creating, modifying, deleting the registry, etc. The research [37] proposed 7 groups of security-sensitive registry keys for malware detection. Registry Events operating on key-values pairs could extract features including: the number of successful or unsuccessful writes on all used key events and on the security-sensitive key group (since APT primarily performs writes for all user accounts to automatically start malicious programs or create permanent residences); the read and delete operations on all key events. In Table 2 below, we propose features extracted from the Registry Event.
List of features extracted from Registry Event
List of features extracted from Registry Event
To avoid being traced and detected when the incident happened, malware often performs trace removal. So after the malware finishes executing, it normally copies itself to some folder such as % Windir%, % SystemDir%, ProgramFile, ProgramFile(x86), % temp%, and then delete itself. Therefore, we propose some cases of deleting sensitive locations such as the folders as above. Table 3 below lists the file deletion behaviors of processes selected and extracted from the File Deletion Event.
List of abnormal features extracted from File Deletion Event [20]
List of abnormal features extracted from File Deletion Event [20]
APT attack campaigns often set up malicious C&C channels to deploy connections. Specifically, ports 80 and 443 are commonly used because, in the reality, only these ports are allowed incoming connections in the appropriately secured corporation or government organization environment. The total size of packets transmitted/received on the network environment is also affected: malware usually sends commands and receives a lot of information. Based on the information collected in the Network connection, we propose to extract the following behaviors of processes in Table 4 below [20]. In Table 4, features marked with the asterisk (*) are the new ones proposed by us in this study.
List of behaviors extracted from network connection
List of behaviors extracted from network connection
As described above, the task of the process-based APT malware behavior representation stage is to find ways to synthesize and represent malware behaviors based on process behaviors. To achieve this purpose, it is necessary to perform three main tasks: i) evaluating and classifying processes, ii) synthesizing and grouping processes by each file, iii) extracting malware’s information according to executable files. The next sections of the paper will describe and analyze these tasks in detail.
The method of labeling processes
The process labeling method is used to classify normal and abnormal processes based on behaviors. In this paper, we propose to use machine learning and deep learning algorithms for the process labeling task. Some algorithms and models used in this study include RF [39], CNN, LSTM [40]. Each algorithm or model has certain advantages and disadvantages, so we will compare and evaluate these algorithms to find the best model and algorithm.
The method of aggregating the processes by file
Figure 2 below describes the method of building and synthesizing processes by file. In which Fig. 2(a) presents the algorithm for building and synthesizing a file’s process profile in graph form, Fig. 2(b) is an illustrative example of the graph construction method.

The method of building the file’s process profile based on processes: in which (a) describes the procedure and algorithm for building process profiles; (b) describes an example of a process profile built on the procedure (a).
From Fig. 2, it can be seen that with the graph construction method as shown in the paper, the relationships about vertices and edges are defined very clearly. With this construction method, all behaviors are bound, and show relationships with each other, thereby helping to clarify the correlation between processes of APT malware.
To extract the behaviors of APT malware based on the graph built above, this study proposes a GE model based on the GCN network. The GCN network is widely applied in graph normalization tasks [41–43]. Figure 3 below depicts the architecture and operating principle of our proposed model.

The architecture of the GE model based on GCN.
From Fig. 3, it can be seen that the proposed model includes 2 main components: Training model and Embedding model. In which: Input block: is pair of graphs expressing the relationship among APT malware’s processes built above. Graph embedding block: this block includes GCN layers having the function of normalizing input graphs. The GCN layer’s characteristic is to use localized spectral filters on graphs to extract subgroups on the graph, thereby clearly expressing the graph structure [43]. The purpose of the GCN is to represent all the features on the graph onto the embedding space. The proposed model uses 2 GCN blocks corresponding to 2 inputs and calculates 2 output vectors. In reality, more than 2 GCN layers could be used. However, the research [24] pointed out that only 2 GCN layers still ensure a balance between efficiency and analysis and processing time. These two GCN blocks have the same structure and share the same weights on the network during the training process. In which, the first GCN layer is responsible for looking at neighboring nodes on each node to update that node’s features. Next, the second GCN layer has the function of recursively updating the nodes’ features in the graph. The output of GCN layers is a hidden feature representation expressing the connection between nodes in the graph using the GCN layers. Thus, after input blocks with different structures and characteristics pass through the GCN block, we obtain 2 homogeneous embedding vectors corresponding to 2 input graphs. Figure 3 illustrates that the GCN model includes 2 main layers:
Where:
A: adjacency matrix
X: feature matrix
D: degree matrix
I: unit matrix with the same size as A
W: weight matrix
σ: activation function
Z(i): output of layer i
Z(0) = X
Mean Pooling layer: This layer has the task of aggregating the features computed by the GCN layers and ensuring that the output of the GCN layers is homogeneous. Graph distance block: The purpose of this block is to calculate the real distance between two graphs. Calculating the distance between two graphs aims to train the model based on the distance. Accordingly, the smaller the distance is, the more similar the two graphs are. To achieve this purpose, we propose the following steps in this block: Matching each pair of graphs. Calculating the feature matrix of each graph by Hadamard multiplication of the adjacency matrix and the feature array of each graph. Normalizing by padding 0 values into the smaller matrix so that the last 2 matrices have the same size. Calculating the Frobenius norm of the difference of 2 matrices. Formula (2) describes the process of calculating the difference between two matrices [44].
Where:
A: adjacency matrix
X: feature array
d: distance
F: Frobenius norm is defined by the formula (3)
“MAE” block: This block has the function of updating loss function between the actual distance of two graphs and the distance of two vectors representing two graphs in embedding space. Updating this loss function helps weights of the embedding model better fit the graph dataset. Formula (4) below describes the calculation and processing of the “MAE” block [45].
where:
y i is prediction
x i is true value
n is total number of data points
We found that: because the architecture of the graphs is different (some architectures have many edges; some architectures have few edges; some linked architectures have many edges, but they contain normal information; some linked architectures have a few edges, but they contain abnormal information), so when the GCN network extract and normalize it into a uniform vector, the GCN network may miss some features. In other words, GE networks could not represent graphs with few edges well. Therefore, values in the extracted vector could not express obvious differences amongthe normal values. Next, if this vector is put directly into the classification models, the classification models will not able to represent these behaviors clearly. Therefore, the accuracy of the classification techniques is not high. To solve this problem, we propose to use a CNN-Attention combined network. Figure 4 below describes the architecture of the CNN-Attention model.

The architecture of the CNN-Attention combined network.
From Fig. 4, it can be seen that the CNN-Attention model architecture includes 3 main parts as follows: Part 1: CNN Network: CNN includes a set of basic layers including convolution layer, nonlinear layer, and fully connected layer. These layers are linked together in a certain order which depends on the designed architecture and the intended use of the model. The detailed structure of CNN and the terms (stride, padding, MaxPooling) are detailed in the paper [46, 47]. In this paper, we propose to apply the CNN model to extract features of the vectors obtained after the GE process. We choose to use the CNN model at this step because the data obtained after aggregating and representing through the embedding model is sequential data. Therefore, the CNN network with layers such as the convolution layer and MaxPooling layer will extract important features of these vectors. At this CNN network, choosing to use the ReLU activation function. Part 2: Attention Network. This paper proposes an Attention model for the task of extracting APT malware’s features. From Fig. 4, it can be seen that the architecture of the proposed Attention network includes the following processing blocks: Blocks Q, V, K are matrices created to store and process the output of the CNN deep learning network. To process this output information, we use linear transformations to create a variety of data representations. The processing inside blocks Q, V, K is presented by the formula (5).
Where f
Q
; f
k
; f
V
∈ R
hxd
; h is the feature size, d is the model size. Score block: is used to calculate the weights of two vectors Q, K by using the Dot-product operation. The results obtained in this function are used in the Softmax function to recalculate the weights. The processing inside the Score block is presented by the formula (6).
Where: d is the constant for data division (usually chosen = 64, 128); K
T
is the transpose matrix of K. Softmax function: The Softmax function in the Attention network is used to calculate the attention of vector a.
Where:
Compression function: has the function of summing the output weights of the values. Where o is the vector that has been recalculated its weights and pushed into the classification layer. The formula is shown below.
Part 3: Classification block:
Fully Connected Layer: This layer is like an MLP network, and is responsible for learning the features that are processed by CNN-Attention Layers. Softmax Layer: This layer is responsible for calculating the probability of the output label. The class which has the highest probability of a file belonging to it will be assigned to the file.
From the architecture of the CNN-Attention combined model described in Fig. 4, combined with the analysis of sub-blocks in the model, it can be seen the operating principle of this model in data analysis and processing is as follows. First, the feature vector, which is processed and computed by the GE model, is put into the CNN model. At this step, the CNN network (including Convolutional and Max Pooling layers and the ReLU activation function) is used to extract the features of the vectors. The features extracted by the CNN are passed through the Attention network to retain and highlight the important features through the processing blocks of Q, K, V, score, and compression. In which, blocks Q, K, V are used for the purpose of representing vector information by linear transformation functions. The score block with the Softmax function has the function of recalculating weights of the vector and is summed at the compression function. At this point, it can be seen that there is a big difference between the output data of the GE model and the output of the CNN-Attention model. Specifically, the output of GE is a vector representing behaviors of malware in which the features haven’t been highlighted. The reason is that when modeling data, the GE network focuses only on extracting behaviors with clear differences, and behaviors with small differences are not highlighted or are sometimes ignored. With the CNN-Attention network, the output is a vector containing the highlighted features on the entire data, thereby helping to improve the efficiency of the APT malware detection process.
Experimental dataset
The normal data
Regarding the normal data, this study uses data at source [50]. Table 5 lists the components and the number of collected normal files in detail.
The normal data
The normal data
The APT attack malware data is collected from the following sources: APT attack malware is collected and reported by reputable organizations from APT attack campaigns [50] including some types such as Andromeda, Colbalt, Cridex, Dridex, Emotet, Gh0stRAT, NjRAT, Lokibot, Agentesla, etc. APT attack malware is collected and monitored from organizations in the Vietnam Ministry of Information and Communications including: Vietnam Cyberspace Security Technology [51]; Viettel Cyberspace Center [52], Research Institute of Information Security CyRadar [53]; National Cyber Security Center belongs to the Authority of Information Security [54].
Experimental scenarios
Scenarios for the experimental dataset
With the experimental dataset listed in Table 6, we will divide the dataset into different components and then conduct experiments and evaluate the effectiveness of the proposed models based on these experimental datasets. The experimental dataset is randomly divided, in which 80% of the dataset is used in the training process, the remaining 20% is used in the testing process.
The malware data
The malware data
There are two proposed experimental scenarios to evaluate the effectiveness of the proposed method as follows: Question 1: How effective is the method proposed in the paper? In this scenario, we conduct experiments to detect APT malware based on the proposed GECA model. During the experiment, we will fine-tune the model’s parameters to find parameters giving the best performance for the model. Question 2. Why use the GECA combined model? Why not use another model? Is it possible to replace the networks in the proposed model with other networks? To answer this question, we will clarify two issues as follows: Question 2.1. Replacing the GE model with some other methods including G2V, Sequence. Question 2.2. Replacing the CNN-Attention model with some other classification methods such as LSTM, MLP, CNN, RF models. The purpose of this scenario is to compare and evaluate the effectiveness of the proposed CNN-Attention combined model with 4 methods LSTM, MLP, CNN, RF in the task of classifying APT malware. Question 3: How is the GECA combined model more effective than other models and approaches? To answer this question, we have compared the experimental results of this study with some other studies in the malware detection task on the same dataset. Specifically, 3 approaches used for experiments for comparison are GCN [29], DGCNN [55], GCN+IndRNN [35].
Evaluation criteria
The following measures are used in this paper to evaluate the accuracy of models:
In which:
TP –True positive: The number of APT malware samples classified correctly; FN –False negative: The number of APT malware samples classified as normal; TN –True negative: The number of normal samples classified correctly; FP –False positive: The number of normal samples classified as APT malware.
Experimental results of classifying processes with scenario 1
Table 7 below shows the experimental results according to scenarios for process classification. In this paper, we have fine-tuned and configured parameters of RF, MLP, CNN, LSTM algorithms to seek and offer the best results for each algorithm.
Results of classifying processes
Results of classifying processes
The experimental results in Table 7 show that the RF algorithm brings better results than the rest of the algorithms and models. Specifically, the RF algorithm with the number of the decision trees of 100 yields an Accuracy of 96.49% . These results are higher than those of the MLP, CNN, and LSTM models by about 4.5%, 14%, and 9%, respectively. Similarly, regarding the results of predicting abnormal processes, the RF algorithm is also higher than other algorithms from 4.5% to 14% . The reason is that processes are extracted and analyzed to obtain a relatively small number of their abnormal features and behaviors, so supervised machine learning algorithms are often more efficient. Next, we will apply the results of classifying abnormal processes of the RF algorithm for the APT malware classification task in the following scenarios.
This is the scenario for testing our proposed model. In this scenario, we fine-tune the parameters in the models as follows: For the embedding model using GCN, we experiment with 2 layers. The study [24] demonstrated that this model could be changed with many different layers, for example, 2 GCN layers still ensuring efficiency and computation time. For the CNN network, we change the number of layers from 1 to 3 layers. For the Attention network, we use 1 network.
Table 8 presents the experimental results when using the GECA model.
Experimental results of detecting APT malware using the GECA model
Experimental results of detecting APT malware using the GECA model
We noticed that with the different number of parameters of each model, the experimental results of training and detecting APT malware are also different. When increasing the number of layers of the CNN network, the results of the model increase. The [2 GCN- 1CNN- 1Attention] model gives the lowest results when Accuracy, Precision, Recall, F1_score measures only reach 96.03%, 96.02%, 96.03%, and 95.96% respectively. In contrast, the [2GCN- 3CNN- 1Attention] model gives the best results with the measures higher than other models from 0.7% to 1% . This can be explained by the different network structures, so the learning method of each model is different, so the obtained features are also different. This leads to the difference in the result classification. Besides, when changing the number of CNN layers in the model, the experimental results also change. However, this change is not too large. This shows that using too many CNN layers in the GECA model is not necessary because the more CNN layers are, the more cumbersome the architecture is and the slower the computation time is. Based on the results in Table 8, we noticed that the GE network has done a very good task in extracting the features of the behavior graph of the APT malware. This graph is the basis for CNN and Attention networks to find and combine important features, thereby improving the ability to accurately classify APT malware. In Fig. 5 below, we express the results of the confusion matrix of the GECA model at the parameter giving the best results.

Confusion matrix of the GECA model.
The experimental results in Fig. 5 show that the GECA model works effectively in the task of detecting APT malware and normal files. Specifically, regarding APT malware detection, the GECA model only incorrectly detects 550 APT files out of a total of 6046 files, reaching a correct prediction rate of 91% . Regarding normal file detection, this model only incorrectly predicts 334 files out of a total of 21006 files, reaching a correct prediction rate of 98.41% . Such prediction and detection results could be considered as a good signal in the context that the dataset has an imbalance in the number of APT files and the normal files.
The purpose of this question is to clarify the role of each network in the entire model. To answer this question, we will evaluate each network in the combined model by replacing these networks with other methods.
Answer to question 2.1: Replacing the GE model with some other methods including G2V, Sequence. Thus, in this scenario, there are 2 combined models (namely G2V-CNN-Attention and Sequence-CNN-Attention) are performed to compare with the GECA model. During the experiment, we change parameters in the models as follows: For G2V, we change the dimension of the vector from 64 to 256 dimensions. For CNN, we change the number of layers of the network from 1 to 3 layers. For the Attention network, we use 1 network and change the Embed size of the Attention network to 64 and 128. For Sequence, the dimension of the vector is fixed at 16.
Table 9 below lists experimental results of the G2V-CNN-Attention model with changed parameters.
Experimental results of detecting APT malware using the G2V- CNN-Attention model
The experimental results in Table 9 show that with the G2V-CNN-Attention model, results of APT malware detection change when the parameters of this model change. This model works best at G2V with
the vector dimension of 64 dimensions, and CNN with 1 layer, and Attention network with Embed size of 64. We noticed that although the architecture of the components in the G2V-CNN-Attention model changes, the experimental results tend to change slowly. Specifically, the best results (Accuracy of 95%, Precision of 94.98%, Recall of 94.99%, and F1-score of 94.86%) are only nearly 0.5% higher than other models. Besides, observing the experimental results, we found that the results change mainly when changing the parameters of the G2V model. When changing the parameters of the CNN and Attention networks, the results do not affect much. This shows that graph normalization is very important in the task of extracting features and behavior of APT malware. Figure 6 below shows the results of the confusion matrix of the G2V-CNN-Attention model with the best results.

Confusion matrix of the G2V-CNN-Attention model.
The experimental results in Fig. 6 show that the G2V-CNN-Attention model works relatively effectively in the task of detecting APT malware and normal files. Specifically, for the normal file detection, the G2V-CNN-Attention model correctly detects 20,727 files out of a total of 21,006 normal files, and only 279 files are incorrectly detected. For APT malware detection, this model is also effective when it only incorrectly detects 1,076 files out of a total of 6,046 APT files. Based on the results shown in Fig. 6 and Table 9, we think that the G2V-CNN-Attention model makes sense and is suitable for the purpose of detecting APT malware on Endpoint devices.
Table 10 below shows experimental results of detecting APT malware using the Sequence-CNN-Attention model.
Experimental results of detecting APT malware using the Sequence-CNN-Attention model
Based on the results of Table 10, the Sequence-CNN-Attention model with parameter CNN: 2 layers, Attention: Embed size is 128 gives the best results with Accuracy, Precision, Recall, and F1_score as 93.46%, 93.45%, 93.46%, 93.46%, respectively. Similar to the G2V-CNN-Attention model, when changing the parameters in the Sequence- CNN-Attention model, the results also change slowly and insignificantly. Comparing the results in Tables 9 and 10, it can be seen that the G2V-CNN-Attention model brings better overall results than those of the Sequence-CNN-Attention model and the results have a difference of about 1.5% .
The confusion matrix in Fig. 7 shows that the Sequence-CNN-Attention model works very well. Specifically, only 883 files out of 21,006 normal files and 887 files out of 6,064 APT files are incorrectly classified into the remaining class. Comparing the results in Figs. 6 7, seeing that the Sequence-CNN-Attention model improves the ability to accurately detect APT malware compared to the G2V-CNN-Attention model. It only misclassifies 887 APT files while the G2V-CNN-Attention model incorrectly classifies 1,012 files. In contrast, the G2V-CNN-Attention model (with only 340 misclassified normal files) brings higher efficiency than the Sequence-CNN-Attention model (with 883 misclassified normal files) in the ability to classify normal files. This proves that the G2V-CNN-Attention model is more suitable for the task of detecting and classifying normal files while the Sequence-CNN-Attention model is suitable for the task of classifying APT malware.

Confusion matrix of the Sequence- CNN-Attention model.
Comparing the experimental results of Tables 8, 10, it can be seen that the proposed GECA model has better performance than all other models on all measures. Specifically, with the Accuracy measure, the GECA model is 2% to 4.3% higher than that of the G2V-CNN-Attention and Sequence-CNN-Attention models. As for the ability to correctly classify normal files and APT files, the GECA model is also more efficient from 2.7% to 3.6% . Comparing the experimental results on Figs. 5, 7, it can be seen that the GECA model significantly improves compared with the G2V-CNN-Attention and Sequence-CNN-Attention models in the ability to accurately predict both APT files and normal files. Specifically, the number of accurately detected APT files of the GECA model is higher than that of the G2V-CNN-Attention and Sequence-CNN-Attention model by 526 files and 337 files, respectively. As for the normal file classification results, our proposed model only incorrectly detects 334 files, reducing 549 files and 742 files compared to the Sequence-CNN-Attention and G2V-CNN-Attention model, respectively. Obviously, the number of correctly detected APT malware increases, and the number of incorrectly detected normal files decreases. Therefore, our proposal in this study has shown its superiority in the task of detecting APT malware compared with other models.
Thus, it can be seen that when replacing the GE network with some other networks such as G2V, Sequence, the detection efficiency is greatly reduced. The reason is that the GE network works better than other graph networks in the graph normalization and analysis task. These results prove that the proposal to use the GE network for the graph normalization task is correct and reasonable.
Answer to question 2.2: Replacing the CNN-Attention model with some other classification methods such as LSTM, MLP, CNN, RF models. For this scenario, we conduct experiments to replace the CNN-Attention model with some other classification methods such as RF, CNN, LSTM models, FC layers. During the experiment, we changed the parameters in each model to offer the best results as shown in Table 11.
Experimental results of detecting APT malware using some other classification methods
From Table 11, it can be seen that with Accuracy, Precision, Recall, F1_score measures of 94.86%, 94.81%, 94.81%, 94.83%, respectively, the LSTM model gives better results than the rest of the algorithms that replace the CNN-Attention model on all metrics. This result is also reasonable because the LSTM model is designed with the ability to memorize the training process, so it supports the process of learning and classifying normal files and APT malware. Figure 8 shows the results of the confusion matrix in the LSTM network.

Confusion matrix of the LSTM model.
From the confusion matrix in Fig. 8, it can be seen that the LSTM model, which is used to replace the CNN-Attention network, works relatively effectively for both normal and malware file detection. In particular, in the ability to accurately detect normal files, this model reaches 97.2% . Regarding APT malware classification, this model also brings high efficiency when its accuracy also reaches about 86.48% . This is an acceptable result of APT malware detection models.
Comparing the detection results in Tables 8 11, it can be seen that the GECA model with the support of the CNN-ATTENTION combined network is more effective than the model using LSTM or FC layer by 2% or more on all metrics.
Comparing Figs. 5 8, it is clear that the GECA model is more efficient than the alternative models on all measures. Specifically, with the GECA model, the number of correctly detected APT malware files increases by 252 files, and the number of misclassified normal files reduces by 254 files.
From the experimental results of APT malware detection without using or replacing CNN-Attention with other classification methods, it can be seen that the CNN-Attention combined network is more effective than other methods. The reason is that the CNN-Attention model has a flexible combination of CNN and Attention blocks, so it has succeeded in extracting anomalous behaviors and highlighting the important abnormal behaviors of malware. This makes for a better and more accurate classification. This accuracy is reflected in the detection of both APT malware and normal executable files.
In this scenario, we experiment with our dataset with other approaches. Specifically, we compare with GCN [29], DGCNN [55], GCN+IndRNN [35] models. During the experiment, we also fine-tune and configure the parameters of these models according to these papers. In Table 12 below, we summarize and present the best results of these models.
Experimental results of detecting APT malware with some other approaches
From Table 12, it can be seen that the GCN+IndRNN model gives the best results on all measures. 2nd place is the improved DGCNN model presented in the study [55]. Lastly is the basic GCN model. We noticed that the models in [55] and [35] are only improved over other models at the classifier layer, while the structure of the networks remains largely the same. The results in Table 12 show that there is a relatively large difference between the model giving the highest results and the model giving the lowest results when the difference is up to about 3% . However, when comparing Table 12 with Table 8, it is clear that the GECA model brings higher efficiency than models with other approaches. With the best model of the other approaches, our proposed model is also about 2% higher on all measures.
Thus, it can be seen that the GECA model with a flexible combination of GE, CNN, and Attention blocks is much more efficient than some other approaches using graphs. Specifically, in the GCN [29] and DGCNN [55] models, the authors only use GCN blocks to normalize the graph, so it is not suitable for the task of detecting APT malware. Similarly, for the GCN+IndRNN combined model [35], although the authors proposed to use an additional IndRNN network to improve the efficiency of the anomalous behavior aggregation process, due to limitations of the IndRNN network, the result of the GCN+IndRNN model is only higher than the GCN [29], DGCNN [55] models and still lower than that of our proposed model. From these experimental results, it is found that should place some models and algorithms behind deep learning graph networks to support analyzing, aggregating, and evaluating abnormal behaviors. Therefore, our proposal to place the CNN-Attention combined network behind the GE network is correct and has a lot of scientific meaning.
Contrary opinions
Based on the results of the 3 experimental questions, it is clear that the GECA model is more effective than other approaches. However, the question is: is it necessary to build a GECA model with so many computation steps and stages? In reality, after a process is classified to identify as normal or abnormal process, it can be used directly as a basis for detecting APT malware. At this point, there’s no need to use the GECA model. In our opinion, this approach is not wrong, but it will not bring high efficiency because it faces many difficult problems as follows:
Comparing the results in Tables 13 8, it can be seen that the methods without using the GECA model bring much worse results. This result once again confirms that our assessment is correct.
Experimental results of other approaches for APT malware detection
Experimental results of other approaches for APT malware detection
The results of the experimental scenarios not only demonstrate that the GECA model is completely superior to other approaches, but also prove the role and importance of each individual network in the GECA combined model. Specifically, when comparing and evaluating the GE model with other graph analysis methods including G2V, GCN [29, 35], DGCNN [55], this model has better results than other networks. This shows that the GE network does a good job of aggregating and extracting graph features. The reason is that, instead of having only one Embedding phase like other networks, the GE model also has a training model, so a lot of important and meaningful information is extracted. Next, when comparing the CNN-Attention combined model with other classification methods, the GECA model also shows superiority. This shows that the CNN-Attention combined network has synthesized and highlighted a lot of important and meaningful information to make the prediction process more accurate. The experimental results in the paper demonstrate that the proposed intelligent cognitive computation method in the GECA model not only has scientific meaning (because it is a new cognitive computation method) but also has practical meaning (because it has contributed to improving the efficiency of the APT malware detection task on Endpoints). From here, we think that the GECA model not only is suitable for application to APT malware monitoring systems for Endpoints but also supports in detecting many other attack techniques and anomalous behaviors such as malware, unauthorized intrusion, insider, etc.
Conclusion and future development direction
APT malware will always change and improve. This requires the information security system to always update and improve its ability to detect and prevent this malware type. In this paper, with the purpose of improving the efficiency of the APT malware detection on Endpoints, we have successfully built the GECA combined model based on the intelligent cognitive computation method. The intelligent cognitive computation method has improved the ability to detect malware by optimizing two problems: i) the method of extracting APT malware’s behaviors based on processes; ii) the methods of APT malware detection based on behavior analysis. Specifically, we have successfully proposed a GE network based on the GCN network to synthesize and extract malware behaviors based on processes. The APT malware behaviors extracted by our proposed method have many meanings and important information because they represent the relationships, correlations, and dependencies among the discrete processes generated by malware. Regarding the behavior-based APT malware detection process, we have successfully proposed the CNN-Attention combined model. This is a novel model that has not been proposed and applied by any research. In the future, in order to improve the efficiency of the system, we think that should improve 2 main issues: i) researching and proposing new cognitive computation methods to improve the efficiency of the process classification as the basis for classifying APT malware; ii) improving and applying new methods to improve the ability to represent relationships and correlations of processes, thereby improving the ability to accurately detect APT malware as well as reduce false warnings in the system.
Footnotes
Acknowledgments
This work has been sponsored by the Posts and Telecommunications Institute of Technology, Vietnam.
