New approach for APT malware detection on the workstation based on process profile

Abstract

The Advanced Persistent Threat (APT) attack is a form of dangerous, intentionally and clearly targeted attack. Currently, the APT attack trend is through the end-users and then escalating privileges in the system by spreading malware which is widely used by attackers. Therefore, the problem of early detection and warning of the APT attack malware on workstations is urgent. In this paper, we propose a new approach to APT malware detection on workstations based on the technique of analyzing and evaluating process profiles. The characteristics and principles of our proposed method are as follows: Firstly, processes are collected and aggregated into process profiles of APT malware; Secondly, these process profiles are used by Graph2Vec graph analysis algorithm to extract the characteristics of the process profile. Finally, in order to conclude about the sign of malicious APT, this paper proposes to use Long short-term memory (LSTM) and bidirectional LSTM (BiLSTM) algorithm. With the proposed approach in the paper, we have not only succeeded in building and synthesizing APT malware behavior on Workstations as a basis to improve the efficiency of predicting APT malware, but also have opened up a new approach to the task of synthesizing and analyzing anomalous behavior of malware.

Keywords

APT APT malware detection on Workstation Event ID deeplearning process profile Graph2Vec

1 Introduction

1.1 Problem

APT is currently the most dangerous cyber-attack technique in terms of the attack scale and extent of the damage inflicted on organizations. The dangers of this attack technique are shown in three main aspects including [1, 2] Advanced; Persistent; Threat. In addition, the study [3] presented the concept and definition of the APT attack characteristics, process and life cycle. In the study [3], the authors pointed out the dangers of APT attacks that are reflected through two main factors including attack assist technology and attack launching method. In particular, the attack assist technology used in the attack campaigns is mainly developed by the attacking team, including malware, the search engine for Zero-Day vulnerability, etc. As regards the method of launching the attack, recognized APT attack campaigns often use phishing and spear-phishing techniques aimed at the end-user. Thus, we can see that a characteristic of the attack method of the APT attacker is to try to spread malware to the end-user and then escalate and take control over systems that store sensitive data. The studies [1 –3] indicated that in order to prevent and detect APT attacks, it is only necessary to focus on one life cycle or process of attack campaign because if a process or a life cycle is detected, the whole APT attack campaign will collapse. Besides, according to the statistics [4, 5], most of the recorded APT attack campaigns are through the form of spreading malware on workstations. For the above reasons, the problem of early detection and warning of the APT malware on the workstations is very necessary today in the context of recorded campaigns of this attack type growing in both quantity and degree of damage [6].

1.2 Evaluating some approaches for detecting APT malware on Workstations

The two main methods to detect APT malware that are commonly studied and applied are the signature-based method through the rule set and anomaly-based method based on behavior analysis to find anomalies [4]. Malware detection approaches combining anomaly-based with machine learning or deep learning techniques have been highly effective in identifying new APT malware samples. The studies [2 , 8] listed two main methods to extract features and behaviors of malware including the static analysis method and the dynamic analysis method. The current trend of detecting cyber-attacks using machine learning or deep learning often applies dynamic analysis methods with the support of Sandbox tools to analyze and extract the features and behaviors of APT malware. However, we found that these approaches have some problems [9 –15]:

The features and behaviors of malware: Virtualization tools such as Sandbox [1] assist in collecting and extracting the features of malware. They usually work well with simple samples but not very effective for APT malware types because this malware often has Identifying Functions and anti-Sandbox, hibernating, etc. Therefore, the APT malware features that are collected and extracted from the sandbox log usually do not make much sense [16].

The malware detection time: We noticed that applying machine learning or deep learning methods based on features and abnormal behaviors would only detect malware at later stages of the attack campaign. That is when the attackers are able to steal information from the victim.

Lack of correlation between events [10 , 17]: As regards conducting malware signature and behavior based on virtualization tools, it is not only difficult to fully collect features and behaviors of malware but also lead to the situation in which the system can not seek and synthesize the correlation among single events of malware. This is due to the fact that the APT malware often uses a variety of exploit and propagation techniques at different timelines when the collected signs and behaviors are completely benign. However, if we concatenate the events, we will see that this is the hiding and concealing behavior of malware.

The studies [1 , 19] also showed that one of the problems that made detecting APT attacks much more difficult than other attack techniques was the lack of correlation in the events of the attack.

1.3 The proposed method of this paper

From the above analysis and evaluation, in this paper, we propose the method of detecting APT malware on workstations based on the process profile analysis technique using deep learning method. Our proposal has the following steps:

Step 1: Build the process profile in graph form: This process will include 2 phases:

Phase 1: Data standardization. First of all, the processes generated from the operating system kernel of the workstations will be collected and standardized into a process profile in graph form. To achieve this goal, we propose to use the Sysmon tool [20] on the workstations to gather information about the processes generated by workstations. Then, based on the information about the processes that Sysmon provides, we standardize them into features of each process. Details of the process of collecting and standardizing information about the processes are described in Table 1 in Section 3.2.1.

Phase 2: Building process profiles. Finally, based on the features of each process extracted in stage 1, we proceed to build process profiles. Characteristics of our method of building process profiles in graph form are: each process is a node of the graph, and the edge of the graph is that the parent process calls the child process. Details of the process of building processes profiles in graph form are described in section 3.2.2.

Table 1
List of features in a process

Group Feature name Data type

1 Name String

Hash String

ProcessID Number

ParentPID Number

CommandLine String

Image String

CreationTimestamp Number

2 TerminationTimestamp Number

3 MitreAttacks List of String

Group	Feature name	Data type
1	Name	String
	Hash	String
	ProcessID	Number
	ParentPID	Number
	CommandLine	String
	Image	String
	CreationTimestamp	Number
2	TerminationTimestamp	Number
3	MitreAttacks	List of String

Step 2: Analyzing process profiles: At this step, we will find a way to analyze the process profile built in phase 2 to seek signs of abnormal behavior of APT malware. To perform this task, we propose to use the Graph2Vec algorithm [21] to convert the process profile from the graph form into a vector representing that graph. The method of analyzing the process profile based on the Graph2Vec algorithm is the procedure of extracting the attribute of the process profile.

Step 3: Classifying process profiles: Finally, in order to conclude about signs of the APT malware in the process, we use some advanced deep learning algorithms such as LSTM and BiLSTM. With this approach, our proposal will solve the problems that we think need to be optimized as follows:

Regarding the problem “features and behaviors of malware": In this paper, instead of using Sandbox tools to collect malware behavior, we recommend collecting and extracting malware behavior based on processes that were recognized on the operating system kernel. Thus, The approach of this research is finding a way to collect and analyzing the processes that malware generated instead of defining the behavior of the malware [1 , 19]. With this approach, we realized that regardless of whether malware is transformed or hidden, the processes that they generated will be fully recognized based on the process of extracting information from the operating system kernel on the workstation [17].

Regarding the problem “When the malware was discovered": In order to conclude the signs of APT attack malware in the system, we find a way to collect and extract the processes and then build into a process profile of each object. Therefore, as long as the malware adds a new process, that process is immediately recognized and added to its process profile. That process profile is then analyzed and evaluated to conclude about the APT malware. Based on this approach, we ensure fast and accurate detection of APT malware [12, 22].

Regarding the problem “Lack of correlation among events": With the construction of process profiles based on discrete processes that are collected and extracted on the workstations as the basis for APT malware detection, we have shown the correlation among the events that the malware generated. These events and processes may be judged to be normal by monitoring systems, but when combined, they form process profiles that represent the behavior sequence of the malware. With this approach, our method is able to record and represent a sequence of events and processes at different times of malware.

1.4 Contributions

Proposing a new approach for selecting and extracting features and abnormal behaviors of APT attack malware based on process profile analysis using the Graph2Vec model. This is a new proposal for the task of detecting APT malware on workstations. With this approach, we have found and synthesized APT malware’s behavior on the processes that the system generates as evidence to conclude about the process profile. Our proposal not only partially solves the problem of lack of behaviors from APT attack malware but also calculates the impact and correlation among processes, thereby improving the ability to accurately conclude about the APT malware on the workstations.

Proposing a new approach for detecting APT malware on workstations based on process profiling analysis techniques using some advanced deep learning networks such as LSTM, BiLSTM. Accordingly, in this proposal, in addition to using the Graph2Vec deep learning model to extract the features based on behavior of malware from the process tree, we have also proposed to use two different deep learning algorithms to classify process profiles to improve the efficiency of the APT malware classification. The experimental results have proven the effectiveness of LSTM and BiLSTM models compared to other deep learning models.

The rest of the paper is organized as follows: in part 2, we study and examine some previous studies for the task of detecting APT malware. The contents related to the proposed method are analyzed and presented in Section 3. The experimental results and evaluations of the effectiveness of the proposed method are presented in Section 4. Finally, conclusions and future development directions are presented in Section 5 of the paper.

2 Related work

2.1 Detecting APT attacks using graph

In the study [23], Cho et al proposed a method of detecting APT malware on Workstations using Graph Isomorphism Network (GIN). In their research, the authors proved that the proposed method is more efficient than some other deep learning graph networks including Graph Convolutional Networks (GCN) and Dynamic Graph Convolutional Neural Network (DGCNN). However, we noticed that this model is relatively cumbersome and complex. Shifu Hou [24] proposed the Deep4MalDroid model for malware detection on Android operating systems based on Kernel System Call Graphs. In the experimental section, the authors used some basic machine learning algorithms to classify malware.

In addition, Zhen Ma [25] also proposed for the first time a method of detecting APT malware connected to the control server via graph-based domain analysis. Specifically, the authors used 257,535,071 DNS requests and 73,136 domain names to evaluate the proposed model. In the study [26], Pektaş et al presented an Android malware detection method using API call graph embeddings. In their research, the authors listed some classical graph embedding models such as DeepWalk, Node2vec, Structural deep network embedding, Higher-order proximity preserved embedding.

The study [27] presented ATLAS which is a framework that constructs an end-to-end attack story from off-the-shelf audit logs for detecting signs of APT attacks. Specifically, in their research, the authors proposed the ATLAS model including causality analysis, natural language processing, and machine learning techniques. The characteristics of the ATLAS model are that ATLAS constructs a set of candidate sequences associated with the symptom node, uses the sequence-based model to identify nodes in a sequence that contribute to the attack, and unifies the identified attack nodes to construct an attack story. Research [18] proposed an APT IP detection method based on the GCN network and the BiLSTM deep learning algorithm.

2.2 Some other methods

In the research [28], the authors proposed a combined deep learning model for APT attack detection based on network traffic. Specifically, the authors combined single deep learning networks, namely Convolutional Neural Network (CNN) and LSTM, into a CNN-LSTM model. Experiments show that this model detects APT attack IPs more accurately than other individual deep learning models. This result has opened up a research direction on information representation of IPs based on network traffic. The study [29] proposed several ATTENTION models to optimize the CNN-LSTM combined deep learning models [28]. Specifically, the authors used the CNN-LSTM network as the base network, and the outputs of this network were further analyzed by the ATTENTION network to highlight important information of IPs, instead of being used directly for the classification process. Experimental results showed that the proposed method has better performance than the combined deep learning models.

The study [30] developed a method to detect traces of APT attacks based on Provenance Graph and Metric Learning. In the experimental section, the authors compared their research with some other approaches and found that the average accuracy increased by 11.3% and the True positive rate increased by 18.3%.

The research [31] listed, compared, and evaluated some tools and solutions for detecting APT malware on Endpoint systems. In that study, some Endpoint Detection and Response (EDR) tools and solutions were proposed such as Carbon Black; Kaspersky Endpoint Detection and Response-KEDR; McAfee Endpoint Protection.

R. Coulter [32] proposed a method of detecting APT malware on Windows machines based on 15,259 APT indicators of compromise (IoC) and basic machine learning algorithms such as Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Naive Bayes (NB). Similarly, research [33] proposed an integrated EDR model for APT malware detection based on the strategies of Miter attack. However, we noticed that the studies [32, 33] face a problem that the rule sets built on Miter attack and IoC are very limited, so it cannot cover APT malware’s behaviors. The study [34] proposed an APT attack detection model based on the belief rule base and IP Packets.

3 The proposed method

3.1 The model architecture

Figure 1 shows our proposed model for detecting APT malware on the workstations. The main components in this model include:

Fig. 1

The model of detecting APT malware based on workstations and analyzing the process profile using deep learning.

Workstations: are the workstations that need to monitor. Accordingly, in this paper, in order to collect processes on the operating system kernel of a workstation, it is necessary to have a tool to collect, process, and transfer processes to the analysis center. We will collect processes on Windows kernels using the Sysmon [35] tool. Sysmon tool is a tool introduced by the Microsoft team.

Event ID: are events collected by the Sysmon tool on the operating system kernel. According to document [35], the Sysmon tool will collect a total of 23 types of events from the kernel. The document [35] listed and described these processes in detail. In this paper, we will seek ways to collect these processes as the basis for building APT malware detection systems.

Building process profiles: Accordingly, the Events generated from the operating system will be inspected and evaluated to know whether this Event is related to the processes that have been collected previously or not. If this Event is related, it will continue to be attached to the previous Events. If this Event is unrelated to previous Event IDs, it will be built into a new profile.

Analyzing process profiles: at this stage, the process profiles are analyzed to extract the behavior that represents the differences between the APT malware and the benign files. In this paper, we will use the Graph2Vec model to analyze and extract features in the process profile. Details of this process are presented in section 3.3 of the paper.

Classifying process profiles: The process of classifying the process profiles based on features (represented as a vector) that are extracted at a previous stage. In this paper, we combine deep learning models (LSTM and BiLSTM) with the Softmax Regression network to predict signs of APT malware in process profiles.

3.2 Building the process profile

3.2.1 Data standardization

As mentioned above, the Sysmon tool will collect 23 different Event IDs on the Windows kernel [35]. In this paper, based on the collected Event IDs, we analyze and evaluate them to build a process profile of the malware. Research [36] pointed out a number of important behaviors that are often used by malware to exploit and attack systems. Therefore, in this paper we choose to use the following 3 main components to analyze and build the process profile of malware:

Process Create: The process creation event provides extended information about a newly created process. In this study, we find a way to analyze this process because of the fact that this process contains a lot of important information that identifies the process. That information will be used to analyze and track subsequent behaviors (Event IDs). Some key components in Process Create are ProcessID, ProcessGuid, parent process code, absolute path of executable file, hash code, etc. In particular, ProcessID is re-granted by the operating system for the purpose of managing and monitoring subsequent behavior. ProcessGuid is a unique value identified for each process. The parent process code is used to define relationships among processes generated from an object.

Process Terminated: The process terminate event reports when a process terminates. After receiving this event, we will update to complete information of the remaining 21 events for the process.

Mitre ATT&CK: MITRE ATT&CK is a globally-accessible knowledge base of adversary tactics and techniques based on real-world observations [35]. The framework focuses on adversary tactics, techniques and procedures instead of typical indicators such as file hashes, IP addresses, domains, etc. Thus, MITRE ATT&CK identifies behaviors in a life cycle, the relationship between one action and another, and their sequence for targets of malware. In addition, MITRE ATT&CK also provides a topology with threat data which gives context for how malware is used [35]. The MITRE ATT&CK matrix describes an overview of the relationship between tactics, techniques, and sub-techniques. Each tactic can be performed by individual techniques. Some techniques may be available in different tactics for different purposes. A technique can be broken down into more specific sub-techniques to achieve the attack purpose. Based on the description of the characteristics and action mechanism of each technique supported by MITRE ATT&CK, we define the set of conditions corresponding to each attack technique. In this study, we use MITRE ATT&CK to evaluate all information of processes, thereby identifying and measuring the correlation between the collected Event IDs and the techniques defined by the MITRE ATT&CK. Each discovered technique will represent an abnormal behavior of malware with File, Network, RemoteThread, Registry, etc. Thus, with this approach, we quickly summarize and generalize malware behavior in each process. Table 1 below lists features of the process used to build a process profile of each object in detail.

Table 1 shows the main components of each Event ID group that we use. Of which, features in group 1 are collected from Event ID 1: Process creation, the TerminationTimestamp feature is updated when we receive Event ID 5: Process terminated. During this process, the remaining 21 Event IDs are respectively updated and lab eled through the MitreAttacks framework. The result is a set of labels corresponding to each respective technique used in the process defined by MitreAttacks.

3.2.2 Building the process profile

A process profile includes a list of processes. Information of each process is standardized by us according to the principles outlined in section 3.3.1. Characteristics of processes are: these processes are collected in chronological order and related to other processes; a process can have one or more child processes. In this paper, to build process profiles representing relationships among processes, we propose to construct an oriented graph architecture. Accordingly, our proposed graph architecture has the following characteristics: each child process is a node; an edge represented via a parent process calling a child process. Thus, through this oriented graph architecture, describing malware behavior not only looks at each process individually but also is able to observe the correlation between processes from which hidden behaviors of malware can be clearly shown. Figure 2 below shows an example of a process profile architecture in tree form.

Fig. 2

An example of a process profile architecture in tree form.

As illustrated in Fig. 2, each process is assigned an identification number code (ProcessID). We rely on it to determine the relationship between parent processes and newly created processes. Therefore, successfully connecting behaviors of each process give suspicious signs about the action of that malware.

3.3 Detecting malware using the process profile analysis technique

Figure 3 below shows the architecture of the malware detection model based on the process profile analysis technique.

Fig. 3

Model of analyzing and classifying the process profiles.

Figure 3 shows that the model of detecting the APT malware on the workstations based on the process profile analysis technique consists of 2 main components:

“Process profile analysis” block: This block has the function of analyzing process profiles to convert process profiles from graph form to Embedding graph to serve the APT malware classification process. The details of this analysis process are described in section 3.3.1 of the paper.

“Process Profile Classification” Block: Based on Embedding graphs extracted based on the process of analyzing the process profile in graph form, this block performs classifying to detect signs of malware in the process profile. Details of this classification process are presented in section 3.3.2 of the paper.

3.3.1 Analyzing the process profiles

We noticed that the traditional approaches for representing data in graph form are not suitable for classification task [34]. Specifically, this method usually has two following problems:

The dimension of the adjacent matrix depends on the number of nodes in the graph. Specifically, in real-time processing problems, the number of nodes can increase rapidly and very large. So it will consume memory and affect processing performance.

The representation method using adjacent matrix and similar methods only describe the graph architecture, so it is difficult to describe the features of each node and the characteristics of each branch. This can lead to the loss of important information and significance of the graph.

To fix the above problems, in this paper we propose to use the Graph2Vec model to extract features of the process profile. Graph2Vec is the first neural embedding approach that learns representations of whole graphs and an unsupervised learning technique used to embed graph in vector form extended from doc2vec [21, 37]. Based on the idea of doc2vec, Graph2Vec will learn to represent the features of the graph by treating an entire graph as a text and rooted subgraphs as words that form the document. As with data in text form, words near each other will form a sentence containing context, rooted subgraphs represent graph branches representing the association between features at nodes. Figure 3 shows the process of applying the Graph2Vec model for analyzing process profile in our proposal as follows:

Step 1: Extracting rooted subgraphs: The purpose of this process is to generate a vocabulary of subgraphs. Accordingly, a Rooted subgraph is a set of nodes around a node with a certain degree derived from the Weisfeiler-lehman algorithm. The formula used to update the rooted subgraphs in the Weisfeiler-lehman algorithm as follows [38]:

$\begin{matrix} l^{i} (v) = & relabel ((l^{i - 1} (v), \\ sort {{l^{i - 1} (u) | u \in N (v)}})) \end{matrix}$ (1)

Where: i > 0

v is the node being considered

li(v) is the subgraph label updated at loop i at node v

li-1 (v) is the subgraph label updated at loop i-1 at node v

N(v) is the set of nodes around node v relabel is the function that computes a new label for the subgraph

In this paper, the relabel function uses the MD5 [31] algorithm in order to ensure subgraphs that have different architecture and contain different nodes will be labeled differently. Thus, at the last loop, each subgraph is represented by only a unique hash string. From a graph consisting of many subgraphs, we have processed it into a set of hash strings similar to a word.

Step 2: Standardizing the graph: The purpose of this step is to represent the graph into feature vectors. Accordingly, all rooted subgraphs extracted in step 1 will be preprocessed to produce a single vector representing the characteristic of the original graph. To accomplish this task, we use the Distributed Bag of Words version of the Paragraph Vector (PV-DBOW) model that is a version of doc2vec [37] and uses the Skip-gram model [39] to conduct training. Unlike other versions of doc2vec, PV-DBOW does not care about the order of words when training, and the model input is a ParagraphID. After training, this ParagraphID vector is the representative vector of the text. The Skip-gram model architecture usually tries to achieve the reverse of what the Continuous Bag of Words Model does. It tries to predict the source context words (surrounding words) given a target word (the center word). Figure 4 shows the skip-gram architecture of doc2vec.

Fig. 4

The skip-gram training model.

From Fig. 4, it can be seen that the skip-gram model consists of three main components: input layer, hidden layer, output layer. The input layer is a one-hot vector GraphID. The dimensions of the hidden layer are the dimensions of the embedding vector space. The output layer consists of a set of one-hot vectors of rooted subgraphs. The model will learn how to predict rooted subgraphs in vocab from GraphID. During the training process, instead of getting output with the size as vocabulary, the list of rooted subgraphs was sampled by using the Negative Sampling technique to reduce computational complexity at the output layer. Thus, it can be seen that output of Graph2Vec is an N-dimensional embedding vector space in which each graph is represented by a vector in space. Graphs with the same architecture and the same rooted subgraphs are closer together, meaning that the cosine similarity between them is closer to 1 [40]. Cosine similarity is defined by formula (2):

$\begin{matrix} similarity & = cos (θ) = \frac{A \cdot B}{∥ A ∥ ∥ B ∥} \\ = \frac{\sum_{i = 1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2} \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}}} \end{matrix}$ (2)

Where:

A = (A₁, A₂, ... .A_n) and B = (B₁, B₂, ... B_n) are two vectors in embedding space.

n is the dimension of space. In this paper, we conduct experiments on 3 graph embedding spaces with dimensions of 64, 128, and 256 respectively.

3.3.2 Detection method

To classify the vectors representing the features of the graphs extracted in section 3.3.1, we propose to use two deep learning algorithms: LSTM and BiLSTM.

a) LSTM

The research [41] introduced LSTM with its ability to self-memorize information for long periods of time. This is shown in the structure of the gates in each memory cell. A memory cell consists of 3 main components: input gate, forget gate, output gate. The first step, ‘forget gate’ determines what information needs to be removed from the cell state. Next, the ‘input gate’ decides what information to update into the cell state. Finally, the ‘output gate’ performs calculating desired outputs. During this process, the cell state is transmitted constantly and updated when passing through all the links. Figure 5 describes the architecture of a basic memory cell in the LSTM network in detail.

Fig. 5

Structure model of a cell of LSTM.

At each moment t, we have a hidden state h_t and a cell state C_t with the basic math formulas shown below:

Input gate is defined by formula (3) as follows:

$f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})$ (3)

Forget gate has a general formula (4): $i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})$ (4)

The general formula for the output gate is: $o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})$ (5)

The new memory cell to decide what information should be written: ${\tilde{C}}_{t} = tanh (W_{C} \cdot [h_{t - 1}, x_{t}])$ (6)

Then hidden state h_t and a cell state C_t are updated as follows: $C_{t} = f_{t} * C_{t - 1} + i_{t} * {\tilde{C}}_{t}$ (7) $h_{t} = o_{t} * tanh (C_{t})$ (8)

To classify the vectors representing the features of the graphs extracted in section 3.3.1, we propose to use two deep learning algorithms: LSTM and BiLSTM.

Where:

w is the weight matrix of each gate,

b is the bias vector of each gate,

* is the element-wise product operator.

b) BiLSTM

Although the LSTM network has overcome some shortcomings in the information remembering ability of traditional regression neural networks RNN, it still has some limitations: it’s only able to memorize and learn one dimension. This makes the LSTM network can lose important information in some cases. To fix this issue, the study [42] proposed using the BiLSTM deep learning network. According to this study, the BiLSTM model has two components: forward LSTM and backward LSTM. This allows the model not only to inherit the ability to remember for a long time of the LSTM but also has the ability to remember information in two dimensions. The two layers of LSTM create two corresponding hidden states which are h^f_i from forward LSTM and h^b_i from backward LSTM. In particular, h^f_i integrates the previous information and h^b_i integrates the back information. Determine the final state h_i by combining two states with the formula [9) below. $h_{i} = h_{i}^{f} ∥ h_{i}^{b}$ (9)

Where:

h_i is the hidden state in state i (contains information from both two directions)

|| is the join operation

Figure 6 below shows the architecture of the BiLSTM network consisting of two components: forward LSTM and backward LSTM.

Fig. 6

The model that describes two LSTM components in BiLSTM.

From the architecture of the BiLSTM network shown in Fig. 6 rit is clear that this model has the ability to learn one more dimension of information. This will greatly improve the remembering ability in the training process rthereby increasing the efficiency of the classification process.

With the approach of using the BiLSTM network to detect APT attack malware rwe expect this BiLSTM network could remember the location of malware groups in the embedding space in order to improve the ability to classify APT malware and benign files.

4 Experiments and evaluations

4.1 Experimental dataset

4.1.1 The APT malware data

In this paper rin order to build a dataset about the APT malware rwe collect malware samples from the following sources:

APT attack malware that was collected and reported by reputable organizations from APT attack campaigns [43] includes several types such as Andromeda, Colbalt, Cridex, Dridex, Emotet, Gh0stRAT, NjRAT, Lokibot, Agentesla, etc.

APT attack malware that was collected and monitored from organizations in the Ministry of Information and Communication of Vietnam includes Vietnam Cyberspace Security Technology [44]; Viettel Cyberspace Center [45]; CyRadar Information Security Institute [46]; National Cyber Security Center in Vietname Authority of Information Security [47].

Table 2 below lists the number and components of some types of APT malware in detail.

Table 2
Statistics for the number of APT malware samples

Source [43] [44] [45] [46] [47] Total

The number of APT malware samples 62899 1248 1883 1215 2478 69723

Source	[43]	[44]	[45]	[46]	[47]	Total
The number of APT malware samples	62899	1248	1883	1215	2478	69723

4.1.2 The normal data

For normal data, we use data from source [43]. Table 3 lists the number and components of collected benign files in detail.

Table 3
Statistics for components and the number of benign files

File type The number of normal files File type The number of normal files

PE EXE 20781 Microsoft Office 48088

PE DLL 8823 Scripts 26950

JAVA 2014 Email files 20852

HTML Documents 6312 Archive files 45183

Adobe Flash, PDF 48518 Url 60831

Total 288352

File type	The number of normal files	File type	The number of normal files
PE EXE	20781	Microsoft Office	48088
PE DLL	8823	Scripts	26950
JAVA	2014	Email files	20852
HTML Documents	6312	Archive files	45183
Adobe Flash, PDF	48518	Url	60831
Total	288352

4.2 Experimental scenario

With the experimental dataset consisting of 358,075 records listed in Tables 2 3, we divide the dataset into different components and then experiment and evaluate the accuracy of the proposed models based on these experimental datasets. Our experimental scenario is as follows:

As regards the dataset: The entire process of dividing experimental datasets for scenarios will be selected randomly in which 80% of the dataset are used for training, the remaining 20% are used for testing. Besides, to clearly see the effectiveness of process profile analysis, we test the collected dataset with the number of dimensions of embedding space (feature) as 64, 128, and 256 respectively.

As regards the classification algorithm: In this paper, to see the effectiveness of the proposed method, we will experiment and evaluate some machine learning and deep learning algorithms proposed by some other studies. During the experiment, we will conduct to refine the parameters to suit the models that the previous studies have used. Besides, in order to calculate the probability of falling into the normal or malware classes after the data was analyzed through some deep learning layers, we use the Softmax Regression function. The Softmax function formula was defined in the study [48].

The above two evaluation methods are performed in parallel and integrated into the following experimental scenarios:

•Scenario 1: To compare the method proposed in the paper with some other approaches. For this scenario, we will conduct some comparison and evaluation approaches as follows:

Replace LSTM and BiLSTM models with some other classification algorithms such as RF, Multilayer Perceptron (MLP), CNN 1D. Thus, we will evaluate the following models for this scenario: Graph2Vec-RF; Graph2Vec –MLP, Graph2Vec-CNN 1D. The reason for using RF, MLP, CNN 1D is that these are the best classification algorithms up till now.

Replace the Graph2Vec model with another embedding model called Sequence. For this scenario, the attribute extraction will be performed using the Sequence model and the APT malware detection will use the LSTM and BiLSTM models. Thus, in this scenario, we will experiment with the Sequence-LSTM model proposed in [49] and the Sequence-BiLSTM. We experiment with the G2V-CNN model proposed in the study [50] to compare with other approaches

•Scenario 2: Compare and evaluate experimental results of LSTM and BiLSTM models in our proposal.

4.3 Installation requirements and Classification Measures

4.3.1 Installation requirements

Software requirements: Python version 3.6, Tensorflow 2.0, Ubuntu 18.04, gensim 3.6.0.

Hardware requirements: RAM 13GB; Intel(R) Xeon(R) CPU @ 2.20GH; GPU Tesla T4.

4.3.2 Classification Measures

The following measures are used in this paper to evaluate the accuracy of models:

Accuracy: The ratio between the number of samples classified correctly and the total number of samples. Accuracy is calculated by the following formula: $accuracy = \frac{TP + TN}{TP + TN + FP + FN} \times 100$ (10)

In which:

TP - True positive: The number of malicious samples classified correctly.

FN - False negative: The number of malicious samples classified as normal.

TN - True negative: The number of normal samples classified correctly.

FP - False positive: The number of normal samples classified as malicious.

Precision: is the ratio of true positive points to the total number of points classified as positive (TP + FP). High precision means high accuracy of finding points. $precision = \frac{TP}{TP + FP} \times 100$ (11)

Recall: is the ratio of true positive points to the total number of real positive points (TP + FN). A high recall means that the TP is high and the rate of missing really positive points is low. $Re call = \frac{TP}{TP + FN} \times 100$ (12)

F1-score: is harmonic mean of precision and recall. The higher the F1, the better the classifier. $F 1 = \frac{2 \times precision \times Re call}{precision + Re call}$ (13)

4.4 Results and discussion

4.4.1 Experimental results of scenario 1

a) Experimental results when replacing LSTM and BiLSTM models with other classification algorithms

We will experiment with 3 different algorithms and models for this task, including RF, MLP, and CNN 1D. Tables 4 , 6 below show experimental results of RF algorithms and MLP, 1D CNN models.

Table 4
Experimental results using Random Forest algorithm

Total trees 64 features 128 features 256 features

Acc Pre Rec F1 Training time Test Time Acc Pre Rec F1 Training time Test Time Acc Pre Rec F1 Training time Test Time

10 96.96 94.52 89.69 92.04 35.58 0.16 97 94.54 89.61 92 49.86 0.19 97.05 94.7 89.92 92.25 74.78 0.27

50 97.4 94.75 91.85 93.27 256.78 0.87 97.4 95.03 91.49 93.2 256.78 0.87 97.5 95.1 91.88 93.5 386 1.11

100 97.58 94.24 92.22 93.22 500.96 1.7 97.58 95.4 91.99 93.67 500.96 1.7 97.55 95.32 91.9 93.58 788 2.32

Total trees	64 features	128 features	256 features
10	96.96	94.52	89.69	92.04	35.58	0.16	97	94.54	89.61	92	49.86	0.19	97.05	94.7	89.92	92.25	74.78	0.27
50	97.4	94.75	91.85	93.27	256.78	0.87	97.4	95.03	91.49	93.2	256.78	0.87	97.5	95.1	91.88	93.5	386	1.11
100	97.58	94.24	92.22	93.22	500.96	1.7	97.58	95.4	91.99	93.67	500.96	1.7	97.55	95.32	91.9	93.58	788	2.32

Table 5

Experimental results using MLP algorithm

Total layer	64 features						128 features						256 features
	Acc	Pre	Rec	F1	Training time	Test Time	Acc	Pre	Rec	F1	Training time	Test Time	Acc	Pre	Rec	F1	Training time	Test Time
	Acc	Pre	Rec	F1	Training time	Test Time	Acc	Pre	Rec	F1	Training time	Test Time	Acc	Pre	Rec	F1	Training time	Test Time
2	98.01	95.93	93.78	94.84	380.88	1.23	97.38	92.53	93.35	93.34	842.91	1.7	97.63	93.6	94.32	93.96	1879.8	3.32
4	98.07	96.57	93.45	94.98	631.91	1.56	98.05	96.17	93.75	94.95	1414.4	1.99	97.5	97.54	89.49	93.33	3820.1	6.56

Table 6

Experimental results using CNN 1D model

Total features	Parameters	Accuracy	Precision	Recall	F1-score	Training time	Test time
64	96	98.06	96.74	93.21	94.94	95.61	1.74
	32–64	97.54	94.97	93.6	94.28	1410.6	2.58
128	512	98.32	95.93	95.48	95.70	943.4	1.85
	128–256	98.04	96.27	93.57	94.9	751	2.15
256	512	98.38	96.19	95.47	95.83	2227.2	3.14
	64–128	98.16	95.64	94.9	95.27	1146.1	2.17

Table 4 lists the experimental results of detecting APT malware using the RF algorithm with the different number of trees. From Table 4, it can be seen the accuracy of the APT malware detection process is different when changing the number of features in the dataset. However, we found that with the number of trees of 100, all three datasets including 64, 128, and 256 features gave the best results on all measures. Despite this, looking at the training and testing time, it is clear that this measure increases when the number of decision trees and the number of features in the dataset increases. In particular, the training time of the model with these 3 datasets of 64, 128, and 256 features were 35.8s, 49.86s, 74.78s respectively. Besides, the testing time of the model with these 3 datasets were 0.16s, 0.19s, 0.27s respectively. We noticed that although the training and testing time increases rapidly when the number of trees and features increases, the accuracy of the classification process does not change much. This shows that with the experimental dataset in the paper, the classification abilities of the RF algorithm are relatively good and stable even when the number of features changes. Tables 5 below show experimental results of MLP model.

The experimental results in Table 5 show the effectiveness of different MLP models when changing the parameters in the Graph2Vec model as well as the number of layers in the MLP. Especially, this model gives the best results at 2 MLP layers and 64 features of Graph2Vec. In contrast to the RF algorithm, the result shows that when the number of decision trees and features of the dataset increases, the efficiency increases.

Comparing the experimental results in Table 4 and Table 5, we can see the superior efficiency of the MLP model compared to the RF algorithm. Specifically, at Accuracy measurement, the MLP model achieved 98.07% higher than the RF algorithm 0.49%. Similarly, MLP model is higher than RF by 1.07%, 0.2% for Precision and Recall measures, respectively. In terms of training and testing time, the MLP model is much higher than the RF algorithm. This makes perfect sense because the architecture of the MLP model is more complex than that of the RF algorithm.

Table 6 shows the results of detecting APT malware using the 1D CNN model. We found that these results are relatively high both in terms of classification and detection. Specifically, the overall accuracy of the classification results ranges from 98.08% to 98.34%, while the correct APT malware classification rate ranges from 93.21% to 95.48% and the correct normal file classification rate ranges from 94.47% to 98.74%. Besides, the training and detection time is generally higher than that of using MLP and RF.

In Fig. 7, we present the results of the confusion matrix of the MLP and RF models with the parameter giving the highest recall result.

Fig. 7

Confusion matrix result of RF, MLP, CNN 1D algorithms. in which: (a) shows result of the RF algorithm (b) shows result of the MLP algorithm; (c) shows result of the CNN1D algorithm.

From Fig. 7, it can be seen that the MLP model only incorrectly detected 870 files (reducing 247 files of false alert) compared to 1117 files of the RF algorithm. Similarly, in benign file detection, the MLP model only falsely alerted 557 files (reducing 58 files) compared to 615 files of the RF algorithm. This result shows that the MLP model is more effective than the RF algorithm in both APT malware detection and benign file detection. However, the MLP model is not as efficient as the CNN1D model. Specifically, with the CNN1D model, this model only incorrectly detected 528 benign files and 634 APT malware. This result is better than the RF algorithm and MLP

b) Experimental results when replacing Graph2Vec model with Sequence model

Tables 7, 8 below show APT malware detection results using Sequence-LSTM [49] and Sequence-BiLSTM models.

Table 7

Experimental results using using Sequence-LSTM model [49]

Total features	Parameters	Accuracy	Precision	Recall	F1-score	Training time	Test time
64	512-512-128	96.73	93.97	88.98	91.41	1105	8.2
	256-256-512-128	96.78	94.13	89	91.53	1119	9.4
128	128-512-256	96.88	94.6	89.14	91.8	925.43	8.64
	256-512-256-128	96.85	94.26	89.3	91.71	1141	9.6
256	128-172-256	96.93	94.43	89.57	91.94	1309.8	10.5
	172-512-256-512	96.86	94.57	89	91.74	1826.8	12.67

Table 8

Experimental results using Sequence- BiLSTM model

Total features	Parameters	Accuracy	Precision	Recall	F1-score	Training time	Test time
64	256	96.73	93.75	89.18	91.41	445.37	5.1
	128-128	96.8	93.53	89.82	91.63	687	11.64
128	128	96.6	93.39	88.9	91.09	417.78	06.03
	256-256	96.78	93.84	89.38	91.55	687.2	7.89
256	128	96.63	93.4	89.04	91.17	480.43	6.15
	172-172	96.91	94.75	89.1	91.8	821.82	9.47

Specifically, Table 7 shows the experimental results of the Sequence-LSTM model studied and proposed in [49]. The experimental results in Table 7 show that the Sequence-LSTM model gives different classification results when changing the parameters of both the Sequence and LSTM models. However, this change is very slow even though the number of layers in the LSTM and the number of features of the Sequence increase rapidly. Tracking this change, we found that the best results of the Sequence-LSTM model achieved with Accuracy, Precision, Recall, and F1-Score measures were: 96.93%, 94.57%, 89.57%, and 91.94%, respectively. At the same time, most of the measures achieve the highest level when the parameters of the models are complex and multi-layered. Based on the experimental results in Table 7, this model is not really effective for both benign file classification and APT malware.

Table 8 shows the results of the Sequence-BiLSTM model. Based on the experimental results in Table 8, it can be seen that this model gives the best classification results on all measures in the model with parameters [256 Sequence- 2 BiLSTM] with values Accuracy, Precision, Recall, and F1-Score are 96.91%, 94.75%, 89.18%, and 91.8%, respectively. Comparing the results of Table 7 and Table 8, it can be seen that these two models work relatively similarly and the results do not show a large difference and the Sequence-LSTM model is more efficient. The cause of this problem is that, in the Sequence-LSTM model using more LSTM layers, it is possible to synthesize and extract more important information from which to help the classification process be accurate.

Figure 8 shows the results of the confusion matrix of the sequence-LSTM and sequence-BiLSTM models.

Fig. 8

Confusion matrix result Sequence-LSTM và Sequence-BiLSTM models. in which: (a) shows result of the Sequence-LSTM model (b) shows result of the Sequence-BiLSTM mode.

From the results in Fig. 8, it can be seen that the sequence-LSTM model is more effective than the sequence-BiLSTM model in the task of accurately detecting benign files, but it is less effective with the problem of correctly classifying APT malware.

c) Experimental results when comparing other approaches

In this scenario, we will conduct experiments to evaluate the approach of using Graph2Vec in combination with CNN2D model proposed in the study (see Table 9).

Table 9

Experimental results using CNN 2D model [50]

Total features	Parameters	Accuracy	Precision	Recall	F1-score	Training time	Test time
64	128-256	98.05	95.61	94.37	94.99	373.33	1.69
	32-72	97.92	95.15	94.17	94.66	1084.86	1.84
	32-64-128-256	97.88	94.88	94.23	94.55	302	2.12
128	64-96	98.26	96.1	94.92	95.5	1129	1.96
	32-72	98.25	96.5	94.47	95.47	798.79	1.88
	32-64-128-256	98.12	94.5	94.87	95.22	1223.97	2.13
256	192-512	98.36	96.48	95.04	95.76	2250.66	3.79
	32-72	98.4	96.45	95.35	95.89	821.11	1.91
	32-48-64-128	98.16	95.24	95.34	95.29	600.3	2.51

From the experimental results presented in Tables 9, it can be easily seen that: the more complex the model structure, the higher the number of filters and layers is, the worse experimental results of the model are. However, if we increase the number of features, the learning ability of the model is improved and the model gives more accurate test results. Specifically, with the detection model using 64 features, the CNN algorithm with the [2D 128–256] model gave the best classification result for APT malware and normal files that were 94.37% and 95.61%. respectively. With the dataset consisting of 128 features, the accuracy of the malware classification process was 94.92% (increases by 0.55% compared to the result of the 64-feature dataset) and the normal file classification result was 96.5% (increases by 0.93% compared to the result of the 64-feature dataset). Similarly, when using the CNN model with 256 features, the accuracy of classifying normal file and APT malware is also higher than the model using 64 features 1.1% and 0.9%, respectively. This result shows that when increasing the number of features, the accuracy of the APT malware detection process increases faster than the normal file classification process. The cause of this problem is that the APT malware dataset used in this paper is not big enough leading to CNN not being able to fully utilize its abilities, so when increasing the number of features, the classification ability in the embedding space of the algorithm is better, leading to better classification results.

Figure 9 shows confusion matrix result of Graph2Vec-CNN2D model [50].

Fig. 9

Confusion matrix result of Graph-CNN2D model.

From the results in Fig. 9, it can be seen that this model incorrectly detected 711 out of 13,990 APT files. At the same time, it also falsely alarm 538 benign files out of a total of 57625 files.

4.4.2 Experimental results of scenario 2

Tables 10, 11 show the experimental results with LSTM and BiLSTM models proposed in this paper.

Table 10
Experimental results using using LSTM algorithm

Total features Parameters Accuracy Precision Recall F1-score Training time Test time

64 512-512-128 98.03 94.87 95.05 94.9 1336 7.17

256-256-512-128 98.07 95.11 95.02 95.06 1675.7 9.26

128 128-512-256 98.23 94.58 96.46 95.51 1276.29 8.25

256-512-256-128 98.24 94.79 96.27 95.53 1773.75 10.3

256 128-172-256 98.32 95.65 95.74 95.69 2536.1 8.78

172-512-256-512 98.36 95.88 95.73 95.8 4313 12.09

Total features	Parameters	Accuracy	Precision	Recall	F1-score	Training time	Test time
64	512-512-128	98.03	94.87	95.05	94.9	1336	7.17
	256-256-512-128	98.07	95.11	95.02	95.06	1675.7	9.26
128	128-512-256	98.23	94.58	96.46	95.51	1276.29	8.25
	256-512-256-128	98.24	94.79	96.27	95.53	1773.75	10.3
256	128-172-256	98.32	95.65	95.74	95.69	2536.1	8.78
	172-512-256-512	98.36	95.88	95.73	95.8	4313	12.09

Table 11

Experimental results using BiLSTM algorithm

Total features	Parameters	Accuracy	Precision	Recall	F1-score	Training time	Test time
64	256	98.02	94.03	95.96	94.99	1246.97	4.02
	128-128	98.11	95.43	94.84	95.13	1737.7	7.17
128	128	98.38	95.93	95.75	95.84	1758.67	4.64
	256-256	98.28	95.03	96.24	95.63	1764.26	7.87
256	128	98.44	96.15	95.86	95.85	2098.11	4.88
	172-172	98.42	96.44	95.41	95.92	3453.17	8.91

With the results reported in Table 10, it can be seen that the training and classifying ability of the model is hardly changed when we change parameters such as the number of features, units, and layers. As regards the model with 64 features, the best classification results for APT malware and normal files were 95.05% và 95.11%. respectively. Similarly, when using the 256-feature dataset, the classification result for APT malware increased by 0.69% compared to the the result of model using the 64-feature dataset. And the algorithm gave the best results which were 96.46% (increases by 1.41% compared to the result of 64 features dataset) when using the 128-feature dataset. Comparing the experimental results on Tables 4 , 10 on the measures of classification process for malware and clean files, the LSTM algorithm gives better results than the other methods with the recall of 96.46%. Specifically, based on Tables 10 and 6 , 9, the LSTM model with the ability to remember information gave better results than the CNN model: the malware classification results increased by 0.99%. Likewise, when comparing experimental results in Tables 10 6, it is clear that the LSTM model gives better results than CNN 1D in APT malware classification. However, the training and detection time is not as good as CNN 1D. In addition, another aspect of the LSTM model demonstrated through the experimental results is that when increasing the number of features and the number of layers, the training time as well as detection time increase rapidly (training time increases by 2 to 2.5 times and detection time increases from 1.2 to 1.5 times), but measures such as accuracy, precision and recall do not change much. Therefore, it can be concluded that the LSTM model demonstrates its stability and effective classification for detecting both malware when parameters are changed.

Similarly, the experimental results in Table 11 show that the BiLSTM model gave the best classification results for APT malware at the 128-features model with 2 layers with a parameter [256-256]. On the other hand, the measures change irregularly, when changing the parameters such as the number of features and the number of units. Specifically, at the 64-features model, the highest Accuracy was 98.11% at 2 parameter layers [128-128], while at the 128 features model it is 98.38% at 1 layer [128], at the 256-features model, it is 98.44% at 1 layer [128]. As regards the accuracy of APT malware classification, results of the BiLSTM model ranged from 94.84% to 96.24% (up 0.77% compared to the result of CNN model).

Comparing the results in Table 10 and Table 11, we found the following: regarding Accuracy, the BiLSTM model gave better results with the highest rate of 98.44% (increases by 0.08% compared to the result of LSTM model); regarding the classification of malware LSTM model gave better results than BiLSTM model of 0.22% respectively; regarding the training and testing time, the BiLSTM model consumed approximately 2 times higher than LSTM. The cause of this problem is that the BiLSTM model has a structure consisting of two LSTM layers that operate in two opposite directions: one operates in the forward direction, the other operates in the backward direction. This leads to the calculation and processing process is slower. From the experimental results using LSTM and BiLSTM models, we noticed that the LSTM model give the best effect with the dataset that we use in this paper.

Figure 10 shows the confusion matrix results of the Graph2Vec-LSTM and Graph2Vec-BiLSTM models.

Fig. 10

Confusion matrix result Sequence-LSTM và Sequence-BiLSTM models. in which: (a) shows result of the LSTM model; (b) shows result of the BiLSTM model.

Detailed analysis of Fig. 10 shows that the Graph2Vec-LSTM model (wrongly detecting 495 files) is more effective than the Graph2Vec-BiLSTM (wrongly detecting 525 files) model in the task of accurately detecting APT malware of Graph2Vec-BiLSTM. However, in contrast, the Graph2Vec-BiLSTM model only wrongly detected 704 benign files, reducing 71 files compared to the result of Graph2Vec-LSTM model. These results once again confirm the correctness and reasonableness of our proposals in this study when using the combination of Graph2Vec model with LSTM and BiLSTM models.

4.5 Discussion

The experimental results presented in section 4.4 shows that our proposed method is more effective than other approaches in the APT attack detection problem.

First, in the scenario of comparing LSTM and BiLSTM models with other algorithms and classification models: Obviously, although many studies and proposals proved that the RF algorithm is superior in the classification task compared to other supervised machine learning algorithms, in the problem of detecting APT malware based on process profile analysis technique, it does not work well. Similarly, although the CNN1D and MLP deep learning models have complex architecture and calculation, it does not bring the desired effect. As regards the CNN 1D model, we find that there is a superiority over the RF algorithm and the MLP network. However, the dataset has a large difference between the number of normal data and the APT data (the number of clean data is 4 times malware one), so the model still has some false classification. Comparing tables 4 , 6 with Tables 10 11 shows that LSTM and BiLSTM models have better performance than other classification algorithms on all measures.

Second, in the scenario of replacing the Graph2Vec model with the Sequence model, comparing the results between Tables 7 8 with Tables 10 11, it is clear that the proposal to use the Graph2Vec model for the attribute extraction task has completely better results than when using the Sequence model. This efficiency is reflected in both the accurate detection of APT malware and normal files. Specifically, when the Graph2Vec-LSTM model brought higher efficiency than the Sequence-LSTM model on Accuracy, Precision, and Recall measures of 2.36%, 4.54%, and 6.89%, respectively. Similarly, the Graph2Vec-BiLSTM model also outperforms the Sequence-BiLSTM model from 2 to 7% on all measures.

Third, in the scenario of comparing the proposed method in the paper with other approaches, in this paper, we compared our study with two other approaches including Sequence-LSTM [49] and Graph2Vec-CNN 2D [50]. Comparing the results on Tables 10 11 with Tables 7 9, we see the superiority of our proposed model with these other proposals. In addition, the results shown in the confusion matrix in Figs. 7 , 10 once again confirm the optimization of the proposed model in the study with other studies in the same approach.

Finally, the two proposals in this study performed in their functions and tasks well because the results are better than other techniques and methods. This demonstrates that the deep learning models with the ability to fully remember can be adaptable to classification tasks with imbalanced datasets. In addition, based on the changing parameters in the LSTM and BiLSTM models, the paper provides options for the APT malware detection model on workstations when needing a trade-off between computation time and effective detection. Obviously, with the use of more complex network layers and architectures, higher results can be obtained, but higher system and time costs will be required.

5 Conclusions

In The trend of APT attacks by spreading malware through end-users is causing many difficulties and challenges for information security monitoring systems. In this paper, with the goal of early detection and warning of APT malware on workstations, our proposal has succeeded in building an APT malware detection model based on profile analysis technique using Grap2vec algorithm and deep learning models. Through the research results in this paper, this paper solved 3 issues: i) regarding the process of building process behavior profiles: with a flexible combination of process labeling via Mitre attack and several other behavior features, we have successfully built the process profile in graph form that fully shows the behavior of APT malware. This not only shows the correlation between processes to improve the efficiency of the process of monitoring and detecting APT malware on workstations but also reduces cost for storage devices of the system; ii) regarding process profile analysis: the proposal to use the Graph2Vec model has brought remarkable efficiency. We successfully standardized the process profile in graph form into an embedding graph showing full features of the graph to help the process of detecting APT malware more efficiently; iii) regarding APT malware detection process: some of the proposed deep learning models including LSTM and BiLSTM are highly effective: the experimental results are much more effective than other studies and proposals in the same approach. With the research results presented in this paper, our research has not only contributed to solving some difficulties in the APT malware detection task but also opened up new research and approach directions for tasks of detecting other anomalies on the workstations such as malware, intrusion behavior, insider, etc. In the future, in order to improve APT malware detection on workstations, based on research results in this paper, 2 main issues can be considered to improve including i) method of building and synthesizing process profiles; ii) method of analyzing process profiles.

Footnotes

Acknowledgments

This work has been sponsored by the Posts and Telecommunications Institute of Technology, Viet Nam

References

Lemay

, Calvet

, Menet

and Fernandez

, Survey of publicly available reports on advanced persistent threat actors, Computers & Security 72 (2018), 26–59.

Alshamrani

, Chowdhary

, Myneni

and Huang

, A Survey on Advanced Persistent Threats: Techniques, Solutions, Challenges, and Research Opportunities, IEEE Comm Surveys & Tutorials 21 (2019), 1851–1877.

Code

(2012) Advanced persistent threat: understanding the danger and how to protect your organization. Elsevier, Amsterdam.

Ghafir

, Hammoudeh

, Prenosil

, Han

, Hegarty

, Rabie

and Aparicio-Navarro

F.J.

, Detection of advanced persistent threat using machine-learning correlation analysis, Future Generation Computer Systems 89 (2018), 349–359.

Shean Tan

M.K.

, Goode

and Richardson

, Understanding negotiated anti-malware interruption effects on user decision quality in endpoint security, Behaviour & Information Technology. doi: 10.1080/0144929X.2020.1734087.

Yang

L. -X.

, Li

, Yang

and Tang

Y.Y.

, A Risk Management Approach to Defending Against the Advanced Persistent Threat, IEEE Transactions on Dependable and Secure Computing 17 (2020), 1163–1172. doi: 10.1109/TDSC.2018.2858786.

Bonilla

, Rey

, Ángel, A New Proposal on the Advanced Persistent Threat: A Survey, Applied Sciences (2020), 38–74.

Rubio

J.E.

, Alcaraz

, Roman

and Lopez

, Current cyber-defense trends in industrial control systems, Computers & Security 87 (2019), 101561.

Stojanovi'c

, Hofer-Schmitz

and Kleb

, APT Datasets and Attack Modeling for Automated Detection Methods: A Review, Computers & Security (2020). doi: https://doi.org/10.1016/j.cose.2020.101734.

10.

, Chen

, Zhuo

and Zhang

X.S.

, A temporal correlation and traffic analysis approach for APT attacks detection, Cluster Computing 22 (2019), 7347–7358.

11.

, Dai

, Bai

, Gan

, Wang

and Wang

, An Intelligence-Driven Security-Aware Defense Mechanism for Advanced Persistent Threats, IEEE Transactions on Information Forensics and Security 14(3) (2019), 646–661.

12.

Milajerdi

S.M.

, Gjomemo

, Eshete

and Sekar

, HOLMES: Real-time APT Detection through Correlation of Suspicious Information Flows, In: proceedings of the 2019 IEEE Symposium on Security and Privacy (2019), 1137–1152.

13.

Zimba

, song Chen

, Wang

and Chishimba

, Modeling and detection of the multi-stages of Advanced Persistent Threats attacks based on semi-supervised learning and complex networks characteristics, Future Generation Computer Systems 106 (2020), 501–517.

14.

Lajevardi

and Amini

, A semantic-based correlation approach for detecting hybrid and low-level APTs, Future Generation Computer Systems 96 (2019), 64–88.

15.

Myneni

, et al., DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats, Deployable Machine Learning for Security Defense, MLHat 2020, Communications in Computer and Information Science 1271 (2020). doi: https://doi.org/10.1007/978-3-030-59621-7_8.

16.

Manzoor

E.A.

, Momeni

, Venkatakrishnan

V.N.

and Akoglu

, Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs (2016) arXiv:1602.04844.

17.

Hassan

W.U.

, Bates

and Marino

, Tactical Provenance Analysis for Endpoint Detection and Response Systems, 2020 IEEE Symposium on Security and Privacy (SP) (2020), 1172–1189. doi: 10.1109/SP40000.2020.00096.

18.

Xuan

C.D.

, Nguyen

H.D.

and Dao

H.M.

, APT attack detection based on flow network analysis techniques using deep learning, Journal of Intelligent & Fuzzy Systems 39(3) (2020), 4785–4801.

19.

Xiang

, Guo

and Li

, Detecting mobile advanced persistent threats based on large-scale DNS logs, Computers & Security 96 (2020). doi: https://doi.org/10.1016/j.cose.2020.101933.

20.

Sysmon v12.03. https://docs.microsoft.com/en-us/sysinternals/downloads/Sysmon.

21.

Narayanan

, Chandramohan

, Venkatesan

, Chen

, Liu

and Jaiswal

, Graph2Vec: Learning Distributed Representations of Graphs (2017). arXiv:1707.05005.

22.

Akoglu

, Tong

and Koutra

, Graph based anomaly detection and description: a survey, Data mining and knowledge discovery 29(3) (2015), 626–688.

23.

Do Xuan

and Huong

, A new approach for APT malware detection based on deep graph network for endpoint systems, Appl Intell (2022). https://doi.org/10.1007/s10489-021-03138-z.

24.

Hou

, Saas

, Chen

and Ye

, Deep4MalDroid: A Deep Learning Framework for Android Malware Detection Based on Linux Kernel System Call Graphs, 2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW), 2016, pp. 104–111, doi: 10.1109/WIW.2016.040.

25.

, Li

and Meng.

, Discovering Suspicious APT Families Through a Large-Scale Domain Graph in Information-Centric IoT, IEEE Access 7 (2019), 13917–13926, doi: 10.1109/ACCESS.2019.2894509.

26.

Pektaş

and Acarman

, Deep learning for effective Android malware detection using API call graph embeddings, Soft Comput 24 (2020), 1027–1043. https://doi.org/10.1007/s00500-019-03940-5.

27.

Alsaheel

, Nan

, Ma

, Yu

, Walkup

, Berkay Celik

, Zhang

and Xu

, A Sequence-based Learning Approach for Attack Investigation, In Proceedings of the 30th Security Symposium (2021).

28.

Do Xuan

and Dao

M.H.

, A novel approach for APT attack detection based on combined deep learning model, Neural Comput & Applic 33 (2021), 13251–13264. https://doi.org/10.1007/s00521-021-05952-5

29.

Xuan

C.D.

and Duong

, ‘Optimization of APT Attack Detection Based on a Model Combining ATTENTION and Deep Learning’, Journal of Intelligent & Fuzzy Systems, vol. Pre-press, no. Pre-press, (2021), pp. 1–17.

30.

Ayoade

, et al., “Evolving Advanced Persistent Threat Detection using Provenance Graph and Metric Learning,” 2020 IEEE Conference on Communications and Network Security (CNS), 2020, pp. 1–9, doi: 10.1109/CNS48642.2020.9162264.

31.

Karantzas

and Patsakis

, An Empirical Assessment of Endpoint Detection and Response Systems against Advanced Persistent Threats Attack Vectors, J Cybersecur Priv 1 (2021), 387–421. https://doi.org/10.3390/jcp1030021.

32.

Coulter

, Zhang

, Pan

and Xiang

, “UnmaskingWindows Advanced Persistent Threat Execution,” 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2020, pp. 268–276, doi: 10.1109/TrustCom50675.2020.00046.

33.

Park

S. -H.

, et al., “Performance Evaluation of Open-Source Endpoint Detection and Response Combining Google Rapid Response and Osquery for Threat Detection,” in IEEE Access 10 (2022), pp. 20259–20269, doi: 10.1109/ACCESS.2022.3152574.

34.

Wang

, Cui

, Wang

, Wu

and Hu

, A Novel Method for Detecting Advanced Persistent Threat Attack Based on Belief Rule Base, Appl Sci 11 (2021), 9899. https://doi.org/10.3390/app11219899.

35.

Blake Strom

Andy Applebaum Doug

Miller Kathryn

Nickels Adam

Pennington Cody Thomas

, MITRE ATT&CK: Design and Philosophy. https://attack.mitre.org/docs/ATTACK_Design_and_Philosophy_March_pdf?fbclid=IwAR3AAczELLv3svk25sy_l3I3yxnuhj6E-LAszibwFi02DBpddhy0qqKrfOE.

36.

Gibert

, Mateu

and Planes

, The rise of machine learning for detection and classification of malware: Research developments, trends and challenges, Journal of Network and Computer Applications 153 (2020). doi: 10.1016/j.jnca.2019.102526.

37.

Quoc Le

and Mikolov

, Distributed Representations of Sentences and Documents (2014). arXiv:1405.4053.

38.

Morris

, Kersting

and Mutzel

, Global Weisfeiler-Lehman Graph Kernels (2017). arXiv:1703.02379.

39.

Mikolov

, Sutskever

, Chen

, Corrado

and Dean

, Distributed Representations of Words and Phrases and their Compositionality (2013). arXiv:1310.4546.

40.

Han

, Kamber

and Pei

, Data Mining: Concepts and Techniques Third Edition, Elsevier (2012), 744.

41.

Siami-Namini

, Tavakoli

and Siami Namin

, A Comparative Analysis of Forecasting Financial Time Series Using ARIMA, LSTM, and BiLSTM (2019). arXiv: 1911.09512.

42.

Cornegruta

, Bakewell

, Withey

and Montana

, Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks (2017). arXiv:1609.08409.

43.

Interactive Online Malware Sandbox. https://app.any.run/.

44.

Vietnam Cyberspace Security Technology JSC (VNCS). http://www.vncert.gov.vn/index.php.

45.

Viettel cyberspace center. https://viettelcybersecurity.com/#/home.

46.

CyRadar. https://cyradar.com/#.

47.

National Cyber Security Center – NCSC. https://khonggianmang.vn/intro.

48.

Duan

, Sathiya Keerthi

, Chu

, Krishnaj Shevade

and Poo

A.N.

, Multi-category Classification by Soft-Max Combination of Binary Classifiers, In proceedings of the 4th International Workshop, MCS 2003 (2003), 125–134.

49.

Xiao

, Zhang

, Mercaldo

, et al., Android malware detection based on system call sequences and LSTM, Multimed Tools Appl 78 (2019), 3979–3999.

50.

Nguyen

H.-T.

, Ngo

Q.-D.

and Le

V.-H.

, A novel graph-based approach for IoT botnet detection, International Journal of Information Security 19(9) (2020), 567–577.

New approach for APT malware detection on the workstation based on process profile

Abstract

Keywords

1 Introduction

1.1 Problem

1.2 Evaluating some approaches for detecting APT malware on Workstations

1.3 The proposed method of this paper

Table 1 List of features in a process Group Feature name Data type 1 Name String Hash String ProcessID Number ParentPID Number CommandLine String Image String CreationTimestamp Number 2 TerminationTimestamp Number 3 MitreAttacks List of String

2 Related work

2.1 Detecting APT attacks using graph

2.2 Some other methods

3 The proposed method

3.1 The model architecture

3.2.1 Data standardization

3.2.2 Building the process profile

4.1 Experimental dataset

4.1.1 The APT malware data

Table 2 Statistics for the number of APT malware samples Source [43] [44] [45] [46] [47] Total The number of APT malware samples 62899 1248 1883 1215 2478 69723

4.3 Installation requirements and Classification Measures

4.3.1 Installation requirements

4.3.2 Classification Measures

4.4.1 Experimental results of scenario 1

5 Conclusions

Footnotes

Acknowledgments

References

Table 1
List of features in a process

Group Feature name Data type

1 Name String

Hash String

ProcessID Number

ParentPID Number

CommandLine String

Image String

CreationTimestamp Number

2 TerminationTimestamp Number

3 MitreAttacks List of String

Table 2
Statistics for the number of APT malware samples

Source [43] [44] [45] [46] [47] Total

The number of APT malware samples 62899 1248 1883 1215 2478 69723