Log2Graph: A graph convolution neural network based method for insider threat detection

Abstract

With the advancement of network security equipment, insider threats gradually replace external threats and become a critical contributing factor for cluster security threats. When detecting and combating insider threats, existing methods often concentrate on users’ behavior and analyze logs recording their operations in an information system. Traditional sequence-based method considers temporal relationships for user actions, but cannot represent complex logical relationships well between various entities and different behaviors. Current machine learning-based approaches, such as graph-based methods, can establish connections among log entries but have limitations in terms of complexity and identifying malicious behavior of user’s inherent intention.

In this paper, we propose Log2Graph, a novel insider threat detection method based on graph convolution neural network. To achieve efficient anomaly detection, Log2Graph first retrieves logs and corresponding features from log files through feature extraction. Specifically, we use an auxiliary feature of anomaly index to describe the relationship between entities, such as users and hosts, instead of establishing complex connections between them. Second, these logs and features are augmented through a combination of oversampling and downsampling, to prepare for the next-stage supervised learning process. Third, we use three elaborated rules to construct the graph of each user by connecting the logs according to chronological and logical relationships. At last, the dedicated built graph convolution neural network is used to detect insider threats. Our validation and extensive evaluation results confirm that Log2Graph can greatly improve the performance of insider threat detection compared to existing state-of-the-art methods.

Keywords

Insider threat detection cluster security advanced persistent threats graph construction graph convolution neural network

1. Introduction

Due to the rapid development and adoption of information technologies, enterprises and organizations heavily rely on networks for daily operations. Network border protection provides an efficient approach for many enterprises and organizations, such as using a firewall to prevent attacks from outside. However, insider threat has gradually become a critical factor threatening safe operations and services of these systems. Insider threat attackers are generally employees, contractors, or business partners of the organization. They usually have authorized access to the organization’s systems, network, and data. Insider threat is the behavior for which insiders use legally obtained authorization to negatively affect the confidentiality, integrity, and availability of information systems. Specifically, it can pose a high risk with serious consequences, such as significant financial ramifications or vital data leakage. With the establishment and improvement of information security mechanisms in enterprises, it is increasingly difficult for external attackers to enter target systems. As a result, insider threats have been increasing and become the major challenges to compromise enterprise system security.

Different from external attacks, the main challenges for insider threat detection lie in below three aspects:

Insider threats are caused by insider users whose malicious behavior can not be easily detected by peripheral security devices, such as firewall.

Most insider threats are composed of a series of malicious operations. As insiders are also employees within an organization, these malicious operations are scattered and hidden in their normal work operations.

Usually, there are not enough insider threat samples to support the training process of a supervised learning model.

A common approach for insider threat detection is to use log entries for analysis. The log-entry-based method first extracts a large number of features (e.g., time, protocol, and login type) from logs. Then it analyzes the characteristics by using certain anomaly detection algorithms, such as clustering [27] or isolation forest [6,17,26,29,37] algorithms, and outputs anomaly logs as possible malicious behavior. This method can differentiate logs according to the distance or density calculation, but cannot obtain any relationship between logs during feature detection. To achieve more effective results, it often relies on rich, complex feature extraction, which largely limits the effectiveness of threat detection.

With the development of machine learning models and algorithms, the sequence-based method has been proposed for anomaly detection recently. This method extracts normal or malicious patterns by considering temporal dependencies between logs. It links different types of logs and features in chronological order and matches new log sequences to corresponding patterns to determine whether they are malicious or not. It often uses long short-term memory (LSTM) or recurrent neural network (RNN) to model users’ normal behavior and predicts deviations as anomalies [5,7,28,34,36,45,46]. However, the sequence-based method cannot be aware of complex user behaviors and deal with them in parallel processing.

To further exploit interactive relationships among user behavior, more work has been developed on the graph-based method to solve the problem of insider threat detection through graph neural network (GNN) [3,4,16,19,25]. It mainly expresses the logical relationship between logs in the form of graph construction and detects malicious behavior with node classification, subgraph classification, link prediction, etc. According to different training process, the graph-based method can be divided into unsupervised and supervised approaches. The unsupervised approach does not require any data annotation during training, and detects malicious behavior by comparing the similarity between user behavior. However, it will cause a high rate of false positives that cannot distinguish between outliers and malicious behavior. Existing supervised approach uses labeled data to detect anomaly, but does not consider well how the graph structure will influence the performance of graph convolution neural networks, which still limits the usages in practice.

In this paper, we introduce Log2Graph, a novel approach based on graph convolution neural network (GCN) [20] for insider threat detection. The overview of Log2Graph is shown in Fig. 1. It first analyzes original log files to obtain log entries and extract log features of all users. Then Log2Graph performs data augmentation to balance the dataset. It randomly samples all logs of users with malicious behavior in the training set to generate oversampled data. Accordingly, it downsamples the oversampled data and original data respectively to reduce normal logs. After that, Log2Graph uses the augmented data as a training set. Graph construction treats each log as a node and connects log nodes in time series and logical relationships to form graphs. At last, the graphs in the training set are fed into the graph neural network for training, and a node classifier is obtained for classification of the logs in the test set. The classifier outputs a label for each node to indicate whether it poses an insider threat, providing prediction for malicious behaviors.

Fig. 1.

The overview of Log2Graph.

The contribution of this study includes:

We propose a Log2Graph detection method by capturing users’ intentions and detecting their behavior to determine whether they belong to insider threats. The Log2Graph method is composed of four components: feature extraction, data augmentation, graph construction, and GCN classifier. It achieved the best performance with AUC 0.997 (LANL dataset)/0.986 (CERT dataset), and the FPR 0.006 (LANL dataset)/0.027 (CERT dataset).

We propose a new feature to measure the anomaly in login behavior, namely anomaly index. It provides an auxiliary attribute together with other log features, which significantly reduces the complex connections for different users in the process of graph construction and largely improves the efficiency.

We propose a data argumentation method based on sampling to provide more information during the training process. So the supervised learning method can achieve better results for insider threat detection in case of a small number of malicious samples.

We propose a method to convert each user’s behavior logs into a graph based on chronological and logical relationships, enabling the GCN to capture long-distance information on time series and achieve efficient anomaly detection. It can achieve more efficiency with GCN-based insider threat detection.

2. Motivation

For insider threat detection, the easiest data that can be obtained is logs related to user behavior during daily work. Figure 2(a) depicts the logs we aim to detect. In general, there are three types of logs, including the user authentication log A, the process log P, and the network flow log F. Each type of log has a distinct format to describe user behavior. The authentication log contains the time, user, source computer, and destination computer, as well as many corresponding features, such as the type of authorization protocol. The process log records when the user starts or stops a process. The network flow log records information about the recipient and sender. Additionally, it also has features such as transport protocols, etc. From the user behavior logs, it can be seen that they are simpler than system logs, but contain more features in each log entry.

Fig. 2.

Sequence and graph based approaches for insider threat detection. (a) A log file recording user’s operations; (b) sequence-based approach: process data step by step; (c) graph-based approach: in the illustrated GCN model, each node gains information from 1-hoop neighbor in a GCN layer, and can further retrieve rich information from k-hoop neighbors with k layers.

Figure 2(b) and Fig. 2(c) illustrate the sequence-based method and graph-based method for insider threat detection. The logs in Fig. 2(a) indicate a typical example of a real scenario, where an employee logs in a computer and runs a process to generate two network flows for data transmission. As shown in Fig. 2(b), the sequence-based method can identify the simple temporal relationship but have difficulties in capturing the complex logical relationship between different log entries. For instance, it will ignore the relationship between processes and flows, in which the second flow log F actually is made by the process log P and should be connected with it directly. Additionally, the sequence-based method can only process one data for each step, and cannot run the analysis process in parallel.

Instead, the graph-based methods can tackle the challenge by creating edges and passing messages/information between each operation directly. Figure 2(c) illustrates an example of using a graph to construct the relationship between operations, where the two flow logs can connect with the process log directly. Moreover, it can treat all data in parallel processing. In the graph-based methods, if the behavior of employees has some logical connection, the relationship between users’ acts can be further measured with graph structure. But this will lead to a new problem, in which the computational complexity is very high. For instance, the Log2vec [25] establishes corresponding edges through a total of 10 rules to connect log entries into a heterogeneous graph, uses node2vec [11] to get the node embedding vectors, and then clusters them. Due to node2vec being inefficient, they need to use 9 servers to do the job.

Many works had used unsupervised methods to detect unknown types of insider threats, such as zero-day attacks, but those methods only work well in some special scenarios. The AUC (area under the curve) of Log2vec is about 0.91, it will recognize the normal behavior of about $10 %$ as malicious behavior, and it needs to set different hyperparameters for different employees/users. This can be a limitation in actual enterprise practice because the amount of normal data is much larger than that of malicious data, and even leads to more false positives. Thus, there is an obstacle that cannot be overcome when using unsupervised learning to detect insider threats. The reason is two-fold. First, insider threats are detected by the employee’s behavior logs, but the employee’s work content is not constant. Small changes in the work content can cause noise, such as distributing new computers to the employee, transferring the employee to another department, etc. These behaviors are logged as unusual actions, but they are not threats. Unsupervised methods cannot accurately distinguish between outliers and threats, thus a high false positive rate is unavoidable.

When we already know the types of threats, supervised methods are more efficient and accurate than unsupervised methods. Compared with the expert system based on domain knowledge, the artificial neural network does not require security personnel to have the complex background knowledge to establish threat detection rules and write corresponding detection programs. It only needs the security personnel to provide enough malicious samples to train the neural network. Thus, in a real insider threat detection system, both unsupervised methods and supervised methods are needed for efficiency. Nevertheless, using the supervised learning method to deal with malicious behavior detection is also difficult. First, in most scenarios, we do not have enough samples of malicious behavior to train the supervised learning methods. Compared with normal actions, our core concern is to fetch samples with malicious behavior, which can be relatively clearly defined. However, in real-world scenarios, malicious behavior samples are very sparse. For instance, there are $1, 648, 275, 307$ samples in the LANL dataset [18], in which only 749 are malicious. Moreover, the sparseness of malicious samples results in a serious imbalance in the dataset. Thus, it is challenging to employ supervised learning methods to deal with malicious behavior detection problems.

Overall, there is still room for improvement in the effectiveness and performance of existing methods. First, in the case of long sequences and large amounts of noise, the ability to capture complex logical relationships between different operations can be improved. Specifically, while retaining important information, we can try to eliminate the impact of noise. Second, appropriate data enhancement methods can be conducted for malicious behavior data to reduce the negative impact of data sparsity. Finally, reducing the complexity of methods can be feasible. Our proposed Log2Graph will address the issues in a supervised learning way.

3. Design

Based on the aforementioned discussion, we design a new method for insider threat detection. The architecture of Log2Graph is shown in Fig. 3. It mainly contains four stages, including feature exaction, data preprocessing (oversampling and downsampling), graph construction, and graph neural network training. After that, a node classifier is finally obtained for prediction. The detail of each stage is described in the below sections.

Fig. 3.

The architecture of Log2Graph.

3.1. Intentions of adversarial attack

We first summarize scenarios leading to insider threats in enterprises and governments, in which the attack intentions can be roughly divided into four categories:

Advanced persistent threat (APT): A common attack that causes prolonged disclosure or intrusion. For insider personnel, their behaviors should be carried out without attracting the attention of others. Though the attack frequency is low, its behaviors are intentional and not accidental, which are likely to be carried out many times with serious harm.

Employee retaliation: Employees are dissatisfied with the enterprise for some reason, and take a series of malicious actions for their interests or revenge on the enterprise. For instance, disclosing the company’s secrets to the outside, distributing malware, etc.

Curious behavior of employees: A series of behaviors that may endanger the interests of the enterprise to satisfy their curiosity. For instance, saving the company data to the user’s private terminal, peeping at other people’s computers or accounts, etc.

Accident: Employee’s inadvertent action or negligence. This behavior is characterized by a short duration, no strong logical or temporal connection with the surrounding environment, and is a purely accidental event with considerable or minor consequences.

3.2. Feature extraction

In Log2Graph, we transform a log entry into a node in the graph. A log entry is defined as a tuple involving five meta-attributes $⟨ U, H, T, O, F ⟩$ , where U is a collection of users; H is a collection of hosts; T is a collection of time; O is a collection of operations, such as log on, file read, process start, etc; and F is a collection of features, including the type of authorization agreement, the format of the file, the name of startup process, and so on.

Among these attributes, we use U, H, T, and O to find the relationship between log nodes, for instance, two nodes generated by the same user, the same host, or in chronological order. We further use F to generate embedding vectors of nodes. Generally, features in F can be divided into two types, label-type and numerical value-type. For the feature with label-type, we transform it into a one-hot vector. Then we concatenate label-type one-hot vectors and numerical value-type features as the feature embedding vector of each node. The details of features we used in the experiment can be find in Section 4.1.

As well known in the graph theory, a graph contains a set V of nodes together with a set E of edges, where each edge connects one node and another. Thus, a graph G can be expressed with $G = (V, E)$ . Graph neural networks (GNN) is a connectionist model, which uses the edges between nodes as channels to transmit messages.

So we need to determine the content transmitted on these channels, which they are features of each log. We have no fixed requirements for the content of features. If it is conducive to expressing the user’s behavior intention and can be expressed quantitatively, it can be used as a feature. For instance, in the LANL dataset [18], the format of logs are as follows:

From the log entries, we extract “source user” or “user” as U, “destination computer” as H, “time” as T, and “auth” as O. Other elements like “authentication type” are label type, which can be converted to one-hot vectors as one element of F. As “packet count” is a numerical feature, it can be directly regarded as one element of F. All the features we use are described in Section 4.1. We convert the label-type features into one hot vector, and form a log feature vector together with the numerical-type features, with the null value filling 0.

Anomaly index. Different from Log2vec, our graph construction does not establish connections between users and hosts or different users. Indeed, we construct an operation graph for each user as Section 3.4, in which the behavior differences between users will be learned by GNN. We design a feature extraction method to capture the relationship between users and hosts, namely the anomaly index. Specifically, the anomaly index is an index to measure the anomaly degree of a user’s operations, such as logging behavior on a computer or device. It provides an additional attribute together with other log features for efficiency, and similar information can be applied to different datasets. It can also reflect the relationship between different users. The process of generating anomaly index is described below.

First, we consider counting the hosts commonly used by each user and the corresponding times, and calculate the frequency of users using each host:

\begin{aligned} X_{u, h}^{(1)} = T i m e s_{u, h} / A l l t i m e s_{u}, \end{aligned}

(1)

where

{T i m e s}_{u, h}

is the total number of times the user U logs on host H, and

{A l l t i m e s}_{u}

is the total number of times the user logs on all hosts. Additionally, the index value will be close to 0 when the user logs on to an abnormal host. But considering some scenarios, such as a user gets a new computer, this index is close to 0 too. The reason is that this employee never use this computer before, which can cause a wrong information.

To solve this problem, we use another way to calculate the anomaly index. We calculate it by dividing the number of times a user uses a computer by the total number of times the computer is used ( ${A l l t i m e s}_{h}$ ) and get Eq. (2).

\begin{aligned} X_{u, h}^{(2)} = T i m e s_{u, h} / A l l t i m e s_{h} . \end{aligned}

(2)

In Eq. (2), the index value will close to 0 when the user logs on to an abnormal host. This formula is very simple to solve the above problem. When using a new computer, there is only one new user in the user list of the computer, which is equivalent to being monopolized by the new user. Thus, the index value will be close to 1. At this time, we consider another very common situation in the enterprise. There are always some public computers in it, such as those specially used to connect printers or scanners. Employees will use these public computers when they need to do relevant operations. For these computers, the total number of uses will be very high. However, each user will only use them several times when needed. This situation is normal, but in this case, the anomaly index of everyone using the public computer will be close to 0, which will cause misleading.

To avoid being affected by this situation, we multiply the $X_{u, h}^{(2)}$ by the total number of users of the computer $h ({Usernum}_{h})$ , and calculate the anomaly index as Eq. (3).

\begin{aligned} X_{u, h}^{(3)} = U s e r n u m_{h} * T i m e s_{u, h} / A l l t i m e s_{h} . \end{aligned}

(3)

In the above formula, the value of normal behavior approaches 1, but the value of abnormal behavior approaches 0. To make the result more intuitive and the value of abnormal behavior does not lose accuracy (as the absolute value is too small), we add a negative logarithm to the above formula to obtain the complete calculation of the anomaly index, and the final result is Eq. (4):

\begin{aligned} X_{u, h} = \log (\frac{A l l t i m e s_{h} / U s e r n u m_{h}}{T i m e s_{u, h}}), \end{aligned}

(4)

where

{U s e r n u m}_{h}

is the number of users that accesses the host H within a certain period,

{T i m e s}_{u, h}

is the number of times in which the user U logs on the host H, and

{A l l t i m e s}_{h}

is the total times of all users that log on the host H within the same certain period. The premise of designing this index is to assume that the host logged on by each user is fixed under normal working conditions. When a user suddenly logs on to other hosts outside this range, it may be abnormal. As Log2Grpah analyze logs in batches for anomaly detection, we can calculate the anomaly index in Eq. (4) during training process independently, and use it for model construction. After that, we will compute the anomaly index again in test process, which just uses the testing data and will not cause data leakage in experiments.

Algorithm 1 describes the process of calculating the anomaly index for the whole organization. In the algorithm, “logs” is the logon logs of the organization, and “dict” is a nested dictionary data that stores anomaly index. The dictionary is a key-value type data structure, in which we can get each log’s anomaly index by “anomaly index = dict[host][user]”. We first group all logs of logon by hosts (line 2 in Algorithm 1), where “h” represents the hostname and “ $h g$ ” represents the logon logs of the “h”. Then we group “ $h g$ ” by the user (line 4 in Algorithm 1) in which “u” represents the username and “ $u g$ ” represents the logon logs of the “u” logging on “h”. After that, we calculate the anomaly index by Eq. (4) and put it into “ $d i c t$ ” (line 8–9 in Algorithm 1). When a new user or a new host is added to the system (testing data), the “dict” need to update by Algorithm 1.

3.3. Data preprocessing

As is widely acknowledged, unsupervised methods are suitable for finding unknown types of insider threats, such as zero-day attacks, but do not work well in analyzing that already known. Instead, supervised methods are more efficient and accurate in detecting known types of insider threats. However, utilizing the original log data to train the insider threat detection model presents two challenges. One is the lack of enough malicious samples, and the other is data imbalance. To solve these problems, we both need oversampling to make sure the GNN model can learn more detail about malicious data and downsampling to eliminate the effect of data imbalance.

3.3.1. Oversampling

We use oversampling to increase the number of malicious data. The challenges of oversampling are two-fold. One is how to generate oversampling data like other real employee behaviors in the original dataset. The other is how to avoid the oversampling data being as same as the original data. There are two kinds of oversampling methods. One is to generate new data according to the probability distribution of existing data by analyzing existing data. The other is to generate new data by sampling in a certain proportion of the original data. The user’s behavior and intention are reflected in logs with a series of behaviors in a certain order. And the characteristics contained in them are relatively complex. So it is very difficult to realize oversampling by generating new data. To address this issue, we choose to generate copies by sampling to realize oversampling. The oversampling process is divided into following steps:

Sample all logs of users who have malicious behavior according to a certain probability to obtain a complete copy of user behavior. For instance, at a sampling rate of 0.8, a user has 10 malicious logs and 90 normal logs, and we will randomly sample 80 logs as oversampled samples.

Repeat the steps in 1 for k times to obtain k different complete copies of user behavior. The performance of the model varies with k (see Section 4.5).

Add random noise to the numerical features in the oversampled samples. For instance, we can not add noise on “authentication type” because it is a label-type feature, in which adding noise will change its meaning. But we can add noise on “packet count”, which is a numeric feature so that adding noise will only change its value without changing its meaning.

3.3.2. Downsampling

Data imbalance means that the number of some categories in the data set is very different from other categories. In insider threat detection scenarios, most employees or users are benign, so the normal data is far more than threat data. In case of data imbalance, the mechanism of supervised learning may make the neural network fall into a trap of dividing all data into the most common categories. During supervised learning, the model gradually learns information by reducing the loss of the model. However, as long as all the data is classified into the most common categories, a higher accuracy rate can be obtained to reduce the loss and no malicious behavior can be detected. So the model has not enough ability to detect threats. To avoid this situation, it needs to process imbalanced data before using them.

There are three common methods to solve this problem. One is to modify the loss function so that the loss function has different weights in the face of different categories. The advantage of this method is that the processing is usually very simple, as long as the corresponding weight is added when calculating the loss function according to the proportion of data. However, this method also has some disadvantages. For instance, it can not reduce the number of big categories, so that leading to use excessive amounts of data during model training and cost much time. The other method is to downsample the data set. The specific method is to sample different categories of data according to the corresponding proportion so that the data volume of multiple categories can reach a similar level. The disadvantage is that it will only use a small part of the original data that will lose some information. The third method is oversampling. Oversampling can artificially increase the amount of data of a few amount of data through certain methods. The advantage is that it can ensure sufficient data for training, but the corresponding disadvantage is the complexity when constructing data. As the amount of benign samples is too large compared to malicious samples and the dimension of log vectors is small (see Section 4.1), generating too many samples will be redundant (see Section 4.5). Thus we only use oversampling to generate a certain amount of malicious samples when anomaly detection.

Based on the above analysis, we will utilize a downsampling method to solve the problem of data imbalance. In daily work, the work content of each employee is relatively fixed, and there will be no drastic changes. This phenomenon is also reflected in which the logs generated by employees’ normal work are highly repetitive. It can be used to downsample log data to reduce the proportion of benign samples. With downsampling for data imbalance, we can reduce the size of data GNN needs to process when training and improve the speed.

There are two types of features in each log, label type features and numerical value type features. The label type features mainly reflect the attributes of operations, such as whether an email is sent to the outside world and whether the sent email has attachments. Numerical value type features are often used to describe the detail of operations under current attributes, such as the size of email attachments. We use label type features to measure the degree of rareness of logs. In the process of downsampling, we will reserve rare logs with a higher probability. The purpose of this design is to ensure the diversity of training data, so that the deep learning model can learn more information about the relationship between features and threats. The specific process of downsampling is described in Algorithm 2.

3.4. Graph construction

Through feature extraction and data augmentation, we get many log entries in the form of $⟨ U, H, T, O, F ⟩$ . As we treat each log entry as a node, we need to obtain the edges in graph. In a graph convolution neural network, the convolution layer will update each node’s feature by aggregating the information of its neighbor nodes in a certain manner. And the edges pass information in this way. So when constructing edges, we need to consider which nodes we will obtain information to distinguish the intention of each behavior.

Fig. 4.

Graph construction for three flow nodes.

With the analysis of malicious behaviors, we can observe that users will exhibit malicious acts in several other cases, except for unintentional incidents. Therefore, our main goal is how to express the user’s intention through information transmission between nodes. Specifically, we propose three rules for constructing edges in graph as shown in Fig. 4.

Rule 1:

All logs for the same user are connected chronologically by undirected edges.

It reflects the behavior pattern of an employee, which is what the employee intends to do. In daily work, malicious acts are usually mixed with that of normal actions, which leads to a long distance between the logs of malicious behaviors if they are connected only in chronological order. And sometimes for a certain purpose, the same type of operation may be performed many times. To capture the intent of malicious users, we need to take more actions to those same types of operations, so that we introduce below two rules:

Rule 2:

Each log is connected to the next log of other different operation types for the same user.

Rule 3:

Each log is connected to the previous log of other different operation types for the same user.

By jumping over the logs that have the same operation type, we enable the model to obtain useful information over longer distances during training. Since two nodes connected at both ends of each edge belongs to the same user, there is no connection between users, failing to deliver messages. Therefore, to enable the model to capture the relationship between different users, we design the anomaly index and use it as an auxiliary feature in GNN.

Algorithm 3 is used to construct the graph. In the algorithm, “logs” is a list of logs, “node embeddings” is the list of all embedding vectors of log nodes (generate by F), and the edges in the graph are stored in the list “edges”. Each element in “edges” is a two meta-attributes $⟨$ tail, head $⟩$ , where they point to one tail node and one head node in “node embeddings”, respectively. We iterate through the entire “logs” sequentially while connecting the same user’s actions sequentially in a convenient process (Rule 1), and storing the same user’s consecutive actions of the same type in a temporary list “tailTemp”. We also connect a node in the temporary list to a new node when the operation type changes (Rule 2), and use the last node in the temporary list as the node for subsequent nodes to backtrack (Rule 3). When a new user is added to the system, we only need to run lines 3 to 28 in Algorithm 3 to update the graph. When adding a new host, we need to re-run Algorithm 3 to update the graph. The algorithm first groups logs by user, then sorts each user’s logs by time, and at last traverses all logs once. Since the logs are usually generated in chronological order, in most cases, the time complexity of Algorithm 3 is $O (n)$ . When the logs are generated randomly, the time complexity is $O (n \log n)$ .

3.5. Graph neural network

In a graph convolution network, data to be gathered for one output node comes from its neighbors in the previous layer. Each of these neighbors, in turn, gathers its output from the previous layer, and so on. Because the intention represented by each log is related to the information around it, so we need enough convolution layers to obtain the neighborhood information of each log node.

But the deeper we backtrack, the more multi-hop neighbors support the computation of the root. The number of support nodes (and thus the training time) potentially grows exponentially with the GCN depth. Therefore, to solve this contradiction, we use a random walk sampler proposed by GraphSaint [47] to generate the mini-batch to limit the scale of the graph. Another problem with too many convolution layers is over-smoothing. Over-smoothing refers to the phenomenon that the output of a node approaches the same. One reason for over-smoothing is that too deep GCN causes each node to get information from too many nodes, whereas many nodes that provide information for different nodes are the same. To avoid over-smoothing caused by convolution layers, we need to keep the output of each convolution layer.

After that, we use a multi-layer perceptron to classify the obtained information. Based on the above analysis, the input of the first fully connected layer after the convolution layer is made up of the output of the previous convolution layers.

Therefore, as shown in Fig. 5, the whole graph neural network consists of n convolutional layers, a concatenate layer, and m fully connected layers. Different numbers of n and m can be used to construct the graph neural network. In our test, we use 6 convolutional layers and 7 fully connected layers. Each convolution layer consists of a GraphConv layer [31], a dropout, and an activation function. Each fully connected layer consists of a dense layer, a dropout, and an activation function. Each GraphConv layer calculates the new features of node i according to the following formula:

\begin{aligned} X_{i}^{'} = W_{1} X_{i} + W_{2} \sum e_{i, j} \cdot X_{j}, \end{aligned}

(5)

where

X_{i}^{'}

denotes the new features of node i,

X_{i}

denotes the old features of node i,

X_{j}

denotes the features of node j,

e_{i, j}

denotes the edge weight from source node j to target node i,

W_{1}

and

W_{2}

are two weight matrices to be learned.

Fig. 5.

Graph neural network in Log2Graph.

4. Evaluation

4.1. Experimental setup

The experimental environment was carried out on a computer configured with Ryzen 3900x 4.25Ghz CPU (12 cores and 24 threads), 64G memory, and 2070 super GPU. We use Python 3.8 and Pytorch Geometric 1.9 [8] to complete all work, including data cleaning, feature extraction, data augmentation, graph construction with datasets, and anomaly detection with the graph convolution neural network. We selected two datasets composed of real records to test Log2Graph.

LANL dataset [18]: The LANL dataset comprises over one billion log entries collected over 58 days for 12,425 users and 17,684 computers, within LANL’s corporate and internal computer network. It involves 749 malicious logons with 98 compromised accounts, which is a typical APT.

In the LANL dataset, we use three files, including “auth.txt”, “proc.txt” and “flows.txt”. Since the login operations initiated by a computer has little to do with a user’s intention, we filter the relevant data and only retain the login operations initiated by the user. We also add several new features, such as whether the user is trying to log on to another user’s account (logon different user), or whether the user is trying to log on to another computer (logon different computer), and calculate the anomaly index. The attributes used in the LANL dataset are described in Table 1. The “ $⋆$ ” in the table indicates that the feature needs to be extracted from the file through feature engineering. All values of anomaly index are calculated by Eq. (4). For the feature of “logon different user”, if the source user and the target user are different in logs, its value is true, otherwise its value is false. Similarly, for the feature of “logon different computer”, if the source computer and the target computer are different in logs, we set it with true, otherwise we set it with false. In the original training set, there are 43,321,938 benign logs and 32 malicious logs. After oversampling and downsampling, there are 26059 benign logs and 259 malicious logs. Since the data are still not balanced enough, we set different weights in the loss function during the training process of Log2Graph and GCN.

Table 1.
Features used in LANL dataset.

Feature Type Source

Authentication type Label auth.txt

Logon type Label auth.txt

Authentication orientation Label auth.txt

Authentication success/failure Label auth.txt

Process name Label proc.txt

Process start/end Label proc.txt

Flows duration Numerical value flows.txt

Flows protocol Label flows.txt

Packet count Numerical value flows.txt

Byte count Numerical value flows.txt

Logon different user Label auth.txt $⋆$

Logon different computer Label auth.txt $⋆$

Anomaly index Numerical value auth.txt $⋆$

Feature	Type	Source
Authentication type	Label	auth.txt
Logon type	Label	auth.txt
Authentication orientation	Label	auth.txt
Authentication success/failure	Label	auth.txt
Process name	Label	proc.txt
Process start/end	Label	proc.txt
Flows duration	Numerical value	flows.txt
Flows protocol	Label	flows.txt
Packet count	Numerical value	flows.txt
Byte count	Numerical value	flows.txt
Logon different user	Label	auth.txt $⋆$
Logon different computer	Label	auth.txt $⋆$
Anomaly index	Numerical value	auth.txt $⋆$

CERT dataset [10]: The CERT dataset is a synthetic dataset that includes system logs labeled insider threat activities. We use r5.2 in our experiment, in which this version includes both benign and malicious activities generated from 1000 simulated users. The attributes used in CERT dataset are described in Table 2. All values of anomaly index are calculated by Eq. (4). For the feature of “email from out”, if the sender’s email address does not end with “dtaa.com”, we consider the sender is from outside and set the feature with true, otherwise we set it with false. For the feature of “email to number”, we count the number of email addresses that in field “to”. Other features like “email ∗ number” are all extracted in similar way. In the original training set, there are 343,001 benign logs and 590 malicious logs. After oversampling and downsampling, we get 6539 benign logs and 4713 malicious logs. Since the data are balanced enough, we did not set different weights in the loss function during the training process.

Table 2.

Features used in CERT dataset.

Feature	Type	Source
Operation type	Label	Log file name
Non-working hours	Label	Operation occurs
Resign	Label	LDAP
Website	Label	http.csv
File open	Label	file.csv
File write	Label	file.csv
To removable device	Label	file.csv
From removable device	Label	file.csv
Email receive	Label	email.csv
Email from out	Label	email.csv $⋆$
Email to number	Numerical value	email.csv $⋆$
Email to out number	Numerical value	email.csv $⋆$
Email cc number	Numerical value	email.csv $⋆$
Email cc out number	Numerical value	email.csv $⋆$
Email bcc number	Numerical value	email.csv $⋆$
Email bcc out number	Numerical value	email.csv $⋆$
Email attachments number	Numerical value	email.csv $⋆$
Anomaly index	Numerical value	logon.csv $⋆$

Table 3.

Hyperparameters.

Parameters	Values
Batchsize	6000
Epoch	600
Walk length	8
Data augmentation (k)	7
Drop out (Conv)	0.2, 0.2, 0.2, 0.2, 0.2, 0.2
Activation function (Conv)	Leaky relu Õ 6
Drop out (FC)	0.2,0.2,0,0,0,0,0
Activation function (FC)	Leaky relu Õ 6, none
Hidden channels	64
Learning rate	0.0003

For evaluation, we first randomly select half of the malicious users’ logs in both datasets and expand them through data augmentation to build the malicious part of the training set. Then we extract the logs of benign users with the same number of malicious users as the benign part. Both of the parts build the entire training set together. At last, we use other users’ logs as test sets. To reduce the consumption of video memory, we randomly keep 0.1% of normal logs in the training set, and use a random walk sampler [47] to generate the mini-batch. All hyperparameters are shown in Table 3.

4.2. Performance metrics

For test metrics, we use the widely used AUC (area under the curve) to measure the performance of insider threat detection, and we use FPR (false positive rate) to measure the additional workload that our method brings to security personnel.

AUC: AUC is the area under the ROC curve (receiver operating characteristic curve). In ROC curve, the horizontal axis means FPR and vertical axis means TPR (true positive rate). Thus, the AUC calculation method considers the classification ability of the classifier for TPR and FPR at the same time. Since the ROC curve does not change with the number of different types of data in the dataset, it will not be affected by data imbalance. In the case of unbalanced samples, it can still make a reasonable evaluation of the classifier, which is very suitable to measure the efficiency of insider threat detection.

FPR: FPR can well evaluate the degree to which the classifier identifies normal data into threat data. Because the traditional anomaly detection algorithm usually can not well balance the detection rate (TPR) of malicious behaviors and the false alarm rate (FPR) of normal behaviors, the model can be evaluated comprehensively and objectively by using AUC and them.

4.3. Training results

For evaluation, we first conduct the training process of Log2Graph. Taking the LANL dataset as an example, we analyze the training process. We trained a total of 600 epochs for the model. As shown in Fig. 6 and Fig. 7, at the beginning of training process, the AUC values increased rapidly and reached 0.9 after about 50 epochs. After that, there were only some small fluctuations, and the whole curve showed a slow upward trend. The FPR decreased rapidly at the beginning of training but increased significantly at about 90 epochs, suddenly decreased at 250 epochs, and then showed a fluctuating downward trend. After training to 500 epochs, the performance of the model is relatively stable, with AUC above 0.98 and FPR between 0.01 and 0.015. In our evaluation, Log2Graph takes about 2 minutes to complete the entire training process, costs about 4G of video memory, and spends about 0.04 seconds to complete a test on the test set.

Fig. 6.

The AUC curve during training.

Fig. 7.

The FPR curve during training.

4.4. Test results

In this section, we evaluate test results by comparing Log2Graph with other methods.

Comparison with baselines. We first compare the Log2Graph with baselines, including Kmeans [1], IF (Isolation Forest) [26], and GCN [20]. They are classical outlier detection algorithms with a wide range of applications. We reproduce these baselines for comparison. When we test IF and Kmeans, we use the same testing set as that in Log2Graph, except for removing the anomaly index. Similarly, we used the same training and testing set for GCN and Log2Graph, except that we deleted the anomaly index in GCN.

Kmeans: Kmeans is a traditional clustering algorithm, which is a method of vector quantization. It aims to partition feature vectors into k clusters in which each vector belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

Isolation Forest (IF): Isolation forest is a classical anomaly detection algorithm, which recursively and randomly divides the dataset until all sample points are isolated. Under the random segmentation strategy, abnormal data usually have a shorter path.

GCN: We trained a node classifier with two GCN layers and two full connection layers using the same hyperparameters as Log2Graph. The input of the first layer is node features, the output of the last layer is the classification results, and the input of other layers is the output of the previous layer.

The results can be seen in Table 4 and Table 5. As shown from the tables, Log2Graph obtained the best performance with AUC 0.997 (LANL)/0.986 (CERT) and FPR 0.006 (LANL)/0.027 (CERT). In the results, the performance of IF is quite excellent on the LANL dataset, but did not work with the CERT dataset. The reason is that in the CERT dataset, most label-type features have too many values, such as the URL or subject of the website content. When turning these features into the one-hot vector, there are too many ‘0’ in the vector. Therefore, IF (and Kmeans) did not work well on the CERT dataset. Because the two-layer GCN can not obtain enough neighborhood information, it performed worse than Log2Graph on both datasets.

Table 4.
Comparison with other method on LANL dataset.

Method Training AUC FPR TPR

Log2Graph Supervised 0.997 0.003 0.995

IF Unsupervised 0.980 0.020 0.895

Kmeans Unsupervised 0.867 0.056 0.790

GCN Supervised 0.987 0.005 0.980

Log2Vec [25] Unsupervised 0.910 NA NA

LEADS [19] Supervised 0.950 0.050 NA

Le et al. [22] Unsupervised 0.910 NA NA

Bowman et al. [3] Supervised NA 0.010 NA

∗MLTracer [24] Supervised 0.999 0.001 1.000

Method	Training	AUC	FPR	TPR
Log2Graph	Supervised	0.997	0.003	0.995
IF	Unsupervised	0.980	0.020	0.895
Kmeans	Unsupervised	0.867	0.056	0.790
GCN	Supervised	0.987	0.005	0.980
Log2Vec [25]	Unsupervised	0.910	NA	NA
LEADS [19]	Supervised	0.950	0.050	NA
Le et al. [22]	Unsupervised	0.910	NA	NA
Bowman et al. [3]	Supervised	NA	0.010	NA
∗MLTracer [24]	Supervised	0.999	0.001	1.000

¹ The symbol ∗ means that the method needs additional manual assistance for anomaly detection.

² NA is used if there is no corresponding metric in the original paper.

Table 5.

Comparison with other method on CERT dataset.

Method	Training	AUC	FPR	TPR
Log2Graph	Supervised	0.967	0.027	0.824
Kmeans	Unsupervised	0.538	0.477	0.552
IF	Unsupervised	0.857	0.093	0.807
GCN	Supervised	0.803	0.194	0.799
Log2Vec [25]	Unsupervised	0.890	0.100	NA
Log2Vec $+ +$ [25]	Unsupervised	0.930	NA	NA
Wrongdoing monitor [38]	Supervised	0.820	0.200	0.944
RAP-Net [49]	Supervised	0.967	NA	0.960
Singh et al. [35]	Supervised	0.890	0.020	0.930
HITD [2]	Supervised	0.950	0.030	0.920
Nasir et al. [32]	Supervised	0.910	0.090	NA
Yuan et al. [45]	Supervised	0.940	NA	NA
Yuan et al. [46]	Supervised	0.950	NA	NA
Le et al. [22]	Unsupervised	0.910	NA	NA

Fig. 8.

The AUC curve with different k in Log2Graph.

Comparison with state-of-the-art methods. We also compare Log2Graph with state-of-the-art methods. According to our hardware environment, its performance mainly affects the execution time of various methods and does not reflect their effectiveness in malicious behavior detection. We believe that these methods have reached their optimal state in original papers, so we do not reproduce them, and directly compare Log2Graph with the results provided in their literatures. The discussion of state-of-the-art methods can be found in Section 5. Table 4 and Table 5 summarize the performance of these methods on LANL and CERT datasets. The results show that Log2Graph achieves the best AUC in performance efficiency, and gets higher TPR and lower FPR compared to most methods on both datasets. The reason is that Log2Graph uses a combination of feature extraction, data augmentation, and graph construction to efficiently detect anomalies.

4.5. Parameter sensitivity

To verify the impact of oversampled copies k on model performance during data augmentation, we tested Log2Graph according to different k values (from 0 to 8) on the LANL dataset. The changes of AUC and FPR values are shown in Fig. 8 and Fig. 9, respectively.

From the figures, we can see that the performance of Log2Graph is not very good without using oversampling technology to enhance the data ( $k = 0$ ), which is similar to that of the IF algorithm. Instead, with the increase of k, the performance of the model is improved. At the value of $k = 8$ , in addition to accidental reasons, there is another possible factor for performance degradation, which is overfitting. Too many sampling times may lead to the model taking irrelevant details as classification elements in the training process. Thus, different k may be required for different tasks. When using Log2Graph, we should select appropriate k by considering the impact of its values on model performance.

Fig. 9.

The FPR curve with different k in Log2Graph.

4.6. Case studies

To analyze the stability and improvement space of the model, we use the same parameters to train Log2Graph for 9 times repeatedly and analyze the classification results on the LANL dataset. The performance results are shown in Table 6.

We believe that those data incorrectly classified by more than half of the classifiers or misclassified at most times are difficult to classify. After statistics, 157 normal behaviors in the test data are easy to be identified as malicious, and one malicious behavior is easy to be ignored. Among them, we make statistics on those data that are misclassified as malicious data. The authentication type mainly concentrates in the NTLM protocol, and about 70% of the login behavior is to log on from one computer to another. The abnormal index is mainly distributed near −1 (the normal category), but there are also individual login behaviors with an anomaly index of more than 4, which is quite high.

Table 6.
Performance of 9 tests.

Test AUC FPR Test AUC FPR

1 0.9907 0.0185 6 0.9912 0.0176

2 0.9906 0.0138 7 0.9933 0.0134

3 0.9872 0.0157 8 0.9919 0.0162

4 0.9916 0.0118 9 0.9955 0.0090

5 0.9840 0.0121 mean 0.9907 0.0142

Test	AUC	FPR	Test	AUC	FPR
1	0.9907	0.0185	6	0.9912	0.0176
2	0.9906	0.0138	7	0.9933	0.0134
3	0.9872	0.0157	8	0.9919	0.0162
4	0.9916	0.0118	9	0.9955	0.0090
5	0.9840	0.0121	mean	0.9907	0.0142

Therefore, it is very difficult to judge the category of these misclassified samples only from the characteristics of the login behavior itself. In particular, the behavior after login may be similar to the pattern of malicious behavior, which brings great difficulty to the classification work.

The only malicious sample that is easy to be missed is the login behavior of user U12 logging into c366 from the c17693 host through the NTLM authorization protocol at 1068312 seconds. Its anomaly index is −0.34877, which is close to normal behavior. After logging in, the user U12 sends a large amount of data to a plurality of different hosts. However, because the user U12 often logs on to the c366 host, and other users often log on to the c366 host for a large number of data transmission operations, this operation is very similar to the normal behavior and is difficult to distinguish accurately.

5. Related work

Log entry-based method. The log entry-based method mainly obtains the corresponding feature vector of each log by complicated feature extraction, and using algorithms such as K-NN [27] or Isolation Forest [6,17,26] for anomaly detection. In this method, different feature extraction schemes usually need to be formulated for different data to achieve the desired effect. Moreover, after feature extraction, feature selection is usually carried out to reduce the computational overhead.

Sequence-based method. The sequence-based method is mainly to extract normal or malicious patterns in the sequence by linking different types of logs and features in chronological order, and match new log sequences to corresponding patterns to determine whether they are malicious or not [9,40,48]. DeepLog [7] is a typical method that uses long short-term memory (LSTM) neural networks to model logs as natural language sequences. It learns the log model from the sequence of normal execution, and uses the model to detect abnormal log execution paths. At the same time, a parameter value anomaly detection model is trained through the parameter value vector to detect system performance threats reflected by these parameter values. LogCluster [23] represents each log sequence by a vector and calculates the similarity value between each two log sequences. It uses the agglomeration hierarchical clustering (AHC) algorithm to group similar log sequences into normal and abnormal clusters. Wei et al. [14] use both auto and manual feature engineering to derive enhanced features that are subsequently fed into the LSTM for anomaly detection optimization. They [13] also proposed a bi-channel insider threat detection (B-CITD) framework enhanced by graph intelligence to improve the overall performance of insider threat detection methods. Manoharan et al. [30] combine standalone and sequential activities, and use the RNN to detect insider threats.

The sequence-based method can detect anomalies according to the temporal order of logs but lacks efficiency in identifying interactive relationships between users, hosts, etc.

Graph-based method. There is a rich body of related work on graphs to solve problems such as serving online transactional processing, finding the social community structure, etc. With the development of machine learning technologies, the graph neural network (GNN) has risen to great attention. Specifically, the graph convolution neural network (GCN) has been widely used in diverse domains [12,15,31,39,41,42,44]. The primary purpose of GCN is to obtain neighborhood information for each node and aggregate it to update its representation. There are also unsupervised methods that can generate node embeddings for downstream classifiers, such as VGAE [21], GraphSAGE [12], Node2Vec [11], and DeepWalk [33]. To solve node explosion and over-smoothing on GCN, Zeng et al. [47] propose GraphSAINT, which obtains a small subgraph from an unbiased sampler and runs a complete GCN on the subgraph.

For anomaly detection, the graph-based methods mainly transform logs or other entities such as users and hosts into nodes in the graph, and establishe connections for them by considering the logical relationships between nodes [3,4,16,19,24,25,43]. However, they still cannot explore malicious acts of user’s inherent intention well.

MLTracer [24] is a GNN-based method to detect intrusion. It mainly designs some intrusion metapaths, converts logs into heterogeneous graphs, and generates graph embedding vectors to feed them into CNN (convolutional neural network). Then it uses the co-attention mechanism and FCNN (fully convolutional neural network) to classify logs based on login session. But this method needs users to generate dedicated new metapaths to represent new types of intrusion actions.

Log2vec [25] is a typical unsupervised abnormal behavior detection method. In this method, the temporal and logical relationship between log nodes and the similarity between log sequences are modeled into the graph, and then the node embedding vector is obtained by node2vec method. Finally, the anomaly detection is achieved by clustering methods.

Bowman et al. [3] establish a graph by analyzing the relationship between the host and the user through logs. They use the continuous-bag-of-words (CBOW) model to obtain the vectorized representation of the graph, and detect the insider threat through the link prediction technology.

LEADS [19] constructs the log sequence as a heterogeneous graph and classifies the graph by combining it with graph embedding representation. It can achieve the real-time detection of abnormal behavior.

Chen et al. [4] obtain the feature vector of each host by constructing the communication logic network between hosts. They use graph embedding and semi-supervised methods to identify the attacked hosts.

Different from these efforts, our proposed Log2Graph provides a more efficient method based on the graph convolution neural network for insider threat detection. The Log2Graph comprehensively considers log features, entity relationships, data imbalance, and graph construction complexity to detect anomalies. It uses elaborated rules to construct the graph and build a dedicated graph convolution neural network, which can better express the chronological and logical relationships between logs and identify the intent of malicious users.

6. Conclusion

To the best of our knowledge, Log2Graph is the first method in the field of insider threat detection to realize the expression of user intention by using multiple graph convolution layers. We also introduce anomaly index to measure the anomaly degree of behaviors between entities for the first time, i.e. users and hosts. There are temporal and logical relationships between operations by combing anomaly index with other features. It can reduce the complexity of graph construction. With data augmentation techniques, the number of malicious samples is increased, so that the model can learn more insights about malicious samples. Through the efficient graph construction method, the chronological and logical relationships are reflected in the graph, and the classification based on GCN achieves the desired effect.

Our evaluation results show that Log2Graph obtains the best performance with AUC 0.997 (LANL)/0.986 (CERT) and the FPR 0.006 (LANL)/0.027 (CERT). On the one hand, our method can detect the threat caused by insiders. On the other hand, note that in a practical environment, it is a tradeoff to track the operations of actual employees for abnormal behavior detection and anonymize employees’ information for privacy protection.

Footnotes

Acknowledgment

This work has been partially supported by the National Science Foundation of China No. 62372450.

References

Ahmed

Seraj

Islam

S.M.S.

, The k-means algorithm: A comprehensive survey and performance evaluation, Electronics 9(8) (2020), 1295. doi:10.3390/electronics9081295.

Al-Mhiqani

M.N.

Ahmad

Abidin

Z.Z.

Abdulkareem

K.H.

Mohammed

M.A.

Gupta

Shankar

, A new intelligent multilayer framework for insider threat detection, Computers & Electrical Engineering 97 (2022), 107597. doi:10.1016/j.compeleceng.2021.107597.

Bowman

Laprade

Huang

H.H.

, Detecting lateral movement in enterprise computer networks with unsupervised graph AI, in: The 23rd International Symposium on Research in Attacks, Intrusions and Defenses, 2020, pp. 257–268.

Chen

Yao

Liu

Jiang

, A novel approach for identifying lateral movement attacks based on network embedding, in: ISPA/IUCC/BDCloud/SocialCom/SustainCom , IEEE, 2018, pp. 708–715.

Clausen

Grov

Aspinall

, CBAM: A contextual model for network anomaly detection, Computers 10(6) (2021), 79. doi:10.3390/computers10060079.

Ding

Fei

, An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window, IFAC Proceedings Volumes 46(20) (2013), 12–17. doi:10.3182/20130902-3-CN-3020.00044.

Zheng

Srikumar

, DeepLog: Anomaly detection and diagnosis from system logs through deep learning, in: Proceedings of ACM SIGSAC Conference on Computer and Communications Security , 2017, pp. 1285–1298.

Fey

Lenssen

J.E.

, Fast graph representation learning with PyTorch Geometric, 2019, arXiv preprint arXiv:1903.02428.

Geiger

Liu

Alnegheimish

Cuesta-Infante

Veeramachaneni

, TadGAN: Time series anomaly detection using generative adversarial networks, in: IEEE International Conference on Big Data , 2020, pp. 33–43.

10.

Glasser

Lindauer

, Bridging the gap: A pragmatic approach to generating insider threat data, in: 2013 IEEE Security and Privacy Workshops , 2013, pp. 98–104. doi:10.1109/SPW.2013.37.

11.

Grover

Leskovec

, node2vec: Scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–864. doi:10.1145/2939672.2939754.

12.

Hamilton

W.L.

Ying

Leskovec

, in: Inductive Representation Learning on Large Graphs , 2017, pp. 1025–1035.

13.

Hong

Yin

You

Wang

Cao

Liu

, Graph intelligence enhanced bi-channel insider threat detection, in: International Conference on Network and System Security , Springer, 2022, pp. 86–102. doi:10.1007/978-3-031-23020-2_5.

14.

Hong

Yin

You

Wang

Cao

Liu

Man

, A graph empowered insider threat detection framework based on daily activities, ISA transactions 141 (2023), 84–92. doi:10.1016/j.isatra.2023.06.030.

15.

Huang

Liu

Van Der Maaten

Weinberger

K.Q.

, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2017, pp. 4700–4708.

16.

Jiang

Chen

Choo

K.R.

Liu

Huang

Mohapatra

, Anomaly detection with graph convolutional networks for insider threat and fraud detection, in: IEEE Military Communications Conference (MILCOM) , 2019, pp. 109–114.

17.

Karev

McCubbin

Vaulin

, Cyber threat hunting through the use of an isolation forest, in: Proceedings of the 18th International Conference on Computer Systems and Technologies , 2017, pp. 163–170. doi:10.1145/3134302.3134319.

18.

Kent

A.D.

, Cybersecurity data sources for dynamic network research, in: Dynamic Networks in Cybersecurity , Imperial College Press, 2015.

19.

Kiouche

A.E.

Lagraa

Amrouche

Seba

, A simple graph embedding for anomaly detection in a stream of heterogeneous labeled graphs, Pattern Recognition 112 (2021), 107746. doi:10.1016/j.patcog.2020.107746.

20.

Kipf

T.N.

Welling

, Semi-supervised classification with graph convolutional networks, 2016, arXiv preprint arXiv:1609.02907.

21.

Kipf

T.N.

Welling

, Variational graph auto-encoders, 2016, arXiv preprint arXiv:1611.07308.

22.

D.C.

Zincir-Heywood

, Anomaly detection for insider threats using unsupervised ensembles, IEEE Transactions on Network and Service Management 18(2) (2021), 1152–1164. doi:10.1109/TNSM.2021.3071928.

23.

Lin

Zhang

Lou

Zhang

Chen

, Log clustering based problem identification for online service systems, in: The 38th International Conference on Software Engineering Companion , 2016, pp. 102–111.

24.

Liu

Wen

Liang

Jiang

Meng

, MLTracer: Malicious logins detection system via graph neural network, in: 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) , IEEE, 2020, pp. 715–726. doi:10.1109/TrustCom50675.2020.00099.

25.

Liu

Wen

Zhang

Jiang

Xing

Meng

, Log2vec: A heterogeneous graph embedding based approach for detecting cyber threats within enterprise, in: Proceedings of ACM SIGSAC Conference on Computer and Communications Security , 2019, pp. 1777–1794.

26.

Liu

F.T.

Ting

K.M.

Zhou

, Isolation-based anomaly detection, ACM Transactions on Knowledge Discovery from Data (TKDD) 6(1) (2012), 1–39. doi:10.1145/2133360.2133363.

27.

Liu

Qin

Guan

Jiang

Wang

, An integrated method for anomaly detection from massive system logs, IEEE Access 6 (2018), 30602–30611. doi:10.1109/ACCESS.2018.2843336.

28.

Wong

R.K.

, Insider threat detection with long short-term memory, in: Proceedings of the Australasian Computer Science Week Multiconference , 2019, pp. 1–10.

29.

Ghojogh

Samad

M.N.

Zheng

Crowley

, Isolation Mondrian forest for batch and online anomaly detection, in: 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC) , IEEE, 2020, pp. 3051–3058. doi:10.1109/SMC42975.2020.9283073.

30.

Manoharan

Hong

Yin

Zhang

, Bilateral insider threat detection: Harnessing standalone and sequential activities with recurrent neural networks, in: International Conference on Web Information Systems Engineering , Springer, 2023, pp. 179–188.

31.

Morris

Ritzert

Fey

Hamilton

W.L.

Lenssen

J.E.

Rattan

Grohe

, Weisfeiler and leman go neural: Higher-order graph neural networks, in: Proceedings of the AAAI Conference , 2019, pp. 4602–4609.

32.

Nasir

Afzal

Latif

Iqbal

, Behavioral based insider threat detection using deep learning, IEEE Access 9 (2021), 143266–143274. doi:10.1109/ACCESS.2021.3118297.

33.

Perozzi

Al-Rfou

Skiena

, Deepwalk: Online learning of social representations, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2014, pp. 701–710. doi:10.1145/2623330.2623732.

34.

Shen

Mariconti

Vervier

P.A.

Stringhini

, Tiresias: Predicting security events through deep learning, in: Proceedings of ACM SIGSAC Conference on Computer and Communications Security , 2018, pp. 592–605.

35.

Singh

Mehtre

Sangeetha

, User behavior based insider threat detection using a multi fuzzy classifier, Multimedia Tools and Applications (2022), 1–31.

36.

Zhao

Niu

Liu

Sun

Pei

, Robust anomaly detection for multivariate time series through stochastic recurrent neural network, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & , Data Mining, 2019, pp. 2828–2837. doi:10.1145/3292500.3330672.

37.

Tao

Peng

Zhao

Wang

, A parallel algorithm for network traffic anomaly detection based on Isolation Forest, International Journal of Distributed Sensor Networks 14(11) (2018). doi:10.1177/1550147718814471.

38.

Wang

Zhu

, Wrongdoing monitor: A graph-based behavioral anomaly detection in cyber security, IEEE Transactions on Information Forensics and Security 17 (2022), 2703–2718. doi:10.1109/TIFS.2022.3191493.

39.

Wang

Jiang

Lan

, Intrusion detection using few-shot learning based on triplet graph convolutional network, Journal of Web Engineering (2021), 1527–1552.

40.

Xia

Yin

, LogGAN: A sequence-based generative adversarial network for anomaly detection based on system logs, in: International Conference on Science of Cyber Security , Springer, 2019, pp. 61–76. doi:10.1007/978-3-030-34637-9_5.

41.

Cui

Hong

Zhang

Yang

Liu

, Graph inference learning for semi-supervised classification, 2020.

42.

Tian

Sonobe

Kawarabayashi

K.-I.

Jegelka

, Representation learning on graphs with jumping knowledge networks, in: International Conference on Machine Learning, PMLR , 2018, pp. 5453–5462.

43.

Fang

Liu

Xiao

Wen

Meng

, Depcomm: Graph summarization on system audit logs for attack investigation, in: 2022 IEEE Symposium on Security and Privacy (SP) , IEEE, 2022, pp. 540–557. doi:10.1109/SP46214.2022.9833632.

44.

Yang

Liu

Shi

, Extract the knowledge of graph neural networks and go beyond it: An effective knowledge distillation framework, in: Proceedings of the Web Conference , 2021, pp. 1227–1237.

45.

Yuan

Cao

Shang

Liu

Tan

Fang

, Insider threat detection with deep neural network, in: International Conference on Computational Science , Springer, 2018, pp. 43–54.

46.

Yuan

Zheng

, Insider threat detection via hierarchical neural temporal point processes, in: IEEE International Conference on Big Data , 2019, pp. 1343–1350.

47.

Zeng

Zhou

Srivastava

Kannan

Prasanna

, Graphsaint: Graph sampling based inductive learning method, 2019.

48.

Zhang

Lin

Qiao

Zhang

Dang

Xie

Yang

Cheng

et al., Robust log-based anomaly detection on unstable log data, in: Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2019, pp. 807–817.

49.

Zhu

Huang

Sun

Liu

, RAP-Net: A resource access pattern network for insider threat detection, in: 2022 International Joint Conference on Neural Networks (IJCNN) , IEEE, 2022, pp. 1–8.

Log2Graph: A graph convolution neural network based method for insider threat detection

Abstract

Keywords

1. Introduction

3.2. Feature extraction

3.3.1. Oversampling

3.3.2. Downsampling

3.4. Graph construction

4.1. Experimental setup

4.3. Training results

Table 6. Performance of 9 tests. Test AUC FPR Test AUC FPR 1 0.9907 0.0185 6 0.9912 0.0176 2 0.9906 0.0138 7 0.9933 0.0134 3 0.9872 0.0157 8 0.9919 0.0162 4 0.9916 0.0118 9 0.9955 0.0090 5 0.9840 0.0121 mean 0.9907 0.0142

6. Conclusion

Footnotes

Acknowledgment

References

Table 6.
Performance of 9 tests.

Test AUC FPR Test AUC FPR

1 0.9907 0.0185 6 0.9912 0.0176

2 0.9906 0.0138 7 0.9933 0.0134

3 0.9872 0.0157 8 0.9919 0.0162

4 0.9916 0.0118 9 0.9955 0.0090

5 0.9840 0.0121 mean 0.9907 0.0142