Abstract
With the advancement of network security equipment, insider threats gradually replace external threats and become a critical contributing factor for cluster security threats. When detecting and combating insider threats, existing methods often concentrate on users’ behavior and analyze logs recording their operations in an information system. Traditional sequence-based method considers temporal relationships for user actions, but cannot represent complex logical relationships well between various entities and different behaviors. Current machine learning-based approaches, such as graph-based methods, can establish connections among log entries but have limitations in terms of complexity and identifying malicious behavior of user’s inherent intention.
In this paper, we propose Log2Graph, a novel insider threat detection method based on graph convolution neural network. To achieve efficient anomaly detection, Log2Graph first retrieves logs and corresponding features from log files through feature extraction. Specifically, we use an auxiliary feature of anomaly index to describe the relationship between entities, such as users and hosts, instead of establishing complex connections between them. Second, these logs and features are augmented through a combination of oversampling and downsampling, to prepare for the next-stage supervised learning process. Third, we use three elaborated rules to construct the graph of each user by connecting the logs according to chronological and logical relationships. At last, the dedicated built graph convolution neural network is used to detect insider threats. Our validation and extensive evaluation results confirm that Log2Graph can greatly improve the performance of insider threat detection compared to existing state-of-the-art methods.
Keywords
Introduction
Due to the rapid development and adoption of information technologies, enterprises and organizations heavily rely on networks for daily operations. Network border protection provides an efficient approach for many enterprises and organizations, such as using a firewall to prevent attacks from outside. However, insider threat has gradually become a critical factor threatening safe operations and services of these systems. Insider threat attackers are generally employees, contractors, or business partners of the organization. They usually have authorized access to the organization’s systems, network, and data. Insider threat is the behavior for which insiders use legally obtained authorization to negatively affect the confidentiality, integrity, and availability of information systems. Specifically, it can pose a high risk with serious consequences, such as significant financial ramifications or vital data leakage. With the establishment and improvement of information security mechanisms in enterprises, it is increasingly difficult for external attackers to enter target systems. As a result, insider threats have been increasing and become the major challenges to compromise enterprise system security.
Different from external attacks, the main challenges for insider threat detection lie in below three aspects:
Insider threats are caused by insider users whose malicious behavior can not be easily detected by peripheral security devices, such as firewall.
Most insider threats are composed of a series of malicious operations. As insiders are also employees within an organization, these malicious operations are scattered and hidden in their normal work operations.
Usually, there are not enough insider threat samples to support the training process of a supervised learning model.
A common approach for insider threat detection is to use log entries for analysis. The log-entry-based method first extracts a large number of features (e.g., time, protocol, and login type) from logs. Then it analyzes the characteristics by using certain anomaly detection algorithms, such as clustering [27] or isolation forest [6,17,26,29,37] algorithms, and outputs anomaly logs as possible malicious behavior. This method can differentiate logs according to the distance or density calculation, but cannot obtain any relationship between logs during feature detection. To achieve more effective results, it often relies on rich, complex feature extraction, which largely limits the effectiveness of threat detection.
With the development of machine learning models and algorithms, the sequence-based method has been proposed for anomaly detection recently. This method extracts normal or malicious patterns by considering temporal dependencies between logs. It links different types of logs and features in chronological order and matches new log sequences to corresponding patterns to determine whether they are malicious or not. It often uses long short-term memory (LSTM) or recurrent neural network (RNN) to model users’ normal behavior and predicts deviations as anomalies [5,7,28,34,36,45,46]. However, the sequence-based method cannot be aware of complex user behaviors and deal with them in parallel processing.
To further exploit interactive relationships among user behavior, more work has been developed on the graph-based method to solve the problem of insider threat detection through graph neural network (GNN) [3,4,16,19,25]. It mainly expresses the logical relationship between logs in the form of graph construction and detects malicious behavior with node classification, subgraph classification, link prediction, etc. According to different training process, the graph-based method can be divided into unsupervised and supervised approaches. The unsupervised approach does not require any data annotation during training, and detects malicious behavior by comparing the similarity between user behavior. However, it will cause a high rate of false positives that cannot distinguish between outliers and malicious behavior. Existing supervised approach uses labeled data to detect anomaly, but does not consider well how the graph structure will influence the performance of graph convolution neural networks, which still limits the usages in practice.
In this paper, we introduce Log2Graph, a novel approach based on graph convolution neural network (GCN) [20] for insider threat detection. The overview of Log2Graph is shown in Fig. 1. It first analyzes original log files to obtain log entries and extract log features of all users. Then Log2Graph performs data augmentation to balance the dataset. It randomly samples all logs of users with malicious behavior in the training set to generate oversampled data. Accordingly, it downsamples the oversampled data and original data respectively to reduce normal logs. After that, Log2Graph uses the augmented data as a training set. Graph construction treats each log as a node and connects log nodes in time series and logical relationships to form graphs. At last, the graphs in the training set are fed into the graph neural network for training, and a node classifier is obtained for classification of the logs in the test set. The classifier outputs a label for each node to indicate whether it poses an insider threat, providing prediction for malicious behaviors.

The overview of Log2Graph.
The contribution of this study includes:
We propose a Log2Graph detection method by capturing users’ intentions and detecting their behavior to determine whether they belong to insider threats. The Log2Graph method is composed of four components: feature extraction, data augmentation, graph construction, and GCN classifier. It achieved the best performance with AUC 0.997 (LANL dataset)/0.986 (CERT dataset), and the FPR 0.006 (LANL dataset)/0.027 (CERT dataset).
We propose a new feature to measure the anomaly in login behavior, namely anomaly index. It provides an auxiliary attribute together with other log features, which significantly reduces the complex connections for different users in the process of graph construction and largely improves the efficiency.
We propose a data argumentation method based on sampling to provide more information during the training process. So the supervised learning method can achieve better results for insider threat detection in case of a small number of malicious samples.
We propose a method to convert each user’s behavior logs into a graph based on chronological and logical relationships, enabling the GCN to capture long-distance information on time series and achieve efficient anomaly detection. It can achieve more efficiency with GCN-based insider threat detection.
For insider threat detection, the easiest data that can be obtained is logs related to user behavior during daily work. Figure 2(a) depicts the logs we aim to detect. In general, there are three types of logs, including the user authentication log A, the process log P, and the network flow log F. Each type of log has a distinct format to describe user behavior. The authentication log contains the time, user, source computer, and destination computer, as well as many corresponding features, such as the type of authorization protocol. The process log records when the user starts or stops a process. The network flow log records information about the recipient and sender. Additionally, it also has features such as transport protocols, etc. From the user behavior logs, it can be seen that they are simpler than system logs, but contain more features in each log entry.

Sequence and graph based approaches for insider threat detection. (a) A log file recording user’s operations; (b) sequence-based approach: process data step by step; (c) graph-based approach: in the illustrated GCN model, each node gains information from 1-hoop neighbor in a GCN layer, and can further retrieve rich information from k-hoop neighbors with k layers.
Figure 2(b) and Fig. 2(c) illustrate the sequence-based method and graph-based method for insider threat detection. The logs in Fig. 2(a) indicate a typical example of a real scenario, where an employee logs in a computer and runs a process to generate two network flows for data transmission. As shown in Fig. 2(b), the sequence-based method can identify the simple temporal relationship but have difficulties in capturing the complex logical relationship between different log entries. For instance, it will ignore the relationship between processes and flows, in which the second flow log F actually is made by the process log P and should be connected with it directly. Additionally, the sequence-based method can only process one data for each step, and cannot run the analysis process in parallel.
Instead, the graph-based methods can tackle the challenge by creating edges and passing messages/information between each operation directly. Figure 2(c) illustrates an example of using a graph to construct the relationship between operations, where the two flow logs can connect with the process log directly. Moreover, it can treat all data in parallel processing. In the graph-based methods, if the behavior of employees has some logical connection, the relationship between users’ acts can be further measured with graph structure. But this will lead to a new problem, in which the computational complexity is very high. For instance, the Log2vec [25] establishes corresponding edges through a total of 10 rules to connect log entries into a heterogeneous graph, uses node2vec [11] to get the node embedding vectors, and then clusters them. Due to node2vec being inefficient, they need to use 9 servers to do the job.
Many works had used unsupervised methods to detect unknown types of insider threats, such as zero-day attacks, but those methods only work well in some special scenarios. The AUC (area under the curve) of Log2vec is about 0.91, it will recognize the normal behavior of about
When we already know the types of threats, supervised methods are more efficient and accurate than unsupervised methods. Compared with the expert system based on domain knowledge, the artificial neural network does not require security personnel to have the complex background knowledge to establish threat detection rules and write corresponding detection programs. It only needs the security personnel to provide enough malicious samples to train the neural network. Thus, in a real insider threat detection system, both unsupervised methods and supervised methods are needed for efficiency. Nevertheless, using the supervised learning method to deal with malicious behavior detection is also difficult. First, in most scenarios, we do not have enough samples of malicious behavior to train the supervised learning methods. Compared with normal actions, our core concern is to fetch samples with malicious behavior, which can be relatively clearly defined. However, in real-world scenarios, malicious behavior samples are very sparse. For instance, there are
Overall, there is still room for improvement in the effectiveness and performance of existing methods. First, in the case of long sequences and large amounts of noise, the ability to capture complex logical relationships between different operations can be improved. Specifically, while retaining important information, we can try to eliminate the impact of noise. Second, appropriate data enhancement methods can be conducted for malicious behavior data to reduce the negative impact of data sparsity. Finally, reducing the complexity of methods can be feasible. Our proposed Log2Graph will address the issues in a supervised learning way.
Based on the aforementioned discussion, we design a new method for insider threat detection. The architecture of Log2Graph is shown in Fig. 3. It mainly contains four stages, including feature exaction, data preprocessing (oversampling and downsampling), graph construction, and graph neural network training. After that, a node classifier is finally obtained for prediction. The detail of each stage is described in the below sections.

The architecture of Log2Graph.
We first summarize scenarios leading to insider threats in enterprises and governments, in which the attack intentions can be roughly divided into four categories:
Feature extraction
In Log2Graph, we transform a log entry into a node in the graph. A log entry is defined as a tuple involving five meta-attributes
Among these attributes, we use U, H, T, and O to find the relationship between log nodes, for instance, two nodes generated by the same user, the same host, or in chronological order. We further use F to generate embedding vectors of nodes. Generally, features in F can be divided into two types, label-type and numerical value-type. For the feature with label-type, we transform it into a one-hot vector. Then we concatenate label-type one-hot vectors and numerical value-type features as the feature embedding vector of each node. The details of features we used in the experiment can be find in Section 4.1.
As well known in the graph theory, a graph contains a set V of nodes together with a set E of edges, where each edge connects one node and another. Thus, a graph G can be expressed with
So we need to determine the content transmitted on these channels, which they are features of each log. We have no fixed requirements for the content of features. If it is conducive to expressing the user’s behavior intention and can be expressed quantitatively, it can be used as a feature. For instance, in the LANL dataset [18], the format of logs are as follows:
From the log entries, we extract “source user” or “user” as U, “destination computer” as H, “time” as T, and “auth” as O. Other elements like “authentication type” are label type, which can be converted to one-hot vectors as one element of F. As “packet count” is a numerical feature, it can be directly regarded as one element of F. All the features we use are described in Section 4.1. We convert the label-type features into one hot vector, and form a log feature vector together with the numerical-type features, with the null value filling 0.
First, we consider counting the hosts commonly used by each user and the corresponding times, and calculate the frequency of users using each host:
To solve this problem, we use another way to calculate the anomaly index. We calculate it by dividing the number of times a user uses a computer by the total number of times the computer is used (
In Eq. (2), the index value will close to 0 when the user logs on to an abnormal host. This formula is very simple to solve the above problem. When using a new computer, there is only one new user in the user list of the computer, which is equivalent to being monopolized by the new user. Thus, the index value will be close to 1. At this time, we consider another very common situation in the enterprise. There are always some public computers in it, such as those specially used to connect printers or scanners. Employees will use these public computers when they need to do relevant operations. For these computers, the total number of uses will be very high. However, each user will only use them several times when needed. This situation is normal, but in this case, the anomaly index of everyone using the public computer will be close to 0, which will cause misleading.
To avoid being affected by this situation, we multiply the
In the above formula, the value of normal behavior approaches 1, but the value of abnormal behavior approaches 0. To make the result more intuitive and the value of abnormal behavior does not lose accuracy (as the absolute value is too small), we add a negative logarithm to the above formula to obtain the complete calculation of the anomaly index, and the final result is Eq. (4):
Algorithm 1 describes the process of calculating the anomaly index for the whole organization. In the algorithm, “logs” is the logon logs of the organization, and “dict” is a nested dictionary data that stores anomaly index. The dictionary is a key-value type data structure, in which we can get each log’s anomaly index by “anomaly index = dict[host][user]”. We first group all logs of logon by hosts (line 2 in Algorithm 1), where “h” represents the hostname and “
As is widely acknowledged, unsupervised methods are suitable for finding unknown types of insider threats, such as zero-day attacks, but do not work well in analyzing that already known. Instead, supervised methods are more efficient and accurate in detecting known types of insider threats. However, utilizing the original log data to train the insider threat detection model presents two challenges. One is the lack of enough malicious samples, and the other is data imbalance. To solve these problems, we both need oversampling to make sure the GNN model can learn more detail about malicious data and downsampling to eliminate the effect of data imbalance.
Oversampling
We use oversampling to increase the number of malicious data. The challenges of oversampling are two-fold. One is how to generate oversampling data like other real employee behaviors in the original dataset. The other is how to avoid the oversampling data being as same as the original data. There are two kinds of oversampling methods. One is to generate new data according to the probability distribution of existing data by analyzing existing data. The other is to generate new data by sampling in a certain proportion of the original data. The user’s behavior and intention are reflected in logs with a series of behaviors in a certain order. And the characteristics contained in them are relatively complex. So it is very difficult to realize oversampling by generating new data. To address this issue, we choose to generate copies by sampling to realize oversampling. The oversampling process is divided into following steps:
Sample all logs of users who have malicious behavior according to a certain probability to obtain a complete copy of user behavior. For instance, at a sampling rate of 0.8, a user has 10 malicious logs and 90 normal logs, and we will randomly sample 80 logs as oversampled samples. Repeat the steps in 1 for k times to obtain k different complete copies of user behavior. The performance of the model varies with k (see Section 4.5). Add random noise to the numerical features in the oversampled samples. For instance, we can not add noise on “authentication type” because it is a label-type feature, in which adding noise will change its meaning. But we can add noise on “packet count”, which is a numeric feature so that adding noise will only change its value without changing its meaning.
Downsampling
Data imbalance means that the number of some categories in the data set is very different from other categories. In insider threat detection scenarios, most employees or users are benign, so the normal data is far more than threat data. In case of data imbalance, the mechanism of supervised learning may make the neural network fall into a trap of dividing all data into the most common categories. During supervised learning, the model gradually learns information by reducing the loss of the model. However, as long as all the data is classified into the most common categories, a higher accuracy rate can be obtained to reduce the loss and no malicious behavior can be detected. So the model has not enough ability to detect threats. To avoid this situation, it needs to process imbalanced data before using them.
There are three common methods to solve this problem. One is to modify the loss function so that the loss function has different weights in the face of different categories. The advantage of this method is that the processing is usually very simple, as long as the corresponding weight is added when calculating the loss function according to the proportion of data. However, this method also has some disadvantages. For instance, it can not reduce the number of big categories, so that leading to use excessive amounts of data during model training and cost much time. The other method is to downsample the data set. The specific method is to sample different categories of data according to the corresponding proportion so that the data volume of multiple categories can reach a similar level. The disadvantage is that it will only use a small part of the original data that will lose some information. The third method is oversampling. Oversampling can artificially increase the amount of data of a few amount of data through certain methods. The advantage is that it can ensure sufficient data for training, but the corresponding disadvantage is the complexity when constructing data. As the amount of benign samples is too large compared to malicious samples and the dimension of log vectors is small (see Section 4.1), generating too many samples will be redundant (see Section 4.5). Thus we only use oversampling to generate a certain amount of malicious samples when anomaly detection.
Based on the above analysis, we will utilize a downsampling method to solve the problem of data imbalance. In daily work, the work content of each employee is relatively fixed, and there will be no drastic changes. This phenomenon is also reflected in which the logs generated by employees’ normal work are highly repetitive. It can be used to downsample log data to reduce the proportion of benign samples. With downsampling for data imbalance, we can reduce the size of data GNN needs to process when training and improve the speed.
There are two types of features in each log, label type features and numerical value type features. The label type features mainly reflect the attributes of operations, such as whether an email is sent to the outside world and whether the sent email has attachments. Numerical value type features are often used to describe the detail of operations under current attributes, such as the size of email attachments. We use label type features to measure the degree of rareness of logs. In the process of downsampling, we will reserve rare logs with a higher probability. The purpose of this design is to ensure the diversity of training data, so that the deep learning model can learn more information about the relationship between features and threats. The specific process of downsampling is described in Algorithm 2.
Graph construction
Through feature extraction and data augmentation, we get many log entries in the form of

Graph construction for three flow nodes.
With the analysis of malicious behaviors, we can observe that users will exhibit malicious acts in several other cases, except for unintentional incidents. Therefore, our main goal is how to express the user’s intention through information transmission between nodes. Specifically, we propose three rules for constructing edges in graph as shown in Fig. 4.
All logs for the same user are connected chronologically by undirected edges.
It reflects the behavior pattern of an employee, which is what the employee intends to do. In daily work, malicious acts are usually mixed with that of normal actions, which leads to a long distance between the logs of malicious behaviors if they are connected only in chronological order. And sometimes for a certain purpose, the same type of operation may be performed many times. To capture the intent of malicious users, we need to take more actions to those same types of operations, so that we introduce below two rules:
Each log is connected to the next log of other different operation types for the same user.
Each log is connected to the previous log of other different operation types for the same user.
By jumping over the logs that have the same operation type, we enable the model to obtain useful information over longer distances during training. Since two nodes connected at both ends of each edge belongs to the same user, there is no connection between users, failing to deliver messages. Therefore, to enable the model to capture the relationship between different users, we design the anomaly index and use it as an auxiliary feature in GNN.
Algorithm 3 is used to construct the graph. In the algorithm, “logs” is a list of logs, “node embeddings” is the list of all embedding vectors of log nodes (generate by F), and the edges in the graph are stored in the list “edges”. Each element in “edges” is a two meta-attributes
In a graph convolution network, data to be gathered for one output node comes from its neighbors in the previous layer. Each of these neighbors, in turn, gathers its output from the previous layer, and so on. Because the intention represented by each log is related to the information around it, so we need enough convolution layers to obtain the neighborhood information of each log node.
But the deeper we backtrack, the more multi-hop neighbors support the computation of the root. The number of support nodes (and thus the training time) potentially grows exponentially with the GCN depth. Therefore, to solve this contradiction, we use a random walk sampler proposed by GraphSaint [47] to generate the mini-batch to limit the scale of the graph. Another problem with too many convolution layers is over-smoothing. Over-smoothing refers to the phenomenon that the output of a node approaches the same. One reason for over-smoothing is that too deep GCN causes each node to get information from too many nodes, whereas many nodes that provide information for different nodes are the same. To avoid over-smoothing caused by convolution layers, we need to keep the output of each convolution layer.
After that, we use a multi-layer perceptron to classify the obtained information. Based on the above analysis, the input of the first fully connected layer after the convolution layer is made up of the output of the previous convolution layers.
Therefore, as shown in Fig. 5, the whole graph neural network consists of n convolutional layers, a concatenate layer, and m fully connected layers. Different numbers of n and m can be used to construct the graph neural network. In our test, we use 6 convolutional layers and 7 fully connected layers. Each convolution layer consists of a GraphConv layer [31], a dropout, and an activation function. Each fully connected layer consists of a dense layer, a dropout, and an activation function. Each GraphConv layer calculates the new features of node i according to the following formula:

Graph neural network in Log2Graph.
Experimental setup
The experimental environment was carried out on a computer configured with Ryzen 3900x 4.25Ghz CPU (12 cores and 24 threads), 64G memory, and 2070 super GPU. We use Python 3.8 and Pytorch Geometric 1.9 [8] to complete all work, including data cleaning, feature extraction, data augmentation, graph construction with datasets, and anomaly detection with the graph convolution neural network. We selected two datasets composed of real records to test Log2Graph.
In the LANL dataset, we use three files, including “auth.txt”, “proc.txt” and “flows.txt”. Since the login operations initiated by a computer has little to do with a user’s intention, we filter the relevant data and only retain the login operations initiated by the user. We also add several new features, such as whether the user is trying to log on to another user’s account (logon different user), or whether the user is trying to log on to another computer (logon different computer), and calculate the anomaly index. The attributes used in the LANL dataset are described in Table 1. The “
Features used in LANL dataset.
Features used in LANL dataset.
Features used in CERT dataset.
Hyperparameters.
For evaluation, we first randomly select half of the malicious users’ logs in both datasets and expand them through data augmentation to build the malicious part of the training set. Then we extract the logs of benign users with the same number of malicious users as the benign part. Both of the parts build the entire training set together. At last, we use other users’ logs as test sets. To reduce the consumption of video memory, we randomly keep 0.1% of normal logs in the training set, and use a random walk sampler [47] to generate the mini-batch. All hyperparameters are shown in Table 3.
For test metrics, we use the widely used AUC (area under the curve) to measure the performance of insider threat detection, and we use FPR (false positive rate) to measure the additional workload that our method brings to security personnel.
Training results
For evaluation, we first conduct the training process of Log2Graph. Taking the LANL dataset as an example, we analyze the training process. We trained a total of 600 epochs for the model. As shown in Fig. 6 and Fig. 7, at the beginning of training process, the AUC values increased rapidly and reached 0.9 after about 50 epochs. After that, there were only some small fluctuations, and the whole curve showed a slow upward trend. The FPR decreased rapidly at the beginning of training but increased significantly at about 90 epochs, suddenly decreased at 250 epochs, and then showed a fluctuating downward trend. After training to 500 epochs, the performance of the model is relatively stable, with AUC above 0.98 and FPR between 0.01 and 0.015. In our evaluation, Log2Graph takes about 2 minutes to complete the entire training process, costs about 4G of video memory, and spends about 0.04 seconds to complete a test on the test set.

The AUC curve during training.

The FPR curve during training.
In this section, we evaluate test results by comparing Log2Graph with other methods.
Kmeans: Kmeans is a traditional clustering algorithm, which is a method of vector quantization. It aims to partition feature vectors into k clusters in which each vector belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
Isolation Forest (IF): Isolation forest is a classical anomaly detection algorithm, which recursively and randomly divides the dataset until all sample points are isolated. Under the random segmentation strategy, abnormal data usually have a shorter path.
GCN: We trained a node classifier with two GCN layers and two full connection layers using the same hyperparameters as Log2Graph. The input of the first layer is node features, the output of the last layer is the classification results, and the input of other layers is the output of the previous layer.
The results can be seen in Table 4 and Table 5. As shown from the tables, Log2Graph obtained the best performance with AUC 0.997 (LANL)/0.986 (CERT) and FPR 0.006 (LANL)/0.027 (CERT). In the results, the performance of IF is quite excellent on the LANL dataset, but did not work with the CERT dataset. The reason is that in the CERT dataset, most label-type features have too many values, such as the URL or subject of the website content. When turning these features into the one-hot vector, there are too many ‘0’ in the vector. Therefore, IF (and Kmeans) did not work well on the CERT dataset. Because the two-layer GCN can not obtain enough neighborhood information, it performed worse than Log2Graph on both datasets.
Comparison with other method on LANL dataset.
Comparison with other method on LANL dataset.
1 The symbol ∗ means that the method needs additional manual assistance for anomaly detection.
2 NA is used if there is no corresponding metric in the original paper.
Comparison with other method on CERT dataset.

The AUC curve with different k in Log2Graph.
To verify the impact of oversampled copies k on model performance during data augmentation, we tested Log2Graph according to different k values (from 0 to 8) on the LANL dataset. The changes of AUC and FPR values are shown in Fig. 8 and Fig. 9, respectively.
From the figures, we can see that the performance of Log2Graph is not very good without using oversampling technology to enhance the data (

The FPR curve with different k in Log2Graph.
To analyze the stability and improvement space of the model, we use the same parameters to train Log2Graph for 9 times repeatedly and analyze the classification results on the LANL dataset. The performance results are shown in Table 6.
We believe that those data incorrectly classified by more than half of the classifiers or misclassified at most times are difficult to classify. After statistics, 157 normal behaviors in the test data are easy to be identified as malicious, and one malicious behavior is easy to be ignored. Among them, we make statistics on those data that are misclassified as malicious data. The authentication type mainly concentrates in the NTLM protocol, and about 70% of the login behavior is to log on from one computer to another. The abnormal index is mainly distributed near −1 (the normal category), but there are also individual login behaviors with an anomaly index of more than 4, which is quite high.
Performance of 9 tests.
Performance of 9 tests.
Therefore, it is very difficult to judge the category of these misclassified samples only from the characteristics of the login behavior itself. In particular, the behavior after login may be similar to the pattern of malicious behavior, which brings great difficulty to the classification work.
The only malicious sample that is easy to be missed is the login behavior of user U12 logging into c366 from the c17693 host through the NTLM authorization protocol at 1068312 seconds. Its anomaly index is −0.34877, which is close to normal behavior. After logging in, the user U12 sends a large amount of data to a plurality of different hosts. However, because the user U12 often logs on to the c366 host, and other users often log on to the c366 host for a large number of data transmission operations, this operation is very similar to the normal behavior and is difficult to distinguish accurately.
The sequence-based method can detect anomalies according to the temporal order of logs but lacks efficiency in identifying interactive relationships between users, hosts, etc.
For anomaly detection, the graph-based methods mainly transform logs or other entities such as users and hosts into nodes in the graph, and establishe connections for them by considering the logical relationships between nodes [3,4,16,19,24,25,43]. However, they still cannot explore malicious acts of user’s inherent intention well.
MLTracer [24] is a GNN-based method to detect intrusion. It mainly designs some intrusion metapaths, converts logs into heterogeneous graphs, and generates graph embedding vectors to feed them into CNN (convolutional neural network). Then it uses the co-attention mechanism and FCNN (fully convolutional neural network) to classify logs based on login session. But this method needs users to generate dedicated new metapaths to represent new types of intrusion actions.
Log2vec [25] is a typical unsupervised abnormal behavior detection method. In this method, the temporal and logical relationship between log nodes and the similarity between log sequences are modeled into the graph, and then the node embedding vector is obtained by node2vec method. Finally, the anomaly detection is achieved by clustering methods.
Bowman et al. [3] establish a graph by analyzing the relationship between the host and the user through logs. They use the continuous-bag-of-words (CBOW) model to obtain the vectorized representation of the graph, and detect the insider threat through the link prediction technology.
LEADS [19] constructs the log sequence as a heterogeneous graph and classifies the graph by combining it with graph embedding representation. It can achieve the real-time detection of abnormal behavior.
Chen et al. [4] obtain the feature vector of each host by constructing the communication logic network between hosts. They use graph embedding and semi-supervised methods to identify the attacked hosts.
Different from these efforts, our proposed Log2Graph provides a more efficient method based on the graph convolution neural network for insider threat detection. The Log2Graph comprehensively considers log features, entity relationships, data imbalance, and graph construction complexity to detect anomalies. It uses elaborated rules to construct the graph and build a dedicated graph convolution neural network, which can better express the chronological and logical relationships between logs and identify the intent of malicious users.
Conclusion
To the best of our knowledge, Log2Graph is the first method in the field of insider threat detection to realize the expression of user intention by using multiple graph convolution layers. We also introduce anomaly index to measure the anomaly degree of behaviors between entities for the first time, i.e. users and hosts. There are temporal and logical relationships between operations by combing anomaly index with other features. It can reduce the complexity of graph construction. With data augmentation techniques, the number of malicious samples is increased, so that the model can learn more insights about malicious samples. Through the efficient graph construction method, the chronological and logical relationships are reflected in the graph, and the classification based on GCN achieves the desired effect.
Our evaluation results show that Log2Graph obtains the best performance with AUC 0.997 (LANL)/0.986 (CERT) and the FPR 0.006 (LANL)/0.027 (CERT). On the one hand, our method can detect the threat caused by insiders. On the other hand, note that in a practical environment, it is a tradeoff to track the operations of actual employees for abnormal behavior detection and anonymize employees’ information for privacy protection.
Footnotes
Acknowledgment
This work has been partially supported by the National Science Foundation of China No. 62372450.
