Towards robust log parsing using self-supervised learning for system security analysis

Abstract

Logs play an important role in anomaly detection, fault diagnosis, and trace checking of software and network systems. Log parsing, which converts each raw log line to a constant template and a variable parameter list, is a prerequisite for system security analysis. Traditional parsing methods utilizing specific rules can only parse logs of specific formats, and most parsing methods based on deep learning require labels. However, the existing parsing methods are not applicable to logs of inconsistent formats and insufficient labels. To address these issues, we propose a robust Log parsing method based on Self-supervised Learning (LogSL), which can extract templates from logs of different formats. The essential idea of LogSL is modeling log parsing as a multi-token prediction task, which makes the multi-token prediction model learn the distribution of tokens belonging to the template in raw log lines by self-supervision mode. Furthermore, to accurately predict the tokens of the template without labeled data, we construct a Multi-token Prediction Model (MPM) combining the pre-trained XLNet module, the n-layer stacked Long Short-Term Memory Net module, and the Self-attention module. We validate LogSL on 12 benchmark log datasets, resulting in the average parsing accuracy of our parser being 3.9% higher than that of the best baseline method. Experimental results show that LogSL has superiority in terms of robustness and accuracy. In addition, a case study of anomaly detection is conducted to demonstrate the support of the proposed MPM to system security tasks based on logs.

Keywords

System security data analysis log parsing deep learning self-supervised learning

1. Introduction

With the increasing scale of software and network systems, security events will happen at any time [1, 2]. Logs play an important role in system security analysis such as anomaly detection, fault diagnosis, and trace checking because they record running status and important events of systems [3, 4, 5, 6]. However, logs, which are often unstructured and very large, can not be used directly for system security analysis [3]. Thus, it is necessary to make logs structured. Each log message generated by software or network systems is composed of the message header and the message content called the log line [7]. The message header is determined by the logging framework and can be extracted relatively easily, such as timestamp, verbosity, and component. In contrast, it is often difficult to make log lines structured, which were written by developers in natural language [7, 8]. Therefore, this study assumes that log lines have been extracted from the log message and focuses on automatically parsing log lines. Specifically, each unstructured log line is transformed into a log template that consists of constant tokens and a related parameter list that consists of variable tokens, which can be achieved using techniques such as regular expressions or source code analysis [9]. The log template represents the corresponding event type and the parameter list records the runtime information [8]. For example, a log line “Got allocated containers 1” is parsed into a template “Got allocated containers $\langle\ast\rangle$ ” and a parameter list [“1”], where $\langle\ast\rangle$ represents the location of each variable in the template and is associated with the location of the values in the parameter list.

Automatic log parsing has been a significant topic for system security analysis and available parsers can be divided into two types: traditional and deep learning (DL)-based [10]. Traditional log parsing schemes usually use frequent pattern mining, heuristics, and clustering to extract log templates [11]. The parsers using frequent pattern mining [12, 13, 14, 15, 16] usually assume that the log template is a frequent set of tokens, while the parsers using heuristics [17, 18, 19] and clustering [20, 21, 22, 23, 24, 25, 26, 27] extract log templates according to the unique characteristics of specific logs and the similarity between log lines, respectively. However, these traditional methods can only parse logs of specific formats and are not suitable for log data generated by modern large-scale systems. The main reason why these methods are not workable is that the formats of logs generated by systems are different [10, 19]. The existing log parsing schemes based on DL mainly focus on using log data to train deep parsing models, which can be applied to logs of inconsistent formats, including supervised and unsupervised. Nevertheless, supervised DL methods need labeled logs during the model training phase, which are invalid for unlabeled logs [28, 29, 30, 31]. Duan et al. [10] predicted the subsequent token according to tokens within the sliding window and used the original token as the label, which avoided labeling massive logs. However, this method is not satisfactory as it ignores the token information behind the predicted token. Nedelkoski et al. [32] have explored self-supervised DL log parsing technology by predicting each token of each log line, but this method replaced a token in the original log line with a special $\langle\textit{MASK}\rangle$ token, whose content information affects the embedding of the log line, so the parsing accuracy of NuLog needs to be improved. All in all, existing methods cannot accurately parse logs of inconsistent formats without labels.

The design concept of the robust log parsing method based on self-supervised learning (LogSL) proposed in this paper comes from natural language processing, which uses the idea of token prediction to extract log templates. Specifically, multiple tokens in each log line are predicted at the same time consecutively. Tokens that are correctly predicted form part of the template, while incorrectly predicted tokens serve as parameters. Besides, a multi-token prediction model (MPM) trained in a self-supervised way is proposed in this paper, which fuses the XLNet and the n-layer stacked Long Short-Term Memory net with the self-attention mechanism to capture the contextual information of the predicted tokens and the dependencies among the predicted tokens. We conducted extensive experimental evaluations on 12 real benchmark log datasets. Experimental results show that the average parsing accuracy of LogSL is 3.9% higher than the best baseline method, and the difference between the maximum and minimum values of the parsing accuracy of LogSL is 23.4% lower than the best baseline method. Furthermore, we present a case study of how to combine the proposed MPM with an anomaly detection task, where a detection accuracy of 99% can be achieved on the benchmark BGL log dataset.

In summary, the main contributions of this paper can be summarized as follows:

•
We propose a robust log parsing method, which can extract templates from logs of different formats.
•
A novel multi-token prediction model is proposed, which can predict tokens of the template more accurately and does not need labels in the training phase.
•
LogSL is evaluated on 12 benchmark log datasets, showing that LogSL has good robustness and accuracy.
•
We design an anomaly detection case using the proposed MPM, showing that the trained MPM can be used as prior knowledge for system security tasks.

The rest of the paper is organized as follows. In Section 2, we summarize the related work. Section 3 introduces the log parsing method LogSL. In Section 4, we evaluate the robustness and accuracy of LogSL for unlabeled logs of different formats. In Section 5, we describe how to combine the MPM with the downstream anomaly detection task. Finally, Section 6 summarizes the full paper and puts forward the further research direction.
2. Related work

The purpose of log parsing is to find the log template to match each log line and facilitate the system security analysis based on logs. An excellent parsing method should meet the following desirable standards [10, 11]:

(1)
Robustness: Log parsing methods work with all formats of log data from different vendors.
(2)
No-supervision: Log parsing methods need to work without any domain knowledge or label data.
(3)
Accuracy: Log parsing methods should have high accuracy.

The current log parsing approaches can be roughly divided into traditional and DL-based approaches [33]. We analyzed all of them based on the above standards.

Traditional log parsers are often designed by the technology of frequent pattern mining, heuristic rules, and clustering. Frequent pattern mining assumes that the log template is a frequent set of tokens and builds frequent itemsets based on tokens, token frequency, or token n-grams. The simple log file clustering tool (SLCT) [12] operates by mining frequent tokens from the dataset, constructing cluster candidate sets, subsequently selecting clusters, and extracting log events from these sets based on support thresholds, but the lower threshold may result in the derivation of overly specific log templates. The log file abstraction (LFA) [13] extracted log events based on the frequency of the token in a particular location. LogCluster [14] used frequent tokens that were identified by support threshold to partition the clusters of log lines and extracted the same log template for all log lines in the same cluster. This method is better at handling log messages with flexible parameter lengths. Craftsman [15] is the first framework to implement incremental parsing using the ideas of prefix trees and frequent patterns. The log parsing method based on n-gram and frequent pattern mining (NFPM) [16] used n-grams to divide logs with long identical consecutive strings into the same group and a frequent pattern mining algorithm to extract constant tokens from each group of similar data. Nevertheless, these methods are invalid for logs violating the assumption that frequent tokens always appear at the same location in the log. Algorithms based on heuristic rules use the characteristics of the log to extract log templates, such as the abstracting execution logs (AEL) [17], iterative partitioning log mining (IPLoM) [18], and Drain [19]. AEL used a hard-coded heuristic method based on system knowledge to identify dynamic fields in log lines, and abstracted execution events based on different bins divided according to the numbers of tokens and estimated parameters. IPLoM utilized an iterative partitioning strategy to group log lines based on log line length, token location, and mapping relationships. Drain applied a fixed-depth tree structure to represent log lines and efficiently extracted common templates, which avoided constructing a profound and unbalanced tree. This method assumed that log lines with the same template had the same length, and the token at the beginning of the log line was more likely to be a constant token. However, these heuristic-based parsers can only parse logs of the specific format. Since the process of generating log templates is very similar to the clustering problem, the researchers used clustering to extract log templates according to the similarity between log lines. The log key extraction (LKE) [20] and LogMine [21] utilized the hierarchical clustering algorithm to group similar logs based on weighted edit distances, but they need regular expressions based on domain knowledge to detect a set of user-defined types. Instead of conducting clustering on raw log messages directly, LogSig [22] extracted the signature of logs first, then the clustering was performed, which categorized log lines into a set of templates by searching the most representative signatures of the log lines. Kimura et al. [23] proposed the statistical template extraction (STE) to obtain log templates using density-based spatial clustering. The scalable handler for incremental system log (SHISO) [24] and the length matters clustering (LenMa) [25] used the idea of incremental clustering technology to parse logs. The streaming parser for event logs using the longest common subsequence (Spell) [26] utilized prefix tree and subsequence matching for log parsing, which clustered the current longest common subsequence objects that might have the same message type together in the merge procedure. The online log template extraction method based on hierarchical clustering (LogOHC) [27] used hierarchical clustering, which required regular expressions for preprocessing. However, the accuracy of clustering methods is not high for logs with a large number of similar templates and clustering methods need domain knowledge.

In summary, since these traditional methods rely on domain knowledge for specific logs to extract log templates, they do not satisfy the applicability of logs of different formats and they need domain expert support and manual intervention. Therefore, this paper uses the idea of multi-token prediction to extract log templates, so that logs in various formats can be parsed.

To adapt to logs of various formats, deep learning is introduced into log parsing, supervised [28, 29, 30, 31], and unsupervised methods [10, 32]. The natural language processing-log template generation (NLP-LTG) [28] considered log parsing as a problem of labeling sequence data with natural language, and classified tokens into constants and variables using Conditional Random Fields (CRF). The neural language model-for signature extraction (NLM-FSE) [29] trained a character-based neural network to extract log templates from log lines. Rücker et al. [30] proposed the flexible parser (FlexParser) using a stateful Long Short-Term Memory network that forced the model to learn parsing event logic rather than direct classification relations. Liu et al. [31] proposed to use the Token Encoder module and Context Encoder module to capture patterns of templates and parameters and use the Context Similarity module to focus on the commonalities of learned patterns. Nevertheless, these log parsing methods based on supervised deep learning need labeled log data and the process of labeled log data obtaining often requires a lot of time and effort, as well as the support of domain experts. Therefore, they are not able to parse unlabeled log data and they need domain expert support. To avoid labeling massive log data, Duan et al. [10] proposed an unsupervised log parsing method that utilized tokens within the sliding window to predict the next token and used the original token as labels, but this method disregarded the information from tokens following the predicted token. Nedelkoski et al. [32] explored self-supervised learning to identify variables and generate the corresponding event templates, a log parsing method referred to as NuLog. This method used masked-language modeling to randomly mask the input tokens and used a neural network model to predict the masked input tokens. Nevertheless, since NuLog replaces a token in the original log line with a special $\langle\textit{MASK}\rangle$ token, whose content information influences the vector representation of the log line, indicating a need for enhancing the parsing accuracy of NuLog.

In summary, available log parsing methods based on deep learning which require labeled data or have low accuracy do not meet the requirements of No-supervision and high accuracy. Therefore, this paper proposes a multi-token prediction model based on self-supervised learning. In addition, to avoid artificially adding mask tokens to affect the performance of parsing, this paper generates token sequences with noise using a permutation language model.

In our work, we introduce LogSL, a robust log parsing method based on self-supervised learning. Compared with existing work, we model log parsing as a multi-token prediction task, which can accurately extract templates for unlabeled logs of different formats. Specifically, in the training phase, the token sequences with noise are used as inputs to the multi-token prediction model, and the original log tokens are used as labels and for supervising model training. In the prediction phase, the trained multi-token prediction model is used to predict the token, if the token is predicted correctly it is the token of the template, otherwise, it is variable. We verify the robustness and accuracy of LogSL on benchmark log datasets.
3. LogSL

In view of the fact that existing log parsing methods are not robust to unlabeled logs of different formats, we propose a robust log parsing method based on self-supervised learning (LogSL), which can automatically extract templates from log lines without any domain knowledge and log labels. The core idea of LogSL is modeling log parsing as a multi-token prediction task, including four steps: Tokenization, Permutation And Sampling, Multi-token Prediction Model, and Extraction. As shown in Fig. 1, for the log line “Got allocated containers 1”, since the token “1” is not correctly predicted, while other tokens are correctly predicted, the token “1” is identified as variable and other tokens as constant.

Figure 1.

Parsing the log line “Got allocated containers 1” with LogSL.

3.1 Tokenization

Log parsing first requires extracting each log line as a list of tokens. In different systems, using the same regular expression to extract tokens will affect the accuracy of log parsing. If a regular expression is developed for each system, it will take a lot of human resources, and when the system is updated or upgraded, the accuracy of the parsing will also decrease. Therefore, for general logs, we employ spaces or tabs for segmentation. For logs that resist such straightforward splitting, we utilize regular expressions for specialized processing. For example, the log line “Got allocated containers 1” is split into the token list [“Got”, “allocated”, “containers”, “1”] by spaces. We denote the number of all unique tokens split out of the log as $T_{\max}$ .

3.2 Permutation and sampling

Since it takes experts a lot of time and effort to obtain log labels, this paper uses self-supervised learning for log parsing, which takes each log line with noise as input and each original log line as the label. The log lines with noise generated by token permutation do not change the content information of any original tokens, making the template token constraining the embedding vector of the log lines more accurately recognized. Thus, this paper draws on Permutation Language Model (PLM) [33, 34] to obtain all permutation sequences of all tokens in the token list of each log line. Assuming that the given token list is $l=$ [“Got”, “allocated”, “containers”, “1”], we fully permute its elements to get all permutation sequences “Got $\rightarrow$ allocated $\rightarrow$ containers $\rightarrow$ 1”, “Got $\rightarrow$ allocated $\rightarrow$ 1 $\rightarrow$ containers”, “Got $\rightarrow$ containers $\rightarrow$ 1 $\rightarrow$ allocated”, and so on.

Given a token list $\mathbf{x}$ of length $T$ , $T!$ token permutation sequences will be generated by the above full permutation method. When the length of the token list of the log line is very large, there will be a large number of token sequences with noise for model training, which can lead to high time complexity in training the model. For instance, the token list $l$ with only 4 tokens can generate up to 24 token permutation sequences. Token lists of the logs in the actual system are usually longer such as the maximum length of the token list for the BGL dataset used in this study exceeding one hundred. What’s more, when the predicted tokens are at the top of the token sequence, the obtained token sequences are not helpful for model training. Therefore, for each prediction of multiple tokens of the log line during the training phase, we only randomly sample some sequences in the permutation sequences whose predicted tokens are at the end. The number of the sampling is at least $\lceil\frac{T}{ML}\rceil$ , where $M L$ represents the number of predicted tokens.

Figure 2.

Mask matrix of token permutation sequence “containers $\rightarrow$ allocated $\rightarrow$ 1 $\rightarrow$ Got”.

This paper utilizes the mask matrix [35] to characterize the sampled token permutation sequence, whose implementation method is to mask the tokens that do not work in the prediction process, and the original order of the token list is not changed. The mask matrix of the token permutation sequence “containers $\rightarrow$ allocated $\rightarrow$ 1 $\rightarrow$ Got” is shown in Fig. 2. In the mask matrix, the shaded part is the information that can be referenced when predicting. When predicting “containers” in the first place, it has no reference information, so the third row of the mask matrix has no shadow, when predicting “allocated”, it can be predicted according to the content of “containers”, so the third position of the second row of the mask matrix has a shadow, and so on.

3.3 Multi-token prediction model

To predict tokens of the template more accurately, this paper proposes a multi-token prediction model (MPM) using XLNet [36], the n-layer stacked Long Short-Term Memory (LSTM) [37, 38, 39], and a self-attention mechanism [40, 41, 42], which is a self-supervised model including a training phase and a prediction phase. In the training phase, noise is added to the original log lines using the token permutation. The log lines with noise are used as input. The original log lines are used as the label. The model parameters are adjusted by reducing the loss value through backpropagation, and at the same time, the optimal hyperparameters are selected. In the prediction phase, we predict the multi-token of each new log line through model forward propagation, then generate the corresponding log template. Figure 3 describes the complete architecture of MPM. The log lines with noise are jointly characterized by the random embedding vector of the original log lines and mask matrix, which are the input of MPM. Firstly, the XLNet layer is used to obtain the context information of multiple tokens that are predicted and to capture the dependencies between multiple tokens that are predicted at the same time. Then, the features are further screened by an n-layer stacked LSTM (n-LSTM) layer, in which LSTM uses three gating methods to control the historical information of the sequence to update the current state value. Thirdly, the attention layer is used to assign different weights according to the importance of the features. Finally, the probability distribution of all unique log tokens is obtained by using a Generator, which is a single linear layer with softmax activation of all unique tokens in the dataset.

Figure 3.

The proposed complete architecture of MPM.

3.3.1 XLNet layer

Figure 4.

Principle of two-stream self-attention of the token permutation sequence “containers $\rightarrow$ allocated $\rightarrow$ 1 $\rightarrow$ Got”.

To capture the contextual information of multiple tokens predicted and the dependencies between multiple tokens that are predicted at the same time, this paper introduces two-stream self-attention of the pre-trained model XLNet, which includes content stream and query stream. The content hidden state of the content stream encodes both the content of the predicted tokens and their contextual information but does not encode their location information. The query hidden state of the query stream only encodes contextual information of the predicted tokens and their location information but does not encode their content information. Specifically, given the permutation token sampling sequence $z\thicksim Z_{T}$ , when predicting the token $x_{i}$ , in the content stream, the $i-th$ content hidden state $h_{i}^{l}$ of the $l-th$ layer both encodes the contextual information of the $(l-1)-th$ layer $x_{i}$ and encodes information of $(l-1)-th$ layer $x_{i}$ itself. In the query stream, the $i-th$ query hidden state $g_{i}^{l}$ of the $l-th$ layer not only encodes the contextual information of the $(l-1)-th$ layer $x_{i}$ , but also encodes the position information of the $(l-1)-th$ layer $x_{i}$ . Therefore, the update formula of the content hidden state $\mathbf{h}$ and the query hidden state $\mathbf{g}$ is:

$\displaystyle\left\{\begin{array}[]{l}\mathbf{h}_{z_{t}}^{(m)}\leftarrow% \textit{Attention}[\mathbf{Q}=\mathbf{h}_{z_{t}}^{(m-1)},\mathbf{KV}=\mathbf{h% }_{z_{\leqslant t}}^{(m-1)};\theta]\\ \mathbf{g}_{z_{t}}^{(m)}\leftarrow\textit{Attention}[\mathbf{Q}=\mathbf{g}_{z_% {t}}^{(m-1)},\mathbf{KV}=\mathbf{h}_{z_{<t}}^{(m-1)};\theta]\\ \end{array}\right.,$ (1)

where $m$ is the number of layers of the two-stream self-attention, in the $0-th$ layer, the query hidden state $\mathbf{g}^{(0)}$ usually is initialized a variable $w$ and the content hidden state $\mathbf{h}^{(0)}$ usually is initialized a random embedding vector of the token $e(x)$ , the data of the next layer is calculated according to the previous layer. $\theta$ is the parameters shared by content streams and query streams. $\mathbf{Q}$ , $\mathbf{K}$ , and $\mathbf{V}$ denote the query, key, and value in the self-attention mechanism. The calculation formula of self-attention is as follows:

$\displaystyle\textit{Attention}(Q,K,V)=\textit{softmax}\left(\frac{\mathbf{Q}% \cdot\mathbf{K}^{T}}{\sqrt{d_{k}}}\cdot\mathbf{V}\right).$ (2)

Dividing by $\sqrt{d_{k}}$ stabilizes the gradient during training.

As shown in Fig. 4, the working principle of the content stream and the query stream when the token permutation sequence “containers $\rightarrow$ allocated $\rightarrow$ 1 $\rightarrow$ Got” predicts the token “Got”. When predicting “Got”, the model can obtain the information of “allocated”, “containers”, “1”. Therefore, in the content stream shown in Fig. 4a, the predicted token “Got” both encodes the information of “allocated”, “containers”, “1” (location information and content information) and encodes the information of the predicted token “Got”. In the query stream shown in Fig. 4b, the prediction “Got” encodes the information “allocated”, “containers”, “1” and only encodes the location information of the predicted token “Got”. Looking from the bottom to the top, that is, starting from the $0-th$ layer, $\mathbf{h}$ and $\mathbf{g}$ are initialized to embedding vector of the tokens $e(x_{i})$ and variable $w$ respectively. The mask matrix is used to generate the content stream mask matrix and the query stream mask matrix respectively. The Eq. (1) calculates the first layer outputs $\mathbf{h}_{i}^{(1)}$ and $\mathbf{g}_{i}^{(1)}$ , and so on.

3.3.2 n-LSTM layer

The n-LSTM layer uses n-layer stacked LSTM to further filter the token sequence features output by the XLNet layer. In a single-layer LSTM, the output of an LSTM unit includes the cell state and the hidden state. The hidden state and the cell state of the previous LSTM unit are passed to the next LSTM unit, and the hidden state is also passed to the stacked upper layer LSTM as its input. Each LSTM unit in the bottom layer corresponds to the output of the query stream of the XLNet layer, as shown in Fig. 5, $h_{i}^{(n)}$ and $C_{i}^{(n)}$ represent the hidden state and the cell state of the $i-th$ LSTM unit in the $n-th$ layer, respectively.

Figure 5.

n-LSTM layer architecture.

The LSTM unit can solve the problem of loss of learning ability through a special gate structure, which avoids the loss of information caused by the large distance between the predicted tokens and the relevant tokens. It consists of three gate structures: forget gate, input gate, and output gate. The calculation formula of each gate control unit is as follows:

•

The forget gate determines what information is retained in the cell state at the previous moment to the cell state at the current moment,

$\displaystyle f_{t}=\sigma(W_{f}\cdot[h_{t-1},x_{t}]+b_{f}),$ (3)

where $x_{t}$ and $h_{t-1}$ represent the input of the $t-th$ LSTM unit and the hidden state of the $(t-1)-th$ LSTM unit of the current LSTM layer, respectively, and $W_{f}$ and $b_{f}$ are the weight matrix and bias term of the forget gate, respectively.

•

The input gate determines how much information is input to the cell state $C_{t}$ from the input $x_{t}$ of the network at the current moment,

$\displaystyle i_{t}=\sigma(W_{i}\cdot[h_{t-1},x_{t}]+b_{i}),$ (4) $\displaystyle\tilde{C_{t}}=\sigma(W_{c}\cdot[h_{t-1},x_{t}]+b_{c}),$ (5) $\displaystyle C_{t}=f_{t}\odot C_{t-1}+i_{t}\odot\tilde{C_{t}},$ (6)

where $\sigma$ represents sigmod activation function, tanh represents tanh function, $W_{i}$ and $b_{i}$ are the weight matrix and bias term of the input gate, respectively, $W_{c}$ and $b_{c}$ are the weight matrix and bias term of the output gate, respectively, $\tilde{C_{t}}$ is the candidate vector created by the tanh layer.

•

The output gate controls how much information the cell state $C_{t}$ outputs,

$\displaystyle o_{t}=\sigma(W_{o}\cdot[h_{t-1},x_{t}]+b_{o}),$ (7) $\displaystyle h_{t}=o_{t}\odot\tanh{C_{t}},$ (8)

where $W_{o}$ and $b_{o}$ are the weight matrix and bias term of the calculation unit state, respectively, $o_{t}$ represents the output of the output gate.

3.3.3 Attention layer

Each token in the token sequence has a different influence on the predicted tokens. Therefore, the Attention layer uses a self-attention mechanism to extract the important part of the token sequence for the predicted tokens. The self-attention mechanism receives the output of the n-LSTM layer as input. According to the importance of different feature vectors, weights are assigned to them. Finally, the attention layer gives a weighted vector representation through softmax normalization, as shown in Eq. (2).

3.3.4 Generator

The last component of MPM consists of a single linear layer with Softmax that receives the output of the Attention layer. The linear layer maps the output vector of the Attention layer to a vector whose size corresponds to the total number of unique tokens in the log dataset. The subsequent Softmax is used to compute the probability distribution over each unique token of the log dataset. During training, the predicted tokens are used as labels for self-supervised learning.

3.4 Extraction

The extraction of all log templates in the log dataset is performed online using the trained model. We take each log line as input and sample the permutation sequences in a way that masks multiple tokens consecutively. We measure the ability of the model to predict each token, thereby deciding whether the token is a constant part of the template or a variable. Higher than confidence $\omega$ in the prediction of a particular token represents the constant part of the template, while lower than confidence $\omega$ is interpreted as a variable, where $\omega$ is a hyperparameter. For example, given the log line “Got allocated containers 1”, we can get one predicted variable token and three predicted constant tokens, and the log line is parsed into a template “Got allocated containers $\langle\ast\rangle$ ” and a parameter list [“1”].

4. Evaluation

To verify the robustness and accuracy of LogSL, we carried out experiments on benchmark log datasets.

4.1 Experimental environment and datasets

We conducted all experimental evaluations on Ubuntu 18.04.5 LTS, which has Intel (R) Xeon (R) W-2123 3.60GHZ CPU, NVIDIA TITAN Xp GPU, and 64 GB RAM.

The datasets we experimented with come from distributed system logs (HDFS, Spark, OpenStack, Hadoop, ZooKeeper), supercomputer logs (BGL and HPC, Thunderbird), and standalone software logs (Windows, Android, HealthApp, Apache) on the loghub data repository [43]. To enable reproducibility, we follow the guidelines in [11] and utilize a random sample of 2000 log messages from each dataset, respectively, where there are available ground truth templates, as shown in Table 1.

In the experiment, the proposed MLM includes two layers of the two-stream self-attention mechanism with 128 hidden units, two layers of the LSTM with 128 hidden units, and one layer of the self-attention mechanism with 128 hidden units. Since the linear layer in the Generator maps the output vector of the Attention layer to a vector whose size corresponds to the total number of unique tokens in the log dataset, the number of unique tokens in each dataset affects the total parameter volume of the proposed MLM. The total number of model parameters in each dataset is shown in Table 1.

Table 1
log datasets and model parameters

Datasets	Source	#Templates	#Parameters
HDFS	Hadoop Distributed File System	14	1376170
Spark	Unified Analytics Engine for Big Data Processing	36	1011100
OpenStack	Cloud Operating System	43	880552
Hadoop	Hadoop mapreduce job	140	787156
Zookeeper	ZooKeeper service	50	758002
BGL	BlueGene Supercomputer	120	1047478
HPC	High Performance Cluster	46	644998
Thunderbird	Thunderbird Supercomputer	149	1670032
Windows	Windows 7 Computer Operating System	50	824566
Android	Mobile Operating System	166	752584
HealthApp	Mobile Application for Android Devices	75	903256
Apache	Apache HTTP Server	6	747424

4.2 Performance parameters

We use log parsing the precision [44], recall [45], and F-measure [46] on 12 benchmark log datasets to quantify the accuracy of LogSL for logs of different formats. The precision means among all the log templates generated, how many match the true log templates from the ground truth. The recall is the percentage of “the number of the correct log templates generated” over “the total number of true log templates in the ground truth”. The F-measure is the harmonic mean of the precision and recall. They are defined as follows:

$\displaystyle\textit{Precision}=\frac{TP}{TP+FP},$ (9) $\displaystyle\textit{Recall}=\frac{TP}{TP+FN},$ (10) $\displaystyle\textit{F-measure}=\frac{2\cdot\textit{Recall}\cdot\textit{% Precision}}{\textit{Recall}+\textit{Precision}}.$ (11)

The precision, recall, and F-measure are related to several values specifically the true positive (TP), false positive (FP), and false negative (FN). The true positive (TP) is when a method assigns two log lines with the same templates to the same templates. The false positive (FP) is calculated when a method assigns two log lines with different templates to the same templates. The false negative (FN) is when a method assigns two log lines with the same templates to different templates.

In addition, we further evaluated the accuracy and robustness of LogSL using the parsing accuracy (PA) [10]. The PA is defined as the ratio of correctly parsed log lines over the total number of log lines. After parsing, each log line is assigned to a log template. A log line is considered correctly parsed if its log template corresponds to the same group of log lines as the ground truth does. For example, if a log line sequence $[e_{1},e_{2},e_{3}]$ is parsed to $[e_{1},e_{4},e_{5}]$ , we get $PA=1/3$ , since the second and third log lines are not grouped together.

4.3 Experimental results

This section describes the accuracy and robustness of LogSL. We compared the precision, recall, F-measure, and PA of LogSL with that of 12 log parsing methods: SLCT [12], LFA [13], LogCluster [14], AEL [17], Drain [19], LKE [20], LogSig [22], LogMine [21], SHSHO [24], LenMa [25], Spell [26], and NuLog [32] on 12 benchmark datasets.

Figure 6.

Comparisons of LogSL and other 12 log parsing methods in the precision.

As shown in Fig. 6, the result of comparisons of LogSL and other 12 log parsing methods in the precision. The parsing precision of our method on distributed system logs (HDFS, Spark, OpenStack, Hadoop, ZooKeeper) is close to 100%. Because the number of basic fact templates for HDFS, Spark, OpenStack, and ZooKeeper is small, and the log lines in Hadoop are relatively simple, the results of our method are similar to most methods, and our method is significantly superior to SLCT and LogSig on HDFS, to LKE and LogMine on Spark, to LFA and LKE on OpenStack. The parsing precision of our method on supercomputer logs (BGL, HPC, and Thunderbird) surpasses most methods and reaches 100% on BGL and Thunderbird. In the case of HPC where some of the templates are very similar, the method of computing similarity will affect the parsing precision, but the similarity between templates will not affect the parsing performance of our method, therefore, our parsing method is superior to most clustering algorithms on HPC. The parsing precision of our method achieves nearly 100% on standalone software logs (Windows, Android, HealthApp, Apache). Due to the small number of basic fact templates for Windows and Apache, the results of our method are similar to those of most methods. Because Android and HealthApp have more basic fact templates, the parsing precision of our method is higher than most baseline methods. Our method can capture dependencies within multiple tokens, so our method is slightly better than NuLog.

Figure 7.

Comparisons of LogSL and other 12 log parsing methods in the recall.

As shown in Fig. 7, the result of comparisons of LogSL and other 12 log parsing methods in the recall. The recall of LogSL is 100% on eight datasets: HDFS, Spark, OpenStack, Hadoop, BGL, Windows, HealthApp, and Apache. The recall of LogSL exceeds 99% on Zookeeper, HPC, Thunderbird, and Android, which are complex and have a large number of basic fact templates. Most frequent pattern mining-based log parsing methods set lower thresholds, which leads to the acquisition of log templates too specific. Some heuristic-based log parsing methods have uncertainties of similar template separation. When parsing complex logs, some clustering-based algorithms have the issue of insufficient clustering, which affects the parsing recall. The above problems do not occur in our method. Therefore, the results of the parsing recall of our method are similar to most methods on distributed system logs (HDFS, Spark, OpenStack, Hadoop, ZooKeeper), and our method is significantly superior to SLCT, LFA, and LogSig on HDFS, to SLCT on Spark, to SLCT, LogCluster, AEL, Drain on Hadoop, and to SLCT, LogCluster, LKE, LogMine on ZooKeeper. The parsing recall of our method is slightly better than most methods on supercomputer logs (BGL and HPC, Thunderbird), and significantly better than some frequent pattern mining-based methods and most cluster-based methods. Our method is significantly superior to most parsing methods in the parsing recall on standalone software logs, including 4.7% higher than NuLog on Android.

Figure 8.

Comparisons of LogSL and other 12 log parsing methods in the F-measure.

As shown in Fig. 8, the result of comparisons of LogSL and other 12 log parsing methods in the F-measure. The parsing F-measure of LogSL achieved 100% on seven datasets: HDFS, Spark, OpenStack, Hadoop, BLG, Windows, and Apache; parsing F1-Score exceeded 99% on five complex datasets: Zookeeper, HPC, Thunderbird, Android, and HealthApp, 99.9%, 99.4%, 99.8%, 99.5%, and 99.7%, respectively. Log templates obtained by most frequent pattern mining-based log parsing methods are too specific, which affects parsing Recall. Some heuristic-based log parsing methods have uncertainties of similar template separation. In some clustering-based methods, methods of computing similarity will affect the parsing precision. NuLog used the special $\langle\textit{MASK}\rangle$ token to replace tokens in the original log lines. The content information of the special $\langle\textit{MASK}\rangle$ token affects the embedding of log lines, resulting in inaccurate prediction of masked tokens. In our method, the log lines with noise generated by token permutation do not change the content information of any original tokens, making the template token constraining the embedding vector of the log lines more accurately recognized. Therefore, our method which does not have the above problems exceeds the maximum parsing F-measure for the thirteen methods of comparison.

Summarily, LogSL, in comparison to state of art methods, leverages the permutation language model to generate token sequences with noise. This strategy helps circumvent the influence of the content information of the special $\langle\textit{MASK}\rangle$ token on the embedding of the log line. Moreover, LogSL effectively captures both the contextual information of the predicted tokens and the dependencies among the predicted tokens via the MPM. LogSL makes MPM learn the distribution that has higher probabilities of tokens belonging to the template and lower probabilities of variable tokens, resulting in template tokens and variable tokens being easily distinguished. Therefore, the precision and recall of our method are higher in Figs 6 and 7 because log lines with different templates are not assigned to the same template and log lines with the same templates are not assigned to different templates. Furthermore, the F-measure is the harmonic mean of precision and recall, thus, the F-measure of our method is higher in Fig. 8.

Figure 9.

Comparisons of LogSL and other 12 log parsing methods in the PA.

As shown in Fig. 9, the result of comparisons of LogSL and other 12 log parsing methods in the PA. The PA of our method achieves nearly 100% on distributed system logs (HDFS, Spark, OpenStack, Hadoop, ZooKeeper), which is better than the baseline methods. The PA of our method on HDFS is significantly better than frequent pattern mining-based log parsing methods (SLCT, LFA, LogCluster). The PA of our method on Spark is significantly better than most frequent pattern mining-based log parsing methods (SLCT, LogCluster), and all cluster-based methods. The PA of our method on Hadoop is significantly better than most frequent pattern mining-based and all cluster-based log parsing methods. Our method has more PA on supercomputer logs (BGL and HPC, Thunderbird) than most methods. Most of the cluster-based log parsing methods can not accurately parse log datasets containing similar basic fact templates, therefore, our parsing method is superior to most cluster-based methods on HPC. The PA of our method on standalone software logs (Windows, Android, HealthApp, Apache) exceeds the 12 parsing methods compared. Due to the small number of basic fact templates in Windows and Apache, the PA of our method gets 100%, which is obviously better than the frequent pattern mining-based log parsing methods. Because Android and HealthApp have more basic fact templates, our method is higher than the PA for most baseline methods. NuLog added the special $\langle\textit{MASK}\rangle$ token, resulting in the masked token being not accurately predicted, while our method generates token sequences with noise using a permutation language model and captures the contextual information of the predicted tokens and the dependencies among the predicted tokens using proposed MPM. Therefore, our method is better than NuLog on the Android and Thunderbird datasets.

Figure 10.

Robustness evaluation of log parsing methods.

An excellent parsing method needs to have robust performance on log datasets of different formats. The robustness of log parsing means that log parsing methods have high accuracy on log data of various formats from different vendors. Therefore, when the parser parses different logs, the higher the average parsing accuracy and the lower the difference between the maximum and minimum values of parsing accuracy, the better the robustness of the parser. In this paper, the robustness of LogSL is analyzed and compared with related methods. Figure 10 shows the box plots of the PA distribution and the mean PA line graph for each log parsing method. It can be observed that LogSig has the lowest PA on the median and mean values, and LogSL has the highest PA on the median and mean values. Although most log parsing methods achieve high PA values of 90% for a given log dataset, they differ significantly when applied to all given log types. LogSL outperforms all other baseline methods in terms of PA robustness with a minimum value of 87.8%, a median value of 99.24%, and a mean value of 97.5%, where the minimum, median, and mean values are higher than the best method by 23.4%, 0.36% and 3.9%, respectively. Thus, the difference between the maximum and minimum values of parsing accuracy of LogSL is 23.4% lower than the best baseline method. Our method has the best average PA for the following reasons: the log lines with noise generated by token permutation do not change the content information of any original tokens and our multi-token prediction model can capture the contextual information of the predicted tokens and the dependencies between the predicted tokens.

To sum up, from the results of individual datasets, LogSL has a higher PA than the baseline method on Android and Thunderbird datasets. From the results of all datasets, the average PA of LogSL is 3.9% higher than the best baseline method. Thus, our method is applicable to logs of inconsistent formats and insufficient labels.

Figure 11.

The average time of parsing each log line.

We use the average time of parsing each log to evaluate the efficiency of LogSL, as shown in Fig. 11. The average parsing time for each log line in HDFS, Spark, OpenStack, Hadoop, ZooKeeper, BGL, HPC, Thunderbird, Windows, Android, HealthApp, and Apache for LogSL is 10.458 ms, 8.14 ms, 7.19 ms, 7.068 ms, 7.049 ms, 7.166 ms, 86.037 ms, 7.437 ms, 7.433 ms, 7.539 ms, 5.349 ms, 5.569 ms, respectively. Therefore, the average time for LogSL to parse each log line is 7.21045 ms, which means that our method can process more than 130 log lines in 1 s, and can adapt to large-scale applications. For example, according to paper [11], the HDFS system generated 11175629 log lines within 38.7 hours, with an average of 80 log lines generated per second. Our method can parse the log lines generated by HDFS in a timely manner.

We explored the effectiveness of each component in the MLM, and the results are shown in Table 2. When the LSTM layer adding the Attention layer is used in the MLM, the average PA of LogSL reaches 92.2%. When the XLnet layer adding the LSTM layer is used in the MLM, the average PA of LogSL reaches 97.2%. When the XLnet layer adding the Attention layer is used in the MLM, the average PA of LogSL can reach 97.1%. When the XLnet layer adding the LSTM layer and Attention layer is used in the MLM, the average PA of LogSL can reach 97.5%. From the above, it can be seen that the XLnet layer, LSTM layer, and Attention layer are useful for accurately parsing logs in LogSL.

Table 2

Ablation study results

Components	PA
LSTM $+$ Attention	0.922
XLnet $+$ LSTM	0.972
XLnet $+$ Attention	0.971
XLnet $+$ LSTM $+$ Attention	0.975

Figure 12.

Downstream anomaly detection model.

5. Case study of anomaly detection

Our MPM allows the coupling of the parsing method and downstream anomaly detection tasks, as shown in Fig. 12. We use the vector representation of the log line as the input of the anomaly detection model based on supervised learning. Firstly, the MPM is pre-trained in a self-supervised manner, and then we replace the Generator of the proposed MPM with a linear layer with dropout regularization that is fine-tuned by supervised training. Finally, the trained anomaly detection model is used to predict whether a given log line is normal or abnormal, i.e. if the binary cross entropy is greater than the threshold value, it is abnormal; otherwise, it is normal.

We evaluate the anomaly detection case using BGL, where 80% of the dataset is used for training, 10% of the dataset is used for validation and selection of the best hyperparameter threshold, and 10% of the dataset is used for evaluation. The result is shown in Table 3. Our proposed MLM can capture the contextual information of multiple tokens predicted and the dependencies between multiple tokens that are predicted at the same time, improving the representation ability of log lines. Therefore, the accuracy, precision, recall, and F1 of anomaly detection cases on the BGL dataset reached 99%, 99%, 100%, and 99.5%, respectively. In addition, it further adds evidence to the proposed hypothesis, which supports the application of the log line vector representation to different downstream tasks.

Table 3
Scores for the downstream anomaly detection on BGL

Accuracy	Precision	Recall	F1
0.99	0.99	1.0	0.995

6. Conclusion and future work

In this paper, we propose a log parsing method, namely LogSL, for robustly and accurately extracting log templates. Since LogSL trains the MPM through self-supervision, it does not require expert support and labels. We used twelve benchmark datasets to evaluate and compare the performance of our method with twelve existing log parsing methods. Compared with other methods, LogSL achieved much better robustness and accuracy. In addition, we give a case study of anomaly detection, which shows that the MPM in LogSL can be applied to system security analysis tasks. For future work, we will explore the compression of the token prediction model in LogSL to adapt to logs generated quickly in production. In addition, we will further study system security methods using log templates extracted by LogSL.

Footnotes

Acknowledgments

The work was supported by Jilin Science and Technology Development Plan Project of China No.20230508096RC and Chongqing Municipal Bureau of Science and Technology Project of China No.CSTB2022NSCQ-MSX1434.

References

Barik

DeLine

Drucker

Fisher

, The Bones of the System: A Case Study of Logging and Telemetry at Microsoft, in: Proceedings of the 38th International Conference on Software Engineering Companion, 2016, pp. 92–101. ISBN 9781450342056. doi: 10.1145/2889160.2889231.

Zeng

Jiang

Lin

Fan

Shen

, A distributed fault/intrusion-tolerant sensor data storage scheme based on network coding and homomorphic fingerprinting, IEEE Transactions on Parallel and Distributed Systems 23(10) (2012), 1819–1830. doi: 10.1109/TPDS.2011.294.

Dai

Chen

C.-S.

Shang

Chen

T.-H.

, Logram: Efficient log parsing using n-gram dictionaries, IEEE Transactions on Software Engineering 48(3) (2022), 879–892. doi: 10.1109/TSE.2020.3007554.

Jin

Gan

Zomaya

A.Y.

, Shadow-Chain: A decentralized storage system for log data, IEEE Network 34(4) (2020), 68–74. doi: 10.1109/MNET.011.1900385.

Tan

Wang

Zhao

, Attack provenance tracing in cyberspace: Solutions, challenges and future directions, IEEE Network 33(2) (2019), 174–180. doi: 10.1109/MNET.2018.1700469.

Chen

Yang

Lyu

M.R.

, A survey on automated log analysis for reliability engineering, ACM Comput. Surv 54(6) (2021). doi: 10.1145/3460345.

El-Masri

Petrillo

Guéhéneuc

Y.-G.

Hamou-Lhadj

Bouziane

, A systematic literature review on automated log abstraction techniques, Information and Software Technology 122 (2020), 106276. doi: 10.1016/j.infsof.2020.106276.

Zhu

Lyu

M.R.

, Towards automated log parsing for large-scale log data analysis, IEEE Transactions on Dependable and Secure Computing 15(6) (2018), 931–944. doi: 10.1109/TDSC.2017.2762673.

Vervaet

Chiky

Callau-Zori

, USTEP: Unfixed Search Tree for Efficient Log Parsing, in: 2021 IEEE International Conference on Data Mining (ICDM), 2021, pp. 659–668. ISSN 2374-8486. doi: 10.1109/ICDM51629.2021.00077.

10.

Duan

Ying

Cheng

Yuan

Yin

, OILog: An online incremental log keyword extraction approach based on MDP-LSTM neural network, Information Systems 95 (2021), 101618. doi: 10.1016/j.is.2020.101618.

11.

Zhu

Liu

Xie

Zheng

Lyu

M.R.

, Tools and Benchmarks for Automated Log Parsing, in: 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2019, pp. 121–130. doi: 10.1109/ICSE-SEIP.2019.00021.

12.

Vaarandi

, A data clustering algorithm for mining patterns from event logs, in: Proceedings of the 3rd IEEE Workshop on IP Operations Management (IPOM 2003) (IEEE Cat. No.03EX764), 2003, pp. 119–126. doi: 10.1109/IPOM.2003.1251233.

13.

Nagappan

Vouk

M.A.

, Abstracting log lines to log event types for mining software system logs, in: 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), 2010, pp. 114–117. ISSN 2160-1860. doi: 10.1109/MSR.2010.5463281.

14.

Vaarandi

Pihelgas

, LogCluster – A data clustering and pattern mining algorithm for event logs, in: 2015 11th International Conference on Network and Service Management (CNSM), 2015, pp. 1–7. doi: 10.1109/CNSM.2015.7367331.

15.

Zhang

Liu

Meng

Yang

Sun

Pei

Zhang

Song

Zhang

, Efficient and robust syslog parsing for network devices in datacenter networks, IEEE Access 8 (2020), 30245–30261. doi: 10.1109/ACCESS..

16.

Ying

Wang

Zhao

Shang

Huang

Cheng

Yang

Geng

, An Improved KNN-Based Efficient Log Anomaly Detection Method with Automatically Labeled Samples, ACM Trans. Knowl. Discov. Data 15(3) (2021). doi: 10.1145/3441448.

17.

Jiang

Z.M.

Hassan

A.E.

Flora

Hamann

, Abstracting Execution Logs to Execution Events for Enterprise Applications (Short Paper), in: 2008 The Eighth International Conference on Quality Software, 2008, pp. 181–186. ISSN 2332-662X. doi: 10.1109/QSIC.2008.50.

18.

Makanju

A.A.O.

Zincir-Heywood

A.N.

Milios

E.E.

, Clustering Event Logs Using Iterative Partitioning, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, pp. 1255–1264. ISBN 9781605584959. doi: 10.1145/1557019.1557154.

19.

Zhu

Zheng

Lyu

M.R.

, Drain: An Online Log Parsing Approach with Fixed Depth Tree, in: 2017 IEEE International Conference on Web Services (ICWS), 2017, pp. 33–40. doi: 10.1109/ICWS.2017.13.

20.

Lou

J.-G.

Wang

, Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis, in: 2009 Ninth IEEE International Conference on Data Mining, 2009, pp. 149–158. ISSN 2374-8486. doi: 10.1109/ICDM.2009.60.

21.

Hamooni

Debnath

Zhang

Jiang

Mueen

, LogMine: Fast Pattern Recognition for Log Analytics, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016, pp. 1573–1582. ISBN 9781450340731. doi: 10.1145/2983323.2983358.

22.

Tang

Perng

C.-S.

, LogSig: Generating System Events from Raw Textual Logs, in: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, 2011, pp. 785–794. ISBN 9781450307178. doi: 10.1145/2063576.2063690.

23.

Kimura

Ishibashi

Mori

Sawada

Toyono

Nishimatsu

Watanabe

Shimoda

Shiomoto

, Spatio-temporal factorization of log data for understanding network events, in: IEEE INFOCOM 2014 – IEEE Conference on Computer Communications, 2014, pp. 610–618. ISSN 0743-166X. doi: 10.1109/INFOCOM.2014.6847986.

24.

Mizutani

, Incremental Mining of System Log Format, in: 2013 IEEE International Conference on Services Computing, 2013, pp. 595–602. doi: 10.1109/SCC.2013.73.

25.

Shima

, Length Matters: Clustering System Log Messages using Length of Words, CoRR, abs/1611.03213, 2016.

26.

, Spell: Online streaming parsing of large unstructured system logs, IEEE Transactions on Knowledge and Data Engineering 31(11) (2019), 2213–2227. doi: 10.1109/TKDE.2018.2875442.

27.

Yang

Qian

Dai

Zhu

, An online log template extraction method based on hierarchical clustering, EURASIP Journal on Wireless Communications and Networking 2019(1) (2019), 135. doi: 10.1186/s13638-019-1430-4.

28.

Kobayashi

Fukuda

Esaki

, Towards an NLP-Based Log Template Generation Algorithm for System Log Analysis, in: Proceedings of The Ninth International Conference on Future Internet Technologies, 2014. ISBN 9781450329422. doi: 10.1145/2619287.2619290.

29.

Thaler

Menkonvski

Petkovic

, Towards a neural language model for signature extraction from forensic logs, in: 2017 5th International Symposium on Digital Forensic and Security (ISDFS), 2017, pp. 1–6. doi: 10.1109/ISDFS.2017.7916497.

30.

Rücker

Maier

, FlexParser – The adaptive log file parser for continuous results in a changing world, Journal of Software: Evolution and Process 34(3) (2022), e2426. doi: 10.1002/smr.2426.

31.

Liu

Zhang

Kang

Lin

Dang

Rajmohan

Zhang

, UniParser: A Unified Log Parser for Heterogeneous Log Data, in: Proceedings of the ACM Web Conference 2022, ACM, 2022. doi: 10.1145/3485447.3511993.

32.

Nedelkoski

Bogatinovski

Acker

Cardoso

Kao

, Self-Supervised Log Parsing, CoRR, abs/2003.07905, 2020.

33.

Devlin

Chang

Lee

Toutanova

, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, CoRR, abs/1810.04805, 2018.

34.

Uria

Côté

M.-A.

Gregor

Murray

Larochelle

, Neural autoregressive distribution estimation, J. Mach. Learn. Res 17(1) (2016), 7184–7220.

35.

Yang

Dai

Yang

Carbonell

J.G.

Salakhutdinov

Q.V.

, XLNet: Generalized Autoregressive Pretraining for Language Understanding, CoRR, abs/1906.08237, 2019.

36.

Hochreiter

Schmidhuber

, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780. doi: 10.1162/neco.1997.9.8.1735.

37.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Polosukhin

, Attention Is All You Need, CoRR, abs/1706.03762, 2017.

38.

Shen

Zhou

Long

Jiang

Pan

Zhang

, DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding, CoRR, abs/1709.04696, 2017.

39.

Dowdell

Zhang

, Is Attention All What You Need? – An Empirical Investigation on Convolution-Based Active Memory and Self-Attention, CoRR, abs/1912.11959, 2019.

40.

Cho

van Merriënboer

Gulcehre

Bahdanau

Bougares

Schwenk

Bengio

, Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734. doi: 10.3115/v1/D14-1179.

41.

Zhu

Sobhani

Guo

, Long Short-Term Memory over Recursive Structures, in: Proceedings of the 32nd International Conference on International Conference on Machine Learning – Volume 37, 2015, pp. 1604–1612.

42.

Khalil

Eldash

Kumar

Bayoumi

, Economic LSTM approach for recurrent neural networks, IEEE Transactions on Circuits and Systems II: Express Briefs 66(11) (2019), 1885–1889. doi: 10.1109/TCSII.2019.2924663.

43.

Zhu

Lyu

M.R.

, Loghub: A Large Collection of System Log Datasets towards Automated Log Analytics, CoRR, abs/2008.06448, 2020.

44.

Zhu

Lyu

M.R.

, An Evaluation Study on Log Parsing and Its Use in Log Mining, in: 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2016, pp. 654–661. ISSN 2158-3927. doi: 10.1109/DSN.2016.66.

45.

Chunyong

Meng

, Log Parser with One-to-One Markup, in: 2020 3rd International Conference on Information and Computer Technologies (ICICT), 2020, pp. 251–257. doi: 10.1109/ICICT50521.2020.00045.

46.

Aussel

Petetin

Chabridon

, Improving Performances of Log Mining for Anomaly Prediction Through NLP-Based Log Parsing, in: 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), 2018, pp. 237–243. ISSN 2375-0227. doi: 10.1109/MASCOTS.2018.00031.

Towards robust log parsing using self-supervised learning for system security analysis

Abstract

Keywords

1. Introduction

3.2 Permutation and sampling

3.3.4 Generator

3.4 Extraction

4. Evaluation

4.1 Experimental environment and datasets

Table 1 log datasets and model parameters

Table 3 Scores for the downstream anomaly detection on BGL

Footnotes

Acknowledgments

References

Table 1
log datasets and model parameters

Table 3
Scores for the downstream anomaly detection on BGL