Abstract
Logs play an important role in anomaly detection, fault diagnosis, and trace checking of software and network systems. Log parsing, which converts each raw log line to a constant template and a variable parameter list, is a prerequisite for system security analysis. Traditional parsing methods utilizing specific rules can only parse logs of specific formats, and most parsing methods based on deep learning require labels. However, the existing parsing methods are not applicable to logs of inconsistent formats and insufficient labels. To address these issues, we propose a robust Log parsing method based on Self-supervised Learning (LogSL), which can extract templates from logs of different formats. The essential idea of LogSL is modeling log parsing as a multi-token prediction task, which makes the multi-token prediction model learn the distribution of tokens belonging to the template in raw log lines by self-supervision mode. Furthermore, to accurately predict the tokens of the template without labeled data, we construct a Multi-token Prediction Model (MPM) combining the pre-trained XLNet module, the n-layer stacked Long Short-Term Memory Net module, and the Self-attention module. We validate LogSL on 12 benchmark log datasets, resulting in the average parsing accuracy of our parser being 3.9% higher than that of the best baseline method. Experimental results show that LogSL has superiority in terms of robustness and accuracy. In addition, a case study of anomaly detection is conducted to demonstrate the support of the proposed MPM to system security tasks based on logs.
Introduction
With the increasing scale of software and network systems, security events will happen at any time [1, 2]. Logs play an important role in system security analysis such as anomaly detection, fault diagnosis, and trace checking because they record running status and important events of systems [3, 4, 5, 6]. However, logs, which are often unstructured and very large, can not be used directly for system security analysis [3]. Thus, it is necessary to make logs structured. Each log message generated by software or network systems is composed of the message header and the message content called the log line [7]. The message header is determined by the logging framework and can be extracted relatively easily, such as timestamp, verbosity, and component. In contrast, it is often difficult to make log lines structured, which were written by developers in natural language [7, 8]. Therefore, this study assumes that log lines have been extracted from the log message and focuses on automatically parsing log lines. Specifically, each unstructured log line is transformed into a log template that consists of constant tokens and a related parameter list that consists of variable tokens, which can be achieved using techniques such as regular expressions or source code analysis [9]. The log template represents the corresponding event type and the parameter list records the runtime information [8]. For example, a log line “Got allocated containers 1” is parsed into a template “Got allocated containers
Automatic log parsing has been a significant topic for system security analysis and available parsers can be divided into two types: traditional and deep learning (DL)-based [10]. Traditional log parsing schemes usually use frequent pattern mining, heuristics, and clustering to extract log templates [11]. The parsers using frequent pattern mining [12, 13, 14, 15, 16] usually assume that the log template is a frequent set of tokens, while the parsers using heuristics [17, 18, 19] and clustering [20, 21, 22, 23, 24, 25, 26, 27] extract log templates according to the unique characteristics of specific logs and the similarity between log lines, respectively. However, these traditional methods can only parse logs of specific formats and are not suitable for log data generated by modern large-scale systems. The main reason why these methods are not workable is that the formats of logs generated by systems are different [10, 19]. The existing log parsing schemes based on DL mainly focus on using log data to train deep parsing models, which can be applied to logs of inconsistent formats, including supervised and unsupervised. Nevertheless, supervised DL methods need labeled logs during the model training phase, which are invalid for unlabeled logs [28, 29, 30, 31]. Duan et al. [10] predicted the subsequent token according to tokens within the sliding window and used the original token as the label, which avoided labeling massive logs. However, this method is not satisfactory as it ignores the token information behind the predicted token. Nedelkoski et al. [32] have explored self-supervised DL log parsing technology by predicting each token of each log line, but this method replaced a token in the original log line with a special
The design concept of the robust log parsing method based on self-supervised learning (LogSL) proposed in this paper comes from natural language processing, which uses the idea of token prediction to extract log templates. Specifically, multiple tokens in each log line are predicted at the same time consecutively. Tokens that are correctly predicted form part of the template, while incorrectly predicted tokens serve as parameters. Besides, a multi-token prediction model (MPM) trained in a self-supervised way is proposed in this paper, which fuses the XLNet and the n-layer stacked Long Short-Term Memory net with the self-attention mechanism to capture the contextual information of the predicted tokens and the dependencies among the predicted tokens. We conducted extensive experimental evaluations on 12 real benchmark log datasets. Experimental results show that the average parsing accuracy of LogSL is 3.9% higher than the best baseline method, and the difference between the maximum and minimum values of the parsing accuracy of LogSL is 23.4% lower than the best baseline method. Furthermore, we present a case study of how to combine the proposed MPM with an anomaly detection task, where a detection accuracy of 99% can be achieved on the benchmark BGL log dataset.
In summary, the main contributions of this paper can be summarized as follows:
We propose a robust log parsing method, which can extract templates from logs of different formats. A novel multi-token prediction model is proposed, which can predict tokens of the template more accurately and does not need labels in the training phase. LogSL is evaluated on 12 benchmark log datasets, showing that LogSL has good robustness and accuracy. We design an anomaly detection case using the proposed MPM, showing that the trained MPM can be used as prior knowledge for system security tasks.
The rest of the paper is organized as follows. In Section 2, we summarize the related work. Section 3 introduces the log parsing method LogSL. In Section 4, we evaluate the robustness and accuracy of LogSL for unlabeled logs of different formats. In Section 5, we describe how to combine the MPM with the downstream anomaly detection task. Finally, Section 6 summarizes the full paper and puts forward the further research direction.
The purpose of log parsing is to find the log template to match each log line and facilitate the system security analysis based on logs. An excellent parsing method should meet the following desirable standards [10, 11]:
Robustness: Log parsing methods work with all formats of log data from different vendors. No-supervision: Log parsing methods need to work without any domain knowledge or label data. Accuracy: Log parsing methods should have high accuracy.
The current log parsing approaches can be roughly divided into traditional and DL-based approaches [33]. We analyzed all of them based on the above standards.
Traditional log parsers are often designed by the technology of frequent pattern mining, heuristic rules, and clustering. Frequent pattern mining assumes that the log template is a frequent set of tokens and builds frequent itemsets based on tokens, token frequency, or token n-grams. The simple log file clustering tool (SLCT) [12] operates by mining frequent tokens from the dataset, constructing cluster candidate sets, subsequently selecting clusters, and extracting log events from these sets based on support thresholds, but the lower threshold may result in the derivation of overly specific log templates. The log file abstraction (LFA) [13] extracted log events based on the frequency of the token in a particular location. LogCluster [14] used frequent tokens that were identified by support threshold to partition the clusters of log lines and extracted the same log template for all log lines in the same cluster. This method is better at handling log messages with flexible parameter lengths. Craftsman [15] is the first framework to implement incremental parsing using the ideas of prefix trees and frequent patterns. The log parsing method based on n-gram and frequent pattern mining (NFPM) [16] used n-grams to divide logs with long identical consecutive strings into the same group and a frequent pattern mining algorithm to extract constant tokens from each group of similar data. Nevertheless, these methods are invalid for logs violating the assumption that frequent tokens always appear at the same location in the log. Algorithms based on heuristic rules use the characteristics of the log to extract log templates, such as the abstracting execution logs (AEL) [17], iterative partitioning log mining (IPLoM) [18], and Drain [19]. AEL used a hard-coded heuristic method based on system knowledge to identify dynamic fields in log lines, and abstracted execution events based on different bins divided according to the numbers of tokens and estimated parameters. IPLoM utilized an iterative partitioning strategy to group log lines based on log line length, token location, and mapping relationships. Drain applied a fixed-depth tree structure to represent log lines and efficiently extracted common templates, which avoided constructing a profound and unbalanced tree. This method assumed that log lines with the same template had the same length, and the token at the beginning of the log line was more likely to be a constant token. However, these heuristic-based parsers can only parse logs of the specific format. Since the process of generating log templates is very similar to the clustering problem, the researchers used clustering to extract log templates according to the similarity between log lines. The log key extraction (LKE) [20] and LogMine [21] utilized the hierarchical clustering algorithm to group similar logs based on weighted edit distances, but they need regular expressions based on domain knowledge to detect a set of user-defined types. Instead of conducting clustering on raw log messages directly, LogSig [22] extracted the signature of logs first, then the clustering was performed, which categorized log lines into a set of templates by searching the most representative signatures of the log lines. Kimura et al. [23] proposed the statistical template extraction (STE) to obtain log templates using density-based spatial clustering. The scalable handler for incremental system log (SHISO) [24] and the length matters clustering (LenMa) [25] used the idea of incremental clustering technology to parse logs. The streaming parser for event logs using the longest common subsequence (Spell) [26] utilized prefix tree and subsequence matching for log parsing, which clustered the current longest common subsequence objects that might have the same message type together in the merge procedure. The online log template extraction method based on hierarchical clustering (LogOHC) [27] used hierarchical clustering, which required regular expressions for preprocessing. However, the accuracy of clustering methods is not high for logs with a large number of similar templates and clustering methods need domain knowledge.
In summary, since these traditional methods rely on domain knowledge for specific logs to extract log templates, they do not satisfy the applicability of logs of different formats and they need domain expert support and manual intervention. Therefore, this paper uses the idea of multi-token prediction to extract log templates, so that logs in various formats can be parsed.
To adapt to logs of various formats, deep learning is introduced into log parsing, supervised [28, 29, 30, 31], and unsupervised methods [10, 32]. The natural language processing-log template generation (NLP-LTG) [28] considered log parsing as a problem of labeling sequence data with natural language, and classified tokens into constants and variables using Conditional Random Fields (CRF). The neural language model-for signature extraction (NLM-FSE) [29] trained a character-based neural network to extract log templates from log lines. Rücker et al. [30] proposed the flexible parser (FlexParser) using a stateful Long Short-Term Memory network that forced the model to learn parsing event logic rather than direct classification relations. Liu et al. [31] proposed to use the Token Encoder module and Context Encoder module to capture patterns of templates and parameters and use the Context Similarity module to focus on the commonalities of learned patterns. Nevertheless, these log parsing methods based on supervised deep learning need labeled log data and the process of labeled log data obtaining often requires a lot of time and effort, as well as the support of domain experts. Therefore, they are not able to parse unlabeled log data and they need domain expert support. To avoid labeling massive log data, Duan et al. [10] proposed an unsupervised log parsing method that utilized tokens within the sliding window to predict the next token and used the original token as labels, but this method disregarded the information from tokens following the predicted token. Nedelkoski et al. [32] explored self-supervised learning to identify variables and generate the corresponding event templates, a log parsing method referred to as NuLog. This method used masked-language modeling to randomly mask the input tokens and used a neural network model to predict the masked input tokens. Nevertheless, since NuLog replaces a token in the original log line with a special
In summary, available log parsing methods based on deep learning which require labeled data or have low accuracy do not meet the requirements of No-supervision and high accuracy. Therefore, this paper proposes a multi-token prediction model based on self-supervised learning. In addition, to avoid artificially adding mask tokens to affect the performance of parsing, this paper generates token sequences with noise using a permutation language model.
In our work, we introduce LogSL, a robust log parsing method based on self-supervised learning. Compared with existing work, we model log parsing as a multi-token prediction task, which can accurately extract templates for unlabeled logs of different formats. Specifically, in the training phase, the token sequences with noise are used as inputs to the multi-token prediction model, and the original log tokens are used as labels and for supervising model training. In the prediction phase, the trained multi-token prediction model is used to predict the token, if the token is predicted correctly it is the token of the template, otherwise, it is variable. We verify the robustness and accuracy of LogSL on benchmark log datasets.
In view of the fact that existing log parsing methods are not robust to unlabeled logs of different formats, we propose a robust log parsing method based on self-supervised learning (LogSL), which can automatically extract templates from log lines without any domain knowledge and log labels. The core idea of LogSL is modeling log parsing as a multi-token prediction task, including four steps: Tokenization, Permutation And Sampling, Multi-token Prediction Model, and Extraction. As shown in Fig. 1, for the log line “Got allocated containers 1”, since the token “1” is not correctly predicted, while other tokens are correctly predicted, the token “1” is identified as variable and other tokens as constant.
Parsing the log line “Got allocated containers 1” with LogSL.
Log parsing first requires extracting each log line as a list of tokens. In different systems, using the same regular expression to extract tokens will affect the accuracy of log parsing. If a regular expression is developed for each system, it will take a lot of human resources, and when the system is updated or upgraded, the accuracy of the parsing will also decrease. Therefore, for general logs, we employ spaces or tabs for segmentation. For logs that resist such straightforward splitting, we utilize regular expressions for specialized processing. For example, the log line “Got allocated containers 1” is split into the token list [“Got”, “allocated”, “containers”, “1”] by spaces. We denote the number of all unique tokens split out of the log as
Permutation and sampling
Since it takes experts a lot of time and effort to obtain log labels, this paper uses self-supervised learning for log parsing, which takes each log line with noise as input and each original log line as the label. The log lines with noise generated by token permutation do not change the content information of any original tokens, making the template token constraining the embedding vector of the log lines more accurately recognized. Thus, this paper draws on Permutation Language Model (PLM) [33, 34] to obtain all permutation sequences of all tokens in the token list of each log line. Assuming that the given token list is
Given a token list
Mask matrix of token permutation sequence “containers
This paper utilizes the mask matrix [35] to characterize the sampled token permutation sequence, whose implementation method is to mask the tokens that do not work in the prediction process, and the original order of the token list is not changed. The mask matrix of the token permutation sequence “containers
To predict tokens of the template more accurately, this paper proposes a multi-token prediction model (MPM) using XLNet [36], the n-layer stacked Long Short-Term Memory (LSTM) [37, 38, 39], and a self-attention mechanism [40, 41, 42], which is a self-supervised model including a training phase and a prediction phase. In the training phase, noise is added to the original log lines using the token permutation. The log lines with noise are used as input. The original log lines are used as the label. The model parameters are adjusted by reducing the loss value through backpropagation, and at the same time, the optimal hyperparameters are selected. In the prediction phase, we predict the multi-token of each new log line through model forward propagation, then generate the corresponding log template. Figure 3 describes the complete architecture of MPM. The log lines with noise are jointly characterized by the random embedding vector of the original log lines and mask matrix, which are the input of MPM. Firstly, the XLNet layer is used to obtain the context information of multiple tokens that are predicted and to capture the dependencies between multiple tokens that are predicted at the same time. Then, the features are further screened by an n-layer stacked LSTM (n-LSTM) layer, in which LSTM uses three gating methods to control the historical information of the sequence to update the current state value. Thirdly, the attention layer is used to assign different weights according to the importance of the features. Finally, the probability distribution of all unique log tokens is obtained by using a Generator, which is a single linear layer with softmax activation of all unique tokens in the dataset.
The proposed complete architecture of MPM.
Principle of two-stream self-attention of the token permutation sequence “containers
To capture the contextual information of multiple tokens predicted and the dependencies between multiple tokens that are predicted at the same time, this paper introduces two-stream self-attention of the pre-trained model XLNet, which includes content stream and query stream. The content hidden state of the content stream encodes both the content of the predicted tokens and their contextual information but does not encode their location information. The query hidden state of the query stream only encodes contextual information of the predicted tokens and their location information but does not encode their content information. Specifically, given the permutation token sampling sequence
where
Dividing by
As shown in Fig. 4, the working principle of the content stream and the query stream when the token permutation sequence “containers
The n-LSTM layer uses n-layer stacked LSTM to further filter the token sequence features output by the XLNet layer. In a single-layer LSTM, the output of an LSTM unit includes the cell state and the hidden state. The hidden state and the cell state of the previous LSTM unit are passed to the next LSTM unit, and the hidden state is also passed to the stacked upper layer LSTM as its input. Each LSTM unit in the bottom layer corresponds to the output of the query stream of the XLNet layer, as shown in Fig. 5,
n-LSTM layer architecture.
The LSTM unit can solve the problem of loss of learning ability through a special gate structure, which avoids the loss of information caused by the large distance between the predicted tokens and the relevant tokens. It consists of three gate structures: forget gate, input gate, and output gate. The calculation formula of each gate control unit is as follows:
The forget gate determines what information is retained in the cell state at the previous moment to the cell state at the current moment,
where The input gate determines how much information is input to the cell state
where The output gate controls how much information the cell state
where
Each token in the token sequence has a different influence on the predicted tokens. Therefore, the Attention layer uses a self-attention mechanism to extract the important part of the token sequence for the predicted tokens. The self-attention mechanism receives the output of the n-LSTM layer as input. According to the importance of different feature vectors, weights are assigned to them. Finally, the attention layer gives a weighted vector representation through softmax normalization, as shown in Eq. (2).
Generator
The last component of MPM consists of a single linear layer with Softmax that receives the output of the Attention layer. The linear layer maps the output vector of the Attention layer to a vector whose size corresponds to the total number of unique tokens in the log dataset. The subsequent Softmax is used to compute the probability distribution over each unique token of the log dataset. During training, the predicted tokens are used as labels for self-supervised learning.
Extraction
The extraction of all log templates in the log dataset is performed online using the trained model. We take each log line as input and sample the permutation sequences in a way that masks multiple tokens consecutively. We measure the ability of the model to predict each token, thereby deciding whether the token is a constant part of the template or a variable. Higher than confidence
Evaluation
To verify the robustness and accuracy of LogSL, we carried out experiments on benchmark log datasets.
Experimental environment and datasets
We conducted all experimental evaluations on Ubuntu 18.04.5 LTS, which has Intel (R) Xeon (R) W-2123 3.60GHZ CPU, NVIDIA TITAN Xp GPU, and 64 GB RAM.
The datasets we experimented with come from distributed system logs (HDFS, Spark, OpenStack, Hadoop, ZooKeeper), supercomputer logs (BGL and HPC, Thunderbird), and standalone software logs (Windows, Android, HealthApp, Apache) on the loghub data repository [43]. To enable reproducibility, we follow the guidelines in [11] and utilize a random sample of 2000 log messages from each dataset, respectively, where there are available ground truth templates, as shown in Table 1.
In the experiment, the proposed MLM includes two layers of the two-stream self-attention mechanism with 128 hidden units, two layers of the LSTM with 128 hidden units, and one layer of the self-attention mechanism with 128 hidden units. Since the linear layer in the Generator maps the output vector of the Attention layer to a vector whose size corresponds to the total number of unique tokens in the log dataset, the number of unique tokens in each dataset affects the total parameter volume of the proposed MLM. The total number of model parameters in each dataset is shown in Table 1.
log datasets and model parameters
log datasets and model parameters
We use log parsing the precision [44], recall [45], and F-measure [46] on 12 benchmark log datasets to quantify the accuracy of LogSL for logs of different formats. The precision means among all the log templates generated, how many match the true log templates from the ground truth. The recall is the percentage of “the number of the correct log templates generated” over “the total number of true log templates in the ground truth”. The F-measure is the harmonic mean of the precision and recall. They are defined as follows:
The precision, recall, and F-measure are related to several values specifically the true positive (TP), false positive (FP), and false negative (FN). The true positive (TP) is when a method assigns two log lines with the same templates to the same templates. The false positive (FP) is calculated when a method assigns two log lines with different templates to the same templates. The false negative (FN) is when a method assigns two log lines with the same templates to different templates.
In addition, we further evaluated the accuracy and robustness of LogSL using the parsing accuracy (PA) [10]. The PA is defined as the ratio of correctly parsed log lines over the total number of log lines. After parsing, each log line is assigned to a log template. A log line is considered correctly parsed if its log template corresponds to the same group of log lines as the ground truth does. For example, if a log line sequence
This section describes the accuracy and robustness of LogSL. We compared the precision, recall, F-measure, and PA of LogSL with that of 12 log parsing methods: SLCT [12], LFA [13], LogCluster [14], AEL [17], Drain [19], LKE [20], LogSig [22], LogMine [21], SHSHO [24], LenMa [25], Spell [26], and NuLog [32] on 12 benchmark datasets.
Comparisons of LogSL and other 12 log parsing methods in the precision.
As shown in Fig. 6, the result of comparisons of LogSL and other 12 log parsing methods in the precision. The parsing precision of our method on distributed system logs (HDFS, Spark, OpenStack, Hadoop, ZooKeeper) is close to 100%. Because the number of basic fact templates for HDFS, Spark, OpenStack, and ZooKeeper is small, and the log lines in Hadoop are relatively simple, the results of our method are similar to most methods, and our method is significantly superior to SLCT and LogSig on HDFS, to LKE and LogMine on Spark, to LFA and LKE on OpenStack. The parsing precision of our method on supercomputer logs (BGL, HPC, and Thunderbird) surpasses most methods and reaches 100% on BGL and Thunderbird. In the case of HPC where some of the templates are very similar, the method of computing similarity will affect the parsing precision, but the similarity between templates will not affect the parsing performance of our method, therefore, our parsing method is superior to most clustering algorithms on HPC. The parsing precision of our method achieves nearly 100% on standalone software logs (Windows, Android, HealthApp, Apache). Due to the small number of basic fact templates for Windows and Apache, the results of our method are similar to those of most methods. Because Android and HealthApp have more basic fact templates, the parsing precision of our method is higher than most baseline methods. Our method can capture dependencies within multiple tokens, so our method is slightly better than NuLog.
Comparisons of LogSL and other 12 log parsing methods in the recall.
As shown in Fig. 7, the result of comparisons of LogSL and other 12 log parsing methods in the recall. The recall of LogSL is 100% on eight datasets: HDFS, Spark, OpenStack, Hadoop, BGL, Windows, HealthApp, and Apache. The recall of LogSL exceeds 99% on Zookeeper, HPC, Thunderbird, and Android, which are complex and have a large number of basic fact templates. Most frequent pattern mining-based log parsing methods set lower thresholds, which leads to the acquisition of log templates too specific. Some heuristic-based log parsing methods have uncertainties of similar template separation. When parsing complex logs, some clustering-based algorithms have the issue of insufficient clustering, which affects the parsing recall. The above problems do not occur in our method. Therefore, the results of the parsing recall of our method are similar to most methods on distributed system logs (HDFS, Spark, OpenStack, Hadoop, ZooKeeper), and our method is significantly superior to SLCT, LFA, and LogSig on HDFS, to SLCT on Spark, to SLCT, LogCluster, AEL, Drain on Hadoop, and to SLCT, LogCluster, LKE, LogMine on ZooKeeper. The parsing recall of our method is slightly better than most methods on supercomputer logs (BGL and HPC, Thunderbird), and significantly better than some frequent pattern mining-based methods and most cluster-based methods. Our method is significantly superior to most parsing methods in the parsing recall on standalone software logs, including 4.7% higher than NuLog on Android.
Comparisons of LogSL and other 12 log parsing methods in the F-measure.
As shown in Fig. 8, the result of comparisons of LogSL and other 12 log parsing methods in the F-measure. The parsing F-measure of LogSL achieved 100% on seven datasets: HDFS, Spark, OpenStack, Hadoop, BLG, Windows, and Apache; parsing F1-Score exceeded 99% on five complex datasets: Zookeeper, HPC, Thunderbird, Android, and HealthApp, 99.9%, 99.4%, 99.8%, 99.5%, and 99.7%, respectively. Log templates obtained by most frequent pattern mining-based log parsing methods are too specific, which affects parsing Recall. Some heuristic-based log parsing methods have uncertainties of similar template separation. In some clustering-based methods, methods of computing similarity will affect the parsing precision. NuLog used the special
Summarily, LogSL, in comparison to state of art methods, leverages the permutation language model to generate token sequences with noise. This strategy helps circumvent the influence of the content information of the special
Comparisons of LogSL and other 12 log parsing methods in the PA.
As shown in Fig. 9, the result of comparisons of LogSL and other 12 log parsing methods in the PA. The PA of our method achieves nearly 100% on distributed system logs (HDFS, Spark, OpenStack, Hadoop, ZooKeeper), which is better than the baseline methods. The PA of our method on HDFS is significantly better than frequent pattern mining-based log parsing methods (SLCT, LFA, LogCluster). The PA of our method on Spark is significantly better than most frequent pattern mining-based log parsing methods (SLCT, LogCluster), and all cluster-based methods. The PA of our method on Hadoop is significantly better than most frequent pattern mining-based and all cluster-based log parsing methods. Our method has more PA on supercomputer logs (BGL and HPC, Thunderbird) than most methods. Most of the cluster-based log parsing methods can not accurately parse log datasets containing similar basic fact templates, therefore, our parsing method is superior to most cluster-based methods on HPC. The PA of our method on standalone software logs (Windows, Android, HealthApp, Apache) exceeds the 12 parsing methods compared. Due to the small number of basic fact templates in Windows and Apache, the PA of our method gets 100%, which is obviously better than the frequent pattern mining-based log parsing methods. Because Android and HealthApp have more basic fact templates, our method is higher than the PA for most baseline methods. NuLog added the special
Robustness evaluation of log parsing methods.
An excellent parsing method needs to have robust performance on log datasets of different formats. The robustness of log parsing means that log parsing methods have high accuracy on log data of various formats from different vendors. Therefore, when the parser parses different logs, the higher the average parsing accuracy and the lower the difference between the maximum and minimum values of parsing accuracy, the better the robustness of the parser. In this paper, the robustness of LogSL is analyzed and compared with related methods. Figure 10 shows the box plots of the PA distribution and the mean PA line graph for each log parsing method. It can be observed that LogSig has the lowest PA on the median and mean values, and LogSL has the highest PA on the median and mean values. Although most log parsing methods achieve high PA values of 90% for a given log dataset, they differ significantly when applied to all given log types. LogSL outperforms all other baseline methods in terms of PA robustness with a minimum value of 87.8%, a median value of 99.24%, and a mean value of 97.5%, where the minimum, median, and mean values are higher than the best method by 23.4%, 0.36% and 3.9%, respectively. Thus, the difference between the maximum and minimum values of parsing accuracy of LogSL is 23.4% lower than the best baseline method. Our method has the best average PA for the following reasons: the log lines with noise generated by token permutation do not change the content information of any original tokens and our multi-token prediction model can capture the contextual information of the predicted tokens and the dependencies between the predicted tokens.
To sum up, from the results of individual datasets, LogSL has a higher PA than the baseline method on Android and Thunderbird datasets. From the results of all datasets, the average PA of LogSL is 3.9% higher than the best baseline method. Thus, our method is applicable to logs of inconsistent formats and insufficient labels.
The average time of parsing each log line.
We use the average time of parsing each log to evaluate the efficiency of LogSL, as shown in Fig. 11. The average parsing time for each log line in HDFS, Spark, OpenStack, Hadoop, ZooKeeper, BGL, HPC, Thunderbird, Windows, Android, HealthApp, and Apache for LogSL is 10.458 ms, 8.14 ms, 7.19 ms, 7.068 ms, 7.049 ms, 7.166 ms, 86.037 ms, 7.437 ms, 7.433 ms, 7.539 ms, 5.349 ms, 5.569 ms, respectively. Therefore, the average time for LogSL to parse each log line is 7.21045 ms, which means that our method can process more than 130 log lines in 1 s, and can adapt to large-scale applications. For example, according to paper [11], the HDFS system generated 11175629 log lines within 38.7 hours, with an average of 80 log lines generated per second. Our method can parse the log lines generated by HDFS in a timely manner.
We explored the effectiveness of each component in the MLM, and the results are shown in Table 2. When the LSTM layer adding the Attention layer is used in the MLM, the average PA of LogSL reaches 92.2%. When the XLnet layer adding the LSTM layer is used in the MLM, the average PA of LogSL reaches 97.2%. When the XLnet layer adding the Attention layer is used in the MLM, the average PA of LogSL can reach 97.1%. When the XLnet layer adding the LSTM layer and Attention layer is used in the MLM, the average PA of LogSL can reach 97.5%. From the above, it can be seen that the XLnet layer, LSTM layer, and Attention layer are useful for accurately parsing logs in LogSL.
Ablation study results
Downstream anomaly detection model.
Our MPM allows the coupling of the parsing method and downstream anomaly detection tasks, as shown in Fig. 12. We use the vector representation of the log line as the input of the anomaly detection model based on supervised learning. Firstly, the MPM is pre-trained in a self-supervised manner, and then we replace the Generator of the proposed MPM with a linear layer with dropout regularization that is fine-tuned by supervised training. Finally, the trained anomaly detection model is used to predict whether a given log line is normal or abnormal, i.e. if the binary cross entropy is greater than the threshold value, it is abnormal; otherwise, it is normal.
We evaluate the anomaly detection case using BGL, where 80% of the dataset is used for training, 10% of the dataset is used for validation and selection of the best hyperparameter threshold, and 10% of the dataset is used for evaluation. The result is shown in Table 3. Our proposed MLM can capture the contextual information of multiple tokens predicted and the dependencies between multiple tokens that are predicted at the same time, improving the representation ability of log lines. Therefore, the accuracy, precision, recall, and F1 of anomaly detection cases on the BGL dataset reached 99%, 99%, 100%, and 99.5%, respectively. In addition, it further adds evidence to the proposed hypothesis, which supports the application of the log line vector representation to different downstream tasks.
Scores for the downstream anomaly detection on BGL
Scores for the downstream anomaly detection on BGL
In this paper, we propose a log parsing method, namely LogSL, for robustly and accurately extracting log templates. Since LogSL trains the MPM through self-supervision, it does not require expert support and labels. We used twelve benchmark datasets to evaluate and compare the performance of our method with twelve existing log parsing methods. Compared with other methods, LogSL achieved much better robustness and accuracy. In addition, we give a case study of anomaly detection, which shows that the MPM in LogSL can be applied to system security analysis tasks. For future work, we will explore the compression of the token prediction model in LogSL to adapt to logs generated quickly in production. In addition, we will further study system security methods using log templates extracted by LogSL.
Footnotes
Acknowledgments
The work was supported by Jilin Science and Technology Development Plan Project of China No.20230508096RC and Chongqing Municipal Bureau of Science and Technology Project of China No.CSTB2022NSCQ-MSX1434.
