Improving process discovery by filtering noises based on event dependency

Abstract

Process discovery techniques analyze process logs to extract models that characterize the behavior of business processes. In real-life logs, however, noises exist and adversely affect the extraction and thus decrease the understandability of discovered models. In this paper, we propose a novel double granularity filtering method, executed on both the event and trace levels, to detect noises by analyzing the directly-following and parallel relations between events. Based on the probability of an event occurring in a sequence, the infrequent behaviors and redundant events in the logs can be filtered out. In addition, the missing events in parallel blocks are detected to further improve the performance of filtering. Experiments on synthetic logs and five real-life datasets demonstrate that our method significantly outperforms other state-of-the-art methods.

Keywords

Process discovery process mining event logs noise filtering event dependency parallel relation

1. Introduction

Process mining [1] is a technique that dedicates to extracting process-related knowledge from event logs and providing profound insight for stakeholders. Process discovery, as one of the most challenging process mining tasks, captures the behavioral characteristics of processes from event logs and builds process models that reflect the actual process execution from a control-flow perspective [2]. In other words, process is discovered by analyzing process logs to extract sequential patterns that characterize the behavior of business processes. Such sequential patterns could be the building blocks of process models. High-precision process models not only visually demonstrate the actual flow of service execution but also help stakeholders acquire deeper insights into real business and support their decision-making such as detecting process deviation and take proactive actions in advance [3]. One of the greatest challenges that process discovery faces is to diagnose noises existed in event logs, which cause unnecessary structures in discovered models and further impede their precision and understandability [4]. It is generally believed that cleaning event logs to address quality issues prior to conducting a process discovery analysis is necessary [5].

False logs refer to unexpected and incorrect data that are generated during system operations, and system malfunction is considered as the main cause of false logs, which usually misses events (i.e., some events are not recorded in logs) [6] or adds redundant events (i.e., some events are repeatedly recorded in logs) [7] to event logs. Infrequent behaviors are behaviors that are not expected to occur but actually happen with very low frequency in real execution, the existence of which also seriously impacts the extraction of process models [8]. The above-mentioned incorrect or infrequent data in real-life logs (collectively called noises in this paper) [9] not only increase the complexity of discovered process models, but also reduce their precision [10]. Thus, in many cases, approximations to the process model can be discovered by applying a fraction of the event data [11].

In general, business process activities are performed in accordance with a well-designed process model (i.e., Reference Model), and the execution of these activities are recorded in logs for process discovering. However, noises that are unexpectedly generated in logs during real executions greatly impede the discovering. As shown in Fig. 1, the reference model contains only sequential and exclusive structure, which means the recorded logs should contain only two types of traces (i.e., trace $<\textit{ABCF}>$ and trace $<\textit{ADEF}>$ ) if the system is running normally. In reality, however, abnormal conditions occur occasionally (e.g., system downtime) and lead to exceptional records in event logs (i.e., trace $<\textit{ABDECF}>$ ), which greatly increases the difficulty of process discovering and complicates process models (i.e., Discovered Model).

Figure 1.

Example of negative effects of noise.

To improve the results of process discovering, a range of methods have been proposed to filter noises in event logs. However, some of them cannot detect all types of noise, such as missing events, while others affect the quality of discovered models for small logs by removing entire traces rather than specific incorrect events from logs [6, 5, 9]. To address these problems, in this paper, we propose a novel method of filtering out noises in event logs, called Double Granularity Filtering (DGF), to ease process discovery based on the concept of dependency. Here, dependency refers to the relationship between events in recorded logs, indicating the probability of one event occurring after another. The higher the dependency value between two events, the stronger their relationship would be. In our opinion, closely related events are more likely to be normal, while loosely related events are more likely to be infrequent behaviors. More specifically, in our method, we couple local dependency and global dependency as mixed dependency to conduct fine-grained filtering of abnormal events based on statistical probability. Here, we use local dependency to denote the frequency of two sequentially occurring events among all paired ones beginning with the first event or ending with the second one, while global dependency is defined to denote the frequency of two sequentially occurring events in the whole log. In addition, considering that fine-grained filtering is not detectable for some specific scenarios such as event lost, we add a coarse-grained filtering to the method for removing traces through a double-tier penalty mechanism to achieve double granularity filtering. Comparison experiments on synthetic and real-life datasets with two other filtering methods verify the precision and superiority of our method.

The major contributions of our paper are as follows: 1) We propose a novel double granularity filtering method, which considers both the events and traces, to detect noises more precisely. 2) We introduce the definitions of Local and Global Dependency to reflect the relation of two events precisely. 3) We define a concept of parallel block to detect missing events that are supposed to exist in the parallel structure. 4) We conduct extensive experiments to verify the effectiveness of our method on synthetic logs and five real-life datasets by comparing it with other state-of-the-art filtering methods.

The rest of this paper is structured as follows: After discussing the related work of log noises filtering in Section 2, we introduce preliminaries and details of our method in Section 3. Section 4 evaluates our method on multiple datasets against existing methods. Finally, we conclude the paper and discuss future works in Section 5.

2. Related work

Given the growing availability of event logs, an equally growing interest is drawn on automated process discovery [12]. However, noises in event logs always decrease the understandability of discovered models. To address this problem, during the past years, various techniques have been put forward to remove the impact of noises for better process model discovery during the past decade. Due to the fact that the implementation of noise filtering is mainly distributed in two different stages of process mining, the related researches on noise detection are accordingly divided into two categories.

The first category deals with noises during the mining process. Most early process discovery techniques (e.g., the ${\alpha}$ algorithm [13] and the region-based [14] algorithm) require noise-free and complete logs, which have great limitations in real practice. To address this problem, three noise-tolerant process discovery algorithms [15, 16, 17] have been proposed ever since, i.e., the Heuristics Miner (HM), the Fuzzy Miner and the IMi algorithm. The HM algorithm [15] proposes a new process modeling language (i.e., Causal Nets) to characterize business process models, and realizes the filtering of noises in event logs through three custom threshold settings (i.e., the dependency threshold, the positive observations threshold and the relative to best threshold). However, since Causal Nets is used for modeling in HM, the discovered model is not sound enough and may contain deadlocks. The Fuzzy Miner [16] mainly solves the problem that models discovered from the event logs by the traditional process discovery method is too generalized when the business process is unstructured. This method refers to the concept of road maps to build a process model and provides a differently simplified process view through custom configurations. In addition, it defines two fundamental metrics (i.e., correlation and significance) to achieve filtering of low-frequency behaviors. One main drawback of this algorithm is that models mined by this algorithm lack formal semantics. Another common discovery algorithm is the IMi algorithm. The IMi [17] algorithm is an extension of the well-known Inductive Miner (IM) algorithm [18]. It proposes a process discovery algorithm with the ability to handle incomplete logs. It uses the process tree structure to characterize the process model and improves the time efficiency of the algorithm by using a divide-and-conquer strategy. Unlike the above three noise-tolerant process discovery algorithms, the Integer Linear Programming (ILP) miner [19] intergrates noise into the discovered model, rather than filtering it out. However, the algorithm only performs well when the event log contains only frequent behaviors. To compensate for this shortcoming, Zelst et al. [20] propose a revised ILP-based process discovery algorithm and use sequence encoding filtering technique to cope with the presence of infrequent behaviors based on trace frequency information.

The second category is to pre-process logs and remove noises before the mining process. According to the granularity of noise processing, this category of methods is further divided into two types: fine-grained filtering (i.e. removing specific abnormal events from log) and coarse-grained filtering (i.e. removing entire traces of anomalies from logs). Conforti et al. [9] proposed the method of constructing an automaton to filter events in logs. The core idea is to build an Anomaly-Free Automaton by removing infrequent events in the logs. Then the original logs are filtered by replaying on the previously constructed Anomaly-Free Automaton according to alignment-based technique. Similar to this method, Zelst et al. [21] proposed a method that is suitable for noise filtering in online environments. This method first constructs a probabilistic automaton using the historical event stream, and then filters the incoming event stream and automatically adjusts the probabilistic automaton to adapt the new online environment. Tax et al. [22] proposed Activity Filter (AF) for filtering out events based on information theory and Bayesian statistics. However, this technique may erroneously filter some correct events due to the different ways the same task can be performed. In addition, the PROM framework [23] provides a lot of plugins for log preprocessing. In particular, the Filter Log using Prefix-Closed Language (PCL) plugin implements filtering of abnormal events through its defined language rules, so that the trace can be expressed via a PCL. Meanwhile, the Filter Log using Simple Heuristics (SLF) plugin filters the noisy events based on the frequency of the events occur or their positions in traces. Moreover, Ghionna et al. [24] proposed a method to distinguish abnormal traces and normal traces. It first uses clustering algorithms to obtain the main behavior patterns in the logs, and then filters traces that contain infrequent behavior patterns. Budalakoti et al. [25] used unsupervised clustering to cluster the sequences of events in logs, and then filters out the traces that deviate from the centers of the clusters. Florez-Larrahondo et al. [26] used Hidden Markov models (HMMs) to filter abnormal data. Sani et al. [6] proposed a filtering method (Matrix Filter (MF)) that fully considers the relationship between current activity and prefix activity sequence, and removes traces containing events whose conditional probabilities between sequences of activities are below a set threshold. Nguyen et al. focused on detecting anomalous values and reconstructing missing values at the level of attributes in event logs [27]. They proposed methods based on autoencoders, which can reconstruct their own input and are particularly suitable to learn a model of the complex relationships among attribute values in an event log. Likewise, through auto-encoding the event-level and state-level features, Ni et al. [28] represented process instances in a comprehensive and compact form so as to predict the remaining execution time of business process instances. Besides, Vidgof et al. [29] put forward an interactive log-delta analysis technology that enables analysts to interactively establish the range for log filtering and further explore manual differentiation between typical and atypical behaviors. BINet, a neural network architecture proposed byNolle et al. [30], is another work worthy of attention, which can be used to detect anomalies in event logs not only at a case level but also at event attribute level.

Inspired by previous studies, we propose a novel double granularity filtering method to detect noises while pre-processing business logs. Here, we first couple local dependencies and global dependencies as mixed dependencies to perform fine-grained filtering of noisy events based on statistical probability. Secondly, we employ a penalty mechanism to punish the traces after removing the noisy events. Afterwards, we define a concept of parallel block to detect missing events that are supposed to exist in the parallel structure, and combine it with the penalty strategy to determine the abnormality of traces to complete the second part of coarse-grained filtering. Finally, we test and compare the performance of our method with other filtering methods on both synthetic and real-life datasets.

3. Approach

In this section, we introduce our noise filtering method, i.e., DGF, from three aspects. Firstly, we demonstrate the basic definitions of concepts used in this paper. Secondly, we introduce how our method works to finely filter noises based on mixed dependency. Finally, we explain the strategies used in the coarse-grained filtering stage.

3.1 Preliminaries

In information systems, the executed activities are usually recorded in logs to help the relevant personnel to understand and analyze the practical process [31]. Table 1 provides an example event log of credit application process in a bank, where each row represents an event. Here, CaseId indicates which process execution instance the event belongs to, Activity gives the name of the activity executed by the event, StartingTime (CompletionTime) denotes the starting time (completion time) of the event, and finally Resource shows the execution resource of the event.

Table 1
Example of real-life logs

CaseID	Activity	StartingTime	CompletionTime	Resource
1	Registration	2014/04/02	2014/04/02	System
		16:00:48.00	16:00:48.00
1	Acceptance of requests	2014/04/02	2014/04/02	Group1
		16:00:48.00	16:18:43.00
1	Collection of documents	2014/04/02	2014/04/02	Group1
		16:18:43.00	17:47:48.00
1	Completeness check	2014/04/02	2014/04/02	Group2
		17:47:48.00	19:05:04.00
…	…	…	…	…

Definition 1 (Event). An event denotes a running activity in a process instance with its attribute, which can be represented using the tuple $e=(\textit{CaseID},(d_{1},v_{1}),\cdots,(d_{m},v_{m}))$ . Here, CaseID denotes the unique identifier of the process instance to which the event belongs, and $(d_{1},v_{1}),\ldots,(d_{m},v_{m})$ denotes the attributes and their corresponding values of the event.

For example, as indicated from first row of Table 1, we have $e=(1,(\textit{Activity},\textit{Registration}),\linebreak(\textit{StartingTime% },2014/04/02\ 16\colon 00\colon 48.00),(\textit{CompletionTime},2014/04/02\ 16% \colon 00\colon 48.00),(\textit{Resource},\linebreak\textit{System}))$ .

Definition 2 (Trace). The execution of each process instance is represented by a trace $\sigma=\langle{{e}_{1}},\ldots,{{e}_{n}}\rangle$ , which is an ordered sequence of several events. For the sake of simplicity, we usually use the corresponding name of activity to indicate one specific event. In other words, a trace can be represented by $\sigma=\langle\underline{{{e}_{1}}},\ldots,\underline{{{e}_{n}}}\rangle$ , where $\underline{{{e}_{i}}}$ denotes the activity name of ${{e}_{i}}$ .

For example, as indicated from Fig. 1, we have $\sigma=\langle A,B,C,F\rangle$ . It is noteworthy that, when considering a trace, we only care the activities it contains and their orders. Therefore, the traces with the same orders of same activities are shown by the same trace representation.

Definition 3 (Log). $\textit{Log}\mathscr{L}=[\sigma_{1}^{{{f}_{1}}},\ldots,\sigma_{m}^{{{f}_{m}}}]$ is a set of traces that contains all records of instances executed in the process, where ${{\sigma}_{i}}$ represents a trace in the log, and ${{f}_{i}}$ is the number of occurrence ${{\sigma}_{i}}$ appears in the log.

For example, as indicated from Fig. 1, we have ${{L}}=[{{\langle\textit{ABCF}\rangle}^{100}},{{\langle\textit{ADEF}\rangle}^{9% 9}},{{\langle\textit{ABDECF}\rangle}^{1}}]$ .

By analyzing event logs, the relationships between events can be calculated and analyzed. Based on the stability assumption of system operation, either the probability of noise occurrence or the proportion of noises in logs is small. Therefore, we can analyze logs from the perspective of probability and statistics to identify noises. In this paper, we focus on two kinds of relations between events, i.e., directly following relation and parallel relation. Figure 2 shows an example of the two relationships.

Figure 2.

Example of directly following relation and parallel relation.

Definition 4 (Directly Following Relation). Let $\mathscr{L}$ be an event log on event set $\mathscr{E}$ . Given two events ${{e}_{i}},{{e}_{j}}\in\mathscr{E}$ , if ${{{e}}_{j}}$ directly follows ${{{e}}_{i}}$ in trace $\sigma$ (or ${{{e}}_{j}}$ occurs immediately after ${{{e}}_{i}}$ with no other events between them), we assume there exists a Directly Following Relation from ${{{e}}_{i}}$ to ${{{e}}_{j}}$ , denoted as ${{e}_{i}}>_{\mathscr{L}}{{e}_{j}}$ . Here, ${{{e}}_{i}}$ is the preceding event of ${{{e}}_{j}}$ , and ${{{e}}_{j}}$ is the succeeding event of ${{{e}}_{i}}$ .

In order to describe the association strength between two events, we use $|{{e}_{i}}>_{\mathscr{L}}{{e}_{j}}|$ to indicate the number of times event ${{{e}}_{j}}$ occurring after event ${{{e}}_{i}}$ in the whole log.

For example, given log ${{L}_{1}}=[{{\langle\textit{ABCF}\rangle}^{100}},{{\langle\textit{ADEF}\rangle% }^{99}},{{\langle\textit{ABDECF}\rangle}^{1}}]$ shown in Fig. 1, we have $A>_{\mathscr{L}_{1}}B$ and $|A>_{\mathscr{L}_{1}}B|=101$ .

Definition 5 (Dependency Degree). Given two events ${{e}_{i}},{{e}_{j}}\in\mathscr{E}$ , we use $|{{e}_{i}}\Rightarrow_{\mathscr{L}}{{e}_{j}}|$ to denote the intensity of the dependency relation, or Dependency Degree, between ${{e}_{i}}$ and ${{e}_{j}}$ , as Eq. (1) indicates.

$\displaystyle|{{e}_{i}}\Rightarrow_{\mathscr{L}}{{e}_{j}}|=\frac{|{{e}_{i}}>_{% \mathscr{L}}{{e}_{j}}|-|{{e}_{j}}>_{\mathscr{L}}{{e}_{i}}|}{|{{e}_{i}}>_{% \mathscr{L}}{{e}_{j}}|+|{{e}_{j}}>_{\mathscr{L}}{{e}_{i}}|+1}$ (1)

If $|{{e}_{i}}\Rightarrow_{\mathscr{L}}{{e}_{j}}|$ is close to 1, it means there is a strong Directly Following Relation from ${{e}_{i}}$ to ${{e}_{j}}$ . Conversely, if $|{{e}_{i}}\Rightarrow_{\mathscr{L}}{{e}_{j}}|$ is close to $-$ 1, then ${{e}_{j}}$ is often directly followed by ${{e}_{i}}$ . It is noteworthy there may be two events that do not have any directly following relationship between them. Under such circumstance, to avoid dividing by 0, we add 1 to the denominator when calculating Dependency Degree.

For example, as indicated from Fig. 1, since $|A>_{\mathscr{L}}B|=101$ and $|B>_{\mathscr{L}}A|=0$ , we have $|A\Rightarrow_{\mathscr{L}}B|=(101-0)/(101+0+1)=0.99$ .

Definition 6 (Parallel Relation). When $|{{e}_{i}}\Rightarrow_{\mathscr{L}}{{e}_{j}}|=0$ , implying that no Directly Following Relations exist from $e_{i}$ to $e_{j}$ or from $e_{j}$ to $e_{i}$ , and there exists Parallel Relation between $e_{i}$ and $e_{j}$ .

For example, given log ${{L}_{2}}=[{{\langle\textit{SABCDGI}\rangle}^{32}},{{\langle\textit{SABDCGI}% \rangle}^{32}},{{\langle\textit{SACBDGI}\rangle}^{32}},{{\langle\textit{% SACDBGI}\rangle}^{32}},\linebreak{{\langle\textit{SADBCGI}\rangle}^{32}},{{% \langle\textit{SADCBGI}\rangle}^{32}},{{\langle\textit{SAEFHI}\rangle}^{80}},{% {\langle\textit{SAFEHI}\rangle}^{80}}]$ shown in Fig. 2, we have $|B\Rightarrow_{\mathscr{L}_{2}}C|=|B\Rightarrow_{\mathscr{L}_{2}}D|=|C% \Rightarrow_{\mathscr{L}_{2}}D|=0$ and $|E\Rightarrow_{\mathscr{L}_{2}}F|=0$ , indicating B and C, B and D, C and D, and E and F hold Parallel Relation.

Definition 7 (Parallel Block, Parallel Block Set). A Parallel Block $PB=\{{{e}_{1}},{{e}_{2}}\ldots,{{e}_{k}}\}$ is a set of events in which all events have parallel relations. A Parallel Block Set $\textit{PBS}=\{P{{B}_{1}},P{{B}_{2}},\ldots,P{{B}_{n}}\}$ is a set of parallel blocks, where events from different parallel blocks are not in parallel.

For example, in Fig. 2, the Parallel Block Set of the log ${{L}_{2}}$ is constructed with two Parallel Blocks $P{{B}_{1}}=\{B,C,D\}$ and $P{{B}_{2}}=\{E,F\}$ , denoted as $\textit{PBS}({{L}_{2}})=[P{{B}_{1}},P{{B}_{2}}]=[\{B,C,D\},\{E,F\}]$ .

Definition 8 (Preceding Event Set, Succeeding Event Set). For a given event ${{e}_{k}}\in\mathscr{E}$ , its Preceding Event Set is the collection of all its preceding events, denoted by ${{U}_{\textit{pre}}}({{e}_{k}})=\{{{e}_{i}}\in\mathscr{E}\mid{{e}_{i}}>_{% \mathscr{L}}{{e}_{k}}\}$ . Similarly, its Succeeding Event Set is the collection of all its succeeding events, denoted by ${{U}_{\textit{suc}}}({{e}_{k}})=\{{{e}_{j}}\in\mathscr{E}\mid{{e}_{k}}>_{% \mathscr{L}}{{e}_{j}}\}$ . Further, we use $|{{U}_{\textit{pre}}}({{e}_{k}})|$ and $|{{U}_{\textit{suc}}}({{e}_{k}})|$ to indicate the number of events in these two sets respectively.

Definition 9 (Predecessor Density, Successor Density). For $\forall{{e}_{k}}\in\mathscr{E}$ , we define Predecessor Density ${{D}_{\textit{pre}}}({{e}_{k}})$ and Successor Density ${{D}_{\textit{suc}}}({{e}_{k}})$ according to Directly Following Relation and Preceding/ Succeeding Event Set as Eqs (2) and (3). If event $e_{k}$ dose not hold any preceding events (or succeeding events), its $D_{\textit{pre}}(e_{k})$ (or $D_{\textit{suc}}(e_{k})$ ) would be set to 0.

$\displaystyle{{D}_{\textit{pre}}}({{e}_{k}})=\frac{\mathop{\sum}_{{{e}_{t}}\in% {{U}_{\textit{pre}}}({{e}_{k}})}|{{e}_{t}}>_{\mathscr{L}}{{e}_{k}}|}{|{{U}_{% \textit{pre}}}({{e}_{k}})|},|U_{pre}(e_{k})|\neq 0$ (2) $\displaystyle{{D}_{\textit{suc}}}({{e}_{k}})=\frac{\mathop{\sum}_{{{e}_{t}}\in% {{U}_{\textit{suc}}}({{e}_{k}})}|{{e}_{k}}>_{\mathscr{L}}{{e}_{t}}|}{|{{U}_{% \textit{suc}}}({{e}_{k}})|},|U_{\textit{suc}}(e_{k})|\neq 0$ (3)

For example, in log ${{L}_{1}}$ mentioned above, for event $D$ , we have ${{U}_{\textit{pre}}}(\textit{D})=\{A,B\}$ , ${{U}_{\textit{suc}}}({D})=\{E\}$ , ${{D}_{\textit{pre}}}(D)=\frac{|A>_{\mathscr{L}_{1}}D|+|B>_{\mathscr{L}_{1}}D|}% {|{{U}_{\textit{pre}}}({D})|}=\frac{99+1}{2}=50$ , and ${{D}_{\textit{suc}}}(D)=\frac{|D>_{\mathscr{L}_{1}}E|}{|{{U}_{\textit{suc}}}({% D})|}=\frac{100}{1}=100$ .

3.2 Fine-grained filtering

This subsection introduces how to comprehensively consider the local and global dependencies of events to complete the Fine-grained filtering of noises, which refers to filtering noises at the event level. We assume that abnormal events occur less frequently than normal events as generally accepted [9]. Therefore, by analyzing the execution context of an event, we determine whether it is a noise or not.

Definition 10 (Local Dependency). Local Dependency denotes the frequency of two sequentially occurring events among all paired ones beginning with the first event or ending with the second one. In other words, given two events ${{e}_{i}},{{e}_{j}}\in\mathscr{E}$ , Local Dependency between ${{{e}}_{i}}$ and ${{{e}}_{j}}$ is defined as:

$\displaystyle{\textit{Dep}_{\textit{local}}}({{e}_{i}},{{e}_{j}})=\frac{1}{2}*% \left(\frac{{e}^{2*\frac{|{{e}_{i}}>_{\mathscr{L}}{{e}_{j}}|}{{{D}_{\textit{% suc}}}({{e}_{i}})}}-1}{{e}^{2*\frac{|{{e}_{i}}>_{\mathscr{L}}{{e}_{j}}|}{{{D}_% {\textit{suc}}}({{e}_{i}})}}+1}+\frac{{e}^{2*\frac{|{{e}_{i}}>_{\mathscr{L}}{{% e}_{j}}|}{{{D}_{\textit{pre}}}({{e}_{j}})}}-1}{{e}^{2*\frac{|{{e}_{i}}>_{% \mathscr{L}}{{e}_{j}}|}{{{D}_{\textit{pre}}}({{e}_{j}})}}+1}\right)$ (4)

As Eq. (4) indicates, Local Dependency is normalized to [0, 1), and tends to 1 (0) as the $|{{e}_{i}}>_{\mathscr{L}}{{e}_{j}}|$ increases (decreases). In other words, the greater Local Dependency between two events, the stronger they are related and the less possible they are to be filtered, but not vice versa. It is noteworthy that the local dependency is finally computed by balancing both the preceding and succeeding dependencies.

In addition to Local Dependency, we also determine noises by the overall correlation, i.e., whether the frequency of current behavior (co-occurring events) is significantly lower than the frequency of other behaviors. A noise global factor $\mathbf{\zeta}$ is applied to distinguish between abnormal behavior and normal behavior through the following definition of Global Dependency.

Definition 11 (Global Dependency). Given two events ${{e}_{i}},{{e}_{j}}\in\mathscr{E}$ , Global Dependency refers the frequency of their sequentially occurring in the whole log, denoted by:

$\displaystyle\textit{Dep}_{\textit{global}}({{e}_{i}},{{e}_{j}})=\frac{{e}^{2*% \frac{|{{e}_{i}}>_{\mathscr{L}}{{e}_{j}}|}{\theta}}-1}{{e}^{2*\frac{|{{e}_{i}}% >_{\mathscr{L}}{{e}_{j}}|}{\theta}}+1},\mathbf{\theta}=\textit{Max}(|{{e}_{x}}% >_{\mathscr{L}}{{e}_{y}}|),\frac{|{{e}_{x}}>_{\mathscr{L}}{{e}_{y}}|}{\mathop{% \sum}_{{{e}_{k}},{{e}_{t}}\in\varepsilon}|{{e}_{k}}>_{\mathscr{L}}{{e}_{t}}|}<% \mathbf{\zeta}$ (5)

Finally, a trade-off factor $\alpha$ is used to comprehensively consider Local and Global Dependency to obtain a Mixed Dependency.

Definition 12 (Mixed Dependency). Given two events ${{e}_{i}},{{e}_{j}}\in\mathscr{E}$ , the Mixed Dependency between ${{{e}}_{i}}$ and ${{{e}}_{j}}$ is defined as:

$\displaystyle\textit{Dep}_{\textit{mixed}}({{e}_{i}},{{e}_{j}})=\alpha*\textit% {Dep}_{\textit{local}}({{e}_{i}},{{e}_{j}})+(1-\alpha)*\textit{Dep}_{\textit{% global}}({{e}_{i}},{{e}_{j}})$ (6)

A higher $\textit{Dep}_{\textit{mixed}}({{e}_{i}},{{e}_{j}})$ means that ${{e}_{j}}$ has a greater probability of occurring after ${{e}_{i}}$ or they are more unlikely to be noises. Thus, by using the mixed dependency proposed above, we can filter out infrequent behaviors (or noises) in the event logs.

Given $\mathbf{\zeta}=0.05$ and $\alpha=0.5$ , according to the above definitions, we calculate the Mixed Dependence of all events in original log and construct a Mixed Dependency Matrix, as shown in Fig. 3. Then, the Mixed Dependency of two adjacent events in the trace is checked one by one. If the Mixed Dependency is below a threshold, it is regarded as a noise and removed from the current trace.

Figure 3.

The construction process of mixed dependency matrix.

3.3 Coarse-grained filtering

In addition to the Fine-grained filtering, we further use Coarse-grained filtering to detect abnormal traces and filtering these anomalies at a trace level. For each trace $\sigma_{i}$ in log $\mathscr{L}$ , we set an abandon factor ${{f}_{\textit{abandon}}}$ with an initial value of 1.0. We sequentially extract two adjacent events ${{{e}}_{i}}$ and ${{{e}}_{i+1}}$ from $\sigma_{i}$ , and obtain the mixed dependency of these two events. If the mixed dependency is below 0.5, ${{{e}}_{i+1}}$ is considered to be a noise and should be removed from trace $\sigma_{i}$ . Furthermore, $\sigma_{i}$ is likely to be an abnormal trace due to the event we discarded. In order to confirm whether $\sigma_{i}$ is truely abnormal, we use the penalty function to change the abandon factor value of $\sigma_{i}$ according to Eq. (7), where ${{f}_{\textit{punish}}}$ is the punishment factor. When the abandon factor of $\sigma_{i}$ is lower than the given abandon threshold ${{T}_{\textit{abandon}}}$ , $\sigma_{i}$ is removed from the log $\mathscr{L}$ . It is noteworthy that the abandon factor could be adjusted according to the actual mixed dependency in this way.

$\displaystyle{{f}_{\textit{abandon}}}={{f}_{\textit{abandon}}}*{{f}_{\textit{% punish}}}*(1+2*(f_{\textit{punish}}^{-1}-1)*\textit{Dep}_{\textit{mixed}}({{e}% _{i}},{{e}_{j}}))$ (7)

However, during our research, we notice that the above penalty mechanism still fails to detect the lost of events in parallel structures. To solve this problem, we employ the parallel block we proposed above to implement the penalty for parallel structures to check for missing parallel events.

For trace $\sigma_{i}$ , we look for all parallel events in it, and group multiple parallel events that are adjacent to each other in order to be classified into one parallel block. Then we calculate the missing degree of $\sigma_{i}$ through the parallel block set of $\sigma_{i}$ and the parallel block set of the whole log.

Definition 13 (Missing Degree). Missing Degree measures the possibility if there are events missing from the parallel blocks, denoted by Eq. (3.3). The higher the missing degree, the higher possibility to remove the trace.

$\displaystyle{{L}_{tr}}({{\sigma}_{i}})=\underset{j=1}{\overset{m}{\mathop{% \sum}}}(|P{{B}_{j}}\cup P{{B}_{k}}|-|P{{B}_{j}}\cap P{{B}_{k}}|)$ (8) $\displaystyle s.t.(PB_{j}\in\textit{PBS}({{\mathbf{\sigma}}_{i}}))\wedge\left(% PB_{k}=\underset{PB_{n}\in\textit{PBS}(L)}{\mathop{\mathop{\textit{argmax}}}}|% PB_{j}\cap PB_{n}|\right)$

For example, in log ${{L}_{2}}$ that was mentioned above, $\textit{PBS}({{L}_{2}})=[\{B,C,D\},\{E,F\}]$ . For trace $\mathbf{\sigma}=\langle\textit{SAEFHI}\rangle$ , $\textit{PBS}(\mathbf{\sigma})=[\{\mathit{E},\mathit{F}\}]$ . According to Eq. (3.3), ${{L}_{tr}}(\sigma)=|\{E,F\}\cup\{E,F\}|-|\{E,F\}\cap\{E,F\}|=0$ . For a new trace ${{\mathbf{\sigma}}_{\textit{new}}}=\langle\textit{SAEHI}\rangle$ in log ${{L}_{2}}$ , its PBS is $\textit{PBS}({{\mathbf{\sigma}}_{\textit{new}}})=[\{\mathit{E}\}]$ and its missing degree is ${{L}_{tr}}({{\mathbf{\sigma}}_{\textit{new}}})=|\{E\}\cup\{E,F\}|-|\{E\}\cap\{% E,F\}|=1$ .

Further, if the missing degree of the trace is bigger than 0, which means that there are events missing from the parallel blocks, we use the following formula to punish the current trace.

$\displaystyle{{f}_{\textit{abandon}}}={{f}_{\textit{abandon}}}*{{({{f}_{% \textit{punish}}})}^{{{L}_{tr}}({{\mathbf{\sigma}}_{i}})}}$ (9)

If the abandon factor of $\sigma_{i}$ is lower than the abandon threshold ${{T}_{\textit{abandon}}}$ set in fine-grained filtering, $\sigma_{i}$ is removed from log $\mathscr{L}$ as a noise.

[htbp] : Log Noise Filtering Method

[1] initial event log ${{\mathscr{L}}_{\textit{initial}}}$ , abandon threshold ${{{T}}_{\textit{abandon}}}$ , punishment factor ${{f}_{\textit{punish}}}$ ; event log with noise filtered ${{\mathscr{L}}_{\textit{filter}}}$ ;

$A_{\textit{MDM}}^{L}={\textit{buildMixedDependencyMatrix}}({{\mathscr{L}}_{% \textit{initial}}})$ ; $\textit{PBS}({{\mathscr{L}}_{\textit{initial}}})=\textit{getPBS}({{\mathscr{L}% }_{\textit{initial}}})$ ; // Obtain the PBS of ${\mathscr{L}}_{\textit{initial}}$ FOR Trace $\sigma$ in ${{\mathscr{L}}_{\textit{initial}}}$ DO ${{e}_{\textit{start}}}=\sigma.\textit{getFirstEvent}()$ ; ${\sigma_{\textit{filter}}}.\textit{addEvent}({{e}_{\textit{start}}})$ ; ${{f}_{\textit{abandon}}}=1.0$ ; // Setting the initial abandon factor FOR event ${{e}_{i}}$ in $\sigma$ DO ${{e}_{j}}={{e}_{i}}.\textit{nextEvent}()$ ; WHILE $A_{\textit{MDM}}^{L}({{e}_{i}},{{e}_{j}})<0.5$ DO ${{e}_{j}}={{e}_{j}}.\textit{nextEvent}()$ ; ${{f}_{\textit{abandon}}}=\textit{calcuAF}(\sigma)$ ; // Calculate abandon factor IF ${{f}_{\textit{abandon}}}<{{{T}}_{\textit{abandon}}}$ THEN CONTINUE END IF END WHILE ${\sigma_{\textit{filter}}}.\textit{addEvent}({{e}_{j}})$ ; ${{e}_{i}}={{e}_{j}}$ ; END FOR $\textit{PBS}({\sigma_{\textit{filter}}})=\textit{getPBS}({\sigma_{filter}})$ ; // Get the PBS of ${\sigma_{\textit{filter}}}$ ${{L}_{tr}}({\sigma_{\textit{filter}}})=\textit{calcuMD}({\sigma_{filter}})$ ; // Calculate missing degree IF ${{L}_{tr}}({\sigma_{\textit{filter}}})>0$ THEN ${{f}_{\textit{abandon}}}={{f}_{\textit{abandon}}}*(f_{\textit{punish}})^{{{L}_% {tr}}({{\mathbf{\sigma}}_{\textit{filter}}})}$ ; END IF IF ${{f}_{\textit{abandon}}}<{{{T}}_{\textit{abandon}}}$ THEN CONTINUE ELSE DO ${{\mathscr{L}}_{\textit{filter}}}.\textit{addTrace}({\sigma_{\textit{filter}}})$ ; END IF END FOR RETURN ${{\mathscr{L}}_{\textit{filter}}}$

3.4 Double granularity filtering method

Algorithm 1 detailly describes the filtering process of our Double Granularity Filtering method (DGF), which is divided into the following two steps:

Constructing mixed dependency matrix (Lines 1–2)

Firstly, according to the definitions proposed above, statistical analysis is performed on all events in the initial log ${{\mathscr{L}}_{\textit{initial}}}$ , and the mixed dependencies between each two events are calculated. The mixed dependency matrix of the log is therefore built. Afterwards, we calculate the value of the dependency degree between events and obtain the parallel block and the parallel block set of ${{\mathscr{L}}_{\textit{initial}}}$ .

Double granularity filtering (Lines 3–30)

For each trace $\sigma$ in the initial log, its start event is extracted and added to a newly created filter trace ${{\mathbf{\sigma}}_{\textit{filter}}}$ . Afterwards, the double granularity filtering is conducted, which is further divided into two following parts.

As for the first part, the fine-grained filtering is performed together with the first part of coarse-grained filtering (Lines 7–18). Specifically we traverse two adjacent events in $\sigma$ from the start event to the end event. If the mixed dependency of two events is smaller than 0.5, the event is tagged as noise and removed from the trace (fine-grained filtering). Afterwards, Eq. (7) is used to calculate the abandon factor of $\sigma$ . When the abandon factor of $\sigma$ is lower than ${{T}_{\textit{abandon}}}$ , $\sigma$ is judged as noise and removed from the log (the first part of coarse-grained filtering). The preliminary filtering of $\sigma$ is completed when the last event of $\sigma$ has been added to ${{\mathbf{\sigma}}_{\textit{filter}}}$ .

As for the second part of double granularity filtering, the coarse-grained filtering on ${{\mathbf{\sigma}}_{\textit{filter}}}$ is performed based on the parallel relation (Lines 19–28). The initial value of the abandon factor belonging to ${{\mathbf{\sigma}}_{\textit{filter}}}$ is inherited from $\sigma$ . The missing degree of trace ${{\mathbf{\sigma}}_{\textit{filter}}}$ is obtained by Eq. (3.3). If it is bigger than 0, Eq. (9) is used to calculate the abandon factor of ${{\mathbf{\sigma}}_{\textit{filter}}}$ . After the punishment, if the abandon factor is lower than ${{T}_{\textit{abandon}}}$ , ${{\mathbf{\sigma}}_{\textit{filter}}}$ is discarded. Otherwise ${{\mathbf{\sigma}}_{\textit{filter}}}$ is retained. Finally, the next trace is taken out from ${{\mathscr{L}}_{\textit{initial}}}$ to repeat the above process.

Figure 4.

Examples of noise filtering.

In order to better understand the method of this paper, we introduce our method in detail through an example. We select the log shown in Fig. 3 as the initial log with trace ${\sigma_{1}}=\langle\textit{ABCDEFGH}\rangle$ to be filtered. The parameter penalty factor and the abandon threshold are set to 0.8 and 0.6 respectively. The filtering process is shown as follows (Example 1 in Fig. 4):

Firstly, extract start event $A$ from $\sigma_{1}$ and add it to $\sigma_{f1}$ , where $\sigma_{f1}$ is initially an empty sequence.

Secondly, take out the succeeding event $B$ of $A$ from $\sigma_{1}$ . According to the mixed dependency matrix in Fig. 3, $\textit{Dep}_{\textit{mixed}}(A,B)=0.88>0.5$ , hence event $B$ is not considered as noise and is added to $\sigma_{f1}$ . The abandon value of $\sigma_{1}$ remains as the initial value of 1.

Similarly, take out the succeeding event $C$ of $B$ from $\sigma_{1}$ . Because the mixed dependency $\textit{Dep}_{\textit{mixed}}(B,C)=0.86>0.5$ , event $C$ is also added to $\sigma_{f1}$ .

Take out the next event $D$ from $\sigma_{1}$ . As the mixed dependency $\textit{Dep}_{\textit{mixed}}(C,D)=0.86>0.5$ , event $D$ is added to $\sigma_{f1}$ .

Take out the next event $E$ from $\sigma_{1}$ . Since the mixed dependency $\textit{Dep}_{\textit{mixed}}(D,E)=0.90>0.5$ , event $E$ is added to $\sigma_{f1}$ .

Take out the next event $F$ from $\sigma_{1}$ . Because $\textit{Dep}_{\textit{mixed}}(E,F)=0.01$ is lower than 0.5, event $F$ is considered as a noise. It is then removed and the abandon factor of $\sigma_{1}$ is changed according to Eq. (7), i.e., ${{f}_{\textit{abandon}}}=0.80$ . As the calculated value of ${{f}_{\textit{abandon}}}$ is higher than the abandon threshold ${{T}_{\textit{abandon}}}=0.6$ , $\sigma_{1}$ is not discarded.

Take out the next event $G$ in $\sigma_{1}$ . As $\textit{Dep}_{\textit{mixed}}(E,G)=0.95>0.5$ , event $G$ is added to $\sigma_{f1}$ .

Take out the next event $H$ . Since $\textit{Dep}_{\textit{mixed}}(G,H)=0.92>0.5$ , $H$ is added to $\sigma_{f1}$ . Because $H$ is the last event of $\sigma_{1}$ , the first stage of the trace filtering is completed, which obtains a filtered trace ${{\sigma_{f1}}}=\langle\textit{ABCDEGH}\rangle$ .

Obtain the parallel block set of the original log and ${\sigma_{f1}}:\textit{PBS}(L)=[\{B,C\}]$ , $\textit{PBS}({\sigma_{f1}})=[\{B,C\}]$ . Then calculate the missing degree of ${\sigma_{f1}}$ , i.e., $\textit{Ltr}({\sigma_{f1}})=(|\{B,C\}\cup\{B,C\}|)-(|\{B,C\}\cap\{B,C\}|)=0$ . As $\textit{Ltr}({\sigma_{f1}})$ is not bigger than 0, there is no need to punish the current trace and ${\sigma_{f1}}$ is reserved.

In the second example of Fig. 4, similar to the above process, for the trace ${{\sigma}_{2}}=<\textit{ABDGH}>$ , because $\textit{Dep}_{\textit{mixed}}(A,B)=0.88$ and $\textit{Dep}_{\textit{mixed}}(B,D)=0.86$ are bigger than 0.5, event $A$ , $B$ , and $D$ are considered as normal events. Since $\textit{Dep}_{\textit{mixed}}(D,G)=0.01$ and $\textit{Dep}_{\textit{mixed}}(D,H)=0.00$ (both smaller than 0.5), event $G$ and $H$ are deleted as noises. After event $G$ is removed, as ${{f}_{\textit{abandon}}}=0.80>{{T}_{\textit{abandon}}}=0.6$ , $\sigma_{2}$ is hence not abandoned. After event $H$ is removed, ${f}_{\textit{abandon}}$ is changed to 0.64, which is still bigger than 0.6 and $\sigma_{2}$ is not abandoned and filter trace ${{\sigma}_{f2}}=<\textit{ABD}>$ is obtained. Then we calculate the missing degree of the $\sigma_{f2}$ , i.e., $\textit{Ltr}({\sigma_{f2}})=|\{B\}\cup\{B,C\}|-|\{B\}\cap\{B,C\}|=1$ . As $\textit{Ltr}({\sigma_{f2}})$ is bigger than 0, we punish it using Eq. (9). After this, ${{f}_{\textit{abandon}}}$ changes to 0.51, which is below the set threshold ${{T}_{\textit{abandon}}}=0.6$ , trace $\sigma_{f2}$ is removed from the log as noise.

4. Experiments and discussion

In order to verify the effectiveness of the log noise filtering method proposed in this paper, we conduct extensive experiments on the ProM platform based on both synthetic and real-life datasets. The source code can be downloaded from https://github.com/HduDBSI/DGF for reference.

4.1 Datasets

As for datasets, we use CPNtools [32] to synthesize logs with different noise ratios. In addition, we also use five real-life business process logs obtained from 4TU Centre for Research Data (https://data.4tu.nl/).

Synthetic log

Since the simulation log data is synthesized by CPNtools and is noise-free, we generate log sets with different noise ratios by injecting noises into them. As we know, during the execution of the real system, event redundancy, event missing, event sequence exchanges and other types of errors occur in the log. In this paper, we mainly consider the first two types of noises. Based on Pareto Principle [33], the noise ratio in the generated log in this paper ranges from 5% to 25% with a increasing step of 5%. In each log, the noise consists of an equal number of missing events and redundant events, with a total of 5 logs, 20000 events per log on average.

Hospital Billing (HB)

This log is collected from the ERP system of a local hospital. Each case records a series of medical procedure that are billed together. It contains 10000 process execution instance records, each with an average of 16 events.

Road Traffic Fine Management Process (RTF)

This dataset is obtained from a local police station in Italy and it recordes 150,370 examples of local road traffic fine repayment processes, with an average of 4 events per instance.

Sepsis Cases (SC)

Similar to HB, this dataset also comes from the local hospital’s ERP system. The difference is that this dataset is generated from the diagnostic module. There are 1050 cases with an average of 14 events in this log.

BPI Challenge 2020 (BPIC2020)

This dataset contains the events of two years related to Requests for Payment of 6,886 cases and 36,796 events. The dataset is used for Business Process Intelligence Challenge 2020.

Help Desk (HD)

This event log concerns the ticketing management process of the Help desk of an Italian software company, which covers 4580 cases and 21348 events.

4.2 Metrics and baselines

Three indicators are selected to measure the quality of the discovered process models: fitness, precision and F-score.

Fitness: Fitness is used to measure how many behaviors in the event log can be reproduced in the discovered models. The range of values for fitness is [0, 1], and the higher the fitness of the discovered models, the stronger their ability to represent the log behavior.

Precision: Precision measures how many behaviors in a model can be supported by event logs. When the precision value is 1.0, it means that only behaviors recorded in the log are allowed in the process model.

F-score: F-score combines the fitness and precision to consider the quality of the discovered model, as Eq. (10) shows,

$\displaystyle{{{F}}_{\textit{score}}}=2*\frac{\textit{Fitness}*\textit{% Precision}}{\textit{Fitness}+\textit{Precision}}$ (10)

To demonstrate the effectiveness of our proposed DGF method, we compare it with the following methods:

Inductive Miner (IM [18]/IMi [17]): Inductive Miner is a discovery algorithm based on a divide-and-conquer approach. An extension of IM, which called Inductive Miner-infrequent (IMi), is able to filter infrequent behaviors while ensuring soundness.

Matrix Filter (MF [6]): MF employs the conditional probability of an activity occurring after sequence of activities to detect abnormal behaviors.

Activity Filter (AF [22]): AF filters out events based on information theory and Bayesian statistics.

Filter Log using Simple Heuristics (SLF [23]): SLF, a plugin of ProM, removes traces and activities based on frequency of events.

4.3 Results and discussion

Figure 5.

Performance of filtering methods on synthetic logs with different noise ratios.

In the first experiment, we use the synthetic logs with different proportions of noises to investigate the effect of different noise ratios on the performance of six methods, i.e., IM [18], IMi [17], MF [6], AF [22], SLF [23] and our newly proposed DGF. The experiment uses IM as the benchmark mining method. In other words, we incorporate these noise filters into IM for a fair comparison. Meanwhile, we use the IMi variant with a default noise threshold of 0.2. As for our method DGF, we use $\alpha=0.5$ as the trade-off factor, $\zeta=0.20$ as the noise global factor, and $f_{\textit{punish}}=0.8$ as the punishment factor. As for the abandon factor $f_{\textit{abandon}}$ , it is set to an initial value of 1.0 and then changed according to Eqs (7) and (9). To measure fitness and precison, we use the ProjectedRecallAndPrecision package in the ProM framework to replay event logs on the discovered process models. Figure 5 shows the result.

Due to the fact that IM does not have a built-in noise filtering mechanism, its discovered models are complex and cover all behaviors including false behaviors and infrequent behaviors, which explains why its fitness value is close to 1.0 but its precision is the lowest. Compared to IM, IMi has a certain noise tolerance, but its precision is also relatively low since its noise mechanism mainly aims at filtering low-frequency behaviors rather than all types of noises. As a contrast, the DGF method manages to guarantee the highest precision and pretty high fitness of the discovered model simultaneously, and its F-score is much higher than the other methods.

Figure 6.

Comparison of model indicators for five real-life logs.

To further demonstrate the effectiveness of our proposed method, we perform a second experiment on five real-life datasets. Here, we set the noise thresholds for IMi to 0.1 and 0.2 and take the average as the final results. The IM method is again used as a benchmark. We apply DGF with five noise global factors ranging from 0.10 to 0.30 with a trade-off factor of 0.5. Figure 6 shows the average results.

Figure 7.

The influence of the trade-off factor on the model.

As for these five datasets, DGF outperforms all other methods in terms of F-score. Although the precision of DGF is slightly lower than that of the MF method on the SC dataset, it has excellent performance on the other four datasets. In particular, the precisions of DGF on the HB, RTF, BPI2020 and HD datasets are all higher than 0.80. The main reason is that the process structure corresponding to these datasets is relatively simple. In addition, DGF takes the missing events in parallel structure into consideration, which also leads to a higher precision than other methods. Since the abnormal behaviors recorded in these logs are filtered and hence cannot be replayed in DGF, there is a certain decline in the fitness of its discovered models. However, DGF manages to ensure that the final F-score of models achieves the highest among all methods. In terms of fitness, the models discovered by IM remain the highest and the reason is the same as we discussed above that they contain all behaviors recorded in the log. Besides, we also notice that as a comprehensive combination of the fitness and precision of the model, the variation of F-score index is basically consistent with precision.

In the last experiment, we try to explore the influence of the setting of the trade-off factor $\alpha$ on the model with different noise global factors ( $\mathbf{\zeta}=$ 0.15/0.25). As shown in Fig. 7, as $\alpha$ increases, the fitness of the model gradually increases, while the precision decreases on all datasets. In other words, the dominance of global dependency decreases as $\alpha$ increases. Since global dependency identifies noises with a certain baseline value ( $\zeta$ ), more logs will be filtered out when global dependency dominates, resulting in a lower fitness and a higher precision of the discovered model. Interestingly, we also find out that when $\alpha$ is larger than 0.5, the precision curve of the HB dataset decreases dramatically in some places. The most likely reason is that in HB logs there are many selection structures. If the number of selected branches in these structures is large, the predecessor density and successor density of events are relatively low, which leads to a high local dependency according to Eq. (4). Hence, when $\alpha$ becomes larger after the middle value of 0.5 and the local dependency becomes dominant, the mixed dependency is increasing to a relatively high value and a lot of noises cannot be filtered. This also explains why the average precisions of most curves when the trade-off factor $\alpha$ is small (0–0.5) are higher than when $\alpha$ is large (0.5–1.0) as in Fig. 7. We find out that when $\alpha$ is set between 0.4 and 0.6, almost all datasets achieve relatively excellent F-scores.

In sum, the results in Figs 5 and 6 show that our method achieves better performances in both synthetic and real-life datasets than the compared methods. However, it takes a slight decrease of the fitness of the discovered models as a cost to achieve higher precision. The result in Fig. 7 shows that although the variation trend of precision and fitness is different for each dataset when $\alpha$ varies, our method achieves relatively high F-score values on all datasets when $\alpha$ is between 0.4 and 0.6.

5. Conclusion

In this paper, a novel double granularity noise filtering method based on dependency correlation and parallel relation is proposed to solve the negative impact of log noises on discovering high-quality process models. Unlike traditional filtering methods that only conduct fine-grained or coarse-grained filtering to remove noises, our method couples these two granularities to remove noisy events and traces simultaneously. In addition, in order to prove the effectiveness of our method, we carry out comparison experiments in ProM based on one synthetic and five real-life datasets. The results verify our proposed DGF improves the precision and F-score of the process models significantly.

In future, we will try to use heuristic algorithms to further improve filtering efficiency. In addition, we will employ the deep learning technology to recommend a correction scheme for noisy traces under different circumstances. Last but not least, to add our proposed DGF method in ProM as a plugin is also on the list of future work.

References

Aalst

W.V.D.

, Process Mining: Data Science in Action, Springer Publishing Company, Incorporated, 2016.

Savickas

and Vasilecas

, Belief network discovery from event logs for business process analysis, Computers in Industry 100 (2018), 258–66.

Weerdt

J.D.

Backer

M.D.

Vanthienen

and Baesens

, A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs, Information Systems 37(7) (2012), 654–76.

Leno

Polyvyanyy

Dumas

Rosa

M.L.

and Maggi

F.M.

, Robotic Process Mining: Vision and Challenges, Business & Information Systems Engineering, 2020, 1–14.

Suriadi

Andrews

Hofstede

A.H.M.T.

and Wynn

M.T.

, Event log imperfection patterns for process mining: Towards a systematic approach to cleaning event logs, Information Systems, 2017, 64.

Sani

M.F.

van Zelst

S.J.

and vanÂ der Aalst

W.M.P.

, Improving Process Discovery Results by Filtering Outliers Using Conditional Behavioural Probabilities, in: International Conference on Business Process Management, Springer, 2018. pp. 216–29.

Mitsyuk

A.A.

and Shugurov

I.S.

, On process model synthesis based on event logs with noise, Automatic Control & Computer Sciences 50(7) (2016), 460–70.

Chapela-Campa

Mucientes

and Lama

, Simplification of Complex Process Models by Abstracting Infrequent Behaviour, in: Yangui

Bouassida Rodriguez

Drira

and Tari

, editors, Service-Oriented Computing, Cham: Springer International Publishing, 2019. pp. 415–30.

Conforti

La Rosa

and Hofstede

A.H.M.

, Filtering out Infrequent Behavior from Process Event Logs, IEEE Transactions on Knowledge and Data Engineering, 2016, 1–1.

10.

and van der Aalst

W.M.

, A framework for detecting deviations in complex event logs, Intelligent Data Analysis 21(4) (2017), 759–79.

11.

Sani

M.F.

et al., Improving the performance of process discovery algorithms by instance selection, Computer Science and Information Systems 17(3) (2020), 927–58.

12.

Delias

Lagopoulos

Tsoumakas

and Grigori

, Using multi-target feature evaluation to discover factors that affect business process behavior, Computers in Industry 99 (2018), 253–61.

13.

Van der Aalst

Weijters

and Maruster

, Workflow mining: Discovering process models from event logs, IEEE Transactions on Knowledge & Data Engineering 16(9) (2004), 1128–42.

14.

Carmona

Cortadella

and Kishinevsky

, A Region-Based Algorithm for Discovering Petri Nets from Event Logs, in: Dumas

Reichert

and Shan

M.C.

, editors, Business Process Management, Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. pp. 358–73.

15.

Weijters

van Der Aalst

W.M.

and De Medeiros

A.A.

, Process mining with the heuristics miner-algorithm, Technische Universiteit Eindhoven, Tech Rep WP 166 (2006), 1–34.

16.

Günther

C.W.

and Van Der Aalst

W.M.

, Fuzzy mining-adaptive process simplification based on multi-perspective metrics, in: International Conference on Business Process Management, Springer, 2007. pp. 328–43.

17.

Leemans

S.J.

Fahland

and van der Aalst

W.M.

, Discovering block-structured process models from event logs containing infrequent behaviour, in: International Conference on Business Process Management, Springer, 2013. pp. 66–78.

18.

Leemans

S.J.J.

Fahland

and van der Aalst

W.M.P.

, Discovering Block-Structured Process Models from Event Logs – A Constructive Approach, in: Colom

J.M.

and Desel

, editors, Application and Theory of Petri Nets and Concurrency, Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. pp. 311–29.

19.

van der Werf

J.M.E.M.

van Dongen

B.F.

Hurkens

C.A.J.

and Serebrenik

, Process Discovery Using Integer Linear Programming, in: van Hee

K.M.

and Valk

, editors, Applications and Theory of Petri Nets, Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. pp. 368–87.

20.

van Zelst

S.J.

van Dongen

B.F.

van der Aalst

W.M.

and Verbeek

, Discovering workflow nets using integer linear programming, Computing 100(5) (2018), 529–56.

21.

van Zelst

S.J.

Sani

M.F.

Ostovar

Conforti

and La Rosa

, Filtering spurious events from event streams of business processes, in: International Conference on Advanced Information Systems Engineering, Springer, 2018. pp. 35–52.

22.

Tax

Sidorova

and van der Aalst

W.M.

, Discovering more precise process models from event logs by filtering out chaotic activities, Journal of Intelligent Information Systems 52(1) (2019), 107–39.

23.

Van Dongen

B.F.

de Medeiros

A.K.A.

Verbeek

Weijters

and van Der Aalst

W.M.

, The ProM framework: A new era in process mining tool support, in: International Conference on Application and Theory of Petri Nets, Springer, 2005. pp. 444–54.

24.

Ghionna

Greco

Guzzo

and Pontieri

, Outlier detection techniques for process mining applications, in: International Symposium on Methodologies for Intelligent Systems, Springer, 2008. pp. 150–9.

25.

Budalakoti

Srivastava

A.N.

and Otey

M.E.

, Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airline safety, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 39(1) (2008), 101–13.

26.

Florez-Larrahondo

Bridges

S.M.

and Vaughn

, Efficient modeling of discrete events for anomaly detection using hidden markov models, in: International Conference on Information Security, Springer, 2005. pp. 506–14.

27.

Nguyen

H.T.C.

Lee

Kim

and Comuzzi

, Autoencoders for improving quality of process event logs, Expert Systems with Applications 131 (2019), 132–47.

28.

Yan

Liu

and Zeng

, Predicting remaining execution time of business process instances via auto-encoded transition system, Intelligent Data Analysis 26(2) (2022), 543–62.

29.

Vidgof

Djurica

Bala

and Mendling

, Interactive log-delta analysis using multi-range filtering, Software and Systems Modeling 21(3) (2022), 847–68.

30.

Nolle

Luettgen

Seeliger

and Mühlhäuser

, Binet: Multi-perspective business process anomaly classification, Information Systems 103 (2022), 101458.

31.

Aalst

W.M.P.V.D.

, Process Mining: Discovery, Conformance and Enhancement of Business Processes, Springer Publishing Company, Incorporated, 2011.

32.

De Medeiros

A.A.

and Günther

C.W.

, Process mining: Using CPN tools to create test logs for mining algorithms, in: Proceedings of the Sixth Workshop on the Practical Use of Coloured Petri Nets and CPN Tools (CPN 2005), Vol. 576, 2005. pp. 177–90.

33.

Goeminne

and Mens

, Evidence for the pareto principle in open source software activity, in: Csmr Workshop on Software Quality & Maintainability, 2011. pp. 74–82.

Improving process discovery by filtering noises based on event dependency

Abstract

Keywords

1. Introduction

3. Approach

3.1 Preliminaries

Table 1 Example of real-life logs

4.1 Datasets

4.2 Metrics and baselines

References

Table 1
Example of real-life logs