Discovery of dependency relations in sequential data flow

Abstract

The idea of extracting knowledge from log data for both data mining and process mining emphasises data flow and relations among data items in the data. Unfortunately, challenges have been encountered when working with the data flow and relations. One of the challenges is that the representation of the data flow between a pair of elements or tasks is insufficiently simplified and formulated, as it considers only a one-to-one data flow relation. In this paper, we discuss how to effectively represent dependency relations in log data. To this end, we introduce a new representation of the data flow and dependency formulation using an extracted flow graph. The solution solves the issue of the insufficiency of presenting other relation types, such as many-to-one and one-to-many relations. As an experiment, a new evaluation framework is applied to the Teleclaim process in order to show how this method can provide us with more precise results when compared with other definitions.

Keywords

Data mining process mining flow graph data flow dependency

1. Introduction

Data mining and process mining are related disciplines which can complement one another [21]. Process mining is used to extract knowledge from the event logs stored in the information systems in order to discover, monitor, and improve real processes [3] using an event log as a starting point [24,26]. Process mining can be classified as an extension and a descendant of data mining [20]. The ideas that have been developed in the field of data mining are widely applicable for assessing the outcomes of process mining. In addition, there are certain process-mining techniques that are drawn from classical data-mining techniques, such as the discovery and enhancement approaches that centre on data and resources. Furthermore, the analysis of different types of decisions regarding business processes is directed by data mining [15]

Data mining and process mining share the same concept of emphasising the relations, data flow, and dependencies among elements in the extracted information, where dependency is a measure to find the extent of the association or co-relation between a pair of tasks. Dependency is that “the occurrence of one event type depends on the occurrence of another event type” [11]. In process mining, many algorithms, such as heuristic-based algorithms [35–37] and α-series algorithms [4,36,39] have been elaborated. Many algorithms in data mining include dependencies among nodes (tasks) in the flow graph, namely rough sets and decision algorithms [12,13]. In process mining, Weijters et al. generated a dependency graph based on the concept of “the all-activities-connected heuristic” [37], whereas in data mining, Pawlak introduced a dependency based on an acyclic flow graph and probability theory, using an event-independent concept [12]. Another related representation of dependencies is introduced in [34], which showed a model of dependency impact analysis for web services evaluation.

The relation types of the data flow in process mining and data mining are significant for several reasons. The data flow relations show and control the structure and relations between tasks. Based on the data flow relations, the process model or flow graph can show the dependency between a pair of tasks. However, because of the bias in most of the proposed methods, the data flow representation tends to be insufficiently simplified and formulated. For example, heuristic-based algorithms in process mining mainly consider the frequencies of the two contrary directions of data flow. Nevertheless, they consider only a one-to-one (1-1) data flow, which is derived from a direct succession relation, the starting point for the other relation types in process mining [21]. Heuristic-based algorithms do not represent other types of data flow, such as many-to-one (M-1) and one-to-many (1-M) flows. In addition, they do not include associations by involving confidences when composing the data flow and dependency. Including the confidences in the data flow and dependency increases the precision, because the frequency can be misleading when changing between low and high frequencies. On the other hand, Pawlak introduced a representation of data flow and dependency in data mining based on confidences [12]. Pawlak uses a flow graph to describe the relations of association rules and dependencies among tasks. Even though Pawlak includes other types of data flow relations and confidences, the two contrary directions of data flow are not included. Moreover, the formulation of the dependency is not composed properly.

Therefore, this paper proposes a new data flow representation to show the possible M-1 and 1-M relations. The new representation simplifies the data flow relations in the flow graph and extends that to propose a new dependency equation, which provides the right data flow in the process and increases the dependency precision. The proposed representation exploits the strengths in both the Weijters et al. [37] and Pawlak [12] representations and avoids their weaknesses. The major modifications and formulations in this new model are summarised as the use of the other relation types of the data flow and bidirectional associations. Even though this representation has the same foundation as in [2], it is considered to be an extension, because the major details in [2] are partially changed.

In order to evaluate and compare the proposed dependency with the other models, the proposed representation that was previously introduced in [2] was amplified and used as an adequate quantifying measure to manage and control the evaluation for the extracted knowledge or model. Using the dependency equation and stages inspires and motivates us to develop an effective technique that establishes a new evaluation property. The primary assumption of the evaluation property is that the forward flow (out-flow) in a stage is very close to the backward flow (in-flow) of the same stage in the perfect situation. When the forward dependencies of the stage are subtracted from the backward dependencies, the result should be zero or close to zero.

The evaluation framework is applied to the Teleclaim process. The data of the Teleclaim process are popularly used in process mining groups.1

¹
More information can be found at http://www.processmining.org.

By using the Teleclaim data, we conducted intensive experiments to investigate the effectiveness and efficiency of the proposed representation. These experiments examined and compared the proposed representation with other baselines, such as the Weijters et al. [37] and Pawlak [12] dependencies. The experiments have provided promising results from the application of data flow representation and the dependency equation.

The remainder of the paper is organised as follows. The related work is explained in Section 2, while the background is given in Section 3. The knowledge extraction and representation are introduced in Section 4, whereas the evaluation principle is defined in Section 5. Section 6 presents the evaluation framework and the experiments, and the conclusion follows in Section 7.

2. Related work

The dependency graph in process mining was originally created by Agrawal et al. [1] using the event log as an input. They defined two kinds of graphs: a dependency graph and, a conformal graph. However, their work did not distinguish the different kinds of flow (e.g., the data flow or the control flow) when constructing the graph, whereas Hwang and Yang [7] considered only the control flow in the instances of process models. A dependency graph generated from the concept of “the all-activities-connected heuristic” was introduced in [37]. The main concern in this dependency graph is discovering the process model and keeping in mind the need to reduce noise in event logs. In another study, Lou et al. [11] mined every temporal dependency between a pair of tasks to cover any interleaving patterns in the filtered event log. An initial basic graph is generated from the temporal mined dependency using support and confidence and excluding loop relations. The dependency in [11] was classified into five types (forward dependency, backward dependency, strict forward dependency, strict backward dependency and no dependency).

From a data mining perspective, Pawlak [12] built a flow graph using the data flow based on deterministic dependency. This flow graph is for representing the relations of associations among items. Subsequently, Pawlak’s flow graphs have been extended and applied in different fields, such as [13,19] and [10]. However, Pawlak’s representation did not explicitly mention M-1 and 1-M flows. The M-1 and 1-M flows are described in data mining to show associations mining. The starting point of the associations is rules that include 1-1 flow [6]. However, the M-1 and 1-M flows are introduced as associations mapping in [8] and [9]. Although the representation of the dependency in [12] has a totally different foundation from the representation in [37], the proposed dependency exploits and uses the advantages of the process mining and data mining perspectives in one representation. The proposed representation simplifies the data flow in the flow graph and is an extension and mixture of the advantages of the other dependencies and avoids their defects. In another study, Wang et al. [34] in web services introduced a model of impact analysis based on service dependency instead of the task dependency. Wang et al. [34] classified the service dependencies into four kinds (process dependency, semantic dependency, message dependency and non-functional dependency).

In 2004, van der Aalst et al. [30] defined the relations between a pair of tasks and introduced α-algorithm, which applied these relations in event logs to discover process models. The assumptions and limitations in α-algorithm (i.e., event logs are assumed to be free of noise, and discovered models should be presented in Petri nets) raised many issues that opened a new trend of algorithms in discovering process models, such as α-series algorithms and heuristic-based algorithms [32].

Using Petri nets and other modeling languages as process model representations over the last decade has built up biases and created limitations. Therefore, a new language of process representation, Causal net (C-net), is highlighted [22,23]. C-net is a new representation, and few discovering techniques utilise it [18]. Even though C-net shares with other discovering model techniques the same ability to identify and model splits/joins from tasks, C-net is a different semantic representation that is concluded in sequences of valid binding [16,17,35].

3. Background

The background discusses dependency in Sec-tion 3.1. Preparing the event log by filtering is explained in Section 3.2 whereas Section 3.3 outlines the log-based ordering relations. In Section 3.4, the dependency of the 1-1 flow is described.

3.1. Dependency

The dependency contributes to the structure of a graph. It also provides a clear view of the data, control and resource flow. Hence, dependency can show and control the structure of the graph and the relations between tasks. Moreover, since most of the existing quality measures of the process model discovery involve qualitative measures and not quantitative measures, dependency can be used as a quantifying measure for the discovered process model. Although there are studies that have included quantitative measures, such as [25] and [14], they are few, and most of them are limited to the Petri net modelling language.

The dependency between a pair of tasks can be considered from various viewpoints. Based on the objectives, the viewpoint tends to increase the correctness and precision of the result by using different elements with different formulations. For example, heuristic-based algorithms [35–37] use a dependency graph that is established based on the dependency between a pair of tasks and the 1-1 flow. The activities in this case are tasks, which can be equivalent to the transitions in the Petri net modelling language.

The 1-1 flow can be extended to different types of data flow. The extended flow types can be M-1 or 1-M flows. For instance, consider the dependency between a father and his child in terms of the flow of financial support. We assume that the flow of the child’s finance is A$100. When the father provides the child with A$100, the dependency is now 1-1 and fully dependent on the father because the flow of the child’s finance comes only from the father. When the child has a part-time job and gets A$50, the dependency is now M-1 and partially dependent on the father because the flow of the child’s finance depends on the father and the part-time job to flow A$100. If the child has to pay various expenses for other items, such as a school and a gym, the type of financial flow in this case is 1-M. As a result, the data flow should be extended to M-1 and 1-M flows.

3.2. Filtered event log

In this paper, the filtered event log (FEL) is the main source of data. FEL can be extracted and filtered from different kinds of data sources (e.g., flat files and web or user logs). Extracting and filtering the event log are a refined process [21]. In our case, extracting and filtering consist of eliminating irrelevant and useless data and then generating one set of transactional records that includes only tasks. This set of the transactional records matches an event log or a data source and every record matches a trace in the event log. FEL is constructed to answer the questions that were asked before beginning the extracting and filtering. According to the Process Mining Manifesto [24], the first two guiding principles are “Event Data Should Be Treated as First-Class Citizens” and “Log Extraction Should Be Driven by Questions”. The questions asked are changed based on the outcomes of process mining. Therefore, either the process-mining workflow can be started again or FEL can repeatedly be changed when the questions asked are modified [21]. Definition 1 explains FEL, and Table 1 shows an example of FEL (FEL₁).

Definition 1.
Let FEL be a set of transactional records and let a transactional record be $σ = t_{1} t_{2} t_{3} \dots t_{m}$ where $T [n]$ is a list of unique tasks in FEL and $t_{i} \in T [n]$ ; $t_{i}$ may be repeated in σ and $1 ⩽ i ⩽ m ⩽ n$ .

Table 1
Example of filtered event log (FEL₁)

Sequence Transactional record

1 $t_{1}, t_{2}, t_{4}, t_{7}, t_{6}$

2 $t_{1}, t_{2}, t_{7}, t_{4}, t_{6}$

3 $t_{1}, t_{3}, t_{5}, t_{3}, t_{5}, t_{6}$

4 $t_{1}, t_{3}, t_{5}, t_{1}, t_{3}, t_{8}$

5 $t_{1}, t_{3}, t_{8}$

6 $t_{1}, t_{2}, t_{2}, t_{4}, t_{7}, t_{6}$

3.3. Log-based ordering relations

Sequence	Transactional record
1	$t_{1}, t_{2}, t_{4}, t_{7}, t_{6}$
2	$t_{1}, t_{2}, t_{7}, t_{4}, t_{6}$
3	$t_{1}, t_{3}, t_{5}, t_{3}, t_{5}, t_{6}$
4	$t_{1}, t_{3}, t_{5}, t_{1}, t_{3}, t_{8}$
5	$t_{1}, t_{3}, t_{8}$
6	$t_{1}, t_{2}, t_{2}, t_{4}, t_{7}, t_{6}$

The log-based ordering relations are borrowed from [4,30,38], who tried to analyse the causal dependencies between a pair of tasks. These relations are widely used in process model discovery. The starting point of the relation between a pair of tasks is the direct succession relation (DSR) in Definition 2. Three types of loops are also considered in the proposed representation. They are described in Definition 3. The length-one loop relation (LOR) and length-two loop relation (LTR) were introduced in [4]. The long-loop relation (LLR) occurs when a task is repeated in the same transactional record with more than one different task between the repeated tasks. Due to the direction of the loop relations, all the loops in the proposed method are extracted and separated. Figure 1 is an example of DSR and the three types of loops.

Definition 2.
$(t_{i} >_{L} t_{j})$ iff there is a transactional record $σ = t_{1}, t_{2}, t_{3}, \dots, t_{m}$ such that $1 ⩽ i < m$ and $j = i + 1$ , where $σ \in$ FEL.

Fig. 1.
Examples of direct succession relation and three types of loops.
Definition 3.
Let a transactional record ( $σ \in$ FEL) and $1 ⩽ i, k ⩽ n$ :
Length-one loop relation. $(t_{i} ↺_{L} t_{(i + 1)})$ iff there is $σ = t_{1} t_{2} t_{3} \dots t_{m}$ and $i \in {1, 2, \dots, (m - 1)}$ such that $t_{i} = t_{(i + 1)}$ .

Length-two loop relation. $(t_{i} △_{L} t_{(i + 1)})$ iff there is $σ = t_{1} t_{2} t_{3} \dots t_{m}$ and $i \in {1, 2, \dots, (m - 2)}$ such that $t_{i} \neq t_{(i + 1)}$ , but $t_{i} = t_{(i + 2)}$ .

Long loop relation. $(t_{i} ↶_{L} t_{j})$ iff there is $σ = t_{1} t_{2} t_{3} \dots t_{m}$ such that $t_{i} \neq t_{(i + 1)} \neq t_{(i + 2)}$ , but $t_{i} = t_{(i + k)}$ where $3 ⩽ (i + k) ⩽ m$ and $\neg ((t_{i} ↺_{L} t_{(i + 1)}) and (t_{i} △_{L} t_{(i + 1)}))$ .

DSR and loop relations are fundamental relations, because the other relations between a pair of tasks can be derived from them and the starting point of most of the other relation types, namely, the causality relation (CR) and parallelism relation (PR) [4], explained in Definition 4 (refer to [21] for more relation types and details).
Definition 4.
Let FEL be a filtered event log over $T [n]$ , Let $t_{i}, t_{j} \in T [n]$ :
Causality. $(t_{i} \to_{L} t_{j})$ iff $(t_{i} >_{L} t_{j})$ and $(t_{j} ≯_{L} t_{i})$ or $(t_{i} △_{L} t_{j})$ or $(t_{j} △_{L} t_{i})$

Parallelism. $(t_{i} ∥_{L} t_{j})$ iff $(t_{i} >_{L} t_{j})$ and $(t_{j} >_{L} t_{i})$ and $\neg ((t_{i} △_{L} t_{j}) or (t_{j} △_{L} t_{i}))$

3.4. Dependency of 1-1 flow

The formulation of the flow and dependency is seen as a 1-1 flow in some research. When the 1-1 flow is considered in the dependency, the second task is fully dependent on the first task, ignoring the other related tasks. For example, Weijters et al. [37] in the Heuristics-Miner algorithm proposed Eq. (1) (referred to as “Weijters” in this paper), which measures the dependency between a pair of tasks. The dependency in Weijters is formulated based on the frequency of the flow in two contrary directions of a pair of tasks, where $| t_{i} >_{L} t_{j} |$ is the number of flows that contain $t_{i} >_{L} t_{j}$ . The dependency graph constructed from the Heuristics-Miner algorithm is establishing after using the dependency threshold that reduces the noise in event logs. The result of Weijters is between 1 and $- 1$ . Whenever the result is high and close to 1, the dependency flow between the pair of tasks $(t_{i} \Rightarrow_{L} t_{j})$ is stronger. However, this representation does not consider the other related tasks when one of the flow ends includes many flows. In addition, it does not include associations and confidences when formulating the equation (yet, including the associations and confidences in the dependency increases the precision). The frequency can also be misleading when changing between low and high frequencies, whereas the confidences provide proportional value. $\begin{matrix} (1) & (t_{i} \Rightarrow_{L} t_{j}) = (\frac{| t_{i} >_{L} t_{j} | - | t_{j} >_{L} t_{i} |}{| t_{i} >_{L} t_{j} | + | t_{j} >_{L} t_{i} | + 1}) \end{matrix}$

Weijters et al. [37] extended Weijters into Eq. (2) for LOR and Eq. (3) for LTR. However, whenever Weijters is mentioned in this paper, it also includes Eqs (2) and (3) unless they are specifically differentiated. $\begin{array}{l} (2) & (t_{i} \Rightarrow_{L} t_{i}) = (\frac{| t_{i} >_{L} t_{i} |}{| t_{i} >_{L} t_{i} | + 1}) \\ (3) & (t_{i} \Rightarrow_{2 L} t_{j}) = (\frac{| t_{i} △_{L} t_{j} | + | t_{j} △_{L} t_{i} |}{| t_{i} △_{L} t_{j} | + | t_{j} △_{L} t_{i} | + 1}) \end{array}$

4. Knowledge extraction and representation

4.1. Abstraction step

The abstraction step extracts DSR, LOR and LTR from FEL. DSR is stored in a direct-succession-relation matrix ( $D S R [n, n]$ ) and both LOR and LTR are stored in the all-loop-relations matrix ( $A L R [n, n]$ ). $D S R [n, n]$ and $A L R [n, n]$ are two-dimensional matrices, which are considered to be raw-data graphs. $D S R [n, n]$ is a matrix of unique tasks and the cell in $D S R [n, n]$ only contains an accumulated counter for a pair of tasks using DSR. The pair of tasks in a relation is the intersection between a row (the first task in the relation) and a column (the second task in the relation) in $D S R [n, n]$ . On the other hand, $A L R [n, n]$ has the same dimensions as $D S R [n, n]$ with a different cell structure. Every cell in $A L R [n, n]$ has a relation-flow pair which retains two values. The first value describes the relation type between a pair of tasks, whereas the second value is a frequency of that relation. $A L R [n, n]$ not only contains LOR and LTR, but also contains the third type of loop relations, LLR (refer to Section 4.4 for the details of discovering LLR). The starting point here is to use FEL and the unique tasks of FEL as inputs. The outputs are $D S R [n, n]$ and $A L R [n, n]$ .

Algorithm 1

Creating $D S R [n, n]$ and $A L R [n, n]$

For the purpose of creating and filling $D S R [n * n]$ and $A L R [n * n]$ , Algorithm 1 is constructed by applying the concept of DSR, LOR, and LTR in FEL. Algorithm 1 starts with FEL and the unique tasks of FEL as inputs. The outputs are $D S R [n * n]$ and $A L R [n * n]$ . The first step in Algorithm 1 is to reset $D S R [n * n]$ and $A L R [n * n]$ by using the first two $f o r$ loops. The next $f o r$ loop contains the main processing. It starts with storing every trace or transactional record (σ) in a one-dimensional array ( $R [*]$ ), and every one occurrence in $R [*]$ is a task. As no data flow exists after the last task in $R [*]$ , the length of $R [*]$ is subtracted by one in order to identify the length of the fourth $f o r$ loop. The fourth $f o r$ loop reads $R [*]$ and accumulates the counters in $D S R [n * n]$ and $A L R [n * n]$ , whereas the while loop is to find the intended occurrences $x, y$ in $D S R [n * n]$ and $A L R [n * n]$ . The scenario created with FEL₁ is used in Algorithm 1 to generate $D S R [8 * 8]$ and $A L R [8 * 8]$ , as shown in Tables 2 and 3, respectively. Algorithm 2 is a sub-algorithm or procedure to find the indexes of a matrix. It is used from Algorithm 1 and other following algorithms.

4.2. Induction step

The objective of the induction step is to derive the causality and parallelism relations from $D S R [n, n]$ and $A L R [n, n]$ . To derive CR and PR, a new two-dimensional matrix ( $C P R [n, n]$ causality-parallelism-relation) is created. $C P R [n, n]$ has the same structure as $A L R [n, n]$ and is only related to causality and parallelism relations. The induction step starts from $D S R [n, n]$ and $A L R [n, n]$ as inputs. The output of this step is $C P R [n, n]$ . In order to clarify this step, the results of Section 4.1 using FEL₁ are used as inputs in this step to produce $C P R [8, 8]$ in Table 4.

Table 2
$D S R [8, 8]$ for FEL₁

Table 3

$A L R [8, 8]$ for FEL₁

Algorithm 2

Identifying the indexes $(x, y)$ of a matrix

Table 4

$C P R [8, 8]$ for FEL₁

4.3. Fusion

Fusion relates to the parallel tasks in the proposed study. In one parallel portion, all parallel tasks are fused into one flow or arc between the start and the end of parallelism, because including parallel tasks in a sequential situation conflicts with parallelism. The flow and dependency between the start and the end of the parallel portion are usually equal. Simply, the fusion compresses a sub-graph to represent a larger graph or sub-graph. The sub-graph is pre-conditional which means it should be isolated and have one source and one sink. The source is AND-split, whereas the sink is AND-join. The parallel portion must contain two parallel tasks or more. In order to generate the sub-graph, all parallel tasks are eliminated from the graph and replaced by one flow between the AND-split task and AND-join task. The parallelism in a process includes two tasks or more. It can be identified in the proposed method by using Definition 4.

4.4. Discovery of LLR

In this section, a new technique is created to overcome the challenges to identify LLR (Definition 3). Before explaining the technique, ${FEL}_{a}$ , ${FEL}_{u}$ and ${FEL}_{o}$ need to be clarified. All of them are extracted from FEL. ${FEL}_{a}$ represents all traces in FEL without parallelism. By assuming that all paths are completed in ${FEL}_{a}$ with the exclusion of loops and parallelism, the only unique paths in ${FEL}_{a}$ are represented in ${FEL}_{u}$ , whereas the other traces are stored in ${FEL}_{o}$ which represents the difference between ( ${FEL}_{a} - {FEL}_{u}$ ).

Algorithm 3

Finding LLR

Algorithms 3 and 4 are created to find LLR in FEL. The built-up knowledge in $D S R [n, n]$ , $A L R [n, n]$ and $C P R [n, n]$ is used as a starting point with using FEL. Initially, fusion (Section 4.3) is used to correctly simplify and remove parallel tasks from FEL and then store the traces in ${FEL}_{a}$ . Subsequently, by assuming that all unique paths are completed in FEL without loops and parallelism, the unique paths of all possible traces must be found by excluding any trace that includes a repeated task (loop) and storing them in ${FEL}_{u}$ . The next step is to go through ${FEL}_{o}$ to find LLR by comparing part of ${FEL}_{o}$ to ${FEL}_{u}$ until there is no relation that represents a loop. If a found loop is not LOR or LTL, it will be LLR. In order to clarify this step, the results of Section 4.2 using FEL₁ are used as inputs in this step to update $A L R [8, 8]$ in Table 5.

4.5. Characteristics of graph

The representation of the flow in FEL is simplified in a graph. The graph is presented by matrices. Therefore, the matrices, $D S R [n, n]$ , $A L R [n, n]$ and $C P R [n, n]$ , are created based on the details in Definitions 2, 3, and 4, respectively. $C P R [n, n]$ and $A L R [n, n]$ are considered to be basic raw-data graphs with in-flows and out-flows. In order to have the ability to easily handle the characteristics of a graph, Definition 5 describes the acyclic graph (AG), reverse graph (RG) and cyclic graph (CG).

Algorithm 4

Recursive part of finding LLR

Table 5

$A L R [8, 8]$ for FEL₁

Definition 5.

Let $AG = C P R [n, n]$ , $RG = A L R [n, n]$ , and $CG = (AG \cup RG)$ . $A G_{i j}$ or $R G_{i j}$ represents an edge that consists of two attributes (relation type and flow) where $i, j ⩽ n$ .

4.5.1. 1-M and M-1 flows

The 1-1 flow can be extended to different types of flow. The extended flow types can be M-1 or 1-M flows. In order to represent 1-M and M-1 flows, we need to introduce the in-task and out-task of a graph, whereby the graph is $[n * n]$ matrix and $t_{i}, t_{j} \in T [n]$ . The M-1 relation shows the in-task, whereas 1-M shows the out-task. The intask and out-task can be identified in $C P R [n, n]$ and $A L R [n, n]$ with their frequencies. The definitions of the in-task and out-task of AG are given in Eqs (4) and (5), respectively; and the definitions of the in-task and out-task of RG are given in Eqs (6) and (7), respectively. Following the same procedure as for the in-task, the parallelism (Definition 4), is also gathered from the relation-flow of a pair that has the “ $∥_{L}$ ” relation in the $t_{i}$ column and the $t_{j}$ column (refer to Eq. (8)). The in-task and out-task definitions are mapped and defined as the following: $\begin{matrix} i n T a s k, o u t T a s k : : T [n] \to 2^{T [n]}, \end{matrix}$ such that $\begin{array}{l} (4) & \begin{matrix} i n T a s k_{A G} (t_{i}) \\ = {t_{j} ∣ t_{j} \in T [n], \\ ((A G_{j i}) .1 s t = “ \to_{L} ”)} \end{matrix} \\ (5) & \begin{matrix} o u t T a s k_{A G} (t_{i}) \\ = {t_{j} ∣ t_{j} \in T [n], \\ ((A G_{i j}) .1 s t = “ \to_{L} ”)} \end{matrix} \\ (6) & \begin{matrix} i n T a s k_{R G} (t_{i}) \\ = {t_{j} ∣ t_{j} \in T [n], ((R G_{j i}) .2 n d > 0)} \end{matrix} \\ (7) & \begin{matrix} o u t T a s k_{R G} (t_{i}) \\ = {t_{j} ∣ t_{j} \in T [n], ((R G_{i j}) .2 n d > 0)} \end{matrix} \\ (8) & \begin{matrix} p r l T a s k_{A G} (t_{i}) \\ = {t_{j} ∣ t_{j} \in T [n], \\ ((A G_{i j}) .1 s t = “ ∥_{L} ”) and \\ ((A G_{j i}) .1 s t = “ ∥_{L} ”)} \end{matrix} \end{array}$

4.5.2. Source, internal and sink tasks

There are three types of tasks: source, internal and sink tasks. The source and sink tasks are described in Eqs (9) and (10), respectively. The “one” in the equations means that $t_{i}$ is a true source or sink, whereas “zero” means that $t_{i}$ is not a source or sink. The sources are the starting point of the graph and the sinks are the end of the graph. The internal tasks are the tasks anywhere between the source and sink tasks. $\begin{array}{l} (9) & i s S r c_{A G} (t_{i}) = \{\begin{matrix} 1 & if i n T a s k_{A G} (t_{i}) = \emptyset \\ 0 & Otherwise \end{matrix} \\ (10) & i s S n k_{A G} (t_{i}) = \{\begin{matrix} 1 & if o u t T a s k_{A G} (t_{i}) = \emptyset \\ 0 & Otherwise \end{matrix} \end{array}$

4.5.3. Flow detail

The detail of the flow among tasks is a substantial aspect which contributes to the structure and analysis of a process. A major detail and characteristic of flow is the in-flow and out-flow of a task, $t_{i}$ . The in-flow and out-flow of a task explain the task frequencies of the flow. Equations (11) and (12) define the in-flow and out-flow of AG, and Eqs (13) and (14) define the in-flow and out-flow of RG, respectively. $\begin{array}{l} (11) & \begin{matrix} i n F l o w_{A G} (t_{i}) \\ = \sum_{t_{j} \in i n T a s k_{A G} (t_{i})} ((A G_{j i}) .1 s t = “ \to_{L} ”) \end{matrix} \\ (12) & \begin{matrix} o u t F l o w_{A G} (t_{i}) \\ = \sum_{t_{j} \in o u t T a s k_{A G} (t_{i})} ((A G_{i j}) .1 s t = “ \to_{L} ”) \end{matrix} \\ (13) & \begin{matrix} i n F l o w_{R G} (t_{i}) \\ = \sum_{t_{j} \in i n T a s k_{R G} (t_{i})} ((R G_{j i}) .2 n d > 0) \end{matrix} \\ (14) & \begin{matrix} o u t F l o w_{R G} (t_{i}) \\ = \sum_{t_{j} \in o u t T a s k_{R G} (t_{i})} ((R G_{i j}) .2 n d > 0) \end{matrix} \end{array}$

The in-flow of CG is described in Eq. (15) and the out-flow of CG is described in Eq. (16), which are generated based on the in-flow and out-flow equations of AG and RG. $\begin{array}{l} (15) & \begin{matrix} i n F l o w (t_{i}) = & i n F l o w_{A G} (t_{i}) \\ + i n F l o w_{R G} (t_{i}) \end{matrix} \\ (16) & \begin{matrix} o u t F l o w (t_{i}) = & o u t F l o w_{A G} (t_{i}) \\ + o u t F l o w_{R G} (t_{i}) \end{matrix} \end{array}$

The flow between a pair of tasks is another trait of CG. It is described in Eq. (17). $\begin{matrix} (17) & F l o w (t_{i} ↣_{L} t_{j}) = C G_{i j} \end{matrix}$

Confidence in Eq. (18) is assigned to the flow between a pair of tasks. $\begin{matrix} (18) & c o n f (t_{i} ↣_{L} t_{j}) = \frac{F l o w (t_{i} ↣_{L} t_{j})}{o u t F l o w (t_{i})} \end{matrix}$

4.6. Dependency construction

A new equation (Eq. (19)), called “ $D_{η}$ ”, is introduced in order to increase the accuracy and precision of the dependency. $D_{η}$ is a built-up operation, where the characteristics and features of the flow graph are the main components to formulate $D_{η}$ . Therefore, the characteristics and features that are involved in $D_{η}$ must be executed to generate $D_{η}$ . $D_{η}$ is originally an extension of other dependency equations. It exploits their strengths and avoids their weaknesses. Since the flow relation between a pair of tasks ( $t_{i} \Rightarrow t_{j}$ ) has two ends, the other types of flow relation, namely, M-1 and 1-M are considered in $D_{η}$ . For example, if the 1-M flow relation is considered, the first end is $t_{i}$ with one task, whereas the second end is $t_{j}$ , which is may be a part of a group of tasks ( $t_{x}, t_{y}, t_{z}, \dots$ ). On the other hand, the M-1 flow relation is considered the opposite situation of the 1-M flow relation. Thus, the confidences of the flow for the contrary directions of the two tasks is included in $D_{η}$ . Equation (19) uses normalised parameters to compose the equation, and the result is between 1 and $- 1$ . $\begin{matrix} (19) & \begin{matrix} D_{η} (t_{i} \Rightarrow t_{j}) \\ = \frac{c o n f (t_{i} ↣ t_{j}) - c o n f (t_{j} ↣ t_{i})}{c o n f (t_{i} ↣ t_{j}) + c o n f (t_{j} ↣ t_{i}) + 1} \end{matrix} \end{matrix}$

Equation (19) is extended in Eq. (20) for LOR and in Eq. (21) for LTR due to two reasons. The first reason is that the natural direction of the flow is forward, because the main direction of a process progresses to the front to reach the end and achieve the process objectives. The second reason is that this extension aligns with Eq. (1) as proposed by Weijters et al. [37] in their consideration of the two kinds of short loops (LOR and LTR) as exceptions. However, Eqs (1) and (19) are the main dependency equations and the other extended equations (Eqs (2), (3), (20), and (21)) are treated as special cases or exceptions when identifying the dependency. $\begin{array}{l} (20) & D_{η} (t_{i} \Rightarrow t_{i}) = \frac{c o n f (t_{i} ↣ t_{j})}{c o n f (t_{i} ↣ t_{j}) + 1} \\ (21) & \begin{matrix} D_{η} (t_{i} \Rightarrow_{2} t_{j}) \\ = \frac{c o n f (t_{i} ↣ t_{j}) + c o n f (t_{j} ↣ t_{i})}{c o n f (t_{i} ↣ t_{j}) + c o n f (t_{j} ↣ t_{i}) + 1} \end{matrix} \end{array}$

4.7. Stages in process model

The main target of this section is to define the model with unique, sequential and stable stages, $M_{x}$ . These stages in $M_{x}$ can easily express and facilitate the 1-M and M-1 flow relations. $M_{x}$ is defined and used as a new representation to classify and separate the tasks in stages. Each stage is unique and stable, and it may contain at least one or more tasks. In the normal case and direction, the tasks in one stage are sequentially linked to the tasks in the following stage until the last stage in the model is reached. In the case of a loop, the task may be linked to the previous stage or the same stage. Definition 6 describes $M_{x}$ .

Definition 6.
Let $T [n] = {t_{1}, t_{2}, \dots, t_{n}}$ be a set of unique tasks in FEL where n is the number of unique tasks. Let $M_{x} = {S_{1}, S_{2}, \dots, S_{x}}$ where $S_{i} \subseteq T [n]$ , and $1 ⩽ i, j ⩽ x$ , and $1 ⩽ h, k ⩽ n$ . $M_{x}$ is unique, sequential and stable stages, iff there are:
${S_{1} \cup S_{2} \cup S_{3} \cup \dots \cup S_{x}} = T [n]$ ,

${S_{i} \cap S_{j}} = \emptyset$ , and

$\forall t_{h} \in S_{i}$ , $\exists (t_{k} \in S_{j} and F l o w (t_{h} ↣_{L} t_{k}) > 0)$ .

By considering loops as exceptions in $M_{x}$ and excluding them from the flow, the actual in-flow and out-flow in $M_{x}$ are given in Eqs (22) and (23), respectively. $\begin{array}{l} i n G p h (M_{x}) = & \sum_{i s S r c_{A G} (t_{i}) = 1} o u t F l o w_{A G} (t_{i}) \\ (22) & - i n F l o w_{R G} (t_{i}) \\ o u t G p h (M_{x}) = & \sum_{i s S n k_{A G} (t_{i}) = 1} i n F l o w_{A G} (t_{i}) \\ (23) & - o u t F l o w_{R G} (t_{i}) \end{array}$

When $g p h F l o w$ is a throughflow of $M_{x}$ , we assume Eq. (24). $\begin{array}{l} i n G p h (M_{x}) & = o u t G p h (M_{x}) \\ (24) & = g p h F l o w (M_{x}) \end{array}$

Loop and parallelism in $M_{x}$ Loops and parallelism in the proposed method are treated and considered through special techniques. The loop conflicts with the characteristics of $M_{x}$ , if $M_{x}$ is acyclic. It breaks the rule of one direction in the stages. For example, the existence of the loop in a source or sink breaches Eqs (9) and (10). Therefore, if there is a loop in a source or sink, we can not identify the source and sink. However, we can identify loops by using $A L R [n, n]$ or by comparing the in-flow and out-flow for the source or sink. The difference between the in-flow and out-flow in this case must be high unless there is very low flow frequency or high noise or incompleteness in FEL. Parallelism also conflicts with the stages construction, because the behaviour of the parallel tasks conflicts with the behaviour of the stages, which are sequential, whereas the parallel tasks are concurrent. In addition, the flow and dependency between the start and the end of the parallel portion are usually equal. Therefore, fusion is applied on the parallel portion.
5. Evaluation principle

The concept of the stages and dependency promotes the use of a novel evaluation principle. Hence, this concept is exploited and extended to evaluate and validate the proposed method by using the evaluation principle. The principle is summarised in the proposition that the total out-flow and total in-flow of a task in $M_{x}$ should be balanced, and the source and sink tasks should be excluded. Therefore, some equations used in the evaluation are introduced with proof.

Equations (25) and (26) are introduced for $t_{i}$ , where $t_{i}$ is a task in $M_{x}$ . $\overline{d_{i n}} (t_{i})$ and $\overline{d_{o u t}} (t_{i})$ are the average of the total in-flow dependency and the average of the total out-flow dependency, respectively, where n is the number of unique tasks in $M_{x}$ . $\begin{array}{l} (25) & \overline{d_{i n}} (t_{i}) = \frac{\sum_{t_{j} \in i n f l o w (t_{i})} D_{η} (t_{j} \Rightarrow t_{i})}{n} \\ (26) & \overline{d_{o u t}} (t_{i}) = \frac{\sum_{t_{j} \in o u t f l o w (t_{i})} D_{η} (t_{i} \Rightarrow t_{j})}{n} \end{array}$

Equation (27) is called the dependency difference of a stage, $S$ , where $S = {t_{1}, t_{2}, t_{3}, \dots, t_{m}}$ is a set of unique tasks in a stage. Let $t_{i} \in T [n]$ and $Δ d$ be the difference. Based on Property 1, a significant fact of the stage is that $o u t F l o w (S)$ and $i n F l o w (S)$ should be balanced. $\begin{matrix} (27) & Δ d (S) = \sum_{t_{i} \in S} (\overline{d_{o u t}} (t_{i}) - \overline{d_{i n}} (t_{i})) \end{matrix}$

Property 1.
Let assume the following conditions in a primary and perfect situation of FEL:
All traces are completed.

No noise is included.

Let $S = {t_{1}, t_{2}, t_{3}, \dots, t_{m}}$ , $t_{i} \in T [n]$ , and $S \in M_{x}$ , where $\forall t_{i} \in S$ , $i ⩽ m ⩽ n$ , $i s S r c_{A G} (t_{i}) \neq 1$ , and $i s S n k_{A G} (t_{i}) \neq 1$ . We have: $\begin{matrix} n F l o w (S) = o u t F l o w (S) . \end{matrix}$

Proof.
$\begin{array}{l} i n F l o w (S) = \sum_{t_{i} \in S} i n F l o w (t_{i}) \\ and \\ o u t F l o w (S) = \sum_{t_{i} \in S} o u t F l o w (t_{i}) \\ ∵ \forall t_{i} \in S, i n F l o w (t_{i}) = o u t F l o w (t_{i}) \\ if t_{i} is not a
source or sink task, \\ ∴ i n F l o w (S) = o u t F l o w (S) . \end{array}$ □

FEL in the real world usually breaks the assumed conditions and includes source and sink tasks; or FEL in the real world may include a degree of noise or incomplete traces that slightly affect and deviate from accurate results. Therefore, the significant property of the stages is that $o u t F l o w (S)$ and $i n F l o w (S)$ should nearly be balanced. This property can also be transferred to the dependency in the stage, because the dependencies are generated and related to the flow that is indicated in $o u t F l o w (S)$ and $i n F l o w (S)$ . As a result, the average of the total out-flow and in-flow dependencies in the stage should also be nearly balanced as follows: $\begin{matrix} \begin{matrix} i n F l o w (S) \approx o u t F l o w (S) \\ \Rightarrow \sum_{t_{i} \in S} \overline{d_{i n}} (t_{i}) \approx \sum_{t_{i} \in S} \overline{d_{o u t}} (t_{i}) \end{matrix} \end{matrix}$

Based on this derivation, we can measure and compare the performance of the dependency representations and we believe that the difference in Eq. (27) should be close to zero.

Equation (28) is the average used at the end of the evaluation. In Eq. (28), all the absolute values of $Δ d (S)$ are added and divided by x, which is the number of stages in $M_{x}$ . $\begin{matrix} (28) & \begin{matrix} \overline{Δ d} (M_{x}) = \frac{\sum_{S \in M_{x}} | Δ d (S) |}{x} \end{matrix} \end{matrix}$
6. Evaluation framework and experiments

Evaluation in knowledge discovery is a challenging component, especially in process mining. According to Process Mining Manifesto [24], some of the challenges in evaluation include finding baselines, avoiding bias, combining with other kinds of analysis, and balancing the measures. Additionally, evaluation in knowledge discovery has encountered several open issues that are out of the scope of the proposed method. However, these issues may be linked to this research in future work. The main focus of this section is to show how effectively the major aims of the proposed method are evaluated and how the proposed method compares with and outperforms other methods, while the major aims that need to be evaluated and validated in this section is $D_{η}$ , which includes the proper dependency representation and comprehensive data flow relations.

In the evaluation, one of important parts in our intensive experiments was the validation. In order to examine the robustness and reliability, the proposed method was validated by trying to interfere with the flow. We intentionally injected some loops or exceptions that deviated or corrupted the outcomes of the dependency. Subsequently, the outcomes were observed and discussed in order to identify the variations in the outcomes and the degree of robustness and reliability in the proposed method compared to other baselines.

6.1. Teleclaim process

The Teleclaim process is inverted from the process mining group2

²
More information can be found at http://www.processmining.org.

at the Eindhoven University of Technology, which is a pioneer in the process mining discipline. The Teleclaim process has been used in several published articles [5,27,28]. Its event log, used in our experiments, is synthetic and without noise.

Teleclaim represents the process of an Australian insurance company and shows how the insurance company handles its claims. The log contains $46, 138$ events related to $3, 512$ traces (transactional records or claims) with 15 unique tasks. The process deals with the handling of inbound phone calls, whereby different types of insurance claims (e.g., household and car) are lodged over the phone. The process is supported by two separate call centres operating for two different organizational entities (Brisbane and Sydney). Both centres are similar in terms of incoming call volume and average total call handling time, but differ in the way call centre agents are deployed, and in their underlying IT systems. After the initial steps in the call centre, the remainder of the process is handled by the back-office of the insurance company.

6.2. Baselines

Another opinion in the formulation of dependency uses the M-1 flow while disregarding the other flow directions. An example of this is the Pawlak representation in [12]. Pawlak introduced Eq. (29) based on an acyclic flow graph. Pawlak includes and uses the flow graph to describe the relations of association rules and dependencies among tasks. The key elements in this equation, when having a pair of tasks, ( $t_{i}, t_{j}$ ), are the confidence of $(t_{i} ↣_{L} t_{j})$ and the normalised flow of dependent task, $t_{j}$ . The elements of the equation are fraction values after normalisation. The same as Weijters, the result of $η (t_{i}, t_{j})$ is also between 1 and $- 1$ . Although Pawlak considers the M-1 flow, fraction and confidence in the dependency formulation, the dependency is formulated based on one direction, neglecting the contrary direction and the 1-M flow. $\begin{matrix} (29) & \begin{matrix} η (t_{i}, t_{j}) \\ = \frac{c o n f (t_{i} ↣_{L} t_{j}) - (i n F l o w (t_{j}) / g p h F l o w (M_{x}))}{c o n f (t_{i} ↣_{L} t_{j}) + (i n F l o w (t_{j}) / g p h F l o w (M_{x}))} \end{matrix} \end{matrix}$

Since $D_{η}$ is proposed as a new dependency in Eq. (19), its outcomes need to be evaluated, measured and compared with the outcomes of other dependency equations. Therefore, the Weijters equation (Eq. (1)) and η equation (Eq. (29)) are used as baseline equations when evaluating $D_{η}$ equation (Eq. (19)).

6.3. Measures

In order to evaluate and discuss most process mining techniques or the quality of the discovered model, four classical quality dimensions are defined: fitness (replay event log), simplicity (avoid complicity), precision (avoid underfitting), and generalisation (avoid overfitting). The four quality dimensions must be carefully balanced to get the best process model. However, balancing them is a challenging issue [24], because of the difficulty in measuring the graph-structured model. Most of the evaluation measures in process mining deal with the quality of the graph-structured model [29]. Moreover, discovering the best process model needs other complex aspects in process mining. However, various measures can be used in the process model discovery based on the representational aims of an analyst [21]. Therefore, the dependency is used as measure to evaluate the main aims of the proposed representation.

Table 6
$D S R [15 * 15]$ for FEL₂

Table 7

$C P R [15, 15]$ for FEL₂

6.4. Evaluation procedures

This section lists the sequence of the procedures followed to evaluate the proposed aims. The following consecutive procedures were implemented in this evaluation:

The proposed method was executed using the Teleclaim process.

The α-algorithm and Manual models were discovered for the Teleclaim process in order to create models with their stages.

The results of the dependency equations for the baselines and $D_{η}$ were identified for the α-algorithm and Manual models.

The dependency setup was conducted by executing the proposed equations for the evaluation (Eqs (25), (26), (27) and (28)). After that, the results of these equations were discussed and compared in order to demonstrate the significance of the proposed representation.

Fig. 2.

α-algorithm model generated by the ProM framework with using the event log of Teleclaim and XES file format.

The proposed method was validated in terms of the data flow directions.

6.5. Results

6.5.1. Proposed method results

The preprocessing extracted all the tasks to be created, $T [15]$ , and assigned new conventional names for every task. One set of transactional records was generated for the traces, excluding irrelevant and useless data. The final result of the preprocessing was FEL₂. Subsequently, the abstraction step started from the inputs: FEL₂, $T [15]$ . By using these inputs and the concepts of DSR, LOR and LTR, the abstraction step generated $D S R [15, 15]$ in Table 6 and $A L R [15, 15]$ ( $A L R [15, 15]$ was empty, because there were no loops in the Teleclaim process). After that, the induction step was executed to achieve the details in $C P R [15, 15]$ in Table 7. Since no LLR existed in FEL₂, Algorithm 3 did not lead to modifications on $A L R [15, 15]$ when it was executed. In dependency construction, the outcomes of the three dependency equations (Weijters, η and $D_{η}$ ) were conducted for the α-algorithm and Manual models. The stages identification started with finding the source, sink, in-, out- and parallel tasks in $C P R [15, 15]$ and $A L R [15, 15]$ . Equations (9) and (10) obtained the source and sink tasks, respectively. ${t_{1}}$ was the only source and ${t_{11}, t_{12}, t_{13}, t_{14}, t_{15}}$ were the sink tasks in the Teleclaim process. The parallel tasks, which were obtained by Eq. (8), were ${t_{8}, t_{9}, t_{10}}$ ; the rest of the tasks were the internal tasks ${t_{2}, t_{3}, t_{4}, t_{5}, t_{6}, t_{7}, t_{8}, t_{9}, t_{10},}$ .

6.5.2. α-Algorithm and Manual models

The outcome of the α-algorithm and Manual models was represented by the Petri net modelling language. For the purpose of comparing and evaluating the proposed method, the stages were manually added and aligned in the α-algorithm and Manual models. The α-algorithm model in Fig. 2 was created by the ProM framework [31]. The α-algorithm model consists of eight stages. More stages were discovered, because it was difficult to identify the non-free-choice situation in the Teleclaim process using α-algorithm. This issue impeded α-algorithm from implementing the model with the right depiction. For instance, α-algorithm did not discover the parallel tasks in the model. Therefore, the parallel tasks were included in the stages.

Fig. 3.

Manual model for the Teleclaim process.

The Manual model for the Teleclaim process in Fig. 3 was derived from the α-algorithm model. There is a critical issue generated in the α-algorithm model, namely, how to discover and model a non-free-choice construct. The Teleclaim process has a non-free-choice situation. The non-free-choice situation is mixture of parallelism and choice [38]. This situation created a deadlock in the model generated by α-algorithm. The situation arose after transition $t_{7}$ . According to [33], the deadlock is an extremely detrimental circumstance in which a group of tasks is infinitely waiting for one another to release resources. The situation of non-free-choice situation also created ignorance of parallelism that starts from AND-split in the dummy task and should end at $t_{13}$ , the position of AND-join. Due to the limitation of the Petri net modelling language, the new dummy transition or task in the Manual model was added in order to avoid the deadlock and the parallelism ignorance issues which were created by non-free-choice situation. On the other hand, the deadlock and the parallelism ignorance issues were avoided in the Manual model.

The Manual model generated seven stages. Due to the implementation of the dummy task to solve the non-free-choice issue, the dummy task was included in the stages of the Manual model and was treated as a special case. The in-flow or out-flow of the dummy task were ( $o u t f l o w (t_{7}) - i n f l o w (t_{14})$ ), because $t_{7}$ had a flow option to $t_{14}$ or flow option to the parallel tasks. This technique enabled the dummy task to be considered and included in the stages of the Manual model.

6.6. Implementation of the evaluation principle

With the help of stages, dependency can be an adequate and useful measure, since the stages and dependency include the significant and comprehensive representation that can quantitatively evaluate the extracted knowledge. This section implements the evaluation principle and shows how the dependency that was used to evaluate and compare $D_{η}$ with other baselines as well as to prove that the proposed method outperforms the baselines.

6.6.1. Results of dependency

The difference in Eq. (27) was conducted for the dependency equations in every model. In FEL₂, the results of that equation are shown in Figs 4 and 5 for the α-algorithm and Manual, respectively. The results of Eq. (28) for FEL₂, which is the absolute average, are shown in Table 8 in order to test and compare the results of the three dependency equations in FEL₂ for the two models in FEL₂.

Fig. 4.

Results of Eq. (27) using α-algorithm.

Fig. 5.

Results of Eq. (27) using Manual.

6.6.2. Discussion

The discussion points in this section mainly focus on the dependency aligned by stages. The discussion is based on the evaluation principle that the difference in Eq. (27) should be near to zero or zero. Therefore, the information in Figs 4 and 5 is discussed. Obviously, the information in the figures is aligned based on the dependency equations and the stages. The vertical axis represents the dependency differences, and the horizontal axis represents the stages. The results of Eq. (27) fluctuate between positive and negative values. The two curves in the aforementioned figures describe the outputs of the Weijters, η and $D_{η}$ equations at every stage. Table 8 shows the results of Eq. (28) for all the dependency equations and models. These results are depicted in Fig. 6.

$D_{η}$ is the third dependency equation shown in Figs 4 and 5. Most of the values of $D_{η}$ were the best values that were almost zero. Therefore, $D_{η}$ combines and balances between Weijters and η to get the best dependency formulation. This formulation avoids the negative aspects found in the other equations and enhances some neglected ones. For example, $D_{η}$ considers the M-1 and 1-M flows and it includes confidences when formulating the equation instead of frequencies. It also considers bidirectional flow. Hence, the proposed equation provides us with more precise results.

In summary, Table 8 and Fig. 6 reveal the lowest result among the baselines of models and dependency equations. In Table 8, the results from $D_{η}$ provided 0.18 and 0.15 for the α-algorithm and Manual models, respectively. These results were the lowest and closest result at the level of both models. The results of $D_{η}$ indicate that representation in $D_{η}$ is significant and comprehensive, since $D_{η}$ considers the other neglected aspects in the dependency baselines.

Table 8
Results of Eq. (28) using all models and dependency equations

Model Weijters η $D_{η}$

α algorithm 0.67 0.22 0.18

Manual 0.61 0.21 0.15

Model	Weijters	η	$D_{η}$
α algorithm	0.67	0.22	0.18
Manual	0.61	0.21	0.15

Fig. 6.

Results of Eq. (28) for three dependencies and two models.

6.7. Validation

The proposed representations were validated by disrupting the dataset. FEL₂ was disrupted by injecting FEL₂ with another contrary flow (loops), which can be called exceptions due to their contrary flow direction. Subsequently, the outcomes were compared in order to test and observe the performance of the proposed representation with the other related baselines. The validation highlighted the significance of the proposed representation.

6.7.1. Setup and discussion when disrupting FEL₂

FEL₂ did not contain any kind of loops. In order to validate the proposed representation, the performance of the proposed representation was tested and observed by disrupting FEL₂ when gradually appending 5% of diverse kinds of exceptions in a round. New traces that include exceptions were appended in the round until 30% of exceptions was appended. The appended traces were based on the percentages of the total number of the traces in FEL₂. The comparison is built based on the property that the perfect result of Eq. (28) is zero. In addition, for the purpose of generalising the selected exceptions, the tasks to be included in these exceptions were carefully selected. Since middle tasks highly dominate the large part of the dependencies, they were nominated to be included in the appended exceptions. The selected tasks were for LOR ( $t_{7} ↺_{L} t_{7}$ ), for LTR ( $t_{6} △_{L} t_{7}$ ) and for LLR ( $t_{2} ↶_{L} t_{7}$ ).

Fig. 7.

Results of Eq. (28) using Weijters and gradually appending mix exceptions.

Fig. 8.

Results of Eq. (28) using η and gradually appending mix exceptions.

Fig. 9.

Results of Eq. (28) using $D_{η}$ and gradually appending mix exceptions.

Figures 7, 8, and 9 describe the results of the Weijters, η and $D_{η}$ equations, respectively, when using the mixed exceptions. By brief scanning of Figs 7, 8, and 9, the significant conclusion is that the outcomes of the $D_{η}$ equation in Fig. 9 outperformed the outcomes of the other baselines in regard to the closeness to zero.

7. Conclusion

Data mining and process mining share many common aspects, especially in terms of representing and showing the extracted knowledge in the data flow relations and dependencies between a pair of tasks. This combination motivated us to cover the gap in the representation of data flow relations and dependencies. We identified a lack of a proper representation that reflects the actual flow relations and dependencies. Our comprehensive representation is achieved by generating the proposed method, which is sequentially executed until the targeted aims. The proposed method is supported by other concepts, namely, log-based ordering relations [30] and other dependency equations [12,37].

The main contribution in this paper is a new representation of the neglected flow relations and proper dependency equation, exploiting and benefiting from both the data mining and process mining perspectives. One example in process mining is the heuristic-based algorithms. The heuristic-based algorithms consider only the 1-1 flow relation between a pair of tasks, although they include other important parts, namely, the contrary flow direction. Additionally, they rely on the frequency, not the confidence, when computing the dependency. The ignorance of the other flow relations and confidences could cause a misleading or incomplete picture. In data mining, we have η dependency as an example that provides a good explanation for the flow between two tasks by including the one-directional confidence. However, η includes only one-directional flow, disregarding the opposite direction. It was also formulated improperly. Therefore, the proposed representation provides the comprehensive flow relations, and increases the trust and precision of the dependency outcomes. Since extracting and filtering FEL from various data sources (e.g., web logs and user logs) can be applied, the proposed data flow and dependency can be reflected in the service dependency of web context.

In regard to the evaluation and validation, the proposed contributions facilitated and encouraged the introduction of novel evaluation arrangements that manage and control the evaluation and experiment, since the evaluation in process mining is a controversial issue. The experimental evaluation used dependency measures, whereas the baselines of the dependency equation were Weijters and η equations. The measure and baselines were applied to the data of the Teleclaim process, which is commonly used in process mining. The dependency with the help of stable, unique and sequential stages provided us with an equality assessment between the in-flows and out-flows in the task. Hence, the new representation was evaluated based on the distance of the average of the dependency differences between the in-flows and out-flows from zero. The experiments on the data provided us with promising results, and demonstrated that the proposed method obtained more precise results and outperformed other representations.

References

Agrawal,

Gunopulos and

Leymann, Mining process models from workflow logs, in: Advances in Database Technology,

Schek et al., eds, Vol. 1377, Springer, Berlin, Heidelberg, 1998, pp. 467–483.

Aldahami,

Li and

Chan, Using a flow graph to represent data flow and dependency in event logs, in: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Vol. 1, 2015, pp. 531–539.

Bose and

van der Aalst, When process mining meets bioinformatics, in: IS Olympics: Information Systems in a Diverse World, 2012, pp. 202–217.

de Medeiros,

van Dongen,

van der Aalst and

Weijters, Process Mining: Extending the α-Algorithm to Mine Short Loops, Vol. 19, Eindhoven University of Technology, Eindhoven, 2004.

García-Banuelos and

Dumas, Towards an open and extensible business process simulation engine, in: CPN Workshop, 2009.

Han,

Kamber and

Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2011.

Hwang and

Yang, On the discovery of process models from their instances, Decision Support Systems 34(1) (2002), 41–57. doi:10.1016/S0167-9236(02)00008-8.

Li and

Wu, Interpretation of association rules in multi-tier structures, International Journal of Approximate Reasoning 55(6) (2014), 1439–1457. doi:10.1016/j.ijar.2014.04.015.

Li,

Yang and

Xu, Multi-tier granule mining for representations of multidimensional association rules, in: Data Mining, 2006. ICDM’06. Sixth International Conference on, IEEE, 2006, pp. 953–958.

10.

Liu,

Sun,

Zhang and

Liu, Extended Pawlak’s flow graphs and information theory, in: Transactions on Computational Science, Springer, 2009, pp. 220–236.

11.

Lou,

Fu,

Yang,

Li and

Wu, Mining program workflow from interleaved traces, in: The 16th SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, USA, 2010, pp. 613–622.

12.

Pawlak, Flow graphs and data mining, in: Transactions on Rough Sets III,

Peters and

Skowron, eds, Lecture Notes in Computer Science, Vol. 3400, Springer, Berlin, Heidelberg, 2005, pp. 1–36.

13.

Pawlak, Decision trees and flow graphs, in: Rough Sets and Current Trends in Computing,

Hata, et al., eds, Lecture Notes in Computer Science, Vol. 4259, Springer, Berlin, Heidelberg, 2006, pp. 1–11.

14.

Rozinat and

van der Aalst, Conformance checking of processes based on monitoring real behavior, Information Systems 33(1) (2008), 64–95. doi:10.1016/j.is.2007.07.001.

15.

Smirnov,

Pashkin,

Levashova,

Shilov and

Kashevnik, Role-based decision mining for multiagent emergency response management, in: Autonomous Intelligent Systems: Multi-Agents and Data Mining, Springer, Berlin, 2007, pp. 178–191. doi:10.1007/978-3-540-72839-9_15.

16.

Solé and

Carmona, A high-level strategy for c-net discovery, in: Application of Concurrency to System Design (ACSD), 2012 12th International Conference on, IEEE, 2012, pp. 102–111.

17.

Solé and

Carmona, An SMT-based discovery algorithm for c-nets, in: Application and Theory of Petri Nets, Springer, 2012, pp. 51–71.

18.

Solé and

Carmona, Amending C-net discovery algorithms, in: Proceedings of the 28th Annual ACM Symposium on Applied Computing, Ser. SAC ’13, ACM, New York, NY, USA, 2013, pp. 1418–1425. doi:10.1145/2480362.2480628.

19.

Sun,

Liu and

Zhang, An extension of Pawlak’s flow graphs, in: Rough Sets and Knowledge Technology,

Wang et al., eds, Vol. 4062, Springer, Berlin, Heidelberg, 2006, pp. 191–199.

20.

Tiwari,

Turner and

Majeed, A review of business process mining: State-of-the-art and future trends, Business Process Management Journal 14(1) (2008), 5–22. doi:10.1108/14637150810849373.

21.

van der Aalst, Process Mining: Discovery, Conformance and Enhancement of Business Processes, Springer-Verlag, 2011.

22.

van der Aalst, Do Petri nets provide the right representational bias for process mining? in: Proceedings of the 2009 International Database Engineering & Applications Symposium, 2011, pp. 85–94.

23.

van der Aalst,

Adriansyah and

van Dongen, Causal nets: A modeling language tailored towards process discovery, in: CONCUR 2011 – Concurrency Theory, Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 6901, 2011, pp. 28–42.

24.

van der Aalst,

Andriansyah,

de Medeiros,

Arcieri,

Baier,

Blickle,

Bose,

van den Brand et al., Process mining manifesto, in: BPM 2011 Workshops Proceedings, Springer-Verlag, 2012, pp. 169–194.

25.

van der Aalst,

de Medeiros and

Weijters, Process equivalence: Comparing two process models based on observed behavior, in: BPM,

Dustdar et al., eds, Vol. 4102, Springer, Berlin, Heidelberg, 2006, pp. 129–144.

26.

van der Aalst and

Dustdar, Process mining put into context, Internet Computing, IEEE 16(1) (2012), 82–86. doi:10.1109/MIC.2012.12.

27.

van der Aalst,

Rosemann and

Dumas, Deadline-based escalation in process-aware information systems, Decision Support Systems 43(2) (2007), 492–511. doi:10.1016/j.dss.2006.11.005.

28.

van der Aalst,

van Dongen,

Günther,

Mans,

de Medeiros,

Rozinat,

Rubin,

Song,

Verbeek and

Weijters, ProM 4.0: Comprehensive support for real process analysis, in: Application and Theory of Petri Nets and Other Models of Concurrency 2007,

Kleijn and

Yakovlev, eds, Lecture Notes in Computer Science, Vol. 4546, Springer-Verlag, Berlin, 2007, pp. 484–494.

29.

van der Aalst and

Weijters, Process mining: A research agenda, Computers in Industry 53(3) (2004), 231–244. doi:10.1016/j.compind.2003.10.001.

30.

van der Aalst,

Weijters and

Maruster, Workflow mining: Discovering process models from event logs, Knowledge and Data Engineering 16(9) (2004), 1128–1142. doi:10.1109/TKDE.2004.47.

31.

van Dongen,

de Medeiros,

Verbeek et al., The ProM framework: A new era in process mining tool support, in: Application and Theory of Petri Nets 2005, Vol. 3536, Springer-Verlag, Berlin, 2005, pp. 444–454.

32.

van Dongen,

de Medeiros and

Wen, Process mining: Overview and outlook of Petri net discovery algorithms, in: Transactions on Petri Nets and Other Models of Concurrency II,

Jensen and

van der Aalst, eds, Vol. 5460, Springer, Berlin, Heidelberg, 2009, pp. 225–242.

33.

Viswanadham,

Narahari and

Johnson, Deadlock prevention and deadlock avoidance in flexible manufacturing systems using Petri net models, IEEE Transactions on Robotics & Automation Magazine 6(6) (1990), 713–723. doi:10.1109/70.63257.

34.

Wang and

M.A.M.

Capretz, A dependency impact analysis model for web services evolution, in: Web Services, 2009. ICWS 2009. IEEE International Conference on, 2009, pp. 359–365. doi:10.1109/ICWS.2009.62.

35.

Weijters and

Ribeiro, Flexible heuristics miner (FHM), in: Computational Intelligence and Data Mining, IEEE, 2011, pp. 310–317.

36.

Weijters and

van der Aalst, Rediscovering workflow models from event-based data using little thumb, Integrated Computer-Aided Engineering 10(2) (2003), 151–162.

37.

Weijters,

van der Aalst and

de Medeiros, Process Mining with the Heuristics Miner-Algorithm, Vol. WP 166, Eindhoven University of Technology, Eindhoven, 2006.

38.

Wen,

van der Aalst,

Wang and

Sun, Mining process models with non-free-choice constructs, Data Mining and Knowledge Discovery 15(2) (2007), 145–180. doi:10.1007/s10618-007-0065-y.

39.

Wen,

Wang and

Sun, Detecting implicit dependencies between tasks from event logs, in: Frontiers of WWW Research and Development,

Zhou et al., eds, Vol. 3841, Springer, Berlin, Heidelberg, 2006, pp. 591–603.

Discovery of dependency relations in sequential data flow

Abstract

Keywords

1. Introduction

1 More information can be found at http://www.processmining.org.

3. Background

3.1. Dependency

3.2. Filtered event log

4. Knowledge extraction and representation

4.1. Abstraction step

Table 2 D S R [ 8 , 8 ] for FEL1

4.4. Discovery of LLR

4.5.2. Source, internal and sink tasks

4.5.3. Flow detail

4.6. Dependency construction

4.7. Stages in process model

6.1. Teleclaim process

2 More information can be found at http://www.processmining.org.

6.3. Measures

Table 6 D S R [ 15 ∗ 15 ] for FEL2

6.5.1. Proposed method results

6.5.2. α-Algorithm and Manual models

6.6.1. Results of dependency

Table 8 Results of Eq. (28) using all models and dependency equations Model Weijters η D η α algorithm 0.67 0.22 0.18 Manual 0.61 0.21 0.15

6.7.1. Setup and discussion when disrupting FEL2

References

¹
More information can be found at http://www.processmining.org.

Table 2
$D S R [8, 8]$ for FEL₁

²
More information can be found at http://www.processmining.org.

Table 6
$D S R [15 * 15]$ for FEL₂

Table 8
Results of Eq. (28) using all models and dependency equations

Model Weijters η $D_{η}$

α algorithm 0.67 0.22 0.18

Manual 0.61 0.21 0.15

6.7.1. Setup and discussion when disrupting FEL₂