MantaRay-ProM: An efficient process model discovery algorithm

Abstract

Discovering the business process model from an organisation’s records of its operational processes is an active area of research in process mining. The discovered model may be used either during a new system rollout or to improve an existing system. In this paper, we present a process model discovery approach based on the recently proposed bio-inspired Manta Ray Foraging Optimization algorithm (MRFO). Since MRFO is designed to solve real-valued optimization problems, we adapted a binary version of MRFO to suit the domain of process mining. The proposed approach is compared with state-of-the-art process discovery algorithms on several synthetic and real-life event logs. The results show that compared to other algorithms, the proposed approach exhibits faster convergence and yields superior quality process models.

Keywords

Manta ray foraging optimization process model discovery bio-inspired optimization process mining event log

1. Introduction

Organizations generate large volumes of data related to their business processes. A wide variety of tools are available nowadays that support automation of the business process (also known as workflow) in different application domains such as healthcare, medicine, industry, manufacturing, finance, logistics, education, information, and communication technology. However, organisations still face the challenge of mining the business data and developing a refined understanding of their processes to improve their work. Process mining generates process models that accurately describe processes by considering only an organisation’s records of its operational processes.

In the context of process mining, a process is understood as a collection of tasks requiring of coordination amongst them [36]. These tasks are carried out by a set of actors. For example, in a hospital universe, patients are the actors who participate in the process of treatment of a patient, comprising of a sequence of activities such as registration, admission, patient care (assignment of an available doctor, patient medical history, diagnosis, medical tests, treatment, nursing care, counseling, management of patient medical records), and patient discharge. However, there may be deviations from the expected process behaviour. For instance, in the hospital domain, the patient registration process may not be followed for a patient who requires emergency treatment. While many deviations may be acceptable to the system, such as non-availability of the X-ray machine, causing a delay in the patient’s treatment, may be unacceptable. Identifying the reasons for delays in patient treatment can help the hospital provide better medical care to the patients.

For the domain of process mining, data generated by an organisation is presented as an event log, and the organisation’s process flow is termed as a process model. The discovery of a process model for a given organisation is a key aspect of process mining [21,26]. Process models generated from process mining algorithms are evaluated on four quality dimensions, namely, completeness, preciseness, generalization, and simplicity [3,31,35,36]. A “good” model is expected to enact the minimal behaviour encoded in the log (simplicity) [35,36], echo all the traces in the log (completeness) [3], avoid any spurious behaviour (preciseness) [35,36], and fit well on unseen behaviour (generalization) [31].

In this paper, we are proposing a process discovery algorithm based on the Manta Ray optimisation technique. Manta Ray Foraging Optimization (MRFO) algorithm [42] imitates the following foraging strategies of manta rays— chain foraging, cyclone foraging, and somersault foraging. While chain and somersault foraging strategies aid the local search ability, the cyclone foraging strategy enhances the global search ability.

That is, the proposed algorithm benefits from the global as well as the local search ability of the Manta ray foraging optimization approach [42].

The main contributions of this proposal are:

A novel metaheuristic algorithm, Manta ray foraging process miner (MantaRay-ProM), to address the problem of process discovery is proposed.

The proposed approach is based on Manta ray foraging optimization for process model discovery and benefits from the strength of the MRFO approach.

Since the formulation of the problem of process discovery is binary, and MRFO is proposed for real-valued problems, we adapted a binary version of MRFO to suit the domain of process mining.

The proposed algorithm (MantaRay-ProM) is evaluated on ten synthetic and three real-life event logs.

The proposed approach (MantaRay-ProM) is compared with both evolutionary and traditional state-of-the-art algorithms.

This study is organised as follows. Section 2 begins with the process mining terminology used in the paper and discusses the related work in the context of the proposed work. Section 3 describes the Manta ray foraging optimization (MRFO) algorithm. Section 4 describes the proposed Manta ray foraging process mining (MantaRay-ProM) algorithm. Section 5 describes the experimentation and results. Section 6 discusses the conclusion and offers future research directions.

2. Related work

In the literature, various meta-heuristic strategies, such as Particle Swarm Optimization (PSO), Differential Evolution (DE), and Genetic algorithms (GA) have been applied for process discovery in the domain of process mining [3,9].

2.1. Basic constructs of process mining

The data for a business process, stored in the form of an event log, can be represented as a two-dimensional table. A row represents the data corresponding to an event (also called activity). The columns represent the characteristics of an event. Table 1 depicts an example event log for a transportation company involving the activities ‘Customer request’ ( $a_{1}$ ), ‘Check stock’ ( $a_{2}$ ), ‘Accept request’ ( $a_{3}$ ), ‘Decline request’ ( $a_{4}$ ), ‘Pack order’ ( $a_{5}$ ), ‘Arrange transport’ ( $a_{6}$ ), and ‘Ship order’ ( $a_{7}$ ) in a process. Case ID is a unique identifier for an occurrence of a business process instance. A process instance, also known as a trace, is defined as the single execution of a business process. Process instances (traces) can be extracted by ordering the timestamp of a specific Case ID. For the example event log, Case ID values 101, 102, 103 correspond to process instances ${a_{1}, a_{2}, a_{4}}$ , ${a_{1}, a_{2}, a_{3}, a_{6}, a_{5}, a_{7}}$ , ${a_{1}, a_{2}, a_{3}, a_{5}, a_{6}, a_{7}}$ , respectively. The data in the event log is represented as a process model and can be visualised as a Petri net. Figure 1 depicts a Petri net that conforms to the example event log in Table 1.

Table 1
An example event log

Case ID Activities Timestamp

101 Customer request( $a_{1}$ ) 15-1-2018 @9:12

102 Customer request( $a_{1}$ ) 15-1-2018@9:15

103 Customer request( $a_{1}$ ) 15-1-2018@9:25

101 Check stock( $a_{2}$ ) 15-1-2018@9:50

101 Decline request( $a_{4}$ ) 15-1-2018@9:55

103 Check stock( $a_{2}$ ) 15-1-2018@9:30

102 Check stock( $a_{2}$ ) 15-1-2018@9:20

103 Accept request( $a_{3}$ ) 15-1-2018@9:40

103 Pack order( $a_{5}$ ) 17-1-2018@8:12

102 Accept request( $a_{3}$ ) 17-1-2018@8:45

103 Arrange transport( $a_{6}$ ) 17-1-2018@8:12

102 Arrange transport( $a_{6}$ ) 17-1-2018@9:15

103 Ship order( $a_{7}$ ) 18-1-2018@10:20

102 Pack order( $a_{5}$ ) 17-1-2018@9:15

102 Ship order( $a_{7}$ ) 18-1-2018@12:00

Case ID	Activities	Timestamp
101	Customer request( $a_{1}$ )	15-1-2018 @9:12
102	Customer request( $a_{1}$ )	15-1-2018@9:15
103	Customer request( $a_{1}$ )	15-1-2018@9:25
101	Check stock( $a_{2}$ )	15-1-2018@9:50
101	Decline request( $a_{4}$ )	15-1-2018@9:55
103	Check stock( $a_{2}$ )	15-1-2018@9:30
102	Check stock( $a_{2}$ )	15-1-2018@9:20
103	Accept request( $a_{3}$ )	15-1-2018@9:40
103	Pack order( $a_{5}$ )	17-1-2018@8:12
102	Accept request( $a_{3}$ )	17-1-2018@8:45
103	Arrange transport( $a_{6}$ )	17-1-2018@8:12
102	Arrange transport( $a_{6}$ )	17-1-2018@9:15
103	Ship order( $a_{7}$ )	18-1-2018@10:20
102	Pack order( $a_{5}$ )	17-1-2018@9:15
102	Ship order( $a_{7}$ )	18-1-2018@12:00

Fig. 1.

Petri net for the example event log of Table 1.

Definition 2.1 (Causal relation).

For any two activities, say $a_{1}$ and $a_{2}$ , belonging to the given event log L, $a_{1}$ causes $a_{2}$ ( $a_{1} \to a_{2}$ ) if and only if $a_{1} > a_{2}$ and $a_{2} ≯ a_{1}$ . Activity $a_{1}$ is said to be the cause of activity $a_{2}$ . Activity $a_{2}$ is called the direct successor of activity $a_{1}$ .

Definition 2.2 (Direct succession relation (succeeds or follows)).

For any two activities, say $a_{1}$ and $a_{2}$ , belonging to the given event log L, $a_{2}$ succeeds/follows $a_{1}$ ( $a_{1} > a_{2}$ ) if and only if there is a trace $δ = t_{1} t_{2} \dots t_{n}$ , where $i \in {1, \dots, n - 1}$ and $t_{i} = a_{1}$ , $t_{i + 1} = a_{2}$ . The symbol > denotes which activities appeared in sequence, i.e., one directly following the other.

Definition 2.3 (Exclusive relation (unrelated)).

For any two activities, say $a_{1}$ and $a_{2}$ , belonging to the given event log L, $a_{1}$ unrelates $a_{2}$ ( $a_{2} # a_{1}$ ) if and only if $a_{1} ≯ a_{2}$ and $a_{2} ≯ a_{1}$ , that is, activity $a_{1}$ is never followed by activity $a_{2}$ , and activity $a_{2}$ is never followed by activity $a_{1}$ .

Definition 2.4 (Parallel relation).

For any two activities, say $a_{1}$ and $a_{2}$ , belonging to the given event log L, $a_{1}$ parallel $a_{2}$ ( $a_{1} ∥ a_{2}$ ) if $a_{1} > a_{2}$ and $a_{2} > a_{1}$ , that is, activity $a_{1}$ follows activity $a_{2}$ and activity $a_{2}$ also follows activity $a_{1}$ .

Definition 2.5 (Short loops (loops of length one and length two)).

For any two activities, say $a_{1}$ and $a_{2}$ , belonging to the given event log L, direct succession relations are also called length-one-loops ( $l_{1} l$ ), and $a_{1} a_{2} a_{1}$ or $a_{2} a_{1} a_{2}$ occurring in a trace are examples of length-two-loops ( $l_{2} l$ ).

Definition 2.6 (Input set).

In the given log L, the set S of all activities $a_{i}$ that are followed by an activity $a_{j}$ , where $i \neq j$ , are said to form an input set for the activity $a_{j}$ .

Definition 2.7 (Non-free choice constructs).

The transformation from the input sets, $S_{1}$ and $S_{2}$ , to the activities $a_{1}$ and $a_{2}$ , respectively, such that $S_{1} \cap S_{2} \neq ϕ$ and $S_{1} \neq S_{2}$ forms a non-free choice construct.

To illustrate, consider a process model with five tasks $a_{1}$ , $a_{2}$ , $a_{3}$ , $a_{4}$ , $a_{5}$ such that $S_{1} = {a_{3}, a_{4}}$ is the input set for $a_{1}$ and $S_{2} = {a_{3}, a_{5}}$ is the input set for $a_{2}$ . The input sets of $a_{1}$ and $a_{2}$ overlap but are not identical. This transformation from $S_{1}$ , $S_{2}$ to $a_{1}$ , $a_{2}$ is said to result in a non-free choice construct.

2.2. Traditional algorithms in process mining

In the domain of process mining α algorithm [25] is one of the earliest techniques. It analyses the event log and identifies the relationship for every pair of activity. Subsequently, many extensions of the α algorithm were proposed [2,18,38–40]. Alves de Medeiros et al. [2] proposed the $α^{+}$ -algorithm to deal with short loops. Wen et al. introduced $α^{+ +}$ -algorithm [39] to handle non-free choice constructs. Later, Wen et al. [40] proposed $α^{#}$ -algorithm to deal with hidden (invisible) activities. Wen et al. [41] define an invisible activity as an activity that is present in the process model but is absent from the event log. Li et al. proposed $α^{*}$ -algorithm to handle duplicate activities, a situation where the process model has two or more nodes referring to the same activity [18]. Weijters et al. [38] proposed Heuristics miner (HM) that discovers short loops and non-free-choice structures from noisy (exception or incorrect behaviour) logs. Dongen et al. [29,30] proposed Multi-phase miner that builds an instance graph for each individual process instance. The authors suggest that individual process instance graphs can be used to generate a process model. Goedertier et al. [16] proposed Inductive logic programming (ILP) approach to present artificially generated negative events – events that cannot occur at a particular position in an event sequence. Leemans et al. [17] proposed an algorithm called B’ algorithm or Inductive miner for discovery of process model that is free of deadlocks and other anomalies [13,17]. De Smedt et al. [10] proposed Declarative process model miner that extracts rules from the given log to provide additional insights into the model. Proposed by De Smedt et al. [10], Multi-paradigm miner algorithm integrates Heuristics miner and Declarative process model miners for process model discovery [10,13]. Van Zelst et al. [32] provided a Hybrid ILP Miner approach that relies on hybrid variable-based regions to enhance the ILP miner. Vanden Broucke and De Weerdt [33] proposed an extension of Heuristics Miner, namely the Fodina algorithm. The algorithm mines duplicate tasks with a focus on robustness. Adriano augusto et al. [4] introduced Split Miner, a method designed to identify the concurrency, conflict, and causal relations among adjacent elements in the directly-follows graph. Recently, Adriano augusto et al. [5] proposed a structured heuristic mining strategy that initially uncovers an unstructured process model before transforming it into a structured representation [6].

2.3. Evolutionary algorithms in the literature

Alves de Medeiros [3] proposed the genetic process miner (GPM), a genetic algorithm for process discovery. GPM is based on random initialization of the population. This technique deals with non-free choices, noise, non-local relationships, and non-trivial constructs in the event log. The authors defined a fitness function to guide the global search for mining a process model that is complete and precise. Buijs et al. [7] proposed an extension of the ETM algorithm [8] that uses NSGA-II algorithm [22] to generate process models. Cheng et al. [9] introduced a hybrid process mining methodology that integrates GPM, particle swarm optimization (PSO), and discrete differential evolution (DE) techniques to extract process models from event logs. Deshmukh et al. [12] proposed a process mining approach based on binary differential evolution.

Table 2
State-of-the-art algorithms for discovering process models

Algorithm type Algorithm/Reference

Evolutionary Genetic Process Miner (GPM) [3]

Evolutionary Tree miner (ETM) [7,8]

GPM + PSO + Discrete DE [9]

Binary DE [12]

Non-Evolutionary α [25]

$α^{+}$ [2]

$α^{+ +}$ [39]

$α^{#}$ [40]

$α^{*}$ [18]

Heuristics Miner [38]

Multi-phase Miner [29,30]

ILP [16]

Inductive miner [17]

Multi-paradigm miner [10]

Hybrid ILP Miner [32]

Structured Heuristic miner [6]

Fodina [33]

Split Miner [4]

Structured Heuristic miner [5]

Algorithm type	Algorithm/Reference
Evolutionary	Genetic Process Miner (GPM) [3]
Evolutionary Tree miner (ETM) [7,8]
GPM + PSO + Discrete DE [9]
Binary DE [12]
Non-Evolutionary	α [25]
$α^{+}$ [2]
$α^{+ +}$ [39]
$α^{#}$ [40]
$α^{*}$ [18]
Heuristics Miner [38]
Multi-phase Miner [29,30]
ILP [16]
Inductive miner [17]
Multi-paradigm miner [10]
Hybrid ILP Miner [32]
Structured Heuristic miner [6]
Fodina [33]
Split Miner [4]
Structured Heuristic miner [5]

3. Manta ray foraging optimization algorithm (MRFO): An overview

Manta rays are marine creatures that consume large amount of plankton each day. Manta rays have unique foraging behaviour with which they effectively search the dispersed plankton in the ocean. Based on this rare behaviour of Manta Rays, Zhao et al. [42] proposed a meta-heuristic algorithm, Manta Ray Foraging Optimization algorithm (MRFO) that imitates three foraging strategies of Manta Rays, namely, chain foraging, cyclone foraging and somersault foraging. The chain foraging and somersault foraging behaviours are used to enhance local search ability (exploitation), whereas the cyclone foraging activity contributes to global search ability (exploration). This technique has proven to be effective in a variety of disciplines including energy allocation, geophysics, electric power, and image processing [19].

MRFO algorithm [42] begins by initializing the population randomly. Subsequently, the population evolves by emulating the foraging behaviour. According to the chain foraging behaviour, Manta Rays form a linear foraging chain moving forward according to a linear function of the position of maximum concentration of plankton found so far and position of the Manta Ray in front of it. The mathematical model is as follows: $\begin{matrix} C (i) (g + 1) = \{\begin{array}{ll} C (i) (g) + r * (C_{best} - C (i) (g)) + α * (C_{best} - C (i) (g)), & if i = 1 \\ C (i) (g) + r * (C (i - 1) (g) - C (i) (g)) + α * (C_{best} - C_{i} (g)), & if i = 2, \dots, N \end{array} \end{matrix}$ where $α = 2 * r * \sqrt{(} | log (r) |)$ is a weight coefficient, $C_{best}$ is the best individual so far, C(i)(g) is the position of ith individual at generation g, r $\in [0, 1]$ is a random number.

When encountering a dense plankton patch in deep waters, Manta Rays exhibit a foraging chain formation, moving towards the food in a spiral pattern, demonstrating cyclone foraging behavior. The mathematical representation of this phenomenon is as follows: $\begin{matrix} C (i) (g + 1) = \{\begin{array}{ll} C_{x} + r * (C_{x} - C (i) (g)) + β * (C_{x} - C (i) (g)), & if i = 1 \\ C_{x} + r * (C (i - 1) (g) - C (i) (g)) + β * (C_{x} - C (i) (g)), & if i = 2, \dots, N \end{array} \end{matrix}$ where $β = 2 \times e^{r_{1} \times \frac{g_{max} - g + 1}{g_{max}}} \times sin (2 π r_{1})$ is the weight coefficient, r and $r_{1}$ are random numbers belonging to $[0, 1]$ , $g_{max}$ is the maximum number of generations, $C_{x}$ is either $C_{best}$ for exploitation or $C_{rand}$ for exploration.

$C_{rand} = LB + r * (UB - LB)$ where LB and UB denote the lower and upper bounds, respectively. When the ratio of $g / g_{max} < rand$ , the current best solution serves as the reference point for exploitation. Conversely, when the ratio of when $g / g_{max} > rand$ , a randomly chosen position within the search space is utilized as the reference point for exploration. The cyclone foraging behavior emphasizes exploitation within the region containing the best solution identified thus far, $C_{best}$ . By introducing a new random point, $C_{rand}$ , across the entire search area as the reference position, each participant is encouraged to explore new positions that may significantly differ from the current best one. This mechanism prioritizes exploration, enabling the Manta Ray Foraging Optimization (MRFO) algorithm to conduct a comprehensive search.

In the somersault strategy, the position of the food acts as the pivot. Manta Rays oscillate around this pivot, adjusting their position in response to the best plankton position discovered thus far. The mathematical model is outlined as follows: $\begin{matrix} C (i) (g + 1) = C (i) (g) + S * ((r_{2} * C_{best}) - (r_{3} * C (i) (g))), if i = 1, \dots, N \end{matrix}$ Here, S represents the somersault factor, determining the range of somersaults for the Manta Ray, while $r_{2}$ , $r_{3}$ denote random numbers within the interval $[0, 1]$ . Moreover, the Manta Ray Foraging Optimization (MRFO) algorithm switches between chain foraging behavior and cyclone foraging behavior based on a random number. Zhao et al. [42] have demonstrated that MRFO is straightforward to implement with only a few tunable parameters (α and β), and it often outperforms other well-known competitors.

4. Proposed manta ray foraging process miner (MantaRay-ProM) algorithm

Whereas the original proposal of the MRFO algorithm is for continuous search space problems [42], the proposed MantaRay-ProM algorithm (Algorithm 1, flowchart in Fig. 2) is a meta-heuristic algorithm that applies a binary adaptation of the MRFO algorithm for the problem of process mining.

Algorithm 1

The proposed manta-ray foraging process miner (MantaRay-ProM) algorithm

4.1. Initial population

A metaheuristic algorithm begins by initializing a population. From the given event log, say, L comprising n activities, the knowledge contained in the event log is first summarized in the form of a dependency measure matrix D that specifies the strength of dependencies between activities (Algorithm 2) [3]. A dependency exists between activities $a_{1}$ and $a_{2}$ if, in a trace, either $a_{1}$ directly precedes $a_{2}$ or vice versa. This is indicated by the presence of either the strings $a_{1} a_{2}$ or $a_{2} a_{1}$ in a process instance (trace) of the event log. The strength of dependency is proportional to the frequency of occurrence of these strings. In the example log (Table 1), $a_{1}$ directly precedes $a_{2}$ thrice whereas the string $a_{2} a_{1}$ does not occur at all. That is, in the given system, task $a_{1}$ is more likely to be the cause of the task $a_{2}$ than vice versa. Dependency measure is computed by counting the length-one loops (for example, $a_{1} a_{2}$ ), self-loops (for example, $a_{1} a_{1}$ ), length-two loops (for example, $a_{1} a_{2} a_{1}$ ), and parallel tasks (for example, $a_{1} a_{2}$ and $a_{2} a_{1}$ occur an equal number of times). In the example log (Table 1), $a_{5}$ and $a_{6}$ are parallel tasks and $a_{1} a_{2}$ is a length-one-loop.

Fig. 2.

The proposed MantaRay-ProM algorithm for process model discovery.

Algorithm 2

Computation of dependency measure matrix D for the event log L

In the proposed approach, each individual space in the population is a two-dimensional matrix of order n, representing the causal relations between the n tasks, and is known as the causality relation matrix. The population of N causality relation matrices, denoted by C(:, :, k), k $\in [1, N]$ , is generated as follows (Algorithm 3) [3]: $\begin{matrix} C (a_{1}, a_{2}, k) = \{\begin{array}{ll} 1 & if r < D (a_{1}, a_{2}, L), r \in [0, 1) \\ 0, & otherwise \end{array} \end{matrix}$ where $a_{1}, a_{2} \in [1, n]$ are the tasks in the event log L.

Algorithm 3

Generation of initial population. Each individual is called a causality relation

4.2. Fitness function

In the present work, process models are evaluated based on four quality dimensions, namely, completeness, preciseness, generalization, and simplicity [3,11,24,31,35,36]. It is to be noted that we represent a process model using the causality matrix notation which instructs the definition of the quality dimensions.

The proposed MantaRay-ProM algorithm optimizes the completeness value of the individuals as the fitness function using the following equation [3]: $\begin{array}{c} (1) & Completeness (L, C (:, :, k)) = \frac{allParsedActivities - Punishment}{n} \\ \begin{aligned} Punishment & = \frac{allMissingTokens}{numTracesLog - numTracesMissingTokens + 1} \\ + \frac{allExtraTokensLeftBehind}{numTracesLog - numTracesExtraTokensLeftBehind + 1} \end{aligned} \end{array}$

Completeness is a measure that involves the computation of all the parsed activities while replaying the traces of the log in the model. The tokens that are missing in a trace and the extra tokens that were left behind (unconsumed tokens) during parsing contribute to the punishment value. The completeness value of a process model is an essential aspect of its quality because it indicates how well a model captures the behaviour in the event log i.e., for the discovered process model, other quality dimensions will make sense only when completeness is acceptable [14,15].

Additional metrics for evaluating the quality of a process model encompass Precision, Simplicity, and Generalization. Preciseness [35,36] quantifies the additional behavior produced by a model not observed in the event log. It is calculated as shown in Equation (2). $\begin{matrix} (2) & Preciseness (L, C (:, :, k)) = \frac{1}{allEnabledActivities (L, C (:, :, k))} \end{matrix}$ In this context, $allEnabledActivities (L, C (:, :, k))$ represents the total number of enabled activities within a mined model, signifying the count of activities with tokens from all inputs. A higher count of enabled activities in a model suggests the presence of more potential paths for additional behavior. Simplicity refers to the number of nodes within the mined model. It is determined by computing the cardinality of input and output subsets within an individual’s causal matrix (as shown in Equation (3)) [35,36]. $\begin{matrix} (3) & Simplicity (C (:, :, k)) = \frac{1}{\sum_{t ε C (:, :, k)} (\sum_{ϕ ε I (t)} | ϕ | + \sum_{ψ ε O (t)} | ψ |)} \end{matrix}$ $I (t)$ and $O (t)$ denote the input and output subsets of the causal matrix $C (:, :, k)$ , where t representing a task within the causal matrix $C (:, :, k)$ [3]. A model is considered more complex if it includes more duplicate tasks or exhibits a distinct path for every trace.

Generalization, as described in [31], evaluates how effectively the mined model can reproduce future unseen behavior. It is quantified based on the frequency of execution of each node, as denoted in Equation (4). A higher frequency suggests that the path is utilized more often, indicating that the model is more generic. $\begin{matrix} (4) & Generalization (L, C (:, :, k)) = 1 - \frac{\sum_{nodes} \sqrt{{# executions (L, C (:, :, k))}^{- 1}}}{#nodes in model (C (:, :, k))} \end{matrix}$

4.3. Updation

The three foraging approaches, namely, chain foraging (Algorithm 4), cyclone foraging (Algorithm 5), and somersault foraging (Algorithm 6), are used to update the population. The process of chain foraging begins by computing the value of the control parameter (α) that dictates the impact of distance of an individual from the best individual found so far. Except for the first individual, each individual is updated concerning its distance from the best individual found so far and the individual in front. The process of cyclone foraging occurs in a spiral fashion. Following a two-pronged approach, the cyclone foraging strategy promotes exploration by selecting a random individual $C_{rand}$ in the search space. For exploitation, the best individual $C_{best}$ is used. The control parameter β decides the impact of the distance of an individual from $C_{best}$ or $C_{rand}$ . The process of somersault foraging mimics the somersault motion and the position of each individual is updated with respect to the best individual found so far. The range of motion during this strategy is controlled by using a somersault factor S. S is chosen as 2 by the authors of the MRFO algorithm [42]. Algorithm 7 explains the binary substitution function. This is essentially the round function for converting real to binary values.

Algorithm 4

Chain foraging approach to update the population C (generation g) of N individuals, each of order n

Algorithm 5

Cyclone foraging approach to update the population C (generation g) of N individuals, each of order n

Algorithm 6

Somersault foraging approach to update the population C (generation g) of N individuals, each of order n

Algorithm 7

Binary substitution function

Table 3

Details of the datasets

Dataset type	Event log	Activities	Traces	Events
Synthetic datasets [3,36]	ETM	7	100	790
	g2	22	300	4501
	g3	29	300	14599
	g4	29	300	5975
	g5	20	300	6172
	g6	23	300	5419
	g7	29	300	14451
	g8	30	300	5133
	g9	26	300	5679
	g10	23	300	4117
Real-life datasets	BPI 2012 [28]	23	13,087	26,2200
	BPI 2013-incident [23]	13	7,554	65,533
	BPI 2017 [27]	41	21,861	714,198
	Sepsis [20]	16	1,050	15,214

5. Experiments and results

For the proposed MantaRay-ProM algorithm, the hyperparameters are set as suggested in the original MRFO algorithm [42] with population size = 30, maximum iterations = 100, number of runs = 30. It was observed that the proposed algorithm converges before 100 iterations for most of the datasets. The experimental outcomes are determined by the average performance over all the runs.

The proposed algorithm is run on both real-life and synthetic datasets. Over the last decade, BPI challenge event logs have become important benchmarks in the data-driven research area of process mining [1] (Table 3). The proposed algorithm is tested for three BPI event logs, namely, BPI 2012 [28], BPI 2013 [23], and BPI 2017 [27], varying in terms of the number of activities, traces, and their respective domains. BPI 2012 is one of the extensively studied datasets in process mining. It comprises 13,087 traces, and 23 activities, and is derived from a structured real-life loan application procedure released to the community by a Dutch financial institute. The BPI 2013 dataset is sourced from the IT incident management system of Volvo Belgium, consisting of 7554 traces and 13 activities. The BPI 2017 dataset represents a loan application process of another Dutch financial institute, encompassing 21,861 traces and 41 activities. Additionally, the proposed algorithm is evaluated on a real-life medical event log featuring events related to sepsis cases from a hospital [20]. Sepsis, a severe medical condition typically triggered by an infection, is addressed in this dataset, which includes 1,050 traces and 16 tasks.

We also experimented on ten synthetic logs, namely, ETM, and g2-g10 [3,36]. First proposed by Alves de Medeiros [3], they are among the most popular unbalanced logs used in the literature. These logs are unbalanced regarding the frequencies with which the traces occur.

Under the constraints of availability of the code/ datasets, the results of the proposed MantaRay-ProM algorithm for synthetic datasets are compared with those of state-of-the-art algorithms, namely, $α^{+ +}$ [39], Heuristic Miner [38], Genetic Process Miner (GPM) [3], Inductive miner [17], Hybrid ILP Miner [32], Evolutionary Tree miner (ETM) [7], Fodina [33], Split Miner [4], and Structured Heuristic miner [5].

The values of completeness, preciseness, and simplicity for the state-of-the-art algorithms, namely, $α^{+ +}$ [39], Heuristic Miner [38], and Genetic Process Miner (GPM) [3] for ETM, g2-g10 datasets are taken as reported by Vázquez-Barreiros et al. [36]. Since Vázquez-Barreiros et al. [36] do not report the generalization value for these datasets; it is computed using the Cobefra tool [34]. The values for completeness, preciseness, simplicity, and generalization for the remaining state-of-the-art algorithms are computed using the ProM tool and Cobefra tool [34,36,37]. In addition, the proposed technique is compared with the binary differential evolution algorithm for process mining [12] on both synthetic and real-world datasets. For the experimentation, we have used the default parameter settings for each of the state-of-the-art algorithms.

5.1. Analysis of the results

The quality dimensions for the process models discovered by the proposed MantaRay-ProM algorithm, the state-of-the-art methods, and the binary differential evolution (DE) algorithm for process mining are shown in Tables 4 and 5. “−” indicates that a certain accuracy or complexity measurement could not be reliably acquired because the discovered model has syntactical or behavioral problems that could be because of a disconnected model or an unsound model. Completeness is used as the fitness function for the binary DE algorithm and the proposed MantaRay-ProM algorithm. With respect to the completeness value, the proposed MantaRay-ProM algorithm achieves the highest possible value of completeness for most of the datasets. That is, MantaRay-ProM is capable of exploiting the search space competitively. For BPI 2012 and BPI 2013 datasets, the value of completeness obtained by Binary DE is marginally better compared to MantaRay-ProM. In order to analyze the other qualities of the discovered process model, we also compute the value of quality dimensions, namely, preciseness, simplicity, and generalization [3,31,35,36]. The results show that the process models generated by the proposed strategy have a higher quality in terms of Generalization, Simplicity, and Preciseness in all the cases. In order to rank the proposed approach and the traditional algorithms, a composite fitness value function of the quality metrics is computed as follows [8]: $\begin{array}{r} Fitness = \frac{1}{13} (10 * Completeness) + (1 * Preciseness) + (1 * Simplicity) + (1 * Generalization) \end{array}$ In the weighted average [8], the weight assigned to completeness is ten times higher than the weight assigned to other quality dimensions. This ensures that the identified model can reproduce the behaviour present in the event log [8]. Tables 6 and 7 show that, in terms of the weighted average, the proposed MantaRay-ProM algorithm outperforms the state-of-the-art algorithms and is better than or at least as good as the binary DE algorithm. We also compute the F-score as an evaluation approach taking the harmonic mean of Completeness and Precision [6,11]: $\begin{matrix} (5) & 2 * \frac{Completeness * Preciseness}{Completeness + Preciseness} \end{matrix}$

Tables 8 and 9 show that, in terms of the F-score, the proposed MantaRay-ProM algorithm outperforms the state-of-the-art algorithms and is better than or at least as good as the binary DE algorithm.

Wilcoxon signed-rank test is used to estimate the statistical significance of these results. Table 10 reveals that in the context of process mining, the suggested MantaRay-ProM method produces results that are superior or on par with the compared algorithms. The results show that MantaRay-ProM is efficient in tackling real-life as well as synthetic datasets.

We also compare the convergence rates of the proposed MantaRay-ProM algorithm and the binary differential evolution algorithm. Figures 3 and 4 show that MantaRay-ProM converges faster in all cases, demonstrating its superior exploration. In 13 out of 14 datasets, the proposed algorithm converges atleast 45% faster than the binary differential evolution algorithm.

Table 4
Quality dimensions for the process models obtained using the state-of-the-art algorithms, binary DE algorithms, and the proposed MantaRay-ProM algorithm for synthetic datasets (C: completeness, P: preciseness, S: simplicity, G: generalization)

Algorithm ETM g2 g3 g4 g5 g6 g7 g8 g9 g10

Genetic Process Miner C 0.3 1 0.31 0.59 1 1 1 0.26 0.48 0.48

P 0.94 1 0.6 1 1 1 1 0.15 1 1

S 1 1 1 0.97 1 1 1 0.72 0.96 0.88

G 0.56 0.91 0.88 0.90 0.92 0.80 0.91 0.88 0.75 0.61

Heuristic Miner C 0.37 1 1 0.78 1 0.66 1 0.52 0.74 0.78

P 0.98 1 1 1 1 0.99 1 1 1 1

S 1 1 1 1 1 0.99 0.98 0.93 0.96 1

G 0.62 0.913 0.89 0.81 0.92 0.80 0.81 0.90 0.73 0.60

$α^{+ +}$ C 0.89 0.33 0 1 1 0.45 0 0.35 0.48 0.563

P 1 0.96 0.18 0.97 1 1 0.12 1 1 1

S 1 0.78 0.79 1 1 0.76 0.93 0.74 0.79 0.76

G 0.56 0.62 0.74 0.91 0.92 0.84 0.81 0.91 0.59 0.43

Hybrid ILP Miner C 1 0.99 1 0.99 1 1 0.99 1 0.99 1

P 1 0.964 0.97 0.99 0.98 0.99 0.97 0.98 0.98 0.95

S 0.96 0.89 0.94 0.91 1 0.64 0.83 0.62 0.72 0.7

G 0.72 0.95 0.9 0.88 0.87 0.72 0.83 0.85 0.71 0.58

Inductive Miner C 0.89 0.958 0.757 0.70 0.80 0.63 0.74 0.79 0.668 0.61

P 1 0.89 0.73 0.56 0.75 0.41 0.64 0.637 0.423 0.26

S 1 0.9 0.82 0.81 0.9 0.7 0.85 0.63 0.84 0.65

G 0.56 0.91 0.94 0.91 0.94 0.91 0.95 0.91 0.88 0.9

Evolutionary Tree Miner C 0.98 0.7174 0.64 0.69 0.607 0.59 0.68 0.71 0.62 0.58

P 0.8 0.65 0.62 0.59 0.49 0.51 0.58 0.61 0.41 0.3

S 1 0.9 0.8 0.84 0.89 0.78 0.8 0.6 0.8 0.6

G 0.87 0.9 0.91 0.81 0.9 0.89 0.9 0.87 0.81 0.85

Split Miner C 0.77 0.82 0.82 0.67 0.71 0.641 0.84 0.61 0.648 0.71

P 0.436 0.77 0.7 0.58 0.62 0.41 0.66 0.46 0.40 0.3

S 1 0.9 0.8 0.84 0.89 0.78 0.8 0.6 0.8 0.6

G 0.84 0.92 0.93 0.90 0.94 0.87 0.94 0.89 0.89 0.89

Fodina C 0.87 0.92 1.0 0.9 0.89 0.92 1.0 0.9 0.92 0.95

P 0.4 0.57 0.6 0.48 0.41 0.38 0.46 0.26 0.2 0.25

S 1 0.93 0.92 0.94 0.99 0.88 0.9 0.87 0.89 0.78

G 0.74 0.72 0.83 0.80 0.92 0.77 0.85 0.85 0.9 0.93

Structured Heuristic Miner C 0.71 0.85 0.93 0.86 0.91 0.93 0.94 0.91 0.98 0.87

P 0.44 0.77 0.29 0.53 0.52 0.23 0.56 0.41 0.37 0.35

S 1 0.85 0.74 0.74 0.79 0.68 0.7 0.56 0.68 0.56

G 0.67 0.92 0.98 0.90 0.92 0.94 0.91 0.85 0.9 0.82

Binary DE C 1 1 1 1 1 1 1 1 1 1

P 1 1 1 1 1 1 1 1 1 1

S 1 1 1 1 1 1 1 1 1 1

G 0.88 0.916 0.946 0.905 0.938 0.924 0.947 0.893 0.924 0.908

MantaRay-ProM C 1 1 1 1 1 1 1 1 1 1

P 1 1 1 1 1 1 1 1 1 1

S 1 1 1 1 1 1 1 1 1 1

G 0.888 0.922 0.949 0.911 0.943 0.924 0.947 0.893 0.924 0.908

Algorithm		ETM	g2	g3	g4	g5	g6	g7	g8	g9	g10
Genetic Process Miner	C	0.3	1	0.31	0.59	1	1	1	0.26	0.48	0.48
P	0.94	1	0.6	1	1	1	1	0.15	1	1
S	1	1	1	0.97	1	1	1	0.72	0.96	0.88
G	0.56	0.91	0.88	0.90	0.92	0.80	0.91	0.88	0.75	0.61
Heuristic Miner	C	0.37	1	1	0.78	1	0.66	1	0.52	0.74	0.78
P	0.98	1	1	1	1	0.99	1	1	1	1
S	1	1	1	1	1	0.99	0.98	0.93	0.96	1
G	0.62	0.913	0.89	0.81	0.92	0.80	0.81	0.90	0.73	0.60
$α^{+ +}$	C	0.89	0.33	0	1	1	0.45	0	0.35	0.48	0.563
P	1	0.96	0.18	0.97	1	1	0.12	1	1	1
S	1	0.78	0.79	1	1	0.76	0.93	0.74	0.79	0.76
G	0.56	0.62	0.74	0.91	0.92	0.84	0.81	0.91	0.59	0.43
Hybrid ILP Miner	C	1	0.99	1	0.99	1	1	0.99	1	0.99	1
P	1	0.964	0.97	0.99	0.98	0.99	0.97	0.98	0.98	0.95
S	0.96	0.89	0.94	0.91	1	0.64	0.83	0.62	0.72	0.7
G	0.72	0.95	0.9	0.88	0.87	0.72	0.83	0.85	0.71	0.58
Inductive Miner	C	0.89	0.958	0.757	0.70	0.80	0.63	0.74	0.79	0.668	0.61
P	1	0.89	0.73	0.56	0.75	0.41	0.64	0.637	0.423	0.26
S	1	0.9	0.82	0.81	0.9	0.7	0.85	0.63	0.84	0.65
G	0.56	0.91	0.94	0.91	0.94	0.91	0.95	0.91	0.88	0.9
Evolutionary Tree Miner	C	0.98	0.7174	0.64	0.69	0.607	0.59	0.68	0.71	0.62	0.58
P	0.8	0.65	0.62	0.59	0.49	0.51	0.58	0.61	0.41	0.3
S	1	0.9	0.8	0.84	0.89	0.78	0.8	0.6	0.8	0.6
G	0.87	0.9	0.91	0.81	0.9	0.89	0.9	0.87	0.81	0.85
Split Miner	C	0.77	0.82	0.82	0.67	0.71	0.641	0.84	0.61	0.648	0.71
P	0.436	0.77	0.7	0.58	0.62	0.41	0.66	0.46	0.40	0.3
S	1	0.9	0.8	0.84	0.89	0.78	0.8	0.6	0.8	0.6
G	0.84	0.92	0.93	0.90	0.94	0.87	0.94	0.89	0.89	0.89
Fodina	C	0.87	0.92	1.0	0.9	0.89	0.92	1.0	0.9	0.92	0.95
P	0.4	0.57	0.6	0.48	0.41	0.38	0.46	0.26	0.2	0.25
S	1	0.93	0.92	0.94	0.99	0.88	0.9	0.87	0.89	0.78
G	0.74	0.72	0.83	0.80	0.92	0.77	0.85	0.85	0.9	0.93
Structured Heuristic Miner	C	0.71	0.85	0.93	0.86	0.91	0.93	0.94	0.91	0.98	0.87
P	0.44	0.77	0.29	0.53	0.52	0.23	0.56	0.41	0.37	0.35
S	1	0.85	0.74	0.74	0.79	0.68	0.7	0.56	0.68	0.56
G	0.67	0.92	0.98	0.90	0.92	0.94	0.91	0.85	0.9	0.82
Binary DE	C	1	1	1	1	1	1	1	1	1	1
P	1	1	1	1	1	1	1	1	1	1
S	1	1	1	1	1	1	1	1	1	1
G	0.88	0.916	0.946	0.905	0.938	0.924	0.947	0.893	0.924	0.908
MantaRay-ProM	C	1	1	1	1	1	1	1	1	1	1
P	1	1	1	1	1	1	1	1	1	1
S	1	1	1	1	1	1	1	1	1	1
G	0.888	0.922	0.949	0.911	0.943	0.924	0.947	0.893	0.924	0.908

Table 5

Quality dimensions for the process models obtained using the state-of-the-art algorithms, binary DE algorithm, and the proposed MantaRay-ProM algorithm for real-world datasets (C: completeness, P: preciseness, S: simplicity, G: generalization)

Algorithm		BPI2012	BPI2013	BPI2017	Sepsis
Genetic process mining	C	0.93	0.997	0.97	0.94
	P	0.705	0.845	0.91	0.715
	S	0.99	0.99	1.0	0.99
	G	0.98	0.92	0.91	0.92
Heuristic Miner	C	-	0.91	-	-
	P	-	0.96	-	-
	S	0.05	1.0	0.45	0.17
	G	-	0.91	-	-
$α^{+ +}$	C	-	-	-	-
	P	-	-	-	-
	S	0.34	-	0.21	0.05
	G	-	-	-	-
Hybrid ILP Miner	C	-	-	-	-
	P	-	-	-	-
	S	-	-	-	-
	G	-	-	-	-
Inductive Miner	C	0.98	0.92	0.98	0.99
	P	0.5	0.54	0.7	0.45
	S	1.0	1.0	1.0	1.00
	G	0.98	0.92	0.98	0.96
ETM	C	0.44	1.00	0.76	0.83
	P	0.82	0.51	1.00	0.66
	S	1.0	1.0	1.00	1.0
	G	-	-	-	-
Split Miner	C	0.97	0.98	0.96	0.76
	P	0.72	0.92	0.81	0.77
	S	0.73	1.0	1.0	0.82
	G	0.97	0.98	0.96	0.77
Fodina	C	1.00	0.00	-	0.96
	P	0.07	0.36	-	0.36
	S	1.00	0.88	-	0.33
	G	-	-	-	0.3
Structured Heuristic Miner	C	-	0.91	0.95	0.92
	P	-	0.96	0.62	0.42
	S	0.4	1.0	0.97	1.00
	G	-	0.91	0.94	0.92
Binary DE	C	0.9943	0.9988	0.9934	0.998
	P	0.8398	0.8791	0.8351	0.967
	S	0.9996	0.9995	1.00	0.995
	G	0.9849	0.9245	0.9821	0.91
MantaRay-ProM	C	0.991	0.9975	0.988	0.97
	P	0.9985	0.9985	0.986	0.99
	S	1	0.9999	0.99	1.0
	G	0.9852	0.9245	0.982	0.91

Table 6

Weighted average of the quality dimensions for synthetic datasets for the state-of-the-art algorithms, binary DE algorithm and the proposed MantaRay-ProM algorithm

Algorithm	ETM	g2	g3	g4	g5	g6	g7	g8	g9	g10
GPM	0.4231	0.993	0.4292	0.6746	0.994	0.9846	0.9931	0.3346	0.5777	0.5608
HM	0.4846	0.9938	0.9915	0.8162	0.994	0.7215	0.9838	0.6177	0.7762	0.8
$α^{+ +}$	0.8815	0.4354	0.1315	0.9908	0.994	0.5462	0.1431	0.4731	0.5523	0.6015
Hybrid ILP	0.98	0.987	0.985	0.975	0.988	0.95	0.963	0.957	0.947	0.94
Inductive Miner	0.881	0.94	0.77	0.71	0.81	0.64	0.756	0.795	0.679	0.6
ETM	0.96	0.74	0.67	0.703	0.64	0.62	0.698	0.706	0.63	0.58
Split Miner	0.767	0.83	0.82	0.69	0.734	0.6507	0.83	0.62	0.659	0.684
Fodina	0.83	0.878	0.95	0.863	0.86	0.864	0.939	0.85	0.86	0.8815
Structured HM	0.708	0.85	0.87	0.828	0.8715	0.857	0.89	0.84	0.903	0.8
Binary DE	0.9908	0.9936	0.9958	0.9927	0.9953	0.994	0.996	0.992	0.994	0.993
MantaRay-ProM	0.991	0.994	0.996	0.993	0.996	0.994	0.996	0.992	0.994	0.993

Table 7

Weighted average of the quality dimensions for real-world datasets for the state-of-the-art algorithms, binary DE algorithm, and the proposed MantaRay-ProM algorithm. “-” where all the quality metrics are not available

Algorithm	BPI 2012	BPI 2013	BPI 2017	Sepsis
GPM	0.921	0.978	0.963	0.925
HM	-	0.92	-	-
$α^{+ +}$	-	-	-	-
Hybrid ILP	-	-	-	-
Inductive Miner	0.945	0.896	0.96	0.95
ETM	-	-	-	-
Split Miner	0.93	0.977	0.95	0.766
Fodina	-	-	-	0.814
Structured HM	-	0.92	0.925	0.887
Binary DE	0.9821	0.9839	0.9922	0.988
MantaRay-ProM	0.9918	0.9922	0.9936	0.969

Table 8

F-score obtained for synthetic datasets for the state-of-the-art algorithms, binary DE algorithm and the proposed MantaRay-ProM algorithm

Algorithm	ETM	g2	g3	g4	g5	g6	g7	g8	g9	g10
GPM	0.458	1	0.408	0.742	1	1	1	0.19	0.64	0.648
HM	0.537	1	1	0.876	1	0.792	1	0.684	0.851	0.876
$α^{+ +}$	0.94	0.49	0	0.984	1	0.62	0	0.51	0.64	0.717
Hybrid ILP	0.94	0.922	0.74	0.62	0.77	0.49	0.68	0.705	0.51	0.36
Inductive Miner	0.881	0.94	0.77	0.71	0.81	0.64	0.756	0.795	0.679	0.6
ETM	0.88	0.68	0.63	0.64	0.54	0.547	0.63	0.66	0.493	0.395
Split Miner	0.56	0.794	0.755	0.62	0.66	0.5	0.73	0.52	0.49	0.42
Fodina	0.55	0.70	0.75	0.626	0.56	0.537	0.63	0.403	0.328	0.395
Structured HM	0.54	0.80	0.44	0.656	0.66	0.36	0.701	0.565	0.537	0.49
Binary DE	1	1	1	1	1	1	1	1	1	1
MantaRay-ProM	1	1	1	1	1	1	1	1	1	1

Table 9

F-score obtained for real-world datasets for the state-of-the-art algorithms, binary DE algorithm and the proposed MantaRay-ProM algorithm

Algorithm	BPI 2012	BPI 2013	BPI 2017	Sepsis
GPM	0.80	0.91	0.94	0.81
HM	-	0.934	-	-
$α^{+ +}$	-	-	-	-
Hybrid ILP	-	-	-	-
Inductive Miner	0.662	0.68	0.81	0.61
ETM	0.57	0.675	0.86	0.735
Split Miner	0.826	0.949	0.878	0.765
Fodina	0.014	0	-	0.52
Structured HM	-	0.934	0.75	0.576
Binary DE	0.91	0.94	0.91	0.982
MantaRay-ProM	0.995	0.9979	0.987	0.979

Table 10

Wilcoxon signed-rank test for MantaRay-ProM vs. Binary DE. The symbol ‘+’ (‘−’) shows that the first (second) algorithm displays better performance at 95 % significance level. ‘=’ denotes no substantial difference between the ranks of the algorithms

Event log	MantaRay-ProM vs. binary DE

	p-value	T−	T+	Winner
ETM	-	0	0	=
g2	-	0	0	=
g3	-	0	0	=
g4	-	0	0	=
g5	-	0	0	=
g6	-	0	0	=
g7	-	0	0	=
g8	-	0	0	=
g9	-	0	0	=
g10	-	0	0	=
BPI2012	$0.9 E - 05$	0	465	+
BPI2013	$0.45 E - 04$	0	423	+
BPI2017	0.0245	0	411	+
Sepsis	<0.00001	464	1	−

6. Conclusion

The discipline of process mining is focused on extracting process models from databases comprising time-stamped business process data. An accurate process model allows a better understanding of the business workflow and can assist in redesigning and improving the processes. In the present work, the problem of process discovery is formulated as the problem in the binary domain. In the present research, we have proposed a bio-inspired optimization algorithm, Manta Ray foraging process miner (MantaRay-ProM), for discovering the process models based on timestamped workflow data in the form of an event log. The proposed approach searches the solution space by mimicking the foraging strategies of Manta Rays. As we deal with binary data in process mining, we adapted the original MRFO algorithm [42] designed to work for the floating point data for process mining.

Fig. 3.

Convergence rate for the binary DE algorithm for process mining and for the proposed MantaRay-ProM algorithm when run on synthetic datasets (ETM, g2-g10).

Fig. 4.

Convergence rate for the binary DE algorithm for process mining and for the proposed MantaRay-ProM algorithm when run on real-world datasets.

The quality of the process model discovered for the given event log can be measured in terms of completeness, preciseness, simplicity, and generalization of the model. However, completeness is the most important of these as the other quality dimensions become meaningful for models with a high completeness value. The proposed Manta Ray Foraging Process Miner (MantaRay-ProM) algorithm optimizes the completeness of the process models. The proposed Manta ray foraging process miner (MantaRay-ProM) algorithm is tested for synthetic as well as real-life event logs. The results of applying the proposed MantaRay-ProM algorithm are compared with those of other state-of-the-art algorithms, namely, $α^{+ +}$ [39], Heuristic Miner [38], Genetic Process Miner (GPM) [3], Inductive miner [17], Hybrid ILP Miner [32], Evolutionary Tree miner (ETM) [7], Fodina [33], Split Miner [4], and Structured Heuristic miner [5] and with a binary differential evolution algorithm for process mining [12]. Each of the discovered process models score well for completeness by reproducing the behaviour expressed in the event log. Preciseness, simplicity, and generalization values are also computed for the obtained solutions. The discovered process models also score well for preciseness, simplicity, and generalization. A composite fitness value (weighted average) for the model is computed as a weighted average of completeness, preciseness, simplicity, and generalization. In terms of the composite fitness value, the proposed MantaRay-ProM algorithm achieves a higher rank than the compared algorithms.

The experimental results show that the proposed Manta ray foraging process miner (MantaRay-ProM) algorithm exhibits a faster convergence rate than the Binary DE algorithm for process mining. The proposed algorithm surpasses the compared algorithms in the quality of the extracted process models with better completeness and composite fitness value.

For future work, we aim to explore a parallelized MantaRay-ProM for a trade-off between different quality dimensions.

7. Funding

This research did not receive any grant from funding agencies in the public, commercial, or not-for-profit sectors.

References

4TU Research Data, Accessed: 2021-03-13. https://data.4tu.nl/info/en/.

Alves de Medeiros ,

Van Dongen ,

Van Der Aalst and

Weijters , Process mining: Extending the α-algorithm to mine short loops, Technical Report, BETA Working Paper Series, 2004.

A.K.

Alves de Medeiros , Genetic Process Mining, CIP-DATA LIBRARY TECHNISCHE UNIVERSITEIT EINDHOVEN, printed in 2006.

Augusto ,

Conforti ,

Dumas and

La Rosa , Split miner: Discovering accurate and simple business process models from event logs, in: 2017 IEEE International Conference on Data Mining (ICDM), IEEE, 2017, pp. 1–10.

Augusto ,

Conforti ,

Dumas ,

La Rosa and

Bruno , Automated discovery of structured process models from event logs: The discover-and-structure approach, Data & Knowledge Engineering 117 (2018), 373–392. doi:10.1016/j.datak.2018.04.007.

Augusto ,

Conforti ,

Dumas ,

La Rosa ,

F.M.

Maggi ,

Marrella ,

Mecella and

Soo , Automated discovery of process models from event logs: Review and benchmark, IEEE transactions on knowledge and data engineering 31(4) (2018), 686–705. doi:10.1109/TKDE.2018.2841877.

J.C.

Buijs ,

B.F.

van Dongen and

W.M.

van der Aalst , Discovering and navigating a collection of process models using multiple quality dimensions, in: International Conference on Business Process Management, Springer, 2013, pp. 3–14.

J.C.

Buijs ,

B.F.

Van Dongen ,

W.M.

van Der Aalst et al., On the role of fitness, precision, generalization and simplicity in process discovery, in: OTM Conferences (1), Vol. 7565, 2012, pp. 305–322.

H.-J.

Cheng ,

Ou-Yang and

Y.-C.

Juan , A hybrid approach to extract business process models with high fitness and precision, Journal of Industrial and Production Engineering 32(6) (2015), 351–359. doi:10.1080/21681015.2015.1065519.

10.

De Smedt ,

De Weerdt and

Vanthienen , Multi-paradigm process mining: Retrieving better models by combining rules and sequences, in: OTM Confederated International Conferences” on the Move to Meaningful Internet Systems”, Springer, 2014, pp. 446–453.

11.

De Weerdt ,

De Backer ,

Vanthienen and

Baesens , A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs, Information systems 37(7) (2012), 654–676. doi:10.1016/j.is.2012.02.004.

12.

Deshmukh ,

Gupta ,

Varshney and

Kumar , A binary differential evolution approach to extract business process models, in: International conference on soft computing for problem solving-SocProS 2020, Springer, in press.

13.

dos Santos Garcia ,

Meincheim ,

E.R.F.

Junior ,

M.R.

Dallagassa ,

D.M.V.

Sato ,

D.R.

Carvalho ,

E.A.P.

Santos and

E.E.

Scalabrin , Process mining techniques and applications – a systematic mapping study, Expert Systems with Applications 133 (2019), 260–295. doi:10.1016/j.eswa.2019.05.003.

14.

Fahland and

W.M.

van der Aalst , Repairing process models to reflect reality, in: International Conference on Business Process Management, Springer, 2012, pp. 229–245. doi:10.1007/978-3-642-32885-5_19.

15.

M.A.

Ghazal ,

Ghoniemy and

M.A.

Salama , Multi-objective optimization for automated business process discovery, in: KDIR, 2019, pp. 89–104.

16.

Goedertier ,

Martens ,

Vanthienen and

Baesens , Robust process discovery with artificial negative events, Journal of Machine Learning Research 10 (2009), 1305–1340.

17.

S.J.

Leemans ,

Fahland and

W.M.

van der Aalst , Discovering block-structured process models from event logs-a constructive approach, in: International Conference on Applications and Theory of Petri Nets and Concurrency, Springer, 2013, pp. 311–329. doi:10.1007/978-3-642-38697-8_17.

18.

Li ,

Liu and

Yang , Process mining: Extending α-algorithm to mine duplicate tasks in process logs, Advances in Web and Network Technologies, and Information Management (2007), 396–407.

19.

Liao ,

Zhao and

Wang , Improved manta ray foraging optimization for parameters identification of magnetorheological dampers, Mathematics 9(18) (2021), 2230. doi:10.3390/math9182230.

20.

Mannhardt , Sepsis Cases–Event Log, 2016, https://data.4tu.nl/articles/dataset/Sepsis_Cases_-_Event_Log/12707639 .

21.

Mining , Discovery, Conformance and Enhancement of Business Processes, Vol. 8, Springer-Verlag, 2011, p. 18.

22.

Srinivas and

Deb , Muiltiobjective optimization using nondominated sorting in genetic algorithms, Evolutionary computation 2(3) (1994), 221–248. doi:10.1162/evco.1994.2.3.221.

23.

Steeman , BPI Challenge 2013, Incidents, 2013. https://doi.org/10.4121/uuid:500573e6-accc-4b0c-9576-aa5468b10cee .

24.

A.F.

Syring ,

Tax and

W.M.

van der Aalst , Evaluating conformance measures in process mining using conformance propositions, Transactions on Petri Nets and Other Models of Concurrency XIV (2019), 192–221. doi:10.1007/978-3-662-60651-3_8.

25.

Van der Aalst ,

Weijters and

Maruster , Workflow mining: Discovering process models from event logs, IEEE Transactions on Knowledge and Data Engineering 16(9) (2004), 1128–1142. doi:10.1109/TKDE.2004.47.

26.

W.M.

van der Aalst , Process Mining: Data Science in Action, Springer, 2016.

27.

van Dongen , Bpi challenge 2017, 2017, 4TU, Centre for Research Data, Dataset.

28.

van Dongen and

Challenge , Event log of a loan application process, 2012, https://data.4tu.nl/articles/dataset/BPI_Challenge_2012/12689204.

29.

B.F.

Van Dongen and

W.M.

Van der Aalst , Multi-phase process mining: Building instance graphs, in: International Conference on Conceptual Modeling, Springer, 2004, pp. 362–376.

30.

B.F.

van Dongen and

W.M.

Van der Aalst , Multi-phase process mining: Aggregating instance graphs into EPCs and Petri nets, in: PNCWB 2005 Workshop, 2005, pp. 35–58.

31.

van Eck , Alignment-based process model repair and its application to the Evolutionary Tree Miner, PhD thesis, Master’s thesis, Technische Universiteit Eindhoven, 2013.

32.

S.J.

van Zelst ,

B.F.

van Dongen and

W.M.

van der Aalst , ILP-based process discovery using hybrid regions, in: ATAED@ petri nets/ACSD, 2015, pp. 47–61.

33.

S.K.

vanden Broucke and

De Weerdt , Fodina: A robust and flexible heuristic process discovery technique, decision support systems 100 (2017), 109–118.

34.

S.K.

vanden Broucke ,

De Weerdt ,

Vanthienen and

Baesens , A comprehensive benchmarking framework (CoBeFra) for conformance analysis between procedural process models and event logs in ProM, in: 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2013, pp. 254–261. doi:10.1109/CIDM.2013.6597244.

35.

Vázquez-Barreiros ,

Mucientes and

Lama , A genetic algorithm for process discovery guided by completeness, precision and simplicity, in: International Conference on Business Process Management, Springer, 2014, pp. 118–133. doi:10.1007/978-3-319-10172-9_8.

36.

Vázquez-Barreiros ,

Mucientes and

Lama , ProDiGen: Mining complete, precise and minimal structure process models with a genetic algorithm, Information Sciences 294 (2015), 315–333. doi:10.1016/j.ins.2014.09.057.

37.

Verbeek ,

Buijs ,

Van Dongen and

W.M.

van der Aalst , Prom 6: The process mining toolkit, Proc. of BPM Demonstration Track 615 (2010), 34–39.

38.

Weijters ,

W.M.

van Der Aalst and

A.A.

De Medeiros , Process mining with the heuristics miner-algorithm, Technische Universiteit Eindhoven, Tech. Rep. WP 166 (2006), 1–34.

39.

Wen ,

W.M.

van der Aalst ,

Wang and

Sun , Mining process models with non-free-choice constructs, Data Mining and Knowledge Discovery 15(2) (2007), 145–180. doi:10.1007/s10618-007-0065-y.

40.

Wen ,

Wang and

Sun , Mining invisible tasks from event logs, APWeb/WAIM 4505 (2007), 358–365.

41.

Wen ,

Wang ,

W.M.

van der Aalst ,

Huang and

Sun , Mining process models with prime invisible tasks, Data & Knowledge Engineering 69(10) (2010), 999–1021. doi:10.1016/j.datak.2010.06.001.

42.

Zhao ,

Zhang and

Wang , Manta ray foraging optimization: An effective bio-inspired optimizer for engineering applications, Engineering Applications of Artificial Intelligence 87 (2020), 103300. doi:10.1016/j.engappai.2019.103300.