Abstract
Semantics has been a major challenge when applying the process mining (PM) technique to real-time business processes. The several theoretical and practical efforts to bridge the semantic gap has spanned the advanced notion of the semantic-based process mining (SPM). Fundamentally, the SPM devotes its methods to the idea of making use of existing (semantic) technologies to support the analysis of PM techniques. In principle, the semantic-based process mining method is applied through the acquisition and representation of abstract knowledge about the domain processes in question. To this effect, this paper demonstrates how the semantic concepts and process modelling (reasoning) methods are used to improve the outcomes of PM techniques from the syntactic to a more conceptual level. To do this, the study proposes an SPM-based framework that shows to be intelligent with a high level of semantic reasoning aptitudes. Technically, this paper introduces a process mining approach that uses information (semantics) about different activities that can be found in any given process to make inferences and generate rules or patterns through the method for annotation, semantic reasoning, and conceptual assertions. In turn, the method is theoretically applied to enrich the informative values of the resultant models. Also, the study conducts and systematically reviews the current tools and methods that are used to support the outcomes of the process mining as well as evaluates the results of the different methods to determine the levels of impact and its implications for process mining.
Keywords
Introduction
The problem of handling the large volumes of data or event logs derived from the real-time (e.g. business) processes (information systems) have not only spanned intense studies within the process mining field. But have also paved way for the development of tools and methods that are capable of automatically or semi-automatically expanding the underlying processes. Perhaps, this is allied to the rapid shift from the big data notion to big data analysis. For instance, the data engineering methods such as the semantic process mining has consequently emerged. Indeed, the resultant methods which focuses on not just the collection, organization, and analysis of the big volumes of events data logs (e.g. to discover patterns and valuable information), also represents as intelligent tools that can be used to understand/identify abstract information from the readily available datasets [1, 2, 3, 4, 5]. Moreover, the derived (conceptual) information has shown to be useful for the further benefit of organizational process management and decision-making purposes [6, 7]. In turn, those processes and information analysis are largely championed or adopted by the IT experts, business process managers, data scientists, artificial intelligence (AI) companies and developers, etc.
One of the many areas in which the process-related analysis has been applied; is the process mining (PM) [8]. Nowadays, many organizations process or methods for data collection and analysis are showing to be more complex in nature. The growing complexities are linked to the need for a richer (intelligent) and more accurate interpretation of the real-time processes that allows for an abstract analysis or exploration of the various activities that constitute the models. Therefore, given the affirmations; one can argue that the usefulness of data (value) has become the central focus for most big (data) analysis tasks. This is owing to the fact that the value of data superlatively to the other measures of the big v’s notion (e.g. volume, variety, velocity, and veracity) has spanned a few additional sets of measurements or benchmarks needed to intuitively support the several big (data) analysis projects or researches. This study notes that those additional dimensions of the big v’s include namely; validity, venue, variability, vocabulary, and vagueness of data. Whereby in definition:
Volume – represents the recorded or stored data size. Velocity – calculates the speediness at which the data is generated. Variety – describes the several types or kinds of data. Veracity – measures the data accuracy. Validity – deals with the data quality, governance, or massive management. Variability – considers the dynamic or evolving behaviour of the various source of data. Venue – represents the heterogeneity and/or distribution of data from the multiple platforms. Vocabulary – expresses the metadata models or semantics that describes the different data structures. Vagueness – targets to resolve the confusion over the meaning of data and the technologies used to perform the analysis, whereas. Value – actually means control over all the other v’s.
Furthermore, it is worth mentioning that it is all well and good having access to big data but unless we develop (or apply) tools and methods which are capable of turning the datasets into value; the full potential (benefit) of the said data logs will eventually be still missing [9, 10]. In fact, one can safely argue that data-value sits at the core of the big v’s particularly as it concerns the speedy advancement from the big data technologies to big (data) analysis or methods. For this purpose, this paper shows that data-value can be realized by creating insights on the best logical ways to make the envisioned information that are derived from the big datasets explicable in reality. Certainly, an effective and accurate exploration (analysis) of the readily available data (and the resultant models) is capable of providing useful information that can be used to drive the operations and offer quality support for the several organizational processes. For example, the newly discovered models or frameworks can be used to understand the interconnectedness (correlations) that exist between the process instances and the business process operations in general [11, 12]. Moreover, this work also studies the relevant literature in order to explore the technological potentials/prospects of using the semantic-based approach to support the PM methods. Henceforth, the aim of this paper to analyze the events data (logs) based on the abstraction view (concepts) rather than focusing on the labels (tags) in events log about the processes [2, 13].
Accordingly, this study outlines the main challenges with the PM techniques, and consequently, proposes a semantic-based process mining (SPM) framework to show how the conceptualization method can be effectively used to resolve the real-world (business) problems or challenges. For example, de Leoni et al. [14] notes that present challenges with PM techniques includes issues of scalability (i.e., the ability of the process mining algorithms to deal with the unprecedented volumes, variability and velocity of the input datasets) particularly during the real-time executions or operational settings. Certainly, the PM methods are projected to make use of the events log streams to carry out approximation (e.g. balancing computation time with accuracy), understandability and explainability (providing easy-to-understand and explainable analytics), multi-perspective analysis (considering data, resources and time beyond the process control flow), measurability (providing a comprehensive framework for measuring differences between the observed and modelled process behaviours), and ethical and confidential aspects of process mining (e.g. how to ensure that the PM practices and results do not violate ethical and privacy principles).
Generally, the common problem with the PM methods has been the technical focus of the available data (event logs) whereby the majority of the PM techniques rely on the tags (labels) in the event logs to produce the models or mappings [8, 13]. To this effect, the PM methods appear to be somewhat limited when confronted with unstructured data. Moreover, in theory, one may say that the PM approaches are just syntactic in nature, although, this study has shown that the method can be extended or enhanced to a more conceptual level. This is given to the fact that the logical understanding or interpretation of the derived process models can subsequently be described to be linearly related to the levels of abstract (conceptual) information that are contained or hidden in the readily available events data logs. For example, a lot of the time, the discovered process models by using the traditional PM methods tend to support just machine-readable systems rather than machine-understandable systems at large. By machine-understandable systems; we note that the (abstract) process analysis methods are created not only to present information/knowledge in formats that are easy to understand. But, also are employed to provide intelligent applications that trail to inclusively process the information that they contain or supports. Besides, an ample information knowledge-base or process analysis system is one which is considered to be:
understandable by humans, and understandable by machines.
Perhaps, the aforementioned assertions indicates that the captured events log and models are either semantically labelled (annotated) to ease the analysis process, or defined in a structured format (e.g. ontologies) which allows a computer (the reasoning engine) to automatically infer (compute) newly or previously undiscovered facts or patterns by making references to the metadata (process libraries). Having said that, the work done in this paper shows how to effectively perform and overcome the data processing challenges. Thus, the main motivation of this paper primarily addresses the following challenges with PM:
limited availability of PM techniques that supports semantical information retrieval, extraction, and analysis how to perform PM at a more level of conceptualization as opposed to the syntactic nature or method of analysis displayed by the classical PM techniques.
Technically, the current study considers the above-mentioned challenges as an effective way towards provision of formal structures for the different datasets used for process mining, as well as extension (enhancement) of the discovered models through further semantically focused process analysis. Thus, the semantic-based process mining (SPM) approach described in this paper.
The implications of this study show that methods such as the SPM (that integrates the PM methods with semantic technologies) has the capacity of not just discovering meaningful and valuable information from the available events logs, but also supports an abstract analysis (conceptualization) of the resultant models. Besides, the proposed SPM-based framework benefits from semantic-aware procedures that exploit knowledge kept in (big) data to better reasoning on data beyond the possibilities offered by most traditional PM techniques. Although, on the one hand, the PM serves as the methodical bridge between the data-centric and model-based analysis [8, 15]. On the other hand, when it comes to modelling of the information recorded about the different process domains, the semantic-aware methods (e.g. SPM) stands as a plethora of tools that happen to support the processing (analysis) of the (big) datasets in terms of the semantics they express [2, 12, 13].
For all intents and purposes, this paper proposes the SPM-based framework not only to demonstrate how data is being analyzed using the semantic rule-based approach or technologies. But, also to show how by referring to the attributes or computation of the underlying concepts (e.g. classes, objects and data properties, etc.), it becomes possible to automatically determine the casual or logical relationships [16, 17, 18, 15, 19] the process instances share amongst themselves within the knowledge-base.
In summary, this paper provides its main contributions to knowledge from the following aspects:
It defines a semantic-based process mining framework that uses metadata (conceptual information) about the datasets and models to support the PM analysis. In turn, it gains a more accurate results that are closer to human understanding. It provides a set of technique for semantical annotation and analysis of the process models towards the development of PM methods that exhibit a high level of semantic reasoning capabilities. It provides a systematic mapping study and review of the current methods that are used to support the semantic-based process mining technique.
The remaining sections of this study is structured as follows: Section 2 discusses the different technologies that are applicable to the process mining and semantic modelling techniques. In Section 3, the work provides a thematic analysis of the existing tools, algorithms, and methods which can be used to support the semantic-based process mining. Section 4 presents the proposed SPM-based framework, and how the work applies the method to analyse the real-time processes (data) based on concepts rather than events tags or labels that re contained in the events log about the processes. The series of experimentations using a case study example of data about a real-time business process is presented in Section 5. This includes a practical illustration of the capability of the proposed SPM method in using the semantic schema (ontological concepts) to perform automatic (conceptual) reasoning of the events log and models given a dataset that consist of a training set and a test set. Section 6 discusses the results and implications of the SPM-based framework, and then concludes and points out the direction for future works in Section 7.
Semantic Web Search technology (SWS)
SWS technologies describes methods or tools that tend to integrate the concept of information extraction (IE) [20] and information retrieval (IR) [21] to find meaningful information (e.g. files, corpus) from large collections of data (e.g. databases, web, etc.), and then provides the results (outputs) to the users (search initiator) based on some pre-specified information need. Whereas, the IR systems focus on finding useful materials (documents) from the said large collection of data (which are quite often unstructured in nature e.g. the internet), the IE pursues to present the specified informations in a systematized (structured) state that the users are interested in. In theory, the SWS simply means finding a set(s) of text or information that are pertinent to the users’ queries [22].
Interestingly, Cunningham [23] notes that SWS technologies targets to add some kind of machine tractable and/or repurposable layer of annotations that are relative to ontologies. For example, in terms of web mining, the method (SWS) can be applied by creating semantically annotated terms that links the resultant web pages to an ontology. Moreover, the process or yet web analysts may make use of the technology to match or complement the overwhelming/omnipresence of web of NLP (natural language processing) hypertext [24, 25]. In return, the process turns out to be an automatic or semi-automatic one owing to the formal method of design, development, and interrelation of the ontologies. Likewise, the main mechanism of the method (SPM) in this paper is grounded on the SWS concept. The paper introduces the method to show how to improve the information values of the different datasets (events logs) and model by creating semantically annotated terms that references (or links to) the concepts defined in an ontology.
Accordingly, Cunningham [23] notes that a typical illustration of the SWS in practice is the KIM (knowledge and information management system) [26]. KIM offers a type of IE-based facility for creating metadata, storage, and semantically enriched web browsing or search functions. Equally, several other tools exist in the literature that supports the SWS. For example, SemTag system [27], Magpie [28] an add-on for the browser that relies on using ontologies to provide specific or tailored perspectives about the web pages that the users might be interested in (or wishes to browse).
Also, the OWL (web ontology language) [25, 29] has emerged as the standard format for defining SWS-based tools, and has since over the decade, been accepted and widely used for abstract structuring of information (conceptual modelling) and/or knowledge engineering. Characteristically, OWL has proved to be useful in enriching the datasets or depiction of inference rules (see: Section 5) to support the process of making assertions or automated reasoning of the semantic models at a more conceptual level. Moreover, as a set(s) of semantically annotated properties/terms, the resultant ontologies are used to support the conceptual information extraction particularly those allied to the Ontology-based Information Extraction systems (OBIE) [30].
Ontology-based information extraction systems (OBIE)
OBIE refers to methods which pursue to identify and/or extract pieces of information in form of texts (concepts) and then describes the relations (e.g. properties restrictions) that exist or yet well-defined in the resultant ontologies [31]. Indeed, OBIE spans and is inspired by the IE (information extraction) terminology. Technically, a typical IE system takes as input; texts, and in some settings, speech or auditory records to produce fixed format explicit data as outputs [23]. It is worth mentioning that IE systems only presents relevant (specific) information (knowledge) in formats that the users are interested in. Moreover, this feature is where OBIE draws its incentive given that ontology is one of the existing tools that have the ability to provide the pieces of information in a taxonomical form (i.e. structured format). For example, the process of automatic generation (population) of ontologies in an OBIE application can help identify individuals (process instances) in a text document or model that belongs to a specific class. Also, the method trails to add those learned instances within their correctly inferred locations or reference addresses in the model or knowledge-base. Interestingly, Yankova et al. [31] notes that such approach for information aggregation appears to be advantageous not only for supporting the augmentation of the extracted information. But, also useful in storage and/or manipulation of the retrieved informations. For instance, if we take a look at an online learning management system (LMS) in context of OBIE; if a new student’s course is added to his/her description within the knowledge-base, it is expected that the course description (properties) will be automatically added to a new identity criterion and does not necessarily require changing the entire apriori details or function of the system.
In this lines, Yankova et al. [31] observes that one fundamental problem that needs to be addressed when providing structures for the distribution of conceptual knowledge (such as with OBIE system); is the issue of identification or yet integration of the different entities or process instances. Basically, the process should aim at identifying newly extracted information or facts (e.g. from the texts, models, web pages, etc.) and then make a connection between the discovered information and their previously mentioned references. Nonetheless, Cunningham [23] notes that OBIE poses two main challenges particularly as it concerns the development and application of the procedures which include:
identifying the various concept(s) from the ontologies, and modelling/classification of the different relations that exist amongst the concepts e.g. the automatic population of the ontologies with process instances in the knowledge-base.
Furthermore, if the ontologies in question are already populated with instances; the task of an OBIE system may perhaps be simply to identify and integrate those process instances or their mentions in the ontologies with the original text (data) sources. Truly, the resultant frameworks or methodologies tend to be more beneficial when compared to the traditional IE systems. This is owing to the fact that OBIE systems make use of an ontologies rather than a flat gazetteer [23]. Likewise, a number of OBIE supported systems have also been developed and applied in different settings or contexts within the existing literature [32, 33, 12, 17, 34, 35, 31, 2, 36]. Moreover, this current study also note that one common feature of the methods or approaches described in the aforementioned works [32, 33, 12, 17, 34, 35, 31, 2, 36], is that unlike the traditional IE systems (whereby the extracted information are only classed as appropriate for predefined data types); the OBIE supported methods seeks to discover models or data structures that focus on generation of reference links or connections between the objects (process instances) inherent in the knowledge-bases including their mentions within the contextual domain [23]. In other words, the ontology-based systems do not only consist of representations of the specified process domains of interest, but they also provide useful pieces of information on the identity of the process elements (entities) including their mention within the process base or models. Practically, the domain (ontological) representations are carried out by using the object/datatype properties. Thus, a typical and/or adequate OBIE system should consist of a set(s) of well-defined concepts (classes, object properties, and individual assertions, etc.) with their full semantic descriptions.
Calvanese et al. [20] studies OBIE systems as it concerns the concept of process mining framework by highlighting main challenges the process analysts may encounter when extracting the different events log or derived process models. Essentially, the work [20] reveals the importance of suitable methods for extraction of the several event logs from relational databases.
Although, a number of PM softwares or tools such as XESame [37], ProMimport [38, 39], and ProM [37] that all support events log extraction; and the commercial software versions such as Disco [40] and the overlapping software vendors like MinIt, Celonis, MyInveno, etc. that make it easier or simpler to transform typical excel or CSV files into an eXtensible Event Streams (XES) [37] or MXML (Mining eXtensible Markup Language) [2] log formats has already been developed. However, Calvanese et al. [20] observe that none of those tools or process mining platforms, in reality, considers the domain-specific information in the loop. As a consequence, the process of mining valuable information from the readily available event logs or developing ontologies from the resultant models are a lot of the time ad-hoc. Perhaps, this is owing to the fact that in real-time (business) settings, datasets might be duplicated for dissimilar reasons or interpretations, and the semantic information about the available sets of data perchance cannot be traced back in most cases.
Nevertheless, Calvanese et al. [20] notes that some works have equally been done on the semantical annotation of the events log [41, 42, 22, 2, 11] to help exploit or define the semantic (e.g. ontological) information during the data analysis or process modelling tasks [17]. Noteworthy, from the observations in [20], a majority of the existing works do not put profusely into consideration the process of capturing the event logs. Although, to overcome such challenges, Calvanese et al. [20] argues that the conceptual process or frameworks would be theoretically applied only if realistic datasets (event logs) that follow the accepted standards (e.g. XES) [23, 37] are available. To this effect, [20] proposes a framework that support the process analysts or IT experts in extracting XES logs from inherent relational databases.
Moreover, in order to demonstrate the ontology-based framework and its relevance within the process mining context; Calvanese et al. [20] resort to a well-established OBDA model (ontology-based data access model) that allows the users to link raw datasets to the underlying domain ontologies (i.e. hierarchical structures or taxonomies). In turn, the system shows to overcome the impedance mismatch associated with relational databases. Therefore, the said process analysts and IT experts can place more attention or technical efforts on the taxonomical (ontology) levels only, while the associations within the underlying knowledge-base are managed automatically through the instated OBDA system [32, 43]. In other words, the ontology-based systems (such as the SPMaAF framework described in this paper) shows to provide the theoretical (seeked-for) foundation for developing PM tools or algorithms that have the capacity to extract conceptual information from the input data logs. Be it either by explicitly materializing the said (abstraction) knowledge or by retrieving the different information on demand (e.g. through the user queries).
Indeed, the process mining framework (SPMaAF) introduced in this paper is built on the PM and semantic (ontological) modelling techniques. Specifically, the current study provides an ontological-based system that is capable of performing query answering and information retrieval/extraction in a more abstract manner when compared to other standard logical procedures used for information management, or yet knowledge engineering.
Data integration and linking
Semantic technology has advanced (matured) during the past few decades. One of the several areas the technology (semantics) has experienced substantial expansion is the Linked Open Data (LOD) cloud [43, 44]. LOD consists of a number of machine-readable datasets (e.g. the RDF triples) that are used to describe class(es) elements and their underlying properties. Although, Zhao and Ichise [44] note that in LOD applications it is difficult to understand the ontological (taxonomy) alignments between multiple datasets. To resolve the identified problem, the work [44] introduced a domain-independent framework that tends to decrease heterogeneity in ontologies (considering the linked datasets), automatically retrieves the main entities or objects, and consequently, enriches the underlying ontologies by adding the domain ranges and annotations.
Also, another problem with LOD is the fact that the datasets are largely categorized into domains and interlinked mainly with owl:sameAs (a built-in OWL property) with limited use of some other descriptive properties (such as the owl:equivalentClass and owl:equivalentProperty, etc.) that have shown to be a more resourceful way of linking equivalent classes and their associated properties. Moreover, the exact or main properties of the ontologies (especially OWL) that makes the technology capable of reasoning is types of relations that exists across the different concepts (e.g. functional, inverse functional, transitive, symmetric, asymmetric, reflexive and irreflexive) [29] as demonstrated in Section 4 of this paper.
Similarly, Pfaff et al. [45] measures different perspectives of similar objects within a specified domain (IT benchmarking) by creating an ontology-based formalization (integration) of all the relevant properties, attributes, and elements. This was done using expressive functionalities of the OWL [25] and logical reasoning (reasoner) [46, 47].
Likewise, this study uses the properties description language such as the OWL [25], SWRL (semantic web rule language) [46] and DL queries (description logics) [47] to provide the semantic-based approach for process mining that involves real-time processing, descriptions (e.g. ontology-based) and reformulation (manipulation) of the meaning and relationships that exist amongst the different process instances or entities. This is achieved by allowing the attributes or labels about the several entities (process instances) to be enriched by means of the metadata creation or data labelling. Henceforth, the method supports the semantic-based annotations and properties assertions using the ontological schema/vocabularies.
Semantical annotation and data labelling
A common challenge when performing process mining or the big (data) analysis task is to discover the right information and to instinctively comprehend the meaning of the different terms [48, 49, 50]. Considering the focus of this paper, the works in [51, 52, 53, 12, 17, 54] show that the semantically annotated logs or models is essential in carrying out conceptual analysis of the events log, and consequently, model enhancement to follow. Specifically, Okoye et al. [17] notes that semantical annotations (i.e. semantic-based data labelling) is an indispensable step towards the development of intelligent methods or tools that supports the semantic-based process mining by conveying (in an automated manner) the formal structures and meaning of the different entities (or process elements) that can be found in the discovered models [54].
In theory, semantical annotations are defined formally as a function that returns a set(s) of concept from the ontologies for each node or edge in the resultant graph or models [51, 52]. Equivalently, Born et al. [53] states that semantical annotations can be performed either manually or computed in an automated manner keeping in loop the existence of words similarity. Perhaps, such considerations (word similarities) are put in mind to help generalize the individual entities residing in the domain processes in question. Recently, Jonquet et al. [55] study the ontological schema or practice by analyzing metadata (semantical description) of different ontologies and examining the frequently used vocabularies and standards. The work [55] systematically compares several methods that are used to implement the ontology repositories (reference libraries) in order to provide a set of newly metadata-model that can be applied to build ontologies.
By following the same aforementioned principle, this work presents a technique for annotating unlabelled activities sequence(s) through the use of the ontological schema/vocabularies. The paper demonstrates this method using data about a business process described in [49] to determine which traces (i.e process elements and the underlying activities) are fitting in the original model. Technically, the resulting framework (SPMaAF) is applied to transform the extracted datasets and input models into minable executable formats to support the discovery/enhancement of the process domains. Moreover, the framework (SPMaAF) includes the technique for semantical annotation of unlabelled activities sequences (workflows) by using the ontology schema (OWL, SWRL, DL queries, restriction properties, etc.) to provide (metadata) assertions that allow for discovering of useful information or formal expressions in the existing knowledge-base.
Process-Aware Information system (PAIS)
PAIS comprises of process modelling/analysis methods that trails to offer more flexibility and support for task-specific processes [16, 56]. Typically, the traditional WFMs (Workflow Management systems) [15, 8] are standard examples of PAIS. Although PAIS tends not to necessarily control the processes which they are used to support (through the generic workflow engine) they share a common attribute. Whereby, the PAIS inclined systems are entirely aware of each processes they trail to support, and there exists an explicit process view or interpretation [57]. Thus far, the more flexibility (generalization) a PAIS system allows for; the greater the diversity of behaviours that will be discovered from the supported processes. We note that only in settings whereby the processes in question shows a greater level of flexibility, would the resulting models offer the best values, especially for the purpose of process-related decision-making. Moreover, the flexibility and usefulness of the derived models are more relevant in comparison to other methods which are applied to perform task-specific information processing such as BAM (business activity monitoring), BPM (business process management), etc. [8, 57].
In principle, process mining (PM) techniques embody PAIS, given the fact that PM aims to mine and analyze event logs at the process-levels and are entirely sensitive (aware) to the facts or details about the different activities that underlies the processes in question. The PM methods, for instance, can be used to understand how the various activities that make up the said processes have been performed within the execution environments.
Accordingly, the process aware methods or analysis (which forms part of the main functions of the SPMaAF framework) can be supported by making reference to the semantically annotated labels to add meanings to the discovered models. This enables for an automatic inferencing (e.g. through process querying) of new and/or abstract information in the underlying knowledge-base. In fact, the semantic-based approach focuses on bringing the process-related information to a conceptual level of human (real-world) understanding. Moreover, according to Varghese et al. [4], the hybrid intelligent systems (e.g. recurrent neural networks) have shown to attain state-of-the-art or hi-tech results/performance in several settings than humans could logically do. Henceforth, the proposed method (SPMaAF) in this paper is exclusively considered to support machine-understandable systems rather than just a machine-readable system.
Process querying
Process querying (PQ) is a method that have emerged during the decade with the aim to support the process mining and semantic modelling techniques. The method (PQ) is used to manage (in an automatic manner) real-world or envisioned process models and information repositories. Indeed, the method has shown to be important particularly within the BPM field otherwise allied to the (business) data analysis [7, 58].
According to Polyvyanyy et al. [7], PQ methods are concerned with the automated manipulation (e.g. through filtering) of repositories of process models and their associated relationships in real-world settings. Moreover, the PQ supported methods are created with the intention of transforming the process-related information into decision-making capabilities. In theory, Polyvyanyy et al. [7] notes that researches within the PQ field covers a number of topics; ranging from the study of PM algorithms to resolving the limitations found with the PQ tools or techniques, and then, application of the querying capabilities in software products.
However, Polyvyanyy et al. [58] states that the PQ methods also trails to integrate the process models with ontologies (particularly ontologies for process management), and are increasingly gaining attention of the software developers and process analysts over the years. Polyvyanny et al. [58] notes that one of the many reasons for the increase in attention of the users; is that ontologies permit (supports) the adding of semantics to the discovered or pre-existing models by enabling an automated discovery (inference) of conceptual information from the domain processes as noted earlier in Subsections 2.2 and 2.4. In consequence, the resultant knowledge (semantic information) are utilized to support the said processes (e.g. business process) both at the design or implementation stages, etc.
Along this lines, Polyvyanyy et al. [7] proposed a PQ framework that is used to enable BI (business intelligence) through query-based process analytics. The proposed framework constitutes state-of-the-art components built on generic functions that are capable of being programmed/designed to produce a certain type of querying techniques or functionalities. Interestingly, the PQ framework [7] also references some use case studies within BI and BPM fields, and correspondingly points to the gap in current works of literature.
Moreover, the work [7] notes that typical PQ methods should be designed to address those gaps in the current works of literature. For example, an organization may fail to transform the high volume of data recorded in its various information systems into some tactical or strategic data processing/intelligent system. This can be owing to a lack of dedicated technologies that is designed to effectively manage the pieces of information underlying the said business models or data sources. Thus far, the application of ontology-based frameworks (such as the PQ method) is deemed beneficial, not only to provide better support for the ample transformation of the process-related information into decision-making capabilities. But, also employed for the development of the next-generation of BI’s for the so-called organizations [7].
Considering the context of this paper, we focus on the methods which are used to perform process modelling particularly as it concerns PQ with rich annotations [58]. By definition, the PQ with rich annotations refers to the application of ontological modelling (annotation) techniques. For example, by using process descriptions languages to represents the process models and manipulation of the different entities. The proposed method (SPMaAF) in this paper integrates a well-defined semantic process models with ontologies. Moreover, Montani et al. [59] notes that the methods which are used for trace abstraction (e.g. the PQ and SPM, etc.) should provide mechanisms that are able to convert the various traces (process elements) into some kind of higher-level concepts (or real-world understanding). Thus, the resultant systems or models (otherwise referred to as conceptualization) should technically be linked or based on the domain knowledge about the processes. In fact, the proposed SPMaAF framework (see: Section 4) in this paper also sits as central footing in terms of the related PQ method and approaches explained in this section. Besides, just like the PQ, the SPMaAF approach also considers the different phases of the PM and its application for real-time processing and/or analysis. This includes the initial phase of collecting and transforming of the readily available events data logs, to discovering of useful process models, and then expounds on the result of the standard PM techniques to semantical preparation of the input models for further enhancement and/or analysis at a more abstraction level.
Main tools and application areas for process mining and semantic modelling: Systematic mapping study
This study methodically represents in this section (see: Table 1), related works that are pertinent to the method and proposals of this paper. The systematic analysis studies appropriate works and methods that covers either or both the PM and semantic modelling topics or concepts. Thus, the analysis as represented in Table 1 is a thematic report of the different methods,
Systematic mapping study and review of the related works considering the scope of the studies, design methods/approaches used, and finding related to PM and semantic modelling techniques
Systematic mapping study and review of the related works considering the scope of the studies, design methods/approaches used, and finding related to PM and semantic modelling techniques
tools, case studies or application domains, including the several findings that are closely connected to the semantic-based process mining technique introduced in this paper.
As gathered in the systematic mapping study and relevant studies (Table 1); this work notes that there has been a substantial advancement or development from the conventional big data mining techniques or supported technologies to the big (data) analysis or concepts. This applies to both in theory and practice. Essentially, it is noteworthy to mention from the analysis (Table 1) the main factors that have inspired or stimulated the big (data) analysis scheme including the supporting technologies such as the PM [8, 60], its trailed theories and real-world applications [62, 6, 63, 64, 18] and practical impact and analysis [1, 12, 17, 54, 73] that have consequently motivated the SPM-based approach [2, 17] described in this current study.
Furthermore, we note that the PM [8] builds on computational intelligence by providing methods which tends to integrate the data mining with process modelling techniques [61, 62, 33]. Obviously, this has led to its significant effect on how the scientific or yet data science community (process owners, process analysts, IT experts, researchers, etc.) perceive and analyze the increasingly volumes of dataset that are captured from the various IT systems.
Therefore, this study deemed it necessary not only to explore the PM field and the main several areas of its application in practice that are closely related to the semantic-based approach, but are also pertinent and utilized to support the design and implementation of the proposed method in this paper (see: Sections 4 and 5).
Moreover, the systematic review and report and this section (Table 1) also highlights the different methods that are entirely aware (i.e. semantic-aware methods) of the several processes which they trail to support. For instance, the we observe that a majority of the studies (see: Table 1) focus more on how to extract meaningful patterns from the event logs captured about the various processes. This is done alongside creating effective ways of transforming/analyzing the datasets and models to provide real (semantic) knowledge and understanding about the processes in reality.
Perhaps, most organizations’ processes are becoming complex due the scale (or unprecedented amount) of datasets readily available today in its various information systems or databases. These databases or sources has outgrown both the human expectations and processing abilities of the various IT or computer systems. Yet, the opportunities, researches, and good news from the available works of literature which also forms part of the systematic analysis of this study (Table 1) conveys that there are solutions. Besides, researchers are working resolutely to meet those expectations (as reported in Table 1).
Certainly, if the process analysts, software developers or IT experts (i.e. those whose tasks are to provide the ever needed methods, tools, or algorithms to handle the increasingly complexities and provide efficient organizational structures) can understand the captured datasets with advanced level of intelligence, then we can start to realise the greater impact and benefits of the SPM-based methods and its supporting technologies. Without a doubt, the SPM-based method (see: Section 4) can be applied to support the real-time process modelling and analysis, and in turn, help to resolve the different complexities that underlies the said processes in question. However, that is only if the process scientists or analysts should take the additional step of providing the real (semantics) information or knowledge that describes the processes. Thus, the notion and proposal of the SPM-based method (SPMaAF) in this paper.
This study notes that quality augmentation of the PM methods and the resultant models is a result of applying data analysis approaches that tend to combine the said systems with the main components or building block, namely: Semantic Labelling (annotation), Semantic Representation (ontology), and Semantic Reasoning (reasoner) [13, 17, 72, 2]. To this effect, this paper illustrates in the following Subsections 4.1 to 4.3, the importance of the semantic technologies and its main application for abstract (conceptual) levels of the PM tasks and analysis. In consequence, the study introduces the SPMaAF framework and then demonstrates how the resulting method is utilized for ample implementation of the semantic-based PM techniques.
Semantics
One of the commonly encountered problem when performing process mining tasks (as noted in Section 2) is to discover the right information and to comprehend (understand) what they mean or represents in the real-world settings [48, 50, 75, 49]. According to Rozinat [75], it can be anything between really easy or very complicated to determine (or figure out) the semantic information (e.g. metadata description) from the existing event logs that are stored in many IT systems/databases. Moreover, the outcomes of the said systems mainly depend on how distant (or unconnected) the events log is from the actual business logic. For instance, a typical case example is where a specific business model might be logged directly in relation to the corresponding activities names as performed, or in settings whereby a process analyst may as well require process mappings between an actual business activity and some kind of hidden action code to be able to analyze the processes.
Therefore, as noted by Rozinat [75] and distinctly from the current researches that are related to the field of SPM [2, 12, 17, 36, 70] including the broader area of BPI (business process intelligence) [15, 68, 76]; it is perhaps best practice to work alongside process analysts who are capable of mining the correct information and to interpret/determine the implication of the different components that makes up the processes in question. Eventually, in the context of PM, the guidelines outlined in [75] denotes that it helps not to try to understand everything at once but instead to focus first on the key important elements:
Moreover, the recommendations by Rozinat [75] notes that when the above-mentioned components are identified and addressed, subsequently, one can further look for additional information (metadata) that may help improve the PM outcomes and analysis, especially from a specific domain or perspective. Henceforth, the application and development of the semantic-based process mining approach described in this paper.
Semantic technologies and its application for real-time process analysis have gained attention within the PM field [32, 54, 36, 17]. The increasing attention has eminently spanned the notion of the SPM that is recently being embraced and technically applied as a useful method towards extension and/or improvement of events log and models derived using the classical PM techniques. In essence, the idea of SPM aims to utilize the metadata (semantic information) about the captured events log and the discovered models to develop new techniques for PM or yet improve the existing tools/methods to support humans (e.g. the process analysts) in obtaining a more detailed and accurate results that are closer to human (real-world) understanding (i.e. machine-understandable systems).
Theoretically, the SPM methods leverages the rich-semantics and/or annotations [7, 58, 32] that are embedded in the events log and models about any given process (e.g. business process) by linking the defined properties to concept(s) in ontologies. This is done in order to allow for the extraction of useful patterns through the semantic reasoning aptitudes. Perhaps, the automated computing (reasoning) of the different concepts or expression of the different relationships that exist within the derived models is allowed owing to the ontologies (formal definitions) and annotations (semantic labels). Henceforth, with the SPM supported methods, valuable information (semantics) about the different activities that underlie the existing process bases and how they are associated with the discovered models are made possible, and essential, for extracting models capable of producing new knowledge.
In fact, the SPM-based approaches spans due to the limitations that are identified or are associated with most of the PM techniques. Whereby, the majority of the existing methods do not technically gain from the real (semantics) knowledge that describes the events logs or discovered models. Thus, to cater for such gaps or limitations, the method (SPM) prompts the key benefits that are consequently provided by its utilization. For example, with such benefits includes the capacity to describe the semantics beneath the events log and models considered useful for the discovery of new knowledge about any given process.
In summary, the SPM methods are purely grounded on the three basic building blocks [2], namely:
Annotated events log or models Ontologies Semantic reasoning (reasoners)
SPMaAF framework (semantic-based process mining and analysis framework).
Interestingly, the SPM is a new field within the wider area of PM, and there are not too many approaches/methods that supports or demonstrate the capabilities or magnitudes of the technique [2, 7, 12, 17, 32]. To this end, the SPM-based framework (SPMaAF) is proposed in this paper help resolve the problems associated with the process mining through the semantic modelling approach and representations. This is done to aid determine the presence of different patterns or traces within the different process domains. Practically, the SPMaAF framework described in Section 4.3 focuses on identifying meaningful knowledge (information) about the various process instances that can be found within the events log about the real-time processes. This work uses the case study experimentation and analysis of the business process data [49] provided by the IEEE Task Force on Process Mining to illustrate the method. In turn, the method (SPMaAF) shows to improve the informative values of the resultant models by means of the semantically-driven PM approach.
The proposed SPM-based framework (SPMaAF) [12, 13, 54] is seen as useful in analyzing the different item-sets (e.g. the column fileds or variables contained in the events log) based on concepts rather than the events tag or labels about the processes. In fact, the conceptual method of analysis is a huge benefit particularly towards formulation of a more robust and extendable description of processes in reality. Moreover, the SPMaAF framework shows to support an advanced (automated) reasoning aptitude, as well as, increase in conceptual knowledge awareness or (big) data management.
For all intents and purposes, this work notes that in order to carry out the SPM-based tasks; at first, the process data or different variables (category) have to be captured for any given process of interest. By category, this study refers to each entities (elements) in the process base (e.g. the activities log, process instances, sub-processes, etc.) that make up the given process [74]. Second, the identification and modelling of the different objects and datatypes regarding the various process elements is necessary (see: Figs 1 and 2). Henceforth, the different data properties are modelled and (automatically) manipulated in order to support the analysis of the input dataset and derived models at a more abstraction levels.
Therefore, the SPMaAF framework (Fig. 1) is designed to show how the input datasets and models are being extracted, semantically prepared, and transformed into a minable format that allows for the conceptual (abstract) analysis to follow. We note that the architecture of the SPMaAF framework is construed to enrich the attributes (labels) about the various elements (in terms of concepts) by enhancing the information values and metadata descriptions of the derived models. Figure 1 represents the architecture of the SPMaAF framework.
In summary, the different phases (components) of the framework (SPMaAF) are collectively aimed at discovering, conformance and extension of the discovered models at a more conceptual (abstract) levels of analysis through the conceptualization method. Typically, the framework can be applied to analyse any given process of interest or domain provided there is an availability of some events log recorded about the processes in question with the minimum requirement for any given process mining tasks to follow.
As gathered in Fig. 1, the SPMaAF framework constitutes of the following main application components or phases:
Model extraction from the events data logs: discovered models which are described as set(s) of annotated terms or labels that links to defined properties in an ontology.
Ontological classifications and representations: permits for association of (inferential) meaning to the labels or attributes defined within the model through the presence or description of class hierarchies (i.e. taxonomies).
Reasoner (inference engine): designed to support the automatic (computing) classification and maintenance of consistency within the resultant models, as well as, cleaning out inconsistent results. In turn, it presents the inferred (underlying) object/ data properties assertions or associations.
Conceptual references and information extraction: allows for automated discovery of new knowledge (information) about the process instances and the relationships that exist between the different entities.
To summarize the design framework (SPMaAF), the key steps towards achieving an effective application of the method is to focus on integrating the system with the following two core elements:
Events logs/models where the labels or attributes is designed to make references to the concept(s) defined within the ontologies, and Reasoner that can be invoked (automatically executed) to make inferences (i.e. reason) over the ontologies.
Indeed, the main benefits/implication of the framework (Fig. 1) is that it allows for semantic annotation of the different labels (attributes) in the events logs/models by linking to the concepts that are well-defined in the ontologies. This is done in order to help improve the underlying process analysis (as explained in detail in the following Fig. 2).
This study expands on the steps that constitute or allows for ample implementation of the proposed framework (Fig. 1). First, the extracted logs/models from the standard process mining techniques are represented as a set of annotated terms that make references to defined ontologies (see: Figs 1 and 2). As a result, the method makes it straightforward to represent the extracted information in a structured and yet accurate manner.
Second, the method provides means to represent and manipulate the annotated terms or process models in a formal and structured way (taxonomies) by describing the associations (relationships) [19, 77] between the different process elements in the resultant (semantic) model [17]. Technically, the method ensures that the various range of tasks (activities) conforms naturally to the event logs as executed in reality. Moreover, this is achieved by encoding the deployed models in the formal structure of ontologies (i.e. semantic modelling) and also allows for further expansion (or improvement) of the existing model.
Lastly, the Reasoner (the inference engine) is designed to perform semantic reasoning and ontology classification of the different process elements in order to validate the resulting model and clean out inconsistent outputs [78]. Consequently, it presents the inferred (underlying) semantic associations in a systematized manner that is seen as machine-understandable and closer to human understanding.
Accordingly, the study highlights in the following figure (Fig. 2); the incremental procedures it employed for implementation of the framework (SPMaAF) as follows.
Incremental procedure applied for ample implementation of the SPM-based framework (SPMaAF).
Fragment of the main functions used to infer the ontologies in the semantic fuzzy miner using Pellet reasoner and OWL API.
As gathered in the aforelisted procedures (Fig. 2), this work shows that the first step towards achieving semantic annotation of the derived models should aim at making use of the process description languages/assertions (e.g. using OWL, SWRL, DL Queries, etc.) [25, 46, 47] to link elements in the models with concepts that they represent in a well-defined ontology. The purpose (i.e. semantic annotation) is to seek equivalence between the concepts of the derived models with concepts of the defined in the ontologies [77]. Moreover, the automated reasoning by referencing the concepts in the ontologies; (i) provides us with a robust way to answer the various questions regarding the relationship the process instances share amongst themselves within the model (i.e. machine-understandable), and (ii) to perform a more conceptual analysis capable of providing real-world answers that are closer to human understanding.
To test for the main functionalities and components of the SPMaAF framework, this work utilized the events logs data from the business process domain provided in Carmona et al. [49] to illustrate the implication of the method [13]. Essentially, the series of experimentation and case study implementation is performed in order to weigh the performance of the framework (SPMaAF) being able to produce a more accurate classification of the individual traces that constitutes the derived models. Typically, the datasets used for the series of experimentation include a training set and a test set that were used to discover the models. These were also used for the purpose of cross-validation of the classification results. Consequently, the resulting approach, otherwise referred to as semantic-based fuzzy mining approach [13] shows to allow for the meaning of the process elements to be enhanced through the use of properties description languages (semantic assertions) and classification of the discoverable entities. In principle, the method is applied in order to generate inference knowledge that is consequently used to discover useful patterns (traces) within the derived (semantic) models in a conceptual (abstraction) manner.
Practically, the work implements the semantic fuzzy mining application using the OWL application programming interface (API) [79] to manipulate and allow for extraction and loading of the inferred concepts/ parameters. This is done by employing the semantic reasoner (Pellet) [80] to perform all the logical referencing and classification of the defined concepts. Consequently, the main properties this study references to determine the different concepts (i.e. Class_assertions, Object_property_assertions, and Data_property_ assertions, etc.) is as shown in the following figure (Fig. 3). Besides, the purpose of performing the automated inferencing, otherwise referred to as concepts classification, is to match the questions one would like to answer about the different attributes or relationships the process instances share amongst themselves within the knowledge-base. Again, this is done by linking the different entities or properties (classes, object and data properties, etc.) in the model with the concepts that they represent in the ontologies.
The semantic-based fuzzy mining approach and the underlying properties or main fucntions as shown in Fig. 3 references a number of different OWL ontologies (trainingModel ontology, testSet ontology, traceFitnessClassification ontology, etc.) which were defined for the purpose of the experimentation. Basically, for each ontology that is defined in the model, all concepts in their turn are considered using the reasoner (Pellet) and were checked for consistency by referencing the process parameters. Based on the behavioural characteristics of the provided datasets [49] which contains in each test log; 10 traces that are considered allowed or replayable by the model (i.e. true positives) and 10 other traces that are seen as disallowed or non-replayable by the model (true negatives) [49]. A cross-validation test method was conducted with the goal to overcome the variability in composition of the different datasets. Technically, the traces were computed and recorded based on the reasoner’s response. Moreover, the classifier (reasoner) was run and tested based on the resulting individual traces to assess its performance with respect to determining the correctly classified traces.
Henceforth, for each result of the classification process, the replayable (true positives) and non-replayable (true negatives) traces were determined. In the end, the results of the tests in relation to the deployed models including the classifications of the corresponding individual traces occurring in each test set were recorded [13, 17, 54]. The tests and the experimental results/observations were noted by considering if the specified trace has been classified as true positives (TP), false negative (FN), false positive (FP), or true negatives (TN). Thus, the following performance metrics were considered to determine the fitness of the different traces in accordance with the classifications definition in Van der Aalst [8, 57] where:
TP denotes the true positive values i.e. the traces that were correctly classified as positive. FN signifies the false negatives i.e. the traces that are predicted to be negative but ought to have been classified as positive. FP denotes the false positives values i.e. the traces that are predicted to be positive but ought to have been classified as negative. TN signifies the true negatives i.e. the traces that were correctly classified as negative.
In summary, the study notes from the experimental results that for every run set of parameters, the commission error, i.e. false positives (FP) and false negatives (FN) values was null, i.e, equal to zero (fp
The process mining techniques trails to consider the end-to-end processes rather than hidebound (data) patterns when compared to the traditional data mining techniques [8, 61]. In other words, the existing data mining methods tend not to be process-centric and do not focus on events data [8] e.g. the rows (instances) and columns (variables) of a standard data file or events log that a lot of the time do contain formal meanings. Although, Bogarín et al. [60] notes that in any case, the data mining and process mining methods apply specific algorithms to data with the purpose to discover hidden patterns or relationships that can be represented as data models or process models, respectively. Perhaps, whichever tool or method one chooses to adopt, the key focus should be on achieving the goal or objective of adopting the techniques.
Likewise, the works in [70, 72, 1] are even specific about such existing approaches used for extracting the said models that appear to be limited to some extents because the resulting models are merely syntactic in nature (i.e., based on the labels/tags in the events log to produce the process models). Moreover, given the fact that the existing approaches depend on the labels to discover the models, and to this effect, the developed systems or methods tend not to gain from the real (semantics) knowledge that describes the processes as performed in reality. We note that the actual semantics which describes the events log remains missing and sprouts the additional need for domain experts or intelligent systems (e.g. machine-understandable systems) Varghese et al. [4] that are capable of classifying or interpreting the said models.
Having said that, we note that in practice, the traditional PM methods pose some certain issue of semantics that limits its efficacy when handling the large volumes of event logs or datasets [1, 2, 12]. Moreover, Cairns et al. [1] argues that the semantic-based methods [2, 12, 13, 17, 54] appear to be the most closely associated and promising area that can be explored to resolve those issues of understanding the different patterns or traces (heterogeneity) within the data. Perhaps, to do this, the envisioned systems or methods are expected to involve the process of extracting streamlined models that fits or represent the actual process(es) as performed in reality. Moreover, Cairns et al. [1] thinks that semantical annotation of the readily available datasets can be effectively used to tackle the challenge or issue of interpreting the said process models. Thus far, to benefit from the real semantics behind the event logs and models, the semantic-based process mining (SPM) which enforces the process mining and its analysis at the conceptual levels, has to be employed.
For this reason, this paper introduces the SPMaAF framework that focuses on allowing for an effective discovery and enhancement of the process models. Basically, the paper demonstrates how the events logs from any given process domain are being extracted, semantically prepared, and transformed into minable formats to support the discovery, monitoring, and enhancement of the processes through the semantic-based (conceptual) analysis. In other words, the semantic-based approach shows to support the analysis of the events log and resultant models based on concepts rather than the events tag or label about the processes in question. Functionally, the semantic-based analysis allows for the meaning of the different objects and data types (that constitute the models) to be enhanced by using the properties description languages to support the automatic classification or reasoning of the various entities. Apparently, this is done by generating the inference knowledge (metadata descriptions) which are consequently utilized to determine useful patterns and/or improve analysis of the derived (semantic) models based on the domain concepts. Interestingly, this study has also shown that by understanding and leveraging the real meaning (semantics) of the different process elements which are stored in different variable forms within the information systems or databases; the resulting outcomes can be used to identify patterns that subsequently can be transliterated into actionable plans to support the process-related decision making for the different organizations in general [5, 12, 17, 54, 81].
Theoretically, the experimentations and analysis conducted in this paper is, on the one hand, grounded on how the semantic-based method can be utilized to support the automatic generation and representation of the different concepts (e.g. ontologies) within the (semantic) models. On the other hand, the performance evaluation process (using the metrics defined in Section 5) is centered on the ability of the system being able to determine (through the use of a reasoner) the associations or relationships between the concepts that are well-defined in the ontologies. Thus, owing to the fact that the method (SPMaAF) is built on the description logics or properties restrictions; the resulting class hierarchies (or taxonomies) happen to allow for the conceptual analysis and can measure the similarities of the different concepts. To this end, the series of experiments and cross-validation results shows that the method we applied to determine (ascertain) the similarities (and semantic information) amongst the different entities upholds to be more efficient than the traditional techniques used for process mining [69]. Besides, the SPMaAF framework shows to provide a more flexible way to analyse such task-specific processes or systems that are believed to be aware of the various process they are used to support. Thus, we provide a machine-understandable system for conceptual information or data analysis.
Conclusion
This study shows that process-related analysis, often allied to the process mining (PM) techniques, spans the need not only for methods or tools that have the capacity to extract valuable information from the events log and models. But, also the theoretical requirement for holistic design methods that can be utilized to perform a more abstract (conceptual) analysis and reasoning about the different processes in question.
Although, on the one hand, process mining has become a very useful technique that supports the process-related analysis or information exploration. Whereby, useful information on how the different activities depend on each other within the process domains are made possible.
On the other hand, there exists the problem of semantic (abstraction analysis) which the standard PM methods tends to lack. For this purpose, this paper has shown that a combination of the PM methods with semantic technologies is beneficial and to a great extent effective in extracting models capable of producing newly and/or previously undiscovered information that allows for abstract analysis and interpretation of the processes. To do this, this paper introduced a framework (SPMaAF) that integrates the main components that motivates the semantic process mining methods (i.e. annotated event logs/models, ontologies and semantic reasoners). Clearly, the SPM-based method is aimed at discovering and enhancement of the set(s) of behaviours (or patterns) that can be found within the different process domains. In addition, this paper also study the current methods, applications and development within the context of process mining and semantic technologies.
Finally, the paper concludes with the supposition that a system which is formally encoded with semantic labelling (annotation), semantic representations (ontologies) and semantic reasoning (reasoner) has the capacity to lift the PM results and analysis from the syntactic to a more conceptual level.
Future works can adopt the proposed SPMaAF method to analyze data extracts from any given process domain or interests. This may also include refinement of the semantic-based fuzzy mining and reasoning method that has already been developed in this paper. For instance, further extensions may uncover a different approach towards integration of the ontology-based schemas or reasoning capabilities that have been defined in this study. Moreover, this is owing to the fact the semantic-based process mining (SPM) is a new field within the broader context of the PM, and there are not too many algorithms/tools that support the method in the current literature.
