Abstract
Process mining is an emerging research field which deals with discovering, monitoring and improving business processes by analyzing and mining data in the form of event logs. Event logs can be extracted by most of the existing enterprise information systems. Predictive business process monitoring is a sub-field of process mining and deals with predictive analytics models on event log data that incorporate Machine Learning (ML) algorithms and deal with various objectives of process instances, such as: next activity, remaining time, costs, and risks. Existing research works on predictions about next activities are scarce. At the same time, Automated Machine Learning (AutoML) has not been investigated in the predictive business process monitoring domain. Therefore, based on its promising results in other domains and type of data, we propose an approach for next activity prediction based on AutoML, and specifically on the Tree-Based Pipeline Optimization Tool (TPOT) method for AutoML. The evaluation results demonstrate that automating the design and optimization of ML pipelines without the need for human intervention, apart from making accessible ML to non-ML experts (in this case, the process owners and the business analysts), also provides higher prediction accuracy comparing to other approaches in the literature.
Introduction
Process mining is an emerging research field dealing with discovering, monitoring and improving the business processes by extracting knowledge from event logs existing in enterprise information systems. Process mining combines approaches from information technology, data analytics, and management sciences in order to analyze the data that are available in the form of event log and to transform them into process science [2]. Event log includes historical data from process instances and includes at least a case ID for each event, an activity, and a timestamp [1].
Predictive business process monitoring is a sub-field of process mining and deals with predictive analytics models on event log data that incorporate Machine Learning (ML) algorithms and deal with various objectives of process instances, such as: next activity, remaining time, costs, and risks [3]. Existing approaches have focused on predictions about remaining time of process instances rather than about next activities [4]. At the same time, Automated Machine Learning (AutoML) has not been investigated in the predictive business process monitoring domain in order to automate the time-consuming process of designing and optimizing ML pipelines. Therefore, based on its promising results in other domains [5, 6], we propose an approach for next activity prediction, while running a business process instance based on AutoML, and specifically on the Tree-Based Pipeline Optimization Tool (TPOT) method for AutoML.
The rest of the paper is organized as follows. Section 2 presents a literature review on predictive business process monitoring and on AutoML. Section 3 describes our proposed approach for next activity prediction with TPOT for AutoML. Section 4 presents the implementation of the proposed approach and the results. Section 5 concludes the paper and presents our plans for future work.
Literature review
In this Section, we present the literature review on predictive business process monitoring (Section 2.1) and on AutoML (Section 2.2).
Predictive business process monitoring
Process mining methods aim at extracting knowledge from datasets in the form of event logs in order to investigate the already executed business processes and to enable decision making for their improvement [2, 7]. However, there is also the need for predicting certain objectives related to business processes during the execution of process instances [8, 9]. To this end, predictive business process monitoring, as a sub-field of process mining aims at developing predictive analytics models, mostly incorporating ML algorithms, in order to provide predictions about several outcomes related to an ongoing (uncompleted) process execution [3, 4, 10, 11, 12, 13, 14, 15, 16].
Typical examples of predictions of the future of an execution trace are: the outcome of a process execution, its completion time, and the sequence of its future activities [17]. Such kind of predictions can provide added value to several scenarios and application domains. Indeed, instead of acting in a reactive manner in order to detect the violation or the delay after its occurrence, predicting the violation supports users and organizations to prevent the undesired events, thus enabling them to act ahead of time, in a proactive manner. To do this, ML algorithms that are incorporated to predictive business process monitoring approaches take as input an event log, optionally accompanied with a process model extracted from the data or contextual attributes [9]. The next activity prediction problem is modelled as a classification problem, while the remaining time prediction problem is modelled as a regression one [4].
The emergence of deep learning inevitably affected the developed approaches and methods for predictive business process monitoring. Therefore, several research works take advantage of deep learning algorithms and demonstrate results that outperform traditional ML algorithms [9, 18, 19, 20, 21]. However, despite the promising results of deep learning methods in predictive business process monitoring, existing approaches suffer from the following limitations: (i) they have focused on time predictions rather than on next activity predictions [4, 9]; (ii) their explainability is still at its early stages and several challenges need to be tackled [11, 22, 23]; (iii) they require large amounts of data that are not usually available in the enterprise systems in the form of event logs [23]; (iv) their configuration and training require sophisticated data science knowledge and considerable amount of experiments for achieving optimal performance [5, 24]; (v) their generalization is questionable, since they are trained in trivial datasets representing structured business processes [25]; (vi) they have focused on predictions about remaining time of process instances rather than on predictions about next activities [4].
At the same time, Automated Machine Learning (AutoML) has not been investigated in the predictive business process monitoring domain, on event log datasets, in order to automate the time-consuming process of designing and optimizing ML pipelines, although it has demonstrated promising results in other domains, problems, and datasets [5, 6].
Automated machine learning (AutoML)
ML and deep learning models are manually designed by data scientists after performing extensive experiments, sensitivity analyses, and trial-and-error procedures for specific datasets. Consequently, they need to spend considerable resources and time in order to develop accurate and reliable models [5]. These tasks prevent data analytics from becoming more accessible, flexible, and scalable and from addressing non-ML experts for taking advantage of them for solving the problem at hand [26]. For these reasons, the emergence of AutoML, which aims at automating the procedure of configuring and optimizing ML pipelines, is of outmost importance [6, 27]. AutoML minimizes the human intervention, since it automates all the steps of the ML pipelines, e.g. data preparation, feature engineering, model selection, hyperparameter optimization, model evaluation, etc. [5].
One of the most widespread AutoML methods is Tree-Based Pipeline Optimization Tool (TPOT), which optimizes several feature preprocessors and ML models in order to optimize the accuracy on a supervised classification problem [28]. To do this, it takes advantage of genetic programming [28, 29]. Overall, TPOT has been proved to be a successful AutoML method and has been applied to a variety of tasks, application domains, and datasets. For example, in [28], the authors compared the application of TPOT in 150 supervised classification problems and found that it significantly outperforms other well-known ML algorithms in 21 of them, while they almost approached the accuracy of 4 of them. It should be noted that, apart from this quantitative comparative analysis, TPOT did not require any domain knowledge or human input.
However, AutoML results including TPOT, demonstrate variances in their behaviour according to the dataset, the application domain, and the task, but also according to whether they deal with a classification or a regression problem [30]. Therefore, it is important to evaluate such methods to a variety of domains taking into account their specificities.
Research methodology for next activity prediction with Tree-Based pipeline optimization tool (TPOT)
The pseudocode of the main function of the proposed research methodology
The pseudocode of the main function of the proposed research methodology
In this Section, we describe our research methodology for predictive business process monitoring with AutoML, and particularly with TPOT, aiming at predicting the next activity in a running business process instance. The research methodology consists of the following steps, which are described in the following sub-sections: (i) Event log extraction; (ii) Preprocessing of event log dataset; (iii) Process discovery; and, (iv) Prediction of next activity. Its pseudocode at a high level is represented in Table 1.
The pseudocode for the “Event Log Extraction” step of the proposed research methodology
The pseudocode for the “Event Log Extraction” step of the proposed research methodology
An example of the XES format.
Enterprise systems record events corresponding to the execution of work items, and store them in databases, from where they can be extracted and analyzed. An event log, which is required to have certain attributes. These mandatory attributes include a case ID for each event, an activity related to each event, and timestamps to order events and measure performance. Optional attributes may also be included. Simple event logs are represented as tables and are available in a Comma-Separated-Values (CSV) format. However, in more complex event logs, where there are several attributes in addition to the mandatory ones (i.e. case ID, activity, and timestamp), the eXtensible Event Stream (XES) format standardized by the IEEE Task Force on Process Mining, is used [1, 7]. The XES format is a tag-based language for capturing system behaviors through event logs and streams. The XES standard includes a schema for the structure of the event log and its extensions, as well as prototypes for providing semantics to certain attributes. Figure 1 provides an example of the XES format. Table 2 presents the pseudocode of this step of the proposed research methodology.
Pseudocode for the “Preprocessing of Event Log Dataset” step of the proposed research methodology
Pseudocode for the “Preprocessing of Event Log Dataset” step of the proposed research methodology
Preprocessing of the event log dataset is an essential step for transforming the raw data into a dataset that is capable of being processed by the process discovery and AutoML algorithms of the proposed research methodology. Its pseudode is presented in Table 3. First, the categorical, the continuous, and blank values are identified, and the respective categorical features are prepared in order to adopt a meaningful interpretation and to feed into the AutoML algorithms. In addition, the numerical features are subject to scaling procedures in order to standardize their size, to eliminate the potential impact of different scales on the performance of ML models, and to ensure a balanced contribution of different features to the ML model.
Second, this step includes data cleaning and particularly, the treatment of missing values. It also includes the extraction of the day, month, and time from the timestamps, something which facilitates capturing temporal patterns in the dataset. Third, a variance analysis is performed with the aim to identify and evaluate potential performance fluctuations within the business process, the so-called process variants in order to reveal how process instances deviate among them and how they affect efficiency, stability, and predictability of the business process.
Pseudocode for the “Process Discovery” step of the proposed research methodology
Pseudocode for the “Process Discovery” step of the proposed research methodology
The Process Discovery phase incorporates the Heuristics Miner algorithm in order to derive a heuristic network. The heuristic network is a graphical representation of the process and provides a visual abstraction of the business process structure. It reveals the flow of activities as recorded in the event log, identifies patterns, and investigates unexpected process behaviours. It is capable of capturing the inherent uncertainty existing in business processes, handling incomplete or noisy data, and tackling with the dynamic nature of event logs. The pseudocode of this step is presented in Table 4.
Pseudocode for the “Prediction of Next Activity” step of the proposed research methodology
Pseudocode for the “Prediction of Next Activity” step of the proposed research methodology
In predictive business process monitoring, the selection and configuration of ML models is of outmost importance. In our methodology, this process is facilitated by the application of AutoML and particularly, by the TPOT library, for automating the complex tasks associated with model selection and hyperparameter tuning. TPOT aims at optimizing the ML pipeline, by exploring a variety of algorithms, techniques, and configurations for each step of ML pipelines, and by selecting the most appropriate ones, in terms of accuracy and reliability, for the specific dataset at hand. In this way, TPOT accelerates the model development process, while, at the same time, exploring a variety of potential architectural ML models and ensuring a high accuracy [27, 28, 29]. TPOT can be used for both classification and regression tasks [27]. We implemented a classification prediction pipeline, since the choice of a classifier model aligns with the nature of the activity prediction task. The pseudocode of this step is presented in Table 5.
The implementation of TPOT consists of the following steps:
Raw Data: Data are imported from an XES file after a selective use of specific columns, i.e. timestamp, case ID, and concept name, streamlining the dataset to the essential components for the subsequent steps. Data Cleaning: TPOT incorporates data cleansing techniques in order to tackle with missing values and to ensure the integrity of the dataset. In addition to the automatic data cleaning procedures embedded to TPOT, we also convert the timestamp column to datetime and fill in the values that are missing. We also delete entire columns that are not taken as input for predicting the next activity in the process instance. Feature Engineering: It extracts temporal features from the event log data and captures the sequence of events and temporal patterns. Model Selection: TPOT orchestrates an automated procedure for model selection by using genetic programming in order to explore a variety of ML algorithms and configurations, and select models that exhibit optimal performance for the specific predictive tasks. In particular, TPOT Classifier performs a search over ML pipelines that can contain supervised classification models, preprocessors, feature selection techniques, and any other estimator or transformer. It also searches over the hyperparameters of all the objects in the pipeline. The algorithms available for TPOT Classifier provide a set of tools for automated model selection in classification tasks among the following algorithms:
Logistic Regression: Logistic Regression models the probability of a binary outcome by applying a logistic function to a linear combination of input features, providing probabilities that can be transformed into class predictions.
Decision Trees: Decision trees create branches based on features in order to classify instances into different classes. They are capable of capturing complex decision boundaries and interactions within the features space.
Random Forest: Random Forest combines multiple decision trees to form an ensemble, aiming at achieving a higher robustness, accuracy and generalization.
Gradient Boosting: Gradient Boosting builds a series of weak learners to iteratively correct errors. It is capable of capturing subtle dependencies within the data and can tackle with imbalanced classes.
Support Vector Machines (SVM): SVM finds the hyperplane that best separates different classes in the features space And is particularly effective when the decision boundary is non-linear or complex.
k-Nearest Neighbors (k-NN): k-NN predicts the class based on the majority vote of its k-nearest neighbors. It is particularly suitable when local patterns are important in the classification task. Parameter Optimization: Genetic programming extends its ability to optimize hyperparameters. TPOT evolves and refines the hyperparameter configurations for the selected ML models, by using genetic programming. In this way, it ensures that the models are finely tuned to the intricacies of the event log data. Model Validation: The robustness and accuracy of the predictive models are evaluated through cross-validation techniques. TPOT partitions the dataset, trains the models on training sets, and validates them on test sets corresponding to the remaining part of the dataset, thus ensuring reliable performance evaluation and tackling with overfitting.
Implementation
The research methodology presented in Section 3 was implemented using the following Python libraries:
Pandas and NumPy: Pandas and NumPy were instrumental for data manipulation and numerical operations, providing a robust foundation for handling datasets and performing essential computations. PM4Py: PM4Py, a process mining library, played a pivotal role in extracting insights from event logs, enabling the application of process discovery and conformance checking techniques. Seaborn and Matplotlib: Seaborn and Matplotlib served as powerful data visualization tools, facilitating the creation of insightful graphs and plots for a comprehensive analysis of predictive model outputs. Plotly: Plotly enriched the visual representation of data with interactive plots, enhancing the communicative aspects of the results and providing an immersive experience in exploring patterns and trends. Scikit-learn (sklearn): Scikit-learn, a versatile machine learning library, offered a broad spectrum of tools for model selection, evaluation, and preprocessing, streamlining the implementation of predictive models. TPOT: TPOT, an automated machine learning framework, played a crucial role in optimizing model selection and hyperparameter tuning, automating the tedious aspects of the machine learning pipeline.
A sample dataset of the event log.
The dataset was taken by the Business Process Intelligence Challenge (BPIC’12). It is a real-life event log pertaining to a loan application process of a Dutch financial institute. This log contains some 262200 events in 13087 cases. Apart from some anonymization, the log contains all data as it came from the financial institute. The process represented in the event log is an application process for a personal loan or overdraft within a global financing organization. A sample dataset of the event log is depicted in Fig. 2.
Below, we describe the attributes of the event log:
The ‘org_resource’ column represents an employee or department that is responsible for an activity of a loan request. That is, it helps with who performed a procedure. The ‘lifecycle:transition’ column indicates a more basic event in the set of processes. That is, the initiation, completion or planning for the loan application. The ‘concept:name’ column where we see all the procedures for the loan application, in total. It describes what happens at a given point in the process and is a very important column for its analysis and manipulation for the prediction of models in general in this type of data. The ‘time:timestamp’ column is the timestamp for a process. The given moment when the process started to be implemented and can be the chronological sequence of events. The ‘case:REG_TIME’ column contains information about when the request was initially entered into the system. That is, it only indicates the initial timestamp for each request. The ‘case:concept:name’ column represents the id of each loan case. That is, a set of procedures constitute a request where this request for facilitation consists of an id number found in the specific column. The ‘amount_req’ column shows the requested amount for each loan request requested by the client.
In the exploratory analysis that we performed, essential metrics were computed to characterize the dataset’s nature:
Number of events: 262,200 Number of cases: 13,087 Average Events per case: 20.04 Average Case Length: 20.04 Average Event Duration (hours): 10.87 Max Event Duration (hours): 2,468.41 Average Case Duration (hours): 206.97 Max Case Duration (hours): 3,293.32 Number of Variants: 1,348
Moreover, language normalization was applied to the ‘concept:name’ column, converting Dutch values to English for improved clarity. The temporal aspect of the dataset revealed that activities spanned from 01/10/2011 to 01/03/2012, with notable exclusions of certain months – encompassing only January, February, March, September, and October. Then we analyzed in all cases the procedures that start and end. All loan application cases start uniformly with the ‘A_SUBMITTED’ process. On the other hand, they are completed with the final activities being ‘W_Validate request’, ‘W_Edit contract details’, ‘A_Declined’, ‘W_Complete your application’, ‘A_Cancelled’, ‘W_Call incomplete files’, ‘W_Handling leads’, ‘ W_Call for quotes’, ‘W_Assess fraud’, ‘O_Cancelled’, ‘A_Approved’. However, with greater frequency the activities ‘A_Declined’, ‘W_Validate request’ and ‘W_Handling leads’ are present The temporal aspect reveals occurrences of activity during specific months, with gaps in the data set for certain periods. This analysis provides a fine-grained understanding of the temporal distribution, activities, and characteristics of the data set, forming a comprehensive foundation for subsequent predictive modeling. The depth of exploration is encapsulated in derived metrics and insights, emphasizing the heterogeneous nature of the BPIC 2012 dataset.
Heuristic network for process discovery.
In the process discovery step, a heuristic network visually captures complex relationships within the dataset, providing a graphical representation of the interaction between different process elements, as shown in Fig. 3.
The implementation begins by preprocessing the dataset and importing the event log data using Python libraries such as pandas to convert the dataset to dataframe, numpy to properly import the data into the model, and pm4py to import the xes file. Timestamps are converted to datetime objects. An very important aspect is the coding of the “concept:name” values, for which a mapping dictionary is created and applied, facilitating the conversion of categorical data into numerical form. Next, we create sequences and tags based on the specified window size. This function works within the constraints of the dataset and business process context, striking a balance between the historical contexts considered for prediction and the practicality of the model.
We delve into the implementation of next activity prediction using a window approach. The prediction model is designed to predict the next step based on historical sequences of events, allowing adaptability to different window sizes. In other words, the window size is a number that determines the number of events we will give the system to predict the next ones. By defining the size of the window, this is used by a function, which, through an iterative loop, goes through all the cases of loans, and creates sequences of events based on the size of the window we defined. The number of events (window size) that will be generated through the function will enable the model to learn patterns within a window of events of fixed size to predict the exact next event each time. The choice of a window size affects the sensitivity of the sequence and shapes the ability of the model to capture patterns in the data. The experiments done include window values: 1, 2, 3, 5, 10, 15 and 20. For each window size, sequences and labels are generated and the data set is split into training and test sets to use appropriately for our model.
The TPOT library determines the most appropriate classification model and hyperparameters for the next event prediction task. The model is trained on the generated sequences and tags and then predictions are made on the test set. In all experiments performed, a standardized set of parameters for TPOT Classifier was used, ensuring consistency and comparability between different analyses. The parameters used for the TPOT Classifier are:
Generations: 8 Population Size: 25 Random State: 42 Verbosity: 2
Distribution of model efficiency across different window sizes.
Accuracy for various window sizes and required execution time.
The distribution of model efficiency across different window sizes is depicted in Fig. 4. This graphical representation clarifies the predictive accuracy achieved over various time windows. As the window size increases, a distinct upward trend in performance becomes apparent, validating the intrinsic link between time frame and predictive accuracy with an apparent differential increase in model performance from window size 3 and above. However, from window size 5 and then, there is no considerable improvements in accuracy, while at the same time, the required computational power and execution time increase significantly, as shown in Fig. 5. It should be noted that the execution of the pipeline was performed on a local PC; these times are expected to be significantly lower when being executed on the cloud. However, they still provide a measure of comparison.
By delving into a detailed analysis of each window size, subtle patterns and algorithmic preferences come to light. In the case of window size 1, the TPOT library automatically selects the ExtraTreesClassifier algorithm, yielding a commendable model performance of 66%. The subsequent increase in window size to 2 introduces a change in the performance of our model, but with TPOT again choosing the ExtraTreesClassifier algorithm with different hyperparameters, but resulting in an increased performance to 79% from 66%. Thus we see that already from the second experiment the accuracy rate increases by 13%. A notable transition also occurs at window size 3, where TPOT selects the RandomForestClassifier thus again increasing the performance of the model. This change in algorithmic preference, accompanied by specific hyperparameter configurations, contributes to a 84% performance. It is important to note that switching to the RandomForestClassifier algorithm highlights the model’s ability to dynamically adapt its approach to different time frames, exploiting the strengths of different algorithms and combining the appropriate hyperparameters. Comparative information on window sizes further illuminates the complex relationship between time frame, algorithmic selection, and predictive performance. Continuing to increase the window size of the variable, to number 5, we see that TPOT again chooses the ExtraTreesClassifier algorithm. Thus we reach the point where the model achieves a maximum performance of 88% for a window size of 10, using the RandomForestClassifier algorithm. Moving to larger window sizes, namely 15 and 20, introduces additional nuances. The selection of ExtraTreesClassifier’s TPOT with RobustScaler and MinMaxScaler, respectively, shows the model’s differentiated approach to handling environmental variation. Despite the small drop in efficiency, these findings highlight the adaptability and flexibility of the model in optimizing predictive accuracy.
Our results not only confirm the correlation between window size and predictive accuracy, but also highlight the dynamic nature of the subsequent event prediction model. The ability to autonomously select different algorithms based on time windows adds a layer of sophistication, offering professionals valuable insights for fitting models to various business process scenarios. From a business perspective, the observed results suggest that, when dealing with predictive modeling for sequences of events, the choice of window size significantly affects both computational resources and model accuracy. In this particular context, the comparison between window sizes 5 and 20 shows that the larger window size does not provide a substantial improvement in prediction accuracy, while imposing a noticeable increase in computational cost.
Moving now to an analysis regarding the computational cost in relation to the window size, we can see that in the diagram we have after the window size equal to 5, a constant course of the accuracy up to and decreasing slightly in some window sizes. This happens at size 20 where we have 1% less accuracy rate at a large window difference of 15 units. That is, we want to see the computing cost for a window equal to 5 and for a window equal to 20, so that since they have the same and smaller percentage of accuracy, if there is a big difference in computing cost due to the larger size, so that we have the opportunity to know that there is no need to do such experiments because nothing changes and in the end it costs us much more without getting better results. Consequently, in this particular context, the comparison between window sizes 5 and 20 shows that the larger window size does not provide a substantial improvement in the prediction accuracy, while it imposes a noticeable increase in the computational cost. For window_size
While the accuracy achieved was marginally lower at 86%, the increase in computational cost raises questions about the necessity of using larger window sizes. Given the marginal accuracy difference between the two window sizes and the substantial increase in computation time for window_size
Comparison of our approach with other related research works
Then, we compared the results of the proposed approach in terms of accuracy with other next activity prediction models existing in the literature which use the event log of BPIC 2012. The summary of the comparative analysis is presented in Table 6 The primary objective is to evaluate the performance of our model in comparison to established research methodologies, technical approaches and algorithms used in previous studies.
The Decision Trees model [31] shows stability and reliability, especially with lower window sizes, with an accuracy rate of 85%. The Adversarial Framework with LSTM [32] stands out with a 94% accuracy, demonstrating superior performance in event label predictions. Multi-Stage Deep Learning [33] achieves an accuracy of 82.70%, and provides information on hyperparameter tuning. T-LSTM Cells with Cost-Sensitive Learning [20] combines techniques, enhancing predictive capabilities with a reported accuracy of 77.80%. Finally, our model using the TPOT library achieves a competitive accuracy of 88%. When analyzing the results, it is important to consider the context and specific requirements of the forecasting task. The reliability of the Adversarial Framework on different datasets highlights its robustness, while our TPOT-based model achieves high accuracy, proving the effectiveness of AutoML.
Recurrent Neural Networks (RNNs), particularly those equipped with a Long-Term Memory (LSTM) architecture, stand out for their ability to model sequential data. LSTMs excel at capturing complex temporal dependencies, making them particularly suitable for predicting the nuanced dynamics of ongoing processes. The ability to preserve contextual information over extended sequences allows LSTMs to discern subtle patterns and dependencies that may escape simpler models. Their success in achieving high accuracy rates in tasks such as predicting the next event is attributed to their sophisticated memory mechanism, which allows the model to distinguish and exploit long-term dependencies within sequential data. However, the efficiency of LSTMs comes at the cost of increased computational requirements and the necessity for meticulous hyperparameter tuning. Achieving optimal performance involves fine-tuning various aspects of the model, including the number of layers, the size of hidden states, and learning rates. This complex tuning process requires considerable expertise and computational resources, and the search for the optimal configuration can be time-consuming.
Instead, TPOT introduces an alternative paradigm. While LSTMs achieve remarkable heights of accuracy, TPOT focuses on automating model selection and hyperparameter optimization procedures. Its accuracy is comparable to LSTM, and higher that the other algorithms existing in the literature for next activity prediction in business processes, but its power lies in democratizing ML by automating the laborious aspects of model development. TPOT systematically explores a wide range of algorithms and hyperparameters, searching for a model that performs well on the data without requiring deep user involvement in parameter adjustments. An additional aspect is its adaptability and reusability. The complex nature of LSTM hyperparameter tuning is one area where TPOT can significantly save analysts time. Furthermore, once an analyst has created a TPOT model for a particular data set, the same model can be used for analogous data sets with similar forecasting tasks. TPOT’s automated search for the most suitable algorithms and configurations makes it highly flexible and facilitates the transfer of knowledge to different applications.
The proposed approach for predictive business process monitoring with AutoML in order to provide predictions about the next activity in a business process instance demonstrates implications for both researchers and practitioners.
Researchers can leverage AutoML to develop more sophisticated predictive models for various domains without needing high expertise in ML, thus democratizing ML and allowing a broader range of researchers to contribute to the field and a broader range of scientific fields to benefit from AutoML. In addition, since AutoML automates the creation of ML pipelines in terms of model selection, hyperparameter tuning and optimization, researchers can focus more on high-level problems such as understanding process behaviors, improving data quality, and exploring new types of predictive tasks. In this way, the process owners have more insights based on which they can make informed decisions. Moreover, AutoML provides standardized and reproducible workflows for predictive business process monitoring, thus facilitating benchmarking and comparative studies. Researchers can systematically compare different models and approaches, contributing to a more robust and cumulative knowledge base. Finally, AutoML can serve as a testbed for new methodologies, enabling researchers to validate hypotheses and iterate on new approaches for predictive business process monitoring.
Practitioners can benefit from the proposed approach because they are empowered to create predictive models that can anticipate process outcomes, enabling proactive decision making. This can lead to optimized resource allocation, reduced process bottlenecks, and improved operational efficiency. Furthermore, practitioners without extensive expertise in ML can still develop effective predictive models. This lowers the barrier to entry, allowing a wider range of professionals to utilize predictive business process monitoring approaches in their daily operations. Then, AutoML solutions often come with scalability features, enabling practitioners to apply predictive business process monitoring across different departments and process types within an organization. This adaptability ensures that predictive insights can be leveraged across various business units. Additionally, by automating the model development process, AutoML reduces the time and cost associated with building and deploying predictive models. It should also be noted that AutoML facilitates continuous monitoring and model updating, ensuring that predictive models remain accurate and relevant as business processes and environments evolve. Practitioners can maintain high performance of their predictive systems with minimal manual intervention. Finally, many AutoML libraries offer seamless integration with existing business systems (e.g., ERP, CRM). This enables practitioners to embed predictive insights directly into their workflows, enhancing real-time decision support.
Conclusions and future work
Process mining is an emerging research field which deals with discovering, monitoring and improving business processes by extracting knowledge from event logs readily available in information systems. Predictive business process monitoring is a sub-field of process mining and deals with predictive analytics models on event log data aiming at predicting various values of process instances. Despite the flourishing literature on ML models for predictive business process monitoring, next activity prediction algorithms are underexplored, while AutoML approaches have not been investigated. In this paper, we applied TPOT for AutoML in order to predict the next activity in running process instances. The evaluation results show that automating the design and optimization of ML pipelines without the need for human intervention, apart from making accessible predictive business process monitoring to non-ML experts (in this case, the process owners and the business analysts), also provide higher prediction accuracy comparing to other approaches in the literature.
Our future work will follow two main directions. First, we will implement additional AutoML methods in the context of predictive business process monitoring domain and we will perform a comparative analysis through experiments in order to evaluate them across various objectives, such asQ next activity prediction, remaining time of business process execution, risks, and costs. Second, we will investigate synergies of Large Language Models (LLM) with process mining and predictive business process monitoring.
