Abstract
Automated machine learning (AutoML) supports ML engineers and data scientist by automating single tasks like model selection and hyperparameter optimization, automatically generating entire ML pipelines. This article presents a survey of 20 state-of-the-art AutoML solutions, open source and commercial. There is a wide range of functionalities, targeted user groups, support for ML libraries, and degrees of maturity. Depending on the AutoML solution, a user may be locked into one specific ML library technology or one product ecosystem. Additionally, the user might require some expertise in data science and programming for using the AutoML solution.
We propose a concept called OMA-ML (
Introduction
Machine learning (ML) is an important sub-domain of artificial intelligence, allowing to make predictions using models based on previous observations [1]. ML is used as an approach to solve a multitude of problems, like classification, clustering, or anomaly detection, from all kinds of business domains, like life sciences [2], manufacturing [3, 4], or the public sector [5, 6]. Engineering ML applications for practical use requires sound experience of ML engineers, respectively, data scientists. Tasks to be performed include data analysis, data preparation, feature engineering, model selection, validation, learning curve analysis and hyperparameter optimization. To support data scientists and also enable application domain experts to create ML pipelines, the field of automated ML (AutoML) [7] has emerged. AutoML aims at automating model selection and hyperparameter optimization, leading to higher efficiency and, potentially, better results. More progressive AutoML solutions also perform data preparation, feature engineering and validation, allowing to create entire ML pipelines automatically [8, 9]. Currently, AutoML is focused on supervised ML [7]. There is a growing number of AutoML solutions available, both academic as well as commercial. Current state-of-the-art AutoML solutions target one major ML library and compute a ML pipeline for this library only, e.g., Autosklearn [10] for Scikit-learn [11], Auto-Keras [12] for Keras [13], and Google AutoML [14] for Tensorflow [15]. While most AutoML solutions expand on including secondary ML libraries that offer support for one ML approach (e.g. Catboost, LightBGM); only AutoGluon [16] support multiple redundant ML libraries (MXNET, PyTorch).
The targeted user groups of AutoML solutions differ. Commercial solutions like RapidMiner Auto Model [17] or Google AutoML [14] offer a graphical user interface (GUI) usable for application domain experts potentially without programming skills (e.g., biologists), providing a workflow and deployment inside their ecosystem. Auto-WEKA [18] is an open source solution which also provides a GUI. Other open source solutions like Autosklearn [10], Auto-Keras [12], Auto-PyTorch [19], or TPOT [8] offer libraries that require programming skills.
The contribution of this article is three-fold: (a) we present a extended survey of state-of-the-art AutoML solutions; (b) based on the survey results we present a novel concept and (c) an implementation called OMA-ML (
This article is an extended version of [20]. Sections 3, 4.4 and 5 including Figs 1, 3, 5 and 6 are based on [20] and have been extended with current findings where needed. The survey of AutoML solutions, the description of the implementation and its evaluation are new.
This article is structured as follows: Section 2 presents related work. In Section 3, we introduce the basics of AutoML. Section 4 presents our survey on 20 AutoML solutions. Section 5 introduces the concepts of Meta AutoML and OMA-ML. In Section 6 the implementation of OMA-ML is presented, which is evaluated in Section 7. Section 8 concludes the article and indicates future work.
Related work
Six Peer-reviewed surveys of AutoML have been published recently: [21, 7, 22, 23, 24] and [25]. The survey [21] provides a detailed overview of the state-of-the-art in AutoML. The authors describe tasks in AutoML and algorithms to solve these tasks. However, they do not analyze specific AutoML solutions. The survey [7] presents AutoML concepts and algorithms and additionally reviews some AutoML solutions. However, the list of solutions is incomplete. It does not cover ADANET, AlphaD3M, AutoCVE, AutoGluon, Auto-Keras, Auto-Pytorch, Auto-WEKA, AWS Sagemaker Autopilot, Azure AutoML, EvalML, FLAML, Google AutoML, MLBOX, MLJAR, RapidMiner Auto Model and TransmogrifAI. The survey [22] introduces an overview of AutoML steps, presents 4 AutoML solutions and compares the characteristics of 5 additional AutoML solutions. In [23] the authors propose a new concept to benchmark AutoML solutions; their benchmark is applied on 4 different AutoML solutions using 39 different datasets.
In [26], an extensive AutoML solution classification benchmark suite containing 72 classification datasets is presented. The classification dataset used for the benchmark in our survey is part of this benchmark suite. The survey [24] briefly describes 4 different AutoML solutions and their internal pipeline building concept. Additionally, the authors present benchmark results collected from different sources related to the analysed 4 AutoML solutions, including [23]. Finally, [25] presents the progress made in the different areas covered by AutoML and a survey of 9 different AutoML solutions. This survey focused on the methods used by each AutoML solution, e.g. the model selection or if meta learning is used.
AutoML input and output.
The concept of Meta AutoML is novel; it was first published in our conference paper [20]. We are only aware of one article [27] which presents a similar concept called Ensemble Squared, but this preprint is not a peer-reviewed publication. Like OMA-ML, Ensemble Squared uses third-party AutoML solutions which are invoked in parallel. Insofar, both OMA-ML and Ensemble Squared are Meta AutoML approaches. A difference of our approach is the use of the ML ontology to guide various components of OMA-ML. We see considerable benefits in this approach regarding extensibility.
In this section we briefly introduce the behaviour of AutoML by describing its input/output behaviour and illustrate it by means of an example.
Input and output
Figure 1 shows the input and output of AutoML as a BPMN diagram (Business Process Model and Notation) [28]. AutoML requires the following inputs from the domain expert or the data scientist:
Dataset: the dataset for the ML task, e.g. a CSV file for classification on tabular data; AutoML configuration:
ML task, e.g., classification or regression on tabular data, images, videos, or textual data. ML target: e.g. label column in classification or regression tasks; Optional configuration parameters, e.g. maximum run-time, model performance or hardware restrictions.
AutoML produces the following outputs for the domain expert or the data scientist:
ML pipeline: The ML pipeline generated by AutoML is a piece of source code which can be executed to perform the ML task specified. A ML pipeline implements data preparation (e.g. feature selection, encoding or missing values imputation), the selected ML approach and its hyperparameter configuration [7]. Report: A textual or graphical explanation of the AutoML result, including a listing of ML configurations and their respective performance measures.
AutoML solves the Combined Algorithm Selection and Hyperparameter optimization (CASH) problem [29]. Algorithms that solve the CASH problem search for the best ML approach and hyperparameter setting for a given ML task [7]. Different AutoML solutions use different algorithms to solve the CASH problem, e.g.:
One of the more popular AutoML solutions by citations is Auto-sklearn [30]. It offers pipeline generation for classification and regression of tabular data. Listing 1 displays a simple Auto-sklearn implementation. In Auto-sklearn, the ML task is specified by the Python class used, e.g., AutoSklearnClassifier for classification of tabular datasets. The AutoML process is triggered by executing the fit function.
Auto-sklearn example
cls = AutoSklearnClassifier()
cls.fit(X_train, y_train)
predictions = cls.predict(X_test)
Auto-sklearn example
cls = AutoSklearnClassifier()
cls.fit(X_train, y_train)
predictions = cls.predict(X_test)
Without custom parameterization, Auto-sklearn will use its default configuration. Advanced users may customize the Auto-sklearn process [34] with a multitude of parameters, e.g.
Hardware usage, e.g. memory_limit; Pipeline size or generation constraints, e.g, ensemble_size; Pipeline scoring/metrics, e.g. metric; Pre-processing constraints, e.g. exclude_preprocessors; Runtime constraints, e.g. time_left_for_this_task; Meta configuration (logging, save folder location, etc.), e.g. output_folder;
The Auto-sklearn result is a pipeline that can be used to make predictions using the predict function (see Listing 1). The sprint_statistics function displays statistics about the found ML pipelines [34], e.g. metric used, best validation score, and number of target algorithm runs.
In this section we compare all AutoML solutions we are aware of, in total 20 (16 open source and 4 commercial).
Methodology
We evaluate the AutoML solutions using the following criteria:
Type: this includes information about (a) the licensing model:
Open source (OS): The code of the AutoML solution is publicly available under an open source licence. Commercial (C): The AutoML solution is available as a commercial solution only. (b) the way of accessing the AutoML solution:
Software library (SL): The AutoML solution is a software library implemented in a programming language like Python. Local application (LA): The AutoML solution is a desktop application that can be executed on a computer. Web service (WS): The AutoML solution is hosted as a web service and can be accessed via a web browser. Target user group: The intended user group of the AutoML solution:
Domain expert: An expert in an application domain (e.g., biology) who may not have programming expertise. Data/computer scientist: A person with programming/data science expertise. ML tasks: An overview of supported dataset types and their ML tasks which the AutoML solution can process and generate models on (e.g. classification or regression for tabular data). Model result: This includes information about (a) the model type returned by the AutoML solution:
Single model (SM): Only one ML model is returned (e.g. one neural network). Multiple models (MM): Several ML models are returned (e.g. a neural network and a decision tree). Model ensemble (ME): An ensemble pipeline is returned combining several ML models. (b) the export type generated by the AutoML solution:
Model instance (MI): The model is available as a runtime instance only; the AutoML solution provides no built-in export functionality. Model as file (MF): The model is automatically exported or the AutoML solution provides an export functionality to save the ML model as a file. Files (script and model) (F (S+M)): The AutoML solution generates an execution script, containing code to import the exported ML model and perform a prediction on a new dataset. Reporting: a characterization of the reporting functionality.
Basic: Only a minimum of information is shared with the user, e.g. only the metric of the model. Detailed: Various information about the produced model is shared with the user, e.g. pipeline structure, hyperparameter configuration, etc. ML library: The libraries that are used by the AutoML solution, e.g. Keras, Tensorflow, etc. Maturity: This includes information about (a) the release status of the AutoML solution:
Released (R): The published version number is at least 1.0. Pre-release (PR): The published version number is below 1.0, e.g. 0.1. Unknown (UK): No release version number is available. (b) the development status:
Actively developed (AD): On-going development with the last release being less than 6 months old, as can be observed in the release notes. Not actively developed (NAD): Sporadic or no on-going development with the most recent release being older than 6 months ago, as can be observed in the release notes. Unknown (UK): No activity or release information are available. Additionally, an AutoML solution is classified as could not be executed (NE) if we could not execute any AutoML benchmark due to crashes, errors, or other reasons. Benchmark: Two benchmarks were performed with each AutoML solution:
Tabular binary classification using the PhishingWebsites dataset.1 The goal of this dataset is to predict if a website is malicious or not for a user. We selected this dataset as it was incorporated into the OpenML Benchmarking Suites [26]. The evaluation metric used is F1 score. Tabular regression using the colleges dataset.2 The goal is to predict the percentage of students receiving the pell grant by a university. We chose this dataset for our benchmark, as it is intended to benchmark AutoML solutions. The evaluation metric used is RMSE.
For each benchmark three experiments were performed. The final benchmark score is the mean value of all three experiments within each task. 70% of the dataset were used to compute a ML model and the remaining 30 % were used to validate the AutoML solution model. Each experiment had a time limit of 10 minutes. For AutoML solutions that do not offer a time limit parameter (e.g. AutoKeras) a limitation of the search space or retry was used (e.g. for AutoKeras: max epoch
AWS Sagemaker Autopilot: No time or training limit parameter can be entered using the Web GUI. Google AutoML: The shortest time that can be entered is 1 hour. Azure AutoML: No time or training limit parameter can be entered using the Web GUI.
Each experiment was performed on a AWS EC2 Virtual Machine with instance type m5dn.xlarge (4 Cores and 16 GB Ram). Local applications were tested on a device with similar computation power (4 Cores and 16 GB Ram). The web applications did not offer a similar hardware configuration, the closed available option was selected if a hardware configuration option was offered:
AWS Sagemaker Autopilot: 2 EC2 VM with instance type ml.m5.4xlarge. Google AutoML: No hardware configuration can be selected. Azure AutoML: 1 VM with instance type Standard_DS3_v2.
For this survey a total of 20 AutoML solutions have been examined:
ADANET [35]: A Python software library using Tensorflow. It adaptively searches for the best ensemble of neural networks. AlphaD3M [36]: A Python software library using scikit-learn. Several model types are evaluated. ATM [33]: A Python software library using scikit-learn. Different Machine Learning approaches are evaluated to return the best model. AutoCVE [37]: A Python software library using scikit-learn and XGBoost. The final result is an ensemble model. AutoGluon [38]: A Python software library using a wide range of different ML libraries. Several ML approaches are computed to finally deliver a wide range of models. AutoKeras [12]: A Python software library using Keras. The best combination of hyperparameters and neural network architecture is selected. AutoSklearn [10]: A Python software library using scikit-learn. The best ensemble is selected during the AutoML process. Auto-Pytorch [19]: A Python software library using several ML libraries; the main libraries are Pytorch and scikit-learn, but Catboost and LightGBM are also supported. At the end of the training the best ensemble is selected. Auto-WEKA [18]: A local application using WEKA. Several ML approaches are used to find the best model. AWS Sagemaker Autopilot:3 A cloud-based application by Amazon using scikit-learn to generate multiple models. Azure AutoML:4 a cloud-based application by Microsoft using Microsoft’s Azure MachineLearning library in Python. Several ML approaches are optimized to find the best model. EvalML:5 A Python software library using several ML libraries to train multiple models. The main library is scikit-learn. FLAML [39]: A Python software library with scikit-learn as the main ML library. Google AutoML:6 A cloud-based application by Google, using Tensorflow to compute one ML model. H2O AutoML [32]: A Python library based on the Java H2O Framwork. It searches for the best model ensemble. MLBOX:7 A Python software library using Keras and scikit-learn to search for the best model. MLJAR [40]: A Python software library using scikit-learn. It trains multiple models and an ensemble to find the best solution. Rapidminer [17]: A desktop application using ML approaches from several major ML libraries (e.g. H2O and WEKA) TPOT [41]: A Python software library based on scikit-learn and Torch. It returns the best model found. TransmogrifAI:8 A Scala library using Spark ML as its base ML library. The best model is selected.
The AutoML solutions vary considerably in target user, maturity and produced ML pipeline. An overview can be found in Fig. 2. For details see [42, 43, 44, 45, 46, 47].
We were unable to execute the local application Auto-WEKA for our benchmark. Any attempt at executing the AutoML process led to an error, using our benchmark datasets as well as the datasets made available by the developers.
Almost every open source AutoML solution requires programming knowledge of the user; only Auto-WEKA is targeted towards domain experts requiring no programming skills. While all evaluated commercial
AutoML survey results.
AutoML solutions target domain experts, AWS Sagemaker Autopilot and Azure AutoML offer a programmable interface for computer/data scientists to execute more detailed experiments.
The most commonly supported ML tasks are classification and regression on tabular data. ATM and AutoCVE are the only AutoML solutions that only support classification on tabular data. While some AutoML solution offer a richer variety (e.g. AlphaD3M, Google AutoML) of supported tasks and/or input data types it is by no means the majority:
7 AutoML solutions support only classification and regression on tabular data. 8 AutoML solutions support up to 5 additional tasks on tabular or other data. 3 AutoML solutions support more than 5 additional tasks on tabular or other data.
Of the three AutoML solutions with the most options on different tasks, one is a commercial solution (Google AutoML) and two are open source libraries (AlphaD3M, AutoGluon).
The majority of the AutoML solutions train various ML approaches to either generate an ensemble or to find the best model during the AutoML process; a few AutoML solutions can only use one ML approach (e.g. neural networks with AutoKeras) or only the winning model is returned (e.g. FLAML).
In Section 3 we identified the output produced by the AutoML process as an ML pipeline and a report. The ML pipeline consists of a model and a script to execute a new prediction using the generated model. Of all surveyed AutoML solutions only 3 (Azure AutoML, AWS Sagemaker Autopilot and TPOT) produce a ML pipeline as previously defined. The majority of the solutions offer functionality for exporting the generated ML model; only 5 AutoML solutions do not have a default way to save the ML model.
Almost all AutoML solutions produce a detailed reporting after concluding the AutoML process, describing the parametrization of found model or even generate graphs with various information about the model and features (e.g. MLJAR).
Regarding maturity, the majority (14) of AutoML solutions is considered pre-release; most of those (8) are still being actively being worked on by their development team/community. Of the remaining 6 AutoML solutions, 3 have reached release status; only Auto-WEKA is not being actively worked on and is the only AutoML solution that currently could not be executed. All web services are classified as unknown, unknown, since their current version number and development status is not displayed. We assume that Google, Microsoft and AWS only publish new products after having reached maturity; therefore all web services AutoML solutions can be considered released and constantly being worked on until their services are discontinued.
MLJAR generates the best model for both, classification and regression benchmark experiments.
AutoML solutions are implemented on top of specific ML libraries. They produce pipelines using software from those ML libraries that can be exported and imported into those ML libraries. Deciding on an AutoML solution results in a technology lock-in for the corresponding ML library or libraries. Comparing the performance between different ML libraries is not possible.
ONNX [48] is an open format for artificial neural networks (ANN) to enable interoperability between ML libraries. However, not every ML library supports ONNX. Furthermore, ONNX does not support other ML model types besides ANN.
AutoML solutions target specific user groups. Most open source AutoML solutions target users with programming skills, e.g. in Python. Commercial AutoML solutions provide a GUI which also address users without programming skills, e.g., domain experts.
All existing AutoML solutions have their individual features. They all solve the CASH problem, support specific ML tasks, target specific user groups and generate ML pipelines for specific ML libraries.
Meta AutoML allows combining the strengths of individual AutoML solutions, while alleviating their limitations: supporting various ML tasks and user groups while being technology-independent.
In the next section, we introduce OMA-ML, our concept for Meta AutoML.
An ontology-based concept for Meta AutoML
Before describing the concept of OMA-ML in detail, we start by defining goals we aspire for OMA-ML.
Goals for OMA-ML
By combining the features of individual AutoML solutions, we pursue the following goals for OMA-ML:
AutoML: OMA-ML shall perform AutoML, i.e. generate an executable ML pipeline and a report based on a configuration and a dataset. User groups: OMA-ML shall target user groups with and without programming skills. It shall provide a GUI which allows intuitive configuration of AutoML and interactive reporting. Additionally, it shall provide an API to be used by application programmers. Technology-independent: OMA-ML shall support any number of ML libraries. ML tasks: A wide range of ML tasks shall be supported.
Meta AutoML is a novel concept of AutoML. Figure 3 shows the concept as a BPMN diagram. Similar to other AutoML solutions, the user enters the required input (dataset and AutoML configuration). The Meta AutoML solution then prepares various AutoML solutions to be executed in parallel. The results of the AutoML solutions are collected and the results of Meta AutoML (ML pipeline and report) are finalized.
Meta AutoML workflow.
Schema of the ML ontology.
An ontology is a formal, explicit specification of a shared conceptualization of a problem domain [49]. We are developing an ontology for the domain of ML [50]. One of the use cases of this ML ontology is to guide the Meta AutoML process. The ML ontology is modelled in RDF [51] using SKOS [52]. It currently consists of over 1800 RDF triples, specifying 104 ML approaches, 42 ML tasks, 55 metrics, 21 AutoML solutions, 16 ML libraries, their configuration items, interrelationships, and more. The ML ontology is open source and can be accessed from the GitHub repository for OMA-ML.9
Figure 4 shows the classes and relationships of the ML ontology. In the upper part of the diagram, classes representing general ML concepts are depicted. The class ML area represents the major ML areas, in particular supervised learning, unsupervised learning and reinforcement learning. The class ML task lists problems that can be solved using ML, e.g., classification or regression. Each task belongs to an ML area, e.g. classification belongs to supervised learning. With the class ML approach, algorithmic ML technologies are represented, e.g., neural networks, support vector machines, or decision trees; each ML approach is associated with one or several ML tasks. Finally, the class Metric formalizes prediction performance metrics used in ML, e.g., F1-score for classification tasks or RMSE for regression tasks. Metrics are used for ML tasks.
OMA-ML software architecture and technology stack.
In the lower part of the diagram, implementations of ML concepts are depicted. The class ML library collects available ML libraries like Tensorflow or scikit-learn. The class AutoML solution contains instances like Autosklearn or Google AutoML. Each AutoML solution is used for one or more ML libraries and can perform one or more ML tasks. Finally the class Configuration item represents the knowledge about what configuration parameters are available for each AutoML solution and which ML approaches and metrics can be parameterized, e.g. the AutoML solution Autosklearn allows configuring the ML task classification.
For the classes of the ML ontology, a broader relationship can be used for representing hierarchies within the class, e.g., the ML task classification is a broader concept than the ML task binary classification (not expicitly depicted in Fig. 4).
The ML ontology is the information backbone of OMA-ML and is used in several components of OMA-ML, as shown in the next section.
Figure 5 shows the software architecture of OMA-ML as a UML (Unified Modeling Language [53]) component diagram.
OMA-ML is designed as a 3-layer-architecture.
Presentation layer: This is the user interface of OMA-ML. A GUI allows interaction and visualization. An ontology-guided wizard supports configuring OMA-ML. Additionally, an API provides batch access to OMA-ML. Logic layer: This implements the control logic of OMA-ML, designed as a blackboard architecture [54]. The OMA-ML controller invokes individual AutoML libraries via the adapter pattern [55], thus providing a plug-in architecture for multiple AutoML solutions. Data layer: This layer provides access to the ML ontology (read access), the ML model store and AutoML logs (write access).
In the GUI, a wizard guides the user to enter mandatory and optional AutoML configuration parameters. The wizard is based on the ML ontology, providing plausible configuration options only. For example, if the user selects AutoML solutions that produce ANN pipelines only, the wizard will only display configuration options for ANN.
Mandatory configuration parameters are as follows:
Dataset: The dataset with labeled training data; ML task: The task the user wants to perform on the dataset, e.g. classification on tabular data (options from the ML ontology); ML target: The name of the label column in the dataset.
Optional configuration parameters are:
OMA-ML control logic.
Dataset schema: Schema information on dataset columns including data types (e.g. int, float, string, date) and categories (e.g., numerical, categorical, textual) (options from the ML ontology); Scoring: The prediction performance measure to be used as optimization target, e.g. accuracy (options from the ML ontology); AutoML solutions: Usage restrictions on particular AutoML solutions or ML libraries, e.g. AutoSklearn (options from the ML ontology); ML model constraints: Restrictions on ML approaches and custom configuration of ML approaches, e.g. ANN with maximum 10 hidden layers (options from the ML ontology); AutoML runtime constraints: General Meta AutoML constraints (monetary, time, hardware restriction) to influence the execution time, e.g. runtime limit 1 hour (options from the ML ontology); Training type: Training strategy for Meta AutoML, e.g. using a subset of the dataset only (options from the ML ontology);
After starting the OMA-ML process, the user interface is updated regularly with the current status of the AutoML processes which are executed in parallel. After termination of the OMA-ML process, the following output is provided:
ML pipeline: The user can download the successfully generated ML pipelines as Python scripts and files specifying the pipeline structure. The Python scripts provide the following functionality:
Import the file specifying the pipeline structure; Make predictions for a new, unlabeled dataset; Save the prediction result. Report:
Description of the used AutoML solutions, their produced ML pipelines and respective performance evaluations; ML pipeline leaderboard with scores.
When using OMA-ML in batch mode, the configuration file, including a link to the dataset can be passed to an API. The runtime state and output can be pulled from the API. Like in the online mode, the output consists of ML pipelines and reports.
The OMA-ML control logic is designed using the blackboard pattern. Figure 6 shows an overview of the OMA-ML control logic as a BPMN diagram. When a new run of OMA-ML is triggered, the dataset analyzed at first, extracting the following metadata:
Number of rows and columns; Data types of columns; Missing values.
Those metadata are needed for deciding whether pre-processing of the dataset is necessary for individual AutoML solutions. The OMA-ML strategy selection is based on the ML ontology, taking into account the configuration and the dataset analysis result. It selects AutoML solutions which perform the ML tasks specified in the configuration. The dataset is pre-processed if needed. For example, if an AutoML solution requires numeric features only, but the dataset contains textual features, then the textual features are encoded. If the dataset is very large (e.g. 100 million rows) and a small runtime limit is specified (e.g. 1 hour), then approaches with fast training times are selected or the dataset is downsized.
The selected AutoML solutions are invoked via their adapters in parallel by the OMA-ML controller. While executing AutoML, they continuously report their progress to the blackboard. The OMA-ML controller monitors the blackboard. After reaching the termination criteria (e.g. required accuracy is met, or run time limit is reached), the OMA-ML controller finalizes the OMA-ML run, saving the best performing executable ML pipelines to the ML pipeline store, generating a report, and storing it in the report store. Otherwise, the strategy may be altered, or alternative AutoML solutions may be triggered.
OMA-ML component workflow.
All OMA-ML runs are logged in a structured format, including the following data:
AutoML configuration; Dataset analysis result; OMA-ML strategy; Hardware configuration (kernels, memory, processor, etc.); AutoML actual run time (time spent); Generated ML pipelines characteristics (accuracy, size, etc.).
With many OMA-ML runs, we expect the log data to be a valuable source of information. Data mining techniques may be used to gain insights to improve the OMA-ML controller’s strategy selection. Using this log data additionally for supervised ML in the OMA-ML controller is subject to future work.
OMA-ML is developed as an open source project and can be accessed as a GitHub repository.10 OMA-ML is under active development. At the time of writing, a minimum viable product is available with an initial set of AutoML solutions integrated, providing classification and regression tasks for tabular datasets. Additional dataset types and ML tasks are constantly being provided by integrating more AutoML solutions. An overview of the technology used for the implementation of OMA-ML can be seen in Fig. 5.
Component interaction
Figure 7 shows the component interaction between a user and OMA-ML as a UML sequence diagram. When a user interacts with OMA-ML, a dataset must be uploaded first. During the upload, the dataset is sent to the Logic Layer where it is persisted inside OMA-ML server. After having uploaded a dataset, the user can start configuring a new OMA-ML run by selecting a dataset and performing a (minimal) configuration: ML task to perform (e.g. binary classification), target, and time limit. The options available to the user are dynamically displayed by querying the ML ontology to only provide a sensible configuration; e.g., if the user wants to use PyTorch-based AutoML solutions only, AutoKeras will not be displayed as an option.
After finalizing the configuration, the user can start OMA-ML. The controller will perform automatic preprocessing if required, retrieving the preprocessing workflow from the ontology. For example, if an AutoML solution requires numeric features only but the dataset contains textual features, then the controller will adjust the dataset for this AutoML solution. When the dataset is prepared, all AutoML adapters will start the AutoML processes. They will constantly stream for process updates to the controller until the AutoML process terminates. Those process update information are in turn forwarded to the Presentation Layer.
When all AutoML adapters terminated their execution, a resulting ZIP file containing the ML model and a Python script for local execution will be sent to the controller. The user will be notified that the file is available for download, and can in turn send a download request, which will download the file to his local computer.
Presentation layer
The GUI is implemented as a web application in C# using the Blazor Framework,11 and provides the user with the following pages:
Configuration: Providing an ontology-based wizard to configure a new execution of OMA-ML.
Reporting: Presenting the leaderboard of the OMA-ML run, including ML models generated, their metric scores and runtimes (see Fig. 8). Additionally the user can download a ZIP file with the selected ML pipeline to perform predictions on new datasets.
OMA-ML leaderboard.
The GUI is oriented at commercial AutoML solutions like RapidMiner AutoModel which also provide wizard-based configuration and a leaderboard for solutions.
A simple AutoML controller and seven AutoML adapters have been implemented so far. All components have been realized as Python solutions, containerized using Docker. The controller is based on the OMA-ML control logic (Fig. 6) and offers a gRPC12 interface to allow communication between GUI and Controller. Functionality provided by the gRPC interface can be grouped into the following categories: Dataset manipulation (upload, preprocessing, configuration), ontology queries, and AutoML session (start, information, results).
Another gRPC interface is implemented in each AutoML adapter. The controller uses those adapter interfaces to start new AutoML sessions and receive updates from those sessions. The updates sent to the controller during an ongoing AutoML run comprise the console output produced by the underlying AutoML libraries. After a AutoML library concludes successfully its search for the best performing model, the Adapter uses the templating language Jinja213 to generate the Python script as defined in Section 5.5.
AutoML libraries
At the time of writing, the OMA-ML system supports seven AutoML libraries:
AutoPytorch AutoSklearn AutoKeras FLAML AutoGluon AutoCVE MLJAR
Currently, only classification and regression on tabular data is provided.
The ML ontology (Section 5.3) is loaded into the AutoML controller using the Python library RDFlib.14 SPARQL is used for querying the ML ontology.
Evaluation
In this section, we evaluate the concept and implementation of OMA-ML against the goals specified in Section 5.1.
AutoML: OMA-ML performs AutoML. A user can upload datasets; in the current stage of implementation, this is restricted to tabular data. The user may configure the AutoML execution. An ontology-based wizard guides the configuration setup. After executing OMA-ML, a resulting report in form of a leaderboard is presented. The user may download the resulting ML pipeline of choice in form of a ZIP file. When applying the phishing dataset used for classification benchmarks in our survey, OMA-ML achieves an F1-score of 0.949 (see an OMA-ML screenshot of the leaderboard in Fig. 8). User groups: Users with or without programming skills can interact with OMA-ML: A GUI provides access to OMA-ML functionality for users without programming skills. To interact with the AutoML Controller, users with programming skills can access OMA-ML programmatically via the gRPC API. Technology-independent: Any AutoML solution can be integrated in OMA-ML. For every additional AutoML solution, a new AutoML Adapter must be implemented and the ML ontology must be updated with metadata about configuration options of the AutoML solution. In the current state of the OMA-ML system, seven different AutoML solutions are integrated. Previous implementation experience indicates that adding a new AutoML solution requires implementation effort of about 1-4 person weeks, depending on the complexity of the AutoML solution. About 1-2 person days are required to analyze the AutoML solution and extend the ontology accordingly. The integration of additional AutoML solutions is subject to future work. ML task: Any ML task that is provided by an existing AutoML solution may be offered by OMA-ML by integrating this AutoML solution. In the current state of the OMA-ML system, only the tasks classification and regression on tabular data are supported. Several AutoML solutions (e.g. AutoKeras, Autosklearn, AlphaD3M, etc.) offer a wider range of ML tasks for other dataset types (e.g. texts, images, video, audio, and graphs). The extension of OMA-ML to support those tasks is subject to future work.
OMA-ML offers users the possibility to compute ML pipelines for any ML task supported. However, this may come at a high cost of computing power, and thus energy consumption. This is a potential problem of all AutoML solutions, and OMA-ML multiplies this by executing various AutoML solutions in parallel. In recent publications about sustainability in AI systems, the term red AI in contrast to green AI is being used [56]. If certain AutoML solutions can be individually be rated red AI, executing them in parallel in OMA-ML could then be rated “deep red AI”. The authors of [57] use the field of Natural Language Processing (NLP), to illustrate the enormous consumption of energy required by modern approaches to achieve increasing results. They suggest that instead of focusing on prediction performance only, energy consumption shall be considered as well, in order to target more energy efficient algorithms.
In the OMA-ML concept, we do envisage a process step which may deal with this issue: the strategy selection. Using knowledge from the ML ontology, it may be possible to largely reduce the need for computation power required for OMA-ML execution. Firstly only the most promising AutoML solutions for a given task may be suggested. Secondly, only the most efficient configurations for those AutoML solutions may be provided. The more intelligent the strategy selection, the more energy-efficient may be the solution. Thirdly, the execution of AutoML solutions which are performing badly compared to others may be terminated early. Applying artificial intelligence for the strategy selection is subject to future work.
Conclusion and future work
AutoML continues to be an active field of research, with many open source as well as commercial solutions being actively developed and released. The contribution of this article is three-fold. Firstly, we presented a survey on 20 existing AutoML solutions, evaluating their functionality, targeted user groups, maturity and performance. The open source AutoML solutions are almost exclusively targeting data/computer scientists. While the commercial AutoML are marketed towards domain experts, they offer an easy-to-use GUI for domain experts. Some of them additionally offer an optional API for data/computer scientists. While the commercial AutoML solutions can be considered as released and mature software, most open source AutoML solutions are in pre-release state and continue to being actively enhanced. Most AutoML solutions support one ML library only. Therefore, choosing an AutoML solution results in a technology lock-in regarding the ML library.
The second contribution of this article addresses this issue. We presented OMA-ML (
Thirdly, we presented an implementation of OMA-ML. A minimum viable product is available, consisting of a GUI with an configuration wizard, and a simple controller component implementation which currently integrates 7 third-party AutoML solutions. An ontology is the information backbone of the implementation guiding the configuration wizard. The OMA-ML system including the ontology are open source under GitHub.
The implementation of the OMA-ML system is ongoing work in progress. We plan the following next development steps:
Integration of more AutoML solutions; Support for additional ML tasks for additional dataset types, e.g., text, image, audio, video and graph; Implementation of an ontology-guided control logic including pre-processing and strategy selection using a blackboard approach; Improvements in user experience; Persistence of OMA-ML execution data.
Future research work includes the use of supervised ML on OMA-ML log data to improve the strategy selection in the OMA-ML controller. Furthermore, we plan to use learning curve analysis in the OMA-ML controller and to use transfer learning. We also plan to thoroughly analyze the OMA-ML system regarding user experience and generated ML pipelines quality.
Footnotes
OpenML phishing dataset:
OpenML colleges dataset:
AWS Sagemaker Autopilot website:
Azure AutoML product page:
EvalML Github:
Google AutoML website:
MLBOX GitHub:
TransmogrifAI Github:
ML ontology Github:
OMA-ML GitHub repository:
Blazor website:
gRPC website:
Jinja documentation:
RDFlib website:
Acknowledgments
This work is funded by the German federal ministry of education and research (BMBF) in the program Zukunft der Wertschöpfung (funding code 02L19C157), and supported by Projektträger Karlsruhe (PTKA). The responsibility for the content of this publication lies with the authors.
We thank our graduate students Andre Brücke, Lukas Jansen, Luciano Jung, Daniel Kraft, Shamil Nabiyev, Sven Nawrat, Thanh Loan Nguyen, Tim Pachmann, Patrick Reckeweg, Gerrit Derk Scheppat, and Andre Wohnsland for contributing to the survey and Alex Becker, Dong Hung Pham, David Reyer, Fabio Burillo Ruiz, Lars Stockum, and Jonas Weßner for additionally contributing to the implementation of OMA-ML.
