Abstract
In a GraphQL Web API, a so-called GraphQL schema defines the types of data objects that can be queried, and so-called resolver functions are responsible for fetching the relevant data from underlying data sources. Thus, we can expect to use GraphQL not only for data access but also for data integration, if the GraphQL schema reflects the semantics of data from multiple data sources, and the resolver functions can obtain data from these data sources and structure the data according to the schema. However, there does not exist a semantics-aware approach to employ GraphQL for data integration. Furthermore, there are no formal methods for defining a GraphQL API based on an ontology. In this work, we introduce a framework for using GraphQL in which a global domain ontology informs the generation of a GraphQL server that answers requests by querying heterogeneous data sources. The core of this framework consists of an algorithm to generate a GraphQL schema based on an ontology and a generic resolver function based on semantic mappings. We provide a prototype, OBG-gen, of this framework, and we evaluate our approach over a real-world data integration scenario in the materials design domain and two synthetic benchmark scenarios (Linköping GraphQL Benchmark and GTFS-Madrid-Bench). The experimental results of our evaluation indicate that: (i) our approach is feasible to generate GraphQL servers for data access and integration over heterogeneous data sources, thus avoiding a manual construction of GraphQL servers, and (ii) our data access and integration approach is general and applicable to different domains where data is shared or queried via different ways.
Keywords
Introduction
GraphQL is a conceptual framework to build APIs for Web and mobile applications [29]. It was publicly released by Facebook in 2015 and, since then, the GraphQL ecosystem1
Examples of a GraphQL schema, a resolver function, a GraphQL query and a query response will be described in Sections 2 and 3.
GraphQL could be used to integrate data from different data sources by building a GraphQL server over these sources, in which the GraphQL schema provides a view over data from multiple sources, and the resolver functions contain implementations for accessing multiple sources. However, a semantics-aware approach to employing GraphQL for data integration does not exist. The approaches in [11] and [4] introduce how to use GraphQL for data federation. The semantics of data are not explicit in a machine-processable form which means the developer needs to write program code (i.e., resolver functions) to populate the various elements of a GraphQL schema. Furthermore, there are no formal methods for defining a GraphQL schema. The developers have to define a GraphQL schema manually. The aim of this work is to provide a semantics-aware approach to employ GraphQL for data integration, with formal methods to generate the GraphQL server.
All the material related to the prototype implementation (OBG-gen) is available online at
The remainder of the article is organized as follows. We provide the relevant background regarding ontologies, description logics, data integration and GraphQL in Section 2. Then we outline the proposed GraphQL-based framework in Section 3 and elaborate on the implementation of this framework in Sections 4 and 5. Section 6 introduces related work. Section 7 presents an evaluation based on a real-world data integration scenario in the materials design domain and evaluations based on two synthetic benchmark scenarios, the Linköping GraphQL Benchmark (LinGBM) and GTFS-Madrid-Bench. Section 8 discusses the strengths and limitations of our approach, and introduces the directions for future work. Finally in Section 9, we present concluding remarks.
This section provides background information on ontologies, description logics, data integration and GraphQL.
Ontologies and description logics
The term ontology originates in philosophy, in which it is the science of what is, of the kinds and structures of objects, properties, and relationships in every area of reality [57,64]. Ontologies can be viewed, intuitively, as defining the terms, relations, and rules that combine these terms and relations in a domain of interest [59]. Through ontologies, people and organizations are able to communicate by establishing a common terminology. They provide the basis for interoperability between systems and are applicable as an index to a repository of information as well as a query model and a navigation model for data sources. Moreover, they are often used as a foundation for integrating data sources, thereby alleviating the heterogeneity issue. The benefits of using ontologies are their improved reusability, share-ability and portability across platforms, as well as their increased maintainability and reliability. On the whole, ontologies allow a field to be better understood and allow information in that field to be handled much more effectively and efficiently (e.g., knowledge representation for bioinformatics discussed in [58]).

Example of an ontology of the university domain including 4 concepts and relationships among them, as well as relationships between concepts and datatypes.
From a knowledge representation point of view, ontologies usually contain four components: (i) concepts that represent sets or classes of entities in a domain, (ii) instances that represent the actual entities, (iii) relations, and (iv) axioms that represent facts that are always true in the topic area of the ontology. Relations represent relationships among concepts. Axioms represent domain restrictions, cardinality restrictions, or disjointness restrictions. Depending on the components and information related to the components they contain, ontologies can be classified. Figure 1 represents an example ontology for the university domain. The open-headed arrows represent axioms that represent is-a relationships that is, if A is a B, then all entities belonging to concept A also belong to concept B. We say that A is a sub-concept of B. In this example
To formally define the above concepts and relationships, we need representation languages. Description logics (DL) are a family of knowledge representation languages. There are three basic building blocks of such a language, namely: (i) atomic concepts (unary predicates) such as
Data integration deals with combining data that resides at multiple different sources [14,26,43]. Ideally, a data integration system should enable unified access to a number of data sources [26,43]. Formally, according to [43], a data integration system can be formalized as a triple
Ontology-based data integration (OBDI) is a form of data integration in which an ontology plays the role of a global schema that captures domain knowledge [15]. Usually, in an information system with only one single data source, the formal treatment of OBDI is identical to that of ontology-based data access (OBDA) [15,67]. In this article, we generally refer to both OBDI and OBDA as OBDA. OBDA, as a semantic technology, aims to facilitate access to different underlying data sources [66]. Traditionally, these underlying data sources are considered to be relational databases. Ontologies play the role of global views over multiple data sources. There are different ways to implement an OBDA system. Generally, these systems can be categorized into two types, namely, data warehouse-based approaches and virtual approaches. These two categories of methods both make use of semantic mappings in order to overcome the differences between ontologies and local schemas, but in different ways [21,65]. In a data warehouse-based approach, data from multiple sources are usually loaded or stored in a centralized storage, which is the warehouse [26,63], based on semantic mappings. We refer to the data in such warehouses as materialized data. Depending on the aims or functionalities of a system, the materialized data could be stored in local databases or transformed into RDF graphs. Therefore, queries are evaluated against the materialized data. In a virtual approach, data is retained at the original sources and mediators are used to translate queries defined in terms of a global or mediated schema into queries defined in terms of each data source’s local schema, based on semantic mappings. Therefore, queries are evaluated and executed against each data source. SPARQL queries are widely supported by data integration systems that use ontologies as global schemas.
A number of semantic mapping definition languages have been proposed over the years. One such language is R2RML (RDB to RDF Mapping Language),5
GraphQL schemas and GraphQL resolver functions are basic building blocks in the implementations of GraphQL servers. The former describe how users can retrieve data using GraphQL APIs. The latter contain program code including how to access data sources and structure the obtained data according to the schema. We introduce GraphQL schemas and GraphQL resolver functions in Section 2.3.1 and Section 2.3.2, respectively.
GraphQL schemas
In a GraphQL API, the GraphQL schema defines types, their fields, and the value types of the fields. Such a schema represents a form of vocabulary supported by a GraphQL API rather than specifying what the data instances of an underlying data source may look like and what constraints have to be guaranteed [33]. There are six different (named) type definitions in GraphQL, which are scalar type, object type, interface type, union type, enum type and input object type. Listing 1 depicts a GraphQL schema example.

Example of a GraphQL schema of the university domain
An object type represents a list of fields and each field has a value of a specific type such as object type or scalar type. A scalar is used to represent a value such as a string. In Listing 1, there are three basic object type definitions, which are
GraphQL allows fields to accept arguments to configure their behavior [29]. These arguments can be defined by input object types. An input object type defines an input object with a set of input fields; the input fields are either scalars, enums, or other input objects. This allows arguments to accept arbitrarily complex structs, which can capture notions of filtering conditions. For instance, according to the definitions of
Additionally, a GraphQL schema supports defining types that represent operations such as query and mutation. The schema presumes the
In a GraphQL API, apart from the GraphQL schema defining types, their fields, and the value types of the fields, resolver functions are responsible for populating the data for fields of types in the GraphQL schema. For instance, for the schema example shown in Listing 1, there are four fields defined in the

Example of a resolver function for the
This section introduces an overview of the GraphQL-based framework for data access and integration and two basic processes in this framework.
Overview of the framework
Figure 2 illustrates the framework for data access and integration based on GraphQL in which an ontology drives the generation of GraphQL server that provides integrated access to data from heterogeneous data sources. These data sources may be based on different schemas and formats and may be accessed in different ways (e.g., as tabular data accessed via SQL queries or as JSON-formatted data accessed via API requests). To address the heterogeneity, the framework relies on an ontology that provides an integrated view of the data from the different sources, and corresponding semantic mappings that define how the data from the underlying data sources is interpreted or annotated by the ontology (arrows

GraphQL-based framework for data access and integration.
This process includes generating both a GraphQL schema for the API provided by the server (arrow
This process requires users or developers who are familiar with the query mechanisms of underlying data sources, domain ontologies that can be used for data access or integration. Consequently, they can define the scope of the ontology that will be used for generating the GraphQL schema for the server, as well as the semantic mappings that will be used for generating the generic resolver function. This type of automatic generation of GraphQL servers based on ontologies and semantic mappings can also benefit general GraphQL application developers, since it can eliminate the need to build GraphQL servers from scratch.
GraphQL query answering process
During this process the query is validated against the GraphQL schema (arrow

Example of a GraphQL query

Example of a GraphQL query response
A GraphQL query example and corresponding query result are shown in Listing 3 and Listing 4, respectively. The example query is: “Get the university including the head of each department where the UniversityID is ‘u1’”. The query takes as an input an argument defined as
As mentioned earlier, domain users are the intended users of GraphQL servers, regardless of whether they have prior knowledge of the Semantic Web or ontologies. In order to write GraphQL queries, they only need to have a basic understanding of GraphQL, which can easily be explored via the GraphQL API provided by the server.
As mentioned in Section 2.3.1, the GraphQL schema represents a form of vocabulary supported by the GraphQL API rather than specifying what the data instances of an underlying data source may look like and what constraints have to be guaranteed. Therefore, we focus on GraphQL language features supporting semantics-aware and integrated data access, namely how data can be queried, rather than reflecting the semantics of a complex knowledge representation language in the context of a GraphQL schema. Section 4.1 introduces how a GraphQL schema is formalized, and Section 4.2 introduces how an ontology is represented via a description logic TBox. Given an ontology represented in a description logic TBox, the concept and role names can be used to generate types and fields in a GraphQL schema. The general concept inclusions in a description logic TBox can be used to specify how to connect generated types and fields in a GraphQL schema. Then, in Section 4.3, we present the core algorithm (Schema Generator) for generating a GraphQL schema based on an ontology. In Section 4.4, we present the intended meaning of GraphQL schemas generated by the Schema Generator.
GraphQL schema formalization
According to [33,34], a GraphQL schema can be defined over five finite sets. These five sets are

The formalization of the GraphQL schema shown in Listing 1
Listing 5 illustrates a formalized representation of the GraphQL schema shown in Listing 1. In the formalization, we have sets
In this work we assume that the ontology is represented by a TBox in a description logic which is an extension of
The syntax and semantics for the description logic used in our approach
The syntax and semantics for the description logic used in our approach
The syntax and semantics of the description logic used in our approach are shown in Table 1. The introduction of datatypes is based on the work presented in [35] and [36]. Let
A TBox over

TBox example representing the example ontology as shown in Fig. 1
Algorithm 1 shows the details to generate a GraphQL schema. The output for the example is the schema shown in Listing 1. First, the algorithm iterates over the concept names in
From line 13 to line 21, the algorithm deals with GCIs containing roles, which can be of the form
We define a function Φ for mapping a datatype that exists in the TBox to a scalar type in GraphQL. Due to the fact that current GraphQL supports five basic scalar types which are
By generating the GraphQL schema based on an ontology, the schema will contain object or interface types corresponding to concepts in the ontology, and field declarations corresponding to relationships in the ontology. When a GraphQL query is sent to the GraphQL server, a resolver function parses the query to determine which type in the schema is requested. It then parses the relevant definitions corresponding to such a type in the semantic mappings to retrieve data. For instance, if a query requests all the entities of the 
The intended meaning of a generated GraphQL schema
In Section 4.3, we present the Schema Generator which takes a TBox representing an ontology as an input, to generate a GraphQL schema. Such a GraphQL schema can describe how to access underlying data sources in which the data can be annotated by the ontology. The underlying data can thus be viewed as an ABox for the TBox. Therefore, a GraphQL query that conforms to this GraphQL schema can be considered as a query over the ABox. To make this intention more formal we consider an ABox

ABox example based on the example ontology as shown in Fig. 1
Let
If
If
For instance, given the query (as shown in Listing 3) and the above ABox,
In general, there are two styles for implementing resolver functions for a GraphQL server. One option is to implement one resolver function per type (object or interface) defined in the GraphQL schema, where such a function states how to fetch the data to populate relevant fields. For instance, since the
GraphQL queries represented by abstract syntax trees

Example abstract syntax trees for the query shown in Listing 3.
In general, a GraphQL query can be represented using a single AST that contains nodes representing the fields requested in the query, and also contains additional nodes for the input arguments that may be used for each of these fields. In our approach, we assume that each query accepts an input argument which captures the notion of a filter condition. Therefore, we specify the query evaluation in two steps: (i) evaluating for a filter condition, which is represented via an input argument that is defined as an input object type in the schema, (ii) evaluating for those fields that are requested in the GraphQL query. For instance, in the query example shown in Listing 3, the field having a filtering condition is different from the requested fields (the former is
In practice, a filter condition is converted into disjunctive normal form (DNF).7
A statement is in DNF if it is a disjunction of conjunctions of literals. A disjunction uses the OR (∨) operator. A conjunction uses the AND (∧) operator.

Example of RML mappings transforming university domain data, defined based on the example ontology as shown in Fig. 1
RML is a declarative mapping language for linking data to ontologies [51]. An RML document has one or more

Technical components in the generic resolver function.
We show the basic technical components of the generic resolver function including QueryParser and Evaluator in Fig. 4. In Algorithm 2, we show the generic resolver function. The inputs to the generic resolver function are a GraphQL schema, a GraphQL query and semantic mappings. The GraphQL query and schema are inputs of the QueryParser. The QueryParser parses a query including a filter expression given as an input argument, and outputs the corresponding ASTs (Fig. 3) for the input argument and the query structure, respectively (shown as arrows
As mentioned in Section 4.3, by generating the GraphQL schema based on an ontology, we can therefore, for each object or interface type and each field declaration, find the corresponding concept and relationship in the ontology. Since such concepts and relationships are used to define semantic mappings, when a generic resolver function retrieves data from the underlying sources of a requested type and relevant fields, it can therefore understand the semantic mappings regarding how to access underlying data sources and structure the returned data according to the GraphQL schema. Taking the query in Listing 3 represented by the ASTs shown in Fig. 3 as an example, as the requested type is

Evaluator
We present the details of Evaluator in Algorithm 3 and show an example in Fig. 5 of how evaluators work for answering the query in Listing 3. An AST and a number of triples maps from the semantic mappings are essential inputs to the algorithm. For a given AST, we can obtain the object type and fields that are requested in the query based on the root node and child nodes, respectively (line 2). For instance, taking the ASTs in Fig. 3 as examples, the root type and the field for evaluating the filter expression are

Example for answering the query in Listing 3,
In the phase of evaluating a filter expression, local_filter, which represents the rewritten filter expression, is a necessary argument when sending requests to underlying data sources (line 14). While in the phase of evaluating query fields, filter_ids, being a NULL value or having at least one element, is a necessary argument (line 14, arrow
The widely used Semantic Web-based techniques and the recently developed GraphQL have led to a number of works relevant to our GraphQL-based framework for data access and data integration. We extend the summary of approaches presented in [17] by adding several new related approaches and new perspectives on the comparison. Table 2 summarizes these systems and our approach. These systems can be divided into two categories, namely OBDA-based systems and GraphQL-based systems. The former group contains Morph-RDB [49,52], Morph-CSV [19], Ontop [13,50], Squerall [47] and Ontario [28]. The latter group consists of GraphQL-LD [60], HyperGraphQL [55], UltraGraphQL [32,56], Morph-GraphQL [17], Ontology2GraphQL [30] and our OBG-gen. OBG-gen can also be categorized as an OBDA-based system in the first group. In addition to the two groups described above, there is another system that is related to our work. It is OBA [31], which is an ontology-based framework that facilitates the development of REST APIs for knowledge graphs.
Summary of related approaches
Summary of related approaches
As a new perspective to the summary in [17], all the approaches (except for GraphQL-LD) have two processes: (i) the service setup (preparation) process and (ii) the query answering process. During the service setup process, all OBDA-based approaches need semantic mappings as input. In these systems, semantic mappings are used in a similar manner to represent differences between global and local schemas, namely mapping translations as highlighted in [21]. Some approaches take additional resources as inputs. Morph-CSV uses CSVW8
CSVW is used to annotate CSV files with JSON metadata (
Morph-GraphQL requires semantic mappings to generate a GraphQL server intended for data access. It does not consider data integration scenarios where integrated views are required.
For the query answering process, OBDA-based approaches (i.e., Morph-RDB, Morph-CSV, Ontop, Squerall and Ontario) accept SPARQL queries and translate them into specific queries. Morph-RDB handle underlying data stored in relational databases, while Morph-CSV deals with data stored in CSV files. Morph-RDB and Morph-CSV translate SPARQL queries into SQL queries. Ontario, Squerall and Ontop support heterogeneous data sources. These three systems can translate SPARQL queries into various queries according to the query languages of the underlying data sources or queries accepted by data source wrappers. Our approach, OBG-gen, accepts relational data, tabular data and JSON-formatted data as the underlying data. Moreover, OBG-gen can integrate data in different formats from multiple sources, due to the generic resolver function implementation that can structure obtained data in the JSON format according to the GraphQL schema. The remaining approaches are based on underlying data in SPARQL endpoints and translate GraphQL queries into SPARQL queries (GraphQL queries for GraphQL-based approaches, API requests for OBA). GraphQL-LD, HyperGraphQL, and UltraGraphQL require context information expressed in JSON-LD. Such JSON-LD context information contains URIs of classes to which instances in the RDF data belong.
In addition, we study relevant OBDA/OBDI and GraphQL benchmarks to conduct our experiments and evaluation. These benchmarks are Berlin SPARQL Benchmark (BSBM) [12], Norwegian Petroleum Directorate Benchmark (NPD) [42], GTFS-Madrid-Bench [18], ForBackBench [2] and Linköping GraphQL Benchmark (LinGBM) [20]. These OBDA/OBDI related benchmarks are built based on different use cases (BSBM for the e-commerce use case, NPD for the oil industry, GTFS-Madrid-Bench for the transport domain, ForBackBench reusing data from other benchmarks) and focus on testing different abilities of OBDA/OBDI engines. In more detail, the BSBM benchmark aims to test and compare the performance of native RDF stores with engines implementing SPARQL-to-SQL query translation. The NPD benchmark can be used to analyze OBDA system implementations in terms of query rewriting, query unfolding and query execution. The GTFS-Madrid-Bench aims to test engines focusing on virtualized access to heterogeneous data. The above three benchmarks commonly focus on testing engines that contain query rewriting mechanisms, which are usually implemented by OBDA/OBDI engines. The ForBackBench benchmark has a focus on both data integration scenarios and data exchange scenarios. In the former scenarios, engines usually implement query rewriting mechanisms. While in the latter scenarios, engines usually implement forward-chaining algorithms (e.g., [16,48]) to populate a centralized data warehouse. Therefore, in contrast to the previous three benchmarks, ForBackBench focuses on comparing and analyzing systems across both two different mechanisms (i.e., query writing and forward-chaining algorithms). In terms of GraphQL-related benchmarks, the LinGBM benchmark is the first one that can be used to study the behavior of GraphQL server implementations at scale [20]. It provides a scalable dataset regarding the University domain and specifies key technical challenges (e.g., relationship traversal) of GraphQL server implementations. For the evaluation of our work (see next section), among these OBDA/OBDI and GraphQL related benchmarks, we choose GTFS-Madrid-Bench and LinGBM, respectively. The reasons are: (i) for the real case evaluation in the materials design domain, LinGBM can guide us to characteristic the GraphQL queries to better compare and analyze the abilities of GraphQL systems; (ii) by following the scenarios in LinGBM and GTFS-Madrid-Bench, we can test the ability of our approach to work for general different domains.
In this section, we present an evaluation of the framework shown in Section 3. We consider a real case application scenario in the materials design domain, and two synthetic benchmark scenarios based on the Linköping GraphQL Benchmark (LinGBM)11
Can the generated GraphQL server provide integrated access to heterogeneous data sources? For instance in the real case application scenario, data from different sources may follow different models and is shared or queried in different ways.
How does the generated GraphQL server compare to other OBDA-based systems and other GraphQL-based systems in terms of query performance and its behavior for increasing dataset sizes?
Is the proposed approach, ontology-based GraphQL server generation, a general approach that can work in different domains for data access and integration?
In the first evaluation scenario based on the real case application scenario in the materials design domain, we aim to answer
In the real case evaluation, we focus on a use case in the materials design domain where the task is data integration over two data sources, Materials Project [37] and OQMD (The Open Quantum Materials Database) [54].

Example of searching materials from materials project, OQMD and NOMAD.
Motivation The materials science domain, like many other domains, is at an early stage when it comes to introducing Semantic Web-based technologies into its data-driven workflows. A large number of research groups and communities have thus developed a variety of data-driven workflows, including data repositories [40,41] and data analytics tools. As data-driven techniques become more prevalent, more data is produced by computer programs and is available from various sources, which leads to challenges associated with reproducing, sharing, exchanging, and integrating data among these sources [1,38,39,53,61]. Figure 6 illustrates an example of searching for gallium nitride materials with the reduced chemical formula of
Data We collect data from the Materials Project and OQMD representing five different types of real-world entities (Calculation, Structure, Composition, Band Gap and Formation Energy). We define semantic mappings (for all the systems, see the next paragraph) based on MDO to interpret such data. We collect data in the sizes of 1K, 2K, 4K, 8K, 16K and 32K from each database for populating the five entities. The size 1K means 1000 entities of each entity type. We represent this data in different formats such as tabular data for relational databases and for CSV files, and JSON-formatted data for JSON files. Additionally, for HyperGraphQL and UltraGraphQL in our evaluation, we create an RDF file based on RML mappings and MDO for each dataset setting. We have six dataset settings for the experiments, which are 1K–1K, 2K–2K, 4K–4K, 8K–8K, 16K–16K and 32K–32K. Taking 2K–2K as an example, for each entity type, the test data contains data in the size of 2K from Materials Project and 2K from OQMD, respectively.

Outline of the real case evaluation.
Systems We compare our tool, OBG-gen in two versions (OBG-gen-rdb and OBG-gen-mix) with four systems: Morph-RDB [49], Ontop [50], HyperGraphQL [55], and UltraGraphQL [56]. OBG-gen-rdb represents the case where the generated GraphQL server handles data in relational databases, and OBG-gen-mix represents the case where the generated GraphQL server handles data not only in relational databases but also data in JSON and CSV formats. They take different RML mappings as inputs. Morph-RDB and Ontop are representatives from the group of OBDA-based tools. They can access relational databases as data sources by translating SPARQL queries into SQL queries based on semantic mappings, written in R2RML. As for the group of GraphQL-related tools, we intended to include Morph-GraphQL and Ontology2GraphQL in our evaluation. However, Morph-GraphQL fails to parse mappings; Ontology2GraphQL cannot be run due to a lack of detailed instructions regarding its setup. In the case of GraphQL-LD, since it focuses on querying Linked Data via GraphQL queries and a JSON-LD context using a SPARQL engine instead of a GraphQL interface, we did not consider it in our evaluation. Therefore, HyperGraphQL and its extension UltraGraphQL are the GraphQL engines that are included in our evaluation. They can query Linked Data that may be provided by local RDF files and remote SPARQL endpoints. The semantic mappings for all the systems in the evaluation are based on MDO. OBG-gen generates the GraphQL schema based on MDO. UltraGraphQL and HyperGraphQL use a modified version of the generated schema since they require directive definitions to specify the correspondences between query entries and the data. Figure 7 shows how the systems are configured in the evaluation. HyperGraphQL and UltraGraphQL are provided with the same RDF data for each dataset setting. OBG-gen-rdb is provided with two MySQL database instances hosting data from the Materials Project and OQMD respectively. Morph-RDB and Ontop are provided with one single MySQL database instance hosting data from the two sources. Conceptually, OBG-gen-mix is also provided with two database instances. However, each instance contains different formats of data such as data in a MySQL database, or in CSV or JSON files. More detailed, the instance for Materials Project has Composition data in JSON format and Band Gap data in CSV format. The instance for OQMD has Structure and Band Gap data in JSON format and Formation Energy data in CSV format. The data representing other entities for each instance is stored in MySQL database instances.
Features of queries without filter conditions
Features of queries with filter conditions
Queries We create queries that cover different features, aiming to evaluate our system based on qualitative aspects regarding what functionalities the system can satisfy and quantitative aspects regarding how the system performs over different data sizes. Additionally, we use competency questions stated in the requirements analysis of MDO to create queries with domain interests. The features of queries without and with filter expressions are shown in Table 3 and Table 4, respectively. From the perspective of GraphQL, we consider which choke point a query covers. The details of choke points are introduced in LinGBM.13
Table 5 shows more details of meanings of different filter expressions for Q6–Q12. The filter expressions for Q6 and Q12 are simpler than those for Q7–Q11 where the filter expressions have sub-expressions connected by boolean operators. Query features in terms of DI, and the filter expression form can help us understand systems qualitatively; Diffs and RS help in understanding systems quantitatively in the scaling analysis over different data sizes. We show Q1 in Listing 10 and Q7 in Listing 12. The results of these two queries are given in Listing 11 and Listing 13, respectively. Q1 requests all the structures containing the reduced chemical formula of each structure composition. Q7 requests all the calculations where the ID is in a given list of values, and the reduced chemical formula is in a given list of values.
Meanings of filter expressions in Q6 to Q12
Experiments and measurements We evaluate the query execution time (QET) of the different systems over the six dataset settings. Separately for each query, we run the query four times and always consider the first run to be a warm-up, then take the averaged value of the remaining three runs. Figure 8 illustrates the measurements over the six data sizes per query (Q1–Q12). Figure 9 and Figure 10 illustrate the measurements of all systems per data size for queries without filtering conditions and with filtering conditions, respectively. The measures for all data sizes and all queries are available online.14

Query Execution Time (QET) per query on materials dataset.
Results and discussion By analyzing the obtained measurements, we summarize three observations. The
The

Query Execution Time (QET) per data size on materials dataset for queries without filtering conditions.

Query Execution Time (QET) per data size on materials dataset for queries with filtering conditions.
The
Based on the second and the third observations, we can answer the research question
To show the generalizability of our system, we conduct an evaluation based on LinGBM. LinGBM provides tools for generating datasets (data generator)15
Data The dataset generated by the data generator is a scalable, synthetic dataset regarding the University domain, including several entity types (e.g., University and Department). We generate data in scale factors (

A query according to QT5, such a query goes from a given department to its university, then retrieves all graduate students who get the bachelor’s degree from the university, then comes back to department. This cycle is repeated two times
Queries The experiments are performed over eight query sets, where each set contains 100 queries that are generated using the LinGBM query generator based on a query template (QT). A query template has placeholders for input arguments. The query generator can generate a set of actual queries (query instances) based on a query template in which the placeholder in the query template is replaced by an actual value. We select eight query templates (QT1–QT6, QT10 and QT11) for constructing eight query sets (QS1–QS8). We show an example query according to QT5 in Listing 9. The other six query templates from LinGBM require GraphQL servers to have implementations for functionalities such as ordering and paging which are not considered currently by OBG-gen. However, these functionalities are interesting for future extension of OBG-gen.
Experiments, results and discussion Same as the real case evaluation, we evaluate the query execution time (QET) of our system on the three datasets. Each query from a query set is evaluated once. We show the average query execution times for the different query sets in Table 6. Based on the obtained measurements, we observe that our system has slight increases for QS1, QS2, QS4, QS6 and QS7 in terms of the average QETs. For QS3, the average QET is stable for all the three datasets. For QT5, the increase from 0.51 seconds at data scale factor 20 to 13.85 seconds at data scale factor 100 is due to the dramatic increase in result size. More specifically, the queries in QS5 and QS8 need to access the ‘graduateStudent’ table which increases dramatically in size from 50,482 rows in the table (
Average QET (in seconds) in the evaluation based on LinGBM
We furthermore demonstrate the generalizability of our system by evaluating it against GTFS-Madrid-Bench, which is a benchmark for evaluating OBDI systems.
Data, queries and systems The dataset provided by GTFS-Madrid-Bench is a scalable dataset regarding the Transport domain (the metro system of Madrid), including several entity types (e.g., Route, Stop, Shape and Trip). We use the data generator provided by GTFS-Madrid-Bench to generate data in scale factors (
Experiments, results and discussion Same as the previous two evaluation scenarios, we evaluate the query execution time (QET) of systems on different datasets. We show the measurements in Table 7. According to the measurements, both OBG-gen-rdb and Ontop show increases in QETs for all four queries as the dataset increases. However, as with the observation in the real case evaluation, Ontop behaves less sensitively to the increase in dataset. In terms of how the two systems behave for different queries, both engines spend more time to answer Q1 (without any filter conditions) and Q9 (with several relationship retrievals). For answering Q1, OBG-gen-rdb spends more than 3,600 seconds for scale factors 10 and 50. Although Ontop is able to answer Q1 in less time than OBG-gen, it cannot finish the execution because it runs out of the reserved 4 GB memory for scale factor 50. More specifically, Q1 needs to access the ‘Shape’ table which increases dramatically in size from 58,540 rows in the table (
QET (in seconds) in the evaluation based on GTFS-Madrid-Bench
QET (in seconds) in the evaluation based on GTFS-Madrid-Bench
For evaluating our approach, ontology-based GraphQL server generation, we conducted an experiment motivated by the materials design domain and experiments based on two synthetic benchmark scenarios (LinGBM and GTFS-Madrid-Bench). Based on the measurements of these experiments, we can answer the three research questions presented at the beginning of Section 7. Our approach can generate GraphQL servers for data access and data integration and can be used in various domains (
Discussion and future work
We emphasize that our work aims to enable GraphQL for not only data access (as other GraphQL-based approaches) but also data integration, by automatically generating the server based on an ontology and semantic mappings. Our work presents the first solution to this problem. Essentially, our approach concentrates on providing data access and integration using an ontology in a GraphQL setting with an approach that provides support for practical applications. In this respect, our work fills a gap in GraphQL applications (e.g., [17,30,32,55,60]). Compared with existing GraphQL-based approaches (e.g., UltraGraphQL [32] and HyperGraphQL [55]) for data access, our approach provides more GraphQL query features by supporting arbitrary filtering conditions. Our work can also provide an alternative to build data access and data integration applications, in addition to existing OBDA or OBDI approaches [13,19,28,47,49] (e.g, users can write GraphQL queries).
It should be noted that our current effort of OBG-gen focuses only on the GraphQL language [29] features that support semantics-aware and integrated data access, namely how underlying data can be queried, rather than reflecting the semantics of a complex knowledge representation language in the context of GraphQL schemas. Therefore, not all description logic constructors are used, but rather only those that are necessary for data access via GraphQL. It would be worthwhile to investigate how to represent more complex description logic constructors within the GraphQL context. In the future we will follow the development of the GraphQL language and investigate if any new features for data access can be generated formally based on the description logic currently used by OBG-gen or whether a more expressive language is needed. One specific example is the formal generation of union types in GraphQL schemas, based on ontologies. This will necessitate updates to the schema generator algorithm and the generic resolver function. Another extension related to the schema generator algorithm is to extend the Φ function which is responsible for translating a datatype in the DL TBox to a corresponding datatype in GraphQL. Our current work focuses on generating basic datatypes supported by GraphQL (e.g., String, Float, Integer). However, in GraphQL schemas, custom type definitions can be used to represent datatypes rather than above basic ones. We will extend the Φ function to support translating more datatypes in the DL TBox into custom type definitions in GraphQL schemas.
In contrast to the query languages SQL and SPARQL, which have been specifically designed for relational databases and triple stores, respectively, and encompass a wide range of query features, the capabilities of the GraphQL query language are contingent on the definitions of GraphQL schemas and the implementation of resolver functions. In our work, we implement resolver functions in a generic manner. As a result, along with GraphQL schemas containing input type definitions, OBG-gen enables the support of arbitrary filtering conditions in GraphQL queries. Additional query features, including aggregates (group by, having), solution sequences and modifiers (order by, distinct, offset, limit), are not yet covered but are part of our planned future work. Another extension related to the generic resolver function is to support user-defined functions (e.g., a date normalization function) on underlying data, which is not implemented in our approach currently. To support this, the Function Ontology (FnO) [23] can be used during creating RML mappings. We will work on extending the generic resolver function to enable user-defined functions.
In this work, we conducted a query performance comparison, specifically evaluating query execution times, between our tool and various OBDA-based and GraphQL-based approaches. While our approach shows more ample query capabilities than other GraphQL-based methods, and demonstrates similar performance to other OBDA-based approaches (as demonstrated in Section 7), there is still room for optimizing query performance. We emphasize that we did such comparisons aiming at providing initial insights into the query performance. A direction for future work includes optimizing our generic resolver function to enhance query performance. This may involve adapting the mapping partition group rules recently proposed in [6].
From a practical standpoint, we plan to implement a search system for OPTIMADE in the materials design domain based on our approach. As we mentioned before, OPTIMADE aims to make materials databases inter operable. Our approach can provide an integrated view of data to increase the interoperability. This will result to achieve data integration over more data sources, by considering more material databases. Our previous work in [45] has shown the capability of MDO to represent an integrated view of data over several representative material databases.
Concluding remarks
To leverage ontologies for generating GraphQL APIs to support semantics-aware data access and data integration, in this article, we have presented a GraphQL-based framework (Section 3) for data access and integration in which an ontology drives the generation of the GraphQL server. Our approach consists of a formal method to generate a GraphQL schema based on an ontology (Section 4), and a generic implementation of resolver functions (Section 5). In detail, ontologies play two roles in our approach: one is as an integrated view of underlying data sources for generating a GraphQL schema; the other is as a basis for defining semantic mappings on which the generic GraphQL resolver function is based. Generating a GraphQL schema based on an ontology rather than just semantic mappings (e.g., Morph-GraphQL) can ensure to have an integrated view of data in data integration scenarios. Such a schema does not need to be regenerated when new data sources are added, unless the ontology needs to be modified. We show the feasibility and usefulness of our approach in terms of using GraphQL for data integration and avoiding implementing a GraphQL server from scratch, based on a real-world data integration scenario motivated by the materials design domain (Section 7.1) and two synthetic benchmark scenarios, LinGBM and GTFS-Madrid-Bench (Sections 7.2 and 7.3). Additionally, we discuss the strengths and limitations of our approach, moreover show some directions for future work (Section 8).
Footnotes
Acknowledgements
This work has been financially supported by the Swedish e-Science Research Centre (SeRC), the Swedish National Graduate School in Computer Science (CUGS), the Swedish Research Council (Vetenskapsrådet, dnr 2018-04147 and dnr 2019-05655), and the Swedish Agency for Economic and Regional and Growth (Tillväxtverket).
