Decentralized infrastructure for knowledge discovery in the Semantic Web

Abstract

An enormous volume of well-structured data with explicit semantics, in accordance with W3C’s standards, becomes a reality in the Web of Linked Data. However, the Semantic Web promise to turn it into a machine-processable global graph of knowledge still encounters numerous impediments. Efficient access and discovery along with the semantic heterogeneity have been identified as major stumbling blocks. Following the design principles for Semantic Web and Linked Data, we present ActiveDiscovery, a decentralized infrastructure for distributed SPARQL query evaluation based on its terminological entities, namely the ontologies used in a query. ActiveDiscovery’s main goal is to facilitate distributed and transparent semantic search based on structural rather than keyword-based querying in the Semantic Web. Key architectural extensions regarding metadata, indexing and ontology alignment are proposed to achieve transparency for federated query execution in a decentralized manner. The rewriting procedure for extensional SPARQL query is considered regarding the proposed components and SERVICE clause as a standard recommendation for query federation. We investigate the feasibility of our approach and present preliminary results of initial evaluation. We conclude by indicating questions which need to be addressed in future work.

Keywords

Semantic Web query federation ontologies SPARQL decentralized architecture

1 Introduction

The Semantic Web and Linked Data initiative together lay the foundation for the new paradigm of information processing and knowledge discovery on a global scale. The vision of the Web as being more data- and resource-oriented space rather than document-centric medium has been a well established and widely known idea for over a decade. An important role is played by common vocabularies and ontologies as a semantic layer for datasets, as they add explicit semantics and reasoning capabilities, thus enabling intelligent data processing on the Web. Means for both, data exposure in a machine-processable form and semantic awareness along with structured querying, are provided by W3C standards, namely Resource Description Framework [39], Ontology Web Language [38] and SPARQL [40]. They constitute a set of formalisms for Semantic Web. The Linked Data initiative follows the Design Issues [3] which stipulate dereferenceable HTTP URIs to identify resources, providing useful information expressed with W3C standards and linking across datasets. The ultimate goal is to turn Web into shared space of machine-understandable knowledge. However, Web’s heterogeneity as its inherent quality makes it a challenging task. It is agreed upon that decentralization of data sources and semantic diversity expressed in many ontologies may be tackled with integration efforts. The prevailing approach regarding the semantic integration on the Web is to provide mappings, thus having miscellaneous ontologies loosely coupled rather than seeking full semantic alignment. There is a variety of styles and approaches to access and integration of assertional data, in particular federation architectures regarding SPARQL queries executed in distributed manner.

In this article we consider a decentralized infrastructure for knowledge discovery in the Semantic Web, which we refer to as ActiveDiscovery, in terms of distributed SPARQL query evaluation, automatic discovery of data sources based on terminological elements of a query and semantic expansion for result set broadening. The purpose for such an infrastructure is to enable Semantic Web applications to execute structural queries across the Web without prior knowledge of relevant data sources. In order to achieve this goal we propose architectural extensions for OWL ontologies metadata, decentralized indexing services and an algorithm for federated SPARQL query execution based on a standard SERVICE clause.

In section 2 the research landscape in the area of data discovery in the Semantic Web is briefly illustrated. Section 3 depicts conceptual foundation and main architectural assumptions of the infrastructure. Section 4 presents preliminary results of evaluation of a proof-of-concept implementation. Finally, section 5 concludes and shows the directions for future research.

2 Related work

Data discovery on the Semantic Web has been a widely explored topic in two primary research areas, namely access to distributed RDF datasets on the Web and semantic integration of Web ontologies. Both issues are vital for efficient data discovery.

2.1 RDF data access

RDF data is distributed across the Web, thus making it difficult to seek, query and consume. Triples are freely available via standard HTTP GET mechanism or stored in datasets equipped with SPARQL endpoint. These sources are frequently disconnected, leading to the so-called data islands. There are four general approaches to this problem. Navigational [1 , 37] mimics the way we browse the Web exploiting graph nature of RDF and cross-linking RDF graphs. Navigating does not require any additional structures supporting data collection; however, the collected data is heavily dependent on a starting point selection. This technique is implemented in semantic browsers, Tabulator [4], Disco [5], DBpedia Mobile [2]. Warehousing [9 , 39] is another approach originating from database theory, where heterogenous data needs standarization and schema unification prior to being queried. It assumes collecting and loading entire datasets into central repository, thus enabling efficient query evaluation. This approach, however, suffers from data redundancy which raises problems with scalability and data becoming obsolete. Semantic Web search engines SWOOGLE [13], Sindice [29], SWSE [18], Falcons [8] crawl the Web and index RDF documents much like traditional search engines. They collect RDF documents, create centralized indexing structure and enable keyword-based search exploiting classic information retrieval methods. Since they are not resource-oriented, the support for structural or semantic querying is limited only to metadata and search results refer to RDF documents rather than RDF resources. Federation [16, 17] is oriented towards structural query evaluation across multiple data sources on the Web. It derives from database researches on distributed query processing [21] and assumes a role of a mediator which splits the query into subqueries which are then delegated to different endpoints. Mediator collects and merge intermediate results to obtain final result set for the query. SPARQL 1.1 [31] has built-in support for federation implemented as a SERVICE clause. Two examples of federated SPARQL system are DARQ [32] and SemiWIQ [22].

2.2 Semantic integration

Web ontologies determine terminological layer for RDF triples, so they constitute the semantics both for datasets and for query. Ontologies, even if shared and reused, introduce heterogeneity on a semantic level. Fruitful knowledge discovery in the Semantic Web requires alignment of miscellaneous ontologies used to describe RDF data. Ontology matching is a wide topic including a variety of methods that can be split into categories regarding matching objective and human involvement. Ontology merging and integration produce new ontology which may be significantly altered in a deep reengineering process requiring extensive human involvement. Mapping and alignment of ontologies are considered more lightweight and better fitting Web’s reality. A potentially incomplete mapping between Web ontologies, (semi-)automatically computed and represented without affecting their autonomy is believed to be particularly suitable to aid semantic alignment. [10 , 35] provide a comprehensive survey on ontology matching. Another important aspect is the representation formalism that determines mappings’ expresiveness and ways they can be exploited. Formal and logic-oriented approaches [14 , 28] consider the interpretation of logic concepts across different ontologies, first-order formulas for cross-ontological translations and upper ontologies for schema mediation. Query language level approaches [11 , 34] propose various extensions for SPARQL to evaluate queries in one ontology over data expressed in another. A great deal of attention is dedicated to query rewriting strategies and algorithms. Lightweight mapping representation tools are represented by Alignment API [12], R2R [6], BLOOMS [19].

3 ActiveDiscovery infrastructure

Current approaches to information seeking and knowledge discovery in the Semantic Web require active involvement of a querying party, hereafter referred to as a querying agent (QA), thus assuming the Web infrastructure to behave passively. In particular, data sources discovery and selection remains an unresolved issue. Our goal is to introduce a set of means, consisting of architectural extensions and decentralized infrastructure, to change the perspective on how the query could be evaluated with Web playing an active part and taking over the responsibilities of the QA. The motivating scenario is QA issuing an extensional SPARQL query using terminological entities of OWL ontologies according to Semantic Web standards. We expect certain Web components to take care of query processing from now on and bring the best possible results back to QA. It can be seen as a generalization of a query federation problem. A set of expectations towards an infrastructure is claimed.

It is the Web’s underlying infrastructure which is to behave proactively in query evaluation process, QA merely states the query.

Data sources localizations are unknown to the QA; it is the Web’s responsibility to discover proper data sources and execute the query against them.

The number of contributing data sources is to be maximized.

QA’s query is based on any dereferenceable Web ontologies of its choice; terminological components of the query are to determine the evaluation process.

Semantically related data should contribute even if described with different ontology than the query ontology.

Such an infrastructure should constitute robust and scalable system ensuring independence from any controlling entity.

3.1 Architectural extensions

We follow common distinction between terminological data (TBox), namely ontologies, and assertional data (ABox) stored in RDF datasets, usually of much higher volume. Dereferenceability of ontology IRI and availability of ontology document including its metadata, is not only stipulated by Linked Data design recommendations but also required by OWL2 specification [27]. It renders a basis for additional annotation of owl : Ontology concept (owl, rdfs and rdf are abbreviations for standard namespaces of those vocabularies) regarding new services necessary to facilitate query federation. Two annotations are defined and illustrated in Figure 1.

Definition 1. (IndexingServiceAnnotation). Annotation is defined as a set of axioms:

IndexingService : owl:Class

hasIndexingService : owl:AnnotationProperty

⊤ ⊑ ∀ hasIndexingService.IndexingService

∃ hasIndexingService.Thing ⊑ owl:Ontology

Definition 2. (MappingServiceAnnotation). Annotation is defined as a set of axioms:

MappingService : owl:Class

hasMappingService : owl:AnnotationProperty

⊤ ⊑ ∀ hasMappingService.MappingService

∃ hasMappingService.Thing ⊑ owl:Ontology

Fig.1

Indexing and Mapping annotations.

Conceptually, five separate logical nodes in the network infrastructure can be denoted. Each of them exposes a service which, along with ontology annotation, facilitates discovery protocol.

Ontology Repository Node. ORN node is responsible for hosting ontologies and serving ontology documents via HTTP GET mechanism. The ORN implementation is based on standard HTTP server, since terminological datasets are small in comparison to assertional data. Every hosted ontology has at least one IndexingService annotation and one MappingService annotation, pointing at ISN node and MSN node, respectively.

Indexing Service Node. ISN node stores index entries linking ontologies with RDF datasets and providing additional metadata information to improve the process. The basic index entry takes form of a couple (IRI_TBox, URL_SEndpoint), where IRI_TBox is an identifier of ontological concept and URL_SEndpoint is an address to the SPARQL Endpoint which provides triples containing IRI_TBox. The actual index entry is more elaborate, but it is of less importance at this point. The ISN node provides information about index entries for TBox concepts in terms of getDataSources(IRI_TBox) service.

Mapping Service Node. MSN node stores mapping entries between TBox elements in different Web ontologies. The expresiveness of mappings regards named concepts, both classes and properties, which can be related by means of standard description logics equivalence and inclusion. $(class equivalence) C_{1} \equiv C_{2}$ (1) $(class inclusion) C_{1} ⊏ C_{2}$ (2) $(property equivalence) R_{1} \equiv R_{2}$ (3) $(property inclusion) R_{1} ⊏ R_{2}$ (4) The basic getMappings(IRI_TBox) service returns URIs of all concepts TBox_i, both classes and properties, which satisfy ∀i TBox_i ⊑ TBox.

Data Repository Node. DRN node is merely an RDF dataset repository equipped with standard SPARQL endpoint. The service can be implemented with any existing SPARQL server and is supposed to handle large volumes of RDF data.

Querying Service Node. QSN node orchestrates all other services to accomplish the scenario according to the discovery protocol. It does not store any data itself, but calls on other nodes, thus playing the role of mediator in federation system.

These elements form the infrastructural basis for looking up the data sources necessary for query rewrite and execution.

3.2 Ontology-based query evaluation

Following the distinction between TBox and ABox data, the evaluated query is considered to be an ABox query according to definition 3. Let V be the set of variables, L be the set of literals, I_TBox be the set of terminological IRIs of any ontology excluding common vocabularies for OWL, RDFS and RDF, I_ABox be the set of any assertional IRIs.

Definition 3. (ABox Query). SPARQL query Q_ABox is an ABox query ⇔ ∀ tp ∈ BGP : tp ∈ V × ({rdf : type}) × (I_TBox) ∪ (I_ABox ∪ V) × (I_TBox ∪ {owl : sameAs, owl : differentFrom}) × (I_ABox ∪ V ∪ L) where tp and BGP stand for triple pattern and Basic Graph Pattern of the query, respectively.

Definition 3 entails five types of a triple pattern. TP_i is a shorthand for triple pattern of type i.

Type 1 (V, rdf : type, I_TBox): tp = (s, p, o), where s ∈ V, p ∈ {rdf : type}, o ∈ I_TBox.

Type 2 (V ∪ I_ABox ∪ L, I_TBox, V ∪ I_ABox ∪ L): tp = (s, p, o), where p ∈ I_TBox, (s ∈ V ∧ o ∈ I_ABox ∪ L) ∨ (s ∈ I_ABox ∪ L ∧ o ∈ V).

Type 3 (V, I_TBox, V): tp = (s, p, o), where p ∈ I_TBox, s, o ∈ V.

Type 4 (V ∪ I_ABox ∪ L, I_OWL, V ∪ I_ABox ∪ L): tp = (s, p, o), where p ∈ {owl : sameAs, owl : differentFrom}, (s ∈ V ∧ o ∈ I_ABox ∪ L) ∨ (s ∈ I_ABox ∪ L ∧ o ∈ V).

Type 5 (V, I_OWL, V): tp = (s, p, o), where p ∈ {owl : sameAs, owl : differentFrom}, s, o ∈ V.

An immediate consequence is that there is at most one element of I_TBox in each triple pattern in Q_ABox and at least one element of I_ABox. Any variable is bounded by elements of I_ABox. According to the primary proposition, the initial ABox query carries no addressing elements, in terms of FROM or SERVICE clauses, and no background dataset or default graph is assumed either.

Having introduced architecture components and the input query defined, the scenario of ontology-based ABox query evaluation can be described. An overall protocol is presented in Fig. 2. The sequence diagram illustrates message passing between nodes depicting pieces of intermediary data collected in the evaluation process. An entry point for the querying agent is Query Service Node, which receives the input SPARQL query and starts the procedure in Algorithm 1 listing. Data sources discovery and query rewriting interleave in course of algorithm execution. First, the query’s BGP is extracted and a new one is initialized (lines 1-2). Next, all triple patterns are subsequently analyzed (lines 3-27). We are interested in TPs having a TBox element on predicate or object position, so only TP₁, TP₂ and TP₃ are further processed (line 4). TP₄ and TP₅ are appended to the new BGP untouched (line 24). From each triple pattern TBox IRI is extracted (line 6) and respective ontology is dereferenced by issuing HTTP GET request to this location. Then ontology hasMappingService annotation is consulted (line 7) and, after the respective Mapping Service Node is contacted, the set of mappings for TP TBox element is acquired (line 8). The new TBoxes extracted from mappings, along with source TBox form the basis for data sources search (lines 9-13). For each TBox set element, by dereferencing its IRI, the indicated ontology is loaded from respective Ontology Repository Node and hasIndexingService annotation is referred to (line 15). The IndexingService pointer enables a remote invocation to Indexing Service Node which returns data sources for a given TBox (line 16). At this point all addressing information required for query rewrite is in place. For each data source new SERVICE clause for a given endpoint and rewritten triple pattern is appended, thus creating a part of compound UNION clause which in turn forms a new expanded triple pattern (lines 17-20), appended to the new BGP in the target query. When all triple patterns are transformed the rewriting procedure stops. There is some important pre-and-postprocessing for the query with optimization in mind, elaborated in the next section.

The rewritten (and optimized) query is then executed by SPARQL engine being part of QSN implementation. Subqueries from SERVICE clauses are delegated to respective sources, intermediate results are joined on QSN and sent back to the querying agent.

Algorithm 1 An algorithm for ontology-based ABox query rewriting

Require: Non-addressing ontology-based ABox query

Ensure: Adressing ontology-based ABox query

1: BGP← extractBGP (Q_ABox) ;

2: newBGP← initEmptySet () ;

3: for all (TP tp: BGP)do

4: if (tp . getTpKind () in (1, 2, 3))then

5: nTP← initNewTP () ;

6: tbox← tp . getTBoxIRI () ;

7: mServ← getMappingService (tbox) ;

8: mappings← mServ . getMappings (tbox) ;

9: tboxSet← initEmptySet () ;

10: tboxSet . add (tbox) ;

11: for all (Map map: mappings) do

12: tboxSet . add (map . getTBox ()) ;

13: end for

14: for all (TBox tbox: tboxSet) do

15: iServ← getIndexingService (tbox) ;

16: DSs← iServ . getDataSources (tbox) ;

17: for all (DataSource ds: DSs) do

18: nTP . appendSERVICE (tp, ds, tbox) ;

19: nTP . append (′UNION′) ;

20: end for

21: end for

22: newBGP . append (nTP) ;

23: else

24: newBGP . append (tp) ;

25: end if

26: newBGP . append (′ . ′) ;

27: end for

Fig.2

Query evaluation protocol

3.3 Optimization techniques

The already mentioned query preprocessing and postprocessing parts in Fig. 2 regard various optimization efforts undertaken to prepare the target query for efficient evaluation in distributed environment. In federation architecture the most expensive operation is connected with subquery execution on remote nodes, transmission the intermediate results over the network and join them on mediator. Therefore, it is crucial for performance reasons to ensure the intermediate results are as limited as possible and query planner on mediator supports smallest-first join order. The key factor is the selectivity of a triple pattern, Definition 4.

Definition 4. (Triple pattern selectivity). Let tp_x = (s_x, p_x, o_x) ∈ (I ∪ V) ³ be a triple pattern x and tp = (s, p, o) ∈ V³, where I is a set of IRIs, V is a set of variables. Selectivity S (tp_x) is then $S ({tp}_{x}) = \frac{| eval (tp) |}{| eval ({tp}_{x}) |}$ where eval(tp) is a set of results of evaluation tp.

Roughly speaking, selectivity represents a capability of a triple pattern to restrict results. It can be estimated based on the triple pattern itself or it may take into account additional metadata carried by index entries. Both possibilities are considered in ActiveDiscovery query evaluation.

In preprocessing phase the FILTER expression is analyzed in the first place. The filter pushing strategy is applied by converting the filtering formula into conjunctive form. Any conjunct having its variables set contained in the TP’s variables set, is glued to the respective TPs as separate FILTER, so that they later can be pushed into one SERVICE call. Afterwards, initial reordering of the triple pattern is performed. TPs containing URI_ABox, namely TP₂ and TP₄, as well as TPs with glued FILTER clause are moved to the beginning of the BGP according to obvious heuristics. They become a seed of BGP. Next in line are TPs with variables shared with seeders, which narrows possible bindings for those TPs.

In postprocessing phase the reordering is augmented with the information about the cardinality of TBoxes. Basic index entry couple is extended to the triple (IRI_TBox, URL_SEndpoint, n), where n is a number of occurrences of IRI_TBox in data source exposed by URL_SEndpoint. The total number of occurences of IRI_TBox in all data sources URL_{SEndpoint
_i} discovered for this element is denoted as $card ({IRI}_{TBox}) = \sum_{({IRI}_{TBox}, {URL}_{{SEndpoint}_{i}}, n_{i})} n_{i}$ (5) Cardinality supports an estimation of selectivity of triple pattern containing IRI_TBox. Lower cardinality implies fewer bindings on average, therefore relevant TPs take precedence in BGP order. However, a rule requiring TPs with bounded variables to occur prior to those with unbounded variables must be satisfied at all times. One way of subquery’s selectivity improvement is filter pushing. Another one is raising a number of triple patterns with shared variables to be pushed to remote endpoint within one subquery. This is possible whenever there are triple patterns addressed to the same endpoint which are not elements of any UNION clause at the same time.

To summarize optimization strategies applied, filter pushing and aggregating as many TPs as possible within one SPARQL SERVICE invocation improve selectivity and reduce network transmission by cutting off useless intermediary results. Reordering triple patterns based on heuristics and source statistics tend to significantly reduce resource consumption spent on mediator joins.

4 Evaluation

Although an evaluation of ActiveDiscovery has been carried out primarily on artificial datasets, the real world example is presented. The example validates the value added and as such, plays a role of proof-of-concept for the architecture’s feasibility. In the second part of the evaluation the matter of scalability, as an important factor of feasibility, is briefly discussed and the preliminary results of quantitative analysis are presented. An evaluation environment consists of computational nodes, one per each logical node. Fuseki SPARQL execution engine based on JENA ARQ open library is used for all SPARQL endpoints. Services on ISN, MSN and QSN nodes are implemented in Java. Apache Web Server serves as an implementation of ORN.

4.1 Validation example

Two ontologies are used to express the source query, presented in Listing 1, Schema.org¹ and Geonames² ontology. The query declares all cities in Poland with their names (labels) and population. It may be trivial, but still worth mentioning, that source query cannot be successfully executed unless it is explicitly submitted to the specific endpoint. First of all, a querying agent must know the address. If there are multiple endpoints it needs to make a decision which one to query, which leads to different results. Moreover, the data necessary to form the results may be split into various sources, which makes the task even harder.

Listing 1: Source query

PREFIX

schema: <http://schema.org/>

PREFIX

geo: <http://www.geonames.org/

ontology#>

SELECT *

WHERE

{

?city a schema:City.

?city schema:location <http://dbpedia.org/

resource/Poland>.

?city geo:population ?population.

?city geo:name ?label

}

For the purpose of the experiment, the ontologies’ namespaces need to be altered and ontologies themselves hosted accordingly. The reason for this is that they had to be augmented with relevant annotations regarding Mapping and Indexing Service. However, it does not affect the principles so original namespaces are used.

Table 1
Index entries for Schema, DBpedia and Geonames

IRI TBox URL SPARQL Cardinality

schema:City dbpedia.org/sparql 20839

schema:City ff.net/repositories/ff-news 41078

dbpedia:City dbpedia.org/sparql 28675

dbpedia:country dbpedia.org/sparql 769223

dbpedia:City ff.net/repositories/ff-news 41078

dbpedia:country ff.net/repositories/ff-news 6709883

geo:name ff.net/repositories/ff-news 11870162

geo:population ff.net/repositories/ff-news 698583

IRI TBox	URL SPARQL	Cardinality
schema:City	dbpedia.org/sparql	20839
schema:City	ff.net/repositories/ff-news	41078
dbpedia:City	dbpedia.org/sparql	28675
dbpedia:country	dbpedia.org/sparql	769223
dbpedia:City	ff.net/repositories/ff-news	41078
dbpedia:country	ff.net/repositories/ff-news	6709883
geo:name	ff.net/repositories/ff-news	11870162
geo:population	ff.net/repositories/ff-news	698583

The annotation property has Mapping Service of Schema ontology points at node returning the following mappings dbpedia : City ≡ schema : City

dbpedia : country ⊑ schema : location

which extend the query in a way that DBpedia ontology³ is used in a query as well. Index entries for all TBoxes in question are collected in Table 1 as they were returned by Indexing Service Nodes for three ontologies. Two endpoints are available (ff.net is a shorthand for factforge.net). Cardinality measures reflect real world numbers for TBoxes occurrences in both sources.

Listing 2: Target query

PREFIX

schema: <http://schema.org/>

PREFIX

geo: <http://www.geonames.org/ontology#>

PREFIX

dbpedia: <http://dbpedia.org/ontology/>

SELECT

WHERE

{

{ { SERVICE <http://dbpedia.org/sparql> {

?city dbpedia:country <http://dbpedia.org/

resource/Poland>

}

UNION

{ SERVICE <http://factforge.net/repositories/

ff-news> { ?city dbpedia:country <http://

dbpedia.org/resource/Poland>

}

{ { SERVICE <http://dbpedia.org/sparql> {

?city a dbpedia:City

}

UNION

{ SERVICE <http://factforge.net/repositories/

ff-news> { ?city a dbpedia:City

}

UNION

{ SERVICE <http://dbpedia.org/sparql> {

?city a schema:City

}

UNION

{ SERVICE <http://factforge.net/repositories/

ff-news> { ?city a schema:City

}

SERVICE <http://factforge.net/repositories/

ff-news> { ?city geo:population ?population.

?city geo:name ?label

}

The source query submission to either DBpedia endpoint or Factforge.net endpoint returns no bindings. Translation of the query and optimization-oriented transformation produce target query presented in Listing 2. The target query executed on Query Service Node returns 39 bindings of which 31 bindings are distinct due to the fact that Factforge aggregates data from multiple sources, DBpedia included.

4.2 Scalability

The scalability of the ActiveDiscovery architecture is measured in terms of execution time of a rewritten query Q as a function of database size, for different Q, varying in size (triple patterns number) and selectivity. $T = f_{Q} (S)$ (6) where T is execution time in ms of already rewritten ABox query Q and S is a total number of triples in all sources contributing to the result set of the query. Duration time of translation process itself can be neglected due to its low time cost in comparison to execution time.

The goal of the experiment is to figure out a shape of f_Q for increasing size of RDF datasets. Four different queries have been examined:

Q₁ consisting of 7 triple patterns of low selectivity,

Q₂ the same 7 triple patterns query but of higher selectivity,

Q₃ smaller in size 3 triple patterns query of moderate selectivity and

Q₄ small 2 triple patterns query of low selectivity.

Fig.3

Execution time of test queries.

RDF triples generator was used to produce twelve datasets of different size for measurement, starting from 38 triples in the smallest dataset it doubled each time reaching 77824 triples in the largest. Each dataset was uniformly distributed on four separate DRN nodes. Evaluation methodology was to take 100 samples for each setting of dataset size and query type, cut off 10 highest and 10 lowest results and take an average of others. The results for each query are presented in Fig. 3. Logarithmic scale for both axes has been used to clearly show all average measures.

The first and most significant observation is the linear growth of f_Q for all queries. The difference in execution time between queries is constant and depends on the query type. The much smaller Q₄ outperforms the biggest Q₁ by factor cca 24 while Q₂, equal in size to Q₁ but more selective, performs 4 times faster. It is important to note that these results cannot be meaningfully compared to the performance of a single query engine which provides measure of central query execution rather than distributed execution against multiple endpoints.

5 Conclusions and future work

In this paper a decentralized infrastructure for extensional SPARQL query evaluation based on terminological components of the query has been presented. Following the essential principles for Web of Data we have proposed extensions to OWL ontology annotations, a service-oriented architecture and protocol for RDF data sources discovery, which determines query rewriting procedure and distributed execution. Certain optimization issues and a selection of improvements have been discussed. The proof-of-concept implementation has enabled an initial evaluation of the infrastructure’s feasibility. The preliminary results have been presented including a validation example of added value gained for real world query and data sources as well as performance metrics showing linear growth of execution time w.r.t. datasets size.

Our future work will focus on such areas as results ranking and limitation. An efficient way to control the volume of intermediate results from triple patterns delegated to remote endpoints is crucial, particularly when low selectivity is observed. Means for more adequate selectivity measurement might be provided by indexing services. Moreover, additional metadata to express more complex characteristics of data available via an endpoint seems helpful, both for cut-off threshold assignment and results ranking strategy. Development of ranking criteria goes hand in hand with data chunking necessity. Another area of research is index creation and update, which is necessary whenever data source appears or ceases to be available. Likewise, when dataset content changes in terms of data volume or new vocabulary is used, index update is needful. The protocol extension for datasets registration, deregistration or update could follow the pattern known from DNS architecture. Linked data connections between RDF datasets enable updates based on Web crawling. Lastly, more expressive representation language for inter-ontology mappings remains a direction for future research. The ability to make use of equivalence between individual resources (OWL:sameAs) would further improve accuracy of the discovery process.

Footnotes

References

Baier

, Daroch

, Reutter

and Vrgoč

Evaluating Navigational RDF Queries over the Web, in: Proceedings of the 28th ACM Conference on Hypertext and Social Media, 2017, pp. 165–174.

Becker

and Bizer

Exploring the geospatial semantic web with DBpedia mobile, in: J Web Sem 7 (2009), 278–286.

Berners-Lee

Linked Data Design Issues, 2006.

Berners-Lee

, Chen

, Chilton

, Connolly

, Dhanaraj

, Hollenbach

, Lerer

and Sheets

Tabulator: Exploring and Analyzing linked data on the Semantic Web, in: Proceedings of the 3rd International Semantic Web User Interaction, 2006.

Bizer

and Gauss

Disco - Hyperdata Browser, http://wifo5-03.informatik.uni-mannheim.de/bizer/ng4j/disco/.

Bizer

and Schultz

The R2R Framework: Publishing and Discovering Mappings on the Web, in: 1st International Workshop on Consuming Linked Data (COLD 2010), Shanghai, 2010.

Bouquet

, Ghidini

and Serafini

Querying The Web Of Data: A Formal Approach, in: 4th Asian Semantic Web Conference, vol. 5926, 2009, pp. 291–305.

Cheng

and Qu

Searching Linked Objects with Falcons: Approach, Implementation and Evaluation, in: Int J Semantic Web Inf Syst, vol. 5, 2009, pp. 49–70.

Choi

, Son

, Cho

, Sung

and Chung

SPIDER: A system for scalable, parallel / distributed evaluation of largescale RDF data, in: Proceedings of the 18th ACM Conference on Information and Knowledge Management, 2009, pp. 2087–2088.

10.

Choi

, Song

and Han

A survey on ontology mapping, in: SIGMOD Rec, vol. 35, 2006, pp. 34–41.

11.

Correndo

, Salvadores

, Millard

, Glaser

and Shadbolt

SPARQL query rewriting for implementing data integration over linked data, in: Proceedings of the 2010 EDBT/ICDT Workshops, 2010, pp. 4:1–4.11.

12.

David

, Euzenat

, Scharffe

and Trojahn

dos Santos, The Alignment API 4.0, in: Semant Web 2 (2011), 3–10.

13.

Ding

, Finin

, Joshi

, Pan

, Cost

, Peng

, Reddivari

, Doshi

and Sachs

Swoogle: A search and metadata engine for the semantic web, in: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, 2004, pp. 652–659.

14.

Dou

, McDermott

and Qi

Ontology Translation on the Semantic Web, in: Journal on Data Semantics II, vol. 3360, 2005, pp. 35–57

15.

Euzenat

, Polleres

and Scharffe

Processing Ontology Alignments with SPARQL, in: Proceedings of the 2008 International Conference on Complex, Intelligent and Software Intensive Systems, 2008, 913–917.

16.

Görlitz

and Staab

Federated Data Management and Query Optimization for Linked Open Data, in: New Directions in Web Data Management 1, 2011, pp. 109–137.

17.

Hartig

and Langegger

A Database Perspective on Consuming Linked Data on theWeb, in: Datenbank-Spektrum, vol. 10, 2010, pp. 57–66.

18.

Hogan

, Harth

, Umrich

, Kinsella

, Polleres

and Decker

Searching and Browsing Linked Data with SWSE: The Semantic Web Search Engine, in: Web Semantics: Science, Services and Agents on the World Wide Web, vol. 9, 2011.

19.

Jain

, Hitzler

, Sheth

, Verma

and Yeh

Ontology alignment for linked open data, in: Proceedings of the 9th International Semantic Web Conference on The Semantic Web – Volume part I, 2010, pp. 402–417.

20.

Kalfoglou

and Schorlemmer

Ontology mapping: The state of the art, in: Knowl Eng Rev, vol. 18, 2003, pp. 1–31.

21.

Kossmann

The State of the Art in Distributed Query Processing, in: ACM Comput Surv, vol. 32, 2000, pp. 422–469.

22.

Langegger

, Wöß

and Blöchl

A semantic web middleware for virtual data integration on the web, in: Proceedings of the 5th European Semantic Web Conference on The Semantic Web: Research and Applications, 2008, pp. 493–507.

23.

Madhavan

, Bernstein

, Domingos

and Halevy

Representing and reasoning about mappings between domain models, in: Eighteenth National Conference on Artificial Intelligence, 2002, pp. 80–86.

24.

Makris

, Bikakis

, Gioldasis

and Christodoulakis

SPARQL-RW: Transparent Query Access over Mapped RDF Data Sources, in: Proceedings of the 15th International Conference on Extending Database Technology, 2012, 610–613.

25.

Makris

, Bikakis

, Gioldasis

, Tsinaraki

and Christodoulakis

Towards a Mediator Based on OWL and SPARQL, vol. 5736, 2009, pp. 326–335.

26.

Makris

, Gioldasis

, Bikakis

and Christodoulakis

Ontology mapping and SPARQL rewriting for querying federated RDF data sources, in: Proceedings of the 2010 International Conference on the Move to Meaningful Internet Systems: Part II, 2010, pp. 1108–1117.

27.

Motik

, Patel-Schneider

P.F.

and Parsja

OWL 2 Web Ontology Language, Structural Specification and Functional-Style Syntax. Available from: http://www.w3.org/TR/owl2-syntax.

28.

Noy

Semantic Integration: A Survey Of Ontology-Based Approaches, in: SIGMOD Record, vol. 33, pp. 2004.

29.

Oren

, Delbru

, Catasta

, Cyganiak

, Stenzhorn

and Tummarello

Sindice.com: A document-oriented lookup index for open linked data, in: Int J Metadata Semant Ontologies, vol. 3, 2008, pp. 37–52.

30.

Polleres

, Scharffe

and Schindlauer

SPARQL++ for mapping between RDF vocabularies, Proceedings of the 2007 OTM Confederated International Conference on the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I, 2007, pp. 878–896.

31.

Prud’hommeaux

and Buil-Aranda

SPARQL 1.1 federated query, W3 Recommendation 2013.

32.

Quilitz

and Leser

Querying distributed RDF data sources with SPARQL, in: Proceedings of the 5th European Semantic Web Conference on the Semantic Web: Research and Applications, 2008, pp. 524–538.

33.

Scharffe

and de Bruijn

A language to specify mappings between ontologies, in: SITIS, 2005, pp. 267–271.

34.

Scharffe

, Ding

and Fensel

Towards Correspondence Patterns for Ontology Mediation, in: Proceedings of the 2Nd International Conference on Ontology Matching - Volume 304, 2007, pp. 296–300.

35.

Shvaiko

and Euzenat

A survey of schema-based matching approaches journal on data semantics IV, in: Journal on Data Semantics IV 3730 (2005), 146–171.

36.

Solomou

, Koutsomitropoulos

and Papatheodorou

Semantics-Aware Querying of Web-Distributed RDF(S) Repositories, 2008.

37.

Xin

A General Approach to Query the Web of Data, in: 9th International Semantic Web Conference (ISWC2010), 2010.

38.

W3C Web Ontology Language (OWL). Available from: https://www.w3.org/OWL/.

39.

W3C Resource Description Framework (RDF). Available from: https://www.w3.org/RDF/.

40.

W3C SPARQL 1.1 Query Language. Available from: https://www.w3.org/TR/sparql11-query/.