Abstract
Ontologies play a key role in the development of the Semantic Web and are being used in many diverse application domains such as biomedicine and e-commerce. An application domain may have been modeled according to different points of view and purposes. This situation usually leads to the development of different ontologies that intuitively overlap, but that use different naming and modeling conventions. The problem of (semi-)automatically integrating independently developed ontologies through mappings, is usually referred to as the ontology matching problem. Ontology matching systems, however, rely on lexical and structural heuristics, and the integration of the input ontologies and the mappings may lead to many undesired logical consequences, which could sensibly diminish their usefulness. The present paper, on the one hand aims at veryfing the hypothesis that classification of large ontologies via mappings still poses a challenge to OWL 2 reasoners. On the other it also explores the applicability of OWL 2 reasoning for the repair of unintended entailments (namely, unsatisfiable concepts or violations of the conservativity principle). In this paper we provide an update on the feasibility of using OWL 2 reasoners to repair the integration of ontologies via mappings, providing a more accurate evaluation of the feasibility of extracting all the justifications. Additionally, the current evaluation also encompasses the analysis of the use of OWL 2 reasoners for solving the violations of the so-called conservativity principle.
Introduction
The problem of (semi-)automatically computing mappings between independently developed ontologies is usually referred to as the ontology matching problem. A number of sophisticated ontology matching systems have been developed in the last years [8, 29]. Ontology matching systems, however, rely on lexical and structural heuristics and the integration of the input ontologies and the mappings may lead to many undesired logical consequences (e.g., unsatisfiable classes or violations of the conser-vativity principle).
The fix of undesired logical consequences caused by ontology mappings is known as the mapping repair problem [14]. Mapping repair can be addressed using state-of-the-art approaches for debugging OWL 2 ontologies, which rely on the extraction of justifications for the unwanted axiom (e.g, [13, 34]). However, in [12] it was pointed out that justification-based technologies do not scale when the number of such axioms is large (a typical scenario in mapping repair problems).
This paper extends our previous evaluations presented in [12, 32], in different respects. [12] provided a first evaluation on the use of OWL 2 reasoning for the classification and repair of integrated ontologies. [32] also considered reasoners for OWL 2 profiles (e.g, ELK), and tested the extraction of more than a single justification for computing a repair for logical violations represented by incoherent classes.
Our extended evaluation is based on the datasets and ontology matching systems from the Ontology Alignment Evaluation Initiative (OAEI) [8]. In addition to the previous versions of the evaluation, in this paper we also: consider the so-called violations of the conservativity principle, that is, novel axioms entailed by the aligned ontology, involving elements of one of the two input ontologies, that are not entailed by the input ontologies in isolation [1, 31]. provide extended experimental results concerning the extraction of (a subset of) all the justifications for a given entailment, providing additional insights on the problem. compare the black-box justification extraction techniques with one of the latest glass-box approaches based on tracing, for the optimization of the justifications extraction.
Our results suggest that the classification of the integration of large ontologies via mappings still poses a challenge to OWL 2 reasoners. Furthermore, the repair of unintended entailments (e.g., unsatisfiable concepts or conservativity violations) using OWL 2 reasoners critically compromises the performance of mapping repair systems in the best case, or it is simply not tractable when all the justifications need to be extracted in order to compute an optimal repair.
The remainder of the paper is organised as follows: Section 2 introduces the needed preliminaries, Section 3 describes the dataset, the environment used for the evaluation, and also provides and discusses in detail its results. Finally, Section 4 concludes the paper.
Preliminaries
In this section, we present the formal representation of ontology mappings (Section 2.1), the notions of semantic difference (Section 2.2), mapping coherence and conservativity principle violations (Section 2.3).
Representation of ontology mappings
Mappings are conceptualised as 4-tuples of the form 〈e1, e2, n, ρ〉, where e1, e2 are entities in the vocabulary or signature of the relevant input ontologies 𝒪1 and 𝒪2 (i.e., e1 ∈ Sig𝒪1 and e2 ∈ Sig𝒪2), n is a confidence measure between 0 and 1, and ρ is a relationship between e1 and e2, typically subsumption (i.e., e1 is more specific than e2), equivalence (i.e., e1 and e2 are synonyms) or disjointness (i.e., e1 and e2 cannot share individuals) [7].
RDF Alignment [5] is the main format used in the Ontology Alignment Evaluation Initiative (OAEI) to represent mappings containing the aforementioned elements. Additionally, mappings are also represented as OWL 2 subclass, equivalence, and disjointness axioms [4]; mapping confidence values (n) are then represented as axiom annotations. Such a representation enables the reuse of the extensive range of OWL 2 reasoning infrastructure that is currently available. Note that alternative formal semantics for ontology mappings have been proposed in the literature (e.g., [2, 23]), but they are out of scope of the present article because the reasoning in the aligned ontology cannot be achieved directly with OWL 2 reasoning infrastructure, rather it requires custom reasoning facilities.
Semantic consequences of the integration
The ontology resulting from the integration of the two ontologies 𝒪1 and 𝒪2 via a set of mappings ℳ typically entails axioms that do not follow from 𝒪1, 𝒪2, or ℳ alone. These new semantic consequences can be captured by the notion of deductive difference [18, 19].
Intuitively, the deductive difference between 𝒪 and 𝒪′ w.r.t. a signature Σ is the set of entailments constructed over Σ that do not hold in 𝒪, but do hold in 𝒪′. No algorithm is available for computing the deductive difference for DLs more expressive than ℰℒ, for which the existence of tractable algorithms is still open [18].
Thus in this paper we rely on the approximation of the deductive difference given in Definition 1. This approximation only requires comparing the classification hierarchies of 𝒪 and 𝒪′ provided by an OWL 2 reasoner, and it has successfully been used in the past in the context of ontology integration [13].
Violations and mapping repair
As already discussed in Section 2.2, the notion of deductive difference can capture the novel entailments of an aligned ontology w.r.t.the input ontologies. However, some of these entailments may be undesired, and are called violations, stemming from erroneous mappings in ℳ, or from an inherent incompabilities between the input ontologies 𝒪1 and 𝒪2.
A set of mappings that leads to unsatisfiable classes in 𝒪1 ∪ 𝒪 2 ∪ ℳ is referred to as incoherent w.r.t 𝒪1 and 𝒪2 [21], as formalized in Definition 2. Analogously, a set of mappings that leads to violations of the conservativity principle in the aligned ontology 𝒪1 ∪ 𝒪 2 ∪ ℳ is referred to as nonconservative w.r.t 𝒪1 and 𝒪2 [1, 31], as formalized in Definition 3.
More generally, an alignment being incoherent and/or nonconservative, is called problematic, as introduced in Definition 4.
A problematic set of mappings ℳ can be fixed by removing mappings from ℳ. This process is referred to as mapping repair (or repair for short).
A trivial repair is ℛ = ℳ, since an empty set of mappings is obviously nonproblematic. Nevertheless, the objective is to minimize a loss function over the alignment (e.g., to remove as few mappings as possible or to minimize the total confidence of the removed mappings). Minimal (mapping) repairs are typically referred to in the literature as mapping diagnosis [20] — a term coined by Reiter [25] and introduced to the field of ontology debugging in [28].
In the literature there are different approaches to compute a repair or diagnosis for an incoherent set of mappings. Early approaches were based on Distributed Description Logics (DDL) (e.g., [22–24]). Alternatively, if mappings are represented as OWL 2 axioms, a repair or diagnosis can also be computed using the state-of-the-art approaches for debugging and repairing OWL 2 ontologies, which rely on the extraction of justifications for the undesired entailments (e.g., [13, 34]).
“A justification for an entailment in an ontology is a minimal subset of the ontology that is sufficient for the entailment to hold. The set of axioms corresponding to the justification is minimal in the sense that if an axiom is removed from the set, the remaining axioms no longer support the entailment.” [10] Definition 7 formally introduces the notion of justification.
In ontology matching scenarios the use of incomplete reasoning techniques to enhance scalability is very frequent (e.g., [1, 31]). Incomplete reasoning leads to an approximate repair ℛ≈, i.e., there is no guaranteee that ℳ \ ℛ≈ is nonproblematic, but the number of violations caused by the original set of mappings ℳ tends to be reduced while minimizing the loss function over the originalalignment.
Given that the justifications for an entailment are usually exponential in the size of the ontology, [10] approximate mapping repair techniques, based on the extraction of a single justification, has been successfully used in the past to achieve scalability (e.g., LogMapFull [11]). For this reason, our empirical evaluation does not only consider the (limited) extraction of all the justifications, but also the computation of a single one, as described in details in [10].
Experimental evaluation
This section describes the conducted experimental evaluation. In Section 3.1 we present the used datasets and mapping sets. Section 3.2 introduces the evaluation setting. The obtained results are discussed in Section 3.3.
Datasets
The datasets are based on the OAEI, an international campaign for the systematic evaluation of ontology matching systems. The matching problems in the OAEI are organised in several tracks, with each track involving different kinds of test ontologies [3, 8]. In this paper, we have focused on the anatomy, largebio, library and conferencetracks. For largebio we we used both the small setting, in which reduced fragments of FMA, NCI and SNOMED CT are employed (where the fragments are relevant portions of one of the ontologies with respect to the other two), and the big setting, employing the whole ontologies (at the exception of SNOMED CT, for which a large fragment for both FMA and NCI is used). Library is composed by not very expressive medium-sized ontologies, while conferenceontologies are very expressive but of limited size. Anatomy is composed by a fragment of NCI ontology (named HUMAN in this context to avoid confusion with the largebio dataset) involving human anatomy, that should be matched with an ontology describing the anatomy of mice (called MOUSE ontology). Table 1 summarizes the metrics of the selected ontology pairs for the evaluation, while Tables 2–4 provides the details about the selected subset of mapping sets computed by ontology matching systems participating in the OAEI 2013 and 2014 campaigns. Due to the excessive time required for running a so expensive evaluation, we were forced to select only a representative subset of the computed mappings sets (we have selected, for each track, the alignments with the highest or lowest precision and recall values) 1 . Please refer to [3, 6] for more information about the datasets and ontology matching systems.
Evaluation settings
System Details. The test environment consists of a desktop computer equipped with 32GB DDR3 RAM at 1333MHz and an AMD Fusion FX 4350 (quad-core, each running at 4.2GHz) as CPU. The dataset is stored on a 128GB SSD, where the operating system (Ubuntu 12.04, 64-bit version) is installed. The employed build of Java Runtime Environment (JRE) is 1.8.0_45-b14, while the one for the Oracle 64-Bit Java Virtual Machine (JVM) is the 25.45-b02 (mixed mode). The amount of memory allocated for the heap of the JVM is 12GB, the processes not involved in the evaluation require approximately 3GB of space, thus leaving 17GB of free RAM (plus 1.8GB of swap memory, that is not used unless totally necessary 2 ).
Tested Reasoners. The versions of the employed reasoners are: (i) Konclude 0.6.0-408 64-bit [33] (linux binaries), ELK0.4.2 17 3 Pellet2.3.1 [30], HermiT1.3.8 [9].
ELK, Pellet, and HermiT implement the OWLReasoner interface of the OWL-API and they all are called on a fresh thread. A timeout on the classification task is enforced by killing the thread after reaching the timeout value, times are measured using the getNanoSec function, because it measures the elapsed time without skew corrections. 4
ELK is a (very fast) reasoner for the OWL 2 EL profile, thus it cannot guarantee complete results for ontologies outside this profile.
Konclude does not implement the OWL-API’s OWLReasoner interface and its invocation through OWLlink1.2.1 is raising an OWLlinkReasonerRuntimeException exception caused by an IndexOutOfBoundsException exception during the parsing of most of the ontologies in our dataset (expressed in OWL/XML format). Thus, Konclude is instead called using an external process, 5 using the ProcessBuilder class, 6 and it is allowed to use all the available cores. For Konclude, timeout on classification is enforced using timeout program for Linux, 7 and wall-clock time is measured using the time program. 8 .
Note that due to its invocation, the measured time for Konclude also includes the loading time of the aligned ontology. However, given that we aim at verifying the feasibility of using full OWL 2 reasoners in a mapping diagnosis context, and not at comparing reasoners, all these biases are not influencing our analysis.
It was not possible to extend our analysis to FaCT++ because its invocation using JNI is permanently failing with a StackOverflowError.
Justification Extractor. Our evaluation is based on the black-box justification extractor described in [10]. 9 Black-box extractors typically allow to use any reasoner implementing the axiom pinpointing service.
In addition to our previous evaluation, we have also compared the performance of the aforementioned black-box approach using ELK0.4.2, and the glass-box trace extraction technique offered by ELKreasoner [16]. The tracing functionality offerred by ELKkeeps track of the subset of the ontology used for the entailment check of the axiom of interest, then the black-box justification module is run only against this relevant subset of the ontology, allowing, in principle, a significant reduction of the required runtime. [10] When we refer to this additional functionality offered by ELK, the reasoner will be referred to as ELKtrace .
Conducted evaluation
The evaluation algorithm is presented in Algorithm 1, which takes as input a pair of ontologies (𝒪1 and 𝒪2) and an alignment ℳ between them from the datasets described in Section 3.1. For each of the available reasoners we compute the classification 10 and record the classification times in seconds (see Tables 5–9 and Class.(s) in Tables 10–31). Then, if the classification succeeds, we record the number of unsatisfiable concepts (#Unsat in Tables 10–31). For at most 50 of them, we compute justifications 11 (a single one and up to a maximum of 50 justifications, recording the total time in seconds required for completing the respective operations (1Just.(s) and 50Just.(s) in Tables 10–31, respectively). In addition, we also keep track of the percentage of violations for which neither error nor timeout occurred (% 1JustOK and %50JustOK). When a timeout or an error occurs, the measured time is only a lower-bound of the real required time.
Analogously, the number of detected conservativity violations are recorded (#Viols in Tables 10–31), and, for at most 50 of them, we compute justifications.
Notice that when computing more than one justification, the timeout is enforced for each single justification computation, and reaching a timeout on one of them stops the whole computation. This is motivated by the fact that the search space for justifications is explored by increasing the number of axioms composing them, and it is therefore reasonable that at the first failure the justification module will not be able to find other ones within the given time limit.
Classification. In Tables 5–9 the classification time for a selection of the testcases is shown.
For the anatomydataset (Table 5), all the reasoners succesfully managed to compute the classification of the aligned ontology. The only reasoner that experienced problems was Pellet, that permanently raised a ConcurrentModificationException exception on several aligned ontologies of this dataset.
In the largebio-smalldataset (Table 6), for the mapping sets involving FMA and NCI, none of the reasoners failed at classifying the aligned ontology. For the FMA and SNOMED testcase, Pelletfailed to classify, due to timeouts (T/OUT), half of the cases. For the integration of SNOMED and NCI, Pelletfailed due to timeouts in all the cases, and exclusively ELKcould classify the aligned ontology in 3 out of 7 testcases. 12 Konclude, for instance, failed with an out of memory error (OOM), and HermiTreached the timeout.
For the largebio-bigdataset (Table 7), not all the mapping sets are available because some of the ontology matchers failed at computing the alignment for this extended case. Regarding classification, the results are very similar to the small version of the dataset, with a little increase of the number of timeouts for Pellet due to the increase of size of the aligned ontologies.
For library (Table 8), instead, the reasoners succeeded in most of the cases (with the partial exception of Pellet), but only Konclude managed to classify, within the timeout, the integrated ontology via the mappings computed by XMapGen. These mappings include an extremely high number of many to many correspondences, that caused problems to all the reasoners but Konclude.
Concerning conference, the classification could be performed in the vast majority of the cases, with only a single failure for both HermiT and Pellet. In Table 9 we only report the cases in which the classification either required a time greater than 1 second, or during which a timeout or error occurred.
Computation of Justifications for Unsatisfiabilities. Tables 10–30, instead, show the details for justification computation for the existing unsatisfiabilites, for relevant cases. For the library dataset, the results are omitted due to the lack of unsatisfiable classes in the aligned ontologies (the input ontologies are simple and they do not contain disjointness axioms). Note that Konclude, since it was invoked from the command line, could not be evaluated on the justification extraction tasks, due to the limitation of using it from the command line. Missing rows mean that the corresponding reasoner failed at classifying the integrated ontology, due to timeout or an error, and has been excluded.
Note that, the computed times in the Tables 10–30 are only for 50 unsatisfiable classes. Thus, the total times given below for all unsatisfiable classes have been extrapolated from these results.
A subset of the results for the anatomydataset is reported in Tables 10-11- and Tables 10-11. Consider for instance Table 10a, which presents the justification extraction for the mapping set computed by MaasMatch. Computing a single justification for each of the 5, 972 unsatisfiable classes, would require for ELK2h (68s for 50 unsatisfiable classes), while > 11 days for computing fifty of them (>2h for 50 unsatisfiableclasses).
The available version of ELKtrace is still in beta quality, and therefore we experienced some problems in the justification extraction process. For this reason, despite the great improvement in the runtime for computing justifications, it exhibited a high number of errors during the evaluation. For instance, for the present task, ELKtrace did not manage to correctly compute any justification. When HermiTis used, >37m and >90h would be required, respectively. These times are surprisingly lower than the corresponding for ELK, despite the latter is an approximated reasoner.
While already the reported times could be considered to be incompatible with “online” repair scenarios, we also need to consider that the runtime is a lower-bound, limited to at most 50 justifications, of the one extracting all the justifications, whose number is usually exponential in the size of the ontology.
Considering the conference dataset, composed by small sized ontologies having high expressivity, we also find cases that could not be compatible with an “online” mapping repair (e.g., almost two hours for HermiT in Table 14a, >53m for Pellet in Table 13a and >47m for ELK in Table 13a). Notice also that in this last case, only 16% of the computations involving Pellet finished before the timeout, and therefore the total runtime is expected to be much higher. The worst result for ELKtrace, instead, is in Table 15a, with a runtime >22m.
For the largebio datasets, shown in Tables 16–30, both big and small variants, the values are definitely higher and not affordable for an online repair process. Computing a single justification for each unsatisfiable concept in the largebio testcase of Table 20a would require >63h for ELK, >6h for ELKtrace, >68h for HermiT, while >613 days, >485 days and >51 days for computing 50 of them, respectively.
Even considering a testcase of largebio-small, with a fairly limited number of unsatisfiable classes (2058 for the mapping set involving FMA and SNOMED and computed by GOMMA, Table 25a), the runtime is still prohibitive if more than one justification is computed. Indeed, computing a single justification would require >10m for ELK, >44s for ELKtrace, >6m for HermiT and >4m for Pellet, while >24 days, >32h, >10 days and >7 days for computing 50 of them, respectively.
It is evident that the proposed runtimes for largebio might be only acceptable in an off-line mapping repair process for the small testcases, while they are not affordable for the largest ones.
Computation of Justifications for Conservativity Violations. In Tables 10–31 we show the details for justification computation for the conservativity violations, restricted to the relevant cases.
What is evident from the analyzed testcases is that, the average runtime required for computing the justifications for a conservativity violation (that is, a subsumption axiom between named classes) is in general lower than the time required for computing the justifications for an unsatisfiable class.
On the other hand, it is also true that the number of violations is usually higher than the number of unsatisfiabilities, with millions of violations like in the library testcase, and this is balancing the required time for computing justifications.
For instance, for the anatomy dataset, in Table 10b, computing a single justification requires >25m for ELK, >5h for ELKtrace, and >4m for HermiT, while >50h, >43h, and >16h for computing 50 justifications, respectively. It is again notable that the required runtimes for HermiT are lower than the corresponding for ELK (both variants).
The conferencedataset exhibits a very limited number of conservativity violations, in general, favored by the extremely reduced size of the signature of the involved ontologies. An exception is represented by the testcase of Table 13b, where computing a single justification requires less than 2 seconds for all the tested reasoners, but computing 50 justifications caused 32% of timeouts for Pellet. All the other reasoners were still able to accomplish the task within one minute.
For largebio, Table 26b presents a very limited number of conservativity violations (2270), that paired with the reduced runtime for the computation of their justifications, as already discussed, requires at most 10 seconds for computing a single justification, and at most 39 minutes for computing 50 justifications. Even if the required time can appear to be limited, we need to remind again that in order to provide a complete and minimal repair, all the (possibly exponentially many) justifications must be computed, while the proposed runtime is for computing at most 50 justifications.
In other largebio testcases, such as that described in Table 28b, ELK would need 12 hours to compute a single justification (ELKtrace raised an error on 90% of the computations and its runtime cannot be compared), while >127 days would be required by ELK, and >5 days by ELKtrace for computing 50 justifications.
The highest number of conservativity violations occurs in the library dataset, favored by the extremely high number of many to many correspondences that its mapping sets exhibit. For instance, in the testcase of Table 31b, the required time for the computation of a single justification is >229 days for ELK, and >18 days for HermiT, while for computing 50 justifications the runtime is >48 and >7 years, respectively. The results for ELKtrace are omitted because, for the selected testcases, all the single justification computation failed, and all that for multiple justifications reached the timeout. Similarly to the anatomy testcase, also for library the runtime for justification computation with HermiT has been lower than that of ELK.
It is again evident that despite the smaller runtime required for computing justifications for conservativity violations, the total runtime would be prohibitive in the majority of the cases, also for off-line repair scenarios.
Conclusions
In this paper, we have extended previous evaluations of the feasibility of using OWL 2 reasoning capabilities in mapping repair related tasks, under different aspects. Firstly, the evaluation has been extended to extract more justifications than in the previous evaluation, in order to provide more accurate results. Additionally, the black-box justification extraction method has also been compared against glass-box techniques offered by ELKreasoner, which provides the capability of extracting a trace of the relevant subset of an ontology, for the entailment check of a given axiom. Finally, we also analyzed the performances of justification extraction for repairing the so-called conservativity principle.
Our empirical results on the performances of several top-level reasoners suggest that the classification of the integration of medium/large size ontologies via mappings is hard to compute for current OWL 2 reasoners. Furthermore, when OWL 2 reasoners are to be used in mapping repair tasks, the computation time increases considerably, and in the majority of the cases it becomes impractical, even when restricting to reasoners for one of the OWL 2 profiles. From the extended empirical evaluation presented in this paper, the problem is even exacerbated when considering conservativity violations, in addition to consistency ones.
Hence, the integration of ontologies via mappings seems an ideal reasoning benchmarks, and its hardness motivates the interest in approximated repair techniques in the context of ontology matching.
Acknowledgments
This work was supported by the EU FP7 IP project Optique (no. 318338) and the EPSRC project Score!. Alessandro Solimando has been fully supported by the AI*IA 2014 Mobility Grant for visiting Ernesto Jiménez-Ruiz at the Knowledge Representation and Reasoning group, University of Oxford, UK. We also thank the unvaluable help provided by Bernardo Cuenca Grau and Ian Horrocks.
Due to space reasons we can only present a subset of the computed evaluation, a technical report with the full analysis is available at ftp://ftp.disi.unige.it/person/SolimandoA/aijournaltr.pdf.
This behaviour is enforced by means of the swappiness Linux kernel parameter set to 0, see for more information.
Version compiled on the 29th of May 2015 from the sources available at https://github.com/klinovp/elk/tree/feature/tracing-complete.
https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime–.
Konclude is runned with “Konclude classification -w AUTO -i aligneOntology.owl”
With the command “timeout –preserve-status –s TERM timeoutVal cmd”
Using “/usr/bin/time -f % E cmd” command.
Current version available at https://github.com/matthewhorridge/owlexplanation. For the experiments we used the version available here: https://github.com/protegeproject/mvn-repo/tree/master/releases/org/semanticweb/owl/explanation/3.3.0
With a timeout of 10, 60, 20 and 10 minutes for anatomy, largebio, library and conference, respectively.
With a timeout of 60 seconds to find each new justification. ELKtrace has a timeout of 60 seconds for computing the trace, and than another timeout of 60 seconds for computing each justification.
Note that ELK is an OWL 2 EL reasoner and since NCI falls outside the OWL 2 EL profile, the classification computed by ELK for the integration of SNOMED and NCI is incomplete.
