The knowledge increase estimation framework for ontology integration on the concept level

Abstract

Nowadays, due to the high level of data distribution, it is frequently impossible to generate a unified representation of a variety of heterogenous data sources in a single step. Dividing the integration process into smaller subtasks and their parallelization can solve this problem. Unfortunately, it entails difficulties concerning the initial classification of data sources into groups that can be independently integrated, and serve as an input for the final integration step. The problem becomes even more complicated when not only raw data is required to be integrated, but the designed system is expected to perform more expressive integration of heterogenous knowledge representations, such as ontologies. In our previous work [10] we have proved both analytically and experimentally that such approach to the integration task can increase its effectiveness in terms of the time required to obtain the final result. In this article we intend to explore the issue of selecting initial classes of ontologies based on the novel notion of the knowledge increase. This indicator can be computed before the integration and moreover answer the question concerning whether this integration is viable. This not only simplifies the initial distribution of aforementioned subtasks, but can also be used as a stop condition during subsequent steps of the integration.

Keywords

Ontology integration knowledge management consensus theory

1 Introduction

The integration of ontologies is a topic that gained high attention throughout the years. All of the available publication (both old like [2] and very recent like [13]) emphasize the fact that it can be very time- and cost-consuming. It requires processing of many concepts, their attributes, instances and relations. This problem is even more complicated if integrated ontologies are geographically distant which may entail major delays on the level of raw data transfer. Moreover, their inner integrity is a crucial requirement because to their content can be sensitive (for example - ontologies may be involved in cancer treatment [29]).

The aforementioned problems imply some research questions and in literature we have found some studies about the following matters: how to decrease the cost of sending and processing a semantically enriched big set of data [9, 10]? How to preserve the integrity during and after the procedure [5]? How to analyze partial results [15]?

Despite this high attention, there are still some questions that remain unanswered, mainly touching issues concerning the profitability of the ontology integration process. In other words - is the integration process worth performing or not? In order to achieve that, it is necessary to assess what is the growth of knowledge after the set of ontologies has been integrated creating one, single, unified knowledge base. Therefore, our goal is to establish a balance between the cost and the effectiveness of the integration process.

In the literature it is possible to find some measures which allow to calculate the effectiveness of the integration [14], for example: completeness, precision or optimality. The first one allows to check how much knowledge is lost after the integration. Precision estimates how many elements are duplicated and how many new elements are introduced after the integration. The last one tells how close the output of the integration is to its inputs in terms of some accepted distance measure.

Despite many advantages and certain simplicity these functions have one, serious defect - all of them require the integration to be performed beforehand and only afterwards they can be used to evaluate obtained results. As it was mentioned, in many cases the integration is very complicated. For example, lets assume that some company’s board needs to make a decision based on an analysis of information gathered from this company’s branches. Data used in eventual ontologies are stored in many distant locations - for example each branch of the company stores different portion of sales information.

Because of the complexity of ontologies and their semantic expressiveness, preparing the unified view on all of them is a difficult task. Before performing the integration it would be useful to know if the whole process itself is profitable and how many benefits it may bring. In other words, how much knowledge will be gained thanks to the conducted integration. If this growth is significant enough, the company can decide about performing the integration without any (or very little) risk.

In our previous work [8, 10] we have proposed a multi-level approach to ontology integration. During the developed procedure, integrated ontologies are divided into explicitly assumed classes. For each of such groups the partial integration (that incorporates the algorithm taken from [16]) is performed in order to obtain a separate ontology. Eventually, all of the partial results are combined into a final ontology using the same integration algorithm, which was used in preceding steps. It is somehow similar to the map-reduce approach commonly incorporated in many practical applications [11]. As a result, we have demonstrated that the multilevel integration process can be even 20% faster than the single-level approach, thanks to a parallelization of the fragmentary calculations. What is more, we have analytically proven that the outcomes of the integrations done on one level and on multiple levels are identical. The biggest problem we have encountered concerned the initial classification of ontologies into subgroups - we have not found any formal method that could be used to perform this task or indicate the correctness of accepted classification.

In this paper we want to revise and extend our previous research. The main contribution is to develop a novel measure which allows to estimate a potential growth of knowledge that will be gained after the integration of a set of ontologies. As stated in our previous research and a variety of related articles (e.g. [25]) the considered process is always preceded by finding which elements of source ontologies may be integrated, using one of the methods of ontology alignment. We claim that during this stage it is also possible to assess qualities of a final outcome and partial results (if a multilevel approach is used). In other words - our measure can be used to evaluate the integration before it is even performed. Moreover, it is build on a well grounded theoretical background originating from the ontology alignment framework that we have developed in the past [22]. This can be useful in medical application, for example, when several disease ontology are needed to be integrated with semantic networks concerning genes or phenotypes (like in [12]). Estimating the gain of knowledge before performing such vast procedure can be invaluable.

As aforementioned, our previous achievements concern the multi-level approach to the integration process of distributed and complex data. This can be used to illustrate one of the applications of the proposed measure. For example, the intended approach can simplify the initial classification of input ontologies into appropriate classes. It can also serve as a stop condition during subsequent levels of integration when a satisfactory level of knowledge increase is about to be reached. In other words, it can be somehow treated as a quality of the integration. Unfortunately, due to the limited space, in this paper we will only consider the concept level of ontologies.

The remaining part of this paper is organized as follows. In the next section we give a summary of related works. Section 3 contains the introduction to ontologies and basic notions used throughout our research. In Section 4 a description of revised methodology for the multilevel ontology integration can be found, which is formally analyzed in Section 5. In Section 6 we describe the measure that can be used to estimate the increase of knowledge after the ontology integration. Section 7 recalls the results of the conducted experiment and eventually propose, how the notion on knowledge increase can be used in conjunction with the multilevel approach to ontology integration. Section 8 presents the knowledge increase obtained during the integration of ontologies performed in the earlier section. The possible application is considered along with a pragmatic analysis of obtained values. The last section focuses on our upcoming research plans and concludes the paper.

2 Related works

In the literature it is possible to find some measures which allow to calculate the effectiveness of the integration process. Most of them are, however, rather difficult to calculate and the integration needs to be performed beforehand. They can only evaluate the integrated ontology once it has been designated, and not to estimate the potential profit before any resources are assigned to perform the integration.

Authors of [28] presented a theoretical and practical aspects of the knowledge integration process in companies. They demonstrated that the time it takes the company to integrate external sources of information depends on the attributes of the knowledge source and company’s own internal capabilities. Moreover, a paper highlighted the fact that the speed of integration depends on accumulated experiences. Similar issues where covered in [23] where authors focused on identifying sources of knowledge that collectively satisfy the quality requirements and budget limitations of their applications.

In [27] authors proposed a model called OntoQA that analyses ontology schemas along with their populations and describes them through a well defined set of metrics. Authors defined two categories of metrics. The schema metrics addressed the design of the ontology which indicates its richness, taxonomy’s width and depth, and an inheritance of an ontology schema. The instance metrics were grouped into two categories: knowledge base metrics, which describe the knowledge base as a whole, and Class metrics, which described the way each class is being utilized in the knowledge base. The main disadvantage of proposed metrics is their excessive simplicity and the lack of connection between concepts, relations, instances. All of the described metrics are calculated in separation with each other and none of the metric can give a big picture of the quality of the performed integration.

Supekar in [26] proposed a methodology representing a qualitative and quantitative analysis of ontologies and their classification. Authors defined a quality of ontology knowledge and computed the Degree of Quality. Additionally, a formal method that determines the level of quality of ontology based on characterization of knowledge was designed. The model which characterizes the quality of knowledge were based on cognitive adequacy, context (quantifiable features) and veracity, complexity and specificity. Based on this model, the correctnesses of domain ontologies by ranking them according to their degree of correctness was determined. The proposed measure is focused on a description of ontologies and their usefulness, therefore, it is not generic due to its relation with domains of ontologies. The measure described above evaluates quality of ontologies and does not consider the integration of them.

An interesting approach to assessing the increase of knowledge gained after the integration (treated as a knowledge of some collective, not a result of merging several ontologies) can be found in [17]. Authors investigated the influence of the inconsistency degree on the collective knowledge increase by taking into account the number of members of this collective. The idea behind this was based on a remark that the more consistent the collective’s knowledge is, the smaller the diversity in the opinions of collective’s members. Therefore, the smaller amount of knowledge is eventually gained.

In [1] an information quality analysis of schemas in data integration environments was presented. The evaluation of schema quality focusing on minimality, consistency and completeness aspects was discussed. Authors defined entity and relationship redundancy degree and based on those measures the schema minimality was proposed. Schema completeness was calculated based on a number of concepts. The overall schema type consistency score was estimated based on the total number of consistent attributes. Despite that all of defined measures are very useful, unlike our approach, they require that the integration process is conducted beforehand.

Andrew Frank [7] proposed a novel definition of data quality understood as “fitness for use” in the sense an executable approach. This definition led to a method of assessing the influence of data imperfections on the number of errors in the final decisional process. Data quality was considered from philosophical point of view and unfortunately no formal measures of data quality was presented.

In Ceusters [3] a whole methodology of assessing the quality of ontologies that are merged or mapped was described. This paper contains only theoretical consideration about some factors which can be used to assess the adequateness of both, the original ontologies, and the outputs of matching and merging.

There are some publications focusing on similarity measures which are used in ontology integration and alignment like [6, 16] or [22]. However, none of these works do not allow to estimate the knowledge increase and evaluate a profitability of the eventual integration. In this paper we would like to propose a measure without the flaws that have been described above.

3 Basic notions

In our research we accept the notion of the real world represented by a pair (A,V), where A is a set of attributes that can be used to describe objects from that world, and V is a set of valid valuations of these attributes. Assuming that V_a is a domain of an attribute a taken from the set A the following condition holds: V = ⋃ _a∈AV_a.

We define an ontology as follows: $O = (C, R, I)$ (1)

where C is a finite set of concepts, R is a finite set of relations between concepts R = {r₁, r₂,. . . , r_n}, n ∈ N, r_i ⊂ C × C for i ∈ [1, n] and I is a finite set of instances.

Every concept from the set C is represented as a triple: $c = ({Id}^{c}, A^{c}, V^{c})$ (2) where Id^c is a unique label, A^c is a set of its attributes and V^c is a set of domains (sets of values) of these attributes (V^c = ⋃ _a∈A^cV_a).

We say that an ontology O is (A,V)-based if ∀_c∈CA^c ⊆ A and ∀_c∈CV^c ⊆ V are simultaneously met.

Raw attributes from the set A do not have any semantics - they obtain a possibility to be interpreted only when they become included within selected concepts. We assume the existence of a set containing atomic descriptions of attributes denoted as D_A (for example - this set can contain a label year_of_birth). Subsequently we define a sub-language $L_{s}^{A}$ of the sentence calculus built from elements of D_A and logic operators of conjunction, disjunction and negation. Eventually, by the semantics of attributes we understand a following function: $S_{A} : A \times C \to L_{s}^{A}$ (3)

For example, an attribute “Date_of_birth” within a concept “Person” obtains the following semantics: S_A (Date _ of _ birth, Person) : year _ of _ birth ∧ month _ of _ birth ∧ day _ of _ birth ∧ age

Considered tool gives us the opportunity to formally define different meanings that one attribute can express while being a part of different concepts. For example, an attribute title conveys different meaning when it is used in a concept Book and entirely different when being a part of a concept Person. This allowed us to formally define equivalency (denoted as ≡), generalization (denoted as ↑) and contradiction (denoted as ↓) between attributes:

Two attributes a ∈ A^c, b ∈ A^c′ are semantically equivalent if the formula S_A (a, c) ⇔ S_A (b, c′) is a tautology for any two c ∈ C, c′ ∈ C′. The symbol ≡ will be used to designate such relationship.

The attribute b ∈ A^c′ in concept c′ ∈ C′ is more general than the attribute a ∈ A^c in concept c ∈ C if the formula S_A (a, c) ⇒ S_A (b, c′) is a tautology for any two c ∈ C, c′ ∈ C′. To denote this relationship the symbol ↑ will be further used.

Two attributes a ∈ A^c, b ∈ A^c′ are in semantical contradiction if the formula ¬ (S_A (a, c) ∧ S_A (b, c′)) is a tautology for any two c ∈ C, c′ ∈ C′. To denote this relationship the symbol ↓ will be used.

In consequence, such approach gives us the opportunity to formulate a novel ontology alignment framework, presented in details in [22].

To describe the overall meaning that a particular concept can carry, we define a context function with the following signature: $ctx : C \to L_{s}^{A}$ (4)

Assuming that a concept c has a set of attributes with cardinality equal to n (|A^c| = n) and contains attributes {a₁, a₂,. . . , a_n} we define the above function as follows: $ctx (c) = S_{A} (a_{1}, c) \land S_{A} (a_{2}, c) \land . . . \land S_{A} (a_{n}, c)$ (5)

We also introduce a set D_R containing descriptions of relations and in consequence another sublanguage of the sentence calculus $L_{s}^{R}$ that is used in a function representing semantics of relations from the set R: $S_{R} : R \to L_{s}^{R}$ (6)

An instance i from the set I is defined as a triple i = (id, A_i, v_i), where id is its unique identifier, A_i is a set of assigned attributes and v_i is a function v_i : A_i → ⋃ _{a∈A_i}V_a that is used to give certain values from the corresponding sets V_a to elements of the set A_i. If there exists some concept c = (Id^c, A^c, V^c) such that A^c ⊆ A_i and ∀_{a∈A_i∩A^c}v_i (a) ∈ V^c we say that i = (id, A_i, v_i) is its instance.

In the further parts of our paper we will accept the existence of a function: $dist : L_{s}^{A} \times L_{s}^{A} \to [0, 1]$ that will be used to calculate a distance between logic statements expressed using the language $L_{s}^{A}$ . Due to the limited space we cannot fully present them - for broad description please refer to our previous article [21].

In our alignment framework this function has been used to investigate how far are two descriptions of attributes from two concepts. In this work we will use it to measure how much knowledge we have gained after an integration of two ontologies on the concept level. We assume that it accepts the contexts (defined according to the Equation 5) of concepts from two ontologies as an input, therefore it has the following signature d_S : C₁ × C₂ → [0, 1] and can be defined as following: d_S (c₁, c₂) = dist (ctx (c₁) , ctx (c₂)). Obviously it also meets the criteria of identity of indiscernible d_S (c₁, c₂) ⇔ c₁ = c₂ and symmetry d_S (c₁, c₂) = d_S (c₂, c₁).

4 The multilevel approach to the task of ontology integration

According to [18], the task involving a knowledge integration can be described as a process of combining several, independent knowledge bases into a single, unified structure containing all of the knowledge extracted from the input sources witch the exclusion or resolution of potential conflicts and inconsistencies that may occur. The high complexity of required transformations may cause that in some cases it is impossible to achieve the required output using only one-level integration in which all of the incoming knowledge is processed at the same time. Other issues that may be a reason of such difficulties are memory consumption or simply geographical distance between knowledge sources that entails an unacceptable latency.

A simultaneous integration of knowledge taken from a smaller number of sources than their origins can be treated as a viable approach, thanks to dividing them into several subgroups and the eventual merging of the results into the one final knowledge base. The general idea is presented in Fig. 1.

The task of ontology integration can be defined as follows: for given n ontologies O₁, O₂,. . . , O_n one should determine an ontology O* which is the best representation of given input ontologies. As aforementioned, the process of the integration can be done in a single step or in special cases in two or more stages. These steps are frequently identified as levels of integration. An appropriate definition of the one level ontology integration is given below:

Definition 1. The input of the one-level integration process is a sequence of n ontologies: $O_{1}^{1}, O_{2}^{1}, . . ., O_{n}^{1}$ . The output of the integration process is a single ontology O^1*, which is in multiple relationships with input ontologies, as defined by a group of criteria. Integration criteria $K_{1}^{1}, K_{2}^{1}, . . ., K_{n}^{1}$ are the parameters of the integration task and tying O^1* with $O_{1}^{1}, O_{2}^{1}, . . ., O_{n}^{1}$ each at least at a given level $α_{1}^{1}, α_{2}^{1}, . . ., α_{n}^{1}$ " of $K_{i}^{1} (O^{1 *} | O_{1}^{1}, O_{2}^{1}, . . ., O_{n}^{1}) \geq α_{i}^{1}$ .

Based on the above definition the multi-level approach to the ontology integration can be defined as follows:

Definition 2. Let $O_{1}^{m - 1 *}, O_{2}^{m - 1 *}, . . ., O_{n}^{m - 1 *}$ be ontologies obtained during m - 1 level of the knowledge integration, where m ≥ 2. The output of the m-level of integration is a single ontology O^m*, which is in multiple relationships with input structures, as defined by a group of criteria: $K_{1}^{m}, K_{2}^{m}, . . ., K_{n}^{m}$ .

According to the available publications, like [14], we can distinguish a following set of criteria that the outcome of an integration procedure should meet:

Completeness which states that after the integration no knowledge is lost

Minimality which states that the output of the integration is not much larger than its inputs

Precision which states that the integration procedure will not duplicate any data

Optimality which states that the output of the integration is the closest to the inputs in referring to an accepted distance measure

Sub-tree agreement which states that in case of a tree-based knowledge representation (which in some cases can be identified with ontologies) the integration’s output includes all of the sub-trees from its inputs

As defined in the Section 3, an ontology consists of a few main components, namely concepts, relations and instances. These elements form so called ontology stack, first aforementioned in our initial works, and further developed in [22]. Such approach to structuring ontologies and their content implies that the problem of ontology integration should be conducted in three steps: integration of concepts, integration of relations and integrations of instances.

Using the one level approach, this problem has been solved in [16], where author has decomposed it into three phases corresponding to described components and for each of such phase an integration algorithm has been proposed:

Integration on an instance level has been solved using consensus methods

Integration on a concept level has required defining some additional formal postulates

An algorithm for a relational level in the resulting set of relations has included only those relations which appeared most frequently in the input ontologies and did not cause any inconsistencies.

The multi-level ontology integration task requires to conduct an initial division of the sequence of n ontologies O₁, O₂,. . . , O_n into k classes X₁, X₂,. . . , X_k where k < n. For each class X_i of ontologies the one-level integration of concepts, relation or instances is performed using one of the selected algorithms.

Then, using the exact same procedure, ontologies denoted as $O_{1}^{2 *}, O_{2}^{2 *}, . . ., O_{k}^{2 *}$ (which are the result of the 1st level integration) can be further integrated into the final ontology O^2*. Obviously this approach based on the division of a sequence of ontologies into classes which are simultaneously integrated can be carried out many times.

5 Formal analysis of the integration algorithm

Due to the limited space available, overall simplicity and for illustrative purposes we will focus only on the evaluation of one- and multi-level integration of concepts. The base algorithm is taken from [16]:

We can easily prove that the outcomes of the ontology integration performed according to the presented algorithm using the one-level and the multilevel approach are identical:

Theorem 1. For an ontology integration on a concept level and for m ≥ 2 the following condition is always satisfied: O^m* is equal O^1*.

Proof. In the first step we show that O^m* is equal to O¹* for m = 2. Due to the fact that we consider only concept integration we want to show that A^m* is equal A¹* and V^m* is equal V¹* where A^m*, A¹* are the results of attribute integration on multi- and one-level respectively and V^m*, V¹* are integrated values of attributes for multi- and one-level algorithm.

From Step 1 of Algorithm 1 it is obvious that $A^{1} * = ⋃_{i = 1}^{n} A^{i}$ . Two-level integration process is more complicated. Let us assume that A₁, A₂ . . . . , A_n were divided into k classes. Therefore, S₁ = {i : A_i belongs to a class 1} , S₂ = {i : A_i belongs to a class 2},...,S_k = {i : A_i belongs to a class k}. In the first stage of the multi-level integration process we obtain $A_{1}^{1} * = ⋃_{i \in S_{1}} A^{i}$ , $A_{2}^{1} * = ⋃_{i \in S_{2}} A^{i}, . . ., A_{k}^{1} * = ⋃_{i \in S_{k}} A^{i}$ . In the second stage we get A² * = ⋃ _i∈S₁Aⁱ ∪ ⋃ _i∈S₂Aⁱ ∪ . . . ∪ ⋃ _{i∈S_k}Aⁱ. Therefore, A²* is A¹* equal because union of sets is associative. The same reasoning could be conducted for the set of attributes values. For m ≥ 2 it is easy to show by using mathematical induction. □

The above theorem is a formal proof of the additivity of integration performed using one of the algorithms taken from [16]. The considered procedure assumes that the final state of integration gives the broadest possible perspective of integrated ontologies and does not discriminate less frequently used elements. In other words - it returns a sum of parts (input concept structures) after it has been cleaned form potential inner conflicts or redundancies.

6 The quantity of knowledge on the ontology concept level

As it was mentioned in the previous Section, the ontology consists of three main elements: concepts, relations and instances. Therefore, the problem of ontology integration should also be conducted in three corresponding steps: integration of concepts, relations and instances. This decomposition has been proposed in [16], where the author divided the main problem into aforementioned three phases and for each of them an appropriate integration algorithm has been proposed.

The main contribution of this paper relies on the estimation of the increase of knowledge which may occur during ontology integration on the concept level. In the rest of the article we will use the following notation:

O₁ = (C₁, R₁, I₁), O₂ = (C₂, R₂, I₂) are the two ontologies that are integrated

σ denotes the selected integration algorithm on the concept level. This can be treated as a function with the following signature σ : 2^C₁∪C₂ → C^*, where C₁, C₂ are sets of concepts from ontologies O₁ and O₂, respectively. The powerset is used due to the fact that the resulting ontology (the output of the integration) may contain concepts that were merged from any other number of input concepts

O^* = (C^*, R^*, I^*) denotes a final ontology. C^* is the output of σ

c₁, c₂ are two integrated concepts from respective ontologies O₁ and O₂

c^* the result of the integration of concepts c₁ and c₂

C_merged ⊂ C^* is a subset of C^* containing concepts that are a result of an integration of two or more concepts from ontologies O₁ and O₂.

The considered problem can be described as follows: For given two ontologies O₁ and O₂, one should determine the increase of knowledge that can be gained during the integration of two ontologies using the function σ.

The naive approach to the estimation of the growth of knowledge (that can be understood as the quality of the integration) would be based on a function with a signature Δ_O₁,O₂ : C₁ × C₂ × C^* → [0, 1]. Answering the question about the growth of knowledge can shed some light on the issue of profitability of the integration itself. In the case of so called Big Data, large collections of unprocessed and unstructured information are used as the source for creating more expressive and flexible ontologies. Such situation entails that the final ontologies are also rapidly growing and therefore, their integration (despite the potential profits given by the final result) can be very time- and cost-consuming. Therefore, for given two ontologies is it possible to estimate the increase of knowledge that will be gained after their integration before the integration itself takes place? A plethora of integration quality measures relies on the fact that the final result has already been designated and it is comparable with the source ontologies. But what if the output of the considered process is not satisfactory, the potential resource loss is too high and the results of any cost- and time-consuming integration will not be used?

We claim that, ideally, we should be able to measure such potential gain of knowledge before the integration in order to decide if it is profitable enough to be performed despite the other costs. Therefore, the element concerning the set C^* (which contains concepts which are the result of the integration) should be skipped.

The measurement proposed in this paper allows to establish a balance between the cost and effectiveness of integration process using the function: $Δ_{O_{1}, O_{2}} : C_{1} \times C_{2} \to [0, 1]$ (7) that meets the following postulates:

Δ_O₁,O₁ (C₁, C₁) =0

Δ_O₁,O₂ (O₁, O₂) =1 ⇔ C_merged = φ

The first postulate illustrates the situation in which an ontology is integrated with itself. Intuitively, in such case, the gain of knowledge does not exist.

The second postulate concerns the opposite situation in which none of the elements of the final ontology are the result of the integration of two or more concepts taken from the source ontologies. Such, final, ontology can be treated as a simple sum of elements of source ontologies. Therefore, the knowledge gained thanks to such integration is maximal.

By σ^-1 we will denote a function with a signature σ^-1 : C^* → 2^C₁∪C₂ that takes a concept from the final ontology O^* and returns concepts from source ontologies O₁ and O₂ that have been merged to create it. In other words - it “unpacks” the concept that is a result of the integration of elements taken from source ontologies.

We can extend this idea to develop a tool that can be used to get concepts from the specific ontology that has been merged. Formally, these functions have the following signatures $σ_{0_{1}}^{- 1} : C^{*} \to 2^{C_{1}}$ and $σ_{0_{2}}^{- 1} : C^{*} \to 2^{C_{2}}$ . They can be defined as:

$σ_{0_{1}}^{- 1} (c *) = σ^{- 1} (c^{*}) \cap C_{1}$

$σ_{0_{2}}^{- 1} (c *) = σ^{- 1} (c^{*}) \cap C_{2}$

Obviously the following properties are always true (φ denotes an empty set):

∀c^* ∈ C^* (σ (σ^-1 (c^*)) = c^*)

¬ ∃ c^* ∈ C_merged (σ^-1 (c^*) = φ)

$\neg \exists c^{*} \in C_{merged} (σ_{0_{1}}^{- 1} (c^{*}) = φ \land σ_{0_{2}}^{- 1} (c^{*}) = φ)$

$\forall c^{*} \in C^{*} (σ_{0_{1}}^{- 1} (c^{*}) \cup σ_{0_{2}}^{- 1} (c^{*}) = σ^{- 1} (c^{*}))$

By incorporating our alignment framework developed in [22] we dispose an auxiliary structure map (O₁, O₂) containing tuples in the form <s, t> where s represents a concept either from O₁ or O₂ and t represents a set of concepts from the corresponding ontology that can be unequivocally matched with the given s. In other words, map (O₁, O₂) is a set of candidate concepts that have been chosen to be integrated as members of the set C_merged. Note, that the cardinality of the set C^* is a sum of cardinalities of concept sets of input ontologies reduced by the number of integrations of selected concepts from C₁ and C₂. Formally: $| C_{1} | + | C_{2} | - | C^{*} | = \sum_{< s, t > \in map (O_{1}, O_{2})} | t |$ (8)

Therefore, map (O₁, O₂) has the following property: ∀ < s, t > ∈ map (O₁, O₂) ((s ∈ C₁ ⇒ t ⊈ C₁) ∧ (s ∈ C₂ ⇒ t ⊈ C₂)).

Incorporating presented notations and the proof from Section 5 (concerning the additivity of knowledge expressed using concepts’ structures) we have noticed that the knowledge gain can be calculated directly by comparing two concepts, that will be integrated and eventually included in the final ontology.

These remarks allows to omit a limitation of previously developed integration quality measures that were based on the evaluation of a strict output of the integration. Our approach is built on top of the comparison of source ontologies that can be done during their initial alignment (which must be always performed). Using the basic naive approach we can estimate the potential profitability of the integration as follows:

$\begin{matrix} Δ_{O_{1}, O_{2}} (C_{1}, C_{2}, C^{*}) \\ = \frac{| C^{*} | - | C_{merged} | + \sum_{< s, t > \in map (O_{1}, O_{2})} {max}_{c_{i} \in t} d_{S} (s, c_{i})}{| C^{*} |} \end{matrix}$ (9)

Bare in mind that the set C_merged is a result of the integration of matchable concepts from source ontologies, initially selected in the mapping phases and included in the set map (O₁, O₂). Therefore, we can say that |C_merged| = |map (O₁, O₂) |. We innclude only the maximal value from the availabledistances for the single mapping taken from map (O₁, O₂) to approximate the potential growth of knowledge after the integration.

Also note that from the Equation 8 we have |C^*| = |C₁| + |C₂| - ∑_{<s,t>∈map(O₁,O₂)}|t|. Hence, we can replace |C^*| in the Equation 9 and eventually become completely independent from the analysis of O^*: $\begin{matrix} Δ_{O_{1}, O_{2}} (C_{1}, C_{2}) \\ = 1 - (\frac{| map (O_{1}, O_{2}) | - \sum_{< s, t > \in map (O_{1}, O_{2})} {max}_{c_{i} \in t} d_{S} (s, c_{i})}{| C_{1} | + | C_{2} | - \sum_{< s, t > \in map (O_{1}, O_{2})} | t |}) \end{matrix}$ (10)

Example 1. Consider an example from Fig. 2 with two input ontologies O₁ = (C₁, R₁, I₁) and O₂ = (C₂, R₂, I₂) and one result of integration denoted as O^* = (C^*, R^*, I^*). Lets assume the following: ctx (c₁) = p ∧ q ∧ r, ctx (c₂) = w, ctx (c_A) = q ∧ r, ctx (c_B) = x, ctx (c_C) = m ∧ o ∧ z. Moreover, σ (c₁, c_A, c_B) = C_1,A,B and σ (c₃, c_C) = C_3,C which implies that concepts c₁, c_A, c_B and c₃, c_C can be integrated and any other concept from input ontologies O₁ and O₂ will be simply duplicated within O^*. According the Algorithm 1 the context of integrated concepts will be as following: ctx (c_1,A,B) = p ∧ q ∧ r ∧ x and ctx (c_3,C) = m ∧ n ∧ o ∧ z. The calculated distances are $d_{S} (c_{1}, c_{A}) = \frac{1}{3}$ , d_S (c₁, c_B) =1, $d_{S} (c_{3}, c_{C}) = \frac{1}{2}$ . As defined in the Equation 10 to calculate the knowledge gain of integration we need to select the maximal distance between contexts of concepts taking part in the procedure. For the integration of concepts c₁, c_A, c_B the maximal distance between their contexts is equal to 1 and, due to only two participating concepts, the maximal distance between c₃, c_C is obviously equal $\frac{1}{2}$ . The final increase of knowledge in the considered example can be calculated as $Δ_{O_{1}, O_{2}} (C_{1}, C_{2}) = 1 - \frac{2 - (1 + \frac{1}{2})}{3} = \frac{5}{6} \approx 83 %$ .

In the next section of the article we will extend our experiment verifying the multilevel ontology integration (described in Section 4) with proposed measure of knowledge gain. It will estimate the profit of every step of integration which can be used as a stop condition in real-life applications.

7 The analysis of time required for the multilevel ontology integration

The most common dataset, frequently used to evaluate any kind of application related to processing ontologies, is the one provided by sOntology Alignment Evaluation Initiative (OAEI). It is handed during the annual evaluation campaigns, which are mainly aimed at verifying ontology alignment frameworks (that designate a set of symmetrical mappings which connect equivalent elements from two ontologies). The evaluation procedure is based on a large set of pairs of ontologies (for convenience grouped into smaller thematic subsets referred to as tracks) complemented with a reference mapping. The evaluation of an alignment tool consists of the comparison of a provided result with such reference. Eventually generic values of Precision and Recall are calculated, accompanied by other quality metrics (e.g. time required to perform a mapping).

The selected integration algorithm, presented in Section 5, in order to be conducted using some real data, involves some initial requirements that need to be fulfilled beforehand:

The algorithm expects a set of pairs designating equivalent concepts from separate ontologies that may be integrated into a final ontology. This assumption is similar to the one used in [24].

A set that designates equivalent attributes from integrated concepts for the sake of Step 3 of the considered algorithm

To meet the above requirements we have used reference alignments given by OAEI for four ontologies (Sigkdd, Edas, ConfTool and Sofsem) taken from the conference track (due to the accessibility and simplicity of their domain) of the 2015 evaluation campaign [19]. Therefore, we were able to test the one- and two-level approach using a dedicated experimental environment implemented using Python programming language.

The integration of all ontologies using the generic one-level approach took 0.0788 seconds and in Table 1 times (in seconds) required by the two-level integration approach can be found.

We have analyzed outcomes of seven different initial classifications of selected four ontologies into subsets containing one, two or three input ontologies. It can be identified with classes X₁ and X₂ from Definition 2). Each of such division can be found in a separate row in Table 1. Columns contain durations of each level of integration for different classes and the total time taken.

The given values are the arithmetic means of outcomes collected from ten iterations of the identical integration process. This allowed to avoid potential distortions that may take place due to a number of random technical issues such as memory access downtime, stack overflow etc.

The obtained results allowed us to draw a conclusion that the multi-level integration can be significantly faster than the one-level equivalent. As easily seen in the last column of Table 1, the evaluation clearly showed that in comparison to the simple one-level integration such integration for given input dataset is faster even by 20%. In the context of the Big Data [4] the shortest possible time required to obtain their final representation is a key factor in providing reliable business solutions.

8 The analysis of knowledge increase during the ontology integration

We have analyzed seven different selection of initial classes X₁ and X₂. By using the aforementioned experimental environment we have calculated a separate knowledge increase for every level of integration. To calculate the distance between contexts of concepts we have assumed that it can be modeled using URIs of datatype properties. According to the OWL specification [20] they can be treated as concepts’ attributes.

We have incorporated the alignments provided by OAEI (described in the Section 7) which contain several mappings between a handful of single attributes. This implies that such attributes are equivalent and therefore, the distance between them is equal 0. We assumed the distance between any other attributes is maximal and equal 1.

The knowledge increase obtained during standard one-level approach is equal 0.89. The results of multilevel integration, calculated for every selection of X₁ and X₂ can be found in the Table 2.

The high knowledge increase in our experiment is the result of the chosen dataset. In the considered ontologies the number of attributes is low and the concepts are rarely aligned, therefore, the integration brings a lot of new knowledge. However, the results of our experiment demonstrate some interesting dependencies. The most important ontology is Edas. The increase of knowledge in the integration process without this ontology is smaller than in other cases. Additionally, 1 - Δ_O₁,O₂ (C₁, C₂) informs how much information is redundant.

As it was mentioned in Section 1, the proposed measure can be helpful in the initial classification of input ontologies into appropriate classes. It can serve as a stop condition during subsequent levels of integration when a satisfactory level of knowledge is reached. From the first line of Table 2 we can draw a conclusion that such initial classification of input ontologies is the least effective because the pairs: {Sofsem, Sigkdd} and {Edas, ConfTool} have the most redundant data. Moreover, the results of the second level show that the integration process should not be stopped on the first level, because further integrations are still profitable.

9 Future works and summary

Ontologies allow to easily store the big set of data enriching them with expressive semantics. In consequence processing them is a difficult and complex task. The easy and cheap method for estimation of the profitability of the integration process is therefore desired. In this work we have proposed a novel measure which allows to estimate a potential growth of knowledge after the integration of a set of ontologies on a concept level. The main advantages of the proposed method is its simplicity, a well grounded theoretical background and the fact that it can be performed before any level of the integration.

Moreover, the multi-level integration process has been described as the use case scenario. The developed measure has been used in the multi-level integration process for two tasks. The first one is an initial classification of input ontologies into appropriate classes. The second one is a stop condition during subsequent levels of integration when a satisfactory level of knowledge is reached.

The preliminary experimental results pointed out that the multi-level integration process is shorter by even 20% in comparison to the standard one-level approach. The calculation of proposed knowledge increase measure allows to denote that the most important ontology is the one which brings the most information. Moreover, the analysis of the time taken by the integration procedure and the growth of the knowledge have demonstrated that the best initial classification of input ontologies is: X₁ = {Sofsem, Edas} , X₂ = {Sigkdd, ConfTool} or X₁ = {Sofsem, Conftool} , X₂ = {Sigkdd, Edas}

In our future work we would like to conduct experiments for more ontologies and for more levels of integration (relations and instances), therefore, expanding it into fully fledged ontology integration framework. In this paper we were able to examine only four ontologies integrated on only two levels. More sophisticated experiments could bring more reliable conclusions. We have proposed the method which estimates the objective growth of knowledge that does not judge the disperse in the amounts of knowledge available in the input ontologies. In our upcoming publications, the subjective (from the point of view of the integrated ontologies) measures will be proposed.

References

Batista

M.D.C.M.

and Salgado

A.C.

, Information Quality Measurement in Data Integration Schemas. In QDB, 2007, pp. 61–72.

Calvanese

, De Giacomo

and Lenzerini

, Ontology of integration and integration of ontologies, Description Logics49 (2001), 10–19:30.

Ceusters

and Smith

, Towards A realism-based metric for quality assurance in ontology matching, Frontiers in Artificial Intelligence and Applications150 (2006), 321–332.

Chen

, Chiang

R.H.

and Storey

V.C.

, Business intelligence and analytics: From big data to big impact, MIS quarterly36(4). Society for Information Management and The Management Information Systems Research Center, Minneapolis, MN, USA, 2012, pp. 1165–1188.

Chen

C.P.

and Zhang

C.Y.

, Data-intensive applications, challenges, techniques and technologies: A survey on Big Data, Information Sciences275 (2014), 314–347.

Euzenat

and Petko

, An integrative proximity measure for ontology alignment, Proc ISWC-2003 workshop on semantic information integration. No commercial editor., 2003.

Frank

A.U.

, Data quality ontology: An ontology for imperfect knowledge, In Spatial Information Theory, Springer Berlin Heidelberg, 2007, pp. 406–420.

Kozierkiewicz-Hetmańska

, Comparison of one-level and two-level consensuses satisfying the 2-optimality criterion, 2012. DOI: 10.1007/978-3-642-34630-9_1

Kozierkiewicz-Hetmańska

and Nguyen

N.T.

, A Comparison Analysis of Consensus Determining Using One and Two-level Methods, Volume 243: Advances in Knowledge-Based and Intelligent Information and Engineering Systems, 2012, pp. 159–168.

10.

Kozierkiewicz-Hetmańska

and Pietranik

, Preliminary evaluation of multilevel ontology integration on the concept level, Lecture Notes in Computer Science9621 (2016), Springer London. DOI: 10.1007/978-3-662-49381-6

11.

Kumar

, Moseley

, Vassilvitskii

and Vattani

, Fast greedy algorithms in mapreduce and streaming, ACM Transactions on Parallel Computing2(3) (2015), 14.

12.

and Dang

, Ontology-based disease similarity network for disease gene prediction, Vietnam Journal of Computer Science3 (2016), 197. DOI: 10.1007/s40595-016-0063-3

13.

Lv Ni

, Zhou

and Chen

, Multi-level ontology integration model for business collaboration, The International Journal of Advanced Manufacturing Technology84(1-4) (2016), 445–451. ISO 690.

14.

Maleszka

and Nguyen

N.T.

, A method for complex hierarchical data integration, Cybernetics and Systems42(5) (2011), 358–378.

15.

Merelli

, Pérez-Sánchez

, Gesing

and D’Agostino

, Managing, analysing, and integrating big data in medical bioinformatics, open problems and future perspectives, BioMed Research International2014 (2014), 1–13. DOI: https://dx-doi-org.web.bisu.edu.cn/10.1155/2014/134023

16.

Nguyen

N.T.

, Advanced Methods for Inconsistent Knowledge Management, Springer London, 2008.

17.

Nguyen

V.D.

, Nguyen

N.T.

, Some Novel Results of Collective Knowledge Increase Analysis Using Euclidean Space. In: Rutkowski

, Korytkowski

, Scherer

, Tadeusiewicz

, Zadeh

A.L.

and Zurada

M.J.

(eds.) Artificial Intelligence and Soft Comuting: 15th International Conference, ICAISC 2016, Zakopane, Poland, June 12-16, 2016, Proceedings, Part II. pp. 352–363. DOI: 10.1007/978-3-319-39384-1_30. Springer International Publishing, Cham, 2016.

18.

Noy

N.F.

and Musen

M.A.

, An algorithm for merging and aligning ontologies: Automation and tool support, Proceedings of the Workshop on Ontology Management at the Sixteenth National Conference on Artificial Intelligence (AAAI-99)1999.

19.

http://oaei.ontologymatching.org/2015/

20.

http://www.w3.org/TR/owl2-overview/

21.

Pietranik

and Nguyen

N.T.

, Semantic Distance Measure Between Ontology Concept’s Attributes, Proceedings of 15th International Conference, KES 2011 Lecture Notes in Artificial Intelligence, Vol. 6881, Springer, 2011, pp. 210–219.

22.

Pietranik

and Nguyen

N.T.

, A multi-atrribute based framework for ontology aligning, Neurocomputing146 (2014), 276–290. DOI: 10.1016/j.neucom.2014.03.067

23.

Rekatsinas

, Dong

X.L.

, Getoor

and Srivastava

, Finding Quality in Quantity: The Challenge of Discovering Valuable Sources for Integration, In 7th Biennial Conference on Innovative Data Systems Research (CIDR’15), 2015.

24.

Jiménez-Ruiz

, Grau

B.C.

, Horrocks

and Berlanga

, Ontology integration using mappings: Towards getting the right logical consequences, 2009. DOI: 10.1007/978-3-642-02121-3_16

25.

Shvaiko

and Euzenat

, Ontology matching: State of the art and future challenges, IEEE Trans Knowl Data Eng25(1) (2013), 158–176.

26.

Supekar

, Patel

and Lee

, Characterizing Quality of Knowledge on Semantic Web, In FLAIRS Conference2004, pp. 472–478.

27.

Tartir

, Arpinar

I.B.

, Moore

, Sheth

A.P.

and Aleman-Meza

, OntoQA: Metric-based ontology quality analysis, (2005). Available at: http://works.bepress.com/amit_sheth/341/

28.

Tzabbar

, Aharonson

B.S.

and Amburgey

T.L.

, When does tapping external sources of knowledge result in knowledge integration?Research Policy42(2) (2013), 481–494. ISSN 0048-7333 DOI: https://dx-doi-org.web.bisu.edu.cn/10.1016/j.respol.2012.07.007

29.

Zhou

, Li

, Zhang

, Chen

and Kong

, Feature classification and analysis of lung cancer related genes through gene ontology and KEGG pathways, Current Bioinformatics11(1) (2016), 40–50.