Abstract
Nowadays, due to the high level of data distribution, it is frequently impossible to generate a unified representation of a variety of heterogenous data sources in a single step. Dividing the integration process into smaller subtasks and their parallelization can solve this problem. Unfortunately, it entails difficulties concerning the initial classification of data sources into groups that can be independently integrated, and serve as an input for the final integration step. The problem becomes even more complicated when not only raw data is required to be integrated, but the designed system is expected to perform more expressive integration of heterogenous knowledge representations, such as ontologies. In our previous work [10] we have proved both analytically and experimentally that such approach to the integration task can increase its effectiveness in terms of the time required to obtain the final result. In this article we intend to explore the issue of selecting initial classes of ontologies based on the novel notion of the knowledge increase. This indicator can be computed before the integration and moreover answer the question concerning whether this integration is viable. This not only simplifies the initial distribution of aforementioned subtasks, but can also be used as a stop condition during subsequent steps of the integration.
Introduction
The integration of ontologies is a topic that gained high attention throughout the years. All of the available publication (both old like [2] and very recent like [13]) emphasize the fact that it can be very time- and cost-consuming. It requires processing of many concepts, their attributes, instances and relations. This problem is even more complicated if integrated ontologies are geographically distant which may entail major delays on the level of raw data transfer. Moreover, their inner integrity is a crucial requirement because to their content can be sensitive (for example - ontologies may be involved in cancer treatment [29]).
The aforementioned problems imply some research questions and in literature we have found some studies about the following matters: how to decrease the cost of sending and processing a semantically enriched big set of data [9, 10]? How to preserve the integrity during and after the procedure [5]? How to analyze partial results [15]?
Despite this high attention, there are still some questions that remain unanswered, mainly touching issues concerning the profitability of the ontology integration process. In other words - is the integration process worth performing or not? In order to achieve that, it is necessary to assess what is the growth of knowledge after the set of ontologies has been integrated creating one, single, unified knowledge base. Therefore, our goal is to establish a balance between the cost and the effectiveness of the integration process.
In the literature it is possible to find some measures which allow to calculate the effectiveness of the integration [14], for example: completeness, precision or optimality. The first one allows to check how much knowledge is lost after the integration. Precision estimates how many elements are duplicated and how many new elements are introduced after the integration. The last one tells how close the output of the integration is to its inputs in terms of some accepted distance measure.
Despite many advantages and certain simplicity these functions have one, serious defect - all of them require the integration to be performed beforehand and only afterwards they can be used to evaluate obtained results. As it was mentioned, in many cases the integration is very complicated. For example, lets assume that some company’s board needs to make a decision based on an analysis of information gathered from this company’s branches. Data used in eventual ontologies are stored in many distant locations - for example each branch of the company stores different portion of sales information.
Because of the complexity of ontologies and their semantic expressiveness, preparing the unified view on all of them is a difficult task. Before performing the integration it would be useful to know if the whole process itself is profitable and how many benefits it may bring. In other words, how much knowledge will be gained thanks to the conducted integration. If this growth is significant enough, the company can decide about performing the integration without any (or very little) risk.
In our previous work [8, 10] we have proposed a multi-level approach to ontology integration. During the developed procedure, integrated ontologies are divided into explicitly assumed classes. For each of such groups the partial integration (that incorporates the algorithm taken from [16]) is performed in order to obtain a separate ontology. Eventually, all of the partial results are combined into a final ontology using the same integration algorithm, which was used in preceding steps. It is somehow similar to the map-reduce approach commonly incorporated in many practical applications [11]. As a result, we have demonstrated that the multilevel integration process can be even 20% faster than the single-level approach, thanks to a parallelization of the fragmentary calculations. What is more, we have analytically proven that the outcomes of the integrations done on one level and on multiple levels are identical. The biggest problem we have encountered concerned the initial classification of ontologies into subgroups - we have not found any formal method that could be used to perform this task or indicate the correctness of accepted classification.
In this paper we want to revise and extend our previous research. The main contribution is to develop a novel measure which allows to estimate a potential growth of knowledge that will be gained after the integration of a set of ontologies. As stated in our previous research and a variety of related articles (e.g. [25]) the considered process is always preceded by finding which elements of source ontologies may be integrated, using one of the methods of ontology alignment. We claim that during this stage it is also possible to assess qualities of a final outcome and partial results (if a multilevel approach is used). In other words - our measure can be used to evaluate the integration before it is even performed. Moreover, it is build on a well grounded theoretical background originating from the ontology alignment framework that we have developed in the past [22]. This can be useful in medical application, for example, when several disease ontology are needed to be integrated with semantic networks concerning genes or phenotypes (like in [12]). Estimating the gain of knowledge before performing such vast procedure can be invaluable.
As aforementioned, our previous achievements concern the multi-level approach to the integration process of distributed and complex data. This can be used to illustrate one of the applications of the proposed measure. For example, the intended approach can simplify the initial classification of input ontologies into appropriate classes. It can also serve as a stop condition during subsequent levels of integration when a satisfactory level of knowledge increase is about to be reached. In other words, it can be somehow treated as a quality of the integration. Unfortunately, due to the limited space, in this paper we will only consider the concept level of ontologies.
The remaining part of this paper is organized as follows. In the next section we give a summary of related works. Section 3 contains the introduction to ontologies and basic notions used throughout our research. In Section 4 a description of revised methodology for the multilevel ontology integration can be found, which is formally analyzed in Section 5. In Section 6 we describe the measure that can be used to estimate the increase of knowledge after the ontology integration. Section 7 recalls the results of the conducted experiment and eventually propose, how the notion on knowledge increase can be used in conjunction with the multilevel approach to ontology integration. Section 8 presents the knowledge increase obtained during the integration of ontologies performed in the earlier section. The possible application is considered along with a pragmatic analysis of obtained values. The last section focuses on our upcoming research plans and concludes the paper.
Related works
In the literature it is possible to find some measures which allow to calculate the effectiveness of the integration process. Most of them are, however, rather difficult to calculate and the integration needs to be performed beforehand. They can only evaluate the integrated ontology once it has been designated, and not to estimate the potential profit before any resources are assigned to perform the integration.
Authors of [28] presented a theoretical and practical aspects of the knowledge integration process in companies. They demonstrated that the time it takes the company to integrate external sources of information depends on the attributes of the knowledge source and company’s own internal capabilities. Moreover, a paper highlighted the fact that the speed of integration depends on accumulated experiences. Similar issues where covered in [23] where authors focused on identifying sources of knowledge that collectively satisfy the quality requirements and budget limitations of their applications.
In [27] authors proposed a model called OntoQA that analyses ontology schemas along with their populations and describes them through a well defined set of metrics. Authors defined two categories of metrics. The schema metrics addressed the design of the ontology which indicates its richness, taxonomy’s width and depth, and an inheritance of an ontology schema. The instance metrics were grouped into two categories: knowledge base metrics, which describe the knowledge base as a whole, and Class metrics, which described the way each class is being utilized in the knowledge base. The main disadvantage of proposed metrics is their excessive simplicity and the lack of connection between concepts, relations, instances. All of the described metrics are calculated in separation with each other and none of the metric can give a big picture of the quality of the performed integration.
Supekar in [26] proposed a methodology representing a qualitative and quantitative analysis of ontologies and their classification. Authors defined a quality of ontology knowledge and computed the Degree of Quality. Additionally, a formal method that determines the level of quality of ontology based on characterization of knowledge was designed. The model which characterizes the quality of knowledge were based on cognitive adequacy, context (quantifiable features) and veracity, complexity and specificity. Based on this model, the correctnesses of domain ontologies by ranking them according to their degree of correctness was determined. The proposed measure is focused on a description of ontologies and their usefulness, therefore, it is not generic due to its relation with domains of ontologies. The measure described above evaluates quality of ontologies and does not consider the integration of them.
An interesting approach to assessing the increase of knowledge gained after the integration (treated as a knowledge of some collective, not a result of merging several ontologies) can be found in [17]. Authors investigated the influence of the inconsistency degree on the collective knowledge increase by taking into account the number of members of this collective. The idea behind this was based on a remark that the more consistent the collective’s knowledge is, the smaller the diversity in the opinions of collective’s members. Therefore, the smaller amount of knowledge is eventually gained.
In [1] an information quality analysis of schemas in data integration environments was presented. The evaluation of schema quality focusing on minimality, consistency and completeness aspects was discussed. Authors defined entity and relationship redundancy degree and based on those measures the schema minimality was proposed. Schema completeness was calculated based on a number of concepts. The overall schema type consistency score was estimated based on the total number of consistent attributes. Despite that all of defined measures are very useful, unlike our approach, they require that the integration process is conducted beforehand.
Andrew Frank [7] proposed a novel definition of data quality understood as “fitness for use” in the sense an executable approach. This definition led to a method of assessing the influence of data imperfections on the number of errors in the final decisional process. Data quality was considered from philosophical point of view and unfortunately no formal measures of data quality was presented.
In Ceusters [3] a whole methodology of assessing the quality of ontologies that are merged or mapped was described. This paper contains only theoretical consideration about some factors which can be used to assess the adequateness of both, the original ontologies, and the outputs of matching and merging.
There are some publications focusing on similarity measures which are used in ontology integration and alignment like [6, 16] or [22]. However, none of these works do not allow to estimate the knowledge increase and evaluate a profitability of the eventual integration. In this paper we would like to propose a measure without the flaws that have been described above.
Basic notions
In our research we accept the notion of the real world represented by a pair (A,V), where A is a set of attributes that can be used to describe objects from that world, and V is a set of valid valuations of these attributes. Assuming that V a is a domain of an attribute a taken from the set A the following condition holds: V = ⋃ a∈AV a .
We define an ontology as follows:
where C is a finite set of concepts, R is a finite set of relations between concepts R = {r1, r2,. . . , r n }, n ∈ N, r i ⊂ C × C for i ∈ [1, n] and I is a finite set of instances.
Every concept from the set C is represented as a triple:
We say that an ontology O is (A,V)-based if ∀c∈CA c ⊆ A and ∀c∈CV c ⊆ V are simultaneously met.
Raw attributes from the set A do not have any semantics - they obtain a
possibility to be interpreted only when they become included within selected concepts. We
assume the existence of a set containing atomic descriptions of attributes denoted as
D
A
(for example - this set can contain a
label year_of_birth). Subsequently we define a sub-language
of the sentence calculus built from elements of
D
A
and logic operators of conjunction,
disjunction and negation. Eventually, by the semantics of attributes we understand a
following function:
For example, an attribute “Date_of_birth” within a concept “Person” obtains the following semantics: S A (Date _ of _ birth, Person) : year _ of _ birth ∧ month _ of _ birth ∧ day _ of _ birth ∧ age
Considered tool gives us the opportunity to formally define different meanings that one
attribute can express while being a part of different concepts. For example, an attribute
title conveys different meaning when it is used in a concept
Book and entirely different when being a part of a concept
Person. This allowed us to formally define equivalency
(denoted as ≡), generalization (denoted as ↑) and
contradiction (denoted as ↓) between attributes: Two attributes
a ∈ A
c
,
b ∈ Ac′ are
semantically equivalent if the formula
S
A
(a,
c) ⇔ S
A
(b,
c′) is a tautology for any two
c ∈ C, c′ ∈ C′.
The symbol ≡ will be used to designate such relationship. The attribute
b ∈ Ac′ in concept
c′ ∈ C′ is more general than the attribute
a ∈ A
c
in concept
c ∈ C if the formula
S
A
(a,
c) ⇒ S
A
(b,
c′) is a tautology for any two
c ∈ C, c′ ∈ C′.
To denote this relationship the symbol ↑ will be further
used. Two attributes
a ∈ A
c
,
b ∈ Ac′ are in
semantical contradiction if the formula
¬ (S
A
(a,
c) ∧ S
A
(b,
c′)) is a tautology for any two
c ∈ C, c′ ∈ C′.
To denote this relationship the symbol ↓ will be used.
In consequence, such approach gives us the opportunity to formulate a novel ontology alignment framework, presented in details in [22].
To describe the overall meaning that a particular concept can carry, we define a context
function with the following signature:
Assuming that a concept c has a set of attributes with cardinality equal
to n
(|A
c
| = n) and contains
attributes {a1, a2,. . . ,
a
n
} we define the above function as follows:
We also introduce a set D
R
containing
descriptions of relations and in consequence another sublanguage of the sentence calculus
that is used in a function representing semantics of relations from the set
R:
An instance i from the set I is defined as a triple i = (id, A i , v i ), where id is its unique identifier, A i is a set of assigned attributes and v i is a function v i : A i → ⋃ a∈A i V a that is used to give certain values from the corresponding sets V a to elements of the set A i . If there exists some concept c = (Id c , A c , V c ) such that A c ⊆ A i and ∀a∈A i ∩A c v i (a) ∈ V c we say that i = (id, A i , v i ) is its instance.
In the further parts of our paper we will accept the existence of a function: that will be used to calculate a distance between logic statements expressed using the language . Due to the limited space we cannot fully present them - for broad description please refer to our previous article [21].
In our alignment framework this function has been used to investigate how far are two descriptions of attributes from two concepts. In this work we will use it to measure how much knowledge we have gained after an integration of two ontologies on the concept level. We assume that it accepts the contexts (defined according to the Equation 5) of concepts from two ontologies as an input, therefore it has the following signature d S : C1 × C2 → [0, 1] and can be defined as following: d S (c1, c2) = dist (ctx (c1) , ctx (c2)). Obviously it also meets the criteria of identity of indiscernible d S (c1, c2) ⇔ c1 = c2 and symmetry d S (c1, c2) = d S (c2, c1).
According to [18], the task involving a knowledge integration can be described as a process of combining several, independent knowledge bases into a single, unified structure containing all of the knowledge extracted from the input sources witch the exclusion or resolution of potential conflicts and inconsistencies that may occur. The high complexity of required transformations may cause that in some cases it is impossible to achieve the required output using only one-level integration in which all of the incoming knowledge is processed at the same time. Other issues that may be a reason of such difficulties are memory consumption or simply geographical distance between knowledge sources that entails an unacceptable latency.
A simultaneous integration of knowledge taken from a smaller number of sources than their origins can be treated as a viable approach, thanks to dividing them into several subgroups and the eventual merging of the results into the one final knowledge base. The general idea is presented in Fig. 1.
The task of ontology integration can be defined as follows: for given n ontologies O1, O2,. . . , O n one should determine an ontology O* which is the best representation of given input ontologies. As aforementioned, the process of the integration can be done in a single step or in special cases in two or more stages. These steps are frequently identified as levels of integration. An appropriate definition of the one level ontology integration is given below:
Based on the above definition the multi-level approach to the ontology integration can be defined as follows:
According to the available publications, like [14], we can distinguish a following set of criteria that the outcome of an
integration procedure should meet: Completeness which states that after the
integration no knowledge is lost Minimality which states that the output of the integration is
not much larger than its inputs Precision which states that the integration procedure will not
duplicate any data Optimality which states that the output of the integration is
the closest to the inputs in referring to an accepted distance
measure Sub-tree
agreement which states that in case of a tree-based knowledge
representation (which in some cases can be identified with ontologies) the
integration’s output includes all of the sub-trees from its
inputs
As defined in the Section 3, an ontology consists of a few main components, namely concepts, relations and instances. These elements form so called ontology stack, first aforementioned in our initial works, and further developed in [22]. Such approach to structuring ontologies and their content implies that the problem of ontology integration should be conducted in three steps: integration of concepts, integration of relations and integrations of instances.
Using the one level approach, this problem has been solved in [16], where author has decomposed it into three phases corresponding to
described components and for each of such phase an integration algorithm has been
proposed: Integration on an instance level has been solved using consensus
methods Integration on a concept level
has required defining some additional formal postulates An algorithm for a relational level in the resulting set of
relations has included only those relations which appeared most frequently in the
input ontologies and did not cause any inconsistencies.
The multi-level ontology integration task requires to conduct an initial division of the sequence of n ontologies O1, O2,. . . , O n into k classes X1, X2,. . . , X k where k < n. For each class X i of ontologies the one-level integration of concepts, relation or instances is performed using one of the selected algorithms.
Then, using the exact same procedure, ontologies denoted as (which are the result of the 1st level integration) can be further integrated into the final ontology O2*. Obviously this approach based on the division of a sequence of ontologies into classes which are simultaneously integrated can be carried out many times.
Formal analysis of the integration algorithm
Due to the limited space available, overall simplicity and for illustrative purposes we will focus only on the evaluation of one- and multi-level integration of concepts. The base algorithm is taken from [16]:
We can easily prove that the outcomes of the ontology integration performed according to the presented algorithm using the one-level and the multilevel approach are identical:
From Step 1 of Algorithm 1 it is obvious that . Two-level integration process is more complicated. Let us assume that A1, A2 . . . . , A n were divided into k classes. Therefore, S1 = {i : A i belongs to a class 1} , S2 = {i : A i belongs to a class 2},...,S k = {i : A i belongs to a class k}. In the first stage of the multi-level integration process we obtain , . In the second stage we get A2 * = ⋃ i∈S1A i ∪ ⋃ i∈S2A i ∪ . . . ∪ ⋃ i∈S k A i . Therefore, A2* is A1* equal because union of sets is associative. The same reasoning could be conducted for the set of attributes values. For m ≥ 2 it is easy to show by using mathematical induction. □
The above theorem is a formal proof of the additivity of integration performed using one of the algorithms taken from [16]. The considered procedure assumes that the final state of integration gives the broadest possible perspective of integrated ontologies and does not discriminate less frequently used elements. In other words - it returns a sum of parts (input concept structures) after it has been cleaned form potential inner conflicts or redundancies.
The quantity of knowledge on the ontology concept level
As it was mentioned in the previous Section, the ontology consists of three main elements: concepts, relations and instances. Therefore, the problem of ontology integration should also be conducted in three corresponding steps: integration of concepts, relations and instances. This decomposition has been proposed in [16], where the author divided the main problem into aforementioned three phases and for each of them an appropriate integration algorithm has been proposed.
The main contribution of this paper relies on the estimation of the increase of knowledge
which may occur during ontology integration on the concept level. In the rest of the article
we will use the following notation: O1 = (C1,
R1, I1),
O2 = (C2,
R2, I2) are the two
ontologies that are integrated σ denotes the selected integration algorithm on the concept
level. This can be treated as a function with the following signature
σ : 2C1∪C2 → C*,
where C1, C2 are sets of
concepts from ontologies O1 and
O2, respectively. The powerset is used due to the fact
that the resulting ontology (the output of the integration) may contain concepts that
were merged from any other number of input concepts O* = (C*,
R*, I*) denotes a final
ontology. C* is the output of σ c1,
c2 are two integrated concepts from respective
ontologies O1 and O2 c* the
result of the integration of concepts c1 and
c2 C
merged
⊂ C*
is a subset of C* containing concepts that are a result of
an integration of two or more concepts from ontologies O1
and O2.
The considered problem can be described as follows: For given two ontologies O1 and O2, one should determine the increase of knowledge that can be gained during the integration of two ontologies using the function σ.
The naive approach to the estimation of the growth of knowledge (that can be understood as the quality of the integration) would be based on a function with a signature ΔO1,O2 : C1 × C2 × C* → [0, 1]. Answering the question about the growth of knowledge can shed some light on the issue of profitability of the integration itself. In the case of so called Big Data, large collections of unprocessed and unstructured information are used as the source for creating more expressive and flexible ontologies. Such situation entails that the final ontologies are also rapidly growing and therefore, their integration (despite the potential profits given by the final result) can be very time- and cost-consuming. Therefore, for given two ontologies is it possible to estimate the increase of knowledge that will be gained after their integration before the integration itself takes place? A plethora of integration quality measures relies on the fact that the final result has already been designated and it is comparable with the source ontologies. But what if the output of the considered process is not satisfactory, the potential resource loss is too high and the results of any cost- and time-consuming integration will not be used?
We claim that, ideally, we should be able to measure such potential gain of knowledge before the integration in order to decide if it is profitable enough to be performed despite the other costs. Therefore, the element concerning the set C* (which contains concepts which are the result of the integration) should be skipped.
The measurement proposed in this paper allows to establish a balance between the cost and
effectiveness of integration process using the function: ΔO1,O1 (C1,
C1) =0 ΔO1,O2 (O1,
O2) =1 ⇔ C
merged
= φ
The first postulate illustrates the situation in which an ontology is integrated with itself. Intuitively, in such case, the gain of knowledge does not exist.
The second postulate concerns the opposite situation in which none of the elements of the final ontology are the result of the integration of two or more concepts taken from the source ontologies. Such, final, ontology can be treated as a simple sum of elements of source ontologies. Therefore, the knowledge gained thanks to such integration is maximal.
By σ-1 we will denote a function with a signature σ-1 : C* → 2C1∪C2 that takes a concept from the final ontology O* and returns concepts from source ontologies O1 and O2 that have been merged to create it. In other words - it “unpacks” the concept that is a result of the integration of elements taken from source ontologies.
We can extend this idea to develop a tool that can be used to get concepts from the
specific ontology that has been merged. Formally, these functions have the following
signatures
and .
They can be defined as:
Obviously the following properties are always true (φ denotes an empty
set): ∀c* ∈ C* (σ (σ-1 (c*)) = c*) ¬ ∃ c* ∈ C
merged
(σ-1 (c*) = φ)
By incorporating our alignment framework developed in [22] we dispose an auxiliary structure
map (O1, O2)
containing tuples in the form <s, t> where
s represents a concept either from O1 or
O2 and t represents a set of concepts from
the corresponding ontology that can be unequivocally matched with the given
s. In other words, map (O1,
O2) is a set of candidate concepts that have been chosen to be
integrated as members of the set C
merged
. Note,
that the cardinality of the set C* is a sum of cardinalities of
concept sets of input ontologies reduced by the number of integrations of selected concepts
from C1 and C2. Formally:
Therefore, map (O1, O2) has the following property: ∀ < s, t > ∈ map (O1, O2) ((s ∈ C1 ⇒ t ⊈ C1) ∧ (s ∈ C2 ⇒ t ⊈ C2)).
Incorporating presented notations and the proof from Section 5 (concerning the additivity of knowledge expressed using concepts’ structures) we have noticed that the knowledge gain can be calculated directly by comparing two concepts, that will be integrated and eventually included in the final ontology.
These remarks allows to omit a limitation of previously developed integration quality measures that were based on the evaluation of a strict output of the integration. Our approach is built on top of the comparison of source ontologies that can be done during their initial alignment (which must be always performed). Using the basic naive approach we can estimate the potential profitability of the integration as follows:
Bare in mind that the set C merged is a result of the integration of matchable concepts from source ontologies, initially selected in the mapping phases and included in the set map (O1, O2). Therefore, we can say that |C merged | = |map (O1, O2) |. We innclude only the maximal value from the availabledistances for the single mapping taken from map (O1, O2) to approximate the potential growth of knowledge after the integration.
Also note that from the Equation 8 we have
|C*| = |C1| + |C2| - ∑<s,t>∈map(O1,O2)|t|.
Hence, we can replace |C*| in the Equation 9 and eventually become completely independent from the analysis
of O*:
In the next section of the article we will extend our experiment verifying the multilevel ontology integration (described in Section 4) with proposed measure of knowledge gain. It will estimate the profit of every step of integration which can be used as a stop condition in real-life applications.
The most common dataset, frequently used to evaluate any kind of application related to processing ontologies, is the one provided by sOntology Alignment Evaluation Initiative (OAEI). It is handed during the annual evaluation campaigns, which are mainly aimed at verifying ontology alignment frameworks (that designate a set of symmetrical mappings which connect equivalent elements from two ontologies). The evaluation procedure is based on a large set of pairs of ontologies (for convenience grouped into smaller thematic subsets referred to as tracks) complemented with a reference mapping. The evaluation of an alignment tool consists of the comparison of a provided result with such reference. Eventually generic values of Precision and Recall are calculated, accompanied by other quality metrics (e.g. time required to perform a mapping).
The selected integration algorithm, presented in Section 5, in order to be conducted using
some real data, involves some initial requirements that need to be fulfilled
beforehand: The
algorithm expects a set of pairs designating equivalent concepts from separate
ontologies that may be integrated into a final ontology. This assumption is similar to
the one used in [24]. A set that designates
equivalent attributes from integrated concepts for the sake of Step 3 of the
considered algorithm
To meet the above requirements we have used reference alignments given by OAEI for four ontologies (Sigkdd, Edas, ConfTool and Sofsem) taken from the conference track (due to the accessibility and simplicity of their domain) of the 2015 evaluation campaign [19]. Therefore, we were able to test the one- and two-level approach using a dedicated experimental environment implemented using Python programming language.
The integration of all ontologies using the generic one-level approach took
We have analyzed outcomes of seven different initial classifications of selected four ontologies into subsets containing one, two or three input ontologies. It can be identified with classes X1 and X2 from Definition 2). Each of such division can be found in a separate row in Table 1. Columns contain durations of each level of integration for different classes and the total time taken.
The given values are the arithmetic means of outcomes collected from ten iterations of the identical integration process. This allowed to avoid potential distortions that may take place due to a number of random technical issues such as memory access downtime, stack overflow etc.
The obtained results allowed us to draw a conclusion that the multi-level integration can be significantly faster than the one-level equivalent. As easily seen in the last column of Table 1, the evaluation clearly showed that in comparison to the simple one-level integration such integration for given input dataset is faster even by 20%. In the context of the Big Data [4] the shortest possible time required to obtain their final representation is a key factor in providing reliable business solutions.
The analysis of knowledge increase during the ontology integration
We have analyzed seven different selection of initial classes X1 and X2. By using the aforementioned experimental environment we have calculated a separate knowledge increase for every level of integration. To calculate the distance between contexts of concepts we have assumed that it can be modeled using URIs of datatype properties. According to the OWL specification [20] they can be treated as concepts’ attributes.
We have incorporated the alignments provided by OAEI (described in the Section 7) which contain several mappings between a handful of single attributes. This implies that such attributes are equivalent and therefore, the distance between them is equal 0. We assumed the distance between any other attributes is maximal and equal 1.
The knowledge increase obtained during standard one-level approach is equal
The high knowledge increase in our experiment is the result of the chosen dataset. In the considered ontologies the number of attributes is low and the concepts are rarely aligned, therefore, the integration brings a lot of new knowledge. However, the results of our experiment demonstrate some interesting dependencies. The most important ontology is Edas. The increase of knowledge in the integration process without this ontology is smaller than in other cases. Additionally, 1 - ΔO1,O2 (C1, C2) informs how much information is redundant.
As it was mentioned in Section 1, the proposed measure can be helpful in the initial classification of input ontologies into appropriate classes. It can serve as a stop condition during subsequent levels of integration when a satisfactory level of knowledge is reached. From the first line of Table 2 we can draw a conclusion that such initial classification of input ontologies is the least effective because the pairs: {Sofsem, Sigkdd} and {Edas, ConfTool} have the most redundant data. Moreover, the results of the second level show that the integration process should not be stopped on the first level, because further integrations are still profitable.
Future works and summary
Ontologies allow to easily store the big set of data enriching them with expressive semantics. In consequence processing them is a difficult and complex task. The easy and cheap method for estimation of the profitability of the integration process is therefore desired. In this work we have proposed a novel measure which allows to estimate a potential growth of knowledge after the integration of a set of ontologies on a concept level. The main advantages of the proposed method is its simplicity, a well grounded theoretical background and the fact that it can be performed before any level of the integration.
Moreover, the multi-level integration process has been described as the use case scenario. The developed measure has been used in the multi-level integration process for two tasks. The first one is an initial classification of input ontologies into appropriate classes. The second one is a stop condition during subsequent levels of integration when a satisfactory level of knowledge is reached.
The preliminary experimental results pointed out that the multi-level integration process is shorter by even 20% in comparison to the standard one-level approach. The calculation of proposed knowledge increase measure allows to denote that the most important ontology is the one which brings the most information. Moreover, the analysis of the time taken by the integration procedure and the growth of the knowledge have demonstrated that the best initial classification of input ontologies is: X1 = {Sofsem, Edas} , X2 = {Sigkdd, ConfTool} or X1 = {Sofsem, Conftool} , X2 = {Sigkdd, Edas}
In our future work we would like to conduct experiments for more ontologies and for more levels of integration (relations and instances), therefore, expanding it into fully fledged ontology integration framework. In this paper we were able to examine only four ontologies integrated on only two levels. More sophisticated experiments could bring more reliable conclusions. We have proposed the method which estimates the objective growth of knowledge that does not judge the disperse in the amounts of knowledge available in the input ontologies. In our upcoming publications, the subjective (from the point of view of the integrated ontologies) measures will be proposed.
