Abstract
Ontology integration is an important work when integrating information from heterogeneous ontologies into an ontology. The existing methods about ontology integration cannot effectively make full use of non-1-1 mappings, which are very common in the real world. Furthermore, these methods only stated that the concept-pairs with mappings should be integrated, but not gave the specific operations for it. Therefore, these methods cannot describe a complete framework for ontology integration. To this end, this paper proposes a framework for Ontology Integration based on Genetic Algorithm, called OI-GA. During the process of integrating ontologies, OI-GA firstly creates mappings between them based on similarity measures. Next, OI-GA finds out all the non-1-1 mappings from mappings, and provides an evolutionary method to extract 1-1 mappings from them. Finally, all the concepts belonging to different ontologies are integrated into a new knowledge base called integrated ontology. Experimental results indicate that OI-GA performs encouragingly well in the optimization of mapping set as well as in the integration of ontologies from the real world.
Nomenclature list
Edit distance
Genetic algorithm
Information content
Ontology integration
Ontology mapping
Source concept
Similarity
Similarity matrix
Source ontology
Target concept
Target ontology
Introduction
In recent years, ontologies [14], an explicit, formal specification of a shared conceptualization, have been considered as standard knowledge model in a variety of areas such as the Semantic Web [4], Knowledge discovery, and information retrieval. Unfortunately, different ontologies, even those from the same domain, are semantic heterogeneous, as no uniform standard can be observed by knowledge engineers when building ontologies. The problem of heterogeneity seriously hinders semantic interoperability among ontologies.
In order to solve the above problem, many researches about ontology mapping [20] and integration [3] have been done. However, these researches only considered 1-1 mappings between ontologies as the basis of ontology integration, and ignore other types of mappings (collectively called non-1-1 mappings), which are commonly found in ontology mappings. Furthermore, these researches explain that the concept-pair connected by mapping should be merged into one concept, but they do not clarify us how to merge them. For this purpose, this paper proposes a new framework for ontology integration based on genetic algorithm, which is called OI-GA. This method puts forward an evolutionary method based GA to extract 1-1 mappings from non-1-1 mappings thereby optimizing candidate mapping set that is the precondition to ontology integration. As for the problem of merging concepts, OI-GA analyzes concepts at property-level and instance-level, and provides rules to integrate the property sets and instance sets for the concepts to be merged. Given two ontologies: target ontology (TO) and source ontology (SO), the objective of OI-GA is to integrate SO into TO. Firstly, OI-GA creates a candidate mapping set for TO and SO, according based on the edit-distance-based and the information-content-based similarity measures. Next, OI-GA finds out all the non-1-1 mappings, and extract 1-1 mappings from them. Finally, SO is integrated into TO by merging concept-pairs, each of which is connected by a 1-1 mapping.
Specifically, this paper makes the following contributions: OI-GA describes a framework of ontology integration, which takes two ontologies as the input, and generates an integrated ontology as its output. The framework consists of two main modules: extraction and integration, and an auxiliary module, i.e., similarity calculation. OI-GA introduces genetic algorithm into the extraction of 1-1 mappings from non-1-1 mappings, and show that it can be adapted to efficiently optimize the candidate mapping set, which is the basis of ontology integration. OI-GA analyzes concept at the view of property-level and instance-level, and proposes integration rules formatted as formulas to deal with the property sets and instance sets of the concepts to be integrated.
The remainder of this paper is organized as follows: Section 2 introduces the related works about ontology integration as well as ontology mapping. In Section 3, we provide background knowledge, such as ontology, the categories for ontology mappings, and genetic algorithm. Section 4 proposes a new framework for ontology integration, which is called OI-GA (Ontology Integration based on Genetic Algorithm). Experimental results in Section 5 shows that OI-GA performs well in integrating ontologies of the real world. Finally, Section 6 shows the general conclusions.
Related works
Researchers in the field of ontology integration have produced a large body of researches to facilitate the interoperability between heterogeneous ontologies. Pinto [15] discussed some issues on ontology integration by analyzing the difference between aligning and merging for integration. Caluanese [5] adopted two strategies of the global-centric and local-centric to structure a framework of ontology integration, in which mappings between ontologies are created by the principle of view-based query answering. Liang [25] presented the architecture of hybrid ontology integration framework, which combines the schemas of global-centric, local-centric and mixture to integrate the information in databases in the distribute environment into ontologies. Duong [22] proposed Enriched Ontology Model (EOM) and Semantic Supporting Environment (SSE) to expand the ontologies to be integrated with semantic information, then adopted a hybrid method in several levels, such as the element level, internal structure, and relational structure, to integrate these ontologies. Heer [21] used graphs to represent both ontologies and domain knowledge, and provided substantial supports for the interactive part of the integration process. Alasoud [1] developed a prototypical framework for ontology integration, which is composed of various views: materialized, virtual, and integrated views. Nguyen [11] applied the theory of consensus to integrate ontologies with regarding to the conflicts on the instance level. Li [8] presented a flexible and extensible framework for ontology integration, in which all kinds of agents are considered as the intelligent bodies to complete tasks automatically. Lambrix [16] provided an ontology integration system, called SAMBO, to aligning and merging biological ontologies. In the literature [26], machine learning is used to the classification for the concepts to be integrated, and the integration of two ontologies is accomplished by classifying the concepts in an ontology into the other. Kutz [12] sketched a new language DOL (distributed ontology language) by combining various ontology languages, and applied it to integrating ontologies as well as ontology mappings. The algorithm ILIADS [13] proposed by Udrea combined statistical matching and logical reasoning to determine cluster relationships between concepts, and produced high quality integration results. Jimenea-Ruiz [6] presented an ontology integration system ContentMap, in which mappings between ontologies are used to guide the process of integration, and some algorithms are provided to detect and repair potential errors for the integrated ontology. Biletskiy [24] resolved the problem of conflicts for ontology integration by identifying homonymy and synonymy ofconcepts.
As for ontology mapping, which is the key of ontology integration, some classic methods for ontology mapping have been developed. GLUE [2] proposed by Doan is an ontology mapping system that applies machine learning to train multiple learners to improve the accuracy of mapping. Tang [7] proposed RiMOM based on risk model as the ontology mapping system, in which the risk is measured by five strategies, i.e., name, instance, description, taxonomy, and constraint based decision, and the problem to be solved is transformed to the problem of how to reduce the risk of creating mappings between ontologies. In literature [23], an evolutionary approach for ontology mapping is provided. It evaluates every mapping in candidate set with fitness computed by a coincidence-based weighting, and finds out the satisfied mappings. Tun [10] uses three philosophical notions –identity, rigidity, and dependency –to clarify and enrich the semantics of concepts in ontology, and then divided all the concepts into five categories, finally proposed the corresponding mapping algorithms for them. Zhang [9] applied conceptual graph model to represent all the ontologies in a specific domain, and created mappings for theseontologies.
However, the above-mentioned methods for ontology integration and mapping have two weaknesses. Firstly, these methods only consider 1-1 mappings as their research subjects, and ignore other types of mappings, such as 1-N mappings, M-1 mappings, and M-N mappings. The later ones, which are named as non-1-1 mappings, represent ambiguous relationships among concepts, and they cannot explicitly indicate which concepts should be integrated. Secondly, although the integration methods state that the concept-pair in 1-1 mapping should be merged into a concept, they do not describe the details for the process of merging concepts. To this end, this paper proposes a new framework for Ontology Integration based on Genetic Algorithm, which is called OI-GA.
Preliminaries and notation
In this section, we recall some important background knowledge about ontology, ontology mapping, and genetic algorithm, which is used throughout the paper.
Ontology
Ontology is a formal, explicit specification of a shared conceptualization of a domain of interest [19]. Ontology provides a formal and explicit shared knowledge model to description domain knowledge, and it plays an important role in sharing, integration, and query of knowledge.
Definition 1 (Ontology): The formal definition of Ontology is a quintuple O = {C, P, I, R, A}, where: C denotes a set of concepts, which are considered as classes applied to classify instances (objects) of the domain. It is the most important entities in ontology. P denotes a set of properties. It is the intensional characteristic of concept, which restrains the semantic scope of concept, and gives explicit semantic meaning for concept. I denotes a set of instances. It is the extensional characteristic of concept, which is used to materialize concepts. R denotes a set of hyponymy relations among concepts. Based on it, a taxonomical structure, i.e., a tree-structure, is organized for ontology. A denotes a set of axioms. It is expressed in a proper logical language, in order to standardize the representation for concepts, instances, properties, and relationships among the first threeentities.
Ontology mapping
In essence, ontology mapping is defined as a process that takes target ontology (TO) and source ontology (SO) as input, and returns mappings between them. Based on mappings, we can identify the equivalence relations among entities, i.e., map: TO = {C, P, I, R, A} TO ↔ SO = {C, P, I, R, A} SO. In our paper, we only consider the mappings for concepts, because concepts are the most important entities for ontology. These mappings among concepts can be subdivided into two categories: 1-1 mapping and non-1-1 mapping.
Definition 2 (1-1 mapping): An 1-1 mapping connecting two concepts TC and SC is represented as a formula M(TC, SC), where TC represents the concept from TO, SC the concept from SO, and M is a function to label the semantic equivalent relationship between concepts, i.e., M: TC × SC ⟶≡.
However, for a concept in an ontology, more than one concept in the other ontology is semantic equivalent to it. It is easy to see that a 1-1 mapping cannot represent all the relationships for the concept. Therefore, we introduce non-1-1 mappings to represent more complex relationships among concepts, based on 1-1 mappings. To be specific, non-1-1 mappings can be divided into three categories: 1-N mapping, M-1 mapping, and M-N mapping. 1-N mapping
A 1-N mapping is a set of 1-1 mappings, in which all the target concepts of mappings are the same. Supposed that TC is semantic equivalent to SC1 and SC2, a 1-N mapping representing the relationships about TC is a set {M(TC, SC1), M(TC, SC2)}. M-1 mapping
As for an M-1 mapping, it also consists of a set of 1-1 mappings, which describes the case that a SC is mapped with multiple TCs at the same time. For example, the M-1 mapping {M(TC1, SC), M(TC2, SC)} represents TC1 and TC2 are mapped to SC. M-N mapping
An M-N mapping is a combination of 1-N mapping and M-1 mapping. It can be used to represent more complex relationships among two fragments that are composed of concepts. For example, two fragments are composed of {TC1, TC2} and {SC1, SC2}. The relationships between them are that TC1 is mapped to SC1 and SC2, and SC2 mapped to TC1 and TC2. In this way, we use the 1-N mapping {M(TC1, SC1), M(TC1, SC2)}, and M-1 mapping {M(TC1, SC2), M(TC2, SC2)} to represent mappings among these fragments. Obviously, the 1-1 mapping M(TC1, SC2) is redundancy, so the M-N mapping is obtained by the intersection of {M(TC1, SC1), M(TC1, SC2)} and {M(TC1, SC2), M(TC2, SC2)}, i.e., {M(TC1, SC1), M(TC1, SC2), M(TC2, SC2)}.
Genetic algorithm
Genetic algorithm (GA) is widely used in many fields, such as artificial intelligence [17], pattern matching, and scientific calculation [18]. It is a heuristic recognition technology that follows the principle of the survival of the fittest, and is routinely used to generate useful solutions to optimization problems. Specially, GA is an iterative process that starts from a population of individuals. In each iteration, the population is called a generation, and the fitness of every individual in the generation is evaluated by its performance in the optimization problem. The more appropriate individuals are selected as seeds of the next generation. As for the other individuals, their elements (genomes) are randomly exchanged, and the generated individuals are put into the next generation. However, some elements are modified to make their individuals adapt to the dynamic environment. To sum up, evolutionary process based on GA is an iteration composed of selection, crossover, and mutation. Selection. In this sub-process, the well-performed individuals are selected to breed a new generation through a fitness-based process. Crossover. The rest of individuals in the previous generation are re-organized as groups, each of which contains two individuals. For a group, two individuals exchange their elements randomly to generate two new individuals. Finally, the fitnesses of individuals are re-calculated, and the excellent individuals are passed to the next generation. Mutation. In the breeding process, some individuals are changed by gene mutation to better fit a change in environment.
OI-GA: A framework of ontology integration based on genetic algorithm
In this section, we present a new framework for Ontology Integration based on Genetic Algorithm, which is called OI-GA.
OI-GA takes two ontologies as input, and integrates source ontology (SO) into target ontology (TO). The process of ontology integration in OI-GA is shown in Fig. 1, where the rectangles are the operations, and the rounded rectangles represent the processing results. Firstly, the similarities of concept-pairs are calculated by edit-distance strategy and information-content strategy. Next, OI-GA creates mappings for those concept-pairs whose similarities are greater than or equal to the given threshold predefined by ontology experts. The created mappings are stored in the candidate mapping set ({Set-Candi-Mapping}), which contains many non-1-1 mappings. For this reason, OI-GA puts forward an evolutionary method based on GA to extract 1-1 mappings from {Set-Candi-Mapping}. The evolutionary method is repeatedly carried out, until all the non-1-1 mappings in {Set-Candi-Mapping} are transformed into 1-1 mappings. Finally, for each concept-pair (e.g., <TC, SC>) connecting with 1-1 mapping, OI-GA incorporates SC into TC by its integration rules. In this way, SO is successfully integrated into TO.
Similarity calculation for concepts
As is known to all, a concept in ontology is labeled by a name composed by a string with some semantic meaning. In this paper, we adopt the edit-distance based strategy and the information-content based strategy to compare the names of concepts, and calculate similarities for concepts. Edit-distance based strategy
Edit-distance based strategy is the most intuitive way to compare strings, so it is widely used by many similarity methods. Specifically, this strategy defines the similarity for two strings by the minimum number of insertions, deletions, and substitutions required to transforming one string into the other. The similarity method based on edit distance is defined as follows, where |C| is the length of C, and ed(C1, C2) is the edit distance of C1 and C2. However, this strategy ignores that two entity name with similar semantic meaning might be absolutely differently spelled. Therefore, it is not used alone.
Information-content based strategy
For two concepts (e.g., C1 and C2), one key to the semantic similarity is the extent to which they share information in common, indicated by the information content (IC) of their Least Common Ancestor (LCA) that is the first concept upward that subsumes both of them in an ontology. Inspired by the core idea computing semantic similarity of concepts, we apply to the formula (2), where L(C1, C2) represents the LCA of C1 and C2. Furthermore, the information content of a concept is defined as the negative log likelihood of the probability of the concept, i.e., IC(C) = –log(P(C)), and P(C) is measured by the ratio of the number of instances belonging to C to the number of all instances.
Combination
The two partial similarity measures are combined in a weighted-sum calculation resulting in a single similarity. Thus, the similarity of C1 and C2 is calculated by the formula (3), where λED and λIC, λED + λIC = 1, are the weights for the edit-distance based strategy and the information-content based strategy, respectively. λED and λIC can be trained by artificial neural network.
Based on the above work, mappings are created for all the similar concept-pairs between TO and SO, and they are restored in candidate mapping set {Set-Candi-Mapping}. However, {Set-Candi-Mapping} contains not only 1-1 mappings, but also more complex types of mapping, such as 1-N mappings, M-1 mappings and M-N mappings. Thus, we cannot know which concepts should be integrated. For this reason, OI-GA proposes an evolutionary method (E-method) based on GA to gradually extract 1-1 mappings from {Set-Candi-Mapping}. Generally speaking, the E-method firstly encodes mappings as hypotheses composed by data strings that are the objects to be handled by GA. Next, the operations of selection, crossover, and mutation are iteratively applied to extracting 1-1 mappings, until all the mappings in {Set-Candi-Mapping} have been processed. In addition, {Set-Candi-Mapping} should be divided into four subsets, i.e., {Set-Candi-Mapping}1 - 1, {Set-Candi-Mapping}1 - N, {Set-Candi-Mapping}M - 1, and {Set-Candi-Mapping}M - N, according to the types of mappings.
Encoding concepts and mappings
For two similar concepts connecting with a mapping, they must have a great number of similar elements, i.e., properties and instances. In other words, the rationality of a mapping between two concepts depends on their similar elements. To further evaluate the rationality for a mapping between two concepts (e.g., TC and SC), the similarity of each element-pair should be considered. To be specific, OI-GA firstly encodes TC and SC as two sequences by their elements, i.e., <e1TC, e2TC, … , emTC>and <e1SC, e2SC, … , enSC, en +1SC, … , emSC> (m > n), where the superscript of element represents the concept containing it. As m > n, en +1SC, … , emSC are represented by the symbol ∅. Then, the mapping between TC and SC can be encoded as a hypothesis: <Sim(e1TC, e1SC), … , Sim(emTC, emSC), F>, in which the first m items are the similarities of the corresponding element-pairs, and the last item of the expression represents the fitness of the hypothesis. The function of the fitness is to determine whether or not the hypothesis is rational. Obviously, a key problem to be solved is how to arrange elements to encode TC and SC. In order to solve this problem, OI-GA carries out the following steps: OI-GA creates a similarity matrix (SM) for TC and SC, where lines and columns are respectively labeled by the properties and instances. Each item in the SM is the similarity of the corresponding element-pair, calculated by the strategies of edit-distance and information-content. An iterative process of selection is performed. In each sub-process, OI-GA finds out and the biggest element from SM, and put the labels of line and column into the sequences of TC and SC respectively. Meanwhile, the elements in the line or column are deleted from the SM. This iterative process ends up until the line or column is empty.
Next, OI-GA finds out the biggest element in SM, e.g., Sim(eiTC, ejSC), where 1 < = i < = m and 1 < = j< = n. Therefore, the elements eiTC and ejSC are respectively put into the sequences of TC and SC, i.e., E1TC ← eiTC and E1SC ← ejSC. Meanwhile, the ith line and jth column are deleted from the SM. The above process is performed repeatedly, until SM becomes one-dimensional matrix. As a result, TC and SC can be encoded as <E1TC, E2TC, …, EmTC>and <E1SC, E2SC, …, EmSC>, and the mapping between TC and SC can be encoded as <Sim(E1TC, E1SC), Sim(E2TC, E2SC), …, Sim(EmTC, EmSC), F>.
The process of extracting 1-1 mappings based on GA
For the {Set-Candi-Mapping}1 - 1, the mappings in it are put into the mapping set {Set-Mapping}, directly. As for {Set-Candi-Mapping}1 - N and {Set-Candi-Mapping}M - 1, OI-GA designs two methods, i.e., method-1 and method-2, to extract 1-1 mappings from them. Finally, to deal with {Set-Candi-Mapping}M - N, M-N mappings are decomposes into 1-N mappings and M-1 mappings. In this way, OI-GA designs the method-3 by combining the method-1 and method-2. The following are the descriptions for the three methods that are included by the E-method. Method-1 for {Set-Candi-Mapping}1 - N
Grouping:
In a 1-N mapping, a target concept is mapped with more than one source concept. For this reason, method-1 groups all the mappings in {Set-Candi-Mapping}1 - N based on target concepts. Specifically, for each target concept, method-1 creates a group named by its name. Then, method-1 searches all the mappings in {Set-Candi-Mapping}1 - N, and to put the mappings into the group, whose target concepts are the same as the name of the group. For instance, mappings M(A1, A1’) and M(A1, A2’) have the same target concept A1. Thus, they are put into the same group {M(A1, A1’), M(A1, A2’)} that is restored in 1-N group-set. Note that, some M-N mappings may be put into 1-N group-set by mistake. Suppose that source concept A2’ in the group {M(A1, A1’), M(A1, A2’)} is also mapped to another target concept (A2). The M-N mapping {M(A1, A1’), M(A1, A2’), M(A2, A2’)} is moved to the M-N group-set by Method-3.
Crossover:
Next, the groups in 1-N group-set are taken out in turn, and method-1 performs the crossover for them. To confirm the elements to be exchanged between mappings in a group, e.g., {M(A1, A1’), M(A1, A2’)} shown in Fig. 3(a), method-1 generates a crossover operator (C-operator) for the group. The C-operator is a bit string composed by 0 or 1, and its length is equal to the number of items (i.e., properties and instances) contained in A1. Then, the C-operator is applied to exchange items between the source concepts (A1’ and A2’) that are mapped with the same target concept (A1). Finally, a 1-1 mapping is extracted from the group. The following example illustrates the generation of crossover in details.
The mappings M(A1, A1’) and M(A1, A2’) are transformed as bit strings, shown in Table 1. The two corresponding similarities in the same column are compared. If the result of comparison is “< ”, the corresponding item in the C-operator is set to 1, otherwise is set to 0. Thus, method-1 obtains the C-operator <0, 1, … , 1, … ,0>for M(A1, A1’) and M(A1, A2’). Next, according to the generated C-operator, method-1 determines how to exchange items between A1’ and A2’. Specifically, method-1 finds out all the items in the C-operator, which is set to 1, and records their columns. For each column, method-1 exchanges the two items belonging to A1’ and A2’ respectively. The fitness of the hypothesis representing mapping is recalculated. Obviously, the crossover in the method-1 increases the fitness of the mapping M(A1, A1’). As for other mapping(s), method-1 decreases it (their) fitness. In this way, the 1-N mapping {M(A1, A1’), M(A1, A2’)} can be replaced by the 1-1 mapping M(A1, A1’) shown in Fig. 3(b).
Mutation:
When extracting the 1-1 mapping M(A1, A1’) from the group {M(A1, A1’), M(A1, A2’)}, the method-1 performs the mutation to modify the items in A1’ by the similar items in A1. Like the crossover, method-1 firstly generates a mutation operator (M-operator) for concepts A1 and A1’. The M-operator is also a bit string composed by 0 or 1, and its length is equal to the length of the C-operator. The function of the M-operator is to identify the similar items between A1 and A1’. For example, two items (Iti and Iti’) belonging to A1 and A1’ are similar, then the ith item in M-operator is set to 1, otherwise is set to 0. Next, the items in A1 are used to replace the corresponding items in A1’, according to the M-operator.
To sum up, the mutation in method-1 is to find out the similar properties and instances between target concept and source concept in a 1-1 mapping. Therefore, it is the preparation for integrating ontologies.
Finally, the 1-1 mappings extracted from 1-N group-set are put into {Set-Candi-Mapping}1 - 1, and the method-1 repeatedly performs the selection, grouping, crossover, and mutation, until {Set-Candi-Mapping}1 - N is empty. Method-2 for {Set-Candi-Mapping}M - 1
Grouping:
The grouping of method-2 is similar to the grouping of method-1. For instance, mappings M(A1, A1’) and M(A2, A1’) have the same source concept A1’. Therefore, they are put into the same group {M(A1, A1’), M(A2, A1’)} that is staged at M-1 group-set. Like the method-1, some M-N mappings may be put into M-1 group-set by mistake. They are also moved to the M-N group-set by the method-3.
For a group in the M-1 group-set, method-2 firstly creates copies of the source concept, and transforms the group into multiple 1-1 mappings. Then, the following operations, i.e., crossover and mutation, are used to exchange items between sub-concepts.
Crossover:
The M-1 mapping {M(A1, A1’), M(A2, A1’)} in Fig. 4 has transformed as 1-1 mappings {M(A1, C1’), M(A2, C2’)}, where C1’ and C2’ are the copies of A1’. Then, method-2 generates a C-operator for them. Then, method-2 applies the string < ∞, ∞, … , ∞>to exchange the items in C1’ or C2’, according to the C-operator. Note that, the symbol “∞” is to label the items that should be removed from C1’ or C2’. The following example illustrates the crossover in details.
The 1-1 mappings M(A1, C1’) and M(A2, C2’) can be transformed as bit strings, shown in Table 2. Next, the two similarities in the same column are compared. Thus, method-2 obtains the C-operator <0, 1, … , 1, … X>for C1’ and C2’. Next, the generated C-operator is used to exchange items in C1’ and C2’ with < ∞, ∞ , …, ∞>. To be specific, method-2 firstly finds out the item in the C-operator, which is set to 1. Then, method-2 exchanges the corresponding item in C2’ with the symbol ∞. For the item in the C-operator, which is set to 0, method-2 exchanges the corresponding item in C1’ with the symbol ∞. Finally, method-2 deletes the items labeled with ∞ from C1’ and C2’, and recalculates the fitness of 1-1 mappings in the group. The mapping result is shown in Fig. 3(b).
Mutation:
For all the 1-1 mappings, i.e., M(A1, C1’) and M(A2, C2’) in the group {M(A1, A1’), M(A2, A1’)}, the method-2 performs the mutation to modify the items in the copies (C1’ and C2’) by the similar items in the target concepts (A1 and A2). To be more specific, method-2 generates two mutation operators (M-operator) for M(A1, C1’) and M(A2, C2’), respectively. The function of M-operator is to identify the similar items between the concept-pairs, i.e., (A1, C1’) and (A2, C2’). Supposed that two items (Iti and Iti’) belonging to A1 and C1’ are similar, then the ith item in M-operator is set to 1. Next, the ith item in A1 are used to replace the corresponding items in C1’. If Iti is not similar to Iti’, then the ith item in M-operator is set to 0, and no operation is done.
Like the mutation in the method-1, this operation is also to find out the similar properties and instances between target concept and source concept in a 1-1 mapping. It is also the preparation for integrating ontologies. Finally, the 1-1 mappings extracted from M-1 group-set are put into {Set–Mapping}, and the method-2 repeatedly performs the selection, grouping, crossover, and mutation, until {Set-Candi-Mapping}M - 1 is empty. Method-3 for {Set-Candi-Mapping}M - N
Grouping:
Reviewing the above works, it is easy to see that M-N mappings are decomposed into 1-N mappings and M-1 mappings by mistake, which are staged at {Set-Candi-Mapping}1 - N and {Set-Candi-Mapping}M - 1, respectively. To find out all the M-N mappings, the grouping is designed as follows: Method-3 obtains the intersection of {Set-Candi-Mapping}1 - N and {Set-Candi-Mapping}M - 1. The generated set is composed of the mappings included by both of candidate sets. Method-3 takes a mapping from the intersection, and considers its concepts as key words to search {Set-Candi-Mapping}1 - N and {Set-Candi-Mapping}M - 1. In this way, the mappings containing either concept of the mapping can be found out. Then, the group {Groupi} is created to store the mappings, where “i” is set to1. The new concepts included in {Groupi} are used to search {Set-Candi-Mapping}1 - N and {Set-Candi-Mapping}M - 1, and the searched mappings are put into {Groupi}. This step is carried out repeatedly, until no mapping can be add to {Groupi}. Method-3 updates “i” by adding 1 to itself. Then, it turns to step 2 if the intersection is not empty, otherwise the grouping ends up.
Realigning:
Now, all the M-N mappings are respectively restored in {Groupi} in M-N group-set. To extract 1-1 mappings from M-N mappings effectively, concepts in a group are realigned based on the hyponymy. For instance, suppose that the relationships among the concepts in {Group1} provided by example 4 are A1 sup A2 sup A3 and A1’ sup A2’, then a bigraph is generated by the realignment for {Group1}, shown in Fig. 5(a). In this way, M-N mappings are represented as bigraphs restored in M-N group-set.
Extracting:
As the two methods of extracting 1-1 mappings from 1-N mappings and M-1 mappings are respectively provided in method-1 and method-2, the method-3 adopts the top-down strategy to divide a bigraph representing an M-N mapping into 1-N mappings and M-1 mappings. Thus, method-1 and method-2 can be used to extract 1-1 mappings from M-N group-set.
By the top-down strategy, the M-1 mapping, i.e., {M(A1, A1’), M(A2, A1’)}, is firstly isolated from the bigraph, and the method-2 is applied to generate 1-1 mappings {M(A1, C1’), M(A2, C2’)}, shown in Fig. 5(b). Next, the 1-N mapping, i.e., {(A2, C2’), (A2, A2’)}, is separated out, and the method-1 extracts a 1-1 mapping (A2, C2’) for it. Note that, although the concept A2’ is deleted during the extraction, its items with high similarities are inherited by C2’. For this reason, the mapping M(A3, A2’) is changed by M(A3, C2’), shown in Fig. 5(c). Finally, the method-2 deals with the M-1 mapping {M(A2, C2’), M(A3, C2’)}, and the result containing 1-1 mappings {M(A1, C1’), M(A2, C2’), M(A3, C3’)} is shown in Fig. 5(d).
In conclusion, OI-GA employs the method-1,method-2 and method-3 to extract 1-1 mappings from {Set-Candi-Mapping}1 - N, {Set-Candi-Mapping}M - 1 and {Set-Candi-Mapping}M - N. Some examples describe the extraction in details. Until the three subsets of {Set-Candi-Mapping} are empty, OI-GA stops running, and returns the {Set-Mapping} to us.
The analysis of the E-method based on GA
In the E-method, the source concepts are considered as a population, and the size of the population is the number of source concepts. The E-method takes all the source concepts in the {Set-Candi-Mapping}1 - 1 into the next generation, directly. As for the source concepts in the sets, {Set-Candi-Mapping}1 - M, {Set-Candi-Mapping}M - 1, {Set-Candi-Mapping}M - N, three sub-methods (method-1, method-2 and method-3) are designed, inspired by the main process of GA (selection, crossover, mutation).
In each iterative process, the {Set-Candi-Mapping} is gradually optimized by transforming a non-1-1 mapping into a group of 1-1 mappings. The E-method is stopped, when all the non-1-1 mappings are dealt with. In other words, the E-method converges when new 1-1 mappings are not extracted. Therefore, the iterations depend on the number of non-1-1 mappings. To sum up, compared with other evolutionary algorithms based on the traditional GA, the E-method decreases the randomness, and have a shorter convergence time.
Ontology integration
Ontology integration in OI-GA is to integrate the source ontology (SO) into the target ontology (TO), according to 1-1 mappings between them. From the microscopic view, each source concept (SC) connected by 1-1 mapping (M(TC, SC)) is merged into the target concept (TC). During the process of merging, the properties and instances of concepts TC and SC are considered as the main research objects. We use the following methodology to integrate TO and SO. OI-GA takes a 1-1 mapping out of {Set-Mapping} generated by the extraction from {Set-Candi-Mapping}. For example, M(TC, SC) is the 1-1 mapping, and concepts TC and SC are to be merged. OI-GA respectively creates property sets ({Set-Property}TC and {Set-Property}SC) and instance sets ({Set-Instance}TC and {Set-Instance}SC) to restore the properties and instances belonging to TC and SC. Obviously, the item sets {Set-Item}TC and {Set-Item}SC of TC and SC are represented as {Set-Item}TC = {Set-Property}TC ∪ {Set-Instance}TC and {Set-Item}SC = {Set-Property}SC ∪ {Set-Instance}SC. OI-GA merges SC into TC by dealing with {Set-Item}TC and {Set-Item}SC. To distinguish the merged concept with TC, it is named as M-TC. For the new merged concept M-TC, the semantic scope restrained by its properties covers the semantic scope of TC and SC. Moreover, its instances are composed by the instances of TC and SC. Therefore, the property set of M-TC is the intersection of {Set-Property}TC and {Set-Property}SC, i.e., {Set-Property}TC ∩ {Set-Property}SC. The instance set of M-TC is the union of {Set-Instance}TC and {Set-Instance}SC, i.e., {Set-Instance}TC ∩ {Set-Instance}SC. The integration rules are represented by the following formulas:
OI-GA checks the state of {Set-Mapping}. If it is a nonempty set, OI-GA jumps to step (1) and continues to merge concepts with 1-1 mappings by the above three steps. Otherwise, the process of integration stops, and the merged ontology TO is the result of integrating.
Obviously, the step of merging concepts with 1-1 mapping is composed by step (2) and step (3). It is the key of integration. To this end, we give the following figure to describe it in details.
In this section, two different experiments were performed to assess the feasibility and robustness of the E-method.
The feasibility analysis of the E-method
To state the advantage of OI-GA in integrating ontologies connecting with mappings composed of 1-1 mappings and non-1-1 mappings, we provide an instance coming from the biological field. Our objective is to describe the process of extracting 1-1 mappings from non-1-1 mappings in detailed. To date, there is not uniform data set for M-N mappings, and the ontologies (TC and SC) were designed by our laboratory. As shown in Fig. 7, there are two fragments of ontologies connecting with an M-N mapping. The left one is from TC, and the other from SC. The concepts in the dashed box are mapped by an M-N mapping, where the labels of mappings are the similarities of concept-pairs. Table 3 shows the concepts, as well as their properties and instances.
An M-N mapping between them had been created by OI-GA. The fitness of every mapping is represented by the corresponding label. Based on the top-down strategy of method-3 included in the E-method based GA, the SC (source concept) “Catamount” in M-1 mapping i.e., {(Mammal, Catamount), (Felid, Catamount)} is firstly separated into two concepts: “C1” and “C1”. The properties and instances of them are shown in Table 4. Thus, two 1-1 mappings, i.e., M(Mammal, C1) and M(Felid, C1’), are extracted from the M-1 mapping, and the first generation is shown in Fig. 8 (a). Their fitnesses are 0.94 and 0.85, respectively. Next, the 1-N mapping, i.e., {(Felid, C1’), (Felid, Panthera)} should be dealt with. Method-3 extracts 1-1 mapping M(Felid, C1’) from it, shown in Fig. 8 (b). In this extraction, all the items of the concept “Panthera”, which are similar to the items of concept “Lion” in TO, are substituted in the concept “Felid”. For this reason, the mapping M(Lion, Panthera) is replaced by M(Lion, C1’). The properties and instances of them are shown in Table 5. Their fitnesses are 0.96 and 0.87, respectively. Finally, the M-1 mapping {(Felid, C1’), (Lion, C1’)} is transformed into 1-1 mappings M(Felid, C2) and (Lion, C3), shown in Fig. 8 (c). Thus, concepts “Catamount” and “Panthera” are transformed into concepts C1, C2, and C3, shown in Table 6. The fitness of M(Lion, C3) is 0.89.
In the process of integration, OI-GA applies the integration rules to deal with the property set and the instance set of concept-pair with 1-1 mapping. In this way, SO is integrated into TO.
The robustness analysis of the E-method
To date, the method of M-N mapping is not well researched, and there is not a dataset specialized for measuring this kind of method. For this reason, we used the dataset “benchmark” to measure the feasibility, stability and reliability of the E-method. The benchmark, provided by OAEI, contains 74 ontologies, where the ontology #101 is the reference ontology, and other source ontologies are obtained by adding, deleting or changing some information of #101, excepting the ontology #102. Therefore, most of them share the same schema, which is composed by 33 concepts, 72 properties and 55 instances.
To create some non-1-1 mappings, the similarity threshold is set to 0.8. The number of non-1-1 mappings between #101 and other source ontologies is shown in the second column of Table 7. Our work is to extract 1-1 mappings from non-1-1 mappings, and integrate all the source ontologies into #101 based on 1-1 mappings. In order to evaluate the integration performance of OI-GA, we provide the calculation formula of integration precision (PI) as follows.
In the formula (6), the numerator is the number of the correct instances integrated by OI-GA, and the denominator the number of all the instances (55). In this way, the statistical result is shown in the last column of Table 7. In these source ontologies, #102 is an irrelevant ontology, #103 and #104 contains the comments with different language generalization and restriction, and #201∼#266 are created by deleting or modifying some elements. The statistical results of OI-GA are shown in Table 7. Obviously, no mapping is created between #101 and #102, and all the instances in #102 cannot be integrated into #101. For this reason, the integration precision is 0. Compared with #101, the concepts in #103 and #104 are enriched by the comment of generalization and restriction. Therefore, the integration precisions are 1.0. As for #201∼#266, some non-1-1 mappings are created, shown in the second column of Table 7. When extracting 1-1 mappings from them, some instances and properties to be integrated are not included in the new source concepts. The range of the integration precisions is between 0.76and 0.91.
To sum up, almost 70 source ontologies (#102∼#103, #201∼#266) with different elements are respectively integrated into the target ontology (#101). Compared with the integration result provided by artificial method, the mean integration precision is about 0.89. The experimental result is desirable, and we conclude that the OI-GA is robust.
In this paper, we propose a new framework of ontology integration based on genetic algorithm, called OI-GA. The whole contribution can be divided further into two parts: (1) the extraction of 1-1 mappings from non-1-1 mappings and (2) the integration method. The first part consists of three evolutionary methods based on GA, which are used to extract 1-1 mappings from 1-N mappings, M-1 mappings, and M-N mappings. The next part involves the design of ontology integration method. Moreover, two integration rules are proposed to deal with the property sets and instance sets of concepts. Experimental results consisting of two parts indicate that (1) OI-GA performs encouragingly well when integrating ontologies from the real world; (2) OI-GA is robust, when almost 70 different source ontologies in the benchmark are used to measure the integration precision of OI-GA.
Some problems should be solved in our future work. Firstly, we should create more datasets of M-N mappings. Other datasets provided by OAEI, such as anatomy and conference, should be focused on. They help us evaluate the benefits of OI-GA. Secondly, we should design the user interface with the functions of setting parameter, debugging, and test result display, in order to improve the flexibility of OI-GA.
Footnotes
Acknowledgments
The authors wish to thank the anonymous referees for their valuable comments and suggestions, which improved the technical content and the presentation of the paper. This work was supported by the National Natural Science Foundation of China (61204127), and the Science & Technology Research Project of Education Department (12541899) and the Natural Science Foundation of Heilongjiang Province (F2015024 and F201334).
