Abstract
Biomedical ontology matching dedicates to find two heterogeneous ontologies’ alignment and address their heterogeneity problem. Typically, a biomedical ontology has various biomedical concepts that are described with various labels and datatype property names, which forms a lexical space where each label or datatype property represents one dimension. Therefore, it is an effective way to present two biomedical concepts in a vector space, and use the cosine distance to measure their similarity. In this work, we present two biomedical concepts in a lexical vector space which is constructed with their inner and context concepts’ lexical information, and then utilize two vector’s cosine distance to measure similarity value. Then, we propose a compact Evolutionary Algorithm (cEA) to find the concept correspondences. The experiment uses Ontology Alignment Evaluation Initiative (OAEI)’s testing cases, and the expeirmental results with Vector space Based Ontology Matcher (VBOM), Genetic Algorithm based Ontology Matcher (GAOM) and OAEI’s participants show the effectiveness of our proposal.
Introduction
Nowadays, Artificial Intelligence (AI) techniques [2, 21] have been widely applied in various domains. As one of the important application domains, semantic web attracts more and more attentions, whose kernel technique is the ontology. Biomedical ontology can provide a formal definition on the biomedical concepts and their relationships, which provides the foundation for the inter-operation among intelligent biomedical systems [26]. However, knowledge defined in different biomedical ontologies could be described in different ways, e.g. define a biomedical concepts with different terminologies or in different context. Therefore, it is necessary to bridge the semantic gaps among different biomedical ontologies, which is so-called biomedical ontology matching. In general, the matching process can be divided into two steps: (1) calculating two entities’ similarity value, and (2) finding all the entity mappings. Typically, a biomedical ontology has various biomedical concepts that are described with various labels and datatype property names, which forms a lexical space where each label or datatype property represents one dimension [4]. Therefore, it is an effective way to present two biomedical concepts with two vectors in that lexical space, and use the cosine distance to calculate their similarity value. Tous et al. [18] takes the predicates as the references, and on the basis of the relationship between the predicates, each entity is represented as a vector. They further propose a a similarity matrix on all the ontology entities, and update the similarity values inside through a graph matching algorithm. Eidoon et al. [4] make use of the concepts and properties in two ontologies to construct a multi-dimension vector space model, and then vectorize each entity by a weighting mechanism. They calculate two entities’ similarity value through the cosine function in the vector space. The existing work mainly model the ontology in the vector space by utilizing the taxonomy information, i.e. the relationships between the entities, whose hereafter matching results could be poor if two ontologies have different structures. To address this issue, we further take into consideration the entities’ lexical information when vectorizing them to enhance the confidence of the results.
Moreover, since the process of determining the entity mapping set is a complex and time-consuming task, Evolutionary Algorithm (EA) can represent a suitable methodology to find the high-quality alignment [1, 5–7]. In recent years, Xue et al. propose a hybrid EA [24], interactive EA [26], multi-objective EA [25] and many-objective EA [27] to optimize the ontology alignment’s quality. The existing EA-based matchers are implemented on the population evolving mechanism, and to save the algorithm’s memory consumption, in this work, a compact EA (cEA), which replaces the population with a probability representation, is utilized to optimize the alignment.
In the next, Section 2 defines some basic concepts on biomedical ontology matching; Section 3 describes the biomedical concept similarity measure in the lexical vector space; Section 4 describes cEA in details; Section 5 gives the experimental configuration and results; and finally, Section 6 draws a conclusion.
Biomedical ontology matching problem
A biomedical ontology consists of the sets of concept and datatype property that describes a concept’s features. An ontology alignment is an entity mapping set, where each correspondence consists of the entities from two ontology, the relation of equivalence between them and the confidence of that relationship holds. A similarity measure takes as input two ontology entities’ information, and outputs a real number in [0,1] to show to what extent they are similar. In particular, 1 means they are identical, and 0 means the opposite. Given two concept correspondences (ci1, cj1) and (ci2, cj2), if they well maintain the similarity, |sim (ci1, ci2) - sim (cj1, cj2) | should be close to 0. On this basis, given an alignment A, its quality can be approximately measured as follows:
It is an effective way to model two biomedical concepts in a vector space, and use the vector space based similarity measure, such as the cosine similarity measure, to calculate two concepts’ similarity value. First, we need to construct a lexical vector space for an ontology, and each of its dimension relate with a label or a datatype property. In particular, the lexical vector space should cover all the ontology entities and avoid representing the similar entity. Given two biomedical concepts, their lexical vector space is built by extracting the labels and all the datatype properties from them and their direct ascendant and descendant classes, as the dimensions. Then, a weighting mechanism is used to present each concept as a vector in the lexical vector space.
Given a lexical vector space
Compact evolutionary algorithm
Since matching biomedical ontologies is a complex and time-consuming task, EA becomes a suitable method of addressing it [23]. First, we need to encode an biomedical alignment, i.e. a set of concept mappings. Since an entity correspondence’s kernel elements are two mapped concepts, we can simply make use of their indices in the ontologies to encode it. Here, we empirically choose the Gray code, which is a binary encoding mechanism, to encode an alignment to ensure cEA’s evolving efficiency.
cEA uses a Probability Vector (PV) [22] to approximate EA’s evolving mechanism. Each PV’s element corresponds to a solution’s gene bit, which represents a probability of being 1 on a solution’s corresponding gene bit. Therefore, we can utilize a PV to generate different solutions. If the newly generated idea is the elite, we will update PV by moving it to the elite, which is implemented by increasing (or decreasing) the corresponding dimension number of PV by a real number st.
Given the maximum generations MaxGeneration = 2000, a real number st = 0.1 for updating PV, a solution (or PV)’s length len, cEA’s pseudo-code is presented in Algorithm 1.
**** Initialization ****
PV i = 0.5;
solution elite = generateSolution (PV);
generation=1;
**** Evolving Process ****
solution new =generate a solution through PV;
compete (solution new , solution elite );
solution elite = solution new ;
PV i = PV i + st;
PV i = PV i - st;
generation = generation + 1;
Experiment
Experimental configuration
In the experiment, the Disease and Phenotype track and Biodiversity and Ecology track provided by Ontology Alignment Evaluation Initiative (OAEI) are utilized to test cEA’s performance, and Table 1 summarizes their main statistics.
Description on the OAEI’s tracks.
Description on the OAEI’s tracks.
To compare with the Vector-based Ontology Matcher (VBOM) [4], Genetic Algorithm based Ontology Matcher (GAOM) [20] and OAEI’s participants, the obtained alignments are evaluated by recall, precision and f-measure [19]. cEA uses the parameters in Section 4 which is determined in an empirical way, and VBOM and GAOM utilize the configurations from their corresponding literatures.
As can be seen from Figs. 1 and 2, cEA’s f-measure is the highest in all testing cases, and cEA’s precision is in general high, which shows the effectiveness of the lexical vector space based similarity measure. Comparing with VBOM and GAOM, our approach improves both the recall and precision significantly, which further shows the effectiveness of the compact evolving mechanism.

Comparison on Disease and Phenotype track

Comparison on Biodiversity and Ecology track
From the Tables 2 and 3, cEA dramatically improves VBOM and GAOM’s memory consumption and runtime by 53.23% and 49.80%, respectively. To conclude, our approach is able to effectively match the biomedical ontologies, and significantly reduce the memory consumption and runtime.
Comparison of the memory consumption per generation
Comparison of the memory consumption per generation
Comparison on the runtime taken per generation
Biomedical ontology matching aims at finding identical biomedical entities in two heterogeneous biomedical ontologies. To improve the efficiency of matching two biomedical ontologies, in this work, we first present two biomedical concepts in a lexical vector space which is constructed with their inner and context concepts’ lexical information, and then utilize two vector’s cosine distance to measure similarity value. We model the biomedical ontology matching as a discrete optimization problem, and propose a cEA to address it. The experimental results show the effectiveness of our proposal.
Footnotes
Acknowledgment
This work is supported by the National Natural Science Foundation of China (No. 61503082), the Natural Science Foundation of Fujian Province (No. 2016J05145), the Program for New Century Excellent Talents in Fujian Province University (No. GY-Z18155), the Program for Outstanding Young Scientific Researcher in Fujian Province University (No. GY-Z160149) and the Scientific Research Foundation of Fujian University of Technology (Nos. GY-Z17162 and GY-Z15007).
