A multi-strategy approach for the merging of multiple taxonomies

Abstract

Taxonomy merging is an important work to provide a uniform schema for several heterogeneous taxonomies. Previous studies primarily focus on merging two taxonomies in a specific domain, while the merging of multiple taxonomies has been neglected. This article proposes a taxonomy merging approach to automatically merge multiple source taxonomies into a target taxonomy in an asymmetric manner. The approach adopts a strategy of breaking up the whole into parts to decrease the complexity of merging multiple taxonomies and employs a block-based method to reduce the scale of measuring semantic relations between concept pairs. In addition, for the problem of multiple inheritance, a method of topical coverage is proposed. Experiments conducted on synthetic and real-world scenarios indicate that the proposed merging approach is feasible and effective to merge multiple taxonomies. In particular, the proposed approach works well in the aspects of limiting the semantic redundancy and establishing high-quality hierarchical relations between concepts.

Keywords

Multiple taxonomies merging resource reuse schema integration taxonomy

1. Introduction

A taxonomy as a hierarchical data organisation framework is the most important ontology element and is widely used to provide the sharing and exchange of content related to specific knowledge or application domain [1 –3]. Nowadays, a growing number of taxonomies have been developed on the Web for resource management [4,5], intelligent information retrieval [6] and knowledge discovery system [7,8].

However, taxonomies were built by different organisations or individuals using different tools and scattered on the Web. As a result, many taxonomies refer to the same domain in different formats, which make it hard for users to fully understand the knowledge behind these taxonomies and bring great difficulties for the application of taxonomy resources. It is necessary, therefore, to merge multiple related taxonomies in order to resolve the taxonomy heterogeneity and provide a unified schema serving for the management of resources.

Currently, schema integration, including taxonomy merging, is regarded as a vital, yet challenging task. Various schema integration approaches have been proposed to deal with the problem [9 –11]. The process of taxonomies merging can be symmetric or asymmetric with respect to the input taxonomies. The symmetric-based merging preserves all information from input taxonomies with the same priority but thereby introduces a large amount of redundant information and reduces the understandability of the merged result [10,11]. Comparatively, the asymmetric-based merging takes one of the input taxonomies as the target taxonomy and the other input taxonomies as the source taxonomies, and it merges the source taxonomies into the target taxonomy. In this way, the concepts and relations of the target taxonomy are fully preserved, and only the concepts and relations of the source taxonomies, that do not introduce semantic redundancy, are added.

One limitation of previous taxonomy merging approaches is that these approaches mainly focus on merging two taxonomies, while the problem of merging three or more taxonomies has not been investigated. With the increasing use of taxonomies in various domains, the merging of multiple taxonomies is encountered in numerous situations. In comparison with two taxonomies merging, the multiple taxonomies merging is more time-consuming and complicated, which is primarily attributed to the following two factors:

In general, before performing the merging of taxonomies, it is essential to measure the semantic relations between any two concepts belonging to different taxonomies. In the case of two taxonomies merging, the measures need to be carried out $n^{2}$ times when both taxonomies have n concepts. For the task of m taxonomies merging, the maximum times of measures would be $1 / 2 (m (m - 1) n^{2})$ . It is noted that calculating the semantic relations between a concept pair is already a complicated and time-consuming task. Accordingly, the merging of multiple heavyweight taxonomies that consists of hundreds of thousands of concepts is computationally unacceptable.

Since the asymmetric-based approach can retain the structure of the target taxonomy and largely limit the semantic redundancy, it is more desirable than the symmetric-based approach. However, maintaining the structure of the target taxonomy is a great challenge in the process of merging two taxonomies. For the merging of multiple taxonomies, the increase in the number of concepts to be merged and more complicated semantic relations among taxonomies add a further dimension to the complexity of the asymmetric-based approach.

In this article, we propose a taxonomy merging approach that takes two or more domain-specific taxonomies as input and produces a merged taxonomy as output. The proposed approach is asymmetric, and it can automatically determine the merged solution. For the multiple large-scale taxonomies that should be merged, the strategy of ‘breaking up the whole into parts’ is adopted. According to the characteristics of concepts, our approach divides all concepts from source taxonomies into two parts, that is, core concepts and non-core concepts, which are the interior nodes and leaf nodes of the merged taxonomy, respectively. In the process of integrating core concepts, a block-based approach is proposed to reduce the calculating scale of measuring semantic relations between concept pairs. Specifically, non-core concepts mainly contain leaf nodes that will not modify the hierarchical structure of the target taxonomy; thus, the goal of maintaining the hierarchical structure of the target taxonomy can be achieved in the process of integrating core concepts.

The main contributions of our work are summarised as follows:

An approach of merging multiple taxonomies is proposed, which can merge multiple source taxonomies into a target taxonomy in a fully automatic and asymmetric manner.

To reduce the complexity of the multiple taxonomies merging, we propose a method to divide concepts from source taxonomies into different types and develop different integration strategies for each type.

A block-based approach is presented to decompose the symmetric merged taxonomy. The approach reduces the scale of measuring semantic relations between concept pairs and preserves the structure of the target taxonomy.

A topical coverage is provided for each parent–child pair to address the multiple inheritance problem, which can effectively reduce semantic overlap and increase the understandability of the merged taxonomy.

The rest of this article is organised as follows. In section 2, the related works are reviewed and discussed. Problem definition and relevant terminologies are presented in section 3. Section 4 gives the general overview of the proposed approach and details the step of merging multiple taxonomies. Section 5 provides the approach of generating alignment. Experiments conducted on synthetic and real-world scenarios are presented in section 6. Finally, we provide remarks along with some future research directions in section 7.

2. Related work

Taxonomy merging is a complicated and challenging task, which can be decomposed into independently solvable two subproblems, that is, identifying correspondence relations and combining the information from input taxonomies.

The identification of corresponding concepts between two or more heterogeneous schemas is an active field of current research. It is a prerequisite for the merging of ontologies and taxonomies. A number of approaches have been developed to determine different kinds of correspondences, including but not limited to equivalence, is-a and inverse-is-a correspondences.

Several methods focus mainly on identifying equivalence correspondences such as COMA++ [11], Anchor-Flood [12], ASMOV [13] and RiMOM [14], which primarily used linguistic-based strategies and structure-based strategies. COMA++ [15] provided context-dependent and segment-based approach to solve large-scale alignment problems. Similar to COMA++, Anchor-Flood [12] also took segment-based method to achieve equivalence correspondence between two ontologies of arbitrary size. ASMOV [13] derived equivalence correspondences between two ontologies by iteratively calculating an equivalence measure in conceptual and structural characteristics. RiMOM [14] proposed a multi-strategy method to align different kinds of ontologies and employed a learning-based approach to support multilingual environments.

When it comes to detect correspondences other than equivalence correspondences, there are also many studies on the identifying of more expressive relations. S-Match [16] and STROMA [17] are able to distinguish the different kinds of correspondences including equivalence, is-a and part-of correspondences between concepts of two ontologies based on linguistic characteristics and background knowledge from WordNet. COMPOSE [18] developed a three-stage alignment framework to detect equivalence and is-a correspondences. Recently, SBOMT [19] employed a statistical model to generate one-to-many alignment between two ontologies, which can discover all semantic relations for any concept pairs. Different from those approaches of identifying equivalence correspondences, the approaches of measuring enriched correspondences need to use external resources, which limit the adaptation of these approaches in special situation.

As for the process of schema merging, considerable work has investigated such research in the last 20 years. In the early stages, ontology merging approaches are mostly semi-automatic, that is, the generation of merged result requires human domain experts’ intervention [20 –25]. Such semi-automatic approaches are far from realistic, obviously, when dealing with large-scale taxonomies with thousands of concepts. To overcome the limitation, several schema merging approaches with fully automatised manner have been proposed recently.

In Guzmán-Arenas and Cuevas [26], the authors proposed an automatic ontology merging approach that considers the inconsistencies, contradictions and redundancies between the input ontologies but leaves out some information in the merged result. In order to incorporate all concepts of ontologies without semantic redundancy, Rangel et al. [27] developed a novel ontology merging algorithm based on semantic similarity measure and exhaustive searching along with the closest concepts. Maree and Belkhatir [28] developed an ontology merging approach that focuses on domain-specific ontology.

Many approaches have been developed and successfully used in merging two ontologies. However, the resulting ontology as a graph-based model is not suitable for some applications that require the tree-structured taxonomy [29,30]. Furthermore, previous works paid little attention to the merging of taxonomies. Yang et al. [31] proposed a taxonomy integration approach that employs several operations, such as mapping, splitting and inserting to ensure the merged result conform to the tree structure. Likewise, Xie et al. [32] proposed a tree-structured ontology integration method to facilitate queries between ontologies. One representative approach in the taxonomies merging, called ATOM, is to employ target-driven approach to merge automatically a source taxonomy into the target taxonomy [10]. The approach largely limits the semantic redundancy, but it allows the existence of multiple inheritance in the merged output.

In summary, the review of the state of the art shows that all existing merging approaches – including ontology merging and taxonomy merging – focus mainly on merging two schemas, and little work has been done to study the multiple schemas merging. This motivates our research on the merging of multiple taxonomies in an automatic manner.

3. Terminology and preliminaries

3.1. Taxonomy

A taxonomy T_k represents a tree-based schema that consists of a set of concepts having certain relations and sharing a set of resources. It is denoted by a triple $T_{k} = (C, R, D)$ where:

$C = {(c_{i}), i \in [1, | C |]}$ represents a set of concepts of the taxonomy. Each concept $c_{i}$ is named by its label $label (c_{i})$ . The concept hierarchy of $T_{k}$ is a partial order in the concept set. The topmost concept in a taxonomy is called the root $c_{root}$ . There is only one root in a taxonomy.

$R = {(r_{i}), i \in [1, | R |]}$ represents a set of semantic relations between concepts, where each semantic relation $r_{i} (c_{i}, c_{j})$ stands for a directed semantic connection between a concept pair. Some relations can be defined like ‘is-a’, ‘part-of’, ‘type-of’ and so on. In this article, only the relation ‘is-a’ is considered. If an order concept pair $c_{i} ⪯ c_{j}$ , then $c_{i}$ ‘is-a’ $c_{j}$ and $c_{j}$ ‘inverse-is-a’ $c_{i}$ .

$D = {(d_{i}), i \in [1, | D |]}$ represents a corpus of documents. Each concept $c_{i}$ attaches with some documents $d_{i}$ . The documents $d_{i}$ also satisfy a partial order ⊆, where ⊆ denotes the inclusion relation between document sets.

A taxonomy $T_{k}$ satisfies the following properties:

If $c_{i} ⪯ c_{j}$ , then $d_{i} \subseteq d_{j}$ .

For any three concepts $c_{i}$ , $c_{j}$ and $c_{k}$ , if $c_{i} ⪯ c_{j}, c_{j} ⪯ c_{k}$ , then $c_{i} ⪯ c_{j} ⪯ c_{k}$ .

For children $c_{1}, \dots, c_{i}, \dots, c_{m}$ of a concept $c_{n}$ , $\forall c_{i} ⪯ c_{n} (i \in [1, m])$ .

For any two children $c_{i}$ and $c_{j}$ of a concept $c_{n}$ , $d_{i} \cap d_{j} = \emptyset$ .

3.2. Alignment

Alignment $A$ represents the set of corresponding concepts between two or more taxonomies, denoted by a triple A = (T_i, T_j, R), where $R$ is the correspondence relations holding between a concept pair from taxonomies $T_{i}$ and $T_{j}$ , respectively. For two concepts $c'$ and $c ″$ , respectively, from taxonomies $T_{i}$ and $T_{j}$ , the possible correspondence relations between them include three typical types: equivalence, is-a and inverse-is-a.

If concepts $c'$ and $c ″$ are equivalent in meaning or they refer to the same semantic object, that is, one of them is a synonym of the other, we can say the two concepts have an equivalence correspondence, represented as $Equal (c', c ″)$ . The definition of $Equal (c', c ″)$ is as follows

\begin{matrix} Equal (c', c ″) : = {(c', c ″) | c' \in T_{i} \land c ″ \in T_{j} \land (object (c') = object (c ″))} \end{matrix}

(1)

where $object (c)$ stands for the semantic object referred to by concept $c$ . In this article, we only consider one-to-one equivalence correspondence, that is, for any concept in a taxonomy, there is only one concept at most having an equivalence correspondence in another taxonomy.

The semantic relation between concepts $c'$ and $c ″$ is called is-a correspondence if concept $c'$ describes something more special than concept $c ″$ , that is, $c'$ is a sub-concept of $c ″$ , as denoted by $Isa (c', c ″)$

\begin{matrix} Isa (c', c ″) : = {(c', c ″) | c' \in T_{i} \land c ″ \in T_{j} \land c' ⪯ c ″} \end{matrix}

(2)

If there exists a correspondence $Isa (c', c ″)$ between concepts $c ″$ and $c'$ , then we can deduce that $c'$ is the super concept of $c ″$ , represented as $Invisa (c', c ″)$ . The correspondence $Invisa (c', c ″)$ is defined as

\begin{matrix} Invisa (c', c ″) : = {(c', c ″) | c' \in T_{i} \land c ″ \in T_{j} \land c ″ ⪯ c'} \end{matrix}

(3)

In particular, if an equivalence correspondence is established between a concept pair, we can deduce other correspondence relations according to the order relation of concept pair in its taxonomy.

In Figure 1(a), concepts $c'_{1}$ and ${c ″}_{1}$ are the super concepts of concepts $c'_{2}$ and ${c ″}_{2}$ , respectively, and concepts $c'_{3}$ and ${c ″}_{3}$ are the sub-concepts of $c'_{2}$ and ${c ″}_{2}$ , respectively. If there exists an equivalence correspondence between $c'_{2}$ and ${c ″}_{2}$ then $c'_{1}$ and ${c ″}_{1}$ are the ancestors of both $c'_{2}$ and ${c ″}_{2}$ , and meanwhile $c'_{3}$ and ${c ″}_{3}$ are the descendants of both $c'_{2}$ and ${c ″}_{2}$ . The semantic relations between $c'_{1}$ and ${c ″}_{2}$ ; ${c ″}_{1}$ and $c'_{2}$ ; $c'_{2}$ and ${c ″}_{3}$ ; and ${c ″}_{2}$ and $c'_{3}$ do not need to be measured again.

Figure 1.

The inheritance for super concept and sub-concept relations.

A special case may happen in the inheritance of semantic relations. In Figure 1(b), there exist two equivalence correspondences between concept pairs $c'_{2}$ and ${c ″}_{2}$ ; $c'_{4}$ and ${c ″}_{4}$ , namely, $Equal (c'_{2}, {c ″}_{2})$ and $Equal (c'_{4}, {c ″}_{4})$ . After merging the two concept pairs, respectively, concepts $c'_{3}$ and ${c ″}_{3}$ become the parents of the new merged concept $(c'_{4}, {c ″}_{4})$ . We assume that the semantic relation for the concept pair $(c'_{3}, {c ″}_{3})$ cannot be identified since one of the concepts or both do not exist in the semantic dictionaries, this would lead to multiple inheritance for the merged concept $(c'_{4}, {c ″}_{4})$ . However, the multiple inheritance is not allowed in many applications such as Document Management Systems, because it may introduce ambiguous semantic relation and mislead resource management [33 –37]. Thus, a problem arises that how to remove the multiple inheritance in a taxonomy, which will be addressed in section 4.

3.3. Taxonomy merging

Given a target taxonomy $T_{t}$ and some source taxonomies $T_{s 1}, T_{s 2}, \dots, T_{sm}$ , referred to the same domain as $T_{t}$ , taxonomy merging is to integrate the multiple source taxonomies into the target taxonomy based on the alignments among $T_{t}, T_{s 1}, T_{s 2}, \dots, T_{sm}$ . The merged taxonomy $T_{merged}$ retains relatively complete information of all input taxonomies without introducing additional redundancy and incoherence [10,28].

To support a good coverage and compactness of the merged result, based on the definition of two taxonomies merging introduced in [10,11], we suggest that the merged result generated in the asymmetric manner from multiple taxonomies should satisfy the following requirements:

R1. Target concept preservation: each concept coming from the target taxonomy should have an equivalent concept in the merged result.

R2. Target relation preservation: each semantic relation in the target taxonomy should be explicitly in or implied by the merged result.

R3. Equality concept preservation: the equivalent concepts in input taxonomies should be mapped to the same concepts in the merged result.

R4. Is-a relation preservation: if there is an is-a relation between a concept pair, the relation should be maintained in the merged result or evolved into irrelevant relations for reducing semantic redundancy, but translation into inverse-is-a or equivalence relation is not supported.

R5. Tree-based preservation: the merged result is a tree-based hierarchical structure, in which concepts should be connected by is-a relations, and cyclic and multiple inheritance are not allowed.

The purpose of R1 and R2 is to maintain the concepts and structure of the target taxonomy in the merged result. And, R3, R4 and R5 ensure that the merged result should remain a tree structure and not contain those concepts and relationships that would introduce semantic redundancy and incoherence.

3.4. Core concept and non-core concept

We categorise the concepts of the source taxonomies according to the equivalence correspondences between each source taxonomy and the target taxonomy: core concepts and non-core concepts:

Core concept: given a concept $c ″$ of $T_{s}$ , if there exists a concept $c'$ in $T_{t}$ , such that $Invisa (c ″, c')$ or $Equal (c ″, c')$ , then the concept $c ″$ is a core concept.

Non-core concept: given a concept $c ″$ of $T_{s}$ , if there does not exist node $c'$ in $T_{t}$ , such that $Invisa (c ″, c')$ or $Equal (c ″, c')$ , then the concept $c ″$ is a non-core concept.

To give an example, Figure 2 shows a source taxonomy $T_{s}$ and a target taxonomy $T_{t}$ , and three equivalence correspondences $Equal ({c ″}_{1}, c'_{1})$ , $Equal ({c ″}_{6}, c'_{5})$ and $Equal ({c ″}_{4}, c'_{7})$ (indicated by dotted lines). Using the inheritance of semantic relations, we can deduce that $Invisa ({c ″}_{2}, c'_{7})$ and $Invisa ({c ″}_{3}, c'_{5})$ . According to the definitions of the core concept and the non-core concept, ${c ″}_{1}$ , ${c ″}_{2}$ , ${c ″}_{3}$ , ${c ″}_{4}$ and ${c ″}_{6}$ can be identified as the core concepts, while ${c ″}_{5}$ , ${c ″}_{7}$ , ${c ″}_{8}$ , ${c ″}_{9}$ and ${c ″}_{10}$ can be identified as the non-core concepts.

Figure 2.

The dividing of core concepts and non-core concepts.

The identifications of core concepts and non-core concepts are at the core of our approach. It is clear that the core concepts consist mainly of branch nodes. After integrating core concepts into the target taxonomy, these core concepts will modify and reconstruct the hierarchical structure of the target taxonomy. Conversely, non-core concepts are mostly leaf nodes and their incorporation is an extension rather than a reconstructing for the target taxonomy. And, because the semantic relations between core concepts are notably more complicated than non-concepts, this makes the process of merging core concepts harder. To reduce the complexity of taxonomies merging, we therefore classify concepts in source taxonomies into core concepts and non-core concepts and propose different merging strategies for these two categories.

4. Implemented strategies

4.1. Framework overview

The basic workflow of our approach is shown in Figure 3, which depicts the input and output, as well as three data processing steps: dividing concept sets, building core tree and adding non-core concepts.

Figure 3.

Basic workflow for merging multiple taxonomies.

First, the taxonomies to be merged are fed into the process of dividing concept sets. Dividing concept sets divides the concepts in source taxonomies into core concepts and non-core concepts based on the equivalence correspondences between each source taxonomy and the target taxonomy (see section 4.2). After dividing, the process of building core tree incorporates core concepts into the target taxonomy to generate a fully information-preserving model and invokes a detecting block method to split the fully merged model into multiple graph-based blocks (see section 4.3.1). Then, three decomposition operations – including building semantic relations, translation semantic relations and eliminating multiple inheritance– are performed to reconstruct each block into a tree structure (see section 4.3.2). Finally, the process of adding non-core concepts divides the non-core concept set into multiple independent subtrees, and then integrates them into the core tree in turn to generate a new merged taxonomy (see section 4.4).

4.2. Dividing concept sets

We first introduce the process of dividing the concepts of the source taxonomies into core concepts and non-core concepts. As the definitions of the core concept and the non-core concept point out, the core concept is the ancestor of one or more concepts in the target taxonomy, or has an equivalence correspondence with one target concept, while the non-core concepts are the remaining concepts in the source taxonomies after removing the core concepts. Some concepts in the target taxonomy are the ancestors of non-core concepts, but no concept is the descendant of non-core concepts. The equivalence correspondences between each source taxonomy and the target taxonomy can be viewed as the most distinguishing mark to identify the core concepts and the non-core concepts. Algorithm 1 illustrates the general procedure of dividing concept sets.

Algorithm 1: Pseudo-code of dividing concept sets

Input:
source taxonomies

T_{s 1}, T_{s 2}, . . ., T_{sm}

;
equivalence correspondences

Equal (T_{s 1}, T_{t}), Equal (T_{s 2}, T_{t}), \dots, Equal (T_{sm}, T_{t})

;
Output:
core concept set

C_{core} [m]

, non-core concept set

C_{non - core} [m]

;
1. initialise

C_{core} [m]

C_{non - core} [m]

and

C_{temp}

;
2. fori← 1 to m do
3. dividingTree

(i, c_{root}^{si})

; /*

c_{root}^{si}

denotes the root of source taxonomy

T_{si}

*/
/* implement the Postorder Traversal to each source taxonomy */
4. def dividingTree

(i, c^{si})

5. add

c^{si}

C_{temp}

;
6.

C_{child}

← getChildren

(c^{si})

; /* return all children of the concept

c^{si}

in taxonomy

T_{si}

*/
7. for each concept c in

C_{child}

do
8. dividingTree(i, c);
9. if

c^{si}

Equal (T_{si}, T_{t})

then
10. add

C_{temp}

C_{core} [i]

;
11. clear

C_{temp}

;
12. else
13. add

c^{si}

C_{non - core} [i]

;
14. remove

c^{si}

from

C_{temp}

;

For each source taxonomy, the algorithm employs Postorder Traversal to determine the core concepts and the non-core concepts (lines 2 and 3). A temporary array $C_{temp}$ is employed to hold each concept that is not involved in any equivalence correspondence (line 5). When traversing from leaf nodes to upper layers, if the current concept is not involved in any equivalence correspondence, it can be confirmed as a non-core concept; otherwise, if the current concept belongs to an equivalence correspondence, then all its ancestors (stored in array $C_{temp}$ ) can be confirmed as core concepts (lines 9–14).

4.3. Building core tree

The merging of core concepts has a direct impact on the understandability of the final output, because it determines the hierarchical structure of the output taxonomy $T_{merged}$ . The challenges for the process of merging core concepts include two questions. The first one is how to integrate core concepts into the target taxonomy and the process must satisfy the requirements proposed in section 3 as well. The second is how to reduce the scale of measuring semantic relations between concept pairs from different taxonomies. To deal with these problems, a two-phase approach is proposed, whose pseudocode is given in Algorithm 2.

Algorithm 2: Pseudo-code of building core tree
Input: core concept set $C_{core} [m]$ , target taxonomy $T_{t}$ ; equivalence correspondences $Equal (T_{s 1}, T_{t}), Equal (T_{s 2}, T_{t}), \dots, Equal (T_{sm}, T_{t})$ ; Output: core tree $T_{merged}^{core}$ ; 1. initialise $T_{merged}^{core}$ ← null, $C_{equal}$ ← null; 2. add $T_{t}$ to $T_{merged}^{core}$ ; 3. fori← 1 to m do 4. add $C_{core} [i]$ and their relation $R_{core} [i]$ to $T_{merged}^{core}$ ; 5. merge every concept-pair from $Equal (T_{si}, T_{t})$ into a new concept $c^{t}$ and label it using $label (c^{t})$ from $T_{t}$ ; 6. add $c^{t}$ to $C_{equal}$ ; /* reconstruct the graph-based taxonomy $T_{merged}^{core}$ into a tree structure */ 7. for each concept $c_{equal}$ in $C_{equal}$ do 8. block $b_{block}$ ← detectingBlock $(T_{merged}^{core}, c_{equal})$ ; 9. buildingSemanticRelations $(b_{block})$ ; 10. translatingSemanticRelations $(b_{block})$ ; 11. eliminatingMultipleInheritance $(b_{block})$ ;

Algorithm 2: Pseudo-code of building core tree

Input:
core concept set

C_{core} [m]

, target taxonomy

T_{t}

;
equivalence correspondences

Equal (T_{s 1}, T_{t}), Equal (T_{s 2}, T_{t}), \dots, Equal (T_{sm}, T_{t})

;
Output:
core tree

T_{merged}^{core}

;
1. initialise

T_{merged}^{core}

← null,

C_{equal}

← null;
2. add

T_{t}

T_{merged}^{core}

;
3. fori← 1 to m do
4. add

C_{core} [i]

and their relation

R_{core} [i]

T_{merged}^{core}

;
5. merge every concept-pair from

Equal (T_{si}, T_{t})

into a new concept

c^{t}

and label it using

label (c^{t})

from

T_{t}

;
6. add

c^{t}

C_{equal}

;
/* reconstruct the graph-based taxonomy

T_{merged}^{core}

into a tree structure */
7. for each concept

c_{equal}

C_{equal}

do
8. block

b_{block}

← detectingBlock

(T_{merged}^{core}, c_{equal})

;
9. buildingSemanticRelations

(b_{block})

;
10. translatingSemanticRelations

(b_{block})

;
11. eliminatingMultipleInheritance

(b_{block})

;

The algorithm first generates a fully merged taxonomy with graph-based structure by a symmetric manner which retains all information from the target taxonomy and core concepts (lines 2–6). At this point, the algorithm only employs equivalence correspondences between concept pairs, in which two concepts come from the target taxonomy and source taxonomies, respectively. Each concept pair that has an equivalence correspondence is merged into one new integrated concept where the label of the integrated concept is defined by the label of the target concept (line 5).

After this process, the algorithm takes all integrated concepts $C_{equal}$ as marks to reconstruct the fully merged taxonomy (lines 7–11). To reduce the complexity of the algorithm, the detecting and decomposing of blocks are performed in sequence. We build semantic relations between those concept pairs coming from same blocks but from different source taxonomies (line 9). Note that the semantic relations between concepts from different blocks are either type of is-a or type of irrelevant since blocks are organised in a hierarchical structure. After that, the redundant concepts and relations for each block are eliminated (line 10). Finally, latent Dirichlet allocation (LDA) topic model is employed to address the multiple inheritance problem (line 11).

4.3.1. Detecting blocks

The fully merged taxonomy is a directed acyclic graph, which remains all information of the target taxonomy and all core concepts from source taxonomies. Figure 4(a) shows a fully merged taxonomy. Grey nodes represent concepts from the target taxonomy, while the white nodes represent concepts from source taxonomies. Note that the merged taxonomy may introduce a lot of redundant concepts and relations.

Figure 4.

The process of detecting and reconstructing blocks.

To remove the redundancy information, a straightforward idea is to obtain the semantic relations between each concept pair in the fully merged taxonomy and then invoke merging method to eliminate semantic overlap. However, this approach can be problematic because, typically, it is computationally expensive to measure semantic relations between each concept pair, including equivalence, is-a and inverse-is-a correspondences, especially when these concepts come from multiple heterogeneous large-scale taxonomies. Therefore, we take a block-based approach to divide the fully merged taxonomy into multiple independent sub-graphs, that is, blocks. For each block, it has only one concept with zero indegree (called source point) and one concept with zero degree (called sink point) where source point and sink point are the integrated equivalent concepts $c_{equal}$ . In addition, the sink point has two characteristics: it is an integrated equivalent concept, and it has multiple parents.

Figure 4(b) shows that a fully merged taxonomy is divided into two graph-based blocks. There exists a hierarchical structure between any two blocks divided by the integrated equivalent concepts. In the interior of each block, the semantic relations between core concepts are complicated; we get the relations and further translate the graph-based block into a tree-based pattern. Thus, the integrated tree-based taxonomy can be constructed by multiple tree-based patterns.

Algorithm 3 demonstrates the process of detecting blocks based on all integrated equivalent concepts in the fully merged taxonomy. To decompose blocks from top-down, we first sort the integrated equivalent concepts based on their hierarchical order in the fully merged taxonomy (line 1). Then, we find the sink points with multiple parents from the integrated equivalent concepts (lines 2–6). Detecting blocks is a process of searching for a source point from a sink point. The number of searching paths is equal to the number of the parents of the sink point. If the current node searched by a search path is an integrated equivalent concept, the searching will be temporarily stopped to estimate whether these searching paths have converged to a node or not (lines 10–17). If the current nodes being searching refer to a node, then the node is the source point and a block has been found; otherwise, the searching path at the highest level does not move, and the remaining paths continue to search upwards (lines 18–23). This entire process is continued until the source node is found out.

Algorithm 3: Pseudo-code of detecting blocks
Input: fully merged taxonomy $T_{merged}^{core}$ ; integrated equivalent concepts $C_{equal}$ ; Output: the set of blocks $B_{block}$ ;
1. sort $C_{equal}$ by BFS; /* employ Breadth First Search to sort the concepts $C_{equal}$ / 2. for each concept $c_{equal}$ in $C_{equal}$ do 3. $C_{parent}$ ← getParents $(c_{equal})$ ; / return all parents of the concept $c_{equal}$ in taxonomy $T_{merged}^{core}$ / 4. s←\| $C_{parent}$* \|; 5. ifs <= 1 then 6. continue; /* the concept $c_{equal}$ is not a sink point / 7. $C_{block}$ ← $c_{equal}$ ; add $C_{parent}$ to $C_{block}$ ; 8. P[s] ← $C_{parent}$ ; pFlag[s]= False; bFlag* = True; 9. while (bFlag) do 10. fori←1 to s do 11. ifpFlag[i] == True then 12. continue; 13. while (P[i] not in $C_{equal}$ ) then 14. P[i] ← getParents(P[i]); /* if P[i] not in $C_{equal}$ , then \|getParents(P[i])\| is 1 / 15. add P[i] to $C_{block}$ ; 16. if each c in P[s] are the same concept then / whether these paths have converged / 17. bFlag* = False; /* a source point has been found / 18. else 19. j = max Level(P[s]); pFlag*[j] = True; 20. fori←1 to s do 21. ifi != j then 22. P[i] ← getParents(P[i]); 23. add P[i] to $C_{block}$ ;

Algorithm 3: Pseudo-code of detecting blocks

Input:
fully merged taxonomy

T_{merged}^{core}

;
integrated equivalent concepts

C_{equal}

;
Output:
the set of blocks

B_{block}

;

1. sort

C_{equal}

by BFS; /* employ Breadth First Search to sort the concepts

C_{equal}

*/
2. for each concept

c_{equal}

C_{equal}

do
3.

C_{parent}

← getParents

(c_{equal})

; /* return all parents of the concept

c_{equal}

in taxonomy

T_{merged}^{core}

*/
4. s←| $C_{parent}$ |;
5. ifs <= 1 then
6. continue; /* the concept

c_{equal}

is not a sink point */
7.

C_{block}

←

c_{equal}

; add

C_{parent}

C_{block}

;
8. P[s] ←

C_{parent}

; pFlag[s]= False; bFlag = True;
9. while (bFlag) do
10. fori←1 to s do
11. ifpFlag[i] == True then
12. continue;
13. while (P[i] not in

C_{equal}

) then
14. P[i] ← getParents(P[i]); /* if P[i] not in

C_{equal}

, then |getParents(P[i])| is 1 */
15. add P[i] to

C_{block}

;
16. if each c in P[s] are the same concept then /* whether these paths have converged */
17. bFlag = False; /* a source point has been found */
18. else
19. j = max Level(P[s]); pFlag[j] = True;
20. fori←1 to s do
21. ifi != j then
22. P[i] ← getParents(P[i]);
23. add P[i] to

C_{block}

;

4.3.2. Reconstructing blocks

After finishing the process of detecting blocks, we deconstruct each found block into a tree structure. Three steps are executed in sequence.

Building semantic relations

It is important to highlight that section 4.2 only employed the equivalence correspondences between the target taxonomy and each source taxonomy. Here, we will build more expressive semantic relations for the concepts in the interior of each block. For each path in a block, we find that the concepts belonged to it come from the same taxonomy. Therefore, it is only necessary to measure the semantic relations between concept pairs from different paths. For example, the first graph-based block in Figure 4(b) has three paths {a, b, f}, {a, c, f} and {a, d, e, f}. For these concept pairs from different taxonomies (b, c), (b, d) and (b, e), they need to be estimated whether there exist is-a correspondences or inverse-is-a correspondences. The equivalence correspondences between them have been detected in the process of dividing concept sets. As mentioned in section 3, after obtaining the semantic relation between a concept pair, we can further deduce other existing semantic relations between the concept pair and other concepts. This enables us to reduce the number of concepts whose semantic relations need to be measured. As shown in Figure 4(b), if an equivalence correspondence between c and d is obtained, we can deduce that e is a sub-concept of c and the relation between the concept pair (c, e) does not need to be detected again.

Translating semantic relations

In this step, we discuss the process of translating semantic relations, including translating inverse-is-a correspondences, merging equivalent concepts and eliminating redundant relations. According to the requirement R5, all relations between concepts should be expressed in the type of is-a. As the translation of the inverse-is-a correspondence is much simpler than that of the other two relations, we will directly translate each inverse-is-a correspondence $Invisa (c', c ″)$ into the is-a relation $Isa (c ″, c')$ .

For some concepts that exist equivalence correspondences with each other, we integrate them into a new concept. Relations attached to these concepts are added to the new concept. For the label of the integrated concept, section 4.3 provides a method that uses target concepts to define it. However, this method cannot be used here because equivalent concepts in a block are all from source taxonomies. At this point, each of the following approaches can be utilised:

Manually assigns a label for the integrated concept.

Randomly selects a label from the equivalent concepts.

Concatenate these labels of the equivalent concepts to form a new label.

Define the label using a common prototype of the equivalent concepts.

Let us discuss the process of eliminating redundant relations. Given two concepts with an is-a relation, if there exists one more detailed is-a relation to depict their semantic relation, then the direct is-a relation is disconnected and the detailed is-a relation is preserved. The intuition behind selecting the detailed path is that it preserves and extends the structure and relation between the concept pair.

To give an example, in the second graph-based block in Figure 4(b), the concept pair $(f, j)$ exists a semantic relation $Isa (j, f)$ , and there is another more detailed path {f, g, j} that represents the same relation with $(f, j)$ , which not only contains the relation $Isa (j, f)$ but also contains another concept g. In this path, since g is a sub-concept of concept f, and concept j is a sub-concept of g, then we can deduce that j is a sub-concept of f, and this is possible thanks to the transitivity of is-a relation.

Eliminating multiple inheritance

Due to the inadequacy of semantic dictionaries, the state-of-the-art automatic semantic alignment approaches, including linguistic-based and structure-based approaches, cannot cover all terms so that they are difficult to identify the semantic relations between any two concepts accurately. When semantic alignment approaches cannot identify the semantic relations among several concepts having one common child concept, the problem of how to eliminate the multiple inheritance for a concept, that is, the multiple inheritance problem (R5), is raised.

We conduct LDA topic model to detect topical coverage for each parent–child pair to eliminate multiple inheritance. LDA topic model [38] is a three-layer Bayesian model, including documents layer, topics layer and words layer, where each document can be represented as a probability distribution of topics and each topic can be represented as a probability distribution over words. LDA topic model is one of the most popular unsupervised machine learning techniques that can be used to identify the implicit topics and mine deep semantics of documents [39].

In this stage of our algorithm, the topic distribution for each concept $c_{i}$ that has documents $d_{i}$ is calculated, and then the distribution of concept topic is employed to represent the inner relation between concepts. The distribution of concept-topic is used to eliminate the multiple inheritance for a concept $c$ as follows. Let $C_{pa}$ be the parent concept set of $c$ , and $C_{de}$ be a set including concept c and c’s descendants. We assume that, for each concept $c_{pa}$ in $C_{pa}$ , if the similarity of topic distribution between $c_{pa}$ and $C_{de}$ is higher, then it is more probably to be c’s parent since it can better cover the topic information of the set $C_{de}$ . The similarity is defined as

\begin{matrix} TC (c_{pa}, C_{de}) = \sum_{c_{de} \in C_{de}} (D_{JS} (θ_{c_{de} t_{k}}, θ_{c_{pa} t_{k}})) \end{matrix}

(4)

where $θ_{c_{de} t_{k}}$ is the top k major topics of concept $c_{de}$ , and $θ_{c_{pa} t_{k}}$ represents the k topics corresponding to $θ_{c_{de} t_{k}}$ in the topic distribution of concept $c_{de}$ . $D_{JS} (θ_{c_{de} t_{k}}, θ_{c_{pa} t_{k}})$ stands for the distance of Jensen–Shannon divergence (JSD) between $θ_{c_{de} t_{k}}$ and $θ_{c_{pa} t_{k}}$ . JSD distance derives from Jensen’s inequality and Shannon’s information entropy, which has a good performance for the calculation of probability distributions in different topics. The JSD distance is defined as

\begin{matrix} D_{JS} (p, q) = \frac{1}{2} [D_{KL} (p, \frac{p + q}{2}) + D_{KL} (q, \frac{p + q}{2})] \end{matrix}

(5)

where $p$ and $q$ represent the topic probability distributions, respectively. $D_{KL}$ is the distance of Kullback–Leibler divergence (KLD). KLD distance is based on the definition of information entropy, also called relative entropy. It is mainly used to measure the difference between two probability distributions. The KLD distance is defined as

\begin{matrix} D_{KL} (p, q) = \sum_{t \in T} p (t) \log \frac{p (t)}{q (t)} \end{matrix}

(6)

where $T$ is the total number of the common topics in $p$ and $q$ . $p (t)$ represents the topic probability of $t$ in $p$ , and $q (t)$ represents the topic probability of $t$ in $q$ .

For a concept $c$ with multiple inheritance, we, respectively, get the $TC$ value for each parent, and select the concept with maximum $TC$ value from all parent concepts as the final parent of $c$ . The relations between the concept $c$ and its other parents will be removed. As shown in Figure 4(c), the dotted lines between the concept pair $(e, f)$ and the concept pair $(g, j)$ represent these removed relations.

4.4. Adding non-core concepts

In this section, we discuss how to automatically integrate non-core concepts into the core tree to generate a final merged taxonomy as the output. According to the definition of non-core concept, it is not difficult to find that the non-core concepts consist of subtrees from source taxonomies. Each subtree $C_{tree}$ in the set of non-core concept has a root node $c_{root}$ that is a sub-concept of one concept $c_{root - pa}$ in the core tree. Each concept in the subtree $C_{tree}$ is the sub-concept of $c_{root - pa}$ . Hence, the goal of integrating non-core concepts can be achieved by integrating each subtree into its corresponding subtree (rooted at $c_{root - pa}$ ) in the core tree. Algorithm 4 explains the procedure of adding non-core concepts to the core tree.

Algorithm 4: Pseudo-code of adding non-core concepts
Input: non-core concept set $C_{non - core} [m]$ , core tree $T_{merged}^{core}$ ; source taxonomies $T_{s 1}, T_{s 2}, \dots, T_{sm}$ ; Output: the merged taxonomy $T_{merged}$ ;
1. initialise $C_{root} [m]$ ← null; $T_{merged}$ ← $T_{merged}^{core}$ ; /* find the root node for each subtree / 2. fori← 1 to m do 3. for each concept $c$ in $C_{non - core} [i]$ do 4. $c_{pa}$ ← getParents $(c)$ ; 5. if $c_{pa}$ not in $C_{non - core} [i]$ then 6. add $c$ to $C_{root} [i]$ ; / merging each subtree into the core tree */ 7. fori← 1 to m do 8. for each concept $c_{root}$ in $C_{root} [i]$ do 9. $T_{snc}$ ← getSubtree $(c_{root}, T_{si})$ ; 10. $c_{root - pa}$ ← getParents $(c_{root}, T_{merged})$ ; 11. $T_{sc}$ ← getSubtree $(c_{root - pa}, T_{merged})$ ; 12. merge $T_{snc}$ into $T_{sc}$ ;

Algorithm 4: Pseudo-code of adding non-core concepts

Input:
non-core concept set

C_{non - core} [m]

, core tree

T_{merged}^{core}

;
source taxonomies

T_{s 1}, T_{s 2}, \dots, T_{sm}

;
Output:
the merged taxonomy

T_{merged}

;

1. initialise

C_{root} [m]

← null;

T_{merged}

←

T_{merged}^{core}

;
/* find the root node for each subtree */
2. fori← 1 to m do
3. for each concept

c

C_{non - core} [i]

do
4.

c_{pa}

← getParents

(c)

;
5. if

c_{pa}

not in

C_{non - core} [i]

then
6. add

c

C_{root} [i]

;
/* merging each subtree into the core tree */
7. fori← 1 to m do
8. for each concept

c_{root}

C_{root} [i]

do
9.

T_{snc}

← getSubtree

(c_{root}, T_{si})

;
10.

c_{root - pa}

← getParents

(c_{root}, T_{merged})

;
11.

T_{sc}

← getSubtree

(c_{root - pa}, T_{merged})

;
12. merge

T_{snc}

into

T_{sc}

;

Adding non-core concepts includes two major procedures: finding the root node for each subtree (lines 2–6) and merging each subtree into the core tree (lines 7–12). For any subtree, if we can obtain its root, then all concepts in the tree can be obtained using either breadth-first search or depth-first search to traverse a source taxonomy from the root. The rationale behind the method of identifying root concept is that for any concept in the non-core concept set, if it does not have parent concept or its parent concept is not in the non-core concept set, then it is the root of a subtree. After performing the method on each non-core concept, we get a list of root concept $C_{root} [m]$ . For each root in the list, we obtain all its descendant concepts $T_{snc}$ from its corresponding source taxonomy (line 9). Then, we obtain another subtree $T_{sc}$ in the core tree through its parent concept in the core concept set (line 11). Finally, two trees are merged, where $T_{sc}$ is viewed as the target tree and the tree $T_{snc}$ is viewed as the source tree (line 12).

4.5. The scale of measuring semantic relations

Assume that the number of taxonomies to be merged is m(m≥ 2), the average number of concepts per taxonomy is n(n≥ 1), the number of blocks in the fully merged taxonomy is r(r≥ 1) and the average number of concepts per block is (m × n)/r. Section 4.2 measured the equivalence correspondences between the target taxonomy and each source taxonomy. In this phase, the maximum number of concept pairs to be measured would be $(m - 1) n^{2}$ . Section 4.3.1 measured three kinds of correspondences for the concepts in the interior of each block. For each block, the maximum number of concept pairs to be measured would be $1 / 2 ((m - 1) (m - 2) (n / r)^{2})$ . As there are r blocks, the total number of concept pairs to be measured in the worst case would be $1 / 2 r ((m - 1) (m - 2) n^{2})$ .

By summing up the two phases, the total number of concept pairs to be measured in the worst case is $1 / 2 r ((m - 1) (m - 2) n^{2} + (m - 1) n^{2})$ . As discussed in section 1, for the merging m taxonomies, in general, the maximum number of concept pairs to be measured would be $1 / 2 (m (m - 1) n^{2})$ , which is greater than the number of our proposed algorithms, that is

\frac{1}{2 r} (m - 1) (m - 2) n^{2} + (m - 1) n^{2} \leq \frac{1}{2} m (m - 1) n^{2}

5. Alignment generation

In this section, we introduce the alignment approach for determining the exact correspondence between a concept pair from different taxonomies. As discussed in section 2, existing approaches for measuring semantic relations can be classified into two types: linguistic-based strategy and structure-based strategy [40].

The linguistic-based strategy relies on the concept’s label to determine correspondences between a concept pair. This strategy is able to support different types of correspondences, including equivalence, is-a, inverse-is-a and so on, by utilising linguistic techniques and background sources such as WordNet [41,42]. Due to the complexity and diversity of natural language, the linguistic-based strategy has obvious limitation. For example, the equivalent concepts expressed by different labels (synonyms) cannot be detected, and the different concepts described by same labels (homonyms) might be mistakenly detected as equivalent concepts.

Differently from the linguistic-based strategy, the structure-based strategy relies on the taxonomy structure information to determine correspondences between a concept pair. This strategy is based on the intuition that if two concepts belonging to different taxonomies are similar in semantic, their neighbours are semantically related. The structure-based strategy can detect homonyms and synonyms in taxonomies, but it cannot further identify kinds of correspondences.

Combining the advantages of the two strategies, we employ a two-phase approach to determine the correspondence between a concept pair. The structure-based strategy is used to determine whether the concept pair is semantically related in the first stage, and then the linguistic-based strategy is applied to identify more expressive correspondences, including equivalence, is-a and inverse-is-a.

5.1. Determining candidate concept pair

This stage is to identify the concept pair that is semantically related, and to roughly filter out homonyms. To compute semantic relatedness between a concept pair, the approach of information distance proposed in Jiang et al. [43] is used as follows.

Assume that there is a concept-pair $c'$ and $c ″$ such that $c' \in T_{i}$ and $c ″ \in T_{j}$ . If $0 < | sim (c', c ″) - sim (c ″, c') | < δ$ , the concepts $c'$ and $c ″$ are semantically related. $δ$ is a threshold value to measure the semantic relatedness, which is empirically set to 0.1. $sim (x, y)$ represents the similarity between concepts x and y, which is designed to consider both terminological similarity and structural similarity. The definition of $sim (x, y)$ is described as follows

\begin{matrix} sim (x, y) = γ \times si m_{t} (x, y) + \frac{1 - γ}{2} \times si m_{si} (x, y) + \frac{1 - γ}{2} \times si m_{se} (x, y) \\ \begin{matrix} = γ \times ω (x, y) + \frac{1 - γ}{2} (\sum_{h_{i} \in H (x), h_{j} \in H (j)} max {ω (h_{i}, h_{j})} + \sum_{p_{i} \in P (x), p_{j} \in P (j)} max {ω (p_{i}, p_{j})}) \end{matrix} \end{matrix}

(7)

where $γ$ is the balance factor, which is empirically set to 0.6. $si m_{t} (x, y)$ is the terminological similarity. $si m_{si} (x, y)$ is the structural internal similarity. $si m_{se} (x, y)$ is the structural external similarity. $H (x)$ and $H (j)$ , respectively, denote the child set of concepts $x$ and $y$ . $ω (m, n)$ denotes the link weight, which is asymmetric dependency measures between concepts m and n based on the Normalised Google Distance (NGD) [43,44]. $ω (m, n)$ can be defined as follows

\begin{matrix} ω (m, n) = \frac{f (m, n)}{f (n)} \end{matrix}

(8)

where $f (n)$ denotes the number of web pages containing the concept n in Google, and $f (m, n)$ denotes the number of web pages containing both the concepts m and n in Google.

Owing to the combination of terminological similarity and structural similarity, the approach of information distance has a high performance in identifying semantic relatedness. Moreover, it is also effective to correctly detect synonyms and homonyms.

5.2. Identifying expressive relationship

The first step in our semantic alignment approach is to identify the candidate concept pair with semantic relatedness. We then apply linguistic-based strategy proposed by Arnold and Rahm [17] to determine more expressive semantic relations between the candidate concept pair. This strategy consists of four approaches: lexical-based, compound, background knowledge and itemisation.

Lexical-based approach

The approach is proposed to deal with concept pairs which share the same lexical representation or are both found in semantic lexicon like WordNet. Given two concepts $c' (c' \in T_{i})$ and $c ″ (c ″ \in T_{j})$ , the approach works as follows:

If $label (c') = label (c ″)$ , return $Equal (c', c ″)$ .

If $c'$ and $c ″$ are both found in semantic lexicon:

If $c'$ is the synonym of $c ″$ , return $Equal (c', c ″)$ .

If $c'$ is a hyponym of $c ″$ , return $Isa (c', c ″)$ .

If $c'$ is a hypernym of $c ″$ , return $Invisa (c', c ″)$ .

Type cannot be determined, return unknown.

If either $c'$ or $c ″$ is not contained in semantic lexicon, return unknown.

Compound approach

The approach is based on the idea that a compound c consisting of a head c_h and a modifier c_m expresses something more specific than its head c_h, that is, c is a sub-concept of c_h. Specifically, given two concepts $c' (c' \in T_{i})$ and $c ″ (c ″ \in T_{j})$ , if concepts $c'$ ends with $c ″$ and $length (c') - length (c ″)$ > 3, $c'$ is considered a compound to $c ″$ and the correspondence $Isa (c', c ″)$ is hold. The condition $length (c') - length (c ″)$ > 3 is able to discard obviously wrong compound conclusions, like sword is a word (pseudo-compounds) and unlike is a like (prefix derivations).

Background knowledge approach

The approach mainly employs semantic lexicon as external background source to identify specific correspondence types between a non-compound concept c_w and an open compound c_c, where the concept c_w is contained in semantic lexicon, yet the concept c_c = c_mc_h consisting of a modifier c_m and a head c_h is not. For the compound concept c_c, we gradually remove the modifier c_m and check whether the remaining form ${c_{m}}^{'}$ (c_c is a sub-concept of ${c_{m}}^{'}$ ) is contained in semantic lexicon. If ${c_{m}}^{'}$ is contained in semantic lexicon, the relation between ${c_{m}}^{'}$ and c_w is identified in semantic lexicon. If ${c_{m}}^{'}$ is the synonym of c_w or ${c_{m}}^{'}$ is a sub-concept of c_w, then the relation $Isa (c_{c}, c_{w})$ is hold. If ${c_{m}}^{'}$ is not contained in semantic lexicon, the process continues until the form ${c_{m}}^{'}$ is the same as the head of $c_{c}$ .

Itemisation approach

The approach is utilised to identify the correspondence types between two concepts where at least one concept is an itemisation containing a list of words or phrases, like ‘Games and Puzzles’. Given two concepts $c' (c' \in T_{i})$ and $c ″ (c ″ \in T_{j})$ , let $W'$ and $W ″$ , respectively, be the word set of $c'$ and $c ″$ . The implementation of itemisation approach to derive the semantic relation between $c'$ and $c ″$ is as follows:

For each set W, remove each w_i∈W which is a sub-concept of w_j∈W.

For each set W, remove each w_i∈W which is a synonym of w_j∈W.

Remove each $w_{i} \in W'$ and $w_{j} \in W ″$ if w_i is a synonym of w_j.

Remove each $w_{j} \in W ″$ if w_j is a sub-concept of $w_{i} \in W'$ .

Determine the semantic relation:

If $W' = \emptyset$ and $W ″ = \emptyset$ , return $Equal (c', c ″)$ .

If $W' = \emptyset$ and $| W ″ | \geq 1$ , return $Isa (c', c ″)$ .

If $| W' | \geq 1$ and $W ″ = \emptyset$ , return $Invisa (c', c ″)$ .

If $| W' | \geq 1$ and $| W ″ | \geq 1$ , return unknown.

Since these introduced approaches will return exactly one correspondence type or ‘unknown’, a conflict may occur when four approaches return different types. In this case, we use the following priority order: equivalence, is-a/inverse-is-a and unknown.

It is noted that for the measure of equivalence correspondences in section 4.2, we only employ the lexical-based approach and do not involve the compound approach, the background knowledge approach and the itemisation approach. Although the lexical-based approach is only suitable to measure the equivalence correspondences for non-compound concepts but not for the compound concepts, the strategy can effectively reduce the complexity and time consumption of measuring the equivalence correspondence. Moreover, the equivalence correspondences utilised in section 4.2 are to divide the target taxonomy and the source taxonomies into blocks; therefore, in this stage, this strategy that we only get the equivalence correspondences between non-compound concepts does not affect the result of the whole taxonomies merging. The measures for compound concepts will be implemented in section 4.3.1.

6. Evaluation

To make a comprehensive evaluation for the performance of the proposed approach, the experiments are respectively conducted on real-world and synthetic scenarios. For the real-world scenario, we select four popular e-commerce website product catalogues as the test data to assess the proposed approach in terms of semantic redundancy. Regarding the synthetic scenario, seven groups of test cases are applied to evaluate the semantic accuracy of the taxonomy generated by the proposed approach.

The hardware device used for evaluating the proposed approach is a PC running Windows 10 system with Intel Core 2 Duo processors (2.30 GHz) and 4 GB RAM. A prototype of the proposed approach is implemented as a Java application. In our experiments, JGibbLDA (http://jgibblda.sourceforge.net) is used as the implementation of LDA topic model to discover the topic distribution of each concept. Gibbs sampling is employed in JGibbLDA to estimate and infer the document-topic probability distribution and the topic-word probability distribution. We empirically set the number of topics K = 50, hyper parameters α = 0.5 and β = 0.01, and the number of Gibbs sampling iterations = 2000.

6.1. Experiment 1: real-world scenario

As discussed in section 1, an important advantage of the asymmetric-based merging approach is that it can largely limit the semantic redundancy in the merged taxonomy. Here, we conduct experiments on real-world scenario to evaluate and verify the performance of proposed approach in limiting the semantic redundancy. For a quantitative evaluation, we follow [10,45,46] to measure the compactness and the introduced degree of semantic overlap in the merged result. The measure is defined to compare the fully merged taxonomy generated in symmetric-based manner with our solution in terms of the number of concepts and leaf paths. Besides, considering that the asymmetric-based merging approach needs to specify a target taxonomy, we conduct careful experiments to analyse how different choices of the target taxonomy affect the merged result.

We select four popular e-commerce website catalogues as input taxonomies, in which the descriptions of products are viewed as the documents attached to concepts. The product catalogues were crawled in June 2019. Table 1 shows the detailed characteristics of these catalogues, including the site name, URL, the number of concepts and leaf concepts as well as the maximum depth and the average depth.

Table 1.

The characteristics of experimental data for real-world scenario.

Website	URL	Concepts	Leaf concepts	Maximum depth	Average depth
Groupon	https://www.groupon.com	2273	1852	7	4.76
eBay	https://www.ebay.com	1088	993	4	4
Amazon	https://www.amazon.com	359	315	3	3
Walmart	https://www.walmart.com	620	530	4	3.89

Table 2 provides the experiment results for the real-world scenario obtained by the proposed approach using different target taxonomies [10]. We find that the solutions, whether using the product catalogues of Groupon, eBay, Amazon or Walmart as the target taxonomies, have the same number of concepts, leaf concepts, leaf paths and maximum depth. Moreover, although the average depth of the four taxonomies is quite different, the four solutions using different target taxonomies achieve almost the same average depth. Similar results obtained using different target taxonomies demonstrate that the proposed approach is robust to the chosen target taxonomy.

Table 2.

Comparison of our solutions using different target taxonomies and the fully merged solution on the real-world scenario.

	Concepts	Core concepts	Non-core concepts	Leaf concepts	Leaf paths	Maximum depth	Average depth
# Groupon	3887	656	1411	3316	3316	7	4.73
# eBay	3887	557	2695	3316	3316	7	4.64
# Amazon	3887	478	3503	3316	3316	7	4.62
# Walmart	3887	504	3216	3316	3316	7	4.77
Fully merged	3887			3278	8527	7	4.95

# Groupon, # eBay, # Amazon and # Walmart, respectively, indicate the solutions using the product catalogue of Groupon, eBay, Amazon and Walmart as the target taxonomies.

We further compare our solutions with the fully merged solution. The results show that our solutions and the fully merged solution have the same number of concepts indicating that our approach not only combines equivalent concepts but also maintains all the input concepts. However, the number of leaf concepts in the fully merged solution is less than our solutions, since our approach converted some branch concepts to leaf concepts in the problem-solving process of multiple inheritance. In addition, Table 2 depicts the number of leaf paths, which could act as an index to evaluate the introduced degree of redundancy. It is observed that all our solutions introduce 60% fewer leaf paths compared with the fully merged solution, and the number of leaf paths is equal to that of leaf concepts in our solutions, the reason for which is that the merged taxonomy generated by our approach does not introduce any multiple inheritance and remains a tree structure.

To summarise, the experiments indicate that the proposed approach has a good performance to preserve all concepts and relationships of the input taxonomies and can limit the semantic redundancy in the merged result at the same time. Furthermore, the experiments demonstrate the capability of our approach to merge multiple taxonomies from real world.

6.2. Experiment 2: synthetic scenario

6.2.1. Evaluation measures

The above experiments evaluate the performance of proposed approach in the degree of semantic redundancy. To evaluate the performance more holistically, in this section, we use a golden standard evaluation [47,48] as evaluation framework to systematically evaluate the precision, recall and F-measure of the proposed approach. The golden standard evaluation mainly emphasises on comparing the positions and hierarchies between two concepts belonging, respectively, to the merged taxonomy and the benchmark taxonomy (i.e. a reference taxonomy) which represents an accurate well-merged result. The evaluation usually includes two aspects: local measure and global measure.

Before describing local measure, let us introduce the common semantic cotopy $(\csc)$ to compare the concept relations between the merged taxonomy and the reference taxonomy. Given a concept $c$ , let $C_{MP}$ and $C_{MB}$ denote its super concept set and its sub-concept set in the merged taxonomy $T_{M}$ , respectively, and $C_{RP}$ and $C_{RB}$ represent its super concept set and its sub-concept set in the reference taxonomy $T_{R}$ , respectively. $\csc$ is defined as follows

\begin{matrix} \csc (c, T_{M}, T_{R}) = {c_{i} | c_{i} C_{MP} \cap C_{RP} \lor c_{i} \in C_{MB} \cap C_{RB} \land (c_{i} ⪯ c \lor c ⪯ c_{i})} \end{matrix}

(9)

where ‘ $C_{1} \cap C_{2}$ ’ is employed to compute the set of common concepts between the sets $C_{1}$ and $C_{2}$ . The symbol ‘⪯’ stands for the concept order induced by the hierarchical relations. The definition indicates that $\csc$ is the set that contains concept c, c’s super concepts and sub-concepts which are shared by a merged taxonomy and a reference taxonomy.

The framework of local measure includes the local taxonomic precision $(LP)$ and local taxonomic recall $(LR)$ . The definitions of $LP$ and $LR$ are as follows

\begin{matrix} LP (c, T_{M}, T_{R}) = \frac{| \csc (c, T_{M}, T_{R}) \cap \csc (c, T_{R}, T_{M}) |}{\csc (c, T_{M}, T_{R})} \end{matrix}

(10)

\begin{matrix} LR (c, T_{M}, T_{R}) = \frac{| \csc (c, T_{M}, T_{R}) \cap \csc (c, T_{R}, T_{M}) |}{\csc (c, T_{R}, T_{M})} \end{matrix}

(11)

The definitions of $LP$ and $LR$ leverage $\csc$ to determine the quality of a single concept. The $LP$ value shows the proportion of the size of the common set and the size of $\csc$ from the merged taxonomy’s perspective. The $LR$ value represents the proportion of the size of the common set and the size of $\csc$ from the reference taxonomy’s perspective.

The local measure compares the position of a concept in the merged taxonomy with the position of the corresponding concept in the reference taxonomy. The global taxonomic precision $(GP)$ and global taxonomic recall $(GR)$ show the quality of whole concept hierarchies of a taxonomy. The definitions of $GP$ and $GR$ are as follows

\begin{matrix} GP (T_{M}, T_{R}) = \frac{1}{| C_{M} \cap C_{R} |} \sum_{c \in C_{M} \cap C_{R}} LP (c, T_{M}, T_{R}) \end{matrix}

(12)

\begin{matrix} GR (T_{M}, T_{R}) = \frac{1}{| C_{M} \cap C_{R} |} \sum_{c \in C_{M} \cap C_{R}} LR (c, T_{M}, T_{R}) \end{matrix}

(13)

where $C_{M}$ represents the concept set of $T_{M}$ , $C_{R}$ represents the concept set of $T_{R}$ , $GP$ is the sum of the taxonomic precision values of $C_{M} \cap C_{R}$ (i.e. the common concept set of both taxonomies), and $GR$ indicates the sum of the taxonomic recall values of $C_{M} \cap C_{R}$ .

In some cases, the taxonomic precision and the taxonomic recall are contradictory. In order to balance the two measures, we introduce the taxonomic F-measure $(GF)$ to comprehensively evaluate the merged taxonomy. The definition of $GF$ is as follows

\begin{matrix} GF (T_{M}, T_{R}) = \frac{2 \cdot GP (T_{M}, T_{R}) \cdot GR (T_{M}, T_{R})}{GP (T_{M}, T_{R}) + GR (T_{M}, T_{R})} \end{matrix}

(14)

The F-measure is the harmonic mean of the global taxonomic precision value and global taxonomic recall value. The higher of the value of F-measure, the better of the quality of the merged taxonomy.

6.2.2. Evaluation data

Virtually, all previous studies on integrating heterogeneous ontologies or schemas focus on the problem of merging two schemas. A number of benchmarks for two ontologies merging are available [46]. For instance, the Ontology Matching conference presents more than 20 data sets and their alignment (http://www.ontologymatching.org). However, little work has done for the multiple taxonomies merging, and there is no public benchmark for multiple taxonomies merging, to the best of our knowledge. To investigate the performance of our approach, seven groups of test cases that contain, respectively, 3, 4, 5, 6, 7, 8 and 9 input taxonomies are constructed by a two-stage method as follows.

In the first stage, an original taxonomy is generated from the Open Directory Project data set (DMOZ). DMOZ is the largest human-edited directory on the Web with over 5,169,995 sites listed in over 1,017,500 concepts. In this article, we just extract the taxonomy ‘Science’ from DMOZ. As the taxonomy contains many meaningless concepts and some concepts with multiple inheritance, these concepts and their sub-concepts are removed. After cleaning these concepts, the number of concepts in the taxonomy is 1187 and the depth of the taxonomy is 7. Due to the lack of corresponding resource documents for many concepts in DMOZ, each concept in the extracted taxonomy is searched by Google, and the texts of the first 100 records from the search results are collected as the documents attached to the concept.

In the second stage, the taxonomies of each test case are generated from the original taxonomy. The generation method includes two operations: copying taxonomy and removing concept.

Copying taxonomy

n copies are generated from this original taxonomy, where $3 \leq n \leq 9$ , corresponding to the number of the taxonomies of the generated test cases.

Removing concept

The process of removing concepts is a vital key since it determines the unique and heterogeneous nature of each copy of a test case. Here, we take the test case containing six taxonomies (i.e. n = 6) as an example to illustrate the operation process, which comprises five (i.e. n− 1) times of removing operation. When performing the ith (i = 1, 2, …, 5) removing operation, 100 concepts are selected one by one at random from the original taxonomy (once a concept is selected, it will be marked so that it will not be selected again); after that, for each selected concept, we first randomly select i copies from the test case, and then remove this concept from the i copies. Note that to maintain the hierarchical structure of the copy, when a concept is removed from a copy, its sub-concepts become the direct sub-concepts of its super concept.

In this way, the generated test case ensures that each concept in the original taxonomy exists in at least one copy. Meanwhile, the massive and random removal of concepts from the copies guarantees the heterogeneity of each copy. Due to the stochastic nature of the generation method, 50 different test cases were constructed for each group. The general characteristics of each group are depicted in Table 3. Furthermore, the total number of concepts in a test case for the seven groups is, respectively, 3261, 4148, 4935, 5622, 6209, 6696 and 7083. As can be seen from the generation method, the total number of concepts of a test case can be calculated using the following formula

\begin{matrix} f = n \times m - \frac{n (n - 1)}{2} \times 100 \end{matrix}

(15)

Table 3.

The characteristics of experimental data for synthetic scenario.

	Group 1	Group 2	Group 3	Group 4	Group 5	Group 6	Group 7
The largest number of concepts	1114	1059	1019	976	920	878	824
The smallest number of concepts	1075	1016	954	909	856	805	755
Maximum depth	7	7	7	7	7	7	7
Minimum depth	6	6	5	5	5	5	5

where n is the number of taxonomies in a test case and m is the number of concepts of the original taxonomy (i.e. m = 1187).

6.2.3. Evaluation results

For each test case, the first taxonomy is viewed as the target taxonomy, and the others as the source taxonomies. As pointed out in section 6.1, the original taxonomy is viewed as the benchmark taxonomy to evaluate each merged taxonomy. In order to focus on evaluating the performance of the merged taxonomy and remove the influence of the quality of alignments for the merged result, we transform the semantic relations between concepts in the original taxonomy to the correspondence relations between concepts from different taxonomies so that a near-perfect alignment is available. Figure 5 consists of seven figures depicting the detailed results of the experiments from 50 different test cases for each group in terms of the measures of GP, GR and GF. Figure 7 summarises the average performance of 50 different test cases for each group.

Figure 5.

Performance of 50 test cases for each group: (a) group 1, (b) group 2, (c) group 3, (d) group 4, (e) group 5, (f) group 6 and (g) group 7.

The results displayed in Figures 5 and 6 show that the precisions for each group all are 100%. It means that the hierarchical relations for the concept in each merged taxonomy are same as the benchmark taxonomy. In other words, the merged taxonomies do not introduce any incorrect semantic relations. As a general requirement, any semantic error in a taxonomy is not supported by the applications using the taxonomy [49], so that it is necessary to ensure that the precision of semantic relations of merged taxonomy reaches 100%.

Figure 6.

Average performance of seven groups of test cases.

The average recalls for each group (all less than 100%) show that the benchmark taxonomy contains more information than these merged taxonomies. In fact, each merged taxonomy and the benchmark taxonomy have exactly the same concept set, which means that some semantic relations have been missed in the process of merging taxonomies. We find that with the increasing number of taxonomies in a test case, the recall shows a downward trend (see Figure 6). Since more concepts are removed in the test case with more taxonomies, this leads to more concepts having multiple inheritance. To deal with multiple inheritance problem, not all semantic relations between concepts coming from source taxonomies are retained, but the relations from the target taxonomy are maintained. Even so, the minimum recall of our approach in all test cases for seven groups also reaches over 93.02% (in group 7).

As can be seen from Figure 6, the average F-measures of our approach applied on each merging task show that the merged taxonomies have a high quality of hierarchical relations (99.67%, 99.41%, 99.21%, 98.75%, 98.39%, 97.94% and 97.57%). Indeed, our approach preserves not only the concepts of the target taxonomy but also the non-redundant concepts of the source taxonomies. However, we lost some semantic relations mainly coming from source taxonomies, but we did not introduce any redundancy and inconsistency in the merged result.

7. Conclusion

In this article, we present an approach to automatically merge multiple taxonomies in an asymmetric manner, which includes three procedures: dividing concept sets, building core tree and adding non-core concepts. In order to reduce the scale of measuring semantic relations between concepts, the approach decomposes a fully merged taxonomy into many graph-based blocks and further reconstructs each block into a tree structure. Moreover, the approach employs the LDA topic model to solve the multiple inheritance problem. The proposed approach is extensively tested on real-world and synthetic scenarios in a fully automatic way. Experimental results demonstrate the effectiveness of the proposed approach in merging multiple taxonomies.

As a future work, we would like to explore an automatic way to measure the accuracy of information in the input taxonomies and eliminate erroneous information from them, which will help to further increase the utility of the merged taxonomy. Moreover, we plan to investigate how to evaluate the knowledge coverage of the merged taxonomy for some specific domains, which is another important decision factor for the utility of the merged taxonomy [50].

Footnotes

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This work was supported by the National Key R&D Program of China (Grant No. 2017YFB1401300 and 2017YFB1401302), the self-determined research funds of CCNU from the colleges’ basic research and operation of MOE (Grant No. CCNU19ZN011) and the Fundamental Research Funds for the Central Universities (Grant No. 2019YBZZ010).

ORCID iD

Chao Wu

References

Gruber

. A translation approach to portable ontology specifications. Knowl Acquis 1993; 5(2): 199–220.

Rios-Alvarado

Lopez-Arevalo

Sosa-Sosa

. Learning concept hierarchies from textual resources for ontologies construction. Expert Syst Appl 2013; 40(15): 5907–5915.

Zhu

Chen

et al. An efficient path computing model for measuring semantic similarity using edge and density. Knowl Inf Syst 2017; 55(1): 79–111.

Borsje

Levering

. Hermes: a semantic web-based news decision support system. In: Proceedings of the 2008 ACM symposium on applied computing (SAC), Fortaleza, 16–20 March 2008, pp. 2415–2420. New York: ACM.

De Knijff

Frasincar

Hogenboom

. Domain taxonomy learning from text: the subsumption method versus hierarchical clustering. Data Knowl Eng 2013; 83: 54–69.

Vandic

Van Dam

Frasincar

. Faceted product search powered by the semantic web. Decis Support Syst 2012; 53(3): 425–437.

Zhou

. Improving financial data quality using ontologies. Decis Support Syst 2012; 54(1): 76–86.

Zhou

Bruza

et al. A two-stage decision model for information filtering. Decis Support Syst 2012; 52(3): 706–716.

Zong

Nam

Eom

et al. Aligning ontologies with subsumption and equivalence relations in linked data. Knowledge Based Syst 2015; 76: 30–41.

10.

Raunich

Rahm

. Target-driven merging of taxonomies with atom. Inf Syst 2014; 42: 1–14.

11.

Pottinger

Bernstein

. Merging models based on given correspondences. In: Proceedings of the 2003 VLDB Conference, Berlin, 9–12 September 2003, pp. 862–873. New York: Springer.

12.

Seddiqui

Aono

. An efficient and scalable algorithm for segmented alignment of ontologies of arbitrary size. J Web Semant 2009; 7(4): 344–356.

13.

Jean-Mary

Shironoshita

Kabuka

. Ontology matching with semantic verification. J Web Semant 2009; 7(3): 235–251.

14.

Zhang

Jin

Pan

et al. RiMOM results for OAEI 2016, In: OM @ ISWC, Kobe, Japan, 17–21 October 2016, pp. 210–216. Berlin: Springer.

15.

Rahm

. Matching large schemas: approaches and evaluation. Inf Syst 2007; 32(6): 857–885.

16.

Giunchiglia

Autayeu

Pane

. S-Match: an open source framework for matching lightweight ontologies. Semant Web 2012; 3(3): 307–317.

17.

Arnold

Rahm

. Enriching ontology mappings with semantic relations. Data Knowl Eng 2014; 93: 1–18.

18.

Vennesland

. Matcher composition for identification of subsumption relations in ontology matching. In: Proceedings of the WI ’17, Leipzig, 23–26 August 2017, pp. 154–161. New York: ACM.

19.

Ochieng

Kyanda

. A statistically-based ontology matching tool. Distrib Parallel Databases 2018; 36(1): 195–217.

20.

Gangemi

Pisanelli

Steve

. Overview of the ONIONS project: applying ontologies to the integration of medical terminologies. Data Knowl Eng 1999; 31(2): 183–220.

21.

McGuinness

Fikes

Rice

et al. An environment for merging and testing large ontologies. In: Proceedings of the seventh international conference on principles of knowledge representation and reasoning, Breckenridge, Colorado, USA, 12–15 April 2000, pp. 483–493. New York: IEEE.

22.

Noy

Musen

. PROMPT: algorithm and tool for automated ontology merging and alignment. In: Proceedings of the 17th national conference artificial intelligence and 12th conference on innovative application of artificial intelligence1–3 June 2000, pp. 450–455. New York: ACM.

23.

Stumme

Maedche

. FCA-MERGE: bottom-up merging of ontologies. In: IJCAI seventeenth international joint conference on artificial intelligence, 4–10 August 2001, pp. 225–230. Menlo Park, CA: IJCAI.

24.

Xue

Pan

. A segment-based approach for large-scale ontology matching. Knowl Inf Syst 2017; 52(2): 467–484.

25.

Hakimpour

Geppert

. Global schema generation using formal ontologies. In: Proceedings of the 21st international conference on conceptual modeling, Tampere, 7–11 October 2002, pp. 307–321. Berlin: Springer.

26.

Guzmán-Arenas

Cuevas

. Knowledge accumulation through automatic merging of ontologies. Expert Syst Appl 2010; 37(3): 1991–2005.

27.

Rangel

Altamiranda

Cerrada

et al. Procedure based on semantic similarity for merging ontologies by non-redundant knowledge enrichment. Int J Knowl Manag 2018; 14(2): 16–36.

28.

Maree

Belkhatir

. Addressing semantic heterogeneity through multiple knowledge base assisted merging of domain-specific ontologies. Knowledge Based Syst 2015; 73: 199–211.

29.

Zhang

Ren

. OIM-SM: a method for ontology integration based on semantic mapping. J Intell Fuzzy Syst 2017; 32(3): 1983–1995.

30.

Khan

Safyan

. Semantic matching in hierarchical ontologies. J King Saud Univ – Comput Inf Sci 2014; 26(3): 247–257.

31.

Yang

Lin

Wei

. Retaining knowledge for document management: category-tree integration by exploiting category relationships and hierarchical structures. J Am Soc Inf Sci Technol 2010; 61(7): 1313–1331.

32.

Xie

Liu

Guan

. Tree-structure based ontology integration. J Inf Sci 2011; 37(6): 594–613.

33.

Otero-Cerdeira

Rodríguez-Martínez

Gómez-Rodríguez

. Ontology matching: a literature review. Expert Syst Appl 2015; 42(2): 949–971.

34.

Song

. A mental model approach for category hierarchy maintenance on sellers’ self-input items in e-commerce websites. In: 2016 11th Iberian conference on information systems and technologies (CISTI), Las Palmas, 15–18 June 2016. New York: IEEE.

35.

Yuan

Cong

Sun

et al. Category hierarchy maintenance: a data-driven approach. In: 35th International ACM SIGIR conference on research and development in information retrieva, Portland, OR, 12–16 August 2012, pp. 791–800. New York: ACM.

36.

Tang

Zhang

Liu

. Acclimatizing taxonomic semantics for hierarchical. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining KDD ‘06, Philadelphia, PA, 20–23 August 2006, pp. 384–393. New York: ACM.

37.

Matykiewicz

Duch

. Multiple inheritance problem in semantic spreading activation networks. In: International conference on brain informatics and health, Warsaw, 11–14 August 2014, pp. 252–265. New York: Springer.

38.

Blei

Jordan

et al. Latent Dirichlet allocation. J Mach Learn Res 2003; 3: 993–1022.

39.

Zhuge

. Automatic maintenance of category hierarchy. Futur Gener Comput Syst 2017; 67: 1–12.

40.

Otero-cerdeira

Rodríguez-martínez

Gómez-rodríguez

. Expert systems with applications ontology matching: a literature review. Expert Syst Appl 2015; 42(2): 949–971.

41.

Fellbaum

Miller

. WordNet: an electronic lexical database. Cambridge, MA: The MIT Press, 1988, pp. 22.

42.

Ahsaee

Naghibzadeh

Naeini

SEY

. Semantic similarity assessment of words using weighted WordNet. Int J Mach Learn Cybernet 2014; 5(3): 479–490.

43.

Jiang

Wang

Zheng

. A semantic similarity measure based on information distance for ontology alignment. Inf Sci 2014; 278: 76–87.

44.

Cilibrasi

Vit

PMB

. The Google similarity distance. IEEE Trans Knowl Data Eng 2007; 19(3): 370–383.

45.

Duchateau

Bellahsene

. Measuring the quality of an integrated schema. In: Proceedings of the 29th international conference on conceptual modeling, Vancouver, BC, Canada, 1–4 November 2010, pp. 261–273, Berlin: Springer.

46.

Mahfoudh

Forestier

Hassenforder

. A benchmark for ontologies merging assessment. In: 9th international conference on knowledge science, engineering and management, Passau, 5–7 October 2016, pp. 555–566. Berlin: Springer.

47.

Dellschaft

Staab

. On how to perform a gold standard based evaluation of ontology learning. In: Proceedings of the 5th international conference on the semantic web, Athens, GA, 5–9 November 2006, pp. 228–241. Berlin: Springer.

48.

Meijer

Frasincar

Hogenboom

. A semantic approach for extracting domain taxonomies from text. Decis Support Syst 2014; 62: 78–93.

49.

Liang

Xiao

Wang

. Graph-based wrong isa relation detection in a large-scale lexical taxonomy. In: Proceedings of 31th conference of artificial intelligence (AAAI 2017), San Francisco, CA, 4–9 February 2017, pp. 1178–1184.

50.

Zaveri

Rula

Maurino

et al. Quality assessment for linked data: a survey. Semant Web 2016; 7(1): 63–93.