Abstract
Ontologies are the extensively used structural and semantic knowledge representation to describe any entities with non-ambiguous meaning and relations. A large number of general and domain specific semantic similarity measures are available in the literature to access the similarity among these rich knowledge bases. Nevertheless, none of the measures have the best performance in all domains and applications. Each measure uses different strategies and possesses its own pros and cons. Hence, to consolidate the different kinds of measures, its applicability, the similarity and difference among them, the advantages and disadvantages of the measures, a detailed review of different semantic similarity measures has been carried out in this paper. Specifically, a comprehensive and novel classification of the semantic similarity measures exploring different type of ontology information has been presented and each measure is briefed. Further, the open challenges in this field, existing evaluation methodologies and datasets for the semantic similarity measures are also described.
Keywords
Introduction
The realization of the semantic web goals need a rich, well defined structure and relationship between the web data which are provided by the ontology. In general, ontology is an abstract model that describes a domain of interest with the set of concepts and rich relationships (IS-A + other relations) among those concepts. According to Gruber [19] ontology is a “formal explicit specification of conceptualization”. In this definition, the word “formal” means machine readable, “explicit specification” means the classes, its properties, instances, rules etc., that should be defined explicitly while “conceptualization” refers to the abstract model of a phenomenon or real world concept.
The ontologies are used in various applications in which, similarity measures are deployed in the preliminary processing stage to discover the similarity among the concepts in the ontologies. These measures are used in various areas like clustering the ontology [15, 57], ontology matching systems [16], XML document clustering [20], biomedical domain [39, 40], error correction and detection [10], cross lingual processing [26], detection of synonym [33], word sense disambiguation [38], knowledge discovery [22], etc.
In general, input to the similarity measures are two concepts from same or multiple ontologies and the similarity is represented by a real value usually ranging from 0 to 1. Even though this paper deals only with semantic similarity measures, the similarity measures can be generally classified into linguistic, semantic and extensional based measures.
Linguistic similarity measures consider the information about the concept such as name, label, comment, annotation, synonyms and its translated name to other languages. This information is used by linguistic similarity measures such as edit-distance, n-gram, cosine, Cosynonymy similarity, etc., to find linguistic similarity between the concepts.
Semantic similarity measures use the semantic information of the concept such as depth, Information Content (IC), neighbourhood, synset, path length, etc., between the two concepts to be matched, to compute the similarity.
Extensional similarity measures like Jaccard Similarity and Hamming distance, use instances of the concepts to computing similarity values.
Surplus semantic similarity measures are available in the literature, in which several of them are either domain or application specific [22]. Each measure is designed with different objectives such as efficiency, applicability [51], scalability, quality, etc. Similarly each of it, has certain prerequisite criteria, parameters to be fixed (if any) and explores particular type(s) of information from the ontologies to find the similarity. Therefore choosing among this plethora of measures for a particular application and/or a domain needs a thorough understanding of each measure’s design objective, prerequisites, type of information required, advantages, disadvantages, and applicability.
Hence in this paper, we have presented an extensive review of the existing semantic similarity measures which could provide a detailed insight into each measure. A new classification hierarchy for the semantic similarity measures is also proposed to describe the measures in a systematic way. In our classification, first the semantic similarity measures are classified coarsely into intra and inter semantic similarity measures [54]. A semantic similarity measure is called “intra semantic similarity measure” if the two concepts belong to the same ontology and it is called “inter semantic similarity measure” if the two concepts are from different ontologies [41]. Further, these two subclasses are further classified into finer levels where each measure is discussed in detail.
The rest of the paper is organized as follows. Section 2 discusses about the existing surveys on semantic similarity measures. Section 3 presents the proposed classification for the semantic similarity measures and briefs each measure. Section 4 outlines the disadvantages of each class and the open challenges. Section 5 gives a brief overview about the evaluation strategies for the similarity measures. Section 6 concludes the paper with a note on the challenges.
Existing surveys
Few surveys [5, 66] have been carried out on semantic similarity measures. In [54], the similarity measures are classified into intra and inter similarity measure. The intra similarity measure is classified into ontology based, Information Content (IC) based, feature based and hybrid measure. The ontology based class is sub classified into path based and depth based measures. The inter similarity measure is classified into path based and feature based measures.
In [40], the similarity measures are classified into node and edge based. Further, the node based similarity measures are sub-classified into external (IC based) and internal (depth, link density and number of children). This survey is presented based on the gene ontology and evaluates the sets of measures against the various biomedical tasks such as sequence similarity, gene expression, family similarity, clustering, etc.
Le et al. [29] outline a survey paper that discusses both intra and inter semantic similarity measures. The intra measures are further sub classified based on presence or absence of subclass relation between the two concepts to be matched. The inter measures are further sub classified into two; on the basis of whether the two concepts to be matched are belonging to same or difference DL language. They concentrated on the semantic similarity measure specifically designed for Description Logic based ontologies.
Ricklef and Blomqvist [47] have described the ways in which the set of semantic similarity measures have been combined, tested and evaluated. This paper classifies the measures into the following: (i) “weighting methods” where measures compute edge weights, (ii) “path calculation” where path between the concepts are used and (iii) “Similarity of sets” which discusses about the general set similarity measures rather than ontology specific measures.
In [22], the measures are divided into three categories: (i) edge based, (ii) node based and (iii) hybrid measures. Further node based measures are subdivided into feature based and IC based measures. Here, the equivalence among the measures is discovered and a framework to unify all the measures in the biomedical domain has been proposed.
Meng et al. [35] reviewed the semantic measures based on WordNet [17] and classified it into path based measures, information based measures, feature based measures and hybrid measures. Zare et al. [66] reviewed and classified intra similarity measures that use SNOMED-CT as the input ontology. The measures are classified into path based measure, IC based measure and hybrid measures. Batet et al. [5] classified the existing intra similarity and relatedness measures and briefly introduced the pros and cons of each measure. The similarity measure is classified into edge based, feature based and IC based measure.
The surveys in [5, 54] lacks in terms of coverage, since all the existing semantic similarity measures are not discussed. Certain inter similarity measures based on feature and IC are not covered in [54] and the entire set of IC based measures are not covered in [47]. In [35], the measures which are based on WordNet only are discussed. Domain dependent survey is presented in [22, 66] where the measures are formalized and discussed based on the biomedical ontologies. Language dependent survey is presented in [29], where the semantic similarity measures for non-Description Logic ontologies are not discussed.
Preliminary definitions
This section introduces the definition of the basic terminologies of ontologies and similarity measures.
In this paper, we considered ontology as a directed graph G = (C, R), where C is the set of vertices in the graph representing the set of concepts (unnamed and named concepts) in the ontology. R is the set of edges in the graph representing the relations between the concepts in the ontology.
The edges E can be of many types depending on the relation between the two concepts. If two concepts are linked by parent-child relation then the edge is a IS-A relation type. The parent and child concept is called as hypernym (a.k.a super concept) and hyponym (a.k.a sub-concept) respectively. For a given concept c, the parent concept and all subsumer concepts till the root of ontology are called as ancestors of c. Similarly, the child concept and all the concepts down the hierarchy for which concept c is an ancestor are called as descendants of c. If two concepts are linked by a domain specific relation, then the edge is non-IS-A relation type. The two concepts linked by a non-IS-A relation are called as domain and range concepts respectively.
The term path is defined as a set of edges which connects two concepts. The term path length is defined as the number of edges in the path between the two concepts. The path length between two concepts a and b is represented as dist (a,b).
The edges form the structure of the ontology and each concept is placed in a particular level of the structure. A concept is said to be at depth k, if it is k edges away from the root node (owl:Thing) of the ontology. It is represented using the function: depth(c) which denotes the depth of a concept c.
A new classification hierarchy for semantic similarity measures
This section outlines the various semantic similarity measures and discussed the new classification of both intra and inter semantic similarity measures.
Under each class, the existing measures are reviewed and compared in terms of advantages and disadvantages.
Intra semantic similarity measures
The set of intra semantic similarity measures are classified based on the information used by the measures. The measures which use the path distance between the two concepts to be matched are classified as path based measures. Certain measures which consider the depth information of the concepts are classified as depth based measures. The measures which are based on the IC of the concepts are classified as IC based measures and the measures which exploit the concept description represented using the Description Logic (DL) are classified as DL concept based description measures. Some measures access the similarity using the features of the concepts and these are classified as feature based measures. The measures which use different information of the concepts are classified as hybrid measures. The sets of classified measures are briefed below.
Path based measures
The intra path based measures can be further classified as follows: (i) ‘simple path’ where edges in the shortest path connecting the concept pair are assigned weight of one, (ii) ‘weighted path’ where each edge in the shortest path is assigned a precomputed weights and (iii) ‘multi path’ measures consider all the paths between the two concepts.
Simple Path: These measures are based on the number of edges separating the two concepts where each edge is assigned the weightage of one. The basic assumption is that, more the distance between two concepts in the ontology, more dissimilar they are. Rada et al. [45] defined similarity as the number of edges in the shortest path connecting the two concepts to be matched. Hirst & Onge et al. [23] considered the number of edges in the shortest path (dist (a, b)) between the concept pair (a, b) and the number of edge direction changes in the path (turns (a, b)). If the path length and changes in edge direction are more, then the similarity is assumed to be less. It is formally defined as follows.
where C and k are constants. Hirst & Onge et al. used C = 8 and k = 1 for their experiments.
Weighted Path: This measure is based on the weights of the edges along the shortest path connecting the two concepts. In Yang & Power et al. [65], the similarity is computed based on the product of the edge weights along the path, where weights of the edges are assigned based on the relation type of the edge. Types of relations considered are hyponymy, hypernymy, synonym, antonym, holonymy, meronymy and identity. The similarity value is computed as follows.
where the γ is a threshold on distance, α t is the weighting factor based on the sequence of edges of type t and β t i is the weighting factor for ith, t type edge in the path.
In Sussna et al. [61], edges are weighted based on the relation type and the deepest depth d between the two concepts. The author states that, each edge between the two concepts in an ontology represents two relations. For example, an IS-An edge between two concepts represents, generalization and specialization relation. The weight of each edge (w (a, c)) in the shortest path between the concept pair (a, b) is computed as follows.
where r represents the relation type between the concepts a and b, represents the inverse relation of relation r,
Zhu et al. [69] designed a new way to measure the distance between two concepts based on the number of edges in the shortest path (dist ()), depth and density of the concept pair. First, the local area density of the concept pair is computed as follows.
Where AreaSubsumer (a, b) indicates the set of subsumers of concept a or b in the shortest path between a and b. Density(c) represents the number of hyponyms of the concept c. Now, the similarity measure can be formally defined as follows.
where LCS (Least Common Subsumer) is the most specific common ancestor of the concept pair (a, b) and λ is the adjustable controlling factor. This is a generic and heuristic distance computing measure that can be substituted into any of the path based measure’s dist () function.
Multi Path: These measures consider length of all the paths connecting the concepts for similarity computation. All the paths are considered based on the argument that each path conveys some semantic relation between the concepts and hence cannot be ignored just because a path is longer [2, 3].
In Bulskov et al. [11], the similarity is computed based on the number of generalization and specialization edges along the path pj among the set of paths p, which gives the maximum similarity. The similarity value is computed as follows.
where m is the number of paths between the concept a and b, σ ∈ [0, 1] is a specialization factor, γ ∈ [0, 1] is a generalization factor, function s (p j ) refers to the number of specialization edges and g (p j ) refers to the number of generalization edges along the path p j .
IC based measures compute similarity based on the information obtained from the ontology and corpus. Generally, corpus is a collection of texts pertaining either to a specific domain or generic. The domain of the corpus, chosen for the similarity computation should be similar to the ontology’s domain. This corpus would serve as an additional knowledge for similarity computation.
IC of a concept can be considered as a measure of quantifying the amount of information a concept expresses. Information theoretic aspect, compute the needed IC values by associating probabilities to each concept c in the ontology (Equation (9)). These probabilities (p (c)) are based on the word occurrences in a given corpus.
The more frequent does a word occur, the less informative it is [51], which is incorporated by the negative logarithm. The intra IC based measure can be classified into corpora based and taxonomy based.
Corpora Based: In Resnik et al. [46] the similarity measure is based on the IC value of the LCS of the two concepts. This measure is formally represented as follows.
The drawback in this measure is that, it could result in same semantic similarity value for any concept pairs which have same LCS, irrespective of any other information of the individual concepts in the concept pair. In addition, this measure could not yield maximum value of 1 for identical concepts.
Lin et al. [33], tackled the drawback of the Resnik’s measure by normalizing the IC of the least common subsumer using the IC of both the concepts. Basically both the common and distinct information are considered.
Taxonomy Based: To overcome the drawback of corpus based IC computation, taxonomy based measures calculate the IC value from the taxonomical information of the ontology rather than the corpus. David Sanchez et al. [52] defined a new way of computing IC of a concept c, using the sum of commonness of the leaf concepts subsumed by c (commonness (c)). The commonness of a leaf ℓ is defined as inversely proportional to the number of subsumers of the leaf ℓ. The IC of a concept c is formally defined as follows.
To normalize the IC measure in the range of 0 to 1, it is scaled by the commonness value of root concept which is the generic concept in the ontology.
Seco et al. [58] and Pirro & Seco [44] calculated IC based on the number of hyponyms of the concepts. Let hypo (c) denotes the set of hyponyms (subclasses) of the concept c and root_node be the top concept of the ontology. The IC of a concept c is computed as follows.
The drawback of this method is similar to the Resnik et al. [46] measure. Two concept pairs belonging to different specificity i.e. depth, but with the same number of hyponyms could be computed with the same similarity value. However, as the hierarchy is traversed down, concepts in the deeper level are distinguished by little information as compared to the concepts in the upper level of hierarchy which are distinguished by large amount of information. Therefore the similarity value should be tuned according to the depth of the concept. Hence to overcome this drawback, Zhou et al. [68] used number of hyponyms and depth of the concept to compute the IC value. IC computation of Zhou et al. is formally defined as follows.
where max_depth represents the maximum depth of the ontology. The importance of the two features: depth and number of hyponyms are controlled by the factor k, which was set as 0.5 by the author. All the four taxonomy based IC computations of the concepts, are incorporated in the existing IC measures such as Resnik et al. [46], Lin et al. [33] and Jiang & Conrath [28] to compute the similarity.
In Pirro G [42, 43], the IC of a concept is calculated based on Seco et al. [58] and Pirro & Seco [44] measure’s IC definition Equation 13. The measure in [42] uses Tversky theory of similarity [62] to compute the similarity as follows.
Meanwhile, the measure in [43] uses Tversky ratio model. Additionally, a new IC measure called Extended IC (EIC) is introduced, to consider the neighbour concepts connected using non IS-A relations. EIC of a concept c is defined as the average of IC values (computed by Equation (13)) of neighbour concepts of c. The similarity measure computed using these ICs measures is as follows.
where the parameters ζ and η determine the influence of the IC and EIC measure of the concept. In both the above measures, the similarity is based on the common and distinct features of the concept pairs. The IC of the LCS contributes to the common features and the IC of each concept in the pair contributes to the distinct features of the concept pair.
Aouicha et al. [4] designed a new way to compute IC of a concept based on the ancestors subgraph. Here, the IC value is calculated based on the specificity (AnchorSpecificity(b)) of the ancestors which is quantified by the depth and descendants of the hypernyms of the ancestors. The IC of a concept is defined as follows.
where QuantifiedDescendants (b) can represent number of descendants of b or number of leaf descendants of b or the depth probability distribution of descendants in the given ontology. After the IC values are calculated, the Lin et al. [33] measure is used to compute the similarity value.
Meng et al. [34] calculated IC of a concept based on the concept’s topology in the ontology. Specifically, concept’s depth, its number of hyponyms and depth of each hyponym are used for IC computation. The IC of a concept is formally defined as follows.
where depth_max is the maximum depth of the ontology and node_max is the total number of concepts in the ontology. Similar to Aouicha et al. [4] measure, Lin et al. [33] measure is used to compute the similarity value.
Zhang et al. [67] proposed a novel way of computing IC of a concept c, using the sum of the number of subsumers of the leaf concepts subsumed by c. The IC of a concept c is formally defined as follows.
Based on the above defined IC computation and Resnik et al. [46] measure, Zhang et al. [67] proposed a new similarity measure which is formally defined as follows.
where the function hyp (a) represents the hypernym of the concept a, cmax-depth denotes the deepest node in the ontology and L represents hyponyms of the LCS of the concept pair (a, b), i.e. L = hyp(LCS(a,b)).
With ontologies being described by description logic, the features of the description logic are used to compute concept similarity.
Bentallah et al. [7] aimed at finding the best web service that suits the given query (the required web service) where the set of web services are represented using a DL-based ontology. This web service search problem is rewritten to formulate the best covering problem. Using the concepts DL description, the proposed hyper graph based matching algorithm takes the query Q and the web service ontology O has input and discovers the best covered web services. The best covered web services for a concept in Q indicates the set of web services, represented using set of concepts in O which have maximum matched information with Q.
In Li & Horrocks et al. [31] and Castillo et al. [18], the semantic similarity between the user query and the available set of web service, is calculated using DL reasoner which uses concept’s DL description. The available web services in [31] are represented using the DAML-S markup language. The compatibility between the user request and web services are calibrated using the proposed matchmaking algorithm which is based on RACER: DL reasoner’s intersection satisfiability. The intersection is stated as satisfiable, if the concept descriptions of the user request and the web service have some information in common. The compatibility is distinguished into five levels: Exact, PlugIn, Subsume, Intersection and Disjoint. The web services with Exact match is given top priority and Disjoint match is assigned least priority.
Castillo et al. [18] used similar producer as above, to find the similarity between the user query and the available web services using intersection satisfiability operation. The main differences are: (i) DAML+OIL markup language is used to represent the web services and (ii) In addition to RACER DL reasoner, FACT reasoner is also used.
The drawbacks of these measures are, it can only be applied to DL-ontologies and similarity can be computed only for concept pairs which are subsumed.
Depth based measures
The depth based approaches are basically the shortest path approaches. Depth of a concept represents the information specificity of it. As the depth increases, the specificity of the concept increases and the distinct information between the concepts decreases. For example, let us consider Transport ontology, where two concepts: (four wheeler, two wheeler), up in the hierarchy have more distinguished information in comparison to the concept pair: (bicycle, motorcycle) down the hierarchy. Let us assume that the path lengths of these two concept pairs are 2. If the similarity for these two concept pairs is computed using the path based measures, say Rada et al. [45], then both the concept pairs will have the same similarity value of 2, in spite of second concept pair more similar then the first pair. Hence the similarity value should be scaled up, as the depth increases. Therefore many measures in literature have used depth information along with the path information, to compute the similarity. The depth based measure can be further classified as shortest path and common subsume. Shortest path measures consider only the shortest path and depth information, whereas, common subsume measures also consider LCS of the concept pair.
Shortest Path: In Leacock & Chodrow [30], the number of edges (n) in the shortest path between the two concepts is normalized by the overall depth (d) of the taxonomy.
Common Subsumer: In Wu & Palmer [64], the similarity measure is computed based on the depth of the LCS (d) of two concepts and path lengths of the two concepts from the LCS (p1 and p2). It is formally defined as follows.
This measure is designed in such a way that, a concept pair at higher level of hierarchy (i.e. abstract concepts) is assigned a similarity value less than the concept pairs at lower level of hierarchy (i.e. specific concepts) since the concept specialization increases as we go down the hierarchy.
In Li et al. [32], the shortest path length and the depth of LCS is combined in a non-linear fashion. The contribution of the path length and depth is controlled by two parameters named α and β whose values where experimentally determined.
where d is the depth of the LCS, path i (a, b) represents all possible paths between the concept pair (a, b).
Hybrid class of measures use more than one class of information to accurately leverage the similarity value.
Jiang & Conrath [28] is a weighted shortest path based similarity measure. The weight of each edge connecting a concept c to its parent concept p, is determined based on the depth, local density and IC of the concept c as follows.
where E (p) is the number of children edges, a parent concept p possesses (i.e. local density of p), is the average density of the ontology, d (p) represents the depth of p in the ontology, T (c, p) represents the edge relation/type factor. The parameters α and β, determine the contribution of depth and density of the concept. The similarity measure is the summation of the weights of the edges in the shortest path between the two concepts.
Cia et al. [12] proposed a novel way to compute IC value of a concept c using the number of hyponyms of the concept (hypo (c)) and depth (d (c)).
where
First model is used to represent the edge length between two adjacent concepts using the difference in the IC value which is defined as follows.
Then, in the second model, the above measure is normalized using the maximum depth of the WordNet (MaxDepth)/
Based on the above two model the similarity measure can be defined as follows.
where α and β indicates the importance of spl W () and spl N () respectively.
Inter semantic similarity measure is classified further into path based, IC based and feature based measures on the basis of information exploited by the measures. Since majority of the feature based measures can be applied, both for identifying intra and inter semantic similarity, it is discussed separately in the Subsection 2.3. Establishing the connectivity across the input ontologies is the major problem which each inter semantic similarity measure must overcome. This problem is handled by each method differently which is outlined in the following subsections.
Path based measures
In Al-Mubaid & Hoa A. Nguyen [1], proposed both intra and inter similarity measures. For inter similarity computation, the connectivity between the two ontologies is established by a set of bridge concepts. First, similar concepts across different ontologies are identified. Then the connectivity between the two ontologies is established by adding edges between these similar concepts to create the set of bridge concepts. The shortest path between two concepts of different ontologies is identified through these bridge concepts. The intra similarity is calculated based on the commonality between the concepts which is defined as follows.
where D is the maximum depth of the ontology and Depth (LCS (a, b)) represents the depth of the LCS of the concept pair (a, b). Based on the above defined commonality, the intra similarity computation is defined as follows.
where Path represents the shortest path length between the concepts, CSpec is defined in 19, k is a constant and α > 0 and β > 0 are the parameters which determine the contribution of path and depth component.
For inter similarity computation, the denser input ontology is termed as primary and the other(s) is termed as secondary ontology/ontologies. The two concepts in a concept pair can belong to: (i) primary ontology (ii) secondary ontology, (iii) one concept from primary and other concept from secondary ontology and (iv) multiple secondary ontologies. Among the four scenarios, the first scenario computes the similarity using the Equation (33). For other three scenarios, the above defined intra similarity measure Equation (33) is adopted to scale the Path and Depth values based on the maximum depth of primary and secondary ontology, for computing the inter similarity value.
The IC is calculated based on taxonomy information like number of children, number of subsumers, total number of concepts in the taxonomy and maximum number of leaves. In Sanchez et al. [52], the two ontologies are connected in terms of bridge concept as already discussed and hence possess the same limitation. The IC of a concept is computed as follows.
where |leaves (c) | represents the number of leaves concepts of c, |subsumers (c) | represents the number of subsume of c and max_leaves represents the total number of leaf concepts in the ontology. The similarity value is the IC value of the LCS of the concept pair, where LCS is a bridge concept.
In Saruladha et al. [55] the two ontologies are connected by a virtual root which links the two roots of the input ontologies. The author has proposed three measures: COSS, RRCOSS and RLCOSS. In COSS, the Tversky’s similarity ratio model is modified for inter similarity computation where the commonality between the features is represented by the IC of the LCS. Similarly RRCOSS is the modified Resnik et al. [46] measure and RLCOSS is the modified Lin et al. [33] measure. In all these three measures two main modifications are proposed. First the LCS of the concept pair across the ontologies is discovered through the virtual root. Second, the IC value of the LCS is computed based on the following definition.
where a and b are concepts belonging to ontologies O1 and O2 respectively, hypo (a) represents the number of hyponym of concept a, |O1| is the total number of concepts in O1 and min(|O1|, |O2|) represents the minimum of total number of concepts among the ontologies. Using the above definition, IC of the LCS is computed and it is used in the existing Tversky similarity, Resnik and Lin measures.
In feature based measure the concepts to be compared is represented by a set of features. The feature set of a concept consists of information like neighbourhood (ancestors, descendants and siblings), synonyms, attributes, gloss, functions, etc. These measures overcome the drawback of path and depth based measure by not relying on the fact that all taxonomical links represent same distance and have same importance in computing similarity. The feature based measures can be used to find both intra and inter semantic similarity measures since the measures are based on the set operations over the features which can be from same or different ontologies. Meanwhile the other intra measures based on path, depth and IC cannot be applied across the ontologies until the two ontologies are explicitly linked.
In feature based measures, the similarity of a concept pair is calibrated based on the commonality and difference existing between the features of the concept pair. This is based on the set theory model: Tversky’s model [62] of similarity which considers common and distinct features of the compared concept pair for similarity computation. This Tversky’s similarity ratio model is defined as below.
where A and B denotes the feature set of concept a and b. The f is a function representing the importance of the feature set, where, in general it is a cardinality function. The set union operation (A ∩ B) denotes the common features between the concepts. The set difference operations (A ∖ B) and (B ∖ A) denotes the distinct features of concept a and b respectively. The α and β are the parameters which are used to decide the comparative importance of the concept a and b. The entire feature based measures use the Tversky’s ratio model and each measure mainly differs in the definition of “what constitutes the features of a concept?” which is discussed below.
David Sanchez et al. [51] and Batet et al. [6] proposed two similar intra feature based measures where the feature is the ancestors of a concept c, till the root of the taxonomy represented as φ (c) in [51] and T (c) in [6]. In [6, 51], the measures are defined as the ratio of the distinct ancestors of two concepts to the total number of ancestors of the two concepts. It is formally defined as follows.
The advantage of these measures [6, 51] is its simplicity, since it depends only on the ancestral taxonomical relation of the ontology and it does not have any parameters to be tuned.
David Sanchez et al. [50] also proposed two methodologies, to find the LCS of the concept pair across the multiple biomedical ontologies. As already discussed in Section 2.2, the connectivity between the two ontologies is made by the set of LCS here. The feature set considered is the ancestors (subsumers) of the two concepts till the root of the taxonomy and the children (hyponyms) till the leaf level. The LCS is discovered based on the semantic overlapping and structural similarity among the possible subsumer pairs (ancestors) for the given concept pair to be matched. The semantic overlapping (SO) of a subsume pair for a concept pair is formally defined as follows.
where total_hypo o 1 (s i ) represents the complete set of hyponyms till the leaf level for the subsume s i in the ontology O1. The structural similarity of a subsumer pair for a concept pair is based on the semantic overlap among the immediate subsumer and hyponym of the subsumer pair. Once the LCS for a concept pair based on the above two similarities is discovered, existing similarity measures such as Rada et al. [45], Wu & Palmer [64] and Leacock & Chodrow [30] are used for evaluating the similarity value.
LOMPT, an ontology matching system [56], proposed an intra feature based semantic similarity measure for clustering the ontology i.e. to partition the ontologies into set of small sub ontologies. In this measure, parents, children and siblings of the concept is considered as the feature of the concepts. The similarity is based on the commonality between the feature sets of a and b concept, which is normalized by the total number of features. It is formally defined as follows.
where N (a) represents the immediate parents, children and siblings of the concept a. Even though this measure does not consider exhaustive neighbourhood, it has a static neighbour set where occasionally the sum of number of children and sibling considered could be numerous and hence it is classified under this class.
Rodriguez & Egenhofer [48] proposed an inter feature based measure which considers three components of the concepts such as synset (set of synonyms), features and semantic neighbourhood for similarity computation. In this measure, each concept’s parts, attributes and functions are considered as the features of the concept. Semantic neighbourhood of concept c is defined as the set of concepts which are semantically linked to c within radius of r. The measure is formally defined as follows.
where x, y and z are the weights based on the contribution of each component. Each component’s information overlapping obtained using the Tversky similarity model is as follows.
where A and B are the terms belonging to one of the components of the concept a and b respectively and the parameter β = (1 - γ). The parameters gamma and β which are computed based on the depth of the concepts, determines the relative importance of the distinct characteristics of the concept. According to [41], the measure doesn’t depict on how to fix the value for the certain parameters used. Also the performance results obtained are not encouraging according to Saruladha [54].
Petrakis et al. [41] used concept’s description and neighbourhood to calculate the similarity. Concept descriptions are taken from gloss in WordNet and “scope notes” in MeSH. The formal definition is as follows.
Each component’s information overlapping (S (a, b)) is formulated as the ratio of number of common information between a and b to the total number of information.
Numerous general set theory similarity measures are available, which can be adapted for finding semantic similarity measures by defining the feature of the concepts. These set theory measures can be generalized into two formulae as follows.
where C1 and C2 denotes the information of the 1st and 2nd concept represented using a feature set and Nr denotes normalization factor which varies from one measure to another.
Set measures based on the commonality Equation (44) are Jaccard [27], Dice [13], Ochiai [37], Simpson [60] and Braun-Blanquet [9]. Seung-Seok Choi et al. [59] listed measures like Czekanowski, 3w-Jaccard, Nei&li and Sokali & Sneath-I, etc which also uses the commonality of the features. Seung-Seok Choi et al. [59] also listed feature based dissimilarity measures which are based on the difference Equation (45) between the two feature sets. They are Lance & Williams, Hellinger, Cosinge, Gilbert&Wells, Ochiai, Forbesi, Fossum, etc,.
First, the drawbacks of each class of intra semantic similarity measures are outlined as follows. The shortcomings of the path based measures are, it does not consider the large available information in the ontology like the relative depth/specificity of the concepts and the information shared between concepts. Further, it assumes that all the taxonomical links in the hierarchy represents uniform distance [7, 63], while links are more specific or general based on the degree of granularity and the amount of information incorporated in the taxonomy.
The drawbacks of corpora IC based measures are complex and time consuming computation of the IC value and its dependency on the corpus. The IC computed based on the taxonomy information of the ontology were able to produce more accurate results [51] and eliminated the need for highly complex corpus processing. However, its performance is hampered when applied to ontologies with less branching factor or depth [51]. The IC value, obtained from these small or specific ontologies becomes too homogenous to differentiate the information conveyed by the concepts.
In depth based measures, the shared information such as neighbours, properties, domain, range, etc. between concepts is not considered for similarity computation which is a significant downside.
Second, the drawbacks of each class of inter semantic similarity measures are briefed. Both path [1] and IC [52] based inter measures consider only exact linguistically similar concepts as the bridge concepts which is a disadvantage [53]. Methodologies like synonym detection and light weight linguistic matcher can be used to find linguistically non-equivalent concept pairs which could possibly be considered as the bridge concepts. Further, in [55] the two ontologies are connected by a virtual root which poorly captures the other possible commonalities between ontologies.
Finally, the disadvantages of the feature based measures are listed as follows. The two main drawbacks are, applicability of the measure and the selection of exhaustive or insufficient neighbourhood as the features of the concept.
Firstly, measures like Rodriguez & Egenhofer [48] and Petrakis et al. [41] can only be applied to ontology which has specifications like synset, attributes, parts and function. But Swoogle, the ontology search engine [14] depicts that ontologies rarely model these semantic features and concentrates more for the taxonomical relations.
Secondly, exhaustive feature sets such as ancestors till root [51] and combination of ancestors till root and descendant till leaf [50] could lead to a time consuming similarity computation. Measure by LOMPT [56] considers only the immediate semantic neighbourhood which could be exhaustive or insufficient based on the ontology considered. Hence, the similar concept pairs which are taxonomically apart may be evaluated as dissimilar pairs. Also, similar to path based measures, feature based measures also do not consider depth/specificity for similarity computation.
From the drawbacks of these measures, it can be inferred that majority of the measures are not applicable for all domains and applications. Generally, each measure is designed to work best for a particular application, domain and objective. Further each measure has prerequisite, parameters to be fixed (if any) and uses particular type(s) of information from the ontologies to find the similarity. Similarly, each application has certain specification. Hence choosing a semantic similarity measure for a given application is a challenging task, since a thorough understanding of both is mandatory. The user should figure out answers for the questions (listed few) in Table 1 with regard to the application and its ontology/ontologies to choose the best measure.
Questions for understanding the specification of the application
Questions for understanding the specification of the application
Based on the answers to these questions an appropriate measure for the given application can be chosen. For example, if the number of ontologies (Q-1) used is more than one, then inter semantic similarity measure should be chosen. If the ontology is described in DL-OWL (Q-2), then a DL based measure will best fit this application. Next, if the application has a small ontologies (Q-4), then taxonomy based IC measures should not be selected and so on. For each application and the available measures, appropriate questions should be formulated and based on the answers a best measure can be chosen.
If the chosen measure has parameter(s) to be fixed, it is also a challenge since the parameter(s) value(s) will differ for each application. However the manual selection and tuning can be carried out but the user should have a good knowledge about both the application and the measure. Also, the parameters can be fixed empirically based on the observations for the given datasets. However, the parameters value would not be optimal for a different dataset. Hence, in the future there is a need for automatic selection and tuning of the measure based on the given application and dataset. This can be done by any machine learning algorithm which would be trained to choose and tune the best measure based on the characteristics of the given applications and its ontologies.
The semantic similarity measures can be evaluated in two ways. First, using the three standard benchmark datasets consisting of set of word pairs whose similarity is judged by human experts. The first benchmark dataset Rubenstein and Goodenough (R&G) [49] consists of 65 noun pairs from WordNet (https://wordnet.princeton.edu/wordnet/) ontology which is a lexical ontology for English language. The second benchmark dataset Miller Charles (M&C) [36] consists of 30 noun pairs from WordNet ontology. The third benchmark dataset contains 49 word pairs from MeSH (Medical Subject Heading) (MeSH, http://www.nlm.nih.gov/mesh) thesaurus created by Petrakis et al. [41].
The measure to be evaluated should be deployed to assess the similarity of the noun pairs of any or all of these datasets. Then the similarity values obtained are correlated with the similarity values of the human experts from the corresponding dataset(s). The Pearson’s correlation coefficient value closer to 1 indicates that the measure has good accuracy and vice versa. The noun pairs of first and second benchmark dataset are generic English words and the noun pairs of third dataset are medical terms. Hence these datasets may not suitable to evaluate the measures designed for other domains.
In this scenarios, the second way is to use application based evaluation. This methodology evaluates the measure based on the application for which it is deployed. Among the candidate measure of choice, choose the one which maximize the objective of the application. For example, to cluster the ontology, the objective is to reduce the entropy (randomness) of the clusters formed. Hence a measure which will decrease the entropy should be chosen.
Table 2 shows the correlation coefficient of the semantic similarity measures belonging to various classes like path based, depth based, IC based (corpora and taxonomy), hybrid and feature based measures. The correlation coefficient results of the existing measures for R&G and M&C benchmarks are obtained either from the paper where author had reported the results or David Sanchez et al. [51] where they collected these results from the corresponding paper where the author had reported the results. For MeSH benchmarks the results are obtained either from the paper where author had reported the results or Petrakis et al. [41] paper, in which the measures were evaluated by Petrakis et al. himself.
As evident from the results (Table 2), the Cai et al. [12] and IC (taxonomy) measures outperform most of the similarity measures in categories like path based, depth based, IC (corpora), hybrid, feature based measures. This is due to the precise selection of structural information such as number of hyponyms, concept’s depth, neighbour concepts connected through non IS-A, etc. and correct formulation of the similarity measures utilizing these information.
Correlation coefficient of the semantic similarity measures
Correlation coefficient of the semantic similarity measures
N/A* (Not Available).
The necessity for inter and intra semantic similarity measures grows, as the semantic web and its heterogeneity increases. To meet the necessity, surplus measures have been proposed in the literature. Hence to enlighten the users with these measure’s definitions, prerequisite, disadvantages, etc. a detailed review on the existing semantic similarity measure is presented in this paper. A new classification scheme considering both inter and intra semantic similarity measures have also been proposed and under each class, the corresponding measures are discussed in detail. The drawbacks of each class of measures, the open challenges and the evaluation strategies have also been outlined. From this study, it can be concluded that majority of the measures are designed to suit for a particular domain, application or objective. Hence, choosing and tuning a measure for a particular scenario is a challenging task. Further, standard evaluation dataset and metric which could guaranty the accuracy and applicability of the measure across the domain, application or objective is not available.
Footnotes
Acknowledgments
This work was financially supported by Anna University, Chennai, India under the Anna Centenary Research Fellowship.
