Semantic relatedness maximisation for word sense disambiguation using a hybrid firefly algorithm

Abstract

Word sense disambiguation (WSD) refers to determining the right meaning of a vague word using its context. The WSD intermediately consolidates the performance of final tasks to achieve high accuracy. Mainly, a WSD solution improves the accuracy of text summarisation, information retrieval, and machine translation. This study addresses the WSD by assigning a set of senses to a given text, where the maximum semantic relatedness is obtained. This is achieved by proposing a swarm intelligence method, called firefly algorithm (FA) to find the best possible set of senses. Because of the FA is based on a population of solutions, it explores the problem space more than exploiting it. Hence, we hybridise the FA with a one-point search algorithm to improve its exploitation capacity. Practically, this hybridisation aims to maximise the semantic relatedness of an eligible set of senses. In this study, the semantic relatedness is measured by proposing a glosses-overlapping method enriched by the notion of information content. To evaluate the proposed method, we have conducted intensive experiments with comparisons to the related works based on benchmark datasets. The obtained results showed that our method is comparable if not superior to the related works. Thus, the proposed method can be considered as an efficient solver for the WSD task.

Keywords

Firefly algorithm local search meta-heuristic semantic relatedness word sense disambiguation

1 Introduction

In natural languages, words have multiple interpretations based on their context. Natural language processing (NLP) is a sub-field of artificial intelligence, which provides the task of word sense disambiguation (WSD). WSD concerns the issue of identifying the right sense of an ambiguous word computationally using the available information [3]. For a given text, WSD asks to determine a suitable sense for each non-stopping word based on its context. Thus, it is a computational problem that requires to search the best possible set of senses that have strong semantic relatedness. This semantic value is calculated based on hierarchical relations provided by a lexical database.

In specific, the relations among the senses (or concepts) are stored by a lexical database, called the WordNet [19]. It contains information about verbs, nouns, adverbs and adjectives. WordNet organizes related concepts into hierarchies, which are represented by trees, where the root is a very general concept and the leaves are very specific. Thus, these hierarchies help to measure the similarity between two concepts by identifying the path length that connect them [26]. However, such measure lacks to the knowledge of how frequent these concepts appear in a specific corpora. Thus, more information from a corpora, like the frequency of two concepts, and their nearest root can help to measure the semantic similarity [29]. Also, WordNet provides the gloss of each concept, and thus, methods based on gloss overlap notion can measure the relatedness easily. Particularly, Lesk algorithm determines the relatedness between two concepts by counting the common words in the definitions of these concepts [16]. An extension to Lesk algorithm suggests including the glosses of concepts that have WordNet-based relations with the given concepts to be measured [7]. These semantic relatedness methods are vital to achieve the WSD, where the semantic coherent can be identified.

The WSD plays an intermediate role in various NLP tasks, like the automatic text summarization [25], machine translation [27], and information retrieval [14]. In these applications, the WSD performs the task of identifying the correct sense for each word to obtain accurate results for a final task (e.g. machine translation). For instance, a search engine with a semantic disambiguation model would retrieve highly relative results to the user query. Similarly, a machine translator would assign the suitable meaning when the vague of the words is dropped by a WSD solver. Hence, it is necessary to provide accurate WSD system to get the desired impact on the final tasks. This directs the research flow to develop methods that maximise the semantic coherent of a given text. Specifically, several meta-heuristic were proposed to search for the best set of senses that maximises relatedness of a given text [2 , 38].

Recently, swarm intelligence based methods received great deal of attention for various NLP tasks [15, 24]. Firefly algorithm (FA) is a nature-inspired algorithm that mimics the flashing lights of the fireflies [35]. FA belongs to the swarm-based algorithms, which implies that a set of agents move toward a goal defined by an objective function. In swarm algorithms, there is a communication tactic to control the movement of the swarm toward the goal. In FA, fireflies communicate through the brightness of their lights. Because the light intensity is inversely proportional to the distance, fireflies only can communicate within a limited distance. In analogue to the WSD, the flashing lights of a firefly represent the semantic relatedness of the processed text (the objective function). Hence, brighter fireflies attract others to move toward them, and thus their lights become brighter. Recently, FA has found successful applications in various fields [9 , 39]. Interested readers may refer to the book of Dey for more of the recent applications of the FA and its variants [12].

The efficiency of the FA across various fields has motivated us to investigate its effectiveness for the WSD task. In specific, the FA has significantly improved the accuracy of document classification [15]. This is attributed to the unique feature of the FA, which is the automatic subdivision. This feature enables the population of the FA to subdivide into subsets automatically [35]. Thus, each subset of the population can explore each mode or local optimum. From these modes, the FA can determine the best global solution. This advantage over other algorithms has motivated us to investigate the FA to determine the best set of senses that maximize the relatedness of a given text to achieve the WSD.

This study addresses the WSD by maximising the semantic relatedness of a given text through the FA. To achieve this task, the proposed method invokes a lexical database to provide the number of senses for each word in the given text. Then, we propose an improved semantic relatedness method. This method serves as a fitness function for the proposed FA. FA has the ability to explore the problem space effectively as its searching procedure is based on multiple solutions, which increase the chance of exploring a wide area in the state space. Nevertheless, the FA is less efficient in exploiting the problem space when it is compared to solo-solution-based algorithms [6, 8], and thus, the FA may fall in local optima easily. To overcome this issue, we hybridise the FA with a local search algorithm called the late acceptance hill climbing (LAHC). The LAHC exploits the neighbourhood of a given solution effectively; hence, it aids the FA to balance between exploration and exploitation.

The proposed hybrid FA (HFA) evaluates the solutions during the search process based on a semantic relatedness method is known as the extended Lesk (e-Lesk) algorithm [7]. The e-Lesk algorithm is based on the concept of gloss overlapping of the given senses. Thus, e-Lesk can measure the relatedness for a pair of senses even if their POS are different. However, the gloss overlapping does not consider the hierarchical relations for the senses that belong to the same POS. Thus, we incorporate a semantic similarity method based on the notion of information content (IC) [30]. This method aims to improve the e-Lesk algorithm for pairs of the same POS, i.e., verbs and nouns only. The most important feature is the incorporation of the IC to evaluate the semantic similarity of two concepts. That is the algorithm takes into account the hierarchical relations provided by the WordNet, instead of relying only on the glosses overlapping. This allows the semantic measure to know about the distance between the given two concepts, through their Least Common Subsumer (LCS). Overall, our contributions are as follows:

Providing an effective semantic relatedness method to serve as a fitness function for the searching algorithm.

Proposing an efficient optimization method named hybrid firefly algorithm (HFA) for addressing the WSD.

To prove the efficiency of the proposed methods, intensive experiments have been conducted. Also, we provide comparisons to the related works based on benchmark datasets.

This paper is organized into six sections, including the current one, which gave the introduction of this work. The second section reviews the related works. The third section explains the representation of the WSD as an optimisation problem. The fourth section describes the method proposed in this study, which consists of three subsections. These present the formulation of the FA, hybrid FA, and semantic measures used in this study. The fifth section is dedicated to analysing the components of the proposed method and presenting comparisons against the related works. Finally, section six draws the conclusion of this research.

2 Literature review

In WSD, the main clue to identify the right sense of an ambiguous word is its context in the given text. In most WSD approaches, the length of contexts has a strong impact on the task of the WSD, where a short context may be inadequate to identify the correct sense. In contrast, a very long context may add noise to the features used in the disambiguation process [31]. Thus, some studies triggered to investigate the impact of the context length on the WSD approaches [2, 33]. Also, filtering the context from the noisy words has been investigated using a syntactic parser based on a meta-heuristic algorithm [1].

Meta-heuristic algorithms are used for the WSD task when a bag of words is required to undergo the disambiguation process. The first investigation of such algorithms was proposed by Cowie et al. in 1992 [10]. This was followed by several experimental studies of the meta-heuristic algorithms, especially using the genetic algorithm [2 , 38]. Due to the difficulty of the WSD task, the aforementioned studies either were dedicated to a single part-of-speech (POS) [38] or limited to a specific type of datasets [2]. Recently, other algorithms of the same class were able to scale for multiple datasets and POS [33, 36]. In [33], a particle swarm optimisation was used to maximise the semantic similarity of a given text. Also, a combinatorial PSO is proposed for WSD by Ajeena and Chinmayan [4]. The authors in [4] chose some parts from famous novels to test their method. However, their method was not tested using a WSD benchmark dataset. While in the study of [36], a bat algorithm was employed to find the best set of senses of the processed text. These studies show a successful searching process when a local search algorithm is invoked. Also, the type of local search has a strong impact on the search behaviour of the algorithm. For instance, using the simulated annealing in [33] revealed improved results, especially for the nouns among other POS. The authors attributed the effectiveness of their algorithm to the acceptance criterion of the simulated annealing (SA), which accepts non-improving solutions to take place for the purpose of avoiding local optima.

However, the SA in [33] requires prolonged searching process to allow the system accepting non-improving solutions at higher temperatures. On the other hand, allowing only improved solutions to take place in [36] can lead the system to fall in local optima. In [36], the authors employed hill climbing algorithm (HC) as a local search, which prevents non-improving solutions to take place as the a new point of the search process. In addition, neglecting the exploitation of the state space would delay convergence of the searching algorithm [38]. Also, the WSD system in [38] is limited to nouns only. Thus, the study in [2] attempted to provide a WSD system for all parts of speech. Nevertheless, the semantic measure in [2] is limited to only glosses overlapping of the given concepts without considering their synsets.

In this paper, we present a hybrid firefly algorithm that aims to attain a balance between exploration and exploitation. The attractiveness of the fireflies allows the algorithm to explore wide area of the search space. However, exploiting the search space is obtained by incorporating a local search with an acceptance criterion for non-improving solutions. The solutions obtained by the HFA is evaluated based on gloss overlapping method and information content notion. The proposed HFA considers all parts of speech.

3 Problem description

For a given sentence, WSD seeks a suitable sense for each non-stopping word based on its context. WSD is a computational problem that requires searching the possible set of senses of the given text. For a given text, there are numerous combinations of the senses (see Fig. 1), and thus it is exhaustive to enumerate all of them.

Fig. 1

An example of the sense combinations for the words Bank and Note.

Hence, employing an intelligent searching technique would alleviate the problem of tracking all combinations by emphasising eligible ones only. To illustrate the combinations of senses, assuming two words, “bank” and “note” appeared in one context. Considering only the first six senses (obtained from the WordNet) of each word, then the possible combinations are shown in Fig. 1.

In Fig. 1, each sense from the word “bank” is paired with all senses of the word “note”. Thus, there are 36 tuples of these senses for both words. The L in Fig. 1 refers to a function of measuring the relatedness between every two senses. Each word is concatenated with its POS and the sense number, where the "#" symbol separates them. For instance, bank#n#1 refers to the first sense of the word bank when its POS is a noun. This form is used in the WordNet similarity Java API called WS4J. This API is a re-implementation of the Perl WordNet-Similarity library [22]. Hence, the library measures the relatedness between two senses rather than two words.

Figure 1 shows six senses only for the words “bank” and “note”, yet it should be noted that WordNet provides more than this number of senses for these words. Thus, considering all the senses of the given words will make the task of searching for the best set of senses more complicated. Moreover, the example in Fig. 1 gives only two words which makes it easy to iterate through all the tuples. Nevertheless, the all-word WSD task asks to find suitable senses for all non-stopping words in the given text. Hence, the WSD becomes a combinatorial problem, and thus an approximate method can serve to handle it.

In this study, the proposed method represent the problem numerically, as it searches for the best set of numbers (senses Ids). This set is evaluated based on a well-known relatedness algorithm called extended Lesk (eLesk) [7]. Also, a similarity method is combined with eLesk to refine the obtained relatedness. The refinement of the similarity method is due to the notion of information content (IC) [29]. The IC measures how frequent a sense appears in relation to other words’ senses in a specific corpus. While, a relatedness method only measures the overlap between two senses. Hence, combining relatedness and similarity methods can provide more information. In this study, the set of senses (which are provided by the proposed searching algorithm) are evaluated using the combined methods as follows:

$f (x) = \sum_{i = 1}^{n} \sum_{j = 1}^{w} | C_{i} \cap C_{j} | + IC (LCS (C_{i}, C_{j}))$ (1) where n is the size of a given text excluding stopping words, and w refers to the size of the window of words, which is the context of the ambiguous word. The first part of equation 1 shows the relatedness calculation. While, The second part is for the semantic similarity, where the LCS is the least common subsumer of the concepts C_i and C_j. A detailed description about each of these will be explained in Section 4.3. To explain numerically the solution representation of the WSD, let us consider a sentence of six words, each has five senses as shown in Table 1.

Table 1

A numerical example of the WSD

	Word 1	Word 2	Word 3	Word 4	Word 5	Word 6
Solution 1	3	5	2	1	3	5	84
Solution 2	1	4	3	2	2	5	37
Solution 3	5	1	1	3	4	1	167
Solution 4	2	3	1	1	5	4	91

In Table 1, each number represents the Id of a concept (or sense) of its corresponding word, and a row of these numbers is a solution. For instance, solution 1 implies that concept 3 has been assigned to word 1, concept 5 assigned to word 2, and so on. Then, the semantic relatedness among these concepts is reported in the last column, the higher the better. In Table 1, we have gave four example solutions, as our proposed method process multiple solutions iteratively. These solutions evolve based on the updating rules of the proposed HFA.

For quick reference, we provide a list of notations as shown in Table 2.

Table 2

List of notations and acronyms used in this paper

Notation	Definition
β _i	The attracftiveness of firefly i
γ	The light absorption coefficient
I (x)	The brightness of a firefly at position x
N (x)	The neighbourhood of position x
FA	Firefly Algorithm
HFA	Hybrid Firefly Algorithm
SA	Simulated Annealing
LAHC	Late Acceptance Hill Climbing
WSD	Word Sense Disambiguation
POS	Part of Speech
IC	Information Content
LCS	Least Common Subsumer
WS4J	A Java API WordNet similarity
JWI	A Java API for accessing WordNet

4 Proposed method

This study aims to maximise the relatedness of a given text by exploring the potential senses for each word. To achieve this aim, the designed method is based on stochastic searching algorithms, which are capable of providing good-quality solutions in a reasonable amount of time. These invoke a semantic measurement to evaluate the obtained solutions. These algorithms include a swarm intelligence-based method called Firefly Algorithm (FA). This kind of algorithms is effective in exploring the problem space. However, they are less competitive in exploiting the problem space when it is compared to the meta-heuristic algorithms that use a single solution. Hence, this study attempts to make a balance between exploration and exploitation through hybridising the FA with the Late Acceptance Hill Climbing (LAHC) algorithm. The work-flow of this hybridisation is presented in Fig. 2.

Fig. 2

The work-flow of the proposed method.

In Fig. 2, the first step is to read the data, which is a text composed of paragraphs. Each paragraph is consisted of a set of sentences. These sentences are tokenized to words, where each word lies on a single line of a file. Due to the special structure of this dataset, a Java API called JSemCor is used for interfacing with dataset files. After reading the data, it is required to retrieve the number of senses for each word using the WordNet. To achieve this, we used another Java API known as JWI. The remaining steps are the main searching algorithm which includes the semantic evaluation. These are described in details in the subsequent sections.

4.1 Firefly algorithm (FA)

Inspired by the flashing light of the fireflies, an objective function can be optimised by moving toward the brightest firefly in the swarm [35].

The core steps of the FA are inspired from the three rules of firefly swarms. First, fireflies are attracted to each other regardless their sex because they are unisex. Second, the attractiveness is proportional to their brightness which implies, a firefly of lower flashing light will move towards the brighter one. The brightness and attractiveness of two fireflies decrease when their distance increases. A firefly moves randomly when no one is brighter than it. Third, the landscape of an optimisation function has impact on the brightness of a firefly. Considering the optimisation problem in this study, the brightness will be proportional to the value of the objective function. These three rules can be translated to algorithmic steps, as shown in Algorithm 1.

In the FA, the movement of an agent is based on the attractiveness of others in the swarm. Typically, a firefly’s attractiveness is always determined by its brightness or light intensity, which in turn is mapped to the given fitness function. At a specific position x, a firefly’s brightness I is proportional to the objective function, i.e., I (x) ∝ f (x). Nevertheless, a firefly’s attractiveness β is relative, and thus, other fireflies should assess it. Hence, it varies with respect to the distance r_ij between two fireflies j and i. On the other hand, light intensity is changing inversely in relation to the distance from its source. Also, media absorb fireflies’ light, which leads to variation in the attractiveness with respect to the degree of absorption. Thus, the attractiveness is calculated by considering the light absorption coefficient γ of a given medium and current distance r, as shown in Equation 2. $β (r) = β_{0} e^{- γ r^{2}}$ (2) where β₀ is the attractiveness when r is zero. The Cartesian distance is used to represent the distance between any two fireflies i and j at the positions x_i and x_j, respectively, as shown in Equation 3. $r_{ij} = ∥ xi - xj ∥ = \sqrt{\sum_{k = 1}^{d} (x_{ik} - x_{jk})^{2}}$ (3) where x_ik represents the kth component of the ith firefly’s spatial coordinate x_i.

A firefly i changes its position affected by another more attractive (brighter) firefly j, which can be represented as follows: $x_{i} = xi + β r (xj - xi) + α (rand - 1 / 2)$ (4) where the second term reflects the attraction, and the last part dedicated for diversity, which is controlled by the randomization parameter α, and rand is a random number in the period [0, 1] generated using the uniform distribution.

4.2 Hybrid firefly algorithm (HFA)

The proposed method in this study employs the FA for its exploration’s capacity. The FA is based on the swarm intelligence technique, which suggests deploying a set of agents to explore the problem space effectively. In FA, this can be seen by the attractiveness feature where all fireflies move toward the brightest one. This movement will form the swarm, which moves toward a potential best area in the problem space. However, the FA and other population-based meta-heuristics are less competitive in exploiting the problem space when they are compared to the local search algorithms. These are based on a single solution, and thus a local search algorithm intensifies the neighbourhood of a given solution. This is done by moving the solution to an eligible neighbouring point. Thereafter, the same movement will be performed for the new point to move toward a better one. This movement will be repeated until a predefined stopping criterion is met. The aforementioned process is the general scheme of local search algorithms.

However, the main feature that can distinguish a local search algorithm from others is how the algorithm accepts a new movement. Intuitively, a new point will be accepted since its quality better than the previous one. However, relying only on improving solutions may end up with a low-quality local optimum. A current solution is identified as a local optimum when its neighbouring solutions have lower quality than it [32], and thus the algorithm sticks in this point. Formally, a solution x is a local optimum when it satisfied the condition in Equation 5. $f (x) > f (x') \forall x' \in N (x)$ (5) where N(x) is the neighbourhood of the solution x. The considered problem in this study is to maximisation one, and thus in Equation 5 all neighbours are considered of lower quality than the current one if they have a smaller similarity of a given text.

To avoid local optima, the LAHC allows the searching process to accommodate non-improving solutions, which permits the movement to a new point of the search space. In LAHC, accepting a new solution depends on the history of the searching procedure. In specific, LAHC accepts a new solution if it is better or equal to a solution of previous iterations. In other words, a new move will take place even if it does not improve the current one, but it should be at least as good as it was in a number of iterations ago.

To achieve this, the LAHC uses a list to record the cost of previous solutions. The cost of a new solution is compared to the last element of the list, an then it will be accepted it is better than the compared one. After that, the cost of the newly accepted solution is inserted at the beginning of the list, and the last one is removed. In case of rejecting the new solution, the cost of the current solution is instead inserted. These steps are summarized in Algorithm 2.

Memorising previous information in a list resembles the mechanism of the tabu list in Tabu Search (TS) [13]. However, the list in LAHC is completely different from the one in TS. First, the list of LAHC holds the values of a fitness function which is the overall similarity of the given text. While, the tabu list contains the previous movements to be forbidden in the next cycles. Second, TS compares the candidates of the current solution to the whole list, whilst LAHC compares with only one value from the list. Let’s assume the LAHC list Fa = {f0, f1, f2, …, f_Lfa-1}, and v is a pointer to the fitness to be compared with the new solution which is calculated as follows: $V = I mod Lfa$ (6)

Using a pointer to the target fitness in the LAHC list makes the processing time independent from the length of the list. Hence, a long list can be used to memorise more about the search history without delaying the searching process. Obviously, the Lfa is the only parameter in the LAHC, which makes it easy to set.

Practically, the proposed HFA tackles the WSD problem based on the numerical presentation of the given text, as shown in Section 3. In the FA, a solution (firefly) that holds the higher semantic coherent (attractiveness) is identified as the best solution. The semantic relatedness is calculated from the given solution, where it is composed of a set of senses’ ids that corresponds to a given text (see Fig. 1). Then, all other solutions will move toward the best solution to enhance their similarity. After this movement, the best solution so far is updated if any solution acquired more relatedness than the best one (one firefly became brighter than best one). This process works on exploring various sets of senses of the processed text. However, to make the search focuses on an eligible set of senses, the HFA will invoke the LAHC to intensify the search around a given solution. The LAHC is called within a predefined probability called local search rate (lr) as shown in Algorithm 3. Though only one solution (the best one) from the FA will be processed by the LAHC, other solutions will get updated when they move toward the best solution. LAHC searches the neighbourhood of a given solution using a predefined neighbourhood structure (NS). In this study, the NS swaps two random senses of the given solution. Then the similarity of the new set of senses is calculated. LAHC repeats these steps, including the list of costs explained above, till a predefined number of cycles.

The parameters of the HFA are set experimentally, where the Swarm_size is 40, max_number_of_iterations is 100, α is 20%, and γ is set to 1 based on the original FA study [35]. The local search parameters which are lr, L_fa, and max_cycles are set to 15%, 17000 and 30000, respectively. The HFA is highly sensitive to the list length L_fa of the LAHC, and thus extensive experiments is done to set the value of this parameter (see Fig. 5). On the other hand, the other parameters have negligible impact on the HFA when we experimented it with different values. That is, when we set the Swarm_size to 60, and the max_number_of_iterations to 500, the accuracy only improved by 0.006%, while, we required to make extra 26 thousand times of semantic evaluations. Hence, we followed the recommended values of the FA’s parameters based on its original study in [35].

4.3 Semantic measurement

This study quantifies the relatedness between two concepts based on the notion of gloss overlapping, which is represented by the Lesk algorithm. Given two concepts (or senses) C₁ and C₂, the relatedness score is the overlap in definitions (i.e., words in common) of these concepts. Given two words w1 and w2, where C₁∈ concepts (w1) and C₂∈ concepts (w2), then the semantic relatedness based on Lesk algorithm is calculated as follows:

$ScoreLesk (C_{1}, C_{2}) = | gloss (C_{1}) \cap gloss (C 2) |$ (7) where gloss (C_i) is the set of words in the definition of the concept C_i of the word w_i.

The extended Lesk algorithm (eLesk) goes beyond glosses of a pair of concepts to include related ones [7]. This is enables eLesk to find more overlaps for the concepts being measured, which lead to accurate semantic relatedness. However, eLesk algorithm does not consider the information of words in frequency with respect to lexical relations. Thus, this study incorporates the Resink method to measure the similarity between two concepts.

Resnik method is a measure of the similarity between two concepts based on the notion of information content (IC) [29]. IC is based on the specificity of the concept, i.e., highly specific concepts (e.g., fork and pitch) would reveal a high similarity score. IC is calculated regarding the frequency counts of the concept in a corpus of text. Formally, IC for each concept calculated as the negative log of the concept’s frequency to its root’s frequency, as shown in Equation: $IC (C) = - \log (\frac{Freq (C)}{Freq (root)})$ (8) where C is a concept for which the IC is being measured, Freq (C) is the frequency C in a corpus, and Freq (root) is the frequency of the root in the hierarchy. Freq (root) includes the frequency of all words (of the same POS of C) in the corpus.

Given two concepts, the method of Resnik states that the IC of the least common subsumer (LCS) of the concepts. The LCS of any two concepts is the nearest root that subsumes them. Formally, Equation 9 presents Resnik method mathematically to calculate the similarity between two concepts (C1 and C2) as follow:

$Sim (C_{1}, C_{2}) = IC (LCS (C_{1}, C_{2}))$ (9)

In this study the Resnik similarity enforces the relatedness measured by the eLesk. Resnik method only measures the similarity for the concepts that belong to the verbs or nouns POS because it is based on the WordNet hierarchies. This is due to WordNet isolates the noun hierarchy from the verb hierarchy. Consequently, the LCS of two concepts of different POS cannot be identified. However, this study utilizes eLesk that can measure the relatedness between concepts of different POS.

5 Experimental results

This section provides the metrics and datasets used for the evaluation of the proposed HFA. Thereafter, this section presents the gained results and compares it to the related works.

5.1 Experimental setup

The proposed algorithms in this work have been programmed using Java language. The operating system used in this work is Ubuntu 16.04.6 installed on a Dell machine. This machine has a RAM of 16 GB and the CPU model is Intel(R) Xeon(R) CPU E5-2650 of 2.30 GHz. To read the datasets, a Java API called JSemCor is used to interface with datasets files and retrieve the required data. Also, another Java API called JWI is used to retrieve data from the WordNet database.

5.2 Datasets and metrics of evaluation

To evaluate an algorithm for the WSD task, it is required to know the ambiguous words and their corresponding correct senses. The well-known Semcor dataset provides the detailed tags for each word in the given text. Semcor came from the Brown corpus, where WordNet is used to annotate some of its files to constitute a semantic concordance (which is shorten to Semcor) [20]. The Semcor is composed of set of files which covers more than 234,000 word frequencies, and thus these files could serve to evaluate a WSD system. Commonly, set of files from Secmcor is used as a benchmark for the evaluation, which have been used in previous studies [33, 38]. Particularly, these are nineteen files, which are as follows, br-a01, b13, c0l, d02, e22, r05, g14,h21, j0l, k01, k11, l09, m02, n05, p07, r04, r06, r08, and r09. Also, two more datasets are considered which are Senseval-2, and Senseval-3 [18, 21]. This dataset is formatted in tag-based files, where each sentence is tokenized to set of words. Each word occupies one line as shown in Fig. 3.

Fig. 3

A snippet from the dataset used in this study.

In Fig. 3, each line contains the orthogonal form of the word, sense number, POS, and lemma which is the base of the word. The POS and the lemma save the effort of doing pre-processing operations like stemming and POS tagging. The lemma of word and its POS are used to retrieve the sense count from the WordNet of the targeted word.

Based on these datasets, the accuracy of the proposed method is calculated by counting the number of words correctly tagged to the suitable sense. There are three forms of calculating the accuracy which are the precision, recall, and coverage. The coverage reports the percentage of number of tackled words to the total number of word (See Equation 10). Thus, the coverage does not show the accuracy, yet it shows the coverage of the algorithm. $Coverage = \frac{all answered senses}{total number of senses}$ (10) On the other hand, the Recall and Precision reveals the accuracy of the algorithm as shown in Equations 11 and 12.

$Precision = \frac{tp}{tp + fp}$ (11) $Recall = \frac{tp}{tp + fn}$ (12) where tp and fp is the number of true positives and false positives reported by the algorithm, while tn is the true negative.

In WSD, some methods obtained higher precision than others, but they are inferior to others in term of recall. Hence, taking the average of these metrics is helpful to indicate the accuracy of a method. F-measure is the harmonic average of the precision and recall which is calculated as follows: $F - measure = \frac{2 \times Precision \times Recall}{(Precision + Recall)}$ (13)

5.3 Results analysis

In this study, we have conducted series of experiments based on the aforementioned datasets. These experiment show the effectiveness of each part of the proposed method. The first experiment aims to show the impact of the proposed semantic relatedness method. This consists of examining the searching algorithm with different semantic relatedness methods as shown in Table 3–5. Second experiment concerns the effectiveness of the LAHC when different values used for its parameter L_fa as shown in Fig. 5. Third experiment is dedicated to show impact of the LAHC on the FA (see Fig. 4). Finally, comparisons to related methods are presented in Fig. 6. In these experiments, the window of words is set to eleven words, which has been successfully employed in previous studies [1, 33]. On the other hand, selecting a window of more than eleven may contribute negatively by including unwanted attributes [17].

Table 3
HFA’s results based on Resnik semantic similarity method

Dataset POS Coverage Precision Recall F-measure

Semcor Noun 100% 71.69% 71.69% 71.69%

Verb 77.39% 49.10% 38.05% 42.87%

Senseval-2 Noun 100% 69.21% 69.21% 69.21%

Verb 93.25% 48.21% 44.73% 46.40%

Senseval-3 Noun 100% 68.74% 68.74% 68.74%

Verb 74.84% 57.36% 43.11% 49.22%

Dataset	POS	Coverage	Precision	Recall	F-measure
Semcor	Noun	100%	71.69%	71.69%	71.69%
	Verb	77.39%	49.10%	38.05%	42.87%
Senseval-2	Noun	100%	69.21%	69.21%	69.21%
	Verb	93.25%	48.21%	44.73%	46.40%
Senseval-3	Noun	100%	68.74%	68.74%	68.74%
	Verb	74.84%	57.36%	43.11%	49.22%

Table 4

HFA’s results using eLesk method based on Semcor dataset

POS	Coverage	Precision	Recall	F-measure
Noun	100%	71.13%	71.13%	71.13%
Verb	77.39%	48.61%	39.88%	43.81%
Adjective	100%	73.36%	73.36%	73.36%
Adverb	100%	63.84%	63.84%	63.84%
All	94.34%	66.92%	62.25%	64.50%

Table 5

HFA’s results using a combination of eLesk and Resnik methods

Dataset	POS	Coverage	Precision	Recall	F-measure
Semcor	Noun	100%	72.48%	72.48%	72.48%
	Verb	77.39%	49.53%	40.16%	44.35%
	Adjective	100%	73.36%	73.36%	73.36%
	Adverb	100%	63.84%	63.84%	63.84%
	All	94.34%	68.21%	65.10%	66.61%
Senseval-2	Noun	100%	70.19%	70.19%	70.19%
	Verb	93.25%	48.52%	40.11%	43.91%
	Adjective	100%	74.20%	74.20%	74.20%
	Adverb	100%	69.41%	69.41%	69.41%
	All	98.31%	66.92%	64.25%	65.55%
Senseval-3	Noun	100%	70.08%	70.08%	70.08%
	Verb	77.84%	57.64%	43.39%	49.51%
	Adjective	100%	73.36%	73.36%	73.36%
	Adverb	100%	68.91%	68.91%	68.91%
All	94.46%	67.14%	62.74%	64.86%

Table 3 shows the results obtained using the proposed HFA, where Resnik method is employed to measure the similarity. The Resnik method is experimented on all types of datasets used in this study. In these experiments the proposed HFA tackled two classes of the POS, which are the nouns and verbs. Other POS’s classes were not experimented because the similarity method used in these experiments are limited to nouns and verbs only. Based on Semcor, the proposed HFA yielded a precision of 71.69% for nouns, which is equivalent to the recall and F-measure as all nouns are covered (i.e., the coverage is 100%). While, it achieved more than 77% and 38% for verb’s precision and recall, respectively. Due to the incomplete coverage for verb POS, the F-measure can used to identify the average of the precision and recall, which is 42.87%. On the other hand, for senseval-2 and senseval-3 the proposed HFA achieved a slightly lower F-measure in comparison to semcor four noun POS. While, it was able to cover more verbs in senseval-2 than semcor, and senseval, i.e., its verb’s coverage is more than 93%. As the Resnik algorithm is unable to measure the similarity for all POS’s classes, the proposed method in this research utilized the extended Lesk beside the Resnik algorithm.

To examine the eLesk capacity, we first experimented the HFA using eLesk only as shown in Table 4. Thereafter, the eLesk and Resnik methods are combined to serve as the objective function for the HFA (see Table 4).

Table 4 shows the accuracy of the HFA when eLesk algorithm is used for evaluating the relatedness between two given concepts. Using eLesk, the HFA was able to cover 94.34% of the given text with 64.50% for the F-measure metric. The eLesk was slightly inaccurate in comparison to Resnik algorithm when noun and verb POS are considered. However, the eLesk algorithm compensate the shortage of the Resnik algorithm, where the latter is limited to nouns and verbs only. The experimental results of combining these algorithms are given in Table 5.

Fig. 4

Experimental results of the hybrid firefly algorithm versus its basic form and the LAHC based on the F-measure metric.

Fig. 5

The impact of the L_fa on the accuracy of the HFA for the br-a01 instance from the Semcor dataset.

Fig. 6

The HFA’s F-measure against the state-of-the-art meta-heuristic-based WSD algorithms for all POS classes.

The combination of the semantic measures in Table 5 shows improvement over the individual ones in Tables 4 and 3. This is can be noted when we compare the results in Table 5 to these in Tables 3 and 4. The gain of the combined measure can be observed for nouns and verbs as the Resnik method can operate on them. In specific, for the Semcor dataset, the combined measure achieved 72.35% and 44.35% in term of F-measure for nouns and verbs, respectively. While, when only eLesk was used the HFA obtained an F-measure of 71.31% and 43.81% for same POS classes. On the other hand, the combined measure yielded an F-measure of 70.19%, and 43.91% for nouns and verbs using the Senseval-2 dataset. This is more accurate than using Resnik method individually, where it has obtained under the same setting 69.21% and 46.40% for the F-measure of nouns and verbs. Also, the combined measure gained about 2% improvement over the Resnik method for nouns when the Senseval-3 is considered.

The aforementioned experiments have showed the performance of th proposed HFA using different types of the semantic measures. However, the capacity of the HFA is still not detailed. Hence, to show the impact of the LAHC on the HFA, we conducted an experiment by running each algorithm separately (see Fig. 4).

Figure 4 displays the accuracy in term of F-measure of the search methods used in this study. The proposed method in this study used two meta-heuristic algorithms which is the FA and the LAHC. The FA emphasis exploring the problem space, while the LAHC focuses on exploiting it. Figure 4 shows that the FA can obtain better accuracy than the LAHC for all POS’s classes. Yet, achieving the balance between exploration and exploitation features is vital to obtain the best possible estimation for the optimal solution [11, 34]. Thus, incorporating LAHC to perform local searching helped the HFA to exploit the problem space better than the FA individually as shown in Fig. 4. The experiment presented in Fig. 4 is done after tuning the length of the LAHC list. The tuning of this parameter consisted of extensive experiments (around 300 runs), which investigates range of values for this parameter as shown in Fig. 5. This experiment showed that setting the L_fa to 17000 entries is a proper value for the WSD task.

In comparison to the related works, Fig. 6 gives the accuracy of the HFA against the related disambiguation algorithms.

The comparison in Fig. 6 is based on the three datasets. This comparison includes five methods based on meta-heuristic algorithms. For the Semcor dataset, the proposed HFA outperformed other methods for all metrics, i.e., precision, recall, and F-measure. While for the Senseval-2, the proposed HFA was slightly less accurate than the PSO for the recall and F-measure. Yet, for the same dataset, the HFA could achieve better precision than other at the expense of its coverage. On the other hand, the HFA obtained better F-measure than the PSO and other methods when the Senseval-3 was considered.

The experimental outcome shows that the proposed HFA achieved best results with respect to the solution quality in comparison to the other five methods. The proposed semantic measure has the vital role to accurately evaluate the HFA solutions. Also, the LAHC used in the HFA enabled the algorithm to effectively avoid low-quality local optima as shown in the experiment presented in Fig. 4.

The accuracy of the proposed HFA relies mainly on its objective function, which is the semantic measurement. In this study, our proposed semantic relatedness has outperformed conventional methods to achieve the WSD as shown in Tables 3, 4, and 5. That is, the overall performance of the HFA is highly impacted by the semantic relatedness. Thus, obtaining accurate semantic measurement can lead to further improvement in the searching algorithm. Also, the proposed FA has shown considerable improvement when it was hybridized with the local search.

5.4 Complexity of the HFA

The FA is easy to implement due to its simplicity in term of complexity as it is for the most meta-heuristic algorithms. The FA’s complexity is $O (n^{2} t)$ , where the algorithm has a nested loop. The inner one iterates through the swarm, and outer loop for the maximum number of cycles [35]. In this study, n (i.e., the swarm size) is set to 40, and the t is set to 100 for the maximum number of cycles. That is, the FA has a linear complexity in term of t.

Practically, the first step in the FA is to initialize the positions of the swarm members as shown in Fig. 2. Also, this steps requires sorting these members according to their attractiveness. Thus, the complexity of this step is as follows: $O (n \times w + n \times log (n))$ where w is the window size of the given text.

In this study, the HFA consists of the FA and LAHC. The LAHC is invoked within a predefined probability lr. The complexity of the LAHC is $O (n)$ , and thus the complexity of the HFA is $O ((n^{2} + n \times lr) \times t)$ . That is, the complexity of the HFA including the step of initializing the positions of the swarm is as follows:

$O (n \times w + n \times log (n)) + (t (n^{2} + n \times lr))$

6 Conclusion

This work presented a computational method to perform the task of word sense disambiguation. This method has gained improved results when it is tested on three distinct benchmark datasets. The success of the proposed method is attributed to the developed optimization algorithm and its objective function. Based on the experimental results, the role of the proposed LAHC as a local search has enabled the HFA to obtain accurate results. That is, the LAHC algorithm consolidates the intensification process of the FA. However, the LAHC is sensitive to its main parameter, which is the length of the history list. Thus, we have tuned this parameter to obtain the best possible set of senses. For the semantic measure, the developed method in this work has been proved to be more efficient than conventional methods. This is due to exploiting the information content notion beside the gloss overlapping. Specifically, the knowledge obtained from the WordNet hierarchies has added more information to the idea of calculating the commonalities between glosses of two concepts. On the other hand, the proposed method would not be able to measure the semantic relatedness for two words outside the WordNet. Thus, investigating other algorithms that can scale for informally written text will be beneficial for future works.

References

Abed

S.A.

, Tiun

and Omar

, Harmony search algorithm for word sense disambiguation, PloS one 10(9):e0136614, 2015.

Abed

S.A.

, Tiun

and Omar

, Word sense disambiguation in evolutionary manner, Connection Science 28(3) (2016), 226–241.

Agirre

, Edmonds

, Word sense disambiguation: Algorithms and applications, volume 33. Springer Science & Business Media, 2007.

Ajeena Beegom

and Chinmayan

, Solving word sense disambiguation problem using combinatorial pso, Journal of Intelligent & Fuzzy Systems 38(5) (2020), 6193–6200.

Alsaeedan

, Menai

M.E.B.

, A self-adaptive genetic algorithm for the word sense disambiguation problem, In Ali

, Kwon

Y.S.

, Lee

C.-H.

, Kim

, and Kim

, editors, Current Approaches in Applied Artificial Intelligence, pages 581–590, Cham, 2015. Springer International Publishing.

Altabeeb

A.M.

, Mohsen

A.M.

and Ghallab

, An improved hybrid firefly algorithm for capacitated vehicle routing problem, Applied Soft Computing 84 (2019), 105728.

Banerjee

and Pedersen

, An adapted lesk algorithm for word sense disambiguation using wordnet, In International conference on intelligent text processing and computational linguistics, pages 136–145. Springer, 2002.

Beigi

A.M.

and Maroosi

, Parameter identification for solar cells and module using a hybrid firefly and pattern search algorithms, Solar Energy 171 (2018), 435–446.

Chuan Wang

, Xu

, Wing Chau

and Mei Xu

, Yinyang firefly algorithm based on dimensionally cauchy mutation, Expert Systems with Applications 150 (2020), 113216.

10.

Cowie

, Guthrie

and Guthrie

, Lexical disambiguation using simulated annealing, In Proceedings of the Workshop on Speech and Natural Language, HLT ’91, page 238–242, USA, 1992. Association for Computational Linguistics.

11.

Cuevas

, Diaz

and Camarena

, Experimental Analysis Between Exploration and Exploitation, pages 249–269. Springer International Publishing, Cham, 2021.

12.

Dey

, Applications of firefly algorithm and its variants, Springer, 2020.

13.

Glover

, Tabu search—part i, ORSA Journal on Computing 1(3) (1989), 190–206.

14.

Hristea

and Colhon

, The long road from performing word sense disambiguation to successfully using it in information retrieval: An overview of the unsupervised approach, Computational Intelligence 36(3) (2020), 1026–1062.

15.

Larabi Marie-Sainte

and Alalyani

, Firefly algorithm based feature selection for arabic text classification, Journal of King Saud University - Computer and Information Sciences 32(3) (2020), 320–328.

16.

Lesk

, Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone, In Proceedings of the 5th annual international conference on Systems documentation, pages 24–26, 1986.

17.

, Huang

and Zhu

, Feature words selection for knowledge-based word sense disambiguation with syntactic parsing, Przeglad Elektrotechniczny 88(1b) (2012), 82–87.

18.

Mihalcea

, Chklovski

and Kilgarriff

, The senseval-3 english lexical sample task, In Proceedings of SENSEVAL-3, the third international workshop on the evaluation of systems for the semantic analysis of text, pages 25–28, 2004.

19.

Miller

G.A.

, WordNet: An electronic lexical database, MIT press, 1998.

20.

Miller

G.A.

, Leacock

, Tengi

and Bunker

R.T.

, A semantic concordance, In HUMAN LANGUAGE TECHNOLOGY: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21–24, 1993, 1993.

21.

Palmer

, Fellbaum

, Cotton

, Delfs

and Dang

H.T.

, English tasks: All-words and verb lexical sample, In Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems, pages 21–24, 2001.

22.

Pedersen

, Patwardhan

, Michelizzi

, et al., Wordnet:: Similarity-measuring the relatedness of concepts, In AAAI 4 (2004), 25–29.

23.

Peng

, Zhu

, Deng

and Wu

, Enhancing firefly algorithm with courtship learning, Information Sciences 543:18–42, 2021.

24.

Peška

, Tashu

T.M.

and Horváth

, Swarm intelligence techniques in recommender systems - a review of recent research, Swarm and Evolutionary Computation 48 (2019), 201–219.

25.

Plaza

and Diaz

, Using semantic graphs and word sense disambiguation techniques to improve text summarization, Procesamiento del Lenguaje Natural (47) (2011), 97–105.

26.

Rada

, Mili

, Bicknell

and Blettner

, Development and application of a metric on semantic nets, IEEE Transactions on Systems, Man and Cybernetics 19(1) (1989), 17–30.

27.

Raganato

, Scherrer

and Tiedemann

, An evaluation benchmark for testing the word sense disambiguation capabilities of machine translation systems, In Proceedings of the 12th Language Resources and Evaluation Conference pages 3668–3675, Marseille, France, May 2020. European Language Resources Association.

28.

Ren

, Luo

, Zhao

and Huang

, Agricultural information resource scheduling algorithm based on firefly algorithm in cloud computing, Journal of Intelligent & Fuzzy Systems 38(6) (2020), 7437–7448.

29.

Resnik

, Using information content to evaluate semantic similarity in a taxonomy, arXiv preprint cmp-lg/9511007, 1995.

30.

Resnik

, Semantic similarity in a taxonomy: An informationbased measure and its application to problems of ambiguity in natural language, Journal of Artificial Intelligence Research 11 (1999), 95–130.

31.

Rosso

, Masulli

and Buscaldi

, Word sense disambiguation combining conceptual distance, frequency and gloss, In International Conference on Natural Language Processing and Knowledge Engineering, 2003, Proceedings, 2003, pages 120–125, 2003.

32.

Voudouris

and Tsang

, Guided local search and its application to the traveling salesman problem, European Journal of Operational Research 113(2) (1999), 469–499.

33.

Wafaa

A.-S.

, Tiun

, Ahmed

A.-S.

, Awang

and Al-Khaleefa

, Word sense disambiguation using hybrid swarm intelligence approach, PloS one 13(12) (2018), e0208695.

34.

Wilson

R.C.

, Bonawitz

, Costa

V.D.

and Ebitz

R.B.

, Balancing exploration and exploitation with information and randomization, Current Opinion in Behavioral Sciences 38 (2021), 49–56.

35.

Yang

X.-S.

, Firefly algorithms for multimodal optimization, In Watanabe

and Zeugmann

, editors, Stochastic Algorithms: Foundations and Applications, pages 169–178, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.

36.

Ying

and Hamad

A.H.

, A hybrid bat algorithm based on combined semantic measures for word sense disambiguation, In Liu

, Wang

, Zhao

and Yu

, editors, Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery, pages 149–157, Cham, 2020. Springer International Publishing.

37.

, Zuo

, Fan

, Liu

and Pei

, An improved firefly algorithm based on personalized step strategy, Computing, pages 1–14, 2021.

38.

Zhang

, Zhou

and Martin

, Genetic word sense disambiguation algorithm, In 2008 Second International Symposium on Intelligent Information Technology Applicationvolume 1, pages 123–127, Dec 2008.

39.

Zhu

, Peng

, Leng

, Deng

and Wu

, Surrogateassisted firefly algorithm for breast cancer detection, J Intell Fuzzy Syst 40(5) (2021), 8915–8926.