Applying various distance functions and feature extraction schemes to ambiguity resolution

Abstract

. Word Sense Disambiguation, which is one of the most challenging problems in the process of machine translation, can be considered as a classification problem. In this paper, we use K-Nearest-Neighbor, as one of the most popular classification methods, as well as some knowledge based resources in order to design a WSD scheme. The success of K-Nearest-Neighbor is tightly dependent on two factors; the features used to represent the context in which an ambiguous word occurs and the distance/similarity measure used for comparison of text vectors. Hence, in the present study, we focus on these two matters. For the first purpose, we extract three sets of features; syntactic features, lexical features and semantic features. In order to produce enriched and useful corpora, we apply preprocessed steps. In this work, we carry out a feature selection process as well as a feature weighting policy in order to fine-tune the classifier. For the second purpose, we try several distance/similarity metrics (rather than one metric) in order to find the most proper one. We also assign and use feature weights and propose a weighted formula for every metric. Moreover, to show that the proposed schemes are not language-dependent, we apply the suggested schemes to two sets of data; English and Persian corpora. The evaluation results, with regards to the feature selection and feature weighting strategies, show that the semantic and syntactic features have a significant effect on the classification ability of the system. The results are also encouraging compared to state of the art.

Keywords

Machine translation word sense disambiguation K-Nearest-Neighbor similarity metric distance metric feature weighting feature selection

1. Introduction

We all have frequently referred to a dictionary to look up the translation of a word and found that it bears several definitions. Human language is abundant with such so called “Ambiguous words”. The right meaning of those words are determined by the context. The context of an ambiguous word is the word(s), phrase(s) and sentence(s) surrounding it. Here we encounter an attractive and difficult process called Word Sense Disambiguation (WSD) which would automatically recognize the meaning of the ambiguous word based on the context bearing it.

This hard and interesting process is an important technique in many NLP (Natural Language Processing) applications such as machine translation, information retrieval, text mining and so on.

WSD is important in many fields specially the subcategories of Computational Linguistics and we can say advancement in this regard would improve many fields including speech recognition and information extraction and the like.

One of the most important applications of WSD is in Machine Translation in which a text shall be automatically translated from one language into another. For instance, consider an English to Farsi Machine Translator which is going to translate the word “paper” while translating a sentence. The word “paper” could be translated as “Ø±ÙˆØ†Ø§Ù…Ù‡” (newspaper), “Ù…Ù‚Ø§Ù„Ù‡” article or “Ú©Ø§Ø°Ø°” letter. WSD can improve this process.

WSD is also required in Information Retrieval (IR) systems. In IR systems, ambiguity arises in some queries. For example, given the query “depression”, it may be unclear for the system whether to return documents about illness, weather systems, or economics.

WSD methods based on supervised approaches normally yield acceptable results. But the problem with these methods is lack of sense-annotated data which is big enough. In this article we used a supervised approach as the system’s main engine to resolve the ambiguity and we benefited from several knowledge resources to resolve the data shortage as well.

We used K-Nearest-Neighbor (KNN) Algorithm which is one of the well-known and credible supervised approaches to solve WSD matter.

We believe that two factors guarantee the success of KNN: the type and the quality of features used to represent the context in which an ambiguous word occurs and the type measure used for comparison of text vectors.

For this reason,at the first step, our approach extracts three sets of features: the lexical features, the syntactic features and the semantic features. In fact we tried to add most features that include lexical, syntactic, and semantic levels of language to our feature sets as influential factors. Another feature influencing the success of KNN is the method of comparing the input sentence with former input sentences to choose the meaning of ambiguous word.

In order to obtain the most suitable measure for comparison between the new sentence which houses the ambiguous word and existing sentences using KNN algorithm, we examine the various distance/similarity measures.

It should be noted that in order to improve the accuracy of the WSD method, a feature selection process for lexical and syntactic features and some weighting schemes will be proposed and discussed.

One of the other points considered in this article is using two different languages of English and Persian.

Persian language (also known as Farsi or Parsi) is spoken in Iran, Afghanistan, Tajikistan and Uzbekistan. This language has also had a significant effect on neighboring languages, especially Armenian, Turkic and Georgian languages. There are only few efforts to develop natural language processing resources, tools and researches despite of Persian language potential and high number of its speakers in the world. So we aim to do this language a little favor and also we intend to apply the suggested schemes to two sets of data; English and Persian corpora to show that the proposed schemes are not language-dependent.

WSD methods are usually tested in two different ways. When certain words in the text are being disambiguated, which is called Sample Word WSD, and when all the words in the text are disambiguated, which is referred to as All Words WSD. The first way is highly useful for systems using Supervised Approaches. As this article presents Supervised Approach, we used the first case for our experiments.

This paper is structured as follows. Section 2 is devoted to the introduction of related literature. Section 3 illustrates the feature extraction process. Section 4 presents feature selection process. Section 5 introduces the corpus and also illustrates the preprocessing steps for our goals. Section 6 describes the combination of the proposed system and also presents a variety of distance/similarity measures. Experimental settings and results are presented in Section 7. Finally, Section 8 concludes the paper.

2. Related work

Several methods have already been proposed to be applied in WSD problem. WSD methods can be divided into supervised, unsupervised and hybrid scheme sub categories. The first category includes methods which are based on supervised learning. These methods use classification systems to determine the correct translation of ambiguous words. The second category includes methods that use unsupervised learning. Text clustering is the main learning process used by the methods included in this category. Generally, supervised methods compared to unsupervised once produce better results. There is also another category of disambiguation methods which propose a combination of supervised and unsupervised learning.

From another perspective, WSD methods can be categorize into two main knowledge-based and corpora-based approaches [1, 2, 3]. Knowledge-based methods mainly rely on external knowledge base, such as machine readable dictionaries, large lexical semantic resources, ontologies, thesauri, etc., for disambiguation. However, the corpora-based approaches do not make use of any of these resources.

Supervised approach has been applied for WSD problem frequently. They use data mining and machine learning techniques to train a classifier sense-annotated data. Some of these approaches rely on Decision List [4, 5], Naïve Bayesian [6, 8], Nearest Neighbor [9], Boosting [10], Winnow [11], Transformation Based Learning [12], Naïve Bayesian Ensemble [13] and Support Vector Machine [14].

Pedersen, has presented a corpus-based approach to word sense disambiguation that builds an ensemble of Naive Bayesian classifiers. Each Bayesian classifier is based on lexical features that represent co-occurring words in varying sized windows of context [13].

The system presented by Ng and Lee in [9] is based on the Nearest Neighbor method. The prototypes are the instances of the ambiguous word in the training corpus, each containing the following features: singular/plural; POS tags of the current word; three words on either side; support for verbs, which have a different verbal morphological feature; a verb–object syntactic feature for nouns; and nine local collection features. Note that the local collection features are the words that are frequently adjacent to the ambiguous word. These features are calculated for each instance of w in the sense-tagged training data, where w is an ambiguous word. The results are stored as exemplars of their senses. By calculating the same feature vector for the current word and comparing by all the examples of that word, the given word is disambiguated choosing the closest matching instance. The accuracy of the system on test sets from the Brown corpus and the WSJ corpus was reported to be 58% and 75.2%, respectively. The results were calculated on a task including 121 nouns and 70 verbs, using fine-grained sense distinctions from WordNet.

Dandala et al. used parallel corpora that help map the sense transfer from the initial language to the final language. Word Sense Disambiguation uses Wikipedia as a source of sense annotations. Through experiments on four different languages, they showed that the Wikipedia-based sense annotations are reliable and can be used to construct accurate sense classifiers [15].

In addition to supervised approaches, unsupervised approaches and their combinations have also been proposed for word sense disambiguation. For example, an unsupervised learning method using the Expectation-Maximization (EM) algorithm for text classification problems was proposed by Nigam and McCallum in [16], which then was improved by Shinnou and Sasaki [17], in order to apply it to the ambiguity resolution problem. The approaches presented in [18, 19, 20, 21] are other examples of unsupervised learning methods. Navigli et al., built a sub-graph of the entire lexicon containing vertices useful for disambiguation and then used graph connectivity measures to determine the most appropriate senses [22]. Chaplot et al. [23] proposed a graph-based unsupervised to solve WSD. Using sense dependency and selective dependency, they concluded that the sense of a word depends on senses of few other words in the sentence. A combined supervised and unsupervised lexical knowledge methods for word sense disambiguation is proposed by Agirre et al. [24]. Yarowsky applied rule-learning and neural network approaches respectively to the task of WSD [25].

There are many other WSD methods which require a bilingual corpus. This problem is more challenging for some language pairs such as Persian, since a sufficiently large bilingual corpus can hardly be found [26, 27, 28]. Sarrafzadeh et al. try to overcome the lack of knowledge resources and NLP tools for the Persian language. They propose a cross lingual approach to tagging the word senses in Persian texts. The method makes use of English sense disambiguators, the Wikipedia articles in both English and Persian, and a newly developed lexical ontology named FarsNet which has a similar structure to WordNet [29].

There are some approaches that takes advantage from multiple languages simultaneously to solve WSD [30, 31, 32]. The method presented by Navigli and Ponzetto was based on a large multilingual encyclopedic dictionary, called BabelNet, that was built from WordNet and Wikipedia. Their approach used the graph structure of “BabelNet” to identify complementary sense evidence from translations in different languages.

The method proposed by Dagan and Itai [33] chooses the most probable sense of a word using frequencies of the related word combinations in a target language corpus. In this method, first of all, the system identifies syntactic relations between words using a source language parser and maps those relations to several possibilities in the target corpus using a bilingual lexicon. Two evaluations were carried out for this method, one using Hebrew sentences and the other using German sentences. The accuracy of the system was 91% and 78% for Hebrew and German sentences, respectively.

There are a lot of WSD methods which use heuristic algorithms. Schwab et al proposed the use of Ant Colony Optimization to solve WSD. This approach uses a variant of the original Lesk algorithm where glosses are expanded using the relations in WordNet. Each word is represented as a component in a vector. The algorithm starts with a graph. The graph is built based on the text structure where each word is linked to its predecessor sentence, and each sentence is linked to its predecessor paragraph [34]. Abualhaija et al introduced D-Bees as a knowledge-based unsupervised method for solving WSD problem. D-Bees has been inspired by bee colony optimization where artificial bee agents collaborate to solve the problem [35].

3. Feature extraction process

As described before, WSD refers to the process of automatically identifying the correct meaning of an ambiguous word based on the context in which it occurs. Therefore what is important here is the context of ambiguous word presented in the form of Feature in this study. We believe that in addition to the words that appear beside the ambiguous word, we’d better to extract the words that have a special syntactic relation with the ambiguous word as well as the words that are not in the context of ambiguous word but could be relied on semantically, as features to improve the disambiguation process.

So we extract three set of features; lexical features, syntactic features and semantic features. Then, we employ several sub features for each set.

The lexical features are the set of words which exist in the context of the ambiguous word, the syntactic features are part of speech, syntactic relation and head of phrases, and finally, the semantic features are synonym, antonym, same lemma, hyponymy and hypernymy of words which exist in the context.

3.1 Syntactic features

3.1.1 Part of speech

When we consider POS as a feature, this procedure can be useful in two cases and help us throughout the word sense disambiguation process. In the first case, a word belongs to different POS categories with different senses. For example, consider the following sentences:

1)
He turned off the light. (noun/lamp)
2)
Chandeliers light the house. (verb/make brighter)
3)
Magnesium is a light metal. (adjective/not heavy)

The word light, has different syntactic roles in each sentence and as its POS changes, its sense also becomes different. Similarly in Persian language, the meaning of the word “Ù…Ø±Ø¯” as a noun is “man”, and its meaning as a verb is “died”. Undoubtedly, in this case, the POS of the ambiguous word determines the correct meaning of the target word.

The second case occurs when a word has more than one sense, while its different senses fall under the same POS category. For instance, the word paper, when falls into the noun role, has more than one sense such as newspaper, composition and scholarly article. Similarly, in Persian language the word “Ø´ÛŒØ±”, which is a noun, has more than one sense, i.e., lion and milk. In this case, the POS of words around the ambiguous word can be a useful feature for the WSD model.

Hence, in this work, we considered the POS of the target word as well as the POS of two words before and two word after the ambiguous word.
3.1.2 Syntactic relation

In this study, syntactic relation is the important relation of an ambiguous word with other words in a sentence based on their POS and syntactic roles. We believe these relations if don’t clearly determine the correct meaning of target word, at least can help us reach this goal. For example when the target word is subject of a sentence its verb can guide the system to ambiguity resolution. In the following sentence this claim has been clearly shown.

Black bass can swim very fast in short distances.

Bass is the subject and also is the target word in this sentence and swim is its verb. As is obvious, swim can uniquely determine that the right sense of bass is a type of fish. Also consider the following Persian sentence:

Nozad shirr ra noushid. (Ù†ÙˆØØ§Ø¯ Ø´ÛŒØ± Ø±Ø§ Ù†ÙˆØ´ÛŒØ¯)

In this sentence the ambiguous word is shirand its role is object. On the other hand, the verb is noushid. Similarly, this verb is clearly related to the milk sense of shir. Hence, we extracted syntactic relations as shown in Table 1 and considered them as a set of features.

Table 1
Syntactic relation of an ambiguous word with other words in a sentence

Relation
Verb $+$ object
Subject $+$ verb
Adj $+$ Noun
Noun $+$ Noun

3.1.3 Head of phrase

A phrase is a small sequence of words that forms a meaningful unit within a sentence. Most phrases contain a central word that identifies the type of the phrase known as the head. The Part of speech of the head is used to determine the category of the phrase, for example, a phrase whose head is a noun is called as noun phrase. “Black sea bass” is a noun phrase which bass as head is noun.

Fig. 1.

A high-level view of the proposed WSD system.

We believe the head word of a phrase that includes the ambiguous word, if does not directly determine the correct meaning of the ambiguous word, can be a helpful factor to this process. For instance, consider the following noun phrases:

(a)

a light metal.

(b)

light colors.

The ambiguous word light has many senses, such as, little physical weight and small amount of color. As is obvious, the head words of these phrases, metal and colors, can clearly determine the correct meaning of the light. So we employ the head word of the phrase, which includes the ambiguous word, as a sub feature.

Similar to the head word of a phrase, the head word of its parent phrase that includes the ambiguous word can also help to identify the proper sense. For instance, consider the following sentence as well as some part of its parse tree shown in Fig. 2.

Fig. 2.

Part of parse tree structure.

Soldiers learn how to explode a tank

The ambiguous word in this sentence is tank which has many senses such as, container and vehicle. This ambiguous word is part of a noun phrase a tank. The parent phrase of this phrase is a verb phrase which its head is the verb explode. It is obvious that this verb conduces our system to war vehicle sense of tank. So we employ the head word of the parent phrase of the phrase, which includes the ambiguous word, as a sub feature.

In addition to the head word of a phrase and the head word of the parent phrase, the head word of the child phrase that includes the ambiguous word can help the WSD method to discriminate the correct meaning of the target word. For example, consider the following sentence as well as some part of its parse tree shown in Fig. 3.

Fig. 3.

Part of parse tree structure.

I played pop music with my guitar.

The target word in this sentence is play. The semantic concept of play is divided into entertainment and music. This ambiguous word is a part of a verb phrase which is made up of a noun phrase and a prepositional phrase. These phrases are the child phrases of the verb phrase. As shown in Fig. 3, The head word of the noun phrase is music that guides the algorithm towards the music sense of play. So, we employ the head word of all child phrases of the phrase, which include the ambiguous word, as sub features.

Similar to the head of phrase which was described before, the head word of sibling phrases of the phrase which include the ambiguous word is also considered as a sub feature in this study. These features as shown in the following sentences can help us to apply word sense disambiguation process. In the English sentence, the target word is crane. Bird and motion are the senses of this ambiguous word.

Fig. 4.

Part of parse tree structure.

Fig. 5.

Part of parse tree structure.

As presented in Fig. 4, Crane is a part of a noun phrase and this noun phrase is the left sibling of the verb phrase with head word of hunts. It is obvious that this head word can help the system to select the bird sense of crane. Also in the following Persian sentence the ambiguous word is ‘Ø´ÛŒØ±’ (shir) and it is a part of a noun phrase. As shown in Fig. 5 this noun phrase is the left sibling of the verb phrase with head word of ‘Ø´Ú©Ø§Ø± Ú©Ø±Ø¯’ (shekar kard) that guides the algorithm towards the lion sense of ‘Ø´ÛŒØ±’ (shir).

shir madeh gavazn ra shekar kard (Ø´ÛŒØ± Ù…Ø§Ø¯Ù‡ Ú¯ÙˆØÙ† Ø±Ø§ Ø´Ú©Ø§Ø± Ú©Ø±Ø¯).

3.2 Lexical features

3.2.1 Set of frequent words

The first subset of the extracted lexical features includes frequent words. Every word that frequently occurs in the training corpus is represented as a feature in this set. The value of a typical feature is dependent on the number of times it has co-occurred with the ambiguous word (in the same paragraph), ignoring the position of its occurrence in the paragraph.

A word such as $w_{i}$ can be selected as a feature of this category, if it meets the following pair of heuristic conditions:

1.
$W_{i}$ occurs at least $n$ times in the paragraph that the ambiguous word is in its $k$ -th sense. It should be noted that the proper value of $n$ should be determined experimentally.
2.
$\textit{prob}(k|w_{i})\geqslant p$ (1)

Where $p$ is a predefined value determined through experiments, and the value of prob( $k|w_{i}$ ) is calculated as follows:

$\textit{prob}(k|w_{i})=\frac{N(k,w_{i})}{N(w_{i})}$ (2)

where $N(kw_{i})$ refers to the number of paragraphs in which the word $w_{i}$ occurs with the ambiguous word in its $k$ -th sense, and $N(w_{i})$ refers to the total number of paragraphs in which the word $w_{i}$ occures.

The second condition is checked for the words that satisfy the first condition. The first condition tries to prevent selecting the words based on spurious and rare occurrences. The second condition is used to reduce the probability of selecting words that are frequent, but co-occur with all senses of the ambiguous word.

After detecting a number of words that satisfy the above pair of conditions, $m$ words that co-occur more frequently with the k_th sense of the ambiguous word are selected as features to construct the dataset (i.e., if the number of these words for a given sense k exceeds m). Notice that the value of m is determined via experiments.

For example, consider the Persian word “Ø´ÛŒØ±” (shir) which has two meanings (i.e., Lion and Milk). our system extracted the Persian words “ØÙ†Ú¯Ù„” (janagl), “Ø´Ú©Ø§Ø±” (shekar), “ÛŒØ§Ù„” (yal), “ØØ§Ù†ÙˆØ±” (janevar) and “Ø¨Ø¨Ø±” (babr) as features for the first sense of “Ø´ÛŒØ±” (i.e., Lion) and the Persian words “Ú¯Ø§Ùˆ” (gav), “Ù„Ø¨Ù†ÛŒØ§Ø” (labaniat), “Ù¾Ø§ØØÙˆØ±ÛŒØÙ‡” (pastorize), “Ú†Ø±Ø¨ÛŒ” (charbi), “Ú©Ù„ØÛŒÙ…” (kalsium) were extracted as features for the second sense of “Ø´ÛŒØ±” (i.e., Milk). These features have been extracted from the Persian corpus. Section 4.1 will describe this corpus in detail.
3.2.2 Set of the surrounding words

Every feature in this subset is assigned a weight value according to its positional distance to the ambiguous word. For this purpose, we select $s$ words on either sides of the ambiguous word (where s is determined experimentally) and for each of these selected words, the pair of conditions discussed in the previous section is verified. Among all the words satisfying both conditions, if the number of words for a given sense k exceeds a definite number m, a subset containing m words are selected. The selected subset will consist of the words that have co-occurred more frequently with the $k$ -th sense of the ambiguous word.

This set of features does not get binary values (i.e., 1 and 0, showing whether the word exists or not), instead, the positional distance to the ambiguous word will affect the weight value assigned to a typical feature. As a heuristic, we assign the value of $\frac{1}{|i|}$ to a word which has a distance of i to the ambiguous word. Here, distance means the number of words between two words in a text.

3.3 Semantic features

Lexical features, which are extracted from a small tagged corpus, only in limited cases can guide the disambiguated system to correct meaning of a word. As it is impossible to find all co-occurrence words in a corpus, a knowledge based resource such as WorldNet can help us to solve this problem. We believe that when a word is considered as a lexical feature, it can be a root for a lot of other features. These features will be extracted from a knowledge based resource. Notice that these features can partly cover the shortage of co-occurrence words. We entitled these features as semantic features.

In fact using this set of features as well as other tools finding syntactical relations, we tried to employ a combination of supervise approach and knowledge based algorithm in ambiguity resolution.

3.3.1 Synonym

A synonym is a word that has exactly or closely the same meaning as another word. When a word is extracted as a lexical feature, its synonym can also consider as a helpful feature for disambiguated system. Because this lexical feature has frequently occurred with ambiguous word and its synonym may also happen in a new sentence that houses ambiguous word. In such cases, synonym can lead the system to the correct meaning.

For example, when bass as an ambiguous word is employed to the fish sense of itself, the word sea was extracted as one of its lexical features. Therefore we obtained its synonym. Based on WordNet the synonym of sea is ocean. Now consider the following sentence:

Oceans is only one of the places hosting bass

Between lexical features that we extracted from our corpus for fish sense of bass, there are none of them in this sentence. So based on lexical features the disambiguated system cannot detect the right meaning oftarget word. But ocean as a semantic feature can guide our system to the fish sense of bass.

3.3.2 Antonym

An antonym is a word that means the opposite of another word. Similar to the synonym, the antonym of a lexical feature which has been extracted from corpus, can sometimes guide the disambiguated system to the correct meaning of a target word. For example consider the following sentence:

After long inspections, the expert informed us that the palms were fruitless.

In this sentence the target word is palm (has two senses, hand and tree). We know the correct meaning of palm is tree here, but there is no lexical feature for this sense in the sentence. of course it is not the later sense because one of lexical features extracted for the second sense of palm (tree) is “fruitful”, and the antonym of it i.e., fruitless exists in this sentence and our system based on this semantic feature can determine the right meaning of the target word. So we consider antonym of lexical feature with adjective roles as semantic features.

3.3.3 Same lemma

A word can lay in different forms. For example, a verb can be in form of simple present tense, present participle, simple past, past participle and so on, or a noun can be singular and plural, while an adjective can be a normal adjective, comparative and superlative. When a word was extracted as a lexical feature based on its part of speech, all of its word forms are considered as same lemma features. For instance consider the following sentence:

Only two members voted against the new motion.

In this sentence the target word is motion, which has been used in legal sense. In addition, when motion is in legal sense, vote is one of its lexical features. But in this sentence, vote has been used in past simple form, and only with same lemma features of vote we can guide our system to the correct meaning of target word.

3.3.4 Hyponym and hypernym

Hyponym and hypernym refer to a hierarchical semantic relationship between words. Hypernym presents a general term and hyponym shows the more specific terms that fall under the category of the general term.

For example, the fork, knife and spoon are hyponyms. They fall under the general term of utensil, which is the hypernym. Also in Persian the “Ø´ÛŒØ±”, “Ù¾Ù„Ù†Ú¯”, and “Ø¨Ø¨Ø±” are hyponyms of “Ú¯Ø±Ø¨Ù‡ ØØ§Ù†” which is their hypernym.

Hyponyms and hypernyms can be described by using a taxonomy, as seen in the Fig. 6.

Fig. 6.

An example for Hypernym and Hyponym.

Similar to synonym a hypernym or all hyponyms of a lexical feature which have been extracted from the corpus can be a helpful item to our system, so we consider all of them as semantic features in the process of feature extraction.

4. Feature selection

Generally, Feature selection is the process of selecting a subset of input features by eliminating most irrelevant and redundant features. Undoubtedly, choosing a proper feature affects the final functionality significantly. As we are going to extract certain words as feature and such features are acquired from a limited context, some words might repeatedly be chosen as feature in each set. Emergence of repeated features might have negative effects on classifier. To solve this problem, a feature selection process has to be added to the system.

As discussed above, three sets of features are extracted in our system. We apply the feature selection process for the lexical features and the syntactic features. Because lexical features are the input of system to extract the semantic features, this process doesn’t need to be employed for semantic features.

4.1 Syntactic feature selection

Between three subsets of syntactic features that are extracted by our system, two of them are likely to have common features. These two subsets are the head of phrase and the syntactic relation. This case occurs when one of the head phrase that is important for us, has exactly a syntactic relation with the target word as shown on Table 1. So, the system extracts such features more than once. For example, consider the example given in Fig. 4. In this example the word “hunts” is the head word of sibling phrases of the phrase which includes the ambiguous word (i.e., the head of phrase feature). This word also has “subject $+$ verb” relation with the target word. Thus, the word “hunts” is extracted twice, i.e., as the head of phrase feature and as a syntactic relation feature. In these cases, we select one of these features, because holding both may have a negative impact on the classifier.

4.2 Lexical feature selection

As discussed above, two subsets of features are extracted for lexical features in our system. The features of the first subset are extracted from the paragraphs in which the ambiguous word occurs and the features of the second subset are extracted from a window of words that surrounds the ambiguous word. These two subsets of features are likely to have common features. In other words, one feature may exist in both sets of features. Undoubtedly, for each pair of common features, one is redundant and also has a negative impact on the KNN classifier. Thus, we should omit redundant features.

It should be noticed that the features which exist in both of the sets have two different values. Thus, it matters which of them is omitted. Regarding this fact, we do not omit one of the common features randomly. Instead, among a pair of identical features (with different values) we hold the feature which leads to a higher classification accuracy on training data and remove the other [36].

5. Corpora and preprocessing steps

Prior to performing feature extraction and feature selection processes, the existing datasets are required to be preprocessed in a way that the mentioned processes could be easily performed on them. Additionally, these preprocessing steps would yield a set of enriched corpora which could be used for other NLP works.

5.1 Datasets

A proper dataset significantly affects the final functionality when using a Supervised Approach. We used two English and Persian datasets to extract features and assess our proposed scheme and we implemented our preprocess steps on these two datasets.

In the English dataset, we focused on 6 nouns, 1 adjective and 1 verb, namely palm, bass, crane, plant, motion, tank, light and call all of which involve sense ambiguity. We used a corpus which includes a set of texts including the mentioned words frequently in their different senses. A great part of this corpus has been obtained from a corpus named TWA (developed at University of North Texas by [37]). These ambiguous words have higher frequencies compared to other words in the texts and appear in different situations in the training data. Further information about this corpus can be found in Table 2.

Table 2
The details about the datasets that include six English ambiguous words

Data set	No. of paragraphs	role	Different senses
Bass	194	Noun	fish, music
Crane	150	Noun	bird, machine
Motion	290	Noun	physical, legal
Palm	286	Noun	hand, tree
Plant	220	Noun	factory, living
Tank	246	Noun	container, vehicle
Light	95	Adjective	Not dark, not heavy
call	80	Verb	Invite, telephone

In the second dataset, we concentrate on two Persian ambiguous words, namely “Ø´ÛŒØ±” and “Ø´Ú©Ø±”. The corpus which includes a set of texts including these two Persian words in their different senses has been collected using Google search engine. Additional details about this corpus can be found in Table 3. The aim of this part of the experiments is to show that the proposed method is applicable for Persian as low-resource language and is not language-dependent.

Table 3

The details about the datasets that include two Persian ambiguous words

Data set	No. of paragraphs	Different senses
Ø´ÛŒØ±	72	Milk, Lion
Ø´Ú©Ø±	51	Thankfulness, Sugar

5.2 Syntactic preprocessing task

For syntactic features, the treebank is created on top of the tagged corpus, also a dependency relation is added to each sentences. In English, this task is done using Stanford parser. In Persian, because there isn’t a good parser, this task is done manually. Finally we present the prepared corpus in Xml format. Figure 7 shows an English example sentence before syntactic preprocessing as well as some part of its XML representation after preprocessing.

Fig. 7.

An example sentence of the corpus (a)Before syntactic preprocessing (b)Part of Xml representation after syntactic preprocessing.

The sub-features of POS and also the head of phrase are extracted from Treebank ( $<$ Chunk $>$ node) and the sub-features of syntactic relations are extracted from $<$ dependency $>$ node that exist in XML format of the prepared corpus.

5.3 Lexical preprocessing task

For lexical features, we omit all stop-words (i.e., the words which are not quite valuable, such as articles, all types of pronouns, etc.) as well as the punctuations from the context. The processing we perform over the text is not case sensitive, i.e., it does not make a difference between upper case and lower case letters.We present the preprocessed corpus in XML format and enhance it by providing some other information such as the word frequencies in the corpus and part of speech of each word. Figure 8 shows an example sentence before preprocessing as well as some part of its XML representation after preprocessing.

Fig. 8.

An example sentence of the corpus (a)Before lexical preprocessing (b)Part of Xml representation after lexical preprocessing.

5.4 Semantic preprocessing task

The preprocessing task for semantic features is carried out after extraction of lexical and syntactic features. These features are the input of the semantic features extraction step. For this purpose, an array list is created each row of which can be represented as al $=<$ Lex_fea, S, POS, Syn, Ant, Lem, Hypr, Hypo $>$ , where Lex_fea is a frequent word; S is the Sense of the ambiguous word; POS is Part of Speech of the frequent word; Syn is the synonym of the frequent word; Ant is the antonym of the frequent word; lem is a word having the same lemma with the frequent word; Hypr is the hypernym of the frequent word and Hypo is the hyponym of the frequent word. This function has a positive impact on system’s accuracy.

Fig. 9.

An example of semantic features after lexical and syntactic features extraction.

For instance, Fig. 9 shows a frequent word (for second sense of tank) and its semantic features in format of the array list.

Fig. 10.

The semantic feature extraction algorithm.

The algorithm of semantic features extraction is presented in Fig. 10. As it is illustrated in this algorithm, here we tried to compensate for the sense tagged corporashortage using some knowledge based resources.

6. Integrated WSD system

In this section, we explain how the processes already mentioned are integrated and form a seamless disambiguation system. In general, as is demonstrated in Fig. 1, the preliminary input is the very English and Persian dataset receiving the preprocess operation and formed in the desired format.

Then the feature extraction process is performed and from the enriched and preprocessed corpus, three sets of features are extracted whereas every set consists of some sub features. Then, following a feature selection process, a pure subset of the extracted features is chosen to be used for classification. In order to improve the classification capacity, we propose a heuristic feature weighting method and assign different weights to the selected features. Finally, each paragraph included in the corpus is converted into a vector including values of the syntactic, lexical and semantic features.

6.1 Disambiguation process

After performing the mentioned processes, the system is ready to receive the sentence housing an ambiguous word. At this moment, the second part of the system that is responsible for disambiguation process starts working.

This part converts the input sentence to a vector based on the features extracted in previous processes and the features existing in this vector are compared with the vectors resulting from the previous part. It is a classifier that uses an improved version of K-NN and performs the WSD task on the set of constructed vectors.

Since the type of measure which is used for comparison between vectors is very important in KNN, we illustrate and employ various distance/similarity measures for this task as the following. We also assign and use feature weights and propose a weighted formula for every metric.

6.1.1 Distance/similarity measures

K-NN is a supervised learning algorithm in which the classification is accomplished based on learning by analogy, that is, by comparing a given test vector with training vector that are similar to it.

When an unknown vector is introduced, K-NN classifier finds $k$ most similar training vectors that are closest to the unknown vector. These $k$ training vectors are the $k$ “nearest neighbors” of the unknown vector. K-NN determines the label of the new coming vector according to its $k$ nearest neighbors.

In K-NN, the number $k$ is a positive integer number and can be determined experimentally. If $k=$ 1, then the unknown vector will simply be assigned to the class of its nearest neighbor, while in other cases it is classified by the majority vote of its neighbors.

As mentioned, the KNN algorithm is based on comparison. Thus, first of all, a proper metric for comparison should be determined. Since selection of a proper metric has a high effect on the accuracy of the WSD system, in this section, we aim to find the most proper similarity (or distance) measure among a set of standard metrics. In order to illustrate any of the distance (or similarity metrics), we use two typical vectors X1 $=$ ( $x_{11}$ , $x_{12}$ , $\ldots$ , $x_{1n}$ ) and X2 $=$ ( $x_{21}$ , $x_{22}$ , …, $x_{2n}$ ). For each metric, the distance of this pair of vectors is denoted as dist( $x_{1}$ , $x_{2}$ ). A distance measure must satisfy the following four rules:

1.
The distance values must non-negative, i.e., dist(x ${}_{1},x_{2})\geqslant$ 0.
2.
The distance of two vectors equls zero, if and only if the vectors are identical, i.e., $x=y\Leftrightarrow$ dist(x ${}_{1},x_{2}$ ) $=$ 0
3.
dist(x,y) is a symmetric function, i.e., dist( $x_{1}$ , $x_{2}$ ) $=$ dist( $x_{2}$ , $x_{1}$ )
4.
Triangular Inequality Rule: dist(x ${}_{1},x_{3})\geqslant$ dist(x ${}_{1},x_{2}$ ) $+$ dist( $x_{2}$ , $x_{3}$ )

Table 4 presents some distance/similarity measures, namely Euclidean, Squared Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, Hamming, Cosine and Pearson. In Table 4, the distance/similarity, for each metric, between two points (vectors) is reported as “formula” and once, the feature weights are used for computing the distance/similarty of a pair of vectors, denoted as “Weighted formula”.

Table 4
Distance/similarity measures between two points with/without applying feature weights

Distance/ similarity measures Formula

Euclidean distance Formula $\textit{dist}\left(x_{1},x_{2}\right)=\sqrt{\sum\limits_{i=1}^{n}\left(x_{1i}-% x_{2i}\right)^{2}}$ (3)

Weighted Formula $\textit{dist}\left(x_{1},x_{2}\right)=\sqrt{\sum\limits_{i=1}^{n}{w_{f_{i}}(x_% {1i}-x_{2i})}^{2}}$ (4)

Squared euclidean distance Formula $\textit{dist}\left(x_{1},x_{2}\right)=\sum\limits_{i=1}^{n}{(x_{1i}-x_{2i})}^{2}$ (5)

Weighted formula $dist\left(x_{1},x_{2}\right)=\sum\limits_{i=1}^{n}{w_{f_{i}}(x_{1i}-x_{2i})}^{2}$ (6)

Manhattan distance Formula $\textit{dist}\left(x_{1},x_{2}\right)=\sum\limits_{i=1}^{n}\left|x_{1i}-x_{2i}\right|$ (7)

Weighted formula $\textit{dist}\left(x_{1},x_{2}\right)=\sum\limits_{i=1}^{n}{w_{f_{i}}\left|x_{% 1i}-x_{2i}\right|}$ (8)

Chebyshev distance Formula $\textit{dist}(x_{1},x_{2})=\mathop{max}_{i}(\left|{x_{1i}-x_{2i}}\right|)$ (9)

Weighted formula $\textit{dist}(x_{1},x_{2})=\mathop{max}_{i}(w_{fi}\left|{x_{1i}-x_{2i}}\right|)$ (10)

Canberra distance Formula $\textit{dist}(x_{1},x_{2})=\sum\limits_{i=1}^{n}{\frac{|x_{1i}-x_{2i}|}{|x_{1i% }|+|x_{2i}|}}$ (11)

Weighted formula $\textit{dist}(x_{1},x_{2})=\sum\limits_{i=1}^{n}{\frac{w_{fi}|x_{1i}-x_{2i}|}{% |x_{1i}|+|x_{2i}|}}$ (12)

Bray curtis distance Formula $\textit{dist}(x_{1},x_{2})=\frac{\sum\limits_{i=1}^{n}{|x_{1i}-x_{2i}|}}{\sum% \limits_{i=1}^{n}{(x_{1i}+x_{2i})}}$ (13)

Weighted formula $\textit{dist}(x_{1},x_{2})=\frac{\sum\limits_{i=1}^{n}{w_{fi}|x_{1i}-x_{2i}|}}% {\sum\limits_{i=1}^{n}{(x_{1i}+x_{2i})}}$ (14)

Hamming distance Formula $\textit{dist}(x_{1},x_{2},n)_{1\leqslant i\leqslant n}{=\left\{{\begin{array}[% ]{ll}\textit{dist}\left(x1,x2,n-1\right)+1,&if\ n\geqslant\ 1\ and\ {x}_{1i}% \neq x_{2i}\\ \textit{dist}\left(x1,x2,n-1\right),&if\ {n\geqslant\ 1\ and\ x}_{1i}=x_{2i}\\ 0&if\ n=0\\ \end{array}}\right.}$ (15)

Weighted formula $\textit{dist}(x_{1},x_{2},n)_{1\leqslant i\leqslant n}{=\left\{{\begin{array}[% ]{ll}\textit{dist}\left(x1,x2,n-1\right)+w_{fi},&if\ n\ \geqslant 1\ and\ x_{1% i}\neq x_{2i}\\ \textit{dist}\left(x1,x2,n-1\right),&if\ {n\geqslant 1\ and\ x}_{1i}=x_{2i}\\ 0&if\ n=0\\ \end{array}}\right.}$ (16)

Cosine similarity Formula $\textit{sim(}x_{1},x_{2})=\frac{x_{1}.x_{2}}{\left\|{x_{1}}\right\|\left\|{x_{% 2}}\right\|}=\frac{\sum\limits_{i=1}^{n}{x_{1i}.x_{2i}}}{\sqrt{\sum\limits_{i=% 1}^{n}{(x_{1i})^{2}}\sum\limits_{i=1}^{n}{(x_{2i})^{2}}}}$ (17)

Weighted formula $\textit{sim}(x_{1},x_{2})=\frac{\sum\limits_{i=1}^{n}{(w_{fi}\times x_{1i}).(w% _{fi}\times x_{2i})}}{\sqrt{\sum\limits_{i=1}^{n}{(w_{fi}\times x_{1i})^{2}}% \sum\limits_{i=1}^{n}{(w_{fi}\times x_{2i})^{2}}}}$ (18)

Pearson correlation Formula $r_{x_{1}x_{2}}=\frac{\sum\limits_{i=1}^{n}{(x_{1i}-\bar{{x}}_{1})(x_{2i}-\bar{% {x}}_{2})}}{\sqrt{\sum\limits_{i=1}^{n}{(x_{1i}-\bar{{x}}_{1})^{2}\sum\limits_% {i=1}^{n}{(x_{2i}-\bar{{x}}_{2})^{2}}}}}$ (19)

Weighted formula $r_{x_{1}x_{2}}=\frac{\sum\limits_{i=1}^{n}{w_{fi}(x_{1i}-\bar{{x}}_{w1})(x_{2i% }-\bar{{x}}_{w2})}}{\sqrt{\sum\limits_{i=1}^{n}{w_{fi}(x_{1i}-\bar{{x}}_{w1})^% {2}\sum\limits_{i=1}^{n}{w_{fi}(x_{2i}-\bar{{x}}_{w2})^{2}}}}}$ (20)

Note that in all relations, xji is the value of the i-th feature of the vactor xj and in weighted relations, $w_{fi}$ is the weight assigned to the feature $f_{i}$ .

The Chebyshev Distance is also known as Chessboard distance, since in the game of chess, the minimum number of moves needed by a king to go from one square on a chessboard to another equals the Chebyshev distance between the centers of the squares, if the squares have side length one, as represented in 2-D spatial coordinates with axes aligned to the edges of the board. For example, the Chebyshev distance between F6 and E2 equals 4.

In Canberra and Bray-Curtis distance, if both points are zeroes, they are omitted from the result (because in mathematics $\frac{0}{0}$ is undefined).

Similarity metrics have a reverse meaning to distance metrics. Cosine similarity is a measure of similarity between two vectors in which, the cosine of the angle between the vectors is used to represent the similarity.The Cosine similarity measure of two vectors is given in relation (17). The cosine of the angle between two vectors varies from $-$ 1 to 1. However, in our application, since there is no negative value for features, the Cosine similarity of two vectors will range from 0 to 1. Higher scores indicate lower distance and higher similarity. Due to the reverse meanings of distance and similarity, we can modify the result by easily subtracting it from 1, in order to obtain a distance measure:

$\displaystyle\textit{dist}(x_{1},x_{2})=1-\textit{sim}(x_{1},x_{2})$ (3)

The weighted Cosine similarity is also given in relation (18).

The Pearson’s correlation is another similarity metric that commonly denoted by $r$ , is presented in relation (19). In this relation, $\bar{{x}}$ is the mean value of $x$ . In order to convert this metric to a distance measure, like cosine similarity we can subtract the correlation formula from 1. As discussed above, we aim to perform feature weighting to improve the classifier. Thus, the Pearson correlation can be modified as formula (20). Now, $\bar{{x}}_{w}$ is the weighted mean of $x$ :

$\displaystyle\bar{{x}}_{w}=\frac{\sum\limits_{i=1}^{n}{w_{fi}\times x_{ki}}}{n}$ (4)
7. Experiment setting and results

Distance/ similarity measures	Formula
Euclidean distance	Formula	$\textit{dist}\left(x_{1},x_{2}\right)=\sqrt{\sum\limits_{i=1}^{n}\left(x_{1i}-% x_{2i}\right)^{2}}$	(3)
	Weighted Formula	$\textit{dist}\left(x_{1},x_{2}\right)=\sqrt{\sum\limits_{i=1}^{n}{w_{f_{i}}(x_% {1i}-x_{2i})}^{2}}$	(4)
Squared euclidean distance	Formula	$\textit{dist}\left(x_{1},x_{2}\right)=\sum\limits_{i=1}^{n}{(x_{1i}-x_{2i})}^{2}$	(5)
	Weighted formula	$dist\left(x_{1},x_{2}\right)=\sum\limits_{i=1}^{n}{w_{f_{i}}(x_{1i}-x_{2i})}^{2}$	(6)
Manhattan distance	Formula	$\textit{dist}\left(x_{1},x_{2}\right)=\sum\limits_{i=1}^{n}\left\|x_{1i}-x_{2i}\right\|$	(7)
	Weighted formula	$\textit{dist}\left(x_{1},x_{2}\right)=\sum\limits_{i=1}^{n}{w_{f_{i}}\left\|x_{% 1i}-x_{2i}\right\|}$	(8)
Chebyshev distance	Formula	$\textit{dist}(x_{1},x_{2})=\mathop{max}_{i}(\left\|{x_{1i}-x_{2i}}\right\|)$	(9)
	Weighted formula	$\textit{dist}(x_{1},x_{2})=\mathop{max}_{i}(w_{fi}\left\|{x_{1i}-x_{2i}}\right\|)$	(10)
Canberra distance	Formula	$\textit{dist}(x_{1},x_{2})=\sum\limits_{i=1}^{n}{\frac{\|x_{1i}-x_{2i}\|}{\|x_{1i% }\|+\|x_{2i}\|}}$	(11)
	Weighted formula	$\textit{dist}(x_{1},x_{2})=\sum\limits_{i=1}^{n}{\frac{w_{fi}\|x_{1i}-x_{2i}\|}{% \|x_{1i}\|+\|x_{2i}\|}}$	(12)
Bray curtis distance	Formula	$\textit{dist}(x_{1},x_{2})=\frac{\sum\limits_{i=1}^{n}{\|x_{1i}-x_{2i}\|}}{\sum% \limits_{i=1}^{n}{(x_{1i}+x_{2i})}}$	(13)
	Weighted formula	$\textit{dist}(x_{1},x_{2})=\frac{\sum\limits_{i=1}^{n}{w_{fi}\|x_{1i}-x_{2i}\|}}% {\sum\limits_{i=1}^{n}{(x_{1i}+x_{2i})}}$	(14)
Hamming distance	Formula	$\textit{dist}(x_{1},x_{2},n)_{1\leqslant i\leqslant n}{=\left\{{\begin{array}[% ]{ll}\textit{dist}\left(x1,x2,n-1\right)+1,&if\ n\geqslant\ 1\ and\ {x}_{1i}% \neq x_{2i}\\ \textit{dist}\left(x1,x2,n-1\right),&if\ {n\geqslant\ 1\ and\ x}_{1i}=x_{2i}\\ 0&if\ n=0\\ \end{array}}\right.}$	(15)
	Weighted formula	$\textit{dist}(x_{1},x_{2},n)_{1\leqslant i\leqslant n}{=\left\{{\begin{array}[% ]{ll}\textit{dist}\left(x1,x2,n-1\right)+w_{fi},&if\ n\ \geqslant 1\ and\ x_{1% i}\neq x_{2i}\\ \textit{dist}\left(x1,x2,n-1\right),&if\ {n\geqslant 1\ and\ x}_{1i}=x_{2i}\\ 0&if\ n=0\\ \end{array}}\right.}$	(16)
Cosine similarity	Formula	$\textit{sim(}x_{1},x_{2})=\frac{x_{1}.x_{2}}{\left\\|{x_{1}}\right\\|\left\\|{x_{% 2}}\right\\|}=\frac{\sum\limits_{i=1}^{n}{x_{1i}.x_{2i}}}{\sqrt{\sum\limits_{i=% 1}^{n}{(x_{1i})^{2}}\sum\limits_{i=1}^{n}{(x_{2i})^{2}}}}$	(17)
	Weighted formula	$\textit{sim}(x_{1},x_{2})=\frac{\sum\limits_{i=1}^{n}{(w_{fi}\times x_{1i}).(w% _{fi}\times x_{2i})}}{\sqrt{\sum\limits_{i=1}^{n}{(w_{fi}\times x_{1i})^{2}}% \sum\limits_{i=1}^{n}{(w_{fi}\times x_{2i})^{2}}}}$	(18)
Pearson correlation	Formula	$r_{x_{1}x_{2}}=\frac{\sum\limits_{i=1}^{n}{(x_{1i}-\bar{{x}}_{1})(x_{2i}-\bar{% {x}}_{2})}}{\sqrt{\sum\limits_{i=1}^{n}{(x_{1i}-\bar{{x}}_{1})^{2}\sum\limits_% {i=1}^{n}{(x_{2i}-\bar{{x}}_{2})^{2}}}}}$	(19)
	Weighted formula	$r_{x_{1}x_{2}}=\frac{\sum\limits_{i=1}^{n}{w_{fi}(x_{1i}-\bar{{x}}_{w1})(x_{2i% }-\bar{{x}}_{w2})}}{\sqrt{\sum\limits_{i=1}^{n}{w_{fi}(x_{1i}-\bar{{x}}_{w1})^% {2}\sum\limits_{i=1}^{n}{w_{fi}(x_{2i}-\bar{{x}}_{w2})^{2}}}}}$	(20)

7.1 Setting classifier

We initially divide each corpus into 5 equal folds in a 5 fold cross-validation method. Of which four folds were used to extract the features and to train K-NN classifier, while the remaining folds were used as test data. This method is used for 5 times in a way that each fold is once chosen as data test. Training vectors are made after extracting features from 4 folds that are not related to the test. These vectors include syntactic features, lexical features, semantic features, as well as a class label (which is in fact the meaning of the ambiguous word) for each data instance.

Then using the K-NN algorithm, every test vector (which exists in fold test) is compared with all training vectors to find the $k$ most similar training vectors (i.e., the k nearest neighbors of the test vector). In order to determine the class label of a test pattern using K-NN, we try Weighted Voting strategy.

Note that Weighted Voting is a strategy each neighbor of the test vector s(found as one of the nearest neighbors) is assigned a weight according to its similarity(reverse ) to the vector. Since the similarity of two vectors (say sim( $x_{1}$ , $x_{2}$ )) has a reverse relation with their distance (say dist( $x_{1}$ , $x_{2}$ )), it can be roughly computed as follows.

$\displaystyle\textit{sim}(x_{1},x_{2})=M-\textit{dis}(x_{1},x_{2})$ (5)

where $M$ is a number that is surely greater than all computed distance values.

7.2 Evaluation

Based on the goal of this study, the experiments are divided into two parts. The first part shows the effect of the various features and the second part compares distance/similarity measures. also the experiments is tried to do on three type of part of speech (Bass, Crane, Motion, Palm, Plant, Tank as noun, light as adjective and call as verb). Unfortunately, for Persian we suffer from lack of such resources. Hence, we did this experiment on two nouns.

7.2.1 Feature selection evaluation

In this section the effect of various features on classifier accuracy is presented.

Table 5
Error rate of the KNN by considering the lexical features before and after lexical feature selection

Lexical features	Bass	Crane	Motion	Palm	Plant	Tank	Light	Call	Ø´ÛŒØ±	Ø´Ú©Ø±
Before feature selection	12.5	19.1	17.4	18.2	32.3	27.7	31.2	32.4	11.7	19
After feature selection	11.8	17.2	17.4	14.7	32.3	26.3	29.1	32.4	11.7	19

Table 5, shows the average error rate of the proposed method across the 5 fold cross validation by considering the set of frequent words as well as the set of the surrounding words, simultaneously (before and after lexical feature selection).

It can be observed in Table 5, that applying feature selection reduces the error rate. For example, applying feature selection on palm has led to a 3.5% reduction in the error rate. Consequently, it is obvious that selection of the better feature from each couple of equal (common) features will have an outstanding effect on the error rate reduction. In the cases where selection of the better feature seems to be non-influent, surely there had been either no common feature or a few ones among the pair of feature sets. Note that the comparison of these two sets of lexical features (before and after feature selection and feature weighting) are shown by [36].

Table 6

Error rate of the proposed scheme using syntactic features

Syntactic features	Bass	Crane	Motion	Palm	Plant	Tank	Light	Call	Ø´ÛŒØ±	Ø´Ú©Ø±
Part of speech	16.4	21.1	15.9	16.7	30.4	27.1	35.9	31.0	19.5	25.7
Syntactic relation	17.6	19.2	18.7	17.2	34.3	28.4	41.2	31.4	29.5	27.1
Head of phrase	16.8	18.0	16.4	17.9	31.1	28.4	37.7	30.8	17.8	22.4

The average error rate of the proposed method across the 5 fold cross validation is reported in Table 6. by considering the syntactic features. As described before, we utilized the part of speech, the syntactic relation and the various head of phrase as syntactic features. The results based on these features are presented under syntactic features in Table 6.

We created the semantic features based on lexical features, POS and some extra features such as synonym, antonym, same lemma, hyponym and hypernym of the frequent words. The effect of the semantic features to reduce the classification error rate is evaluated in Fig. 11. As it was predictable, the obtained results based on the semantic features are better than the results obtained based on lexical features for all datasets.

Note that in the above results, Euclidean was used as distance/similarity measures to compare records.

7.2.2 distance/similarity measures evaluation

As described before, one of the important factors affecting success of KNN is selection of a good distance/similarity measure used for comparison of text vectors. In order to evaluate the distance/similarity effect on classification accuracy, this evaluation was repeatedly carried out using all distance measures presented before. Figure 12, shows the result, based on semantic features for Euclidean, Squared Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, Hamming, Cosine and Pearson metrics in two cases. Note that Fig. 12a presents the results where we used Bass, Crane, Motion, Palm and Plant as dataset and Fig. 12b presents the results where we used Tank, Light, Call, Ø´ÛŒØ± and Ø´Ú©Ø± as dataset.

Table 7
Classification Error rates (%) after performing feature weighting and feature selection compared to other classifiers

	Bass	Crane	Motion	Palm	Plant	Tank	Light	Call	Ø´ÛŒØ±	Ø´Ú©Ø±
Ng method	19.8	22.4	23.3	25.7	35.1	38	35.2	31.4	14.1	20.6
Brown method	17.7	24.9	27.2	22.2	38.7	32.7	31.3	35.1	28.7	25.4
Yarowsky method	18.8	27.6	25.8	27.9	41.8	35.9	36.4	39.6	16.4	22.7
Degan method	14.2	19.7	17.9	15.2	31.4	28.6	29.6	32.7	12.3	18.6
Proposed method	11.8	15.6	13.8	12.9	28.1	23.9	24.3	28.7	11.7	19

Fig. 11.

The effect of semantic features on reducing the error rates of the classifier.

Fig. 12.

Comparison of various distance/similarity measures (a) using five first corpora (bass, crane, motion, palm and plant) (b) using five second corpora (tank, light, call, Ø´ÛŒØ± and Ø´Ú©Ø±).

In order to compare the results with other disambiguation methods, we executed some of the existing corpora-based methods over the same data. The results are also shown in average in Table 7. As it can be observed, our results are better than most of the results obtained based on other methods. For example, the average error rate of the proposed method is significantly lower than the Dagan method [33] as a method that had outperformed others.

8. Conclusion

WSD is a crucial and useful matter in many subcategories of Computational Linguistic as well asnatural language processing .In this paper, we proposed an accurate word sense disambiguation system based on K-Nearest Neighbor algorithm. We first created enriched Persian and English corpora using preprocessing steps. Then we extracted three sets of features: the lexical features, the syntactic features and the semantic features. In order to improve the classification accuracy of K-NN, we proposed and evaluated a feature selection process as well a feature weighting strategy. We also examined and described various distance/similarity metrics in order to find the most proper metric for the KNN classifier. The proposed scheme was compared with some other classifiers already proposed for WSD. It demonstrated a uniform good behavior and achieved results which are comparable or better than other efficient classification systems, proposed in the past.

References

Navigli

, Word sense disambiguation: A survey, ACM Computing Surveys (CSUR) 41(2) (2009), 10.

Mihalcea

, Multilingual word sense disambiguation using wikipedia (Doctoral dissertation, University of North Texas), 2013.

Kumar

, Word Sense Disambiguation (Doctoral dissertation, The Pennsylvania State University), 2015.

Yarowsky

, Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, 1994 June, pp. 88–95.

Parameswarappa

and Narayana

V.M.

, Kannada Word Sense Disambiguation Using Decision List, Volume, 2, 2013, pp. 272–278.

Gale

W.A.

Church

K.W.

and Yarowsky

, A method for disambiguating word senses in a large corpus, Computers and the Humanities 26(5–6), pp. 415–439.

Goutte

, Note on free lunches and cross-validation, Neural Computation 9(6) (1997), 1211–15.

Aung

N.T.T.

Soe

K.M.

and Thein

N.L.

, A Word Sense Disambiguation System Using Naïve Bayesian Algorithm for Myanmar Language, International Journal of Scientific and Engineering Research 2(9) (2011).

H.T.

and Lee

H.B.

, Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, 1996 June, pp. 40–47.

10.

Escudero

Màrquez

and Rigau

, Boosting applied to word sense disambiguation, In European Conference on Machine Learning, Springer Berlin Heidelberg, 2000 May, pp. 129–141.

11.

Golding

A.R.

and Roth

, A winnow-based approach to context-sensitive spelling correction, Machine Learning 34(1–3) (1999), 107–130.

12.

Mangu

and Brill

, Automatic rule acquisition for spelling correction, In ICM 97 (1997 July), 187–194.

13.

Pedersen

, A simple approach to building ensembles of Naive Bayesian classifiers for word sense disambiguation, In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, Association for Computational Linguistics, 2000 April, pp. 63–69.

14.

Zhong

and Ng

H.T.

,It makes sense: A wide-coverage word sense disambiguation system for free text. In Proceedings of the ACL 2010 System Demonstrations, Association for Computational Linguistics, 2010 July, pp. 78–83.

15.

Dandala

Mihalcea

and Bunescu

,Word sense disambiguation using Wikipedia, In The People’s Web Meets NLP, Springer Berlin Heidelberg, 2013, pp. 241–262.

16.

Nigam

and McCallum

A.K.

,Text Classification from Labeled and Unlabeled Documents using EM, Machine Learning 39 (2000), 103–34.

17.

Shinnou

and Sasaki

,Unsupervised learning of word sense disambiguation rules by estimating an optimum iteration number in the EM algorithm. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, Association for Computational Linguistics, 2003 May, pp. 41–48.

18.

Agirre

Martínez

de Lacalle

O.L.

and Soroa

,Two graph-based algorithms for state-of-the-art WSD. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2006, July, pp. 585–593.

19.

Brody

and Lapata

,March. Bayesian word sense induction. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2009, pp. 103–111.

20.

Manandhar

Klapaftis

I.P.

Dligach

and Pradhan

S.S.

,SemEval-2010 task 14: Word sense induction & disambiguation, In Proceedings of the 5th international workshop on semantic evaluation, Association for Computational Linguistics, 2010 July, pp. 63–68.

21.

Di Marco

and Navigli

,Clustering and diversifying web search results with graph-based word sense induction, Computational Linguistics 39(3) (2013), 709–754.

22.

Navigli

Jurgens

and Vannella

,Semeval-2013 task 12: Multilingual word sense disambiguation. In Second Joint Conference on Lexical and Computational Semantics (* SEM), (Vol. 2), 2013 June, pp. 222–231.

23.

Chaplot

D.S.

Bhattacharyya

and Paranjape

,Unsupervised Word Sense Disambiguation Using Markov Random Field and Dependency Parser, In AAAI. 2015 January, pp. 2217–2223.

24.

Agirre

Rigau

Padro

and Atserias

,Combining supervised and unsupervised lexical knowledge methods for word sense disambiguation, Computers and the Humanities 4(1-2) (2000), 103–108.

25.

Yarowsky

,Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, 1995 June, pp. 189–196.

26.

Diab

and Resnik

,An unsupervised method for word sense tagging using parallel corpora, In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2002 July, pp. 255–262.

27.

Mihltz

and Pohl

,Exploiting parallel corpora for supervised word-sense disambiguation in english-hungarian machine translation. In Proceedings of the 5th Conference on Language Resources and Evaluation, 2006, pp. 1294–1297.

28.

Tufiş

and Koeva

,Ontology-supported text classification based on cross-lingual word sense disambiguation, In International Workshop on Fuzzy Logic and Applications, Springer Berlin Heidelberg, 2007 July, pp. 447–455.

29.

Sarrafzadeh

Yakovets

Cercone

and An

,Cross-lingual word sense disambiguation for languages with scarce resources, In Canadian Conference on Artificial Intelligence, Springer Berlin Heidelberg, 2011 May, pp. 347–358.

30.

Banea

and Mihalcea

, Word sense disambiguation with multilingual features, In Proceedings of the Ninth International Conference on Computational Semantics, Association for Computational Linguistics, 2011 January, pp. 25–34.

31.

Lefever

, ParaSense: parallel corpora for word sense disambiguation (Doctoral dissertation, Ghent University), 2012.

32.

Fernandez-Ordonez

Mihalcea

and Hassan

,Unsupervised Word Sense Disambiguation with Multilingual Representations, In LREC, 2012 May, pp. 847–851.

33.

Dagan

and Itai

,Word sense disambiguation using a second language monolingual corpus, Computational linguistics 20(4) (1994), 563–596.

34.

Schwab

and Guillaume

,A global ant colony algorithm for word sense disambiguation based on semantic relatedness, In Highlights in Practical Applications of Agents and Multiagent Systems, Springer Berlin Heidelberg, 2011, pp. 257–264.

35.

Abualhaija

and Zimmermann

K.H.

,D-Bees: A novel method inspired by bee colony optimization for solving word sense disambiguation, Swarm and Evolutionary Computation 27 (2016), 188–195.

36.

Rezapour

Fakhrahmad

S.M.

Sadreddini

M.H.

and Jahromi

M.Z.

,An accurate word sense disambiguation system based on weighted lexical features, Literary and Linguistic Computing 29(1) (2014), 74–88.

37.

Mihalcea

and Yang

,TWA Sense Tagged Data, University of North Texas. W WW. cse. unt. edu/

\sim

rada/downloads. html, 2003.

38.

Agirre

and Edmonds

, eds. ,Word sense disambiguation: Algorithms and applications, Vol. 33, Springer Science & Business Media, 2007.

39.

Ahmed

and Nürnberger

,September. Arabic/english word translation disambiguation using parallel corpora and matching schemes, In Proceedings of EAMT (Vol. 8), 2008, p. 28.

40.

Brown

P.F.

Pietra

S.A.D.

Pietra

V.J.D.

and Mercer

R.L.

,Word-sense disambiguation using statistical methods, In Proceedings of the 29th annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, 1991 June, pp. 264–270.

41.

Clough

and Stevenson

,Cross-language information retrieval using eurowordnet and word sense disambiguation, In European Conference on Information Retrieval, Springer Berlin Heidelberg, 2004 April, pp. 327–337.

42.

Diou

Katsikatsos

and Delopoulos

,December. Constructing Fuzzy Relations fromWordNet forWord Sense Disambiguation. In 2006 First International Workshop on Semantic Media Adaptation and Personalization (SMAP’06), IEEE, 2006, pp. 135–140.

43.

Fakhrahmad

S.M.

Rezapour

A.R.

Jahromi

M.Z.

and Sadreddini

M.H.

,A novel approach to machine translation: A proposed language-independent system based on deductive schemes, Iranian Journal of Science and Technology, Transactions of Electrical Engineering 38(E1) (2014), 59.

44.

Justeson

J.S.

and Katz

S.M.

,Principled disambiguation: Discriminating adjective senses with modified nouns, Computational Linguistics 21(1) (1995), 1–27.

45.

Kilgarriff

Rychly

Smrz

and Tugwell

,Itri-04-08 the sketch engine, Information Technology 105 (2004), 116.

46.

Kohonen

Barna

and Chrisley

,Statistical pattern recognition with neural networks: Benchmarking studies. In Neural Networks, IEEE International Conference on, 1988 July, pp. 61–68, IEEE.

47.

Lagarda

A.L.

Alabau

Casacuberta

Silva

and Diaz-de-Liano

, Statistical post-editing of a rule-based machine translation system, In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, Association for Computational Linguistics, 2009 May, pp. 217–220.

Applying various distance functions and feature extraction schemes to ambiguity resolution

Abstract

Keywords

1. Introduction

2. Related work

3. Feature extraction process

3.1 Syntactic features

3.1.1 Part of speech

Table 1 Syntactic relation of an ambiguous word with other words in a sentence

3.2.1 Set of frequent words

3.3 Semantic features

3.3.1 Synonym

3.3.2 Antonym

3.3.3 Same lemma

3.3.4 Hyponym and hypernym

4.1 Syntactic feature selection

4.2 Lexical feature selection

5. Corpora and preprocessing steps

5.1 Datasets

Table 2 The details about the datasets that include six English ambiguous words

6.1 Disambiguation process

6.1.1 Distance/similarity measures

7.1 Setting classifier

7.2.1 Feature selection evaluation

Table 5 Error rate of the KNN by considering the lexical features before and after lexical feature selection

Table 7 Classification Error rates (%) after performing feature weighting and feature selection compared to other classifiers

References

Table 1
Syntactic relation of an ambiguous word with other words in a sentence

Table 2
The details about the datasets that include six English ambiguous words

Table 5
Error rate of the KNN by considering the lexical features before and after lexical feature selection

Table 7
Classification Error rates (%) after performing feature weighting and feature selection compared to other classifiers