Abstract
Supervised Word Sense Disambiguation (WSD) systems use features of the target word and its context to learn about all possible samples in an annotated dataset. Recently, word embeddings have emerged as a powerful feature in many NLP tasks. In supervised WSD, word embeddings can be used as a high-quality feature representing the context of an ambiguous word. In this paper, four improvements to existing state-of-the-art WSD methods are proposed. First, we propose a new model for assigning vector coefficients for a more precise context representation. Second, we apply a PCA dimensionality reduction process to find a better transformation of feature matrices and train a more informative model. Third, a new weighting scheme is suggested to tackle the problem of unbalanced data in standard WSD datasets and finally, a novel idea is presented to combine word embedding features extracted from different independent corpora, which uses a voting aggregator among available trained models. All of these proposals individually improve disambiguation performance on Standard English lexical sample tasks, and using the combination of all proposed ideas makes a significant improvement in the accuracy score.
Introduction
Word Sense Disambiguation is a long-standing problem in computational linguistics. It is defined as the problem of finding the most probable sense of a multi-sense word in a sentence. There are mainly four approaches to solving the problem: Supervised, unsupervised, semi-supervised and knowledge-based [1]. In supervised WSD, the problem of finding the most probable sense of a word is considered as a classification task; word senses are classes and context gives some clues for training a model.
Supervised WSD systems use standard features such as Part of Speech (POS) tags, surrounding words and collocations extracted from the context of the target word, and it is assumed that they have enough information to represent the feature vector of an ambiguous word. It Makes Sense (IMS) is one of the few open-source frameworks for supervised WSD and it is so flexible that different features and classification algorithms can be employed to train models for predicting the intended sense of an ambiguous word [33].
In recent years, word embeddings have become a point of attention for NLP applications by providing invaluable information on semantic relations between words in a corpus. They have been applied successfully to many NLP tasks such as opinion mining [16], machine translation [34], named entity recognition [13], and dependency parsing [31]. Using word embeddings as a feature in a supervised WSD system was studied in several works. A recent study leveraged word embeddings as a new feature for IMS [10]. In this paper, we first introduce new coefficients and incorporate them into the feature vector generation process of word embeddings. We modify the state-of-the-art work of lacobacci et al. [10]. Second, the effect of dimensionality reduction on word embedding based features is examined using the PCA algorithm. Third, a weighting method is used to alleviate the problem of data imbalance in available corpora. At last, a method for aggregating different word vectors from different corpora is discussed. We evaluate all of these proposed methods on two English lexical sample datasets, Senseval 2 and Senseval 3 and show that they achieve better performance compared to the previous approaches.
The main contributions of this work can be summarized as follows:
1. Using distance-based and frequency-based coefficients in building word embedding vectors for WSD tasks 2. Using PCA as a pre-processing step to find a better transformation of feature matrices 3. Considering the imbalanced data problem in WSD tasks by introducing a simple solution based on class weighting 4. Introducing a voting strategy to exploit different word embeddings extracted from different corpora
The rest of the paper is organized as follows: Section 2 studies different WSD methods which have used word embeddings as a feature. Our proposed methods are introduced in Section 3. Section 4 includes the experiments and results, as well as comparisons with other works, and finally, Section 5 concludes the paper.
Related works
Word embeddings are a group of techniques that map words from a high dimensional space, where each word is a dimension, to a much lower-dimensional space; the new one is called distributed representation. Traditionally, these word embeddings were generated using methods such as dimensionality reduction on co-occurrence matrix of words [14], or probabilistic methods [7].
Bengio et al. proposed neural language modelling and derived word embeddings using deep neural networks [3]. Mikolov et al. popularized neural word embeddings by introducing two shallow neural network models, skip-gram and CBOW. Interesting relational information can be extracted using these embeddings [18]. A deeper model to achieve word embeddings was proposed by Pennington et al. [23]. One of the main drawbacks of word embeddings is their inability to capture polysemy.
Considering word embedding applications, the existing WSD approaches can be categorized into two groups: First, works that try to modify pre-trained word embeddings or the training algorithm to achieve sense embeddings. These approaches try to solve Word Sense Induction (WSI) as a clustering problem by integrating sense embeddings into their training models. The second group of works try to use word embeddings as a feature to the supervised task of sense classification (WSD).
Sense induction
Guo et al. proposed using translation as a tool to cluster word senses and built a monolingual sense tagged corpus. When an ambiguous word is translated to another language, to a great amount, the ambiguity is not present in the target language. Training a recurrent neural network on word clusters results in sense embeddings [8].
Another line of work deals with the training process of skip-gram to achieve sense specific word embeddings [15, 29]. Some other works use knowledge bases or sense inventories to learn sense embeddings. Rothe and Schütze introduced Autoextend, a system which acquires word embeddings as input and derives embeddings for synsets and lexemes using WordNet [19] as an external resource [27]. Iacobacci et al. used BabelNet [20] as a sense inventory to create sense vectors for word similarity tasks [9]. Pelevina et al. introduced a method based on using ego networks to cluster word vectors and induce word senses [22].
These kinds of sense embeddings are useful for improving the performance of WSI, Part of Speech tagging, Named Entity Recognition and so on.
Sense classification
Word embeddings application in WSD mainly consists of using them as new features in a supervised learning algorithm. Taghipour and Ng applied a modified version of word embeddings to IMS system. Their strategy for incorporating word vectors in WSD was to use the vectors of all surrounding words of the target word in a given window as new features. They improved English lexical sample and all words tasks [28]. Chen et al. introduced a knowledge-based approach to WSD using word embeddings; they built context vectors and sense vectors for each target word and ranked word senses based on two simple algorithms to measure the similarity between the context vector and sense vectors [4].
lacobacci et al. introduced a new method for using word embeddings as features to a WSD system [10]. We modified this work, proposing four different ideas. They will be discussed in more details in the next section.
There is also a recent trend toward using neural networks to improve the performance of WSD. Context2vec is a neural network architecture which is based on word2vec [18] and generates context vectors for every target word in a corpus using Long Short-Term Memory (LSTM). Then the resulting vectors could be used in several NLP tasks including WSD. Kågebäck and Salomonsson proposed a language-independent WSD system which uses bidirectional LSTM architecture [12]. Yuan et al. proposed two methods for improving WSD tasks, the first one is an LSTM based algorithm which tries to predict a held-out word using the surrounding context words. Their second idea is using a semi-supervised approach to label more data given some labeled samples based on label propagation. However, the best performance was achieved by combining the two ideas [32]. Raganato et al. introduced a new perspective for supervised WSD in which they used neural models to disambiguate a sequence of words instead of creating a single classifier per word. They evaluated different neural models and found that sequence learning is the best performing and most consistent method based on different tasks in different languages [25]. Pesaranghader et al. used a Bidirectional LSTM to disambiguate all words in a text document without having to train a classifier per word. Their network architecture includes sense and word embedding layer and considers word order [24].
Proposed methods
Here we introduce our new proposed methods regarding a supervised WSD system which uses word embeddings as features. The First two methods generate word embeddings feature vectors in an efficient way, the third one uses a weighting strategy to solve data imbalance problem and the last one introduces a voting plan between different models on different types of word embeddings.
Figure 1 shows the block diagram of the entire system. The bottom right of the figure illustrates our proposed voting mechanism. When a new sample enters to the system, three SVM classifiers trained using different embedding types, give three lists of probabilities for all of the senses of the target word, and the class with the maximum sum is selected as the correct corresponding sense.

The block diagram of the entire system including IMS framework and the proposed ideas which are integrated into the system. The second and third ideas, weighting system and using PCA respectively, are shown on the top of the figure and the voting scheme is shown on the bottom right of the figure divided using the dotted line shapes.
The second and third ideas, using PCA and weighting schema respectively, are shown on the top of the figure using dashed line rectangles. All four proposed ideas in this paper are independent of each other and could be considered as an independent extension to IMS system [33]. In our methodology, we investigated the different combinations of those proposed methods to find the best possible solution for our working system. The following sub-sections introduce each of the ideas in details.
lacobacci et al. proposed four different strategies in order to use word embeddings as features of the supervised WSD system [10]: Concatenation: in this strategy, the word vectors of the context words of the target word are concatenated together to make a big vector equal to the sizes of all vectors in the context window. Given W, as window size and D as vector dimensionality, and I as the index of the target word, i
th
dimension of the feature vector, e
i
, is given as:
Averaging: in this strategy a regular average over the word vector of context words is used. The average strategy calculates e
i
as:
Fractional Decay: in this strategy a weighted average of vectors for each context word is calculated with weights based on the distance from the target word:
Exponential Decay: this strategy has achieved the best performance among the four strategies. The vector values are calculated using:
In our experiments the Exponential Decay strategy has the best performance and accuracy among the four strategies. It shows that not only the distance from the target word is important and vectors should be weighted based on it, but also when the weighting is in the exponential form, it has the best effect on the resulting averaged vector of context.
In the proposed method, two different coefficients were used to capture more information from the context of the target word and try to generate a richer word embedding feature. The first coefficient is surrounding word distance from the target word. This distance is not a mere sentential distance but is calculated as the Euclidean distance of the target word in vector space from each of the surrounding words in its context.
For example, in the sentence “I went to the
The second coefficient is the word frequency coefficient. Again considering the bank example, words such as “I, to, the” are very frequent words which are known as English stopwords. We did not decide to remove stopwords but weighed the feature vector of word embeddings on inverse term frequency (TF) to reduce the effect of high-frequency words on the resulting feature vectors. English Wikipedia was used for counting word occurrences and the following equation shows how to calculate weights based on word frequencies:
Inspired by the work of Raunak, which presents a dimensionality reduction and post-processing technique to reduce the size of word embeddings [26], we performed a linear dimensionality reduction using PCA algorithm on pre-trained word embeddings to evaluate its effects on WSD performance.
Doing a PCA while holding the same dimensions as input dimensions yields a better result comparing to the real reduction of vectors. So long as the calculations after this step are not linear, the results could be biased towards a given representation over others. Therefore, the resulting dimensions of PCA were set as the input vectors dimensions. In this sense, PCA is doing a linear transformation, better distributing features in the n-dimensional space. Based on our experiments, the post-processing algorithm [26], did not improve the performance of our models. So only the best output representations of PCA are used in our work.
New weighting scheme
The sense frequency distribution of multi-sense words is not uniform, and the same unbalanced characteristic is also seen in standard WSD datasets such as Senseval 2 and Senseval 3. In supervised WSD, every sense is a class and a supervised learning algorithm should assign a class, i.e. sense, to a sample. In a supervised learning task, when the number of training samples of different classes is different, a problem arises that is known as imbalanced data [6].
Table 1 shows sense distribution of one word from Senseval 2 and one word from Senseval 3 English Lexical Sample tasks as an example of imbalanced data in WSD datasets. In an imbalanced dataset, a classifier develops a bias towards the majority class (classes) because the minority class (classes) is treated as noise. Several methods have been proposed to deal with imbalanced data problem such as undersampling, oversampling, using ensemble classifiers, and cost-sensitive methods [6]. We use a simple approach based on the latter.
Sense frequency of two sample words from Senseval 2 (Cool) and Senseval 3 (Party) English Lexical Sample tasks
Sense frequency of two sample words from Senseval 2 (Cool) and Senseval 3 (Party) English Lexical Sample tasks
Support Vector Machine (SVM) is a popular discriminative classifier defined by a separating hyper-plane. Given a number of points in an n-dimensional space, SVM tries to find an optimal hyper-plane which separates these data points into two classes (Although real WSD datasets have more than two classes, SVM can be generalized to support multi-label classification). A possible hyper-plane can be represented by:
In our proposed method for handling imbalanced data in a multi-label classification task using SVM, the C parameter of each class is computed as follows:
One of the main drawbacks of word embeddings is their inability to capture polysemy. For every word such as bank, there exists exactly one vector in the vector space. The word bank has 10 senses as a noun according to WordNet [19]. So the vector representation for this word is the combination of all the 10 senses. Our idea is that using different embeddings built on different corpora, or built using different algorithms, one can capture more information.
Figure 2 shows the most similar words to the word apple using word embeddings built on Wikipedia corpus and Fig. 3 shows the most similar words to the word apple using word embeddings built on Google News corpus. These two figures show an important point: each corpus has its own domain, and word embeddings which are built on one corpus have different biases, in terms of word sense, to the other. This simple assumption was not considered in previous works as far as we know.

The vector space of Wikipedia embeddings, where apple is near the fruit sense and far from the company sense.

The vector space of Google news embeddings, where apple is located between fruit sense and company sense.
We have selected three different kinds of word embeddings from two different corpora. The first one is the word vectors for the 2014 Wikipedia dump which were used by Iacobacci et al. [10]; it has 400 dimensions and was trained using word2vec [18]. The second word vectors were trained using fasttext algorithm [11] 1 with 300 dimensions and the last one is Google News 2 word embeddings with 300 dimensions which was trained using word2vec.
Each of these word embeddings is used as a feature to the supervised learning algorithm, and the algorithm gives a probability for each sense of a polysemous word. Using the following equation, the most probable sense of a word w in question is chosen:
In case of using only one type of word embeddings, according to our experiments, Wikipedia embeddings had a better performance compared to the other two, but the voting between these three embeddings yields a better result. This shows that voting schema is a robust technique; however using more embeddings as voters not necessarily improves the result. Table 2 summarizes parameters of three word embedding types.
Different used word embeddings specifications
We evaluated the proposed methods on English lexical sample tasks. Senseval 2 [5] and Senseval 3 [17] challenges provide standard training and test data for English WSD tasks. Lexical sample includes training data for a number of selected words; each sample is a paragraph which contains the target word and its surrounding context. We have used IMS [33] to train a model for every word and the default classifier is a linear Support Vector Machine (SVM).
Experimental setup
Table 3 shows the number of word types, training samples and test samples for Senseval 2 and Senseval 3 English lexical sample tasks.
Information about Senseval 2 and Senseval 3 English Lexical Sample datasets
Information about Senseval 2 and Senseval 3 English Lexical Sample datasets
In all of the proposed methods, we have used an Exponential Decay strategy [10] to generate word embeddings feature vectors. The baseline WSD system is IMS which uses SVM as the classifier. In addition to the word embeddings feature, standard WSD features, surrounding words, POS tags of surrounding words, and collocations were used too.
According to our experiments, we found that the basic and original sentential distance coefficient proposed by lacobacci et al. in Equation (4) is effective enough. So we decided to combine our new proposed coefficients to existing coefficient proposed by lacobacci et al. [10]. The first 400 entries of the word embedding feature vector are reserved for the original Exponential Decay strategy indicated at Iacobacci et al. [10], and using the second 400 entries (800 in total), two models are considered to integrate the proposed coefficients into the system as follows: Distance based coefficient: Version 1. Combining the original coefficients and the new ones:
Version 2. Omitting the original coefficient:
Word frequency based coefficient: Version 1. Combining the original coefficients and the new ones:
Version 2. Omitting the original coefficient:
Table 4 shows the results of Senseval 2 English lexical sample task for our proposed methods discussed in Subsections 3.1 to 3.3. We compared our results with the work of Iacobacci et al. [10] because it outperformed all of the previous works [4, 28]. It can be seen that almost all of the proposed ideas are similar to or better than the baseline. In some cases, when using one of the proposed ideas individually leads to a worse result, e.g., word count coefficient or Wcount in Senseval 2, and distant coefficient or Coeff in Senseval 3 dataset, the combination idea helps us to achieve a better result.
The result of all proposed methods on Senseval 2 English Lexical Sample task. IMSE is the baseline [10], and Coeff, Weight and PCA are our new coefficient, weighting scheme and PCA based methods, respectively. The number of all test cases is 4328. F1: F1 score percentage
The result of all proposed methods on Senseval 2 English Lexical Sample task. IMSE is the baseline [10], and Coeff, Weight and PCA are our new coefficient, weighting scheme and PCA based methods, respectively. The number of all test cases is 4328. F1: F1 score percentage
Similarly, Table 5 shows the results for Senseval 3 English lexical sample task. In both of the tasks, the new coefficient method, weighting scheme, dimensionality reduction technique, and different combinations of these methods are individually assessed as well as combinations of all methods.
The result of all proposed methods on Senseval 3 English Lexical Sample task. IMSE is the baseline [10], and Coeff, Weight and PCA represent our new coefficient, weighting scheme and PCA based methods, respectively. The number of all test cases is 3944. F1: F1 score percentage
The comparison between baseline (IMSE) and the new voting system (IMSE + Voting) on both Senseval 2 (with 4328 samples) and Senseval 3 English Lexical Sample tasks (with 3944 samples). F1: F1 score percentage
The last method, voting scheme, was separately compared to the baseline because it is a different method and could be independently applied to every WSD system which uses word embeddings as a feature. The result of this method is shown in Table 6.
Our results show that our proposed methods outperform the work of Iacobacci et al. [10] in both Senseval 2 and Senseval 3 datasets. Although almost all of the methods have an improvement over the baseline, the best performance is achieved by applying a combination of all three ideas, namely the new coefficient, weighting and using PCA, into the base system. In some cases, the combination strategy is not the best choice, e.g. combining weighting idea and PCA feature reduction in Senseval 3 does not improve the accuracy at all. As it was discussed in Section 3, the idea of voting among different systems using different word embeddings was successful in our experiments. The important point is that based on our experiments both Fasttext and Google news Embeddings had a lower F1 score comparing to Wikipedia word embeddings when trained separately; but voting among these three–Wikipedia, Fasttext and Google news–had a better result.
Furthermore, the proposed weighting scheme, could be applied to any standard dataset whose data is not balanced enough, and since word senses naturally have a non-uniform distribution, this method may improve the result in other natural language processing tasks.
Conclusion
At the moment, supervised WSD approaches outperform other available approaches. Exploitation of word embeddings as a feature representation for semantic information of words in the context of an ambiguous word has recently been introduced in WSD. In this work, we introduced the following different ideas to improve WSD accuracy, and measured encouraging performance improvements on both standard WSD tasks (Senseval 1 and Senseval 2): Using a new coefficient scheme which is applied to a state-of-the-art supervised WSD system (IMS) which uses word embeddings as a high-quality feature vector Applying PCA as a dimensionality reduction technique in order to find a better transformation of word embedding feature vectors before training our supervised models Using a weighting system to decrease the negative effects of data imbalance in existing WSD datasets on accuracy A novel voting idea to aggregate word embeddings created from different corpora
All of the proposed ideas were evaluated individually on standard English lexical sample tasks and results show a consistent improvement over the baseline. Also, a combination of our ideas and voting scheme outperform the baseline and all individual F1 scores with a score of 71.4% and 76.2% for Senseval 2 and Senseval 3 tasks respectively (compared to the baseline scores of 70.9% and 75.8%).
