Abstract
In Natural Language Processing, word sense disambiguation (WSD) is an open challenge which improves the performance of the applications such as machine translation and information retrieval system. Many verbal languages will have many ambiguous words. The meaning of these ambiguous words differ per context. To choose the correct meaning of the word in the given context is known as WSD. In this article, the proposed work is to develop a WSD system using machine learning technique and knowledge-based approach for Telugu language. The knowledge resource used to develop the WSD system is Lexical Knowledge Base (LKB). The efficiency of WSD system is good when compared with other unsupervised approaches.
[numerals=Western]bengali[Script=Bengali]సనà±[Script=Bengali]NimbusRomNo9LTimesNewRoman
Introduction
In Natural Language Processing (NLP), many verbal languages will have many polysemous words. These polysemous words will have different meanings in different contexts. The process of identifying the appropriate sense of an ambiguous word is word sense disambiguation (WSD). Humans can understand verbal communication of a language depending on the context where the polysemous word is used, but for machines this is a difficult problem, as it involves an arrangement of information in a data structure. By thoroughly analysing this data structure we can find appropriate meaning of the polysemous word in the given scenario. The exact meaning of the ambiguous word purely depends on the surrounding words of the ambiguous word and the specific context in which the ambiguous word is used. There are three approaches to develop a word sense disambiguation system: i.e. knowledge-based, supervised and unsupervised approach. In these approaches the supervised approach is the best suited approach for WSD systems and performance wise also. But the drawback of this approach is that it requires hand tagged data and large volumes of data are required to be tagged. To overcome this drawback we use the knowledge-based approach, where data is stored in the Lexical Knowledge Base (LKB). On general domain data the performance of knowledge-based approach is below supervised approach. But in Telugu language the word sense disamgbiguation system is not up to the mark. In English language, more work is reported on WSD and it is one of the most growing research topics in computational linguistics.
WSD is treated as an Artificial Intelligence (AI) complete problem; hence it is as difficult as problems in AI. WSD heavily depends on the external knowledge base such as machine readable dictionary, thesaurus, lexical knowledge bases, collocations and ontology.
In Telugu language we have many ambiguous words. For example, consider the sentence à°à°à°¤à°¿ మాఠఠà°à°¦à°®à± à°à°¾ à°à°¨à±à°¨à°¦à°¿. In this sentence, the ambiguous word banthi “à°à°à°¤à°¿” in Telugu language has three senses:
à°à±à°à°¡à±à°°à°¨à°¿ , à°à°à°µà°¸à±à°¤à±à°µà± వరà±à°¸
Identifying the correct sense of the ambiguous word “à°à°à°¤à°¿” banthi depends on the context in which it is used.
In this research article, the literature survey is outlined in Section 2, in which the approaches of WSD are briefly described. Sections 3 and 4 explain the proposed model and proposed method. In Section 5, WSD systems are assessed. Section 6 concludes the work and provides future scope and the references used in the current work.
Word sense disambiguation system is developed by using three approaches: supervised and unsupervised systems and knowledge-based approach. According to [21] the relationship between two senses is calculated by counting the overlapping of definition of the words. Patwardhan, Banerjee, and Pedersen 2003, disambiguate the polysemous words by finding the distances between the two senses using the hierarchical structure of LKB. Jiang and Conrath [24] combine the lexical taxonomy structure and the statistics from corpus and made it as a metric to develop WSD system. The survey is related to English language WSD methods.
Navigli and Lapata first calculated a sub graph from the whole LKB which is relevant to the target polysemous word by applying depth first search algorithm, then they applied graph-based centrality algorithms to the sub graph. As a result the highest ranking concept is assigned as an appropriate sense to the target word.
Tsatsaronis et al. also adopted a two way process in which they first build a sub graph for a target word and then node ranking in the graph is done by applying spreading activation algorithm. Agirre and Soroa [3] used the same process but their node ranking is done by page rank algorithm. Tsatsaronis et al. compared the page rank algorithm and spreading activation algorithm where the results are better while using spreading activation algorithm.
In the knowledge-based approach, the WSD system is developed using Lesk algorithm and modified Lesk algorithm, based on the similarity metric which is used to measure the relatedness between two senses. The conceptual distance between the two senses of an ambiguous word is also used to identify the correct sense of a polysemous word.
The WSD system developed using supervised approach is more accurate in comparison to the knowledge-based approach. The drawback of the technique is knowledge acquisition, which is a bottleneck. As the knowledge-based approach is an evolving technique that overcomes knowledge acquisition as knowledge is exploited in LKB’s, which is used in our proposed system. Until now, no work is reported on the WSD system in Telugu language.
Proposed model
In our proposed model, world knowledge is defined as the common sense information and lexical knowledge specifies the semantic relations between the concepts. The combination of these are defined as a sense knowledge of each polysemous word which is stored in our LKB. The contextual feature specifies the senses of a polysemous word in which context they are used.
Our proposed model is as follows.
Proposed model.
Recently, the NLP graph-based approach has gained much importance in disambiguating ambiguous words. In this paper we propose a graph-based approach to solve the problem of WSD to disambiguate the polysemous word.
In the existing knowledge-based approaches method the senses of the polysemous words are compared in a pairwise fashion and the number of computations grow exponentially with the number of words. In our proposed method disambiguation was done in a suboptimal word by word process.
The proposed algorithms are divided into two categories: context independent algorithms, which are not dependent on the input context, i.e. they are irrespective of context, and context dependent algorithms, which are dependent on the input context. We use a graph-based method to determine the sense of a polysemous word. This is the first attempt to develop WSD system in regional Telugu language.
Graph-based WSD methods provide optimal solution and these methods are suitable for disambiguating word sequences. Context independent page rank algorithm is applied on the semantic network and the disambiguation is performed, which is treated as a baseline method. Context dependent modified page rank is applied on the graph to disambiguate the polysemous word which better improves accuracy of results than any other knowledge-based approach.
In our proposed disambiguation process, first, the input text is pre-processed, i.e. Tokenization, Part-of-Speech (POS) Tagging, Lemmatization, Chunking and Parsing is done to build a structured format suitable, to find appropriate sense easily by the WSD system.
Let us consider the Graph
Build a disambiguation graph
Modified page rank method: Apply the modified page rank algorithm to the graph with slight modification in the formulae of page rank where
cMP – the voting scheme.
Let us consider the following test samples
à°à±à°®à°¿ à°à°à°¤à°¿ à°µà°à± à°à°à°à±à°à°¦à°¿
వివాఠసమయాà°à±à°à± à°à°°à±à°à°¾à°à± à°à±à°¸à± సామà±à°à°¿à° à°à±à°à°¨à°¾à°à°¨à± à°à°à°¤à°¿ à°à±à°à°¨à°¾à°à± à° à°à°à°¾à°°à±
à°à°à°¤à°¿ మాఠఠà°à°¦à°®à±à°à°¾ à°à°¨à±à°¨à°¦à°¿
The above sample testing sentences are in three different contexts with same target polysemous word “à°à°à°¤à°¿ ”, thesenses of word “à°à°à°¤à°¿” (banthi) is different in three different contexts. Actually the word “à°à°à°¤à°¿”, (banthi) having three distinct senses: sense1 à°à±à°µà±à°µà± sense2 à°à±à°à°¡à±à°°à°¨à°¿ , à°à°à°µà°¸à±à°¤à±à°µà± sense3 వరà±à°¸
Graph for polysemous word with three senses
Graph for word “banthi”.
By using the graph (Fig. 2), we have to disambiguate the polysemous word “à°à°à°¤à°¿” banthi in different contexts with different senses. The assigning of appropriate sense to the polysemous word “à°à°à°¤à°¿” banthi purely depends on the surroundings words of the polysemous word in the given context. The context may vary so we can take the proximity of 2 to 3 context words towards the right and left side of the polysemous word in the given context.
Graph for word “banthi” with context.
Depending on these contexts we can disambiguate the meaning of a polysemous word. These context words are inserted into a graph (Fig. 3) and are assigned to the concepts in our LKB. We map the context words into the LKB polysemous word graph in such a way that the intensity of the node or the concept to which we assigned our context word will make that node more important. Which sense node in the graph (Fig. 3) is mapped with the context and sense is treated as a winner sense.
In NLP the metrics used are precision and recall, which are the same metrics that are used in our proposed WSD method to measure its efficiency. F1 measure specifies the accuracy of the system which is our main measure of evaluation. These metrics are also used in the information retrieval process. Precision is the ratio of the number of correctly disambiguated words and the number of words disambiguated. Some WSD systems are unable to disambiguate some polysemous words, in which case the precision value will be high. Recall is the ratio of the number of correctly disambiguated words and the number of words to be disambiguated. The harmonic mean of precision and recall value is F1 measure.
Different WSD methods accuracy
Different WSD methods accuracy
Table 1 shows the accuracy of our proposed methods for Telugu WSD. From this Table, the comparisons between our proposed methods can be easily seen.
Comparison of WSD methods.
The behaviour of the Telugu WSD system depends on some performance factors on training data (general domain).
Performance factors are damping factor, number of iterations and size of the context. For these performance factors variations, the F-measure is calculated and is converged at a point.
In our experiment the performance factor and one of the parameters used to calculate page rank is the number of iterations and the variations in the number of iterations. The converge point is shown in Fig. 5.
Varying number of iterations.
In our experiments the parameter used to calculate page rank is the damping factor value and its variations. The converge point is shown in Fig. 6.
Varying damping factor.
Varying context size.
In our experiments for each testing sentence, which is to be disambiguating at least 20 content words are taken to build a context, before and after the original sentence. The context size variations are shown in Fig. 7.
In this article, we proposed an algorithm in regional Telugu language to develop word sense disambiguation system using knowledge-based approach. Word sense disambiguation is at beginning stage and little research work is reported. Nowadays word sense disambiguation in Telugu language has more scope than any other regional languages. Future word sense disambiguation system for regional Telugu language can be developed using unsupervised approach.
Footnotes
Acknowledgments
We would like to thank the reviewers, who greatly helped to make this article in better shape. We extend our thanks to the management of MLR Institute of Technology for providing excellent infrastructure to complete this research work. We would furthermore like to extend our thanks to the research and development team for continuous support.
