Graph-based word sense disambiguation in Telugu language

Abstract

In Natural Language Processing, word sense disambiguation (WSD) is an open challenge which improves the performance of the applications such as machine translation and information retrieval system. Many verbal languages will have many ambiguous words. The meaning of these ambiguous words differ per context. To choose the correct meaning of the word in the given context is known as WSD. In this article, the proposed work is to develop a WSD system using machine learning technique and knowledge-based approach for Telugu language. The knowledge resource used to develop the WSD system is Lexical Knowledge Base (LKB). The efficiency of WSD system is good when compared with other unsupervised approaches.

Keywords

Telugu language word sense disambiguation Natural Language Processing knowledge-based approach

ï»¿[numerals=Western]bengali[Script=Bengali]à°¸à°¨à±[Script=Bengali]NimbusRomNo9LTimesNewRoman

1. Introduction

In Natural Language Processing (NLP), many verbal languages will have many polysemous words. These polysemous words will have different meanings in different contexts. The process of identifying the appropriate sense of an ambiguous word is word sense disambiguation (WSD). Humans can understand verbal communication of a language depending on the context where the polysemous word is used, but for machines this is a difficult problem, as it involves an arrangement of information in a data structure. By thoroughly analysing this data structure we can find appropriate meaning of the polysemous word in the given scenario. The exact meaning of the ambiguous word purely depends on the surrounding words of the ambiguous word and the specific context in which the ambiguous word is used. There are three approaches to develop a word sense disambiguation system: i.e. knowledge-based, supervised and unsupervised approach. In these approaches the supervised approach is the best suited approach for WSD systems and performance wise also. But the drawback of this approach is that it requires hand tagged data and large volumes of data are required to be tagged. To overcome this drawback we use the knowledge-based approach, where data is stored in the Lexical Knowledge Base (LKB). On general domain data the performance of knowledge-based approach is below supervised approach. But in Telugu language the word sense disamgbiguation system is not up to the mark. In English language, more work is reported on WSD and it is one of the most growing research topics in computational linguistics.

WSD is treated as an Artificial Intelligence (AI) complete problem; hence it is as difficult as problems in AI. WSD heavily depends on the external knowledge base such as machine readable dictionary, thesaurus, lexical knowledge bases, collocations and ontology.

In Telugu language we have many ambiguous words. For example, consider the sentence à°à°‚à°¤à°¿ à°®à°¾à° à°…à°‚à°¦à°®à± à°—à°¾ à°‰à°¨à±à°¨à°¦à°¿. In this sentence, the ambiguous word banthi “à°à°‚à°¤à°¿” in Telugu language has three senses:

•
à°—à±à°‚à°¡à±à°°à°¨à°¿ , à°†à°Ÿà°µà°¸à±à°¤à±à°µà±
•
à°µà°°à±à°¸

Identifying the correct sense of the ambiguous word “à°à°‚à°¤à°¿” banthi depends on the context in which it is used.

In this research article, the literature survey is outlined in Section 2, in which the approaches of WSD are briefly described. Sections 3 and 4 explain the proposed model and proposed method. In Section 5, WSD systems are assessed. Section 6 concludes the work and provides future scope and the references used in the current work.
2. Literature survey

Word sense disambiguation system is developed by using three approaches: supervised and unsupervised systems and knowledge-based approach. According to [21] the relationship between two senses is calculated by counting the overlapping of definition of the words. Patwardhan, Banerjee, and Pedersen 2003, disambiguate the polysemous words by finding the distances between the two senses using the hierarchical structure of LKB. Jiang and Conrath [24] combine the lexical taxonomy structure and the statistics from corpus and made it as a metric to develop WSD system. The survey is related to English language WSD methods.

Navigli and Lapata first calculated a sub graph from the whole LKB which is relevant to the target polysemous word by applying depth first search algorithm, then they applied graph-based centrality algorithms to the sub graph. As a result the highest ranking concept is assigned as an appropriate sense to the target word.

Tsatsaronis et al. also adopted a two way process in which they first build a sub graph for a target word and then node ranking in the graph is done by applying spreading activation algorithm. Agirre and Soroa [3] used the same process but their node ranking is done by page rank algorithm. Tsatsaronis et al. compared the page rank algorithm and spreading activation algorithm where the results are better while using spreading activation algorithm.

In the knowledge-based approach, the WSD system is developed using Lesk algorithm and modified Lesk algorithm, based on the similarity metric which is used to measure the relatedness between two senses. The conceptual distance between the two senses of an ambiguous word is also used to identify the correct sense of a polysemous word.

The WSD system developed using supervised approach is more accurate in comparison to the knowledge-based approach. The drawback of the technique is knowledge acquisition, which is a bottleneck. As the knowledge-based approach is an evolving technique that overcomes knowledge acquisition as knowledge is exploited in LKB’s, which is used in our proposed system. Until now, no work is reported on the WSD system in Telugu language.

3. Proposed model

In our proposed model, world knowledge is defined as the common sense information and lexical knowledge specifies the semantic relations between the concepts. The combination of these are defined as a sense knowledge of each polysemous word which is stored in our LKB. The contextual feature specifies the senses of a polysemous word in which context they are used.

Our proposed model is as follows.

Figure 1.

Proposed model.

4. Proposed method

Recently, the NLP graph-based approach has gained much importance in disambiguating ambiguous words. In this paper we propose a graph-based approach to solve the problem of WSD to disambiguate the polysemous word.

In the existing knowledge-based approaches method the senses of the polysemous words are compared in a pairwise fashion and the number of computations grow exponentially with the number of words. In our proposed method disambiguation was done in a suboptimal word by word process.

The proposed algorithms are divided into two categories: context independent algorithms, which are not dependent on the input context, i.e. they are irrespective of context, and context dependent algorithms, which are dependent on the input context. We use a graph-based method to determine the sense of a polysemous word. This is the first attempt to develop WSD system in regional Telugu language.

Graph-based WSD methods provide optimal solution and these methods are suitable for disambiguating word sequences. Context independent page rank algorithm is applied on the semantic network and the disambiguation is performed, which is treated as a baseline method. Context dependent modified page rank is applied on the graph to disambiguate the polysemous word which better improves accuracy of results than any other knowledge-based approach.

In our proposed disambiguation process, first, the input text is pre-processed, i.e. Tokenization, Part-of-Speech (POS) Tagging, Lemmatization, Chunking and Parsing is done to build a structured format suitable, to find appropriate sense easily by the WSD system.

Let us consider the Graph $G(V,E)$ – $V$ is the set of nodes that represents LKB concepts – $E$ is the set of edges that represents the relation between the concepts, extract senses and the related context words of a target (polysemous) word from LKB.

Build a disambiguation graph $G(V,E)$ between the senses and the related context words of a target word. Insert input sentence context words into the graph as nodes and link them with respective concepts. Now relate context words to concepts, every concept receives a score. Apply GBMPR algorithm on the graph to disambiguate a target word in the context. Let context (surrounding) words determine the most relevant sense of the target word and output the sense.

Modified page rank method: Apply the modified page rank algorithm to the graph with slight modification in the formulae of page rank where $V$ is the vector initialised with the input context words page rank vector $P$ over $G$ is calculated by

$v$ – $N\times$ 1 random vector (initial).

$c$ – Damping factor, $c\in$ [0, 1].

cMP – the voting scheme.

$(1-c)v$ – the probability of a random jump (not following any paths) smoothing factor.

Algorithm:
Input: Test sentence with target polysemous word
Output: Appropriate sense of a polysemous word
1. Read the input sentence IS
Target word $=$ null, context word $=$ null, sense $=$ null.
2. Pre-processing stage
a. Removal of stop words
b. Stemming
c. Parts of speech POS tagging
3. For each word WI of IS, check whether it is a target word or not
For each word Wj in PSW
If (Wi $==$ Wj)
Target Word is Wi
Else
Add Wi to context words
4. Build a disambiguation graph $G(V,E)$ , with target word, its senses and their context words which are retrieved from LKB.
5. Insert input context words into the graph G and relate to their respective concepts
6. Assign for each concept a score, the concept with maximum score is the appropriate sense
7. Output sense

Let us consider the following test samples

The above sample testing sentences are in three different contexts with same target polysemous word “à°à°‚à°¤à°¿ ”, thesenses of word “à°à°‚à°¤à°¿” (banthi) is different in three different contexts. Actually the word “à°à°‚à°¤à°¿”, (banthi) having three distinct senses: sense1 à°à±à°µà±à°µà± sense2 à°—à±à°‚à°¡à±à°°à°¨à°¿ , à°†à°Ÿà°µà°¸à±à°¤à±à°µà± sense3 à°µà°°à±à°¸

Graph for polysemous word with three senses

Figure 2.

Graph for word “banthi”.

By using the graph (Fig. 2), we have to disambiguate the polysemous word “à°à°‚à°¤à°¿” banthi in different contexts with different senses. The assigning of appropriate sense to the polysemous word “à°à°‚à°¤à°¿” banthi purely depends on the surroundings words of the polysemous word in the given context. The context may vary so we can take the proximity of 2 to 3 context words towards the right and left side of the polysemous word in the given context.

Figure 3.

Graph for word “banthi” with context.

Depending on these contexts we can disambiguate the meaning of a polysemous word. These context words are inserted into a graph (Fig. 3) and are assigned to the concepts in our LKB. We map the context words into the LKB polysemous word graph in such a way that the intensity of the node or the concept to which we assigned our context word will make that node more important. Which sense node in the graph (Fig. 3) is mapped with the context and sense is treated as a winner sense.

5. Evaluation methodology

In NLP the metrics used are precision and recall, which are the same metrics that are used in our proposed WSD method to measure its efficiency. F1 measure specifies the accuracy of the system which is our main measure of evaluation. These metrics are also used in the information retrieval process. Precision is the ratio of the number of correctly disambiguated words and the number of words disambiguated. Some WSD systems are unable to disambiguate some polysemous words, in which case the precision value will be high. Recall is the ratio of the number of correctly disambiguated words and the number of words to be disambiguated. The harmonic mean of precision and recall value is F1 measure.

Table 1
Different WSD methods accuracy

Method	Precision	Recall	Accuracy
WFS	0.56	0.54	0.55
WMFS	0.68	0.65	0.66
WTSS	0.89	0.81	0.85
GBPR	0.69	0.66	0.67
GBMPR	0.93	0.89	0.90

Table 1 shows the accuracy of our proposed methods for Telugu WSD. From this Table, the comparisons between our proposed methods can be easily seen.

Figure 4.

Comparison of WSD methods.

The behaviour of the Telugu WSD system depends on some performance factors on training data (general domain).

Performance factors are damping factor, number of iterations and size of the context. For these performance factors variations, the F-measure is calculated and is converged at a point.

In our experiment the performance factor and one of the parameters used to calculate page rank is the number of iterations and the variations in the number of iterations. The converge point is shown in Fig. 5.

Figure 5.

Varying number of iterations.

In our experiments the parameter used to calculate page rank is the damping factor value and its variations. The converge point is shown in Fig. 6.

Figure 6.

Varying damping factor.

Figure 7.

Varying context size.

In our experiments for each testing sentence, which is to be disambiguating at least 20 content words are taken to build a context, before and after the original sentence. The context size variations are shown in Fig. 7.

6. Conclusion and future scope

In this article, we proposed an algorithm in regional Telugu language to develop word sense disambiguation system using knowledge-based approach. Word sense disambiguation is at beginning stage and little research work is reported. Nowadays word sense disambiguation in Telugu language has more scope than any other regional languages. Future word sense disambiguation system for regional Telugu language can be developed using unsupervised approach.

Footnotes

Acknowledgments

We would like to thank the reviewers, who greatly helped to make this article in better shape. We extend our thanks to the management of MLR Institute of Technology for providing excellent infrastructure to complete this research work. We would furthermore like to extend our thanks to the research and development team for continuous support.

References

Agirre

de Lacalle

O.L.

and Soroa

, Random walks for knowledge-based word sense disambiguation, Computational Linguistics 40(1) (2014), 57–84.

Navigli

, Word sense disambiguation: A survey, ACM Computing Surveys (CSUR) 41(2) (2009), 10.

Agirre

and Soroa

, Personalized page rank for word sense disambiguation, in: Proceedings of EACL-09, Athens, Greece, 2009.

Chatterjee

Joshii

Bhattacharyya

Kanojia

and Meena

, A study of the sense annotation process: Man v/s machine, International Conference on Global Wordnets, Matsue, Japan, Jan, 2012.

Al Bayaty

B.F.Z.

and Joshi

, Word sense disambiguation (WSD) and information retrieval (IR): Literature re-view, Ijarcsse 4(2) (February 2014), ISSN: 2277 128X.

Ponzetto

S.P.

and Navigli

, Knowledge-rich word sense disambiguation rivaling supervised systems, Proc of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), 2010, pp. 1522–1531.

Mihalcea

, Knowledge Based Methods for WSD, e-ISBN 978-1-4020-4809-2, Springer, 2007.

Navigli

and Velardi

, Structural semantic interconnections: A knowledge-based approach to word sense disambiguation, IEEE Transactions on Pattern Analysis and Machine Intelligence 27(7) (July 2005), 1075–1086.

Koppula

Padmaja

R.B.

and Koppula

S.R.

, Hybrid approaches for word sense disambiguation: A survey, International Journal of Applied Engineering Research 10(23) (2015), 43891–43895. ISSN 0973–4562.

10.

Pal

A.R.

Saha

and Pal

, A knowledge based methodology for word sense disambiguation for low resource language, Advances in Computational Sciences and Technology 10(2) (2017), 267–283. ISSN 0973–6107, ©Research India Publications http://www.ripublication.com.

11.

Koppula

and Padmaja Rani

D.R.B.

, Word sense disambiguation using knowledge based approach in Regional Language, Journal of Advanced Research in Dynamical and Control System 5 (2018), 109–111.

12.

A Knowledge-Based Approach to Word Sense Disambiguation by distributional selection and semantic features, Mokhtar Billami (LIF).

13.

Nameh

M.S.

Fakhrahmad

and Jahromi

M.Z.

, A new approach to word sense disambiguation based on context similarity, Proceedings of the World Congress on Engineering I (2011).

14.

Mittal

and Jain

, Word sense disambiguation method using semantic similarity measures and OWA operator, ICTACT Journal on Soft Computing: Special Issue on Soft-Computing Theory, Application and Implication in Engineering and Technology 5(2) (January 2015).

15.

Parameswarappa

and Narayana

V.N.

, Kannada word sense disambiguation using decision list, International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) 2(3) (May–June 2013), 272–278.

16.

Diana

M.C.

and Carroll

, Disambiguating nouns, verbs, and adjectives using automatically acquired selectional preferences, Computational Linguistics 29(4) (2003), 639–654.

17.

Sinha

and Mihalcea

, Unsupervised graph-based word sense disambiguation using measures of word semantic similarity, in: Proceedings of the IEEE International Conference on Semantic Computing (ICSC 2007), Irvine, CA, USA, 2007.

18.

Hughes

and Ramage

, Lexical semantic relatedness with random graph walks, in: Proceedings of EMNLP-CoNLL-2007, 2007, pp. 581–589.

19.

Xiaojie

and Matsumoto

, Chinese word sense disambiguation by combining pseudo training data, Proceedings of The International Conference on Natural Language Processing and Knowledge Engineering, 2003, pp. 138–143.

20.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.6828&rep=rep1&type=pdf, date: 14/05/2015.

21.

http://www.aclweb.org/anthology/P12-1029, date: 14/05/ 2015.

22.

https://www.comp.nus.edu.sg/∼nght/pubs/esair11.pdf, date: 14/05/2015.

23.

http://cui.unige.ch/isi/reports/2008/CLEF2008-LNCS.pdf, date: 14/05/2015.

24.

Jain

and Yadav

, Measuring context-meaning for open class words in hindi language, Sixth International Conference on Contemporary Computing IEEE, 2013, pp. 173–178. ISBN: 978-1-4673-5114-0.

25.

Véronis

, Hyperlex: Lexical cartography for information retrieval, Comput Speech Lang 18(3) (2004), 223–252.

26.

Schutze

, Automatic word sense discrimination, Computat Ling 24(1) (1998), 97–124.

27.

Niu

Srihari

and Li

, Word independent context pair classification model for wordsense disambiguation, in: Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL, Ann Arbor, MI), 2005.

28.

Pal

A.R.

and Saha

, Word sense disambiguation: A survey, International Journal of Control Theory and Computer Modeling (IJCTCM) 5(3) (July 2015).

29.

Singh

R.L.

Ghosh

Nongmeikapam

and Bandyopadhyay

, A decision tree based word sense disambiguation system in manipuri language, Advanced Computing: An International Journal (ACIJ) 5(4) (July 2014), 17–22.

30.

Kumar

and Khanna

, Natural language engineering: The study of word sense disambiguation in Punjabi, Research Cell: An International Journal of Engineering Sciences 1 (July 2011), 230–238. ISSN: 2229–6913.

31.

http://arxiv.org/pdf/cs/0007010.pdf, date: 14/05/2015.

32.

http://www.aclweb.org/anthology/S01-1017, date: 14/05/ 2015.

33.

http://www.academia.edu/5135515/Decision_List_Algorithm_for_WSD_for_Telugu_NLP.

34.

http://cse.iitkgp.ac.in/∼ayand/ICON-2013_submission_36.pdf, date: 14/05/2015.

35.

http://shodhganga.inflibnet.ac.in:8080/jspui/bitstream/10603/34324/12/12_chapter%203.pdf, date: 14/05/2015.

36.

http://wwwusers.di.uniroma1.it/∼navigli/pubs/AIIA_2011_DiMarco_Navigli.pdf, date: 14/05/2015.