Bilingual Corpus-based Hybrid POS Tagger for Low Resource Tamil Language: A Statistical approach

Abstract

In India, most of the Science and Technology resources available are in English. Developing an Automatic Language Translation Engine from English (source language) to Tamil (target language) is very essential for the people who need to get technical resources in their native language. The challenges in designing such engines using Natural Language Processing (NLP) tools include Lexical, Structural, and Syntax level ambiguity. To solve these challenges, the development of a Part-Of-Speech (POS) tagger is essential. The Verb-Framed languages like Tamil, Japanese, and many languages in Romance, Semitic, and Mayan languages families have high morphological richness but lack either a large volume of annotated corpora or manually constructed linguistic resources for building POS tagger. Moreover, the Tamil Language has a low resource, high word sense ambiguity, and word-free order form giving rise to challenges in designing Tamil POS taggers. In this paper, we postulate a Hybrid POS tagger algorithm for Tamil Language using Cross-Lingual Transformation Learning Techniques. It is a novel Mining-based algorithm (MT), which finds equivalent words of Tamil in English on less volume of English-Tamil bilingual unannotated parallel corpus. To enhance the performance of MT, we developed Tamil language-specific auxiliary algorithms such as Keyword-based tagging algorithm (KT) and Verb pattern-based tagging algorithm (VT). We also developed a Unique pair occurrence-tagging algorithm (UT) to find the one-time occurrence of Tamil-English pair words. Our experiments show that by improving Context-based Bilingual Corpus to Bilingual parallel corpus and after leaving one-time occurrence words, the proposed Hybrid POS tagger can predict 81.15% words, with 73.51% accuracy and 90.50% precision. Evaluations prove our algorithms can generate language resources, which can improve the performance of NLP tasks in Tamil.

Keywords

Natural language processing part-of-speech tagger sandhi bilingual parallel corpus cross-lingual transformation learning

1 Introduction

In NLP, the POS tagger plays an important role. Many of the NLP applications such as speech recognition, machine learning information processing, information retrieval, question answering, machine translation, and word sense disambiguation are using POS tags as a part of the application processing. It is observed from the literature survey that the work that has been carried out for POS taggers could be put under two major categories:

1.1 A. Rule-based tagger

A Constrained Grammar consists of a sequence of sub grammar each one consisting of a set of constraints that set context conditions. This type of tagger generally gives an accuracy of more than 90%. However, the difficulty in developing this type of tagger is the requirement of sound grammatical knowledge in the language of the POS tagger to be developed and the knowledge of computational linguistics. The rule-based tagger [4] is an error-driven transformation-based tagger. Initially, the algorithm assigned a tag based on the dictionary, some morphological rules and their probability for each word, capitalization, various prefix or suffix strings, etc. After all word tokens have been provisionally tagged, contextual rules are applied iteratively until a threshold is reached to correct the tags by examining small amounts of context.

1.2 B. Tagger using Learning-based algorithm

The machine acquires the knowledge by training itself on the tagged corpus. Then this knowledge is used to classify the word as a particular POS. The Learning-based algorithm can be classified as:

1.3 a. Stochastic-based Tagger

The supervised stochastic techniques automatically assign a POS to a word based on the probability that a word belongs to a particular tag or based on the probability of a word being a tag based on a sequence of preceding or succeeding words. The unsupervised stochastic techniques do not require a pre-tagged corpus, but instead, use advanced computational methods to induce tag sets and transformation rules.

The POS taggers developed using the statistical method are based on probabilistic measures. The probabilities are calculated using unigram, bigram, trigram, and n-gram methods [11]. For many morphologically rich languages like Japanese and Arabic, the Viterbi algorithm is used for tagging and disambiguation [18, 31]. The model [28] described using Hidden Markov Model (HMM) and Maximum Entropy (ME) based stochastic taggers for Bengali is very simple and efficient for automatic tagging even when the amount of available annotated text is small, had a much higher accuracy than the naïve baseline model.

The supervised system developed by [7] is based on a Conditional Random Field classifier for the task of POS tagging on Code-Mixed Indian social media Text. This system could successfully assign coarse as well as fine-grained POS tag labels for three different language pairs, such as English-Hindi, English-Bengali, and English-Telugu on three different social media platforms -Twitter, Facebook & WhatsApp.

The cross-lingual approach [9] used graph-based label propagation for knowledge transfer. The structure of the word [34] becomes more apparent by combining cues from multiple languages. The fundamental idea of this work is based on the patterns of ambiguity inherent in POS tag assignments that differ across languages. At the lexical level, a word with POS tag ambiguity in one language may correspond to an unambiguous word in other languages. The paper [36] utilized word embedding with anchor-based label propagation to improve the accuracy of cross-lingual part-of-speech tagging under the graph-based framework

1.4 b. Neural Networks

The machine learning or deep learning method requires a huge volume of the annotated corpus in a particular language to train the model at word level and character level. After training, the model predicts the tag of the words in the same language in which it has been trained. The tagging has been done in various languages such as Nepali [35], Malayalam [1], and Kannada [30] using Support Vector Machines. In [22], the model is created using two Recurrent Neural Networks. The first network makes the POS tagging of each sentence, while the second one indicates for each word whether it is a component of the subject or the predicate. The model using Bidirectional Long Short-Term Memory [16] showed noticeably better performance when the source language and the target languages belong to the same language family, and competitively performed with the highest average accuracies for target languages in different families.

2 Related work

In this section we present the literature overview of the automated hybrid part of speech tagging methods. Initially Syntactic structure development in the field of linguistics was extensively studied by Chomsky [5]. Harris [13] explored POS tagging in his initial work and the tagging was done manually by defined set of manually prepared rules. Several rule-based systems have been developed that aimed to improve accuracy and efficiency. However, developing rule-based systems needed a lot of manual effort in framing the grammatical rules and it was time-consuming. Then the statistical and machine learning-based approaches were introduced in the POS tagging task, and these approaches have been widely used for their simplicity and language independence. For example, Cutting et al. [10] proposed an HMM model that used a corpus containing about 500,000 words tagged text of the Brown Corpus and a raw corpus along with the training data. Maximum Entropy model for part of speech tagging was introduced by Ratnaparkhi [26]. Lafferty et al. [20] used Conditional Random Fields (CRF) for POS tagging. Neural Networks for part of speech tagging was used in [6 , 15].

The remaining part of this section summarizes the work that has been carried out in the Tamil Language. A linear programming approach for POS tagging in the Tamil Language [8] has been demonstrated using Support Vector Machine (SVM) methodology. As the grammatically tagged corpus is required to develop the SVM model, they have designed their tagset consisting of 32 tags for preparing the annotated corpus for Tamil. They prepared a corpus by collecting corpora from Dinamani newspaper, Yahoo Tamil news, online Tamil short stories, etc. The annotated corpora are not publicly available. AUKBC-Tamil POS Corpus2016 v1 (AUKBC website [3]) is developed by the Computational Linguistic Research Group (CLRG), AU-KBC Research Centre, MIT Campus of Anna University. The Corpus was tagged by using Conditional Random Field and Bureau of Indian Standards Tagset.

The paper [23] proposed a tagger using a morpheme-based language model. In [29], the authors built a rule-based morphological analyzer and POS tagger via Projection and Induction techniques, which improve on rule-based POS tagging. A semi-supervised rule mining approach using morphological features [24] has been employed for Hindi, Tamil, and Telugu languages. The work has used a combination of small annotated and untagged training data to build a classifier model using a concept of context-based association rule mining that worked as context-based tagging rules. The paper [12] has presented a pattern-based bootstrapping approach using only a small set of POS labeled suffix context patterns. The patterns consist of a stem and a sequence of suffixes, obtained by segmentation using a manually created suffix list.

Human annotators [21] can correct the automatically annotated corpus with less effort, and the corrected annotated data set can be used iteratively to re-train the tagger. Thus, graph-based semi-supervised approaches are particularly useful for the POS tagging of low-resource languages such as Tamil. The POS Tagging [25] for the women’s health-related documents is implemented and tested for 53 documents by using Naïve Bayes’ classification method.

The approaches that have been carried out in POS tagging either require deep linguistic knowledge of the language to frame the grammatical rules or require a large volume of the pre-tagged corpus to train the model. The grammatical rules thus framed are not a common rule that could be applied to any language, because the structural ambiguity, lexical ambiguity, word order form, etc., differs from one language to another language. The other requirement that is not much addressed in the existing work is the domain knowledge of the words to be tagged.

The key idea in our proposed work is how to develop the POS tagging system for the Tamil language by eliminating the requirements such as the need for grammatical rules and the need for a large volume of the pre-tagged corpus, which is an overhead to the existing system. Our major contributions to the proposed work focus on building the POS tagging system 1. By applying the Cross-Lingual Transformation Learning technique for transferring the annotations of English words to their equivalent Tamil words and 2. By applying the association rule mining approach in a suitable way such that its support and confidence are used to find the equivalent of Tamil words in English. The above two approaches are language-independent. As our work specifically focuses on Tamil words, we developed a pattern-matching algorithm such as a Keyword-based algorithm and Verb-based algorithm to enhance the system with the use of created word feature set. We also developed an algorithm to find and tag the one-time occurrence of Tamil-English pair words.

3 The proposed approach

As Tamil is a morphologically rich language, framing the lexical rules of the language in constructing the rule-based tagger is a tedious process. The Tamil language low-resource language due to the less availability of pre-annotated corpus. As the learning-based algorithm requires a pre-annotated corpus, it requires a lot of effort to create the corpus manually. In such an environment, the proposed work seeks to develop a POS tagging system for the Tamil Language and to examine how the developed system with its new practices will contribute to the overall effectiveness by determining the percentage of tagged words, accuracy and precision of the system.

The proposed work starts from preparing the parallel corpus. Three parallel corpora are used in our work to make a comparative analysis of the characteristics of the dataset. The first corpus is taken from Tatoeba Project, the second one is prepared manually from a school textbook and the third one is prepared with the help of Google Translation Engine. All these corpora are less in volume, having less than 200 lines each. The proposed approach predicts the POS tag of words in the Tamil language without the requirement of either intense grammar rules or pre-annotated corpus. The proposed hybrid tagger has Keyword-based Tagger (KT), Verb pattern-based Tagger (VT), Unique pair occurrence Tagger (UT), and Mining-based Tagger (MT). The primary tagger, MT works on the association rule mining technique and uses the maximum likelihood estimation method to predict the POS tag of a word. The significance of MT is it is not language-specific. The performance of MT is improved using the auxiliary taggers viz, KT, VT, and UT. The auxiliary taggers are not only increasing the number of tagged words, accuracy, and precision but also reducing the time complexity of MT. The auxiliary taggers KT and VT are language-specific taggers. The effectiveness of the proposed algorithm on the Tamil tagged corpus is measured in terms of the percentage of the tagged words, accuracy and precision. The performance of the proposed algorithm is compared among the datasets.

The System architecture is shown in Fig. 1. The English-Tamil corpus is given as input to the system. After it is pre-processed, the proposed hybrid POS tagger tags the Tamil words with the help of a Word Feature set and annotated English corpus. The POS tagged Tamil corpus is obtained as the output of the system along with a bilingual dictionary.

Fig. 1

Proposed system architecture.

3.1 Data input

A corpus is a representative subset of a language having an optimum and adequate size so that it is useful for any linguistic analysis. The bilingual parallel corpus, out of which one language must be in English and the other language is the language in which the words to be tagged, is used as an input to the system. As the proposed work deals with the Tamil words to be tagged, the English-Tamil parallel corpus where each sentence in the English language is paired with its translation into the Tamil language is required. A parallel corpus of literature, religion books, mining of parallel content from the bilingual websites, etc., can be used as an input to the system.

3.1.1 Data pre-processing –sandhi removal

After splitting the bilingual parallel corpus into sentences, which has both Tamil and English sentences is cleaned by removing the punctuations and then the English and Tamil texts are separated. Each word of both languages is tokenized. Sandhi is a joining term for a wide variety of sound changes that occur at morpheme or word boundaries. Sandhi can be either internal which occurs at morpheme boundaries within words or external which occurs at word boundaries. Between two words, there are four possibilities of writing those words based on their pronunciation, namely a letter may be inserted/deleted/altered/come naturally without any change. The proposed work considers the sandhi that has been inserted at the word boundary. Removing this sandhi, makes the proposed hybrid tagger to predict more number of words correctly. The Fig. 2 shows an example.

Fig. 2(a)

The word ‘him’ appeared in all four sentences. Its translated text in Tamil is ‘’ which is highlighted. Either one of Sandhi characters occurred at a word boundary, joins ‘’ with the next word in the Tamil Sentences. Fig. 2(b). Sandhi letters are removed from the Tamil word. Wherever ‘’ appears in Tamil Sentence, its translated sentence has ‘him’. This makes the proposed tagger predict more likely word of ‘’ is ‘him’

Algorithm 1: Removal of Sandhi in the Tamil Text
procedure remove_sandhi()
sandhi_letters={}
for each sentence S_i from Tamil Sentences S, do
for each word T_i in the sentence S_i, do
if T_i ends with one of sandhi_letters and T_i +1 starts
with a letter formed by one of sandhi_letters and vowels
Remove sandhi from T_i
end for
end for
end procedure remove_sandhi()

3.2 Word feature set

3.2.1 Preparation of POS annotated Tamil Keyword

The list of frequently occurred Tamil words from Normal/Conventional Context with its translation in English is prepared and assigned suitable POS tag from NLTK POS Tag List as shown in Table 1. The KT uses this feature set to tag the words by direct mapping without any computation complexity.

Table 1
For the defined list of Tamil Words, a suitable POS tag from the NLTK POS tag list is assigned. The translation of Tamil words in English and the description of the NLTK POS tag is also shown.

3.2.2 POS annotated Tamil Verb Pattern matched word

Tamil is an agglutinative language in which suffixes are added to a nominal or a verbal lexical root to convey grammatical information, like a person, number, and cases. Most of these affixes can be derivational or inflectional. The list of suffixes, which ends with verbal words are prepared as shown in Table 2 and then assigned a suitable NLTK POS tag. VT uses this feature set to tag the words by suffix pattern matching.

Table 2
Verb suffix patterns of Tamil words are identified and assigned suitable NLTK POS tags for those patterns.

3.3 Annotation of English Words

There are many POS taggers in NLTK Library namely Unigram Tagger, Bigram Tagger, Trigram Tagger, Perceptron Tagger, Brill Tagger, Conditional Random Fields Tagger, and Classifier Based POS Tagger. Among these taggers, the results from [27] show the highest accuracy up to 88.7% with the perceptron tagger having 80% of the train data and dropping to 1.6% when using half of the data as the training set. Hence, the perceptron tagger is used to tag the English words of our corpus.

4 Hybrid POS Tagger

The procedure of the proposed hybrid tagger has KT, VT, UT, and MT. These taggers are invoked in the order given in Algorithm 2 so that the untagged Tamil words are passed to successive taggers.

Algorithm 2: Proposed Hybrid POS Tagger
procedure make_Hybrid_Tagger()
Tag-based on Keyword-based Method
Tag-based on Verb Suffix Pattern Method
Tag-based on Unique pair occurrence Method
Tag-based on Mining Method using Confidence with Maximum Likelihood Estimation of Target Words given Source Words
end procedure make_Hybrid_Tagger()

4.1 Keyword-based Tagger

The KT treats each Tamil word as a keyword and searches the lookup table of POS Annotated Tamil Keyword feature set for direct mapping of the keyword. If the keyword is there in the feature set table, then the KT assigns its corresponding feature i.e. tag to the keyword. This direct mapping is done for the closed set of POS tags - Personal Pronoun, Possessive Pronoun, Adverb, Cardinal Digit, Determiner, Coordinating Conjunction an,d Preposition / Subordinating Conjunction, with the finite number of words in each tag. As these words frequently occur in the sentence with less or no ambiguity of assigning POS tags, it does not require much computation in finding the tags of a word. Thus, direct mapping reduces the computation complexity of MT.

Algorithm 3: Keyword based Tagger
procedure Tagger_using_Keywords
Let X be the POS annotated Tamil Keyword Feature Set
for each sentences S_i in Tamil Corpus, do
for each words W_i in S_i,do
for each Keyword K_i in X, do
if W_i equals K_i
Assign the POS tag of K_i to W_i
end if
end for
end for
end for
end procedure

4.2 Verb Suffix pattern-based Tagger

The VT uses the POS Annotated Tamil Verb Pattern matched word feature set to tag the words that match with the verbal suffix pattern. The feature set has some finite verb suffix patterns with its corresponding tags. In Tamil, the verb with suffixes could convey many features like tense, person, number, gender, etc., But as the proposed work restricted to tag with NLTK tag list for generic purpose, it considers only the tense of the verb.

Algorithm 4: Verb based Tagger
procedure Tagger_using_Verbs
Let X be the POS annotated Tamil Verb Pattern matched word Feature Set
for each sentences S_i in Tamil Corpus, do
for each words W_i in S_i,do
for each verb suffix pattern VP_i in X, do
if W_i ends with VP_i
Assign the POS tag of VP_i to W_i
end if
end for
end for
end for
end procedure

4.3 Unique pair occurrence Tagger

The UT is useful to tag when both Tamil word with their translated text in English, occurs only one time in the same sentence of the whole corpus. In any application of an NLP system, the word that occurs one time in the corpus has not been considered. The results from the General text corpus show that around 5% of such words are tagged with 4% accuracy.

Algorithm 5: Unique Pair based Tagger
procedure Tagger_using_UniquePair
for each sentence S_i in Bilingual Corpus, do
T_i be the sentence in Tamil from Corpus
E_i be the sentence in English from Corpus
if T_i has any Unique word U_ti and E_i has any Unique word
U_ei
Assign the POS tag of U_ei to U_ti
end if
end for
end procedure

4.4 Mining based Tagger(MT)

4.4.1 The mining approach

The primary tagger MT employs the association rule mining technique and maximum likelihood estimation method. The association rule mining technique uses support and confidence to find how far a Tamil word in a sentence is associated with all the words in English in its translated sentence. The maximum likelihood estimation method uses this association or correlation values to find more likely associative English word of given Tamil and then assigns POS tag of the more likely associative English word to Tamil word. This technique is not a language-specific one.

The significance of bringing the association rule mining for tagging the words in the NLP system is the words of any language could be tagged with neither the requirement of framing the lexical rules nor the pre-annotated corpus. The only requirement is the input to the system has to be in the form of the bilingual parallel corpus, the language in which the words to be tagged act as source language and its translated text in the English language will be the target language.

Consider a sentence from parallel corpus, shown in Fig. 3. English is a fixed word-order language whereas Tamil language word-order is flexible. Flexibility in word order represents that the order may change freely without affecting the grammatical meaning of the sentence. Figure 3 shows the two versions of the word-order difference of Tamil sentences for the English sentence “Where are we”. The word “” appears as the first word in the first diagram and the second word in the second diagram and the meaning of the word “” in English is “where”. The association of the word “” with the remaining three words “where”, “are” and “we” are found by using the mining method. The more likely associated word of “” is done by the maximum likelihood estimation method. After predicting the more associated word “where”, the POS tag of “where” is assigned to “”

Fig. 3

(a) and 3(b) are the free word-order form of Tamil words in the sentence.

4.4.2 Mathematical representation of mining tagger

Consider the language of the word to be tagged by a source language and its translated text in English be a target language.

Association rule mining. Association rule mining is one of the data mining techniques that finds interesting association or correlation relationships among a large set of data items. This discovery of interesting association or correlation relationships among huge amounts of transaction records helps in finding the association relationship of words in the source language with the target language. Let us consider the following assumptions for representing the Association rule in terms of mathematical representation,

C be the Bilingual Sentence Pair Parallel Corpus

S and T be the set of Parallel Corpus in the source language and target language

L, M, and N be the number of sentences in parallel Corpus, number of unique words in S, and number of unique words in T respectively

S_W={sw₁, sw₂, . . . , sw_M} be a set of words in S

T_W={tw₁, tw₂, . . . ,tw_N} be a set of words in T

SW_i={sw₁, sw₂, . . . , sw_m} be a set of words in an i^th sentence of source language S

TW_i={tw₁, tw₂, . . . ,tw_n} be a set of words in i^th sentence of target language T

SS_i and TS_i be the i^th sentences of S and T and are the set of words such that SS_i⊆SW_i and TS_i⊆TW_i

SP_i ={SS_i, TS_i} be the i^th line of parallel sentence pair of SS_i with its translation in TS_i

Support(X->Y) refers to the number of sentences SP_i that contain a word X in SS_i and Y in TS_i to the total number of sentences.

$\begin{matrix} Support (X - > Y) = σ (X ν Y) \\ = | {{SP}_{I} | X \subseteq {SS}_{i}, Y \subseteq {TS}_{i}, {SP}_{i} \subseteq C} | \end{matrix}$ (1)

Where the symbol |.| denotes the number of sentences in the corpus, is an indication of how frequently the words X and Y together appear in SP_i in the Corpus.

Let X, Y be the words in SS_i and TS_i where i∈L and an association rule of words in S and T is an implication expression of the form X⇒Y where X and Y are disjoint word sets i.e., X∩Y=Ø.

Support determines how often a rule applies to a given word set, while confidence determines how frequently a word Yappearsr in its translated sentence that contains the word X. Confidence suggests a co-occurrence relationship between words in the antecedent and consequent of the rule. For a given rule X ⟶ Y, the higher the confidence, the more likely it is for Y to be present in the parallel sentence pair that contain X. It also estimates the conditional probability of Y given X. $Support (X - > Y) = σ (X ν Y) / L$ (2)

$\begin{matrix} Confidence (X - > Y) = Support (X - > Y) / \\ Support (X) = σ (X ν Y) / σ (X) . \end{matrix}$ (3)

Maximum likelihood estimation. The likelihood of a word SW_ij, j^th word of an i^th sentence from S being translated as TW_ij is found by finding the conditional probability of every other word in an i^th sentence of the target language. The TW_ij that has the highest probability with the SWij is likely to be the translated word of SW_ij,

$\begin{matrix} {SW}_{ij} {is equivalent to TW}_{ij}, if P ({TW}_{ij} | {SW}_{ij}) \\ = max_{k} {P (TW}_{ik} | {SW}_{ij}) . \end{matrix}$ (4) The POS tag of the target word TW_ij is the POS tag of the most likely associated source word SW_ij. ${Tag (SW}_{ij}) ⟸ Tag ({TW}_{ij})$ (5)

Algorithm to find an association between the source and target word. Algorithm 6 uses the confidence to check the association between a source word(S) in Tamil and a target word (T) in English. Confidence is the ratio of Support(S ->T) to Support(S). Support(S->T) is the number of sentences in the corpus that has the word S in a Tamil sentence and word T in its translated sentence. Support(S) is the number of sentences in the corpus that has the word S in the Tamil sentence. The time complexity is given by O(LW), where L is the total number of lines in either Tamil or English corpus and W is W_T+W_E, the total number of words in both Tamil and English corpus.

Algorithm 6: Finding the association between a Tamil word and an English word using confidence
procedure Mining_using_Confidence (S, T)
Initialize 0 to count to store the number of times the T and S present in the same bilingual paired sentence
for each sentence from Bilingual Corpus, do
X1< - 0, X2< - 0, S_Count< -0
for each word T1 in the sentence in the target language, do
if T1 equals T, then
X1< - 1,
end if
end for
for each word S1 in the sentence in the source language, do
if S1 equals S, then
X2< - 1
S_Count< - S_Count+1
end if
end for
if X1 and X2 are equals 1, then
Increment the count by 1
end if
end for
prob< - count / S_count
Return prob
end procedure Mining_using_Confidence

Algorithm to find more likely associated word. Algorithm 7 finds the more likely translated word of a Tamil source word. Among the target words in a sentence, the word that has more confidence with the source word is more likely to have an association with the source word. Then the POS tag of the target word is assigned to the source word. The above process is repeated for all source words of all sentences in the Tamil Corpus. The time complexity is O(L* W_T *W_E).

5 Time Complexity

The complexity of the overall mining tagger that uses algorithm 6 and algorithm 7 is O(L* W_T *W_E *L *(W_T+W_E)), approximated to O(L²W²). To reduce this complexity, the algorithm skips the source word that is already POS tagged by the first three algorithms KT, VT, and UT. After finding the association between a Tamil word of a sentence with all of the English words in its translated sentence using the above algorithm, the more likely associated words or translated word of a Tamil word is found by this algorithm.

Algorithm 7: Finding more likely associated word of Tamil using Maximum Likelihood Estimation Method
procedure Assign_POS_Tag_using_MLE
for each sentence from Bilingual Corpus, do
for each word S in the sentence in the source language, do
for each word T in the sentence in the target language, do
Find the AssociationRuleMining_using_Confidence (S, T) and Store the value in an array X.
end for
if X has one max value having index i
Assign the tag of the target word i to the source word
else
Assign the tag of the source word to be UNK (Unknown)
end if
end for
end for
end procedure Assign_POS_Tag_using_MLE

4.5 Data output

The proposed Hybrid POS tagger outputs the POS annotated Tamil Corpus shown in Table 3. As basically the MT works on mapping the Tamil words with its translated text in English, the proposed system creates a bilingual dictionary with its annotations, shown in Table 4.

Table 3
Three sentences of Tamil Tagged words of Corpus III are shown along with its parallel tagged text in English. The UNK in the Tamil Tagged Corpus represents the tag of those words that are unknown. For the same three sentences, the table shows the tagged words in ‘bold’ text that is tagged by the various proposed hybrid tagger.

Table 4

Tamil-English dictionary created by proposed mining tagger for the words appeared in three sentences of Table 3. Here Tamil word is a searching word, its associated word in English, and its POS tag. The remarks show the correctness of the associated word of Tamil in English, which is entered manually.

5 System implementation

5.1 Dataset

The dataset is a collection of Tab-delimited Bilingual Sentence Pairs Parallel Corpus. The parallel corpus contains a collection of original texts in the source language is English and their translations into a target language in Tamil. The dataset is neither aligned nor annotated corpus. There are three datasets used here namely,

5.1 A. General Text Corpus (Corpus I)

The English-Tamil Language sentence pair dataset from the collection of Tatoeba Project datasets is downloaded from the website: https://www.manythings.org/anki/. This corpus has 199 lines, 987 English words, 777 Tamil words and contains a collection of colloquial sentences pair. Out of 777 words in Tamil sentences, 407 words have occurred only once in the corpus. The remaining 370 words appeared more than once in the corpus.

5.2 B. Crop Production Corpus (Corpus II)

The corpus from the English Version and its translated Tamil Version of the Science Textbook of Tamil Nadu State Board Syllabus of Standard 8 is prepared manually. Though the sentence pairs of Corpus II are not an exact translation of Tamil Sentences with English Sentences, it provides a similar contextual meaning of the sentence. There are 1582 words in English and 1206 words in Tamil Texts available in 124 parallel sentences, out of which 663 Tamil words have appeared only once.

5.3 C. Crop Production Corpus –Google Translated Text (Corpus III)

As the Crop Production Corpus has not had exact translated text in each sentence, Corpus III is developed by taking the English sentences from Crop Production Corpus, and its translated sentences are taken from Google Translate (GT) by giving line by line to GT Engine. It has 124 English Sentences is as same as in Corpus II with its GT translated Tamil Sentence. There are 1144 Tamil words, in which 555 words have occurred only once.

5.2 Performance evaluation metrics

Accuracy is defined as the number of Tamil words that are correctly tagged concerning a total number of Tamil words in the dataset/corpus. $Accuracy = No . of Tamil words that are$ $correctly tagged with POS /$ $Total number of Tamil words in the corpus$

Precision is defined as the number of Tamil words that are correctly tagged concerning a total number of tagged words. $Precision = No . of Tamil words that are$ $correctly tagged with PoS /$ $Total number of Tagged Tamil words$

5.3 Experimental analysis

The English words were tagged by using the NLTK POS Tagger Library. The proposed Hybrid POS Tagger algorithm was run on POS annotated English Sentences to tag Tamil Words. The performance of each tagger in the proposed hybrid tagger is studied to know the contribution of each tagger. The performance of overall hybrid system is also analysed on each corpora. The analysis was made on all three corpora by both considering one-time occurrence words and not considering one-time occurrence words. It is observed from all three datasets, nearly 50% of Tamil words appear only once in each of the corpora. The precision of KT is 100% in all three corpora as it is purely based on matching words.

5.4 A. General Text Corpus (Corpus I)

The evaluation metrics like percentage of tagged words, accuracy, and precision of each of the taggers in the proposed hybrid tagger system are shown in Fig. 4(a) and the same metrics after leaving one-time occurrence are shown in Fig. 4(b). Corpus I has more casual and conservative words, the more words are pronouns and verbs in nature. The KT and VT of the proposed Hybrid Tagger play a major role in identifying and tagging the Tamil words shown in Fig. 4(a) and 4(b). There is a drop in the percentage of tagged words by VT in Fig. 4(b) which indicates that not only the UT tags one-time occurrence words, the VT also tags the one-time verb that matches the suffix pattern.

Fig. 4(a)

The output of proposed taggers KT, VT, UT, and MT. The precision of KT, VT, and UT is better. The 4% of one-time tagged words are correctly tagged by UT. 21% of words are not tagged. Figure 4(b) Output of proposed taggers KT, VT, and MT after leaving one-time occurrence words. Compared with 4(a) there is a drop in the output of VT, a reduction in the percentage of not tagged words. The output of VT and MT is improved.

The overall performance of the proposed hybrid system for Corpus I is shown in Fig. 5(a). The proposed hybrid algorithm with leaving one-time occurrence words outperforms when one-time occurrence words are considered. The subset of words in the entire corpus is considered when the one-time occurrence words are not taken into account. It is found that the percentage of tagged words is 92.97%, accuracy is 77.3% and the precision is 83.14%.

Fig. 5(a)

Comparison between the overall performance of the proposed hybrid system including and excluding one-time occurrence word of General Text Corpus.

The precision of tagged Tamil words of the entire corpus I and after leaving one-time occurrence words is shown in Fig. 6(a). The words are classified as 16–18 categories of POS tags such as PRP (Personal Pronoun), NN (Noun), RB (Adverb), JJ (Adjective), etc., The X-axis shows that words that are classified as particular POS tags mentioned in NLTK library. The following is the list of POS tags of what each POS stands for: CC - coordinating conjunction, CD - cardinal digit, DT- determiner, EX - existential there, FW - foreign word, IN - preposition/subordinating conjunction, JJ - adjective ‘big’, MD - modal could, will, NN –singular noun, NNS - plural noun, NNP –singular proper noun, NNPS - plural proper noun, PDT –predeterminer, POS - possessive ending, PRP - personal pronoun, PRP$ - possessive pronoun, RB - adverb, VB - verb base form, VBD - verb, past tense, VBG - verb, gerund / present participle, VBN - verb, past participle, VBP - verb singular present, VBZ -verb, third person singular, WDT - wh-determiner which, WP - wh-pronoun who, what, WP$ possessive wh-pronoun whose, WRB - wh-abverb where, when, etc., The proportions of precision of categorization of POS tags are shown in Fig. 6(b) and 6(c).

Fig. 6(a)

The precision of Tagged Tamil words under the various category of POS tags. Fig. 6(b) and (c). The proportion of precision of various POS tags for the entire corpus and after leaving one-time occurrence words.

5.5 B. Crop Production Corpus (Corpus II)

Corpus II is not an exact parallel corpus, but it has context-based translated text. The number of words in the parallel sentences and the structural pattern of the parallel sentences is not mapped. Each sentence in Tamil / English might have one or more sentences in its translated sentence. This type of corpus is a challenging dataset for any type of tagger. However, this type of corpus is readily available in large volumes on the Web and Social Media in almost all areas like politics, health care, movies, religious books, law, etc.,

As this Corpus does not have many words that are in the proposed word feature set, the percentage of tagged words by KT and VT are less, which is shown in Fig. 7(a) and 7(b). The Corpus does not have the words suitable for UT, only 0.17% of words are tagged by UT. Here the MT plays a major role in assigning the POS tags to Tamil words. Figure 7(a) and 7(b) shows that approximately two-thirds of correctly identified tags have come from MT.

Fig. 7(a)

The output of proposed taggers KT, VT, UT, and MT. The precision of KT and VT is 100%. The accuracy of MT is 17.58% and its precision is 61.63%. 59.2% of words are not tagged. Figure 7(b) Output of proposed taggers KT, VT, and MT after leaving one-time occurrence words. Compared with 7(a), the accuracy and precision of MT is increased to 34.99% and 67.62%.

Despite its challenging characteristics of the morphological richness of the Tamil Language, quality of corpus parallel text suitable for computation, the volume of the corpus, and nearly 55% of onetime occurrence Tamil words in this corpus, the proposed Hybrid algorithm has produced good results that are shown in Fig. 8. After leaving one time occurrence words, the proposed algorithm increases the percentage of tagged words from 40.8% to 68.32%, increases the accuracy from 29.69% to 51.57% and increases the precision from 72.76% to 75.47%.

Fig. 8

Comparison between the overall performance of the proposed hybrid system including and excluding one-time occurrence word of Crop Production Corpus (Corpus II).

Fig. 9(a)

Shows the precision of tagged words under the various category of POS tags and Fig. 9(b) and 9(c) shows the proportions of precision of various POS tags.

5.6 C. Crop Production Corpus –Google Translated Text (Corpus III)

In the case of Corpus III, since the sentence-by-sentence translation is made by using the GT engine, the quality of the corpus that suits the computation is improved when compared with Corpus II. Though the corpus has 48.5% of one-time occurrence words, the proposed Hybrid tagger can predict 81.15% of words, out of which 73.51% of words are accurately correct tagged words and the precision is 90.50% which is shown in Fig. 11. The Fig. 10(a) and (b) shows that 50.1% and 18.9% of words are not still tagged respectively. This is due to the insufficient volume of the dataset. All taggers in the proposed hybrid tagger are improved compared with Corpus II in Fig. 8. The proportions of categorization of POS tags are shown in Fig. 12.

Fig. 10(a)

Around 50% of words are tagged. Fig. 10(b). Around 82% of words are tagged.

Fig. 11

Precision is around 90% which means 90% of tagged words are correct.

Fig. 12(a)

Shows the precision of tagged words under the various category of POS tags and Fig. 12(b) and 12(c) shows the proportions of precision of various POS tags.

5.4 Performance Analysis

5.4.1 Comparison among Corpus I, II, and III

It is observed from Fig. 13(a) and (b) that by not considering the one-time occurrence words in Corpus I, II, and III, the percentage of tagged words, accuracy and its precision has improved by a reasonable amount and around 19 types of POS tags are found in the corpus. The MT plays a significant role in correctly identifying the POS tags in Corpus II and III. The MT does more than 50% of identified tags. In Corpus I, the KT, VT, and MT contribute more or less equally for identifying the tags. The one-time tagger contributes very little in all three corpora as these do not have much data to fit for this tagger.

Fig. 13

Performance Comparison. The proposed algorithm also produces the bilingual dictionary of English-Tamil translated words with its POS. Tables 4 and 5 show the bilingual dictionary created from Corpus III and Corpus I.

Table 5

Bilingual Dictionary created from Corpus I

5.4.2 Analysis of Corpus III with the created dictionary

The analysis is made on the 18.85% of untagged words in Corpus III shown in Table 6. 111 words contribute to this 18.85% and fortunately, 51 words of these untagged words are found in the created dictionary. Out of 51 words, 36 words are found to have either suitable context or translated text in English connected with correct POS tags. So these 36 untagged words could be tagged using this developed dictionary. Hence the accuracy is increased from 73.51% to 79.63%.

Table 6
Analysis of untagged words in Corpus III

No of words not tagged 111

No of words not available in the dictionary 51

No of words available in the dictionary with no suitable context/translated word 14

No of words available in the dictionary with suitable context/translated word 36

No of tagged words 433

Total 469

Accuracy 79.63%

5.4.3 Comparative analysis with other models

The performance of the proposed tagger is compared with other POS taggers for languages such as English, Hindi, Malayalam, Kannada, Tamil and Sanskrit. The accuracy of the proposed tagger for all three Corpora considered is shown in Table 7. The accuracy of the system depends on both volume of dataset and the characteristics of the dataset. The accuracy of the proposed algorithm on Corpus II is 51.57%. It is because the Corpus II dataset is Context-based Bilingual Corpus. The same dataset is converted as a Bilingual parallel corpus in Corpus III. Now the accuracy is improved to 73.51%. The accuracy of the proposed algorithm is lesser than other taggers. However, the proposed tagger uses a very small dataset which is lesser than 10% of other models and works without the requirement of a tagged corpus. Hence the accuracy of the proposed tagger is an acceptable value.

Table 7
Comparison of Proposed Hybrid POS Tagger with other Models

Model Dataset Accuracy

Wang et al. (2015) - BiLSTM recurrent neural network-based POS Tagger [37] Penn Treebank WSJ test set - 910K words training and 129K words test tokens 97.4%

Huang et al. (2015) - Model combining BiLSTM with the CRF model [15] Penn TreeBank dataset - Training data consists of 9,50,011 tokens 97.55%

Shrivastava and Bhatacharyya (2008) - Hindi POS Tagger based on HMM approach [32] Tagged corpus of size 81751 tokens 93.12

Antony et al. (2010) - POS tagger for Malayalam using SVM [1] Manually tagged corpus consists of 1,80,000 words labeled with 29 tags for training 94

Antony and Soman (2010) - SVM based POS tagger for Kannada [2] Training corpus consisting of 54,000 words 86

Junaida and Babu (2021) - experimented using deep learning techniques for the Malayalam language. [17] TDIL dataset comprising approximately 31,000 sentences, 7,00,000 words 87.05

Krishnan et al., (2017) - Bi-LSTM neural network model in Tamil POS Tagging [19] 33,216 words training and 3723 words test data 86.45

Soman et al. (2018) - POS Tagger for Sanskrit using different deep learning algorithms [34] 34,270 words training and 4231 words test data 97.86

Our Proposed Hybrid POS Tagger for Tamil (after leaving one-time occurrence words) Corpus I dataset –199 parallel sentences, 987 English words, 777 Tamil words 77.3

Corpus II dataset –124 parallel sentences, out of which 1582 words are in English and 1206 words are in Tamil 51.57

Corpus III dataset –124 parallel sentences out of which 1582 words are in English and 1144 words are in Tamil 73.51

Model	Dataset	Accuracy
Wang et al. (2015) - BiLSTM recurrent neural network-based POS Tagger [37]	Penn Treebank WSJ test set - 910K words training and 129K words test tokens	97.4%
Huang et al. (2015) - Model combining BiLSTM with the CRF model [15]	Penn TreeBank dataset - Training data consists of 9,50,011 tokens	97.55%
Shrivastava and Bhatacharyya (2008) - Hindi POS Tagger based on HMM approach [32]	Tagged corpus of size 81751 tokens	93.12
Antony et al. (2010) - POS tagger for Malayalam using SVM [1]	Manually tagged corpus consists of 1,80,000 words labeled with 29 tags for training	94
Antony and Soman (2010) - SVM based POS tagger for Kannada [2]	Training corpus consisting of 54,000 words	86
Junaida and Babu (2021) - experimented using deep learning techniques for the Malayalam language. [17]	TDIL dataset comprising approximately 31,000 sentences, 7,00,000 words	87.05
Krishnan et al., (2017) - Bi-LSTM neural network model in Tamil POS Tagging [19]	33,216 words training and 3723 words test data	86.45
Soman et al. (2018) - POS Tagger for Sanskrit using different deep learning algorithms [34]	34,270 words training and 4231 words test data	97.86
Our Proposed Hybrid POS Tagger for Tamil (after leaving one-time occurrence words)	Corpus I dataset –199 parallel sentences, 987 English words, 777 Tamil words	77.3
	Corpus II dataset –124 parallel sentences, out of which 1582 words are in English and 1206 words are in Tamil	51.57
	Corpus III dataset –124 parallel sentences out of which 1582 words are in English and 1144 words are in Tamil	73.51

6 Applying the approach of Mining based Tagger in the Hindi Language

The proposed algorithm of UT and MT in the proposed hybrid tagger is language-independent. The UT does not play a major role for Corpus I, II, and III, as these corpus does not have the data suitable for this tagger. The proposed MT algorithm was run on the English-Hindi Corpus to identify the POS tag of Hindi words. This corpus is downloaded from the website: https://www.manythings.org/anki/. The corpus contains 2867 lines of various lengths. The majority of the words were tagged correctly shown in Table 8.

Table 8
Tagged Corpus using the proposed algorithm. The nine lines from the corpus are shown. The English words are tagged by NLTK POS Tag Library. The Hindi words are tagged using MT of the proposed hybrid Tagger.

7 Conclusion

The proposed Hybrid POS Tagger has four taggers namely KT, VT, UT, and MT. As the KT and VT are developed based on the word feature set of a language, these two algorithms are language-specific. The UT is useful when the sentences in both languages have a unique word. The MT is developed using the cross-lingual information retrieval technique and the association rule mining technique with the maximum likelihood estimation method to tag the words. This MT algorithm is language-independent. As the proposed work is focused mainly on tagging the Tamil Words, a bilingual corpus of English-Tamil is required. The three corpora are used in studying the performance of the proposed hybrid tagger. Corpus I is downloaded from the website. Corpus II and III are prepared from Standard 8 Tamil Nādu Textbook and Google Translation Engine.

The proposed hybrid algorithm was run on these three corpora with and without considering one-time occurrence words and, its accuracy and precision are measured. The performance of the proposed algorithm outperforms on all three corpora when one-time occurrence words are not considered. The data in Corpus I is in such a way that the proposed KT and VT with the defined list of word feature set plays a major role in correctly tagging the 65% of tagged Tamil words. But in Corpus II and III, the data is in such a way that the proposed MT plays a major role in identifying the tags of Tamil words. The 69% and 62% of correctly tagged words in Corpus II and Corpus III respectively by the proposed MT alone. The performance of the proposed system in Corpus II is improved in Corpus III by improving the Context-based Bilingual Corpus of Corpus II to the Bilingual parallel corpus in Corpus III. It is found that after leaving one-time occurrence words, the proposed Hybrid POS tagger can predict 81.15% words, with 73.51% accuracy and 90.50% precision. The Tamil words were tagged/classified as nearly 20 different types of POS tags.

The proposed hybrid algorithm also produces a Bilingual Dictionary as a by-product. This dictionary could be used to tag the untagged words of the same corpus as a part of the looping system. The proposed MT algorithm was executed on tagging the Hindi words. It is found that the majority of the tagged words were correct.

The proposed hybrid algorithm is a novel approach that used a mining technique on the bilingual corpus to tag the words. This algorithm could be used for the languages lacking a large volume of annotated corpora in the field of study of words and for the languages lacking manually crafted linguistic resources sufficient for building the NLP applications. The proposed algorithm produces good accuracy with high precision even for the less volume of corpus. So, the proposed hybrid POS tagger can be a preferable choice tagger.

Footnotes

Acknowledgment

This work is funded by the All India Council for Technical Education under Research Promotion Scheme (AICTE-RPS).

References

Antony Santhanu

P.J.

Mohan

Soman

K.P.

SVM Based Part of Speech Tagger for Malayalam, IEEE International Conference on Recent Trends in Information, Telecommunication and Computing (2010). https://doi.org/10.1109/ITC.2010.86.

Antony

P.J.

and Soman

K.P.

, Kernel based part of speech tagger for kannada, International Conference on Machine Learning and Cybernetics, IEEE 4 (2010), 2139–2144.

AUKBC website: http://www.au-kbc.org/.

Brill

A simple rule-based part of speech tagger. ANLC ’92 Proceedings of the Third Conference on Applied Nat- ural Language Processing (1992), 152–155. https://doi.org/10.3115/974499.974526.

Chomsky

Syntactic Structures. The Hague: Mouton (1957)..

Cicero Dos Santos , Bianca Zadrozny , Learning character-level representations for part-of-speech tagging, Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2) (2014), 1818–1826.

Deepak Gupta , Shubham Tripathi

Asif Ekbal

PushpakBhattacharyya SMPOST: Parts of Speech Tagger for CodeMixed Indic SocialMedia Text, Computation and Language (cs.CL) (2017) https://arxiv.org/abs/1702.00167.

Dhanalakshmi

, Anand Kumar , Shivapratap

, Soman

K.P.

Rajendran

, Tamil POS Tagging using Linear Programming International Journal of Recent Trends in Engineering, 1(2) (2009) https://www.semanticscholar.org/paper/Tamil-POS-Tagging-using-Linear-Programming-Dhanalakshmi-Kumar/18a4e319cb0093be3cb9cf9408ee55ad0fe2b44f.

Dipanjan Das , Slav Petrov , Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (2011), 600–609, Portland, Oregon, June 19-24, https://aclanthology.org/P11-1061.

10.

Doug Cutting , Julian Kupiec , Jan Pedersen , Penelope Sibun , A Practical Part-ofspeech Tagger. Proceedings of ANLP-92. Trento, Italy (1992).

11.

Eugene Charniak , Statistical language learning, Language73(3) (1997), 588–590. https://doi.org/10.2307/415888.

12.

Ganesh

, Parthasarathi

, Geetha

T.V.

, BalajiPattern-based

Pattern-based bootstrapping technique for Tamil POS tagging. In Mining Intelligence and Knowledge Exploration, Lecture Notes in Computer Science (2014), 256–267. Springer, https://doi.org/10.1007/978-3-319-13817-6_25.

13.

Harris , Zellig , String analysis of language structure, The Hague: Mouton (1962).

14.

Hehnut Schmid , Part-of-speech tagging with neural networks. COLING’94: Proceedings of the 15th conference on Computational linguistics. 1 (1994), https://doi.org/10.3115/991886.991915.

15.

Huang

, Xu

, Yu

Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint (2015).

16.

Joo-Kyung Kim , Young-Bum Kim , Ruhi Sarikaya , Eric Fosler-Lussier , Cross-Lingual Transfer Learning for POS Tagging without Cross-Lingual Resources, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics (2017), 2832–2838, https://doi.org/10.18653/v1/D17-1302.

17.

Junaida

M.K.

, Babu

A.P.

A Deep Learning Approach to Malayalam Parts of Speech Tagging. In Second International Conference on Networks and Advances in Computational Technologies Springer, Cham. (2021), 243–250.

18.

Jun’ichi Kazama , Yusuke Miyao , Jun’ichi Tsujii , A maximum entropytagger with unsupervised hidden markov models, Journal ofNatural Language Processing, 11(4) (2001) 333–340. https://doi.org/10.5715/jnlp.11.4_3.

19.

Krishnan Gokul , Pooja , Anand Kumar , Soman

, Character based bidirectional LSTM for disambiguating tamil part-of-speech categories, Int. J. Control Theory Appl (2017), 229–235.

20.

Lafferty

, McCallum

, Pereira

Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proc. 18th International Conf. on Machine Learning (2001).

21.

Mokanarangan Thayaparan , Surangika Ranathunga , UthayasankerThayasivam , Graph-Based Semi-Supervised Learning for Tamil POS Tagging, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018), 3955–3960, https://aclanthology.org/L18-1624.pdf.

22.

Muñoz-Valero

, International Journal of Computational Intelligence Systems, 13(1) (2020) 333–340. https://doi.org/10.2991/ijcis.d.200527.005.

23.

Pandian

S.L.

, Geetha

T.V.

CRF models for Tamil part of speech tagging and chunking, In Proceedings of the 22Nd International Conference on Computer, Processing of Oriental Languages, Language Technology, for the Knowledge-based Economy (2009), 11–22, Springer-erlag, https://doi.org/10.1007/978-3-642-00831-3_2.

24.

Pratibha Rani , Vikram Pudi , Dipti Misra Sharma , A semi-supervisedassociative classification method for POS tagging, Int J DataSci Anal 1(2) (2016) 123–136. https://doi.org/10.1007/s41060-016-0010-5.

25.

Rajasekar

and Udhayakumar

, POS Tagging Using Naïve Bayes Algorithm For Tamil, International Journal of Scientific & Technology Research 9(02), ISSN 2277-8616, http://www.ijstr.org/final-print/feb2020/Pos-Tagging-Using-Nave-Bayes-Algorithm-For-Tamil.pdf.

26.

Ratnaparkhi

A maximum entropy model for partof-speech tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing EMNLP-96. (1996).

27.

Ritu Banga , Pulkit Mehndiratta , Tagging Efficiency Analysis on Part of Speech Taggers, International Confer- ence on Information Technology, https://doi.org/10.1109/ICIT.2017.57, IEEE (2017), 264–267.

28.

Sandipan Dandapat , Sudeshna Sarkar , Anupam Basu , Automatic Part-of-Speech Tagging for Bengali, An Approach for Morphologically Rich Languages in a Poor Resource Scenario, Proceedings of the ACL 2007 Demo and Poster Sessions (2007), 221–224, https://aclanthology.org/P07-2056.

29.

Selvam

, Natarajan

A.M.

Improvement of rule-based morphological analysis and POS Tagging in the Tamil Language via Projection and Induction Techniques, International Journal of Computers (2009) https://scholar.google.co.in/citations?view_op=view_citation&hl=en&user=Wfh3P5UAAAAJ&citation_for_view=Wfh3P5UAAAAJ:d1gkVwhDpl0C.

30.

Shambhavi

B.R.

, Ramakanth Kumar

, Kannada part-of-speechtagging with probabilistic classifiers, International Journalof Computer Applications (0975 – 888), 48(17) (2012) 26–30. https://doi.org/10.1007/s41060-016-0010-5.

31.

Shereen Khoja , APT: Arabic Part-of-speech Tagger, Proceedings of the StudentWorkshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics (2001), https://www.semanticscholar.org/paper/APT%3A-Arabic-Part-of-speech-Tagger-Khoja/4072d185e733726fca0861398f23b03d84eaf2a8.

32.

Shrivastava Manish , Bhatacharyya Pushpak , Hindi POS tagger using naive stemming: Harnessing morphological information without extensive linguistic knowledge, International Conference on NLP, Pune, India (2008).

33.

Soman

K.P.

, Premjith

Prabaharan Poornachandran , A deep learning based Part-of-Speech POS tagger for Sanskrit language by embedding character level features, In Proceedings of the 10th Annual Meeting of the Forum for Information Retrieval Evaluation, ACM2018 (2018), 56– 60.

34.

Tahira Naseem , Snyder , Jacob Eisenstein , Regina Barzilay , Multilingual part-of-speech tagging: two unsupervised approaches, Journal of Artificial Intelligence Research (2009), 341–385. https://doi.org/10.1613/jair.2843.

35.

Tej Bahadur Shahi , Tank Nath Dhamala , Bikash Balami , SupportVector Machines based Part of Speech Tagging for Nepali Text, 70(24) (2013) https://doi.org/10.5120/12217-8374.

36.

Wei Yuan , Lei Wang , Xiao-Fei Sun , Wen-Wen Pan , Jia-Guo Lv ,Cross-lingual part-of-speech tagging using word embedding, JVEInternational Ltd. Vibroengineering Procedia 8 (2016). ISSN 2345-0533 https://www.jvejournals.com/article/17552.

37.

Wang

, Qian

, Soong

F.K.

, He

, Zhao

Parts-of-Speech Tagging with Bidirectional Long ShortTerm Memory Recurrent Neural Network. arXiv preprint arXiv:1510.06168 (2015).

No of words not tagged	111
No of words not available in the dictionary	51
No of words available in the dictionary with no suitable context/translated word	14
No of words available in the dictionary with suitable context/translated word	36
No of tagged words	433
Total	469
Accuracy	79.63%

Bilingual Corpus-based Hybrid POS Tagger for Low Resource Tamil Language: A Statistical approach

Abstract

Keywords

1 Introduction

1.1 A. Rule-based tagger

1.2 B. Tagger using Learning-based algorithm

1.3 a. Stochastic-based Tagger

1.4 b. Neural Networks

2 Related work

3 The proposed approach

3.1.1 Data pre-processing –sandhi removal

3.2.1 Preparation of POS annotated Tamil Keyword

Table 1 For the defined list of Tamil Words, a suitable POS tag from the NLTK POS tag list is assigned. The translation of Tamil words in English and the description of the NLTK POS tag is also shown.

Table 2 Verb suffix patterns of Tamil words are identified and assigned suitable NLTK POS tags for those patterns.

4 Hybrid POS Tagger

4.1 Keyword-based Tagger

4.2 Verb Suffix pattern-based Tagger

4.3 Unique pair occurrence Tagger

4.4 Mining based Tagger(MT)

4.4.1 The mining approach

4.5 Data output

5.1 Dataset

5.1 A. General Text Corpus (Corpus I)

5.2 B. Crop Production Corpus (Corpus II)

5.3 C. Crop Production Corpus –Google Translated Text (Corpus III)

5.2 Performance evaluation metrics

5.3 Experimental analysis

5.4 A. General Text Corpus (Corpus I)

5.4.1 Comparison among Corpus I, II, and III

Table 8 Tagged Corpus using the proposed algorithm. The nine lines from the corpus are shown. The English words are tagged by NLTK POS Tag Library. The Hindi words are tagged using MT of the proposed hybrid Tagger.

Footnotes

Acknowledgment

References

Table 1
For the defined list of Tamil Words, a suitable POS tag from the NLTK POS tag list is assigned. The translation of Tamil words in English and the description of the NLTK POS tag is also shown.

Table 2
Verb suffix patterns of Tamil words are identified and assigned suitable NLTK POS tags for those patterns.

Table 8
Tagged Corpus using the proposed algorithm. The nine lines from the corpus are shown. The English words are tagged by NLTK POS Tag Library. The Hindi words are tagged using MT of the proposed hybrid Tagger.