Enhancing HMM-based POS tagger for Mizo language

Abstract

The process of associating words with their relevant parts of speech is known as part-of-speech (POS) tagging. It takes a substantial amount of well-organized data or corpora and significant target language research to obtain good performance for a tagger. Mizo is a language that needs more research attention in computational linguistics due to its under-resourced nature. The limited availability of corpora and relevant literature adds complexity to the task of assigning POS labels to Mizo text. This paper explores two methods to potentially improve the Hidden Markov Model (HMM)-based POS tagger for the Mizo language. The proposed taggers are compared with the baseline HMM tagger and the N-gram taggers on the designed Mizo corpus, which consists of 72,077 manually tagged tokens. The experimental results proved that the two proposed taggers enhanced the HMM-based Mizo POS tagger, achieving 81.52% and 84.29% accuracy, respectively. Moreover, a comprehensive analysis of the performance of the suggested hybrid tagger was conducted, yielding a weighted average precision, recall, and F1-score of 83.09%, 77.88%, and 79.64% respectively.

Keywords

Hybrid POS tagger rule-based POS tagger N-gram tagger Mizo POS tagger Hidden Markov Model

1 Introduction

Part-of-speech (POS) tagging is the process of assigning a meaningful description to each individual word in a sentence [1]. The description is referred to as a “tag,” and it represents one of the language’s elements of speech. POS tagging assists in corpus parsing, which is an essential stage in language processing. It is an important step because it addresses ambiguous words in a phrase by giving each word a specific POS label based on the text.

POS tagging has gained importance in various natural language processing (NLP) applications, including information extraction, parsing, chunking, machine translation, question answering, semantic processing, and disambiguation of word sense. The development of POS taggers can be accomplished in a number of ways. Rule-based, statistical, and neural network-based methods are the most often adopted techniques for POS tagging. The language’s characters and structures are taken into account while developing rules in the rule-based method. Conditional Random Field-based, Hidden Markov Model, Maximum Entropy-based, and Support vector machine-based are the most commonly used statistical approaches. The primary issue with POS tagging is ambiguity and dealing with unfamiliar words [2]. Occasionally, a word might have different meanings with distinct parts of speech depending on the circumstances in which the term is used. A POS tagger’s objective is to address this ambiguity based on the context in which it is used. Several POS taggers have been developed in a variety of languages using various approaches.

The Mizo language is the language that is spoken mainly by the Mizo people who live in the state of Mizoram as their first language. Mizoram, one of the Indian states, shares international borders with Myanmar and Bangladesh, as well as domestic borders with Tripura, Manipur, and Assam. The nomenclature’ Mizo’ is used interchangeably to refer to both the people of Mizoram and the language they speak. The Mizo language belongs to the Tibeto-Burman language family, and there are approximately 8.3 lakh speakers in the country [3].

The Mizo language is currently under-resourced and still in its early stages of development. As a result, there is a scarcity of publicly available resources for computational linguistics pertaining to this language. Numerous studies in language processing have been conducted for various Indian languages, English, and many European languages. However, there have been a limited number of research studies focusing on computational linguistics in the Mizo language. Furthermore, the United Nations Education, Scientific and Cultural Organisation (UNESCO) has recognized the Mizo language as an endangered language [4]. In such circumstances, it is necessary to give a great deal of research attention to the language. The main goal of this study is to help move the Mizo language forward in the field of computational linguistics.

This study introduces a novel approach for part-of-speech tagging in the Mizo language, combining the HMM-based, N-gram, and rule-based models. The primary contributions of this study encompass two key aspects: Firstly, the development of tailored regular expressions designed specifically for the intricacies of the Mizo language. Secondly, the enhancement of an HMM-based Mizo POS tagger through two innovative strategies. The first strategy involves optimizing the decoder utilized in the process, resulting in improved performance. The second strategy involves the integration of three distinct models, leading to the creation of a highly robust and reliable Mizo POS tagger that can handle a wide range of linguistic variations and complexities.

2 Related works

Several academics have already begun working on building POS taggers, employing a variety of algorithmic approaches. In recent years, considerable effort has been expended on the POS tagging of Indian languages as well. Some of the POS tagging works done for various Indian languages are highlighted below.

One of the earliest articles on POS taggers for the Hindi language was published by Ranjan et al. [5]. It was based on the lexical tags of words. Many new approaches have emerged since the works were first published. A statistical POS tagger for Bengali using the Maximum Entropy (ME) model was presented by Ekbal et al. [6]. On a testing dataset of 20,000 words, the POS tagger was trained with 72,341 words and obtained an overall accuracy of 88.2%. Experiments have shown that the different word suffixes, named entity recognizer, and lexicon can help with unknown word difficulties and considerably enhance the POS tagger’s accuracy significantly.

Applying Artificial Neural Networks, Narayan et al. [7] built a POS tagging system for Hindi. The proposed model was compared to other strategies, such as Maximum Entropy-based, Rule-based, and CRF-based taggers. The experimental findings demonstrated that the proposed tagger outperformed all of them. Jahara et al. [8] presented the results of an empirical study of various POS tagging approaches for the Bengali language. Brill combined with the CRF had the highest accuracy of 91.83% of all tagging approaches.

Sharma et al. [9] proposed a bigram HMM-based POS tagger for the Punjabi language. The model was put through its paces on a corpus of 26,479 tokens. This procedure yielded an accuracy of 90.11%. A POS tagger for the Maithili language based on the CRF model was proposed by Priyadarshi et al. [10]. They built a corpus containing around 52K words, manually annotated with the designed tagset. They have experimented with various orthography features in the model and achieved an accuracy of 82.67%.

Jobanputra et al. [11] introduced the use of Long-Short-Term Memory (LSTM) for developing a part-of-speech (POS) tagger for the Gujarati language. They reported achieving an accuracy of 95.34% with their proposed approach. Tailor et al. [12] proposed a hybrid technique for Gujarati POS tagging that combines computational linguistic rules and LSTM-based POS tagging. The experimental outcome indicated that adopting language-specific rules improved the statistical tagger. The review paper by Gamit et al. [13] presented various methods of POS tagging in the Gujarati language.

Modi et al. [14] developed a hybrid Hindi POS tagger with 88.15% accuracy using rule-based and probability-based models. Mundotiya et al. [15] introduced an attention-based model for Hindi POS tagging, achieving 98.36% accuracy on the Hindi disease dataset. Dalal et al. [16] created a Hindi POS tagger based on Maximum Entropy Markov Model with 88.4% average accuracy. Swamy et al. [17] designed a Kannada POS tagger using CRF with F1-score, recall, and precision values of 91.4%, 91.6%, and 91.3% respectively.

Daimary et al. [18] introduced an HMM-based Assamese POS tagger with 89.21% accuracy, using a corpus of 2,71,890 words. Pathak et al. [19] presented another Assamese POS tagger based on deep learning, achieving 86.52% accuracy. Singh et al. [20] developed Manipuri POS taggers using CRF and SVC, obtaining accuracies of 72.04% and 74.38%, respectively, with 39,449 training tokens and 8,672 testing tokens. Tham [21] proposed a hybrid POS tagger for Khasi, combining HMM and CRF with 95.29% accuracy. Additionally, Warjri et al. [22] presented a Khasi POS tagger using BiLSTM-CRF and character-based models, achieving a maximum accuracy of 96.98% based on testing data. Vaishali et al. [23] presented a Marathi POS tagging system using a rule-based technique. The experimental result demonstrated an accuracy of 97.56%.

Many more diverse POS tagging works have been highlighted in these papers [24 –26]. In conclusion, various approaches, such as rule-based, machine learning, and deep learning techniques, have been extensively employed in POS tagging for Indian languages. As documented in this review, the most notable achievement in terms of accuracy stands at 98.36% [15]. Moreover, publicly available POS tagged corpora for several Indian languages have been published [27].

3 System description

This section discusses the dataset, the tagset, the details of the baseline Hidden Markov Model-based tagger, and the architecture of the two proposed systems for improving the baseline Hidden Markov Model-based tagger.

3.1 Dataset and tagset

The selection of appropriate lexical categories is critical to the development of a tagging system. According to Zikpuia [28], the Mizo language contains 60 different parts of speech, which include subcategories of the major parts of speech. Nevertheless, these parts of speech cannot all be included in their original form in the parts of speech tagging system because the language has many compound phrases that can be classified as a single part of the speech element. For instance,

Compound noun: Biak in (church), vawk sa (pork), ar tui (egg)

Nounal Adjective: Bawng sa (Pork)

Verbal adjective: Naupang mu (sleeping baby), bawng thlun (tied cow) We combined certain fine-grained tags in this research to improve efficiency and accuracy. This merging process reduces the number of tags while preserving essential linguistic information, enhancing both processing speed and accuracy. For example, material nouns, countable nouns, common nouns, and concrete nouns are classified under the category of Common Nouns. Based on an analysis of the Mizo language’s morphological structure and considering various factors, a comprehensive tag set of 48 tags (as shown in Table 1) has been designed to encompass all grammatical categories in the Mizo corpus. The process involved following the Part-of-Speech Tagging Guidelines for the Penn Treebank Project [29] as a starting point for creating POS tags. These initial tags were then refined by taking into account the specific morphological characteristics of the Mizo language, resulting in the final set of 48 POS tags.

Table 1
List of proposed Mizo tagsets and each tag’s frequency in the corpus

Tags Descriptions No. of occurrences Tag Description No. of occurrences

VB Verb base form 10782 MP Demonstrative Pronoun 307

CMN Common Noun 8012 DJJ Double Adjective 298

RB Adverb base form 7478 ; Semi colon 258

PSP Personal Pronoun 7167 RBM Adverb of Manner 188

PPN Proper Noun 5991 ( Opening bracket 167

FW Foreign Word 3902 ) Closing bracket 168

, Comma 3057 VBN Verbal Noun 170

PT Particles 2536 POP Possessive Pronoun 159

CC Coordinating Conjunction 2499 RLP Relative Pronoun 138

JJ Adjective base form 2442 SJJ Superlative Adjective 98

. Full stop 2286 - Hyphen 93

PPT Postposition 1912 SRB Superlative Adverb 88

RBP Adverb of Place 1746 SF Suffix 75

AT Article 1597 NVB Nounal Verb 57

MJJ Demonstrative Adjective 1323 CJJ Comparative Adjective 56

CD Cardinal Number 1090 UH Interjection 46

NG Negation 1072 IJJ Interrogative Adjective 46

ABN Abstract Noun 1063 SYM Symbol 45

MRB Demonstrative Adverb 866 IRB Interrogative Adverb 37

RBT Adverb of Time 719 : Colon 34

SPRB Specifying Adverb 523 CRB Comparative Adverb 38

QM Quotation Mark 519 DVB Double Verb 28

DRB Double Adverb 440 IP Interrogative Pronoun 27

ET Date 374 ? Question mark 23

Tags	Descriptions	No. of occurrences	Tag	Description	No. of occurrences
VB	Verb base form	10782	MP	Demonstrative Pronoun	307
CMN	Common Noun	8012	DJJ	Double Adjective	298
RB	Adverb base form	7478	;	Semi colon	258
PSP	Personal Pronoun	7167	RBM	Adverb of Manner	188
PPN	Proper Noun	5991	(	Opening bracket	167
FW	Foreign Word	3902	)	Closing bracket	168
,	Comma	3057	VBN	Verbal Noun	170
PT	Particles	2536	POP	Possessive Pronoun	159
CC	Coordinating Conjunction	2499	RLP	Relative Pronoun	138
JJ	Adjective base form	2442	SJJ	Superlative Adjective	98
.	Full stop	2286	-	Hyphen	93
PPT	Postposition	1912	SRB	Superlative Adverb	88
RBP	Adverb of Place	1746	SF	Suffix	75
AT	Article	1597	NVB	Nounal Verb	57
MJJ	Demonstrative Adjective	1323	CJJ	Comparative Adjective	56
CD	Cardinal Number	1090	UH	Interjection	46
NG	Negation	1072	IJJ	Interrogative Adjective	46
ABN	Abstract Noun	1063	SYM	Symbol	45
MRB	Demonstrative Adverb	866	IRB	Interrogative Adverb	37
RBT	Adverb of Time	719	:	Colon	34
SPRB	Specifying Adverb	523	CRB	Comparative Adverb	38
QM	Quotation Mark	519	DVB	Double Verb	28
DRB	Double Adverb	440	IP	Interrogative Pronoun	27
ET	Date	374	?	Question mark	23

As far as we know, the Mizo language lacks a properly classified corpus, and the dearth of reliable free corpora for the Mizo language necessitated the manual creation of a tagged corpus. From a wide variety of web sources, raw digital texts have been gathered that cover a wide range of topics like sports, music, current events, and articles. With the assistance of linguistic experts, the required pre-processing tasks were meticulously performed. These tasks encompassed activities such as rectifying spelling errors, standardizing the writing styles of the collected phrases, and eliminating undesired symbols. The thorough execution of these pre-processing tasks ensures the quality and consistency of the data used in the research. It is then tokenized into words and manually annotated with the proposed tagset. Table 2 displays statistics of the designed Mizo corpus consisting of 72,077 tokens (The training and testing data are provided in a ratio of 90:10 for statistics.)

Table 2

Corpus Statistics

Total no. of words	72,077
No. of training sentences	2,046
No. of testing sentences	228
No. of train tagged words	64,869
No. of test tagged words	7,208
No. of unique words in train set	8564
No. of unique tags	48

3.2 Hidden Markov model

A Hidden Markov Model (HMM) is a statistical model that depicts probability distributions over a sequence of observed events [2]. It assumes that the system being described is a Markov process with hidden states. To model a problem using a hidden Markov model, both observation sequences and a set of possible states must be available.

In the part-of-speech tagging problem, observations are the tokens in the string, and hidden states are the POS tags for the words. The HMM tagger selects the tag sequence for a particular sentence so that it optimizes the probability value and uses historical occurrences to assign a probability to a current word. The main goal of the HMM is to find the most likely tag sequence $(t_{1}^{n})$ from a given set of words $(w_{1}^{n})$ [1]. This can be represented by using the following equations: $t_{1}^{n} = \underset{t_{1}^{n}}{argmax} P (t_{1}^{n} | w_{1}^{n})$ (1) Here, $t_{1}^{n}$ is a sequence of tag (t₁ . . . t_n) and $w_{1}^{n}$ is a sequence of words (w₁ . . . w_n). By using Baye’s rule for conditional probability, we get the following equation: $t_{1}^{n} = \underset{t_{1}^{n}}{argmax} P (w_{1}^{n} | t_{1}^{n}) P (t_{1}^{n})$ (2) Here, P $(w_{1}^{n} | t_{1}^{n})$ is referred to as the likelihood of the sequence of a word, and P $(t_{1}^{n})$ is called the prior probability of the sequence of tags. The HMM is predicated on two assumptions: The first assumption states that the word’s probability of occurring is solely determined by its own tag: $P (w_{1}^{n} | t_{1}^{n}) \approx \prod_{i = 1}^{n} P (w_{i} | t_{i})$ (3) The second supposition states that the tag’s probability of occurring is determined by the previously fixed ’n’ number of tags. For this research work, the second-order HMM model has been considered, in which the probability of a tag is determined solely by the preceding tag. $P (t_{1}^{n}) \approx \prod_{i - 1}^{n} P (t_{i} | t_{i - 1})$ (4) The following equation is derived by substituting Equation (3) and Equation (4) into Equation (2):

$\begin{matrix} t_{1}^{n} & = \underset{t_{1}^{n}}{argmax} P (t_{1}^{n} | w_{1}^{n}) \\ \approx \underset{t_{1}^{n}}{argmax} \prod_{i = 1}^{n} P (w_{i} | t_{i}) P (t_{i} | i_{i - 1}) \end{matrix}$ (5) The second-order HMM model employs Equation (5) to identify the most likely tag sequence. It contains two kinds of probabilities, P(w_i|t_i), i.e., emission probabilities, and P(t_i|t_i-1), i.e., tag transition probabilities.

Transition probability: The likelihood of occurrence of a specific tag in a sequence given the preceding tag is referred to as Transition probability. The transition probabilities can be calculated using the following equation: $P (t_{i} | t_{i - 1}) = C (t_{i - 1}, t_{i}) / C (t_{i - 1})$ (6) Here, C(t_i-1, t_i) denotes the number of times the current tag occurs alongside the previous tag in the training corpus, and C(t_i-1) is the previous tag’s frequency count in the corpus.

Emission Probability: The emission probability determines the most appropriate tag for the specific word based on the number of occurrences of the word. These can be calculated using the following equation: $P (w_{i} | t_{i}) = C (t_{i}, w_{i}) / C (t_{i})$ (7)C (t_i, w_i) denotes the number of times the current tag is associated with the present word. C (t_i) denotes the present tag’s frequency.

This baseline HMM model has been utilized for the Mizo POS tagging system that generates a sequence of tags for a particular sentence. With this model, 81.13% accuracy has been obtained on the designed Mizo corpus.

3.3 Proposed method I

This section outlines our first technique for improving an HMM-based tagger for the Mizo language by modifying the decoder.

The Hidden Markov Model (HMM) finds the most probable tag sequence, but the time required to solve the problem grows exponentially. So, we used the Viterbi method to optimize the HMM, the most commonly used decoding algorithm for HMM in part-of-speech tagging. It is a dynamic programming technique with the objective of determining the most probable series of hidden or unobservable states, referred to as the path of Viterbi, which will result in a series of observed events. The Viterbi algorithm takes a set of observations, W = (w₁w₂w₃w₄ . . . w_n) as an input and returns the most likely state sequence, S = (s₁s₂s₃s₄ . . . s_n) with its probability.

The Viterbi algorithm creates two probability metrics, one for the transition and one for the emission. This Viterbi decoding algorithm (utilized during the testing phase) uses a matrix of tag transition probabilities and a matrix of emission probabilities to calculate the most probable sequence of tags for each phrase in the input corpus.

In this work, the baseline version of the Viterbi algorithm is modified to handle unknown terms in the corpus, resulting in improved performance of the HMM-based tagger. As mentioned earlier, the Viterbi algorithm utilizes both emission probabilities and transition probabilities to determine the most probable tag sequences. However, if a word is not found in the training data (meaning it is an unknown word), its emission probability becomes zero, resulting in a state probability of zero. In such cases, when the algorithm encounters an unseen word during training, it exclusively relies on the transition probability to calculate the state probability, disregarding the emission probability. Thus, the algorithm operates in the following manner:

If Word:Not:In:Vocabularythen

State _ prob = Trans _ prob

else

State _ prob = Trans _ prob * Emi _ prob

end if

The algorithm mentioned above can be interpreted as follows: If a word is not present in the vocabulary, then the state probability is equal to the transition probability; otherwise, the state probability equals the product of the transition probability and the emission probability.

With this simple method, the baseline version of the HMM-based tagger for the Mizo language has been improved, allowing it to produce better results.

3.4 Proposed method II: Hybrid Mizo POS tagger

This section describes another proposed method for enhancing the HMM-based POS tagger for the Mizo language by integrating the Hidden Markov Model (HMM) with the N-gram bigram tagger from the NLTK package and the rule-based tagger.

The N-gram model is one of the most straightforward language models for assigning probabilities to phrases and word sequences. The tagger first creates a context for the token before deciding which tag should be assigned. This context is made up of the type of the token and the part-of-speech tags of the n tags that came before it. The N-gram POS taggers select tags based on the word string and the tags of the n-words that precede it. Figure 1depicts an example of an N-gram tagger with n=3, where the context for determining the tag t_n is tinted gray that includes t_n - 2, t_n - 1, and w_n.

Fig.1

Context of the N-gram model.

3.4.1 Building the Mizo rule-based tagger

In order to handle unknown terms, a number of regular expressions for analyzing the Mizo language have been designed. The rules have to be carefully crafted because the Mizo language is sophisticated and morphologically complex. Based on examining and analyzing the morphologically rich Mizo language’s structure via the language experts and the authoritative Mizo grammar publications [28, 30], The designed regular expression tagger, in this work, provides a word-level analysis of tokens in the string. In addition to the trivial tagging such as symbols, conjunctions, articles, and personal pronouns, the analysis and the clues are provided below:

- Foreign Word(FW):

If a word contains characters that are not available in Mizo alphabets, such as Q(q), X(x), Y(y)

If a word contains W(w) but not prefixed by A(a)

If a word contains a character H(c), but not followed by H(h)

If a word contains a character G(g), but not prefix by N(n)

- Proper Noun(PPN):

If a word starts with a capital letter and ends with ‘i’ or ‘a’

If all letters of a word are capital

If a word ends with ‘-in’

- Abstract Noun(ABN):

If a word ends with ‘na’

- Common Noun(CMN):

if a word ends with ’te’ or ’ten’

if a word ends with ’ho’

if a word ends with ’pui’

if a word ends with ’tu’

if a word ends with ’in’

-Demonstrative Pronoun(MP)

If a word ends with ‘ngte’

- Adverb of Place (RBP)

If a word consists of characters only and ends with ‘-ah’ or ‘-a’

If a word ends with ‘ah’

- Adverb of Time (RBT)

If a word consists of digits only and ends with ‘ah’ ‘-a’

-Verb base form (VB)

if a word starts with ’In’ or ’IN’

if a word ends with ’san’

if a word ends with ’tir’

-Date (ET)

if a word is a month or day of the week

-Adjective base form (JJ)

if a word ends with ’zia’

-Default tag: VB

Fig.2

The overall architecture.

3.4.2 The overall architecture of the Mizo hybrid tagger

Figure 2 depicts the overall architecture of the proposed tagger. The HMM tagger and the N-gram model bigram tagger are trained with the training data. During the process of determining the tag sequence that is most likely to be used for the sentence, the HMM-based tagger will initially assign tags to the tokens that have a high probability of being used. The responsibility of tagging any tokens with a probability less than the threshold values will be passed on to the N-gram bigram backoff tagger. The bigram tagger performs a lookup for a tuple that contains the previous tag and the current word in the context to find the appropriate tag for a given word. If the bigram fails to identify a suitable tag for the provided word, the duty is handed off to the unigram tagger.

The unigram tagger employs a basic statistical technique to tag a token by selecting the most frequently encountered tag associated with that token in the annotated training text corpus. The unigram tagger will not be able to label words that are not found in its vocabulary, commonly known as out-of-vocabulary words (or unknown words). These words are then forwarded to the rule-based tagger. Subsequently, the rule-based tagger analyzes these unknown words and attempts to assign them appropriate tags based on the predefined rules. Since VB (Verb) is the most prevalent tag in the corpus used in this study, it has been chosen as the default tag. So, if the rule-based tagger cannot find a suitable rule for tagging a token, the token will be given the default tag "VB (Verb)."

The designed Mizo-specific regular expressions play a crucial role in the success of this hybrid technique, enabling it to capture unknown words in Mizo text. Although computationally more time-consuming, this approach yields substantial improvements in accuracy.

4 Implementation and result comparison

This section covers the detailed experimental setup, results obtained from different methods, and conclusions drawn from the findings.

We tested and compared the accuracy of the baseline HMM tagger and our proposed taggers on the designed Mizo corpus of 72,077 tokens. Two data sets of varying sizes are employed to compare the performance of the taggers. Set 1 (95:5) has 68,474 words in the training set and 3,603 words in the test set, whereas Set 2 (90:10) has 64,869 words in the training set and 7,208 words in the test set. For comparison purposes, we also evaluated the performance of the NLTK unigram tagger, the NLTK bigram tagger, and the NLTK trigram tagger on the same dataset. The outcomes of the various taggers we have examined are summarized in Table 3.

Table 3
Comparison of accuracy

Ration of Train set

and Test set

Name of Taggers 95:5 90:10

NLTK Unigram tagger 79.42% 77.33%

NLTK Bigram tagger 81.41% 80.11%

NLTK Trigram tagger 81.22% 79.5%

Baseline HMM tagger 82.20% 80.44%

Proposed method I 83.36% 81.52%

Proposed hybrid tagger 86.65% 84.29%

	Ration of Train set
NLTK Unigram tagger	79.42%	77.33%
NLTK Bigram tagger	81.41%	80.11%
NLTK Trigram tagger	81.22%	79.5%
Baseline HMM tagger	82.20%	80.44%
Proposed method I	83.36%	81.52%
Proposed hybrid tagger	86.65%	84.29%

Input Features: Based on the various possible combinations of the word and tag context, the primary input features for the POS tagging task have been identified. In order to find plausible tags, the model utilized in this study employs two types of probabilities: transition probabilities and emission probabilities, as well as word-level analyses. The main input features identified for known and unknown words are given in Table 4.

Table 4

Input features

Particulars	Features
Known Words	Preceding word
	Current Word
	and the previous POS tag
Unknown words	Suffix analysis
	-if a word contains a particular suffix
	Word content analysis
	-If a word contains particular characters,
	digits or special symbols

As a result of the findings of our experiment, which showed that the NLTK bigram tagger performed marginally better than the NLTK trigram tagger, the bigram tagger was selected to be integrated into the proposed hybrid method, which also included the HMM-base method and the rule-based tagger. It is observed from Table 3 that the baseline HMM tagger has been improved with the proposed method one by approximately 1%, and the proposed hybrid approach improves it by around 4%. It is worth noting that as the data set grows in size, performance improves significantly, and a significant improvement has been achieved with these two proposed techniques, with our proposed hybrid approach producing the best results (84.29% accuracy).

We assessed the performance of our hybrid POS tagger by evaluating precision, recall, and F1-score for each tag. The results, presented in Table 5, show promising performance. Table 6 displays the macro and weighted averages for these metrics. The macro average precision, recall, and F1-score were determined to be 78.87%, 76.25%, and 75.95%, respectively. Moreover, the weighted average precision, recall, and F1-score were calculated at 83.09%, 77.88%, and 79.64%, respectively, showcasing its effective handling of class imbalance. Upon analysis of the results, it can be concluded that the tagger demonstrated a balanced performance across all tags, as evidenced by the macro average scores. The weighted average scores effectively addressed the class imbalance issue and displayed good overall performance.

Table 5

Precision, recall, and F1-score for each tag

Tags	Precision	Recall	F1-score	Support	Tags	Precision	Recall	F1-score	Support
-	1.00	0.63	0.77	0.0013	MJJ	0.87	0.85	0.86	0.0184
(	1.00	1.00	1.00	0.0023	MP	0.64	0.88	0.74	0.0043
)	0.60	1.00	0.90	0.0023	MRB	0.91	0.65	0.76	0.0120
,	1	0.99	0.99	0.0424	NG	0.97	0.86	0.97	0.0149
.	1.00	1.00	1.00	0.0317	NVB	0.47	0.70	0.56	0.0008
:	1.00	1.00	1.00	0.0005	POP	0.94	0.47	0.63	0.0022
;	1.00	1.00	1.00	0.0036	PPN	0.77	0.67	0.72	0.0831
?	1.00	1.00	1.00	0.0003	PPT	0.80	0.94	0.85	0.0265
ABN	0.89	0.91	0.90	0.0147	PSP	0.90	0.78	0.84	0.0994
AT	0.90	0.74	0.82	0.0222	PT	0.67	0.72	0.69	0.0352
CC	0.86	0.75	0.80	0.0347	QM	1.00	0.91	0.95	0.0072
CD	0.56	1.00	0.72	0.0151	RB	0.70	0.89	0.77	0.1038
CJJ	0.51	0.86	0.63	0.0008	RBM	0.90	0.67	0.77	0.0026
CMN	0.83	0.61	0.70	0.1112	RBP	0.91	0.72	0.80	0.0242
CRB	0.19	0.72	0.30	0.0005	RBT	0.91	0.63	0.75	0.0100
DJJ	1.00	1.00	1.00	0.0041	RLP	0.93	0.54	0.68	0.0019
DRB	0.92	0.82	0.86	0.0061	SF	0.78	0.46	0.58	0.0010
DVB	0.17	0.78	0.37	0.0004	SJJ	0.90	0.47	0.62	0.0014
ET	1.00	0.33	0.50	0.0052	SPRB	0.79	0.67	0.72	0.0073
FW	0.81	0.86	0.83	0.0541	SRB	1.00	0.67	0.80	0.0012
IJJ	0.72	0.65	0.76	0.0006	SYM	1.00	1.00	1.00	0.0006
IP	0.00	0.00	0.00	0.0004	UH	0.45	0.72	0.55	0.0006
IRB	0.32	0.68	0.64	0.0005	VB	0.82	0.74	0.78	0.1496
JJ	0.87	0.76	0.81	0.0339	VBN	0.67	0.9	0.77	0.0024

5 Performance analysis

This section presents the performance analysis of the baseline HMM model and the proposed hybrid HMM-based tagger. The experimental results unveiled that, while both taggers correctly tagged a considerable number of words, there were instances where they mislabeled certain words. Table 7 showcases some examples of such cases, providing insight into the performance of both taggers on the designed corpus.

When the baseline HMM tagger encounters a word that is not in the vocabulary, it will tag the word as "None." It has been observed that certain words in the corpus are annotated with multiple tags due to variations in their contextual usage. For example, consider the word, "sawrkar" in the corpus. Based on the context, this word has been assigned both CMN and VBN tags. However, the baseline HMM tagger assigns the CMN tag to all occurrences of the word, regardless of context. On the other hand, the proposed hybrid tagger utilizes the relevant context and correctly assigns both CMN and VBN tags to different instances of the word.

Let us consider another example, the term, "boxer." As per the rules defined in the proposed tagger, it should be tagged as FW. However, due to the probabilistic nature of determining the most probable tag, both the baseline HMM tagger and the proposed hybrid tagger incorrectly assign the tag CMN to this term.

The number of incorrect predictions made by both the baseline HMM tagger and the proposed hybrid tagger is illustrated in labeled bar graphs in Figures 3 and 4, respectively. A comparison of these two graphs reveals a significant difference in the number of inaccurate predictions. The baseline HMM tagger demonstrates a considerably higher number of incorrect predictions (over 300 counts), while the proposed hybrid tagger shows a lower number (fewer than 150 counts). Figure 3 demonstrates that the error rate is largely influenced by the RBP, which has been reduced in Figure 4. In Figure 4, the distribution of errors across various tags is not heavily biased towards a single tag. However, it is worth noting that the default tag exhibits a slightly higher number of incorrect predictions compared to the other tags.

Table 6
Macro average and Weighted average

Precision Recall F1-score

Macro avg 78.87% 76.25% 75.95%

Weighted avg 83.09% 77.88% 79.64%

	Precision	Recall	F1-score
Macro avg	78.87%	76.25%	75.95%
Weighted avg	83.09%	77.88%	79.64%

Table 7

Comparing word tagging by baseline HMM tagger and hybrid tagger

Words	Baseline HMM tagger	Proposed hybrid tagger	Correct tag
Yellow	None	FW	FW
Inkhelh	None	VB	VB
Lammualah	None	VB	RBP
Anmahni	None	PPN	PSP
Thutlukna	VBN	ABN	ABN
Sawrkar	CMN	VBN	VBN
Sawrkar	CMN	CMN	CMN
Khatah	RBT	RBT	RBP
Boxer	CMN	CMN	FW

Fig.3

Number of incorrect predictions by baseline HMM tagger.

Fig.4

Number of incorrect predictions by the proposed hybrid tagger.

In addition, to evaluate the performance of the two taggers, we conducted a comparison using external test data. We selected three simple sentences that consist of both known and unknown words. These sentences were inputted into the system to analyze how the taggers assign tags to each token.

Input text 1: "Aizawlah ka bazar dawn"

Input text 2: "Dintharah ka hau dawn"

Table 8

Performance on external test data 1

Input	Taggers	Tagging
Input text 1	Baseline HMM tagger	Aizawlah/RBP ka/PSP bazar/VBN dawn/RB
	Proposed hybrid tagger	Aizawlah/RBP ka/PSP bazar/VBN dawn/RB
Input text 2	Baseline HMM tagger	Dintharah/RBP ka/PSP hau/ABN dawn/RB
	Proposed hybrid tagger	Dintharah/RBP ka/PSP hau/VB dawn/RB

Table 9

Performance on external test data 2

Input	Known/Unknown	Baseline HMM tagger	Proposed hybrid tagger	Correct tag
Bombay-ah	Unknown	ABN	RBP	RBP
Mawia	Unknown	ABN	PPN	PPN
Remhriatna	Unknown	ABN	ABN	ABN
Avangin	Known	CC	CC	CC
WPO	Unknown	PPN	PPN	PPN
Buatsaih	Known	VB	VB	VB
X-ray	Unknown	ABN	FW	FW
Chungchang	Known	PPT	PPT	PPT
Seminar-ah	Unknown	ABN	RBT	RBP
ka	Known	PSP	PSP	PSP
tel	Known	VB	VB	VB
ve	Known	RB	RB	RB

The two taggers handle the above two simple sentences quite well, as shown in Table 8. In the training corpus, the word bazar has been assigned three different tags, such as CMN, VBN, and PPN. When this sample Input text 1 is fed to the system, the baseline HMM and our proposed tagger successfully identify the word bazar as VBN. The term hau, that appears in Input text 2 is an unknown word. The baseline HMM tagger wrongly tags the word as "ABN," whereas the proposed tagger correctly tags it as VB.

Input text 3: "Bombay-ah Mawia remhriatna avangin WPO buatsaih X-ray chungchang seminar-ah ka tel ve."

The third statement in the input text is a little more complicated than the previous two input samples. Table 9 provides a detailed analysis.

The proposed hybrid tagger performs well on this specific sample input. Apart from the word "seminar-ah," which has an incorrect RBP tag, the proposed method accurately assigns tags to each word. Integrating the rule-based tagger in the proposed system enables the effective handling of unknown terms.

6 Conclusion and future works

This study describes a Hidden Markov Model-based tagging technique for the Mizo language. Two methods have been proposed to increase the performance of the baseline model. The first proposed model yielded 81.52% accuracy, which is higher than the baseline HMM model. The highest accuracy, 84.29%, is obtained with the second proposed hybrid tagger, which is 3.85% higher than the baseline HMM model.

The effectiveness of the proposed hybrid POS tagger has also been assessed by determining the value of precision, recall, and F1-score for each tag. The overall performance of our model is illustrated through the macro average and weighted average for each metric, indicating satisfactory results.

We have also analyzed the Mizo language and its grammatical structure, which has helped us come up with morphological clues for identifying the corresponding tag for a particular word. The experiments show that our methods substantially improve the accuracy and overall performance of the tagger for both known and unknown words. It is anticipated that this effort will, at least, serve as a primary foundation or groundwork for subsequent research in the language.

The contributions and results obtained with our proposed approaches are quite satisfactory, considering that Mizo is a low-resource language. We believe that there is room for improvement in the taggers’ performance by increasing the corpus’s size and developing more precise regular expressions tailored to the language. In our future research, we intend to explore the application of diverse deep-learning architectures to further enhance the accuracy of the Mizo tagger. Moreover, future studies could also consider incorporating the Kappa score, a metric for evaluating inter-rater agreement. This inclusion would provide a comprehensive understanding of result reliability and strengthen the overall robustness of the study.

Footnotes

Acknowledgment

The authors would like to express their gratitude to Mr. Mika Lalngaihtuaha, and Mr. Lallawmsanga for their assistance in developing the Mizo corpus. They also thank the Department of IT, Mizoram University and the Department of CSE, NIT Silchar for supporting this research work.

References

Jurafsky

Speech & language processing, Pearson Education India (2000).

Voutilainen

Part-of-speech tagging, volume 219. The Oxford handbook of computational linguistics, 2003.

Zothanliana

A Study of the Development of Mizo Language in Relation toWord Formation, PhD thesis, Mizoram University, 2020.

Unesco atlas of the world’s languages in danger. http://www.unesco.org. Last accessed: 2021-09-13.

Ranjan

, Basu

Part of speech tagging and local word grouping techniques for natural language parsing in hindi. In Proceedings of the 1st International Conference on Natural Language Processing (ICON 2003), Citeseer, 2003.

Ekbal

, Haque

and Bandyopadhyay

, Maximum entropy based bengali part of speech tagging, A. Gelbukh (Ed.), Advances in, Natural Language Processing and Applications, Research in Computing Science (RCS) Journal 33 (2008), 67–78.

Narayan

, Chakraverty

and Singh

V.P.

, Neural network based parts of speech tagger for hindi, IFAC Proceedings 47(1) (2014), 519–524.

Jahara

, Barua

, Iqbal

M.D.

, Das

, Sharif

, Hoque

M.M.

, Sarker

I.H.

Towards pos tagging methods for bengali language: A comparative analysis, In International Conference on Intelligent Computing & Optimization, Springer, 2020, pp. 1111–1123.

Sharma

S.K.

, Lehal

G.S.

Using hidden markov model to improve the accuracy of punjabi pos tagger, In 2011 IEEE International Conference on Computer Science and Automation Engineering, IEEE, 2011, volume 2, pp. 697–701.

10.

Priyadarshi

and Saha

S.K.

, Towards the first maithili part of speech tagger: Resource creation and system development, Computer Speech & Language 62 (2020), 101054.

11.

Jobanputra

, Parikh

, Vora

, Bharti

S.K.

Parts-of-speech tagger for gujarati language using longshort-term-memory, In 2021 International Conference on Artificial Intelligence and Machine Vision (AIMV), IEEE, 2021, pp. 1–5.

12.

Tailor

, Patel

Hybrid pos tagger for gujarati text, In International Conference on Soft Computing and its Engineering Applications, Springer, 2020, pp. 134–144.

13.

Gamit

, Joshi

, Patel

A review on part-of-speech tagging on gujarati language, International Research Journal of Engineering and Technology (IRJET) (2019).

14.

Modi

, Nain

and Nehra

, Part-of-speech tagging for hindi corpus in poor resource scenario, Journal of Multimedia Information System 5(3) (2018), 147–154.

15.

Mundotiya

R.K.

, Kumar

, Mehta

, Singh

A.K.

Attention-based domain adaption using transfer learning for part-of-speech tagging: An experiment on the hindi language, In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation, 2020, pp. 471–477.

16.

Dalal

, Nagaraj

, Sawant

, Shelke

Hindi part-of-speech tagging and chunking: A maximum entropy approach, Proceeding of the NLPAI Machine Learning Competition, 2006.

17.

Swamy

and Srinath

, Pos tagging and ner system for kannada using conditional random fields, International Journal of Information Retrieval Research (IJIRR) 11(4) (2021), 1–13.

18.

Daimary

, Goyal

, Barbora

and Singh

, Development of part of speech tagger for assamese using hmm, International Journal of Synthetic Emotions (IJSE) 9(1) (2018), 23–32.

19.

Pathak

, Nandi

, Sarmah

Aspos: Assamese part of speech tagger using deep learning approach, In 2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA), IEEE, 2022, pp. 1–8.

20.

Singh

T.D.

, Ekbal

, Bandyopadhyay

Manipuri pos tagging using crf and svm: A language independent approach, In Proceeding of 6th International conference on Natural Language Processing (ICON-2008), 2008, pp. 240–245.

21.

Tham

M.J.

, A hybrid pos tagger for khasi, an under resourced language, International Journal of Advanced Computer Science and Applications 11(10) (2020).

22.

Warjri

, Pakray

, Lyngdoh

S.A.

and Maji

A.K.

, Part-of-speech (pos) tagging using deep learning-based approaches on the designed khasi pos corpus, Transactions on Asian and Low-Resource Language Information Processing 21(3) (2021), 1–24.

23.

Vaishali

P.K.

, Kalpana

, Namrata

M.C.

A rule-based approach for marathi part-of-speech tagging, In ICT with Intelligent Applications, Springer, 2022, pp. 773–785.

24.

Antony

P.J.

and Soman

K.P.

, Parts of speech tagging for indian languages: A literature survey, International Journal of Computer Applications 34(8) (2011), 0975–8887.

25.

Harish

B.S.

and Rangan

R.K.

, A comprehensive survey on indian regional language processing, SN Applied Sciences 2(7) (2020), 1–16.

26.

Kumar

and Josan

G.S.

, Part of speech taggers for morphologically rich indian languages: A survey, International Journal of Computer Applications 6(5) (2010), 32–41.

27.

The linguistic data consortium for indian languages (ldc-il). https://www.ldcil.org/default.aspx

28.

Thangzikpuia

P.C.

Mizo Tawng Grammar(Based on its usage and unique features). P.C. Thangzikpuia, 2019.

29.

Santorini

Part-of-speech tagging guidelines for the penn treebank project, 1990.

30.

Lalzarzova

Mizo Tawng Grammar Composition. R. Lalrawna, 2016.

	Ration of Train set
	and Test set
Name of Taggers	95:5	90:10
NLTK Unigram tagger	79.42%	77.33%
NLTK Bigram tagger	81.41%	80.11%
NLTK Trigram tagger	81.22%	79.5%
Baseline HMM tagger	82.20%	80.44%
Proposed method I	83.36%	81.52%
Proposed hybrid tagger	86.65%	84.29%

Enhancing HMM-based POS tagger for Mizo language

Abstract

Keywords

1 Introduction

2 Related works

3 System description

3.1 Dataset and tagset

3.4 Proposed method II: Hybrid Mizo POS tagger

4 Implementation and result comparison

Table 3 Comparison of accuracy Ration of Train set and Test set Name of Taggers 95:5 90:10 NLTK Unigram tagger 79.42% 77.33% NLTK Bigram tagger 81.41% 80.11% NLTK Trigram tagger 81.22% 79.5% Baseline HMM tagger 82.20% 80.44% Proposed method I 83.36% 81.52% Proposed hybrid tagger 86.65% 84.29%

Table 6 Macro average and Weighted average Precision Recall F1-score Macro avg 78.87% 76.25% 75.95% Weighted avg 83.09% 77.88% 79.64%

Footnotes

Acknowledgment

References

Table 6
Macro average and Weighted average

Precision Recall F1-score

Macro avg 78.87% 76.25% 75.95%

Weighted avg 83.09% 77.88% 79.64%