Abstract
Low-resource language in machine translation systems poses multiple complications regarding accuracy in translation due to insufficient incorporation of linguistic information. The difference in the linguistic information between the language pair also significantly impacts the dataset creation for improving translation accuracy. Although neural machine translation achieves a state-of-the-art approach, dealing with low-resource language is challenging since it struggled with limited resources. This paper attempts to address the data scarcity problem using augmentation of synthetic parallel sentences, source-target phrase pairs, and language models at the target side for English-to-Mizo and Mizo-to-English translation via transformer-based neural machine translation. We have attained state-of-the-art results for both directions of translation.
Introduction
Machine translation (MT) attempts to minimize the linguistic barrier through automatic translation among natural or human-spoken languages. A language is called a low-resource in MT task based on the limited available resources. Generally, low resources include limited online data [1] or computational tools [2]. Low-resource language in MT faces many challenges in terms of accuracy and comprehension due to limited exploration of the language. For corpus-based MT, there is a need for bilingual corpus encountering linguistic challenges. However, to improve the performance in the output of a machine translation system, a large amount of bilingual corpus is needed, which is often a significant problem in low-resource pair translation.
Even though there are many MT systems accessible for significant Indian dialects, there are minimal resources for studying the Mizo language in MT. As English is the most widely spoken language globally, English–Mizo machine translation will help the Mizo community overcome its shortcomings in the modern age. However, both languages are very different from each other since both languages originated from diverse languages. English is originated from the Indo-European family of languages, while the Mizo language belongs to the Sino-Tibetan family
1
. In Mizo [3], there is an indication for distinguishing a gender added at the suffix of all proper names. A letter ‘i’ suffixes all the female proper names, likewise a letter ‘a’ suffixes all the proper names of male. In English, there is no distinction of gender in the proper name. However, in terms of personal pronouns, there is an appropriate distinction of gender in English like ‘he’ and ‘she’, while it is impossible to determine gender in Mizo. Although both languages use the Roman script in the writing system, the following present the linguistic differences with respect to word-order and tonality [4, 5]. High tone: ‘báwk’ - ‘also’ Mizo: Thlalatu ani a, lehkhabu a ziak Low tone: ‘bàwk’ - ‘down’ Mizo: Mi zawng zawng Rising tone: ‘bawk’ - ‘tumour’ Mizo: A awmah Falling tone: ‘bâwk’ - ‘hut’ Mizo: Kan
Word-order of Mizo
Word-order of Mizo
Indication of tone marker along with the tone
Based on the restricted availability of resources and the dissimilarities between the two languages, English–Mizo (eng-lus) can be classified as a low-resource pair and possess a challenging MT task. According to ISO, the language code of Mizo 2 is “lus" and "eng" is the language code for English 3 . In MT community, neural machine translation (NMT) based approach attains state-of-the-art for both high and low-resource pair translations [7]. In this paper, we have investigated different NMT models by addressing data scarcity issue for eng–lus pair translation.
The contributions are summarized below:
We have handled data scarcity problems by augmentation of synthetic parallel data and phrase pairs. By increasing the training amount of data and providing token alignment information via phrase pairs augmentation. Also, a pre-trained Mizo language model (LusLM) is contributed that helps to improve eng-to-lus translation and LusLM will be publicly available which can be used in different downstream NLP tasks in the Mizo language. We have achieved state-of-the-art results on the test data [5] for eng-to-lus and lus-to-eng translation and performed different types of error analysis.
The rest of the paper is organized as follows. Section 2 and 3 briefly discuss machine translation and related work. Our approach of English–Mizo NMT is presented in Section 4. Section 5 presents the experimental results and error analysis. Lastly, Section 6 concludes the paper with future work.
With technological advances, MT has a significant influence in today’s society, as it bridges the gap between languages. The advent of machine translation (MT) systems has functioned as a substitute for professional human translators, offering instant and instantaneous translations. Even though it has a substantial impact on high-resource languages, it is especially beneficial to low-resource languages since it resolves barriers to communication. Moreover, machine translation can be used for preserving the extinction of a low-resource language by enhancing the limitations using automatic translation. Although conventional human translations remain unrivaled, MT systems have unquestionably increased in accuracy over time.
NMT attains state-of-the-art performance and has made significant progress [7]. To deal with variable-length phrases, NMT based on recurrent neural networks (RNN) is developed [8, 9]. To improve learning long-term features, RNN uses long short-term memory (LSTM) for encoding and decoding. Further, the NMT architecture is improved with the addition of an attention mechanism. It allows the decoder to focus both locally and globally on various parts of the sequence. The disadvantage of RNN is processing data in a fixed temporal order based on previous words rather than future words. The BRNN (Bidirectional RNN) [8] overcomes this problem by employing two separate RNNs, one for forward and the other for backward direction. Convolutional Neural Network (CNN) [10] is introduced by taking advantage of parallelizing operations and using relative locations of tokens rather than temporal dependency among sequence tokens. However, it falls short of RNN capabilities for improving source sentence encoding. CNN-based techniques have the drawback of having several layers to guarantee long-term dependability, resulting in a complicated network. To deal with such a problem, a transformer-based NMT is used [11].
The Transformer model utilizes an encoder-decoder architecture that resembles the RNN models. It is the first model to construct representations of its input and output only via self-attention, without the use of convolution or sequence-aligned RNN models. It has been intended to address the long-term dependencies as well as the limitation of parallelizing in RNN models. The transformer model’s concept is to encode each location and to build completely on attention mechanisms to link two separate words, which would then be parallelized to speed up learning. The self-attention mechanism, unlike the classic attention mechanism, calculates attention several times, such a process is known as multi-head attention. Both the encoder and decoder are formed by six (6) identical attention layers stacked on top of one another, as shown in Figure 1. The encoder consists of two sub-layers, the multi-head self-attention layer, and a fully connected position-wise feed forward network layer. There are three sub-layers in the decoder. Two of the three sub-layers are the same as those in the encoder. Another multi-head attention layer is utilized in the third sub-layer to focus on the encoder stack’s output as in Figure 2.

Transformer Architecture Model

Encoder and Decoder of Transformer Model
The mathematical framework of the attention in Transformer model is determined in Equation 1 using a Query (QY), Value (V) and Key (K) with d
k
as dimension. A dot product of the query with each key is computed, divided by d
k
, and then applying the softmax function to measure each word’s weight at a specific position.
In contrast to single-head attention, the Transformer model suggests the idea of multi-head attention, which enables the model to handle various word representations through numerous locations.
In machine translation (MT), neural machine translation (NMT) has emerged as a promising technique owing to its ability for context analysis and addressing issues with long-range dependencies [9, 11]. Yet, it requires a substantial quantity of training data, which is a huge challenge for low-resource language pair [12]. Different strategies have been devised by researchers to enhance NMT’s limited resource language. The most prominent and effective method of data augmentation in machine translation for a low-resource language is back translation (BT) [13]. BT trains a translation model backwards to create a synthetic parallel corpus from the target monolingual data. Additionally, the concept of BT is improved by iterative back translation (IBT), where both the forward and backward directions of translations are used for mutual training. In [14], a set of low-resource language models that have been augmented via iterative back translation which have been trained, resulting in an improvement in the output. For limited-resource, morphologically rich Indian languages [15], an NMT model utilizing self-attention multihead and byte- pair-encoded is presented to construct an effective translation strategy that overcomes the Out Of Vocabulary barrier. Furthermore, multilingual NMT systems can handle translation between numerous language pairings. The main idea behind these strategies is to transfer knowledge from high-resource to low-resource translation. Recent research suggests that multilingual models outperform bilingual models, especially when only a few languages are present in the system, and that the degree of relatedness between the languages also affects performance [16].
As for English–Mizo language pair, there is a few work exists in the area of MT [4, 17–20]. The prior works focus on the development of eng-lus parallel data to overcome problem in availability of resources for English-Mizo MT task. In [17], a parallel corpus of English-Mizo language pairs is built and performed a comparison in between RNN based NMT and PBSMT. Also, [4, 18] investigated English-Mizo pair using several attention-based NMT models, including RNN, BRNN and transformer. The model’s prediction errors and the accuracy of prediction are analysed depending on changes in sentence length. In our previous work [5], eng-lus parallel and monolingual data of Mizo are prepared and English to Mizo translation encountering tonal words are investigated with a post-processing step. Although researchers have explored the eng-lus pair for the MT task, none of them have tackled the data scarcity and word-order linguistic divergence issues. Moreover, Mizo is a low-resource language with numerous linguistic characteristics, it is essential to investigate the different types of errors. Therefore, to explore the language in the domain of the MT system, analyzing the linguistic characteristics of the language, followed by addressing the linguistic challenges when building the corpus, is highly suggested. In this work, we have addressed the data scarcity challenge by utilizing iterative back-translation (IBT) strategy [21], the phrase-pairs extraction [22, 23] and pre-trained language model (LM) [24] to improve the translational performance of low-resource eng-lus NMT.
Our approach for English–Mizo NMT
We have tackled the data scarcity issue using the iterative back-translation (IBT) strategy [21] to prepare synthetic parallel data and phrase pairs augmentation [22]. Our approach consists of different phases, synthetic parallel data preparation, phrase-pairs extraction, and preparation of a language model for the target language. In this work, we have used our developed dataset (parallel eng-lus and lus monolingual data)[5]. The data statistics are presented in Table 3. The English monolingual data from WMT16 4 is used. In this work, we have considered an equal amount of monolingual data of Mizo and English language for primary investigation of the impact of LM for the improvement of a low-resource pair translation. Further, the monolingual data of the target language (eng/lus) is to train and generate an LM using the transformer model and the weight matrices are loaded from the pre-trained LM by initializing the decoder of an encoder-decoder architecture of transformer-based NMT. Figure 3 demonstrates our approach for eng-lus NMT.
Data statistics for parallel and Mizo monolingual data [5]
Data statistics for parallel and Mizo monolingual data [5]

Our approach for eng–lus NMT
Synthetic parallel data preparation: For synthetic parallel data preparation, both monolingual data (eng/lus) is utilized by following iterative back translation [21] strategy. We have performed a series of experiments in each direction by increasing synthetic data in the ratio of “original parallel corpus: synthetic parallel.” For instance, lus-to-eng transformer-based NMT model is used on lus monolingual data to generate eng sentences. The blank lines and under translations (single-word translations) are removed and synthetic eng-lus parallel data is prepared. The obtained synthetic eng-lus parallel data is augmented with the original parallel data (train set) by performing different ratio of “original parallel corpus: synthetic parallel.” This process is repeated several times until convergence condition is reached. The intuition is not all synthetic parallel sentences are of good quality. Therefore, we have identified better quality synthetic parallel data by adopting iterative back translation technique [21]. In this work, ratio of 1 : 4 and 1 : 3 show improvement by utilizing lus and eng monolingual data. Therefore, we have merged both and used synthetic parallel data by maintaining a ratio of 1 : 7. Phrase-pairs extraction: We have adopted the phrase-pairs extraction strategy of [22, 23]. Here, phrase-based SMT is trained using Moses
5
toolkit on eng-lus original parallel data and extracted phrase-pairs by considering translation probability p ≥ 0.5. Also, removed duplicates and the statistics of obtained pairs are presented in Table 4. By augmentation of the phrase-pairs train set, more word alignment information is passed to the training model and tackled the word-order divergence problem in addition to tackling the data scarcity issue.
Data statistics for phrase-pairs
In the experiment, the publicly available Marian [25] toolkit is employed in three basic operations, data preprocessing, training and testing. The word-segmentation technique, namely, byte pair encoding (BPE) [26] is used with 32k merge operations. The vocabulary size of lus and eng are 28,006 and, 26,834. During preprocessing, source-target vocabulary is shared and the obtained merged vocabulary size is 49,229. We have followed default configuration [27], 6 layers, 8 attention heads, Adam optimizer with learning rate 0.001 and drop-out of 0.1 for the training of LM (target language) and NMT training for eng-to-lus and lus-to-eng translation. The Marian toolkit 6 allows to use custom LM during the training process of NMT model. The models are trained on a single NVIDIA Quadro P2000 GPU.
The predicted sentences of our experiment are evaluated using automatic evaluation metrics. For translation evaluation, the automatic evaluation metrics such as bilingual evaluation understudy (BLEU) [28], translation error rate (TER) [29], metric for evaluation of translation with explicit ordering (METEOR)[30], and F-measure have been implemented. Table 5, 7, 8, 9 present the results of BLEU, TER, METEOR and F-measure scores. Higher the score value in case BLEU, METEOR, and F-measure except for TER indicates better translation accuracy. Also, we have performed Human Evaluation (HE) by considering 100 sample predicted sentences on a scale of 1-5 following [17]. We have hired three human evaluators who possess linguistic knowledge of both the languages and average scores are reported, which are reported in Table 6. The hired human evaluators are undergraduate students and native speakers of Mizo language. The preliminary experiments [5] show that the transformer-based NMT achieves higher accuracy than RNN-based NMT. Therefore, we have considered transformer-based NMT models are explored in different flavors, which are as follows:
Baseline: “Original Parallel Sentences":Train Set: 118,035) With SPA (Synthetic Parallel-Sentence Augmentation): “Original Parallel Sentences" (Train Set: 118,035) + “Synthetic Parallel Sentences (826,000)" With PPA (Phrase-Pairs Augmentation): “Original Parallel Sentences" (Train Set: 118,035) + “Phrase-Pairs (42,110)" With LM: “Original Parallel Sentences" (Train Set: 118,035) + Pre-trained LM (Target Language: eng/lus) With SPA + PPA + LM: “Original Parallel Sentences" (Train Set: 118,035) + “Synthetic Parallel Sentences (826,000) + “Phrase-Pairs (42,110)" + Pre-trained LM (Target Language: eng/lus)
From the quantitative results, it is observed that transformer-based NMT with SPA + PPA + LM, attains higher scores for both directions of translation and outperforms previous work [5]. The comparison with previous work [5] and Google translation
7
are presented in Figure 4 and 5. It is noticed that lus-to-eng translation evaluation scores outperform eng-to-lus translational evaluation scores due to more number of lus tokens in the train data as compared to eng tokens. As a result, the model encoded more lus word frequency, and thus the decoder can generate a better lus-to-eng translation. We have evaluated our best model on benchmark dataset [31] and reported the BLEU score results in Table 10.
BLEU scores of eng-to-lus and lus-to-eng translation. SPA: Synthetic Parallel-Sentence Augmentation, PPA: Phrase Pairs Augmentation, LM: Language Model
BLEU scores of eng-to-lus and lus-to-eng translation. SPA: Synthetic Parallel-Sentence Augmentation, PPA: Phrase Pairs Augmentation, LM: Language Model
Human evaluation scores of eng-to-lus and lus-to-eng translation, SPA: Synthetic Parallel-Sentence Augmentation, PPA: Phrase Pairs Augmentation, LM: Language Model
TER scores of eng-to-lus and lus-to-eng translation, SPA: Synthetic Parallel-Sentence Augmentation, PPA: Phrase Pairs Augmentation, LM: Language Model
METEOR scores of eng-to-lus and lus-to-eng translation, SPA: Synthetic Parallel-Sentence Augmentation, PPA: Phrase Pairs Augmentation, LM: Language Model
F-measure scores of eng-to-lus and lus-to-eng translation, SPA: Synthetic Parallel-Sentence Augmentation, PPA: Phrase Pairs Augmentation, LM: Language Model
BLEU scores of eng-to-lus and lus-to-eng translation on FLORES-200 test data

Comparison among the present work (best model: with SPA+PPA+LM), previous work [5] and Google translation in terms of BLEU and Human evaluation scores for eng-to-lus

Comparison among the present work (best model: with SPA+PPA+LM) and Google translation in terms of BLEU and Human evaluation scores for lus-to-eng
To further evaluate the efficiency of our NMT system, we assessed the quality of several predicted sentences produced by the transformer models from various viewpoints along with Google translation. The predicted sentences are compared against the reference sentence in terms of adequacy and fluency. Adequacy measures how well a reference sentence’s meaning is retained in the predicted translation. Fluency is indicated by the appropriate formation of the predicted sentence, regardless of the reference translation. Using the following notations, the samples of predicted sentences are presented below to investigate the errors.
MTS - lus Test Sentence ETS - eng Test Sentence Base_mz - Predicted sentence of baseline model for eng-to-lus Base_en - Predicted sentence of baseline model for lus-to-eng Best_mz - Predicted sentence of the best model (With SPA+PPA+LM) for eng-to-lus Best_en - Predicted sentence of the best model (With SPA+PPA+LM) for lus-to-eng G_mz - Google translation for eng-to-lus. G_en - Google translation for lus-to-eng.
1. Sample predicted sentence for
MTS - Jakoba thlàhte zàwng záwng chu mi sàwmsarih an ni.
ETS - All the descendants of Jacob were seventy persons.
Base_mz - Jakoba thlahte chu mi sawmsarih an ni.
Best_mz - Jakoba thlahte zàwng záwng chu sawmsarih an ni.
G_mz - Jakoba thlah zawng zawng chu mi sawmsarih an ni.
2. Sample predicted sentence for
MTS - Arona tiang chuan an tiangte chu a lem zo ta vék a.
ETS - Aaron’s staff swallowed up their staffs.
Base_en - Aaron struck the staff of Aaron.
Best_en - Aaron’s staff swallowed up their staffs.
G_en - And Aaron ’s rod swallowed up their rods.
Discussion for
3. Sample predicted sentence for
MTS - Pûk kawngka kha han hawng rawh u.
ETS - Open the mouth of the cave.
Base_mz - biru rawh u.
Best_mz - Tukverh hawng rawh.
G_mz - Puk kawngkhar chu hawng rawh.
4. Sample predicted sentence for
MTS - Josefa chu Egypt rámah chuan alo áwm tawh a.
ETS - Joseph was already in Egypt.
Base_en - Joseph took a party to the Egyptians.
Best_en - Joseph had enough of the Egyptians.
G_en - Joseph was already in Egypt.
Discussion for
5. Sample predicted sentence for
MTS - Chu chu thlang lampang rám chu a ni.
ETS - This forms the western side.
Base_mz - Chu chu chhim lampang chu a ni.
Best_mz - Hei hi thlang lampang thlengin a ni.
G_mz - Hei hian chhim lam a siam a ni.
6. Sample predicted sentence for
MTS - Nile luia sanghate chu an thi áng.
ETS - The fish in the Nile shall die.
Base_en - The fish of the fish is dead.
Best_en - The fish in the Nile died.
G_en - The fish of the Nile River are dead.
Discussion for
7. Sample predicted sentence for
MTS - Theitui ka bùnna lamah ka tibua.
ETS - I spilled the juice while I was pouring it.
Base_mz - A split a.
Best_mz - Ka tân chawhtawlh ka ha a.
G_mz - Ka leih lai chuan a tui chu ka theh chhuak a.
8. Sample predicted sentence for
MTS - Tho tam tàk chuan rám chu a tichhe chiam a.
ETS - All the land was ruined by the swarms of flies.
Base_en - In the hand of all the country, all the people were under the land
Best_en - The land was ruined by the land
G_en - Then the land was destroyed.
Discussion for
9. Sample predicted sentence for
MTS - Naomi mi ti suh u, Marai mi ti zâwk rawh u.
ETS - Do not call me Naomi, call me Mara.
Base_mz - Naomi min ko va, mi ko va.
Best_mz - Naomi mi ti suh, Naomi mi ti zâwk rawh.
G_mz - Naomi min ti suh la, Mara ti rawh.
10. Sample predicted sentence for
MTS - A fanu Zipporah chu Moses a pè a.
ETS - He gave Moses his daughter Zipporah.
Base_en - The daughter of his daughter, the servant of his daughter.
Best_en - He opened his daughter s wife.
G_en - And he gave Moses his daughter Zipporah.
Discussion for
11. Sample predicted sentence for
MTS - I nu leh pa te ti hèk suh.
ETS - Don’t used up your parents’s money.
Base_mz - In pawisa hman dah suh.
Best_mz - I nu leh pate chuan sum leh pai te chu hmang suh.
G_mz - I nu leh pa pawisa chu hmang zo suh.
12. Sample predicted sentence for
MTS - Ka nunna hian a nghak réng a ni.
ETS - My soul waits.
Base_en - My life is for my life.
Best_en - My soul runs in wait for him.
G_en - My life is waiting for you.
Discussion for
13. Sample predicted sentence for
MTS - Ní a, Gibeon khaw chúngah díng rèng rawh.
ETS - Sun, stand still at Gibeon.
Base_mz - Ngawi rawh u.
Best_mz - Gibeon-ah.
G_mz - Sun, Gibeon-ah chuan ding reng rawh.
14. Sample predicted sentence for
MTS - Va chhuak la, va bèi mawlh rawh.
ETS - Go out now and fight with them.
Base_en - Go out, go out.
Best_en - Go out, and go.
G_en - Go out, go out, go out.
Discussion for
Conclusion and future work
In this paper, we have tackled the data scarcity problem for eng-to-lus and vice-versa translation using transformer-based NMT. By augmenting synthetic parallel sentences and phrase pairs to expand the training amount of data and LM at the target side. The experimental results show that the current work attains higher translational accuracy than the previous work. On the basis of the discussions in error analysis, the majority of the predicted translations are acceptable in terms of adequacy as well as fluency for both best model and Google translate. Some of the tone markers in Mizo language are also predicted and encountered correctly in best model while Google translate has never predict tonal word along with tone marker. The pre-trained LM of the Mizo language will be publicly available. In future work, we will address those errors to enhance the translational performance of NMT for the both directions of translation. In this work, we have not attempted to resolve the issues of linguistic challenges, such as, gender or pronoun in both directions of translation. We will investigate those issues in future research work.
