Neural machine translation for low resource Indian language: Hindi-Kangri

Abstract

Neural Machine Translation (NMT) for low resource languages is a challenging task due to unavailability of large parallel corpus. The efficacy of Transformer based NMT models largely depends on scale of the parallel corpus and the configuration of hyperparameters implemented during model training. This study aims to delve into and elucidate the impact of hyperparameters on the performance of NMT models for low resource languages. To accomplish this, a series of experiments are conducted using an open-source Hindi-Kangri corpus to train both supervised and semi-supervised NMT models. Throughout the experimentation process, a significant number of discrepancies were identified within the data-set, necessitating manual correction. The best translation performance evaluated with respect to the metrics such as BLEU (0–1), SacreBLEU (0–100), Chrf (0–100), Chrf+ (0–100), Chrf++ (0–100) and TER (%) is (0.15, 14.98, 41.43, 41.49, 38.77, 68.20) for Hindi to Kangri direction, and (0.283, 28.17, 49.71, 50.64, 48.63, 51.25) for Kangri to Hindi direction.

Keywords

Neural machine translation low resource language low resource MT transformers semi-supervised MT Kangri natural language processing

1 Introduction

The task of machine translation is to develop a model for translating a piece of text from one language into another. Within the realm of natural language processing, machine translation stands out as one of the most intricate tasks. The history of machine translation can be traced back to Cold War in 1954, where an experiment was initiated at the IBM headquarters involving the use of the IBM 701 computer to translate sentences from Russian to English [1]. In recent years, NMT has emerged as a groundbreaking approach in the field of natural language processing, revolutionizing the way languages are translated in various domains. NMT models are build over the neural networks which allows them to grasp complex relationships between words and phrases in diverse languages, resulting in more fluent and contextually accurate translations compared to traditional statistical methods [2–4].

The pivotal moment arrived in 2017 when the transformer architecture was introduced, revolutionizing NMT [4]. The Transformers architecture laid the foundation for the latest advancements in AI models, including BERT (Bidirectional Encoder Representations from Transformers), and GPT (Generative Pre-trained Transformer) [5]. The effectiveness of transformers based NMT models is heavily contingent upon both the size of the parallel corpus and the configuration of hyperparameters employed during the model training process [6, 7]. India is a linguistically diverse nation boasting over 22 officially recognized languages and hundreds of dialects [8]. Many of these languages lack the vast amounts of parallel data required to effectively train a robust NMT system. This leads to a necessity for exploring an efficient technique or a methodology to effectively use small parallel corpus for building an NMT system for low resource language such as Kangri, Dogri, Kashmiri etc.

The efficacy of transformers based NMT models are greatly effected by the configuration of hyperparameters such as embedding dimension, feed-forward dimension and number of encoder-decoder layers [6, 7, 9]. All these hyperparameters effects the complexity of the model and directly depends on the size of the parallel corpus and vocabulary size of the language. Large sized parallel corpus tends to have a larger vocabulary size. The objective of this study is to delve into and elucidate the impact of hyperparameters on the performance of NMT models. All the experimentation has been done over an open source Hindi-Kangri corpus. Kangri is an extremely low resource language which is mostly spoken in few regions of Himanchal Pradesh, India [10, 11]. Throughout the experimentation, a substantial number of inconsistencies were detected within the dataset. Comprehensive insights into these discrepancies, along with the procedures employed for rectification, are elaborated upon in section 4 of this paper.

The research contribution shared through this paper are:

–
A cleaner version of the existing open source Hindi-Kangri corpus. \\ Note: No claim is being made for contribution over the collection of this dataset.
–
Demonstrated the impact of hyperparameter over the performance of NMT models.
–
A new baseline score for the translation performance over the Hindi-Kangri corpus.

Next, in the paper the related works (section 2), dataset (section 3), proposed work (section 4), results (section 5) and implementation is discussed in complete detail.
2 Literature survey

In this section, a thorough and detailed survey of the literature has been discussed. This survey centers on various neural machine translation models that have been previously proposed by researchers. More emphasis has been given over the methodologies and techniques which has been adopted for dealing with low resource language scenarios, particularly in the context of Indian languages.

2.1 General survey for NMT architectures

Constructing a neural model stands as the contemporary and efficient method for developing a Machine Translation system, as it harnesses the embeddings of individual tokens within the vocabulary. This capability facilitates the capture of enhanced semantic and syntactic structures within sentences [12, 13]. First end-to-end deep learning NMT framework, featured a Recurrent Neural Network (RNN) encoder-decoder structure [3]. This network comprised a bidirectional RNN encoder and a unidirectional RNN decoder, with an essential attention layer bridging the two. The attention layer is an important component in the architecture of any NMT system. RNN cells, however, are highly prone to the vanishing gradient problem and struggle to effectively handle long-term dependencies [14]. Consequently, RNN cells were subsequently replaced by Long Short-Term Memory (LSTM) cells [15]. The introduction of the Transformer architecture marked a significant breakthrough in the realm of deep learning research [4]. The Transformer model laid the groundwork for the latest advancements in AI models such as BERT and GPT [5]. Following the advent of the Transformer architecture, researchers worldwide have shifted their focus toward employing this architecture and its variations for diverse language pairs [16–18].

2.2 NMT for low-resource languages

Availability of large parallel corpus is a prerequisite for building a robust NMT system, and the languages for which large parallel corpus is not available are usually referred as Low-Resource language. The two possible scenarios with respect to low resource languages are:

–
No parallel corpus, Large Monolingual Corpus
–
Small parallel corpus, Large Monolingual Corpus

In case of zero availability of parallel corpus, researchers have adopted to unsupervised techniques to build an NMT system. The idea behind the design of an UNMT system is to combine the technique of denoising auto-encoder (base on reconstruction error) and On-the-Fly back translation. The fundamental concept is to establish a shared latent space for both languages, allowing the model to acquire translation capabilities by reconstructing content in both language domains. The model gets trained in a similar way as a language model is getting trained with noisy input data and reconstructing the original data [19]. Using this methodology a BLEU score of 15.56 is achieved on WMT 2014 dataset for French to English translation direction [20] and on adding a small parallel corpus the performance increases to 21.81 BLEU. This implies that the availability of small parallel corpus helps to improve the UNMT system performance.

Transfer learning is another approach for building low resource language NMT system. One common approach is to utilize a high-resource language pair data to pretrain a generic translation model. At a later stage the pre-traine model can be used to finetune over another dataset (low resource language data). This implies that the weights trained over a high resource language pair are used an initialization for fine tuning over low resource language [21, 22]. Using this approach a change of +5 BLEU is observed for Uzbek-English translation with a parent model of French-English translation [21].
2.3 NMT for low-resource Indian language

Machine translation for Indian languages poses a significant challenge due to the existence of large morphological differences [23, 24]. Much of the research conducted in the field of machine translation for Indian languages has primarily focused on NMT system for translating English to Telugu [25], Kannada [26, 27], Hindi [28], Malayalam [29] etc. These languages hold official status and are equipped with more readily available linguistic resources [8, 30].

Recently, there has been a shift in focus towards addressing the challenges of extremely low-resource languages like Kashmiri [31], Kangri [10], and Kumaoni [32]. To facilitate research on Kangri languages for various natural language processing (NLP) applications, an open-source Hindi-Kangri small parallel corpus and a Kangri monolingual corpus were released [11]. These resources not only promote Kangri language research but also establish baselines for unsupervised neural machine translation (UNMT) and unsupervised statistical machine translation (USMT). Using USMT and UNMT models, they achieved BLEU scores of 4.98 and 3.25, respectively, for translating from Kangri to Hindi.

Furthermore, a comparative study on semi-supervised and unsupervised neural machine translation for the Hindi-Kangri language pair was conducted using the same dataset [10]. This study included two main models:

1.
A shared encoder model implementing back-translation, incorporating both fully unsupervised and semi-supervised techniques.
2.
A language model featuring a denoising auto-encoder, leveraging fully unsupervised learning.
The highest BLEU score reported in this study was 21, achieved in the semi-supervised translation scenario with semi-supervised cross-lingual word embeddings.

As per the literature discussed in this section it is very much evident that very limited research work is available for low resource Indian languages and recently attention is drawn towards the extremely low resource languages like Kashmiri and Kangri. In this study the objective is to explore the methodology and techniques which can be opted to build an NMT system with a very limited parallel corpus for Kangri. A special focus has been drawn towards the quality of the dataset and to improve the quality a manual audit has been done. The impact of hyperparameter over the performance of transformer based NMT system has been discussed in detail.
3 Dataset

This study exclusively focused on conducting experiments with a low-resource Indian language (Kangri) to explore methods for building an effective NMT system using a limited parallel corpus. All the experimentation has been done over an open source Hindi-Kangri parallel and monolingual corpus dataset.¹. Kangri is an extremely low resource language which is mostly spoken in few regions of Himanchal Pradesh, India. Kangri and Hindi are Indo-Aryan languages that use the Devanagari script for writing [10, 11].

Tables 1 and 2 shows that data statistics of the raw corpus as mentioned in [11] and data available on Github, respectively. After collecting the dataset from Github (open-source) the dataset is processed. The complete procedure for data pre-processing is discussed in Section 4.

Table 1
Data statistics mentioned in [11]

Parallel Monolingual Total

Language Train Test Train Test

Hindi 26,862 500 – – 27,362

Kangri 26,862 500 1,80,552 1000 2,08,914

	Parallel	Monolingual	Total
Hindi	26,862	500	–	–	27,362
Kangri	26,862	500	1,80,552	1000	2,08,914

Table 2

Data statistics from data avaliable on GitHub

	Parallel		Monolingual		Total
Language	Train	Test	Train	Test
Hindi	26,854	500	–	–	27,354
Kangri	26,854	500	1,71,064	1000	1,99,418

4 Methodology

In this section the complete detail of the methodology used in this research study has been discussed. This section discuss about the data preprocessing, model architecture, hyperparameters, training mechanism, implementation details and the performance evaluation metric used in this study.

4.1 Data preprocessing

In data preprocessing the raw data is converted into a usable form. The basic data preprocessing step involves removal of punctuation, converting text to lower case etc. The sequence of steps which are used to preprocess the parallel and monolingual data are:

–
Removal of punctuation characters.
–
Removal of web-links or html links.
–
Removal of numeric characters.
–
Removal of English alphabet characters.
–
Removal of extra white-spaces

During the data analysis procedure a significant number of discrepancies were identified in the raw corpus which was obtained from the original source. All the different types of discrepancies in the dataset is summarized as: –
1356 parallel sentence pairs were in the form of Hindi-Hindi but it should in the form Hindi-Kangri. (line 7797 to line 9152).
–
A large number of parallel sentence pairs (approx. more than 5000) were incomplete or misaligned translations.
–
Duplicates of 1640 parallel sentences were present in the training set.
–
30 parallel sentences from the training corpus were leaked into the test set, which can inflate any evaluation metric score.

The steps taken to remove all the different types of discrepancies in the dataset are: –
Removed the false parallel sentences from the training set.
–
Manually corrected the alignment of the parallel sentences in the training set.
–
Removed the duplicates from the training and leaked data from the test set.

Table 3 shows that data statistics of the final data retrieved from the original corpus². The monolingual corpus for Hindi is acquired from IITB English-Hindi parallel corpus [33]. From the Kangri and Hindi monolingual corpus only those sentences are considered which had more than or equal to three words.

Table 3
Data statistics after removing discrepancies

Parallel Monolingual Total

Hindi 23,021 1,49,049 1,72,070

Kangri 23,021 1,49,049 1,72,070

4.2 Dataset splitting

	Parallel	Monolingual	Total
Hindi	23,021	1,49,049	1,72,070
Kangri	23,021	1,49,049	1,72,070

In the practical scenario of machine learning, a random data split is often employed to generate training and test sets. However, there is a possibility that a random split may yield a fortuitous outcome. To mitigate such occurrences, multiple splits of equal size are generated in a manner ensuring that each test set is unique; in other words, no two test sets contain the same parallel sentence pairs. Five different splits are created, and percentage of Out-of-Vocabulary (OOV) words (tokens) are calculated for the both Hindi and Kangri language. The performance of a Machine Translation system greatly depends on the spectrum of the vocabulary it holds. Out of all the 5 splits, one with the least OOV% is selected as the best split for building an NMT system. The training and test set for each split comprises of 22,500 and 521 parallel sentence pairs, respectively (refer Table 5).

Table 4
Out-of-vocabulary ratio after parallel dataset split

Splits HI Vocab KA Vocab OOV (%) OOV (%)

Size Size HI KA

Split 1 7984 9368 0.186 0.105

Split 2 7976 9360 0.347 0.167

Split 3 7976 9352 0.347 0.387

Split 4 7984 9360 0.219 0.244

Split 5 7968 9368 0.413 0.102

Splits	HI Vocab	KA Vocab	OOV (%)	OOV (%)
	Size	Size	HI	KA
Split 1	7984	9368	0.186	0.105
Split 2	7976	9360	0.347	0.167
Split 3	7976	9352	0.347	0.387
Split 4	7984	9360	0.219	0.244
Split 5	7968	9368	0.413	0.102

Table 5

Data statistics after train-test split

	Parallel
	Train	Test
Hindi	22,500	521
Kangri	22,500	521

The performance of NMT systems is directly influenced by the chosen tokenization scheme, which has a direct impact on the OOV%. The most efficient technique for effectively managing OOV% is Byte Pair Encoding (BPE) [34]. The choice of tokenizer is primarily dependent on the specific language, especially the writing script used in that language. In this case, both Hindi and Kangri share the Devanagri writing script. Therefore, joint BPE codes are learned through 10,000 merge operations. These learned joint BPE codes are then used to tokenize the dataset, and their vocabularies are extracted individually (with no shared vocabulary). Table 4 represents the statistics of the OOV% for each split. For the Split 1 the OOV% is low as compared to the other splits. This implies that the Split 1 is the best split and all the further experimentation and results are with the respect to Split 1.

Note: The cleaned data provided in the link² consists of parallel and monolingual data. The parallel data corresponds to the Split 1 data. Monolingual data is not used during the learning of BPE codes.

4.3 Model architecture

The Transformer architecture represents a significant advancement in the fields of NMT and LLM [4]. In this study, the same Transformer architecture (refer Fig. 1) has been used for constructing an NMT system for the Hindi-Kangri language pair. The transformer architecture is an encoder-decoder network which utilizes the concept of Multi-Head Attention (extended version of self attention) in the encoder part of the network. The Masked-Multi Head Attention block in the decoder gives an auto-regressive ability to the model during the training and inference stage. Positional Encoding carries the information regarding the sequence in which text appears in the corpus. Input embedding and target embedding refers to the word embedding (vector representation) for the words or tokens in source and target language corpus, respectively. While training an NMT model over a large parallel corpus, pre-trained word embeddings are used like FastText [35], Glove [36], Word2Vec [37] etc. Utilizing the pre-trained embeddings become effective when they are trained over a sufficiently large corpus so the semantics of the sentence can be effectively captured [38]. Since, in this study large corpus is not available so the embeddings are trained during the training of NMT model. Transformer architecture consists of number of hyperparameters which are discussed in the next section

Fig. 1

Transformer architecture.

4.4 Transformer hyperparameter

In machine learning models, the total parameters can be categorized into two distinct groups: learnable parameters and hyperparameters. Learnable parameters are those that undergo training, meaning they are updated using techniques like backpropagation. These parameters adapt to the underlying distribution of the training data. In contrast, hyperparameters are independent of the data distribution and are specified by the user. They remain fixed throughout both the training and testing phases of the model.

The Transformer architecture is known for having a substantial number of hyperparameters, and its performance is highly influenced by the selection of the optimal set of hyperparameters. Discovering the most effective hyperparameter combinations often involves conducting numerous experiments. Some crucial hyperparameters within the Transformer architecture include:

–
Embedding Dimension
–
Feed Forward Dimension (Latent Dimension)
–
Number of Encoder-Decoder Layers
–
Tokenizer
–
Dropout
–
Attention Dropout
–
Learning Rate
–
Label Smoothing
–
Number of Attention Heads
–
Loss Function and Optimizer
–
Batch Size
–
Maximum Tokens
–
Training Epochs

Among the listed hyperparameters, Embedding Dimension, Feed Forward Dimension (Latent Dimension), and Number of Encoder-Decoder Layers, these are the ones that directly impact the model’s complexity, specifically in terms of the number of learnable parameters. To determine the best set of hyperparameter values, hyperparameter optimization techniques like Random Search, Grid Search, Bayesian Optimization, etc., are typically employed. These techniques are predominantly used for optimizing parameters like learning rate and label smoothing, which don’t significantly affect the model’s complexity.

However, when it comes to optimizing parameters like embedding dimension and feed-forward dimension, a more intricate approach is needed, often requiring multiple GPUs to handle the abrupt changes in memory requirements as different combinations of hyperparameters are explored.

The size of the dataset or vocabulary plays a vital role in determining these three hyperparameter values. In scenarios with limited linguistic resources, such as low-resource languages, the availability of a parallel corpus is consistently constrained, resulting in a notably smaller vocabulary. Transformer based NMT models or language model are usually trained over large dataset which has a large vocabulary size and in such cases the hyperparameter values for embedding dimension, latent dimension and no. of encoder-decoder layers are usually set to (1024 or 2048), (2048 or 4096) and (6, 8 or 12), respectively. Using similar range values for low resource language doesn’t yields an optimal result [7, 8].
4.5 Training mechanism

To harness the potential of both parallel and monolingual data, both supervised and semi-supervised training mechanisms are used. Supervised training mechanism is the most common training mechanism in NMT. In this study two supervised NMT models (SUP _HI→KA and SUP _KA→HI) are trained for Hindi to Kangri and Kangri to Hindi translation direction independently. Trained model SUP _HI→KA and SUP _KA→HI utilizes the monolingual Hindi and Kangri corpus to generate the synthetic Kangri and Hindi corpus, respectively (refer Fig. 2).

Fig. 2

Training mechanism for NMT model.

In a semi-supervised training approach, both parallel and monolingual datasets are employed. This methodology utilizes a technique known as On-the-fly backtranslation [39], which involves training two NMT models simultaneously. In the initial phase, both models are trained using parallel data. Subsequently, in the second stage, these trained models employ monolingual data to generate translations, creating synthetic sentences. Once these synthetic sentences are generated, a synthetic parallel corpus is created and used as input for the NMT models, with the target sentence remaining the same as the original sentence. After each epoch the quality of the synthetic sentences starts improving and the learning of the NMT model enhances by using the monolingual corpus. The iterative nature of the On-the-fly backtranslation process results in an extended training duration, necessitating the use of multiple GPUs to accelerate the training process.

To circumvent this situation an alternative technique is used to utilize the monolingual data and it is referred as Off-the-fly backtranslation. In this technique a synthetic corpus is generated by using a fully trained NMT model as depicted in Fig. 2. The synthetic corpus is then merged with the parallel corpus, and subsequently, a new NMT model is trained on the expanded dataset. The effectiveness of the semi-supervised trained NMT model relies heavily on the quality of the synthetic parallel corpus. If the supervised NMT model used to generate the synthetic parallel corpus produces translations of poor quality, the inclusion of such a corpus alongside the genuine parallel data can potentially diminish the NMT model performance.

4.6 Implementation details

All experiments in this research were fully implemented using the Google Colab-Pro platform, with access to the paid NVIDIA A100-SXM4-40 GPU machine. The Python modules, libraries and toolkits employed in this work include Fairseq [40], Numpy [41], NLTK [42] and Sacrebleu [43].

4.7 Performance evaluation metric

The evaluation metric which has been used to evaluate the performance of trained NMT models are SacreBLEU [43], Bilingual Evaluation Understudy (BLEU) [44], Translation Error Rate (TER) [45] and the variants of CHaRacter-level F-score like Chrf, Chrf+ and Chrf++ [46]. The evaluation conditions or the arguments which are need to be defined for calculating the metric score is mentioned in Table 6. For Chrf, Chrf+ and Chrf++ the tokenizer is not required to be defined and for TER the tokenizer and smoothing operation is not required to be defined.

Table 6
Evaluation conditions

Metric Tokenizer Smoothing Python

Package

BLEU Whitespace None nltk 3.8.1

SacreBLEU 13a Exponential sacrebleu 2.3.1

Chrf – None sacrebleu 2.3.1

Chrf+ – None sacrebleu 2.3.1

Chrf++ – None sacrebleu 2.3.1

TER – – sacrebleu 2.3.1

Metric	Tokenizer	Smoothing	Python
BLEU	Whitespace	None	nltk 3.8.1
SacreBLEU	13a	Exponential	sacrebleu 2.3.1
Chrf	–	None	sacrebleu 2.3.1
Chrf+	–	None	sacrebleu 2.3.1
Chrf++	–	None	sacrebleu 2.3.1
TER	–	–	sacrebleu 2.3.1

5 Results and analysis

In this section a detailed discussion over the results obtained for all the experiments conducted in this research study is presented. Analysis of the results obtained for the trained NMT models has been discussed.

5.1 Hyperparameter selection

To find the best set of hyperparameter values for Hindi-Kangri corpus (low resource scenario) multiple experiments are performed to observe the effect of the each hyperparameter on the model’s performance. All the hyperparamters opted for training of the model are listed in Table 7. Parameters marked bold are need to be optimized for which multiple experiments are performed. For the selection of best hyperparameter combination, an NMT model is trained for Hindi to Kangri translation direction under different combination of hyperparameter. The performance of the Hindi to Kangri direction translation NMT model under different hyperparameter configuration is presented in Table 8. The performance is evaluated over 521 test sentences using SacreBLEU as evaluation metric. The decoding is performed using Beam search algorithm [47] using beam length as 5 (B5).

Table 7
Training parameters for transformer based NMT model

Hyperparameter Value

Embedding Dimension 64, 128, 256

Latent Dimension 64, 128, 256, 512

1024, 2048

No. of Encoder-Decoder layers 2, 4

Tokenizer BPE

(10,000 merge opeartions)

Hindi Vocab Size 7984

Kangri Vocab Size 9368

Dropout 0.3

Attention Dropout 0

Learning Rate 0.0005 (Fixed)

Label Smoothing 0.1

No. of Attention Heads 4

Optimizer Adam

Batch Size 64

Training Epochs 50

Max Tokens 4000

Loss Function Cross Entropy

Hyperparameter	Value
Embedding Dimension	64, 128, 256
Latent Dimension	64, 128, 256, 512
	1024, 2048
No. of Encoder-Decoder layers	2, 4
Tokenizer	BPE
	(10,000 merge opeartions)
Hindi Vocab Size	7984
Kangri Vocab Size	9368
Dropout	0.3
Attention Dropout	0
Learning Rate	0.0005 (Fixed)
Label Smoothing	0.1
No. of Attention Heads	4
Optimizer	Adam
Batch Size	64
Training Epochs	50
Max Tokens	4000
Loss Function	Cross Entropy

The analysis of the results obtained Table 8 is summarized as:

Table 8

Hyperparameter selection for Hindi to Kangri translation direction NMT model

Embedding	Number of	Latent	SacreBLEU
dimension	encoder-decoder	dimension	(B5))
	layers
64	2	64	7.40
64	2	128	6.93
64	2	256	7.28
64	2	512	7.54
64	2	1024	6.97
64	2	2048	6.73
64	4	64	7.33
64	4	128	7.08
64	4	256	7.41
64	4	512	7.54
64	4	1024	7.07
64	4	2048	0.78
128	2	128	11.12
128	2	256	10.10
128	2	512	11.80
128	2	1024	10.71
128	2	2048	10.75
128	4	128	11.87
128	4	256	11.07
128	4	512	11.83
128	4	1024	11.60
128	4	2048	0.55
256	2	256	11.81
256	2	512	10.08
256	2	1024	9.86
256	2	2048	10.24
256	4	256	11.07
256	4	512	14.61
256	4	1024	3.45
256	4	2048	0
256	5	2048	0

–

When embedding dimension and the no. of encoder-decoder layers are held constant, and only the latent dimension is varied from 68 to 1024, there is no notable impact on the SacreBLEU score (refer Fig. 3(b)).

Fig. 3

SacreBLEU score variation under different hyperparameter configuration.

–

When altering the latent dimension from 1024 to 2048 while keeping the embedding dimension constant, there is no substantial effect on the SacreBLEU score when using 2 layers of encoder-decoder. However, with 4 layers, a notable decrease in the SacreBLEU score becomes apparent.

–

This suggests that the selection of the latent dimension has minimal impact on the performance of the NMT model. However, when the latent dimension is significantly increased, the performance experiences a sharp decline.

–

When altering the no. of encoder-decoder layers while keeping the other parameter constant, no significant change is observed in SacreBLEU score.

–

When altering the embedding dimension while keeping the other parameter constant, an increase of 3 to 5 points is observed on SacreBLEU score (refer Fig. 3(a)).

–

The highest SacreBLEU score is achieved with 4 encoder-decoder layers, 256 and 512 as the embedding dimension and latent dimension, respectively.

Based on the analysis of Table 8, it can be concluded that in a low-resource language scenario (with a parallel corpus of 20K to 30K), the selection of the embedding dimension significantly impacts the NMT model’s performance. Furthermore, other parameters like the number of encoder-decoder layers and latent dimension can markedly influence the model’s performance, especially when the latent dimension exceeds 512, and the model’s depth exceeds 4. The results mentioned in the further section of this paper is obtained with respect to the best hyperparameter values i.e., 256, 512 and 4 as Embedding Dimension, Latent Dimension and Number of Encoder-Decoder layers, respectively. Same set of hyperparameters values are used to train the final model for Hindi to Kangri and Kangri to Hindi translation direction.

5.2 Final experiment results

In this section the results obtained for the final experiment for training NMT model with supervised and semi-supervised training mechanism is discussed. The results are obtained by using the best hyperparameter combination obtained. The evaluation metric used to evaluate the performance of NMT models are mentioned in Table 6. These performance metrics were computed at the corpus level, which is the standard approach for evaluating machine translation systems. NMT models undergo supervised and semi-supervised training, as elaborated in sub-section 4.5, for translation between Hindi and Kangri in both directions. Their performance scores are displayed in Tables 9 and 10 for the respective translation directions. The decoding process employs the Beam search algorithm with varying beam lengths of 1, 5, and 8 denoted as B1, B5 and B8, respectively. It’s worth noting that using a beam length of 1 in Beam search is analogous to employing a greedy algorithm.

Table 9
Performance for Hindi to Kangri translation

BLEU Sacre Chrf Chrf+ Chrf++ TER

BLEU

(0–1) (0–100) (0–100) (0–100) (0–100) %

Supervised (SUP_HI→KA)

B1 0.130 13.02 39.34 39.34 36.62 73.46

B5 0.146 14.61 41.38 41.41 38.64 68.58

B8 0.150 14.98 41.43 41.49 38.77 68.20

Semi-Supervised (SEMI_HI→KA)

B1 0.080 7.98 34.76 34.40 31.64 78.03

B5 0.089 8.95 35.35 34.99 32.23 77.00

B8 0.091 9.06 35.27 34.94 32.22 77.32

	BLEU	Sacre	Chrf	Chrf+	Chrf++	TER
Supervised (SUP_HI→KA)
B1	0.130	13.02	39.34	39.34	36.62	73.46
B5	0.146	14.61	41.38	41.41	38.64	68.58
B8	0.150	14.98	41.43	41.49	38.77	68.20
Semi-Supervised (SEMI_HI→KA)
B1	0.080	7.98	34.76	34.40	31.64	78.03
B5	0.089	8.95	35.35	34.99	32.23	77.00
B8	0.091	9.06	35.27	34.94	32.22	77.32

Table 10

Performance for Kangri to Hindi translation

	BLEU	Sacre	Chrf	Chrf+	Chrf++	TER
		BLEU
	(0–1)	(0–100)	(0–100)	(0–100)	(0–100)	%
Supervised (SUP_KA→HI)
B1	0.256	25.44	47.20	48.11	46.07	57.15
B5	0.279	27.77	49.17	50.09	48.09	54.87
B8	0.277	27.52	49.11	50.00	47.99	54.77
Semi-Supervised (SEMI_KA→HI)
B1	0.272	27.06	48.54	49.47	47.46	53.69
B5	0.282	28.04	49.63	50.56	48.55	51.51
B8	0.283	28.17	49.71	50.64	48.63	51.25

In the Hindi to Kangri translation direction, the top-performing model is the supervised model (SUP _HI→KA), achieving a SacreBLEU score of 14.98. On the other hand, for Kangri to Hindi translation direction, the leading model is the semi-supervised model, attaining a SacreBLEU score of 28.17. By incorporating a synthetic corpus, the SacreBLEU score for translation from Kangri to Hindi direction improves, rising from 27.77 to 28.17 (small rise). Conversely, in the opposite direction, the score declines, dropping from 14.98 to 9.06. Based on the outcomes observed from both supervised and semi-supervised training models, two question emerge: 1.

In the case of supervised training models (specifically, SUP _HI→KA and SUP _KA→HI), what accounts for the 12-13 point gap in SacreBLEU scores, considering that both models underwent training using the same parallel corpus?

When synthetic parallel data is combined with the original parallel corpus for Hindi to Kangri translation, there is an approximate 6-point decrease in SacreBLEU score, whereas the opposite direction experiences a minor improvement.

One potential explanation for this behavior may be attributed to the significant linguistic similarity between the two languages. To understand this behavior, a manual analysis of the system’s translation has been conducted. Table 11 presents a concise set of Hindi words and their corresponding counterparts in Kangri. In most cases, a single Hindi word can be associated with multiple Kangri words, and at times, Hindi words are directly used within the Kangri language.

Table 11

Hindi-Kangri corresponding word list

Figure 4 provides two examples of genuine parallel sentence pairs. It is evident that in the first example, the Hindi sentence contains the word with its corresponding Kangri counterpart being . In contrast, in the second example, the Hindi sentence includes the word and its corresponding word in Kangri is also . Such scenarios arise because it is a natural tendency for bilingual individuals, who are proficient in two highly similar languages, to mix words and accents.

Fig. 4

Sample of true parallel sentence pairs.

Due to this observed pattern when sentences are translated from Hindi to Kangri and in the input sentence the word comes then on the basis of the context information, while generating the sentence in Kangri language the probability gets distributed between the words and due to which the word prediction can go wrong and evaluation metric score decreases. Whereas, when translation is done from Kangri to Hindi and in the input sentence the word or comes, then majority of the probability weights goes to the word which leads to correct word prediction and does not leads to decrement in the metric score (refer Fig. 5). This is one of the possible reason for a gap of 12-13 SacreBLEU points between the two supervised trained models.

Fig. 5

Translation output from the best NMT systems.

The supervised trained models (SUP _HI→KA and SUP _KA→HI) are used to generate synthetic parallel corpus by utilizing the monolingual corpora. Since, the performance of SUP _HI→KA and SUP _KA→HI reached a SacreBLEU score of 14.98 and 27.52, respectively, due to which the translations generated by these models are not high quality which directly effects the quality of the synthetic corpus. Due to the low quality synthetic corpus there is a drop of 6 point SacreBLEU score. The samples of the translation output by the proposed system is shared in Fig. 6. The overlapping portions of the translation output and the reference sentence is marked in red and blue color.

Fig. 6

Sample translations for all 4 NMT systems.

6 Comparison with existing work

Kangri is considered an extremely low-resource language, with its inaugural dataset being published as recently as 2021 [11]. Consequently, there is a scarcity of relevant research and resources available for this language. The source dataset¹ was employed in a prior study [10] as part of a research work in NMT.

However, because of the identified discrepancies within the raw corpus, as discussed in Section (4.1), it became necessary to rectify them. This rectification process led to the creation of a new data split for training and testing purpose. As a result, the test set utilized in this study differs from the one used in the earlier work [10]. This disparity in test sets makes it challenging to directly compare the performance of the NMT systems between the two studies.

The best translation model proposed in [10] achieve BLEU score and TER as (21, 7) and (20, 7) for Hindi to Kangri and Kangri to Hindi translation direction, respectively. BLEU assesses the likeness between the machine-generated translation and the reference translation by examining n-gram overlap. In contrast, TER gauges the edit distance or the number of edits needed to transform the machine-generated translation into the reference translation.

–
Note: BLEU and TER score are measured in the scale of (0–1) or (0–100). Since, no information has been provided regarding scale of scores in [10], so assuming the scale of (0–100).

If a BLEU score of 21 is obtained, signifying a low degree of similarity between the system translation and the reference, it suggests that a considerable number of tokens in the translation output do not correspond to those in the reference, and the sequence of tokens may also exhibit significant differences. In such a context, a TER score of 7 appears inconsistent. Typically, a TER score of 7 would indicate a relatively low number of edits necessary, implying that the machine translation closely resembles the reference. Due to the mentioned contradiction over BLEU and TER score value it brings a question mark over considering as a baseline.
7 Conclusion and future work

This study explores the NMT systems for low resource language. Experiments are done over Hindi-Kangri language pair where Kangri is an extremely low resource language. The research contribution shared through this paper are: –

A cleaner version of the existing Hindi-Kangri corpus.

–

Explored the impact of hyperparameter over the performance of NMT models.

–

A new baseline score for NMT over the Hindi-Kangri corpus.

After cleaning the dataset, a large number of experiments were conducted to obtain the best combination of hyperparameter used in transformer architecture. Using the best hyperparameter combination obtained, NMT models were trained in supervised and semi-supervised manner. For (HI to KA) NMT model, the SacreBLEU score achieved was (supervised: 14.98; semi-supervised: 9.06) whereas, for (KA to HI) NMT model, the SacreBLEU score achieved was (supervised: 27.52; semi-supervised: 28.17).

The future work involves investigating the methods to filter poor translations within the synthetic corpus. The quality of this synthetic corpus significantly impacts the performance of NMT systems. Therefore, to enhance the effectiveness of semi-supervised models, it is imperative to implement filtering techniques. Notably, not all languages have readily available parallel corpora, so addressing such situations also involves exploring unsupervised techniques, which is part of the ongoing research agenda.

Footnotes

References

https://www.freecodecamp.org/news/a-history-of-machine-translation-from-the-cold-war-to-deep-learning-f1d335ce8b5/.

Kyunghyun Cho Bart van Merriënboer Caglar Gulcehre DzmitryBahdanau Fethi Bougares Holger Schwenk Yoshua Bengio Learning phrase representations usingRNN Encoder–Decoder for statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, (2014).

Bahdanau Dzmitry Kyunghyun Cho Bengio

Neural machine translation by jointly learning to align and translate. ArXiv. 1409. (2014).

Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit LlionJones Aidan Gomez

Łukasz Kaiser Illia Polosukhin , Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17) (2017).

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, pages 4171–4186, (2019).

Ali Araabi Christof Monz Optimizing transformer for low-resource neural Machine Translation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3429–3435, (2020).

Seamus Lankford Haithem Alfi Andy Way Transformers for low-resource languages: Is Féidir Linn!. In Proceedings of Machine Translation Summit XVIII: Research Track, pages 48–60. (2021).

Shubham Dewangan Shreya Alva Nitish Joshi PushpakBhattacharyya , Experience of neural machine translation betweenIndian languages, Machine Translation 35(1) (2021)71–99.

Biljon

E.V.

Pretorius

Kreutzer

On optimal transformer depth for low-resource language translation, In The International Conference on Learning Representations (ICLR 2020), (2020).

10.

Chauhan

Saxena

Daniel

Analysis of neural machine translation KANGRI language by unsupervised and semi supervised methods, IETE Journal of Research (2022), 1–11.

11.

Chauhan

Saxena

Daniel

Monolingual and parallel corpora for Kangri low resource language, arXiv preprint arXiv:2103.11596. (2021).

12.

Ben Athiwaratkun Andrew Wilson and Anima Anandkumar Probabilistic FastText for multi-sense word embeddings, In Proceedings of the 56th Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Papers), pages 1–11. 2018.

13.

Chen

Perozzi

Al-Rfou

Skiena

S.S.

The expressive power of word embeddings, ArXiv, abs/1301.3226. (2013).

14.

Razvan Pascanu Tomas Mikolov Yoshua Bengio On the difficulty of training recurrent neural networks, In Proceedings of the 30th International Conference on International Conference on Machine Learning –Volume 28, (2013).

15.

Stahlberg

Neural machine translation: A review, Journal ofArtificial Intelligence Research 69 (2020)343–418.

16.

Karim Ahmed Nitish Shirish Keskar Richard Socher Weighted transformer network for machine translation, ArXiv abs/1711.02132. (2017).

17.

Liu

Zhou

Integrating dependency tree into self-attention for sentence representation, ICASSP 2022 –2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8137–8141.

18.

Peter Shaw Jakob Uszkoreit Ashish Vaswani Self-attention with relative position representations, In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2, pages 464–468. 2018.

19.

Artetxe

Labaka

Agirre

Cho

Unsupervised neural machine translation, ArXiv, abs/1710.11041. (2017).

20.

Lample

Denoyer

Ranzato

Unsupervised machine translation using monolingual corpora only, ArXiv, abs/1711.00043. (2017).

21.

Barret Zoph Deniz Yuret Jonathan May Kevin Knight Transfer learning for low-resource neural machine translation, In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1568–1575. 2016.

22.

Mieradilijiang Maimaiti Yang Liu Huanbo Luan Maosong Sun , Enriching the transfer learning with pre-trained lexicon embeddingfor low-resource neural machine translation,Article, Tsinghua Scienceand Technology 27(1) (2022)13.

23.

Shailashree Sheshadri

Deepa Gupta Marta Costa-Jussà

A voyage on neural machine translation for indic languages, Procedia Comput Sci 218(C) (2023)2694–2712.

24.

Premjith

Anand Kumar

Soman,

K.P.

Neural machine translation system for English to Indian language translation using MTIL parallel corpus: Special issue on natural language processing, Journal of Intelligent Systems, 2019.

25.

Sheshadri

S.K.

Dhanush

Pradyumna

N.V.S.

Sripathi

S.R.

Gupta,

Reordering based unsupervised neural machine translation system for English To Telugu, 2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 2022, pp. 1–6.

26.

Sheshadri

S.K.

Sai Bharath,

Hari Naga Sree Chandana Sarvani,

Reddy Vijaya Bharathi Reddy

Gupta,

Unsupervised neural machine translation for English to Kannada using pre-trained language model, 2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 2022, pp. 1–5.

27.

Gadugoila

Sheshadri

S.K.

Nair

P.C.

Gupta

Unsupervised pivot-based neural machine translation for English to Kannada, 2022 IEEE 19th India Council International Conference (INDICON), Kochi, India, 2022, pp. 1–6.

28.

Sheshadri

S.K.

Gupta

Costa-Jussà

M.R.

Neural machine translation for Kashmiri to English and Hindi using pre-trained embeddings, 2022 OITS International Conference on Information Technology (OCIT), Bhubaneswar, India, 2022, pp. 238–243, doi: 10.1109/OCIT56763.2022.00053.

29.

Premjith

Soman

K.P.

Anand Kumar

Jyothi Ratnam,

Embedding linguistic features in word embedding for prepositionsense disambiguation in English–Malayalam machine translationcontext, Studies in Computational Intelligence 823(2019), 341–370.

30.

AI4Bharat Jay Gala Pranjal Chitale

Raghavan

A.K.

SumanthDoddapaneni Varun Gumma Aswanth Kumar Janki Nawale AnupamaSujatha Ratish Puduppully Vivek Raghavan Pratyush Kumar Mitesh Khapra

Raj Dabre Anoop Kunchukuttan IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages, ArXiv, abs/2305.16307. (2023).

31.

Lone

N.A.

Giri

K.J.

Bashir

Machine intelligence for language translation from Kashmiri to English, Journal of Information & Knowledge Management (2022), 2250074.

32.

Gusain

Dash

S.R.

Parida

Jha

G.N.

Automatic language identification: A case study of Pahari languages, Language Resources and Evaluation (2023), 1–27.

33.

Anoop Kunchukuttan Pratik Mehta Pushpak Bhattacharyya The IIT Bombay English-Hindi parallel corpus, In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018.

34.

Rico Sennrich Barry Haddow Alexandra Birch Neural machine translation of rare words with subword units, In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Volume 1, pages 1715–1725. 2016.

35.

Ben Athiwaratkun Andrew Wilson Anima Anandkumar Probabilistic fasttext for multi-senseword embeddings, In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1–11. 2018.

36.

Pennington

Socher

Manning

C.D.

Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). (2014, October).

37.

Yoav Goldberg Omer Levy , word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. ArXiv abs/1402.3722 (2014).

38.

Ye Qi Devendra Sachan Matthieu Felix Sarguna Padmanabhan Graham Neubig When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535. 2018.

39.

Artetxe

Labaka

Agirre

Cho

Unsupervised neural machine translation, arXiv preprint arXiv:1710.11041. (2017).

40.

Myle Ott Sergey Edunov Alexei Baevski Angela Fan Sam Gross Nathan Ng David Grangier Michael Auli fairseq: A fast, extensible toolkit for sequence modeling, In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53. 2019.

41.

Van Der Walt,

Colbert

S.C.

Varoquaux,

The NumPy array: A structure for efficient numerical computation, Computing in Science & Engineering 13(2) (2011)22–30.

42.

Steven Bird Edward Loper NLTK: The Natural Language Toolkit, In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217. 2004.

43.

Matt Post A call for clarity in reporting BLEU scores, In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191. 2018.

44.

Kishore Papineni Salim Roukos Todd Ward Wei-Jing Zhu Bleu: A method for automatic evaluation of machine translation, In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318. 2002.

45.

Matthew Snover Bonnie Dorr Rich Schwartz Linnea Micciulla John Makhoul A study of translation edit rate with targeted human annotation, In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231. 2006.

46.

Maja Popović, chrF: character n-gram F-score for automatic MT evaluation, In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395. 2015.

47.

Tillmann

Ney

Word reordering and a dynamic programming beam search algorithm for statistical machine translation, Computational Linguistics 29(1) (2003)97–133.

Neural machine translation for low resource Indian language: Hindi-Kangri

Abstract

Keywords

1 Introduction

2.1 General survey for NMT architectures

2.2 NMT for low-resource languages

Table 1 Data statistics mentioned in [11] Parallel Monolingual Total Language Train Test Train Test Hindi 26,862 500 – – 27,362 Kangri 26,862 500 1,80,552 1000 2,08,914

4.1 Data preprocessing

Table 4 Out-of-vocabulary ratio after parallel dataset split Splits HI Vocab KA Vocab OOV (%) OOV (%) Size Size HI KA Split 1 7984 9368 0.186 0.105 Split 2 7976 9360 0.347 0.167 Split 3 7976 9352 0.347 0.387 Split 4 7984 9360 0.219 0.244 Split 5 7968 9368 0.413 0.102

4.7 Performance evaluation metric

Table 6 Evaluation conditions Metric Tokenizer Smoothing Python Package BLEU Whitespace None nltk 3.8.1 SacreBLEU 13a Exponential sacrebleu 2.3.1 Chrf – None sacrebleu 2.3.1 Chrf+ – None sacrebleu 2.3.1 Chrf++ – None sacrebleu 2.3.1 TER – – sacrebleu 2.3.1

5.1 Hyperparameter selection

Footnotes

References

Table 1
Data statistics mentioned in [11]

Parallel Monolingual Total

Language Train Test Train Test

Hindi 26,862 500 – – 27,362

Kangri 26,862 500 1,80,552 1000 2,08,914

Table 4
Out-of-vocabulary ratio after parallel dataset split

Splits HI Vocab KA Vocab OOV (%) OOV (%)

Size Size HI KA

Split 1 7984 9368 0.186 0.105

Split 2 7976 9360 0.347 0.167

Split 3 7976 9352 0.347 0.387

Split 4 7984 9360 0.219 0.244

Split 5 7968 9368 0.413 0.102

Table 6
Evaluation conditions

Metric Tokenizer Smoothing Python

Package

BLEU Whitespace None nltk 3.8.1

SacreBLEU 13a Exponential sacrebleu 2.3.1

Chrf – None sacrebleu 2.3.1

Chrf+ – None sacrebleu 2.3.1

Chrf++ – None sacrebleu 2.3.1

TER – – sacrebleu 2.3.1