Improving thai-lao neural machine translation with similarity lexicon

Abstract

It has been shown that the performance of neural machine translation (NMT) drops starkly in low-resource conditions. Thai-Lao is a typical low-resource language pair of tiny parallel corpus, leading to suboptimal NMT performance on it. However, Thai and Lao have considerable similarities in linguistic morphology and have bilingual lexicon which is relatively easy to obtain. To use this feature, we first build a bilingual similarity lexicon composed of pairs of similar words. Then we propose a novel NMT architecture to leverage the similarity between Thai and Lao. Specifically, besides the prevailing sentence encoder, we introduce an extra similarity lexicon encoder into the conventional encoder-decoder architecture, by which the semantic information carried by the similarity lexicon can be represented. We further provide a simple mechanism in the decoder to balance the information representations delivered from the input sentence and the similarity lexicon. Our approach can fully exploit linguistic similarity carried by the similarity lexicon to improve translation quality. Experimental results demonstrate that our approach achieves significant improvements over the state-of-the-art Transformer baseline system and previous similar works.

Keywords

Neural machine translation Thai-Lao linguistic similarity structure improving lexicon

1 Introduction

Low-resource NMT has attracted lots of attention in recent years [1], some approaches have been proposed to improve the translation quality by data augmentation [2, 3], unsupervised learning [4, 5], transfer learning [6, 7], structure improving [8, 9], etc. Thai-Lao NMT is a typical low-resource NMT, however, influenced by language family and language processing tools, the research on Thai-Lao NMT in the past decade is not widespread. The bulk of researches on Thai-Lao NMT have to focus on language model training and named entity recognition [10, 11] etc. With the rapid development of the global economy, the demand for translation of Thai-Lao has been increasing. Therefore, it is important to investigate how to design an effective and suited model on the small scale of parallel corpus to improve the translation performance of Thai-Lao NMT.

Limited by parallel corpus and basic language processing tools, dominant low-resource NMT approaches show suboptimal performances on Thai-Lao language pair. However, as the languages both belong to the same Tai-Kadai language family, Thai and Lao have considerable cross-lingual similarities [12], the pronunciation and spelling of amounts of words are close or even identical. The similarities between Thai and Lao make them mutually intelligible for human communication, but there is still a lack of enough approaches to apply it to NMT. Intuitively, for using the cross-lingual similarity, the following preconditions need to be met: (1) accessible similarity representation: the similarity between two languages should be represented explicit and is easier to obtain than parallel sentences; and (2) improved NMT architecture: an efficient and suited model should be designed to receive the extra similarity information and balance the weight between the similarity information and the conventional sentence input. To tackle the above problems, we use bilingual similarity lexicon as the similarity information container and design a novel architecture to process the information flow. Our main contributions are as follows:

We investigate the cross-lingual similarities between Thai and Lao in the perspective of linguistic morphology and discuss the feasibility that chooses Thai-Lao similarity lexicon as an extra semantic information representation. Similarity lexicon is composed of pairs of Thai-Lao words, such as < ,> (corresponding English: to), in which the Thai word “” and Lao word “ ” are similar in linguistic morphology and have identical semantic.

For utilizing the linguistic similarity carried by lexicon, we propose a novel framework that introduces an extra similarity encoder into the conventional encoder-decoder architecture, which allows linguistic similarity to be infused into the transformation from the source language to the target language. In the decoding side, we further provide a simple balance mechanism, the central idea is to balance the flow of information representations which are delivered from the input sentence and the similarity lexicon.

The remaining of the paper is arranged as following: In Section 2, we introduce the linguistic similarity between Thai and Lao. In Section 3, we describe the architecture of our proposed model. In Section 4 we report the experiment settings and results. Section 5 concludes the paper and provides our future work.

2 Linguistic similarity of thai and lao

Thai and Lao are both tonal languages from the Tai-Kadai language family, their pronunciation and writing are highly similar. Basically, spoken Thai and Lao are mutually intelligible. The two languages share a large amount of etymologically related words and have similar head-initial syntactic structures. For writing, Thai and Lao are both written with abugida scripts, slightly different from each other but are similar in basic morphological structure [12]. As the example shown in Table 1, it can be observed the similarity in the shape of tokens.

Table 1
Thai-Lao linguistic similarity in writing

Language Sentence

Thai

Lao

English Driving to Beijing

Language	Sentence
Thai
Lao
English	Driving	to	Beijing

Besides token shape, we investigate the similarity of syntactic structure. We use GIZA++ tool [13] to obtain word alignment over the 20 K Thai-Lao portion of the publicly available ALT dataset [14]. Based on the alignment, we calculate Kendall’s τ according to the previous work of Isozaki [15]. Similar to editing distance, Kendall’s τ mainly focuses on the cost of adjusting the words in the parallel sentence pairs to the same order based on word alignment. As illustrated in Fig. 1, Thai-Lao language pair shows a relatively similar order with an average τ around 0.73, while the average τ for Thai-Chinese is about 0.25. The result demonstrates the considerable similarity in the syntactic structure between Thai and Lao, which is also consistent with the conclusion of the work [12].

Fig. 1

Distribution of Kendall’s τ on Thai-to-Lao (a) and Thai-Chinese (b).

According to the above analysis, Thai-Lao language pair has considerable cross-lingual similarity in either token shape or syntactic structure. Parallel sentence pairs with similarity are the best representation of those characteristics, however, they are difficult to obtain. Therefore, we choose the more accessible similarity lexicon for similarity representation, which can be obtained from the bilingual lexicon with a small number of manual modifications.

Despite some works have been done on language model and statistical machine translation (SMT) [16] over Thai-Lao language pair, there have been few works that focus on the impact of using an extra representation of similarity information in machine translation. To the best of our knowledge, there is no existing work on bilingual similarity lexicon integration for Thai-Lao NMT by designing a customized architecture. We argue that the similarity carried by the similarity lexicon will bring more adequate information from Thai to Lao and improve the accuracy of the translation.

3 Our approach

In this section, we will elaborate the detail of our proposed model. Our goal is to achieve a customed Thai-Lao NMT model that can integrate an extra similarity lexicon encoder and guide the information flow in the decoder.

3.1 Overall framework

Given a source language sentence x ={ x₁, x₂, …, x_m } and a target language sentence y ={ y₁, y₂, …, y_n }, we use P (y|x, θ) to denote a standard attention-based NMT model: $P (y | x, θ) = \prod_{i = 1}^{N} P (y_{i} |_{y < i}; x; θ)$ (1) where θ is a set of model parameters and y_<i is a partial translation. For every word x_i in the input sentence x, if can be found in similarity lexicon, we redefine it as w_si and combine it with the corresponding target word w_ti together as a word pair 〈w_si, w_ti〉 and feed it into a similarity sequence w, then for the integration of the sequence, we redefine NMT model as $P (y | x, w, θ) = \prod_{i = 1}^{N} P (y_{i} | y_{< i}; x; w; θ)$ (2) where w is composed of word pairs {〈w_s1, w_t1〉… 〈w_sn, w_tn〉 } when there are n matches in the input sentence x.

To integrate similarity sequence into efficient encoder-decoder architecture and inspired by the multi-source NMT works [17, 18], we choose Transformer [19] as benchmark model and apply structural modifications on it. As shown in Fig. 2, we first add an extra encoder component into the conventional Transformer. Furthermore, to adjust the extra input generated by the similarity encoder, we improve the structure of the decoder by applying a simple mechanism in the encoder-decoder attention layer to balance the information flow. In the following subsections, we will elaborate the components of our proposed model and show how they are adapted to the Transformer architecture.

Fig. 2

Illustration of the Thai-Lao NMT model based on modified Transformer. For simplicity, we assume that input sentence x matching only one pair of similarity words (w_s1, w_t1).

3.2 Sentence encoder

As illustrated in the left of Fig. 2, all N sentence encoder is standard Transformer encoders and are all identical in structure. The encoders use a stacked structure and each encoder comprises two sub-layers: a multi-head self-attention layer and a position-wise feed-forward network layer. For each sub-layer, residual connection and layer normalization mechanism are adopted. When given an input sentence x ={ x₁, x₂, …, x_m }, the sentence encoder transforms x into a hidden state sequence h_sen ={ h₁, h₂, …, h_m }, where h_i is the hidden state of x_i.

3.3 Similarity encoder

For the input x ={ x₁, x₂, …, x_m } of sentence encoder, we traverse it and get matched similarity sequence w ={ w₁, w₂, …, w_l }, where l is the number of word pairs, for any i ∈ l, w_i is a word pair 〈w_si, w_ti〉 and comprises source word w_si and its corresponding target word w_ti in the similarity lexicon. Note that if x matches nothing in the similarity lexicon, we feed w with a special token pair 〈None, None〉.

As illustrated in the right of Fig. 2, the similarity encoder is identical in structure to the sentence encoder, which use a stacked structure and each encoder comprises two sub-layers: a multi-head self-attention layer and a position-wise feed-forward network layer. For the sake of simplicity, we omit the introduction of Transformer layers and give an input-output introduction here. Given similarity sequence w ={ w₁, w₂, …, w_l }, the similarity encoder transform it into a hidden state sequence h_sim ={ h₁, h₂, …, h_l } by self-attention mechanism, where for i ∈ l, h_i = 〈h_si, h_ti〉, then the hidden state h_sim and the sentence hidden state h_sen will be transferred together (see the blue line and the red line) to the improved encoder-decoder attention layer in the decoder.

3.4 Decoder

As there is extra similarity information that flows to the decoder, the prevailing Transformer decoder needs to be modified to accommodate the new input. Furthermore, for balancing the weight between sentence information and similarity information, a dynamic mechanism needs to be adopted to get the optimal information flow. Specifically, in order to process the additional similarity input besides normal sentence input of the encoder-decoder attention layer, we split the layer into two components for different information accommodating and keep other source-independent layers unchanged. The modified Transformer decoder has three sub-layers: (1) a masked multi-head attention sub-layer, (2) a modified encoder-decoder attention sub-layer which is composed of sentence encoder-decoder attention component and similarity encoder-decoder attention component, and (3) a position-wise fully connected feed-forward network sub-layer.

The detail of the modified encoder-decoder attention sub-layer is illustrated in the upper part of Fig. 2. Given the output s_self of the masked multi-head self-attention layer at position t and the representation h_sen of input sentence, the sentence encoder-decoder attention is calculated as: $s_{sen} = Multihead (s_{self}, h_{sen}, h_{sen})$ (3) where Multihead (·) is multi-head attention. When the input of similarity lexicon encoder is null, it means that there is no actual output of similarity encoder, then the similarity encoder-decoder attention s_sim is set to null vector directly. Otherwise s_sim is obtained as: $s_{sim} = Multihead (s_{self}, h_{sim}, h_{sim})$ (4)

Inspired by the works [9 , 21] that adopt balancing mechanism to control the information flow, we concatenate s_sen and s_sim for the calculation of the balancing coefficient α_t, note when s_sim is null vector, α_t is set to 1 directly: $α_{t} = sigmoid (W_{α_{t}} [s_{sen}; s_{sim}] + b_{α_{t}})$ (5) where W_{α
_t} and b_{α
_t} are parameters. Subsequently, a simple weight-based sum operation is adopted for the calculation of encoder-decoder attention layer output: $s_{enc_dec} = α_{t} * s_{sen} + (1 - α_{t}) * s_{sim}$ (6)

As illustrated in the upper part of Fig. 2. Then s_{enc_dec} is transferred to the position-wise fully connected feed-forward network: $s_{ffn} = f (s_{enc_dec})$ (7) where feed-forward network f (x) is defined as max(0, xW₁ + b₁) W₂ + b₂, in which W₁, W₂, b₁ and b₂ are parameters. Then the final translation y_t is predicted as follow: $P (y_{t} | y_{< t}; x; w; θ) = softmax (σ (s_{ffn}))$ (8) where σ (·) is linear transformation function and w is the sequence composed of similar word pairs.

3.5 Training strategy

In the conventional NMT, in order to actuate the model to predict the target sequence, maximum likelihood estimation (MLE) loss function is used to update the model parameter by maximizing the log likelihood of translation. The MLE loss function can be described as: $L (θ^{'}) = \sum_{x, y \in D} - log P (y | x; θ^{'})$ (9) where θ′ are the parameters of the encoder and the decoder. After integrating the similarity encoder into the conventional sequence-to-sequence architecture, the loss function could be calculated as: $L (θ^{″}) = \sum_{x, y \in D} - log P (y | x; w; θ^{''})$ (10) where θ″ are the parameters of the source encoder, similarity encoder and decoder. To balance the two objectives, we train our model on L (θ′) objective for the β% iterations, and trained on L (θ″) objective for the rest (1 - β)% iterations. The final loss function is calculated by the following formula: $L (θ) = β L (θ^{'}) + (1 - β) L (θ^{″})$ (11)

We assume that the conventional MLE loss function makes the translation approximate to the true distribution, while the similarity-integrated loss function guides the translation procedure in the perspective of semantic. In practice, we find that synthetically using two objectives can make training procedure easier and get a better translation performance.

4 Experiments

In this section, we present empirical studies for the proposed approach on the publicly NMT dataset. We also conduct multiple studies to thoroughly analyze the effect of the proposed approach, including translation quality, inference efficiency, the effect of similarity, and case study.

4.1 Experimental setup

Parallel Data. We conduct experiments on the Thai-Lao and Thai-Chinese portions of the publicly ALT dataset, which is a multilingual parallel dataset supplied by Asian Language Treebank Project 1 . We follow ALT guide and bin the Thai-Lao portion that comprises 20106 sentence pairs into three subsets with ALT-Standard-Split toolkit: 18088, 1000, 1018 sentence pairs for training, development, and test datasets respectively. The same partition is also applied to Thai-Chinese portion. We process the experimental parallel corpus simply before applying our approach. For Chinese preprocess, we apply word segmentation by jieba tools 2 . For Thai word segmentation, we use pythaipiece tool 3 which based on sentencepiece to segment Thai sentences. For Lao word segmentation, we use our inhouse tool to segment Lao sentences, the source codes and model are included in the supplementary files. The dataset we used is encoded by BPE with a shared vocabulary of 2 K symbols.

Similarity Lexicon. For similarity input, we first build the Thai-Lao-Chinese trilingual lexicon. The lexicon comprises 4512 word-pairs, in which 3512 word-pairs are collected by ourselves, and the rest is obtained from dict.land 4 . Then we split the similarity lexicon into Thai-Lao and Thai-Chinese bilingual similarity lexicon. For Thai-Lao portion, we handle it manually and sift out 2700 word-pairs with larger similarity, and for Thai-Chinese portion we extract the corresponding 2700 word-pairs. Note that because there is inapparent similarity between Thai and Chinese, Thai-Chinese lexicon infusion is actually equivalent to the ordinary term infusion. Moreover, to maximize the utilization of similarity lexicon, when checking for the match of a source side word, we apply approximate matching to allow for some morphological variations in the word.

Monolingual data. In order to compare our approach with the traditional data augmentation method based on back-translation [22], we conduct back-translation experiment by using 60 K Thai and 60 K Lao monolingual corpora that chosen from OSCAR dataset 5 .

Evaluation. We choose the case insensitive 4-gram BLEU score as the main evaluation metrics [23] and adopt the script multi-bleu.perl in the Moses toolkit 6 . Significance tests 7 are conducted based on the best BLEU results by using bootstrap resampling [24] tool.

Baseline. We compare the proposed model against the conventional Transformer system and homogeneous previous works:

Moses: The dominant phrase-based SMT system with the default configuration and a 4-gram language model. We train the model on the entire training data and use the lexicon to constraint the word alignment result generated by GIZA++. We choose phrase-based SMT system as a baseline model because in low resource settings, SMT tends to achieve better performance than NMT [25].

Transformer: The dominant NMT approach that obtained the state-of-the-art performance on machine translation and predicts target sentence from left to right relying on self-attention.

Song et al. [26]: A data augmentation approach that uses replace strategy for term lexicon introducing. For each source-target word, randomly sampling k₁ matching sentences to replace a source-side word with its target-side word. For each combination of two source-target word pairs, the sampling hyper-parameter is set to k₂ and both source-side matching words are replaced with their target translations. We follow the empirically set of Song et al. [26], in which k₁ is set to 100, k₂ is set to 30.

Georgiana et al. [27]: A data augmentation approach that utilizes external term lexicon on the generic NMT architecture. The approach provides two strategies: append and replace for data augmentation. For append, the target words in term lexicon are append to the corresponding source words. For replace, the words in source sentences are replaced by the corresponding words in term lexicon. We compare our approach to the append strategy which performs better according to the conclusion of their work.

Post et al. [28]: A constrained decoding approach, which uses a dynamic beam allocation (DBA) technique to reduce the computational overhead to a constant factor at the procedure of term integration.

Implement Detail. We follow the guidelines of Sennrich et al. [25] on optimizing low-resource NMT and adopt prudent parameter settings. The Transformer we use for baseline and implementation of our approach comprises 2-layer encoders and decoders. Moreover, the batch size is set to 128, and the dimensions of word embeddings, hidden states, and the filter sizes are set to 256, 256, and 256 respectively. The dropout is set to 0.1 during training. Because the scale of the dataset is tiny, we set the vocabulary size to 3000. We use the Adam [29] with the same learning rate for optimization. The models are trained on one P100 GPU and are evaluated every 1000 steps. We implement our approach on Thumt [30], a strong open-source machine translation platform.

4.2 Experimental results

Translation Quality. Table 2 shows the experimental results evaluated by BLEU score. We first compare our approach with the NMT models on Thai-Lao translation task, we get 2.78 BLEU point improvement compared with Transformer baseline. Apart from this, compared with data augmentation approaches, our approach also outperforms Song et al. [26] and Georgiana et al. [27] on the best strategies they adopted: replace and append, respectively. Similarly, our approach gets improvement compare with the decoding constraint method Post et al. [28]. Consider the strong performance of SMT model in the low-resource settings, we investigate the performance comparison of our approach against Moses model. As shown in Table 2, our approach still gains 1.67 BLEU point improvement compared with Moses model. Moreover, we observe that in this low-resource setting, translation task on Thai-Lao language pair achieves higher BLEU score than Thai-Chinese. For example, compared with Transformer baseline, for Georgiana et al. [27] the gap between Thai-Lao (+1.13) and Thai-Chinese (+0.75) translation task is 0.38, while for our approach the trend is consistent but the gap between two tasks is more obvious (1.0). The possible reason we believe is that Georgiana et al. [27] is structure-independent, which add an additional stream (by adding different tags) to indicate original source words, target words in the bilingual dictionary and generates augmented corpus. Although the essence of our approach is still to add an additional information stream (by similarity encoder) into NMT procedure, compared with randomly ignore matching used in Georgiana et al. [27], additional encoder and gating mechanism based on it can facilitate the model to control information flow more finely by using the self-learning ability of neural network. To verify the effect of our model against dominate data augmentation method, we conduct back-translation experiment by using Thai and Lao monolingual corpora. As we have observed, compared with the Transformer baseline model, back-translation brings significant performance improvement. However, the improvement of our approach is greater. The possible reason is that our approach provides a clear alignment information through the similarity lexicon. As is known to all, alignment plays a great role in the attention-based translation procedure. The results show that our approach independent of monolingual corpus. By using accessible bilingual similarity lexicon, our approach which can get a better performance.

Table 2
Results on ALT Thai⟶Lao, Chinese translations (BLEU score). Significance tests are conducted based on the best BLEU results by using bootstrap resampling (p < 0.05). “bt” denotes back-translation

Type Models Thai-Lao Thai-Chinese

SMT Moses 9.36 7.43

NMT Transformer 8.25 7.05

Transformer (bt) 10.29 8.34

Song et al. [26] 9.17 7.61

Georgiana et al. [27] 9.38 7.80

Post et al. [28] 9.22 7.67

Our approach 11.03 8.83

Type	Models	Thai-Lao	Thai-Chinese
SMT	Moses	9.36	7.43
NMT	Transformer	8.25	7.05
	Transformer (bt)	10.29	8.34
	Song et al. [26]	9.17	7.61
	Georgiana et al. [27]	9.38	7.80
	Post et al. [28]	9.22	7.67
	Our approach	11.03	8.83

Speed. We investigate the decoding overhead for our approach and the NMT baseline model. The total inference time for decoding Thai-Lao test set which comprises 1018 sentences is reported in Table 3. As data augmentation Song et al. [26] and Georgiana et al. [27] are structure-independent, they only introduce extra 4% additional parameters compare with Transformer model for the embedding of augmented sentences. For structure-dependent approach Post et al. [28], the introduction of extra parameters is 40% and brings 2.5 times overhead than Transformer. By contrast, our approach is also structure-dependent but only increases the inference time by 26% compare with the structure-independent approaches. As translation quality is more valuable than inference cost in the extremely low-resource setting, we believe that the extra cost in the same order of magnitude is acceptable.

Table 3

Decoding overhead. “Inference time” is measured as the total decoding time on one P100 GPU for Thai-Lao test set (1018 sentences). “Type” column indicates that whether the model is structure-independent

Type	Models	Param	Inference time
independent	Transformer	9.3 M	31.7 s
Song et al. [26]	9.7 M	31.7 s
Georgiana et al. [27]	9.7 M	31.7 s
dependent	Post et al. [28]	12.5 M	74.6 s
Our approach	12.8 M	39.4 s

Similarity Effect. We investigate the influence of different scale of similarity lexicon in our approach by ablation study. We compare the performances on the entire lexicon which contains 2700 pairs of words under step length 300. As Fig. 3 shown, Thai-Lao (th-lo) task obtains a more significant BLEU increment than Thai-Chinese (th-zh) task with the infusion of similar word pairs. Especially, at the scale of 2700, the BLEU gap between Thai-Lao and Thai-Chinese is obviously enlarged than without using the similarity lexicon (when the scale is 0). The trend curve indicates that similarity lexicon is promotive for Thai-Lao language pair which is linguistically similar. By contrast, the infusion performance of Thai-Chinese is not as effective as Thai-Lao. The possible reason we believe is that due to inapparent similarity exists between Thai and Chinese, Thai-Chinese infusion is actually equivalent to the ordinary term infusion and cannot use the benefit of linguistic similarity.

Fig. 3

Performance on our proposed approach when infusing different amount of similarity word pairs.

Robustness. It is time-consuming and expensive to build Thai-Lao similarity lexicon from scratch in a hand-crafted manner. Fortunately, some bilingual websites can be crawled to help build the similarity lexicon. However, the crawled similarity word pairs usually have many noises, and the original lexicon will be corroded when populating crawled words. According to this, we conduct the robustness experiment on Thai-Lao translation task to examine the declining trend. For multiple perspectives analysis, we adopt two strategies: (1) replace randomly: replace the Lao side words of Thai-Lao word pairs in similarity lexicon by the words crawled from Lao websites randomly. To ensure reasonable bias, we verify that the ratio of the lengths between two words is not greater than 2. (2) replace by 〈None〉: replace the Lao side word in similarity lexicon by the token 〈None〉. The result is shown in Fig. 4, we observe that in the 〈None〉replacement, BLEU score drops starkly with the infusion of dummy word pairs. When reaching 300 it is close, and 400 obvious below the performance of Transformer baseline. By contrast, in the random replacement, BLEU score drops smoother than the former and outperforms Transformer baseline until the scale of dummy word pairs is 500. The result confirms our intuition that our proposed approach is robust within a reasonable noise range when the noise input conforms to the linguistic norm.

Fig. 4

Performance decline on our proposed approach when using different corrosion strategies (on Thai-Lao translation task).

Balance of Two Objectives. We use the hyper-parameter β to balance the two objectives in Section 3.5. We study the influence on Thai-Lao translation task by raising the value of β. To observe the effect of this hyper-parameter, we set the discrete value β = {10%, 20%, 30%, 40%, 50%,60%, 70%, 80%, 90% }.

As shown in Fig. 5, when the β switches from 0.4 to 0.6, our model gets better performance than the intervals [0.1,0.3] and [0.7,0.9]. The results show that we can set the hyper-parameter β in a reasonable interval ([0.4,0.6]) to keep the balance between source text and similarity input.

Fig. 5

The influence of hyper-parameter β on the model. Note we conduct the evaluation on the validation set and then validate the conclusion on the test set.

Case Study. Apart from the quantitative analysis, we provide an example of our proposed approach. As the simple sentence shown in Table 4, the Thai word “ ” (I) and “” (go/to/leave) are translated correctly in our approach. We argue that one of the main reasons is that the similarity encoder delivers more information for the translation process. Because “ ” (I go/to/leave) in Thai is similar in morphology with “”(I go/leave) in Lao and both of them can be found in the similarity lexicon.

Table 4

Example of Thai-Lao translation

Input:	(I will go abroad next week)
Golden:	(I will go abroad next week)
Baseline:	(Next week begins the journey)
Our approach:	(I am leaving next week)

5 Conclusions

We propose a novel neural machine translation (NMT) approach focusing on language pair Thai-Lao which has extremely limited amount of parallel corpus but is cross-lingual similar. We first investigate the cross-lingual similarity of Thai-Lao language pair. Then for utilizing the cross-lingual similarity we propose a new end-to-end NMT model, in which a similarity encoder and a modified encoder-decoder attention layer in the decoder are designed for similarity receiving and fusion respectively. We further conduct contrast experiments, as the results reported, our approach achieves significant BLEU improvement on Thai-Lao task using tiny parallel corpus, compared to the strong SMT and NMT baseline models.

An interesting direction is to apply our approach to other low-resource NMT tasks in the future, with the feature that the source language is similar to the target language, such as Malay-Indonesian, etc.

Footnotes

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61866020, and in part by the Natural Science Foundation of Yunnan Province under Grant 2019FB082.

References

Philipp

and Rebecca

, Six Challenges for Neural Machine Translation, Proc. First Workshop on Neural Machine Translation, pages 28–39, Vancouver, Canada, August 4, (2017).

Xia

, Kong

, Anastasopoulos

and Neubig

, Generalized data augmentation for low-resource translation, Proc. 57th Annual Meeting of the Association for Computational Linguistics, pages 5786–5796. Florence, Italy, July 28 - August 2, (2019).

Burlot

and Yvon

, Using Monolingual Data in Neural Machine Translation: A Systematic Study, Proc. The Third Conference on Machine Translation (WMT), Volume 1: Research Papers, pages 144–155. Belgium, Brussels, October 31 - Novermber 1, (2018).

Artetxe

, Labaka

, Agirre

and Cho

, Unsupervised Neural Machine Translation, Proc. The 2018 Conference of the international conference on learning representations. Vancouver, BC, Canada. April 30-May 3, (2018).

Lample

, Ott

, Conneau

, Denoyer

and Ranzato

, Phrase-Based & Neural Unsupervised Machine Translation, Proc. The 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium, October 31 - November 4, (2018).

Nguyen

T.Q.

and Chiang

, Transfer learning across low-resource, related languages for neural machine translation, Proc. The The 8th International Joint Conference on Natural Language Processing, pages 296–301. November 27-December 1, (2017).

Lakew

S.M.

, Erofeeva

and Negri

, Transfer Learning in Multilingual Neural Machine Translation with Dynamic Vocabulary, Proc. The International Workshop on Spoken Language Translation. (2018).

, Yu

, Guo

, Huang

and Wen

, Efficient Low-Resource Neural Machine Translation with Reread and Feedback Mechanism, ACM Trans. Asian Low-Resour. Lang. Inf. Process 19(3) Article 34 (December 2019), 13 pages (2019).

Wang

, Xia

Y.C.

, Gao

, Tian

, Tao

, Zhai

and Liu

, Neural Machine Translation with Soft Prototype, Proc. Advances in Neural Information Processing Systems 32. pages 6316–6325. Vancouver, Canada, (2019).

10.

Srithirath

and Seresangtakul

, A Hybrid Approach to Lao Word Segmentation using Longest Syllable Level Matching with Named Entities Recognition, Proc. The 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Krabi, Thailand, pp. 1-5, May 15-17, (2013).

11.

Yang

, Zhou

, Yu

, Gao

and Guo

, Lao Named Entity Recognition based on conditional random fields with simple heuristic information, Proc. The 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, (2015), 1426–1431.

12.

Ding

, Utiyama

and Sumita

, Similar Southeast Asian Languages: Corpus-Based Case Study on Thai-Laotian and Malay-Indonesian, Proc. The 3rd Workshop on Asian Translation, pages 149–156, Osaka, Japan, December 11-17, (2016).

13.

Franz

J.O.

and Hermann

, A Systematic Comparison of Various Statistical Alignment Models, Computational Linguistics 29(1) (2003), 19–51.

14.

Riza

, Purwoadi

, Gunarso

T.U.

, Ti

A.A.

, Aljunied

S.M.

, Mai

L.C.

, Thang

V.T.

, Thai

N.P.

, Chea

, Sun

, Sam

, Seng

, Soe

K.M.

, Nwet

K.T.

, Utiyama

and Ding

, Introduction of the Asian Language Treebank, Oriental COCOSDA (2016).

15.

Hideki

, Katsuhito

, Hajime

and Kevin

, HPSG-based preprocessing for English-to-Japanese translation, ACM Transactions on Asian Language Information Processing 11(3) (2012), 1–16.

16.

Singvongsa

and Seresangtakul

, Lao-Thai machine translation using statistical model, International Joint Conference on Computer Science & Software Engineering. IEEE. (2016).

17.

Zoph

and Knight

, Multi-source neural translation, Proc. The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

18.

Dabre

, Cromieres

and Kurohashi

, Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages, In Proceedings of MT Summit XV, volume 1, pages 96–107, Nagoya, Japan. (2017).

19.

Ashish

, Noam

, Niki

, Jakob

, Llion

, Aidan

N.G.

, Lukasz

and Illia

, Attention is all you need. In I. Guyon, U.V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R.Garnett, editors, Advances in Neural Information ProcessingSystems 30 (2017), 5998–6008.

20.

Gulcehre

, Firat

, Xu

, Cho

, Barrault

and Lin

, On using monolingual corpora in neuralmachine translation, Computer ence (2015).

21.

Cao

and Xiong

, Encoding Gated Translation Memory into Neural Machine Translation, Proc. The 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium. (2018), 3042–3047.

22.

Sennrich

, Haddow

and Birch

, Improving Neural Machine Translation Models with Monolingual Data. Proc. The 54th Annual Meeting of the Association for Computational Linguistics, Berlin (2016), 1715–1725.

23.

Papineni

, Roukos

, Ward

and Zhu

, Bleu: a method for automatic evaluation ofmachine translation. In Proceedings of ACL, (2002), pages 311–318.

24.

Koehn

, Statistical significance tests for machine translation evaluation. Proc. The 2004 Conference on Empirical Methods in Natural Language Processing, (2004), pages 388–395.

25.

Sennrich

and Zhang

, Revisiting Low-Resource Neural Machine Translation: A Case Study, Proc. The 57th Annual Meeting of the Association for Computational Linguistics (ACL2019), Florence, Italy, (2019).

26.

Song

, Zhang

and Yu

, Code-Switching for Enhancing NMT with Pre-Specified Translation, Proc. The 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, Minnesota, (2019).

27.

Dinu

, Mathur

, Federico

and Onaizan

Y.A.

, Training neural machine translation to apply terminology constraints, Proc. The 57th Annual Meeting of the Association for Computational Linguistics, pages 3063-3068, Florence, Italy, July 2019. Association for Computational Linguistics.

28.

Matt

and Vilar

, Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation, In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, (2018), Volume 1 (Long Papers), pages 1314–1324.

29.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, Proc. The 3th International Conference on Learning Representations. San Diego, (2015).

30.

Zhang

, Ding

, Shen

, Cheng