A comparative study of neural machine translation models for Turkish language

Abstract

Machine translation (MT) is an important challenge in the fields of Computational Linguistics. In this study, we conducted neural machine translation (NMT) experiments on two different architectures. First, Sequence to Sequence (Seq2Seq) architecture along with a variation that utilizes attention mechanism is performed on translation task. Second, an architecture that is fully based on the self-attention mechanism, namely Transformer, is employed to perform a comprehensive comparison. Besides, the contribution of employing Byte Pair Encoding (BPE) and Gumbel Softmax distributions are examined for both architectures. The experiments are conducted on two different datasets: TED Talks that is one of the popular benchmark datasets for NMT especially among morphologically rich languages like Turkish and WMT18 News dataset that is provided by The Third Conference on Machine Translation (WMT) for shared tasks on various aspects of machine translation. The evaluation of Turkish-to-English translations’ results demonstrate that the Transformer model with combination of BPE and Gumbel Softmax achieved 22.4 BLEU score on TED Talks and 38.7 BLUE score on WMT18 News dataset. The empirical results support that using Gumbel Softmax distribution improves the quality of translations for both architectures.

Keywords

Neural machine translation Gumbel Softmax sequence to sequence transformer

1 Introduction

Machine Translation (MT), which is the sub-field of Computational Linguistics, is regarded as an effective method to translate source language into target language. The pioneering work on MT was published by Weaver et al. [28] and the topic has witnessed a wide-spread development together with the rapid technological advances of the last decades.

Studies based on MT can be roughly classified into three groups: rule-based, statistical and neural methods. While rule-based machine translation (RMT) uses handcrafted rules to translate the language from source S to target T, statistical machine translation (SMT) utilizes probabilistic models and makes estimation from parallel corpora. The aim of traditional SMT is to search for the most probable translation $\hat{T}$ for a given source sentence S. $\hat{T} = \underset{T}{argmax} P (S | T) P (T)$ , where translation model P(S|T) specifies the faithfulness of T as a translation of S and language model P(T) expresses the fluency of T in the target language. The translation probability is also computed using phrase probability and distortion probability.

In recent years, deep neural networks have been found to be attractive in different fields of applications. Neural machine translation (NMT) based on neural networks was inspired by the increased usage of deep learning architectures. NMT models have been recently proposed by [5 , 26] and showed promising results by enhancing state-of-the-art translation performance. Unlike the traditional SMT [16], most NMT models do not require linguistic knowledge such as syntactical or semantic features, do not consist of many small sub-components and utilize neural network to learn the model jointly to maximize the translation performance. Neural network architectures, especially the encoder-decoder structures, have recently been used in MT as well as in other applications and showed promising results. While an encoder encodes the input sentence into a fixed-sized vector, a decoder maps the vector representation to a sentence in the target language.

Agglutinative languages, like Turkish, are cumbersome for Computational Linguistics since they comprise morphologically complex words due to the morpheme agglutination. For example, in order to form the present tense of the word “öğrenmek”, which stands for “to learn”, a morpheme “-yor” is added along with varying suffixes of the subject. In consequence, corresponding translation of “I’m learning” becomes “öğreniyorum”. As the example indicates that the morphology of the words changes regarding not to sense but the context of the word, thus it is required to contain different forms of the words in order to efficiently represent them in the vocabulary. Therefore, collecting resources and applications for the agglutinative languages are difficult and computationally heavy.

This study serves for literature of morphologically rich and low resourced languages, including Turkish, by comprehensively comparing two main architectures of NMT: Sequence to Sequence (Seq2Seq) [25] and Transformer [26]. The contribution of this study is to examine Byte Pair Encoding (BPE) for morphemes tokenization and the Gumbel Softmax distribution for providing fluency on morpheme selection to models during training. Although, the Gumbel Softmax distributions were studied by Gu et al. [10] for the inference model, the effects on training have not been examined before, to the best of our knowledge. In our empirical studies, it is concluded that training models with Gumbel Softmax distribution is fairly successful to improve the results.

The rest of the paper is organized as follows. Section 2 introduces the related works conducted on Turkish-English machine translation along with NMT studies. Section 3 describes the dataset, architectures and methodology used in this study. Section 4 provides details about the parameters utilized in the experiments. Section 5 presents the results of the experiments. Section 6 overviews the experiments. Lastly, the conclusion is drawn in Section 7

2 Related works

A novel approach to NMT started with the adaptation of neural language models to traditional SMT systems [22]. The studies [23 , 27] trained neural networks for scoring phrase pairs using different representations as an additional input. In the study of Schwenk [23], continuous space translation model probabilities of a phrase-based SMT system were used. Feedforward neural networks were employed and the model predicted an entire target phrase, rather than a word at a time. A similar study [7] proposed a model that uses a feedforward neural network for translation model which predicts one word in a target phrase at a time. Other similar studies [4, 9] utilized a feedforward neural network that accepts a bag-of-words representation for input phrase. Kalchbrenner et al. [15] proposed a different approach, where a convolutional n-gram model is used for encoder and hybrid of inverse convolutional n-gram model and RNNs are used for decoder. A similar study [5] proposed a RNN encoder-decoder architecture that uses two RNN structures. The one behaves as an encoder to map a variable length sequence to fixed-length vector. Another acts as a decoder to map vector representation to variable-length target sequence. A closely related study [25] employed Long Short-Term Memory (LSTM) [12] in both encoder and decoder networks to address vanishing gradient problems. Exploiting encoder-decoder framework, Vaswani et al. [26] developed another architecture, namely Transformer, that replaces the conventional RNNs with self-attention mechanisms to perform language translations. The inference method which decodes the predictions of the models to generate the translations is another important and challenging subject in the NMT field. Sutskever et al. [25] used beam-search decoder that selects the one with the highest probability from sequence of candidates regards to beam size. Moreover, Gu et al. [10] proposed to use Gumbel Softmax distributions for the inference model. This study inspired us to question the effect of Gumbel Softmax distributions on model training in order to teach robust utilization of morphemes to the models rather than the inference.

In Turkish, the study [1] addressed the lack of sufficient vocabulary problem on morphologically rich languages. Ataman et al. [1] proposed a vocabulary reduction methodology that encodes the suffixes. They showed that the impact of stem on model learning increases and the proposed model achieves better performance in open vocabulary translation of morphologically rich languages. Their research was also conducted on TED Talks dataset [20] by Seq2Seq model with attention. Curry et al. [6] also addressed the low resource problem for specific languages such as Turkish, and proposed to address the insufficient data problem for NMT by feeding the models with copied corpus that source and target language is switched in addition to original dataset. Another study conducted by Gülçehre et al. [11] on improving the NMT models, employs two alternative language models, named shallow and deep fusion, and augmented decoder corresponding to these language models. The resulting deep fusion language modelling was successful in improving the state-of-the-art score in Turkish-English translations. The study [19] developed strategies on the low-resource and morphologically-rich languages of Turkish and Uyghur. The results showed that method for morphologically motivated word segmentation is better for NMT and prominent advancement on Turkish-English and Uyghur-Chinese machine translations. Other studies [17, 30] also used multilingual neural machine translation considering Turkish language based on WMT17 and WMT18 News datasets.

3 Experimental setup

3.1 Dataset

The morphologically rich languages, such as agglutinative languages like Turkish, suffer from a lack of resource problem to train neural network architectures due to the required vocabulary and corpus sizes. TED Talks, that is compiled by Qi et al. [20], become a benchmark dataset especially among the morphologically rich language for NMT. The dataset contains sentence pairs in 60 different languages taken from TED Talks transcripts. Another dataset in this domain, namely WMT18, is provided by the shared task of the Third Conference of Machine Translation [3]. We used the WMT18 News corpus that contains texts from published news articles. In this study, we selected Turkish-English pairs to apply Turkish-to-English NMT. The reason why we select from Turkish to English translation, rather than English-to-Turkish, is to compare the results of our models with the studies conducted on exclusively on the selected dataset for this particular direction.

In our qualitative observations of the selected datasets, we noticed that the translation pairs have an even distribution of complex and simple sentences in the TED Talks dataset while having more complex sentences in the WMT18 News dataset. Intuitively, the mixture of complex and simple sentences is required to develop a robust translation system, since it stresses the models to learn complicated sentence fragmentation while acquiring the basics of the language by simple sentences. Similarly, another requirement for training a robust translation model is diversity on word stems. Although most of the datasets for morphologically rich languages suffer from low diversity of stems due to agglutination operation, an adequate variety of stems is maintained in both datasets. The quantitative summaries of the selected datasets are given in Tables 2.

Table 1
The summary of the TED Talks dataset

Training Test

EN TR EN TR

Sentences 182K 5K

Unique Words 47K 185K 8K 17K

Total Words 4.1M 3.1M 111K 83K

Mean Sentence Length 21 15 20 15

	Training	Test
Sentences	182K	5K
Unique Words	47K	185K	8K	17K
Total Words	4.1M	3.1M	111K	83K
Mean Sentence Length	21	15	20	15

Table 2

The summary of the WMT18 News dataset

	Training		Test
	EN	TR	EN	TR
Sentences	205K		3K
Unique Words	55K	146K	8K	15K
Total Words	4.9M	4.3M	68K	54K
Mean Sentence Length	24	21	23	18

3.2 Byte pair encoding (BPE)

The morphologically complex words increase the vocabulary size due to the morpheme agglutination described in Section 1, resulting in that more computational power is required. To address this problem, Sennrich et al. [24] proposed to utilize Byte Pair Encoding (BPE) algorithm for word segmentation on NMT tasks. Instead of using one token to represent a word, BPE segments the complex word into the subword units. Thus less storage is required to constitute the vocabulary while covering more complex words in the language.

In order to apply BPE, the most frequent pairs of characters on corpus are obtained to determine the subword unit vocabulary. The subword units are merged until the specified target vocabulary size is satisfied. Because the words are segmented as subword units, this approach also alleviates the out of vocabulary (OOV) problem by generating or understanding fragments of the words that are not in training vocabulary. Therefore, BPE is a notable approach for low resource corpora, especially in morphologically rich languages.

3.3 Architectures

The latest studies on NMT are built upon Recurrent Neural Networks (RNN) [5]. However, a novel approach disuses RNN and exploits self-attention mechanisms, namely Transformer [26], became popular as well as RNNs after the success of Transformer variations, e.g. BERT [8] GPT-2 [21] on language modelling. Regardless of differences on architectures, the goal is to estimate the conditional probability P = P (y₁, y₂, . . . , y_n|x₁, x₂, . . . , x_m), where x = (x₁, x₂, . . . , x_m) is the given sentence and y = (y₁, y₂, . . . , y_n) is the output sentence which its length may differ from given sentence.

3.3.1 Sequence-to-Sequence (Seq2Seq)

Sutskever et al. [25] proposed Sequence-to-Sequence (Seq2Seq) architecture comprise two networks fully based on RNN. The first RNN, namely encoder, learns to encode a sequence of input sentence to a fixed-length vector and the second RNN, namely decoder, maps the fixed-length vector representation into a sequence of the target language.

The encoder network reads the input sentence x sequentially, and forms the T length of hidden state vector h = (h₁, h₂, . . . , h_T) for each time step t: $h_{(t)} = f (h_{(t - 1)}, x_{t})$ (1) where f is a non-linear activation function. The hidden state h vector passes through the decoder in order to produce the conditional distribution of the next symbol: $P (y_{t} | y_{(t - 1)}, y_{(t - 2)}, . . ., y_{1}) = g (h_{(t)}, y_{t - 1})$ (2) where g is a softmax function that produces the next symbol from the conditional distribution of the possible candidates.

In practice, RNNs often suffer from the vanishing gradient problems while learning long-range dependencies [13]. The variants of RNN such as Long-Short Term Memory (LSTM) [12] and Gated Recurrent Unit (GRU) [5] are proposed to tackle this problem by allowing to forget the hidden states if necessary. The notable difference between LSTM and GRU is the number of gates, consequently requirement of computational power. In this study, we preferred to use GRU due to computational limitations.

Besides above mentioned pure Seq2Seq architecture, another model that is fully based on Seq2Seq but exploits an extra layer, namely Attention, is employed to expand our comparison on the task. The attention layer extension that is introduced by Bahdanau et al. [2] provides soft-search mechanism for decoder network by conditioning the significance of input sequence values. In order to obtain attention, the context vector γ computed as: $\begin{matrix} e_{tj} & = v_{γ}^{⊤} \tanh (W_{γ} s_{(t - 1)} + U_{γ} h_{j}) \\ α_{tj} & = \frac{\exp (e_{tj})}{\sum_{k = 1}^{m} \exp (e_{tk})} \\ γ_{i} & = \sum_{j = 1}^{m} α_{tj} h_{j} \end{matrix}$ (3) where s = (s₁, s₂, . . . , s_n) is the hidden state of the decoder, m is the length of the input sequence, n is the length of the output sequence. The probability of target word at current time-step p (y_t|s_t, y_(t-1), γ_t), given combination of the decoder hidden state s_t, the previous word y_(t-1), and the context vector γ_t, is then calculated. The representation of Seq2Seq with Attention layer is given in Fig. 1.

Fig. 1

The representation of Seq2Seq with Attention layer.

3.3.2 Transformer model

The Seq2Seq architecture exploits LSTMs or GRUs in order to overcome the vanishing gradient problem in sequences. However, the novel study on self-attention mechanism proved to be successful in compensating for the gradient in longer dependencies [26]. The transformer model benefits from multi-head attention instead of LSTM or GRU. The rest of the framework is a traditional encoder-decoder architecture, except that the decoder network has an extra attention layer which combines encoder and decoder output by another n-stacked multi-head attention layer to pass the estimation to a fully-connected network with softmax activation function in order to predict the tokens in the target language.

Disusing RNNs causes vanishing on sequential information of sentence tokens. The transformer model uses Position-Encodings to determine the position of the tokens. By combining Position-Encoding with the embedding vectors, the sequentiality is acquired as in RNN structures. Position-Encoding consists of sine and cosine functions with given position as pos and dimension i: $\begin{matrix} {PE}_{(pos, 2 i)} = \sin (pos / 10000^{2 i / d_{model}}) \\ {PE}_{(pos, 2 i + 1)} = \cos (pos / 10000^{2 i / d_{model}}) \end{matrix}$ (4)

The obtained vector that has sequential information between the tokens then is fed to a multi-head attention layer. The multi-head attention layer is constructed by query (Q), key (K) and value (V) matrices to present the linear projections of Q, K and V. The input contains queries and keys of dimension d_k. The matrices refer to one head and are concatenated in order to form the multi-head attention layer.

$\begin{matrix} Attention (Q, K, V) & = softmax (\frac{{QK}^{T}}{\sqrt{d_{k}}}) V \\ MultiHead (Q, K, V) & = Concat ({head}_{1}, . . ., {head}_{h}) W^{O} \end{matrix}$ (5)

where d_k, K^T and W^O refers to dimension of keys, transpose of K and output weight matrices, respectively.

The main architecture of Transformer shares the pattern of the encoding and decoding mechanism same as Seq2Seq models. The encoder of the Transformer contains n-stacked multi-head attention (MHA) layer that is fed by a position-wise feedforward network then passes through the decoder, likewise the hidden state passing in Seq2Seq models. However, the decoder differs from the one used in Seq2Seq models since it consists of two sub-sections, a decoder MHA layer and an encoder-decoder MHA layer. The decoder MHA layer connects with a position-wise feedforward network same as the encoder. Further, these two layers link to another MHA layer that combines the information for both input and target sentences before the softmax layer. The basic representation of Transformer model can be seen in Fig. 2.

Fig. 2

The representation of Transformer model.

3.4 Gumbel softmax function

A fully connected network with a softmax activation function is used for both architectures to select the tokens for the output sequence from the probability distribution which the decoder network predicted. In our experiments, we alternatively trained our models by using Gumbel distributions [14] on the softmax function.

Gumbel Softmax [14] function creates new categorical distribution corresponding to a tunable temperature parameter τ from the probability distributions of the output. Softmax function that selects the token at timestep y_t+1 with respect to Gumbel distributions is calculated as: $\begin{matrix} ρ_{t} & = - \log (- \log {Uniform}_{t} (0, 1)) \\ y_{t + 1} & = argmax (τ (o_{t} + ρ_{t})) \end{matrix}$ (6) where o_t is the probability distribution that decoder network predicted. As the temperature τ approaches to 0, the distribution becomes more likely to be the output. After our empirical studies, we observed that the selection of the temperature value is critical for translation performances. Therefore, we set the temperature value depending on the datasets.

4 Experimental settings

In this study, we employed Seq2Seq and Transformer architectures on Turkish-to-English translation task. Besides, an Attention layer is utilized on the Seq2Seq architecture. In our experiments, Gumble softmax function and BPE are performed for each architecture. Consequently, we trained 12 models without using any pre-trained weights. The trained models and corresponding results on test datasets are given in Tables 4.

Table 3
BLEU scores for the translations on TED Talks test dataset

BPE Gumbel BLEU ↑

Seq2Seq 19.4

✓ 18.0

✓ 20.0

✓ ✓ 17.3

Seq2Seq + Attention 13.3

✓ 10.7

✓ 20.1

✓ ✓ 16.0

Transformer 16.7

✓ 21.7

✓ 20.0

✓ ✓ 22.4

Qi et al. [20] 17.9

	BPE	Gumbel	BLEU ↑
Seq2Seq			19.4
	✓		18.0
		✓	20.0
	✓	✓	17.3
Seq2Seq + Attention			13.3
	✓		10.7
		✓	20.1
	✓	✓	16.0
Transformer			16.7
	✓		21.7
		✓	20.0
	✓	✓	22.4
Qi et al. [20]			17.9

Table 4

BLEU scores for the translations on WMT18 News test dataset

	BPE	Gumbel	BLEU ↑
Seq2Seq			28.8
	✓		24.5
		✓	27.2
	✓	✓	24.5
Seq2Seq + Attention			22.2
	✓		15.7
		✓	25.8
	✓	✓	21.8
Transformer			37.5
	✓		38.2
		✓	21.2
	✓	✓	38.7
Qi et al. [30]			26.7
Lin et al. [17]			33.3

In the evaluation of the trained models, teacher-forcing [29] is selected for decoding strategy of inference model that generates translation for the given sentences. The teacher-forcing algorithm evaluates the next output by supplying with the token that is generated one step before. The algorithm exploits the maximum likelihood principle: $P (y_{1}, y_{2}, . . ., y_{T}) = P (y_{1}) \prod_{t = 1}^{T} P (y_{t} | y_{1}, . . ., y_{t - 1})$ (7)

In our experiments that are conducted without Byte Pair Encoding (BPE), we had to reduce the size of the unique tokens because of computational resource limitations. We selected the most used 40K tokens for both languages from training dataset to create language vocabularies. All out of vocabulary tokens were removed from sentences. In the experiments with BPE, we selected the subword vocabulary size to be 8K. Although Seq2Seq models require less computational power than the Transformer, the size of vocabularies was left constant to eliminate the effect of vocabulary size and achieve a fair comparison between architectures.

The Seq2Seq models are employed GRU unit size of 1024 for each layer, encoder and decoder. RMSProp optimizer is used for optimizing the sparse categorical crossentropy loss function. Bahdanau attention layer [2] with unit size of 1024 is added on top of the pure architecture.

The Transformer models are trained with the number of layers to be 2, the dimension of the model to be 128, the dimension of the position-wise feed forward network to be 512, the dropout rate to be 0.1 and the number of attention heads to be 4. The activation of fully connected point-wise feedforward layer is chosen to be ReLU.

For both architectures, the maximum length of sentences are limited to be 50 due to computational limitations. The embedding dimension is selected to be 256. We did not use any pre-trained word embedding. The Gumbel softmax temperature τ is chosen to be 0.5 for TED Talks dataset and 0.25 for WMT18 News dataset. Training of models is conducted with batch size 64. Our implementation is independent of the languages that are translated. The models, therefore, can be trainable to perform different language translations as dataset is supplied. 1

5 Experimental results

In this study, we trained 12 different models based on the abovementioned architectures. Each trained model employed to make Turkish-to-English translations on TED Talks and WMT18 News datasets. The BLEU metric is used for evaluating the models. The quantitative results are given in Tables 4.

In our empirical study on TED Talks dataset, we observed that Transformer with combination of BPE and Gumbel Softmax outperforms other models with 22.4 BLEU score. Besides, Table 3 indicates that using Gumbel Softmax distributions during training is demonstrated a significant effect on contributing results in all experiments. Although using BPE reduces the quality of translation in most of the experiments, the combination of BPE and Gumbel Softmax improves the results except pure Seq2Seq architecture. Also, the study conducted by Qi et al. [20] is given in Table 3 to provide a comparison. They employed the Seq2Seq architecture with an attention mechanism to perform Turkish-to-English translation on TED Talks dataset and reported that their model is achieved 14.9 BLEU score, however, the score is increased to 17.9 by using pre-training word embedding for both languages.

As the results shown in Table 4, the experiments conducted on the WMT18 News dataset are in line with the results achieved on TED Talks. That is, the Transformer architecture achieved the best results with a 38.7 BLEU score by combining BPE and Gumbel softmax functions in addition to outperforming Seq2Seq models overall. In previous studies on the WMT18 News dataset, Yang et al. [30] proposed to train the Transformer model in multilingual modality, i.e. training the languages simultaneously and updating the network based on the proposed agreement function. A similar study conducted by Lin et al. [17] employed the Transformer architecture to learn the tokens randomly replaced with one from another language.

For qualitative analysis, an example taken from TED Talks test dataset and the translation outputs of the models are given in Table 5. As the table indicates, the translation closest to ground-truth is achieved by Transformer with combination of BPE and Gumbel Softmax. Although BLEU score of pure Seq2Seq model with combination of BPE and Gumbel is not the highest, the translation in given example demonstrates two notable feature: (i) using ’that’ as object of the verb, (ii) using auxiliary verbs with apostrophes.

Table 5
Translation results on an example sentence that is taken from TED Talks test dataset

Model Sentence

BPE Gumbel

Seq2Seq and we know what a brain is doing if it &apos s a brain

✓ and that we have to do something about the brain.

✓ and we know how this is an brain of how we think about it

✓ ✓ and we know that this is the same kind of information we &apos re doing to the right kind of the brain .

Seq2Seq w/ Attention and we know a lot about understanding about the brain actually does .

✓ and we know the brain is how to do that this is actually .

✓ and we know a little bit about how the brain does this.

✓ ✓ and we also know a little stuff and which the brain do it .

Transformer and we know something about what the brain does is actually .

✓ and we know something about how the brain does it .

✓ and we also know how the brain does that .

✓ ✓ and we know something about how the brain did that , actually.

Ground-truth and we even know something about the way the brain does this

Input ve beynin bunu nasıl yaptığıyla ilgili bir şey de biliyoruz aslında.

	Model		Sentence
	BPE	Gumbel
Seq2Seq			and we know what a brain is doing if it &apos s a brain
	✓		and that we have to do something about the brain.
		✓	and we know how this is an brain of how we think about it
	✓	✓	and we know that this is the same kind of information we &apos re doing to the right kind of the brain .
Seq2Seq w/ Attention			and we know a lot about understanding about the brain actually does .
	✓		and we know the brain is how to do that this is actually .
		✓	and we know a little bit about how the brain does this.
	✓	✓	and we also know a little stuff and which the brain do it .
Transformer			and we know something about what the brain does is actually .
	✓		and we know something about how the brain does it .
		✓	and we also know how the brain does that .
	✓	✓	and we know something about how the brain did that , actually.
Ground-truth			and we even know something about the way the brain does this
Input			ve beynin bunu nasıl yaptığıyla ilgili bir şey de biliyoruz aslında.

(i) Learning to use ’that’ word as an object for the verb is an important achievement because it requires decent quality of language modelling. Further, (ii) using apostrophes is another notable feature because the word “yapıyoruz” is unique although the corresponding translation is “we’re doing” which comprises 4 tokens including apostrophe, therefore, it requires robust understanding between the agglutinative language and English.

6 Discussion

Agglutinative languages, such as Turkish, are one of the most difficult language types for the studies of Computational Linguistics. The main characteristic of agglutinative languages allows changing the meaning of words by adding different unions of morphemes, resulting in that these languages own a vast amount of unique word vocabulary enriched by morphemes, so huge dataset even may not cover substantial words. Therefore, the resources of the morphologically rich languages are limited.

Due to the computational limitations, we were forced to apply reduction on vocabulary sizes and sentence lengths on training dataset. In fact for general-purpose translation systems, these reductions are not tolerable and the models require extensive data. However, we aim to provide a comprehensive analysis of the configurations and architectures for training models in this study. Therefore, we prioritized the eligibility of the comparative experiments rather than attempting state-of-the-art results.

Intuitively, the reduction on vocabulary sizes, although the vocabulary is selected from the most used tokens, causes out-of-vocabulary (OOV) issue. Sutskever et al. [18] also addressed this problem, and improved the quality of the translation with post-process replacement done by positional unknown model that learns the behaviour of the unknown words in sentences. However, we left the investigation of the effect of such OOV models on morphologically rich languages for future works.

After our empirical results, we concluded that self-attention mechanism is more suitable for Turkish-to-English NMT task rather than traditional gated RNN structures like LSTM and GRU. We observed that although using BPE improves the results for Transformer architecture, the combination of BPE and Seq2Seq architecture fails to contribute the results. Our experimental results also indicate that usage of Attention layer on Seq2Seq architecture without Gumbel Softmax decreases the quality on this task.

Gumbel Softmax function balances the distribution of the outputs with respect to temperature parameter. Training with more evenly distributed probabilities of predictions provides output regularization for the models, consequently that the models are penalized to learn for making more robust predictions. Overall, we observed that the results of the models are improved when Gumbel Softmax distributions are used. Therefore, we concluded that Gumbel Softmax function has a strong effect on improving the scores. However, it must be noted that the temperature has a substantial influence on the performance, thus it should be optimized with considerable caution. Although training the network with Gumbel Softmax functions is independent of the languages that are translated, our observations depend on a morphologically rich language. Investigating the effects of using Gumbel Softmax for the other types of language is subject of future works. In the light of our observations, we suppose that using Gumbel Softmax distributions with BPE may improve the quality of the translations for the languages containing a high density of compound words like German.

Moreover, we analyzed the correlation of sentence lengths and the performances of the trained models to assess the efficiency of the models on different types of sentence translations. Figures 4 show the performance of the best three models from different architectures respectively on TED Talks and WMT18 News test dataset with respect to the lengths of the input sentences. The figures depict the sentences which its length is between 3 and 65 because the rest of the sentences are rare in the test set. In TED Talks experiments as shown in Fig. 3(b), the Transformer with BPE and Gumbel Softmax outperforms the other models on the sentences whose lengths are between 5-25 range, however, the performance of the model drops drastically as the length of the sequences increases. Although the Seq2Seq architectures significantly surpass the Transformer architecture on 25+ length sentences, the sentences in this region are scarce. Therefore, the overall results of the Transformer model are substantially better compared to others. In WMT18 News experiments as given in Fig. 4(b), the performance of the Transformer with BPE and Gumbel Softmax universally surpasses the other models. Consequently, this confirms the effectiveness of the proposed model in complex and simple sentence combinations.

Fig. 3

Performance of the best three models on TED Talks test dataset with respect to length.

Fig. 4

Performance of the best three models on WMT18 News test dataset with respect to length.

7 Conclusion

In this study, two main NMT architectures, i.e. Seq2Seq and Transformer, were employed to address the Turkish-to-English translation task. An attention mechanism in Seq2Seq architecture was also utilized to extend the capabilities of the architecture. Besides, BPE and Gumbel Softmax function were examined for each model. Although using BPE is intuitively convenient to approach for agglutinative languages, the empirical results showed that it could not to contribute the results on translation task in overall. On the contrary, Gumbel Softmax function was concluded to be effective in improving results. In our experiments, Transformer model with combination of BPE and Gumbel Softmax achieved the highest quality with 22.4 BLEU score on TED Talks dataset and 38.7 BLEU score on WMT18 dataset.

In conclusion, we conducted comprehensively comparison on NMT architectures for Turkish which is one of the morphologically rich language due to the agglutination. In our experiments, we concluded that Transformer architecture outperforms Seq2Seq. In addition, we found that using Gumbel Softmax distributions on model training substantially contributes to results. We will devote our future work to investigating the decoding strategy of inference model and OOV token problems on morphologically rich languages.

Footnotes

The implementation codes are available at .

References

Ataman

, Negri

, Turchi

and Federico

, Linguistically motivated vocabulary reduction for neural machine translation from Turkish to English, The Prague Bulletin of Mathematical Linguistics, 331–342, 2017.

Bahdanau

, Cho

and Bengio

, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473, 2014.

Bojar

, Federmann

, Fishel

, Graham

, Haddow

, Huck

, Koehn

and Monz

, Findings of the 2018 conference on machine translation (WMT18), In Proceedings of the Third Conference on Machine Translation (Vol. 2, pp. 272–307), 2018.

Chandar

A.P.S.

, Lauly

, Larochelle

, Khapra

M.M.

, Ravindran

, Raykar

V.C.

and Saha

, An autoencoder approach to learning bilingual word representations, Advances in neural information processing systems, 1853–1861, 2014.

Cho

, van Merrienboer

, Gülçehre

Ç.

, Bougares

, Schwenk

and Bengio

, Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078, 2014.

Currey

, Miceli Barone

A.V.

and Heafield

, Copied monolingual data improves low-resource neural machine translation, Proceedings of the Second Conference on Machine Translation, Association for Computational Linguistics, Copenhagen, Denmark, 148–156, 2017.

Devlin

, Zbib

, Huang

, Lamar

, Schwartz

and Makhoul

, Fast and robust neural network joint models for statistical machine translation, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics 1, Association for Computational Linguistics, Baltimore, Maryland, 1370–1380, 2014.

Devlin

, Chang

M.W.

, Lee

and Toutanova

, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018.

Gao

, Pantel

, Gamon

, He

and Deng

, Modeling interestingness with deep neural networks, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2–13, 2014.

10.

, Im

D.J.

and Li

, Neural machine translation with gumbel-greedy decoding, Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. No. 1, 2018.

11.

Gülçehre

Ç.

, Fırat

, Xu

, Cho

, Barrault

, Lin

H.-C.

, Bougares

, Schwenk

and Bengio

, On using monolingual corpora in neural machine translation, arXiv preprint arXiv:1503.03535, 2015.

12.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Comput 9 (1997), 1735–1780.

13.

Hochreiter

, The vanishing gradient problem during learning recurrent neural nets and problem solutions, Int J Uncertain Fuzziness Knowl-Based Syst, 6 (1998), 107–116.

14.

Jang

, Gu

and Poole

, Categorical reparameterization with gumbel-softmax, arXiv preprint arXiv:1611.01144, 2016.

15.

Kalchbrenner

and Blunsom

, Recurrent continuous translation models, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1700–1709, 2013.

16.

Koehn

, Och

F.J.

and Marcu

, Statistical phrase-based translation, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology,Volume 1, Association for Computational Linguistics, 48–54, 2003.

17.

Lin

, Pan

, Wang

, Qiu

, Feng

, Zhou

and Li.

, Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2649–2663, 2020.

18.

Luong

, Sutskever

, Le

, Vinyals

and Zaremba

, Addressing the rare word problem in neural machine translation, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 1, Association for Computational Linguistics, Beijing, China, 11–19, 2015.

19.

Pan

, Li

, Yang

and Dong

, Multi-Task Neural Model for Agglutinative Language Translation, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, 103–110, 2020.

20.

, Sachan

D.S.

, Felix

, Padmanabhan

S.J.

and Neubig

, When and why are pre-trained word embeddings useful for neural machine translation? arXiv preprint arXiv:1804.06323, 2018.

21.

Radford

, Wu

, Child

, Luan

, Amodei

and Sutskever

, Language models are unsupervised multitask learners,, OpenAI Blog 1(8) (2019), 9.

22.

Schwenk

, Continuous space language models,, Computer Speech & Language 21 (2007), 492–518.

23.

Schwenk

, Continuous space translation models for phrase-based statistical machine translation, Proceedings of COLING 2012: Posters, The COLING 2012 Organizing Committee, Mumbai, India, 1071–1080, 2012.

24.

Sennrich

, Haddow

and Birch

, Neural machine translation of rare words with subword units, arXiv preprint arXiv:1508.07909, 2015.

25.

Sutskever

, Vinyals

and Le

Q.V.

, Sequence to sequence learning with neural networks, Advances in Neural Information Processing Systems, 3104–3112, 2014.

26.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

and Polosukhin

, Attention is all you need, Advances in neural information processing systems, 5998–6008, 2017.

27.

Wang

, Zhao

, Lu

B.-L.

, Utiyama

and Sumita

, Neural network based bilingual language model growing for statistical machine translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 189–195, 2014.

28.

Weaver

, Translation, Machine Translation of Languages, eds.W.N. Locke and A.D. Boothe (MIT Press, Cambridge, MA, 1949/1955), 15–23, reprinted from a memorandum written by Weaver in 1949.

29.

Williams

R.J.

and Zipser

, A learning algorithm for continually running fully recurrent neural networks, Neural Comput 1 (1989), 270–280.

30.

Yang

, Yin

, Ma

, Huang

, Zhang

, Li

and Wei

, Multilingual Agreement for Multilingual Neural Machine Translation, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 233–239, 2021.

31.

Zou

W.Y.

, Socher

, Cer

and Manning

C.D.

, Bilingual word embeddings for phrase-based machine translation, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Seattle, Washington, USA, 1393–1398, 2013.

A comparative study of neural machine translation models for Turkish language

Abstract

Keywords

1 Introduction

2 Related works

3 Experimental setup

3.1 Dataset

Table 1 The summary of the TED Talks dataset Training Test EN TR EN TR Sentences 182K 5K Unique Words 47K 185K 8K 17K Total Words 4.1M 3.1M 111K 83K Mean Sentence Length 21 15 20 15

3.3 Architectures

3.3.1 Sequence-to-Sequence (Seq2Seq)

Table 3 BLEU scores for the translations on TED Talks test dataset BPE Gumbel BLEU ↑ Seq2Seq 19.4 ✓ 18.0 ✓ 20.0 ✓ ✓ 17.3 Seq2Seq + Attention 13.3 ✓ 10.7 ✓ 20.1 ✓ ✓ 16.0 Transformer 16.7 ✓ 21.7 ✓ 20.0 ✓ ✓ 22.4 Qi et al. [20] 17.9

Footnotes

References

Table 1
The summary of the TED Talks dataset

Training Test

EN TR EN TR

Sentences 182K 5K

Unique Words 47K 185K 8K 17K

Total Words 4.1M 3.1M 111K 83K

Mean Sentence Length 21 15 20 15

Table 3
BLEU scores for the translations on TED Talks test dataset

BPE Gumbel BLEU ↑

Seq2Seq 19.4

✓ 18.0

✓ 20.0

✓ ✓ 17.3

Seq2Seq + Attention 13.3

✓ 10.7

✓ 20.1

✓ ✓ 16.0

Transformer 16.7

✓ 21.7

✓ 20.0

✓ ✓ 22.4

Qi et al. [20] 17.9