An optimized Transformer algorithm: Implementation for intelligent English translation

Abstract

There is an increasing demand for high-quality translations in the realm of intelligent English translation. This paper optimized the traditional Transformer algorithm by enhancing position coding and the softmax layer. Bidirectional long short-term memory (BiLST) was employed to realize position encoding, capturing both contextual and positional information simultaneously. Additionally, the softmax function was replaced with the sparsemax function to obtain sparser results. The translation performance of some algorithms on Chinese and English datasets was compared and analyzed. It was found that that the optimized Transformer algorithm performed better than the RNNSearch, ConvS2S, and Transformer-base algorithms in terms of bilingual evaluation understudy (BLEU) score on the test set. It achieved an average BLEU score of 23.72, representing an improvement of 1.56 over the RNNSearch algorithm, 1.17 over the ConvS2S algorithm, and 0.73 over the Transformer-base algorithm. The parameter quantity of the optimized algorithm was 6.83 M, which was 0.05 M higher than the Transformer-base algorithm. Furthermore, its running time was 4,862.76 s, showing a marginal increase of 0.21% compared to the Transformer-base algorithm. These findings validate the reliability of the optimized Transformer algorithm for intelligent English translation and its potential practical application.

Keywords

Transformer algorithm intelligent translation position coding softmax function

1. Introduction

With society’s continuous development and progress, there is an increasing demand for extensive cross-linguistic communication. Translation, as an indispensable part of cross-linguistic communication, has been the subject of in-depth study. However, traditional human translation is increasingly unable to meet the wide range of needs in terms of speed and cost. This has led to the emergence of computer-based intelligent translation as a key area of research. Intelligent translation can quickly complete the translation of large-scale cross-language materials [1], playing a crucial role in international communication, such as business translation and travel translation [2]. Consequently, improving the quality of translation has become an important task [3]. The significance of intelligent translation has led to extensive research on intelligent translation algorithms [4]. For instance, Pielka et al. [5] learned contradiction-specific sentence embeddings in German translation using a recurrent neural network (RNN), and their proposed method demonstrated good performance based on dataset evaluation. Vathsala et al. [6] implemented code-switching and transliteration based on a long short-term memory (LSTM) network for translating social media data. It was trained using 1 M data for transliteration and translation of Twitter data. Lee et al. [7] proposed a reinforcement learning-based attention mechanism to ensure translation quality and address latency issues in intelligent translation. They obtained better translation quality through experiments. Raju et al. [8] focused on parallel corpus building and the out-of-vocabulary issue in English-to-Telugu translation. They found that effective preprocessing improved the parameters of translation such as accuracy, perplexity, and bilingual evaluation understudy (BLEU) score. English, a widely used language worldwide [9], holds excellent application value in intelligent translation research. Therefore, this paper optimized the current Transformer algorithm for intelligent English translation, obtaining an optimized algorithm. The reliability of this method was demonstrated through experimental analysis, providing a new and effective approach for intelligent English translation. Furthermore, this research contributes to the theoretical directions for further exploration of the Transformer algorithm.

2. An optimized Transformer algorithm

2.1 Transformer algorithm

The Transformer algorithm employs the Seq2Seq architecture [10], where encoders are positioned on the left and decoders on the right. Each encoder consists of a multi-head self-attention layer and a feedforward neural network (FFN) layer. Residual connection and layer regularization are utilized. The decoder incorporates a masking operation to selectively conceal specific information from influencing parameter updates. The Transformer algorithm encompasses several pivotal components, which are outlined below.

(1) Attention mechanism

The Transformer algorithm uses an attention mechanism with Query-Key-Value, written as:

$\displaystyle\textit{Attention}({Q,K,V})=\textit{softmax}\left({\frac{QK^{T}}{% \sqrt{d_{k}}}}\right)V,$

where $K^{T}$ is the transposition of $K$ and $d_{k}$ refers to the dimension of Key.

The Transformer algorithm combines multiple attention mechanisms, i.e., the multi-head attention mechanism, written as:

$\displaystyle\textit{MultiHaedAttention}({Q,K,V})=\textit{Concat}({\textit{% head}_{1},\textit{head}_{2},\ldots,\textit{head}_{H}})W^{O},$ $\displaystyle\textit{head}_{i}=\textit{Attention}({QW_{i}^{Q},KW_{i}^{K},VW_{i% }^{V}}),$

where $W_{i}^{Q}$ , $W_{i}^{K}$ , and $W_{i}^{V}$ are the parameter matrices, and $H$ is the quantity of heads, which is taken as 8 in the Transformer algorithm.

(2) Feedforward neural network

The FFN maps the results obtained from the multiple attention layers to a new space through a fully connected network, which can be regarded as two convolutional layers, and its computational formula is:

$\displaystyle\textit{FFN}(x)=\textit{ReLU}({xW_{1}+b_{1}})W_{2}+b_{2}$

where $x$ is the output of the previous layer, $W_{1}$ and $W_{2}$ are the weight of different layers, $b_{1}$ and $b_{2}$ are the bias of different layers.

(3) Residual connection and layer regularization (Add&Norm)

Residual connection mitigates the problem of vanishing gradients, and layer regularization facilitates faster convergence. Their equations are:

$\displaystyle\textit{LayerNorm}({x+\textit{MultiHeadAttention}(x)}),$ $\displaystyle\textit{LayerNorm}({x+\textit{FFN}(x)}),$

where $x$ refers to the input to the multi-head attention or FFN layers, $\textit{MultiHeadAttention}(x)$ and $\textit{FFN}(x)$ are outputs. The input and output have the same dimension.

(4) Position coding

The attention mechanism in the Transformer algorithm does not represent the positional information of the word, as the positional encoding needs to be added at the time of input. The calculation formula of positon coding is:

$\displaystyle PE({\textit{pos},2i})=\sin\left({\frac{\textit{pos}}{10000^{2i}/% d_{\textit{model}}}}\right),$ $\displaystyle PE({\textit{pos},2i+1})=\cos\left({\frac{\textit{pos}}{10000^{2i% }/d_{\textit{model}}}}\right),$

where pos is the word position and $d_{\textit{model}}$ is the word vector dimension.

2.2 Optimized position coding

In the Transformer algorithm, the position encoding lacks contextual information. To enhance the performance of English intelligent translation, this paper introduces an optimization technique for position encoding. LSTM is a special kind of RNN [11], which has a unique advantage in the capture of long-distance information, and it mainly realizes the information transfer through three gates.

(1) Forgetting gate: it determines what information needs to be passed or discarded:

$\displaystyle f_{t}=\sigma({W_{f}h_{t-1}+V_{f}x_{t}+b_{f}}).$

(2) Update gate: it determines what information needs to be updated in the cell:

$\displaystyle i_{t}=\sigma({W_{i}h_{t-1}+V_{i}x_{t}+b_{i}}).$

Then, the candidate cell is:

$\displaystyle\tilde{c}_{t}=\tanh(W_{c}h_{t-1}+V_{c}x_{t}+b_{c}).$

The updated cell is:

$\displaystyle c_{t}=f_{t}c_{t-1}+i_{t}\tilde{c}_{t}.$

(3) Output gate: it decides the next hidden state. Its equation is:

$\displaystyle o_{t}=\sigma({W_{o}h_{t-1}+V_{o}x_{t}+b_{o}}).$

Finally, the output of the hidden layer state at time $t$ is written as:

$\displaystyle h_{t}=o_{t}\times\tanh({c_{t}}).$

In the above equations, $h_{t-1}$ is the output at time $t-1$ , $x_{t}$ is the input at time $t$ , $\sigma$ is the sigmoid function, $t$ refers to the tanh function, $W$ denotes the weight matrix, $b$ denotes the bias matrix, and $V$ is the weight matrix of $h_{t-1}$ .

Unidirectional LSTM can only process undirectional information, so when applied to location coding, it can only process preceding context. To capture contextual information at the same time, this paper uses a bidirectional LSTM (BiLSTM) [12], which realizes the integration of the position information and the contextual information. At time $t$ , the output of the hidden layer state of BiLSTM is written as: $h_{t}=[\overrightarrow{h}_{t},\overleftarrow{h}_{t}]$ , where $[\cdot]$ denotes the splicing of the vectors, $\overrightarrow{h}_{t}$ and $\overleftarrow{h}_{t}$ are the hidden layer output of forward and backward LSTM.

As to BiLSTM-based position coding, the word vectors are taken as input sequences and fed into BiLSTM for training to get output vector $h_{t}$ containing the positional information of the words and the contextual information, and the output result of BiLSTM is used as the input sequence of the Transformer algorithm to improve the effect of intelligent translation.

2.3 Optimized softmax layer

In the Transformer algorithm, a softmax layer is used in the decoder part to get the word with high probability as output:

$\displaystyle\textit{softmax}_{i}(z)=\frac{\exp({z_{i}})}{\mathop{\sum}% \nolimits_{j}\exp({z_{j}})}.$

To obtain a more sparse output, this paper uses the sparsemax function [13] instead of the softmax function:

$\displaystyle\textit{sparsemax}_{i}(z)=\left\{{{\begin{array}[]{l}{1,\text{if % }t>1}\\ {({t+1})/2,\text{if }-1\leqslant t\leqslant 1}\\ {0,\text{if }t<-1}\\ \end{array}}}\right..$

Based on the sparse mix function, the output case of the decoder is computed:

$\displaystyle p({Y_{j}|Y_{<j},X,\Theta})\propto\textit{sparsemax}({W_{v}T_{j}^% {N}}),$

where $X$ is the source sentence, $Y$ is the target word translated by the model, $\Theta$ is a parameter, $W_{v}\in R^{V\times d}$ , $V$ is the size of the word list, and $T_{j}^{N}$ is the output of the model, which is used to predict the $j$ -th target word.

In addition, for the softmax function in the attention mechanism, the same sparsemax function is used instead, i.e.,

$\displaystyle\textit{Attention}({Q,K,V})=\textit{sparsemax}\left({\frac{QK^{T}% }{\sqrt{d_{k}}}}\right)V.$

3. Results and analysis

3.1 Experimental setup

The experiments were conducted on a Linux environment using Python 3.6 as the programming language and TensorFlow as the deep learning framework. The Chinese-English dataset from the WMT17 corpus was used for training, newsdev2017 was used for validation, and newstest2017, cwmt2018, and newstest2018 were used as the test sets. The data underwent various preprocessing steps, including cleaning, letter case conversion, word segmentation, and byte pair encoding (BPE) [14].

The proposed method in this paper was compared with the following intelligent translation methods.

(1) RNNSearch [15]: This method utilized 1,000 neurons in the hidden layer, 620 as the word embedding dimension, 5e-4 as the initial learning rate, a batch size of 128, and the Adam optimizer.

(2) ConvS2S [16]: The codec in this method employed 512 neurons in the hidden layer, 512 as the word embedding dimension, 0.25 as the initial learning rate, and a batch size of 64.

(3) Transformer-base [17]: It refers to the conventional Transformer algorithm utilizing identical parameter configurations as the optimized Transformer algorithm proposed in this paper.

Table 1
Transformer algorithm parameter settings

Parameter	Value
Number of layers in the codec	6
Number of attention heads	8
Word vector dimension	512
Hidden layer dimension	512
FFN layer dimension	2048
Dropout	0.1
Learning rate	0.5
Batch size	32
Beam size	5

The effectiveness of intelligent translation was evaluated based on BLEU [18]. BLEU is an algorithm based on the n-gram model. First, the n-gram of each sentence was computed. Correction precision $p_{n}$ is solved:

$\displaystyle p_{n}=\frac{\mathop{\sum}\nolimits_{C\in\{{\textit{Candidates}}% \}}\mathop{\sum}\nolimits_{n-\textit{gram}\in C}\textit{Count}_{\textit{clip}}% ({n-\textit{gram}})}{\mathop{\sum}\nolimits_{C^{\prime}\in\{{\textit{% Candidates}}\}}\mathop{\sum}\nolimits_{n-\textit{gram}^{\prime}\in C^{\prime}}% \textit{Count}_{\textit{clip}}({n-\textit{gram}^{\prime}})},$ $\displaystyle\textit{Count}_{\textit{clip}}=\min({\textit{Count},\textit{Max\_% Ref\_Count}}),$

where Count is the frequency at which the n-gram appears in the model-generated sentences, and Max_Ref_Count is the maximum frequency at which the n-gram appears in the reference corpus.

BLEU is calculated by the following formula:

$\displaystyle\textit{BLEU}=BP\cdot\exp\left({\mathop{\sum}\limits_{n=1}^{n}w_{% n}\log p_{n}}\right),$ $\displaystyle BP=\left\{{{\begin{array}[]{l}{1,\text{if }c>r}\\ {e^{({1-r/c})},\text{if }c\leqslant r}\\ \end{array}}}\right.,$

where $B P$ is a penalty factor, $c$ is the length of the candidate sentence, $r$ is the reference translation, and $w_{n}$ is a weighted weight, $n=4$ usually, i.e., $w_{n}=1/4$ .

3.2 Analysis of results

Figure 1 shows the improvement achieved by the optimized Transformer algorithm compared with the existing intelligent translation methods in terms of BLEU score.

According to Fig. 1, using the newstest2017 dataset, the BLEU score of the RNNSearch, ConvS2S, and Transformer-base algorithms was 22.21, 22.76, and 23.22, respectively. Among these algorithms, the Transformer-base algorithm demonstrated superior performance in intelligent translation. Additionally, the optimized Transformer algorithm proposed in this paper achieved a BLEU score of 24.08, exhibiting an improvement of 1.87, 1.32, and 0.86 compared to the previous three algorithms, respectively. Similarly, the optimized Transformer algorithm achieved the highest BLEU scores on the cwmt2018 and newstest2018 datasets. When comparing the average BLEU scores on the three test sets, the optimized Transformer algorithm outperformed the RNNSearch algorithm by 1.56, the ConvS2S algorithm by 1.17, and the Transformer-base algorithm by 0.73. These results confirmed the effectiveness of the optimized algorithm for English intelligent translation.

Then, comparative experiments were conducted on the optimized part, and the outcomes are presented in Table 2.

Table 2
Comparative experiments for the optimized parts

		Average BLEU score
Position coding optimization	Original position coding	22.99
	BiLSTM position coding	23.67 ( $+$ 0.68)
Softmax optimization	Softmax	22.99
	Sparsemax	23.26 ( $+$ 0.27)

According to Table 2, in the part of position coding optimization, the Transformer-base algorithm adopting the original position coding achieved an average BLEU score of 22.99 on the test set. However, by switching to BiLSTM position coding, the average BLEU score improved to 23.67. This improvement of 0.68 indicated that BiLSTM performed better in position coding and effectively utilized contextual information, resulting in improved translation quality. Moving on to the softmax optimization part, the Transformer-base algorithm adopting the softmax function yielded an average BLEU score of 22.99 on the test set. However, by implementing the sparsemax function instead, the score increased to 23.26. This improvement of 0.27 suggested that utilizing the sparsemax function enhanced the sparsity of results during probability normalization, thereby effectively enhancing translation quality.

Finally, the effect of the optimized Transformer algorithm on the parameter quantity and running time was compared (Table 3).

Table 3

Comparison of parameter quantity and operation time

	Parameter quantity	Operation time
Transformer-base	6.78 M	4852.33 s
Optimized transformer	6.83 M ( $+$ 0.05 M)	4862.76 s ( $+$ 10.43 s)

Figure 1.

Comparison of BLEU score between various algorithms on the test set.

According to Table 3, the parameter quantity of the optimized algorithm was 6.78 M, which was 0.05 M higher than the Transformer-base algorithm (6.78 M). It indicated that the improvement had a small increase in the parameter quantity. The average running time of the Transformer-base algorithm for the test set was 4,852.33 s, whereas the optimized Transformer algorithm had an average running time of 4,862.76 s, showing an increase of 10.43 s compared to the former. These results indicated that the optimized algorithm slightly increased the overall running time. However, considering the BLEU score, the optimized Transformer algorithm not only increased the parameter quantity by 0.05 M and the running time by 0.21% but also brought a significant improvement of 3.18% in the BLEU score. Overall, it demonstrated excellent performance in English intelligent translation.

4. Conclusion

This paper focuses on optimizing the traditional Transformer algorithm in terms of position coding and the softmax layer to obtain an optimized Transformer algorithm. Comparative experiments found that the optimized Transformer algorithm outperformed the RNNSearch, ConvS2S, and Transformer-base algorithms on the test set, achieving an average BLEU score of 23.72. Despite a slight increase in parameter quantity and running time, it did not significantly add to the computational burden. Therefore, the optimized Transformer algorithm is promising for practical applications in intelligent English translation and provides a novel reliable approach for intelligent English translation. This work provides theoretical support for better application and optimization of the Transformer algorithm in the field of intelligent translation. However, the present study also has some limitations, such as only considering intelligent translation between Chinese and English and lacking analysis of the algorithm’s translation performance on longer texts. In the future, there will be experiments on intelligent English translation in more languages to understand the applicability of this method. Further research will also be conducted on translating longer texts. Additionally, lightweight design considerations for the algorithm are needed to ensure translation performance while reducing complexity and improving computational efficiency.

References

Shirsath

Velankar

Patil

Shinde

. Various Approaches of Machine Translation for Marathi to English Language. ITM Web Conf. 2021; 40(4): 1-6.

Zhao

. The relationship between machine translation and human translation under the influence of artificial intelligence machine translation. Mob. Inf. Syst. 2022; 2022(Pt.16): 1-8.

Maksymenko

Saichyshyna

Turuta

Yerokhin

Babii

. Improving the Machine Translation Model in Specific Domains for the Ukrainian Language. 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT). 2022. pp. 123-129.

Minh

KNT

Nguyen

LHB

Dinh

. Exploring composite indexes for domain adaptation in neural machine translation. Vietnam J. Comput. Sci. 2024; 11(01): 75-94.

Pielka

Sifa

Hillebrand

Biesner

Ramamurthy

Ladi

Bauckhage

. Tackling Contradiction Detection in German Using Machine Translation and End-to-End Recurrent Neural Networks. 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy. 2021. pp. 6696-6701.

Vathsala

Holi

. RNN based machine translation and transliteration for Twitter data. Int. J. Speech Technol. 2020; 23(10): 499-504.

Lee

Shin

Kim

. Simultaneous neural machine translation with a reinforced attention mechanism. ETRI J. 2021; 43(5): 775-786.

Raju

Satyanarayana

. Effective preprocessing based neural machine translation for English to Telugu cross-language information retrieval. IAES Int. J. Artif. Intell. 2021; 10: 306-315.

Anggraeni

. English Borrowing Words in Indonesian Informatics Engineering Articles: Types and Meanings. DEIKSIS. 2021; 13(2): 134.

10.

Wang

Ouchi

. Boost Transformer with BERT and copying mechanism for ASR error correction. 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China. 2021. pp. 1-6.

11.

Saha

Senapati

Mahajan

. LSTM based Deep RNN Architecture for Election Sentiment Analysis from Bengali Newspaper. 2020 International Conference on Computational Performance Evaluation (ComPE), Shillong, India. 2020. pp. 564-569.

12.

Cheng

Wan

. Application Research of BiLSTM in Cross-Site Scripting Detection. J. Front. Comput. Sci. Technol. 2020; 14(8): 1338-1347.

13.

Martins

AFT

Astudillo

. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. ICML’16: Proceedings of the 33rd International Conference on International Conference on Machine Learning. Vol. 48. 2016. pp. 1614-1623.

14.

Sennrich

Haddow

Birch

. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Vol. 3. 2016. pp. 1715-1725.

15.

Bahdanau

Cho

Bengio

. Neural Machine Translation by Jointly Learning to Align and Translate. Comput. Sci. 2014.

16.

Gehring

Auli

Grangier

Yarats

Dauphin

. Convolutional Sequence to Sequence Learning. ICML’17: Proceedings of the 34th International Conference on Machine Learning. Vol. 70. 2017. pp. 1243-1252.

17.

Vaswani

Shazeer

Parmar

, et al. Attention Is All You Need. NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017. pp. 6000-6010.

18.

Adlaon

KMM

Marcos

. Neural Machine Translation for Cebuano to Tagalog with Subword Unit Translation. 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia. 2018. pp. 328-333.