Abstract
Recently, many efforts have been devoted to speeding up neural machine translation models. Among them, the non-autoregressive translation (NAT) model is promising because it removes the sequential dependence on the previously generated tokens and parallelizes the generation process of the entire sequence. On the other hand, the autoregressive translation (AT) model in general achieves a higher translation accuracy than the NAT counterpart. Therefore, a natural idea is to fuse the AT and NAT models to seek a trade-off between inference speed and translation quality. This paper proposes an ARF-NAT model (NAT with auxiliary representation fusion) to introduce the merit of a shallow AT model to an NAT model. Three functions are designed to fuse the auxiliary representation into the decoder of the NAT model. Experimental results show that ARF-NAT outperforms the NAT baseline by 5.26 BLEU scores on the WMT’14 German-English task with a significant speedup (7.58 times) over several strong AT baselines.
Keywords
Introduction
Neural Machine Translation (NMT) has achieved promising improvements over statistical machine translation. Models of this type first encode the source sentence into hidden states and then generate the target sentence from these hidden states. For fast training, it is popular to improve the parallelism on both the encoder and decoder sides. For example, more recently, Transformer [2] has been one of the most successful models due to its good ability of parallelizing the computation over the input sequence.
Unfortunately, the way used in speeding up training is not applicable to NMT inference (on the decoder side) because the autoregressive nature of machine translation enforces a serial order of generation. In the inference stage, autoregressive model decodes the target sentence word-by-word according to the translation history. This method effectively captures the distribution of real translation, and has achieved superior performance in machine translation task. However, autoregressive model makes the inference process hard to be parallelized as the results are generated token by token sequentially.
Thus, autoregressive machine translation (AT) systems perform inference in a left-to-right (or right-to-left) manner and run slow. As an alternative, the non-autoregressive machine translation (NAT) model removes the constraint of serial generation and makes predictions at different positions independently. Parallel WaveNet [1] is a very successful application of non-autoregressive generation. It speeds up by more than 1000 times compared with the original autoregressive Wavenet, and which has been deployed in the Google Assistant. Besides, the gain of fast inference from NAT can also allow for deployment of bigger and deeper transformer models under a certain latency budget in production. But this in turn leads to a significant drop in translation quality because of the “weak” target-side representation, although the inference of NAT systems is an order of magnitude faster.
A natural question that arises is that whether we can combine the merits of AT and NAT models for fast but accurate translation. To seek an answer, one can fuse multiple representations to make a stronger neural model [29, 35]. For example, Guo et al. [12] improved the NAT baseline by feeding transformed source-side embeddings to the decoder and showed a modest BLEU improvement (2.9 BLEU improvement on the WMT’14 German-English task). But it is still rare to see studies on introducing high-level representations encoded in AT models into NAT models.
In this paper, we make the source and target-side representation of a lightweight AT system straightforwardly accessible to an NAT system. We call it non-autoregressive neural machine translation with auxiliary representations fusion (ARF-NAT). More concretely, we generate the first k target words in an autoregressive fashion and then generate the whole target sentence in a non-autoregressive fashion. In this way, the NAT system can benefit from different levels of the target-side representation in the AT system. Also, we develop three ways to incorporate the AT representation into the NAT model. Note that the use of the AT system is doing something like establishing a better “cold start” of the NAT system. By choosing an appropriate k, the overhead introduced by the AT system is modest and the system runs as fast as the NAT system.
We test our method on the IWSLT’14 German-English [20], WMT’14 English-German [26] and WMT’14 German-English [26] translation tasks. Experimental results show that it achieves significant improvement in the translation quality while keeping its speed advantage. Specifically, the ARF-NAT model outperforms the NAT baseline by 5.26 BLEU scores on the WMT’14 German-English task and significantly speeds up (7.58 times) the inference process over several strong AT baselines.
Related work
Most NMT models are implemented by the encoder-decoder framework. The encoder first encodes the source sentence into hidden states and then the decoder performs inference autoregressively, which generates the target sentence in a left-to-right (or right-to-left) manner. Several proposed models use convolution [9, 25] or self-attention [2] to make the training process highly parallel and speed up the training speed. However, the inference process of the model is very slow because of its autoregressive characteristics, which is a great challenge for practical industrial applications.
To alleviate this problem, Gu et al. [11] first introduced the non-autoregressive neural machine translation, which could decode the whole target sentence at one time. But the performance is considerably worse than AT counterparts since it removes the constraint of serial generation. A lot of works have been proposed to mitigate such performance degradation. Some methods devote to exploring better training objectives than negative log-likelihood loss. Li et al. [36] examined the diversity between the NAT model and autoregressive teacher and then employed the distance from hidden states as additional loss function to help the training process. Wei et al. [3] proposed an imitation learning framework for non-autoregressive machine translation, which combined the imitation learning term with the commonly used cross-entropy loss as overall training objectives. Wang et al. [33] improved the quality of decoder hidden representations via two auxiliary regularization terms in the NAT model include the similarity regularization and the reconstruction regularization.
Several works remedy the issue by incorporating the part autoregressive module to improve the expressiveness of the network structure. Sun et al. [37] used linear-chain Conditional Random Fields (CRF) to model the richer structural dependencies and performed fast autoregressive decoding with beam approximation. Ran et al. [28] introduced a lightweight autoregressive reordering module to narrow the potential decoding search space which explicitly models the reordering information to guide the decoding of NAT. Shao et al. [4] parallelized bottom layers in a non-autoregressive way to accelerate the model but serialized the top layer in an autoregressively way to enhance the translation quality.
Existing models are typically trained with cross entropy loss, which has achieved excellent performance in autoregressive translation model. Cross entropy is a strict loss function, where a penalty is incurred for every word that is predicted out of position, even for output sequences with small edit distances. Autoregressive models learn to avoid such penalties, because the word in the current position are obtained from the previously generated words, while non-autoregressive translation model cannot obtain this information. One approach [21] uses a new cross-entropy function, which provides more accurate training signals for non-autoregressive translation models by ignoring absolute positions and paying attention to relative order and vocabulary matching. In addition, n-gram-based training objectives [5] also be used to minimize n-gram differences between the model and the reference translation, which encourages NAT to capture the target-side sequential dependency and correlates well with the translation quality.
Prior methods also address this complex problem by modeling with well-designed latent variables. Akoury et al. [24] took syntactic information as hidden variables which first autoregressively predicted a chunked parse tree before generating all target tokens in one shot conditioned on the predicted parse. Ma et al. [31] used generative flow to model complex distributions using neural networks and designed several layers of flow tailored for modeling the conditional density of sequential latent variables.
Other works [13, 22] proposed methods based on iterative refinement. This framework gives up on completely parallelizable generation and refines previously generated words in each iteration, which is equivalent to the autoregressive model at the sentence level. In this paper, we find that the NAT system can benefit from different levels of the target-side representation in the AT system. Therefore, we propose combining the merits of AT and NAT models for fast but accurate translation.
Problem definition
Given a sentence pair (
Then, the decoder of the AT model generates the target sentence
where θAT denotes the parameters of AT model. In the inference process, the AT systems generate the target sentence in a left-to-right (or right-to-left) manner. While the AT model achieves promising performances benefited from this mechanism, it suffers from heavy latency.
Non-autoregressive machine translation (NAT) model removes the constraint of serial generation and factorizes the joint probabilities over the target words into a product of conditionally independent distributions:
We believe that the lack of accurate target representation is the main reason for the heavy performance degradation of the NAT model. In this paper, our objective is to introduce representations encoded in AT model into NAT model to combine the merits of AT and NAT model for fast but accurate translation.
In this section, we introduce our proposed method in detail. We make the source and target-side representation of a lightweight AT system straight-forwardly accessible to an NAT system. We call it non-autoregressive neural machine translation with auxiliary representations fusion (ARF-NAT). By choosing an appropriate k, the overhead introduced by the AT system is modest and the system runs as fast as the NAT system.
As shown in Fig. 1, we fuse the auxiliary representation provided by the shallow AT model to the NAT model. We first generate the first k target words in an autoregressive fashion,
The fusion functions to incorporate the AT representation into the NAT model.
By introducing the merit of a shallow AT model to an NAT model, the ARF-NAT improves the translation quality while keeping its speed advantage. The AT model can be viewed as a special case of the ARF-NAT when the auxiliary representation length k is equal to the target length T y . The NAT model described in Gu et al. [11] can also be viewed as a special case of the ARF-NAT when the auxiliary representation length k = 0.
Previous NAT models take tokens in the source language as decoder inputs, which make the decoding task difficult because of the “weak” target-side representation. Considering that one can fuse multiple representations to make a stronger neural model [29, 35], a straightforward idea to improve the translation quality is to introduce high-level representations encoded in AT models into NAT models. For the representation
Besides, we believe that fusing the hidden states of the top layer of the encoder and decoder can result in better performance. Compared with words embedding fusion, hidden states fusion has two superiorities: Hidden states of the AT model contain syntax presentation and capture the global context information which help the decoding of remaining tokens better. With the help of hidden states, the NAT system can benefit from different levels of the target-side representation in the AT system. A limit for word embedding fusion is that the model must train with sharing vocabulary and embedding between source and target words to ensure the fused embeddings are aligned in the same representation space. Thus, word embedding fusion is feasible for similar languages such as English-German or English-French translation. However, hidden states fusion can extend our method to distant languages such as Chinese-English or Japanese-English tasks.
Fusion functions
For the decoder input, we fuse the auxiliary representation from the AT model and source-side representation to replace copied source embedding. In this paper, we develop three ways named
Before sending the fusion presentation to the decoder, we apply layer normalization [18] operation to normalize forward layer inputs and backward gradients. By detaching the derivatives of the mean andvariance, Xu et al. [15] find that LayerNorm normalizes forward layer inputs and backward gradients. During training, the AT model predicts all auxiliary representation in parallel conditions on the “ground truth” tokens. During inference(as shown in Fig. 2), the AT model employs greedy decoding strategies to generate the auxiliary representation.

The framework of ARF-NAT model.
Different from the autoregressive model, the NAT model needs to estimate the target length to generate all words in parallel. We train a model to predict the length offset between the target and source sentences. The length predictor model P
L
(T
y
|
Experiments
Datasets
We conduct experiments on widely adopted benchmark datasets to evaluate the effectiveness of our proposed method: IWSLT’14 De-En 1 , WMT’14 En-De and WMT’14 De-En which share the same dataset 2 . For the IWSLT’14 De-En task, we use 7K data split from the training set as the validation set and use the concatenation of dev2010, test2010, test2011 and test2012 as the test set. For WMT’14 En-De and WMT’14 De-En task, we use a much larger dataset that contains 4.5M training pairs, and newstest2013 and newstest2014 are used as the validation and test set, respectively. All the datasets are tokenized by Moses [27] and segmented into subword units using byte-pair encoding [30] to solve the OOV problem. We share the source and target vocabulary and embeddings in each language pair.
Similar to previous work on non-autoregressive translation [6, 11], the target side of the training corpus are replaced by the output of an autoregressive Transformer model. Specifically, we pre-train an autoregressive Transformer network [2] and then run beam search over the whole training set with this model.
Model settings
To enable a fair comparison, we use the same network architectures as in NAT [11]. Specifically, for WMT’14 datasets, we use the default hyperparameters of the base model described in Vaswani et al. [2], whose encoder and decoder both have 6 layers and the size of hidden state and embeddings are set to 512, and the number of heads is set to 8. For IWSLT’14 datasets, we use a smaller transformer architecture which consists of a 5-layer encoder and a 5-layer decoder. The size of hidden state and embeddings are set to 256, and the number of heads is set to 4.
For the AT model, we use the same model architectures with the NAT model but only use the 1-layer decoder to improve decoding speed. We pre-train it with the same training data and freeze parameters during the whole process. For the NAT model, we fuse the auxiliary representation and source-side information as decoder input. Meanwhile, we remove the causal attention mask in the decoder and introduce multi-head positional attention to rearrange the local word orders within a sentence to enhance the position information as suggested in Gu et al. [11].
For the translation quality evaluation, we applied the standard automatic metric BLEU [17], which calculates the geometric mean of n-gram precision of the system output with respect to reference translations multiplied by a brevity penalty to prevent very short candidates from receiving too high a score. Scores range between 0%(worst) and 100%(best).
Model initialization
Since the AT model and NAT model only have a slight difference in model architecture and share the same model configurations basically, so we initialize the NAT model with the corresponding parameters of the AT model to speed up training convergence. These parameters include all parameters of encoder and target word embeddings. The remaining parameters are initialized using the normal distribution.
Model knowledge distillation
Knowledge distillation [8] aims to transfer the knowledge of a large teacher network
Our final training loss L is a weighted sum of the hidden loss and the negative loglikelihood loss
As the source and target sentences are often of different lengths, so we need map the source-side representation
Training and inference
Same as Vaswani et al. [2], we train the ARF-NAT by minimizing the cross-entropy loss. The optimizer we use is Adam [7] with β1 = 0.9, β2 = 0.98. We train the NAT model on 8/1 NVIDIA TITAN V GPUs for WMT/IWSLT datasets, respectively. We set λ = 0.4 and μ = 0.6 in Eq. (6) for all tasks to controlling the contribution of different representation terms. We implement the proposed ARF-NAT with fairseq [23]. Follow common practice in previous works, we evaluate using tokenized case-sensitive BLEU for WMT’14 datasets and case-insensitive BLEU for IWSLT datasets. During inference, we also remove repeating consecutive symbols follow Lee et al. [14]. In order to speed up the decoding process, we share the encoder between AT and NAT models. The inference latency is computed as average per-sentence decoding time (ms) on the WMT’14 En-De test set, which is conducted on a single NVIDIA TITAN X Pascal GPU. The latencies are obtained by taking average of five runs.
Main results
We compare the ARF-NAT with strong NAT baselines, including the NAT with fertility [11], the NAT with iterative refinement [14] which trains extra decoders to iteratively refine the translation output, the NAT with imitation learning [3] which forces the NAT model to imitate an AT model during training, as well as the NAT with enhanced decoder input [12] which leveraging phrase or mapping vector to transform the source-side word embeddings to target-side word embeddings and we list the “embedding mapping” results reported in their paper.
The results are shown in Table 1. Latency is computed as the time to decode a single sentence without mini-batching, averaged over the whole test set. “‡” and “†” indicate that the latency is measured on our own platform or by previous works, respectively. “/” indicates the corresponding result is not reported. k: the auxiliary representation length in equation 4.
BLEU scores on official test set
BLEU scores on official test set
We compare the proposed ARF-NAT against the autoregressive counterpart both in terms of translation quality and inference latency. We can see that the
Word embedding v.s. hidden states
We have designed two different fusion approaches to provide auxiliary representation and make a comparison between them in this subsection. According to Table 2, using word embedding or hidden states of source-side representation as the inputs achieve similar performance, which shows that only using hidden states of NAT encoder to replace word embedding does not result in performance improvements.
Comparison between word embedding and hidden states on the IWSLT test set
Comparison between word embedding and hidden states on the IWSLT test set
When applying our auxiliary representation fusion approach, we can achieve 2∼3 BLEU improvement. It is worth noting that source-side hidden states and auxiliary hidden states fusion achieve 1.2 BLEU improvements over the word embedding fusion. This phenomenon confirms our initial hypothesis that the NAT system can benefit from different levels of the target-side representation in the AT system with the help of hidden states.
Figure 3 shows the speed-performance trade-off of different auxiliary representation length on the IWSLT’14 De-En test set. Each circle represents the ARF-NAT decoding run with a different number of auxiliary representation length. The auxiliary representation length k can be an arbitrary value. In our experiment, we set k from 5 to 30 to test the influence of k value on translation quality and translation speed. If the target sentence length is shorter than k, we generate the whole sentence in an autoregressive fashion. As we increase the auxiliary representation length, the performance of the model has been improving. Meanwhile, the NAT model still maintains a high decoding speed. We see that ARF-NAT is versatile; on the one hand, we can translate over 9.7 times faster than the transformer baseline at a cost of 22%performance degradation (k=5), or alternatively retain a high quality of 29.50 BLEU while gaining a 424%speed-up (k=30). Finally, we set k to 20 in our experiments to balance the translation quality and decoding speed. To compare with previous work, we only decode a single sentence without mini-batching here. Since we perform the decoding of the AT model and NAT model, single sentence generating damages our inference speed greatly. Using batch decoding can further improve the decoding speed of our method, which is also consistent with actual industrial applications.

The trade-off between speed-up and translation quality of the ARF-NAT model3.
We compare the translation quality between AT with 5 layer decoder [2], NAT [11], and our method with regard to different sentence lengths on the IWSLT’14 De-En test set. We divide the sentence pairs into different length buckets according to the length of the reference sentence. The results are shown in Fig. 4. It can be seen that as sentence length increases, the accuracy of the NAT model drops quickly. Our method achieves significant improvement compared to the NAT model and comparable performance to AT model for all lengths. It verifies that the ARF-NAT provides strong sequence information for NAT decoder, resulting in more accuracy sentences.

The BLEU scores comparison between AT, NAT, and the ARF-NAT model.
To further verify whether the choice of k value is reasonable, we analyzed the sentence length distribution of the three datasets. As shown in Fig. 5, although the average sentence length of IWSLT’14 De-En dataset is smaller than WMT’14 De-En and WMT’14 En-De datasets, the sentence length of the three datasets is mainly distributed between 10 and 30. When we set k value to 20, the length distribution of datasets around k value is balanced. For example, on the IWSLT’14 De-En dataset, 50.1%sentences are shorter than 20, while 49.9%are longer than or equal to 20. We choose k value to 20, so that both AT part and NAT part of our model can be adequately trained.

Sentence length distribution of three datasets.
In Table 3, we present two sample translations from the test set of IWSLT’14 German-English, including the source sentence, the target reference (i.e., the ground truth translation), the translation given by the AT model (1 layer decoder), by the basic NAT with sequence distillation (NAT), and by our NAT with auxiliary representation fusion (ARF-NAT). As can be seen, the NAT model suffers severely from the issue of incomplete translation (e.g., the “immediately spontaneanswer you immediately” in the first sample) and semantically incoherent translation (e.g., “let’s start perceive that.” in the second sample). While the auxiliary representation fusion brought in, the two issues are largely alleviated. Furthermore, our method also can fix the errors of the lightweight AT model which proves that the ARF-NAT can use the provided representation to generate a better sentence.
Translation examples on IWSLT’14 De-En task
Translation examples on IWSLT’14 De-En task
Non-autoregressive models have achieved impressive inference speedup but suffer from significant drop in translation quality because of the “weak” target-side representation. In this paper, we propose an ARF-NAT model (NAT with auxiliary representation fusion) to introduce the merit of a shallow AT model to an NAT model. We design three functions to fuse the auxiliary representation into the decoder of the NAT model. Experimental results in multiple datasets demonstrate our method achieves better performance than several strong NAT baselines, with one order of magnitude faster in inference than AT model. In the future, we plan to utilize other existing techniques for our ARF-NAT models to further bridge the gap between non-autoregressive and autoregressive sequence models. Specifically, we believe that the current cross entropy loss function is not suitable for non-autoregressive translation model. Using object function beyond cross entropy to improve the modeling of long-distance dependencies, we leave this for future work.
