Modeling hypotactic structure for Chinese-English neural machine translation of complex sentences

Abstract

The hypotactic structural relation between clauses plays an important role in improving the discourse coherence of document-level translation. However, the standard neural machine translation (NMT) models do not explicitly model the hypotactic relationship between clauses, which usually leads to structurally incorrect translations of long and complex sentences. This problem is particularly noticeable on Chinese-to-English translation task of complex sentences due to the grammatical form distinction between English and Chinese. English is rich in grammatical form (e.g. verb morphological changes and subordinating conjunctions) while Chinese is poor in grammatical form. These linguistic phenomena make it a challenge for NMT to learn the hypotactic structure knowledge from Chinese as well as the structure alignment between Chinese and English. To address these issues, we propose to model the hypotactic structure for Chinese-to-English complex sentence translation by introducing hypotactic structure knowledge. Specifically, we annotate and build a hypotactic structure aligned parallel corpus that provides rich hypotactic structure knowledge for NMT. Moreover, we further propose a structure-infused neural framework to combine the hypotactic structure knowledge with the NMT model through two integrating strategies. In particular, we introduce a specific structure-aware loss to encourage the NMT model to better learn the structure knowledge. Experimental results on WMT17, WMT18 and WMT19 Chinese-to-English translation tasks demonstrate the effectiveness of the proposed methods.

Keywords

Neural machine translation hypotactic structure discourse coherence structure-infused neural framework

1 Introduction

The hypotactic structural relation defines an interdependency relation between clauses and mainly makes up the logic of natural language [1]. It plays an important role in improving the discourse coherence of machine translation [2]. In recent years the neural machine translation (NMT) systems, adopting an encoder-decoder framework to model source sentences and target sentences in a sequence-to-sequence (Seq2Seq) learning fashion, have achieved substantial progress [3 –8]. However, the NMT systems often ignore the hypotactic structural relation between clauses when translating a complex sentence. As a result, NMT can often generate correct translations of isolated clauses, but being put together in a complex sentence, these translations end up being incoherent and independent with each other.

The above problem is particularly salient in Chinese-to-English complex sentence translation, owing to the distinct grammar forms of the two languages. English is rich in grammatical form (e.g. verb morphological changes and subordinating conjunctions such as “since”, “when”, and “although”) that serve as crucial clues to instruct a NMT model to learn the hypotactic relation between clauses. On the other hand, Chinese is poor in grammatical form and lacks the explicit logical connectives between clauses [9], which poses a major challenge for learning the hypo-tactic relation in an end-to-end way. As illustrated in Fig. 1, we compare the reference provided by a human translator and the result of Google NMT system. In the reference, the first, third and fourth clauses are subordinate clauses, and the second and fifth clauses are main clauses. These clauses make up the hypotactic structural relationship (interdependency) indicated by the subordinating conjunction “although”. However, it is observed that the NMT system can hardly capture such relation. In the machine translation’s output, all of the five clauses are wrongly generated as a parallel structural relationship (independency) revealed by the conjunction “and” and “but”. This indicates that the NMT system still faces challenges in how to learn the implicit hypotactic structure knowledge from Chinese sentences as well as the structure alignment between Chinese and English. 1 .

Fig. 1

An example of Chinese-to-English complex sentence translation where a Chinese complex sentence is translated as multiple English clauses. The numbers express the order of clauses and “/” indicates the boundaries of clauses. In both of the translation versions, the sentences in italics are subordinate clauses and the rest are main clauses. This example shows that NMT is still weak in dealing with the hypotactic structural relation of complex sentence from Chinese to English.

To address the above issues, (1) we identify the hypotactic structure and annotate a hypotactic structure aligned parallel corpus in Chinese-to-English translation, to provide rich hypotactic structure knowledge for NMT. Our motivation is that the explicit alignment knowledge of hypotactic structure is beneficial for modeling the main and subordinate clauses, and therefore to improve the translation of complex sentences. Specifically, we annotate all the clauses involved in hypotactic relationship with “main” or “subordinate” clause labels in Chinese-English parallel complex sentence pairs. Each labeled complex sentence consisting of multiple clauses is treated as a document and provides larger context information as well as hypotactic structure information for the NMT model. By doing so, our corpus provides fine-grained alignment at clause level. Compared with general parallel corpus for NMT, our corpus can encourage the NMT model to better capture clause relations as well as the alignment across two languages. (2) Moreover, to combine the hypotactic structure knowledge with the NMT model, we propose a structure-infused neural framework using two integrating strategies. One is using a gated fusion mechanism for integrating the label embedding of hypotactic structure; the other is using an explicit label-infused mechanism for integrating the labeled connectives of hypotactic structure. In particular, we introduce a specific structure-aware loss to encourage NMT to better exploit the integrated structure knowledge during training.

Experimental results on WMT17, WMT18 and WMT19 Chinese-to-English translation tasks show that integrating bilingual alignment knowledge of the hypotactic structure is beneficial for improving translation performance. Our best model achieves an average improvement over the baseline Transformer by 1.18 BLEU points on three tasks. In particular, extensive analyses illustrate the proposed method can significantly enhance the translation of main and subordinate clauses, and therefore improve discourse coherence. Meanwhile, the proposed method can also enhance the translation adequacy via fine-grained clause alignment learning.

The key contributions of this work are summarized as follows:

We point out the translation problem of hypotactic structure on the Chinese-to-English translation task of complex sentences, and propose to address this problem by annotating hypotactic structure knowledge and integrating the structure knowledge into the NMT model.

We create a hypotactic structure aligned Chinese-English parallel corpus which provides rich hypotactic structure knowledge and fine-grained alignment at clause level for NMT.

We propose two strategies to integrate the structure knowledge into NMT and introduce a specific structure-aware loss to encourage the NMT model to better learn the integrated structure knowledge.

Extensive experiments and analyses demonstrate that the proposed approach can significantly improve the translation quality of complex sentences in discourse, and therefore enhance the coherence of translations.

2 Related work

In this work, our research focuses on the hypotactic structure problem on the Chinese-to-English translation task of complex sentences, and it is related to cross-sentence context, discourse coherence and the training data in NMT. We discuss these topics in the following sections.

2.1 Context-aware NMT

Traditional NMT models [4 , 7] focus on improving the translation quality of individual sentences, while such methods often show degraded performance in translating multiple sentences. To overcome the weaknesses of sentence-level NMT, researchers have made effective efforts in context-aware NMT. These works mainly included building additional encoder based on RNN [10 –14] to capture larger context. In addition, some works modeled the cross-sentence context by combining hierarchical attention mechanism [15, 16]. In summary, these NMT systems are designed to allow extra-sentential context as their input. However, one limitation of these methods is that they assume all the discourse structure knowledge can be learned automatically by the NMT model from larger context, ignoring the influence of linguistic phenomena (such as the implicit hypotactic relationship in Chinese) on the NMT model. Similar to these methods, we aim to capture more discourse structure information from larger context e.g. complex sentences that contain multiple clauses. Different from these methods, we focus on the semantic relevance and structure dependency between clauses rather than only larger context for translation. We annotate hypotactic structure knowledge and propose a structure-infused neural model to integrate the structure knowledge into the NMT model, which overcomes the shortcoming of exiting parallel data caused by the lack of structure knowledge in Chinese-English translation process.

2.2 Coherence-aware NMT

In recent years, modeling the discourse coherence of translations has attracted more attention. To improve the discourse coherence of translations, Born et al. [17] integrated the graph-based coherence representation into a statistical machine translation system. Kuang et al. [18] introduced a cache-based approach to model discourse coherence for NMT by capturing cross-sentence information either from recently translated sentences or the entire document. Xiong et al. [19] proposed a two-pass decoder translation model to improve the coherence of translations. Similar to these methods, we take the semantic relevance and links among sentences into account. The difference is that we extract the hypotactic structure knowledge from complex sentences and use the structure knowledge to enhance discourse coherence of translations.

2.3 Training data for NMT

The current NMT models, both sentence-level models [4 , 7] and document-level models [10–14 , 19], are all trained based on parallel corpus that consists of sentence pairs. Such corpus is aligned at sentence level or document level. While the corpus that we created is aligned at fine-grained clause level, and the hypotactic structure is marked in the corpus. Therefore, our corpus can provide more structure information and help the NMT model align better in the hypotactic structure of English and Chinese.

3 Background

NMT uses an encoder-decoder framework with attention mechanism to optimize the translation probability of a target sentence y = (y₁, ⋯ , y_N) given its corresponding source sentence x = (x₁, ⋯ , x_M): $P (y | x; θ) = \prod_{n = 1}^{N} P (y_{n} | y_{< n}, x; θ),$ (1)

where θ denotes a set of tunable model parameters and y_<n expresses a partial translation. Some neural structures are widely used in NMT, such as Recurrent Neural Network (RNN) [20] and self-attention network (Transformer) [7]. In this work, we perform experiments on both RNN-based NMT and Transformer-based NMT.

RNN-based NMT. This architecture adopts RNN neural structure. First, an encoder summarizes the source sentence x into a sequence of continuous representations H =< h₁, ⋯ , h_M >, and then a decoder generates the t-th target word based on the following probability: $P (y_{t} | y_{< t}, x; θ) = softmax (g (y_{t - 1}, s_{t}, c_{t})),$ (2) where g (·) represents a non-linear activation function; y_t1 is the target word at time-step t-1; the current decoder hidden state s_t is calculated as follows: $s_{t} = RNN (y_{t - 1}, s_{t - 1}, c_{t}),$ (3) where RNN(·) indicates a transforming function with specific network architecture, exploiting either Long Short Term Memory (LSTM) [21] or Gated Recurrent Units (GRU) [22]; c_t is the context vector of the source side obtained by attention mechanism: $\begin{matrix} c_{t} = \sum_{i = 1}^{I} α_{i, t} h_{i}, \\ α_{i, t} = softmax (m (s_{t - 1}, h_{i})), \end{matrix}$ (4) where m(·) is a compatibility function about attention between s_t-1 and the i-th source representation h_i.

Transformer-based NMT. The Transformer-based NMT [7], which follows an encoder-decoder framework, adopting stacked multi-head self-attention and the residual connection [23] to model the sequential information, is recently a major breakthrough in NMT research.

Transformer adds the positional encodings to the source as well as target embedding, allowing the NMT to capture the sequentiality of the source sentence without using recurrence structure. The positional encodings are calculated based on the position (pos) and the dimension (i) as follows: $\begin{matrix} PE (pos, 2 i) = sin (pos / 10000^{(2 i / d_{\mod el})}), \\ PE (pos, 2 i + 1) = cos (pos / 10000^{(2 i / d_{\mod el})}) . \end{matrix}$ (5)

In self-attention sub-layer, Transformer generates a context vector of the current word by considering other words in the same sequence. Given an input sequence x=(x₁, . . . , x_M), where x_i ∈ R^{d
_x}, each attention head calculates a new context sequence z=(z₁, . . . , z_M) of the same length, where z_i ∈ R^{d
_z} and it is computed as follows: $z_{i} = \sum_{j = 1}^{M} α_{i, j} (x_{j} W^{V}),$ (6) where α_i,j is a weight coefficient, calculated by a softmax function: $α_{i, j} = \frac{exp (e_{i, j})}{\sum_{l = 1}^{M} exp (e_{i, l})},$ (7) where e_i,j denotes the similarity between the two input elements x_i and x_j, computed by a compatibility function: $e_{i, j} = \frac{(x_{i} W^{Q}) (x_{j} W^{K})^{T}}{\sqrt{d_{z}}},$ (8) where W^Q, W^K, and W^V ∈ R^d_x×d_z express parameter matrices and the scaled dot product is used to compute the compatibility function.

4 Annotating the hypotactic structure aligned parallel corpus

To address the discourse phenomenon of the main and subordinate clauses translations, we construct a bilingual parallel corpus with Chinese-to-English complex sentence pairs aligned in the hypotactic structure. The construction process mainly includes: (1) the recognition of Chinese and English clauses, (2) the alignment of Chinese and English clauses, (3) the recognition and annotation of English main and subordinate clauses, and (4) the annotation of Chinese main and subordinate clauses.

In this work, we define a tagged main clause set as M =(<M-start>clause<M-end>), where “<M-start> ” and “<M-end> ” are the hypotactic structure labels of a “main” clause, and represent the start and the end of a main clause respectively. Similarly, we define a tagged subordinate clause set as S=(<S-start>clause<S-end>), where “<S-start> ” and “<S-end> ” are the hypotactic structure labels of a “subordinate” clause, and denote the start and the end of a subordinate clause respectively. In addition, we give each clause a serial number label such as < 1>,<2>, and < 3>, and call them clause alignment labels, to align the source clause with the corresponding target clause. The tagged main and subordinate clauses make up complex sentences in training corpus.

4.1 Annotation principles for hypotactic structure alignment

(1) Alignment of Chinese and English clauses. Given a document-aligned Chinese-English parallel corpus, we first divide the Chinese and English discourse-level units (e.g. paragraphs) into clauses. Then, we align these Chinese clauses and English clauses. As shown in Fig. 2, the bilingual alignment results of Chinese-English clauses are as follows:

Fig. 2

An example of alignment between Chinese and English clauses. The Chinese and English paragraphs have been divided into clauses marked with sequence numbers. Among them the clauses marked with * are subordinate clauses, and those unmarked are main clauses.

Z1-E1, Z2-E2, Z3-E3, Z4-E5, Z5-E6, Z6-E4, Z7-E7.

(2) Alignment of Chinese and English main and subordinate clauses. English clauses have lexical-level formal signs (e.g. conjunctions and verb morphological changes) to represent their categories of the “main” or “subordinate” structure, while Chinese clauses typically lack formal signs to reveal their structure categories. Therefore, we adopt an English priority annotation strategy. We first tag the English clauses with “main” or “subordinate” structure category in terms of the English grammatical form information such as subordinating conjunctions “since”, “when” and “although”. Then, we use the same categories of English clauses that have been tagged to annotate the corresponding aligned Chinese clauses. As shown in Fig. 2, the alignment results of Chinese-English main and subordinate clauses are as follows:

*Z1-*E1, Z2-E2, *Z3-*E3, *Z4-*E5, *Z5-*E6, Z6-E4, *Z7-*E7.

4.2 Automatic construction of hypotactic structure aligned corpus

We design an automatic construction method for our Chinese-English parallel corpus aligned in the hypotactic structure based on the above annotation principles (see Section 4.1). The construction process is illustrated in Fig. 3.

Fig. 3

The construction process of the Chinese-English hypotactic structure aligned parallel corpus.

Data source. Considering the scarcity and expensiveness of the current document-level parallel corpus, we use WMT2019 United Nations Parallel Corpus v1.0 2 as our data source from which 1M complex sentence pairs are selected. Each complex sentence is treated as a document and we label bilingual alignment knowledge of the hypotactic structure on it. Such method allows us to easily obtain a large amount of available parallel data that is close in terms of context to the document-level parallel data.

Clause recognition. Inspired by the work of Li et al. [24], we adopt a Bi-LSTM-CRF clause recognition model. The task of clause recognition, also called elementary discourse unit (EDU) recognition, is giving a label to every word in a sentence. Sentences can represent in the Y (Yes) or N (No) format, where each word is marked as label Y if the word is at the end of a clause, and as label N if the word is in the clause but not at the end of the clause. In this work, our clause recognition model is the combination of CRF [25] and Bi-LSTM [26]. We first utilize Stanford CoreNLP [27] to get part of speech (POS) and syntactic feature of each word. Then, following the work of Li et al. [24], the vector representations of each word, its POS and syntactic feature are added and fed into skip-n-gram, a variation of word2vec [28] to train a fusion representation of word embedding. After that, the word embedding is input to the Bi-LSTM layer to learn the contextual features. Finally, the CRF uses the output of Bi-LSTM to predict the optimal binary classification result of each word. The results of the Chinese clause recognition accuracy are at P 92.0, R 93.6, and F1 92.8, and those of the English clauses recognition accuracy are at P 94.6, R 93.0, and F1 93.8.

Clauses alignment. Following the work of Ding et al. [29], we use a neural sentence alignment method based on word translation to align the Chinese clauses and the English clauses. The method is inspired by the observation that aligned sentence pairs contain a larger number of aligned words than unaligned ones, and propose to improve sentence alignment performance by incorporating word translation as useful external knowledge. Specifically, we learn a bilingual dictionary by using Giza++ [30] to obtain word translation from parallel corpus. Then the sentence pair and its corresponding word translation sequences are combined by concatenating the embedding of each word and its word translation. After that, they are fed into the encoder composed of two bi-directional RNNs [31], one for the source sequence and the other for the target sequence, to get the hidden representation of each word. We obtain the semantic relevance matrices of sentence pairs by computing the cosine between word pairs in sentence pairs, and finally we transform the semantic relevance by max pooling it into a vector and use a multilayer perceptron to predict whether the sentence pair is aligned or not. Following the work of Ding et al. [29], we use the same data for model training.

English-to-Chinese annotation of main and subordinate clauses. Since English clauses have explicit formal signs (e.g. subordinating conjunctions) to represent the “main” or “subordinate” structure categories of the clauses, we adopt an English priority strategy to first identify the English main and subordinate clauses in corpus with a rule-based method in terms of the English formal signs. We collect a set of subordinating conjunctions as English formal signs, and these formal signs are used to identify the main clauses or the subordinate clauses in a rule-based method. Then we annotate each clause with hypotactic structure label. Note that the clauses with paratactic relation are also tagged as “main” clauses or “subordinate” clauses. Specifically, if one of the paratactic clauses is identified as a main clause or subordinate clause, then they are all tagged as “main” clauses or “subordinate” clauses. Finally, we exploit the same labels of English clauses to annotate the corresponding aligned Chinese clauses.

Statistically, the annotated corpus contains 1 M sentence pairs, 1.75 M clause pairs, 0.5 M main clause pairs and 1.25 M subordinate clause pairs, and the English-Chinese clause alignment results achieve the accuracy at P 88.3, R 89.4, and F1 88.8.

The final annotation results are shown in the following example of sentence pair:

IFS/0 -210908/If - 03 . eps EN: <S-start> <1>Because the development and opening up of Pudong is a cross-century project to revitalize Shang-hai and build a modern economy, trade and financial center, <S-end> <M-start> <2> there are a lot of new situations and new problems that have never been en-countered before. <M-end> end tabbing

These hypotactic structure labels annotated in our corpus can guide the NMT model to learn rich hypotactic structure knowledge. Moreover, our corpus provides fine-grained alignment at clause level with clause alignment labels, which can encourage the NMT model to capture key context information through clause alignment learning.

In the following sections, we will explore how to integrate the alignment knowledge of the hypotactic structure into NMT and how to make the model learn such knowledge effectively.

5 Integrating hypotactic structure knowledge into NMT

In this part, we explore how to integrate the annotated hypotactic structure knowledge into NMT model. To better combine the structure knowledge with NMT model, we propose a structure-infused neural framework. The overall architecture is shown in Fig. 4(a). Specially, we design different structure information fusion layers for encoder and decoder on top of the Transformer-based NMT model. Further, we introduce two strategies to incorporate the annotated structure labels into the NMT encoder and decoder respectively. In particular, a specific structure-aware loss is added to the original loss for guiding the NMT model to better learn such structure knowledge during training.

Fig. 4

The proposed structure-infused NMT framework. (a) The overall model architecture integrating Chinese-English hypotactic structure alignment knowledge into NMT; (b) the proposed structure knowledge integration strategy using gated label embedding of the hypotactic structure; and (c) the proposed structure knowledge integration strategy using explicit label-infused mechanism of the hypotactic structure.

Formally, we define the source and target sentences as x=(x₁, ·· · , x_M) and y=(y₁, ·· · , y_N) respectively.

5.1 Gated fusion mechanism for integrating label embedding of hypotactic structure

In this integration strategy, we incorporate the hypotactic structure knowledge into NMT with a gated fusion mechanism.

Source side. To incorporate the hypotactic structure knowledge, we first embed each token of the source sentence with gated structure label embedding. Then we encode the source sequence and get the source representation that is sensitive to the hypotactic structure. The source representation is further used to learn a specific context vector from which the NMT model generates the translations of the main and subordinate clauses.

As shown in Fig. 4(b), given a source Chinese sentence x tagged with structure labels, it is first summarized by our encoder structure fusion layer. Inspired by the work of Devlin et al. [32], for each token within the scope of subordinate clause labels <S-start>and <S-end>, we infuse a subordinate structure embedding e_S. Similarly, for each token within the scope of main clause labels <M-start>and <M-end>, we infuse a main structure embedding e_M. Concretely, for a given token, its input representation is performed by adding the corresponding token embedding t_i, the main or subordinate structure embedding e_i and position embedding p_i. Here, e_i is specifically expressed as e_S or e_M depending on the token’s corresponding structure label.

We introduce a gating operation [33] to regulate the weights between token embedding t_i and the structure embedding s_i. The intuition is that different tokens require different amount of semantic influence at discourse structure level: $λ_{i} = σ (W_{t} t_{i} + W_{e} e_{i}),$ (9) $h_{i}^{'} = λ_{i} t_{i} + (1 - λ_{i}) e_{i},$ (10) $h_{i} = h_{i}^{'} + p_{i},$ (11) where σ (·) is a logistic sigmoid function, W_t and W_e are parameter matrices, $h_{i}^{'}$ is the token embedding via the gating mechanism, and h_i is the final embedding representation of a source word fused with the structure information at time step i. In addition, the calculation of positional embedding p_i can refer to Equation (5), e_S and e_M are generated by random initialization.

Then, the representation h_i of a source word is fed into the encoder for learning contextual token representation. The encoder consists of a stack of N identical layers, in which the input representation of the n-th stacked encoder layer H^n-1 is summarized through a self-attention and a fully connected feed-forward network (FFN) to learn the source representation Hⁿ of n-th layer encoder: ${\bar{H}}^{n} = LN (SelfAtt (H^{n - 1}) + H^{n - 1}),$ (12) $H^{n} = LN (FFN ({\bar{H}}^{n}) + {\bar{H}}^{n}),$ (13) where LN (·) is a layer normalization operation, SelfAtt (·) is a self-attention mechanism (see Equations (6–8)). Finally, the learned source representation of the N -th encoder stack H^N contains rich hypotactic structure knowledge and it can enhance the context vector that is generated by a multi-head encoder-decoder attention mechanism.

Target side. Like the source sentences, the target sentences also contain structure labels. At target side, the NMT model generates the target sentence from left to right, so it is uncertain whether the current word belongs to a main clause or a subordinate clause at inference step. Therefore, different from the integration method of source side, we integrate the annotated hypotactic structure knowledge by viewing each structure label as a label word like subordinating conjunction, and jointly encoding the labels and the words of the input sentence in the same embedding space. These labels serve as “clue words”, revealing the hypotactic structure between clauses in complex sentences.

Specifically, we generate the target representation by summing the token embedding and the position embedding of each token at our decoder hypotactic structure fusion layer. As shown in Fig. 4(b), the target representation is denoted as follows: $l_{i} = t_{i}^{'} + p_{i}^{'},$ (14) where $t_{i}^{'}$ and $p_{i}^{'}$ express the token embedding and the position embedding of a target token. Note that we adopt the same computation as usual tokens for structure labels.

Similarly, the decoder consists of a stack of N identical layers, and the output representation of the n-th decoder layer L^n-1 learns a target representation ${\bar{L}}^{n}$ through a multi-head self-attention. Then a multi-head attention is introduced to compute alignment weights for the encoder stack H^N and learns the context vector Cⁿ to enhance the translation accuracy: $C^{n} = LN (Att ({\bar{L}}^{n}, K^{N}, V^{N}) + {\bar{L}}^{n}),$ (15)

where K^N and V^N are transformed from the N-th layer of the encoder, Att (·) denotes the multi-head attention between the encoder and the decoder. Then the n-th layer of the decoder output is computed as follows: $L^{n} = LN (FFN (C^{n}) + C^{n}) .$ (16)

Finally, the decoder top layer output L^N enhanced by the context vector based on the discourse structure knowledge is used to predict the next target word and improve the translation coherence of the complex sentences: $P (y_{t} | y_{< t}, x; θ) = softmax (W_{W} L_{i}^{N}),$ (17) where W_W is a projection matrix.

The annotated hypotactic structure knowledge is integrated into NMT by performing such strategy.

5.2 Explicit label-infused mechanism for integrating knowledge of hypotactic structure

In this integration strategy, we use the same way that is proposed at the target side in Section 5.1 to integrate the structure labels into NMT at both the source and target sides with an explicit label-infused mechanism. We view hypotactic structure labels in both the source and target sentences as specific discourse conjunctions that connect the main clause and the subordinate clause. These structure labels contribute a formal explicit representation of the hypotactic structure at the lexical level. As shown in Fig. 4(c), these tag tokens are first mapped and fed into the encoder or decoder with other tokens. Then, they are used to help NMT learn the source representation or the target representation that contains rich discourse structure information through a multi-head self-attention sub-layer. Finally, the context vector is computed to predict the next target word y_i at the time-step i.

Note that all the source sentences and target sentences are divided into the simple and complex sentences. Among them, the complex sentences annotated with discourse structure knowledge are fed into NMT using the above two strategies, while the simple sentences are fed into NMT using traditional method without special treatment.

5.3 Target structure-aware loss

In the above two integration methods, the target sentences contain a large number of structure labels. We treat these structure labels as linguistic conjunctions and they make an important contribution to the representations of the hypotactic structure of complex sentences. To better use the structure labels, we introduce a specific structure-aware loss to supervise the predictions of the structure labels. The introduced loss is sensitive to the hypotactic structure and can encourage the NMT model to learn more discourse structure knowledge. As shown in Fig. 4(a), the entire loss is conducted by summing the original loss and the specific structure-aware loss.

Considering the long-tailed recognition problem when learning from the imbalanced data (i.e., structure labels) [34], we introduce a re-weighting coefficient λ to regulate the weights [35]. Formally, the final training objective is denoted as follows: $\hat{θ} = arg max_{θ} {P (y | x; θ) + λ * P (r | x; θ)},$ (18) where r expresses the sequence of the structure labels obtained from the reference translation y at the target side; the hyper-parameter λ is empirically set to 0.3 in this work. Note that the structure-aware loss does not require any new parameter overhead when training the NMT model.

5.4 Our translation models

In this work, we introduce four NMT models based on our annotated parallel corpus and the above two integration strategies: (1) GEMStructure: The hypotactic structure knowledge is integrated into NMT by using gated label embedding of the hypotactic structure (see Fig. 4 ((a) and (b))); (2) COStructure: The hypotactic structure knowledge is integrated into NMT by using explicit label-infused mechanism of the hypotactic structure (see Fig. 4 ((a) and (c))); (3) GEMStruLoss: It combines the GEMStructure model and the target hypotactic structure-aware loss; (4) COStrucLoss: It combines the COStructure model and the target hypotactic structure-aware loss (see section 5.3 and Fig. 4 (a)).

6 Experiments

6.1 Setup

Datasets. We evaluated the proposed approach on WMT17, WMT18 and WMT19 Chinese-to-English (ZH-EN) translation tasks. The training corpus is a mixture of data consisting of 3M sentence pairs, of which 1M sentence pairs are tagged with the hypotactic structure labels based on complex sentences, and the other 2M sentence pairs are simple sentence pairs from WMT2019 United Nations Parallel Corpus. We chose newsdev2017 as our development set, and newstest-2017 (MT17), newstest-2018 (MT18), newstest-2019 (MT19) as our test sets. The statistics of these datasets are shown on Table 1. To test the effectiveness of our method, we preprocessed all the three test sets by marking the source and target sentences with hypotactic structure labels. For more details on data annotation, please refer to Section 4.2. During testing, the labeled test sentences were fed into our four models according to the corresponding integrating strategy, and generated the corresponding machine translation output. Then the translation results were measured with BLEU after a post-processing operation of removing the structure labels.

Table 1
The statistics of training, development and test sets in number of sentence pairs and words

Data Sentences Words

ZH EN

Train 3.0M 62.6M 71.5M

Development 2.0K 52.3K 61.8K

MT17 2.0K 49.1K 56.9K

MT18 3.9K 94.1K 117.5K

MT19 2.0K 60.6K 84.2K

Data	Sentences	Words
Train	3.0M	62.6M	71.5M
Development	2.0K	52.3K	61.8K
MT17	2.0K	49.1K	56.9K
MT18	3.9K	94.1K	117.5K
MT19	2.0K	60.6K	84.2K

Model settings. We implemented our proposed models on top of both the RNN-based NMT model RNNSearch [5] and the state-of-the-art Transformer model [7], and we compared our models with the above two strong baseline systems. We used case-insensitive BLEU-4 score [36] as our evaluation metric, and the byte pair encoding algorithm [37] was exploited to segment all words into BPE sub-word units. The source and target vocabularies were both limited to 40K in Chinese and English, covering approximately 98.4%and 99.2%of the data respectively. Due to a large number of long complex sentences in our training data, we trained our models on sentences of length up to 150 words. When integrating our tagged discourse structure knowledge into Transformer, we followed the Transformer base model [7]. The layer numbers of encoder and decoder were all set to 6. The number of multi-head attention heads was set to 8. The hidden size and inner-layer size were set to 512 and 2048 respectively. The beam size was set to 5 and the dropout [38] rate was set to 0.1. The Adam [39] optimization and other model settings were the same as those in default Transformer model. Similarly, when integrating our tagged discourse structure knowledge into the RNNSearch model, we borrowed the open-source toolkit OpenNMT [40]. The numbers of encoder and decoder layers were all set to 2. The hidden size was set to 512, and the batch size was set to 64. Other configurations were identical to those in the work of Bahdanau et al. [5].

6.2 Main results

Table 2 shows results of the proposed approach over the implemented RNNSearch and Transformer models in terms of BLEU metric. Clearly, the translation performance of our models outperformed their corresponding baseline systems in all cases.

Table 2
Results on ZH-EN translation tasks. Our four models that integrate the hypotactic structure alignment knowledge into NMT are implemented on top of RNNSearch and Transformer respectively, and these two strong translation models are our baselines. ^†† represents a significant improvement over the baseline model

Type Systems ZH-EN

MT17 MT18 MT19 Avg

Baseline MOSES 15.14 16.21 17.97 16.44

Baseline RNNSearch 17.66 19.07 21.45 19.39

This Work +GEMStructure 18.01^† 19.48^† 21.90^† 19.80

+COStructure 18.10^† 19.63^† 22.04^† 19.92

+GEMStruLoss 18.32^† 20.06^†† 22.26^† 20.21

+COStrucLoss 18.45^†† 19.97^† 22.59^†† 20.34

Baseline Transformer 19.14 20.47 22.87 20.83

This Work +GEMStructure 19.60^† 21.03^† 23.39^† 21.34

+COStructure 19.81^† 20.99^† 23.67^† 21.49

+GEMStruLoss 19.92^† 21.33^† 23.88^† 21.71

+COStrucLoss 20.25 ^†† 21.52 ^†† 24.26 ^†† 22.01

Type	Systems	ZH-EN
Baseline	MOSES	15.14	16.21	17.97	16.44
Baseline	RNNSearch	17.66	19.07	21.45	19.39
This Work	+GEMStructure	18.01^†	19.48^†	21.90^†	19.80
	+COStructure	18.10^†	19.63^†	22.04^†	19.92
	+GEMStruLoss	18.32^†	20.06^††	22.26^†	20.21
	+COStrucLoss	18.45^††	19.97^†	22.59^††	20.34
Baseline	Transformer	19.14	20.47	22.87	20.83
This Work	+GEMStructure	19.60^†	21.03^†	23.39^†	21.34
	+COStructure	19.81^†	20.99^†	23.67^†	21.49
	+GEMStruLoss	19.92^†	21.33^†	23.88^†	21.71
	+COStrucLoss	20.25 ^††	21.52 ^††	24.26 ^††	22.01

Baselines. Transformer and RNNSearch significantly outperformed MOSES –a conventional phrase-based SMT system [41] by 4.39 and 2.13 BLEU points on average, showing that they are strong NMT baseline systems; especially Transformer is the dominant NMT system in recent years. Note that all the baseline systems conducted the experiments on the unlabeled data of 3M sentence pairs.

Overall results. All proposed four structure-aware NMT models outperformed their corresponding baseline models. This demonstrates that incorporating the linguistic knowledge of hypotactic structure is beneficial to NMT as measured with BLEU.

Among all the proposed models, COStructure performed better than GEMStructure. This shows that the explicit label-infused mechanism of the hypotactic structure affects the NMT model more than the gated label embedding infusion mechanism of the hypotactic structure. The COStructure combined with the structure-aware loss (COStrucLoss) achieved the best translation performance, outperforming the strong baseline system RNNSearch and the state-of-the-art Transformer by 2.62 BLEU points and 1.18 BLEU points respectively on average. This indicates that the knowledge integration and the knowledge learning methods can be used together to further enhance the translation models on both the source and target sides.

Experimental results demonstrated the effectiveness of our method in terms of BLEU scores. However, the n-gram BLEU score is poorly adapted to evaluating discourse phenomena [12]. We will further analyze the influence of our methods on the Chinese-to-English translations of main and subordinate clauses with human evaluation.

6.3 Evaluating translations of the hypotactic structure

We used sign-test [42] to calculate the statistical significance of our results. We randomly selected 120 complex sentences from the test set, and we counted how many errors based on the hypotactic structure are: 1) generated by the NMT baseline model (Total); 2) fixed by our method (Fixed); and 3) newly produced (New). As shown on Table 3, we observed that 70 complex sentences were translated as the inappropriate main and subordinate clauses, and our model corrected about 66 percent of the errors. However, we also observed that our method generated seven new errors, and we analyzed the reason that we inevitably incorporated a small amount of noisy data into NMT when integrating structure knowledge into the NMT model. We would like to explore how to make full use of the annotated structure labels and reduce the influence of the incorporated noisy data on NMT in the future work.

Table 3
Translation error statistics of the hypotactic structure

Errors Hypotactic Structure (%)

Total 70 –

Fixed 46 66%

New 5 7%

Errors	Hypotactic Structure	(%)
Total	70	–
Fixed	46	66%
New	5	7%

In brief, statistics show that our method can effectively help NMT learn the correct hypotactic structure and improve the translations of the main and subordinate clauses at document level.

6.4 Visualization of the attention weight and alignment matrices

To further analyze the influence of the proposed method on the translations of the main and subordinate clauses, we conducted a visualization analysis of the attention weight and alignment matrices. First, given the source complex sentence, we compared its translation results generated by the baseline Transformer and our best COStrucLoss model respectively. Then, we computed the attention distribution of each current target word aligned with tokens in the source sequence. The attention distribution is calculated by averaging the attention score of each multi-head attention head (8 heads in total) in the top layer (6 layers in total) of the encoder and decoder. And the attention scores are computed through a multi-head attention layer (please refer to Equation (8)).

The attention weights and alignment matrices are visualized in Fig. 5. We observe that our model generated a more clear alignment path in both the main clause (see the purple dashed box) and the subordinate clause (see the red dashed box) compared with the baseline model. In the subordinate clause, the omission of the conjunction “epsfboxG :/Tex/IOSPRESS/IFS/0 -210908/IF - 04 . eps” (“so”) at the beginning of the source Chinese subordinate clause led to the incorrect translations at discourse structure level. The baseline model missed the conjunction “so” and did not produce the correct English subordinate clauses. Moreover, since the Chinese token “epsfboxG :/Tex/IOSPRESS/IFS/0 -210908/IF - 05 . epsepsfboxG :/Tex/IOSPRESS/IFS/0 -210908/IF - 06 . eps” (“difficult”) and the English token “difficult” were not aligned, an incorrect target word “repulsive” was generated by the baseline model. In contrast, our model addressed these issues relying on the guidance of the bilingual structure alignment labels and correctly predicted the conjunction “so” and the subsequent subordinate clause. This indicates that our method based on the structure alignment learning can help NMT learn rich alignment knowledge at both sentence level and discourse level. Meanwhile, such alignment learning also effectively improves the translation accuracy.

Fig. 5

The visualization of the attention weight and alignment matrices. (a) Generated by the baseline Transformer; and (b) generated by our +COStrucLoss model. The alignment of the subordinate clause is shown in the red dashed box, and the alignment of the main clause is shown in the purple dashed box.

6.5 Evaluating the accuracy of structure label generation in inference step

To further verify the effectiveness of our method for translating hypotactic structure on complex sentences, we evaluated the accuracy of structure label generation in inference step of NMT. Specifically, we first randomly sampled 200 sentence pairs from the test sets newstest-2017 and newstest-2019 respectively and we manually annotated the total 400 sentence pairs with hypotactic structure labels. Then we decoded the source sentences with our COStrucLoss NMT model and generated translations that contain hypotactic structure labels. Finally, we calculated the accuracy of structure label generation in inference step by comparing the target structure label generated by NMT with the target structure label annotated by human. The result is shown in Table 4. The accuracy of label generation reaches 95.2%in inference step over the above test data. This indicates that the proposed model can better capture the hypotactic structure information of complex sentences from our annotated data and thus improve the translation performance.

Table 4
The accuracy of structure labels generation in inference step

Test sets Number of Labels (by human) Correctly generated labels (by our model) Accuracy (%)

newstest2017 686 649 94.6

newstest2019 621 595 95.8

Total 1307 1244 95.2

Test sets	Number of Labels (by human)	Correctly generated labels (by our model)	Accuracy (%)
newstest2017	686	649	94.6
newstest2019	621	595	95.8
Total	1307	1244	95.2

6.6 Effect of structure-aware loss

Figure 6 shows the BLEU scores of the GEM-StruLoss model and the COStrucLoss model on the same Chinese-English test set with different hyper-parameter λ. We observed that when λ increased from 0 to 0.3, the above two models got improvements by 0.5 and 0.6 BLEU points over the GEMStructure and COStructure models respectively. This indicates that the introduced structure-aware loss is helpful for the NMT model to learn bilingual alignment knowledge and improve translation results. While we can see larger λ decrease the BLEU scores subsequently, suggesting that excessive attention to structure alignment labels may damage the NMT model. Thus we set λ to 0.3 to regulate the structure aware loss for better training the NMT model and encouraging it to learn more translation knowledge of discourse structure from our annotated structure aligned corpus.

Fig. 6

BLEU scores of the GEMStruLoss model and the COStrucLoss model on the same ZH-EN test set with different λ values.

6.7 Case study

We use an example of document-level translation to illustrate how the structure alignment knowledge improves NMT (see Fig. 7). We can see that a major translation error made by the baseline system is the wrong translation of the hypotactic structure. The reason is that the baseline system cannot identify the semantic relevance between multiple Chinese clauses. That is, it cannot differentiate between the main clause and the subordinate clause among Chinese clauses, which always leads to the inappropriate translations at discourse structure level. For example, the source clause ZH₇ is a subordinate clause, but it is incorrectly translated as a target main clause EN₇ by baseline. Similarly, the source sentence ZH₁₁ is a subordinate clause, but it is also translated as a main clause EN₁₁ by baseline. The two errors are corrected by our model in clauses $\bar{{EN}_{7}}$ and $\bar{{EN}_{11}}$ . This illustrates that our method can significantly improve the translation performance of complex sentences by learning the hypotactic structure alignment knowledge of parallel sentence pairs. Furthermore, our method strengthens the semantic relevance between clauses on both the source and target languages, and thus enhances the coherence of complex sentence translation.

Fig. 7

An example of document-level translation (ZH-EN). The source sentences (ZH) are incorrectly translated by the baseline Transformer (EN), but are correctly translated by our CO-Structure model ( $\bar{EN}$ ). We italicized several mistranslated errors and highlighted the corrected ones in bold.

Another translation problem of the baseline is inadequate translation that is also a challenge for document-level machine translation. When translating a document, a larger cross-sentence context needs to be considered, which always leads to the omission of some translation terms in NMT. As shown in Fig. 7, two terms (“in a timely manner” and “demolition”) were missed by the baseline system, but retrieved by our model in clause $\bar{{EN}_{5}}$ and clause $\bar{{EN}_{7}}$ . This indicates that our method can also enhance the translation adequacy via the bilingual sentence alignment learning, which is in line with the findings of Shi et al. [43].

7 Conclusion and future work

In this paper, we focused on the semantic relevance between clauses and explored the influence of hypotactic structure knowledge on improving Chinese-to-English machine translation. First, we created a hypotactic structure aligned Chinese-English parallel corpus that provides NMT with rich hypotactic structure knowledge and fine-grained alignment at clause level. The annotated datasets will be released for helping to learn a better NMT model on Chinese-English complex sentence translation. Then we proposed a structure-infused neural model with two strategies to incorporate the tagged hypotactic structure knowledge into NMT. In particular, we introduced a specific structure-aware loss to guide the NMT model to better learn the structure alignment knowledge. Experimental results on WMT17, WMT18 and WMT19 Chinese-to-English translation tasks have demonstrated that incorporating the tagged discourse structure knowledge is beneficial for NMT as measured with BLEU. Further analyses illustrate that the proposed NMT model can capture the structure knowledge and significantly improve the translations of complex sentences, and therefore improve the discourse coherence of the translations. Meanwhile, the proposed method can also enhance the translation adequacy through fine-grained clause alignment learning.

In the future, we would like to investigate how to encourage the NMT model to learn more discourse alignment knowledge from our structurally aligned training corpus and reduce the influence of the incorporated noisy data on NMT. We will also study the method of automatically evaluating the document-level translation performance of the main and subordinate clauses on Chinese-to-English task.

Footnotes

Acknowledgments

The research work descried in this paper has been supported by the National Key R&D Program of China (2020AAA0108001), the National Nature Science Foundation of China (No. 61976015, 61976016, 61876198 and 61370130) and Guangdong Basic and Applied Basic Research Foundation (2020A1515011 056). The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve this paper.

This Chinese complex sentence is an excerpt from Jia Pingwa’s novel Turbulence.

References

Mann

and Thompson

S.A.

Rhetorical structure theory: Toward afunctional theory of text organization, Text 8(3) (1988), 243–281.

Grosz

B.J.

, Joshi

and Weinstein

Centering: a framework for modeling the local coherence of discourse, (2), Computational Linguistics 21 (1995), 203–225.

Kalchbrenner

and Blunsom

Recurrent continuous translation models, In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, 2013, 1700–1709.

Sutskever

, Vinyals

and Le

Q.V.

Sequence to sequence learning with neural networks, In Proceedings of the 2014 Neural Information Processing Systems, Montreal, Canada, 2014, 3104–3112.

Bahdanau

, Cho

and Bengio

Neural machine translation by jointly learning to align and translate, In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015, 1–15.

Gehring

, Auli

, Grangier

, Yarats

, et al., Convolutional sequence to sequence learning, In Proceedings of the 34th International Conference on Machine Learning, 2017, 1243–1252.

Vaswani

, Shazeer

, Parmar

, et al., Attention is all you need. In Advances in Neural Information Processing Systems, 2017, 5998–6008.

Zhang

, Feng

, Meng

, You

and Liu

Bridging the gap between training and inference for neural machine translation, In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, 4334–4343.

Feng

W.H.

Alignment and annotation of Chinese-English discourse structure parallel corpus, Journal of Chinese Information Processing 27(6) (2013), 158–165.

10.

Jean

, Lauly

, Firat

and Cho

Does Neural Machine Translation Benefit from Larger Context, In arXiv:1704. 05135, 2017.

11.

Wang

, Tu

, Way

and Liu

Exploiting Cross Sentence Context for Neural Machine Translation, In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, 2816–2821.

12.

Bawden

, Sennrich

, Birch

and Haddow

Evaluating Discourse Phenomena in Neural Machine Translation, In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1 (2018), 1304–1313.

13.

Voita

, Serdyukov

, Sennrich

and Titov

Context-aware neural machine translation learns anaphora resolution, In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics , 1 (2018), 1264–1274.

14.

Zhang

, Luan

, Sun

, Zhai

, Xu

, Zhang

and Liu

Improving the transformer translation model with document-level context, arXiv preprint arXiv:1810.03581, 2018.

15.

Miculicich

, Ram

, Pappas

and Henderson

Document-level neural machine translation with hierarchical attention networks, In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, 2947–2954.

16.

Maruf

, Martins

F.T.

and Haffari

Selective attention for context-aware neural machine translation, In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 1 (2019), 3092–3102.

17.

Born

, Mesgar

and Strube

Using a Graph-based Coherence Model in Document-Level Machine Translation, In Proceedings of the Third Workshop on Discourse in Machine Translation, 2017, 26–35.

18.

Kuang

, Xiong

, Luo

and Zhou

Modeling coherence for neural machine translation with dynamic and topic caches, In Proceedings of the 27th International Conference on Computational Linguistics, 2018, 596–606.

19.

Xiong

, He

, Wu

and Wang

Modeling coherence for discourse neural machine translation, In Proceedings of the AAAI Conference on Artificial Intelligence, 33(01) (2019), 7338–7345.

20.

Bahdanau

, Brakel

, Xu

, et al., An actor-critic algorithm for sequence prediction, In ICLR, 2017.

21.

Sundermeyer

, Schluter

and Ney

Lstm neural networks for language modeling, In Thirteenth Annual Conference of the International Speech Communication Association, 2012.

22.

Chung , Junyoung , Caglar Gulcehre , Cho

, Bengio

, Gated feedback recurrent neural networks, In International Conference on Machine Learning, 2015, 2067–2075.

23.

, Zhang

, Ren

and Sun

Deep residual learning for image recognition, In CVPR (2016), 770–778.

24.

, Lai

, Feng

, et al., Chinese and English Elementary Discourse Units Segmentation based on Bi-LSTM-CRF Model, In Proceedings of the 19th Chinese National Conference on Computational Linguistics, 2020, 1068–1078.

25.

Lafferty

, Mccallum

and Pereira

Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of the Eighteenth International Conference on Machine Learning, 2001, 282–289.

26.

Hochreiter

and Schmidhuber

Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

27.

Manning

C.D.

, Mihai

, John

, et al., The Stanford CoreNLP Natural Language ProcessingToolkit, In Proceed920 ings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014, 55–60.

28.

Mikolov

, Sutskever

, Chen

, et al., Distributedrepresentations of words and phrases and their compositionality, –, Advances in Neural Information Processing Systems 26 (2013), 3119–3119.

29.

Ding

, Li

, et al., Improving neural sentence alignment withword translation, Frontiers of Computer Science 15(1) 2020.

30.

Och

F.J.

and Ney

A systematic comparison of various statistical alignment models, Computational Linguistics 29(1) 2020.

31.

Cho

, van

M.B.

, Gulcehre

, et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, 1724–1734.

32.

Devlin

, Chang

W.M.

, et al., Bert: pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018.

33.

, Liu

, Lu

, et al., Context gates for neural machinetranslation, , Transactions of the Association for ComputationalLinguistics 5 (2017), 87–99.

34.

Liu

, Miao

, Zhan

, et al., Large-scale long-tailed recognition in an open world, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, 2537–2546.

35.

Cui

, Jia

, Lin

, et al., Class-balanced loss based on effective number of samples, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, 9268–9277.

36.

Papineni

, Roukos

, et al., Bleu: A method for automatic evaluation of machine translation, In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002, 311–318.

37.

Sennrich

, Haddow

and Birch

Neural machine translation of rare words with subword units, In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics , 1 (2016), 1715–1725.

38.

Srivastava

, Hinton

, Krizhevsky

, Sutskever

and Salakhutdinov

Dropout: a simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15(1) (2014), 1929–1958.

39.

Kingma

D.P.

and Ba

Adam: A method for stochastic optimization, CoRR, abs/1412.6980, 2015.

40.

Klein

, Kim

, Deng

, Senellart

and Rush

Open NMT: Open-source toolkit for neural machine translation, In Proceedings of ACL 2017, 67–72.

41.

Koehn

, Hoang

, Birch

, et al., Moses: Open source toolkit for statistical machine translation, In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, 2007, 177–180.

42.

Collins

, Koehn

and Kucerova

Clause restructuring for statistical machine translation, In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, 2005, 531–540.

43.

Shi

, Huang

, Jian

and Tang

Improving neural machine translation with sentence alignment learning, Neurocomputing 420 (2021), 15–26

Modeling hypotactic structure for Chinese-English neural machine translation of complex sentences

Abstract

Keywords

1 Introduction

2.1 Context-aware NMT

2.2 Coherence-aware NMT

2.3 Training data for NMT

3 Background

4.1 Annotation principles for hypotactic structure alignment

5.3 Target structure-aware loss

6 Experiments

6.1 Setup

Table 1 The statistics of training, development and test sets in number of sentence pairs and words Data Sentences Words ZH EN Train 3.0M 62.6M 71.5M Development 2.0K 52.3K 61.8K MT17 2.0K 49.1K 56.9K MT18 3.9K 94.1K 117.5K MT19 2.0K 60.6K 84.2K

Table 3 Translation error statistics of the hypotactic structure Errors Hypotactic Structure (%) Total 70 – Fixed 46 66% New 5 7%

Table 4 The accuracy of structure labels generation in inference step Test sets Number of Labels (by human) Correctly generated labels (by our model) Accuracy (%) newstest2017 686 649 94.6 newstest2019 621 595 95.8 Total 1307 1244 95.2

Footnotes

Acknowledgments

References

Table 1
The statistics of training, development and test sets in number of sentence pairs and words

Data Sentences Words

ZH EN

Train 3.0M 62.6M 71.5M

Development 2.0K 52.3K 61.8K

MT17 2.0K 49.1K 56.9K

MT18 3.9K 94.1K 117.5K

MT19 2.0K 60.6K 84.2K

Table 3
Translation error statistics of the hypotactic structure

Errors Hypotactic Structure (%)

Total 70 –

Fixed 46 66%

New 5 7%

Table 4
The accuracy of structure labels generation in inference step

Test sets Number of Labels (by human) Correctly generated labels (by our model) Accuracy (%)

newstest2017 686 649 94.6

newstest2019 621 595 95.8

Total 1307 1244 95.2