STA: An efficient data augmentation method for low-resource neural machine translation

Abstract

Transformer-based neural machine translation (NMT) has achieved state-of-the-art performance in the NMT paradigm. However, it relies on the availability of copious parallel corpora. For low-resource language pairs, the amount of parallel data is insufficient, resulting in poor translation quality. To alleviate this issue, this paper proposes an efficient data augmentation (DA) method named STA. Firstly, the pseudo-parallel sentence pairs are generated by translating sentence trunks with the target-to-source NMT model. Furthermore, two strategies are introduced to merge the original data and pseudo-parallel corpus to augment the training set. Experimental results on simulated and real low-resource translation tasks show that the proposed method improves the translation quality over the strong baseline, and also outperforms other data augmentation methods. Moreover, the STA method can further improve the translation quality when combined with the back-translation method with the extra monolingual data.

Keywords

Data augmentation neural machine translation sentence trunk mixture concatenation

1 Introduction

Transformer-based Neural machine translation (NMT) has shown the best performance on several language pairs with an end-to-end architecture and large-scale parallel data [1–4]. Consequently, the availability of parallel corpora has an important influence on how well the NMT model performs. However, large-scale parallel corpora are often unavailable for many language pairs, e.g., Turish→English, and even for some domains in high-resource language pairs, it still suffers from the data scarcity issue. It is a time-consuming and expensive job to construct bilingual datasets with high quality manually.

To alleviate this issue, some researchers utilize high-resource language to improve the translation quality of the low-resource NMT model [5–7]. But these methods usually have complex models and computational cost. Naturally, data augmentation (DA) technique is an important trick to generate additional training samples when the available parallel data are scarce, and it has been proved useful in previous works [8–15]. It has been widely used in computer vision, and has achieved progress in neural machine translation. Words replacement are employed to generate augmented data with relatively high quality [9, 11, 12, 16–18]. However, these methods are difficult to use all the possible expanded data due to the data sparsity problem in low-resource languages. Another data augmentation method uses monolingual data [8, 10, 19, 20], e.g., back-translation and self-learning. Nevertheless, to gather and collate the required quantity of monolingual data, these methods require a significant amount of work. Some researchers utilize a synonymous dictionary to replace words in a parallel sentence pairs. However, thesaurus resources are also limited for low-resource languages.

Different from previous work, this paper proposes an effective data augmentation method named STA. It utilizes a target-to-source NMT model and sentence trunks to generate additional sentence pairs, and it does not need to introduce any external monolingual data or synonymous dictionary. Specifically, the main steps of the proposed DA method can be simply summarized as follows: Firstly, constructing a target-to-source NMT model (backward NMT model); Secondly, the constituency parse tree is generated by Stanford CoreNLP [21] for target sentence in the training set and an algorithm of generating sentence trunk is proposed; Next, translating the generated sentence trunks by the NMT model; Finally, two strategies are employed to generate the pseudo-parallel corpus. The method is simple yet effective, and the following is major contributions:

To the best of our knowledge, it is the first effort to utilize sentence trunks to augment the training set for NMT tasks;

Two DA strategies (Mixtrure and Concatenation) are proposed to augment the training dataset by the pseudo-parallel sentence pairs in the STA method;

The STA method improves the performance of the Transformer-based NMT model steadily, especially for long sentences.

The structure of this paper is as follows: the related work is introduced in section 2, and section 3 describes the STA method in detail. Section 4 introduces the experiment and results, and the conclusion is presented in section 5.

2 Related work

Data augmentation (DA) is a widely used technology in many fields of deep learning, which improves model training. As for neural machine translation (NMT), DA is often used to generate the noisy data to increase the model’s robustness or more diverse training samples to improve translation performance.

One of the DA methods requires the introduction of monolingual data. In terms of using source monolingual data, Zhang and Zong [10] improved the NMT model with the self-learning method. It augmented the training set by translating source sentenecs into the target. As for the target monolingual data, Sennrich et al., [8] proposed a DA method named back-translation, which translates target monolingual sentences into the source to expand the training set. The variant of back-translation has been proven effectively in many works [14, 19, 22, 23]. He et al., [24] extend the back-translation method to use monolingual data from both the source and target sides. Hoang et al., [25] proposed an iterative DA method to ameliorate the BT and the NMT model. A corpus augmentation method is proposed by Zhang and Matsumoto [26], and it segmented lengthy sentences with word alignment, and generated the synthetic parallel data by BT. Imamura et al., [27] generated several source sentences for a target sentence by sampling. All these methods show effectiveness, yet they require additional monolingual data. Different from previous work, the sentence trunks augmentation (STA) method does not need additional monolingual data. It merely extracts the sentence trunk by establishing rules based on the constituency parse tree and constructs pseudo-parallel sentence pairs to augment the training set.

Another category of data augmentation is based on word replacement. Existing methods for word-level data augmentation also include random word swapping and word dropping. Fadaee et al., [11] improved the NMT for low-resource translation by replacing words in the original sentence with the rare vocabulary. Words in the source and target languages are individually replaced by SwitchOut [12] with words that are uniformly sampled from corresponding vocabulary. Artetxe et al., [28] randomly replaced words with adjacent words within a window size. Xie et al., [29] proposed two ways to introduce noise into sentences: randomly substituting words with placeholder words and substituting words with other words that have a comparable frequency distribution across the vocabulary. The alternative words generated by a bidirectional language model or BERT [30, 31] can be used in a sentence instead of the word token. A soft contextual DA method is proposed by Gao et al., [18], which replaces the word representation with the soft distribution. Similar to the Dropout method, the STA method can be seen as word dropping in the sentence, which is constrained by the extraction rules of the sentence trunk. Therefore, the new pseudo-parallel sentence pairs contain less noise than that of simple word dropping.

3 Method

The data augmentation method named STA (sentence trunks augmentation), generates pseudo-parallel sentence pairs by sentence trunks and back-translation for the original training set. Compared with the original training set, the pseudo-parallel sentences at the source and target sides are concise in length and complete in structure. The two steps of this method are the extraction of sentence trunks as well as the generation and utilization of the pseudo-parallel sentence pairs. The STA method will then be introduced thoroughly in the following.

3.1 Generating sentence trunk by constituency parse tree

Previous works pointed out that different words play a different role in a sentence, e.g., Chen et al., [32] divides the words in the sentence into function and content words. Inspired by this simple notion, we hypothesised that certain words establish the fundamental structure of the phrase and describe its primary content, termed sentence trunk. The sentence components and their interrelations can be obtained through syntactic analysis in the NLP community. Therefore, a well-designed algorithm based on the constituency parse tree can be used to extract the sentence trunk.

Algorithm 1
extraction of sentence trunk based on the constituency parse tree

Input: a constituency parse tree T of a sentence s

Output: the sentence trunk st for the sentence s

  1:  candidate set S = {}

  2:  Setting depth(x) = d where x is the top leaf node

  3:  for each x in T do

  4:       S .add(x) if depth(x) == d

  5:  end for

  6:  while ∀ x ∈ S is non-leaf node do

  7:    if x is NP then

  8:        x' = GetNodeNN(x)

  9:    else

  10:        x' = GetNode(x)

  11:    end if

  12:    if CheckNode(x) == True

            && x is non-leaf node then

  13:        x' = GetNode(x)

  14:    end if

  15:   S .remove(x) and S .add(x')

16:  end while

17:  st = Generate(S)

GetNodeNN(x): return the first leaf node which is NN.

GetNode(x): return the first leaf node.

CheckNode(x): return whether the node x only has one leaf node.

Generate(S): combining the words in set S to generate the sentence trunk.

Figure 1 shows a constituency parse tree of an English sentence. It uses constituency grammar to identify terminal and non-terminal nodes to denote the structure information of the entire sentence. The content words of a sentence are leaf nodes in the constituency-based parse tree, while non-leaf nodes are the constituent attribute words, which consist of multiple content words of the sentence. The leaf nodes can be represented as the set L = {Do, you, expect, changes, in, the, current, situation, ?}, and the non-terminal nodes can be represented as the set N = {VBP, NP, VP, . . . , NN}.

Fig. 1

The constituency parse tree of an example sentence.

Given the constituency parse tree of a sentence, algirithm 1 shows the extraction algorithm of the sentence trunk. The sentence trunk of the example in Fig. 1 is “Do you expect changes in situation?”. Note that the sentence trunk may deviate from grammatical rules to some extent.

3.2 Data augmentation

After the extraction of sentence trunks in original training set, this section introduces the DA method named STA.

3.2.1 Generating pseudo-parallel corpus

As shown in Fig. 2, the method of generating a pseudo-parallel corpus can be briefly summarized as follows:

Fig. 2

Data augmentation method STA based on sentence trunks.

Given a parallel corpus D ={ x, y }, where x and y represent the source language S and target language T, respectively. Training a NMT model of T → S, and labeling it as M_T→S;

∀ (x_i, y_i) ∈ D, generating the sentence trunk of y_i by the extraction algorithm in section 3.1, labelling it as $y_{i}^{'}$ ;

Translating $y_{i}^{'}$ to get the corresponding translation $x_{i}^{'}$ by the NMT model M_T→S, and constructing the pseudo-parallel corpus .

3.2.2 Data augmentation strategy

As shown in Fig. 2, the parallel dataset D denotes the original training set, is the pseudo-parallel corpus that is composed of sentence trunks and corresponding translations. After generating the pseudo-parallel corpus, two data augmentation strategies are employed to augment the training set.

Mixture: The new training set is constructed by merging D and . In other words, the number of sentences in the training set is equal to the sum of D and . It should be noted that the source sentence length and the target sentence length in are shorter than that of the original parallel dataset D.

Concatenation: Concatenating the dataset D and to generate a new pseudo-parallel corpus , and merging it with the original parallel dataset D to construct the final training set. Specially, concatenating x_i and $x_{i}^{'}$ to construct a new sentence on the source side, and concatenating y_i and $y_{i}^{'}$ to construct the corresponding target sentence. Under this circumstance, the sentence length and the target sentence length are longer than that of the original training set.

4 Experiments and results

4.1 Datasets and settings

4.1.1 Dataset and pre-processing

Several experiments are undertaken on simulated and real low-resource translation tasks to validate the performance of the sentence trunks augmentation (STA) method. It mainly includes Chinese(Zh), Spanish(Es), German(De), Vietnamese(Vi) and Turkish(Tr) to English(En) translation tasks, English to Vietnamese and Turkish translation tasks, which are from the well-known IWSLT14, IWSLT15, and WMT18. All the experiments are based on the transformer [4] that is implemented by the fairseq toolkit [33], and all NMT models are run on a server equipped with GeForce RTX 3090Ti*2.

Table 1 lists the number of sentence of each dataset. For the IWSLT14 De→En translation task, we follow the same training set, test set and pre-processing steps as Gao et al., [18], and we use a shared vocabulary with 10K byte-pair encoding (BPE) [34] types. For the IWSLT14 Es→En task, the settings is consistent with Cheng et al., [35]. For the IWSLT15 Zh→En task, we follow the setting of Werlen et al., [36] with a vocabulary size of 30K for both the source and target sides. For IWSLT15 Vi↔En translation tasks, the data pre-processing setting used by Wang et al., [12] is adopted in this paper. For WMT18 Tr↔En translation tasks, the settings of experiment is consistent with Bugliarello and Okazaki [37].

Table 1
The number of sentences in each dataset. “Sentence Trunk” denotes the number of successful extraction of sentences

Corpus IWSTL14 IWSLT15 WMT18

De→En Es→En Zh→En Vi↔En Tr↔En

Training 160239 169028 209941 133317 207678

Validation 7283 7683 887 1553 3000

Test 6750 5593 5473 1268 3007

Sentence Trunk 160150 168949 209842 133157 207662

Corpus	IWSTL14	IWSLT15	WMT18
Training	160239	169028	209941	133317	207678
Validation	7283	7683	887	1553	3000
Test	6750	5593	5473	1268	3007
Sentence Trunk	160150	168949	209842	133157	207662

4.1.2 Settings

For all IWSLT15 translation tasks, we use the standard transformer model. For the IWSLT14 De→En translation task, we adopt the default transformer_small and transformer_base configurations for the neural machine translation (NMT) model, and other IWSLT14 and WMT18 translation tasks, only the default transformer_small configuration is used. For the transformer_small model, It has been determined that the learning rate should be at 5e^-4, and the dropout rate should be at 0.3. With regard to the transformer_base model, the values 0.001 and 0.1 have been designated for the learning rate and dropout rate, respectively. β₁ = 0.9, β₂ = 0.98 and ɛ = 10^-9 are used for the Adam optimizer [38]. The value of the beam size is 5 and the value of the maximum epoch is 30 for all translation tasks. The default settings for the other parameters are utilized.

Inspired by self-training [10], for IWSLT15 En→Vi and WMT18 En→Tr translation tasks, we adopt a source-to-target NMT model to translate source sentence trunks. Moreover, to compare with previous data augmentation methods, this paper adopts BLEU [39] and SacreBLEU [40] as the evaluation metric for several translation tasks based on the transformer_base and the transformer_small model. Models with the best performance on the validation set are evaluated. Taking into account of the variance, all experiments are run three times and the median BLEU is reported.

4.2 Main results and analyses

4.2.1 Performance on several translation tasks

It is indicated from Tables 2 and 3 that both the Mixture and Concatenation strategies for sentence trunks augmentation (STA) proposed in this paper outperform the baseline. In the IWSLT15 En→Vi translation task, the performance of the Concatenation strategy and the Mixture strategy is close. However, the performance of the Concatenation strategy is significantly better than that of the Mixture strategy in other translation tasks.

Table 2
Performance on several translation tasks with the BLEU and SacreBLEU metric based on the transformer_small model

Method SacreBLEU BLEU

De→En Es→En Tr→En En→Tr

Baseline 36.59 42.66 14.63 13.13

STA(Mixture) 36.93 42.99 15.11 13.65

STA(Concatenation) 37.40 43.26 15.83 14.80

Method	SacreBLEU	BLEU
Baseline	36.59	42.66	14.63	13.13
STA(Mixture)	36.93	42.99	15.11	13.65
STA(Concatenation)	37.40	43.26	15.83	14.80

Table 3

Performance on several translation tasks with the BLEU metric based on the transformer_base model

Method	De→En	Zh→En	Vi→En	En→Vi
Baseline	34.20	17.03	26.06	28.19
STA(Mixture)	34.67	17.74	27.10	29.82
STA(Concatenation)	35.01	17.86	27.55	29.76

In real low-resource language pairs, e.g., WMT18 Tr→En and IWSLT15 Vi→En translation tasks, compared with the baseline, the BLEU improvements achieved +1.2 and 1.47, respectively. Similar results were found in other translation tasks. It is indicated that the data augmentation method STA could improve the performance significantly on multi-language pairs that are simulated and real low-resource translation tasks.

4.2.2 Effect of pseudo-parallel corpus size

We perform further experiments, increasing the pseudo-parallel corpus size for augmentation gradually to validate the performance of the NMT model. Four settings are considered: +25%, +50%, +75% and +100%, respectively. As shown in Fig. 3, for all language pairs, both the Mixture strategy and the Concatenation strategy outperform the baseline substantially when increasing the pseudo-parallel data, and the Concatenation strategy achieved better results. Compared with the baseline, the improvements of the Concatenation strategy range from 0.21 in the De→En translation task to 1.67 in the En→Tr translation task.

Fig. 3

△BLEU scores (above) for Mixture, and △BLEU scores (below) for Concatenation. The result of De→En is from Table 2.

4.2.3 Compare with other data augmentation (DA) methods

To validate the performance of the sentence trunks augmentation (STA) method, we compare it with several existing DA methods on three translation tasks. For a fair comparison of the experiments, the same settings are adopted as those in previous work.

For the De→En translation task in Table 4, we evaluate the translation quality with the SacreBLEU metric based on the transformer_small model. It is obvious to find the proposed method in this paper outperforms several other methods. As for the experiments of the De→En translation task in Table 6, the Concatenation strategy also achieves competitive results compared with the Augment_{source+target} method used in [41], and outperforms other methods. The experiment results on other translation tasks listed in Tables 4, 5 and 6 also show that the STA method achieves competitive performance compared to other existing DA methods.

Table 4
Performance of several data augmentation methods on De→En and Es→En translation tasks with the SacreBLEU metric based on the transformer_small model. () is from Cheng et al., [35]. (↑) is from Maimaiti et al., [41]

Method De→En Es→En

Other Reported Results

Baseline^ 33.62 40.87

${CMLM}_{hard}^{}$ 35.07 41.45

${CMLM}_{soft}^{}$ 35.31 42.01

Baseline^↑ 36.50 /

Blanking^↑ 36.80 /

Dropout^↑ 36.40 /

Replacement^↑ 36.20 /

Our Works

Baseline 36.59 42.66

STA(Mixture) 36.93 42.99

STA(Concatenation) 37.40 43.26

Method	De→En	Es→En
Other Reported Results
Baseline^*	33.62	40.87
${CMLM}_{hard}^{*}$	35.07	41.45
${CMLM}_{soft}^{*}$	35.31	42.01
Baseline^↑	36.50	/
Blanking^↑	36.80	/
Dropout^↑	36.40	/
Replacement^↑	36.20	/
Our Works
Baseline	36.59	42.66
STA(Mixture)	36.93	42.99
STA(Concatenation)	37.40	43.26

Table 5

Performance of several data augmentation methods on En→Vi translation task with the BLEU metric based on the transformer_base model. (*) is from Wang et al., [12]

Method	En→Vi
Other Reported Results
Transformer^*	27.97
+WordDropout^*	28.56
+SwitchOut^*	28.67
+RAML^*	28.88
+RAML + WordDropOut^*	28.86
+RAML + SwitchOut^*	29.09
Our Works
Baseline	28.19
STA(Mixture)	29.82
STA(Concatenation)	29.76

Table 6

Performance of several data augmentation and translation methods on De→En translation task with the BLEU metric based on the transformer_base model. (*) is from Maimaiti et al., [41]

Method	De→En
Other Reported Results
Transformer^*	33.53
BT^*	33.69
Copy^*	34.63
Swap^*	33.98
Dropout^*	34.68
Blank^*	34.83
Smooth^*	34.85
SwitchOut^*	34.75
SCA^*	34.89
${Augment}_{source + target}^{*}$	35.14
Our Works
Baseline	34.20
STA(Mixture)	34.67
STA(Concatenation)	35.11

4.2.4 Performance by sentence length

The results of several models are shown in Fig. 4 for varying sentence lengths. Results for all sentence length intervals across every translation tasks demonstrate a substantial improvement in BLEU scores when using the Concatenation strategy (below) compared to the baseline. Nevertheless, the improvement of the Mixture strategy (middle) is only when the sentence length ranges from 11 to 50. We contend that the quality of sentence trunk is not well when the original sentence is too lengthy or too short, therefore the quality of the pseudo-parallel corpus affects the performance of the NMT model.

Fig. 4

Analysis by sentence length: percentage of test set (above), △BLEU scores (middle) for Mixture, and △BLEU scores (below) for Concatenation. The result of De→En is from Table 2.

It is indicated that the Concatenation strategy performs better for long sentences than the Mixture strategy. The Concatenation strategy works well for both the short and long sentences. We contend that the length of the pseudo-parallel corpus generated by Concatenation is longer than that of the original training set, thereby improve the ability of the NMT model to translate long sentences.

4.2.5 Training loss and BLEU score

To validate the convergence of the sentence trunks augmentation (STA) method, we calculate the BLEU score on the validation set and the training loss in different epoch on two translation tasks. It is shown in Figs. 5–6 that the STA method can converge faster than the baseline. Specially, it can be seen that the Concatenation strategy can achieve a better BLEU score and lower loss at the beginning of the training process, and the STA method can speed up the convergence. In our opinion, enriching the diversity of the training data with the pseudo-parallel sentence pairs, the neural machine translation model can perform better.

Fig. 5

BLEU score (left) and training loss (right) in different epoch on Zh→En translation task.

Fig. 6

BLEU score (left) and training loss (right) in different epoch on the Vi→En translation task.

4.2.6 Performance of comparative and stacked methods

The back-translation (BT) method is widely adopted to augment the training set for many low-resource neural machine translation (NMT) tasks. In order to evaluate the robustness of the sentence trunks augmentaion (STA) method, we perform several experiments by combining the STA method with the BT method with extra monolingual 100k English sentences from the WMT14 translation task. Therefore, another pseudo-parallel corpus was generated by the monolingual data with the target-to-source NMT model.

As can be seen in Table 7, the NMT model performs noticeably better when utilising the BT method with the additional monolingual data in comparison to the baseline. In fact, it performs so well that it even outperforms the STA(Mixture) and STA(Concatenation) methods in the Vi→En translation task. We contend that the BT method with extra monolingual data enriches the diversity of the training set compared with the STA method. However, when combining the STA method with the BT method, the NMT model further improves the translation quality. Consequently, it can be seen that BT and STA are independent factors that improve the translation quality.

Table 7
Performance of back-translation (BT) compared to and combined with sentence trunks augmentation (STA)

Method Vi-En Tr-En

Transformer 26.06 14.63

+BT 27.65 15.59

+STA(Mixture) 27.10 15.11

+STA(Mixture)+BT 27.57 15.43

+STA(Concatenation) 27.55 15.83

+STA(Concatenation)+BT 27.89 16.17

Method	Vi-En	Tr-En
Transformer	26.06	14.63
+BT	27.65	15.59
+STA(Mixture)	27.10	15.11
+STA(Mixture)+BT	27.57	15.43
+STA(Concatenation)	27.55	15.83
+STA(Concatenation)+BT	27.89	16.17

4.2.7 Ablation study

It has been proven that the Concatenation strategy steadily improves the performance of all the translation tasks in previous experiments. Reviewing the Concatenation strategy from another perspective, the original training set makes up the first half of the pseudo-parallel corpus, while the sentence trunks make up the second. It is uncertain whether the improvement brought by the original data replication or the sentence trunk. To address this issue, we copy parallel sentences in the training set that the sentence trunk of the target side is extracted successfully as the additional parallel corpus, and combine it with the original training set. This method is named Double. Figure 7 shows the improved performance of several data augmentation methods on corresponding translation tasks compared with the baseline. Unexpectly, we find that the performance of the Double strategy even outperforms the Mixture strategy. It is obvious that the improvement brought by the Concatenation strategy is better than that of the Double strategy. The phenomenon proves the effectiveness of the proposed Concatenation strategy.

Fig. 7

△BLEU scores for different translation task. The result of the De→En is from Table 2.

4.2.8 Comparision of different parser

In order to verify whether the performance of the STA method is affected by the result of the syntactic analysis, we take another parser [42] proposed by Berkeley to test it on two translation tasks. The experimantal results are listed in Table 8. It can be seen that the STA method achieves a stable improvement compared with the baseline when using two different syntactic parsers. The results show that the STA method is stable and robust.

Table 8
Performance of different parser on two translation tasks with the BLEU metric

Method Tr→En (transformer_small) Vi→En (transformer_base)

Berkeley Stanford Berkeley Stanford

Baseline 14.63 14.63 26.06 26.06

STA(Mixture) 15.28 15.11 27.17 27.10

STA(Concatenation) 15.78 15.83 27.38 27.55

Method	Tr→En (transformer_small)	Vi→En (transformer_base)
Baseline	14.63	14.63	26.06	26.06
STA(Mixture)	15.28	15.11	27.17	27.10
STA(Concatenation)	15.78	15.83	27.38	27.55

In fact, sentence trunks extracted by the STA method may deviate from grammatical rules. Ungrammatical sentence trunks can be regarded as a data noice, which improves the robustness of the model.

4.2.9 Comparision of source sentence length

We analyse the source sentence length of the training set under two strategies (Mixture and Concatenation) proposed on different translation tasks. The experimental results are displayed in Fig. 8. We can find that the proportion of each interval of sentence length is more balanced for the Concatenation strategy compared to the baseline. However, the Mixture strategy aggravates the trend of unbalanced distribution of original training set. we content that increasing long sentences for the training set, results in the better ability of the NMT model to deal with long sentences. This conjecture is also consistent with the experimental results in section 4.2.4.

Fig. 8

Analysis by source sentence length in training set: percentage of training set for Baseline, percentage of training set (middle) for Mixture, and percentage of training set (below) for Concatenation.

4.2.10 Case study

Table 9 shows a case in which the STA method works well, whereas Table 10 shows a case in which the translation quality deteriorated. Compared with the reference, the translation quality of the STA (Mixture) and STA (Concatenation) is better than that of the baseline in Table 9, although the translations are not perfect. It is obvious that the STA method results in better performance of the neural machine translation model. In Table 10, there is a repetitive output error in the translation of the STA (Concatenation) strategy. Nevertheless, the error is not seen in the translation of the baseline. In the case of short sentences of the test set, this type of inaccuracy occurs several times. This suggests that the ability to output long sentences may lead to unnatural output repetition due to the attempt to generate long sentences. A sample sentence listed in Table 11 is the translation from Turkish to English. The translation generated by the STA method are closer to the corresponding reference compared to the baseline, alghough the translation quality can yet be better. Therefore, it is demonstrated that the STA method plays a positive role in low-resource language translation tasks.

Table 9
An German example of the effectiveness of the STA method

Source Sentence Vielleicht wollte ich so erfolgreich und so fähig sein verantwortung zu übernehmen, dass ich so handelte und so wäre ich in der lage für meine patienten zu sorgen ohne ihn kontaktieren zu müssen.

Reference Maybe i wanted to be so successful and so able to take responsibility that i would do so and i would be able to take care of my attending’s patients without even having to contact him.

Baseline Maybe i wanted to be so successful, and so be able to take responsibility, so i would be able to take care of my patients without having to take it.

STA(Mixture) Maybe i wanted to take so successful and so capable of taking responsibility that i was so amazed, and so i would be able to take care of my patients without having to have them connected.

STA(Concatenation) Maybe i wanted to be so successful and so able to take responsibility that i would be able to get my patients to be able to do so without having to get him started.

Source Sentence	Vielleicht wollte ich so erfolgreich und so fähig sein verantwortung zu übernehmen, dass ich so handelte und so wäre ich in der lage für meine patienten zu sorgen ohne ihn kontaktieren zu müssen.
Reference	Maybe i wanted to be so successful and so able to take responsibility that i would do so and i would be able to take care of my attending’s patients without even having to contact him.
Baseline	Maybe i wanted to be so successful, and so be able to take responsibility, so i would be able to take care of my patients without having to take it.
STA(Mixture)	Maybe i wanted to take so successful and so capable of taking responsibility that i was so amazed, and so i would be able to take care of my patients without having to have them connected.
STA(Concatenation)	Maybe i wanted to be so successful and so able to take responsibility that i would be able to get my patients to be able to do so without having to get him started.

Table 10

An German example where the STA method may cause errors

Source Sentence	Sie hatten dieses programm an die ägyptische regierung verkauft.
Reference	They had sold this tool to the egyptian government.
Baseline	They had sold this program to the corporate government.
STA(Mixture)	They had sold this program to the educated government.
STA(Concatenation)	They had sold this program to the call from the call from the call from the government.

Table 11

A Turkish example of the effectiveness of the STA method

Source Sentence	Bu üç dönemde hazırlanmış üç değişiklik var: ekonomik, sosyal, siyasal.
Reference	There are three changes prepared in these three periods: economic, social and political.
Baseline	Three changes have been drafted in these three years: economic, social, political.
STA(Mixture)	There are three changes in these three years: economic, social, political.
STA(Concatenation)	There are three changes prepared in these three times: economic, social and political.

5 Conclusion

This paper proposes a straightforward yet effective method named STA to augment the training set of neural machine translation for low-resource language pairs. By utilizing sentence trunks derived from the constituency parse tree on the training set as well as back-translation, the Mixture and Concatenation strategies are employed to generate the pseudo-parallel corpus. Experimental results on simulated low-resource language pairs (e.g., German→English) and real low-resource language pairs (e.g., Turkish↔ English) show substantial improvements in STA method over the strong baselines. In addition, the experimental results indicate that the STA method is more competitive than other available data augmemtation approaches. In future work, we intend to investigate data augmentation methods with smaller granularity, e.g., n-gram, and to continue studying our method in other natural language processing generating tasks, such as text summarization.

Footnotes

Acknowledgments

We would like to thank anonymous reviewers for their valuable comments. And thank Jiarui Li for his helpful advice to improve the paper. This research was supported by grants 2021-YKLH-12 and 2022-YKLH-18 from the Natural Science Foundation of Liaoning Province, China.

References

Bahdanau

, Cho

and Bengio

, Neural machine translation by jointly learning to align and translate[J], arXiv preprint arXiv:1409.0473, 2014.

, Schuster

, Chen

et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation[J], 2016.

Gehring

, Auli

, Grangier

et al., Convolutional sequence to sequence learning[C], International conference on machine learning, PMLR, 2017:1243–1252.

Vaswani

, Shazeer

, Parmar

et al., Attention is all you need[J], Advances in neural information processing systems, 2017:30.

Zoph

, Yuret

, May

et al., Transfer Learning for Low-Resource Neural Machine Translation[C], Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016:1568–1575.

, Wang

, Chen

et al., Meta-learning for low-resource neural machine translation[C], Conference on Empirical Methods in Natural Language Processing (EMNLP) Proceedings, Association for Computational Linguistics 2018.

Ren

, Chen

, Liu

et al., Triangular Architecture for Rare Language Translation[C], Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, (Volume 1: Long Papers), 2018:56–65.

Sennrich

, Haddow

and Birch

, Improving neural machine translation models with monolingual data[J], arXiv preprint arXiv:1511.06709 2015.

Norouzi

, Bengio

, Jaitly

et al., Reward augmented maximum likelihood for neural structured prediction[J], Advances In Neural Information Processing Systems, 2016:29.

10.

Zhang

and Zong

, Exploiting source-side monolingual data in neural machine translation[C],:, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016:1535–1545.

11.

Fadaee

, Bisazza

and Monz

, Data augmentation for low-resource neural machine translation[J], arXiv preprint arXiv:1705.00440 2017.

12.

Wang

, Pham

, Dai

et al., SwitchOut: an efficient data augmentation algorithm for neural machine translation[J], arXiv preprint arXiv:1808.07512 2018.

13.

Zhang

, Wu

, Liu

et al., Regularizing neural machine translation by target-bidirectional agreement[C], Proceedings of the AAAI Conference on Artificial Intelligence 33(01) 2019:443–450.

14.

Edunov

, Ott

, Auli

et al., Understanding back-translation at scale[J], arXiv preprint arXiv:1808.09381 2018.

15.

Fadaee

and Monz

, Back-translation sampling by targeting difficult words in neural machine translation[J], arXiv preprint arXiv:1808.09006 2018.

16.

Sennrich

, Haddow

and Birch

, Edinburgh neural machine translation systems for WMT 16[J], arXiv preprint arXiv:1606.02891 2016.

17.

Gal

and Ghahramani

, A theoretically grounded application of dropout in recurrent neural networks[J], Advances in Neural Information Processing Systems, 2016:29.

18.

Gao

, Zhu

, Wu

et al., Soft contextual data augmentation for neural machine translation[C], Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019:5539–5544.

19.

Poncelas

, Shterionov

, Way

et al., Investigating Backtranslation in Neural Machine Translation[J], 2018.

20.

Ueffing

, Haffari

and Sarkar

, Transductive learning for statistical machine translation, In,Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, 23–30 June 2007; pp. 25–32.

21.

Manning

C.D.

, Surdeanu

, Bauer

, Finkel

J.R.

, Bethard

and McClosky

, The stanford corenlp natural language processing toolkit, In Proceedings of 52nd annual meeting of the association for computational linguistics: System demonstrations, (Baltimore, USA, 2014.6.22), 2014; pp. 55–60.

22.

Burlot

and Yvon

, Using Monolingual Data in Neural Machine Translation: a Systematic Study[C], Conference on Machine Translation 2018.

23.

Cotterell

and Kreutzer

, Explaining and generalizing back-translation through wake-sleep, arXiv 2018, arXiv:1806.04402.

24.

, Xia

, Qin

, Wang

, Yu

, Liu

T.Y.

and Ma

W.Y.

, Dual learning for machine translation, In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 9 December2016; pp 820–828.

25.

Hoang

V.C.D.

, Koehn

, Haffffari

and Cohn

, Iterative back-translation for neural machine translation, In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, Melbourne, Australia, 15–20 July 2018; pp.18–24.

26.

Zhang

and Matsumoto

, Corpus Augmentation by Sentence Segmentation for Low-Resource Neural Machine Translation, arXiv 2019, arXiv:1905.08945.

27.

Imamura

, Fujita

and Sumita

, Enhancement of encoder and attention using target monolingual corpora in neural machine translation, In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, Melbourne, Australia, 15–20 July2018; pp. 55–63.

28.

Artetxe

, Labaka

, Agirre

et al., Unsupervised neural machine translation[C], 6th International Conference on Learning Representations, ICLR 2018. 2018.

29.

Xie

, Wang

S.I.

, Li

et al., Data Noising as Smoothing in Neural Network Language Models[J], 2016.

30.

, Lv

, Zang

, Han

and Hu

, Conditional BERT contextual augmentation, In Proceedings of the International Conference on Computational Science, Faro, Portugal, 12–14 June 2019; pp.84–95.

31.

Kobayashi

, Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations[C], Proceedings of NAACL-HLT, 2018:452–457.

32.

Chen

, Wang

, Utiyama

and Sumita

, Content word aware neural machine translation,, In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Online, .20207.5)2020; pp. 358–364.

33.

Ott

, Edunov

, Baevski

, Fan

, Gross

, Ng

, Grangier

and Auli

, fairseq: A fast, extensible toolkit for sequence modeling, In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), (Minneapolis, USA, 2019.6.2)2019; pp. 48–53.

34.

Sennrich

, Haddow

and Birch

, Neural Machine Translation of Rare Words with Subword Units, 54th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (ACL), pp. 1715–1725, (2016).

35.

Cheng

, Huang

and Duan

, Semantically Consistent Data Augmentation for Neural Machine Translation via Conditional Masked Language Model[C], Proceedings of the 29th International Conference on Computational Linguistics, 2022:5148–5157.

36.

Werlen

L.M.

, Ram

, Pappas

et al., Document-Level Neural Machine Translation with Hierarchical Attention Networks[C], EMNLP 2018.

37.

Bugliarello

and Okazaki

, Enhancing Machine Translation with Dependency-Aware Self-Attention[C], Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020:1618–1627.

38.

Kingma

D.P.

and Adam

Ba J.L.

, A Method for Stochastic Optimization[J], 2015.

39.

Papineni

, Roukos

, Ward

et al., Bleu: a method for automatic evaluation of machine translation[C], Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002:311–318.

40.

Post

, A Call for Clarity in Reporting BLEU Scores[C], Proceedings of the Third Conference on Machine Translation: Research Papers, 2018:186–191.

41.

Maimaiti

, Liu

, Luan

et al., Data augmentation for low-resource languages NMT guided by constrained sampling[J], International Journal of Intelligent Systems 37(1) (2022), 30–51.

42.

Kitaev

and Klein

, Constituency Parsing with a Self-Attentive Encoder[C], Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics , (Volume 1: Long Papers), 2018:2676–2686.

STA: An efficient data augmentation method for low-resource neural machine translation

Abstract

Keywords

1 Introduction

2 Related work

3 Method

3.1 Generating sentence trunk by constituency parse tree

3.2.1 Generating pseudo-parallel corpus

4 Experiments and results

4.1 Datasets and settings

4.1.1 Dataset and pre-processing

4.2 Main results and analyses

4.2.1 Performance on several translation tasks

Table 2 Performance on several translation tasks with the BLEU and SacreBLEU metric based on the transformer_small model Method SacreBLEU BLEU De→En Es→En Tr→En En→Tr Baseline 36.59 42.66 14.63 13.13 STA(Mixture) 36.93 42.99 15.11 13.65 STA(Concatenation) 37.40 43.26 15.83 14.80

Table 7 Performance of back-translation (BT) compared to and combined with sentence trunks augmentation (STA) Method Vi-En Tr-En Transformer 26.06 14.63 +BT 27.65 15.59 +STA(Mixture) 27.10 15.11 +STA(Mixture)+BT 27.57 15.43 +STA(Concatenation) 27.55 15.83 +STA(Concatenation)+BT 27.89 16.17

Table 8 Performance of different parser on two translation tasks with the BLEU metric Method Tr→En (transformer_small) Vi→En (transformer_base) Berkeley Stanford Berkeley Stanford Baseline 14.63 14.63 26.06 26.06 STA(Mixture) 15.28 15.11 27.17 27.10 STA(Concatenation) 15.78 15.83 27.38 27.55

Footnotes

Acknowledgments

References

Table 2
Performance on several translation tasks with the BLEU and SacreBLEU metric based on the transformer_small model

Method SacreBLEU BLEU

De→En Es→En Tr→En En→Tr

Baseline 36.59 42.66 14.63 13.13

STA(Mixture) 36.93 42.99 15.11 13.65

STA(Concatenation) 37.40 43.26 15.83 14.80

Table 7
Performance of back-translation (BT) compared to and combined with sentence trunks augmentation (STA)

Method Vi-En Tr-En

Transformer 26.06 14.63

+BT 27.65 15.59

+STA(Mixture) 27.10 15.11

+STA(Mixture)+BT 27.57 15.43

+STA(Concatenation) 27.55 15.83

+STA(Concatenation)+BT 27.89 16.17