A multi-domain adaptive neural machine translation method based on domain data balancer

Abstract

Most methods for multi-domain adaptive neural machine translation (NMT) currently rely on mixing data from multiple domains in a single model to achieve multi-domain translation. However, this mixing can lead to imbalanced training data, causing the model to focus on training for the large-scale general domain while ignoring the scarce resources of specific domains, resulting in a decrease in translation performance. In this paper, we propose a multi-domain adaptive NMT method based on Domain Data Balancer (DDB) to address the problems of imbalanced data caused by simple fine-tuning. By adding DDB to the Transformer model, we adaptively learn the sampling distribution of each group of training data, replace the maximum likelihood estimation criterion with empirical risk minimization training, and introduce a reward-based iterative update of the bilevel optimizer based on reinforcement learning. Experimental results show that the proposed method improves the baseline model by an average of 1.55 and 0.14 BLEU (Bilingual Evaluation Understudy) scores respectively in English-German and Chinese-English multi-domain NMT.

Keywords

Multi-domain adaptation machine translation domain data balancer empirical risk minimization

1. Introduction

In recent years, Neural Machine Translation (NMT) has shown great performance due to the support of large-scale corpora [47, 9, 52]. However, because it is trained on a large amount of general-purpose data, its translation performance in specific domains such as law and medicine is often unsatisfactory. The use of specialized terminology and text styles in specific domains can confuse the model, and the scarcity of specialized corpus data makes it difficult for the model to learn suitable parameters. Multi-domain adaptive NMT [13, 43] aims to construct a single unified model by blending data from the general domain and multiple specific domains [26, 27], such as law and medicine, to accurately translate texts with distinct styles or vocabularies [43].

Although multi-domain adaptive NMT shows great potential, there often exists an imbalance in training data between different domains of the same language. Specifically, as shown in Table 1, the amount of general domain training data is much larger than that of specific domains. Due to this data imbalance, in the process of multi-domain joint training, the model tends to focus more on the general domain and resource-rich domains, leading to a decrease in translation performance in low-resource [45, 1] domains, ultimately affecting the overall translation performance of the model.

To address this issue, many researchers have attempted various approaches. In Architecture-Centric Adaptation methods, Bapna et al. [4] proposed injecting domain-specific adapter modules into each layer of a general-domain model and fine-tuning the domain-specific adapter modules by freezing the parameters of the general-domain model. This greatly reduces the amount of data required for fine-tuning but also leads to an increase in model capacity. In Data-Centric Adaptation methods, Zhang et al. [55] attempted to apply curriculum learning [6] to domain adaptation, gradually transitioning unlabeled network data to the corresponding target domain through curriculum design to expand target domain data. However, this approach may also introduce noise to the original text style of the target domain. In Training Schemes for Adaptation, Liang et al. [30] identified and froze the most informative parameters in the general-domain model, pruned unnecessary parameters, and trained and adjusted a specific domain subnetwork using a small amount of domain-specific data. However, when dealing with a large number of domains, parameter interference may occur among the subnetworks.

This paper proposes a multi-domain adaptive NMT method based on Domain Data Balancer (DDB) to address the issue of data imbalance. Drawing on the idea of domain control [25], we add the DDB to the Transformer [52] model. It can adaptively learn the sampling distribution of each group of training data, select suitable training data from the original training sets of various domains for model training, and replace the Maximum Likelihood Estimation (MLE) [5, 29] function with the Empirical Risk Minimization (ERM) [8] training criterion to optimize the expected evaluation metrics. In addition, we introduce a reward function for reinforcement learning [54] to iteratively update the bilevel optimizer. Experimental results demonstrate that the proposed method effectively solves the problem of data imbalance and improves the translation performance of both general and specific domains.

Our main contributions can be summarized as follows:

•
We propose a multi-domain adaptive neural machine translation method based on Domain Data Balancer, which effectively addresses data imbalance issues in multi-domain settings and substantially improves the translation performance of the model.
•
We employ the adaptive learning capability of DDB to select appropriate training data from the original training sets of each domain. By leveraging this capability, we can focus on the most relevant and informative data, thus enhancing the training process and achieving better translation results across multiple domains.
•
We adopt the Empirical Risk Minimization training criterion in conjunction with a dual optimizer to update both the Domain Data Balancer and model parameters. This approach ensures that the model benefits from the balanced training across domains, contributing to improved translation performance and overall model effectiveness.

The specific content and structure of this paper are as follows. Section 1 introduces the background and significance of the study, expounds on the research status of Multi-domain adaptive NMT, and analyzes the key technical issues involved in this paper. Section 2 provides an overview of the related work in multi-domain adaptive NMT. Section 3 describes the fundamentals of multi-domain adaptive NMT. Section 4 describes the design and improvement of our method. Section 5 verifies the improved detection model described in Section 4, conducts experiments on domain data balancing and ERM, and analyzes the experimental results. Section 6 discusses the limitations, theoretical implications, and practical significance of our research. Section 7 summarizes the main work of this paper, and analyzes the expectations of the study.
2. Related work

Various approaches have been proposed to address the challenge of multi-domain adaptation in NMT. Chu et al. [11] proposed adjusting the model on top of fine-tuning, by training on both general and specific domain data while oversampling the specific domain samples. Kobus et al. [25] concatenated the input word embeddings and domain embeddings to enable the model to recognize the input domain information during training, thereby preserving the domain-specific features. Tars and Fishel [48] treated different domains as different languages and used multilingual NMT to train a multi-domain translation system. Zhang et al. [55] applied curriculum learning to domain adaptation, gradually transitioning unlabeled network data to the target domain, thereby expanding the target domain data and improving performance. Bapna and Firat [4] injected specific domain adapter [53] modules into every layer of a general domain model and fine-tuned the adapter for specific domains while freezing the general domain parameters. Chu et al. [10] proposed incorporating multilingual into a multi-domain adaptation framework to improve the translation quality of resource-scarce-specific domains. Gu et al. [20] adjusted the model using knowledge distillation, where a trimmed model was trained with specific domain data under the guidance of an untrimmed model trained on a general domain, and then expanded to the original size as a specific domain parameter model. Liang et al. [31] proposed a similar method to Gu et al. [20], where the most informative parameters in the general domain model were identified and pruned, unnecessary parameters were removed, and the specific domain subnetwork was discovered and adjusted with specific domain data. Recently, Morishita et al. [37] proposed a domain-adaptation method that efficiently gathers in-domain parallel sentences from the web with the assistance of crowdworkers. This approach can swiftly collect target-domain parallel data within a few days at a reasonable cost. Hendy et al. [21] proposed Domain Specific Sub-networks. They achieved this by pruning and masking to partition a sub-network for each domain, which includes shared parameters with other domains as well as domain-specific parameters. Cao et al. [7] proposed enhancing the datastore retrieval of k-Nearest Neighbor Machine Translation (kNN-MT) by reconstructing the original datastore. They utilized a reviser to improve key representations, enabling a better fit in the downstream domain. Despite these developments, challenges persist, including effectively balancing shared and domain-specific information, as well as handling domain shifts. Our work addresses the problems of imbalanced data caused by simple fine-tuning.

3. Background

3.1 Multi-domain adaptive neural machine translation model architecture

The Neural Machine Translation (NMT) [3, 8, 52] architecture views all inputs as a series of standard ‘tokens’ [14, 36], which are denoted as integer values. Specifically, the NMT model receives an input $x=x_{1},\dots,x_{i},\dots,x_{m}$ , where each $x_{i}$ belongs to source language vocabulary and generates an output integer sequence $y=y_{1},\dots,y_{j},\dots,y_{n}$ , where each $y_{j}$ belongs to target language vocabulary [35]. The standard paradigm of multi-domain adaptive NMT includes a fully shared model that is informed by bilingual translations across all domains. A specific domain tag is attached to the source text to indicate the target language, i.e. $x=\{tag,x_{1},\dots,x_{i},\dots,x_{m}\}$ [24]. The multi-domain adaptive NMT is often referred to as multi-task optimization, in which a task indicates a domain translation direction, e.g. $En\Rightarrow\textit{Novel}$ . Although numerous NMT networks are capable of learning mappings between $x$ and $y$ that can be applied to unseen $x$ during inference, our attention is specifically directed toward the Transformer, which has emerged as the de facto standard for NMT in recent years.

3.2 Likelihood estimation

The prevalent training objective in NMT is to adjust the weights $\theta$ along the gradient direction of the log-likelihood of training examples $D$ , aiming to maximize the log-likelihood. This method is referred to as Maximum Likelihood Estimation (MLE):

$\displaystyle\hat{\theta}_{\textit{MLE}}=\arg\max_{\theta}\sum_{(x,y)\in D}% \log P(y\mid x;\theta)$ (1)

Unlike NMT, multi-domain NMT has parallel corpora from multiple domains, $D_{1},D_{2},\ldots,D_{K}$ , and its MLE is:

$\displaystyle\hat{\theta}_{\textit{MLE}}=\arg\max_{\theta}\sum_{k=1}^{K}\sum_{% (x,y)\in D_{k}}\log P(y\mid x;\theta)$ (2)

3.3 Bilevel optimization

To address the limitation of traditional optimization algorithms that focus solely on updating model parameters based on a fixed set of hyperparameters, the Bilevel Optimization [15] was introduced to harness the full potential of the model and achieve optimal performance. In bilevel optimization, the outer level is responsible for adjusting hyperparameters such as learning rate, regularization strength, or optimization strategy. These hyperparameters control the behavior and performance of the optimization process. On the other hand, the inner level focuses on updating the model parameters based on the selected hyperparameters, providing a structured and systematic approach to improving optimization performance and enhancing the overall capabilities of machine learning models.

4. Methodology

Notably, in deep learning [3, 47], it is often impossible to feed all the data into the model at once during training. Random sampling [23, 39] is a commonly used data sampling method in NMT, involving randomly selecting a certain proportion of data samples from the training corpus and shuffling and sampling them for training. The goal is to make model training more efficient and accurate while avoiding overfitting the training data. However, in the random sampling strategy, the proportion of data samples $S_{D}$ is often controlled by a temperature parameter $\tau$ [2, 16], which requires multiple experiments to determine the optimal value and lacks flexibility. This parameter is exponentiated by the data size $s_{i}$ and could result in an inadequate sampling of certain data types, particularly in multi-domain adaptation scenarios where there can be significant differences in data size between the general and specific domains.

$\displaystyle S_{D_{i}}=\frac{s_{i}^{1/\tau}}{\sum_{j=1}^{K}s_{j}^{1/\tau}}$ (3)

where

$\displaystyle s_{i}=\frac{|D_{i}^{\textit{train }}|}{\sum_{j=1}^{K}|D_{j}^{% \textit{train }}|}$ (4)

Therefore, in order to better optimize $S_{D}$ , we propose the method of Domain Data Balancer (DDB).

4.1 Empirical risk minimization

Before introducing the DDB, we briefly introduce the concept of Empirical Risk Minimization (ERM). In multi-domain adaptive NMT, MLE has drawbacks as it may lead to overfitting and poor performance on new data. Additionally, it may not balance the data distribution of different domains well, leading to poor performance in some domains. Thus, more effective parameter estimation methods are needed to improve the model’s generalization and cross-domain adaptation capabilities. Given $k$ different data domains $D_{1},D_{2},\ldots,D_{k}$ , our goal is to adjust the model parameters $\theta$ to minimize the total empirical risk function $R(\theta,D_{k})$ , where $R(\theta,D_{k})$ represents the empirical risk of the model on the data domain $D_{k}$ , which is the expected value of the loss function $\ell(x,y;\theta)$ on the training samples $S_{D_{k}}(X,Y)$ drawn from that domain. Therefore, our objective is to optimize the model’s performance by minimizing the empirical risk across all data domains:

$\displaystyle\hat{\theta}_{\textit{ERM}}=\underset{\theta}{\operatorname{% argmin}}\frac{1}{K}\sum_{k=1}^{K}R(\theta,D_{k})$ (5)

where

$\displaystyle R(\theta,D_{k})=\mathbb{E}_{x,y\sim S_{D_{k}}(X,Y)}[\ell(x,y;% \theta)]$ (6)

In order to further optimize the model, we define the following loss function with respect to the model parameters $\theta$ for a single training sentence pair $\langle x,y\rangle$ [38]:

$\displaystyle\ell(x,y;\theta)=\sum_{y^{\prime}}\operatorname{Err}(y,y^{\prime}% )P(y^{\prime}\mid x;\theta)$ (7)

which is summed over all potential translations $y^{\prime}$ in the target language. In this equation, the function $\operatorname{Err}(\cdot)$ represents an arbitrary error function that we define as 1 – SBLEU $(y,y^{\prime})$ , where SBLEU $(\cdot)$ is the smoothed BLEU score (BLEU $+$ 1) proposed by Lin and Och [32].

The ERM method balances the contribution of each domain and handles the data distribution mismatch across domains, resulting in improved performance while also preventing overfitting and improving the model’s generalization ability to new and unseen data.

It is worth noting that our method comprises two types of risk: $R(\theta,D^{\textit{train}})$ represents the multi-domain training objective used for the training dataset, while $R(\theta,D^{\textit{valid}})$ serves as the evaluation objective used for the validation dataset.

4.2 Domain data balancer

In this context, we introduce the Domain Data Balancer, which addresses the problem of imbalanced data, allowing for efficient optimization of $S_{D_{i}}$ even with limited domain-specific data. To achieve this, we use bilevel optimization [15] to construct a parameterized balancer that has a second set of parameters $\psi$ , which we modify to learn the training objective $R(\theta,D^{\textit{train}})$ for minimizing the final objective $R(\theta,D^{\textit{valid}})$ .

DDB is a differentiable function that solely relies on the input features of the example $(x,y)$ . It serves as a distribution over the training data, assigning a higher probability to more crucial data points, denoted as $B(x,y;\psi)$ . Unlike previous approaches that often necessitate complex feature engineering for both the model state and the data [17, 22], DDB does not take the model parameters $\theta$ into account as input. Additionally, the parameters of DDB are highly effective in enabling adaptive updating of the network. Moreover, DDB is a lightweight component, and its parameter count remains constant even with an increase in the number of domains.

Before the training process, we use the Xavier initialization [19] method to initialize the model parameters $\theta$ , ensuring that the parameters have similar distributions across different domains. Meanwhile, we randomly initialize the parameters in DDB to ensure that the initial DDB can balance the data distribution across all domains.

Throughout the training process, we continuously iterate and optimize the model parameters $\theta$ and DDB parameters $\psi$ , where the update rules for $\theta$ are as follows:

$\displaystyle\theta_{t}\leftarrow\theta_{t-1}-\nabla_{\theta}\frac{1}{K}\sum_{% k=1}^{K}R(\theta_{t-1},D^{\textit{train}}_{k})$ (8)

where

$\displaystyle R(\theta,D^{\textit{train}}_{k})=\mathbb{E}_{x,y\sim B(x,y;\psi)% }[\ell(x,y;\theta)]$ (9)

the parameter $\theta$ is updated using gradient descent. At each iteration step $t$ , the new value of $\theta(\theta_{t})$ is computed by subtracting the gradient, $\nabla_{\theta}$ , scaled by $\frac{1}{K}$ , from the previous value of $\theta(\theta_{t-1})$ . Here, $K$ represents the number of domains, indicating the number of different training datasets used in the optimization.

By calculating the cosine similarity between the training data extracted by the balancer and the corresponding validation data for each individual domain, we introduced a reward function based on reinforcement learning to update the parameters in the DDB. This function rewards the balancer for selecting training data that is similar to the validation data and punishes it for selecting dissimilar data. Considering the optimization of the multi-domain training objective and the similarity between different domains, we calculate a separate reward function for each domain and take the average of these rewards as the final reward value:

$\displaystyle J(x,y;\theta_{t})\approx\frac{1}{K}\sum_{k=1}^{K}\cos(\nabla R(% \theta_{t},D^{\textit{valid}}_{k}),\nabla_{\theta}\ell(x,y;\theta_{t-1}))$ (10)

where $\cos(\cdot)$ is the cosine similarity of two vectors.

The reward function $J(x,y;\theta_{t})$ indicates that the update of the data selection network should weight the data with similar gradients to the validation data $D^{\textit{valid}}_{k}$ . According to the reinforcement learning algorithm, the update rules for the DDB are as follows:

$\displaystyle\psi_{t+1}\leftarrow\psi_{t}+J(x,y;\theta_{t})\cdot\nabla_{\psi}% \log B(x,y;\psi)$ (11)

where $\nabla_{\psi}\log B(x,y;\psi)$ represents the gradient of the logarithm of the $B(x,y;\psi)$ with respect to the parameter $\psi$ .

5. Experimental results and discussion

The experiments are conducted on English-German and Chinese-English datasets, each consisting of one general domain and three specific domains, under the Python 3.8.1 computing environment deployed on a computer with the Ubuntu 20.04 operating system. We used Fairseq [40] as our sequence modeling tool and chose the Transformer-Big as the base framework for our experiments. During the experiments, we set the encoder and decoder layers to 6, the embedding dimension and the hidden units of the feedforward network in both the encoder and decoder to 1024 and 4096, respectively, and the number of attention heads in both the multi-head self-attention and cross-attention mechanisms to 16. The initial learning rate was set to $5\times 10^{-4}$ , and we used the Batched Gradient Descent to train our model. To prevent overfitting, we set the Dropout [46] parameter to 0.1. During training, the maximum number of tokens per batch was set to 4096, and during inference, the beam search [18] parameter was set to 4 for both the English-German and Chinese-English datasets, and the length penalties were set to 0.6, and 1.0, respectively. Furthermore, all of our experiments were evaluated using the sacreBLEU [42] tool, and we report the best average BLEU score [41] obtained from the best-performing model for simplicity.

5.1 Datasets and data preprocessing

We conducted comparative experiments on open datasets for English-German and Chinese-English translation, and the detailed data size statistics for each domain are presented in Table 1.

For the English-German dataset, the training corpus for the general domain was obtained from the news translation task in WMT14, with newstest2013 and newstest2014 serving as the validation and test sets, respectively. The specific domains included TED talks, biomedical texts, and novels. For the TED talks domain, IWSLT14 was used as the training corpus, with dev2010 and tst2014 as the validation and test sets, respectively. For the biomedical domain, the EMEA News Crawl dataset was used as the training corpus, and the Khresmoi Medical Summary Translation Test Data 2.0 was used as the validation and test sets. For the novel domain, the book dataset from OPUS [50] was used as the training corpus, and a few chapters randomly selected from Jane Eyre were used as the validation set, while The Metamorphosis was used as the test set.

For the Chinese-English dataset, the training corpus for the general domain was obtained from the news translation task in WMT17, with newsdev2017 and newstest2017 serving as the validation and test sets, respectively. The specific domains were selected from three domains of papers, spoken language, and education in the UM-Corpus [49] dataset as experimental data.

As there are differences between the two experimental datasets and English and German belong to the West Germanic language branch, a shared vocabulary was used for the English-German dataset. However, as there is a greater language difference between Chinese and English, no shared vocabulary was used for the Chinese-English dataset. For the English-German dataset, sentencepiece [28] was used to tokenize the data with a shared vocabulary size of 32768. For the Chinese-English dataset, Stanford NLP [34] and Moses tokenizer tools were used to tokenize the Chinese and English, respectively. Byte Pair Encoding (BPE) [44] was used to perform subword segmentation on both languages, with vocabulary sizes of 44K and 33K, respectively.

Table 1
English-German (EN-DE) and Chinese-English (ZH-EN) open dataset size

	Domain	Train	Valid	Test
EN-DE	WMT14	3.9M	3000	3003
	IWSLT14	170K	6750	1305
	EMEA	587K	500	1000
	Novel	50K	1015	1031
ZH-EN	WMT17	23.97M	2002	2001
	Thesis	75K	1000	1000
	Spoken	75K	1000	1000
	Education	75K	1000	1000

5.2 Baseline

To evaluate the performance of the proposed method, we compared it with several classic and state-of-the-art (SOTA) methods. These methods include:

•
General domain model [51]: This model utilizes the Transformer architecture and is solely trained on parallel corpora data from the general domain.
•
Fine-tuning [33]: This method first trains the model on a general domain corpus, and then fine-tunes it with specific domain corpus data to continue training.
•
Mixed domain model [12]: This model is trained on a mixture of data from the general domain and all specific domains.
•
Mixed with domain tags (MDT) [48]: This approach trains a unified multi-domain machine translation model by mixing data from the general domain and all specific domains, with domain tags, added to distinguish between data from different domains.
•
Sequential PRUNE-TUNE (SPT) [30]: This method identifies and freezes the parameters with the most information in the general domain model, then prunes unnecessary network model parameters, and finally fine-tunes specific domain sub-network parameters using a mask matrix and specific domain data.
•
Temperature-Based Sampling (TBS) [2, 16]: This model is trained by determining the sampling distribution of each domain of training data using a manually set temperature value $\tau$ , where we set $\tau$ to 5.

5.3 Main results

In this study, we employed the DDB to balance the training data of a multi-domain NMT model in order to enhance its performance. Tables 2 and 3 display our experimental findings on the English-German and Chinese-English public datasets, respectively. Remarkably, our approach outperformed the baseline models in all domains of the English-German dataset and in most domains of the Chinese-English dataset. This suggests that DDB effectively balances the training data across different domains, preventing the model from overfitting to a specific domain and ignoring information from other domains.

Furthermore, in the general domain of the English-German and Chinese-English datasets, our approach outperformed the SOTA methods with improvements of 1.4 and 3.68 BLEU scores, respectively. These results demonstrate our approach’s capability to effectively address the catastrophic forgetting problem induced by fine-tuning, thereby significantly enhancing translation performance within the general domain. It is worth noting that our method achieved a more significant improvement on the English-German dataset than on the Chinese-English dataset. The average BLEU score improvements were 1.93 and 0.14, respectively. For the English-German dataset, both the general domain and all specific domains achieved higher BLEU scores compared to all SOTA methods except TBS, particularly in the novel domain where there was an improvement of 1.33 BLEU score. The results were remarkably significant. However, the improvement was not as pronounced for the Chinese-English dataset. The performance in the education domain was average and lower than the SOTA methods, while the other three domains showed good performance. We speculate that this is due to the difference between the two datasets. For the English-German dataset, both languages belong to the same language family, which allows DDB to balance the training data of both the source and target languages’ domains reasonably. However, for the Chinese-English dataset, the two languages differ significantly, and DDB may not balance the training data of the source and target languages’ domains effectively, resulting in slightly inferior model performance.

Moreover, while both TBS and our method have the capability to control the sampling distribution for each domain, we observe overfitting in the experimental results of TBS on the English-German and Chinese-English datasets. Specifically, in the IWSLT domain of English-German and the Thesis domain of Chinese-English, the BLEU scores are significantly higher by 10 points compared to other methods, which led to considerably lower scores in other domains. In contrast, our method effectively avoids overfitting and achieves balanced performance across domains, resulting in average BLEU score improvements of 5.94 and 2.45 on the English-German and Chinese-English datasets, respectively.

Finally, our experimental results demonstrate that although each domain in the English-German dataset has a different number of validation sets, and the Chinese-English dataset uses a consistent validation set, the impact of the validation set on optimizing DDB and model parameters is negligible relative to the training set. This indicates that our method is robust in constructing datasets.

Table 2
The BLEU score of the English-German dataset

	Model	Domain				Avg.
		General	IWSLT	EMEA	Novel
Baseline	General	28.70	28.50	28.40	14.50	25.03
	Fine-tuning	–	31.50	29.70	23.40	28.20
	Mix	27.90	31.30	32.00	21.20	28.10
	MDT	28.55	30.50	31.70	22.73	28.37
	SPT	28.40	31.90	30.10	23.60	28.50
	TBS	24.36	38.75	18.20	16.64	24.49
	Ours	31.02	32.71	33.04	24.93	30.43

Table 3

The BLEU score of the Chinese-English dataset

	Model	Domain				Avg.
		WMT17	Thesis	Spoken	Education
Baseline	General	20.74	12.60	16.47	17.85	16.92
	Fine-tuning	12.58	16.99	18.64	19.43	16.91
	Mix	20.49	14.55	15.73	17.20	16.99
	MDT	20.07	14.66	16.65	17.49	17.21
	SPT	–	16.20	14.60	31.20	20.67
	TBS	18.56	25.47	14.62	14.78	18.36
	Ours	24.42	18.48	20.90	19.45	20.81

5.4 Ablation experiment

To validate the effectiveness of the proposed method, we conducted ablation experiments on the English-German dataset.

5.4.1 Influence of the domain data balancer

Based on the analysis, it can be concluded that the DDB proposed in this study has a practical effect in improving the performance of multi-domain NMT models. The experimental results show that without the DDB, the model translations exhibit a significant decrease in BLEU scores, which demonstrates the effectiveness of the proposed method in balancing the training data across various domains. However, it is worth noting that the model translations without the DDB still exhibit good performance, as indicated in Table 4. This could be attributed to the replacement of the MLE training criterion with ERM during the optimization of the data selection network and model parameters. Further analysis will be conducted in other experiments.

Table 4
Influence of the domain data balancer

Model	WMT14	TED talks	Biomedical	Novels
–	29.18	31.31	33.08	21.51
DDB	31.02	32.71	33.04	24.93

5.4.2 Influence of empirical risk minimization

Based on the findings presented in Table 5, it is clear that not using the ERM training criterion to optimize the model parameters and DDB within the unchanged network model framework leads to a decrease in BLEU values across most domains. We hypothesize that training NMT models using the MLE training criterion may result in overfitting in some domains and underfitting in others. In contrast, the ERM training criterion can help the model better adapt to data from multiple domains and exhibit more stable performance in each domain.

Table 5
Influence of empirical risk minimization

Method	WMT14	TED talks	Biomedical	Novels
MLE	27.36	28.61	34.97	22.51
ERM	31.02	32.71	33.04	24.93

5.4.3 Influence of vocabulary sharing

In our experiments, we used shared vocabulary for the English-German dataset, since both languages belong to the same language family. However, for the Chinese-English dataset, we used non-shared vocabulary due to the larger linguistic differences between the two languages. In order to investigate whether our method can effectively improve translation performance in both shared and non-shared vocabulary settings, and thus adapt to most languages in the world, we conducted experiments on the English-German language with non-shared vocabulary, and the results are shown in Table 6.

Table 6
Influence of vocabulary sharing

Method	WMT14	TED talks	Biomedical	Novels
Non-shared vocabulary	25.14	29.35	36.86	22.46
Proposed method	31.02	32.71	33.04	24.93

Based on the information presented in the table, it appears that sharing vocabulary has a greater impact on the performance of general domains, while the impact on specific domains can be positive or negative. Our hypothesis is that not sharing vocabulary causes the model to learn the same words from two different vocabularies, which increases the sparseness of the data. The vast amount of data in the general domain occupies most of the vocabulary, leading to a more significant decrease in translation performance. However, uncommon words that are often overlooked in specific domains benefit from the expanded vocabulary, resulting in improved translation performance in those domains.

6. Limitations and implications

Although the significant performance improvements achieved by our proposed multi-domain adaptive method based on DDB, we acknowledge several limitations that should be considered. Firstly, the accuracy of domain labels in the training data is crucial for the effectiveness of our method, and inaccuracies may impact the overall performance. Therefore, future research should explore approaches for robust domain label annotation. Secondly, the necessity to update the DDB during training introduces additional computational overhead, potentially extending the training time. We recognize the importance of optimizing this process to maintain efficiency while ensuring effective domain adaptation. Lastly, while our evaluation primarily focuses on English-German and Chinese-English datasets, the generalizability of our method to other languages and domains remains an open question. Further investigation is needed to assess the applicability and performance of our approach in diverse linguistic scenarios.

Despite these limitations, our method presents a promising step towards addressing data imbalance issues in multi-domain adaptive neural machine translation and provides a strong foundation for future research in this domain. The theoretical implications of our study lie in the advancement of multi-domain adaptive neural machine translation. By introducing the Domain Data Balancer, we propose a novel method that effectively tackles the data imbalance problem in multi-domain scenarios. This approach extends the understanding of how to leverage domain-specific data efficiently and achieve balanced training across domains. Moreover, the exploration of alternative loss functions, such as the one we employed, contributes to the theoretical understanding of different optimization strategies in multi-domain settings. On a practical level, our study has significant implications for real-world applications of machine translation. The ability to effectively utilize domain-specific data allows the model to adapt and perform better in various specialized domains, leading to more accurate and contextually appropriate translations. As a result, this can facilitate the deployment of high-quality machine translation systems in a wide range of domains, including law, medicine, technology, and more, where domain-specific language patterns and terminologies are prevalent.

7. Conclusions

In this paper, we propose a multi-domain adaptive NMT method based on Domain Data Balancer to address the problem of data imbalance in multi-domain adaptive neural machine translation, which can lead to a decrease in model performance. Our method can adaptively learn the sampling distribution of each group of training data, select appropriate training data from the original training sets of each domain to train the model, and use the ERM training criterion combined with a dual optimizer to update both the Domain Data Balancer and model parameters, achieving balanced training across domains.

Extensive experiments conducted on English-German and Chinese-English public datasets demonstrate the superior performance of our method compared to various strong baselines. Specifically, our method achieves substantial improvements in average BLEU scores, with gains of approximately 1.93 and 0.14 points in the English-German and Chinese-English datasets, respectively. Notably, in the English-German dataset, both the general domain and specific domains outperform all SOTA methods, with the novel domain exhibiting a remarkable improvement of 1.33 BLEU score.

In future work, we will continue to explore the integration of other methods within this framework to further enhance the performance of multi-domain NMT, such as few-shot translation and incremental multi-domain translation. Additionally, investigating strategies to address the limitations mentioned earlier will be a focus of our research.

Footnotes

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grants U21B2027, 61866020, 61972186, and 61732005, in part by the Yunnan provincial major science and technology special plan projects under Grant 202103AA080015, 202203AA080004 and 202202AD080003.

References

Ahmadnia

and Dorr

B.J.

, Low-resource multi-domain machine translation for spanish-farsi: Neural or statistical? Procedia Computer Science 177 (2020), 575–580.

Arivazhagan

Bapna

Firat

Lepikhin

Johnson

Krikun

Chen

M.X.

Cao

Foster

Cherry

Macherey

Chen

and Wu

, Massively multilingual neural machine translation in the wild: Findings and challenges, 2019.

Bahdanau

Cho

and Bengio

, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473, 2014.

Bapna

and Firat

, Simple, scalable adaptation for neural machine translation, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, Nov. 2019, pp. 1538–1548. Association for Computational Linguistics.

Baum

and Wilczek

, Supervised learning of probability distributions by neural networks, in: Neural Information Processing Systems, 1987.

Bengio

Louradour

Collobert

and Weston

, Curriculum learning, in: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA, 2009, pp. 41–48. Association for Computing Machinery.

Cao

Yang

Lin

Wei

Liu

Xie

Zhang

and Su

, Bridging the domain gaps in context representations for

k

-nearest neighbor neural machine translation, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, July 2023, pp. 5841–5853. Association for Computational Linguistics.

Cho

Van Merriënboer

Gulcehre

Bahdanau

Bougares

Schwenk

and Bengio

, Learning phrase representations using rnn encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078, 2014.

Chorowski

J.K.

Bahdanau

Serdyuk

Cho

and Bengio

, Attention-based models for speech recognition, Advances in neural information processing systems, 2015, 28.

10.

Chu

and Dabre

, Multilingual multi-domain adaptation approaches for neural machine translation, arXiv preprint arXiv:1906.07978, 2019.

11.

Chu

Dabre

and Kurohashi

, An empirical comparison of domain adaptation methods for neural machine translation, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, July 2017, pp. 385–391. Association for Computational Linguistics.

12.

Chu

Dabre

and Kurohashi

13.

Chu

and Wang

, A survey of domain adaptation for neural machine translation, arXiv preprint arXiv:1806.00258, 2018.

14.

Collobert

Weston

Bottou

Karlen

Kavukcuoglu

and Kuksa

, Natural language processing (almost) from scratch, Journal of Machine Learning Research 12(ARTICLE) (2011), 2493–2537.

15.

Colson

Marcotte

and Savard

, An overview of bilevel optimization, Annals of Operations Research 153 (2007), 235–256.

16.

Conneau

Khandelwal

Goyal

Chaudhary

Wenzek

Guzmán

Grave

Ott

Zettlemoyer

and Stoyanov

, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116, 2019.

17.

Fan

Tian

Qin

X.-Y.

and Liu

T.-Y.

, Learning to teach, arXiv preprint arXiv:1805.03643, 2018.

18.

Freitag

and Al-Onaizan

, Beam search strategies for neural machine translation, arXiv preprint arXiv:1702.01806, 2017.

19.

Glorot

and Bengio

, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, 2010, pp. 249–256.

20.

Feng

and Xie

, Pruning-then-expanding model for domain adaptation of neural machine translation, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, June 2021, pp. 3942–3952. Association for Computational Linguistics.

21.

Hendy

Abdelghaffar

Afify

and Tawfik

A.Y.

, Domain specific sub-network for multi-domain neural machine translation, arXiv preprint arXiv:2210.09805, 2022.

22.

Jiang

Zhou

Leung

L.-J.

and Fei-Fei

, Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels, in: International Conference on Machine Learning, PMLR, 2018, pp. 2304–2313.

23.

Johnson

Schuster

Q.V.

Krikun

Chen

Thorat

Viégas

Wattenberg

Corrado

et al., Google’s multilingual neural machine translation system: Enabling zero-shot translation, Transactions of the Association for Computational Linguistics 5 (2017), 339–351.

24.

Johnson

Schuster

Q.V.

Krikun

Chen

Thorat

Viégas

Wattenberg

Corrado

Hughes

and Dean

, Google’s multilingual neural machine translation system: Enabling zero-shot translation, 2017.

25.

Kobus

Crego

and Senellart

, Domain control for neural machine translation, arXiv preprint arXiv:1612.06140, 2016.

26.

Koehn

and Knowles

, Six challenges for neural machine translation, arXiv preprint arXiv:1706.03872, 2017.

27.

Kothur

S.S.R.

Knowles

and Koehn

, Document-level adaptation for neural machine translation, in: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, 2018, pp. 64–73.

28.

Kudo

and Richardson

, Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, arXiv preprint arXiv:1808.06226, 2018.

29.

Levin

and Fleisher

, Accelerated learning in layered neural networks, Complex Systems 2(625–640) (1988), 3.

30.

Liang

Zhao

Wang

Qiu

and Li

, Finding sparse structures for domain specific neural machine translation, in: AAAI Conference on Artificial Intelligence, 2021.

31.

Liang

Zhao

Wang

Qiu

and Li

, Finding sparse structures for domain specific neural machine translation, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 2021, pp. 13333–13342.

32.

Lin

C.-Y.

and Och

F.J.

, ORANGE: a method for evaluating automatic evaluation metrics for machine translation, in: COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, aug 23–aug 27 2004, pp. 501–507. COLING.

33.

Luong

M.-T.

and Manning

, Stanford neural machine translation systems for spoken language domains, in: Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, Da Nang, Vietnam, Dec. 3–4 2015, pp. 76–79.

34.

Manning

C.D.

Surdeanu

Bauer

Finkel

J.R.

Bethard

and McClosky

, The stanford corenlp natural language processing toolkit, in: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55–60.

35.

Mikolov

Chen

Corrado

and Dean

, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781, 2013.

36.

Mikolov

Sutskever

Chen

Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems, 2013, 26.

37.

Morishita

Suzuki

and Nagata

, Domain adaptation of machine translation with crowdworkers, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, Abu Dhabi, UAE, Dec. 2022, pp. 606–618. Association for Computational Linguistics.

38.

Neubig

, Lexicons and minimum risk training for neural machine translation: NAIST-CMU at WAT2016, in: Proceedings of the 3rd Workshop on Asian Translation (WAT2016), Osaka, Japan, Dec. 2016, pp. 119–125. The COLING 2016 Organizing Committee.

39.

Neubig

and Hu

, Rapid adaptation of neural machine translation to new languages, arXiv preprint arXiv:1808.04189, 2018.

40.

Ott

Edunov

Baevski

Fan

Gross

Grangier

and Auli

, fairseq: A fast, extensible toolkit for sequence modeling, arXiv preprint arXiv:1904.01038, 2019.

41.

Papineni

Roukos

Ward

and Zhu

W.-J.

, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318.

42.

Post

, A call for clarity in reporting bleu scores, arXiv preprint arXiv:1804.08771, 2018.

43.

Saunders

, Domain adaptation and multi-domain adaptation for neural machine translation: A survey, Journal of Artificial Intelligence Research 75 (2022), 351–424.

44.

Sennrich

Haddow

and Birch

, Neural machine translation of rare words with subword units, arXiv preprint arXiv:1508.07909, 2015.

45.

Sennrich

and Zhang

, Revisiting low-resource neural machine translation: A case study, arXiv preprint arXiv:1905.11901, 2019.

46.

Srivastava

Hinton

Krizhevsky

Sutskever

and Salakhutdinov

, Dropout: A simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15(1) (2014), 1929–1958.

47.

Sutskever

Vinyals

and Le

Q.V.

, Sequence to sequence learning with neural networks, Advances in neural information processing systems, 2014, 27.

48.

Tars

and Fishel

, Multi-domain neural machine translation, arXiv preprint arXiv:1805.02282, 2018.

49.

Tian

Wong

D.F.

Chao

L.S.

Quaresma

Oliveira

and Yi

, Um-corpus: A large english-chinese parallel corpus for statistical machine translation, in: LREC, 2014, pp. 1837–1842.

50.

Tiedemann

, Parallel data, tools and interfaces in opus, in: Lrec, Vol. 2012, 2012, pp. 2214–2218. Citeseer.

51.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Ł.

and Polosukhin

, Attention is all you need, Advances in neural information processing systems, 2017, 30.

52.

Vaswani

Shazeer

N.M.

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

and Polosukhin

, Attention is all you need, ArXiv, abs/1706.03762, 2017.

53.

Vilar

, Learning hidden unit contribution for adapting neural machine translation models, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, June 2018, pp. 500–505. Association for Computational Linguistics.

54.

Williams

R.J.

, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Reinforcement learning, 1992, 5–32.

55.

Zhang

Shapiro

Kumar

McNamee

Carpuat

and Duh

, Curriculum learning for domain adaptation in neural machine translation, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019, pp. 1903–1915. Association for Computational Linguistics.

A multi-domain adaptive neural machine translation method based on domain data balancer

Abstract

Keywords

1. Introduction

3. Background

3.1 Multi-domain adaptive neural machine translation model architecture

3.2 Likelihood estimation

4. Methodology

5.1 Datasets and data preprocessing

Table 1 English-German (EN-DE) and Chinese-English (ZH-EN) open dataset size

Table 2 The BLEU score of the English-German dataset

5.4.1 Influence of the domain data balancer

Table 4 Influence of the domain data balancer

Table 5 Influence of empirical risk minimization

Table 6 Influence of vocabulary sharing

7. Conclusions

Footnotes

Acknowledgments

References

Table 1
English-German (EN-DE) and Chinese-English (ZH-EN) open dataset size

Table 2
The BLEU score of the English-German dataset

Table 4
Influence of the domain data balancer

Table 5
Influence of empirical risk minimization

Table 6
Influence of vocabulary sharing