Abstract
Most methods for multi-domain adaptive neural machine translation (NMT) currently rely on mixing data from multiple domains in a single model to achieve multi-domain translation. However, this mixing can lead to imbalanced training data, causing the model to focus on training for the large-scale general domain while ignoring the scarce resources of specific domains, resulting in a decrease in translation performance. In this paper, we propose a multi-domain adaptive NMT method based on Domain Data Balancer (DDB) to address the problems of imbalanced data caused by simple fine-tuning. By adding DDB to the Transformer model, we adaptively learn the sampling distribution of each group of training data, replace the maximum likelihood estimation criterion with empirical risk minimization training, and introduce a reward-based iterative update of the bilevel optimizer based on reinforcement learning. Experimental results show that the proposed method improves the baseline model by an average of 1.55 and 0.14 BLEU (Bilingual Evaluation Understudy) scores respectively in English-German and Chinese-English multi-domain NMT.
Introduction
In recent years, Neural Machine Translation (NMT) has shown great performance due to the support of large-scale corpora [47, 9, 52]. However, because it is trained on a large amount of general-purpose data, its translation performance in specific domains such as law and medicine is often unsatisfactory. The use of specialized terminology and text styles in specific domains can confuse the model, and the scarcity of specialized corpus data makes it difficult for the model to learn suitable parameters. Multi-domain adaptive NMT [13, 43] aims to construct a single unified model by blending data from the general domain and multiple specific domains [26, 27], such as law and medicine, to accurately translate texts with distinct styles or vocabularies [43].
Although multi-domain adaptive NMT shows great potential, there often exists an imbalance in training data between different domains of the same language. Specifically, as shown in Table 1, the amount of general domain training data is much larger than that of specific domains. Due to this data imbalance, in the process of multi-domain joint training, the model tends to focus more on the general domain and resource-rich domains, leading to a decrease in translation performance in low-resource [45, 1] domains, ultimately affecting the overall translation performance of the model.
To address this issue, many researchers have attempted various approaches. In Architecture-Centric Adaptation methods, Bapna et al. [4] proposed injecting domain-specific adapter modules into each layer of a general-domain model and fine-tuning the domain-specific adapter modules by freezing the parameters of the general-domain model. This greatly reduces the amount of data required for fine-tuning but also leads to an increase in model capacity. In Data-Centric Adaptation methods, Zhang et al. [55] attempted to apply curriculum learning [6] to domain adaptation, gradually transitioning unlabeled network data to the corresponding target domain through curriculum design to expand target domain data. However, this approach may also introduce noise to the original text style of the target domain. In Training Schemes for Adaptation, Liang et al. [30] identified and froze the most informative parameters in the general-domain model, pruned unnecessary parameters, and trained and adjusted a specific domain subnetwork using a small amount of domain-specific data. However, when dealing with a large number of domains, parameter interference may occur among the subnetworks.
This paper proposes a multi-domain adaptive NMT method based on Domain Data Balancer (DDB) to address the issue of data imbalance. Drawing on the idea of domain control [25], we add the DDB to the Transformer [52] model. It can adaptively learn the sampling distribution of each group of training data, select suitable training data from the original training sets of various domains for model training, and replace the Maximum Likelihood Estimation (MLE) [5, 29] function with the Empirical Risk Minimization (ERM) [8] training criterion to optimize the expected evaluation metrics. In addition, we introduce a reward function for reinforcement learning [54] to iteratively update the bilevel optimizer. Experimental results demonstrate that the proposed method effectively solves the problem of data imbalance and improves the translation performance of both general and specific domains.
Our main contributions can be summarized as follows:
We propose a multi-domain adaptive neural machine translation method based on Domain Data Balancer, which effectively addresses data imbalance issues in multi-domain settings and substantially improves the translation performance of the model. We employ the adaptive learning capability of DDB to select appropriate training data from the original training sets of each domain. By leveraging this capability, we can focus on the most relevant and informative data, thus enhancing the training process and achieving better translation results across multiple domains. We adopt the Empirical Risk Minimization training criterion in conjunction with a dual optimizer to update both the Domain Data Balancer and model parameters. This approach ensures that the model benefits from the balanced training across domains, contributing to improved translation performance and overall model effectiveness.
The specific content and structure of this paper are as follows. Section 1 introduces the background and significance of the study, expounds on the research status of Multi-domain adaptive NMT, and analyzes the key technical issues involved in this paper. Section 2 provides an overview of the related work in multi-domain adaptive NMT. Section 3 describes the fundamentals of multi-domain adaptive NMT. Section 4 describes the design and improvement of our method. Section 5 verifies the improved detection model described in Section 4, conducts experiments on domain data balancing and ERM, and analyzes the experimental results. Section 6 discusses the limitations, theoretical implications, and practical significance of our research. Section 7 summarizes the main work of this paper, and analyzes the expectations of the study.
Various approaches have been proposed to address the challenge of multi-domain adaptation in NMT. Chu et al. [11] proposed adjusting the model on top of fine-tuning, by training on both general and specific domain data while oversampling the specific domain samples. Kobus et al. [25] concatenated the input word embeddings and domain embeddings to enable the model to recognize the input domain information during training, thereby preserving the domain-specific features. Tars and Fishel [48] treated different domains as different languages and used multilingual NMT to train a multi-domain translation system. Zhang et al. [55] applied curriculum learning to domain adaptation, gradually transitioning unlabeled network data to the target domain, thereby expanding the target domain data and improving performance. Bapna and Firat [4] injected specific domain adapter [53] modules into every layer of a general domain model and fine-tuned the adapter for specific domains while freezing the general domain parameters. Chu et al. [10] proposed incorporating multilingual into a multi-domain adaptation framework to improve the translation quality of resource-scarce-specific domains. Gu et al. [20] adjusted the model using knowledge distillation, where a trimmed model was trained with specific domain data under the guidance of an untrimmed model trained on a general domain, and then expanded to the original size as a specific domain parameter model. Liang et al. [31] proposed a similar method to Gu et al. [20], where the most informative parameters in the general domain model were identified and pruned, unnecessary parameters were removed, and the specific domain subnetwork was discovered and adjusted with specific domain data. Recently, Morishita et al. [37] proposed a domain-adaptation method that efficiently gathers in-domain parallel sentences from the web with the assistance of crowdworkers. This approach can swiftly collect target-domain parallel data within a few days at a reasonable cost. Hendy et al. [21] proposed Domain Specific Sub-networks. They achieved this by pruning and masking to partition a sub-network for each domain, which includes shared parameters with other domains as well as domain-specific parameters. Cao et al. [7] proposed enhancing the datastore retrieval of k-Nearest Neighbor Machine Translation (kNN-MT) by reconstructing the original datastore. They utilized a reviser to improve key representations, enabling a better fit in the downstream domain. Despite these developments, challenges persist, including effectively balancing shared and domain-specific information, as well as handling domain shifts. Our work addresses the problems of imbalanced data caused by simple fine-tuning.
Background
Multi-domain adaptive neural machine translation model architecture
The Neural Machine Translation (NMT) [3, 8, 52] architecture views all inputs as a series of standard ‘tokens’ [14, 36], which are denoted as integer values. Specifically, the NMT model receives an input
Likelihood estimation
The prevalent training objective in NMT is to adjust the weights
Unlike NMT, multi-domain NMT has parallel corpora from multiple domains,
To address the limitation of traditional optimization algorithms that focus solely on updating model parameters based on a fixed set of hyperparameters, the Bilevel Optimization [15] was introduced to harness the full potential of the model and achieve optimal performance. In bilevel optimization, the outer level is responsible for adjusting hyperparameters such as learning rate, regularization strength, or optimization strategy. These hyperparameters control the behavior and performance of the optimization process. On the other hand, the inner level focuses on updating the model parameters based on the selected hyperparameters, providing a structured and systematic approach to improving optimization performance and enhancing the overall capabilities of machine learning models.
Methodology
Notably, in deep learning [3, 47], it is often impossible to feed all the data into the model at once during training. Random sampling [23, 39] is a commonly used data sampling method in NMT, involving randomly selecting a certain proportion of data samples from the training corpus and shuffling and sampling them for training. The goal is to make model training more efficient and accurate while avoiding overfitting the training data. However, in the random sampling strategy, the proportion of data samples
where
Therefore, in order to better optimize
Before introducing the DDB, we briefly introduce the concept of Empirical Risk Minimization (ERM). In multi-domain adaptive NMT, MLE has drawbacks as it may lead to overfitting and poor performance on new data. Additionally, it may not balance the data distribution of different domains well, leading to poor performance in some domains. Thus, more effective parameter estimation methods are needed to improve the model’s generalization and cross-domain adaptation capabilities. Given
where
In order to further optimize the model, we define the following loss function with respect to the model parameters
which is summed over all potential translations
The ERM method balances the contribution of each domain and handles the data distribution mismatch across domains, resulting in improved performance while also preventing overfitting and improving the model’s generalization ability to new and unseen data.
It is worth noting that our method comprises two types of risk:
In this context, we introduce the Domain Data Balancer, which addresses the problem of imbalanced data, allowing for efficient optimization of
DDB is a differentiable function that solely relies on the input features of the example
Before the training process, we use the Xavier initialization [19] method to initialize the model parameters
Throughout the training process, we continuously iterate and optimize the model parameters
where
the parameter
By calculating the cosine similarity between the training data extracted by the balancer and the corresponding validation data for each individual domain, we introduced a reward function based on reinforcement learning to update the parameters in the DDB. This function rewards the balancer for selecting training data that is similar to the validation data and punishes it for selecting dissimilar data. Considering the optimization of the multi-domain training objective and the similarity between different domains, we calculate a separate reward function for each domain and take the average of these rewards as the final reward value:
where
The reward function
where
The experiments are conducted on English-German and Chinese-English datasets, each consisting of one general domain and three specific domains, under the Python 3.8.1 computing environment deployed on a computer with the Ubuntu 20.04 operating system. We used Fairseq [40] as our sequence modeling tool and chose the Transformer-Big as the base framework for our experiments. During the experiments, we set the encoder and decoder layers to 6, the embedding dimension and the hidden units of the feedforward network in both the encoder and decoder to 1024 and 4096, respectively, and the number of attention heads in both the multi-head self-attention and cross-attention mechanisms to 16. The initial learning rate was set to
Datasets and data preprocessing
We conducted comparative experiments on open datasets for English-German and Chinese-English translation, and the detailed data size statistics for each domain are presented in Table 1.
For the English-German dataset, the training corpus for the general domain was obtained from the news translation task in WMT14, with newstest2013 and newstest2014 serving as the validation and test sets, respectively. The specific domains included TED talks, biomedical texts, and novels. For the TED talks domain, IWSLT14 was used as the training corpus, with dev2010 and tst2014 as the validation and test sets, respectively. For the biomedical domain, the EMEA News Crawl dataset was used as the training corpus, and the Khresmoi Medical Summary Translation Test Data 2.0 was used as the validation and test sets. For the novel domain, the book dataset from OPUS [50] was used as the training corpus, and a few chapters randomly selected from Jane Eyre were used as the validation set, while The Metamorphosis was used as the test set.
For the Chinese-English dataset, the training corpus for the general domain was obtained from the news translation task in WMT17, with newsdev2017 and newstest2017 serving as the validation and test sets, respectively. The specific domains were selected from three domains of papers, spoken language, and education in the UM-Corpus [49] dataset as experimental data.
As there are differences between the two experimental datasets and English and German belong to the West Germanic language branch, a shared vocabulary was used for the English-German dataset. However, as there is a greater language difference between Chinese and English, no shared vocabulary was used for the Chinese-English dataset. For the English-German dataset, sentencepiece [28] was used to tokenize the data with a shared vocabulary size of 32768. For the Chinese-English dataset, Stanford NLP [34] and Moses tokenizer tools were used to tokenize the Chinese and English, respectively. Byte Pair Encoding (BPE) [44] was used to perform subword segmentation on both languages, with vocabulary sizes of 44K and 33K, respectively.
English-German (EN-DE) and Chinese-English (ZH-EN) open dataset size
English-German (EN-DE) and Chinese-English (ZH-EN) open dataset size
To evaluate the performance of the proposed method, we compared it with several classic and state-of-the-art (SOTA) methods. These methods include:
General domain model [51]: This model utilizes the Transformer architecture and is solely trained on parallel corpora data from the general domain. Fine-tuning [33]: This method first trains the model on a general domain corpus, and then fine-tunes it with specific domain corpus data to continue training. Mixed domain model [12]: This model is trained on a mixture of data from the general domain and all specific domains. Mixed with domain tags (MDT) [48]: This approach trains a unified multi-domain machine translation model by mixing data from the general domain and all specific domains, with domain tags, added to distinguish between data from different domains. Sequential PRUNE-TUNE (SPT) [30]: This method identifies and freezes the parameters with the most information in the general domain model, then prunes unnecessary network model parameters, and finally fine-tunes specific domain sub-network parameters using a mask matrix and specific domain data. Temperature-Based Sampling (TBS) [2, 16]: This model is trained by determining the sampling distribution of each domain of training data using a manually set temperature value
In this study, we employed the DDB to balance the training data of a multi-domain NMT model in order to enhance its performance. Tables 2 and 3 display our experimental findings on the English-German and Chinese-English public datasets, respectively. Remarkably, our approach outperformed the baseline models in all domains of the English-German dataset and in most domains of the Chinese-English dataset. This suggests that DDB effectively balances the training data across different domains, preventing the model from overfitting to a specific domain and ignoring information from other domains.
Furthermore, in the general domain of the English-German and Chinese-English datasets, our approach outperformed the SOTA methods with improvements of 1.4 and 3.68 BLEU scores, respectively. These results demonstrate our approach’s capability to effectively address the catastrophic forgetting problem induced by fine-tuning, thereby significantly enhancing translation performance within the general domain. It is worth noting that our method achieved a more significant improvement on the English-German dataset than on the Chinese-English dataset. The average BLEU score improvements were 1.93 and 0.14, respectively. For the English-German dataset, both the general domain and all specific domains achieved higher BLEU scores compared to all SOTA methods except TBS, particularly in the novel domain where there was an improvement of 1.33 BLEU score. The results were remarkably significant. However, the improvement was not as pronounced for the Chinese-English dataset. The performance in the education domain was average and lower than the SOTA methods, while the other three domains showed good performance. We speculate that this is due to the difference between the two datasets. For the English-German dataset, both languages belong to the same language family, which allows DDB to balance the training data of both the source and target languages’ domains reasonably. However, for the Chinese-English dataset, the two languages differ significantly, and DDB may not balance the training data of the source and target languages’ domains effectively, resulting in slightly inferior model performance.
Moreover, while both TBS and our method have the capability to control the sampling distribution for each domain, we observe overfitting in the experimental results of TBS on the English-German and Chinese-English datasets. Specifically, in the IWSLT domain of English-German and the Thesis domain of Chinese-English, the BLEU scores are significantly higher by 10 points compared to other methods, which led to considerably lower scores in other domains. In contrast, our method effectively avoids overfitting and achieves balanced performance across domains, resulting in average BLEU score improvements of 5.94 and 2.45 on the English-German and Chinese-English datasets, respectively.
Finally, our experimental results demonstrate that although each domain in the English-German dataset has a different number of validation sets, and the Chinese-English dataset uses a consistent validation set, the impact of the validation set on optimizing DDB and model parameters is negligible relative to the training set. This indicates that our method is robust in constructing datasets.
The BLEU score of the English-German dataset
The BLEU score of the English-German dataset
The BLEU score of the Chinese-English dataset
To validate the effectiveness of the proposed method, we conducted ablation experiments on the English-German dataset.
Influence of the domain data balancer
Based on the analysis, it can be concluded that the DDB proposed in this study has a practical effect in improving the performance of multi-domain NMT models. The experimental results show that without the DDB, the model translations exhibit a significant decrease in BLEU scores, which demonstrates the effectiveness of the proposed method in balancing the training data across various domains. However, it is worth noting that the model translations without the DDB still exhibit good performance, as indicated in Table 4. This could be attributed to the replacement of the MLE training criterion with ERM during the optimization of the data selection network and model parameters. Further analysis will be conducted in other experiments.
Influence of the domain data balancer
Influence of the domain data balancer
Based on the findings presented in Table 5, it is clear that not using the ERM training criterion to optimize the model parameters and DDB within the unchanged network model framework leads to a decrease in BLEU values across most domains. We hypothesize that training NMT models using the MLE training criterion may result in overfitting in some domains and underfitting in others. In contrast, the ERM training criterion can help the model better adapt to data from multiple domains and exhibit more stable performance in each domain.
Influence of empirical risk minimization
Influence of empirical risk minimization
In our experiments, we used shared vocabulary for the English-German dataset, since both languages belong to the same language family. However, for the Chinese-English dataset, we used non-shared vocabulary due to the larger linguistic differences between the two languages. In order to investigate whether our method can effectively improve translation performance in both shared and non-shared vocabulary settings, and thus adapt to most languages in the world, we conducted experiments on the English-German language with non-shared vocabulary, and the results are shown in Table 6.
Influence of vocabulary sharing
Influence of vocabulary sharing
Based on the information presented in the table, it appears that sharing vocabulary has a greater impact on the performance of general domains, while the impact on specific domains can be positive or negative. Our hypothesis is that not sharing vocabulary causes the model to learn the same words from two different vocabularies, which increases the sparseness of the data. The vast amount of data in the general domain occupies most of the vocabulary, leading to a more significant decrease in translation performance. However, uncommon words that are often overlooked in specific domains benefit from the expanded vocabulary, resulting in improved translation performance in those domains.
Although the significant performance improvements achieved by our proposed multi-domain adaptive method based on DDB, we acknowledge several limitations that should be considered. Firstly, the accuracy of domain labels in the training data is crucial for the effectiveness of our method, and inaccuracies may impact the overall performance. Therefore, future research should explore approaches for robust domain label annotation. Secondly, the necessity to update the DDB during training introduces additional computational overhead, potentially extending the training time. We recognize the importance of optimizing this process to maintain efficiency while ensuring effective domain adaptation. Lastly, while our evaluation primarily focuses on English-German and Chinese-English datasets, the generalizability of our method to other languages and domains remains an open question. Further investigation is needed to assess the applicability and performance of our approach in diverse linguistic scenarios.
Despite these limitations, our method presents a promising step towards addressing data imbalance issues in multi-domain adaptive neural machine translation and provides a strong foundation for future research in this domain. The theoretical implications of our study lie in the advancement of multi-domain adaptive neural machine translation. By introducing the Domain Data Balancer, we propose a novel method that effectively tackles the data imbalance problem in multi-domain scenarios. This approach extends the understanding of how to leverage domain-specific data efficiently and achieve balanced training across domains. Moreover, the exploration of alternative loss functions, such as the one we employed, contributes to the theoretical understanding of different optimization strategies in multi-domain settings. On a practical level, our study has significant implications for real-world applications of machine translation. The ability to effectively utilize domain-specific data allows the model to adapt and perform better in various specialized domains, leading to more accurate and contextually appropriate translations. As a result, this can facilitate the deployment of high-quality machine translation systems in a wide range of domains, including law, medicine, technology, and more, where domain-specific language patterns and terminologies are prevalent.
Conclusions
In this paper, we propose a multi-domain adaptive NMT method based on Domain Data Balancer to address the problem of data imbalance in multi-domain adaptive neural machine translation, which can lead to a decrease in model performance. Our method can adaptively learn the sampling distribution of each group of training data, select appropriate training data from the original training sets of each domain to train the model, and use the ERM training criterion combined with a dual optimizer to update both the Domain Data Balancer and model parameters, achieving balanced training across domains.
Extensive experiments conducted on English-German and Chinese-English public datasets demonstrate the superior performance of our method compared to various strong baselines. Specifically, our method achieves substantial improvements in average BLEU scores, with gains of approximately 1.93 and 0.14 points in the English-German and Chinese-English datasets, respectively. Notably, in the English-German dataset, both the general domain and specific domains outperform all SOTA methods, with the novel domain exhibiting a remarkable improvement of 1.33 BLEU score.
In future work, we will continue to explore the integration of other methods within this framework to further enhance the performance of multi-domain NMT, such as few-shot translation and incremental multi-domain translation. Additionally, investigating strategies to address the limitations mentioned earlier will be a focus of our research.
Footnotes
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grants U21B2027, 61866020, 61972186, and 61732005, in part by the Yunnan provincial major science and technology special plan projects under Grant 202103AA080015, 202203AA080004 and 202202AD080003.
