Improving sentence representation for vietnamese natural language understanding using optimal transport

Abstract

Multilingual pre-trained language models have achieved impressive results on most natural language processing tasks. However, the performance is inhibited due to capacity limitations and their under-representation of pre-training data, especially for languages with limited resources. This has led to the creation of tailored pre-trained language models, in which the models are pre-trained on large amounts of monolingual data or domain specific corpus. Nevertheless, compared to relying on multiple monolingual models, utilizing multilingual models offers the advantage of multilinguality, such as generalization on cross-lingual resources. To combine the advantages of both multilingual and monolingual models, we propose KDDA - a framework that leverages monolingual models to a single multilingual model with the aim to improve sentence representation for Vietnamese. KDDA employs teacher-student framework and cross-lingual transfer that aims to adopt knowledge from two monolingual models (teachers) and transfers them into a unified multilingual model (student). Since the representations from the teachers and the student lie on disparate semantic spaces, we measure discrepancy between their distributions by using Sinkhorn Divergence - an optimal transport distance. We conduct experiments on two Vietnamese natural language understanding tasks, including machine reading comprehension and natural language inference. Experimental results show that our model outperforms other state-of-the-art models and yields competitive performances.

Keywords

Natural language understanding machine reading comprehension natural language inference knowledge distillation optimal transport

1 Introduction

Nowadays, with the development of data collection and self-supervised objectives, large pre-trained language models (LMs) have become one of the significant breakthroughs in natural language processing (NLP) and have achieved state-of-the-art results on many tasks. Several works [1–3] focus on building LMs for English and other resource-rich languages because of the enormous amount of training data. But considering 7000+ languages in the world, collecting the corpus for pre-training models is not an easy task because it is time-consuming and requires a significant effort, especially for resource-poor languages. Therefore, people often face difficulty in training models on resource-poor languages due to the scarcity of vocabulary curation and size. The emergence of multilingual pre-trained LMs and their impact has shown surprisingly good cross-lingual effectiveness on many downstream tasks [4, 5]. Many famous multilingual pre-trained LMs, such as mBERT [1], XLM [6], XLM-R [7] extend the monolingual models to hundreds of languages by learning deep linguistic representations from a large multilingual corpus. Hence, multilingual models can generalize on many languages and domains at different granularities by leveraging knowledge from the pre-training corpora. As a result, the models deliver impressive performances on many cross-lingual transfer benchmarks.

Despite the effectiveness of cross-lingual transfer, multilingual models come with their own drawbacks. Even though they have been trained on an enormous amount of corpus, due to the constant number of parameters, the performance on downstream tasks decreases as we add more languages for pre-training. The multilingual models perform worse on resource-rich and resource-poor languages because they lack language-specific capacities [7, 8]. To this end, several recent efforts have introduced language-specific LMs with custom vocabularies that can bring more compact representations from enormous data coming from multiple domains [9, 10] and provide comparable advances. However, building a language-specific LM prevents us from utilizing multilingual power such as cross-lingual transfer between a resource-rich language and its closely related language varieties.

Is it possible for multilingual models to leverage language-specific monolingual models, perform well on supervised tasks, and enable positive language transfer between languages? This motivates us to leverage the power of monolingual pre-trained models to improve the performance of multilingual models for downstream tasks. In this work, we focus on improving Vietnamese sentence representation in natural language understanding (NLU) in particular for machine reading comprehension and natural language inference tasks. Recently, knowledge distillation has emerged as a technique for transferring knowledge between models using the student-teacher framework. We employ the teacher-student framework to leverage the capabilities of language-specific monolingual models and transfer their rich knowledge to a multilingual model. Intuitively, we first fine-tune the language-specific monolingual teachers on downstream tasks and then proceed to distill their knowledge into the multilingual student model.

It is worth noting during the knowledge transferring process, the hidden representation of the teachers and the student models belong to two separate monolingual and multilingual spaces. Inspired by the work of [11], we propose a distillation strategy in which the teacher models try to transfer language and task-specific features to the student model without changing the model architecture. Sinkhorn Divergence [12] – an Optimal Transport measure, can be efficiently utilized for this optimization. Sinkhorn Divergence is a powerful method for comparing distributions that aims to minimize the cost of moving probability mass from a source distribution to a target distribution. It has some nice properties that do not require the two sets of distributions to overlap, unlike other types of divergence such as KL-Divergence. This makes Sinkhorn Divergence an effective solution for transferring knowledge between models, as two different models belong to distinct feature spaces and cannot be directly projected one-to-one. Furthermore, Sinkhorn Divergence enables the utilization of GPUs, allowing the computation of the distillation loss for batches of many samples.

To summary, the contributions of this paper are described as follows:

We propose XLMR-KDDA, a knowledge distillation approach based on Optimal Transport where the multilingual student model learns supervision training signals from two monolingual teacher models at the fine-tuning stage. Our method does not change the model architecture and needs no pre-training procedure, thus making it easy to apply to different languages or different tasks. In practice, the Optimal Transport distance is utilized as a training scheme to optimize the discrepancy loss between the hidden representation of the teacher models and the student model. Our findings indicate that by letting the multilingual model learn from multiple teacher models can result in an overall boost in performance, compared to other strong baselines.

We perform a comprehensive analysis and conduct extensive experiments on Vietnamese NLU datasets, namely UIT-ViQuAD and ViNLI. Our results demonstrate substantial improvements over other baselines, thus highlighting the efficacy of our approach. Additionally, we conduct further extensive experiments to explore and evaluate various components within our approach.

The rest of the paper is organized as follows. Section 2 provides backgrounds. Section 3 describes methodology and the training procedure of our model. We discuss experiments results and extensive analysis in section 4. The last section presents our conclusions.

2 Background

This section presents the backgrounds for our proposed framework. Our goal is to improve the Vietnamese sentence representation on the multilingual model by knowledge transferring from monolingual models.

2.1 Multilingual pre-trained language models

Multilingual pre-trained LMs have shown great success in various NLP tasks thanks to their cross-lingual effectiveness. Following the success of Transformer-based architecture [13], several efforts have trained large scale LMs on multiple languages. mBERT is a multilingual variant of BERT [1] that adopts the same training regime by pre-training on 104 languages. Conneau and Lample proposed XLM [6] – a multilingual model pre-trained on a parallel corpus using the translation language modeling objective. XLM-R [7] improves XLM with enormous pre-training data and a better vocabulary; achieving further performance gain on downstream tasks [14, 15]. Recently, mDeBERTa [16] – a multilingual variant of DeBERTa [17], which is pre-trained on the same amount of data as XLM-R. mDeBERTa improves the BERT and RoBERTa models by using disentangled attention and enhanced mask decode which demonstrates its prominent capability on cross-lingual transfer benchmarks. In the meantime, there are many empirical studies that analyze the multilinguality and cross-lingual transfer ability of the multilingual models. They show that a fine-tuned model in one language can be used for another language without relying on any direct cross-lingual supervision [4, 5].

However, many studies point out fundamental limitations of the multilingual models. Conneau et al. [7] observed that scaling the multilingual LMs to a high number of languages only improves the performance to a certain point, after that the performance drops as per language capacity degrades. This is termed as “the curse of multilinguality” or “transfer-interference trade-off”. Also due to this defect, many languages in multilingual models are underrepresented and lag behind their monolingual counterparts in terms of performance on downstream tasks. This can be alleviated by extending per language capacity [7, 18] or further pre-training procedures [19, 20]. Since the model capacity is limited, scaling a bigger model to represent all languages is not practical. To address this, recent works mainly focus on improving performance on one or a set of languages. They have pointed out that monolingual language-specific models pre-trained from scratch achieve better performance than multilingual models, namely Vietnamese [9, 21], Arabic [22], French [23], Finnish [10], Dutch [24], etc. There are some reasons attributed to the increase in performance regarding the multilingual models. For example, monolingual LMs use tailored vocabularies, reduce the need of splitting a word into multiple sub-words, thus avoiding sub-optimal decompositions for other languages [25]. Monolingual LMs often use more pre-training data and this leads to significant improvement in performance, as observed in Li et al. [3].

Recently, the field of natural language understanding has witnessed rapid development, because of the potent ability of LMs in representing textual data. They have a wide range of applications in natural language understanding, including sentiment analysis, question answering, natural language inference, etc. Through the utilization of pre-trained LMs, natural language understanding is becoming more accessible and effective on comprehension and interpretation of human language.

2.2 Knowledge distillation

Knowledge Distillation (KD) is an effective technique that transfers knowledge from a model to another, was first introduced by Hinton et al. [26]. KD aims to train a model, called the student, by exploiting valuable information provided by soft label distribution from another model, called the teacher model. KD has been widely used in a variety of applications in NLP, including model compression [27–29], multi task learning [30, 31], etc. Recent works have investigated methods that align feature spaces between the student and teacher models for a better knowledge transfer. In particular, Wang et al. [32] proposed an application of the KD method that distills the structural knowledge from several monolingual teacher models to the unified multilingual student model in order to solve the problem of the performance gap between monolingual and multilingual model due to capacity limitation. Li et al. [19] focused on learning semantic structure of representation by adopting KD framework to learn rich knowledge from English BERT to improve the multilingual LM. Khanuja et al. [33] enhanced the generalization ability of student models for resource-poor languages by distilling knowledge from multiple multilingual teachers in a task-agnostic setting.

2.3 Optimal transport

Optimal Transport (OT) [34, 35] has become popular in NLP due to its ability to compare two different distributions. It has been successfully applied to many applications. For example, Jianqiao Li et al. [36] proposed using OT to capture positional and contextual information of tokens and tackle the problem of exposure bias in text generation tasks. Peggy Tang et al. [37] formulated text summarization as an OT problem, they considered optimizing transportation cost from an optimal summary to a document based on their semantic distributions. Kyle Swanson et al. [38] employed OT as an objective to align inputs in the text matching problem. Recently, OT has been used for KD to transfer knowledge across models. In this line of work, OT is utilized to compute the optimal value to map between semantic spaces and update the student model afterwards. Thong Nguyen et al. [11] investigated transferring knowledge from a monolingual teacher to a multilingual student for the cross-lingual summarization task. Specifically, they proposed a Knowledge Distillation loss using Sinkhorn Divergence for the transfer process. Tulika Bose et al. [39] introduced a new framework that distills the natural language semantic knowledge from multiple teacher networks to a student network using OT.

In our setup, OT distance is employed as a distillation objective to align the feature spaces of the teacher models and the student model, ensuring that OT enables the positive transfer of knowledge between models.

3 Proposed method

In this section, we formalize our ideas and the training procedure. Our goal is to improve Vietnamese sentence representation in the multilingual model by transferring knowledge from the monolingual models, using the KD framework. One challenge of the knowledge transferring process is that the teacher and the student models come with two different spaces, which makes measuring the discrepancy between them challenging. We need to determine a mapping between their hidden presentations. To address that, we use the Sinkhorn Divergence as the objective for distillation.

Figure 1 demonstrates the overview of our setting, consisting of 3 components: student model, teacher models and training objective.

Fig. 1

Illustration of our proposed framework KDDA. The figure shows our pipeline, including a student model (Multilingual XLM-R) and two teacher models (the Vietnamese PhoBERT and the English RoBERTA). Note that in each training iteration, only one monolingual model is utilized as the teacher.

3.1 Pre-trained language model fine-tuning

This work focuses on two Vietnamese NLU tasks, which consists of: machine reading comprehension (MRC) and natural language inference (NLI). The two tasks are formulated as follows:

MRC: the MRC problem can be characterized as a triple 〈C, Q, A〉 where C denotes context, Q denotes question and A denotes answer. A MRC model must be capable of reading and extracting a relevant span of text in the context to answer the question. Table 1 shows an example from UIT-ViQuAD dataset.

NLI: is the task of logically determining the semantic relation between a “hypothesis” H and a “premise” P. The task can be described as classifying the premise into either “entailment”, “contradiction” or “neutral” based on the given hypothesis. An example of ViNLI dataset is given in Table 2.

Table 1
An example of UIT-ViQuAD dataset on MRC task (English translation included)

Table 2

An example of ViNLI dataset on NLI task (English translation included)

As shown in Figure 1, we present the inputs as a single packed sequence x of length L, where x consists of two segments for the tasks. We then use pre-trained LMs (PLM) to obtain the contextualized embedding H of the input:

$H = PLM (x) = [h_{1}, . . . h_{L}]$ (1)

Where $H \in ℝ^{L \times d}$ ; d is the hidden dimension of the PLM. At the output level, a task-specific layer is utilized by taking the contextualized embedding of the final Transformer layer H for MRC or the representation of the first token in the input sequence [CLS] for NLI. In order to output the predictions, a task-specific layer is followed by an additional Softmax layer. Denoting D as the training data; θ as model parameters, the student model is fine-tuned with negative log-likelihood loss:

$L_{task} = - \sum_{(x, y) \in D} log p (y ∣ x, θ)$ (2)

3.2 Knowledge distillation

Let $D = {D_{1}, . D_{k}}$ denotes the set of training data on k languages, we have $D_{i}$ is the corpus on language i with m examples: $D_{i} = {(x_{i}^{j}, y_{i}^{j})}_{j = 1}^{m_{i}} \cdot T = {T_{1}, . T_{k}}$ denotes the set of teacher models pre-trained on k languages. Given the set of training data and teacher models, we first fine-tune the monolingual teachers on the corresponding training datasets. The fine-tuning process of teachers will give us the access to exploit the task and language-specific features for the knowledge transfer process. For each training iteration, we select a random language l and sample a batch of dataset $B_{i} = {(x_{i}^{l}, y_{i}^{l})}$ from the corresponding dataset $D_{l}$ . We then feed the batch $B$ to the respective monolingual teacher and multilingual student concurrently to get the hidden semantic representation of the student and the teachers:

$H_{S} = {PLM}_{student} (x_{i}^{l})$ (3) $H_{T} = {PLM}_{T_{l}} (x_{i}^{l})$ (4)

Notice that only the student model parameters θ_S are updated during knowledge distillation. So, we detach H_T from the computational graph.

Previous works [40, 41] have shown the effectiveness of leveraging the hidden representation of the teacher in the knowledge distillation process. However, utilizing the hidden states presents a difficulty in our setting as the teachers and the student were pre-trained on different kinds of data and belong to two feature spaces. This results in the fact that they cannot be projected one-to-one since disjoined vocabularies. In order to make monolingual and multilingual feature space compatible, we employ the Sinkhorn Divergence as the KD objective:

$L_{KD} = dist (H_{T}, H_{S})$ (5) Where dist is the OT distance between 2 probability measures. We present this in the next section. Finally, the student jointly learns from the gold targets and the soft targets by minimizing the following objective function:

$L = λ L_{KD} + L_{task}$ (6) Where λ is the hyperparameter that controls the contribution of KD loss and needs to be tuned.

For the purpose of taking the supervision signal from English and Vietnamese monolingual models, we also have to fine-tune the multilingual student model using cross-lingual data. Specifically, at the beginning of the fine-tuning process, we concatenate the original Vietnamese training data with the English training data as the final training set for the student model. Several works have investigated the cross-lingual transfer effect, showing that leveraging one or more source languages can improve the performance of the target language [42–44]. In our setting, the student model is supervised by the training signal from the hard targets of the training data and the soft targets of monolingual models.

3.3 Optimal transport and Sinkhorn divergence

Since the representation of the monolingual teacher models and the multilingual student model lie on two different feature spaces, we propose to use OT to measure the distance between them. We also consider this distance as a transportation cost between two probability measures. OT is a powerful method for transferring probability mass from one distribution to another. Formally, given two distributions: the source distribution α and the target distribution β over the domains $𝕋$ and $𝕊$ , our goal is to optimize θ_s such that the student model distribution matches the teacher model distribution. We use the Sinkhorn Divergence [12] which interpolate between OT and Maximum Mean Discrepancy.

Let α; β are probability distributions that take the form of a sum of Diracs:

$α = \sum_{i = 1}^{L_{T}} α_{i} δ_{h_{i}^{t}}, α_{i} > 0$ (7) $β = \sum_{i = 1}^{L_{S}} β_{i} δ_{h_{i}^{s}}, β_{i} > 0$ (8)

Note that each α; β must sum to 1: $\sum_{i = 1}^{L_{T}} α_{i} = 1$ ; $\sum_{i = 1}^{L_{S}} β_{i} = 1$ . Let $C \in ℝ^{L_{T} \times L_{S}}$ be the cost matrix, where c_ij specifies the cost of transferring probability mass from a point in source distribution to a point in target distribution. In our setting, we optimize the student model parameter θ_S so that the transportation cost from H_T to H_S is minimized. Each element of $Π \in ℝ^{L_{T} \times L_{S}}$ denotes how much probability mass from a point in the source distribution is assigned to a point in the target distribution. Cuturi [45] proposed an entropy regularized approximation of the OT distance that can be solved quickly by adding an entropy penalty term. Using above notations, the entropy-regularized optimal transport problem is formulated as follows:

$\begin{matrix} dist (H_{S}, H_{S}) & = {OT}_{ϵ} (α, β) \\ - \frac{1}{2} {OT}_{ϵ} (α, α) \end{matrix}$ (9) $- \frac{1}{2} {OT}_{ϵ} (β, β)$

${OT}_{ϵ} (α, β) = min_{Π} 〈 Π, C 〉 + ϵ KL (Π, α \otimes β)$ (10)

$\begin{array}{l} s . t . Π \geq 1, Π \times 1 = α, Π^{⊺} \times 1 = β \end{array}$ (11)

Note that in equation (9), α and β are initialized with uniform distribution; ϵ > 0 controls the amount of entropy regularization to interpolate between OT and Maximum Mean Discrepancy. The OT plan can be efficiently approximated by the Sinkhorn-Knopp algorithm [45, 46]. The Sinkhorn-Knopp algorithm is differentiable, making it ideal for any neural network architectures.

4 Experiments and results

In this section, we explain training details and show the experimental results on two Vietnamese NLU benchmark datasets, which are ViQuAD and ViNLI for MRC and NLI, respectively. We also compare our work with state-of-the-art models including XLMRQA [47] and ViReader [48] on the MRC task, and conduct further analysis to prove the effectiveness of our approach based on these benchmarks. In our setting, the models are supervised with cross-lingual training signals from Vietnamese and English data and directly evaluated on Vietnamese.

4.1 Datasets

We conduct experiments on two Vietnamese benchmarks datasets for NLU, namely:

Vietnamese Question Answering Dataset (UIT-ViQuAD) [49] contains human-generated question-answer pairs from Vietnamese articles on Wikipedia for Machine Reading Comprehension (MRC). Given a question and a context, The objective of MRC task is to read and extract the answer which is a text span provided in the context.

Vietnamese Natural Language Inference (ViNLI) [50] is an open domain corpus contains human-annotated premise-hypothesis sentence pairs from news articles for Vietnamese Natural Language Inference (NLI). ViNLI is intended for determining the relationship between two sentences (premise and hypothesis) from {entailment, neutral, contradiction}. The authors also introduce the "other" label to the dataset with the purpose of distinguish unrelated semantic information. As can be seen from their work, the overall performance when including the “other” label to the training dataset is often higher than the original performance on three labels dataset. The reason behind this is distinguishing the “other” label is considered as topic classification problem, since the negative example is constructed with material from another document. Topic prediction is easier to learn compared to semantic relation prediction as it requires less reasoning between two sentences.

At the input layer, the representation must be presented as a single packed sequence, as suggested by [1]. For two tasks, we concatenate two pieces of text with a special [SEP] token and place the [CLS] at the beginning of the sequence. We employ cross-lingual data augmentation for MRC and NLI by using two English datasets namely SQuAD [51] and MNLI [52], respectively. We consider Vietnamese and English as the source languages and Vietnamese as the target language. Table 3 describes the training data statistics.

4.2 Model configurations

We employ transformer-based pre-trained LMs with two versions: base and large. All of the student and teacher models share the same architecture: 12 layers, 8 attention heads; 24 layers, 16 attention heads for base version and large version respectively. For the monolingual teachers, we use RoBERTa [2] on English and PhoBERT [9] on Vietnamese. All of the teachers are already fine-tuned on downstream tasks for learning language-specific features. The student model is initialized from XLM-R [7] and followed by a task-specific layer. During the fine-tuning process, we keep the student model parameters trainable, while freezing the teachers’ parameters. For both tasks, we optimize the student model by the AdamW optimizer and search for the best learning rate in the set {1e - 5, 2e - 5, 3e - 5} and the number of epochs is 2. The batch size is 32. We also employ the learning rate scheduler with a linear decay after 10% of training iterations. We set the entropic regularization parameter ϵ to a relatively small number: 0.005, the value of λ parameter in equation (6) is set to 0.3. All the experiments are performed using two T4 GPUs.

We report the results across evaluation benchmarks. Evaluations are performed on fine-tuned models using the test set, based on their performance on the development set. We name our models XLMR-KDDA. We compare our models with XLM-R, XLM-R with cross-lingual data augmentation (XLMR-DA) and the Vietnamese model PhoBERT on both tasks. Specifically, for the MRC task we also compare our model to two other baselines: XLMRQA [47] and ViReader [48] two open-domain MRC systems for Vietnamese.

To measure the performance of our approach, we adapt the commonly used F1 score and Accuracy for the NLI task. For the MRC task, we measure the F1 and Exact Match (EM) scores. In detail, the F1 score measures the number of overlapping tokens (partial match) between the ground truth and the prediction, while the EM score measures the exact matches. We run all tasks three times with different seeds and report the average scores. More detailed results are presented in the following section.

4.3 Results

ViQuAD In Table 4 we compare the results of our methods with baselines on MRC. It can be observed that our XLMR-KDDA method outperforms other baselines in all evaluation sets, providing a +5.11% and +4.25% improvement in EM over the original multilingual XLM-R fine-tuning method for the base and large versions, respectively. Our gains compared to ViReader and XLMRQA are +2.40% and +2.94% in EM score, and an improvement of +0.51% and +3.19% in F1-score, respectively. The monolingual PhoBERT performs almost the same as the multilingual XLM-R with analogous training data for the base and large versions. We observe that in this setting, all baseline multilingual fine-tuning methods benefit from extra training data from other languages. The performance gains in EM score are up to +2.89% and +3.80% for the base and large versions of XLM-R, respectively. Moreover, we improve EM score from 65.89% to 68.11%, F1 score from 84.60% to 85.37% on the base version with the help of KD. Similarly, on the large version, our proposed KDDA slightly improves the EM and F1 scores by +0.45% and +0.84% respectively, compared to DA. This demonstrates the positive effectiveness of our method.

ViNLI As shown in Table 5, we present our evaluation results on the ViNLI dataset. It is important to note that due to the consistency of the English dataset, we do not take into account the “other” label in the evaluation dataset. Once again, XLMR-KDDA provides better results than other baseline methods. In particular, we obtain improvement on the base version (76.59% /76.59%) and on the large version (85.24% /85.25%) of Accuracy and F1 score, respectively. Compared to the XLM-R_large model, our model shows improvements of +3.89% and +3.94%. Similar to the MRC setting, when we combine KD and DA, the model performs superior to DA in all cases. The performance of PhoBERT is similar to XLM-R; except for the evaluation results of PhoBERT_base on the development set (+3.05% /+3.09% improvement in Accuracy and F1 score).

Table 3
Number of examples in UIT-ViQuAD and ViNLI datasets

Dataset #Train #Dev #Test

UIT-ViQuAD 18,759 2,285 2,210

ViNLI 24,376 3,016 3,016

Dataset	#Train	#Dev	#Test
UIT-ViQuAD	18,759	2,285	2,210
ViNLI	24,376	3,016	3,016

Table 4

Results in Exact Match and F1 Score of UIT-ViQuAD MRC task

Model	Dev set		Test set
	EM	F1	EM	F1
PhoBERT_base [9]	64.88	82.33	63.08	81.84
PhoBERT_large [9]	70.50	86.62	67.48	85.22
XLM-R_base [49]	63.87	81.90	63.00	81.95
XLM-R_large [49]	69.18	87.14	68.98	87.02
XLMRQA [47]	73.23	88.36	70.29	86.86
ViReader [48]	-	-	70.83	89.54
XLM-R_base - DA	69.22	85.89	65.89	84.60
XLM-R_base - KDDA	70.58	86.81	68.11	85.37
XLM-R_large - DA	74.80	90.15	72.78	89.21
XLM-R_large - KDDA	76.33	90.95	73.23	90.05

Table 5

Results in Accuracy and F1 Score of the ViNLI NLI task

Model	Dev set		Test set
	Accuracy	F1	Accuracy	F1
PhoBERT_base [50]	75.07	75.08	72.87	72.79
PhoBERT_large [50]	80.72	80.72	80.67	80.69
XLM-R_base [50]	72.02	71.99	71.59	71.51
XLM-R_large [50]	83.02	82.98	81.36	81.31
XLM-R_base - DA	74.43	74.35	74.77	74.75
XLM-R_base - KDDA	77.21	77.22	76.59	76.59
XLM-R_large - DA	84.21	84.23	84.05	84.08
XLM-R_large - KDDA	86.39	86.40	85.25	85.25

Comparing our method with other baselines, we can conclude that our method notably improves the results on both ViQuAD and ViNLI datasets. The results show that in this setting, the XLMR-KDDA enables positive transfer in three aspects. First, we observe that the performance benefits from additional training data from another language, particularly the XLM-R_large gets the highest improvement. All the results from the benchmarks suggest that the multilingual model benefits from positive transfer, i.e., training a multilingual model on multiple language datasets yields greater improvements than training it on a monolingual dataset. This demonstrates that cross-lingual transfer may act as regularization strategy, making the model robust to noise and generalize on the training dataset, as proven in [53]. Second, our KD method utilizing OT improves the results and outperforms other baselines in all settings. The monolingual teacher models provide additional training signals, including language-specific and task-specific features, to the multilingual student model, thereby further boosting its performance. In our case, the student and the teacher features are determined by two feature spaces which are Multilingual and Monolingual spaces, while the OT loss function promotes the learning of a student model that minimizes the transportation cost between the feature sets. This provides an efficient approach for the distillation process without the need of additional mapping between two spaces. Third, our findings suggest that the overall performance of the multilingual student model can be enhanced by letting it learn from multiple monolingual teachers simultaneously. These monolingual supervising signals contribute to a diverse and comprehensive understanding of the relationships between different languages from various perspectives, thus helping the generalization process. Our study has led us to the conclusion that by simultaneously modeling the soft label distribution from the monolingual teachers with their corresponding hard label distribution, results in a more flexible and positive transfer. This transfer is learned without making any modifications to the original model architecture, thereby demonstrating the effectiveness of the proposed approach.

It is also worth mentioning that the monolingual models perform worse than multilingual models using the same training data. Except for the evaluation of the PhoBERT-base on ViNLI development dataset, we only observe deterioration in performance with XLM-R. One possible reason for this is that the superior performance of XLM-R is attributed to the enormous amount of Vietnamese pre-training data (7 times bigger than PhoBERT), which aligns with the findings in the previous work [3].

4.4 Analysis

In order to shed more light on the contribution of each component in the OT-based knowledge distillation settings, in this section, we study the performance of the student model under different configurations. We inspect (1) the student’s performance with different teachers, (2) the distillation strategies, and (3) the trade-off between hard targets and soft targets.

Effect of teacher models performance. To investigate the performance of the student model under different teachers, we use only one English teacher and fine-tune seven Vietnamese teachers with random seeds. Subsequently, we conduct the knowledge transfer process using the English teacher and various Vietnamese teachers. As shown in Figure 2 & Figure 3, KDDA yields significantly varied results under different teachers, the EM score is concentrated in range 73.3% to 76.5% and peeks at 76.33% on the ViQuAD dataset. On the ViNLI dataset, we observe a smaller performance gap. The accuracy score peaks at 86.39%. Most of the student models demonstrate an improvement in performance when receiving supervision signals from the teacher models. This clearly highlights that, given fixed data and student model capacity, the performance of the student is sensitive to the quality of the teacher models. We can conclude that stronger teachers result in better student performance.

Fig. 2

Exact Match score on the UIT-ViQuAD dataset for different Vietnamese teacher models.

Fig. 3

Accuracy score on the ViNLI dataset for different teacher.

Effect of using different model output to calculate OT objective Here, we conduct a study on how the performance of the student model is affected by using different distillation scenarios, namely using the hidden states or the logits as an input for calculating the distillation objective. Since previous studies [11, 40, 41] regard teacher’s hidden states as containing supervision signals, we also investigate the effect of utilizing the teacher’s logits. Specifically, we take the unnormalized predictions generated after the task-specific layer and examine how different signals would affect the performance of KDDA. The results on two datasets in Table 6 indicate that using hidden states provides more meaningful representation for the knowledge transfer process compared to using logits. Especially, EM score and Accuracy slightly increase by 2.78% EM and 1.02% accuracy on ViQuAD and ViNLI, respectively. Both distillation scenarios exhibit improvements in performance. This clearly highlights the importance of choosing appropriate KD objectives in the framework.

Table 6

Performance on UIT-ViQuAD and ViNLI using different OT objectives

KD loss type	UIT-ViQuAD		ViNLI
	EM	F1-Score	Accuracy	F1-Score
Hidden states	76.33	90.95	86.39	86.4
Logits	74.43	89.86	85.37	85.39

Effect of lambda parameter Our KD framework depends on one additional parameter λ, which controls the contribution of KD loss in the fine-tuning objective. We present the effect of changing lambda in Table 7 on development set results while fixing other hyperparameters. When the lambda is set to 0.3, we obtain the best performance on the development set. KDDA consistently outperforms other baselines, indicating that the student model benefits from KD on other teacher models.

Table 7

Performance on UIT-ViQuAD and ViNLI with different λ parameters

λ value	UIT-ViQuAD		ViNLI
	EM	F1-Score	Accuracy	F1-Score
0.1	75.54	90.9	85.1	85.1
0.2	76.33	90.95	85.81	85.83
0.3	75.94	91.31	86.39	86.4
0.4	75.85	90.65	84.48	84.52
0.5	75.59	90.6	85.99	85.99
0.6	75.32	90.55	84.48	84.49
0.7	74.62	90.54	84.97	84.98
0.8	74.18	90.27	84.52	84.49
0.9	73.3	89.76	84.21	84.23

5 Conclusion

In the present study, we introduce a framework that utilizes knowledge distillation from monolingual models to a multilingual model for the purpose of improving the representation of Vietnamese sentences. To achieve this objective, we leverage the Optimal Transport (OT) objective to minimize the discrepancy between the feature spaces of the student model and the teacher models. Specifically, the proposed framework seeks to distill the knowledge obtained from two monolingual teacher models and transfer it to the multilingual student models. Our results show that the approaches make competitive enhancements in the performance or even outperform other baselines on two Vietnamese NLU benchmarks, including NLI and MRC. Our study contributes to the promotion of research on resource-poor languages such as Vietnamese, where the main challenge is the lack of resources, including annotated corpora and other lexical resources. The proposed approach could also serve as a baseline for future research and help catalyze further improvements in the field of NLU for resource-poor languages.

In future work, we will conduct further experiments to explore the impact of the approach on other NLP tasks, such as natural language generation. Besides, we will employ other OT algorithms in our framework for Vietnamese NLU tasks.

Footnotes

CRediT author statement

Phu Xuan-Vinh Nguyen & Thu Hoang-Thien Nguyen: Conceptualization, Methodology, Software, Validation, Writing - Original Draft, Visualization. Ngan Luu-Thuy Nguyen & Kiet Van Nguyen: Resources, Writing - Review & Editing, Supervision, Project administration.

Acknowledgement

This research is funded by Vietnam National University HoChiMinh City (VNU-HCM) under grant number DS2022-26-01.

References

Devlin

, Chang

M.-W.

, Lee

and Toutanova

, BERT: Pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), (Minneapolis, Minnesota), pp. 4171–4186, Association for Computational Linguistics, June 2019.

Lan

, Chen

, Goodman

, Gimpel

, Sharma

and SoricutAlbert:

, A lite bert for self-supervised learning of language representations, The International Conference on Learning Representations (ICLR), 2020.

Liu

, Ott

, Goyal

, Du

, Joshi

, Chen

, Levy

Lewis

, Zettlemoyer

and Stoyanov

, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692, 2019.

Pires

, Schlinger

and Garrette

, How multilingual ismultilingual BERT? in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, (Florence, Italy), pp. 5931–5937, Association for Computational Linguistics, July 2019.

and Dredze

, Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing andthe 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), (Hong Kong, China), pp. 833–844, Association for Computational Linguistics, Nov 2019.

Conneau

and Lample

, Cross-lingual language model pretraining, in Advances in Neural Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Foxand R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019.

Conneau

, Khandelwal

, Goyal

, Chaudhary

Wenzek

, Guzmán

, Grave

, Ott

, Zettlemoyer

and Stoyanov

, Unsupervised cross-lingual representation learning at scale, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Online), pp. 8440–8451, Association for Computational Linguistics, July 2020.

Aharoni

, Johnson

and Firat

, Massively multilingual neural machine translation, in Proceedings of the 2019 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), (Minneapolis, Minnesota), pp. 3874–3884, Association for Computational Linguistics, June 2019.

Nguyen

D.Q.

and Tuan

, Nguyen, PhoBERT: Pre-trained language models for Vietnamese, in Findings of the Association for Computational Linguistics: EMNLP 2020, (Online), pp. 1037–1042, Association for Computational Linguistics, Nov 2020.

10.

Virtanen

, Kanerva

, Ilo

, Luoma

, Luotolahti

Salakoski

, Ginter

and Pyysalo

, Multilingual is not enough: Bert for finnish, vol.abs/1912.07076, 2019.

11.

Nguyen

T.T.

and Luu

A.T.

, Improving neural cross-lingual abstractive summarization via employing optimal transport distance for knowledge distillation, in Proceedings of the AAAI Conference on Artificial Intelligence 36 (2022), 11103–11111.

12.

Feydy

, Séjourné

, Vialard

F.-X.

, Amari

S.-I.

Trouvé

and Peyré

, Interpolating between optimal transport and mmd using sinkhorn divergences, in The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681–2690. PMLR. 2019.

13.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

Gomez

A.N.

, Kaiser

Ł.

and Polosukhin

, Attention is all you need, Advances in Neural Information Processing Systems 30 (2017).

14.

Schick

and Schütze

, Exploiting cloze-questions for few shot text classification and natural language inference, in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, (Online), pp. 255–269, Association for Computational Linguistics, Apr 2021.

15.

Lauscher

, Ravishankar

, Vulić

and Glavaš

, Fromzero to hero: On the limitations of zero-shot language transfer with multilingual Transformers, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP) (Online), pp. 4483–4499, Association for Computational Linguistics, Nov 2020.

16.

, Gao

and Chen

, Debertav 3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, arXiv preprint arXiv:2111.09543, 2021.

17.

, Liu

, Gao

and Chen

, Deberta: Decodingenhanced bertwith disentangled attention, The International Conference onLearning Representations (ICLR), 2021.

18.

Pfeiffer

, Goyal

, Lin

, Li

, Cross

, Riedel

and Artetxe

, Lifting the curse of multilinguality by pre-training modular transformers, in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Seattle, United States), pp. 3479–3495, Association for Computational Linguistics, July 2022.

19.

, Ding

, Zhang

, Cheng

, Hu

and Luo

, Multilevel distillation of semantic knowledge for pre-training multilingual language model, The Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.

20.

Pan

, Hang

C.-W.

, Qi

, Shah

, Potdar

and YuMultilingual

, BERT post-pretraining alignment, in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, in (Online), pp. 210–219, Association for Computational Linguistics, June 2021.

21.

Phan

, Tran

, Nguyen

and Trinh

T.H.

, ViT5: Pretrainedtext-to-text transformer for Vietnamese language generation, in Proceedings of the 2022 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, (Hybrid: Seattle, Washington + Online), pp. 136–142, Association for Computational Linguistics, July 2022.

22.

Antoun

, Baly

and Hajj

, AraBERT: Transformer based model for Arabic language understanding, in Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, (Marseille, France), pp. 9–15, European Language Resource Association, May 2020.

23.

Martin

, Muller

, Ortiz

P.J.

Suárez, Y.

Dupont, L. Romary, de la Clergerie

É.

, Seddah

and Sagot

, CamemBERT: a tasty French language model, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics in (Online), pp. 7203–7219, Association for Computational Linguistics, July 2020.

24.

De Vries

, van Cranenburgh

, Bisazza

, Caselli

, van Noord

and Nissim

, Bertje: A dutch bert model, arXiv preprint arXiv:1912.09582, 2019.

25.

Chung

H.W.

, Garrette

, Tan

K.C.

and Riesa

, Improving multilingual models with language-clustered vocabularies, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), (Online), pp. 4536–4546, Association for Computational Linguistics, Nov. 2020.

26.

Hinton

, Vinyals

and Dean

, Distilling the knowledge in a neural network, Conference on Neural Information Processing Systems, 2014.

27.

Pan

, Wang

, Qiu

, Zhang

, Li

and Huang

, Meta-kd: Ameta knowledge distillation framework for language model compression across domains, Annual Meeting of the Association for Computational Linguistics, 2021.

28.

Wang

, Wei

, Dong

, Bao

, Yang

and Zhou

, Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, Advances in Neural Information Processing Systems 33 (2020), 5776–5788.

29.

Jiao

, Yin

, Shang

, Jiang

, Chen

, Li

, Wang

and Liu

, Tiny BERT: Distilling BERT for natural language understanding, in Findings of the Association for Computational Linguistics: EMNLP 2020, (Online), pp. 4163–4174, Association for Computational Linguistics, Nov. 2020.

30.

Clark

, Luong

M.-T.

, Khandelwal

, Manning

C.D.

and Le

Q.V.

, BAM!born-again multi-task networks for natural language understanding, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, (Florence, Italy), pp. 5931–5937, Association for Computational Linguistics, July 2019.

31.

Chi

, Dong

, Wei

, Mao

and Huang

, Can monolingual pretrained models help cross-lingual classification? in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, (Suzhou, China), pp. 12–17, Association for Computational Linguistics, Dec. 2020.

32.

Wang

, Jiang

, Bach

, Wang

, Huang

and Tu

, Structure-level knowledge distillation for multilingual sequence labeling, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (Online), pp. 3317–3330, Association for Computational Linguistics, July 2020.

33.

Khanuja

, Johnson

and Talukdar

, Merge Distill: Merging language models using pre-trained distillation, in Findings of the Association for Computational Linguistics: ACLIJCNLP 2021, (Online), pp. 2874–2887, Association for Computational Linguistics, Aug. 2021.

34.

Monge

, Mémoire sur la théorie des déblais et desremblais, in Mem. Math. Phys. Acad. Royale Sci., pp. 666–704, 1781.

35.

Kantorovich

L.V.

, On the translocation of masses, Journal of Mathematical Sciences 133(4) (2006), 1381–1382.

36.

, Li

Wang

, Fu

, Lin

, Chen

, Zhang

, Tao

, Zhang

, Wang

, Shen

, Yang

and Carin

, Improving text generation with student-forcing optimal transport, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), (Online), pp. 9144–9156, Association for Computational Linguistics, Nov. 2020.

37.

Tang

, Hu

, Yan

, Zhang

, Gao

and Wang

, OTExtSum: Extractive Text Summarisation with Optimal Transport, in Findings of the Association for Computational Linguistics: NAACL 2022, (Seattle, United States), pp. 1128–1141, Association for Computational Linguistics, July 2022.

38.

Swanson

, Yu

and Lei

, Rationalizing text matching: Learning sparse alignments via optimal transport, Annual Meeting of the Association for Computational Linguistics, 2020.

39.

Bose

, Illina

and Fohr

, Transferring knowledge via neighborhood-aware optimal transport for low-resource hate speech detection, Annual Meeting of the Association for Computational Linguistics, 2022.

40.

Romero

, Ballas

, Kahou

S.E.

, Chassang

, Gatta

and Bengio

, Fitnets: Hints for thin deep nets, arXiv preprint arXiv:1412.6550, 2014.

41.

Lee

and Song

B.C.

, Graph-based knowledge distillation by multi-head attention network, British Machine Vision Conference (BMVC), 2019.

42.

Neubig

and Hu

, Rapid adaptation of neural machine translation to new languages, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, (Brussels, Belgium), pp. 875–880, Association for Computational Linguistics, Oct.-Nov. 2018.

43.

Johnson

, Schuster

, Le

Q.V.

, Krikun

, Chen

, Thorat

, Viégas

, Wattenberg

, Corrado

, Hughes

and Dean

, Google’s multilingual neural machine translation system: Enabling zero-shot translation, Transactions of the Association for Computational Linguistics. 5 (2017), 339–351.

44.

Cattan

, Servan

and Rosset

, On the usability of transformers-based models for a French question-answering task, in Proceedings of the International Conference on Recent Advancesin Natural Language Processing (RANLP 2021), (Held Online), pp. 244–255, INCOMA Ltd., Sept. 2021.

45.

Cuturi

, Sinkhorn distances: Lightspeed computation of optimaltransport, Advances in Neural Information Processing Systems 26 (2013).

46.

Sinkhorn

, A relationship between arbitrary positive matrices and doubly stochastic matrices, The Annals of Mathematical Statistics 35(2) (1964), 876–879.

47.

Nguyen

K.V.

, Do

P.N.-T.

, Nguyen

N.D.

, Huynh

T.V.

, Nguyen

A.G.-T.

and Nguyen

N.L.-T.

, Xlmrqa: Open-domain question answering on vietnamese wikipedia-based textual knowledge source, Intelligent Information and Database Systems: 14th Asian Conference, ACIIDS 2022, Ho Chi Minh City, Vietnam, November 28–30, 2022, Proceedings, Part I, pp. 377–389, Springer, 2022.

48.

Van Nguyen

, Nguyen

N. Duy

, Do

P.N.-T.

, Nguyen

A. Gia-Tuan

and Nguyen

N.L.-T.

, Vireader: A wikipedia based vietnamese reading comprehension system using transfer learning, Journal of Intelligent & Fuzzy Systems 41(1) (2021), 1993–2011.

49.

Nguyen

, Nguyen

and Nguyen

, A vietnamese datasetfor evaluating machine reading comprehension, in Proceedings of the 28th International Conference on Computational Linguistics, pp. 2605–2020, 2020.

50.

Van Huynh

, Van Nguyen

and Nguyen

N.L.-T.

, Vinli: a vietnamese corpus for studies on open-domain natural language inference, in Proceedings of the 29th International Conference on Computational Linguistics, (2022), 3858–3872.

51.

Rajpurkar

, Zhang

, Lopyrev

and Liang

, Squad: 100,000+questions for machine comprehension of text, arXiv preprint arXiv:1606.05250, 2016.

52.

Williams

, Nangia

and Bowman

S.R.

, A broad-coverage challenge corpus for sentence understanding through inference, North American Chapter of the Association for Computational Linguistics, 2018.

53.

Shorten

, Khoshgoftaar

T.M.

and Furht

, Text data augmentation for deep learning, Journal of Big Data. 8 (2021), 1–34.

Improving sentence representation for vietnamese natural language understanding using optimal transport

Abstract

Keywords

1 Introduction

2 Background

2.1 Multilingual pre-trained language models

2.2 Knowledge distillation

2.3 Optimal transport

3 Proposed method

Table 1 An example of UIT-ViQuAD dataset on MRC task (English translation included)

4.1 Datasets

4.2 Model configurations

4.3 Results

Table 3 Number of examples in UIT-ViQuAD and ViNLI datasets Dataset #Train #Dev #Test UIT-ViQuAD 18,759 2,285 2,210 ViNLI 24,376 3,016 3,016

Footnotes

CRediT author statement

Acknowledgement

References

Table 1
An example of UIT-ViQuAD dataset on MRC task (English translation included)

Table 3
Number of examples in UIT-ViQuAD and ViNLI datasets

Dataset #Train #Dev #Test

UIT-ViQuAD 18,759 2,285 2,210

ViNLI 24,376 3,016 3,016