Abstract
Cross-lingual event retrieval is an information retrieval task aimed at cross-lingual event retrieval among multiple languages to find text or documents related to a specific event. Specific to Chinese-Vietnamese cross-language event retrieval, it involves using Chinese as a query to retrieve Vietnamese documents related to the query event. The critical issue is how to efficiently align query and document representations with limited resources. Existing cross-language pre-training models are trained on large-scale multilingual corpora, but their training goals do not include explicit language alignment tasks. Due to the uneven distribution of training corpora between different languages, these models have The problem of language bias. Therefore, this linguistic bias is also inherited in cross-lingual retrieval based on these models. To solve this problem, this paper proposes a Chinese-Vietnamese cross-lingual event retrieval method based on knowledge distillation. This approach enables the model to learn good query-document matching features from monolingual retrieval by transferring knowledge from high-resource to low-resource languages. By enhancing the alignment between queries and documents in different languages in a shared semantic space, the method improves the performance of Chinese-Vietnamese cross-lingual event retrieval.
Introduction
Chinese-Vietnamese cross-language event retrieval aims to retrieve relevant Vietnamese event documents according to the input Chinese event query [1]. As shown in Fig. 1, enter a Chinese query (the seventh Sino-Vietnamese defense exchange), and retrieve relevant Vietnamese events. Cross-language event retrieval can help researchers and data analysts obtain event information across language barriers in a multilingual environment. Collecting and analyzing multilingual text data gives more comprehensive and accurate event analysis results. In addition, it helps to understand events of concern to other language communities and promotes the development of globalization and cultural diversity, meeting users’ needs for diverse and global information.

Example diagram of cross-language event retrieval.
Cross-lingual event retrieval is a particular cross-lingual information retrieval task. In recent years, some progress has been made in traditional cross-language information retrieval, which can be divided into the following three aspects: methods based on machine translation, methods based on cross-language (multilingual) word embedding, and methods based on multilingual pre-trained language models (such as mBERT [2], XML-R [3] etc.). Among them, machine translation-based methods use neural machine translation to map queries and documents to the same semantic space and then perform monolingual retrieval [4]. The method based on machine translation solves the semantic gap problem of different languages in cross-language information retrieval tasks to a certain extent. Still, the technique based on machine translation relies heavily on the accuracy of neural machine translation, which can easily cause word mismatch and translation ambiguity. Machine translation errors directly affect retrieval results, especially for low-resource languages with significant differences (such as Chinese and Vietnamese). In order to solve these problems, researchers proposed a cross-language information retrieval method based on pre-trained cross-language word vectors [5]. The core idea is to use cross-language word vectors to map text semantics in different languages into the same semantic space, thereby Cross-lingual retrieval problems. However, the cross-language word vector-based approach causes inaccurate semantic representations of the query or the text to be retrieved due to the neglect of word order and contextual information and is prone to error propagation in the process of spatial mapping of semantic representations between different languages, which in turn affects the performance of the retrieval model. With the emergence of multilingual pre-trained language models such as mBERT [2] and XML-R [3], approaches based on multilingual pre-trained language models [6, 7] have become the primary paradigm for cross-language information retrieval today. mBERT uses the same pre-training tasks MLM and NSP as BERT to focus on word-level and sentence-level representation, but the cross-language alignment goal is not included in the training process of mBERT. XLM combines monolingual self-supervised and cross-lingual supervised target training models, using Causal Language Modeling (CLM), Translation Language Modeling (TLM), and MLM-like tasks for training so that sentences in different languages are processed in the same embedding space coding. In order to better utilize cross-language pre-training models, Researchers Litschko et al. [8] discussed pre-trained multilingual text encoders based on the Transformer architecture, such as multilingual BERT (mBERT) and XLM, as the latest paradigm of multilingual and cross-language representation learning, and its application in unsupervised cross-language The applicability in language document retrieval and cross-language sentence retrieval is studied. The results show that a self-supervised multilingual encoder cannot outperform cross-lingual word embedding-based CLIR models in most cases without additional supervised training, although state-of-the-art performance can be achieved in sentence-level CLIR, which requires Using a variant of the encoder further specifically designed for the sentence understanding task, rather than using a general multilingual text pre-trained encoder, can go some way to solving the “linguistic bias” problem. Reimers [9] et al. proposed a knowledge distillation framework. In order to extend the monolingual sentence embedding model to multilingual scenarios, the model is pre-trained using a parallel corpus, i.e., translated sentences are mapped to the same position in the vector space as the original sentence, and the student model learns to migrate the monolingual embedding knowledge to multilingual embedding generation. The validity verified in more than 50 languages shows that it addresses the linguistic bias of the cross-language model to a certain extent and achieves effective results in sentence retrieval. Li [10] et al. proposed to enhance cross-language representations using knowledge distillation to learn representations in different languages through cross-language word alignment algorithms. Relying on the representational capabilities of a pre-trained underlying multilingual encoder, useful representations of non-English text are learned from information retrievers containing only English. The teacher model is constructed using a two-stage reasoning process that involves query translation and monolingual information retrieval. Knowledge from the teacher model is then transferred to the student model through knowledge distillation. The goal is to teach the model powerful multilingual representations as well as CLIR through two corresponding optimized knowledge distillation objectives. Xu [11] et al. proposed a semi-interactive knowledge distillation retrieval model between representative and interactive. The model is initialized from a multilingual pre-trained language model, mBERT, which is built on top of a non-interactive architecture, but encodes each document together with its associated multilingual query. Allowing the model to better learn cross-linguistic features, similar to an interactive model. The semi-interactive mechanism provides the V-CLIR model with the ability to combine the advantages of both interactive and non-interactive paradigms. This architecture can learn good cross-language matching features using interactivity and Improve model retrieval efficiency. Although cross-lingual pre-trained models have the ability to represent different language inputs, there are still some problems in cross-lingual information retrieval (CILR). Taking mBERT as an example, the model uses Wikipedia texts in 104 different languages for pre-training, but the training data between different languages need to be aligned, and the goal of language alignment is not considered during the training process. As a result, the distribution of embeddings across languages lacks consistency. There is an obvious imbalance in the distribution of the data used for training mBERT, as the training data of high-resource languages are relatively easy to obtain, and thus perform relatively well, while the training data of low-resource languages are more difficult to obtain, resulting in significantly lower performance than that of high-resource languages in the task [12]. This data imbalance has led to the concern that mBERT suffers from a “language bias”, which may be inherited by cross-language retrieval models based on this pre-trained language model. For cross-language retrieval tasks, this language bias may cause the model to prefer documents that are in the same language as the query to those that are semantically closer. In the case of a Chinese query, for example, the model is more likely to select Chinese documents and ignore Vietnamese documents that are semantically closer to the query. This problem highlights the potential negative impact of “language bias” on cross-language retrieval performance, and the need to mitigate or correct this bias through appropriate methods.
In order to construct an effective cross-language event retrieval model and solve the language bias problem in the cross-language model, this paper proposes a Chinese-Vietnamese cross-language event retrieval method based on knowledge distillation. By improving ColBERT [13], the encoder module of ColBERT was replaced with a Chinese-Vietnamese cross-language event pre-training model on the query and document encoder side. The teacher model is then trained using the English dataset to learn knowledge about matching queries and documents. The student model learns features from both aspects of the teacher model, features of the query and document representations on the one hand, and features of the interaction between the query and document representations on the other. The student models were then encouraged to further align the Chinese and Vietnamese representations by moving the embedding spaces between the different languages closer to the well-behaved English embedding space, making the Chinese and Vietnamese embedding space distributions as close as possible to the English embedding distribution. When the Chinese-Vietnamese query document pairs used to train the student model are parallel to the English-English query document pairs used to train the teacher model, the student model can acquire the teacher model’s retrieval capabilities. Experiments on the publicly available mMARCO dataset and the self-constructed Chinese-Vietnamese cross-language event retrieval dataset demonstrate that our proposed Chinese-Vietnamese cross-language event retrieval method based on knowledge distillation outperforms the traditional baseline method, effectively validating the effectiveness of the method proposed in this paper.
Cross-lingual event retrieval can be regarded as a special cross-lingual information retrieval method [14, 15] (Cross-Lingual Information Retrieval, CLIR), which aims to use event queries in one language to retrieve event documents in another language. Unlike monolingual retrieval, cross-lingual event retrieval needs to align query and document information in different languages and model the relevance of event queries and documents. The current cross-language event retrieval research is still immature. It is mainly based on cross-language information retrieval, and the recent cross-language information retrieval is primarily divided into three methods: (1) methods based on machine-translation; (2) methods based on cross-language word embedding method; (3) methods based on cross-language pre-training model.
Machine translation-based methods first utilize machine translation to translate a query or document into the same language as the other party and perform monolingual retrieval after aligning the query and document languages. Hull [4] et al. introduced bilingual dictionaries; the purpose is to solve the problem of translation ambiguity in the process of query translation and explore the possibility of cross-language information retrieval based on bilingual dictionaries. Gao [16] et al. believe that in the query translation process, the query can be composed into words as much as possible, and translation as a phrase can improve the effect of cross-language retrieval. They proposed a translation selection method and used a statistical translation model to translate to improve the ability to retrieve. The machine translation method mainly uses machine translation to align queries and target language documents. It can achieve better results for high-resource languages with better translation effects, but for low-resource languages, the translation effect is minimal, which will reduce the retrieval effect.
Based on the method of cross-language word embedding, words from different languages can be mapped to a common vector space, and knowledge can be transferred from one language to another without additional translation, thus simplifying the entire retrieval process. Litschko [17] et al. proposed a fully unsupervised framework for cross-lingual information retrieval, which does not require any bilingual data, utilizing off-the-shelf cross-lingual word embeddings combined with query translation and semantic space ranking. Bonab [5] et al. used dictionaries to guide statistical word alignment methods to generate cross-lingual word embeddings, combined with a deep matching model (DRMM) to complete query-document pair matching. On this basis, Zhou [18] et al. utilized a neural generative model with a Wasserstein autoencoder to learn neural topic-augmented cross-lingual word embeddings. They applied these cross-lingual word embeddings to cross-lingual retrieval. The word embedding method effectively avoids the problem of error transmission in the machine translation method. Still, pre-training cross-language word vectors require a large amount of data and computing resources, and the cost is high. And because the static word vector does not fully represent the contextual semantic information in the text, resulting in the loss of deeper semantic information in some specific texts, this method is not the best solution for cross-lingual event retrieval.
Based on the method of cross-language pre-training, training on large-scale data sets enables the model to learn rich language knowledge and semantic representation, which can solve the problem of data scarcity, capture contextual information, and effectively improve the ability of cross-language retrieval. Jiang [19] et al. used weak supervision to construct cross-language question-answer pairs from parallel corpora. The pre-trained BERT model is then fine-tuned for information retrieval tasks to learn IR-specific features such as exact word matching for query-document and bilingual features in monolingual IR. Through this process, the BERT model is trained to be an adapted model for CLIR tasks. The inputs are word pairs, including English queries and foreign language document sentences. The query and foreign language sentences are represented as a text sequence by concatenating them into a text sequence and using the output embedding of the first token of BERT as a representation of the whole query-sentence pair. Then, binary classification is performed by a single-layer feed-forward neural network to predict the relevance of the query in the document sentences, and the model is used to reorder the CLIR. Wang [20] et al. applied multilingual BERT to a cross-language information retrieval (CLIR) task using ternary loss to learn the correlation between queries and documents in different languages. In addition, aligning the token embeddings of different languages through an adversarial network helps the language model to learn cross-lingual sentence representations. Some results are obtained on the CLIR dataset (CLIRMatrix). Nair [6] et al. migrated monolingual query document matching features to CLIR. In addition, Yu [7] et al. identified the lack of cross-language segment-level relevance data for fine-tuning as well as the lack of query-document style pre-training as the key constraints to cross-language retrieval research in utilizing pre-trained language models, and finally, in the pre-training and fine-tuning phases of the cross-language language models, a global + sliding window attention mechanism was employed to better align cross-language textual representations, as well as to minimize information loss, which also shows that data issues can lead to language bias problems in pre-trained models. Yang [21] et al. performed continuous pre-training by introducing a weakly supervised approach of contrastive learning using comparable documents in cross-language links in Wikipedia in order to drive text representations conveying similar information in different languages to be more similar. Since the multilingual pre-training model does not introduce parallel texts in the pre-training phase, which results in the representation of similarly informative texts not necessarily being similar in different languages, the use of comparative learning explicitly prompts word-level similarity in the texts in the pre-training phase, which improves the efficacy of the cross-language information retrieval model. Using a cross-language pre-training model can solve the language barrier problem very well. However, the lack of alignment between representations in different languages leads to inaccurate similarity calculations. Therefore, additional alignment tasks are required to align representations between different languages. Although pre-training-based methods have been widely used in several jobs, to our knowledge, they have yet to be used in cross-lingual event retrieval.
Cross-lingual event retrieval Compared with traditional text retrieval, event retrieval can search for events by exploiting event knowledge and using events as proxies for information needs. Xue [1] et al. proposed to transform the cross-language event retrieval problem into a monolingual event retrieval problem through query translation, effectively filter irrelevant documents through event trigger words, and use multilingual knowledge graphs to expand event query words, improving the performance of cross-language event retrieval. Tang [22] et al. use word vectors to construct Chinese semantic feature vectors for event keywords, then calculate feature translation vectors for event keywords in Vietnamese, and finally complete cross-language keyword alignment by calculating the similarity between semantic feature vectors to achieve automatic translation of query keywords, thus realizing cross-language event retrieval. Metzler [23] et al. utilized a similar pseudo-relevant feedback mechanism combined with the temporal information of events to expand query terms and then performed event retrieval on microblogging data. Similarly, D.Rosin [24] et al. perform query expansion on event query phrases based on pseudo-relevant feedback and then compute the cosine similarity between the query and the document. Zhao [25] et al. constructed a neural matching model for events’ dynamic and coupled nature to retrieve news texts related to specific events. Bernard [26] et al. used an additional knowledge base to extract event features, which were then weighted according to the relevance of the entity to the event, followed by similarity calculation. These methods have problems in cross-language information retrieval, such as poor multilingual alignment, semantic mismatch in the retrieval process, and language bias.
In this study, we are concerned with effectively acquiring knowledge of alignment between Vietnamese and Chinese to reduce the bias of high-resource to low-resource languages. So we replace the encoder module of ColBERT using the Chinese-Vietnamese cross-language event pre-training model to obtain the Chinese-Vietnamese alignment knowledge. It is then trained as a teacher model using English triples to learn language-independent queries and document-matching knowledge. The embedding spaces between the different languages in the students’ models were encouraged to move closer to the well-performing English embedding space so that the embedding space distributions for Chinese and Vietnamese were as close as possible to the English embedding distribution. To further align Chinese and Vietnamese representations, in order to learn better cross-language event retrieval capabilities. In conclusion, in this study, we focus more on utilizing the knowledge learned in monolingual retrieval to make full use of them for cross-lingual event retrieval.
Chinese-Vietnamese cross-lingual event retrieval model based on knowledge distillation
ColBERT model improvement
In this study, mBERT was pre-trained, which aims at the language and event differences in Chinese-Vietnamese cross-lingual event retrieval. In order to validate the method proposed in this study, evaluations were carried out on the public mMARCO dataset and the self-built Chinese-Vietnamese cross-lingual event retrieval dataset. The retrieval model in this study is improved based on ColBERT [13], which mainly includes a shared dual encoder module and a score fusion ranking module. The model encodes the query and the document independently, and uses the proposed late-interaction mechanism to perform fine-grained interactions between queries and documents. While reducing the retrieval cost, the retrieval accuracy is improved. On the query and document encoder side, the Chinese-Vietnamese cross-lingual event pre-training model is used to replace the encoder module of ColBERT, and then applied to the Chinese-Vietnamese cross-lingual event retrieval task, as shown in Fig. 2.

Cross-lingual event retrieval model.
In Fig. 2, a query phrase Q is given in the query encoder part. First, based on the cross-language event pre-training model (emBERT), the query is segmented into q1, q2, . . . , q
t
sequences, where t represents the length of the query and q
t
(t = 1, 2, 3 …) represents each word in the query. Unlike ColBERT, we do not add special markers to identify queries, but directly add unique markers [CLS] before query Q, so that the model learns to distinguish queries and documents in different languages. Then use emBERT to perform context representation on the query sequence Q = q1, q2, . . . , q
t
, and finally use the output [CLS] as the context representation e
q
of the query. The specific coding of the query is as formula (1).
Like the query encoder, the document encoder represents the document as D = d1, d2, . . . , d
m
, m represents the length of the document, and d
m
(m = 1, 2, 3 . . .) represents the words in the document. The context representation e
d
of the document is obtained through emBERT, and the specific encoding of the document is shown in formula (2).
Then, after a given query document is encoded by emBERT to obtain the corresponding representations e
q
and e
d
, the correlation score of the query and the document is calculated through the later interactive mechanism, and the sum of the scores obtained by using the MaxSim operator is Scoreq,d. The specific calculation process is shown in formula (3).
For the replacement part of the ColBERT encoder, we use mBert for pre-training through the Chinese-Vietnamese bilingual corpus, mainly including event element mask pre-training and cross-language event comparison pre-training.
Event feature mask pre-training
Mask pre-training has demonstrated amazing language understanding capabilities in BERT and its variants, but random masks cannot fully learn event knowledge in news texts. Therefore, we improve traditional mask pre-training and perform mask prediction on event elements to make the model focus on the main part of the event. Given a Chinese event sentence Sentence
zh
, the event element in the sentence is el
l
(l = 1, 2, 3 . . .), first replace el
l
with the [MASK] tag, and then splice it with the Vietnamese pseudo-parallel event sentence Sentence
vi
, and the final input is a sequence containing special tags such as formula (4).
Then it is converted into the corresponding context representation H(k) ∈ RN×dim through the emBERT embedding layer, and the k-layer Transformer, where N represents the maximum sequence length, and dim represents the hidden layer dimension. The specific calculation process is as formula (5) (6).
The obtained sequence representation output by the last layer is then sent to the subsequent linear layer to obtain the probability of each masked event element. For each position el
l
replaced by the [MASK] mark in Sentence
zh
, the final corresponding expression is H
l
, which is used to calculate the loss. We use the cross-entropy loss to optimize the model, and the specific calculation method is formula (7).
In the event element mask pre-training, we only replace the event elements in Sentence zh . The reason for this is to encourage the model to use the semantic information of Vietnamese pseudo-parallel sentences to restore the replaced part and learn cross-lingual features. The model structure is shown in Fig. 3.

Event element mask pre-training structure.
Existing contrastive learning is mostly applied in monolingual environments, aiming to maximize the similarity between queries and related documents, so that semantically similar sentences are closer in the representation space while dissimilar sentences are farther apart. We extend this goal to scenarios where queries and documents belong to different languages. Given a Chinese query phrase Q
zh
, its corresponding relevant documents are
Specifically, as shown in Fig. 4, our Chinese query phrase is a phrase composed of event elements and event trigger words, such as “Russian Crimea Bridge Explosion”, taking “On October 8, 2022, an explosion occurred on the Crimea Bridge connecting the Russian mainland and the Crimea Peninsula, causing part of the bridge deck to be destroyed...” as a positive example. The corresponding Vietnamese descriptions of similar events “Destruction of the Big Sur Bridge” and “Collapse of the Wuxi Bridge” are taken as negative examples.

Cross-lingual event contrastive learning structure.
Both our teacher and student models are based on the ColBERT architecture. It mainly contains three modules: query encoder, document encoder, and post-interaction mechanism.ColBERT uses a transformer-based encoder to encode the input query and document, respectively, followed by a linear compression layer, which is computed as in formula (9) and formula (10).
Each training instance is a triple of query q, positively related document d+ and negatively related document d-. The relevance score Sq,d of the query-document pair is then computed using formula (11), where E
q
i
denotes the embedded representation of the i - th word in query q and E
d
i
denotes the embedded representation of the j - th word in document d.
For a giving training triplet, the cross entropy of Sq,d+ and Sq,d- is minimized.
As mentioned above, our purpose is to transfer the good query-document matching features from the monolingual retrieval model to the cross-lingual event retrieval model, so that it can build an efficient end-to-end cross-lingual event retrieval model in different language scenarios. Since the query encoder and document encoder are independent of each other, the ColBERT architecture in cross-language event retrieval can be extended to other languages for different encoders. Specifically, the teacher model is first trained with an English dataset, and the student model learns features from two aspects of the teacher model. Then encourage the embedding space between different languages of the student model to move closer to the well-performing English embedding space, so that the embedding space distribution of Chinese and Vietnamese is as close as possible to the embedding distribution of English to further align Chinese and Vietnamese representations. The specific model structure is shown in Fig. 5.

Chinese-Vietnamese cross-lingual event retrieval based on knowledge distillation.
The teacher model contains a query encoder E
T
q
and a document encoder E
T
d
. We use a pre-trained Han-Vietnamese cross-language event pre-training model for initialization and train it by using a triad of queries q, positively correlated documents d+ and negatively correlated documents d- in English. The teacher model aims to learn the matching knowledge of language-independent queries and documents. Given an English query q and an English candidate document d, calculate the relevance score Sq,d of the query and document according to formula (12).
Where E T q (q i ) is the embedding representation of the i - th word in the query, and E T d (d j ) is the embedding representation of the j - th word in the document. The scoring function applies a MaxSim operation to each word in the query, performs a soft search for the word in all documents, finds the most relevant word in the document to the query, and finally sum the relevance scores for all query terms.
The student model and the teacher model have the same architecture, and the goal of the student model is to retrieve the relevant Vietnamese document d′ for a given Chinese query q′. Suppose that the query encoder E S q of the student model can mimic the behavior of the query encoder E T q of the teacher model when encoding a Chinese query, and that the document encoder E S d of the student model can mimic the behavior of the document encoder E T d of the teacher model when encoding a Vietnamese document, i.e:
(1)Query encoder learning process:
Assuming that Q denotes the set of Chinese queries, the distance between E
S
q
(q′) output by the query encoder E
S
q
of the student model and E
T
q
(q) output by the query encoder E
T
q
of the teacher model is close to each other as shown in formula (13).
This goal indicates that we want the query encoder output of the student model to match the teacher model on Chinese queries.
(2) The learning process of the document encoder:
Assuming that D denotes the collection of Vietnamese documents, the distance between E
S
d
(d′) output by the document encoder E
S
d
of the student model and E
T
d
(d) output by the document encoder E
T
d
of the teacher model, is close to each other, as shown in formula (14).
This goal indicates that we want the document encoder output of the student model to match the teacher model on Vietnamese documents.
(3) Relevance learning process for queries and documents:
For the learning process of query and document relevance, we compute the relevance score SQ,D of the query and document, as shown in formula (15).
This calculation is used for both the teacher model and the student model to obtain
This goal indicates that we want the student model to be able to match the teacher model on Chinese queries and Vietnamese documents as a whole. Then, when the Chinese-Vietnamese query-document pairs used to train the student model and the English-English query-document pairs used to train the teacher model are in parallel, the student model can perform as well as the teacher model, because after the multilingual pre-training step, words from different languages that translate to each other tend to have smaller distances in hyperspace. Thus, the training goal of knowledge distillation is to reduce the distance between the outputs of the teacher and student encoders given parallel inputs.
Specific realization: In order to reduce the distance between the output of the query encoder of the student model and the output of the query encoder of the teacher model, the query document representation in the teacher model is learned. Our model is optimized using the mean square loss, and the loss functions are defined as (16) and (17) for the query and document encoders, respectively.
For a given parallel input, when the teacher model judges the English query document pair as relevant, it is hoped that the student model also judges the Chinese-Vietnamese query document pair as relevant. Hence, this paper uses Kullback-Leibler divergence to reduce the distribution difference between
Finally, the above losses are combined to obtain the final joint loss function, which is calculated as (19), where α, β, and γ are hyperparameters used to balance the effects of loss
q
, loss
d
, loss
s
.
Dataset
The training process for the teacher model and student model for this paper was carried out with the help of the publicly available mMARCO dataset, which is a multilingual version of the MS MARCO paragraph ordering dataset. In the data preparation phase, two translation methods were used, Google Translate and Machine Translation models, of which the Google Translate version was chosen for this paper. mMARCO dataset covers nine languages, including Chinese-Vietnamese, with queries and documents in each language parallel to English. The original training set contains 808,731 queries and 8,841,823 documents. During the data cleaning process, 538,192 Chinese queries and 5,829,766 Vietnamese documents were processed. Then, the training, validation and testing sets were divided according to the ratio of 7:2:1.
In order to further validate the effectiveness of the model, this paper also constructs a Chinese-Vietnamese cross-language event retrieval dataset on its own, which contains Chinese queries, Vietnamese event documents and relevance labels. Specifically, the dataset contains 4,000 Chinese queries and 50,000 Vietnamese event documents. This self-constructed dataset is used to evaluate the model performance as a way to validate the retrieval performance of the model.
Evaluation index
MRR (Mean Reciprocal Rank) is a metric used to evaluate the performance of an information retrieval system. Its calculation formula is (20):
Where |Q| denotes the number of queries in the query set, and Rank i denotes the reciprocal of the position of the highest-ranked relevant document in the sorted list for the i - th query. If no relevant document is found, Rank i can be set to a larger value.
The process of calculating MRR consists of ranking the relevant documents found for each query, calculating the
P @ k represents the proportion of actual relevant documents among the top k results predicted by the model. R @ k represents the ratio of correctly predicted samples to the total number of relevant documents among the first k results predicted by the model. MAP considers the position of the relevant documents predicted by the system in the list. The higher the relevant documents are, the higher the MAP value is. The calculation method is as in (21).
Where r is the number of positive samples in the ranking list. NDCG @ k denotes the evaluation of the normalized discounted cumulative metric in the top k positions of the ranking results, determined jointly by DCG @ k and IDCG @ k. First, calculate the cumulative gain CG, the sum of the search relevance scores. DCG introduces logarithms to make samples with higher relevance scores have higher rankings, such as formula (22), where rel (i) represents the relevance of the i - th query and documents in the ranking results, and IDCG is the maximum value arranged in descending order of relevance scores. Finally, the calculation method of NDCG @ k can be obtained as (23):
We initialize the ColBERT query and document encoders in the teacher and student models based on the above pre-trained models. The training of the teacher and the student model adopts the pairwise cross-entropy loss, the learning rate is set to 5e-6, the number of iterations is 200k, the batch size is 64, and the Adam algorithm is used to optimize the model. The student model is trained with a learning rate of 5e-5, a batch size of 32, and an optimizer of Adam. As shown in Table 1. Where the hyperparameters α, β, and γ used to balance the query loss, document loss, and relevance loss between query and document are 0.25, 0.25, and 0.5, respectively.
Experimental parameter settings
Experimental parameter settings
In order to verify the effectiveness of the proposed method for Chinese-Vietnamese cross-lingual event retrieval, this paper conducts comparative experiments with the following representation-based retrieval methods.
(1) XLM-R-triplet: This baseline model uses the Chinese-Vietnamese query document triad for training XLM-R. XLM-R is a multilingual pre-trained model that learns cross-lingual semantic representations by processing texts from different languages. In this approach, XLM-R is trained using information from Chinese-Vietnamese query document triples to learn relevant features for cross-language event retrieval tasks.
(2) mBERT-triplet: This baseline model uses a Chinese-Vietnamese query document triad for training mBERT. mBERT is a multilingual pre-trained model based on the Transformer structure with generalization to different languages. By using mBERT, this method aims to capture the semantic representation of Chinese-Vietnamese query documents to improve the performance of cross-lingual event retrieval.
(3) SBERT: This baseline model conveys knowledge by minimizing the difference in embedding vectors between source and target languages and uses dual encoders to retrieve documents. SBERT focuses on improving retrieval by aligning embedding vectors across languages, especially in cross-language contexts. By aligning the source and target languages, SBERT tries to improve the semantic matching of documents to enhance the retrieval performance.
Comparative experiment
Table 2 shows the experimental results comparing the method proposed in this article and the baseline model on the mMARCO data set. We observe that on the mMARCO data set, the method proposed in this article is significantly better than the baseline model in the two evaluation indicators of R@10 and MRR.This has important practical implications for cross-language information retrieval tasks under low-resource conditions. R@10 is the ratio of the number of correctly predicted samples to the overall number of relevant documents among the top 10 results predicted by the model. MRR, on the other hand, is the mean reversed rank, which measures the quality of the ranking of relevant documents in the retrieval results. Our method performs better compared to the other methods, all outperforming XLM-R-triplet, mBERT-triplet, and SBERT.
The significant improvement in the R@10 and MRR metrics suggests that our model captures relevant documents more accurately in the pre-retrieval results and performs better in ranking the quality of the retrieved results. This result may be attributed to improvements in two key aspects of our adoption. First, by fine-tuning on mBERT and XLM-R, we improve the model’s ability to adapt to the Chinese-Vietnamese cross-lingual event retrieval task under low resource conditions. Second, the introduced alignment mechanism enables the model to better understand the representations between different languages, which effectively improves the retrieval results. In summary, our experimental results and analysis show that the method proposed in this paper achieves significant performance improvement on the mMARCO dataset. This provides a feasible solution for cross-language information retrieval tasks in low-resource contexts, and provides useful insights for further improving and optimizing cross-language information retrieval models.
Experimental results of mMARCO dataset
Experimental results of mMARCO dataset
Table 3 demonstrates the experimental results of the method proposed in this paper on the Chinese-Vietnamese cross-language event retrieval dataset with the comparison models. The experimental results show that the method in this paper is significantly better than several other comparative methods in terms of performance, and the performance enhancement of the Chinese-Vietnamese cross-language event retrieval model is significant.
Specifically, the method in this paper achieves better results relative to the method that uses Chinese-Vietnamese query document triples for training mBERT and XLM-R. This is mainly due to the fact that the model in this paper has better Chinese-Vietnamese query document alignment as well as better query document interaction features by learning better query document representation features from monolingual teacher retrieval models. This plays a key role in improving the effectiveness of Chinese-Vietnamese cross-lingual event retrieval. Although the XLM-R-triplet and mBERT-triplet methods use cross-language pre-trained models, which can pay fuller attention to contextual semantic information and obtain better contextual feature representation vectors. However, in cross-language event retrieval tasks, the event knowledge learned in advance is crucial for a deeper understanding of the textual semantic information, which proves the importance of event knowledge for improving retrieval performance.
Compared with the SBERT method, the model in this paper improves by 1.75%, 5.13%, and 4.08% in P@1, NDCG@10, and MAP values, respectively. This indicates that sentence-level alignment has improved the retrieval effect to some extent, but it is still necessary to mine the semantic information in the text more fully. Meanwhile, due to the lack of interaction in the query document, it is easy to lose linguistic information in the computation process. Therefore, further research should fully mine the semantic information in the text, especially when dealing with the lack of interaction information, in order to improve the performance of Chinese-Vietnamese cross-lingual event retrieval.
Experimental results of Chinese-Vietnamese cross-language dataset
We conduct ablation experiments on three key strategies of knowledge distillation, namely query-encoding knowledge distillation, document-encoding knowledge distillation, and query-document interaction feature distillation, to explore the impact of different parts on the overall performance. w/o query-distillation means not to distill the query encoding features, w/o document-distillation means not to distill the document encoding features, w/o interactive-distillation means not to distill the query document interaction features.
Table 4 shows the results of the cross-lingual event retrieval dataset. It can be observed that the performance impact is greatest when the knowledge distillation process of document encoding features is removed, followed by the process of distilling query-document interaction features. When the query encoding feature distillation is removed, the NDCG@10 value and MAP value decrease by 3.8% and 2.72% respectively, the query document interaction feature distillation NDCG@10 value and MAP value decrease by 7.99% and 7.04% respectively, and the document encoding feature distillation decrease by 12.19% and 10.03% respectively. The above results verify the effectiveness of the three knowledge distillation strategies proposed in this paper. Removing any of the knowledge distillation strategies will lead to varying degrees of decline in model performance. The most obvious is the performance drop of knowledge distillation after removing document coding features. The analysis reason is that the document contains richer semantic information than the query, which is usually shorter and can be better aligned. And compared to the alignment effect between English and Vietnamese, the alignment between Chinese and English is easier. Therefore, when the distillation of document feature knowledge is removed, the result drops significantly, and better document representation is more effective in improving retrieval. The results of feature distillation to remove query-document interactions show that fine-grained interactions of query documents are essential.
Ablation experiment results
Ablation experiment results
This paper introduces a knowledge distillation-based method for Chinese-Vietnamese cross-lingual event retrieval. Aiming at the problem of unsatisfactory alignment of query documents in low-resource languages and language bias, a knowledge distillation method is proposed to use a better-performing English retrieval model as a teacher to learn the query document representation features and query document interaction features from the teacher model. After further aligning the representations between different languages, learn the fine-grained interaction features of queries and documents. It is proved by experiments that the knowledge distillation method can effectively improve the retrieval results, which shows the effectiveness of our proposed method.
Footnotes
Acknowledgment
The work was supported by National Natural Science Foundation of China (Grant Nos. U23A20388, U21B2027, 62266027, 62266028, 61972186), Yunnan high-tech industry development project (Grant No. 201606), Yunnan Provincial Key Research and Development Plan (Grant Nos. 202303AP140008, 202302AD080003, 202103AA080015), Yunnan Basic Research Project (Grant No. 202001AS070014), and Talents and Platform Program of Science and Technology of Yunnan (Grant No. 202105AC160018).
