Span prompt dense passage retrieval for Chinese open domain question answering

Abstract

Dense passage retrieval is a popular method in information retrieval recently, especially in open domain question answering. It aims to retrieve related articles from massive passages to answer the question. Retriever can increase retrieval speed with less loss of accuracy compared to other methods. However, the pretrained language models used in recent research are often ineffective in semantic embedding, which will reduce accuracy. In addition, we find that contrastive learning will diverge the representation space, and Siamese models with independent parameters on both sides will decrease generalization performance. Therefore, we propose span prompt dense passage retrieval (SPDPR) based on span mask prompt tuning and parameter sharing in Chinese open-domain dense retrieval. This model can generate more efficient representation embeddings and effectively counteract the separation tendency between positive samples. We evaluate the effectiveness of SPDPR in DYKzh, as well as two Chinese datasets. SPDPR surpasses all SOTAs implemented in DYKzh and achieves a competitive result in other datasets.

Keywords

Dense retrieval prompt tuning question answering natural language processing

1 Introduction

Open-domain dense retrieval (DR) uses vector similarity to retrieve relevant passages, making it an essential component of open-domain question answering systems, as well as search engines and voice assistants. Table 1 provides an example of the DR system, which searches numerous passages to retrieve those that may contain the answer to a given question. For example, the answer "dodo" can be found in the first passage.

Table 1
A sample of open-domain dense retrieval

Currently, DRs are being trained using the contrastive learning strategy [1]. This approach uses a specially designed loss function to increase the distance between positive samples and decrease the distance between negative samples, achieved by dividing the vectors into a spherical unit space [2]. In the DR task, the questions and the corresponding passages are positive examples, and the questions with the others are negative. Therefore, semantic representation and sample selection are two main topics in dense retrieval. On semantic representation, Li et al. [3] and Su et al. [4] pointed out that the hidden embedding dimension of the [CLS] token may collapse in pretrained language models (PLMs) like BERT [5], usually used as the representation of the sentence in downstream tasks such as semantic similarity and dense retrieval.

To solve this problem, some researchers modified the structure of PLMs and retrained them for better semantic representation [6]. Some designed targeted pretraining tasks to enhance the representation of the [CLS] token [7]. However, these approaches do not fully exploit the potential of PLMs and require high computation costs. In addition, existing models typically used independent representers on both sides of the model, which are different from most contrastive learning models in other tasks. Independent representer may gradually separate the representation space in the training process. It will increase the distance between untrained positive samples and reduce retrieval accuracy.

To address the aforementioned issues and advance the research progress in the Chinese dense passage retrieval, this paper presents the following contributions. Firstly, a Chinese open domain retrieval dataset, DYKzh, is constructed to solve the problems of low retrieval difficulty and unnatural question format in existing datasets. Secondly, to overcome the disadvantages of existing methods that rely heavily on pre-trained language model [CLS] token, this paper proposes a span prompt dense passage retrieval (SPDPR) model.

SPDPR introduces discrete prompt tuning for the first time in dense passage retrieval [9], eliminating the shortcomings of using [CLS] token and significantly reducing training costs. Given that Chinese vocabulary typically consists of multiple terms, this model also proposes a span mask prompt template that uses multiple consecutive [MASK] prompts instead of a single [MASK] prompt. Furthermore, to improve the generalization of the model by mitigating the independent training risk of the two representers parameters, a weight sharing strategy is proposed.

The contributions of this paper can be summarized as follows:

We design a Chinese open domain passage retrieval model SPDPR. We propose span masking prompt tuning methods and adopt the representer parameter sharing strategy to reduce the distance between the representation spaces of questions and passages.

We validate SPDPR’s performance on a standard Chinese open-domain dense retrieval dataset DYKzh and 2 ordinary retrieval datasets. For comparison, we reproduce multiple baseline models in Chinese. Results show that span masking prompt template and parameter sharing strategy increase the accuracy of Chinese dense retrieval.

From the experimental results, we analyze the change process of the representation space with the training in the Siamese model. We conclude that the parameter sharing strategy can help improve the generalization of the model.

2 Related works

2.1 Open-domain dense retrieval

Open-domain Dense Retrieval is an effective approach when compared to both sparse retrieval methods and re-ranking retrieval methods [10]. Sparse retrieval methods, such as TF-IDF and BM25, are commonly utilized to query relevant articles in search engines; while TF-IDF [11] examines the relevance of passages based on word frequency and co-occurrence between queries and passages, BM25 [12] considers the impact of passage length. However, since both methods rely on exact word matching, they are unable to handle synonyms. Re-ranking retrieval techniques utilize a correlation calculation approach to retrieve relevant passages based on neural network models’ capabilities using input from both questions and passages. For instance, Nie et al. [13] devised a multi-granularity model for large-scale reading comprehension, where the retrieval part inputs the question and passage into the model together and obtain the vector at the [CLS] position. The ranking model computes all the passages for each input question, resulting in a substantial reduction in inference speed, given the large number of sets of passages, which fails to meet the demand [10].

Dense Retrieval, on the other hand, is capable of handling synonyms and substantially enhances the inference’s running speed by caching passage embeddings, with only a minor loss of accuracy. DR models employ the Siamese structure to obtain semantic representations of questions and passages. DPR [1] and ORQA [14] leverage pre-trained language models (e.g., BERT [5] and RoBERTa [15]) to generate semantic representations. These representations are trained with unsupervised inverse close task (ICT) or supervised in-batch negative contrastive learning. However, Reimers and Gurevych [16] propose that some PLMs have not been prepared for semantic representation at the [CLS] position. Li et al. [3] and Su et al. [4] argue that the outputs of these PLMs undergo dimensional collapse. Xiong et al. [17] points out that the ICT task and in-batch negative strategy are too simple to train models. Therefore, research on DR focuses on improving the model structure and designing training tasks.

Improving the model structure focuses on enhancing the semantic representation of PLMs. SEED-Encoder [7] trains the model to improve the representation at the [CLS] by introducing a bottleneck between the backbone and the prediction head. Condenser [6] adds shortcuts from the middle of the backbone model to the prediction head, in addition to the bottleneck. These models redesign the structure of PLMs and retrain them to avoid the shortcomings of [CLS]. Others concentrate on the selection of training samples. ANCE [17] dynamically acquires negative samples using the model being trained. RocketQA [18], on the other hand, selects negative samples using a ranking model. LaPraDoR [19] and DANCE [20] improve the representation by enhancing the self contrasts.

The Table 2 illustrates that current approaches primarily utilize the Siamese model structure, with variations mainly in the training approach. Some models aim to enhance the model structure with retraining and address the shortcomings of the pre-trained language model’s [CLS] token.

Table 2
Some prior works in Dense Passage Retrieval

model name supvision structure strategy training strategy

DPR yes basic Siamese Model in-batch negative

ORQA no basic Siamese Model inverse close task

ANCE yes basic Siamese Model dynamic negative samples from previous step model

RocketQA yes basic Siamese Model negative samples from more precise model

SEED-Encoder yes set a bottleneck and retrain [CLS] –

Condenser yes set a bottleneck and retrain [CLS] –

LaPraDoR no basic Siamese Model self contrastive learning

DANCE no basic Siamese Model self contrastive learning and momentum array

model name	supvision	structure strategy	training strategy
DPR	yes	basic Siamese Model	in-batch negative
ORQA	no	basic Siamese Model	inverse close task
ANCE	yes	basic Siamese Model	dynamic negative samples from previous step model
RocketQA	yes	basic Siamese Model	negative samples from more precise model
SEED-Encoder	yes	set a bottleneck and retrain [CLS]	–
Condenser	yes	set a bottleneck and retrain [CLS]	–
LaPraDoR	no	basic Siamese Model	self contrastive learning
DANCE	no	basic Siamese Model	self contrastive learning and momentum array

However, there are two issues with existing methods that reduce the prediction accuracy of the model. Firstly, when Siamese models are employed, the parameters of both representers are independent. Although parameter independence may seem logically sound, it causes the representation space of the two representers to gradually diverge during the training phase, reducing the generalization performance of the model. Secondly, existing methods rely on the representation performance of the [CLS] token of the pre-trained language model. However, the semantic representation capability of this token is inadequate and unsuitable for dense passage retrieval tasks. Moreover, retraining the model to overcome this shortcoming incurs high computational costs, making it impractical to respond quickly to rapid changes in the collection of articles in actual scenarios.

2.2 Contrastive learning

Contrastive learning (CL) is a representation learning method commonly used to solve classification tasks. It needs to design the loss function to pull relevant samples closer and push irrelevant samples farther in the representation space. Therefore, these samples can be classified by a simple linear classifier [21]. Contrastive learning is suitable for unsupervised tasks where the input sample and its perturbations are positive to each other. This strategy has recently been widely used in sentence textual similarity (STS) [22, 23]. Contrastive learning can also be applied to supervised tasks. Khosla et al. [24] proposed a supervised contrastive learning loss function in computer vision. In contrast, although dense retrieval is a supervised task, unsupervised loss functions are usually used because there is no category concept like classification problems [1].

Contrastive learning in dense retrieval usually uses the in-batch negative strategy in training. Tsai et al. [25] show that the classification error rate decreases with the increasing size of negative samples, which is also an empirical consensus in dense retrieval. Qu et al.[18] expands the overall batch of negative samples by obtaining negative samples from other GPUs during multi-GPU training. Xin et al. [20], on the other hand, use a momentum queue and incrementally update negative samples. Some recent studies have focused on unsupervised dense retrieval tasks, which focus on constructing suitable positive samples. Xu et al. [19] uses ICT and Dropout to create positive samples, and Gao et al. [26] use the same passage strategy to achieve this.

Tasks using contrastive learning usually use the Siamese model structure, i.e., the two inputs feed into two PLMs with shared parameters to obtain the corresponding semantic embeddings for downstream tasks. However, in the field of dense retrieval, most models have independent parameters on both sides. This is probably due to the large semantic gap between question and passage, but it may also lead to an increase in the representation gap between the two models as the training proceeds.

2.3 Prompt tuning

Prompt tuning is a machine learning paradigm that fully exploits the capabilities of pretrained language models. It requires a prompt template that resembles the pretrained task for specific tasks with a few shot samples. Prompt tuning has been mainly used in classification tasks. For example, Schick et al. [27] designed the “{text} It was [MASK].” to determine the probability of “great” or “bad” at [MASK] for classifying sentiment. Ding et al. [28] constructed the “{text} {entity} is a [MASK].” template for fine-grained category annotation of named entities. In addition to discrete templates, prompt tuning contains continuous templates [29 –31]. They use vectors directly in the word embedding layer space and only adjust the vectors to achieve the expected performance. Prompt tuning has also been applied in semantic representation. Jiang et al. [32] successfully investigated prompt tuning for the STS task and experimented with manually written discrete and automatic templates. Jiang et al. [33] experimented with continuous templates and also achieved a competitive result. However, these techniques are tailored to English language usage. Given the distinct lexical structures in Chinese, these approaches may encounter certain limitations.

3 Method

3.1 Task definition

The user’s input question is Q and the set of all passages is D = d₁, d₂, ⋯ , d_n. The corresponding vectors h^q, h^{d
₁}, h^{d
₂}, ⋯ , h^{d
_n} are obtained by passing the question and the passages through the representation model M, respectively. The goal of the task is to estimate the probability that a passage can answer the input question using similarity between vectors. Ranking all passages in descending order of probability, the higher the ranking of the golden passage d^q. The higher the retrieval accuracy of the model. The final model parameters θ are in Eq (1).

$θ = arg min_{θ} \sum_{q} r (q, d^{q})$ (1) where r (q, d) is the ranking function, which gets the rank of the similarity between passage d and question q.

Fig. 1

SPDPR model. The circles stand for the term or the corresponding hidden vectors. The blue one and green one is the question and passage term. The red one stand for the [CLS] of PLM, and the yellow one is the prompt template. The question vector and the passage vector are obtained from the [MASK] after the prompt template.

3.2 Span prompt dense passage retrieval

To address the limitations of the current open domain dense passage retrieval model, we propose the Span Masking Prompt Tuning Dense Passage Retrieval model (SPDPR). The structure of the model, depicted in Fig. 1, follows the DPR model and represents questions and passages as vectors through a Siamese model. However, it employs two pre-trained language models with shared parameters as its backbone. The input section then applies the span masking prompt template to both sides, with the hidden vector at the [MASK] position serving as the semantic embeddings of the question or passage.

The Siamese model, also referred to as the two towers model, is a popular machine learning approach utilized to align two feature spaces. Specifically, the Siamese model consists of two identical representation models. Each model transforms input sentences into feature embeddings, and subsequently uses contrastive learning to match the outputs of both models. This process trains the representation models to align features in a comprehensive and effective manner.

3.2.1 Input

To avoid using embeddings at [CLS] of PLMs with dimensional collapse, SPDPR adopted prompt tuning to obtain representation from embedding at [MASK] instead. However, existing discrete templates are generally not applicable to Chinese and face generalization challenges in multiple datasets.

Most Chinese words are divided into multiple terms in the vocabulary of PLMs instead of the single one in English. Inspired by this, we proposed the Span Mask Prompt Template (SMPT) to obtain semantic representation of questions and passages. SPMT uses multiple consecutive [MASK]s to fully represent a Chinese word, bringing downstream tasks closer to pretrained tasks. Several examples of SMPT compliance are shown in Table 3. In SPDPR, given the difference in the form of questions and passages, different SMPTs are usually used on both sides.

Table 3
Samples of SMPT and Normal prompt template

Let the index of each character r in question q in PLMs vocabulary be i_{r
₁}, i_{r
₂}, ⋯ , i_{r
_||q||}, and the index of each character e in passage d be i_{e
₁}, i_{e
₂}, ⋯ , i_{e
_||d||}. If the SMPT is ’{text}[MASK][MASK]’, the question and passages input sequence of the PLM are in Eq (2)(3).

$\begin{matrix} Q = \\ {i_{[C]}, i_{r_{1}}, \dots, i_{r_{| | q | |}}, i_{[M]}, i_{[M]}, i_{[M]}, i_{[S]}, i_{[P]}, \dots, i_{[S]}} \end{matrix}$ (2)

$\begin{matrix} D = \\ {i_{[C]}, i_{r_{1}}, \dots, i_{r_{| | d | |}}, i_{[M]}, i_{[M]}, i_{[M]}, i_{[S]}, i_{[P]}, \dots, i_{[S]}} \end{matrix}$ (3)

where i_[C], i_[S], i_[M], i_[P] stand for the index of [CLS], [SEP], [MASK] and [PAD] in PLMs vocabulary.

3.2.2 Representation

The representation layer incorporates pre-trained language models (PLMs) to derive semantically contextual embeddings. In current models, the PLMs on both sides typically use separate parameters due to the presumed significant semantic gap between the question and the passage. This approach poses a risk of deviations in the representational space of the two PLMs during the training process, particularly for untrained passages.

Therefore, this model proposes a parameter sharing approach among representers. Accordingly, the above potential separation hazard in the representation space is circumvented.

Let $h_{*}^{q}, h_{*}^{d}$ denote the hidden states at a position of the question and the passage respectively. Then the vectors h^q, h^d of the question sentence and article are in Eq (4)

$h^{q} = PLM (Q)_{[MASK]} h^{d} = PLM (D)_{[MASK]}$ (4)

Where the PLM (·) means the pretrained language model. The question representation and the passage representation are the hidden states at the first [MASK] in the span.

3.2.3 Training

When dense retrieval uses a Siamese model, it is usually trained using contrastive learning. If the passage can answer the question, they are mutually positive samples and vice versa. SPDPR mainly followed the DPR[1] approach, using negative samples obtained by BM25 and other negative samples in the batch.

The contrastive loss function is the logarithm likelihood function of positive samples. For the training sample $q, d^{+}, d_{1}^{-}, d_{2}^{-}, \dots, d_{n}^{-}$ , the passage d⁺ related to the question q is positive, and the irritated passages $d_{1}^{-}, d_{2}^{-}, \dots, d_{n}^{-}$ are negative. The loss function is in Eq (5).

$\begin{matrix} L (q, d^{+}, d_{1}^{-}, d_{2}^{-}, \dots, d_{n}^{-}) = \\ - log \frac{exp [sim (h^{q}, h^{d_{+}})]}{exp [sim (h^{q}, h^{d_{+}})] + \sum_{i = 1}^{n} exp [sim (h^{q}, h^{d_{i -}})]} \end{matrix}$ (5)

Where h^q is the semantic representation of the question, h^{d
₊}, h^{d
_i-} are the ones of positive and negative passages. Similarity function sim (·) usually use the dot product function as Eq (6).

$sim (h_{1}, h_{2}) = h_{1}^{T} \cdot h_{2}$ (6)

Since handicraft prompts are sensitive to downstream datasets, SPDPR uses multiple prompt templates for training. During the training process, templates are randomly selected to form the input, thus improving generality in different datasets.

3.2.4 Inference

The representations of passages are fixed at each inference. These vectors can be calculated and stored before deployment to increase speed. The inference process generates the corresponding vectors of the questions with the best prompt by the SPDPR model above. Approximate Nearest Neighbors (ANN) tools such as FAISS [34] obtain the associated passage vectors as retrieval results.

4 Experiments

4.1 Setup

Because there are limited studies on Chinese dense passage retrieval, we established the DYKzh Chinese open domain question answering dataset using the XQA method proposed by [8] and evaluated SPDPR with various reimplemented models using this dataset.

4.1.1 Dataset

The DYKzh dataset comes from Wikipedia’s section called Did You Know, which consists of manually written article recommendation questions and hyperlinks to the corresponding articles. Nowadays, the form of the recommendation question is mainly a wh-question, but some old recommendations are yes-no questions. Therefore, the dataset only retained the wh-question and organized them into the form of the question, article title, and article content, which were used as the question, answer, and passage in the dataset, respectively. Table 1 above is an example of the DYKzh dataset.

In addition, the passage collection was from archived files of all Chinese Wikipedia articles. And we used WikiExtractor to get all the passages filtered out of meta articles like redirects and disambiguation. DYKzh contained a total of 557,754 passages and 39,331 questions. Since there are some variants of Chinese, some vocabulary in some regions has a gap with simplified Chinese. Therefore, all the words of Chinese variants were processed using OpenCC and regular expressions. Finally, the passages in the DYKzh dataset were aligned to complete the construction of the article collection.

In addition to the DYKzh dataset, the paper validates DuReaderretrieval and MultiCPR. But their questions are made up of scattered words. MultiCPR passages are even not fluent sentences. Therefore, these two datasets are not satisfactory for evaluating retrieval models in open domain question-answering systems, but they can also be reasonable references.

4.1.2 Model

Compared to the SPDPR model, we used BM25 for sparse retrieval and some dense retrieval models on the DYKzh. The retrieval test uses only the method of traversing all passages to obtain accurate similarity ranking results, in order to more precisely evaluate the effectiveness of feature embedding for each model.

BM25 is a sparse retrieval method considering the length of documents. We used the default implementation of Pyserini [35] for Chinese with parameters k₁ = 0.9, b = 0.4.

DPR[1] is a dense retrieval model training on the English corpus. For comparison, the PLM in the model was changed to Chinese-BERT-wwm fully trained in Chinese and optimized for Chinese vocabulary in the masking language task. Due to the shortage of Chinese corpus, we trained models with 6 epochs on the DYKzh dataset with a batch size of 16.

DPR-ANCE[17] is another DPR trained with ANCE’s strategy. ANCE dynamically updates training data from the previous version of the model. Since the time difference between the trainer and the inferencer is too large, we manually executed the ANCE pipeline and trained for 6 epochs.

RocketQA[18] is the first dense retrieval model trained on millions of Chinese documents, which has achieved competitive results in both English and Chinese corpus. The model uses a two-step strategy by concatenating the dual model and cross-model retriever based on the Chinese pretrained language model, ERNIE. For comparability, we train its dual model for 6 epochs with a batch size of 16.

SPDPR used the SMPT and parameter sharing strategy, and other hyperparameters and optimizer settings were consistent with the original DPR model.

4.1.3 Metrics

This task metric is top-k accuracy. All the questions from the test set were fed to the model for prediction, and the top-k accuracy is the percentage of correct articles ranked in the top-k results. The formal representation is in Eq (7)

${Top}_{k} (D, r) = \frac{\sum_{i = 1}^{N} I (r (q_{i}, d_{i}^{+}) \leq k)}{N}$ (7)

where D is the dataset, r (q, d) means the ranking function of the DRs, which is the rank of the passage d in all documents according to the question q, and I (·) stands for indicator function, which the output is 1 when the proposition is true, vice versa.

According to the experiments of DPR, we used top-1, top-5, top-20 and top-100.

4.1.4 Environments

The hardware environment is a computing server equipped with 4 Nvidia A40 graphics cards, the CPU is Intel Xeon Silver 4210R, the memory is 128GB, and it works under good heat dissipation conditions.

The software environment is Ubuntu 18.04 operating system, and CUDA version is 11.6. The above model is implemented using PyTorch 1.12.1 machine learning framework written based on Python 3.9.12 programming language.

4.2 Contrast experiment

The results are in Table 4. On the DYKzh dataset, each dense retrieval model exceeds BM25, indicating that the reproduction results are correct. The accuracy of SPDPR on DYKzh is higher than that of other models, showing that SMPT and parameter sharing strategy improve retrieval accuracy. RocketQA achieved the best results on the DuReader dataset but performed significantly worse on the other datasets. It shows that RocketQA’s generalization performance is poor.

Table 4
The result of models on dense retrieval

Datasets DYKzh DuReader_retrieval Multi-CPR

ecom medical video

Models Top-1 Top-5 Top-1 Top-5 Top-1 Top-1 Top-1

BM25 53.03 68.07 12.85 33.75 15.6 17.2 12.3

DPR 54.13 74.64 32.35 59.00 15.1 19.3 12.2

DPR-ANCE 59.44 80.12 33.85 61.20 16.1 23.7 14.0

RocketQA 50.01 70.22 43.05 69.80 9.6 13.0 7.5

SPDPR 61.39 80.69 34.10 63.10 17.8 20.0 13.3

Datasets	DYKzh	DuReader_retrieval	Multi-CPR
BM25	53.03	68.07	12.85	33.75	15.6	17.2	12.3
DPR	54.13	74.64	32.35	59.00	15.1	19.3	12.2
DPR-ANCE	59.44	80.12	33.85	61.20	16.1	23.7	14.0
RocketQA	50.01	70.22	43.05	69.80	9.6	13.0	7.5
SPDPR	61.39	80.69	34.10	63.10	17.8	20.0	13.3

4.3 Analyze parameter sharing strategy

We analyzed the training process of the Siamese model to further investigate the contribution of parameter sharing strategy. We captured the representation spaces in the DPR’s and SPDPR’s training process by calculating the mean similarity between positive and negative samples.

To reduce the effect of model training on representation space, we filter out the passages that appear in the DYKzh training set from the development set. The total dispersion of the representation space is defined as the average similarity between the question and the negative sample passage S_base. For comparison, we observe changes in the average similarity between the question and its positive sample passages S_pos. Formalized as Eq (8)(9).

$S_{pos} = \frac{\sum_{i = 1}^{N_{q}} sim (q_{i}, d_{i}^{+})}{N_{q}}$ (8)

$S_{base} = \frac{\sum_{i = 1}^{N_{q}} \sum_{j = 1}^{∥ d_{i}^{-} ∥} sim (q_{i}, d_{ij}^{-})}{N_{q} ∥ d_{i}^{-} ∥}$ (9)

We separately analyze the variation of the representation space during training for DPR, SPDPR, and a variant of DPR using only the parameter sharing strategy. The results of the experiment are shown in Table 5. First, we observe that question and passage representation spaces become increasingly separated during training. However, both the parameter sharing strategy and the prompt learning method reduce the separation between positive samples. They will effectively improve the generalization capability of the model.

Table 5

The representation spaces status during training

epoch	DPR			DPR_mod			SPDPR
	pos	base	delta	pos	base	delta	pos	base	delta
1	87.146	82.600	4.546	95.297	90.999	4.298	92.076	87.742	4.334
2	86.521	81.465	5.056	95.219	90.324	4.895	92.360	87.073	5.287
3	85.310	79.747	5.563	94.057	88.483	5.574	92.699	87.507	5.192
4	85.613	79.783	5.830	94.368	88.617	5.751	91.998	85.959	6.039
5	85.630	79.464	6.166	93.803	87.472	6.331	91.166	84.508	6.658
6	84.665	78.578	6.087	94.051	88.709	5.342	91.331	84.629	6.702

4.4 Sensitivity analysis

4.4.1 Sensitivity to random seeds

Random seeds can significantly affect the optimization outcomes of a model, particularly in neural network-based models. Consequently, this study employs 6 different random seeds at random to train the model, and evaluates the corresponding accuracy fluctuations.

The results are shown in Fig. 2. It has been observed that SPDPR exhibits significantly greater stability than DPR when subjected to various random seed values. This demonstrates the reliability of SPDPR’s approach.

4.4.2 Sensitivity to batch sizes

Fig. 2

Sensitivity to random seeds.

Fig. 3

Sensitivity to batch sizes.

When training Siamese models through contrastive learning, it is widely acknowledged that higher batch sizes lead to improved model training. This paper provides an experimental analysis of the sensitivity of SPDPR and DPR models to batch size changes. In this study, the batch sizes of 8, 16, and 32 were selected and the model was trained with 3 random seeds. The accuracy of the model in the test set was meticulously recorded.

The results are illustrated in Fig. 3. Results indicate that batch size impacts both models, with SPDPR being relatively less susceptible. This could be attributed to the SPDPR model’s multi-template approach, which enhances the model’s overall stability.

4.4.3 Sensitivity to number of negative samples

Fig. 4

Sensitivity to number of negative samples.

Similar to batch size, the quantity of negative samples also has an impact on the ultimate accuracy of the model in contrastive learning training. It is widely accepted that a larger number of negative samples leads to higher accuracy in the trained model. Consequently, this research paper designated the number of negative samples to be 1 (the default setting), 2, and 3, utilizing 3 distinct random seeds for experimentation.

The results are presented in Fig. 4. It is evident that increasing the number of negative samples results in only marginal improvement in the accuracy of the two models. The results indicate that neither model exhibits sensitivity to the hyperparameter for the number of negative samples used.

5 Conclusion

This paper focused on the open domain dense retrieval in Chinese and found that some existing models still had room for further improvement. At present, semantic representation models still focused on the hidden vector at [CLS] in PLM, but the vector is still unsuitable as a full sentence representation. In addition, the existing models use separate parameters for question and passage representers. This paper experimentally confirms that this leads to a decrease in the generalization performance of the model.

In this paper, we proposed SPDPR to solve the above problems. SPDPR reduced the semantic gap between Chinese questions and passages by using span masking prompt templates. In addition, we constructed DYKzh, a Chinese Wikipedia-based open domain question answer dataset based on the XQA method, and evaluated the retrieval accuracy of several models on this dataset. Experiments have shown that SPDPR has a significant improvement over DPR in several metrics, and also achieves competitive results on several datasets.

Through further experiments, we found that the Siamese model has a tendency to gradually separate the representation space during the training process, and experimentally confirmed that the parameter sharing strategy helps to reduce the separation between positive samples, thus improving the retrieval accuracy of the model. Sensitivity analysis also confirmed that the SPDPR model has better stability on multiple parameters.

In future research, primary consideration should be given to the feasibility of implementing continuous prompt templates. This approach has the potential to reduce the need for manual prompt template design, while simultaneously training more robust and effective templates from the dataset. Moreover, as computational power increases, unsupervised contrastive learning can be explored to further enhance the practical applicability of the dense passage retrieval task.

Footnotes

Acknowledgments

The work was supported by the National Natural Science Foundation of China (No. 62071060), and the Beijing Key Laboratory of Work Safety Intelligent Monitoring.

References

Vladimir Karpukhin , et al., Dense Passage Retrieval for Open-Domain Question Answering, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.

Tongzhou Wang and Phillip Isola , Understanding contrastive representation learning through alignment and uniformity on the hypersphere, International Conference on Machine Learning, PMLR, 2020.

Bohan Li , et al., On the Sentence Embeddings from Pre-trained Language Models, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.

Jianlin Su , et al., Whitening sentence representations for better semantics and faster retrieval, arXiv preprint arXiv:2103.15316, (2021).

Jacob Devlin , et al., Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, (2018).

Luyu Gao and Jamie Callan , Condenser: a Pre-training Architecture for Dense Retrieval, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.

Shuqi Lu , et al., Less is more: Pre-train a strong text encoder for dense retrieval using a weak decoder, arXiv preprint arXiv:2102.09206, (2021).

Jiahua Liu , et al., XQA: A cross-lingual open-domain question answering dataset, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.

Timo Schick and Hinrich Schütze , Few-shot text generation with pattern-exploiting training, arXiv preprint arXiv:2012.11926, (2020).

10.

Fengbin Zhu , et al., Retrieving and reading: A comprehensive survey on open-domain question answering, arXiv preprint arXiv:2101.00774, (2021).

11.

Jure Leskovec , Anand Rajaraman and Jeffrey David Ullman , Mining of massive data sets, Cambridge university press, 2020.

12.

Stephen Robertson and Hugo Zaragoza , The probabilistic relevance framework: BM25 and beyond, Foundations and Trends™ in Information Retrieval 3(4) (2009), 333–389.

13.

Yixin Nie , Songhe Wang and Mohit Bansal , Revealing the Importance of Semantic Retrieval for Machine Reading at Scale, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.

14.

Kenton Lee , Ming-Wei Chang and Kristina Toutanova , Latent Retrieval for Weakly Supervised Open Domain Question Answering, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.

15.

Yinhan Liu , e al., Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692, (2019).

16.

Nils Reimers and Iryna Gurevych , Sentence-BERT: Sentence Embeddings using SiameseBERT-Networks, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.

17.

Lee Xiong , et al., Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval, International Conference on Learning Representations.

18.

Yingqi Qu , et al., RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.

19.

Canwen Xu , et al., LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval, Findings of the Association for Computational Linguistics: ACL 2022, 2022.

20.

Ji Xin , et al., Zero-Shot Dense Retrieval with Momentum Adversarial Domain Invariant Representations, Findings of the Association for Computational Linguistics: ACL 2022, 2022.

21.

Aaron van den Oord , Yazhe Li and Oriol Vinyals , Representation learning with contrastive predictive coding, arXiv preprint arXiv:1807.03748, (2018).

22.

Yuanmeng Yan , et al., ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021.

23.

Tianyu Gao , Xingcheng Yao and Danqi Chen , SimCSE: Simple Contrastive Learning of Sentence Embeddings, 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Association for Computational Linguistics (ACL), 2021.

24.

Prannay Khosla , et al., Supervised contrastive learning, Advances in neural information processing systems 33 (2020), 18661–18673.

25.

Yao-Hung Hubert Tsai , et al., Self-supervised Learning from a Multi-view Perspective, International Conference on Learning Representations.

26.

Luyu Gao and Jamie Callan , Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.

27.

Timo Schick and Hinrich Schütze , Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021.

28.

Ning Ding , et al., Prompt-learning for fine-grained entity typing, arXiv preprint arXiv:2108.10604, (2021).

29.

Brian Lester , Rami Al-Rfou and Noah Constant , The Power of Scale for Parameter-Efficient Prompt Tuning, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.

30.

Guanghui Qin and Jason Eisner , Learning How to Ask: Querying LMs with Mixtures of Soft Prompts, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.

31.

Zexuan Zhong , Dan Friedman and Danqi Chen , Factual Probing Is [MASK]: Learning vs. Learning to Recall, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.

32.

Ting Jiang , et al., Promptbert: Improving bert sentence embeddings with prompts, arXiv preprint arXiv:2201.04337, (2022).

33.

Yuxin Jiang and Wei Wang , Deep continuous prompt for contrastive learning of sentence embeddings, arXiv preprint arXiv:2203.06875, (2022).

34.

Jeff Johnson , Matthijs Douze and Herve Jegou , Billion-ScaleSimilarity Search with GPUs, IEEE Transactions on Big Data 7(03) (2021), 535–547.

35.

Jimmy Lin , et al., Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations, Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021.

36.

Yiming Cui , et al., Pre-training with whole word masking for chinese bert, IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3504–3514.