I 2 R: Intra and inter-modal representation learning for code search

Abstract

Code search, which locates code snippets in large code repositories based on natural language queries entered by developers, has become increasingly popular in the software development process. It has the potential to improve the efficiency of software developers. Recent studies have demonstrated the effectiveness of using deep learning techniques to represent queries and codes accurately for code search. In specific, pre-trained models of programming languages have recently achieved significant progress in code searching. However, we argue that aligning programming and natural languages are crucial as there are two different modalities. Existing pre-train models based approaches for code search do not effectively consider implicit alignments of representations across modalities (inter-modal representation). Moreover, the existing methods do not take into account the consistency constraint of intra-modal representations, making the model ineffective. As a result, we propose a novel code search method that optimizes both intra-modal and inter-modal representation learning. The alignment of the representation between the two modalities is achieved by introducing contrastive learning. Furthermore, the consistency of intra-modal feature representation is constrained by KL-divergence. Our experimental results confirm the model’s effectiveness on seven different test datasets. This paper proposes a code search method that significantly improves existing methods. Our source code is publicly available on GitHub.¹

Keywords

Code search semantic alignment semantic representations contrastive learning pre-trained models

1. Introduction

Code search aims to search and reuse the code available on large-scale code repositories based on some specific needs in the form of a query. Existing code search tools can support different types of queries, such as natural language-based queries used by GitHub search and structured code-based queries. Figure 1 illustrates an example of natural language-based query “Downloads Sina videos by URL”, which is usually a short description of the developers’ requirements. A code snippet as shown in Fig. 1 meets the developers’ requirements. In this paper, we focus on code search based on a natural language-based queries. Obviously, precise and automatic code search has the potential to improve the efficiency of software development.

Approaches for code search can be roughly classified into information retrieval-based, deep learning-based, as well as pre-trained model-based. Given a search query, early research on code search [1, 2, 3] focused on the lexical information of code snippets and employed information retrieval methods to find relevant code snippets. With the development of deep learning approaches, approaches for code search embedded both code snippets and queries into a shared high-dimensional vector space and compare their semantic similarity using neural networks [4, 5, 6, 7, 8, 9]. Recently, pre-trained models have been widely employed since they can learn semantic representations of code fragments and search queries and capture their semantic relevance efficiently. For example, pre-trained models trained on large-scale multi-programming language datasets have improved the understanding of code semantics and the performance of code search engines is improved [10, 11, 12, 13].

In spite of some success of the pre-trained models for code search, these approaches solely rely on the representations learned from large-scale pre-trained models and return a list of codes based on the cosine similarity between the natural language to the programming language. The alignment between query and code representations is not fully explored in the fine-tuning phase. As a result, search results are subject to bias. As shown in Fig. 1, Query $A$ and Query $B$ share similar semantic information but are different. Thus Code $B$ , a code corresponding to Query $B$ will wrongly share the high similarity with Query $A$ . Obviously, it is crucial to learn the differences between Query $A$ and Code $B$ during the fine-tuning phase in order to improve the performance of code search, which we call inter-modal representation learning. Moreover, as pointed out in R-drop [14], the performance of the model would be improved significantly when the high-level representations were consistent across inputs. Previous work has demonstrated that achieving consistent high-level representation for different input formats or different underlying representations optimizations can significantly improve the performance of the model [14, 15, 16]. Therefore, it is necessary to ensure the consistency of representations in intra-modal representation learning.

Figure 1.

Examples of code search.

To tackle the challenges mentioned above, firstly, to capture the differences between Query $A$ and the negative sample Code $B$ in inter-modal representation learning, contrastive learning is employed. Contrastive learning methods have been successfully used for self-supervised representation learning for images [17, 18, 19] and natural language text [20, 21, 22, 23]. It can bring similar representations close together and push apart different representations. Secondly, R-drop technique is applied to intra-modal representation learning while keeping representation consistent [16]. R-Drop is a regularization strategy for ensuring that output distributions of different submodels sampled by dropout converge. Following Wu et al. [16], we use KL-divergence to align the representations of two submodels with different dropout rates when using R-drop for intra-modal representation consistency constraints.

The main contributions are summarized as follows:

•

We propose a novel framework for intra- and inter-modal representation learning (I²R). In this framework, contrastive learning is used to optimize the information contained in inter-modal representations, while R-drop technique is used to improve the consistency of intra-modal representations.

•

Extensive experiments are conducted on seven benchmark datasets and experimental results show that the proposed I²R framework outperforms several state-of-the-art approaches for code search. Moreover, experimental results show that other pre-trained models can be easily adapted to the proposed framework and some improvements are achieved.

2. Related work

Over the years, code search has evolved from using basic Information Retrieval (IR) techniques to more advanced deep learning models. However, these models still face challenges in capturing the semantics of code and queries. To overcome these limitations, researchers have proposed co-attentive representation learning models that focus on learning interdependent representations of embedded codes and queries. Additionally, context-aware code translation techniques have been developed to translate code fragments into natural language descriptions, enabling more effective code search. Furthermore, large-scale code pre-training models have shown promising results in improving code search tasks. On the other hand, in the field of natural language processing, pre-training methods have revolutionized the acquisition of image representations from text through transformer-based models. Similar advancements have been made in programming language pre-training models, enhancing the understanding of code semantics and improving code search tasks. Contrastive learning techniques, which bring similar representations closer together and push apart different ones, have been successfully applied to various domains, including images and natural language text. In the multimodal context, aligning and unifying visual and textual representations has been a significant challenge. However, recent approaches have aimed to overcome this challenge by leveraging large-scale image-text pairs and cross-modal contrastive learning methods. The ability of contrastive learning to align and differentiate representations makes it a suitable approach for solving code search problems. We describe code search, multi-modal pre-trained models, and contrastive learning in the following.

2.1 Code search

Early code search models used Information Retrieval (IR) techniques in order to index a large corpus of code and return relevant code depending on a developer’s search query. Unfortunately, IR-based models cannot capture the semantics of code and queries. Deep learning techniques have been applied to code search models to address this problem. A deep learning-based model (i.e., DeepCS) was proposed by Gu et al. and demonstrated significant improvements compared to previous models [24]. DeepCS embeds codes and natural language queries into vectors with two LSTMs (Long Short Term Memory) and returns to the developer that the code is more similar to the code search query. However, it should be noted that the embedding approach does not take into account the internal semantic relevance of the two isolated representations of the code query. Consequently, learning isolated representations of the code and query may limit the search’s effectiveness. In order to address these issues, a co-attentive representation learning model (CARLCS-CNN) has been proposed [5]. A co-attentive mechanism is employed by the CARLCS-CNN to learn interdependent representations of embedded codes and queries. The TabCS performs a two-stage attentional network structure to extract code and query information from code features (e.g., method names, API sequences, and tokens), code structure features (e.g., abstract syntax trees), and query features (e.g., tokens) [7]. Taking into consideration the semantic gaps in code and queries, the first phase uses attention mechanisms to extract semantics. To determine the semantic relevance of each code/query, the second phase applies the co-attention mechanism. TranCS [9], a revolutionary context-aware code translation technique, is proposed to translate code fragments into natural language descriptions (known as translations) using extensive experiments on a large, multilingual dataset. Next, the similarity calculation between the natural language and the translated natural language based on the code fragment is implemented to achieve code search. Simulating the execution of machine instructions allows code translation to be performed on machine instructions. Contextual information is gathered by simulating the execution of the instructions. It further includes a shared word mapping function that generates embeddings for translations and queries based on a vocabulary. In recent years, large-scale code pre-training models have acquired generic source code representations and have demonstrated substantial improvements in code search tasks [10, 12].

2.2 Multi-modal pre-trained models

The field of natural language processing has undergone a revolution as pre-training methods that learn directly from the unsupervised raw text have been introduced. During the past few years, Transformer-based models have become increasingly popular for cross-modal representation learning [25, 10, 26, 18]. In recent studies, CLIP [19] and VirTex [25] have used new architectures and pre-training methods to demonstrate transformer-based language modeling, masked language modeling, and contrastive learning for the acquisition of image representations from the text. Similarly, large-scale programming language pre-training models [10, 27, 11, 12, 13] improved understanding of code semantics and resulted in significant improvements in code search tasks. UniXcoder [12], CodeBERT [10], and GraphCodeBERT [11] are all based on the RoBERTa [28] architecture. In terms of code representation, CodeBERT utilizes lexical information to mimic as much as possible the natural semantics of the identifier through WordPiece’s subsumption. Syntactic information is not explicitly modeled, and the model needs to learn syntactic information from a large amount of code. As input, GraphCodeBERT takes source code with a digest and a corresponding data stream and pre-trains it with a Mask Language Model (MLM), edge prediction tasks, and node alignment routines. In UniXcoder, the syntactic information is explicitly modeled, while the abstract syntax tree (AST) of code is recovered and spread into sequences. Additionally, a unified model for generating and understanding codes is proposed. It is easy to apply our approach to these pre-trained models for downstream tasks and improve their performance.

2.3 Contrastive learning

Contrastive learning [29], which brings similar representations closer together and pushes apart different ones, has been successfully used for self-supervised representation learning for images [25, 18] and natural language text [21, 23]. Recently, several studies [30, 12] have also applied contrastive learning to learn cross-modal representations of video/images and text. To accommodate multimodal scenarios, a series of multimodal pre-training methods have been proposed and pre-trained on a corpus of image-text pairs, such as VisualBERT [26], and UNITER [17], which greatly improve the ability to handle multimodal information. The biggest challenge in unifying different modalities is to align and unify them into the same semantic space, which can be generalized to different data modalities. Several existing cross-modal pre-training approaches attempt to align visual and textual representations by simple image-text matching based on a limited corpus of image-text pairs [26, 17]. As the randomly sampled negative text or images are usually very different from the original text or images, they can only learn a very rough alignment between text and visual representations. CLIP [19] uses large-scale image-text pairs to learn transferable visual representations through image-text matching, which allows the model to be transferred to various visual classification tasks with zero samples. WenLan [31] further was proposed, which has a similar two-tower Chinese multimodal pre-training model to improve the contrastive cross-modal learning process. The end-to-end visual language pre-training architecture SOHO [32] does not extract significant image regions by pre-training object detection models jointly learning Convolutional Neural Networks (CNN) and Transformer to perform cross-modal alignment from millions of image-text pairs. UNIMO [18] leverages large-scale free text corpora and image collections to improve visual and text comprehension and uses cross-modal contrastive learning (CMCL) to align text and visual information into a unified semantic space, adding relevant images and text to the corpus of image text pairs. Contrastive learning methods [29], which bring together representations of positive samples and push apart negative samples, are well suited to solve code search problems that expect pairs of code fragments and queries to have tight representations and unpaired code fragments and queries to have different representations.

3. The proposed approach

Figure 2 shows the architecture of I²R. In general, I²R employs UniXcoder as the encoding model which encodes code and query separately to obtain semantic representations. UniXcoder is a multilayer bi-directional transformer encoder-decoder that can perform both code generation and code understanding operations. Inter- and intra-modal representations are captured by comparative learning and KL-divergence. Our approach consists of a siamese network model where code is encoded by two UniXcoders, while another two UniXcoders encode queries. A pair of UniXcoder models are incorporated to explicitly constrain the consistency of the code and query representations by implementing KL-divergence. Additionally, we introduce contrastive learning in which negative samples are used to model the semantic mapping between code and query. Finally, we use the trained model to perform code search. In the following sections, we will describe in detail how each module was designed.

Figure 2.

Intra- and inter-modal representation learning approach.

3.1 UniXcoder based semantic representation model

Given the training dataset ${\operatorname{D}}=\{(\textit{query}_{i},\textit{code}_{i})\}_{i=1}^{n}$ , the goal is to learn models $\operatorname{P}_{1}^{w_{1}}(\textit{query}_{i})$ , $\operatorname{P}_{2}^{w_{2}}(\textit{query}_{i})$ , where $n$ is the number of the training samples. A list of search codes is recommended based on the cosine similarity of $\operatorname{P}_{1}^{w_{1}}(\textit{query}_{i})$ and $\operatorname{P}_{2}^{w_{2}}(\textit{query}_{i})$ . We use UniXcoder as the underlying semantic representation model, e.g. by inputting $\textit{query}_{i}$ , and the corresponding representation $\operatorname{P}^{w}(\textit{query}_{i})$ can be obtained. A unified pre-training model, UniXcoder, utilizes the information provided by the AST to utilize the code structure information. It is compatible with all three modalities of encoder-only, decoder-only, and encoder-decoder [12].

3.2 Inter-modal representation learning

Figure 3.

Inter-modal representation learning.

As shown in Fig. 3, with the encoding of UniXcoder, we obtain a representation of query $\operatorname{P}_{1}^{w_{1}}(\textit{query}_{i})$ as well as code $\operatorname{P}_{3}^{w_{3}}(\textit{code}_{i})$ . The next step is to conduct inter-modal representation learning by implementing contrastive learning to achieve better alignment between different modalities. Negative sample pairs capable of learning fine-grained query-code similarity from the contrastive loss function are employed. Following the work of Oord et al. [29], we use query_i-code_i as a set of positive sample pairs and the other code samples in the same batch with query_i constitute negative samples. The loss is calculated as Eq. (1), where $b$ is the batch size, $\tau$ is the temperature hyper-parameter, and $\cos(\cdot,\cdot)$ is the cosine similarity between the two vectors.

$\displaystyle\operatorname{loss}_{\textit{CLQ}}=-\sum_{i=0}^{b-1}\log\frac{e^{% \cos(\operatorname{P}_{1}^{w_{1}}(\textit{query}_{i}),\operatorname{P}_{3}^{w_% {3}}(\textit{code}_{i}))/\tau}}{\sum_{j=0}^{b-1}e^{\cos(\operatorname{P}_{1}^{% w_{1}}(\textit{query}_{i}),\operatorname{P}_{3}^{w_{3}}(\textit{code}_{j}))/% \tau}}$ (1)

We introduce positive and negative samples guided by query_i into the comparison loss, as shown in Eq. (1). Moreover, we similarly introduce code_i-guided positive and negative samples into the contrastive loss, as shown in Eq. (2), to achieve a bidirectional alignment of query-code representations.

$\displaystyle\operatorname{loss}_{\textit{CLC}}=-\sum_{i=0}^{b-1}\log\frac{e^{% \cos(\operatorname{P}_{3}^{w_{3}}(\textit{code}_{i}),\operatorname{P}_{1}^{w_{% 1}}(\textit{query}_{i}))/\tau}}{\sum_{j=0}^{b-1}e^{\cos(\operatorname{P}_{3}^{% w_{3}}(\textit{code}_{i}),\operatorname{P}_{1}^{w_{1}}(\textit{query}_{j}))/% \tau}}$ (2) $\displaystyle\operatorname{loss}_{CL1}=\operatorname{loss}_{\textit{CLQ}}+% \operatorname{loss}_{\textit{CLC}}$ (3)

Since in Section 3.3 we use KL-divergence to achieve consistency constraints on the intra-modal representation, we need to implement the above operation twice, as shown in Fig. 2. In turn, $\operatorname{loss}_{CL2}$ is obtained, and the optimization objective of inter-modal representation learning is shown in Eq. (4).

$\displaystyle\operatorname{loss}_{CL}=\operatorname{loss}_{CL1}+\operatorname{% loss}_{CL2}$ (4)

3.3 Intra-modal representation learning

Learning inter-modal representations requires a stable representation of the intra-modal representation as the first step.

Wu et al. found that if a training sample has two different input formats, the performance of the model produced by the various inputs is required to improve significantly if the high-level representations are consistent across the inputs [14]. Furthermore, Bengio et al. proposed a technique known as “Fraternal Dropout” where they first trained two identical RNNs (with shared parameters) with different dropout masks and minimized the difference in their (pre-softmax) predictions [15]. With the development of pre-trained models, large-scale pre-trained representations enabled significant development of natural language processing tasks. Wu et al. used the pre-trained model fine-tuning process to make the prediction distribution of any two randomly sampled sub-models consistent via KL divergence [16]. Also, they theoretically demonstrate that consistency between submodels ensures consistency between the submodels and the whole model.

$\displaystyle KL(p(\bm{x})\|q(\bm{x}))=\mathbb{E}_{\bm{x}\sim p(\bm{x})}\left[% \log\frac{p(\bm{x})}{q(\bm{x})}\right]=\mathbb{E}_{\bm{x}\sim p(\bm{x})}[\log p% (\bm{x})]+\mathbb{E}_{\bm{x}\sim p(\bm{x})}[-\log q(\bm{x})]$ (5)

In this study, we seek to apply the R-drop technique to the learning of natural and programming language representations across modalities and to maintain the stability of both representations within each mode. We draw on Wu et al.’s work to apply the R-drop technique to the learning process [16]. The specific implementation details are described below.

Figure 4.

Intra-modal representation learning.

As shown in Fig. 4, we have obtained two semantic representations of the query after encoding the query twice using UniXcoder. We will then implement intra-modal representation learning, utilizing KL-divergence to maintain intra-modal consistency in our intra-modal representation learning technique.

$\displaystyle\operatorname{loss}_{\textit{KLQ}}=\frac{1}{2}(\operatorname{KL}(% \operatorname{P}_{1}^{w_{1}}(\textit{query}_{i})\|\operatorname{P}_{2}^{w_{2}}% (\textit{query}_{i})){}+\operatorname{KL}(\operatorname{P}_{2}^{w_{2}}(\textit% {query}_{i})\|\operatorname{P}_{1}^{w_{1}}(\textit{query}_{i})))$ (6)

Similarly, we do the same for code, as shown in Eq. (7).

$\displaystyle\operatorname{loss}_{\textit{KLC}}=\frac{1}{2}(\operatorname{KL}(% \operatorname{P}_{3}^{w_{3}}(\textit{code}_{i})\|\operatorname{P}_{4}^{w_{4}}(% \textit{code}_{i})){}+\operatorname{KL}(\operatorname{P}_{3}^{w_{3}}(\textit{% code}_{i})\|\operatorname{P}_{4}^{w_{4}}(\textit{code}_{i})))$ (7)

In the overall I²R, the optimization objective for intra-modal representation learning is shown in Eq. (8).

$\displaystyle\operatorname{loss}_{KL}=\operatorname{loss}_{\textit{KLQ}}+% \operatorname{loss}_{\textit{KLC}}$ (8)

3.4 Optimisation objectives

With Sections 3.2 and 3.3, we obtain optimisation objectives $\operatorname{loss}_{CL}$ and $\operatorname{loss}_{KL}$ for Inter-modal representation learning and Intra-modal representation learning, and the final training objective is to achieve a minimal $L$ .

$\displaystyle\operatorname{L}=\operatorname{loss}_{CL}+\operatorname{loss}_{KL}$ (9)

I²R Training AlgorithmTraining data ${\operatorname{D}}=\{(\textit{query}_{i},\textit{code}_{i})\}_{i=1}^{n}$ model parameter $w_{1}$ , $w_{2}$ , $w_{3}$ , $w_{4}$ Initialize model with parameters $w_{1}$ , $w_{2}$ , $w_{3}$ , $w_{4}$ not at the end of training repeat input data twice as [ $(\textit{query}_{i},\textit{code}_{i})$ , $(\textit{query}_{i},\textit{code}_{i})$ ] and obtain the output distributions $\operatorname{P}_{1}^{w_{1}}(\textit{query}_{i})$ , $\operatorname{P}_{2}^{w_{2}}(\textit{query}_{i})$ , $\operatorname{P}_{3}^{w_{3}}(\textit{code}_{i})$ , $\operatorname{P}_{4}^{w_{4}}(\textit{code}_{i})$ calculate the InfoNCE Loss by Eq. (4) calculate the KL-divergence loss by Eq. (8) update the model parameters $w_{1}$ , $w_{2}$ , $w_{3}$ , $w_{4}$ by minimizing loss $L$ of Eq. (9)

The overall training algorithm based on our I²R is given in Algorithm 1. In each training step, line 3 shows that we advance the model once and obtain the output distribution $\operatorname{P}_{1}^{w_{1}}(\textit{query}_{i})$ , $\operatorname{P}_{2}^{w_{2}}(\textit{query}_{i})$ , $\operatorname{P}_{3}^{w_{3}}(\textit{code}_{i})$ , and $\operatorname{P}_{4}^{w_{4}}(\textit{code}_{i})$ , then line 4 calculates the contrastive loss of the inter-modal distribution, followed by line 5 and the calculation of the KL-divergence of the intra-modal distribution. Finally, the model parameters are updated according to L (line 6). Training will be performed in reaching epochs until convergence.

4. Experiment

An evaluation of the performance of I²R was conducted using a code search task where the impact of inter-modal and intra-modal representations on performance was investigated. We extend the I²R approach to other pre-trained models, such as the GraphCodeBERT model. Finally, we explore the impact of the Dropout rate on the performance of I²R. We conclude that I²R is an effective tool for evaluating opinions based on our responses to the following research questions.

•
RQ1: How effective is I²R in code search?

To validate the effectiveness of the I²R in the code search task, we use six datasets from CSN, as well as the Advtest dataset, to validate it compared to the current state-of-the-art models.
•
RQ2: What are the impact of intra-modal and inter-modal representations on code search performance?

It is the purpose of this study to investigate the impact of intra-modal and inter-modal representations on performance in a learning task. As a result, we performed the corresponding ablation experiments by neutering the I²R model and comparing the effects of neutering on performance across a variety of cases.
•
RQ3: How well does applying I²R to other pre-trained programming language models work?

In addition to UniXcoder, there are also a number of other pre-trained models that have also been successful in achieving acceptable results in code search tasks. During the course of our research, we wondered whether the migration of I²R to other pre-trained models would similarly have the same effect on code search. The model that we replaced with GraphcodeBERT was compared to the baseline model that was developed with UniXcoder, the pre-trained model.
•
RQ4: How do different Dropout rate settings affect the performance of I²R?

It was observed that the performance of I²R was negatively impacted by different dropout rates, which were assigned to I²R, and verified their impact on the performance of code search using the I²R interface.

4.1 Dataset

It is the main objective of a code search to find the most relevant codes among a collection of candidate codes for a given query in natural language. A series of experiments were conducted on these datasets, namely the CSN [11], which comprises six different programming language datasets, and the AdvTest [33].

The CSN dataset was constructed from a CodeSearchNet dataset in six languages, unlike the dataset and setup used by Husain et al. [34]. A detailed analysis of Guo et al. found that low-quality queries were filtered by hand-made rules, and the 1000 candidates were extended to include all the code in the corpus, which is closer to a real-life scenario. To test the model’s understanding and generalization abilities, Lu et al. normalized the Python function names and variable names in the AdvTest dataset from the Python language of the CSN dataset. The reason for this was better to test the model’s understanding and generalization abilities. We present the statistics on this dataset in Table 1.

Table 1
Statistical information on the dataset

Language	Ruby	Javascript	Go	Python	Java	PHP
Training	24,927	58,025	167,288	251,820	164,923	241,241
Validation	1,400	3,885	7,325	13,914	5,183	12,982
Test	1,261	3,291	8,122	14,918	10,955	14,014
Candidate Code	4,360	13,981	28,120	43,827	40,347	52,660

4.2 Experiment settings

Our work is deployed on a GPU server with two Intel Xeon Gold 6142 (2.7GHz) CPUs, 128GB of RAM, a 1TB solid-state drive for the system, and a 4TB hybrid hard drive for the storage. And there are four Nvidia RTX3090 GPUs with 24G of graphics memory. We set the learning rate as 2e-5, the batch size as 64, the dropout rate of two submodels in intra-modal as 0.1 and 0.4, and the max sequence length of code and query as 256 and 128, respectively. We use the Adam optimizer to fine-tune the model for 10 epochs and perform early stopping on the development set. The temperature hyperparameter $\tau$ is set to 0.01.

4.3 Evaluation metrics

In order to determine the performance of a code search system, MRR (Mean Reciprocal Rank) is calculated by ranking the correct search results in order to obtain the best ranking of the search results.

$\displaystyle\textit{MRR}=\frac{1}{Q}\sum_{i=1}^{|Q|}\frac{1}{\textit{rank}_{i}}$ (10)

where $|Q|$ is the number of queries, and $\textit{rank}_{i}$ is the ranking position of the first item in the ground-truth result for the $i$ th query in the search list. The experimental results below show the MRR metrics of the corresponding models on the test set.

4.4 Comparison methods

Over the past few years, significant advancements have been made in the field of natural language processing for programming languages. Researchers have developed several pre-training models that aim to improve code understanding and facilitate various code-related tasks. In this section, we will explore some of these state-of-the-art pre-training models and their unique features.

•
RoBERTa’s pre-training strategy has been modified, including removing the pre-training target for the next sentence prediction in BERT [35], as well as training with larger batches and learning rates. Furthermore, Roberta receives an order of magnitude more training over a more extended period than BERT. Consequently, RoBERTa is able to generalize more effectively to downstream tasks than BERT. According to Liu et al., RoBERTa is trained by treating code as a sequence of tokens, thus producing the code-based model that has been pre-trained [28].
•
CodeBERT is a pre-trained model that can handle bi-modal data (programming language PL and natural language NL) and can represent generic downstream NL-PL applications (e.g., natural language code search, code document generation, etc.). CodeBERT is a cross-modal pre-training model derived from RoBERTa, developed on the Transformer architecture, and designed to combine a pre-training task involving replacement token detection with a hybrid objective function [10].
•
With GraphCodeBERT, code representations can be learned based on the semantic structure of the source code by implementing the BERT pre-training model. To achieve a vector representation of the source code based on the data stream, there are two new pre-training tasks proposed (data stream edge prediction, variable alignment of the source code and the data stream) in addition to the traditional MLM task. With GraphCodeBERT, code representations can be learned based on the semantic structure of the source code by implementing the RoBERTa pre-training model [11].
•
The SynCoBERT pre-training model allows for better representation of code using syntax-guided multimodal contrastive learning [30]. Based on the symbolic and syntactic properties of the source code, two new pre-training targets were developed, namely Identifier Prediction (IP) and AST Edge Prediction (TEP), for anticipating edges between identifiers and AST nodes, respectively. A multimodal contrastive learning strategy is also proposed in order to capitalize on the complementary information in the semantic equivalence modalities of the code (i.e., code, annotation, AST).
•
CodeT5-base [27] is based on the same architecture as Google’s T5 (Text-to-Text Transfer Transformer) framework [36] but with a better understanding of programming language. It proposes to make use of developer-designed identifiers in the code. A new objective function that incorporates code-specific knowledge is proposed. This objective function trains the model to distinguish between tokens that represent identifiers as well as to recover them when they are blocked. Furthermore, annotations in the code are used to allow the model to learn better representations by studying the alignment properties of the code and the text.
•
UniXcoder is a unified cross-pattern pre-training model for programming languages [12]. In the model, a masked attention matrix with prefix adapters is used to control the behavior of the model. Additionally, cross-modal content such as ASTs and code annotations are used to enhance the representation of the code. This thesis proposes a method for encoding ASTs that are represented in parallel as trees that maintains the sequence structure of all the structural information contained within the AST. The model also uses multimodal content to learn the representation of code fragments through comparative learning, followed by a cross-modal generation task to align the representation between programming languages.

4.5 Experiment results and analysis

In this section, we conducted an evaluation of the performance of the I²R model in the context of code search. We aimed to investigate the impact of intra-modal and inter-modal representations on the code search performance and explore the feasibility of applying I²R to other pre-trained programming language models. Additionally, we examined how different Dropout rate settings affected the performance of I²R. The findings from our research provide insights into the effectiveness of I²R as a tool for evaluating opinions in the code search domain.

4.5.1 Effectiveness in code search (RQ1)

Table 2
Performance of each method on AdvTest and CSN datasets

		CSN
Model	AdvTest	Ruby	Javascript	Go	Python	Java	PHP	Overall
RoBERTa	18.3	58.7	51.7	85.0	58.7	59.9	56.0	61.7
CodeBERT	27.2	67.9	62.0	88.2	67.2	67.6	62.8	69.3
GraphCodeBERT	35.2	70.3	64.4	89.7	69.2	69.1	64.9	71.3
SYNCOBERT	38.3	72.2	67.7	91.3	72.4	72.3	67.8	74.0
PLBART	34.7	67.5	61.6	88.7	66.3	66.3	61.1	68.5
CodeT5-base	39.3	71.9	65.5	88.8	69.8	68.6	64.5	71.5
UniXcoder	41.3	74.0	68.4	91.5	72.0	72.6	67.6	74.4
I²R	43.8	75.5	69.2	91.6	73.1	73.4	68.2	75.2

A comparison among the different methods are shown in Table 2 in terms of their performance in the code search task. As shown in the table, the most popular methods for code search that use pre-trained models are currently the most effective in terms of finding codes.

In general, I²R achieved the best performance among all the methods compared. In Table 2, we can see that I²R is able to outperform the baseline model on all datasets compared to the baseline models, with a particular improvement of more than 1% on the AdvTest, Ruby, and Python datasets, and significant improvements on the Javascript, Java, PHP, and Go datasets.

On the CSN task, it has been shown that the standard I²R model has improved by 0.8% over the current state-of-the-art model UniXcoder. Also, it was found that the I²R model performed 2.5% better on the AdvTest dataset as compared to the UniXcoder model.

It has been shown that I²R can achieve the current state-of-the-art in code search tasks, which is a very effective result in code search tasks.

4.5.2 Effect of intra- and inter-modal representation (RQ2)

Our ablation experiments were conducted on the I²R model, as shown in Table 3. The first row of the table shows the base model UniXcoder, where I²R is improved from the UniXcoder model to be as efficient as possible.

As we can see from the second row, it represents the optimization only for intra-modal representation, and it corresponds to Section 3.3. On the basis of the results in the table, it can be concluded that the introduction of KL-divergence for inter-modal representation learning based on the UniXcoder model significantly improves code search on the AdvTest, Ruby, Javascript, Python, Java, and PHP datasets. Moreover, there is an average improvement of 0.6% on the CSN task. As can be seen from the third row, the optimization only refers to the inter-modal representation, which corresponds to Section 3.2. The implementation of contrastive learning to align the inter-modal representation on top of the UniXcoder model significantly improves the performance of the code search model. In the fourth row, we present experimental results for the simultaneous optimization of the intra- and inter-modal representation. Findings indicate that the superimposed effect of optimizing both the intra- and inter-modal representations can further enhance the performance of code searching.

Table 3
Performance of each ablation method based on UniXcoder

		CSN
Model	AdvTest	Ruby	Javascript	Go	Python	Java	PHP	Overall
UniXcoder	41.3	74.0	68.4	91.5	72.0	72.6	67.6	74.4
IntraR	18.3	75.5	69.0	91.3	73.0	72.9	68.3	75.0
InterR	43.3	75.3	68.8	91.3	72.9	72.9	67.9	74.9
I²R	43.8	75.5	69.2	91.6	73.1	73.4	68.2	75.2

4.5.3 Performance on other pre-trained models (RQ3)

Table 4
Performance of each ablation method based on GraphCodeBERT

		CSN
Model	AdvTest	Ruby	Javascript	Go	Python	Java	PHP	Overall
GraphCodeBERT	35.2	70.3	64.4	89.7	69.2	69.1	64.9	71.3
IntraR-GraphCodeBERT	42.2	73.1	67.0	90.5	72.0	71.9	67.3	73.6
InterR-GraphCodeBERT	42.4	72.7	67.8	90.6	71.9	71.6	67.0	73.6
I²R-GraphCodeBERT	41.6	72.9	67.0	90.6	72.0	72.1	67.1	73.6

We evaluated the performance of I²R-GraphCodeBERT and compared it to the baseline model GraphCodeBERT, as shown in Table 4. Using the same experimental setup as I²R-UniXcoder, we set the learning rate as 2e-5, the batch size as 64, the dropout rate of two submodels in intra-modal as 0.1 and 0.4, and the max sequence length of code and query as 256 and 128. We use the Adam optimizer to fine-tune the model for 10 epochs and perform early stopping on the development set. The temperature hyperparameter $\tau$ is set to 0.01.

Besides migrating the I²R method to GraphCodeBERT, we also carried out a series of ablation experiments on I²R-GraphCodeBERT in order to investigate its efficacy. It is important to note that the experimental results are consistent with the ablation scheme described in Section 4.5.2, I²R-UniXcoder. It was demonstrated in the experiments that intra-modal representation learning and inter-modal representation learning are extremely effective in the GraphCodeBERT model, respectively. Due to the fact that the parameters of the I²R-GraphCodeBERT-based model were directly transferred from the I²R-UniXcoder model, the dropout rate was not adjusted, making the integration of the two modalities not significantly beneficial.

4.5.4 Impact of different dropout rates (RQ4)

Table 5
Performance of I²R under different dropout rates

		CSN
Dropout	AdvTest	Ruby	Javascript	Go	Python	Java	PHP	Overall
Rate $=$ 0.0	43.8	74.7	68.5	91.4	72.9	72.9	67.9	74.7
Rate $=$ 0.1	43.1	75.0	68.7	91.2	72.9	72.9	67.7	74.7
Rate $=$ 0.2	43.5	74.9	69.1	91.5	73.0	73.2	68.0	75.0
Rate $=$ 0.3	43.8	75.2	69.1	91.4	73.0	73.4	68.2	75.1
Rate $=$ 0.4	43.1	75.5	69.2	91.6	73.1	73.1	68.1	75.1
Rate $=$ 0.5	42.3	74.2	69.1	91.1	73.0	73.1	67.9	74.7

As well as the studies that have been mentioned above, we also examine I²R from another perspective, and that is the dropout rate.

There are two distributions between the modalities in the current training, which are based on different dropout rates across the two modalities. It is important to note that one of the distributions is always based on the UniXcoder dropout rates, but the other distribution has a variable dropout rate range of {0, 0.1, 0.2, 0.3, 0.4, 0.5}. In this study, we used the two different dropout rates of the two output distributions during training to observe the difference in the performance of the code search, and the results are shown in Table 5.

Among these various results, we can see that 1) a dropout rate of 0.4 for the other distribution is the most appropriate choice (current setting), and 2) R-Drop consistently achieves better performance when the dropout rate of the other distribution is in a reasonable list {0.3, 0.4}. To make sure that the results are not distorted, a dropout rate of 0.4 was chosen as a compromise in our experiments.

4.6 Discussion: Why is I² R-GraphCodeBERT a More Significant Improvement than I² R-UniXcoder?

The I²R method, which is shown in Tables 3 and 4, is far more effective than the UniXcoder model when compared to the GraphCodeBERT model. This is evidenced by the improvement that it shows.

UniXcoder is a unified cross-modal pre-training model for programming languages. As part of the model, a prefix adapter with a masked attention matrix is used to control the behavior of the model, as well as cross-modal content like ASTs and code annotations that enhance the code representation. As part of the thesis, a one-to-one mapping method is proposed to encode ASTs represented in parallel as trees, which preserves the sequence structure of all structural information contained within the AST. A multimodal representation of code fragments can also be learned through comparative learning on multimodal content, and then the representation can be aligned between programming languages through a cross-modal generation task.

GraphCodeBERT is based on a pre-trained model to learn code representations based on code semantic structure information (rather than AST information), which is derived via data flow to obtain semantic structure information for the code. Additionally to the MLM pre-training task, two supplementary code-structure-related pre-training tasks have been introduced, and these tasks will help to develop code representations from source code and data flows, which are related to code representations.

Figure 5.

Model-specific code-query representations. Where red indicates the representation of code and blue indicates the representation of query.

As compared to GraphCodeBERT, UniXcoder learns code representations better and aligns representations between programming languages through cross-modal generation tasks. Therefore, the generic representation of UniXcoder pre-training model is more suitable for code search tasks than GraphCodeBERT. It is evident from Table 2 that UniXcoder performs much better on the code search task than GraphCodeBERT, as can be seen in the results. In terms of the code search, GraphCodeBERT has a poor performance on the code search task because there is no good pre-training task to achieve alignment between code and natural language during GraphCodeBERT pre-training, and therefore it is less effective than UniXcoder. As can be seen from the comparison of the experimental results in Tables 3 and 4, it can be seen that the introduction of I²R on top of GraphCodeBERT is able to make the I²R-GraphCodeBERT model comparable or better than UniXcoder in the code search task when compared to UniXcoder using the I²R method. Therefore, we can conclude that our model is capable of effectively aligning the inter-modal representations while improving the stability of the intra-modal representations at the same time.

4.7 Further analysis: I²R’s effectiveness is explained by the code-query pair representation

We can see in Fig. 5 how the representations of code-query pairs vary among the different models. In Fig. 5a, we can see the representation of code-query pairs obtained using the UniXcoder model, where code and query are encoded separately, and then a direct similarity calculation is performed to implement the code search. The model based on intra-modal representation learning has produced the representation of code-query pairs in Fig. 5b. Next, Fig. 5c shows the representation of code-query pairs obtained from the model based on inter-modal representation learning. Finally, Fig. 5d illustrates the representation of code-query pairs derived from the model using intra- and inter-modal representation learning.

The comparison of Fig. 5a–c shows that the introduction of intra-modal representation learning and inter-modal representation learning leads to a more concentrated representation distribution of the code-query pair, as can be seen from the figures on the axes. Additionally, it is evident that the clusters of code representations and query representations are more closely clustered. This suggests that intra-modal representation learning can effectively improve the effect of unimodal representation. Furthermore, the closer distance between the code-query pairs shown in Figure c achieves the alignment of the different modal representations. Using the I²R model in Fig. 5d, the superimposition of the model effects in Fig. 5b and c results in further clustering of the code-query pair representations, more stable unimodal inter-modal representations, and further alignment between modalities is achieved.

5. Conclusion

For the purpose of addressing the problem of code search, a novel approach to intra- and inter-modal representation learning (I²R) is presented in this research. The objective of this approach is to improve the consistency of intra-modal representations by R-drop while optimizing the information related to inter-modal representations through contrastive learning. A unified framework for solving code search problems has been developed by integrating intra-modal representation learning as well as inter-modal representation learning into a single framework. The inter-modal representation learning module has been introduced in order to achieve alignment between the two types of representations; meanwhile, the randomness of unimodal representations has been mitigated through the use of R-drop in order to achieve a constraint on the consistency of the distribution of outputs in unimodal models. There have been extensive experimental and ablation studies conducted to test the efficacy of the proposed I²R approach, which has been developed to achieve state-of-the-art performance for code search tasks. In the future, we plan to investigate more fine-grained feature representations of source code in order to improve the performance of code representation learning.

Footnotes

Acknowledgments

The authors would like to thank the anonymous reviewers for their insightful comments. This work was funded by the National Natural Science Foundation of China (62176053).

References

McMillan

Grechanik

Poshyvanyk

Xie

and Fu

, Portfolio: finding relevant functions and their usage, in: Proceedings of the 33rd International Conference on Software Engineering, 2011, pp. 111–120.

Zhang

Lou

J.-g.

Wang

Zhang

and Zhao

, Codehow: Effective code search based on api understanding and extended boolean model (e), in: 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE, 2015, pp. 260–270.

Sun

Wang

and Duan

, Query expansion via wordnet for effective code search, in: 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), IEEE, 2015, pp. 545–549.

Yan

Chen

Shen

and Jiang

, Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries, in: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2020, pp. 344–354.

Shuai

Liu

Yan

Xia

and Lei

, Improving code search with co-attentive representation learning, in: Proceedings of the 28th International Conference on Program Comprehension, 2020, pp. 196–207.

Shi

Wang

Shi

Han

and Zhang

, Is a single model enough? mucos: A multi-model ensemble learning approach for semantic code search, in: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 2994–2998.

Yang

Liu

Shuai

Yan

Lei

and Xu

, Two-stage attention-based model for code search with textual and structural features, in: 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2021, pp. 342–353.

Mathew

and Stolee

K.T.

, Cross-language code search using static and dynamic analyses, in: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 205–217.

Sun

Fang

Chen

Tao

Han

and Zhang

, Code search based on context-aware code translation, in: Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, Association for Computing Machinery, 2022, pp. 388–400. ISBN 9781450392211.

10.

Feng

Guo

Tang

Duan

Feng

Gong

Shou

Qin

Liu

Jiang

et al., CodeBERT: A Pre-Trained Model for Programming and Natural Languages, in: Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1536–1547.

11.

Guo

Ren

Feng

Tang

Shujie

Zhou

Duan

Svyatkovskiy

et al., GraphCodeBERT: Pre-training Code Representations with Data Flow, in: International Conference on Learning Representations, 2020.

12.

Guo

Duan

Wang

Zhou

and Yin

, UniXcoder: Unified Cross-Modal Pre-training for Code Representation, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 7212–7225.

13.

Chai

Zhang

Shen

and Gu

, Cross-domain deep code search with meta learning, in: Proceedings of the 44th International Conference on Software Engineering, Association for Computing Machinery, 2022, pp. 487–498. ISBN 9781450392211.

14.

Xie

Xia

Fan

Lai

J.-H.

Qin

and Liu

, Sequence generation with mixed representations, in: International Conference on Machine Learning, PMLR, 2020, pp. 10388–10398.

15.

Zolna

Arpit

Suhubdy

and Bengio

, Fraternal dropout, in: International Conference on Learning Representations, 2018.

16.

Wang

Meng

Qin

Chen

Zhang

Liu

T.-Y.

et al., R-drop: Regularized dropout for neural networks, Advances in Neural Information Processing Systems 34 (2021), 10890–10905.

17.

Chen

Y.-C.

El Kholy

Ahmed

Gan

Cheng

and Liu

, Uniter: Universal image-text representation learning, in: European Conference on Computer Vision, Springer, 2020, pp. 104–120.

18.

Gao

Niu

Xiao

Liu

and Wang

, UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 2592–2607.

19.

Radford

Kim

J.W.

Hallacy

Ramesh

Goh

Agarwal

Sastry

Askell

Mishkin

Clark

et al., Learning transferable visual models from natural language supervision, in: International Conference on Machine Learning, PMLR, 2021, pp. 8748–8763.

20.

Zhang

et al., S-SimCSE: sampled sub-networks for contrastive learning of sentence embedding, arXiv preprint arXiv:2111.11750, 2021.

21.

Gao

Yao

and Chen

, SimCSE: Simple contrastive learning of sentence embeddings, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 6894–6910.

22.

Gao

Zang

Han

Wang

and Hu

, Esimcse: Enhanced sample building method for contrastive learning of unsupervised sentence embedding, arXiv preprint arXiv:2109.04380, 2021.

23.

Chuang

Y.-S.

Dangovski

Luo

Zhang

Chang

Soljacic

S.-W.

Yih

W.-t.

Kim

and Glass

, DiffCSE: Difference-based contrastive learning for sentence embeddings, in: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022.

24.

Zhang

and Kim

, Deep code search, in: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), IEEE, 2018, pp. 933–944.

25.

Desai

and Johnson

, Virtex: Learning visual representations from textual annotations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11162–11173.

26.

L.H.

Yatskar

Yin

Hsieh

C.-J.

and Chang

K.-W.

, Visualbert: A simple and performant baseline for vision and language, arXiv preprint arXiv:1908.03557, 2019.

27.

Wang

Joty

and Hoi

S.C.

, CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8696–8708.

28.

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Zettlemoyer

and Stoyanov

, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692, 2019.

29.

Oord

A.v.d.

and Vinyals

, Representation learning with contrastive predictive coding, arXiv preprint arXiv:1807.03748, 2018.

30.

Wang

Zhou

Wan

Liu

and Jiang

, Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation, arXiv preprint arXiv:2108.04556, 2021.

31.

Huo

Zhang

Liu

Gao

Yang

Wen

Zhang

Zheng

et al., WenLan: Bridging vision and language by large-scale multi-modal pre-training, arXiv preprint arXiv:2103.06561, 2021.

32.

Huang

Zeng

Huang

Liu

and Fu

, Seeing out of the box: End-to-end pre-training for vision-language representation learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12976–12985.

33.

Guo

Ren

Huang

Svyatkovskiy

Blanco

Clement

Drain

Jiang

Tang

et al., CodeXGLUE: A machine learning benchmark dataset for code understanding and generation, in: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.

34.

Husain

H.-H.

Gazit

Allamanis

and Brockschmidt

, Codesearchnet challenge: Evaluating the state of semantic code search, arXiv preprint arXiv:1909.09436, 2019.

35.

Devlin

Chang

M.-W.

Lee

and Toutanova

, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.

36.

Raffel

Shazeer

Roberts

Lee

Narang

Matena

Zhou

and Liu

P.J.

, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning Research 21(140) (2020), 1–67. http://jmlr.org/papers/v21/20-074.html.

37.

Sun

Myers

Vondrick

Murphy

and Schmid

, Videobert: A joint model for video and language representation learning, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7464–7473.

38.

Srivastava

Hinton

Krizhevsky

Sutskever

and Salakhutdinov

, Dropout: A simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15(1) (2014), 1929–1958.

39.

Park

Shin

S.-J.

and Moon

I.-C.

, Adversarial dropout for supervised and semi-supervised learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, 2018.

I 2 R: Intra and inter-modal representation learning for code search

Abstract

Keywords

1. Introduction

2.1 Code search

2.2 Multi-modal pre-trained models

2.3 Contrastive learning

3. The proposed approach

3.2 Inter-modal representation learning

Table 1 Statistical information on the dataset

4.3 Evaluation metrics

4.5.1 Effectiveness in code search (RQ1)

Table 2 Performance of each method on AdvTest and CSN datasets

Table 3 Performance of each ablation method based on UniXcoder

Table 4 Performance of each ablation method based on GraphCodeBERT

Table 5 Performance of I2R under different dropout rates

5. Conclusion

Footnotes

Acknowledgments

References

Table 1
Statistical information on the dataset

Table 2
Performance of each method on AdvTest and CSN datasets

Table 3
Performance of each ablation method based on UniXcoder

Table 4
Performance of each ablation method based on GraphCodeBERT

Table 5
Performance of I²R under different dropout rates