Region-Attention Prompt Learning for CLIP

Abstract

Pre-trained Visual Language Models (VLMs) like CLIP have shown great potential in the multimodal domain. Among this, using different modal contexts and interaction features to construct prompt can stimulate the model’s prior knowledge circuit more accurately, thus generating better outputs. However, in CLIP, the formal mismatch of textual descriptions between the pre-training and inference phases results in a suboptimal representation ability of prompt, which is detrimental to model alignment learning. Therefore, Region-Attention Prompt (RAP) is proposed, which introduces region features to enrich the semantic representation of prompt. RAP is acquired by the Cross-Attention mechanism between images and texts, and it is essentially a region-level prompt with category-sensitive properties. For each category, RAP adaptively assigns greater attention weight to image regions that are more semantically relevant to the category. Besides, CLIP is equipped with RAP (called RA-CLIP) to improve image classification performance in generalization scenarios. Extensive experiments demonstrate that RA-CLIP outperforms the current SOTA CoCoOp 0.4% - 4.16% on base classes and 0.25% - 11.34% on new classes, across 7 datasets. In addition, we show that focusing on category-related regions to construct prompt can further improve the model’s alignment ability.

Keywords

Prompt learning CLIP Cross-Attention mechanism image classfication

1 Introduction

Recently, a growing number of researches about Pre-trained Vision-Language Models (VLMs) have shown that VLMs learned from image and text knowledge jointly can achieve a striking performance in downstream tasks, which demonstrates a great potential in the field of visual-language learning [1 –4]. For example, CLIP [5] and ALIGN [2] are pre-trained on millions of image-text data pairs to align the vision and language modalities in the embedding space, and the resulting models obtain impressive performances on downstream tasks in a zero-shot manner.

Various works such as CLIP [5] transform a discrete category label into a complete sentence using manual-designed prompts such as “a photo of a {CLASS}” to make better use of the encoder’s prior knowledge. However, manual-designed prompts such as “a photo of a {CLASS}” are highly time-consuming and inefficient, because they must be based on trial and error, which cannot guarantee an optimal prompt. In addition, the currently popular Large Language Models (LLMs) also rely heavily on prompt engineering. a well-designed prompt can fully stimulate prior knowledge circuits in LLMs and enable them to generate output in a more precise and controlled manner. Therefore, how to construct prompt to efficiently mine the prior knowledge of models is an important research direction [6, 7]

To automate prompt engineering Zhou et al. [8] have recently introduced prompt learning, a recent trend in NLP, into VLMs [9 –13] and propose CoOp. It introduces a set of learnable vectors to replace the manual prompt to adapt specific downstream tasks, and it turns out that CoOp achieves better performance with few training parameters. CoCoOp [14] adds an image-conditional token into learnable context vectors and shifts the focus from text-only learning to image-text learning, which can improve CoOp’s generalizability.

It can be concluded that the textual formal mismatch between the pre-trained data and downstream data results in a suboptimal representation ability of prompt, resulting in a suboptimal representation ability of prompt. The text description used by CLIP [5] in the pre-training phase is corresponding to the image, which is often a semantically rich and complete sentence. For example, in the pre-training stage “Pepper the aussie pup” is the text description corresponding to the image "dog", but when inferencing on image classification tasks, a manually designed template “a photo of a {CLASS}” is used where "dog" is filled into {CLASS} as a text description. This is detrimental to CLIP’s alignment learning because using a fixed template to construct prompt cannot fully exploit the prior knowledge in CLIP, thus compromising classification accuracy.

One wise way to address the problem above is to use image-based features to construct prompt. This prompting method is able to enrich text representation to approach the text description semantically and formally in the pre-training process, thus benefiting CLIP’s alignment learning. Considering that CLIP learns to match a whole image with the corresponding text description instead of regions in the image, CLIP ignores learning a more fine-grained alignment representation [15]. The image-text level alignment capability of CLIP should be extended to the region-text level by introducing region features into the prompt.

Therefore, Region-Attention Prompt (RAP) is proposed in this paper to serve as a more region-level and category-sensitive prompt, thus modeling a more fine-grained region-text joint embedding learning. RAP is essentially an instance-level and category-sensitive prompt, which enhances the attention weights of regions that are more relevant to the category in the vector space. Therefore, RAP can further enrich the semantic representation of the text description to make it more sensitive to important regions, thus improving the alignment ability of CLIP.

Our contributions are summarized as follows: c(1) We point out that the representation ability of prompt is crucial for the downstream task performance of Pre-trained Visual Language Models (VLMs). Meanwhile, we propose a paradigm for constructing prompts using image-based features, called Region-Attention Prompt (RAP), to enrich the semantic expressiveness of prompts. RAP is essentially a category-sensitive and fine-grained prompt, which focuses more on category-related regions.

(2) Semantic Interaction Network (SINet) is proposed to generate RAP, which includes: (a) Transformer Decoder Module, which uses the Cross-Attention mechanism for image and text to obtain region-attention features. (b) Projector Module, a lightweight neural network, which gets the projected representation of region-attention features.

(3) CLIP is equipped with RAP (called RA-CLIP) to improve image classification performance in generalization scenarios. Concretely, RA-CLIP’s generalization performance is evaluated on 7 image classification datasets, and it turns out that RA-CLIP on average outperforms 6.77%, 7.21%, and 3.52% than CLIP, CoOp, and CoCoOp respectively on harmonic-mean performance (among which, it outperforms the three by 0.64%, 10.86%, and 3.7% on new classes). In addition, we demonstrate that concentrating on category-related regions to construct prompts can further improve the model’s alignment ability.

2 Related work

2.1 Theoretical development

Vision Language Models In recent years, there have been a series of works on building the connection between computer vision and natural language processing [1 –4], e.g., text-to-image retrieval [16], image caption [17], visual question answering [18, 19], referring segmentation [20] and so on.

Those developments in image-text joint learning are largely driven by advances in the following three areas: (1) text representation learning with Transformers [21], (2) large-minibatch contrastive representation learning, and (3) web-scale training datasets [2 , 22].

A representative approach is CLIP [5], which trains two neural network-based encoders using a contrastive loss to match pairs of images and texts. After consuming 400 million data pairs, CLIP shows competitive performance in various downstream tasks such as image recognition, objective detection, and dense prediction. Moreover, CLIP shows an impressive transferable ability to other datasets. Inspired by CLIP [5], several follow-ups have been proposed to improve the training strategy (e.g., CoOp [8], CLIP-Adapter [23], Tip-adapter [24], CoCoOp [14]) or apply CLIP to other domains (e.g., ActionCLIP [25]). Our work is built on CLIP, aiming to construct a more instructive and region-sensitive prompt to model more fine-grained region-text learning.

Prompt Learning This topic is derived from Natural Language Processing (NLP) domain. The motivation of prompt learning is to exploit the knowledge learned by large-scale pre-trained models to perform various downstream tasks. Prompt learning can be summarized as “pre-train, prompt, and predict” [26], in which the downstream task is reorganized into a form similar to a pre-training task. Timo Schick and Hinrich Schütze [27] proposed PET, which adapts the language model to the task by modifying the input to the text classification task to a fill-in-the-blank question, while the inference process is modified to a text generation task to take full advantage of the text generation capabilities of the language model. Fabio Petroni et al. [28] proposed LAMA, which obtained better relational extraction than knowledge bases by modifying the relational extraction task to fill-in-the-blank questions, without modifying the pre-trained language model. Shin et al. [29] proposed AUTOPROMPT, which uses prompt learning for text classification, and text implication determination tasks. However, AUTOPROMPT sometimes struggles when the training data is highly imbalanced, and also lacks interpretability.

Motivated by the achieved prompt learning in NLP, several works have adapted prompt learning into VLMs. CoOp [8] replaces the manual prompt used in CLIP with a set of learnable vectors. Despite demonstrating outstanding performance, interpreting the results of CoOp, like other continuous prompt learning methods in natural language processing (NLP), poses a significant challenge. Furthermore, the conducted experiments indicate that CoOp is highly susceptible to the presence of noisy labels. Moreover, the learned context prompt is not generalizable to wider unseen classes within the same dataset,

Based on CoOp, prompt tuning methods for complex tasks have also been proposed, such as open-vocabulary object detection [30, 31], zero-shot semantic segmentation [32], continual learning [33], and multi-label image classification [34]. CoCoOp [14] extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token(vector). Although CoCoOp introduces instance-level image features for context prompt, it assigns the same image features to each category to construct prompt. Based on this, the model still fails to recognize the intrinsic semantic associations of different classes of text and images, resulting in semantic ambiguity and inaccuracy of prompt.

Our work aims to explore the use of image information to build a category-sensitive prompt, which we hope can adaptively focus on different regions in the image according to different category descriptions, so we call it Region Attention Prompt.

Zero-Shot Learning (ZSL) ZSL is another relevant research area, i.e., to recognize new classes by training only on base classes [3 , 35–37]. Moreover, the "seen-class bias" issue brought up in the ZSL literature is related to the generalization problem, where a model trained on base classes frequently fails on new classes. VLMs like CLIP [5] also suffer from the generalization problem, this problem is attributed to the text form mismatch between the pre-trained data and downstream tasks data, which is detrimental to alignment learning.

The most common approach to ZSL is to learn a semantic space based on auxiliary information such as attributes or word embedding. Different from existing ZSL methods, our work addresses the above problem by using image-based information to construct an instance-level prompt called Region-Attention Prompt (RAP). RA-CLIP is further proposed to improve image classification performance under generalization scenarios.

Fig. 1

Semantic Interaction Network(SINet), which makes prompt focus on those category-related regions (cat’s head, tail, paw). SINet can assign a higher attention weight to prompt, while a lower one to background regions (grass and trees).

2.2 Reviews

CLIP is a powerful vision-language pre-training model using contrastive learning to align image and text representation in the same vector space. CLIP consists of two encoders, including an image encoder (ResNet [38] or ViT [39] and a text encoder (Transformer [21]. When performing zero-shot learning, CLIP [5] uses a manual prompt “a photo of a CLASS” to transform the origin category text into the complete sentence, then feeds the sentence into text encoder and selects the category with the greatest similarity as the label of the image. It should be noticed in this process CLIP does not need to perform additional parameters training [8 , 25].

Context Optimization (CoOp) is proposed to extend CLIP’s ability in few-shot tasks. Motivated by prompt learning in the NLP field, CoOp [8] introduces M learnable context vectors (soft prompt) {v₁, v₂, … , v_M } to replace the manual template “a photo of a {CLASS}”, where each has the same dimension with the word embedding. The key idea of the CoOp is to make use of the prompt learning method in the text encoder by only tuning the learnable context vectors, then the CLIP-like models can learn a better prompt representation.

Conditional Context Optimization (CoCoOp) is proposed to improve CoOp’s generalizability. Specifically, CoCoOp [14] further learns a lightweight neural network to generate for each image an input-conditional token. Then is added to soft prompt to obtain an instance-level prompt for each image. Same as CoOp, CoCoOp maximizes the cross-entropy score from the ground-truth to update the learnable context vectors and light-weight linear-transformation head by gradient propagation, with both image encoder and text encoder freezing during training.

3 Method

3.1 Region features analysis

A good prompt in CLIP should be instructive and precise, thus helping alignment learning more adapt to downstream tasks. For image classification tasks, the information in different regions of the image does not have the same relevance to the category. Moreover, assuming that the classification object in the image is not obvious, or there are other focal parts in the image irrelevant to the classification, then simply making the whole image interact with category text to construct the prompt may be harmful to CLIP’s alignment ability. It can be found that category texts provide conceptual-level semantic descriptions and regions offer the more precise and instance-level ones, so refining the image features with category features can obtain more instructive and fine-grained region information.

Therefore, in constructing the prompt for image classification tasks, it is necessary to pay more attention to regions that contribute more to the category texts, which makes prompt more sensitive and precise to important regions. This prompting method can enrich the semantic representation of text descriptions, thus improving CLIP’s image-text level alignment capabilities.

To illustrate that focusing on category-related regions is beneficial for better alignment in CLIP, the region-text similarity experiments are conducted on OxfordPets dataset. We split a complete image into 4 regions (region 1 - 4) and additionally extract the classification object region (region*) from the image, then the similarity between classes and regions/region* is calculated by the dot-product operation. Greater similarity indicates better alignment.

Table 1 shows that for a cat image, region* (which is more relevant to the class “cat”) achieves the highest similarity score 0.942 than the other regions and full image. It indicates that it is beneficial for CLIP’s alignment to use category-related regions to construct the prompt. Moreover, those regions that are not relevant to the “cat” (region1, region3) may further be classified into the wrong category (such as “dog” or “lion”) thus hurting CLIP’s alignment ability.

Table 1
Region-category similarity experiments

Category Region * Region 1 Region 2 Region 3 Region 4 Full

“cat” 0.942 0.132 0.913 0.194 0.909 0.902

“dog” 0.005 0.215 0.034 0.319 0.004 0.003

“lion” 0.053 0.652 0.053 0.487 0.087 0.095

Category	Region *	Region 1	Region 2	Region 3	Region 4	Full
“cat”	0.942	0.132	0.913	0.194	0.909	0.902
“dog”	0.005	0.215	0.034	0.319	0.004	0.003
“lion”	0.053	0.652	0.053	0.487	0.087	0.095

3.2 Region-attention prompt (RAP)

In general, an instance-level prompt called Region-Attention Prompt (RAP) is proposed to model fine-grained region-text joint embedding learning. RAP introduces region-attention features under the guidance of category texts, which can enrich the semantic representation of the text description and can avoid the referencing ambiguity caused by unimodal prompts.

SINet is built to generate RAP, which uses the Cross-Attention mechanism to make every category interact with the image to assign greater attention weight to the region with higher semantical relevance to the category. In this process, all category texts act as queries while images serve as keys/values. SINet allows each category text to obtain unique region-attention features that make prompt more instructive and precise to important regions. Also, benefiting from the properties of the Cross-Attention mechanism, SINet can maximize the use of semantic information from both modalities. Specifically, SINet consists of two modules:

Transformer Decoder Module Let T_θ (·) denote the Transformer Decoder Module parameterized by θ, $\bar{x}$ indicates the image features generated by the image encoder. And let ${c_{i}}_{i = 1}^{K}$ be word embedding (s) for the K category texts produced by the text encoder (suppose there are K classes in total). K category features ${c_{i}}_{i = 1}^{K}$ and the image features $\bar{x}$ are passed through the Transformer Decoder Module T_θ (·) to get the region-attention features: ${e_{i}}_{i = 1}^{K} = T_{θ} ({c_{i}}_{i = 1}^{K}, \bar{x})$ (1) where in Eq. (1) each region-attention features e_i corresponds to c_i of the same category name.

Projector Module To generate RAP a lightweight neural network called Projector Module is further learned. Specifically, let G_η (·) denote the Projector Module parameterized by η, for the i - th category, region-attention features e_i are passed through the projector G_η (·). We can use Eq. (2) to get RAP: $p_{i} = G_{η} (e_{i})$ (2)

3.3 Framework of RA-CLIP

In general, CLIP is equipped with RAP (called RA-CLIP), the framework of RA-CLIP is shown in Fig. 2, which mainly includes the following parts:

Fig. 2

Framework of RA-CLIP, where both text encoders are original CLIP text encoders and share the same pre-training parameters. "concat" means that RAPs are concatenated with class embedding as inputs to the right text encoder(the number of RAPs is always equal to the number of classes). Meanwhile, in order to maintain the consistency of the text encoder, we pad the token length of the concatenated vectors to 77.

Modified Image Encoder Motivated by DenseCLIP [40], we modify CLIP’s image encoder as f′ (·) to additionally adopt non-global averaged pooled features after MHSA (Multi-Head Self-Attention) layer (called the language-compatible features) as the input of SINet.

Language-compatible features contain sufficient image space features while being well aligned with text features. Eq. (3) can be used to obtain the language-compatible features: $\bar{x} = f^{'} (I)$ (3) where I denotes the original image.

Image Encoder To maintain the uniformity of RA-CLIP, the original CLIP’s image encoder f (·) is used to generate the image features: $x = f (I)$ (4)SINet Let SINet_θ,η (· , ·) be the SINet parameterized by θ and η, then SINet can be represented in the form of Eq. (5): ${SINet}_{θ, η} (\cdot, \cdot) = {T_{θ} (\cdot, \cdot), G_{η} (\cdot)}$ (5)

And RAP can be denoted by Eq. (6): $p_{i} = {SINet}_{θ, η} (c_{i}, \bar{x})$ (6)Text Encoder The original pre-trained CLIP’s text encoder g (·) is adopted as our feature extractor. Note that the two text encoders in the framework are the same and shared among all data.

Let’s assume a dataset τ with training data $D_{train} = {(T_{i}, I_{i})}_{i = 1}^{K}$ ,where K denotes the number of classes, T denotes category texts. The first text encoder is used to get word embedding(s) and take M token lengths as the input of SINet. $c_{i} = g (T_{i}) [:, : M, :]$ (7) where in Eq. (7) M is a hyperparameter.

Then the K classes features are prepended after RAP to get the input of the second text encoder: ${ω_{i}}_{i = 1}^{K} = concat ({p_{i}}_{i = 1}^{K}, g ({T_{i}}_{i = 1}^{K}) [:, M :, :])$ (8)

For example, the input of the second text encoder for the i^th category now becomes: $ω_{i} = concat (p_{i}, g (T_{i}) [:, M :,;])$ (9)Loss Function RA-CLIP is performed to minimize the standard classification loss, and the gradients can be back-propagated all the way through the text encoder to optimize the SINet.

Specifically, the cross-entropy loss is adopted as our loss function to maximize the score from the ground-truth, which can be denoted by Eq. (10): $p (y ∣ x) = \frac{exp (sim (x, g (ω_{y})) / τ)}{\sum_{i = 1}^{K} exp (sim (x, g (ω_{i})) / τ)}$ (10) where K denotes the number of classes.

During training, only the parameters of SINet are updated, both text encoders and image encoders are frozen.

In this work, the Transformer Decoder Module is built on six-layer Transformer Decoder Blocks with eight heads, and the Projector Module is built with a two-layer bottleneck structure (Linear-ReLU-Linear), with the hidden layer reducing to 64 dimensions.

Table 2

Datasets Statistics

(a) Base Classes
Datasets	Base/Total Classes	Train.	Val.	Test.
Caltech101	50/100	800	200	1,549
OxfordPets	19/37	304	76	1,881
StanfordCars	98/196	1,568	392	4,002
Food101	51/101	816	204	15,300
FGVCAircraft	50/100	800	200	1,666
EuroSAT	5/10	80	20	4,200
DTD	24/47	384	96	2340
(b) New Classes
Datasets	New/Total Classes	Val.	Test.
Caltech101	50/100	200	916
OxfordPets	18/37	72	1,788
StanfordCars	98/196	392	4039
Food101	50/101	200	15,000
FGVCAircraft	50/100	200	1,667
EuroSAT	5/10	20	3900
DTD	23/47	92	2248

4 Experiments and conclusions

Datasets Our experiments are evaluated over 7 datasets. Specifically, the benchmark includes Caltech101 for the classification of generic objects; OxfordPets, StanfordCars, Food101, and FGVCAircraft for fine-grained classification; DTD for texture classification; and finally, EuroSAT for satellite imagery recognition. Following Zhou et al. [8], in each dataset we randomly 16 samples per class to construct a few-shot training set while using the original test set for testing. On each dataset we split the classes equally into base classes and new classes. RA-CLIP is trained on base classes while evaluation is conducted on new classes. The detailed data statics for base and new classes are shown in Table 3.

Table 3
Training Settings

Method Batch Size Optimizer Learning Scheduler Training Epoch Warmup Epoch Number of Trials

CoOp 32 SGD Cosine 50 1 3

CoCoOp 1 SGD Cosine 10 1 3

RA-CLIP(Ours) 1 SGD Cosine 10 1 3

Method	Batch Size	Optimizer	Learning Scheduler	Training Epoch	Warmup Epoch	Number of Trials
CoOp	32	SGD	Cosine	50	1	3
CoCoOp	1	SGD	Cosine	10	1	3
RA-CLIP(Ours)	1	SGD	Cosine	10	1	3

Baselines The baselines are CLIP [5], CoOp [8], and CoCoOp [14]. Specifically, CLIP uses a manual-designed template as the prompt, CoOp only tunes the learnable context vector as the soft prompt, and CoCoOp adds the image-conditional token to the learnable context vectors using Meta-Net.

Training Details For the image encoder, the ViT-B/16 is used as our vision backbone in CLIP. For the text encoder, the original CLIP text encoder is adopted. Meanwhile, because of the respectively slow training speed and GPU memory consumption, RA-CLIP is trained with batch size of 1 for 10 epochs. During training, only the parameters of SINet are updated and both text and image encoders are frozen.The detailed training settings is listed below in Table 4.

Table 4

Comparison of CLIP, CoOp, CoCoOp, and RA-CLIP in the base-to-new generalization setting

(a) Average over 7 datasets
Method	Base Acc.	New Acc.	HM
CLIP	68.34	73.94	70.90
CoOp	81.46	63.72	70.46
CoCoOp	78.90	70.88	74.15
RA-CLIP	81.02	74.58	77.67
(b) Caltech101
Method	Base Acc.	B-mF1.	New Acc.	N-mF1.	HM
CLIP	96.84	-	94.00	-	95.40
CoOp	98.00 ±0.29	0.979	89.91 ±0.39	0.899	93.73
CoCoOp	97.96 ±0.23	0.965	93.81 ±0.29	0.937	95.84
RA-CLIP	98.34 ±0.16	0.983	95.21 ±0.24	0.951	96.75
(c) Food101
Method	Base Acc.	B-mF1.	New Acc.	N-mF1.	HM
CLIP	90.10	-	91.22	-	90.66
CoOp	88.33±0.14	0.882	82.26 ±0.31	0.821	85.19
CoCoOp	90.70 ±0.19	0.907	91.29 ±0.22	0.910	90.99
RA-CLIP	90.74 ±0.17	0.907	91.65 ±0.12	0.917	91.19
(d) StanfordCars
Method	Base Acc.	B-mF1.	New Acc.	N-mF1.	HM
CLIP	63.37	-	74.89	-	68.65
CoOp	78.12 ±0.36	0.780	60.40 ±0.29	0.601	68.13
CoCoOp	70.49 ±0.37	0.702	73.59 ±0.20	0.735	72.01
RA-CLIP	74.40 ±0.26	0.743	74.95 ±0.42	0.750	74.67
(e) OxfordPets
Method	Base Acc.	B-mF1.	New Acc.	N-mF1.	HM
CLIP	91.17	-	97.26	-	94.12
CoOp	93.67 ±0.31	0.935	95.29 ±0.49	0.952	94.47
CoCoOp	95.20 ±0.21	0.951	97.69 ±0.12	0.975	96.43
RA-CLIP	95.62 ±0.25	0.956	97.94 ±0.21	0.977	96.77
(f) FGVCAircraft
Method	Base Acc.	B-mF1.	New Acc.	N-mF1.	HM
CLIP	27.19	-	36.29	-	31.09
CoOp	40.44 ±0.37	0.401	22.30 ±0.88	0.222	28.75
CoCoOp	33.41 ±0.47	0.334	23.71 ±0.96	0.235	27.74
RA-CLIP	36.82 ±0.29	0.366	34.11 ±0.52	0.343	35.41
(g) DTD
Method	Base Acc.	B-mF1.	New Acc.	N-mF1.	HM
CLIP	53.24	-	59.90	-	56.37
CoOp	79.44 ±0.51	0.787	41.18 ±0.28	0.415	54.24
CoCoOp	77.01 ±0.87	0.767	56.00 ±0.31	0.558	64.85
RA-CLIP	81.18 ±0.62	0.812	56.84 ±0.26	0.568	66.86
(h) EuroSAT
Method	Base Acc.	B-mF1.	New Acc.	N-mF1.	HM
CLIP	56.48	-	64.05	-	60.03
CoOp	92.19 ±1.91	0.919	54.74 ±1.58	0.545	68.69
CoCoOp	87.49 ±1.87	0.874	60.04 ±2.97	0.597	71.21
RA-CLIP	90.07 ±1.31	0.898	71.38 ±2.33	0.712	73.16

Evaluation Metric We adopt accuracy(Base Acc. and New Acc.), Macro F1 Score (B-mF1 and N-mF1) and harmonic mean between Base Acc. and new Acc.(HM) as evaluation metrics. Macro F1 score provides an average measure of the F1 score (the harmonic mean of Precision and Recall) across all classes and is a commonly used metric in multi-class classification tasks. HM is susceptible to extreme values and is more affected by very small values than by very large values, which can highlight the generalizability trade-off between Base Acc. and New Acc.

4.1 Main results

RA-CLIP is evaluated over 7 datasets in the base-to-new generalization setting, the results are averaged over three runs. CoOp [8], CoCoOp [14], and RA-CLIP are learning-based methods, CLIP [5] is the zero-shot prompt. The detailed results are shown in Table 3, which strongly justifies the strong generalization performance of RA-CLIP.

RA-CLIP vs. CLIP As shown in Fig. 3, for new classes RAP outperforms CLIP in 5 datasets and suffers the loss in 2 datasets (FGVCAircraft and DTD) But RAP still achieves a better performance in new classes on average, which indicates that RAP already has similar or even better generalizability than CLIP on most datasets.

Fig. 3

Comparisons of RA-CLIP and CLIP in the base-to-new generalization setting.

For base classes, RA-CLIP outperforms CLIP by a large margin on all 7 datasets (up to 33.59%, least to 0.64%). This indicates that learning-based prompt methods can more effectively adapt training data, thus providing better semantic representation for base classes. Moreover, RA-CLIP’s gains in base classes far outweigh its losses in new classes than CLIP, thus achieving better harmonic performance, e.g., on DTD dataset in base classes is +27.94% while its losses in base classes are -3.06% or RAP’s gains on FGVCAircraft dataset in base classes are +9.63% while its losses in new classes are just -2.18%.

RA-CLIP vs. CoOp In comparison to CoOp (see Fig. 4), performance drops in the base classes occur for RA-CLIP on 3 datasets (StanfordCars, FGVCAircraft, EuroSAT). This is reasonable because CoOp optimizes specifically for base classes, while RA-CLIP optimizes for each specific image in order to gain the instance-level prompt over an entire task (both new classes and base classes).

Fig. 4

Comparisons of RA-CLIP and CoOp in the base-to-new generalization setting.

But it is worth noting that although RA-CLIP’s base accuracy drops below CoOp on 3 of 7 datasets (range from -2.12% to -3.72%), RA-CLIP’s gains in new classes are significantly larger than its losses, which is enough to turn the averages improvement into positives, e.g., StanfordCars sees the worst base accuracy drop of -3.72% but obtains accuracy gain of +14.55% in new classes, which together bring a 10.83% positive improvement for RA-CLIP, or StanfordCars sees the base accuracy drop of -2.12% but also obtains accuracy gain of +16.64% in new classes, which together bring a 14.52% positive harmonic improvement for RA-CLIP.

RA-CLIP vs. CoCoOp As shown in Fig. 5, when considering both base and new classes (harmonic performance), RAP shows a gain of 3.52% over CoCoOp on average (77.67% vs. 74.15%, see Table 3). Specifically, for new classes, RA-CLIP outperforms CoCoOp on all datasets (specifically on EuroSAT and FGVCAircraft) by a relatively large margin.

Fig. 5

Comparisons of RA-CLIP and CoCoOp in the base-to-new generalization setting.

For base classes, RAP still outperforms CoCoOp on all datasets, (specifically on DTD, StanfordCars, FGVCAircraft, and EuroSAT). In general, compared to CoCoOp (which adopts the image-condition features to construct prompt), RAP is more competitive in both base and new classes. It suggests that using region-attention features to construct prompt can provide a more precise and instructive semantic representation for text description.

4.2 How does RAP aid in image classification

To understand the working mechanism of RAP and explore how RAP improves image classification performance, we visualized the attention weights of RAP on oxfordPets and Food101 datasets. In our experiments, we visualized RAP’s attentional weights on the base class and the new class, respectively, and for each image, we sampled the attention feature maps corresponding to the three classes (only one of which is GroundTruth).

The detailed results are shown in Table 5 and Table 6, it can be inferred that regardless of whether it is on base classes or new classes, Ground-Truth-RAP always focuses on the correct regions in the image, and assigns higher attention weights to these regions. Meanwhile, the rest of the RAPs still maintain lower attention weights overall. This suggests that RA-CLIP is conditioned on each input instance, it always focuses on regions that are semantically similar to category and assigns high attention weights. Meanwhile, RAP can be optimized to characterize each instance, so it is more robust to class shift. In other words, RA-CLIP learns a pattern of extracting region-level features, when comes to the generalization scenario the learned extraction pattern helps RA-CLIP obtains the instance-level region features.

Table 5
RAP attention weight visualization on OxfordPets dataset, where "GT" represents Ground Truth

(a) Base Class

Image "Abyssinian"(GT) "American bulldog" "Basset hound"

Image "Beagle"(GT) "Boxer" "British shorthair"

(b) New Class

Image "Russian Blue"(GT) "Bombay" "Birman"

Image "Shiba"(GT) "Chihuahua" "Havanese"

(a) Base Class
Image	"Abyssinian"(GT)	"American bulldog"	"Basset hound"

Image	"Beagle"(GT)	"Boxer"	"British shorthair"

(b) New Class
Image	"Russian Blue"(GT)	"Bombay"	"Birman"

Image	"Shiba"(GT)	"Chihuahua"	"Havanese"

Table 6

RAP attention weight visualization on Food101 dataset, where "GT" represents Ground Truth

(a) Base Class
Image	"French fries"(GT)	"Donuts"	"Ramen"

Image	"Hotdog"(GT)	"Waffles"	"Chicken curry"

(b) New Class
Image	"Ice cream"(GT)	"Bombay"	"Birman"

Image	"Pizza"(GT)	"Chihuahua"	"Havanese"

Moreover, to explore the advantages of this regional attention feature in model alignment learning, we further measured the cosine similarity between the Ground-Truth-RAP and the image on OxfordPets dataset(see Table 7). The larger the η, the closer the semantic distance between Ground-Truth-RAP and the image, which indicates stronger alignment ability. The results in Table 7 demonstrate that RAP - the prompting approach that utilizes the Cross-Attention mechanism to acquire region features - is able to bring images and classes closer in the vector space, thus improving the model’s alignment ability (image classification performance).

Table 7

Similarity experiments between RAP and image on OxfordPets dataset

Method	η-Base	Base Acc.	η-New	New Acc.	HM.
CLIP	0.1266	91.17	0.1159	97.26	94.12
CoCoOp	0.1814	95.20	0.1525	97.69	96.43
RA-CLIP	0.3047	95.62	0.2804	97.94	96.77

The results in Table 7 show that the larger η (RA-CLIP) helps to improve generalizability. It indicates that RAP has a more similar semantic representation to image features in vector space, thus enhancing the alignment ability. Moreover, RA-CLIP also improves base accuracy, which means RA-CLIP can reduce the generalizability trade-off between base classes and new classes (increase the harmonic performance).

4.3 Ablation study

The Necessity of Transformer Decoder Module To demonstrate the key role of region-attention features in image classification tasks, an Image-Text prompt (IT-prompt) is proposed as a method of comparison. Specifically, IT-prompt makes image features pass through a linear projection layer and then directly point-wise-add to embedding(s) of category to obtain prompt. In the comparison experiment, RAP is replaced with IT-prompt, and the rest of the model structure and training parameters remain unchanged.

Comparison experiments over 7 datasets are shown in Fig. 6. The result shows that RA-CLIP outperforms CLIP with IT-Prompt on both base classes and new classes by 1.7% approximately, suggesting the key to enhancing alignment capability is using region features to construct prompt.

Fig. 6

RAP vs. IT-prompt.

The Effectiveness of Projector Module In this experiment, the removal of the Projector Module in SINet, while keeping the model structure and training parameters unchanged, resulted in a degradation of classification performance on both the base classes and the new classes. This deterioration can be attributed to the incomplete representation and reduced parameterization capability.

The Projector Module plays a crucial role in extracting higher-level abstract features through additional nonlinear transformations and mappings. By removing the Projector Module, the model loses the ability to fully leverage the output of the Transform Decoder Module, leading to incomplete feature representations and consequently impacting performance.

Fig. 7

SINet vs. SINet without Projector Module on Caltech101 and StanfordCars datasets.

4.4 Parameter analysis

The Length of RAP To figure out whether the length of RAP affects the generalizability, a study is conducted by adopting the different lengths (specifically 4, 8, and 16 token lengths) of RAP while keeping all the other parameters identical. Fig. 8 summarizes the average results on the 7 datasets, and it shows that RAP with 8 tokens length performs better in both base classes and new classes than the others.

Fig. 8

Study on the length of RAP over 7 datasets.

This indicates that a suitable length of RAP (which is 8 token lengths in this work from our experience) has contain the most precise region features (semantic representation), which can maximize the model’s alignment ability. On the contrary, RAP with longer length may suffer from information redundancy and ambiguous referents, while RAP with shorter length may suffer from information missing for important regions, both cases will result in inadequate representation ability of prompt.

The Scale of SINet To find out if the scale of SINet affects the image classification performance of RA-CLIP, we conduct experiments over all 7 datasets, where the different scales of SINet are proposed (RA-CLIP-Small, RA-CLIP-Standard, RA-CLIP-Large). In addition, CoCoOp is introduced as a baseline to verify whether a smaller RA-CLIP can still achieve better results than CoCoOp.

Table 8 demonstrates that as the scale increases RA-CLIP also gains a slight performance improvement, but the increase in performance slows down significantly. However, it is important to note that the model also runs the risk of increased computational cost and overfitting as the size increases. Moreover, SINet-small can still achieve better performance on both the new classes and the base classes than CoCoOp. It indicates that region features adopted in RAP are the key to guaranteeing the alignment ability, rather than the training parameters of SINet.

Table 8

Average results of scale experiments on 7 datasets

Scale and Baseline	Params	Params (% CLIP)	Base Acc.	New Acc.	HM.
Zero-shot CLIP	/	/	68.34	73.94	70.90
CoOp	2048	0.002	81.46	63.72	70.46
CoCoOp	35360	0.03	78.90	70.88	74.15
RA-CLIP-Small	0.92M	0.74	79.74	73.85	76.73
RA-CLIP-Standard	2.75M	2.21	81.02	74.58	77.67
RA-CLIP-Large	8.25M	6.62	82.03	75.21	78.47

Training Epochs In order to explore the effect of training epochs on performance and whether fewer training epochs can also yield desirable results, different training epoch (5/10/20/30 epochs) is adopted in the experiment. Fig. 9 shows that as the number of training epochs increases RA-CLIP still gains in classification performance on both base and new classes. However, the performance growth rate on new classes is getting slower and slower, which may be due to the fact that too many training rounds lead to the overfitting of RA-CLIP on the base classes. Meanwhile, for fewer training epochs, RA-CLIP still outperforms CoCoOp, which suggests that RA-CLIP is an efficient prompt-tuning method that can ensure high performance while converging quickly during training.

Fig. 9

Training epoch experiments.

4.5 Is a manually-designed template important in RA-CLIP?

To understand the use of templated category descriptions (like CLIP) affects the RA-CLIPs’ performance, an experiment is conducted over 7 datasets. In the experiment, a template is adopted to expand the classes into one sentence as the input of SINet, following CLIP [5], we use "a photo of a CLASS" as a template to compare with the non-templated RA-CLIP. The detailed results are shown in Fig. 10.

Fig. 10

Template experiments over 7 datasets.

Fig. 10 shows that templated RA-CLIP has a certain degree of regression on both base classes and new classes (from 1.26% and 0.60% drops compared to the original RAP). This illustrates the original RA-CLIP already contains enough conceptual-level semantic representation to guide the image in extracting important region features. On the contrary, using additional templates for classes may complicate the text semantics and thus be detrimental to the generation of RAP.

4.6 Computational cost analysis

In order to analyze the computational resource cost as well as the scalability of RA-CLIP, we conduct comparison experiments with 7 datasets (see Table 9) on each of the three computing platforms(Nvidia 3090, Nvidia 3080Ti, Nvidia Tesla T4). Meanwhile, we introduce CoCoOp on Nvidia 3090 computing platform as a baseline to explore the trade-off between performance improvement and computational cost.

Table 9
Computational cost analysis over 7 datasets

(a) Nvidia 3090

Datasets Training Time(mins) Inference Time(mins) GPU Memory Consumption(GB) Memory Consumption(GB)

Caltech101 20.2 4.1 5.8 6.6

OxfordPets 6.3 1.0 5.2 5.3

StanfordCars 68.7 8.2 8.9 7.3

Food101 30.9 12.0 8.3 4.1

FGVCAircraft 14.5 1.8 6.0 5.0

EuroSAT 10.1 2.3 5.5 4.7

DTD 11.3 1.9 6.3 4.6

(b) Nvidia 3080Ti

Datasets Training Time(mins) Inference Time(mins) GPU Memory Consumption(GB) Memory Consumption(GB)

Caltech101 24.0 4.1 5.9 6.7

OxfordPets 7.3 1.2 5.2 5.2

StanfordCars 73.1 10.1 8.9 7.3

Food101 31.5 13.8 8.5 4.4

FGVCAircraft 14.6 1.9 6.2 5.2

EuroSAT 12.2 3.0 5.4 4.7

DTD 12.5 2.1 6.6 4.9

(c) Nvidia Tesla T4, where values in parentheses represent the additional time/GPU memory spent in RA-CLIP compared to CoCoOp

Datasets Training Time(mins) Inference Time(mins) GPU Memory Consumption(GB) Memory Consumption(GB)

Caltech101 51.0 (+7.3) 8.5 (+1.1) 6.0 (+2.1) 6.8

OxfordPets 17.0 (+3.3) 3.3 (+0.9) 5.7 (+3.6) 6.1

StanfordCars 156.0 (+17.5) 18.5 (+2.5) 9.3 (+1.1) 7.9

Food101 88.0 (+12.1) 36.0 (+5.1) 8.8 (+4.1) 5.4

FGVCAircraft 52.0 (+6.1) 5.0 (+0.4) 6.5 (+2.9) 6.3

EuroSAT 23.1 (+3.2) 5.1 (+0.5) 5.8 (+2.3) 6.0

DTD 20.0 (+2.4) 3.3 (+0.2) 6.6 (+2.2) 5.6

(a) Nvidia 3090
Datasets	Training Time(mins)	Inference Time(mins)	GPU Memory Consumption(GB)	Memory Consumption(GB)
Caltech101	20.2	4.1	5.8	6.6
OxfordPets	6.3	1.0	5.2	5.3
StanfordCars	68.7	8.2	8.9	7.3
Food101	30.9	12.0	8.3	4.1
FGVCAircraft	14.5	1.8	6.0	5.0
EuroSAT	10.1	2.3	5.5	4.7
DTD	11.3	1.9	6.3	4.6
(b) Nvidia 3080Ti
Datasets	Training Time(mins)	Inference Time(mins)	GPU Memory Consumption(GB)	Memory Consumption(GB)
Caltech101	24.0	4.1	5.9	6.7
OxfordPets	7.3	1.2	5.2	5.2
StanfordCars	73.1	10.1	8.9	7.3
Food101	31.5	13.8	8.5	4.4
FGVCAircraft	14.6	1.9	6.2	5.2
EuroSAT	12.2	3.0	5.4	4.7
DTD	12.5	2.1	6.6	4.9
(c) Nvidia Tesla T4, where values in parentheses represent the additional time/GPU memory spent in RA-CLIP compared to CoCoOp
Datasets	Training Time(mins)	Inference Time(mins)	GPU Memory Consumption(GB)	Memory Consumption(GB)
Caltech101	51.0 (+7.3)	8.5 (+1.1)	6.0 (+2.1)	6.8
OxfordPets	17.0 (+3.3)	3.3 (+0.9)	5.7 (+3.6)	6.1
StanfordCars	156.0 (+17.5)	18.5 (+2.5)	9.3 (+1.1)	7.9
Food101	88.0 (+12.1)	36.0 (+5.1)	8.8 (+4.1)	5.4
FGVCAircraft	52.0 (+6.1)	5.0 (+0.4)	6.5 (+2.9)	6.3
EuroSAT	23.1 (+3.2)	5.1 (+0.5)	5.8 (+2.3)	6.0
DTD	20.0 (+2.4)	3.3 (+0.2)	6.6 (+2.2)	5.6

Table 9 shows that on the Nvidia Tesla T4 platform, RA-CLIP’s training/inference times and GPU memory consumption are marginally longer than CoCoOp. This is likely due to the increased complexity and parameters of RA-CLIP, which necessitates more computational time. However, given the enhanced performance of RA-CLIP, this increase in computation time is acceptable. Besides, RA-CLIP exhibits robust scalability across a diverse range of datasets. It delivers stable and reliable performance on datasets of various scales, from those requiring fewer computational resources such as OxfordPets, to those imposing higher ones like StanfordCars. This illustrates the model’s capability to effectively scale across datasets of various types and magnitudes. Furthermore, RA-CLIP’s memory usage and GPU memory utilization remain relatively consistent across all platforms and datasets. This stability underlines the model’s proficient resource management, demonstrating its ability to maintain an efficient balance between computational resource expenditure and performance outcomes.

4.7 Limitation

The first limitation pertains to the training process’s efficiency. RA-CLIP was found to be notably slow during its training phase, exhibiting a high consumption of GPU memory. Our runtime analysis indicates that when the batch size exceeds 4, it becomes untenable for standard civilian-grade GPUs. This is primarily attributed to the proposed SINET, which necessitates independent forward and backward propagation for each image in the minibatch during training to construct an instance-level Region Attention Prompt. This is markedly less efficient than both CLIP and CoOp, which require only a single forward and backward propagation through the text encoder for the entire minibatch, irrespective of its size.

The second limitation emerges in the detailed performance analysis. Despite RA-CLIP demonstrating the best overall performance across all 7 datasets, it does not consistently yield the most optimal results in either the base or new classes. Specifically, it failed to provide the top performance for base category recognition on 3 datasets and new category recognition on 2 datasets. This underscores the necessity for a substantial community effort in the pursuit of fully bridging or even reversing the gap between manually-designed prompts and learning-based prompts.

4.8 Conclusion

An instance-level and category-sensitive prompt called Region-Attention Prompt (RAP) is proposed in this paper to model fine-grained region-text joint embedding learning. RAP employs a unique strategy of integrating region features into prompting, obtained via a Cross-Attention mechanism between image and text data. In essence, RAP is a region-level prompt with category-sensitive attributes that adaptively assign a greater attention weight to image regions possessing higher semantic relevance to the category. Semantic Interaction Network (SINet) is proposed to generate RAP, which includes Transformer Decoder Module and Projector Module. Besides, CLIP is equipped with RAP (called RA-CLIP) to improve image classification performance in generalization scenarios. A series of extensive experiments across seven datasets show that RA-CLIP surpasses the current state-of-the-art method, CoCoOp, by a margin of 0.4% - 4.16% on base classes and 0.25% - 11.34% on new classes. Furthermore, our findings reveal that by focusing on category-related regions to construct prompts, as opposed to uniformly focusing on the entire image, the model’s alignment ability can be further improved, demonstrating the efficiency of RAP in boosting the image classification performance of VLMs.

References

Fürst

, Rumetshofer

, Tran

, Ramsauer

, Tang

, Lehner

, Kreil

, Kopp

, Klambauer

, Bitto-Nemling

, et al., Cloob: Modern hopfield networks with infoloob outperform clip, arXiv preprint arXiv:2110.11316, 2021.

Jia

, Yang

, Xia

, Chen

Y.-T.

, Parekh

, Pham

, Le

, Sung

Y.-H.

, Li

and Duerig

, Scaling up visual and vision-language representation learning with noisy text supervision, in International Conference on Machine Learning, PMLR, 2021, pp. 4904–4916.

, Liang

, Zhao

, Cui

, Ouyang

, Shao

, Yu

and Yan

, Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm, arXiv preprint arXiv:2110.05208, 2021.

Bansal

, Singhi

, Yang

, Yin

, Grover

and Chang

K.-W.

, Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning, arXiv preprint arXiv:2303.03323, 2023.

Radford

, Kim

J.W.

, Hallacy

, Ramesh

, Goh

, Agarwal

, Sastry

, Askell

, Mishkin

, Clark

, et al., Learning transferable visual models from natural language supervision, in International Conference on Machine learning, PMLR, 2021, pp. 8748–8763.

Zhao

W.X.

, Zhou

, Li

, Tang

, Wang

, Hou

, Min

, Zhang

, Dong

, et al., A survey of large language models, arXiv preprint arXiv:2303.18223, 2023.

Qiao

, Ou

, Zhang

, Chen

, Yao

, Deng

, Tan

, Huang

and Chen

, Reasoning with language model prompting: A survey, arXiv preprint arXiv:2212.09597, 2022.

Zhou

, Yang

, Loy

C.C.

and Liu

, Learning to prompt for vision-language models, International Journal of Computer Vision 130(9) (2022), 2337–2348.

Gao

, Fisch

and Chen

, Making pre-trained language models better few-shot learners, arXiv preprint arXiv:2012.15723, 2020.

10.

Jiang

, Xu

F.F.

, Araki

and Neubig

, How can we know what language models know? Transactions of the Association for Computational Linguistics 8 (2020), 423–438.

11.

Lester

, Al-Rfou

and Constant

, The power of scale for parameter-efficient prompt tuning, arXiv preprint arXiv:2104.08691, 2021.

12.

X.L.

and Liang

, Prefix-tuning: Optimizing continuous prompts for generation, arXiv preprint arXiv:2101.00190, 2021.

13.

Zhong

, Friedman

and Chen

, Factual probing is [mask]: Learning vs. learning to recall, arXiv preprint arXiv:2104.05240, 2021.

14.

Zhou

, Yang

, Loy

C.C.

and Liu

, Conditional prompt learning for vision-language models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16816–16825.

15.

Zhong

, Yang

, Zhang

, Li

, Codella

, Li

L.H.

, Zhou

, Dai

, Yuan

, Li

, et al., Regionclip: Region-based language-image pretraining, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16793–16803.

16.

Khattak

M.U.

, Rasheed

, Maaz

, Khan

and Khan

F.S.

, Maple: Multi-modal prompt learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19113–19122.

17.

Liu

, He

, Wang

, Chen

, Zhang

, Yang

, Li

, Yu

, et al., Internchat: Solving vision-centric tasks by interacting with chatbots beyond language, arXiv preprint arXiv:2305.05662, 2023.

18.

Zhu

, Chen

, Shen

, Li

and Elhoseiny

, Minigpt-4: Enhancing vision-language understanding with advanced large language models, arXiv preprint arXiv:2304.10592, 2023.

19.

Liu

, Li

, Wu

and Lee

Y.J.

, Visual instruction tuning, arXiv preprint arXiv:2304.08485, 2023.

20.

, Dai

, Han

and Ding

, Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21694–21704.

21.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

Ł.

and Polosukhin

, Attentionis all you need, Advances in neural information processing systems, 30 (2017).

22.

Zhang

, Jiang

, Miura

, Manning

C.D.

and Langlotz

C.P.

, Contrastive learning of medical visual representations 16 Region-Attention Prompt Learning for CLIP from paired images and text, in Machine Learning for Healthcare Conference, PMLR, 2022, pp. 2–25.

23.

Gao

, Geng

, Zhang

, Ma

, Fang

, Zhang

, Li

and Qiao

, Clip-adapter: Better vision-language models with feature adapters, arXiv preprint arXiv:2110.04544, 2021.

24.

Zhang

, Fang

, Zhang

, Gao

, Li

, Dai

, Qiao

and Li

, Tip-adapter: Training-free clip-adapter for better vision-language modeling, arXiv preprint arXiv:2111.03930, 2021.

25.

Wang

, Xing

and Liu

, Actionclip: A new paradigm for video action recognition, arXiv preprint arXiv:2109.08472, 2021.

26.

Liu

, Yuan

, Fu

, Jiang

, Hayashi

and Neubig

, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, ACM Computing Surveys 55(9) (2023), 1–35.

27.

Schick

and Schütze

, Exploiting cloze questions for few shot text classification and natural language inference, arXiv preprint arXiv:2001.07676, 2020.

28.

Petroni

, Rocktäschel

, Lewis

, Bakhtin

, Wu

, Miller

A.H.

and Riedel

, Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.

29.

Shin

, Razeghi

, Logan IV

R.L.

, Wallace

and Singh

, Autoprompt: Eliciting knowledge from language models with automatically generated prompts, arXiv preprint arXiv:2010.15980, 2020.

30.

Feng

, Zhong

, Jie

, Chu

, Ren

, Wei

, Xie

and Ma

, Autoprompt: Eliciting knowledge from language models with automatically generated prompts, arXiv preprint arXiv:2010.15980, 2020.

31.

, Wei

, Zhang

, Shi

, Gao

and Li

, Learning to prompt for open-vocabulary object detection with visionlanguage model, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14084–14093.

32.

, Zhang

, Wei

, Lin

, Cao

, Hu

and Bai

, A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model, in Computer Vision–ECCV 2022:17th European Conference, Tel Aviv, Israel, Proceedings, Part XXIX, Springer, 2022, pp. 736–753.

33.

Wang

, Zhang

, Lee

C.-Y.

, Zhang

, Sun

, Ren

, Su

, Perot

, Dy

and Pfister

, Learning to prompt for continual learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 139–149.

34.

Sun

, Hu

and Saenko

, Dualcoop: Fast adaptation to multi-label recognition with limited annotations, arXiv preprint arXiv:2206.09541, 2022.

35.

Wang

, Zheng

V.W.

, Yu

and Miao

, A survey of zero-shot learning: Settings, methods, and applications, ACM Transactions on Intelligent Systems and Technology (TIST) 10(2) (2019), 1–37.

36.

Xian

, Schiele

and Akata

, Zero-shot learning-the good, the bad and the ugly, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4582–4591.

37.

, Shen

, Gou

and Elhoseiny

, Exploring hierarchical graph representation for large-scale zero-shot image classification, in Computer Vision–ECCV 2022:17th European Conference, Tel Aviv, Israel, Proceedings, Part XX, Springer, 2022, pp. 116–132.

38.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

39.

Dosovitskiy

, Beyer

, Kolesnikov

, Weissenborn

, Zhai

, Unterthiner

, Dehghani

, Minderer

, Heigold

, Gelly

, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929, 2020.

40.

Rao

, Zhao

, Chen

, Tang

, Zhu

, Huang

, Zhou

and Lu

, Denseclip: Language-guided dense prediction with context-aware prompting, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18082–18091.

Region-Attention Prompt Learning for CLIP

Abstract

Keywords

1 Introduction

2 Related work

2.1 Theoretical development

3 Method

3.1 Region features analysis

Table 1 Region-category similarity experiments Category Region * Region 1 Region 2 Region 3 Region 4 Full “cat” 0.942 0.132 0.913 0.194 0.909 0.902 “dog” 0.005 0.215 0.034 0.319 0.004 0.003 “lion” 0.053 0.652 0.053 0.487 0.087 0.095

Table 3 Training Settings Method Batch Size Optimizer Learning Scheduler Training Epoch Warmup Epoch Number of Trials CoOp 32 SGD Cosine 50 1 3 CoCoOp 1 SGD Cosine 10 1 3 RA-CLIP(Ours) 1 SGD Cosine 10 1 3

4.8 Conclusion

References

Table 1
Region-category similarity experiments

Category Region * Region 1 Region 2 Region 3 Region 4 Full

“cat” 0.942 0.132 0.913 0.194 0.909 0.902

“dog” 0.005 0.215 0.034 0.319 0.004 0.003

“lion” 0.053 0.652 0.053 0.487 0.087 0.095

Table 3
Training Settings

Method Batch Size Optimizer Learning Scheduler Training Epoch Warmup Epoch Number of Trials

CoOp 32 SGD Cosine 50 1 3

CoCoOp 1 SGD Cosine 10 1 3

RA-CLIP(Ours) 1 SGD Cosine 10 1 3