Abstract
As prototype-based Few-Shot Learning methods, Prototypical Network generates prototypes for each class in a low-resource state and classify by a metric module. Therefore, the quality of prototypes matters but they are inaccurate from the few support instances, and the domain-specific information of training data are harmful to the generalizability of prototypes. We propose a
Introduction
Deep learning models rely heavily on the large amount of high-quality labeled training data which is time-consuming to collect and costly to annotate. Few-Shot Learning (FSL) seeks to achieve superior performance with limited amount of labeled data, bridging the sample-efficiency gap between deep learning models and their practical application in fields such as Computer Vision, Reinforcement Learning and Speech Recognition [1].
Prototype-based few-shot methods, like Prototypical Network [2] and Induction Network [3], generate prototypes for each class and perform classification based on the distances between queried examples and class prototypes. These methods have found wide applications in scenarios where acquiring instances is costly, such as Medical Image Analysis [4]. However, the learning of prototypes in these methods is dominated by the support set, which is limited and sparse in the feature space. The presence of noise and disturbance in the support set can lead to deviations in the prototypes [5, 6], making it challenging to learn robust prototypes solely from the few support instances.
The valuable interaction between query set and support set is ignored in previous works [7], which is informative for generating prototypes. As the query set –with its large and more diverse instances –can better stand for the distribution of the entire dataset, it is worthy to leverage the abundant query instances. By utilizing these query instances to guide the distribution of support set to make it align closely with that of query set, we can create more robust and representative prototypes that better capture the dataset’s characteristics. Specifically, we introduce a query-detached set, which is generated from the query set but without their labels. This detachment ensures that the labels of query instances remain confidential during training, thus preserving their authenticity. As a result, the obtained Instance Prototype becomes more typical at an instance level, thereby reducing the influence of noise and bias insupport set.
Another challenge in prototype-based methods is that classification accuracy deteriorates when transferring to new domains (categories) during the testing phase. This occurs because the categories present in training data are different from those encountered in the testing data, resulting in inevitable inter-domain dissimilarities. For example, in sentimental analysis task, a model trained on “cosmetics” and “mother & baby products” may extracts domain-specific features which affect its classification when classifying test sets such as “appliances”. Therefore, it is necessary to create prototypes that focus solely on sentiment concepts, disregarding domain-specificfeatures.
The visual-language model CLIP [8–10] and zero-shot detection method ZSD [11–13] have made significant advancements in completing few-shot and zero-shot tasks by incorporating label semantics as concepts and knowledges. Inspired by these approaches, we propose a novel method that leverages concepts as prototypes through prompt engineering. By extracting textual semantics of classes from labels (e.g., ‘Positive’ and ‘Negative’ for sentiment classes), we create Concept Prototype which is immune to domain-specific features from training data.
We synthesize the Instance Prototype and the Concept Prototype into a novel representation called the We introduce a novel Instance Prototype, which effectively models the interaction between the support and query sets by an interactive network. This helps to reconstruct the distribution of support set, resulting in a more robust and typical prototype that represents the entire dataset. We propose the concept as a general prototype, which enhances the semantics of classes to mitigate the impact of domains-specific features when testing on different domains. This enables our model to handle few-shot and zero-shot tasks more robustly, as it can better capture the core concepts and semantics associated with each class. We combine the Instance Prototype and Concept Prototype into Conceptual Prototype (CP), which integrates both individualized and general features. This comprehensive prototype is utilized in the classification process through Prototypical Contrastive Learning.
Related work
Few-shot learning
Most existing few-shot learning models follow the meta-learning method, employing an episodic training setup in which the meta-learner iterates through episodes in the meta-training phase. In each episode, a task is drawn from base classes C base and a limited amount of support and query data from that task is made available. The meta-learner then learns a task-specific classifier based on the support data and this classifier make predictions on the query data. Updates to the meta-learner is computed based on the performance of the classifier on the query set. Evaluation of the meta-learner is also carried out in episodes in a similar fashion, except that the meta-learner is no longer updated and the performance on query data across multiple episodes is aggregated. This training method is known as “N-way K-shot” where each episode involves N classes and each class contains K support samples.
Metric learning is a kind of meta-learning method that learns the distance distribution of samples, ensuring that similar samples are positioned closely together in the metric space. The Siamese Neural Network [14] calculates the distance between two representations to measure their similarity. Prototypical Network creates a prototype representation c k for each class [14]. This is achieved by the method of averaging, where the model calculates the mean of the support samples belonging to a specific class.
Where f
Φ
(x) represents the feature vector of samples. The distance from the query sample to each prototype is calculated to classify.
Based on the Prototypical Network, several advancements have been made. The Relation Network [15] replaces the traditional distance metric with neural networks. The Induction Network [3] improves the averaging method and introduces dynamic routing to generate prototypes. In the case of ProtoPShare [16], ProtoPool [17], and Deformable ProtoPNet [18], all similarity scores are processed through a fully connected layer to produce the final classification. ProtoTree [19], on the other hand, uses these similarity scores to compute a path across a soft decision tree, where each leaf corresponds to one specific class. In another approach, [20] utilizes word distribution to improve the performance.
Language models (LMs) at scale, such as BERT [21] and GPT [22] have become cornerstones in the field of Natural Language Processing (NLP) due to their extensive knowledge. These models are already central to several products with millions of users, such as the coding assistant Copilot [23] and more recently ChatGPT [24]. In recent years, prompting LMs has become a dominant paradigm [22] which mines the knowledge of LMs to achieve superior performance in various downstream tasks such as question and answer, natural language understanding, and text classification in a low-resource state. [25] transforms the input into fill-in-the-blank phrases to aid the language model in understanding a given task. [26] introduces CoT, where the prompt comprises examples of a task, with inputs followed by intermediate reasoning steps leading to the final output.
The success of prompt learning has also been applied to the field of Computer Vision. ZSD [11, 12] leverages the semantic information carried by image labels as an intermediate representation to transfer knowledge learned from known classes to unknown classes, enabling zero-shot classification. CLIP [8] trains a vision model with supervised signals in natural language which achieves performance close to or even surpassing supervised methods in over 20 tasks in a zero-shot manner. [13] performs object-centric alignment of the language embeddings obtained from CLIP to improve precise localization of objects.
Prototypical contrastive learning
Contrastive learning aims to learn meaningful representations in a self-supervised setting. The goal of Instance-wise Contrastive Learning (ICL) is to bring the embedding of different views from the same instance closer to each other while pushing embeddings of views from different instances further apart by instance-level contrastive loss. This is commonly achieved by a large batch size that allows for the accumulation of positive and negative pairs within the same batch [27, 28], or a momentum encoder to update negative instances from a large and consistent dictionary in real-time [29].
However, ICL has weakness: the representation is not encouraged to encode the semantic structure of data. In ICL, different instances are treated as negative pairs, regardless of whether they share similar semantics. As a consequence, negative instance pairs that actually have similar semantics are undesirably pushed apart in the feature space, preventing the model from learning high-level semantic information.
Prototypical Contrastive Learning (PCL) is more concerned with the learning of semantic information. In PCL [30] the traditional InfoNCE loss is replaced with ProtoNCE loss, which encourages representations to be closer to their assigned prototypes and farther from other prototypes.
SCCL [31] better facilitates the classification of clustered clusters by optimizing clustering loss and contrast loss. SPCL [32] employs an offline prototype spawn approach with k-means clustering and regularizes samples to their corresponding prototypes explicitly. [33] proposes a label obfuscation strategy for dynamically refining the constraint relations between prototypes in the semantic space, improving the accuracy of the mapping between the feature space and the semantic space.
Our Method is built upon the Prototypical Contrastive Learning framework, where we introduce a novel Conceptual Prototype (CP) to replace the traditional prototype in contrastive loss, as illustrated in Fig. 1. (the case of “2-way 2-shot”).

CPCL model (in the case of 2-way 2-shot).
Given a training set T consists of a support set S t with a label set C t , a query set Q t , and the query-detached set Q detached :
The vanilla Prototypical Network [2] generates prototypes by assigning equal weightage to each support instance, which may lead to inaccuracies. This is especially problematic since the support set contains only a handful of instances, or worse, instances that are affected by noise. Such as in Fig. 2(a), the class prototype p′ from support set S t is biased away from the real class center p of the entire dataset, due to the presence of deviated support instances s1,s2. As a consequence, the generated prototypes may be not optimal, thereby compromising the accuracy of classification. Besides, the prototype itself is not suitable enough when it comes to classifying instances in the query set. This is largely due to the fact that the prediction is dependent on a one-way information flow from support set to query set. This leads to the prototype suffering from representation bias, which can have a negative impact on the performance of downstream tasks.

(a) shows the case of deviated support instances, the interactive network makes Prototype p′ closer to p. The blue line indicates instances with less importance while the red line indicates instances with greater importance. (b) shows the process of reducing intra-class variance. The red line illustrates how the support instances are drawn closer together.
Statistically, the vast query instances in query set Q
detached
can better stand for the whole data in terms of center and variance. We develop a novel interactive network f (S
t
, Q
detached
) for improving the support set S
t
by leveraging the information present in Q
detached
which ultimately lead to a new and improved support set, denoted by St′.
We argue that not all instances in support set hold the same importance with a query set. In Fig. 2(a), s3 is closer to the real center p and as a result, it should be given higher weight when generating the prototype. Conversely, s1 and s2 are much further away from p, and therefore should be assigned lower weights to minimize their influence on the final prototype. The interactive network assign weight a
i
to each support sample
In the perspective of variance, Instance loss
Where dot production is used to delegate the similarity between
Among all the support instances in S
t
, the distribution for instances with the same label c, namely

Calculation process of Instance loss.
c′ϵ C∧ { r } means c′ is not of the same class as c, and f ( · ·) represents the Euclidean distance between two distributions.
By effectively utilizing query instances, Instance Prototype achieves excellent performance from support set to query set during the training phase. Nevertheless, there are inevitably inter-domain differences between the training set and testing set. When testing on new domains, the model trained exclusively on train domains, without information about the test domains, may generates prototypes that contain specific domain features unique to the train data. This could introduce pernicious bias and interference on predicting. For example, when conducting sentimental analysis, the presence of product category features from the training set could exert an unexpected impact on the model’s predictions on the testing set, leading to inaccurate results.
CLIP obtains outperforming generalizability through prompt learning based on class labels. Similarly, ZSD also leverages labels as knowledge on new classes classification. Inspired by these works, we integrate class concepts into prototypes by prompt engineering, which further enhances generalizability. In sentimental analysis, the label ‘Positive’ and ‘Negative’ would be used as concept in a ‘2-way’ setup. By capturing the semantic knowledge directly from the labels, we propose Concept Prototype which can be easier to discriminate between different sentiments, and immune to the negative effects of transfer to new domains.
Three kind of Concept Prototype strategies are proposed to leverage labels effectively: (1) class labels as prototype, (2) prompt template as prototype, and (3) prompt engineering for prototype.
By feeding the class labels “Positive” and “Negative” into text encoder, the resulting feature vectors are directly acted as the Concept Prototype. This strategy is perfectly in line with ZSD.
Relying solely on labels may not capture all the nuances and complexities of the task. To address this, we employ described natural language templates as prompts, which can effectively capture more detailed semantics.
By utilizing a prompt template like “This is a {positive} comment.”, the model becomes capable of understanding that the task is related to sentiment analysis, which makes the prototype more interpretable.
It is unclear which templates are the most effective in representing semantic on manual prompt. CLIP integrates 80 templates, we hereby manually compose a set of unique templates to describe the task for the dataset [36]. CLIP treats every templates equally while we assume these templates should be customized for different scenarios. We adopt a self-attention method to focus on the templates that are more important, as shown in Fig. 4:

Prompt Engineering for Concept Prototype.
Given a set of prompt templates P,
Concept Prototype is an abstract representation of class that does not contain a great deal of detail. However, by focusing on the intrinsic characteristics “Positive” and “Negative”, Concept Prototype can point to a wider range of domains, making it a powerful tool for achieving a more comprehensive understanding of sentiments.
In contrast, Instance Prototype is the concrete representation of class that contains diverse details about the instances like “satisfied”, “pleased”, “discontented” and “complaint”. This level of detail makes it an excellent tool for achieving specific analysis of particular problems.
To leverage the strengths of both prototypes, we synthesize the two prototypes into a Conceptual Prototype (CP). CP takes into account both the individual characteristics of the instance and the general attributes of the class simultaneously, enabling a more nuanced and comprehensive understanding.
To enhance the ability to learn and perform classification tasks, an MLP is used to project the abstract Concept Prototype and concrete Instance Prototype into a shared space where they are seamlessly integrated into a CP.
By learning to adjust the importance of Instance Prototype and Concept Prototype in different situations, CP can effectively handle instances of different situations. In situations where instances are of low-quality, the Concept Prototype will play a more prominent role in guaranteeing the overall performance of the model. Conversely, when instances are high-quality and carry strong discriminative information, Instance Prototype will be given a higher weighting in the representation.
Prototypical contrastive learning
Contrastive learning is adopted as the metric module in our method. Further, Conceptual Prototypical Contrastive learning (CPCL) is proposed where the prototype is substituted by CP.
By enforcing the representations of instances to be more similar to their corresponding prototypes and far away from others, CPCL generates a clear classification boundary. The CPCL constrictive loss is defined as follows:
In contrast to the instance-wise contrastive learning, where the InfoNCE loss Equation (4) is computed between instance-level features, our
CPCL imbues class semantics and instance diversity by means of CP, thereby empowering the model to obtain better and more precise representations. In addition to its impressive results in few-shot tasks, CPCL is remarkably suited to inter-domain transferring.
The whole loss consists of three parts, as depicted in Fig. 5: (1) Instance loss

Components of the objective function.
Datasets
Parameters and settings
For the text encoder, we use RoBERTa-base [35], a pre-trained language model known for its effectiveness in various natural language processing tasks. The MLP consists of two hidden layers with dimensions of 1024 each, and the output layer has a dimension of 768. For the objective function, we take the values 0.4, 0.4, 0.2 for α, β, γ respectively. All models are trained on the base classes then the best-performing model on the validation are picked. The performance is evaluated by accuracy and F1 score, both of which have been widely recognized and acclaimed in the field of few-shot learning [2, 15].
The baseline models are trained by the ‘2-way 5-shot’ approach. In few-shot experiments, we repeat the experiments 5 times and perform K = [1, 3, 5, 10, 20] for comparison. For zero-shot experiments, we repeat experiments 3 times.
In addition to K instances as support set, each batch is populated with an equal number of positive and negative query instances. e.g. the ‘2-way 5-shot’ task has 27 positive and 27 negative query instances when the batch size is 64. It is worth noting that our model does not undergone any fine-tuning phase as its inherent generalization ability allows it to perform well across various tasks and domains.
The baseline models in our experiments are introduced as follows:
Comparison with baselines
In order to verify the role and performance of the Instance Prototype and Concept Prototype, we introduce two separate models: Instance-PCL and Concept-PCL, which only use Instance Prototype and Concept Prototype respectively.
The classification results for the “2-way 5-shot” classification task on ARSC and SST-2 is are elucidated in Table 1.
The “2-way 5-shot” classification results on ARSC and SST-2 datasets. Concept-PCL denotes the model only using Concept Prototype and PCL and Instance-PCL denotes the model only using Instance Prototype and PCL
The “2-way 5-shot” classification results on ARSC and SST-2 datasets. Concept-PCL denotes the model only using Concept Prototype and PCL and Instance-PCL denotes the model only using Instance Prototype and PCL
It is worth mentioning that Concept-PCL, without any support instances, manages to achieve a comparable accuracy to Induction-R. We due the improvement to the fact that the Concept Prototype draws semantics from label through prompt engineering. The results indicate that Concept Prototype can acquire class knowledge and directly guide the query set complete the classification task.
On the other hand, by fully mining the interactions between support set and query set, Instance-PCL has the ability to focus on the support instances that are most critical, thereby effectively eliminating the impact of inaccurate support instances. The Instance Prototypes generated in Instance-PCL are more robust than those in Induction Networks, with an accuracy 1.82% higher.
The CPCL model has the best performance, achieving an impressive accuracy of 92.41% on the ARSC dataset. After learning from both low-level individual instances features and high-level semantics, CPCL is able to overcome the denotational ambiguity of purely labelled text, as well as acquire general concepts which is difficult to obtain from instances.
In our model, the loss function is composed of three parts:
The ablation result on objective function of CPCL for the “2-way 5-shot” task on ARSC dataset
In order to meticulously evaluate CPCL, we conducted experiments considering the scale of few shots, with varying values of K. Figure 6 demonstrates that the models exhibit remarkable performance as K increases. This can be attributed to the fact that the prototypes contain more instance features become more robust. The few-shot task transforms into a 0-shot task when K decreases to 0 where Instance Prototype is invalidated and the model relies solely on the Concept Prototype to complete classification.

The “2-way K-shot” classification results of CPCL and Induction-R on ARSC dataset. Note that when K is 0, it is meaningless for Induction-R since there is no support instances.
We can find that the Concept Prototype serves as a reliable foundation for ensuring advanced accuracy. Although 0-shot learning is meaningful and has a strong academic value, it may still be a tad distant from the practicalities of real-world scenarios. In fact, even for people, it is an arduous task to grasp new knowledge without any instances but limited extra information. Furthermore, it is challenging to construct a concept that is entirely pertinent to the current task. As such, the instance knowledge provided by the Instance Prototype is indeed essential, which can contribute substantially to the overall performance of the model.
When K is set to 1, the accuracy actually suffers and turns out to be even worse than 0 shot. This is attributed to the fact that the Instance Prototype only relies on a single instance, which can easily be influenced by low-quality instances, ultimately leading to unstable performance. This result aligns with the observations from experiments of CLIP, where excellent 0-shot performance is achieved, but the model falls short when it comes to 1 or 2 shots scenarios.
As K grows larger, the performance of CPCL is gradually improved compared to Induction-R. The support data becomes more reliable when K is more than 5, making the model more robust than in the case of 0 or 1 shot. However, as K continues to increase, the improvement of accuracy slows down.
To further analyze the effect of Instance Prototype, we design experiments by setting different batch sizes.
The “2-way 5-shot” classification result of CPCL under different batch size on ARSC dataset
The “2-way 5-shot” classification result of CPCL under different batch size on ARSC dataset
Under fixed support instances, the larger batch means more query instances which can provide more distribution features. Specifically, when the batch size is set to 128, Instance Prototype can better stand for the whole dataset, resulting in a 0.57% increase in accuracy.
We leverage a query-detached set to provide additional distribution features for support set, which may be viewed as a leakage of query set. To prove the authenticity, we come up with an experiment in which a query memory bank is established to store the query instance from the past three batches. The prototype generated by current batch will be applied to the memory bank for classification. The results in Table 4 shows that the prototype generated by current batch does not suffer a decrease in accuracy in the memory bank, indicating that Instance Prototype acquires the spatial feature rather than label.
The “2-way 5-shot” classification result of origin CPCL and CPCL with a memory bank on ARSC dataset
In Table 5, we have also recorded the distance between the prototype and the query set center, as well as the intra-class variance of the support set.
The Distribution feature of support set of Induction-R and CPCL on ARSC dataset
It is evident that prototype in CPCL is closer to the query set and less deviant. This indicates that CPCL can better capture the distribution of the query data. Moreover, CPCL reduces the intra-class variance, which diminishes the impact of low-quality instances.
We also investigate the impact of three distinctive concept prototype strategies. The results are diaplayed in Table 6. The direct use of class labels like “positive” and “negative” enhances semantic data to the prototype. However, class labels may lead to inadequacy due to lack of context. The construction of prompt templates can enrich text semantic information. Based on templates, prompt engineering strategy select more relevant templates to aggregate concept by self-attention. Compared with a single template, the performance is improved by 0.14%.
The “2-way 5-shot” classification result of different strategies for generating Concept Prototype on ARSC dataset
To verify the generalization of the model, we extra create a new dataset that reconstruct the categories of training set and testing set. Specifically, we construct three training sets “daily necessities”, “entertainments” and “electronic products”. The composition of three training sets is shown in the Table 7. For the test set, we select “electronics”, “kitchen housewares” and “DVD” respectively, corresponding to the three training sets mentioned above.
The composition statistics of three training sets in ARSC dataset
The composition statistics of three training sets in ARSC dataset
The result presented in Table 8 clearly shows that when the testing category is similar to the training category, such as “Kitchen housewares” to “Daily necessities”, “DVD Book” to “Entertainments” and “Electronics” to “Electronic products”, the performance of the model is satisfactory. However, when transferring to other categories with less similarity, the accuracy of induction network decreases significantly, with the maximum degradation reaching 11.91% between “DVD Book” and “Electronics” when the training set is “Electronic products”. In contrast, CPCL maintains a consistently high level of performance regardless of the change of domains. We owe the outperforming lower limit to the generalization of concept.
The “2-way 5-shot” classification result of CPCL transferring from training domains to testing domains on ARSC dataset
This paper presents a novel Conceptual Prototype (CP), which is the combination of Concept Prototype and Instance Prototype that provide a robust representation of class through the lens of instance and label concepts, to address the problem of inaccurate and limited generalizability of origin prototypes. We further apply CP to PCL to propose a Conceptual Prototypical Contrastive Learning (CPCL) which aims to bring instances closer to their corresponding prototype and push away from others. Experiments results demonstrate that CPCL outperforms other prototype-based methods on ARSC and SST-2. Besides, CPCL has a very high lower limit and is suitable for 0-shot tasks.
