Few-shot image semantic segmentation based on meta-learning: A review

Abstract

Deep learning-based image semantic segmentation approaches heavily rely on large-scale training datasets with dense annotations and often suffer from scarce semantic labels for unseen categories. This limitation has spurred a research trend in Few-shot image Semantic Segmentation (FSS), which makes it possible to segment objects of new categories using only a few labeled samples. Although more and more FSS methods are emerging and gradually integrated into practical applications, a deep understanding of its achievements and issues is still missing. In this survey, we focus on the recent developments of FSS, specifically on FSS methods based on meta-learning. According to different network architectures, we summarize the related research into three classes, that are Convolutional Neural Network-based (CNN-based) models, Graph Neural Network-based (GNN-based) models, and Transformer-based models. Then, we explore the specific implementations of these models, including parameter-based methods, metric-based methods, attention-based methods, and optimization-based methods. Furthermore, we illustrate datasets and analyze the experimental results of various kinds of methods. Toward the end of the paper, we discuss the limitations of FSS and present its applications and challenges to provide further research directions.

Keywords

Deep learning few-shot learning image semantic segmentation meta-learning

1 Introduction

Recently, deep learning has witnessed great success in semantic segmentation, but it depends on a large number of annotated datasets [1, 2]. In real life, especially in medicine and security fields, labeled data is not enough [3]. Semi-supervised and weakly supervised image semantic segmentation learning methods [4, 5] may reduce the dependence on large amounts of pixel-level labeled data. However, they still require a lot of weakly annotated training images. Furthermore, millions of parameters must often be trained to match the model as network layer depth increases [6]. Once the amount of data is scarce, even if the training effect is good, the generalization ability for new samples is still poor.

Human beings can pick up knowledge from a limited number of examples, integrate it, draw conclusions, and use it in a variety of situations [7]. The ultimate goal of artificial intelligence is to achieve human-like thinking and reasoning [8]. It is challenging to bridge the gap between artificial intelligence and human learning. Few-Shot Learning (FSL) [9] has been presented as a solution to this problem, which can reduce the burden of collecting large-scale supervision information by mining knowledge from a small number of samples. In recent years, FSL has demonstrated strong performance in image semantic segmentation. Few-shot image Semantic Segmentation (FSS) is regarded as a pixel-level extension of few-shot image classification [10 –12]. The traditional FSS scheme simply fine-tunes the parameters of the pre-trained network through labeled data sets [13]. However, it is easy to overfit millions of parameters during the update process. To address this issue, Meta-learning [14], also known as the ‘learn to learn’ approach, offers a new learning paradigm for FSS. Meta-learning can help FSS improve the performance of the new task by leveraging the provided datasets and extracting meta-knowledge acrosstasks.

Recent review papers [15 –17] summarize and analyze meta-learning-based FSS algorithms and make different classifications. However, review papers [15, 16] only focused on the metric-based meta-learning methods but ignored other methods, and literature [17] mainly discussed open challenges that few/zero-shot learning brings to visual semantic segmentation. Unlike earlier review papers, this paper makes a comprehensive summary of meta-learning-based FSS algorithms and classifies them according to their neural network architectures. Summaries of state-of-the-art FSS methods and the timeline of different network architectures in recent years are shown in Fig. 1. It is obvious that Convolutional Neural Network-based (CNN-based) models have been the core of FSS research since the appearance of One-Shot Learning for Semantic Segmentation (OSLSM) [18]. Graph Neural Network-based (GNN-based) models have emerged since Pyramid Graph Networks (PGNet) [19], and Classifier Weight Transformer (CWT) [20] opened a new chapter of Transformer-based models.

Fig. 1

Summaries of state-of-the-art methods and the timelines of different network architectures for FSS in recent years.

Thus, from the perspective of network architecture, this paper synthesizes pertinent research based on meta-learning and divides FSS algorithms into three categories: CNN-based models, GNN-based models, and Transformer-based models. Furthermore, the meta-learning-based FSS algorithms are categorized into parameter-based, metric-based, attention-based, and optimization-based approaches based on their implementations. Parameter-based methods use a trainable set of parameters to learn the classifier. Metric-based methods measure the similarity between the support set and the query set using distance metrics. Attention-based methods incorporate attention mechanisms to selectively emphasize or suppress certain features in the support set or query set. Optimization-based methods develop an optimization strategy to optimize the weights of the network. Additionally, this paper summarizes the advantages and disadvantages of each type of approach, lists their applications in different fields (medical images, 3D point clouds, etc.), and discusses future trends.

The main contributions of this paper are as follows:

A systematic review of Few-shot image Semantic Segmentation (FSS) algorithms based on meta-learning is presented, covering the description of the FSS task and development of FSS methods.

the categorizations of FSS methods are provided according to their network architectures or implementations.

The strengths and limitations of each type of method are analyzed, and the applications and research trends are discussed.

The rest of the paper is organized as follows. The task description of meta-learning-based FSS is given in Section 2. The FSS methods are classified and illustrated in Section 3. The main datasets and evaluation metrics are introduced and experimental results are analyzed in Section 4. The discussion and outlook are presented in section 5. In Section 6, we conclude the whole work.

2 Task description

The goal of FSS is to make pixel-level segmentation for new categories of the query image using one or a few labeled support images. By leveraging meta-learning, FSS models can learn and generalize concepts from a diverse set of tasks, and then apply the meta-knowledge to unseen tasks. As illustrated in Fig. 2, the meta-learning-based FSS is implemented by scenario-based training. The FSS task often boils down to the N-way K-shot task, where N refers to the number of new categories in the query image and K refers to the number of labeled images of the support set. The existing FSS algorithms are more often considered for N as 1, i.e. the 1-way K-shottask.

Fig. 2

The conventional settings of the FSS based on meta-learning. The meta-learning-based FSS models adopt scenario-based training, which divides samples into different episodes. Once training is finished on the training set, the models are evaluated on the test set.

Specifically, given a training set $D_{train}$ and a test set $D_{test}$ , their containing categories are non-overlapping, i.e. base categories (for training) and new categories (for testing). Both the training and test sets are composed of many episodes, and each set contains a support set $S = {I_{s}^{i}, Y_{s}^{i}}_{i = 1}^{K}$ and a query set Q ={ I_q, Y_q }, where {I_*, Y_*} represents the images and their corresponding category-specific labels, and K represents the number of support images, typically set to 1 or 5. By utilizing an extensive range of episodes from $D_{train}$ , the FSS models are optimized to achieve efficient segmentation of the query image conditional on only a limited number of labeled support images. Once the training process is finished, the performance of the FSS models is evaluated using the test episodes in $D_{test}$ .

3 FSS algorithms

According to the different neural network architectures, we classify FSS algorithms into CNN-based models, GNN-based models, and Transformer-based models. The FSS algorithms mainly adopt a meta-learning paradigm with four implementations, namely parameter-based, metric-based, attention-based, and optimization-based methods. Next, detailed implementations for each kind of model are described.

3.1 CNN-based models

The CNN-based FSS models typically use a CNN encoder and construct convolutional algorithm units to learn transferable semantic information from the labeled support images to the query image. The implementations of the CNN-based models are parameter-based, metric-based, attention-based, and optimization-based methods. The metric-based methods can be classified into prototype-based linear metrics and dense matching-based non-linear metrics, and the attention-based methods can be classified into convolutional attention-based and cross-attention-based methods.

3.1.1 Parameter-based methods

The parameter-based methods often use a shared encoder to extract features for two branches, and then learn a parameter generator to update the classifier parameters. The general framework of such models is shown in Fig. 3, where a parameter generator is devised to predict the neural weights of the prediction layer for cross-class adaption. After training on base categories, the segmentation ability of the classifier on new categories can be quickly enhanced.

Fig. 3

Framework of parameter-based FSS methods.

Shaban et al. [18] used a conditional branch to learn prediction weights and replaced the parameters of the classifier with logistic regression layers. Unlike direct and simple replacement of classifier parameters, dynamic update strategies [21, 22] can provide more potential for parameter-based methods. Dynamic Reasoning Network (DRNet) [21] is proposed to adaptively generate the parameters of predicting layers and infer the segmentation mask for each unseen category. By leveraging the knowledge from the base classes, the model [22] can dynamically construct and maintain a classifier for the novel class.

Remarks: The parameter-based methods use a direct and efficient way to update the classifier of the model, however, it is difficult for the parameter generator to estimate the parameters of the large-scale model.

3.1.2 Metric-based methods

The metric-based methods are dominant and effective FSS algorithms [15, 16]. These algorithms usually map feature representations to a metric space and use distance functions to measure the similarity among the samples. We classify the metric-based FSS methods into two types: prototype-based linear metrics methods and dense matching-based non-linear metrics methods, depending on whether the distance function is linear or not.

(1) Prototype-Based Methods

Prototype-based methods often use Mask Averaging Pooling (MAP) [23] to obtain prototype vectors. Then, the distance between the query features and the prototype vectors is measured using a cosine similarity or Euclidean distance function.Figure 4 displays the framework of the prototype-based methods.

Fig. 4

Framework of prototype-based FSS methods.

The Prototype Learning (PL) [24] model employs a prototype learner to obtain prototype vectors and uses a non-parametric weighted nearest neighbor classifier to classify semantics. The model is complex and unable to directly obtain segmentation results. Further research focused on addressing the limitations of the PL model. For instance, Wang et al. [25] developed a simplistic yet effective Prototype Alignment Network (PANet) that utilizes query prototypes to make backward predictions of support images. Some algorithms [26, 27] try to produce representative prototypes to reduce prototype bias brought on by data scarcity and intra-class heterogeneity. However, the aforementioned strategies are constrained by global representation.

Multi-prototype techniques attempt to address prototype bias by learning multiple prototypes. For instance, the Part-aware Prototype Network (PPNet) [28] uses the Simple Linear Iterative Clustering (SLIC) [29] strategy to convert a single prototype into multiple local prototypes. The Prototype Mixture Model (PMM) [30] strengthens prototype semantic descriptions by using the expectation estimation maximization (EM) algorithms. Except for extracting various prototypes from the support set, the potential new classes during training can also aid in fine-tuning prototypes [31]. Additionally, self-support prototypes (SSP) [32] of the query image can be employed to reduce the uncertainty resulting from intra-class diversity.

(2) Dense Comparison-based methods

Unlike linear metrics, dense comparison-based methods use a ‘concatenation + convolution’ operation to learn transferable category information, which is flexible and easy to integrate with other methods. As illustrated in Fig. 5, the general model first concatenates the query feature with the category feature learned from the labeled support image. Then, the model uses the convolutional layer to make a dense comparison. Finally, with a decoder, the target information can be segmented.

Fig. 5

Framework of dense comparison-based FSS methods.

The idea of dense comparison derived from the Relation Network [33] was first introduced by the model [34], where two convolutional layers are used to embed the relationship between the support features and query features. Nonetheless, there is an alignment issue of spatial semantic information when making comparisons for local features. To address this issue, Class-Agnostic Segmentation Networks (CANet) [35] employ MAP to obtain a global category vector. Then, each position of the query feature is intensively compared with the vector. Following CANet [35], the Prior Guided Feature Enrichment Network (PFENet) [36] adds a prior mask generated from high-level features in a train-free manner to improve performance. Although PFENet has exceeded the previous algorithms, it still suffers from inadequate use of guidance information or a lack of spatial information on the global category feature.

To address these problems, some algorithms have embarked on enriching category features [37 –39] or introducing memory units [40]. Self-guided learning strategy [37] enhances the segmentation results by aggregating the missing key information to the main category features. The superpixel-guided clustering and adaptive allocation-based algorithm is used to replace the global category feature with multiple local category features [38]. The dynamic convolution strategy [39] transfers more semantic information by acquiring dynamic category features. The recurrent memory network [40] is introduced to obtain rich category information from all resolution features in a cyclic manner.

The aforementioned works have endeavored to optimize the category feature but do not consider the impact of the base categories. Therefore, some algorithms [41 –43] made use of the base categories. Literature [41] integrates base category prototypes with the new category prototypes to match each region in the query image. Literature [42] introduces a set of learnable memory embeddings to record meta-information. Divide-and-conquer proxies (DCP) [44] integrate a series of support-induced proxies derived from the coarse segmentation mask of the annotated support image to boost discrimination. The Base and Meta (BAM) network [43] adaptively integrates the coarse segmentation results of the base categories and the new categories to produce accurate segmentation predictions.

Remarks: Metric-based methods play a vital role in handling the FSS task. Prototype-based methods are stable and easy to implement but often suffer from the issue of prototype bias. Dense comparison-based methods are easy to combine with other methods. Learning discriminative category feature representations becomes the state-of-the-art solution for dense comparison-based methods.

3.1.3 Attention-based methods

The attention mechanism aims to focus its attention on relevant information while disregarding extraneous information [45]. In recent years, the attention mechanism has significantly advanced the performance of computer vision tasks [46]. Given the limited number of support samples provided by the FSS task, it is crucial to concentrate on the details of the target category. The attention mechanism can improve the interpretability and performance of FSS models. The attention techniques adopted in CNN-based FSS models consist of Convolutional Attention [47 –49], Self-Attention [50], and Cross-Attention [51]. Since the Multi-Head Attention (MHA) of the Transformer-based models is comprised of numerous self-attention modules, here we only elaborate on the approaches that rely on Convolutional Attention and Cross-Attention.

(1) Convolutional Attention-based methods

The convolutional attention-based methods construct convolutional layers to learn a spatial or channel weight vector to weigh features. To capture the multi-layer contextual information from the labeled support images, the Attention-based Multi-context Guiding (A-MCG) network [52] uses a modified residual attention module (RAM) [47] as the feature selector. Furthermore, a comparative analysis is conducted in ablation experiments using channel attention Squeeze-and-Excitation Networks (SENet) [49]. Attention-based Refinement Network (ARNet) [53] does modest changes and uses RAM to concentrate effective information. To improve the performance of the K-shot FSS task, CANet employs a parallel attention mechanism inspired by spatial attention [47] to fuse segmentation results from different support images.

(2) Cross-Attention -based methods

The cross-attention-based FSS methods usually adopt a symmetrical structure that enhances the common semantic information by exploring correlation across support and query features. Typically, the structure (depicted in Fig. 6) involves the simultaneous segmentation of support and query images.

Fig. 6

Framework of the cross-attention-based FSS methods.

Cross-Reference Networks (CRNet) [54] use a two-branch squeeze-and-excitation module to mine co-occurrent features in two images and generate updated representations. However, CRNet neglects the correlation of spatial information. To address this issue, some algorithms [41 , 55–59] further learn the correlation between the two branches. Cross-Reference and Region Global Conditional Networks (CRCNet) [55] make an extension of CRNet, which can learn both local and global common information. Literature [56] computes the relationship matrix between query and support based on Cosine distance. Holistic Prototype Activation (HPA) [41] devises a feature interaction weighting scheme to model the inter-dependence and self-dependence of low-level features by matrix multiplication. To learn more common information, long-range dependences on both channel and spatial dimensions are explored [57, 59], and literature [58] incorporates foreground and background attention to the model.

Remarks: Attention-based methods are often used in conjunction with other methods in FSS models, such as metric-based methods, to achieve optimal segmentation performance. While it works well for learning local features, convolution-based attention is not as good at modeling long-range dependencies. Matrix multiplication-based cross-attention can capture long-range dependencies, but it often suffers from high computational complexity.

3.1.4 Optimization-based methods

The optimization-based methods aim to make the model easy to fine-tune for new categories by learning a good initialization network. Feature Weighting and Boosting (FWB) [27] draws inspiration from the optimization concept and utilizes the pre-segmentation of support images to provide an initialization value for a segmentation model. Similarly, several works [37 , 57] have utilized this optimization approach to train their models. Differing from above, Tian et al. [60] redefined the FSS task as an optimization-based pixel classification problem. To this end, they developed an embedding module that utilizes both a global and local feature branch to extract appropriate meta-knowledge for the meta-segmentation network. To learn a generalization meta-learning framework, Cao et al. [61] presented a network with a meta-learner and a base learner. The meta-learner can learn good initialization and parameter updating strategies for the FSS task. The base learner theoretically can be arbitrary semantic segmentation models, and the parameters can be updated quickly with the guidance of the meta-learner. To achieve adaptive tuning, Zhu et al. [62] designed the base learner as an inner-loop task and used an optimization-based learner as an outer-loop to progressively refine the segmentation results.

Remarks: The optimization-based methods often change the training strategy to iteratively update the model’s parameters, so that the model can quickly adapt to new category tasks. However, these methods often require a more complex experimental setup, such as setting separate training steps or using an inner-outer-loop framework. It makes the optimization-based approaches less flexible than other approaches.

3.2 GNN-based models

The GNN-based models use structured representations, i.e., graphs, to represent the inputs and propagate category information from the support graph to the query graph in terms of graphical reasoning. To reason about the potential relationships between query nodes and labeled support nodes, graph attention networks are often introduced. The graph attention modules [63] apply an attention mechanism of gridded structure to learn the weights among the nodes’ connections and highlight the significance of the nodes. As shown in Fig. 7, existing GNN-based methods first learn graph representations for the features extracted from the CNN encoder. Then, the label information is propagated from the support image to the unlabeled query image by modeling the relationship between the support and query graphs.

Fig. 7

Framework of the graph-attention-based FSS methods.

The first GNN-based method, entitled Pyramid Graph Networks (PGNet) [19], employs the graph attention mechanism to construct a pyramid graph structure. To effectively propagate label information from the support images to the query image, subsequent algorithms aim to construct more optimal graph attention networks. For example, the Democratic Attention Network (DAN) [64] propagates guided information to the query image by rescaling the activated regions of objects in the support image. Scale-Aware Graph Neural Network (SAGNN) [65] interprets the cross-scale interactions between the support and query images in a structure-to-structure manner. However, the aforementioned weight adjustment strategies may not be reasonable and could introduce noisy pixels, which would ultimately lead to incorrect query image segmentation. To address this issue, the Mutually Supervised Graph Attention Network (MSGA) [66] constructs a bipartite graph attention module, which can improve performance through mutual guidance.

Remarks: The GNN-based models can reveal the linkages between intra-graphs or inter-graphs through various graph attention modules, and the visualization of the weights in their modules can provide interpretability of links. However, their computational efficiency is greatly impacted by the size and dimensions of the input samples.

3.3 Transformer-based models

The Transformer architecture has made significant progress in computer vision due to its powerful representational capabilities [67]. Inspired by this, researchers have begun exploring its potential applicability in the FSS task. Current transformer-based FSS methods usually take the form of ‘CNN + Transformer’ or ‘Pure Transformer’. As shown in Fig. 8, these methods extract features either from the CNN or Transformer encoder and design transformer-based modules to propagate the semantic information of the target categories within a global receptive field.

Fig. 8

The Framework of the Transformer-based FSS methods.

The Transformer model is a new neural network architecture based on the self-attention mechanism. The model involves alternating MHA and Multi-Layer Perception (MLP), which achieves aggregation of relevant mappings by paying attention to the global receptive field [50]. The first transformer-based FSS algorithm [20] proposed a novel meta-learning framework that employs a Classifier Weight Transformer (CWT) to dynamically adapt to the classifier weights.

To filter out potentially harmful support features, the Cycle-Consistent Transformer (CyCTR) [68] uses self-alignment and cross-alignment transformer blocks to aggregate the context information within query images or across support and query images. The Context and Affinity Transformer (CATrans) [69] propagates contextual information using a relation-guided context Transformer and generates a reliable cross-affinity map by a relation-guided affinity Transformer. To ensure the purity of the sequential features and consistency of pattern matching, Hierarchically Decoupled Matching Network (HDMNet) [70] proposes a new hierarchically matching structure that utilizes self-attention modules and transformer-based matching modules to mine multi-scale pixel-level correlations. To address spurious matches in affinity learning methods, the Adaptive Buoys Correlation (ABC) network [71] rectifies direct pairwise pixel-level correlation by mining buoys and adaptive correlations. To reduce high computational complexity, Dense pixel-wise Cross-query-and-support Attention weighted Mask Aggregation (DCAMA) [72] aggregates multi-level pixel correlation information between matched query and support features with the support masks. To learn a specialized feature extractor for the FSS task, Multi-level Heterogeneity Suppressing (MuHS) [73] enhances attention/interaction between different samples (query and support), different regions, and neighboring patches, respectively.

Remarks: The transformer architecture can leverage pixel-wise alignment to capture global information for the FSS task, which transcends the limitation of semantic-level prototypes. By allowing for fine-grained analysis and considering long-range dependencies, transformers have proven to be a powerful tool for achieving state-of-the-art performance in the FSS task. However, the computational complexity of the aggregation process still hinders their adaptability.

4 Experimental comparison

4.1 Datasets and evaluation

The popular benchmark datasets used for the FSS task primarily include PASCAL-5ⁱ, COCO-20ⁱ, and FSS-1000, which encompass 20, 80, and 1,000 categories, respectively. The first two datasets are derived from commonly used image semantic segmentation datasets redefined according to the concept of few-shot learning, while the third dataset is a distinct collection specifically created for the FSS task. Dataset PASCAL-5ⁱ was first introduced in OSLSM [18], which is composed of images and semantic labels from PASCAL VOC 2012 [74] and the extended set SDS [75]. 20 categories of PASCAL-5ⁱ are divided into 4 folds when treating 5 categories as the test set, the remaining 15 categories are formed the train set. Dataset COCO-20ⁱ was first introduced in FWB [27], which is composed of images and semantic labels from MSCOCO [76]. 80 categories of COCO-20ⁱ are divided into 4 folds, when 20 categories are treated as the test set, the remaining 60 categories are treated as the training set. Dataset FSS-1000 [34] is specifically designed to tackle the problem of few-shot segmentation for general objects. FSS-1000 comprises 1,000 categories and emphasizes the number of categories rather than the number of images. Each category contains 10 images and corresponding category labels, thereby making FSS-1000 highly scalable. The train/validation/test split used in the experiments consists of 5,200/2,400/2,400 image and label pairs.

Mean Intersection over Union (mIoU), Foreground-Background IoU (FBIoU), inferring time, and the learnable parameters are employed as the FSS assessment metrics. The $mIoU = \frac{1}{C} \sum_{c = 1}^{C} {IoU}_{c}$ is the average IoU of all categories, and $FBIoU = \frac{1}{2} ({IoU}_{F} + {IoU}_{B})$ ignores the category of objects and calculates the average of foreground IoU_F and background IoU_B. IoU = TP/(TP + FP + FN), where TP, FP, and FN represent the number of true positives, false positives, and false negatives of the predicted values, respectively. C is the number of total categories in the test fold. IoU_c represents the intersection over the union of category c.

Ideally, an FSS model should be evaluated in multiple respects, such as quantitative accuracy (mIoU, FBIoU), speed (inferring time), and storage requirements (learnable parameters). However, most FSS methods focus on metrics for quantifying model accuracy, and only a few make a comparison with the learnable parameters and inferring time. Thus, we compare and analyze the performance of the FSS methods under the most popular mIoU and FBIoU metrics.

4.2 Performance and analysis

In this section, we compare and analyze the experimental results of FSS methods on the datasets PASCAL-5ⁱ, COCO-20ⁱ, and FSS-1000. We list the performance of meta-learning-based FSS methods on these three datasets with models, solutions, backbones, mIoU (%), and FBIoU (%), respectively. Table 1 presents the experimental results of various methods on datasets PASCAL-5ⁱ and COCO-20ⁱ. Table 2 displays the experimental results of several methods on the FSS-1000 dataset.

Table 1
Segmentation performance of various methods FSS on datasets PASCAL-5ⁱ and COCO-20ⁱ

Solutions Backbones Methods PASCAL-5ⁱ COCO-20ⁱ

mIoU (%) FBIoU (%) mIoU (%) FBIoU (%)

1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

CNN-based models

VGG-16 OSLSM [18] 40.8 43.9 – – – – – –

Parameter ResNet-50 DRNet [21] 59.9 61.8 72.5 74.1 33.8 37.8 57.9 60.2

DENet [22] 60.1 60.5 – – 42.8 43.0 – –

PL [24] – – 61.2 62.3 – – – –

PANet [25] 48.1 55.7 66.5 70.7 – – – –

AMP [26] 43.4 46.9 62.2 63.8 – – – –

SG-one [23] 46.3 47.1 63.1 65.9 – – – –

PMM [30] 53.0 54.0 – – – – – –

FSS-1000 [34] – 58.6 – – – – – –

PFENet [36] 58.0 59.0 72.0 72.3 36.3 40.4 63.3 65.0

MM-Net [42] 58.3 58.3 – – – – – –

VGG-16 DPCN [39] 61.7 65.3 73.7 77.2 39.5 46.2 62.5 66.1

HPA [41] 62.3 67.2 75.2 79.3 42.1 48.1 67.2 72.2

DCP [44] 61.3 65.8 74.9 79.4 – – – –

FECANet [59] 64.3 66.7 76.2 77.6 35.4 41.5 65.5 67.7

BAM [43] 64.4 68.8 77.3 81.1 43.5 49.3 – –

PPNet [28] 52.8 63.0 – – 27.2 36.7 – –

MLC [31] 63.6 66.8 – – 35.1 41.4 – –

PMM [30] 56.3 57.3 – – – – – –

SSP [32] 61.4 69.3 – – 37.4 44.1 – –

CANet [35] 55.4 57.1 66.2 69.6 – – – –

PFENet [36] 60.8 61.9 73.3 73.9 – – – –

Metric MM-Net [42] 61.8 63.4 – – 37.5 38.2 – –

ResNet-50 CMN [40] 62.8 63.7 72.3 72.8 39.3 43.1 61.7 63.3

ASGNet [38] 59.3 63.9 69.2 74.2 – – – –

DPCN [39] 66.7 69.9 78.0 80.7 43.0 49.8 63.2 67.4

HPA [41] 65.5 69.4 76.4 81.1 43.8 50.5 68.3 71.4

DCP [44] 62.8 67.8 75.6 79.7 41.4 46.5 – –

FECANet [59] 67.4 70.0 78.7 80.7 41.6 47.6 69.6 71.1

BAM [43] 67.8 70.9 79.7 82.2 46.2 51.2 – –

PPNet [28] 55.2 65.1 – – 29.0 38.5 – –

MLC [31] 63.8 69.3 – – 37.5 45.1 – –

PMM [30] – – – – 30.6 35.5 – –

ResNet-101 SSP [32] 64.6 73.1 – – 42.0 50.2 – –

PFENet [36] 60.1 61.4 72.9 73.5 38.5 42.7 63.0 65.8

ASGNet [38] 59.3 64.4 71.7 75.2 34.6 42.5 60.4 67.0

HPA [41] 66.1 69.4 76.6 80.4 46.3 52.8 68.8 74.4

VGG-16 ARNet [53] 48.1 49.1 – – – – – –

CRNet [54] 55.7 58.8 66.8 71.5 – – – –

CRCNet [55] 56.4 59.1 66.9 71.7 38.7 43.1 62.0 66.1

Attention ResNet-50 MIC [57] 57.4 59.4 68.3 69.1 – – – –

LTM [56] 57.0 60.6 71.8 74.6 – – – –

SimPropNet [58] 57.2 60.0 72.9 72.9 – – – –

ResNet-101 A-MCG [52] – – 61.2 62.2 – – – –

VGG-16 FWB [27] 51.9 55.1 – – 20.0 22.6 – –

Optimization Literature [61] 48.6 50.2 – – – – – –

ResNet-50 SST [62] 57.6 61.4 – – – – – –

ResNet-101 FWB [27] 56.2 59.9 – – 21.2 23.7 – –

GNN-based models

PGNet [19] 56.0 58.5 69.9 70.5 – – – –

ResNet-50 SAGNN [65] 62.1 62.8 73.2 73.3 – – – –

Attention MSGA [66] 62.7 63.4 – – – – – –

ResNet-101 DAN [64] 58.2 60.5 71.9 72.3 24.4 29.6 62.3 63.9

SAGNN [65] – – – – 37.2 42.7 60.9 63.4

Transformer-based models

Parameter ResNet-50 CWT [20] 56.4 63.7 – – 32.9 41.3 – –

ResNet-101 CWT [20] 58.0 64.7 – – 32.4 42.0 – –

VGG-16 HDMNet [70] 65.1 69.3 – – 45.9 52.4 – –

CyCTR [68] 64.2 65.6 – – 40.3 45.6 – –

CATrans [69] 66.3 75.3 – – 46.6 58.2 – –

ResNet-50 DCAMA [72] 64.6 68.5 75.7 79.5 43.3 46.0 69.5 71.7

ABC [71] 66.0 69.6 76.0 80.0 44.1 49.1 69.9 72.7

HDMNet [70] 69.4 71.8 – – 50.0 56.0 72.2 77.7

Attention CyCTR [68] 64.3 66.6 72.9 75.0 – – – –

CATrans [69] 67.2 76.5 – – 48.8 59.7 – –

ResNet-101 DCAMA [72] 64.6 68.3 77.6 80.8 43.5 51.9 69.9 73.3

ABC [71] 65.6 69.4 78.5 80.8 – – – –

Swin-T CATrans [69] 67.6 77.3 – – 49.4 60.1 – –

Swin-B DCAMA [72] 69.3 74.9 78.5 82.9 50.9 58.3 73.2 76.9

DeiT-B MuHS [73] 69.1 76.7 – – 47.4 56.7 – –

Solutions	Backbones	Methods	PASCAL-5ⁱ	COCO-20ⁱ
CNN-based models
	VGG-16	OSLSM [18]	40.8	43.9	–	–	–	–	–	–
Parameter	ResNet-50	DRNet [21]	59.9	61.8	72.5	74.1	33.8	37.8	57.9	60.2
		DENet [22]	60.1	60.5	–	–	42.8	43.0	–	–
		PL [24]	–	–	61.2	62.3	–	–	–	–
		PANet [25]	48.1	55.7	66.5	70.7	–	–	–	–
		AMP [26]	43.4	46.9	62.2	63.8	–	–	–	–
		SG-one [23]	46.3	47.1	63.1	65.9	–	–	–	–
		PMM [30]	53.0	54.0	–	–	–	–	–	–
		FSS-1000 [34]	–	58.6	–	–	–	–	–	–
		PFENet [36]	58.0	59.0	72.0	72.3	36.3	40.4	63.3	65.0
		MM-Net [42]	58.3	58.3	–	–	–	–	–	–
	VGG-16	DPCN [39]	61.7	65.3	73.7	77.2	39.5	46.2	62.5	66.1
		HPA [41]	62.3	67.2	75.2	79.3	42.1	48.1	67.2	72.2
		DCP [44]	61.3	65.8	74.9	79.4	–	–	–	–
		FECANet [59]	64.3	66.7	76.2	77.6	35.4	41.5	65.5	67.7
		BAM [43]	64.4	68.8	77.3	81.1	43.5	49.3	–	–
		PPNet [28]	52.8	63.0	–	–	27.2	36.7	–	–
		MLC [31]	63.6	66.8	–	–	35.1	41.4	–	–
		PMM [30]	56.3	57.3	–	–	–	–	–	–
		SSP [32]	61.4	69.3	–	–	37.4	44.1	–	–
		CANet [35]	55.4	57.1	66.2	69.6	–	–	–	–
		PFENet [36]	60.8	61.9	73.3	73.9	–	–	–	–
Metric		MM-Net [42]	61.8	63.4	–	–	37.5	38.2	–	–
	ResNet-50	CMN [40]	62.8	63.7	72.3	72.8	39.3	43.1	61.7	63.3
		ASGNet [38]	59.3	63.9	69.2	74.2	–	–	–	–
		DPCN [39]	66.7	69.9	78.0	80.7	43.0	49.8	63.2	67.4
		HPA [41]	65.5	69.4	76.4	81.1	43.8	50.5	68.3	71.4
		DCP [44]	62.8	67.8	75.6	79.7	41.4	46.5	–	–
		FECANet [59]	67.4	70.0	78.7	80.7	41.6	47.6	69.6	71.1
		BAM [43]	67.8	70.9	79.7	82.2	46.2	51.2	–	–
		PPNet [28]	55.2	65.1	–	–	29.0	38.5	–	–
		MLC [31]	63.8	69.3	–	–	37.5	45.1	–	–
		PMM [30]	–	–	–	–	30.6	35.5	–	–
	ResNet-101	SSP [32]	64.6	73.1	–	–	42.0	50.2	–	–
		PFENet [36]	60.1	61.4	72.9	73.5	38.5	42.7	63.0	65.8
		ASGNet [38]	59.3	64.4	71.7	75.2	34.6	42.5	60.4	67.0
		HPA [41]	66.1	69.4	76.6	80.4	46.3	52.8	68.8	74.4
	VGG-16	ARNet [53]	48.1	49.1	–	–	–	–	–	–
		CRNet [54]	55.7	58.8	66.8	71.5	–	–	–	–
		CRCNet [55]	56.4	59.1	66.9	71.7	38.7	43.1	62.0	66.1
Attention	ResNet-50	MIC [57]	57.4	59.4	68.3	69.1	–	–	–	–
		LTM [56]	57.0	60.6	71.8	74.6	–	–	–	–
		SimPropNet [58]	57.2	60.0	72.9	72.9	–	–	–	–
	ResNet-101	A-MCG [52]	–	–	61.2	62.2	–	–	–	–
	VGG-16	FWB [27]	51.9	55.1	–	–	20.0	22.6	–	–
Optimization		Literature [61]	48.6	50.2	–	–	–	–	–	–
	ResNet-50	SST [62]	57.6	61.4	–	–	–	–	–	–
	ResNet-101	FWB [27]	56.2	59.9	–	–	21.2	23.7	–	–
GNN-based models
		PGNet [19]	56.0	58.5	69.9	70.5	–	–	–	–
	ResNet-50	SAGNN [65]	62.1	62.8	73.2	73.3	–	–	–	–
Attention		MSGA [66]	62.7	63.4	–	–	–	–	–	–
	ResNet-101	DAN [64]	58.2	60.5	71.9	72.3	24.4	29.6	62.3	63.9
		SAGNN [65]	–	–	–	–	37.2	42.7	60.9	63.4
Transformer-based models
Parameter	ResNet-50	CWT [20]	56.4	63.7	–	–	32.9	41.3	–	–
	ResNet-101	CWT [20]	58.0	64.7	–	–	32.4	42.0	–	–
	VGG-16	HDMNet [70]	65.1	69.3	–	–	45.9	52.4	–	–
		CyCTR [68]	64.2	65.6	–	–	40.3	45.6	–	–
		CATrans [69]	66.3	75.3	–	–	46.6	58.2	–	–
	ResNet-50	DCAMA [72]	64.6	68.5	75.7	79.5	43.3	46.0	69.5	71.7
		ABC [71]	66.0	69.6	76.0	80.0	44.1	49.1	69.9	72.7
		HDMNet [70]	69.4	71.8	–	–	50.0	56.0	72.2	77.7
Attention		CyCTR [68]	64.3	66.6	72.9	75.0	–	–	–	–
		CATrans [69]	67.2	76.5	–	–	48.8	59.7	–	–
	ResNet-101	DCAMA [72]	64.6	68.3	77.6	80.8	43.5	51.9	69.9	73.3
		ABC [71]	65.6	69.4	78.5	80.8	–	–	–	–
	Swin-T	CATrans [69]	67.6	77.3	–	–	49.4	60.1	–	–
	Swin-B	DCAMA [72]	69.3	74.9	78.5	82.9	50.9	58.3	73.2	76.9
	DeiT-B	MuHS [73]	69.1	76.7	–	–	47.4	56.7	–	–

Table 2

Segmentation performance of several FSS methods on dataset FSS-1000

Models	Solutions	Backbones	Methods	Sizes	mIoU (%)		FBIoU (%)
					1-shot	5-shot	1-shot	5-shot
CNN	Metric	VGG-16	FSS-1000 [34]	224×224	73.5	80.1	–	–
		Resnet-101	SSP [32]	473×473	87.3	88.6	–	–
	Attention	Resnet-50	MIC [57]	224×224	83.7	85.2	–	–
			CRCNet [55]	473×473	81.1	83.2	–	–
GNN	Attention	Resnet-50	MSGA [66]	224×224	75.0	81.3	–	–
		Resnet-101	DAN [64]	MS	85.2	88.1	–	–
Transformer	Attention	Resnet-50	DCAMA [72]	384×384	88.3	89.1	92.4	93.1
		Resnet-101	DCAMA [72]	384×384	88.3	89.1	92.4	93.1
		Swin-B	DCAMA [72]	384×384	90.1	90.4	93.8	94.1

We conduct a comprehensive analysis of the aspects of the datasets, backbones, and methods.

It can be seen that almost all methods are evaluated on the dataset PASCAL-5ⁱ and the performance on the dataset FSS-1000 is superior to the other two and there remains significant room for improvement on the dataset COCO-20ⁱ. Thus, the scene diversity of the dataset has a great impact on FSS models.

The CNN-based and GNN-based methods utilize the primary backbones of the VGG [77] and ResNet [78], while the Transformer-based techniques adopt additional backbones such as Swin [79] and DeiT [80]. Several methods [69 , 73] evaluate their algorithms using both CNN and Transformer backbone architectures. It has been observed that the methods employing the Transformer backbone exhibit enhanced generalization capabilities compared to employing the CNN backbone. Thus, the discrimination of extracted features from the backbone affects the performance of FSS methods.

The number of CNN-based methods is the highest, followed by Transformer-based methods, and the least prevalent are GNN-based methods. The metric-based CNN methods are the most widely used and have demonstrated promising performances. The attention-based methods in different models are often combined with metric-based methods. The attention-based Transformer methods have proven especially effective, achieving the best performance on all three datasets. Thus, both the metric-based and attention-based methods play important roles in FSS research.

5 Discussion and outlook

In this section, we first summarize the state-of-the-art techniques with their merits, drawbacks, and applicability. Then, we discuss the limitations of FSS. After that, we present the applications in current FSS research. Last but not least, we show challenges to provide further research for the FSS task.

5.1 Discussion

FSS algorithms are primarily classified into CNN-based, GNN-based, and Transformer-based methodologies based on their architectures. The majority of FSS algorithms are CNN-based, the minority are GNN-based, and Transformer-based approaches exhibit a growing trend. The framework of these methodologies is changing from ‘Pure CNN’, ‘CNN + Graph’, and ‘CNN + Transformer’ to a more advanced approach with a ‘Pure Transformer’ configuration. In addition, different architectures are implemented in different ways. A discussion of these implementations, along with their merits, drawbacks, and applicability, is presented in Table 3.

Table 3
Discussion of FSS techniques with merits, drawbacks, and applicability

Models Solutions Merits Drawbacks Applicability Literatures

CNN Parameter Using a direct way to learn parameters for the classifier of the model. Difficult for the generator to estimate large-scale model parameters. Suitable for updating small-scale predictor parameters. [18 , 22]

Metric Modeling similarity learning by distance functions or convolutional layers to offer an easy-to-implement solution for FSS. Leading to prototype bias or non-discriminative category features. Suitable for datasets within small intra-class diversity. [23–32 , 59]

Attention Using the convolutional or cross-attention mechanism to improve feature representation and convey more useful information. Learning only local features or leading to high computational complexity. Influenced by factors such as feature discriminability and task complexity. [52 –58]

Optimization Learning good update strategies for the model or providing auxiliary information that can quickly adapt to the new category. The experimental setting is complex and less flexible. Influenced by parameters and settings specific to the optimization process. [27 , 60–62]

GNN Attention Learning graph attention can help represent the interrelationships between samples by updating edges and nodes, leading to a model with strong interpretability. The computation of the model increases exponentially with the number and size of samples. Suitable for modeling graph structure and is influenced by the design of nodes and edges. [19 , 64–66]

Transformer Parameter Using a transformer structure to adapt the classifier of the model. Difficult to classify images with extreme viewpoint differences. Influenced by large intra-class variation. [20]

Attention Aggregating self- or cross-sample information in a global receptive field can learn a robust generalization model. Easy to produce high aggregation costs and overfitting. Suitable for modeling interactions between single or multiple inputs. [68 –73]

Models	Solutions	Merits	Drawbacks	Applicability	Literatures
CNN	Parameter	Using a direct way to learn parameters for the classifier of the model.	Difficult for the generator to estimate large-scale model parameters.	Suitable for updating small-scale predictor parameters.	[18 , 22]
	Metric	Modeling similarity learning by distance functions or convolutional layers to offer an easy-to-implement solution for FSS.	Leading to prototype bias or non-discriminative category features.	Suitable for datasets within small intra-class diversity.	[23–32 , 59]
	Attention	Using the convolutional or cross-attention mechanism to improve feature representation and convey more useful information.	Learning only local features or leading to high computational complexity.	Influenced by factors such as feature discriminability and task complexity.	[52 –58]
	Optimization	Learning good update strategies for the model or providing auxiliary information that can quickly adapt to the new category.	The experimental setting is complex and less flexible.	Influenced by parameters and settings specific to the optimization process.	[27 , 60–62]
GNN	Attention	Learning graph attention can help represent the interrelationships between samples by updating edges and nodes, leading to a model with strong interpretability.	The computation of the model increases exponentially with the number and size of samples.	Suitable for modeling graph structure and is influenced by the design of nodes and edges.	[19 , 64–66]
Transformer	Parameter	Using a transformer structure to adapt the classifier of the model.	Difficult to classify images with extreme viewpoint differences.	Influenced by large intra-class variation.	[20]
	Attention	Aggregating self- or cross-sample information in a global receptive field can learn a robust generalization model.	Easy to produce high aggregation costs and overfitting.	Suitable for modeling interactions between single or multiple inputs.	[68 –73]

In general, the research on the FSS task has made great progress, but several issues of different methods in different models still need to be addressed.

In CNN-based models: Parameter-based methods directly modify the classifier’s parameters to suit new category segmentation, but adapting them to large-scale models can be challenging. Metric-based methods are straightforward to implement and can transfer meta-knowledge in meta-learning frameworks, but they still cannot fully address intra-class diversity. Metric-based methods often require significant effort to learn optimal prototype representations or discriminative category features. Attention-based methods can improve feature representation, but it is essential to strike a balance between computational complexity and performance. Optimization-based ideas are often exploited by other methods but often bring more complex experimental setups.

In GNN-based models: Graph Attention-based methods can provide strong interpretability for the model, but they need to deal with complex computations.

In Transformer-based models: Parameter-based methods can use a transformer structure to adapt the classifier, but cannot classify the images with extreme viewpoint differences. Attention-based methods can learn correlations in a global receptive for self- or cross-samples, but it is easy to produce high aggregation costs and overfitting.

5.2 Applications

Compared with traditional semantic segmentation algorithms, FSS algorithms have more obvious application advantages, which can greatly alleviate the problem caused by data scarcity, and provide a flexible solution for fast cross-class adaptation. Recent FSS has spurred intensive research efforts to apply it to various fields, such as medical images, 3D point clouds, etc.

Compared to natural images, medical images are considerably more challenging to collect and require expert manual labeling. Several FSS algorithms have been developed for medical images, including few-shot organ segmentation [81 –83], few-shot skin lesion segmentation [84, 85], few-shot brain CT segmentation [86 –90], and few-shot COVID-19 pneumonia diagnosis [91 –94]. These methods are highly valuable to society and can improve medical diagnosis and promote medical research.

3D point clouds are complex, multidimensional data collections with intricate labeling requirements. Few-shot 3D point cloud segmentation methods [95 –97] can use only a few labeled point cloud data to segment the new point cloud data. These methods hold great practical value and can benefit various applications, including autonomous driving, robotics, and augmented and virtual reality.

Besides that, FSS has also achieved breakthroughs in some other real-world segmentation tasks, including texture segmentation [98], logo segmentation [99], metal generic surface defect segmentation [100], and document layout segmentation [101]. In short, FSS plays a positive role in real scenes and makes life more intelligent and autonomous.

5.3 Challenges

Few-shot image semantic segmentation is gradually extending to more challenging tasks, such as Generalized Few-shot Semantic Segmentation (GFSS) tasks, weakly supervised FSS tasks, and cross-domain FSS tasks. These advancements create a strong basis for future research and application by significantly expanding its theoretical knowledge and application possibilities.

GFSS. The commonly mentioned FSS is the narrowly defined FSS task, which uses a few labeled support images to segment new categories in the query image. But this specific task neglects base categories contained in query samples. Therefore, evaluating generalization performance only on new categories falls short of addressing real-world FSS scenarios. BAM [43] leverages segmentation information from base categories to facilitate the segmentation of novel categories, while also enabling the segmentation of base categories present in the query image. However, it still conforms to the strict narrow task setting which demands the support set to entail the categories present in the query sample. To overcome the drawbacks of the intricate narrow FSS setting and poor generalization over base categories, GFSS is proposed. GFSS methods [102 –104] can simultaneously segment both base categories and new categories in the query image without prior knowledge of the specific categories.

Weakly Supervised FSS. The label information contained in the support set in the FSS task typically refers to pixel-level labels, which are labor-intensive to produce. Consequently, some methodologies adopt weak labels [35 , 105–108] (bounding boxes, dot annotations, etc.), image-level labels [109, 110], or even leverage self-supervised learning to generate labels [111]. These approaches help address the reliance on label information, further reducing its necessity.

Cross-domain FSS. The FSS task often refers to the single-domain task, where generalization to new categories is achieved by learning segmentation models of base categories on the same dataset. It is far from being a genuine generalization. Some approaches [20 , 112–114] emphasize cross-domain implementation across natural data, like COCO⟶PASCAL, PASCAL⟶COCO, and so on. And others [115] span from natural datasets to medical datasets. In addition to being more realistic, the cross-domain FSS challenge is also more difficult, allowing the trained model to be used in previously unexplored domains with distinct data distributions.

6 Conclusion

FSS can segment new categories only with a few labeled samples, which mitigates the reliance on labeled datasets and demonstrates generalization capabilities for new categories. In this paper, we survey FSS algorithms based on meta-learning and present the research trajectory. Specifically, we group them into architectural categories such as CNN-based, GNN-based, and Transformer-based models and explore their techniques such as parameter-based, metric-based, attention-based, and optimization-based methods. Moreover, we analyze the merits and drawbacks of existing FSS methods and discuss the limitations of current FSS research. To inspire future research in FSS, we also illustrate its applications and challenging tasks.

Footnotes

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 62102129 and Grant 62276088, the Natural Science Foundation of Hebei Province under Grant F2021202030, Grant F2019202381 and Grant F2019202464.

References

Shelhamer

, Long

and Darrell

, Fully Convolutional Networks for semantic segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4) (2017), 640–651.

, Wu

, Yang

, et al., Review the state-of-the-art technologies of semantic segmentation based on deep learning, Neurocomputing 493 (2022), 626–646.

, Yang

, Tan

, et al., Methods and datasets on semantic segmentation: A review, Neurocomputing 304 (2018), 82–103.

Zhang

, Zhou

, Zhao

, et al., A survey of semi- and weakly supervised semantic segmentation of images, Artificial Intelligence Review 53(6) (2020), 4259–4288.

Papandreou

, Chen

L.C.

, Murphy

K.P.

, et al., Weakly- and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation, in IEEE International Conference on Computer Vision, 2015, 1742–1750.

Zhang

Q.C.

, Yang

L.T.

, Chen

Z.K.

, et al., A survey on deep learning for big data, Information Fusion 42 (2018), 146–157.

Lake

B.M.

, Ullman

T.D.

, Tenenbaum

J.B.

, et al., Building machines that learn and think like people, Behavioral and Brain Sciences 40(1) (2017), 1–101.

T.A.M.I., Computing Machinery and Intelligence, Mind 59(236) (1950), 433–460.

Fefei

, Fergus

and Pietro Perona, One-shot learning of object categories, IEEE Transactions on Pattern Analysis and Machine Intelligence 28(4) (2006), 594–611.

10.

Koch

, Siamese neural networks for one-shot image recognition, University of Toronto, PhD thesis, 2015.

11.

Vinyals

, Blundell

, Lillicrap

, et al., Matching Networks for One Shot Learning, in Neural Information Processing Systems (NIPS), 2016, 3630–3638.

12.

Wang

, Yao

, Kwok

J.T.

, et al., Generalizing from a few examples: A survey on few-shot learning, Acm Computing Surveys 53(3) (2020), 1–34.

13.

Caelles

, Maninis

K.K.

, Pont-Tuset

, et al., One-Shot Video Object Segmentation, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, 5320–5329.

14.

Hochreiter

, Younger

A.S.

and Conwell

P.R.

, Learning to Learn Using Gradient Descent, in Artificial Neural Networks — ICANN 2001, 2001, 87–94.

15.

Chen

, Yang

, Huang

, et al., A survey on few-shot image semantic segmentation, Frontiers of Data & Computing 3(6) (2021), 17–34.

16.

Wei

, Li

and Liu

, A review of image semantic segmentation under few-shot dilemma, Computer Engineering and Applications 59(02) (2023), 1–11.

17.

Ren

, Tang

, Sun

, et al., Visual semantic segmentation based on few/zero-shot learning: An overview, IEEE/CAA Journal of Automatica Sinica, (2023), 1–21.

18.

Shaban

, Bansal

and Z

, One-Shot Learning for Semantic Segmentation, in British Machine Vision Conference, 2017, 1–14.

19.

Zhang

, Lin

G.S.

, Liu

F.Y.

, et al., Pyramid Graph Networks with Connection Attentions for Region-Based One-Shot Semantic Segmentation, in IEEE/CVF International Conference on Computer Vision (ICCV), 2019, 9586–9594.

20.

, He

, Zhu

, et al., Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer, in IEEE/CVF International Conference on Computer Vision (ICCV), 2021, 8721–8730.

21.

Zhuge

and Shen

, Deep reasoning network for few-shot semantic segmentation, in Proceedings of the 29th ACM International Conference on Multimedia, 2021, 5344–5352.

22.

Liu

, Cao

, Liu

, et al., Dynamic extension nets for few-shot semantic segmentation, in Proceedings of the 28th ACM international conference on multimedia, 2020, 1441–1449.

23.

Zhang

, Wei

, Yang

, et al., SG-One: Similarity guidance network for one-shot semantic segmentation, IEEE Transactions on Cybernetics 50(9) (2020), 3855–3865.

24.

Dong

and Xing

E.P.

, Few-shot semantic segmentation with prototype learning, in 29th British Machine Vision Conference, 2018, 1–13.

25.

Wang

, Liew

J.H.

, Zou

, et al., PANet: Few-shot image semantic segmentation with prototype alignment, in Proceedings of the IEEE International Conference on Computer Vision, 2019, 9196–9205.

26.

Siam

, Oreshkin

B.N.

, Jagersand

, et al., AMP: Adaptive Masked Proxies for Few-Shot Segmentation, in IEEE/CVF International Conference on Computer Vision (ICCV), 2019, 5248–5257.

27.

Khoi

and Todorovic

, Feature Weighting and Boosting for Few-Shot Segmentation, in IEEE/CVF International Conference on Computer Vision (ICCV), 2019, 622–631.

28.

Liu

, Zhang

, et al., Part-Aware Prototype Network for Few-Shot Semantic Segmentation, in European Conference on Computer Vision, 2020, 142–158.

29.

Achanta

, Shaji

, Smith

, et al., SLIC superpixels compared to State-of-the-Art superpixel methods, IEEE Transactions on Pattern Analysis and Machine Intelligence 34(11) (2012), 2274–2281.

30.

Yang

, Liu

, Li

, et al., Prototype Mixture Models for Few-Shot Semantic Segmentation, in European Conference on Computer Vision (ECCV), 2020, 763–778.

31.

Yang

, Zhuo

, Qi

, et al., Mining Latent Classes for Few-shot Segmentation, in 18th IEEE/CVF International Conference on Computer Vision (ICCV), 2021, 8701–8710.

32.

Fan

, Pei

, Tai

Y.-W.

, et al., Self-support Few-Shot Semantic Segmentation, in 17th European Conference on Computer Vision, 2022, 701–719.

33.

Sung

, Yang

, Zhang

, et al., Learning to compare: Relation network for few-shot learning, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, 1199–1208.

34.

, Wei

, Chen

Y.P.

, et al., FSS-: A -class dataset for few-shot segmentation, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, 2866–2875.

35.

Zhang

, Lin

, Liu

, et al., CANet: Class-Agnostic Segmentation Networks With Iterative Refinement and Attentive Few-Shot Learning, in IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, 5212–5221.

36.

Tian

, Zhao

, Shu

, et al., Prior guided feature enrichment network for few-shot segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 44(2) (2022), 1050–1065.

37.

Zhang

B.F.

, Xiao

J.M.

, Qin

, et al., Self-Guided and Cross-Guided Learning for Few-Shot Segmentation, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 8308–8317.

38.

, Jampani

, Sevilla-Lara

, et al., Adaptive Prototype Learning and Allocation for Few-Shot Segmentation, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 8330–8339.

39.

Liu

, Bao

, Xie

G.-S.

, et al., Dynamic Prototype Convolution Network for Few-Shot Semantic Segmentation, in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, 11543–11552.

40.

Xie

G.-S.

, Xiong

, Liu

, et al., Few-shot semantic segmentation with cyclic memory network, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, 7293–7302.

41.

Cheng

, Lang

and Han

, Holistic prototype activation for few-shot segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, (2022), 1–17.

42.

, Shi

, Lin

, et al., Learning meta-class memory for few-shot semantic segmentation, in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, 517–526.

43.

Lang

, Cheng

, Tu

, et al., Learning What Not to Segment: A New Perspective on Few-Shot Segmentation, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 8047–8057.

44.

Lang

, Cheng

, Tu

, et al., Few-shot segmentation via divide-and-conquer proxies, International Journal of Computer Vision, (2023), 1–23.

45.

Niu

, Zhong

and Yu

H.J.N.

, A review on the attention mechanism of deep learning, Neurocomputing 452 (2021), 48–62.

46.

Guo

M.-H.

, Xu

T.-X.

, Liu

J.-J.

, et al., Attention mechanisms in computer vision: A survey, Computational Visual Media 8(3) (2022), 331–368.

47.

Wang

, Jiang

, Qian

, et al., Residual attention network for image classification, in 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017, 6450–6458.

48.

Woo

, Park

, Lee

J.-Y.

, et al., CBAM: Convolutional Block Attention Module, in European Conference on Computer Vision (ECCV), 2018, 3–19.

49.

, Shen

and Sun

, Squeeze-and-Excitation Networks, in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 7132–7141.

50.

Vaswani

, Shazeer

, Parmar

, et al., Attention is all you need, in Advances in Neural Information Processing Systems, 2017, 5999–6009.

51.

Hou

R.B.

, Chang

, Ma

B.P.

, et al., Cross Attention Network for Few-shot Classification, in Advances in Neural Information Processing Systems (NeurIPS), 2019.

52.

, Yang

, Zhang

, et al., Attention-Based Multi-Context Guiding for Few-Shot Semantic Segmentation, in AAAI Conference on Artificial Intelligence 2019, 8441–8448.

53.

, Liu

, Zhu

, et al., Arnet: attention-based refinement network for few-shot semantic segmentation, in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2020, 2238–2242.

54.

Liu

, Zhang

, Lin

, et al., CRNet: Cross-Reference Networks for Few-Shot Segmentation, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, 4164–4172.

55.

Liu

, Zhang

, Lin

, et al., CRCNet: Few-shot segmentation with cross-reference and regionglobal conditional networks, International Journal of Computer Vision 130(12) (2022), 3140–3157.

56.

Yang

, Meng

, Li

, et al., A New Local Transformation Module for Few-Shot Segmentation, in International Conference on MultiMedia Modeling, 2020, 76–87.

57.

Liu

, Guo

, Zhu

, et al., Mining semantic information from intra-image and cross-image for few-shot segmentation, Multimedia Tools and Applications 81(13) (2022), 18305–18326.

58.

Gairola

, Hemani

, Chopra

, et al., SimPropNet: Improved Similarity Propagation for Few-shot Image Segmentation, in 29th International Joint Conference on Artificial Intelligence, 2021, 573–579.

59.

Liu

, Peng

, Chen

, et al., FECANet: Boosting few-shot semantic segmentation with feature-enhanced context-aware network, IEEE Transactions on Multimedia (2023), 1–13.

60.

Tian

, Wu

, Qi

, et al., Differentiable meta-learning model for few-shot semantic segmentation, in AAAI Conference on Artificial Intelligence, 2020, 12087–12094.

61.

Cao

, Zhang

, Diao

, et al., Meta-Seg: A generalized meta-learning framework for multi-class few-shot semantic segmentation, IEEE Access 7 (2019), 166109–166121.

62.

Zhu

, Zhai

and Cao

, Self-supervised tuning for few-shot segmentation, in International Joint Conference on Artificial Intelligence (IJCAI), 2020, 1019–1025.

63.

Veličković

, Cucurull

and C. A, Graph Attention Networks, in, International Conference on Learning Representations (ICLR), 2018, 1–12.

64.

Wang

, Zhang

, Hu

, et al., Few-Shot Semantic Segmentation with Democratic Attention Networks, in European Conference on Computer Vision (ECCV) 2020, 730–746.

65.

Xie

G.S.

, Liu

, Xiong

, et al., Scale-Aware Graph Neural Network for Few-Shot Semantic Segmentation, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 5471–5480.

66.

Gao

, Xiao

, Yin

, et al., A Mutually Supervised Graph Attention Network for Few-Shot Segmentation: The Perspective of Fully Utilizing Limited Samples, IEEE Transactions on Neural Networks and Learning Systems, (2022), 1–13.

67.

Han

, Wang

, Chen

, et al., A survey on vision transformer, IEEE Transactions on Pattern Analysis Machine Intelligence 45(1) (2022), 87–110.

68.

Zhang

, Kang

, Yang

, et al., Few-shot segmentation via cycle-consistent transformer, Advances in Neural Information Processing Systems 34 (2021), 21984–21996.

69.

Zhang

, Wu

, et al., Catrans: context and affinity transformer for few-shot segmentation, arXiv preprint arXiv:.12817, (2022).

70.

Peng

, Tian

, Wu

, et al., Hierarchical Dense Correlation Distillation for Few-Shot Segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, 23641–23651.

71.

Wang

, Sun

and Zhang

, Rethinking the Correlation in Few-Shot Segmentation: A Buoys View, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, 7183–7192.

72.

Shi

, Wei

, Zhang

, et al., Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation, in European Conference on Computer Vision (ECCV), 2022, 151–168.

73.

, Sun

and Yang

, Suppressing the heterogeneity: A strong feature extractor for few-shot segmentation, in The Eleventh International Conference on Learning Representations, 2023.

74.

Everingham

, Van Gool

, Williams

C.K.I.

, et al., The Pascal Visual Object Classes (VOC) Challenge, International Journal of Computer Vision 88(2) (2010), 303–338.

75.

Hariharan

, Arbelaez

, Bourdev

, et al., Semantic Contours from Inverse Detectors, in IEEE International Conference on Computer Vision (ICCV), 2011, 991–998.

76.

Lin

T.Y.

, Maire

, Belongie

, et al., Microsoft COCO: Common Objects in Context, in European Conference on Computer Vision (ECCV), 2014, 740–755.

77.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, (2014).

78.

Krizhevsky

, Sutskever

and Hinton

G.E.

, ImageNet classification with deep convolutional neural networks, Communications of the ACM 60(6) (2017), 84–90.

79.

Liu

, Lin

, Cao

, et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, in IEEE/CVF International Conference on Computer Vision (ICCV), 2021, 9992–10002.

80.

Touvron

, Cord

, Douze

, et al., Training data-efficient image transformers & distillation through attention, in International conference on machine learning, 2021, 10347–10357.

81.

Roy

A.G.

, Siddiqui

, Polsterl

, et al., ‘Squeeze & excite’ guided few-shot segmentation of volumetric images, Medical Image Analysis 59 (2020), 1–12.

82.

Q.J.

, Dang

, Tajbakhsh

, et al., A location-sensitive local prototype network for few-shot medical image segmentation, in 18th IEEE International Symposium on Biomedical Imaging (ISBI), 2021, 262–266.

83.

Huang

, Xu

, Shen

, et al., Rethinking Few-Shot Medical Segmentation: A Vector Quantization View, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, 3072–3081.

84.

Guha Roy

, Siddiqui

, Polsterl

, et al., Squeeze & excite guided few-shot segmentation of volumetric images, Medical Image Analysis 59 (2020), 101587.

85.

Khadka

, Jha

, Hicks

, et al., Meta-learning with implicit gradients in a few-shot setting for medical image segmentation, Computers in Biology and Medicine 143 (2022), 105227.

86.

Achmamad

, Ghazouani

and Ruan

, Few-shot learning for brain tumor segmentation from MRI images, in IEEE International Conference on Signal Processing (2022), 489–494.

87.

Wang

, Cao

, Wei

, et al., Alternative Baselines for Low-Shot 3D Medical Image Segmentation-An Atlas Perspective, in AAAI Conference on Artificial Intelligence (2021), 634–642.

88.

Feng

, Zheng

, Gao

, et al., Interactive few-shot learning: Limited supervision, better medical image segmentation, IEEE Transactions on Medical Imaging 40(10) (2021), 2575–2588.

89.

Zhao

, Balakrishnan

, Durand

, et al., Data augmentation using learned transformations for one-shot medical image segmentation, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, 8535–8545.

90.

Ding

, Sun

, Tang

, et al., Few-shot medical image segmentation with cycle-resemblance attention, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, 2488–2497.

91.

Wang

X.Y.

, Yuan

Y.W.

, Guo

D.Y.

, et al., SSA-Net: Spatial self-attention network for COVID-19 pneumonia infection segmentation with semi-supervised few-shot learning, Medical Image Analysis 79 (2022), 102459.

92.

Chen

, Yao

, Zhou

, et al., Momentum contrastive learning for few-shot COVID-19 diagnosis from chest CT images, Pattern Recognition 113 (2021), 107826.

93.

Jadon

, COVID-19 detection from scarce chest x-ray image data using few-shot deep learning approach, in Medical Imaging 2021: Imaging Informatics for Healthcare, Research and Applications, 2021, The Society of Photo-Optical Instrumentation Engineers (SPIE).

94.

Abdel-Basset

, Chang

, Hawash

, et al., FSS–nCov: A deep learning architecture for semi-supervised few-shot segmentation of COVID-19 infection, Knowledge-Based Systems 212 (2021), 106647.

95.

Zhao

, Chua

T.-S.

and Lee

G.H.

, Few-shot 3D Point Cloud Semantic Segmentation, in, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 8869–8878.

96.

Mao

, Guo

, Lu

, et al., Bidirectional Feature Globalization for Few-shot Semantic Segmentation of 3D Point Cloud Scenes, in International Conference on 3D Vision, 2022, 505–514.

97.

Sharma

, Dash

, Roy Chowdhury

, et al., PriFit: Learning to fit primitives improves few shot point cloud segmentation, Computer Graphics Forum 41(5) (2022), 39–50.

98.

Zhu

, Cao

, Zhai

, et al., One-shot texture retrieval using global grouping metric, IEEE Transactions on Multimedia 23 (2021), 3726–3737.

99.

Bhunia

A.K.

, Bhunia

A.K.

, Ghose

, et al., A deep one-shot network for query-based logo retrieval, Pattern Recognition 96 (2019), 106965.

100.

Bao

Y.Q.

, Song

K.C.

, Liu

, et al., Triplet-graph reasoning network for few-shot metal generic surface defect segmentation, IEEE Transactions on Instrumentation and Measurement 70 (2021), 3083561.

101.

Y.J.

, Zhang

P.F.

, Xu

, et al., Few-shot prototype alignment regularization network for document image layout segementation, Pattern Recognition 115 (2021), 107882.

102.

Tian

, Lai

, Jiang

, et al., Generalized Few-shot Semantic Segmentation, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 11553–11562.

103.

Liu

S.-A.

, Zhang

, Qiu

, et al., Learning Orthogonal Prototypes for Generalized Few-Shot Semantic Segmentation, in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, 11319–11328.

104.

Hajimiri

, Boudiaf

, Ben Ayed

, et al., A Strong Baseline for Generalized Few-Shot Semantic Segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, 11269–11278.

105.

Azad

, Fayjie

A.R.

, Kauffmann

, et al., On the Texture Bias for Few-Shot CNN Segmentation, in IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, 2673–2682.

106.

Raza

, Ravanbakhsh

, Klein

, et al., Weakly Supervised One Shot Segmentation, in IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019, 1401–1406.

107.

Han

and Oh

T.H.

, Learning Few-shot Segmentation from Bounding Box Annotations, in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, 3739–3748.

108.

Kang

, Koniusz

, Cho

, et al., Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, 19627–19638.

109.

Pambala

A.K.

, Dutta

and Biswas

, SML: Semantic meta-learning for few-shot semantic segmentation * *, Pattern Recognition Letters 147 (2021), 93–99.

110.

Yang

, Chen

, Feng

, et al., MIANet: Aggregating Unbiased Instance and General Information for Few-Shot Semantic Segmentation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, 7131–7140.

111.

Amac

M.S.

, Sencan

, Baran

O.B.

, et al., MaskSplit: Self-supervised Meta-learning for Few-shot Semantic Segmentation, in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022, 428–438.

112.

Kalluri

and Chandraker

, Cluster-to-adapt: Few Shot Domain Adaptation for Semantic Segmentation across Disjoint Labels, in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022, 4120–4130.

113.

Wang

, Duan

, Wang

, et al., Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, 7055–7064.

114.

Ahmed

, Lin

J.C.-W.

and Srivastava

, Ensemble-based deep meta learning for medical image segmentation, Journal of Intelligent Fuzzy Systems 42(5) (2022), 4307–4313.

115.

Lei

, Zhang

X.C.

, He

J.F.

, et al., Cross-Domain Few-Shot Semantic Segmentation, in European Conference on Computer Vision (ECCV), 2022, 73–90.