Abstract
Few-shot text classification aims to learn a classifier from very few labeled text data. Existing studies on this topic mainly adopt prototypical networks and focus on interactive information between support set and query instances to learn generalized class prototypes. However, in the process of encoding, these methods only pay attention to the matching information between support set and query instances, and ignore much useful information about intra-class similarity and inter-class dissimilarity between all support samples. Therefore, in this paper we propose a negative-supervised capsule graph neural network (NSCGNN) which explicitly takes use of the similarity and dissimilarity between samples to make the text representations of the same type closer with each other and the ones of different types farther away, leading to representative and discriminative class prototypes. We firstly construct a graph to obtain text representations in the form of node capsules, where both intra-cluster similarity and inter-cluster dissimilarity between all samples are explored with information aggregation and negative supervision. Then, in order to induce generalized class prototypes based on those node capsules obtained from graph neural network, the dynamic routing algorithm is utilized in our model. Experimental results demonstrate the effectiveness of our proposed NSCGNN model, which outperforms existing few-shot approaches on three benchmark datasets.
Introduction
Text classification is one of fundamental tasks in natural language processing, which has broad applications such as topic labeling [41, 47], sentiment analysis [36, 50] and question answering [17, 21]. Conventional supervised models based on feature engineering [27, 28] and deep learning [16, 46] have been widely explored on this task and achieved great success. However, these methods suffer from the lack of large-scale manually labeled data and thus are hard to generalize to new classes.
Intuitively, humans can promptly assimilate new knowledge and grasp a fresh concept with just few instances [11]. This motivates a lot of interest in Few-Shot Learning (FSL), which aims to efficiently solve new tasks with only few labeled examples. Hence, this paper focuses on the few-shot text classification task, aiming to classify query instance with only few labeled support instances of each class, as shown by an example in Table 1.
An example of 5-way 1-shot text classification in HuffPost dataset
An example of 5-way 1-shot text classification in HuffPost dataset
The correct text class for the query instance is class B: travel. Other query instances are omitted for saving space.
Few-shot learning has been studied extensively in computer vision field. Some optimization-based methods [5, 26] extract transferable knowledge between tasks and learn to optimize model parameters rapidly. While some metric-based methods [19, 38] learn the distance distributions among classes. Among these methods, Prototypical Networks (PN) [33], which averages the embeddings of support set samples to obtain class prototypes and calculates distance between query and each class prototype, achieves state-of-the-art results on several benchmarks.
In natural language processing, some methods [6, 48] based on PN and its variants have been proposed. A key challenge faced by these methods is how to learn generalized class-wise representations from few labeled support instances. To solve the problem, MLMAN [48] and Proto-HATT [6] apply attention mechanism to focus more on those support instances which are more relevant to query. Induction Network [9] adopts dynamic routing mechanism to render more flexibility to prototypical networks. Although these models highlight the interaction between support set instances and queries, they ignore the intra-class similarity and inter-class dissimilarity between support samples which are also important for learning representations of class prototypes and queries. In general, by considering intra-class similarity and inter-class dissimilarity between instances, we can make the representations of the support instances in the same class closer with each other and the ones in different class farther away. In this way, it can be easier to learn representative and discriminative class prototypes and correctly classify the queries. As a result, in this paper we take advantage of graph neural networks (GNNs) with negative supervision to explore the intra-class similarity and inter-class dissimilarity between all instances to learn better class prototypes and query representations.
Graph neural networks (GNNs) have shown advantage in expressing the relationship among data instances and are suitable for the FSL problem. Previous GNN approaches [7, 23] in few-shot learning are mainly based on label propagation. Different from them, we aim to learn sample representations which incorporate similarity or dissimilarity information from neighbor nodes. Recently, Kim et al. [18] devise an Edge-labeling GNN (EGNN) which directly exploits both intra-cluster similarity and inter-cluster dissimilarity with 2-dimensional edge features for few-shot image classification. However, it performs classification with sample-wise comparison rather than class-wise comparison, which may be severely disturbed by the various expressions in the same class in NLP. In contrast, we adopt dynamic routing algorithm to learn class prototypes and classify the query instances with class-wise comparison, avoiding the noise caused by the various expressions of text.
In this paper, in order to consider the similarity and dissimilarity of all samples to learn representative and discriminative class prototypes and query representations, we propose a Negative-Supervised Capsule Graph Neural Network (NSCGNN) for few-shot text classification. Our proposed NSCGNN model mainly consists of two stages: a process of learning instance-level representations and a process of learning class-level representations. In the former procedure, we devise a GNN with negative supervision to learn the representations of all support and query instances, where the GNN is composed of a number of layers and each layer has a node-update block and an edge-update block. In node-update block, each node represents an instance and is updated by aggregating features of similar neighbor nodes. Meanwhile, the features of dissimilar neighbor nodes are regarded as negative supervision signals to encourage distinct representations of nodes with different labels. Compared with the methods which only focus on the interactive information between support instances and queries, our model apply graph structure to explore the similarity and dissimilarity between all instances. And information aggregation and negative supervision are adopted to make the node features in the same class closer with each other and the ones in different class farther away in the feature space. After node updating, the edge features are adjusted according to the updated nodes. Inspired by EGNN [18], edge features in this paper are also 2-dimensional vectors to explicitly indicate the strengths of intra- and inter- class relations of the two connected nodes. In the latter procedure of NSCGNN, inspired by capsule network [30], we regard the node features extracted from GNN as basic-level capsules and class representations as high-level capsules, adopting dynamic routing algorithm to learn the class prototypes. In addition, since the node features of different layers of GNN aggregate adjacent nodes within different steps and can provide multi-aspect sample information, we use node features extracted from all GNN layers to induce the final class prototypes. Compared with the methods which simply learn class prototypes by applying dynamic routing mechanism on original support set, our model combines GNN with capsule neural network and take full advantage of the multi-aspect sample information provided by GNN. Finally, we evaluate our proposed NSCGNN model on 3 benchmark datasets [12, 25], and it improves 5-way 1-shot accuracy about 4.0% -5.5% and 5-way 5-shot accuracy about 1.5% -3.0%, against the best baseline for each dataset.
In summary, our main contributions can be summarized as follows:
1. We propose a Negative-Supervised Capsule Graph Neural Network for few-shot text classification. To the best of our knowledge, our model is the first one to apply negative supervision in GNN and combine GNN with capsule neural network in a few-shot learning task.
2. In the procedure of learning instance representations, our model explicitly exploits both intra-class similarity and inter-class dissimilarity between all samples with graph neural networks, and adopt information aggregation and negative supervision to encourage node features in the same class much closer with each other and the ones in different class farther away.
3. In the procedure of learning class prototypes, by combining GNN with capsule neural network, the node features of all GNN layers can provide multi-aspect information of samples and are considered by the dynamic routing algorithm to induce representative and discriminative class prototypes.
4. By learning representative and discriminative class prototypes and query representations, our proposed NSCGNN model outperforms existing few-shot models on three benchmark datasets [12, 25]. It improves 5-way 1-shot accuracy by 5.5%, 3.9% and 5.4% on the three datasets respectively and 5-way 5-shot accuracy by 3.2%, 1.5% and 2.3% respectively, against the best baseline for each dataset.
Few-shot learning
In few-shot learning paradigm, a model is required to generalize to new tasks with only a few labeled samples. Early works [4, 14] apply transfer learning methods to fine-tune pre-trained models for FSL or adopt data augmentation [31] techniques to slightly alleviate the overfitting problem. Recently, the idea of meta-learning [5] has shown effectiveness for FSL problem, which encourages models to learn fast-learning abilities from previous experience and rapidly generalize to new concepts. Existing meta-learning approaches for FSL mainly include metric-based methods [19, 38] and optimization-based methods [5, 29].
Metric-based Methods. Metric-based methods map instances into distance space with learned projection functions, and then compare the distance between queries and support sets to make a classification. Koch et al. [19] propose Siamese Networks to identify whether input pairs belong to the same class by computing pair-wise distance. Vinyals et al. [38] present Matching Networks to explore a weighted K-nearest neighbor classifier augmented with external memories. Considering that test and train conditions must match, they also propose an episodic training strategy, which is widely adopted by following studies. Prototype Networks [33] averages the embeddings of support instances to derive the prototypes of each class and compares the distance between query and prototypes to make a classification. Relation Networks [35] learns a deep distance metric instead of fixed metric measures to compare query against labeled examples. Recently, many studies based on PN are proposed for few-shot text classification. Proto-HATT [6] and MLMAN [48] devise multi-level attention schemes to focus on the interaction of query and support instances. Induction Networks [9] apply dynamic routing mechanism to learn more generalized class prototypes. And then DMIN [8] leverages a memory component to enhance the Induction Networks. Nevertheless, these methods neglect the useful information of similarity and dissimilarity between support instances. Therefore, in this paper we take use of graph structure and negative supervision to exploit the intra-class similarity and inter-class dissimilarity between each pair of samples and try to make the text representations in the same class closer with each other and the ones in different class farther away.
Optimization-based Methods. Optimization-based methods learn to optimize model parameters with only few labeled instances by extracting some transferable knowledge between tasks. Ravi and Larochelle [29] propose an LSTM-based meta-learner to learn the initial values of model parameters and update them. MAML [5] is a model agnostic approach, which aims to learn a representation that is easily fit to new data with few steps of gradient descent. Mishra et al. [24] present SNAIL to internalize past experience with a combination of temporal convolutions and soft attention. ATAML [15] improves MAML with attention mechanism to be more effective for few-shot text classification.
Graph neural network
Graph neural network is first proposed in [10, 32] to directly deal with graph structured data, which has been applied in many applications in recent years, such as recommended system [39, 42], semantic parsing [2, 34] and information extraction [22, 49]. Since GNN can model complex interactions among data instances by recursively aggregating and transforming features of neighboring nodes, it has great potential to solve the FSL problem.
Garcia and Bruna [7] firstly propose to construct a densely connected graph with all of the support instances and a query. Each input node is represented by the concatenation of instance features and the given label information. Then, the model iteratively updates node features from neighborhood aggregation and classifies the unlabeled query. TPN [23] explicitly models transductive inference in FSL and propagates labels from labeled support instances to all of the unlabeled queries. Although these models make progress in FSL, they mainly base on label propagation without intuitively expressing the intra- and inter- class relationship. As a result, Kim et al. [18] devise an EGNN model with 2-dimensional edge features to explore both intra- and inter-class relations between nodes, and classify the query by calculating similarity scores at sample level instead of class level. However, EGNN utilizes inter-class dissimilarity information by simple feature fusion and introduces new learnable weights and noise, while our model directly employ cosine similarity loss function as negative supervised signal during training and does not introduce any new learnable weights. On the other hand, the sample-wise comparison in EGNN is not suitable for text classification because the diversity of text causes much noise compared with images. In contrast, we induce class prototypes based on all node features in GNN and make a classification by class-wise comparison to avoid text noise problems.
Capsule neural network
Capsule Neural Network (CapsNet) is first proposed by [30] and designed for image feature extraction, in which the transmission of information between layers follows dynamic routing mechanism. In natural language process field, Yang et al. [45] successfully explore CpasNet on text classification task in condition of large labeled datasets. Geng et al. [9] firstly propose Induction Networks to use capsule and dynamic routing to induce generalized class prototypes from samples based. And then they present Dynamic Memory Induction Networks [18], which enhances Induction Networks with a memory component. Different from them, we combine GNN with CapsNet to deal with few-shot text classification and consider node features of all layers. Theoretically, node features of different layers of GNN can provide multi-aspect features of text instances, which is beneficial to dynamic routing algorithm to induce generalized class prototypes.
Problem definition
In few-shot text classification, we are given two datasets,
Episodic training [38] has been proven to be an efficient way of meta-learning, the main idea of which is to sample numerous training tasks (i.e. episodes) from
Concretely, both training and test tasks of the N-way K-shot problem are formulated as
Methodology
As shown in Figure 1, our proposed NSCGNN consists of four modules: graph construction, feature aggregation and update, class prototype induction and matching and training. In the rest of this section, we will introduce these modules in detail.

The overall framework of the proposed NSCGNN model. In this illustration, a 2-way 2-shot problem is presented as an example. Orange and blue circles represent two different classes. The strength of edge feature is represented by the color in the square.
This module descripts how we construct graphs for the few-shot classification problem, including representation extraction and graph initialization.
Representation Extraction. Given an input text x = {w1, w2, …, w l } which is composed of l words, we extract text representation from pre-trained language model BERT [3] to better reflect the semantic information. According to [3], we place special tokens [CLS] and [SEP] at the start and end of x respectively and feed it into BERT model. Then, the output state of BERT which corresponds to [CLS] token is used as our text representation. In this way, we can obtain the representations of all support and query samples of the target task, as shown in Figure 1(a1).
Graph Initialization. Based on the text representations, we initially construct a fully-connected graph G = (V, E), where each node v
i
∈ V represents each sample, and each edge e
ij
∈ E represents the type of relationship between the two connected nodes, as shown in Figure 1(a2). For a N-way K-shot problem, the size of support set is fixed, that is N × K. And the size of query set is decided by the number of queries selected from each class. For example, if we select 5 queries from each class, the size of query set will be N × 5. In order to make the support samples more adapt to each query sample, in each graph we predict the label of one query. Therefore, if N × 5 query samples are selected, we will construct N × 5 graphs and each graph is composed of N × K + 1 nodes. Then, we construct a meta-episode with all the N × 5 graphs and input them to the model together as one single N-way K-shot task. For simplicity, we only introduce the framework of one graph in the following. In each graph, node features are initialized by the output of BERT as mentioned above
GNN [10, 32] consists of multiple layers to process the graph and it iteratively updates node features and edge features through layers. The specific aggregation and update algorithm is detailed in next section.
Our proposed NSCGNN consists of L layers and includes node feature update and edge feature update in each layer. Node feature update aims to aggregate information from similar neighbor nodes and be far away from dissimilar nodes with negative supervision. Edge feature update dynamically computes the scores of intra-similarity and inter-dissimilarity of the two connected nodes.
Node feature update. Given
Edge feature update. When obtain the newly updated node features by node feature update, we re-calculate the similarity and dissimilarity between each pair of nodes and update edge features as follows:
After L number of iterative aggregation and update, node and edge features gradually integrate intra- and inter- class information. Unlike EGNN [18] directly predicting the label of query with sample-wise comparison, we make a prediction with class-wise comparison, i.e., learning to compare between a query instance and class prototypes. The intuitive idea behind this is: compared with image, text is more diverse and noisy, and the sample-wise comparison may be severely disturbed by the various expressions in the same class [9]. As a result, in this paper we combine GNN with dynamic routing algorithm [30] to learn generalized class prototypes for comparison and classification..
Inspired by [30, 44], we regard node features extracted from GNN as base-level capsules and class prototypes as higher-level capsules. Node features of all layers of GNN are considered by the dynamic routing algorithm because they can provide multi-aspect sample information. Therefore, for each class i (i = 1, … N), we can totally obtain K × L node representations from the GNN. Formally, given support node sets
Then dynamic routing is applied iteratively to adjust coupling coefficients d
i
with regard to the base level capsules v
ij
and higher level capsules c
i
:
Finally, after a number of iteration, the dynamic routing algorithm automatically learns more representative class prototypes c
i
based on node features of class i. Besides, against query node set
This section describes how we predict the label of query and train the model. After the representations of class c i and query instance q have been determined, we use a cosine similarity function to calculate the matching score s i between q and c i :
In the stage of node features update, the training objective is to encourage more distinct representations of nodes in different classes, i.e., minimizing the loss function L ns :
In the stage of classification, the objective function we adopt is large margin cosine loss [40] which is defined as:
Where M is the number of queries to be classified in a mini-batch; s and m are two learnable scalar values and y i is the ground truth label of q i . This loss function makes the prediction score of correct class more larger than others.
By combining Eqs. (15) and (16), the final objective function for training the whole model is defined as:
Dataset
We evaluate our model on both text classification datasets and relation classification datasets.
Amazon product data: Amazon [13] consists of customer reviews of 24 product categories which are collected from a real-world dataset Amazon.com. The average length of all instances is 141 and our goal is to classify reviews into product categories they describe.
HuffPost headlines: HuffPost [25] contains 41 types of news headlines published on huffpost.com between 2012 and 2018. These headlines are shorter and less grammatical than formal sentences, with an average length of 11. We aim to classify the headlines into their corresponding news categories.
FewRel: FewRel is a few-shot relation classification dataset presented by [12]. It contains 70000 instances on 100 relations and each instance is annotated with a head entity, a tail entity and their relation. The average length of all instances is 25 and our goal is to predict the relation between the head and tail entities.
It should be noted that, the above three datasets originally contain numerous of labeled instances and should be sampled a subset from each class for few-shot learning first. In order to compare experimental results fairly, we directly use the processed datasets provided by [1].
Implementation details
We use pre-trained model BERT to extract the initial text representations and construct GNN with three layers (L = 3). In each layer, the dimension d of node feature is 768. During Class Prototype Induction, the iteration number iter used in dynamic routing algorithm is 3. The s and m in Eqs. (17) are set to 7.0 and 0.2 respectively. All parameters are optimized by Adam optimizer with a learning rate of 2 * 10-5.
During meta-training, 100 training episodes are sampled per epoch and apply early stopping strategy on valid dataset. To evaluate the performance of our models, we randomly sample 1000 testing episodes and finally report the average classification accuracy over 5 different random seeds.
Results
Overall performance
In this section, we evaluate our model in both 5-way 1-shot and 5-way 5-shot scenario and compare it with current state-of-the-art models. These results are reported in Table 2.
Accuracies (%) of different models on three few-shot text or relation classification datasets
Accuracies (%) of different models on three few-shot text or relation classification datasets
As shown in Table 2, existing few-shot learning models, such as Prototypical Networks (Proto) [33], Model-Agnostic Meta-Learning (MAML) [5] and Graph Neural Networks (GNN) [7], which have achieved great performance in image classification, perform poorly in text classification task. This may be due to the fact that text expression is more abstract and diverse than image. Induction Networks [9] leverages the dynamic routing algorithm to strengthen the original Prototypical Networks and achieves better results on HuffPost and FewRel datasets (improve 2.1% and 6.7% on 5-shot respectively), but performs worse on long text dataset Amazon because of its insufficiency in capturing the interactive information between query and support set. Multi-Level Matching and Aggregation Network (MLMAN) proposed by [48], which considers the matching information between query and support set at both local and instance levels to encode text in an interactive way, performs better on FewRel dataset. While EGNN [18] predicts the query with edge-labeling graph, achieving a big improvement compared with GNN.
Among these models, our NSCGNN model performs better across all datasets. In generally, NSCGNN improves 5-way 1-shot accuracy about 4.0% -5.5% and 5-way 5-shot accuracy about 1.5% -3.0%, against the best baseline for each dataset. It is obvious that the improvement on 5-way 1-shot tasks is higher, and the reason is that when the number of support instances in each class is fewer, taking the node features of all layers into consideration can provide much more sample information and lead to general class representations. While with the increase of support instances in each class, this advantage of our model decreases. We also verified this view in the following ablation study. Additionally, we can find that EGNN achieves a satisfactory performance, but it applies sample-wise comparison which may be severely disturbed by the various expressions of text. Our model combines the advantages of Induction Networks and EGNN together, gaining at least 3.9% and % 1.5 improvement in accuracy on 1-shot and 5-shot tasks. Lastly, as mentioned above, although the training set and the test set have their own label space and are disjoint with each other, our model can generalize well from the training set to the unseen test set in the condition of only one labeled instance available in each class (in 5-way 1-shot setting, the accuracy of our model reaches 54.7%, 43.2% and 77.8% respectively). Especially, when the number of labeled instance increases from 1 to 5 in each class (i.e., in 5-way 5-shot setting), our model improves the accuracy by 10.8%, 15.4% and 11.2% on the three datasets respectively. This indicates that as long as a few more labeled instances are added to each class, our model can achieve better performance and generalization. On the other hand, although the three datasets have significant differences in topic, context and the average length of instances, our model performs better than the other baselines across all the datasets. This also verifies the good generalization and robustness of our model.
To verify the effectiveness of each main component of our NSCGNN model, we conduct ablation experiments on HuffPost and FewRel datasets and report the results in Table 3. We can see that when we remove negative supervision mechanism, the accuracy of our model drops significantly on 5-way 1-shot tasks by 1.79% and 1.73%, and on 5-way 5-shot tasks it drops slightly by 0.26% and 0.19%. It’s obvious that negative supervision mechanism has a heavier impact on 5-way 1-shot tasks. The reason maybe that, in the case of 1-shot, the prototype of each class is decided by only one support instance and there is little intra-class similarity information available, thus the negative supervision mechanism plays a more important role by encouraging to learn distinct representations of nodes with different labels. As the number of samples in each class increases, the intra-class similarity also provides useful information for learning class prototypes. As a result, although negative supervision mechanism is also beneficial to the results of 5-way 5-shot tasks, it has relatively minor effect. Besides, we replace dynamic routing algorithm with simply averaging. The accuracy of model on 5-way 1-shot tasks declines by 0.14% and 0.25% respectively and on 5-way 5-shot tasks declines by 0.22% and 0.12% respectively. Compared with the model which is without negative supervision, the dynamic routing algorithm has less effect on 5-way 1-shot tasks but achieves a comparable performance on 5-way 5-shot tasks. Intuitively, in the case of 5-way 1-shot, node features in each class are related to the same support instance, so there is a little difference between averaging the node features and adopting dynamic routing algorithm on them. The effectiveness of dynamic routing algorithm is limited. In the case of 5-way 5-shot, though much more useful information is available to the model, dynamic routing algorithm can make a difference to the results by giving large weight to samples that are more related to class prototype. Finally, we learn class representations with node features of the last layer of GNN instead of all the layers, and the model drops more clearly on 1-shot tasks (1.92% and 1.71% respectively) compared to 5-shot tasks (0.23% and 0.21% respectively). This verifies that, taking node features of all layers into consideration can provide multi-aspect sample information and contributes more to performance when the number of samples in each class is fewer.
Accuracies (%) of ablation study results
Accuracies (%) of ablation study results
NS stands for negative supervision, DR stands for dynamic routing, MR stands for the multi-layer node representations of GNN.
In this section, we further analyze some variants of our model to show the advantage of our NSCGNN. These variants are described as follows:
Fused+DR: Fused+DR adopts a fully connected layer to fuse similar node features with dissimilar node features in Node Feature Update, replacing the negative supervision mechanism. In Class Prototype Induction, it still applies dynamic routing algorithm.
NS+Attention: NS+Attention adopts self-attention in Class Prototype Induction instead of dynamic routing algorithm, and the rest of the model remains unchanged.
FCL+Sigmoid: FCL+Sigmoid applies a fully connected layer followed by a sigmoid function to calculate the matching scores between query and class prototypes, and the rest of the model remains unchanged.
We conduct experiments on HuffPost and FewRel datasets and the results are shown in Table 4. Fused +DR makes progress against only using similarity information (-w/o NS in Table 3), which verifies the role of inter-class dissimilar information, but there is still a slightly gap compared to our NSCGNN. NS+Attention models the weights of samples by self-attention mechanism, but its ability is limited by the learnt attention parameters. Conversely, dynamic routing can automatically adjusts the coupling coefficients according to inputted support sets. In addition, we use fully connected layer and sigmoid function instead of the cos function, constructing FCL+Sigmoid model. The results in Table 4 show that it achieves better performance on FewRel dataset but is not suitable for the HuffPost datasets.In order to further explore the performance differences between FCL+Sigmoid model and our model, we compare the convergence speed of the two models in training process of FewRel. As shown in Figure 2, although the accuracy of our model is 0.05% worse than FCL+sigmoid on 5-way 5-shot tasks, the loss of it decreases faster than the latter and converges at about 1500 iterations. In contrast, the convergence speed of FCL+Sigmoid model is slower and converges at about 2000 iterations. The reason is that FCL+Sigmoid utilizes a fully connected layer which needs to learn some weight parameters, while our model directly employs cosine function which does not introduce any new learnable weights.
Accuracies (%) of different model variants
Accuracies (%) of different model variants

Normalized Loss of FCL+Sigmoid and our NSCGNN on the training set of FewRel.
We visualize the support sample vectors under 5-way 5-shot scenario in Figure 3 to illustrate the availability of our proposed negative-supervised graph neural network module. Firstly, we randomly select a support set with 25 texts (5 texts per class) from the FewRel test dataset, and obtain the node features of the last layer of our negative-supervised GNN and the node features of the last layer of general GNN without negative supervision mechanism. Then, show them with t-SNE visualization. After iteratively updating, we can see that the generated features from the same class are closer in embedding space in both (a) and (b). But in (b) the vectors of different class are farther apart and more separable, demonstrating the effectiveness of the negative supervision signal. Compared with general GNN which encodes node features with implicit information aggregation, our NSCGNN model can explicitly capture intra-class similarity and inter-class dissimilarity with 2-dimensional edge features and negative supervision mechanism.

Effect of negative supervision under the 5-way 5-shot scenario. (a) The support sample vectors obtained from general GNN. (b) The support sample vectors obtained from GNN with negative supervision. Circles represent corresponding clusters.
In this paper, we propose a negative-supervised capsule graph neural network for few-shot text classification. First, a graph is initially constructed, where nodes are text representations extracted from BERT and edges are 2-dimensional vectors indicating similarity and dissimilarity between nodes. Then, both intra-cluster similarity and inter-cluster dissimilarity between nodes are explored with iterative information aggregation and negative supervision. Furthermore, a dynamic routing algorithm is applied in our model to induce generalized class prototypes based on node capsules obtained from GNN. Finally, we classify the query by class-wise comparison and train the whole model by jointly minimize the negative-supervised loss and the large margin cosine loss. Experimental results demonstrate the effectiveness of our proposed model, which outperforms existing state-of-the-art few-shot models on both few-shot text classification and few-shot relation classification datasets. In future work, we will study few-shot sequence labeling task with data generated by distant supervision and extend our NSCGNN model to zero-shot learning.
Footnotes
Acknowledgments
We thank all the reviewers for their insightful and valuable comments. This work was supported by the National Natural Science Foundation of China (Grant No. 72071145).
