Abstract
In recent years, Deep Learning methods have become very popular in classification tasks for Natural Language Processing (NLP); this is mainly due to their ability to reach high performances by relying on very simple input representations, i.e., raw tokens. One of the drawbacks of deep architectures is the large amount of annotated data required for an effective training. Usually, in Machine Learning this problem is mitigated by the usage of semi-supervised methods or, more recently, by using Transfer Learning, in the context of deep architectures. One recent promising method to enable semi-supervised learning in deep architectures has been formalized within Semi-Supervised Generative Adversarial Networks (SS-GANs) in the context of Computer Vision. In this paper, we adopt the SS-GAN framework to enable semi-supervised learning in the context of NLP. We demonstrate how an SS-GAN can boost the performances of simple architectures when operating in expressive low-dimensional embeddings; these are derived by combining the unsupervised approximation of linguistic Reproducing Kernel Hilbert Spaces and the so-called Universal Sentence Encoders. We experimentally evaluate the proposed approach over a semantic classification task, i.e., Question Classification, by considering different sizes of training material and different numbers of target classes. By applying such adversarial schema to a simple Multi-Layer Perceptron, a classifier trained over a subset derived from 1% of the original training material achieves 92% of accuracy. Moreover, when considering a complex classification schema, e.g., involving 50 classes, the proposed method outperforms state-of-the-art alternatives such as BERT.
Keywords
Introduction
In recent years, Deep Learning methods have become very popular in classification tasks for Natural Language Processing (NLP); this is mainly due to their ability to reach high performances by relying on very simple input representations, i.e. raw tokens. Their success is also due to the capability to scale such architectures to very large datasets, by exploiting parallel architectures for training. As an example, recent architectures have been shown to be effective in capturing syntactic information by only observing sequences of words (e.g., LSTM as in [17]), redundant subsets of n-grams (e.g., CNNs as in [20]) up to just sequences of characters [21]. The networks themselves are expected to learn during training the representations useful for the final decision.
One of the drawbacks of complex deep architectures is the large amount of annotated data required for an effective training. This means relying on large-scale annotated material to lead a neural network to learn suitable representations in its intermediate (i.e. hidden) layers. Unfortunately, annotated material is often scarce in many languages and domains. This is mainly due to the large costs and time consumption required to annotate new instances of the targeted phenomena. Unlabeled examples, instead, are less expensive, and typically more abundant. When only a few examples are available, deep learning methods are not competitive: this setting is usually known as few-shot learning.
Usually, in Machine Learning, this problem is mitigated by the usage of semi-supervised methods [5] or, more recently, using Transfer Learning mechanisms [27]. Semi-supervised learning methods aim at improving the generalization capability of a learner when few labeled data is available, while the acquisition of unlabeled sources is possible. With Transfer Learning, an already pre-trained architecture is exploited to transfer the knowledge embedded in its weights to a new task.
In Computer Vision, one recent method for semi-supervised learning in deep architectures is formalized within the so-called Semi-Supervised Generative Adversarial Networks (SS-GANs) framework. Generative Adversarial Networks (GANs) [18, 30] are a class of neural generative models based on game theory. The goal of GANs is to train a generator network
In SS-GANs, the training is extended to use unlabeled examples. While the labeled material is used to train the classifier, the unlabeled one improves the inner representations of
In weakly supervised problems within NLP tasks, exploiting linguistic information can be beneficial. For example, syntactic information has been demonstrated to be useful for many language understanding tasks [29]. Recently, many approaches have been proposed for producing vector representations of sentences that try to capture syntactic or semantic information. In [7], the authors showed how to successfully embed linguistically motivated kernel functions into low-dimensional embeddings using the Nyström method [36]. In [9], the authors showed that these linguistically motivated embeddings are effective also within deep architectures, in the so-called Kernel-based Deep Architecture (KDA). The authors demonstrated how the syntactic information captured by a Semantic Tree Kernel can be embedded into vectors to be used as input for neural network learning.
A different approach to encode sentences is proposed in [4], where the semantics of short texts is captured in vector spaces, where the geometric distances reflect pretty well the semantic similarity of sentences. The Universal Sentence Encoder (USE, proposed in [4]), is trained over very large datasets; when vectors generated by a USE are used in input to deep architectures, these are trained by reaching very good performances in different natural language inference (NLI) tasks.
It is worth noting that the above methods are not designed to solve specific tasks. They are specifically designed to properly encode texts in metric spaces where the underlying distance successfully reflects linguistic generalizations of lexical and grammatical properties. Such properties are thus induced from unlabeled textual material to model relations between sentences and texts. However, to tackle a new NLI task, e.g. a sentence classification task, a specific (potentially deep) neural architecture needs to be trained. Whereas the above methods provide linguistically meaningful inputs to the target networks, a large labeled dataset is still mandatory, as it embodies knowledge about the classification task. The greater the number of required annotations, the higher the cost of applying the targeted neural classifier.
In this paper, we investigate how to improve the applicability of deep architectures by exploiting the encoding of rich linguistic information in expressive metric spaces, fully capitalizing a semi-supervised learning perspective. The aim is to improve the robustness of deep architectures against too small annotated data sets, by exploiting unlabeled material properly encoded in a linguistically expressive vector space. We will study the impact of syntactic encoding of textual data as presented in [4, 9]. In particular, we will use the information contained in a Semantic Tree Kernel, a semantic similarity metrics sensitive to grammatical and lexical information. The embedding of such linguistic evidence is obtained by using the Nyström method and used to feed a Multi-Layer Perceptron (MLP). In this work, we will augment that reference learning architecture through the adoption of the SS-GAN framework. An adversarially trained generator network
In addition to our previous work presented in [8], in this paper we also show how a similar schema is effective with representations acquired from the USE. It allows us to verify the capability of the SS-GAN framework to deal with a different kind of representation not specifically focusing on syntactic information. Moreover, we show that the information provided by Tree Kernels and USE are complementary and that their combination provides further improvements in the semi-supervised setting provided by the SS-GAN. The resulting architecture enables semi-supervised learning in linguistically rich spaces. We will evaluate this framework on two text classification tasks, both based on the Question Classification TREC dataset. We will show that our approach is beneficial both when using the coarse-grained setting of QC, i.e., on 6 output categories, and when using the fine-grained setting, i.e., with 50 categories. The experimental evaluation shows significant improvements in the few-shot setting, i.e., when using only 1%, 2% and 5% of labeled examples. By applying such adversarial schema to a simple Multi-Layer Perceptron, a classifier trained over a subset derived from 1% of the original training material achieves 92% of accuracy. Moreover, when considering a complex classification schema, e.g., involving 50 classes, the proposed method outperforms state-of-the-art alternatives such as BERT [12].
In the remaining, section 2 provides a discussion on the related works. Section 3 discusses the methods for representing sentences in vector space. In section 4, the proposed SS-GAN is presented. In section 5, the experimental evaluations are reported. Finally, in section 6 the conclusions are derived.
Related work
This work is in the area of Deep Learning for NLP. In particular, it relates to Generative Adversarial Networks and Transfer Learning.
In this work, we will adopt the USE model to provide initial representations for the SS-GAN framework. We aim at improving the applicability of the above embedding methods in scenarios where very few labeled data are provided, while large collections of unlabeled texts exist.
Embedding sentences in geometrical spaces
Lexical embeddings have been largely adopted to encode lexical and textual information in metric spaces, where the corresponding vector operations reflect syntactic as well as semantic properties. In this work, we explore two kinds of embedding techniques. In section 3.1, we will describe a vector space obtained by the approximation of a Semantic Tree Kernel function, aiming at capturing meaningful syntactic and lexical information. In section 3.2, a different representation for sentences, obtained by a deep learning model, i.e., the Universal Sentence Encoder is studied.
Nyström -based approximation of tree kernels
Syntax-based vector representations aim at capturing information reflecting syntactic aspects of sentences. For example, it could be possible to enumerate each possible syntactic relations and associate each of them to a specific vector dimension. Given a sentence, it could be possible to compute its syntax-based vector and use it to perform algebraic operations with other vectors, e.g., the cosine similarity for measuring how two sentences are syntactically similar. However, such a naive approach will not capture all the facets behind the syntax.
One powerful way of dealing with syntax in Machine Learning is given by the Tree Kernels [6]. A Tree Kernel captures the similarity between two parse trees by looking at the common structures in the two parse trees. Tree Kernels have been demonstrated successful in many different tasks, e.g., as in [2, 10]. More formally, given an input training dataset
In [9] it has been shown that neural networks can be trained in low-dimensional spaces which approximate a generic Reproducing Kernel Hilbert Space. These low-dimensional approximations are derived as a reconstruction from a set of real reference training (unlabeled) examples, called landmarks, which can be used to compile the representation of any unseen test instance.
Let us assume we apply the projection function Φ over all examples from
The Nyström method [36] aims to derive a new low-dimensional embedding
Given an input example
where
In this work, we follow the idea of [9] to generate vector representations of sentences to be fed into an MLP architecture acting as the discriminator in an SS-GAN setting. In particular, we will approximate the space defined by the Compositionally Smoothed Partial Tree Kernel discussed in [1] as it suitably captures both lexical syntactic and compositional semantic information.
The Universal Sentence Encoder (USE) [4] is a deep learning model providing 512-dimensional vector representations of sentences or short paragraphs. It belongs to the family of Transfer Learning models, which try to provide good out-of-the-box representations for NLP tasks, ranging from text classification to text similarity. The model we refer to in this work is the one trained with the Deep Averaging Network 1 , as presented in [4]. The model takes in input embeddings for words and bi-grams; it then averages them and then passes them through a feed-forward deep neural network to produce sentence embeddings. The model is trained over a combination of unsupervised and supervised learning.
In [4], it has been demonstrated that using the representations computed by the USE with simple MLP can provide very strong performances on multiple NLP tasks. In this work, we aim at verifying the capability of these representations to provide useful semantic information as opposed to the representations described in section 3.1 in a few shot scenario. Similar to the Nyström -based representation, we will adopt the USE vectors in an MLP architecture in the SS-GAN setting. We will show that the two types of vector representations provide complementary information.
Adversarial training for few-shot text classification
Semi-supervised GANs
In order to be effective, deep neural networks are usually trained on large labeled datasets. The availability of labeled data is not always guaranteed, as the process of annotating the data for a given task/language/domain is a costly and time-consuming process: on the contrary, unlabeled data can be easily accessed. Deep architectures were initially adapted to the semi-supervised case by using concepts coming from the theory of graph-based methods [22, 37]. In this setting, both labeled and unlabeled data are represented in the same graph, and their relationships are exploited to “transfer” knowledge from the unlabeled instances to the labeled ones.
Semi-Supervised Generative Adversarial Networks (SS-GANs) [30] support a semi-supervised setting within the GAN framework. GANs are traditionally used for their generative capabilities. In GANs, the generator aims to produce new examples resembling an existing distribution; the discriminator (i.e., a binary classifier) aims to distinguish real examples from the ones generated by the generator. As shown in Figure 1, in SS-GAN the discriminator

SS-GAN architecture.
More formally, let
Let us define
At the same time,
This is combined with the unsupervised loss of
Finally, the
While in literature an SS-GAN is usually stimulated with input examples x encoding images (e.g. from the MNIST dataset), in the next section we will show that they can be easily adopted over input spaces encoding linguistic information.
In [8] we showed how to incorporate the SS-GAN perspective for the training of the KDA architecture, defined in [9]. The KDA, made of Nyström specific layers and an MLP, was used as a discriminator
In this work, we extend this schema to any vectors representing sentences. In particular, we substitute the KDA architecture with a more generic MLP as the discriminator
Applying a GAN perspective on such an architecture promotes two main benefits: the semi-supervised nature of the GAN approach and the expressiveness of linguistically-motivated vector representations for the involved sentences. This paves the way to cost-effective and highly reliable complex NLP inference, even in poor training conditions, e.g., small labeled data sets.
During the training phase, the input dataset is expected to contain labeled and unlabeled examples: these are randomly extracted and grouped in mini-batches of size b to be provided to
We expect that “fake" and unlabeled examples will boost the capability of
The application of the resulting architecture after training is straightforward [30].
Experimental evaluation
In this section, we provide a set of evaluations of the proposed architecture based on the SS-GAN approach with respect to text classification tasks involving short sentences, in particular questions.
Experimental setup
In this work, we aim to assess the impact of the proposed semi-supervised approach in boosting the performances of a neural classifier in poor training conditions, i.e., when a minimal set of labeled data is available. To evaluate the benefit of adopting the SS-GAN, we consider the setting where only a subset L of a dataset is labeled, while the rest U is unlabeled. This second subset will be used to estimate the unsupervised losses of
Fine-grained labels, such as
In our experiments, we measure the performances of the SS-GAN approach for text classification against both the coarse-grained and the fine-grained target categories. The former task allows to assess the performances of the model on a quite common setting (6 classes), widely discussed in literature. The latter allows to assess the performances of the model in a more complex setting, i.e., made of 50 categories for 5, 452 training example. This task will be even more challenging when reducing the number of labeled instances to one or two examples per class.
We thus compare the contribution of the proposed technique when different embedding and learning methods are used, as discussed hereafter.
It must be said that the time required to train a GAN is higher with respect to the training of the corresponding MLP. This depends on the size of the unlabeled material that augments the labeled data and the replication factor f. Moreover, as pointed out in [25], the convergence of a GAN requires a higher number of epochs (in this work about 100 epochs for each training are required). However, as discussed in the previous section, once the training phase is complete, no additional computational costs are introduced, since only the Discriminator (i.e., an MLP) is used at inference time.
We used a split of the training set (about 10% of the entire training dataset, the same for all the experiments) to tune the hyper-parameters of each model. We performed 5 runs shuffling the training set and reporting the average classification accuracy. The MLP and SS-GAN are implemented in Tensorflow 1.14
5
and trained on an NVIDIA Tesla T4 GPU. Accuracy, i.e., the percentage of the test questions assigned to their correct classes, is used in the evaluations.
Experimental results
Results in terms of accuracy with respect to the coarse-grained setting within the Question Classification task. Rows report the different input embeddings and different learning algorithms. Columns express the percentage of training material used. In bold the highest accuracy score with respect to each column
Results in terms of accuracy with respect to the coarse-grained setting within the Question Classification task. Rows report the different input embeddings and different learning algorithms. Columns express the percentage of training material used. In bold the highest accuracy score with respect to each column

Learning curves for the QC Coarse-Grained task. (a)
In the coarse-grained setting, we consider 6 classes. This specific classification task is characterized by a random baseline (obtained by a system that randomly assigns questions to classes) of 16%. Since the dataset is not balanced, a system that always assigns the most frequent class (i.e.,
The first group of results is obtained by applying the different learning architectures over the Kernel-based embeddings. A Kernel-based SVM (K-SVM) achieves 95.0% when trained over the entire L. It demonstrates the expressiveness of the (implicit) representation space provided by the kernel function. However, when the size of the dataset decreases, performances drop as well. When using 10% of labeled material (i.e., about 550 questions), an accuracy of 86% is obtained. When reducing this amount to only 50 questions, the accuracy drops to 58%. It is still higher with respect to the baselines, but the drop is nevertheless significant.
When the Kernel space is approximated by applying the Nyström method, both a linear SVM (the K-LinSVM, proposed in [7]) and an MLP (the K-MLP, which is essentially a KDA reported in [9]) shows similar trends. Even though the explicit 500-dimensional space provides an expressive reconstruction of the original kernel space, both methods suffer a similar performance drop. The first application of the semi-supervised approach is reported in the K-GAN row. When very few training materials is adopted, it can provide higher performances with respect to the previous supervised approaches: the K-GAN obtains 64.9% in accuracy while the Kernel SVM and the KDA obtain 58, 0% and 57.8% respectively; it results in an error reduction of about 17%. As reported in Figure 2a, the differences between the models get smaller when the size of L grows: starting about at 30 - 40% of the labeled examples the K-GAN, K-MLP and K-SVM perform almost equally. At 100% of the examples, all the model performances are very similar: the K-GAN, K-MLP, the K-SVM obtain 93.2%, 93.4% and 95.0%, respectively. This confirms our idea that linguistically motivated information, combined with adversarial learning, can help in reducing the amount of annotated data needed to train text classification models. However, we are still quite far from providing a solution that can be used in a realistic scenario, when using only 50 questions to train a neural architecture.
Results obtained when training the above algorithms over the embeddings obtained through the Universal Sentence Encoder (USE) show similar trends, but also some significant differences. A linear SVM and an MLP trained over the USE representation (i.e. the USE-LinSVM and USE-MLP) do not outperform the Kernel-based SVM when trained on the entire L, achieving 93.6% and 93.8%. However, these classifiers achieve higher results when only the 1% of the training material is used, i.e. 79.6% and 77.8%, respectively. Most importantly, when applying the SS-GAN approach, the robustness of the discriminator increases, as shown by the USE-GAN curve in Figure 2b: with only the 1% of the training material, accuracy boosts to 87.5%. Such improvements are more noticeable until all methods are trained over the 20% of L.
The most straightforward results are obtained when combining the above representation, in the Kernel + USE setting. The embedding spaces complement each other, improving the quality of the overall representation. As reported in rows KUSE-LinSVM and KUSE-MLP, this expanded representation space achieves better results when compared with any methods trained over individual embeddings. Most importantly, as shown in Figure 2c, when the SS-GAN is adopted a 92.0% of accuracy is achieved with only the 1% of L. These results outperform all previous methods with respect to any percentage of labeled material.
The proposed solution is highly competitive in poor-training conditions with respect to state-of-the-art methods such as BERT. Although this achieves an impressive accuracy of 96.9% when the entire L is used, its capability of generalizing in the QC downstream task is compromised with reduced training datasets. As shown in Figure 2d, BERT trained with only 50 questions diverges, obtaining an accuracy comparable with the lowest baselines. More than 1, 600 labeled questions (i.e., 30% of L) are required for a competitive training. Overall, the proposed approach based on the SS-GAN allows acquiring a classification model whose quality, with respect to the addressed QC task, is stable between 92% and 94% even with a bunch of annotated examples, even only 50 labeled examples from a dataset of 5, 500 instances.
Results in terms of accuracy with respect to the fine-grained setting within the Question Classification task. Rows report the different input embeddings and different learning algorithms. Columns expresses the percentage of training material used. In bold the highest accuracy score with respect to each column

Learning curves for the QC Fine-Grained task. (a)
This result is straightforward for two reasons. First, the adversarial training allows improving the robustness of the discriminator
In this paper, the application of Generative Adversarial Networks (GAN) for semi-supervised learning within natural language inference tasks is presented. The approach is specifically tailored to amplify the applicability of combined Kernel-based and Deep Learning architectures. The strict requirement of large-scale annotated corpora, strongly characterizing most complex NLP inference tasks in Deep Learning, is here mitigated by the semi-supervised perspective offered by the Semi-Supervised GANs. In the proposed approach, the specific SS-GAN formulation has been adopted in combination with sentence embeddings derived by approximating a rich linguistic kernel space and by adopting the Universal Sentence Encoder. This allows to bootstrap the training of a classifier in all the situations when a significant amount of labeled examples is not available, but a large unlabeled dataset can be obtained. The experimental evaluations of the SS-GAN in sentence-level semantic classification tasks confirm several benefits. When a small dataset is made available, up to 1% of the original training material, an impressive 92% of accuracy is achieved in the coarse-grained Question Classification task. Moreover, when considering complex classification schema, e.g. involving 50 classes, the proposed method outperforms state-of-the-art methods such as BERT.
In the future, we aim at targeting more tasks and explore the role of other kernel embeddings. Finally, this paper does not fully explore the possibility to adopt SS-GAN settings against transformer-based embeddings (e.g. [11]) and its variants. This is certainly a core issue to be explored in future work.
Footnotes
We adopted the model that is available on Tensorflow HUB.
http://www.kelp-ml.org, presented in [15,
].
Acknowledgments
We would like to thank Carlo Gaibisso, Bruno Luigi Martino and Francis Farrelly of the Istituto di Analisi dei Sistemi ed Informatica “Antonio Ruberti” (IASI) for supporting the experimentations through access to dedicated computing resources made available by the Artificial Intelligence & High-Performance Computing laboratory.
