SECL: Sampling enhanced contrastive learning

Abstract

Instance-level contrastive learning such as SimCLR has been successful as a powerful method for representation learning. However, SimCLR suffers from problems of sampling bias, feature bias and model collapse. A set-level based Sampling Enhanced Contrastive Learning (SECL) method based on SimCLR is proposed in this paper. We use the proposed super-sampling method to expand the augmented samples into a contrastive-positive set, which can learn class features of the target sample to reduce the bias. The contrastive-positive set includes Augmentations (the original augmented samples) and Neighbors (the super-sampled samples). We also introduce a samples-correlation strategy to prevent model collapse, where a positive correlation loss or a negative correlation loss is computed to adjust the balance of model’s Alignment and Uniformity. SECL reaches 94.14% classification precision on SST-2 dataset and 89.25% on ARSC dataset. For the multi-class classification task, SECL achieves 90.99% on AGNews dataset. They are all about 1% higher than the precision of SimCLR. Experiments show that the training convergence of SECL is faster, and SECL reduces the risk of bias and model collapse.

Keywords

Contrastive learning sampling enhancement InfoNCE loss model collapse

1. Introduction

Representation learning is a key challenge in natural language processing (NLP). Self-supervised learning (SSL) achieves great empirical success across multiple domains, including computer vision [2,3,8], NLP [6,7,18], and speech recognition [17]. SSL learns meaningful structures from randomly initialized networks with large-scale unsupervised data, which can learn effective representations for downstream tasks.

A simple contrastive learning framework for visual representation learning is proposed in SimCLR [2], where augmented samples are constructed as contrastive-positive sample pairs. By the contrastive network, it tries to maximize agreement between differently augmented views of the same sample via a contrastive loss in the latent space.

Pre-training model used in text representation learning such as BERT [5], BERT-whitening [13], BERT-flow [10], etc. have achieved great success in different tasks. [7,12,18] combined pre-training model and contrastive learning are more effective than BERT and other pre-training models. Compared with the traditional pre-training model, applying contrastive learning on NLP can obtain better representation based on simple structure.

However, the instance specificity [1,19] leads to sampling bias in SimCLR, as illustrated in Fig. 1, where labels “0” is negative and labels “1” is positive in sentiment of the SST-2 dataset. Since the true labels are not available in pre-training, the contrastive-negative samples are randomly selected from dataset. It is possible that a sample (Sample-5 in Fig. 1) with positive sentiment is included in the contrastive-negative sample set (CNS). This phenomenon is referred to as sampling bias [4]. It may lead to significant performance drop in the classification task [4]. Meanwhile, instance-level learning only focuses on the target sample, the representation learning may lack the learning of category invariance, which we refer to as feature bias. Thus, increasing samples may be helpful for reducing the possibility of sampling bias and feature bias.

Fig. 1.

Sampling bias on SST-2.

Besides, SimCLR needs a large number of negative samples in training to avoid model collapse [2,9,14,20]. Once the number of negative samples in SimCLR is reduced, the model learning will collapse to a constant solution and few valid classification information can be learned. MOCO [9] avoids collapse by a momentum-based moving average of the query encoder, that the negative samples come from preceding several mini-batches. BYOL [8] does not use negative samples and iteratively bootstraps the outputs of a network to serve as targets for an enhancement. SimSiam [3] is a simpler structure based on BYOL which directly maximizes the similarity of one image’s two views, using neither negative pairs nor a momentum encoder. It can work surprisingly well and prevent collapsing.

For the problems of sampling bias and feature bias, a set-level based contrastive learning is proposed, which is shown in Fig. 2. SimCLR generates two augmented samples (Augmentations) based on the original samples (Starter). In this paper, more samples with the closest semantic distance are super-sampled as the Neighbors near the Augmentation. We use RoBERTa as the feature extractor, which has reasonable embeddings. It can be approximated that if there are samples of the same emotional category as Augmentation in the current batch, they will be selected as Neighbors. The Augmentations and Neighbors form a contrastive-positive sample set (CPS), which extend SimCLR from the instance-level sampling to the set-level sampling.

Fig. 2.

Set-level super-sampling.

Since the unbalance between Alignment and Uniformity will lead to model collapse, we propose a sample-correlation strategy to maintain the balance of the model. In the super-sampling method, CPS can improve Alignment with more contrastive-positive samples, but less samples in CNS decrease the model’s Uniformity. We try to balance the model’s Alignment and Uniformity by a samples-correlation strategy, which adjust correlation of CPS (positive correlation loss) or correlation of CNS (negative correlation loss).

A Sampling Enhanced Contrastive Learning (SECL) method based on SimCLR is proposed for the downstream classification task, which includes: (1) A CPS obtained by super-sampling, (2) Sampling correlation in the loss function. Overall, contributions can be summarized as follows:

The proposed super-sampling method selects additional samples (Neighbors) from CNS to form CPS. The Neighbors contribute to learn set-level sample features for the downstream classification task.

We propose a positive correlation loss, where the Augmentations’ correlation is computed and added to the Super-Sampling InfoNCE loss. The loss ensures a larger difference between Augmentations and samples in CNS.

The negative correlation loss is proposed to strengthen the negative correlation between contrastive-positive and contrastive-negative samples, which can also highlight the impact of the CPS. The network structure is simpler while similar precision and recall are achieved.

Experiments show SECL can improve the downstream classification precision by 1.31%, 0.72%, and 1.15% compared with SimCLR on SST-2, ARSC, and AGNews dataset.

2. SimCLR

SimCLR is a simple contrastive framework proposed by [2]. They constructed a representation learning system: First, a batch size N is generated from the original image. For each image in the batch, a random transformation function is applied to obtain a pair of randomly augmented images as contrastive-positive sample pair. Then, the representation vector of each augmented image is obtained by two identical encoders. A series of nonlinear layers are defined as projectors, that map representations to the space where contrastive loss is applied. Finally, the contrastive loss function is defined for a contrastive prediction task. This framework not only improves existing self-supervised learning methods, but also surpasses supervised learning methods on ImageNet.

The contrastive loss function InfoNCE in SimCLR can be expressed as follow: $\begin{matrix} (1) & ℓ_{InfoNCE} = - log \frac{exp (sim (z_{i}, z_{j}) / τ)}{\sum_{k = 1}^{2 N} I_{[k \neq i]} exp (sim (z_{i}, z_{k}) / τ)} \end{matrix}$ Where, $I_{[k \neq i]}$ is an indicator function evaluating to 0 when $k = i$ , $sim (z_{i}, z_{j}) = \frac{z_{i}^{T} z_{j}}{‖ z_{i} ‖ ‖ z_{j} ‖}$ indicates the cosine similarity between the vectors $z_{i}$ and $z_{j}$ , ${z_{i} \in R}^{d}$ represents the output from the projection head of $x_{i}$ , N is the batch size, and τ denotes a temperature parameter.

The numerator of the function InfoNCE encourages high similarity of the contrastive-positive samples. The denominator part encourages the lowest similarity between any contrastive-negative samples. As the numerator only calculates the distance of different augmentation of the same image, there may be sampling bias, feature bias and other problems. Meanwhile, in order to keep enough individual features in the model, the denominator relies on a large number of negative samples, that leads to a risk of collapse while the negative samples are decreased.

2.1. Sampling bias

Unsupervised deep learning methods include clustering [21] and sample specific learning [1,19]. Clustering mainly focuses on the characteristics of classes, sample specific learning goes to the other extreme by considering every single sample as an independent class, such as SimCLR. However, the sample specific learning is likely to yield more ambiguous class structures and less discriminative features.

Figure 3(a) shows the bias problem. Samples $x_{1}^{pos}, x_{2}^{pos}$ , which are the same class as the Augmentations $(x_{1}^{pos}, x_{2}^{pos})$ , will be defined as contrastive negative samples (sampling bias), and the learning of Augmentation $x_{i}^{pos}$ is lack of class features (feature bias).

Fig. 3.

Comparison of sampling method. Triangles and squares represent contrastive-positive and contrastive-negative samples, $x^{pos}, x^{neg}$ represent classifying-positive and classifying-negative samples of downstream task respectively, $(x_{i}^{pos}, x_{j}^{pos})$ are Augmentations, and $(x_{1}^{pos}, x_{2}^{pos})$ are Neighbors.

Compared with the contrastive-positive sampling of SimCLR, a super-sampling method is proposed to get the set-level samples features. As shown in Fig. 3(b), the optimizing samples are selected as Neighbors (for example: $x_{1}^{pos}, x_{2}^{pos}$ ) from CNS, which introduce clustering in sample specific learning. SimCLR only ensures the closeness between the augmented samples in a training, whereas super-sampling simultaneously considers the closeness among CPS, which brings the robustness and generalization of the representation learning.

2.2. Collapse

Literature [15] proposed that good contrastive learning should have two attributes: Alignment and Uniformity. By contrastive-positive samples, Alignment enables similar samples to have similar features as much as possible. And Uniformity means retaining as much different information as possible in features. Violating the principle of Uniformity leads to a trivial solution where all samples collapse into a unique representation. We can use Eq. (2) and Eq. (3) to justify the inner working of each approach [6,15]: $\begin{array}{l} (2) & ℓ_{align} ≜ \underset{(z, z^{+}) \sim R^{+}}{E} {‖ z - z^{+} ‖}^{2} \\ (3) & ℓ_{uniform} ≜ log \underset{(z, z^{-}) \overset{i . i . d .}{\sim} R^{d}}{E} e^{- 2 ‖ z - z^{-} ‖^{2}} \end{array}$ Where $R^{d}$ denotes the data distribution, $R^{+}$ denotes the distribution of contrastive-positive samples.

SimCLR makes a balance between maximum similarity (Alignment) and minimum similarity (Uniformity) by generating a large number of contrastive-negative samples based on a large batch of samples.

In super-sampling method, more samples are treated as contrastive-positive ones, which may produce more collapse risk. Experiments (Table 4) show collapse occurs when the number of samples increases to 6 with batch size 16.

Equilibrium of Alignment and Uniformity can be guaranteed by learning the correlation between Neighbors and Augmentations. As shown in Fig. 4, compared with the correlation strategy that uses super-sampling in SimCLR, the strategy of positive correlation further narrows the distance between the Augmentations, that ensures a bigger difference between Augmentations and samples in CNS. The negative correlation strategy enlarges the distance between CPS and CNS, that enhances the expression of sample specificity in the model.

Fig. 4.

Correlation strategies.

3. Sampling enhanced contrastive learning

3.1. Super-sampling method

For the problem of sampling bias, a super-sampling method is proposed to get class features by increasing the contrastive-positive samples, where the nearest semantic distance is chosen as the optimizing sampling strategy of the Neighbors.

Augmentations $(x_{i}, x_{j})$ are the augmentation of Starter x, and all samples in CNS $(x_{1}, x_{2}, \dots, x_{n})$ are candidates for Neighbors. Cosine similarity is used to calculate the semantic distance between Augmentation and the Neighbors’ candidates.

CPS X is obtained by adding Neighbors which are the new nearest M samples around Augmentation. As the $sim (z_{i}, z_{k})$ has been calculated in InfoNCE, the Neighbors search method is as follow: $\begin{array}{l} (4) & min_{x_{k}} (1 - sim (z_{i}, z_{k})) \\ (5) & X = (x_{i}, x_{j}, x_{1}, x_{2}, \dots, x_{M}) \end{array}$

Neighbors in the super-sampling method are selected among the input batch. When the batch size is N, after data augmentation there will be $2 N$ samples for the contrastive structure. In SimCLR, there are 2 positive samples (two augmentations of Starter) and $2 N - 2$ negative samples. In SECL, there are $2 + M$ positive samples (two augmentations of Starter and M Neighbors) and $2 N - 2 - M$ negative samples.

Corresponding, the loss function in SimCLR is modified: $\begin{matrix} (6) & ℓ_{Super} = - log \frac{exp (sim (z_{i}, z_{j}) / τ) + \sum_{m \in M} exp (sim (z_{i}, z_{m}) / τ)}{\sum_{k = 1}^{2 N} I_{[k \neq i \land k \notin M]} exp (sim (z_{i}, z_{k}) / τ)} \end{matrix}$ Where $I_{[k \neq i \land k \notin M]} \in {0, 1}$ is an indicator function evaluating to 1 if $k \neq i \land k \notin M$ , $sim (z_{i}, z_{j}) = \frac{z_{i}^{T} z_{j}}{‖ z_{i} ‖ ‖ z_{j} ‖}$ denotes the cosine similarity between the projection head output vectors, N is the batch size, $M = (1, 2, \dots, M)$ is the index set of $z_{m}$ , $z_{m}$ denotes Neighbors, and τ denotes the temperature parameter.

Figure 5 shows the margins of super-sampling. Neighbors in the same class can be searched with high probability in a narrow margin. Whereas Neighbors with different classes in a wide margin can be searched into CPS, which obviously affects subsequent model training.

Fig. 5.

Margins of super-sampling.

3.2. Samples-correlation strategy

We introduce a samples-correlation strategy including two kinds of loss functions to adjust the balance of model’s Alignment and Uniformity, which will reduce the risk of model collapse.

3.2.1. Positive correlation loss (PC loss)

SimSiam [3] adopts the Siamese networks for contrastive learning without negative sample pairs, which performs better on the downstream tasks. The correlation of the Augmentations with an asymmetric structure from SimSiam is calculated, and added to the final loss. The final loss includes the influence of the Neighbors and also highlights the importance of the Augmentations, that balance Alignment and Uniformity of learning.

The asymmetric structure

The asymmetric structure as Fig. 6, has different MLP layers on outputs $(h_{i}, h_{j})$ of Transformer, and stop-gradient on one branch, which is crucial in preventing collapse.

The optimization of the loss function is as follow: $\begin{matrix} (7) & min D (z_{i}^{'}, z_{j}^{″}) \end{matrix}$ Where $D (z_{i}^{'}, z_{j}^{″})$ is cross-entropy, and $z_{j}^{″}$ is stop-gradient.

Fig. 6.

The asymmetric structure.

Loss function

The positive correlation loss (PC loss) includes the Super-Sampling loss and the cross-entropy loss of the Augmentations: $\begin{matrix} (8) & L_{PC} = ℓ_{Super} + ℓ_{Cross_Entropy} \end{matrix}$ $ℓ_{Super}$ is calculated by Eq. (6), where $(z_{i}, z_{j})$ are the outputs of the first two layers of MLP in the projection head (three layers of MLP).

Cross-entropy is as follow: $\begin{array}{l} (9) & D (z_{i}^{″}, z_{j}^{'}) = - \frac{z_{i}^{″} z_{j}^{'}}{‖ z_{i}^{″} ‖ ‖ z_{j}^{'} ‖} \\ (10) & ℓ_{Cross_Entropy} = (D (z_{i}^{″}, z_{j}^{'}) + D (z_{i}^{'}, z_{j}^{″})) / 2 \end{array}$ Where $‖ z ‖$ is $ℓ_{2}$ -norm, $z_{i}^{″}$ and $z_{j}^{″}$ denote $stopgrad (z_{i}^{'})$ , $stopgrad (z_{j}^{'})$ , and $(z_{i}^{'}, z_{j}^{'})$ are the outputs of the third layer.

3.2.2. Negative correlation loss (NC loss)

The proposed NC loss increases the weight of the negative sample pairs in the loss function of Super-Sampling loss with the simple contrastive structure of SimCLR, and strengthens sample specificity. The sample specificity in SimCLR is: $\begin{matrix} (11) & \sum_{k = 1}^{2 N} I_{[k \neq i]} exp (sim (z_{i}, z_{k}) / τ) \end{matrix}$ Where $z_{i}$ denotes the positive sample of SimCLR.

In Super-Sampling loss, new contrastive-positive samples are added, and the sample specificity is: $\begin{matrix} (12) & \sum_{k = 1}^{2 N} I_{[k \neq i \land k \notin M]} exp (sim (z_{i}, z_{k}) / τ) \end{matrix}$ Where $M$ includes M numbers is the index set of $z_{m}$ and $z_{m}$ denotes Neighbors.

Equation (12) reduces $(batch_size \times M)$ contrastive-negative samples from Eq. (11) and meanwhile, the sample specificity decreases. A simple practice is amplifying the negative sample pairs’ influence appropriately, and the corresponding loss function (NC loss) is: $\begin{matrix} (13) & ℓ_{NC} = - log \frac{exp (sim (z_{i}, z_{j}) / τ) + \sum_{m \in M} exp (sim (z_{i}, z_{m}) / τ)}{(1 + λ) \sum_{k = 1}^{2 N} I_{[k \neq i \land k \notin M]} exp (sim (z_{i}, z_{k}) / τ)} \end{matrix}$ NC loss will be effective while $λ \in [0.1, 3]$ .

3.3. Contrastive learning framework

The proposed framework is illustrated in Fig. 7, where SECL-P denotes the model with PC loss and SECL-N denotes the model with NC loss. SECL comprises four components:

Data augmentation: Use the classic text augmentation methods [16]: synonym replacement, random insertion, random swap, and random deletion, to generate positive pair $(x_{i}, x_{j})$ for each document x.

Feature extractor: Use pretrained RoBERTa [11] as feature extractor. Note that the feature extractor is shared among all data, including augmented data. It computes on $x_{i}$ , $x_{j}$ independently and outputs hidden features $h_{i}$ , $h_{j}$ .

Projection head: Adopt an MLP applying on $h_{i}$ , $h_{j}$ to get the projected representations $z_{i}$ , $z_{j}$ (the outputs of two MLP layers) and $z_{i}^{'}$ , $z_{j}^{'}$ (the outputs of three MLP layers).

Loss function: SECL-P adopts the loss function from Eq. (8), and SECL-N adopts the loss function from Eq. (13).

Fig. 7.

SECL.

4. Experiments and discussion

Experiments in this section are designed to reveal the good performance of SECL, and to ensure our results can be easily reproduced with reasonable computation resources. We initialize the model with a random seed for 5 times and average the results.

4.1. Datasets

The experiments are conducted on the Stanford Sentiment Treebank dataset (SST-2), the Amazon Review Sentiment Classification dataset (ARSC) and AG’s News Topic Classification dataset (AGNews). SST-2 and ARSC contain two kinds of emotions: Positive and Negative. AGNews is a four-class classification task. In order to verify the existence of collapse, SST-2 (mini) dataset is a part of SST-2 where the number of Positive and Negative samples are the same (7500). The distribution of the datasets is shown in Table 1. The pre-training data is unlabeled, and the fine-tuning and testing data are labeled.

Table 1
Distribution of the datasets

Dataset Pretrain Train Test

SST-2 20000 8000 1500

SST-2(mini) 15000 8000 1500

ARSC 18624 5963 998

AGNews 20000 10000 1000

Dataset	Pretrain	Train	Test
SST-2	20000	8000	1500
SST-2(mini)	15000	8000	1500
ARSC	18624	5963	998
AGNews	20000	10000	1000

4.2. Experimental settings

We use pretrained RoBERTa-base as our feature extractor. During pre-training, we update both RoBERTa and MLP. In fine-tuning, we replace the MLP in SECL with a linear classification head and update SECL.

The parameters for each training stage are shown in Table 2, while SECL-N has a learning rate of 1e−5 in the pretraining stage. The MLP’s hidden dimension size is 768. And for different datasets, the value of λ is 1.4 (SST-2), 1.6 (ARSC), and 2 (AGNews).

Table 2
Parameter settings

Stage Seq-length Batch size Learning rate Epoches Temperature

Pretrain 128 16 2e−5 30 0.05

Finetune 128 32 2e−5 10 –

Stage	Seq-length	Batch size	Learning rate	Epoches	Temperature
Pretrain	128	16	2e−5	30	0.05
Finetune	128	32	2e−5	10	–

4.3. Comparison with baselines

SECL ( $M = 2$ ) is compared with RoBERTa, SimCLR, and SimSiam in Table 3. SECL outperforms all others on SST-2, ARSC and AGNews datasets, where it gives an absolute improvement of 1.31% classification precision compared with SimCLR on SST-2, from 92.83% to 94.14%. The improvement of SECL-P in downstream classification tasks demonstrates that it learns more efficient representations with better robustness and generalization capabilities. Meanwhile, SECL-N can achieve the similar precision as SECL-P with a simpler network.

Table 3
Comparison with baselines

SST-2 ARSC AGNews

Precision Recall F1 Precision Recall F1 Precision Recall F1

RoBERTa 92.52% 92.48% 0.925 86.58% 86.9% 0.852 89.9% 89.85% 0.897

SimCLR 92.83% 92.79% 0.927 88.53% 88.67% 0.874 89.84% 89.76% 0.896

SimSiam 93.16% 93.13% 0.93 87.48% 87.79% 0.866 90.11% 90.1% 0.9

SECL-N 94.09% 94.03% 0.94 89.03% 89.14% 0.88 90.99% 90.9% 0.908

SECL-P 94.14% 94.09% 0.941 89.25% 89.47% 0.885 90.87% 90.86% 0.907

	SST-2	ARSC	AGNews
RoBERTa	92.52%	92.48%	0.925	86.58%	86.9%	0.852	89.9%	89.85%	0.897
SimCLR	92.83%	92.79%	0.927	88.53%	88.67%	0.874	89.84%	89.76%	0.896
SimSiam	93.16%	93.13%	0.93	87.48%	87.79%	0.866	90.11%	90.1%	0.9
SECL-N	94.09%	94.03%	0.94	89.03%	89.14%	0.88	90.99%	90.9%	0.908
SECL-P	94.14%	94.09%	0.941	89.25%	89.47%	0.885	90.87%	90.86%	0.907

4.4. The convergence of loss

The convergence of SECL’s loss is shown in Fig. 8 (taking SST-2 as an example), where y axis is a logarithmic scale of the training loss, and x axis is the training epoch. The loss has converged to $10^{- 5}$ at the $10^{th}$ epoch, and continues to have a slight decline after the $20^{th}$ epoch, which further increases the accuracy.

The accuracies of SimCLR and SECL on the validation sets (also taking the SST-2 dataset as an example) are demonstrated in Fig. 9. After the $10^{th}$ epoch, SECL’s validation accuracy rises in a fastest speed and is significantly higher than SimCLR. After the $20^{th}$ epoch, the accuracy also increased slightly.

Fig. 8.

The convergence of loss.

Fig. 9.

Accuracies on validation on SST-2.

4.5. Alignment and uniformity

We use Eq. (2) and (3) to justify the inner workings of our approaches (the SST-2 as an example), as shown in Fig. 10. We visualize checkpoints every 300 training steps and the arrows indicate the training direction. For both $ℓ_{align}$ and $ℓ_{uniform}$ , lower numbers are better [6,15]. As the results of SECL-P and SECL-N are similar, we use SECL-P as an example for SECL. As clearly shown, S_SimCLR greatly improves Alignment after super-sampling, and SECL improves both the Alignment and Uniformity comparing with SimCLR.

Fig. 10.

$ℓ_{align}$ – $ℓ_{uniform}$ plot.

4.6. Super-sampling method

In order to study the super-sampling margin, M ( $M = 2, 4, 6$ ) samples are chosen as Neighbors with batch_size = 16, and the results on SST-2 are shown in Table 4.

With the super-sampling method, S_SimCLR ( $M = 2$ ) achieves 93.81% precision, which is 0.98% higher than SimCLR. The downstream classification precision has been significantly improved, since it may have higher probability on choosing the same class samples with the Starter.

When margin $M = 6$ , precision drops to 26.7% that means the model hardly learned the available representation. The excessive Neighbors may include more samples of different class that affects model updating, which means that Neighbors and Starters need to be in the same category to ensure a better learning. Meanwhile, the sample specificity computation by negative samples declines may produce the model collapse.

Table 4
The results of M ( $M = 2, 4, 6$ )

Model Precision Recall F1

RoBERTa 92.52% 92.48% 0.925

SimCLR 92.83% 92.79% 0.927

S_SimCLR ( $M = 2$ ) 93.81% 93.71% 0.936

S_SimCLR ( $M = 4$ ) 92.65% 92.6% 0.925

S_SimCLR ( $M = 6$ ) 26.7% 51.7% 0.352

Model	Precision	Recall	F1
RoBERTa	92.52%	92.48%	0.925
SimCLR	92.83%	92.79%	0.927
S_SimCLR ( $M = 2$ )	93.81%	93.71%	0.936
S_SimCLR ( $M = 4$ )	92.65%	92.6%	0.925
S_SimCLR ( $M = 6$ )	26.7%	51.7%	0.352

4.7. Importance of stop-gradient

Table 5 presents a comparison on “with vs. without stop-gradient”. The architectures and all hyperparameters are kept unchanged, and stop-gradient is the only difference.

We update SECL asynchronously by different MLP layers on the output embeddings and stop-gradient. As clearly shown in Table 5, once the stop-gradient is removed, the precision will drop significantly. It suggests that the stop-gradient is the key to preventing the model from converging to a constant solution.

Table 5
Experimental results of “with vs. without stop-gradient”

Model Precision Recall F1

Remove stop-gradient 83.77% 83.77% 0.837

SECL-P 94.14% 94.09% 0.941

Model	Precision	Recall	F1
Remove stop-gradient	83.77%	83.77%	0.837
SECL-P	94.14%	94.09%	0.941

4.8. Solution of collapse problem

Model collapse occurs when the super-sampling margin increases, as shown in Table 4. To investigate the impact of the collapse, the dataset SST-2 (mini) with the same amount of Positive and Negative samples is designed to ensure that Neighbors have enough samples with the same class as the Augmentation. Results are shown in Table 6.

The classification precision of S_SimCLR ( $M = 6$ ) increases significantly, which demonstrates the importance of the right Neighbors. Compared with $M = 2$ , the slightly decreasing precision confirms the existence of collapse.

On SST-2 (mini), the classification precision of $M = 6$ based SECL is significantly higher than S_SimCLR. The results show positive sample balanced loss and negative sample enhanced loss in the sampling enhanced contrastive learning both can effectively avoid model collapse.

Table 6
Results on SST-2 (mini)

Precision Recall F1

S_SimCLR ( $M = 2$ ) 93.44% 93.43% 0.934

S_SimCLR ( $M = 4$ ) 92/37% 92.22% 0.922

S_SimCLR ( $M = 6$ ) 91.83% 91.73% 0.917

SECL-P ( $M = 6$ ) 93.76% 93.71% 0.937

SECL-N ( $M = 6$ ) 93.78% 93.75% 0.937

	Precision	Recall	F1
S_SimCLR ( $M = 2$ )	93.44%	93.43%	0.934
S_SimCLR ( $M = 4$ )	92/37%	92.22%	0.922
S_SimCLR ( $M = 6$ )	91.83%	91.73%	0.917
SECL-P ( $M = 6$ )	93.76%	93.71%	0.937
SECL-N ( $M = 6$ )	93.78%	93.75%	0.937

5. Conclusion

In this work, a Sampling Enhanced Contrastive Learning model (SECL) is proposed. A super-sampling method is introduced to get class features by set-level based contrastive learning. And the optimizing sampling strategy expands the augmented samples into a contrastive-positive set, which can learn class features of the target sample to reduce the sampling bias and feature bias. It can improve the samples’ representation. A sample-correlation strategy is proposed for the collapse problem, where a positive correlation loss or a negative correlation loss is computed. Positive correlation loss can help to ensure the balance of the correlation between Augmentations and Neighbors based on set-level sample features. Negative correlation loss enhances the negative correlation between contrastive-positive and contrastive-negative samples, that the collapse problem can also be solved with the simple network and a similar precision rate is achieved. The outstanding performance of SECL on the benchmarks demonstrates its ability to learn a suitable representation for downstream tasks of classification.

References

Bojanowski and

Joulin, Unsupervised learning by predicting noise, in: Proceedings of the International Conference on Machine Learning (ICML), 2017, pp. 1–10.

Chen,

Kornblith,

Norouzi and

Hinton, A simple framework for contrastive learning of visual representations, 2020, arXiv preprint arXiv:2002.05709.

Chen and

He, Exploring simple Siamese representation learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

C.Y.

Chuang,

Robinson,

Y.C.

Lin, et al., Debiased contrastive learning, 2020, arXiv preprint arXiv:2007.00224.

Devlin,

M.W.

Chang,

Lee et al., BERT: Pre-training of deep bidirectional transformers for language understanding, 2018, arXiv preprint arXiv:1810.04805v2.

Gao,

Yao,

Chen, SimCSE: Simple contrastive learning of sentence embeddings, 2021, arXiv preprint arXiv:2104.08821.

J.M.

Giorgi,

Nitski,

G.D.

Bader and

Wang, Declutr: Deep contrastive learning for unsupervised textual representations, 2020, arXiv preprint arXiv:2006.03659.

J.-B.

Grill,

Strub,

Altché,

Tallec,

P.H.

Richemond,

Buchatskaya,

Doersch,

B.A.

Pires,

Z.D.

Guo,

M.G.

Azar,

Piot,

Kavukcuoglu,

Munos and

Valko, Bootstrap your own latent: A new approach to self-supervised learning, 2020, arXiv preprint, arXiv:2006.07733v1.

He,

Fan,

Wu,

Xie and

Girshick, Momentum contrast for unsupervised visual representation learning, 2019, arXiv preprint arXiv:1911.05722.

10.

Li,

Zhou,

He et al., On the sentence embeddings from pre-trained language models, 2020, arXiv preprint arXiv:2011.05864.

11.

Liu,

Ott,

Goyal,

Du,

Joshi,

Chen,

Levy,

Lewis,

Zettlemoyer and

V.S.

Roberta, A robustly optimized bert pretraining approach, 2019, arXiv preprint arXiv:1907.11692.

12.

Reimers and

Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-networks, 2019, arXiv preprint arXiv:1908.10084.

13.

Su,

Cao,

Liu et al., Whitening sentence representations for better semantics and faster retrieval, 2021, arXiv preprint arXiv:2103.15316.

14.

Tian,

Krishnan and

Isola, Contrastive multiview coding, 2019, arXiv preprint arXiv:1906.05849.

15.

Wang and

Isola, Understanding contrastive representation learning through alignment and uniformity on the hypersphere, 2020, arXiv preprint arXiv:2005.10242.

16.

Wei and

Z.K.

Eda, Easy data augmentation techniques for boosting performance on text classification tasks, 2019, arXiv preprint arXiv:1901.11196.

17.

Wu,

Wang,

Pino and

Gu, Self-supervised representations improve end-to-end speech translation, 2020, arXiv preprint arXiv:2006.12124.

18.

Wu,

Wang,

Gu et al., CLEAR: Contrastive learning for sentence representation, 2020, arXiv preprint arXiv:2012.15466.

19.

Wu,

Xiong,

X.Y.

Stella and

Lin, Unsupervised feature learning via non-parametric instance discrimination, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

20.

Wu,

Xiong,

Yu and

Lin, Unsupervised feature learning via non-parametric instance discrimination, 2018, arXiv preprint arXiv:1805.01978.

21.

Zhang

et al., Supporting clustering with contrastive learning, 2021, arXiv preprint arXiv:2103.12953.