Contrastive learning with hybrid data augmentation and pseudo-label supervision for short text clustering

Abstract

Contrastive learning has become a powerful paradigm for unsupervised representation learning. However, its effectiveness largely depends on carefully designed data augmentation strategies to generate meaningful positive and negative pairs. Additionally, unsupervised clustering algorithms are typically sensitive to initialization and prone to converging to suboptimal local minima, resulting in unstable performance. To overcome these challenges, we propose HAPL, a unified end-to-end framework for short text clustering that integrates Hybrid data Augmentation with Pseudo-Label supervision. HAPL combines explicit and implicit data augmentation techniques in a synergistic strategy. It also incorporates an adaptive optimal transport mechanism for pseudo-label generation. This design provides principled supervision that stabilizes the optimization process and adapts to varied cluster distributions, thereby enhancing the model’s discriminative power. Furthermore, prototype learning is employed to reinforce the coherence of representations in the embedding space. Extensive experiments on eight benchmark datasets show that HAPL achieves state-of-the-art performance across various evaluation metrics. Comprehensive ablation experiments validate the contribution of each component to the overall effectiveness of the framework.

Keywords

Short text clustering contrastive learning data augmentation pseudo-label unsupervised learning

1. Introduction

Text clustering is a fundamental unsupervised learning technique widely applied in natural language processing and data mining.¹ It generally follows a three stage pipeline comprising textual representation learning, semantic similarity computation and cluster partitioning.² This methodological framework underpins critical applications such as topic discovery, news categorization, document organization, user opinion mining and search engine optimization.³ The exponential growth of social media, instant messaging, and e-commerce has led to a predominance of short texts in modern data streams. These texts, typically containing no more than 50 tokens, include microblogs, chat messages, search queries, and product reviews.⁴ This shift has positioned short text clustering as both an essential and challenging research frontier, where conventional methods often struggle due to data sparsity and limited contextual cues.⁵

To address these issues, recent studies have prioritized enriching semantic representations for short texts.⁶ Modern approaches have progressively moved beyond surface level lexical statistics toward two prevailing paradigms. One is the infusion of external knowledge through structured resources such as WordNet and domain specific knowledge graphs, which inject contextual relationships and ontological constraints.⁷ The other is deep semantic encoding via pretrained language models including Word2Vec, GloVe and BERT, which produce dense feature vectors capturing compositional semantics.⁸ These advances help mitigate the intrinsic information deficiency of short texts by transforming sparse lexical signals into rich distributed representations.

Nevertheless, both knowledge intensive and deep learning methods exhibit inherent limitations. Knowledge infusion approaches face domain coverage gaps and high ontology maintenance costs, whereas deep encoders generally require large scale labeled datasets to achieve optimal performance resources that are rarely available for short text scenarios.⁹ To bridge this gap, recent research has explored self-supervised enhancement strategies, such as data augmentation and pseudo-label propagation. Data augmentation generates context preserving variants to alleviate sparsity, while pseudo-label propagation leverages cluster consistency to provide unsupervised supervision, jointly constructing robust feature spaces without relying on external knowledge.

Despite their promise, these techniques are not without drawbacks. For data augmentation, while effective against data scarcity, it can distort original semantics. Operations such as synonym replacement, random swapping, or insertion may introduce noise or disrupt coherence, making it challenging to preserve semantic integrity while enhancing diversity.¹⁰ Over-augmentation may further push samples away from their true semantic space, thereby impairing discriminative capability. Advanced generative augmentation methods¹¹ can produce more varied samples, but their high computational cost limits practical deployment.

In the realm of unsupervised learning, pseudo-labeling has emerged as a powerful mechanism for guiding representation learning.¹² Early work such as that by DeepCluster¹³ applied K-means to generate pseudo-labels for visual representation learning, albeit without a unified optimization objective. SeLa¹⁴ addressed this limitation by formulating pseudo-label assignment as an Optimal Transport (OT) problem, minimizing cross entropy loss while ensuring theoretical convergence. Subsequently, SwAV¹⁵ integrated this approach with contrastive learning to enable online clustering. However, both SeLa¹⁴ and SwAV¹⁵ rely on uniform distribution constraints for pseudo-labels, which are ineffective under real-world long tailed data distributions. More recent methods such as RSTC¹⁶ attempt to model true class distributions to handle imbalance, while POTA¹⁷ improves pseudo-label reliability via semantic similarity at the cost of increased computational overhead.

Motivated by recent advancements in contrastive learning and deep clustering,¹⁸ we propose HAPL, a novel framework that integrates hybrid data augmentation with pseudo-label supervision to optimize contrastive learning. Our approach employs a hybrid augmentation strategy to enrich sample diversity while preserving semantic consistency. We introduce an adaptive OT-based pseudo-label generation mechanism that dynamically adjusts class distribution priors to address the clustering degradation problem in scenarios with varying data distributions. Additionally, the prototype learning component effectively captures intrinsic sample similarities to enhance representation quality.

The main contributions of this work are as follows:

(1) We propose a novel hybrid data augmentation strategy that synergistically integrates explicit and implicit augmentation techniques, enabling the generation of semantically enriched positive pairs that enhance the model’s discriminative capacity.

(2) We introduce HAPL, a principled end-to-end framework that leverages adaptive OT for pseudo-label generation, providing theoretically grounded supervision that simultaneously guides prototype learning and cluster assignment while ensuring stability across diverse cluster distributions.

(3) We conduct comprehensive empirical evaluations demonstrating that HAPL achieves state-of-the-art performance across multiple benchmark datasets and evaluation metrics, with systematic ablation studies rigorously validating the synergistic contributions of each architectural component.

2. Related work

2.1. Short text clustering

Existing short text clustering methods can be broadly classified into three categories: traditional methods, deep learning methods and deep joint clustering methods.

Traditional methods primarily rely on manually engineered features, such as Term Frequency-Inverse Document Frequency and Bag of Words models. These features are then processed using classical clustering algorithms like K-means or hierarchical clustering. While these approaches are computationally efficient and offer high interpretability, their ability to capture complex semantic representations is often limited, hindering performance on nuanced textual data.

Deep learning methods¹⁹ leverage pre-trained language models to learn dense, semantic-aware text representations, which are subsequently clustered using traditional algorithms. Although this paradigm captures semantic information more effectively, its two-stage nature, which decouples representation learning from clustering, can lead to suboptimal results.

To overcome this limitation, deep joint clustering methods integrate representation learning and clustering within an end-to-end framework, enabling mutual reinforcement between feature refinement and cluster assignment. Among these approaches, autoencoder-based methods pioneered this direction by jointly optimizing reconstruction and clustering objectives. DEC²⁰ introduces a clustering layer atop the encoder and iteratively refines cluster assignments by minimizing the KL divergence between soft assignments and auxiliary target distributions. However, its simple encoder architecture limits the capacity to capture global dependencies in data. Contrastive learning methods have gained prominence by exploiting instance-level discrimination. CC²¹ constructs positive pairs from augmented views of the same instance and negative pairs across different instances, enabling the model to learn discriminative features that naturally separate clusters. ProCA²² extends contrastive learning to the prototype level by incorporating inter-class information into class-wise prototypes and adopting class-centered distribution alignment: treating same-class prototypes as positives and other-class prototypes as negatives to achieve intra-cluster compactness and inter-cluster separability. TAKE²³ employs a Transformer AutoEncoder that incorporates transformer structures to learn global features and leverages contrastive learning mechanisms to enhance feature discrimination, while introducing a convex combination loss to produce K-means-friendly feature spaces.

Despite their success, these methods still face challenges in balancing representation quality and clustering performance and handling complex and heterogeneous data distributions. In particular, the trade-off between preserving fine-grained local structures and achieving globally coherent clusters remains unresolved, often leading to suboptimal performance when applied to real-world datasets with high variability.

2.2. Data augmentation

Research has shown that single augmentation methods often yield limited performance gains, and that contrastive learning typically requires stronger augmentation strategies than its supervised counterpart.²⁴ This insight has spurred the exploration of various text augmentation techniques.

Early efforts focused on identifying the most effective single augmentation strategy. For instance, SCCL²⁵ empirically compared synonym substitution, context augmentation and back translation, identifying context augmentation as the most effective for its task. While this work provided valuable empirical insights, it remained confined to selecting among individual augmentation methods rather than systematically combining them, thereby limiting the potential for capturing multi-faceted semantic variations.

Recognizing the limitations of single-strategy approaches, subsequent research began exploring hybrid augmentation frameworks. TCL²⁶ proposed a dual-strategy framework combining weak (Dropout) and strong (RandAugment) augmentations to construct a more challenging learning objective. This advancement introduced the concept of augmentation composition; however, both strategies primarily operate at the architectural or token-level manipulation, lacking explicit semantic-level transformations that preserve deep contextual meaning.

Despite these advancements, a critical gap persists: existing methods typically augment data at a single semantic level–either through syntactic perturbations or shallow lexical substitutions—without strategically integrating cross-level augmentations that simultaneously ensure diversity and semantic fidelity. For tasks requiring fine-grained semantic understanding, such as short text clustering, the challenge of generating high-quality ”hard” positive pairs that are both meaningfully diverse and semantically consistent remains largely unresolved.

3. Methodology

The aim of this study is to design a unified model that enhances unsupervised clustering performance by integrating contrastive learning, prototype learning and pseudo-label generation. As shown in Figure 1, our framework first generates an augmented sample using an explicit augmentation strategy. Both the original and this explicitly augmented input are encoded into a shared representation space by a neural network $Φ$ . Following this, an implicit augmentation strategy is applied to produce a second augmented sample. The three resulting representations are fed into three specialized head modules, denoted as $G_{C}$ , $G_{P}$ and $G_{Z}$ . Pseudo-labels are generated through an adaptive OT algorithm, which in turn provide supervisory signals for both the clustering process and prototype learning.

Figure 1.

The overall training framework of HAPL, which jointly optimizes clustering loss, prototype loss, and contrastive loss.

3.1. Feature extraction

During the feature extraction phase, HAPL employs SentenceBERT as the text encoder and combines explicit and implicit augmentation strategies to enhance sample diversity. Explicit augmentation generates extended samples $X^{e x +}$ by replacing key words in the input text through a BERT contextual augmenter. Implicit augmentation introduces random erasure operations at the embedding representation level of the original text, creating perturbed samples by masking tokens while preserving sequence length. The augmented text is mapped to a 768 dimensional vector space, where $G_{C}$ predicts its class distribution. Subsequently, the text representations are reduced to 128 dimensions via a projection layer for subsequent training.

3.2. Pseudo-label generation

The pseudo-label generation module bridges feature extraction and representation learning. Its workflow can be formally described as follows: Encoder network $Φ$ maps raw text $X$ to a feature representation $Φ (X) = E \in R^{N \times D_{1}}$ , where $N$ is the batch size and $D_{1}$ is the feature dimension. Subsequently, clustering network $G_{C}$ , constructed via fully connected layers, predicts cluster assignment probability $G_{C} (E) = C \in R^{N \times K}$ , where $K$ represents the predefined number of categories.

To enhance pseudo-label quality, this module solves a discrete OT problem to minimize cross entropy loss and generate pseudo-labels $y^{'} \in R^{N \times K}$ . To address potential degradation issues under random initialization conditions, a regularization penalty term for the distribution variable $b$ is introduced into the OT objective function. $b$ is dynamically updated during the solution process for the transport matrix²⁷ $π$ . This optimization problem can be formally expressed as:

\begin{aligned} min_{π, b} & ⟨ π, M ⟩ + ε_{1} H (π) + ε_{2} {(Ψ (b))}^{T} 1, \\ s.t. & π 1 = a, π^{T} 1 = b, π \geq 0, b^{T} 1 = 1, \end{aligned}

(1)

The transport matrix

π = \frac{1}{N} y^{'}

represents the association between samples and categories, while the cost matrix

M = - \log C

quantifies the migration cost from samples to categories. Hyperparameters

ε_{1}

and

ε_{2}

adjust the balance of the optimization objective, while the entropy regularization term

H (π) = ⟨ π, \log π - 1 ⟩

prevents overly sparse solutions. The class distribution variable

b

is constrained by the penalty function

Ψ (b) = - \log b - \log (1 - b)

. The sample distribution is assumed uniform with

a = \frac{1}{N} 1

, where

1

is the all ones vector.

This design incorporates a dual adjustment mechanism: on one hand, the cost matrix $M$ guides $b_{j}$ to reflect the true data distribution, enabling high frequency classes to obtain larger $b_{j}$ values; on the other hand, the penalty function $Ψ (b)$ maintains the uniformity tendency of $b$ . When $b_{j}$ approaches zero for low frequency classes, it may trigger clustering degradation. $Ψ (b)$ counters this tendency to ensure all classes are effectively modeled, with its strength flexibly adjustable based on dataset imbalance.

Ultimately, through iterative optimization, a stable transmission matrix $π$ is obtained and the pseudo-labels are generated by the following mapping:

y_{i j}^{'} = {\begin{cases} 1, & if j = \underset{j^{'}}{\arg max} π_{i j^{'}} \\ 0, & otherwise \end{cases} .

(2)

3.3. Cluster learning

The objective of clustering learning is to aggregate samples belonging to the same semantic category. Specifically, two augmented text segments from the same original text form a positive sample pair; the pseudo-label serves as the supervised target for this pair. Through a hybrid augmentation strategy, the embedded representations $E^{i m +} \in R^{N \times D_{1}}$ and $E^{e x +} \in R^{N \times D_{1}}$ of the two augmented sample sets are obtained, respectively yielding the predicted distributions $C^{i m +} \in R^{N \times K}$ and $C^{e x +} \in R^{N \times K}$ via the clustering network. To achieve consistency in the prediction distributions across different augmented views of the same original text, a cross entropy constraint targeting the pseudo-labels is employed, defined as follows:

L_{C} = - \frac{1}{N} ⟨ y^{'}, \log C^{e x +} ⟩ + ⟨ y^{'}, log C^{i m +} ⟩

(3)

Where

C^{e x +}

and

C^{i m +}

represent the predicted probability distributions under the two augmented views, respectively.

3.4. Prototype learning

Consistent with the pseudo-label generation, the original text $X$ is encoded into a representation vector $E \in R^{N \times D_{1}}$ via an encoding network, then mapped to a prototype representation $P \in R^{N \times D_{1}}$ through a learnable parameter matrix $G_{P}$ . Prototype learning maintains a learnable prototype vector for each category, achieving intra class aggregation through cross entropy loss. For sample $i$ in a batch, with features $E_{i}$ and pseudo-label ${y^{'}}_{i}$ , the cosine similarity $sim (E_{i}, P_{k})$ is computed between it and all $K$ prototypes ${P_{1}, P_{2}, \dots, P_{K}}$ . The prototype loss is defined as:

L_{P} = - \frac{1}{N} \sum_{i = 1}^{N} \log (\frac{\exp (sim (E_{i}, P_{{y^{'}}_{i}})) / τ}{\sum_{k = 1}^{K} \exp (sim (E_{i}, P_{k})) / τ})

(4)

Where

P_{k}

is the

k

th prototype,

τ

is the temperature hyperparameter.

3.5. Contrastive learning

Contrastive learning aims to pull the projection representations of positive sample pairs closer together while pushing negative sample pairs apart. To achieve this, we use a fully connected layer $G_{Z}$ as the projection network to map augmented representations into a comparison space. This yields $G_{Z} (E^{i m +}) = Z^{i m +} \in R^{N \times D_{2}}$ and $G_{Z} (E^{e x +}) = Z^{e x +} \in R^{N \times D_{2}}$ , where $D_{2}$ is the projection dimension. Suppose a batch contains $2 N$ augmented samples, whose projection representation is $Z = {[Z^{e x +}, Z^{i m +}]}^{T}$ . Given a positive pair, where two texts are augmented from the same original text, the other $2 (N - 1)$ augmented texts are treated as negative samples. The loss for a pair of texts $(i, j)$ is defined as:

l (i, j) = - \log \frac{\exp (sim (Z_{i}, Z_{j}) / τ)}{\sum_{k = 1}^{2 N} I_{k \neq i} \exp (sim (Z_{i}, Z_{k}) / τ)}

(5)

Where

I

denotes the indicator function.

The instance comparison loss considers all positive sample pairs within a batch, including $(i, j)$ and $(j, i)$ :

L_{I} = \frac{1}{2 N} \sum_{i = 1}^{N} (l (i, 2 i) + l (2 i, i))

(6)

Three loss functions work together to enhance representation quality across three levels: prototypical, categorical and instance. Prototypical loss promotes intra class compactness, category contrast loss enhances inter class discriminabilityand instance contrast loss maintains sample level discriminative power. The final objective function is:

L = α L_{P} + β L_{C} + γ L_{I}

(7)

Where

α

β

and

γ

are weight hyperparameters.

Experiments demonstrate that when $α = 1$ , $β = 1$ and $γ = 10$ are satisfied, the model can rapidly learn through instance comparisons during early training stages and subsequently refine the clustering structure by leveraging prototype class comparisons in later stages.

3.6. Training and convergence

The training process of HAPL follows a carefully designed two-stage paradigm aimed at progressively refining representations and cluster assignments. The first stage is the Warm-up Phase, whose core objective is to construct a well-structured and discriminative feature space through instance-level contrastive learning. In this phase, the model applies a hybrid augmentation strategy to each training batch of input texts, generating both explicit and implicit augmented views. These views are then encoded into feature representations, and the contrastive loss $L_{I}$ is computed to maximize consistency between different views of the same instance while minimizing consistency with views of other instances. This stage trains the encoder $Φ$ and the projection head $G_{Z}$ exclusively without using any clustering loss or pseudo-labels. Its fundamental purpose is to transform the initial SentenceBERT features into a geometrically more separable space, providing a robust initialization foundation for the subsequent clustering stage.

The second stage is the Joint Optimization Phase with Adaptive Feedback. Building upon the pre-trained encoder from the first stage, this phase synchronously optimizes feature representations and cluster assignments through an iterative approach. The workflow for each iteration is as follows: First, in the Forward Propagation and Loss Calculation step, the model computes feature representations for a given batch. The clustering head $G_{C}$ outputs cluster assignment distributions, which are used to compute the clustering loss $L_{C}$ . Simultaneously, the prototype loss $L_{P}$ is calculated by measuring the alignment between features and the current prototype vectors, while the contrastive loss $L_{I}$ continues to be applied to augmented views to maintain local discriminability. All losses are computed based on the pseudo-labels $y^{' (t - 1)}$ generated in the previous iteration. Next, in the Backward Propagation and Parameter Update step, the total loss $L = α L_{P} + β L_{C} + γ L_{I}$ is minimized via gradient descent, thereby updating all model parameters, including the encoder $Φ$ , the various heads $G_{P}$ , $G_{C}$ , $G_{Z}$ , and the prototype vectors. Subsequently, in the Pseudo-label Update step, after parameter updates are complete, the Adaptive Optimal Transport module is invoked on the latest feature representations of the entire dataset. By solving the OT problem, it generates a new and more refined set of pseudo-labels $y^{' (t)}$ for the next iteration. The adaptive penalty term $Ψ (b)$ plays a crucial role in this step by preventing cluster degeneration, especially in scenarios with imbalanced data distributions, ensuring that all clusters are effectively represented. Finally, in the Convergence Monitoring step, the algorithm checks for convergence by evaluating the change in pseudo-labels between iterations. If the change falls below a predefined threshold $δ$ or if the maximum number of iterations is reached, the training process terminates. The entire training process is summarized in Algorithm 1.

Algorithm 1: HAPL
Input: Dataset $X$ ; number of warm-up epochs pre_epoch; number of total epochs total_epoch.
Output: The clustering model.
1. Generate explicit augmented dataset $X^{e x +}$ based on $X$ .
2. Load the pre-trained SentenceBERT as encoder $Φ$ and initialize parameters in networks $G_{P}$ , $G_{C}$ , and $G_{Z}$ .
3. Obtain embedded representations ${E, E^{e x +}}$ through encoder $Φ$ . Generate implicit augmented dataset $E^{i m +}$ based on $E$ .
for epoch = 1 to total_epoch do
4. Sample a mini-batch ${E, E^{e x +}, E^{i m +}}$ .
5. Compute representations ${Z^{e x +}, Z^{i m +}}$ and probability assignments ${P, C, C^{e x +}, C^{i m +}}$ .
if epoch < pre_epoch then
6. Compute pseudo-labels $y^{' (0)}$ using k-means.
7. Compute the loss $L_{I}$ .
8. Update parameters in $Φ$ and $G_{C}$ .
else
9. Compute probability assignments $C$ .
10. Compute pseudo-labels $y^{'}$ using AOT.
11. Compute the loss $L = α L_{P} + β L_{C} + γ L_{I}$ (Eq. 7).
12. Update parameters in $Φ$ , $G_{P}$ , $G_{C}$ , and $G_{Z}$ .
end if
end for

4. Experiments

4.1. Datasets

We evaluated our method on eight widely used real-world short text datasets, covering multiple domains including news headlines, technical forums, biomedical literature and social media under varying degrees of class imbalance. Each experiment was repeated at least 10 times to ensure statistical reliability. Table 1 summarizes the key statistics of the datasets, including vocabulary size ( $V$ ), number of samples ( $N$ ), number of classes ( $K$ ) and the imbalance ratio ( $R$ ), defined as the ratio of the largest to the smallest class size. AgNews, StackOverflow and Biomedical are balanced datasets; SearchSnippets is a mildly imbalanced dataset; GoogleNews-TS, GoogleNews-T and GoogleNews-S are moderately imbalanced datasets; Tweet is a heavily imbalanced dataset.

Table 1.
Dataset details.

Datasets V N K R

AgNews 21K 8000 4 1

StackOverflow 15K 20000 20 1

Biomedical 19K 20000 20 1

SearchSnippets 31K 12340 8 7

GoogleNews-TS 20K 11109 152 143

GoogleNews-T 8K 11109 152 143

GoogleNews-S 18K 11109 152 143

Tweet 5K 2472 89 249

Datasets	V	N	K	R
AgNews	21K	8000	4	1
StackOverflow	15K	20000	20	1
Biomedical	19K	20000	20	1
SearchSnippets	31K	12340	8	7
GoogleNews-TS	20K	11109	152	143
GoogleNews-T	8K	11109	152	143
GoogleNews-S	18K	11109	152	143
Tweet	5K	2472	89	249

Following the experimental setup of SCCL,²⁵ we used raw, unprocessed text as input. This allows us to assess model robustness in noisy environments while ensuring a fair comparison with baseline methods.

AgNews²⁸ comprises 8,000 news articles evenly distributed across four categories: World, Sports, Business and Science/Technology. It is commonly used in news classification studies.

StackOverflow¹⁹ includes 20,000 question titles from the Kaggle challenge, labeled into 20 technical topics such as Qt, Matlab and Bash. It is widely used in technical text classification and domain specific QA systems.

Biomedical¹⁹ consists of 20,000 paper titles from PubMed (BioASQ project), spanning 20 medical categories. The dataset contains abundant domain specific terminology, suitable for biomedical NLP tasks.

SearchSnippets²⁹ contains 12,340 web search snippets from Google, categorized into eight thematic classes. The texts are keyword rich and lack full sentence structure, making it suitable for keyword representation and retrieval research.

GoogleNews³⁰ is a snapshot of Google News taken by document,³¹ and crawls the titles and fragments of 11109 news articles belonging to 152 clusters. They further divided the dataset into three datasets: GoogleNews-TS (title + summary), GoogleNews-T (title only) and GoogleNews-S (summary only). These are commonly used in event detection and news analysis.

Tweet³⁰ contains 2,472 short messages from the Weibo track of TREC 2011 $-$ 2012, labeled into 89 categories. Its informal style, slang and symbols pose challenges for social media text analysis.

4.2. Experiment settings

All experiments are conducted on a workstation equipped with an NVIDIA A40 GPU (48GB VRAM), an Intel Xeon Platinum 8380 CPU and 16GB of RAM. The model is implemented using PyTorch 2.2.1 with Python 3.9.13 and CUDA 12.0. We employ the distilbert-base-nli-stsb-mean-tokens model from the Sentence Transformers library³² as the sentence encoder. The model is optimized using the Adam optimizer. The number of clusters is set to the ground-truth category count for each dataset. All results are averaged over ten independent runs and reported across four standard clustering metrics: Accuracy (ACC), Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and Adjusted Mutual Information (AMI).

4.3. Baselines

Among traditional methods, we adopted an approach that combines Bag of Words (BOW) or TF-IDF features with the K-means algorithm, serving as a feature engineering baseline. For deep learning methods, two representative models were included: STC $^{2}$ , which employs a two stage pipeline using Word2Vec embeddings and convolutional neural networks (CNNs) for feature extraction; Self-Train, which utilizes an autoencoder enhanced with Smooth Inverse Frequency (SIF) weighting for training.

In the category of deep joint clustering methods, we selected: SCCL,²⁵ which builds on a SentenceBERT (SBERT) framework and incorporates instance level contrastive learning to refine text representations, combined with deep embedded clustering for joint optimization of feature learning and cluster assignment; RSTC,¹⁶ which employs OT to generate pseudo-labels for guiding cluster learning. These baselines collectively represent state-of-the-art advancements in short text clustering and allow for a rigorous assessment of HAPL’s performance across different learning paradigms.

Table 2.
Accuracy(%), clustering performance comparison on eight short text datasets. The best results are in bold and the second best results are underlined.

AgNews SearchSnippets

ACC NMI ARI AMI ACC NMI ARI AMI

BOW 26.93 3.12 0.17 3.06 23.71 9.41 0.49 9.28

TF-IDF 30.89 7.20 2.83 7.15 29.80 17.58 2.83 17.48

STC $^{2}$ ¹⁹ – – – – 76.75 62.39 57.84 61.98

Self-Train³³ – – – – 77.88 56.88 55.71 56.84

SCCL²⁵ 86.23 64.91 68.43 64.78 79.14 68.03 64.05 68.00

RSTC¹⁶ 85.42 63.34 66.86 63.21 79.51 65.62 61.38 65.16

POTA¹⁷ 85.94 63.74 67.05 63.59 79.21 68.28 66.16 68.25

HAPL 86.81 65.77 69.08 65.72 81.92 70.92 68.78 70.58

StackOverflow Biomedical

ACC NMI ARI AMI ACC NMI ARI AMI

BOW 51.58 57.67 17.82 57.52 26.26 28.05 4.63 27.78

TF-IDF 64.72 67.93 26.80 67.83 31.80 30.62 6.95 30.38

STC $^{2}$ ¹⁹ 50.87 48.93 30.26 48.77 43.26 38.12 25.21 37.41

Self-Train³³ 59.08 52.90 41.96 52.74 39.29 33.63 22.36 33.40

SCCL²⁵ 71.05 71.21 49.20 71.12 38.36 33.55 21.49 33.35

RSTC¹⁶ 79.95 72.34 66.81 71.93 45.71 38.97 28.18 38.69

POTA¹⁷ 82.40 73.59 69.55 73.48 46.91 38.91 28.16 38.46

HAPL 85.48 76.71 73.66 76.63 48.01 40.76 29.71 40.69

GoogleNews-TS GoogleNews-T

ACC NMI ARI AMI ACC NMI ARI AMI

BOW 58.23 82.15 29.02 78.75 49.94 73.50 11.06 68.64

TF-IDF 70.52 88.91 57.18 86.78 58.48 79.69 24.34 75.79

SCCL²⁵ 82.96 92.68 81.79 91.49 73.91 86.02 72.76 83.47

RSTC¹⁶ 84.11 93.26 83.40 92.72 74.73 87.87 72.79 85.62

POTA¹⁷ 83.34 92.77 82.35 91.71 73.85 87.63 72.53 85.33

HAPL 85.53 94.05 84.79 93.18 77.74 89.22 75.14 87.53

GoogleNews-S Tweet

ACC NMI ARI AMI ACC NMI ARI AMI

BOW 51.30 74.93 22.45 70.21 52.37 72.49 23.89 66.02

TF-IDF 63.84 84.14 41.78 81.07 54.87 78.54 37.24 72.55

SCCL²⁵ 72.38 85.23 70.65 82.80 74.97 86.78 74.11 83.57

RSTC¹⁶ 77.63 88.64 76.13 86.77 77.41 86.24 79.79 83.29

POTA¹⁷ 78.97 88.78 77.06 86.96 78.23 86.94 80.11 84.31

HAPL 78.52 89.91 78.43 88.64 80.90 88.39 84.50 86.41

	AgNews	SearchSnippets
BOW	26.93	3.12	0.17	3.06	23.71	9.41	0.49	9.28
TF-IDF	30.89	7.20	2.83	7.15	29.80	17.58	2.83	17.48
STC $^{2}$ ¹⁹	–	–	–	–	76.75	62.39	57.84	61.98
Self-Train³³	–	–	–	–	77.88	56.88	55.71	56.84
SCCL²⁵	86.23	64.91	68.43	64.78	79.14	68.03	64.05	68.00
RSTC¹⁶	85.42	63.34	66.86	63.21	79.51	65.62	61.38	65.16
POTA¹⁷	85.94	63.74	67.05	63.59	79.21	68.28	66.16	68.25
HAPL	86.81	65.77	69.08	65.72	81.92	70.92	68.78	70.58
	StackOverflow	Biomedical
	ACC	NMI	ARI	AMI	ACC	NMI	ARI	AMI
BOW	51.58	57.67	17.82	57.52	26.26	28.05	4.63	27.78
TF-IDF	64.72	67.93	26.80	67.83	31.80	30.62	6.95	30.38
STC $^{2}$ ¹⁹	50.87	48.93	30.26	48.77	43.26	38.12	25.21	37.41
Self-Train³³	59.08	52.90	41.96	52.74	39.29	33.63	22.36	33.40
SCCL²⁵	71.05	71.21	49.20	71.12	38.36	33.55	21.49	33.35
RSTC¹⁶	79.95	72.34	66.81	71.93	45.71	38.97	28.18	38.69
POTA¹⁷	82.40	73.59	69.55	73.48	46.91	38.91	28.16	38.46
HAPL	85.48	76.71	73.66	76.63	48.01	40.76	29.71	40.69
	GoogleNews-TS	GoogleNews-T
	ACC	NMI	ARI	AMI	ACC	NMI	ARI	AMI
BOW	58.23	82.15	29.02	78.75	49.94	73.50	11.06	68.64
TF-IDF	70.52	88.91	57.18	86.78	58.48	79.69	24.34	75.79
SCCL²⁵	82.96	92.68	81.79	91.49	73.91	86.02	72.76	83.47
RSTC¹⁶	84.11	93.26	83.40	92.72	74.73	87.87	72.79	85.62
POTA¹⁷	83.34	92.77	82.35	91.71	73.85	87.63	72.53	85.33
HAPL	85.53	94.05	84.79	93.18	77.74	89.22	75.14	87.53
	GoogleNews-S	Tweet
	ACC	NMI	ARI	AMI	ACC	NMI	ARI	AMI
BOW	51.30	74.93	22.45	70.21	52.37	72.49	23.89	66.02
TF-IDF	63.84	84.14	41.78	81.07	54.87	78.54	37.24	72.55
SCCL²⁵	72.38	85.23	70.65	82.80	74.97	86.78	74.11	83.57
RSTC¹⁶	77.63	88.64	76.13	86.77	77.41	86.24	79.79	83.29
POTA¹⁷	78.97	88.78	77.06	86.96	78.23	86.94	80.11	84.31
HAPL	78.52	89.91	78.43	88.64	80.90	88.39	84.50	86.41

4.4. Performance and analysis

Table 2 presents the experimental results of the proposed method on eight short text benchmark datasets. From the overall results, the traditional methods BOW and TF-IDF are limited by the inherent high-dimensional sparsity of short text, which is difficult to effectively capture deep semantic information, and perform poorly on all data sets. STC $^{2}$ ¹⁹ and Self-Train,³³ two-stage methods based on deep learning, effectively alleviate the problem of high-dimensional sparsity with the help of pre trained word embedding and the powerful representation ability of deep neural network, and their performance has been significantly improved. The deep joint clustering methods SCCL,²⁵ RSTC,¹⁶ POTA¹⁷ and HAPL in this paper show excellent clustering effect on eight data sets, which fully verifies the effectiveness of the end-to-end joint optimization strategy. Among all compared methods, HAPL achieves the best overall performance, demonstrating strong robustness and generalizability across diverse domains and dataset scales.

4.5. The impact of hyperparameters

We investigate the effects of three key hyperparameters: $α$ (prototype loss weight), $β$ (clustering loss weight), and $γ$ (contrastive loss weight). As shown in Figure 2, we evaluate $α \in {0, 1, 2, 3, 5, 10, 20}$ , $β \in {0, 1, 2, 3, 5, 10, 20}$ , and $γ \in {0, 1, 5, 10, 20, 50, 100}$ on three representative datasets with varying class imbalance: StackOverflow served as a balanced dataset, GoogleNews-T as an imbalanced dataset, and Tweet as a severely imbalanced dataset. The results show that model performance is relatively insensitive to $α$ and $γ$ across a wide range. However, performance stabilizes and becomes robust when $β \geq 1$ . Based on these observations and considering the need for balanced contribution from all losses, we select $α = 1$ , $β = 1$ , and $γ = 10$ as our final hyperparameter configuration. This setting allows contrastive learning to dominate early training, while clustering loss and prototype loss fine-tune the structure in later stages.

Figure 2.

The impact of $α$ , $β$ and $γ$ on model performance.

Additionally, this paper investigates the effects of hyperparameters $ε_{1}$ and $ε_{2}$ through experiments conducted within the ranges of ${0.01, 0.05, 0.1, 0.2, 0.5, 1}$ and ${0, 0.001, 0.01, 0.1, 1, 10}$ . As shown in Figure 3, $ε_{1}$ and $ε_{2}$ exhibit negligible impact on performance for balanced datasets but demonstrate sensitivity on imbalanced and severely imbalanced datasets. Consequently, $ε_{1} = 0.1$ and $ε_{2} = 1$ are set for balanced datasets, while $ε_{1} = 0.1$ and $ε_{2} = 0.1$ are employed for imbalanced and severely imbalanced datasets.

Figure 3.

The impact of $ε_{1}$ and $ε_{2}$ on model performance.

In the experimental evaluation, the number of clusters was set to the true number of categories. All experiments were repeated ten times and their average results were reported to ensure the robustness of the conclusions.

4.6. Visualization

To clarify the distinct roles of each module, we performed t-SNE visualization on the AgNews dataset, as shown in Figure 4. (a) SBERT: Raw features exhibit significant overlap with blurred class boundaries. (b) Contrastive Learning (CL): While CL substantially improves classification metrics by learning a more discriminative feature space, its effect on low-dimensional projection separation is limited. This is because CL optimizes instance-level similarity in the high-dimensional metric space, which effectively aids the classifier but does not directly enforce geometric separation in a 2D projection like t-SNE. (c) HAPL without Augmentation (HAPL $^{HA-}$ ): Introducing pseudo-labels via Adaptive OT provides . This explicitly drives prototype and cluster learning to achieve intra-class compactness and inter-class separation, resulting in well-formed, distinct clusters in the visualization.category-level supervision. (d) HAPL: Integrating hybrid data augmentation further refines the clusters by providing robust positive pairs, working synergistically with pseudo-label supervision. In summary, visualization reveals the complementary roles: CL learns discriminative features for better classification, while pseudo-label-guided learning directly structures the geometric layout of the representation space.

Figure 4.

t-SNE visualization depicted on AgNews, with each color representing a true category. (a) SBERT, (b) CL, (c) HAPL $^{HA-}$ and (d) HAPL

4.7. Ablation experiments

4.7.1. Hybrid data augmentation

This paper systematically investigates the impact of diverse unsupervised text augmentation strategies through ablation studies. Unless otherwise specified, all augmentation techniques are applied at a probability of 10%. The evaluated techniques are categorized as follows:

Explicit Augmentation: Techniques that generate new textual instances, including back-translation (bt0.1), contextual word replacement using pre-trained language models BERT (be0.1) and RoBERTa (rb0.1) and Easy Data Augmentation (eda0.1). Implicit Augmentation: Techniques that perturb input representations without generating new text, specifically random token replacement at probabilities of 10% (tr0.1) and 30% (tr0.3) and random token erasure (te0.1). Hybrid Augmentation: Combinations integrating one explicit augmentation method with one implicit augmentation method.

All augmentation strategies are applied exclusively to unlabeled training data, maintaining label consistency. Figure 5 summarizes the performance of these augmentation strategies on the StackOverflow, Biomedical and GoogleNews-T datasets. Experimental results demonstrate that hybrid augmentation methods generally outperform single augmentation techniques, with hy(be0.1,te0.1) achieving the best overall performance in terms of ACC across multiple datasets.

Figure 5.

ACC metrics for ten different data augmentation strategy combinations.

To validate the effectiveness of the proposed hybrid augmentation strategy, we integrate it into the RSTC¹⁶ framework (RSTC+) and remove it from our HAPL model (HAPL-). As demonstrated in Table 3, RSTC+ achieves statistically significant improvements across all eight benchmark datasets, while HAPL- exhibits performance degradation. These results confirm that the hybrid augmentation strategy substantially enhances clustering performance.

Table 3.

Accuracy(%), clustering performance comparison of RSTC and RSTC+ across eight benchmark datasets.

	AgNews				SearchSnippets
	ACC	NMI	ARI	AMI	ACC	NMI	ARI	AMI
RSTC	85.42	63.34	66.86	63.21	79.51	65.62	61.38	65.16
RSTC+	86.40	65.00	68.16	64.98	84.29	71.05	69.29	70.88
HAPL-	86.12	64.39	67.75	64.25	81.36	68.07	64.15	67.89
HAPL	86.81	65.77	69.08	65.72	81.92	70.92	68.78	70.58
	StackOverflow				Biomedical
	ACC	NMI	ARI	AMI	ACC	NMI	ARI	AMI
RSTC	79.95	72.34	66.81	71.93	45.71	38.97	28.18	38.69
RSTC+	83.67	75.30	71.69	75.27	47.47	40.14	29.08	40.05
HAPL-	79.90	72.73	66.93	72.69	46.39	39.81	28.87	39.77
HAPL	85.48	76.71	73.66	76.63	48.01	40.76	29.71	40.69
	GoogleNews-TS				GoogleNews-T
	ACC	NMI	ARI	AMI	ACC	NMI	ARI	AMI
RSTC	84.11	93.26	83.40	92.72	74.73	87.87	72.79	85.62
RSTC+	84.97	93.86	84.47	93.06	77.03	88.96	75.06	87.19
HAPL-	85.03	93.44	84.63	92.51	75.83	88.29	73.98	86.53
HAPL	85.53	94.05	84.79	93.18	77.74	89.22	75.14	87.53
	GoogleNews-S				Tweet
	ACC	NMI	ARI	AMI	ACC	NMI	ARI	AMI
RSTC	77.63	88.64	76.13	86.77	77.41	86.24	79.79	83.29
RSTC+	79.44	90.05	79.12	88.75	79.60	87.53	83.50	85.66
HAPL-	78.46	89.23	77.12	87.62	78.54	86.52	80.08	84.24
HAPL	78.52	89.91	78.43	88.64	80.90	88.39	84.50	86.41

4.7.2. Adaptive optimal transport

This study comparatively evaluates standard OT and Adaptive Optimal Transport (AOT) methods across datasets with varying class imbalance: balanced StackOverflow, mildly imbalanced GoogleNews-T and heavily imbalanced Tweet. The methods are configured as follows: (1) OT: Employs entropy regularization $Ψ (b) = KL (b ∥ \hat{b})$ where $\hat{b}$ denotes the transport plan from the previous iteration. (2) AOT: Utilizes $Ψ (b) = - \log b - \log (1 - b)$ , enabling dynamic updates of $b$ during optimization. To visualize convergence behavior, experiments initialize $b$ randomly and track two metrics over iterations: clustering accuracy and predicted cluster count.

As evidenced in Figure 6, AOT (solid line) consistently achieves higher accuracy than OT (dashed line) across three datasets. Figure 7 further reveals that AOT can return to the correct number of clusters in the optimization process, contrasting with OT’s tendency toward over merging (particularly in imbalanced settings). These results demonstrate AOT’s enhanced robustness to data complexity and its effectiveness in mitigating under clustering during optimization.

Figure 6.

Changes in Accuracy for AOT and OT Training Across Three Datasets.

Figure 7.

Changes in Predicted Cluster Counts for AOT and OT Training Across Three Datasets.

Figure 8.

Changes in ACC and NMI during training of five models on GoogleNews-TS.

Figure 9.

Second(s), comparison of computing overhead between HAPL and POTA across six datasets.

4.7.3. Prototype learning

We design an ablation model HAPL $^{PL-}$ (excluding prototype learning) and compare it with the HAPL and baseline methods. Figure 8 depicts the performance trajectories of these four models during training on GoogleNews-TS under identical random seeds. The HAPL model maintains stable performance throughout training, whereas the HAPL $^{PL-}$ variant exhibits significant fluctuations and gradual performance decay after reaching intermediate accuracy levels. This divergence confirms that prototype learning enhances model stability.

As shown in Figure 9, HAPL exhibits a significant reduction in computational overhead compared to state-of-the-art baseline models. Notably, the computational costs associated with the AgNews and Tweet datasets are excluded from the statistical comparison, as the model reaches convergence at an earlier stage on these datasets and the computational overhead varies considerably across different random seed initializations.

5. Conclusion

In this work, we present HAPL, an end-to-end framework for short text clustering that integrates feature extraction, pseudo-label generation, cluster learning, prototype learning and contrastive learning into a unified architecture. By combining explicit and implicit augmentation within a hybrid strategy, HAPL produces semantically consistent yet diverse positive pairs that strengthen representation quality. The adaptive OT module generates distribution-aware pseudo-labels that provide stable guidance for cluster assignment, while the prototype learning component further aligns sample representations with cluster prototypes, reinforcing semantic coherence and enhancing the robustness of the learned embedding space.

Ablation experiments confirm that each module contributes to overall performance, with joint optimization yielding the greatest gains. Hybrid augmentation outperforms any single strategy; the adaptive OT mechanism alleviates clustering degradation under distribution shift compared to standard OT; and prototype learning improves model stability. The complementary interplay among augmentation, adaptive pseudo-label supervision, and prototype-guided refinement validates our joint optimization design.

Two limitations remain for future work. First, HAPL currently supports single-label clustering; extending it to multi-label settings is the next step. Second, the adaptive OT module incurs O(n $^{2}$ ) complexity, which may limit scalability on large datasets; more efficient approximations will be explored.

Footnotes

Acknowledgements

This paper was supported by National Natural Science Foundation of China (No. 62076215, No. 62301473), Fundamental Research Funds for the Central Universities, China (No. K93 $-$ 9 $-$ 2022 $-$ 03), Jiangsu Provincial Natural Science Foundation of Higher Education (No. 23KJB520039), Jiangsu Provincial Key Laboratory of Network and Information Security (No. BM2003201), Jiangsu University Qing Lan Project and Yancheng Industrial Innovation Technology Support (Industrial) Special Program(NO. YCBG2025201).

Ethical and informed consent for data used

This article does not contain studies with human participants or animals. Statement of informed consent is not applicable since the manuscript does not contain any patient data.

Authors contribution statement

Tao Yan: Writing-original draft, validation, methodology, formal analysis, data curation. Sen Xu: Writing-review, supervision, resources, project administration, funding acquisition. Shanliang Yao: writing-review, data verification, funding acquisition, Naixuan Guo: writing-review, project administration, funding acquisition. Xuesheng Bian: writing-review, funding acquisition. Xiufang Xu: Supervision, project administration. Xianye Ben: Supervision, resources, project administration. Tian Zhou: Supervision, resources.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This paper was supported by the National Natural Science Foundation of China (62322111, 62271289) and the Natural Science Fund for Distinguished Young Scientists of Shandong Province (ZR2024JQ007).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability and access

The datasets we used in this paper are public without private protection.

References

Min

Ross

Sulem

, et al. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput Surv 2023; 56: 1–40.

Guan

Zhang

Liang

, et al. Deep feature-based text clustering and its explanation. IEEE Trans Knowl Data Eng 2020; 34: 3669–3680.

Oyewole

Thopil

. Data clustering: application and trends. Artif Intell Rev 2023; 56: 6439–6475.

Laureate

CDP

Buntine

Linger

. A systematic review of the use of topic models for short text social media analysis. Artif Intell Rev 2023; 56: 14223–14255.

Zhou

Zheng

, et al. A comprehensive survey on deep clustering: Taxonomy, challenges, and future directions. ACM Comput Surv 2024; 57: 1–38.

Zhang

Dong

Yin

, et al. Attentive representation learning with adversarial training for short text clustering. IEEE Trans Knowl Data Eng 2021; 34: 5196–5210.

Peng

Xia

Naseriparsa

, et al. Knowledge graphs: Opportunities and challenges. Artif Intell Rev 2023; 56: 13071–13102.

Subakti

Murfi

Hariadi

. The performance of bert as data representation of text clustering. J Big Data 2022; 9: 15.

Zhao

Alzubaidi

Zhang

, et al. A comparison review of transfer learning and self-supervised learning: Definitions, applications, advantages and limitations. Expert Syst Appl 2024; 242: 122807.

10.

Wei

Zou

. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, 2019, pp.6382–6388.

11.

Zheng

Sabour

Wen

, et al. Augesc: Dialogue augmentation with large language models for emotional support conversation. In: Findings of the association for computational linguistics, 2023, pp.1552–1568.

12.

Chen

, et al. Learning deep discriminative representations with pseudo supervision for image clustering. Inf Sci (Ny) 2021; 568: 199–215.

13.

Caron

Bojanowski

Joulin

, et al. Deep clustering for unsupervised learning of visual features. In: Proceedings of the European conference on computer vision, 2018, pp.132–149.

14.

Asano

Rupprecht

Vedaldi

. Self-labelling via simultaneous clustering and representation learning. In: International conference on learning representations.

15.

Caron

Misra

Mairal

, et al. Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 2020; 33: 9912–9924.

16.

Zheng

Liu

, et al. Robust representation learning with reliable pseudo-labels generation via self-adaptive optimal transport for short text clustering. In: Proceedings of the 61st annual meeting of the association for computational linguistics, 2023, pp.10493–10507.

17.

Yao

Yin

. Reliable pseudo-labeling via optimal transport with attention for short text clustering. arXiv preprint arXiv:250115194 2025.

18.

Sadeghi

Armanfard

. Deep multirepresentation learning for data clustering. IEEE Trans Neural Netw Learn Syst 2023; 35: 15675–15686.

19.

Wang

, et al. Self-taught convolutional neural networks for short text clustering. Neural Netw 2017; 88: 22–31.

20.

Xie

Girshick

Farhadi

. Unsupervised deep embedding for clustering analysis. In: International conference on machine learning, 2016, pp.478–487. PMLR.

21.

Liu

, et al. Contrastive clustering. In: Proceedings of the AAAI conference on artificial intelligence, 2021, Vol. 35, pp.8547–8555.

22.

Jiang

Yang

, et al. Prototypical contrast adaptation for domain adaptive semantic segmentation. In: European conference on computer vision, 2022, pp.36–54. Springer.

23.

Wang

Jia

, et al. Transformer autoencoder for k-means efficient clustering. Eng Appl Artif Intell 2024; 133: 108612.

24.

Chen

Kornblith

Norouzi

, et al. A simple framework for contrastive learning of visual representations. In: International conference on machine learning, 2020, pp.1597–1607. PmLR.

25.

Zhang

Nan

Wei

, et al. Supporting clustering with contrastive learning. In: Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies, 2021, pp.5419–5430.

26.

Yang

Peng

, et al. Twin contrastive learning for online clustering. Int J Comput Vis 2022; 130: 2205–2221.

27.

Cuturi

. Sinkhorn distances: Lightspeed computation of optimal transport. Adv Neural Inf Process Syst 2013; 26: 2292–2300.

28.

Rakib

MRH

Zeh

Jankowska

, et al. Enhancement of short text clustering by iterative classification. In: International conference on applications of natural language to information systems, 2020, pp.105–117. Springer.

29.

Phan

Nguyen

Horiguchi

. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide web, 2008, pp.91–100.

30.

Yin

Wang

. A model-based approach for text clustering with outlier detection. In: Proceedings of the 32nd IEEE international conference on data engineering, 2016, pp.625–636. IEEE.

31.

Yin

Wang

. A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, 2014, pp.233–242.

32.

Reimers

Gurevych

. Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, 2019, pp.3982–3992.

33.

Hadifar

Sterckx

Demeester

, et al. A self-training approach for short text clustering. In: Proceedings of the 4th workshop on representation learning for natural language processing, 2019, pp.194–199.

Contrastive learning with hybrid data augmentation and pseudo-label supervision for short text clustering

Abstract

Keywords

1. Introduction

2. Related work

2.1. Short text clustering

2.2. Data augmentation

3. Methodology

3.2. Pseudo-label generation

4. Experiments

4.1. Datasets

Table 1. Dataset details. Datasets V N K R AgNews 21K 8000 4 1 StackOverflow 15K 20000 20 1 Biomedical 19K 20000 20 1 SearchSnippets 31K 12340 8 7 GoogleNews-TS 20K 11109 152 143 GoogleNews-T 8K 11109 152 143 GoogleNews-S 18K 11109 152 143 Tweet 5K 2472 89 249

4.3. Baselines

4.5. The impact of hyperparameters

4.7.1. Hybrid data augmentation

5. Conclusion

Footnotes

Acknowledgements

Ethical and informed consent for data used

Authors contribution statement

Funding

Declaration of conflicting interests

Data availability and access

References

Table 1.
Dataset details.

Datasets V N K R

AgNews 21K 8000 4 1

StackOverflow 15K 20000 20 1

Biomedical 19K 20000 20 1

SearchSnippets 31K 12340 8 7

GoogleNews-TS 20K 11109 152 143

GoogleNews-T 8K 11109 152 143

GoogleNews-S 18K 11109 152 143

Tweet 5K 2472 89 249