Spectral-adaptive masking and hierarchical contrastive learning for time series clustering

Abstract

Time series clustering is a pivotal technique for efficiently mining the structure of data. However, time series data possess unique characteristics such as periodicity, nonlinearity and sensitivity to high-frequency noise in dynamic environments, which significantly impact the performance of clustering methods. Recently, deep clustering has garnered widespread attention for its outstanding performance in capture multi-scale temporal dependencies. Despite this, existing methods struggle to effectively captured the similarities and diverse temporal patterns in noisy time series. Accordingly, we propose a novel deep time series clustering framework integrating spectral-adaptive masking with hierarchical contrastive learning. First, an enhanced encoder is designed to generate representations of time series through the incorporation of a frequency-domain adaptive noise filter, which dynamically suppress high-frequency fluctuations by learning threshold parameters from spectral power distributions. Second, hierarchical contrastive information is captured at both the temporal-level alignment within overlapping segments and instance-level comparisons across augmented subsequences, while simultaneously performing clustering on low-dimensional space and utilizing a novel fuzzy clustering loss to improve robustness against outlier interference. Finally, the network architecture is optimized through the integration of contrastive loss and clustering loss, which achieves end-to-end joint representation learning and clustering assignment. Extensive experiments on various time series datasets demonstrate that our approach outperforms state-of-the-art clustering methods.

Keywords

time series clustering contrastive learning representation learning fourier transform

1. Introduction

Time series data, which is a type of data inherently related to time, widely exists in fields such as financial markets,¹ industrial manufacturing,² and medical analysis.³ In addition, time series clustering can reveal potential patterns within the data and group them into distinct categories, allowing researchers to extract valuable information from extensive datasets.⁴ However, time series data is often characterized by high dimensionality and temporal dependencies, which pose challenges for traditional clustering methods in effectively handling large-scale datasets.

Traditional methods rely on manually designing and extracting features, which are inadequate to capture the intrinsic characteristics of time series data. Thus, these methods tend to perform poorly when applied to complex and nonlinear datasets. In recent years, the application of deep neural networks to clustering methods has significantly improved the performance of time series clustering.⁵ Deep clustering methods focuses on generating cluster-oriented representations. It can automatically learn and extract the underlying features of the data by using deep neural networks, with these features being represented in a low-dimensional embedding space to better reflect the intrinsic structure of the data, thereby improving clustering accuracy and performance.⁶ Madiraju et al.,⁷ as one of the earliest researchers in deep time series clustering, utilized an autoencoder to map time series data into a low-dimensional latent space. Then, updated the neural network parameters and clustering centers based on the predicted distribution of the clusters and the reconstruction loss. Building on this, Ma et al.⁸ introduced a strategy for generating pseudo-samples to further enhance the capability of the encoder. The optimization of the network combines spectral relaxation-based k-means loss and auxiliary classification loss on top of the reconstruction loss. These methods primarily rely on instance reconstruction and cluster distribution alignment, often neglecting the inherent structural relationships between samples, which limits the ability to capture high-level semantic representations for clustering.

However, most deep time series clustering methods significantly depend on high-level features. As a novel technique in self-supervised learning, contrastive learning can efficiently learning invariant representations from augmented data without the need for labeled samples.^9,10 For instance, Li et al.¹¹ proposed a single-stage clustering method that incorporates contrastive learning through the construction of a feature matrix, performing instance-level and cluster-level contrastive learning in the row and column spaces, respectively, to maximize the similarity of positive pairs while minimizing the similarity of negative pairs. The contrastive learning framework is illustrated in Figure 1, where $X$ is the input time series, $f (\cdot)$ denotes the encoder, $Z_{a}$ and $Z_{b}$ are the latent representations of the two augmented views, and $L$ is the contrastive loss. By enhancing the distinction between positive and negative samples, more discriminative features can be learned, which facilitates the downstream clustering tasks.

Figure 1.

Contrastive learning framework.

While contrastive learning has shown promise for clustering, its direct application to time series clustering is still a challenging problem because of the dynamics and complexity of time series data.¹² Moreover, few methods can adequately capture multi-scale features of time series data while mitigating noise that may potentially impact clustering performance. In light of this, we propose a novel deep time series clustering method based on contrastive learning. Firstly, data augmentation is applied to the original time series, converting the input into a frequency-domain representation within the encoder. Then, an adaptive noise filter is employed to reduce the influence of noise, followed by multi-layer convolution to generate clustering-friendly representations. Subsequently, two levels of contrastive learning are implemented, which include instance-level and temporal-level. Finally, we optimize the entire network with joint loss functions. The contributions of this paper are as follows:

We introduce a novel deep time series clustering method based on contrastive learning, which incorporates instance and temporal contrastive losses, along with an improved fuzzy clustering loss.

An enhanced encoder is designed with adaptive noise filter, which dynamically adjusts high-frequency components and enhances the representation of time series.

Extensive experiments with various datasets confirm the outstanding performance of the proposed method compared to state-of-the-art methods.

We organize the remainder of this paper as follows: In Section 2, we start with a review of the relevant studies on time series clustering. Then, we introduce the details of the proposed method in Section 3. Extensive experiments in Section 4 are conducted to demonstrate the effectiveness of the proposed method. Finally, we present the conclusion in Section 5.

2. Related work

Existing time series clustering methods can be divided into traditional and deep learning-based time series clustering methods.³ Traditional time series clustering methods can be further classified into those based on raw data and those based on features. Methods based on raw data directly use the original values of time series for clustering by calculating the distances. To simplify the data, highlight important features, or enable comparisons of time series with different lengths, preprocessing operations such as normalization, smoothing, and interpolation are often applied to the data. Petitjean et al.¹³ proposed a global technique for averaging a set of sequences, using the Dynamic Time Warping (DTW) distance metric for time series clustering. While this method provides a more robust average under DTW and shows improved performance on datasets like those from the UCR archive, it inherits DTW’s high computational cost and sensitivity to noise, which limits its scalability and practicality on large or noisy datasets. Additionally, Paparrizos et al.¹⁴ proposed a clustering method that applied a standard cross-correlation distance measure to group time series with similar trends by optimizing shape similarity. k-Shape is computationally efficient due to its use of Fast Fourier Transform and demonstrates strong accuracy on benchmark datasets. However, its reliance on global alignment and z-normalization makes it less effective for sequences with complex local temporal variations or inconsistent amplitude characteristics. These methods are highly susceptible to data with noise and outliers, and exhibit poor performance on high-dimensional data. This limitation prompted the development of feature-based methods capable of clustering after capturing the features of the data. Zhang et al.¹⁵ proposed an Unsupervised Salient Subsequence Learning (USSL) model that automatically discovers shapelets by integrating pseudo-labels and spectral analysis. Building on this, Cai et al.¹⁶ introduced SE-Shapelets, a semi-supervised method that leverages a few labels to extract salient subsequence chains and select the most representative shapelets via linear discriminant selection, achieving even higher clustering accuracy. However, a fundamental limitation persists in these works: the decoupling of feature extraction from the clustering process. This separation means the learned features are not explicitly optimized for the final clustering objective, potentially failing to capture the most informative contextual hierarchies in the data.

In contrast, deep learning-based time series clustering methods have significant advantages in representation learning and the processing of complex data patterns.⁶ Deep time series clustering can automatically learn the complex features of the data, and minimize intermediate errors through joint optimization of feature extraction and clustering, thereby improving the accuracy and robustness. Recently, extracting effective representations from time series data for downstream tasks has garnered considerable attention. Yue et al.¹⁷ proposed an universal framework for learning representations of time series at arbitrary semantic levels, and performed contrastive learning in a hierarchical manner on the enhanced context view, demonstrating strong performance in classification and forecasting. However, as a general-purpose representation model, its learning objective is not explicitly designed to optimize for cluster-friendly structures in the latent space, which may limit its direct applicability to clustering tasks. Eldele et al.¹⁸ proposed a lightweight adaptive network for time series that segments the input into multiple blocks. This method captures global patterns across different frequency components using adaptive spectral blocks, showing remarkable noise robustness, then interactive convolutional blocks are employed to extract local features from the time series. While it achieves state-of-the-art results in supervised and semi-supervised settings, its feature learning process is decoupled from the ultimate clustering objective, potentially leading to representations that are suboptimal for partitioning data. The core ideas of representation learning are reflected in subsequent deep time series clustering methods. Zhong et al.¹⁹ leveraged data augmentations to construct dual views and apply contrastive losses at both the instance and cluster levels. While this approach achieves state-of-the-art performance by maximizing agreement between views, its effectiveness is inherently tied to the quality of augmentations and is susceptible to noise in the positive pairs. Lee et al.²⁰ exploited the eigenstructure of latent representations to define a topological loss that aligns samples with similar temporal structures. This allows it to achieve superior results even with a simple MLP encoder, but it introduces significant computational overhead from the eigendecomposition and is sensitive to the initial cluster assignments. In parallel, Huang et al.²¹ learned representations via an adversarial game between a generator and a discriminator. Although capable of capturing complex temporal dynamics, these models often face training instability and lack an explicit clustering objective, which can limit their final clustering performance and consistency. Although the aforementioned methods have achieved considerable performance, these methods are incapable of adequately leveraging the inherent information in the data to obtain discriminative representations while accounting for the noise sensitivity and clustering consistency of the model. The method we proposed bridges this gap by apply a frequency-domain adaptive denoising mechanism coupled with dual-level contrastive learning, enabling joint optimization of representation purity and cluster separability.

3. Proposed method

We propose a deep time series clustering via spectral-adaptive masking and hierarchical contrastive learning (DTCSC), which includes an enhanced contrastive learning framework for time series representation and a clustering module. First, the overall network framework of DTCSC will be presented. Then, we elaborate each module of the proposed method.

3.1. Overall network framework

The overall framework of DTCSC is shown in Figure 2. In the network architecture of DTCSC, the raw time series is first randomly cropped to obtain two different but overlapping subseries. Then, these subseries are fed into the encoder to generate latent low-dimensional representations which encapsulate the contextual information. Further, the contrastive loss is obtained through two-level contrastive learning, which incorporates both instance-level and temporal-level contrastive learning. Additionally, the raw series of those subseries are fed into the encoder in parallel, with the output serving as the input for clustering module. The network is jointly optimized in a self-supervised manner using contrastive loss and clustering loss. It is worth noting that these steps are performed during the fine-tuning phase. Prior to this, we pretrain the model to obtain initial latent representations that reflect the temporal context information of time series, as well as initial cluster centers, thereby avoiding the problem of local optima caused by random initialization. In the pretraining phase, the representations generated by the encoder may not exhibit clearly distinct clustering structures because the encoder weights are updated through the minimization of the contrastive loss function.

3.2. Self-Supervised contrastive representation learning

The key to deep clustering is obtaining clustering-friendly representations through representation learning, and self-supervised learning generates representations for downstream tasks by leveraging the inherent structure and attributes of the data, without the need for labels.²² Further, self-supervised contrastive learning emphasizes point details, establishes spatial relationships between data samples by making instances comparable, and clarifies similarities and dissimilarities, thereby enabling more effective capture of discriminative representations.

3.2.1. Data augmentation

The core idea of contrastive representation learning is to construct positive and negative samples in representation space via data augmentation, and maximizing similarity between positive samples while minimizing it for negative samples.^23,24 The random cropping strategy can help the model learn the invariance of local patterns in time series, improve its ability to capture local features, with lower complexity.²⁵ Thus, we use random cropping strategy to generate subseries on data samples.

Given a time series dataset $x = {x_{1}, x_{2}, \dots, x_{n}}$ , where $n$ is the number of time series samples. The $i -th$ sample is $x_{i} = {x_{i, 1}, x_{i, 2}, \dots, x_{i, t}}$ , where $t$ is the length of $x_{i}$ . The model randomly selects two different but overlapping subseries $x_{i}^{a}$ and $x_{i}^{b}$ from the input sample $x_{i}$ , where the time segment of $x_{i}^{a}$ is $[p_{1}, q_{1}]$ and the time segment of $x_{i}^{b}$ is $[p_{2}, q_{2}]$ . The overlapping time segments of the two distinct subseries is $[p_{2}, q_{1}]$ such that $0 \leq p_{1} < p_{2} \leq q_{1} < q_{2} \leq t$ . These subseries are fed to time series encoder to generate low-dimensional representations, which are then used by the contrastive model to measure similarity.

3.2.2. Adaptive noise filter

In practice, time series data often exhibit noise that manifests as anomalies and irregular fluctuations. These noises may distort the similarity between data points, making it challenging for clustering methods to accurately distinguish different time patterns. However, it is common in existing deep clustering methods to assume that the data points are noise-free or that the noise is uniformly distributed across the entire dataset. This overlooks the varying degrees of impact that noise might have across different time intervals and the potential temporal dependencies, which may mask or misidentify relevant temporal features.

Figure 2.

Overview of the proposed DTCSC.

High frequency components often represent rapid fluctuations that stray from the main trend, leading to more randomness and making data challenging to analyze.¹⁸ Considering the frequency-domain characteristics of the Discrete Fourier Transform (DFT), which can efficiently transform complex time-domain signals into frequency-domain information to facilitate the analysis of frequency components and periodic patterns. Therefore, the 1D DFT is well-suited for preliminary processing to decompose time series data into their frequency components. For clarity, the input time series of $t$ complex numbers $x_{i} [n]$ is transformed into a frequency-domain representation by 1D DFT:

\begin{aligned} P = F (x_{i}) \end{aligned}

(1)

where

P

encapsulates the spectral features of the raw time series and

F

is the 1D DFT operation, there is:

\begin{aligned} X_{i} [k] = \sum_{n = 0}^{t - 1} x_{i} [n] e^{- j 2 π n k / t} := \sum_{n = 0}^{t - 1} x_{i} [n] W_{t}^{k n} \end{aligned}

(2)

where

j

is the imaginary unit and

W_{t} = e^{- j (2 π / N)}

. At the frequency

ω_{k} = 2 π k / t

X_{i} [k]

is the spectrum of the sequence

x_{i} [n]

. Additionally, the original sequence

x_{i} [n]

can be recovered by the Inverse DFT (IDFT) because the DFT is invertible, i.e.

\begin{aligned} x_{i} [n] = \frac{1}{N} \sum_{k = 0}^{t - 1} X_{i} [k] e^{j 2 π k n / t} \end{aligned}

(3)

Further, the Fast Fourier Transform (FFT) and inverse fast Fourier transform (IFFT) optimize the computation of DFT and IDFT by exploiting the symmetry and periodicity of $W_{t}^{k n}$ , which increases computational efficiency. After obtaining the frequency-domain representation $P$ , the power spectrum $S = | P |^{2}$ is computed to identify the dominant frequency components, and a dynamic adaptive filtering process is applied to remove high-frequency noise from the power spectrum $S$ :

\begin{aligned} P_{filtered} & = P \circ M \end{aligned}

(4)

\begin{aligned} M [i] & = {\begin{cases} 1 & if P [i] > θ \\ 0 & otherwise \end{cases} \end{aligned}

(5)

where

θ

is a trainable threshold that dynamically adjusted based on the spectral characteristics of the data,

\circ

is the element-wise multiplication. Only the components with frequencies above the threshold

θ

are considered, with other parts being filtered out. After adaptively removing high-frequency noise, two sets of learnable weights

w_{1}

and

w_{2}

are introduced to integrate the filtered features from

P

and

P_{filter}

as follows:

\begin{aligned} F = w_{1} \circ P + w_{2} \circ P_{f i l t e r} \end{aligned}

(6)

Finally, the frequency-domain data is transformed back to the time domain through IFFT to obtain the output, and it is given as:

\begin{aligned} P^{^{'}} = F^{- 1} (F) \end{aligned}

(7)

3.2.3. Enhanced encoder

Define non-linear mapping function $f (\cdot)$ to denote the encoder, where $f (\cdot) : x_{i} \to z_{i}$ . Here, $z_{i} = {z_{i, 1}, z_{i, 2}, \dots, z_{i, t}}$ is the $i -th$ low-dimensional latent representation of time series $x_{i}$ , in which $z_{i} \in R^{l}$ , where $l$ is the dimension of the latent space. The encoding process can be defined as:

\begin{aligned} z_{i} = f (x_{i}) \end{aligned}

(8)

The encoder consists of three components, including an adaptive noise filter, a dual-layer convolutional structure and a residual block, as depicted in Figure 3. The input is processed through multiple stacked convolutional layers for feature extraction, where each layer contains two parallel paths: a main path and an auxiliary path. The main path comprises two dilated 1-D convolution blocks with progressively increasing dilation parameters, where the dilation rate of the $i -th$ block is $2^{i}$ , followed by gelu activation function, which allows the model to capture temporal dependencies at different scales. In contrast, the auxiliary path independently processes each input channel using a dilated convolution with group-based separation, followed by $1 \times 1$ convolution block to aggregate cross-channel features. In this manner, the main path captures hierarchical multi-scale temporal dependencies through progressive dilated convolutions, while the auxiliary path enhances local feature diversity through channel-independent processing. The fusion of both paths enables complementary optimization of global context and local details. Finally, a residual connection between the output and the input is established to enhance the stability of the training process and mitigate the vanishing gradient problem in deep networks. Then, the contextual representation $z$ of the input time series is obtained.

Figure 3.

Encoder architecture.

3.2.4. Dual contrastive learning

Contrastive loss functions aim to refine the geometry of the feature space by reducing the proximity between related or positively paired instances, while simultaneously increasing the separation between unrelated or negative pairs, enhances the capacity of model to distinguish between diverse data points. Thus, we use two contrastive losses, which focus on instance-level and temporal-level comparisons, to effectively capture the internal dynamic changes within the time series and the differences between instances. These two strategies constrain the representation space from two orthogonal dimensions. The instance-level loss provides a discriminative signal by contrasting different samples within a batch. This guides the model to learn features that can effectively separate one data instance from another, focusing on inter-instance differences. On the other hand, the temporal-level loss provides a consistency signal by contrasting different temporal contexts of the same instance. This guides the model to become invariant to trivial temporal variations and to capture the essential intra-instance dynamics and trends. Relying solely on instance-level contrast might yield representations that are sensitive to noise and lack temporal smoothness, while relying solely on temporal-level contrast might result in representations that are temporally coherent but lack discriminative power between different classes. Therefore, their combination is designed to learn representations that are both discriminative and temporally meaningful.¹⁷ In our work, for any given timestamp, we consider representations from different timestamps within the same instance and from other instances in the same batch as negatives. This approach avoids the strong and often unrealistic invariance assumptions required by other methods, which can distort the inherent temporal structure of the data. Instead, by leveraging the natural structure within and across time series, it ensures that the learned representations faithfully preserve crucial temporal dependencies while maintaining high computational efficiency. Instance-level contrastive loss compares the inherent similarities between samples, pulling similar instances closer in the feature space to obtain discriminative representations. At timestamp $t$ , representations from different samples within the same batch are treated as negative pairs. The instance contrastive loss can be defined as:

\begin{aligned} ℓ_{i n s t a n c e}^{(i, t)} = - ln \frac{exp(z_{i, t} \cdot z_{i, t}^{'})}{\sum_{j = 1}^{N} (exp(z_{i, t} \cdot z_{j, t}^{'}) + 1_{[i \neq j]} \cdot exp(z_{i, t} \cdot z_{j, t}))} \end{aligned}

(9)

where

z_{i, t}

and

z_{i, t}^{'}

are the representations of two augmentations

x_{i}^{a}

and

x_{i}^{b}

x_{i}^{}

at timestamp

t

, respectively, N denotes the number of samples in a mini-batch, and

1

is the indicator function. In contrast, the temporal contrastive loss focuses on the temporal characteristics within the time series. We consider representations from the same timestamp of the same series as positive pairs, and representations from different timestamps as negative pairs. The temporal contrastive loss can be defined as:

\begin{aligned} ℓ_{t e m p o r a l}^{(i, t)} = - ln \frac{exp(z_{i, t} \cdot z_{i, t}^{'})}{\sum_{t^{'} \in Ω} (exp(z_{i, t} \cdot z_{i, t^{'}}^{'}) + 1_{[t \neq t^{^{'}}]} \cdot exp(z_{i, t} \cdot z_{i, t^{'}}))} \end{aligned}

(10)

where

Ω

is the set of timestamps at which the two subsequences overlap. Assuming that

T

is the length of the time series, the overall contrastive loss, based on instance and temporal contrastive loss, can be expressed as:

\begin{aligned} ℓ_{c o n t r a s t i v e} = \frac{1}{N T} \sum_{i = 1}^{N} \sum_{t = 1}^{T} (ℓ_{t e m p o r a l}^{(i, t)} + ℓ_{i n s t a n c e}^{(i, t)}) \end{aligned}

(11)

3.3. Clustering in representation space

When dealing with datasets that exhibit fuzzy boundaries or overlapping features, data points are often not strictly assigned to a specific cluster. This uncertainty necessitates that clustering algorithms not only identify the cluster to which each data point belongs but also quantify the degree of relevance of data points to each cluster.²⁶ Therefore, we use the Euclidean distance to quantify the membership relationships between each data point and the cluster center, which is represented by the membership matrix $P$ . The membership $P_{i j}$ indicates the degree of membership of the $i -th$ sample to the $j -th$ cluster center, $P_{i j}$ and $c_{j}$ are updated as follows:

\begin{aligned} P_{i j} & = \frac{1}{\sum_{k = 1}^{c} {(d_{i j} / d_{k j})}^{2 / (m - 1)}} \end{aligned}

(12)

\begin{aligned} c_{j} & = \frac{\sum_{i = 0}^{n} P_{i j}^{m} \cdot z_{i}}{\sum_{i = 0}^{n} P_{i j}^{m}} \end{aligned}

(13)

where

d_{i j}

is the Euclidean distance between the representation

z_{i}

and the cluster center

c_{j}

n

and

c

is the number of samples and the number of cluster centers, respectively.

m

is the level of fuzziness, with a range of

m \in [1, + \infty)

. To enhance the robustness and accuracy of clustering assignments, especially in scenarios where data points exhibit nuanced differences, we proposed a refined distance measure, which can handle dense samples near the cluster center and sparse samples that are further away, as inspired by.²⁷ It combines absolute error with the smooth characteristics of Huber Loss, as follows:

\begin{aligned} L (z_{i}, c_{j}; w, ϵ) & = {\begin{cases} w \cdot l o g (1 + d_{i j} / ϵ) & if d_{i j} < w \\ d_{i j} - q & otherwise \end{cases} \end{aligned}

(14)

\begin{aligned} q & = w \cdot (1 - l o g (1 + w / ϵ)) \end{aligned}

(15)

where

w

is the smoothing parameter,

ϵ

is the nonlinear curvature parameter, which is adjusted dynamically during training and

q

is used to smoothly connect the piecewise defined linear and nonlinear parts. Then, the clustering loss is given as:

\begin{aligned} ℓ_{c l u s t e r i n g} = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{c} P_{i j} \cdot L (z_{i}, c_{j}; w, ϵ) \end{aligned}

(16)

3.4. Optimization

The contrastive loss and the clustering loss are jointly optimized within a unified framework. The overall loss function, formulated as a weighted sum of these two components, can be expressed as follows:

\begin{aligned} ℓ_{t o t a l} = ℓ_{c o n t r a s t i v e} + λ ℓ_{c l u s t e r i n g} \end{aligned}

(17)

where

λ \in [0, 1]

is the adjustment factor that balances the contrastive loss and clustering loss. The network is trained in a self-supervised manner, and better clustering-oriented representations are learned through the iterative updating of the cluster centers and the weights of the encoder. The overall process of DTCSC is summarized in Algorithm ??.

3.5. Complexity analysis

Suppose given a time series dataset with $N$ samples, each of length $T$ and feature dimension $d$ , the overall complexity consists of several components.

The encoder network, built upon dilated causal convolutions with depth $L$ and hidden dimension $d_{h}$ , has a time complexity of $O (N \cdot T \cdot d_{h}^{2} \cdot L)$ . The adaptive spectral filtering module based on Fourier transforms introduces an additional $O (N \cdot T \log T \cdot d_{h})$ complexity due to the FFT operations. For contrastive learning, both temporal-level and instance-level comparisons are performed, resulting in $O (T \cdot N^{2} \cdot d_{z} + N \cdot T^{2} \cdot d_{z})$ complexity where $d_{z}$ is the latent dimension. The fuzzy clustering component contributes $O (N \cdot K \cdot d_{z} \cdot I)$ , where $K$ is the number of clusters and $I$ is the number of iterations.

Therefore, the total time complexity of DTCSC is approximately:

$O (N \cdot T \cdot d_{h}^{2} \cdot L + N \cdot T \log T \cdot d_{h} + T \cdot N^{2} \cdot d_{z} + N \cdot T^{2} \cdot d_{z} + N \cdot K \cdot d_{z} \cdot I)$

4. Experiment

In this section, extensive experiments were conducted to evaluate the efficacy of the proposed DTCSC. The experimental environment includes Windows 11 64-bit operating system, Intel i5-10400 at 3.10 GHz CPU, and 16 GB RAM.

4.1. Datasets

We employed ten datasets from the publicly UCR database, which include Meat for meat spectral data; DistalPhalanxOutlineAgeGroup, ProximalPhalanxOutlineAgeGroup, and ProximalPhalanxTW for distal phalanx contour data; ECGFiveDays for electrocardiogram curves; Beef for beef spectrometer spectral curves; MoteStrain for temperature and humidity sensor data; OSULeaf for leaf contour curves; and Plane for military aircraft contour curves; Symbols for symbol outline or trajectory data. Each dataset comprises three dimensions: sample size, length, and classes. Detailed descriptions of each dataset are provided in Table 1.

Table 1.
Statistics description of ten time series datasets.

Dataset Abbreviation Samples Length Classes

Meat ME 120 449 3

DistalPhalanxOutlineAgeGroup DA 539 81 3

ECGFiveDays EF 884 137 2

Beef BE 60 471 5

MoteStrain MS 1272 85 2

OSULeaf OL 442 428 6

Plane PL 210 145 7

ProximalPhalanxOutlineAgeGroup PA 605 81 3

ProximalPhalanxTW PT 605 81 6

Symbols SY 1020 398 6

Dataset	Abbreviation	Samples	Length	Classes
Meat	ME	120	449	3
DistalPhalanxOutlineAgeGroup	DA	539	81	3
ECGFiveDays	EF	884	137	2
Beef	BE	60	471	5
MoteStrain	MS	1272	85	2
OSULeaf	OL	442	428	6
Plane	PL	210	145	7
ProximalPhalanxOutlineAgeGroup	PA	605	81	3
ProximalPhalanxTW	PT	605	81	6
Symbols	SY	1020	398	6

4.2. Baseline methods

To verify the effectiveness of the proposed method, we compared DTCSC with several clustering methods, including deep or non-deep clustering methods.

DEC:²⁸ A deep clustering method that learns features and clustering assignments, minimizing a clustering loss based on KL divergence to improve both feature representation and clustering.

IDEC:²⁹ A deep clustering method that uses an undercomplete autoencoder that integrates clustering loss with autoencoder loss to jointly assign clustering labels and learn features that are suitable for clustering while preserving the data structure.

DTC:⁷ DTC is a classical unsupervised temporal clustering that maps data to a low-dimensional latent space, then clustered using K-means, with updates to both neural network parameters and cluster centers based on the clustering results.

DSC:³⁰ A deep clustering method that combines dual autoencoders and deep spectral clustering. The dual autoencoder consists of an encoder, a noisy decoder, and a noise-free decoder.

SDCN:³¹ A deep graph clustering method that captures the low-order and high-order structures of the data, and GCN is used to propagate the representations learned by the autoencoder.

TS2Vec:¹⁷ A general framework for learning time series representations at arbitrary semantic levels, performing contrastive learning in a hierarchical manner on augmented contextual views.

TCGAN:²¹ Employs a generative adversarial network with two one-dimensional CNNs to learn hierarchical representations from unlabeled time series data.

R-clust:³² A time series clustering method using random convolutional kernels and PCA for feature extraction and dimensionality reduction.

SE-shapelets:¹⁶ Leveraging a small number of labeled and pseudo-labeled time series to discover representative shapelets, incorporates a salient subsequence chain to extract informative subsequences and a linear discriminant selection algorithm to identify shapelets that capture discriminative local features.

TS-TCC:³³ A contrastive learning framework employing temporal and contextual contrasting with Transformer encoders, suitable for representation learning and clustering tasks.

TSLANet:¹⁸ A universal time series model that combines spectral analysis with convolutional operations, utilizing frequency-domain processing to enhance feature representation and handle complex temporal patterns across various tasks.

DEETO:²⁰ A deep clustering method that leverages self-supervised pretraining followed by fine-tuning with topological constraints to learn cluster-oriented representations.

4.3. Evaluation metrics

We used two widely accepted metrics in the clustering field to evaluate model performance, including Normalized Mutual Information (NMI), and Rand Index (RI), both of which range from 0 to 1, with higher values indicating better clustering performance. NMI is based on Mutual Information (MI). It is calculated via the following formula:

\begin{aligned} N M I = \frac{\sum_{i = 1}^{c} \sum_{j = 1}^{c} {\bar{N}}_{i j} log(\frac{N \cdot {\bar{N}}_{i j}}{| Y_{i} | \cdot | Y_{j} |})}{\sqrt{(\sum_{i}^{c} | Y_{i} | log \frac{| Y_{i} |}{N}) (\sum_{j = 1}^{c} | Y_{j} | log \frac{| Y_{j} |}{N})}} \end{aligned}

(18)

where

c

and

N

are the numbers of clusters and time series samples, respectively.

| Y_{i} |

and

| Y_{j} |

are the numbers of samples of cluster

Y_{i}

and

Y_{j}

{\bar{N}}_{i j}

is the number of samples in the intersection of

Y_{i}

and

Y_{j}

. RI evaluates the quality of clustering by calculating the similarity of sample pairs between two clustering results, its representation is calculated as:

\begin{aligned} R I = \frac{n_{11} + n_{00}}{N (N - 1) / 2} \end{aligned}

(19)

where

n_{11}

is the number of pairs of samples that are in the same cluster in both the clustering and the ground truth,

n_{00}

is the number of pairs of samples that are in different clusters in both the clustering and the ground truth.

4.4. Parameters setting

The training process of DTCSC consist of two steps, including the pretraining phase and the finetuning phase. In the encoder, the dimension of hidden layer is set to 64, while the dimension of output layer is reduced to 32 via a max pooling layer. Standard weight initialization is used. During the pretraining phase, the learning rate for the experiments is set at 0.001, with the model utilizes the Adam optimizer with its default hyperparameters ( $β_{1} = 0.9$ , $β_{2} = 0.999$ , and $ε = 1 e - 8$ ), and the encoder is trained for 25 epochs. The cluster centroids are initialized using the K-means algorithm. This choice is consistent with the prevailing methodology in deep clustering, as K-means provides a computationally efficient and conceptually straightforward initialization. For the finetuning phase, the learning rate is fixed at 0.0001, the same Adam optimizer configuration, and maximum epochs is 50. In addition, the level of fuzziness $m$ is set to 2, and the smoothing parameter $w$ and the nonlinear curvature parameter $ϵ$ were initially set to 10 and 2, respectively. For the adjustment factor $λ$ , we set to 0.1, a common value. With these parameter settings, the model can achieve the best performance.

4.5. Experimental results

We evaluated the clustering performance of the proposed method and compared it with that of the baseline methods. The NMI scores for 10 datasets are shown in Table 2, and the RI scores are presented in Table 3. In addition, we provide the average rank, the number of the best performance and the number of second-best performance. The underline implies the best results in all methods.

Table 2.
Comparison results of the NMI for 10 time series datasets.

Dataset DEC IDEC DTC DSC SDCN TS2Vec TCGAN R-clust SE-shape TS-TCC TSLANet DEETO DTCSC

ME 0.5176 0.2250 0.2250 0.2019 0.2015 0.1099 0.5060 0.6264 0.4802 0.1936 0.3665 0.5476 0.7514

DA 0.4405 0.4400 0.3406 0.1502 0.2760 0.1906 0.3171 0.4223 0.3460 0.2575 0.2726 0.4542 0.4159

EF 0.0178 0.0223 0.0022 0.0033 0.0197 0.1844 0.0018 0.0195 0.0167 0.1487 0.2737 0.1973 0.9281

BE 0.2463 0.2463 0.2751 0.4993 0.3031 0.1874 0.2530 0.2466 0.2167 0.1257 0.2737 0.4826 0.2284

MS 0.3867 0.3821 0.0094 0.0987 0.5564 0.0008 0.4487 0.4883 0.0144 0.1267 0.3012 0.4893 0.5707

OL 0.2141 0.2412 0.2201 0.0200 0.0200 0.1889 0.1834 0.4982 0.1210 0.1928 0.3841 0.1735 0.3798

PL 0.8947 0.8947 0.8678 0.8736 0.8132 0.6399 0.8949 0.9819 0.7937 0.9306 1.0000 0.9652 1.0000

PA 0.2500 0.5396 0.4153 0.0677 0.0582 0.1481 0.4909 0.5638 0.4337 0.1463 0.4121 0.5711 0.5666

PT 0.5864 0.3289 0.6199 0.1929 0.0724 0.0721 0.5094 0.5455 0.4181 0.2246 0.5031 0.5624 0.6413

SY 0.7421 0.7419 0.7995 0.9110 0.7123 0.1091 0.7982 0.9420 0.4817 0.7613 0.8989 0.9388 0.9488

Avg. NMI 0.4296 0.4062 0.3775 0.3019 0.3033 0.1831 0.4403 0.5335 0.3322 0.3108 0.4686 0.5372 0.6431

Avg. Rank 6.3 6.6 7.3 9.3 9.1 11.1 7.0 3.9 9.3 9.1 5.5 3.4 2.6

Num.Top1 0 0 0 1 0 0 0 1 0 0 1 2 6

Num.Top2 1 0 1 0 1 0 0 2 0 0 2 1 1

Dataset	DEC	IDEC	DTC	DSC	SDCN	TS2Vec	TCGAN	R-clust	SE-shape	TS-TCC	TSLANet	DEETO	DTCSC
ME	0.5176	0.2250	0.2250	0.2019	0.2015	0.1099	0.5060	0.6264	0.4802	0.1936	0.3665	0.5476	0.7514
DA	0.4405	0.4400	0.3406	0.1502	0.2760	0.1906	0.3171	0.4223	0.3460	0.2575	0.2726	0.4542	0.4159
EF	0.0178	0.0223	0.0022	0.0033	0.0197	0.1844	0.0018	0.0195	0.0167	0.1487	0.2737	0.1973	0.9281
BE	0.2463	0.2463	0.2751	0.4993	0.3031	0.1874	0.2530	0.2466	0.2167	0.1257	0.2737	0.4826	0.2284
MS	0.3867	0.3821	0.0094	0.0987	0.5564	0.0008	0.4487	0.4883	0.0144	0.1267	0.3012	0.4893	0.5707
OL	0.2141	0.2412	0.2201	0.0200	0.0200	0.1889	0.1834	0.4982	0.1210	0.1928	0.3841	0.1735	0.3798
PL	0.8947	0.8947	0.8678	0.8736	0.8132	0.6399	0.8949	0.9819	0.7937	0.9306	1.0000	0.9652	1.0000
PA	0.2500	0.5396	0.4153	0.0677	0.0582	0.1481	0.4909	0.5638	0.4337	0.1463	0.4121	0.5711	0.5666
PT	0.5864	0.3289	0.6199	0.1929	0.0724	0.0721	0.5094	0.5455	0.4181	0.2246	0.5031	0.5624	0.6413
SY	0.7421	0.7419	0.7995	0.9110	0.7123	0.1091	0.7982	0.9420	0.4817	0.7613	0.8989	0.9388	0.9488
Avg. NMI	0.4296	0.4062	0.3775	0.3019	0.3033	0.1831	0.4403	0.5335	0.3322	0.3108	0.4686	0.5372	0.6431
Avg. Rank	6.3	6.6	7.3	9.3	9.1	11.1	7.0	3.9	9.3	9.1	5.5	3.4	2.6
Num.Top1	0	0	0	1	0	0	0	1	0	0	1	2	6
Num.Top2	1	0	1	0	1	0	0	2	0	0	2	1	1

Table 3.

Comparison results of the RI for 10 time series datasets.

Dataset	DEC	IDEC	DTC	DSC	SDCN	TS2Vec	TCGAN	R-clust	SE-shape	TS-TCC	TSLANet	DEETO	DTCSC
ME	0.6475	0.6220	0.3220	0.3029	0.3015	0.5695	0.7324	0.8234	0.7248	0.6570	0.6818	0.6489	0.8437
DA	0.7785	0.7786	0.7812	0.6107	0.5742	0.4870	0.7212	0.7404	0.7253	0.6223	0.6286	0.7835	0.7399
EF	0.5103	0.5114	0.5016	0.4933	0.5052	0.6214	0.5007	0.5123	0.5044	0.5975	0.6457	0.5558	0.9798
BE	0.5954	0.6276	0.6345	0.5220	0.5731	0.5793	0.6633	0.6701	0.5288	0.5568	0.6096	0.8747	0.6599
MS	0.7435	0.7324	0.5062	0.4380	0.5748	0.5001	0.7580	0.7787	0.5117	0.5598	0.6752	0.7887	0.8354
OL	0.7484	0.7607	0.7329	0.6465	0.6021	0.7378	0.7396	0.8266	0.7150	0.5425	0.7073	0.5766	0.7813
PL	0.9447	0.9447	0.9040	0.8980	0.8392	0.8368	0.9489	0.9947	0.9264	0.9563	1.0000	0.9971	1.0000
PA	0.4263	0.8091	0.7430	0.5000	0.5133	0.4928	0.7816	0.8001	0.7676	0.6525	0.8216	0.8121	0.8166
PT	0.8189	0.9030	0.8380	0.6071	0.4997	0.5669	0.7877	0.7880	0.7411	0.6078	0.9082	0.8937	0.8421
SY	0.8841	0.8857	0.9053	0.8110	0.8923	0.7161	0.8999	0.9842	0.7752	0.8572	0.9168	0.9744	0.9867
Avg. RI	0.7098	0.7575	0.6869	0.5830	0.5875	0.6108	0.7533	0.7919	0.6920	0.6610	0.7595	0.7906	0.8485
Avg. Rank	7.2	5.5	7.5	11.6	10.5	10.3	6.1	3.7	8.8	8.7	4.6	4.0	2.3
Num.Top1	0	0	0	0	0	0	0	1	0	0	3	2	5
Num.Top2	0	1	1	0	0	0	0	3	0	0	1	1	2

The experimental results demonstrate that DTCSC achieved best or second-best clustering performance across most datasets, outperforming the mainstream methods. Moreover, DTCSC achieved the highest average scores in both NMI and RI metrics. DEETO and TSLANet demonstrate competitive performance, with DEETO leveraging topological information for representation alignment and TSLANet employing spectral analysis for feature enhancement, though both are still outperformed by our approach. While the Transformer-based TS-TCC method benefits from its powerful sequence modeling capacity, its clustering performance is moderate, potentially because the generic representations it learns are not explicitly optimized for the clustering objective. While R-clust and SE-shapelets, as raw data-based methods, exhibit exceptional computational efficiency, their clustering effectiveness is suboptimal. TCGAN, a deep clustering method that focuses on image data, exhibits the poorest performance. In contrast to TS2Vec—a two-stage approach where the learned representations may lack sufficient adaptability for clustering objectives—our proposed method establishes a unified end-to-end framework that jointly optimizes representation learning and clustering objectives, enabling synergistic adaptation between feature encoding and cluster structure formation.

Notably, DEC and IDEC methods achieved satisfactory performance on datasets with short sequences but poor performance when handling long sequence datasets. DTC, a deep clustering method based on Recurrent Neural Network (RNN), heavily relies on the representation quality generated by its encoder. As shown in Figure 4, DTCC—a dual-view deep clustering method that employs contrastive learning and RNN-based autoencoders for temporal representation extraction—outperforms DTC in most datasets. This performance superiority arises from contrastive learning’s inherent mechanism to enforce semantic consistency by filtering stochastic noise and redundant variations, thereby yielding representations with enhanced discriminative separability and robustness.

Figure 4.

Comparison of NMI scores of DTCSC with those of DTC and DTCC.

To provide an internal validation of the clustering results, we employed the silhouette coefficient, which measures the cohesion and separation of the formed clusters without the need for ground truth labels.³⁴ The coefficient ranges from -1 to 1, where values above 0 indicate that samples are, on average, closer to members of their own cluster than to members of other clusters.

The silhouette coefficients for our method across all ten datasets are summarized in Table 4. The results indicate that our method achieves reasonable and acceptable clustering quality, with the majority of silhouette scores above 0.5. Meaning samples are, on average, well-matched to their own cluster.The observed variation in scores, including the lower performance on a few specific datasets, can be primarily attributed to the inherent characteristics of the data themselves. Certain datasets may exhibit higher intrinsic overlap between native clusters or more complex noise patterns.

Table 4.

Silhouette coefficients of the proposed method across benchmark datasets.

Metric	ME	DA	EF	BE	MS	OL	PL	PA	PT	SY
Silhouette Coefficient	0.507	0.570	0.535	0.537	0.633	0.455	0.640	0.543	0.532	0.586

To visually illustrate the changes in clustering performance throughout the clustering process, we used the MoteStrain dataset to analyze the relationship between clustering performance and the number of iterations, as depicted in Figure 5. As the loss decreased, the evaluation metric scores consistently increased. Furthermore, to track the evolution of clusters over time, the t-distributed Stochastic Neighbor Embedding (t-SNE) was utilized to visualize changes in cluster formation on the Symbols dataset. The experiments were conducted under standard parameter settings for 50 iterations, with clustering results generated by the current trained model after every 10 iterations. As shown in Figure 6, which presents a t-SNE projection where the axes are dimensionless and serve only to represent relative similarity, we can observe the evolution of clusters during the iterative optimization process. The visual analysis reveals an overall trend where samples of the same category gradually cluster together, demonstrating higher similarity, while samples of different categories gradually separate.

While the above t-SNE visualization demonstrates the effective cluster formation on datasets where our model performs well, we also employed the same technique to diagnose the challenges presented by the datasets with relatively lower metrics (DA, BE, and OL). The corresponding t-SNE plots are provided in the Figure 7. The visual analysis confirms the quantitative results, revealing less distinct cluster separation. We attribute this primarily to the inherent complexities of these specific datasets. For instance, the BE dataset’s small sample size and its long temporal length hinder the learning of robust feature representations, while the OL and DA datasets, being shape-based classifications, may not be fully leveraged by our contrastive learning framework which is more sensitive to temporal context than global shape morphology.

Figure 5.

Visualization of the loss along with three metrics during training on MoteStrain.

Figure 6.

The evolution of clusters during the iteration process.

Figure 7.

t-SNE visualization of the clustering results on DA, BE and OL.

4.6. Effectiveness of noise filter

To verify the effectiveness of the noise filter used in the proposed method, we investigated the efficacy of the noise filters in reducing noise and enhancing the robustness of the model. As shown in Figure 8, we added four different levels of noise to the Meat dataset to simulate the performance of the model under high noise conditions. The noise levels were quantified using signal-to-noise ratio (SNR) in decibels (dB), where lower SNR values indicate stronger noise contamination. We applied four settings: 20 dB, 10 dB, 0 dB, and -10 dB to simulate varying noise intensities. As the noise intensity increased, the model’s accuracy rapidly declined. However, the performance of the model utilizing noise filters consistently outperformed that of the model without noise filters. This indicates that the noise filter can effectively mitigate the impact of noise and improve model performance.

Figure 8.

The effectiveness of DTCSC in handling noisy data.

4.7. Ablation experiment

In this section, we compared DTCSC with three ablation strategies to verify the validity of several designs: DTCSC without noise filter (DTCSC-F), DTCSC without improved clustering loss (DTCSC-L), DTCSC without both noise filter and improved clustering loss (DTCSC-A), DTCSC without instance-level contrastive loss (DTCSC-I) and DTCSC without temporal-level contrastive loss (DTCSC-T). As shown in Table 5, DTCSC outperforms the other five methods in terms of performance on the most datasets. Crucially, the performance degradation observed in both DTCSC-I and DTCSC-T confirms that the instance-level and temporal-level contrastive components are both indispensable, as removing either leads to non-redundant performance drops. Compared to the other five strategies, DTCSC demonstrates varying degrees of performance improvement, although it may experience some degradations on certain datasets, which could be attributed to the inherent characteristics of the datasets themselves. Moreover, we evaluated the training time required for each method on each dataset. As shown in Figure 9, while DTCSC requires slightly more computation time than the other three methods across most datasets, the increased under 0.1, demonstrating that the performance gains are achieved with only modest additional time cost.

Table 5.
Results from experiments using different strategies.

Dataset Metric DTCSC-F DTCSC-L DTCSC-A DTCSC-I DTCSC-T DTCSC

ME NMI 0.6421 0.6251 0.6226 0.7302 0.7337 0.7514

RI 0.7552 0.7557 0.7231 0.8388 0.8340 0.8437

DA NMI 0.4242 0.3671 0.4212 0.4001 0.4119 0.4159

RI 0.7415 0.7274 0.7407 0.7377 0.7421 0.7399

EF NMI 0.5269 0.7853 0.4323 0.8662 0.8035 0.9281

RI 0.7904 0.9176 0.7529 0.9557 0.9344 0.9798

BE NMI 0.2362 0.2031 0.2391 0.2125 0.2284 0.2284

RI 0.6559 0.6514 0.6629 0.6565 0.6599 0.6599

MS NMI 0.4460 0.5410 0.4136 0.5316 0.4796 0.5707

RI 0.7648 0.8163 0.7490 0.8201 0.7906 0.8354

OL NMI 0.3695 0.3756 0.3568 0.3623 0.3797 0.3798

RI 0.7873 0.7931 0.7853 0.7878 0.7901 0.7813

PL NMI 0.9819 0.9892 0.9892 0.9892 1.0000 1.0000

RI 0.9947 0.9973 0.9973 0.9973 1.0000 1.0000

PA NMI 0.5355 0.5341 0.5286 0.5425 0.5562 0.5666

RI 0.7866 0.7989 0.7887 0.8104 0.7943 0.8166

PT NMI 0.5176 0.5290 0.4926 0.5778 0.6044 0.6413

RI 0.7994 0.7896 0.7828 0.8180 0.8380 0.8421

SY NMI 0.9262 0.9154 0.9327 0.9356 0.9323 0.9488

RI 0.9780 0.9754 0.9818 0.9831 0.9804 0.9867

Dataset	Metric	DTCSC-F	DTCSC-L	DTCSC-A	DTCSC-I	DTCSC-T	DTCSC
ME	NMI	0.6421	0.6251	0.6226	0.7302	0.7337	0.7514
	RI	0.7552	0.7557	0.7231	0.8388	0.8340	0.8437
DA	NMI	0.4242	0.3671	0.4212	0.4001	0.4119	0.4159
	RI	0.7415	0.7274	0.7407	0.7377	0.7421	0.7399
EF	NMI	0.5269	0.7853	0.4323	0.8662	0.8035	0.9281
	RI	0.7904	0.9176	0.7529	0.9557	0.9344	0.9798
BE	NMI	0.2362	0.2031	0.2391	0.2125	0.2284	0.2284
	RI	0.6559	0.6514	0.6629	0.6565	0.6599	0.6599
MS	NMI	0.4460	0.5410	0.4136	0.5316	0.4796	0.5707
	RI	0.7648	0.8163	0.7490	0.8201	0.7906	0.8354
OL	NMI	0.3695	0.3756	0.3568	0.3623	0.3797	0.3798
	RI	0.7873	0.7931	0.7853	0.7878	0.7901	0.7813
PL	NMI	0.9819	0.9892	0.9892	0.9892	1.0000	1.0000
	RI	0.9947	0.9973	0.9973	0.9973	1.0000	1.0000
PA	NMI	0.5355	0.5341	0.5286	0.5425	0.5562	0.5666
	RI	0.7866	0.7989	0.7887	0.8104	0.7943	0.8166
PT	NMI	0.5176	0.5290	0.4926	0.5778	0.6044	0.6413
	RI	0.7994	0.7896	0.7828	0.8180	0.8380	0.8421
SY	NMI	0.9262	0.9154	0.9327	0.9356	0.9323	0.9488
	RI	0.9780	0.9754	0.9818	0.9831	0.9804	0.9867

Figure 9.

Comparison of execution times of different strategies (unit: min).

4.8. Comparative analysis of computational complexity

To comprehensively assess the efficiency of different clustering methods, we conduct a comparative analysis of the computational complexity. Based on their core operations, the benchmark methods exhibit a clear hierarchy in computational cost. Methods like R-Clust, which utilize random convolutional kernels, achieve the lowest complexity, typically on the order of $O (N \cdot T)$ , as they avoid expensive parameter learning. Most deep learning-based methods form the middle tier: autoencoder-based approaches like DEC and IDEC have complexity dominated by forward propagation through fully-connected networks and clustering iterations, approximately $O (N \cdot d_{h}^{2} \cdot L)$ . Methods handling raw time series, such as DTC and TCGAN, introduce convolutional operations, increasing complexity to $O (N \cdot T \cdot d_{h} \cdot L)$ . TS2Vec, with its hierarchical contrastive learning strategy, further introduces terms of $O (N^{2} \cdot d_{z} + N \cdot T^{2} \cdot d_{z})$ . The highest complexity tier includes structurally more complex methods: SDCN incurs $O (N^{2})$ complexity due to graph construction; DSC involves spectral decomposition with $O (N^{3})$ complexity; similarly, SE-Shapelets faces complexity on the order of $O (N \cdot T^{2})$ due to extensive subsequence distance computations.

In comparison, the proposed DTCSC method positions itself in the medium-to-high complexity range. Its computational cost stems primarily from its multi-component architecture: the encoder network based on dilated causal convolutions has a complexity of $O (N \cdot T \cdot d_{h}^{2} \cdot L)$ ; the adaptive spectral filtering module, leveraging the Fast Fourier Transform, provides a relatively efficient $O (N \cdot T \log T \cdot d_{h})$ complexity. However, the contrastive learning component, introduced for more robust representations, brings a higher complexity of $O (T \cdot N^{2} \cdot d_{z} + N \cdot T^{2} \cdot d_{z})$ , which constitutes the main computational bottleneck of the proposed method. Consequently, DTCSC accepts increased computational costs in exchange for the performance gains enabled by its synergistic multi-component design.

5. Conclusion

In this study, we propose a novel deep clustering method for time series based on contrastive learning. The method utilizes an encoder to extract features representations from time series data, incorporating noise filter to denoise the high-frequency noise and enhance the representation. Then, contrastive information is captured at both the temporal and instance levels. To improve the robustness and accuracy of clustering assignments, a refined distance measure, which combines absolute error with the smooth characteristics of Huber Loss, is employed. Finally, we optimize the contrastive loss function and clustering loss function to learn the cluster-friendly representations. Extensive experiments on multiple time series datasets demonstrate that the proposed method outperforms state-of-the-art deep clustering methods. The limitation of the current work is that it is designed for and evaluated on univariate time series, and extending it to multivariate data requires further architectural adjustments. Additionally, extracting feature representations from multiple layers in the deep network may require significant storage space, which could hinder its deployment on ultra-large time series datasets, thus giving rise to a new direction for future extensions.

Footnotes

Acknowledgements

This paper was supported by the National Natural Science Foundation of China (No.62076215, No.62301473), Jiangsu University Qing Lan Project, the Fundamental Research Funds for the Central Universities, China (No. K93-9-2022-03), the Jiangsu Provincial Natural Science Foundation of Higher Education (No. 23KJB520039), Jiangsu Provincial Key Laboratory of Network and Information Security (No. BM2003201), Yancheng Basic Research Fund Project (No. YCBK2023008, YCBK2024028) and Graduate Innovation Program of Yancheng Institute of Technology (No. KYCX24_XZ055).

Ethical and informed consent for data used

This article does not contain studies with human participants or animals. Statement of informed consent is not applicable since the manuscript does not contain any patient data.

Authors contribution statement

Zhixuan Wang: Writing-original draft, visualization, validation, methodology, formal analysis, data curation. Xiufang Xu: Writing-review, supervision, resources, project administration, funding acquisition. Sen Xu: writing-review, data verification, funding acquisition. Naixuan Guo: writing-review, project administration, funding acquisition. Xuesheng Bian: writing-review, funding acquisition. Shanliang Yao: Supervision, project administration. Tian Zhou: Supervision, resources, project administration. Yuyang Shen: Supervision, resources.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability and access

The datasets we used in this paper are public without private protection.

ORCID iDs

Zhixuan Wang

Xiufang Xu

Sen Xu

Naixuan Guo

Xuesheng Bian

Shanliang Yao

References

Brophy

Wang

She

, et al. Generative adversarial networks in time series: A systematic literature review. ACM Comput Surv 2023; 55: 1–31.

Jung

. Deep learning for anomaly detection in multivariate time series: Approaches, applications, and challenges. Information Fusion 2023; 91: 93–102.

Oyewole

Thopil

. Data clustering: application and trends. Artif Intell Rev 2023; 56: 6439–6475.

Huang

Hao

. Exploring the explainability of time series clustering: A review of methods and practices. In: Proceedings of the Eighteenth ACM international conference on web search and data mining, 2025, pp.1005–1007.

Chen

, et al. Structure-aware deep clustering network based on contrastive learning. Neural Netw 2023; 167: 118–128.

Kexin

Wen

Zhang

, et al. Self-supervised learning for time series analysis: Taxonomy, progress, and prospects. IEEE Trans Pattern Anal Mach Intell 2024; 46: 6775.

Madiraju

. Deep temporal clustering: Fully unsupervised learning of time-domain features. Master’s Thesis, Arizona State University, 2018.

Zheng

, et al. Learning representations for time series clustering. Adv Neural Inf Process Syst 2019; 32: 3781–3791.

Yin

Sun

. Effective sample pairs based contrastive learning for clustering. Inform Fusion 2023; 99: 101899.

10.

Pöppelbaum

Chadha

Schwung

. Contrastive learning based self-supervised time-series analysis. Appl Soft Comput 2022; 117: 108397.

11.

Liu

, et al. Contrastive clustering. In: Proceedings of the AAAI conference on artificial intelligence, Vol. 35, 2021, pp.8547–8555.

12.

Cao

Xing

Yang

, et al. Unsupervised contrastive learning for time series data clustering. Electronics 2025; 14: 1660.

13.

Petitjean

Ketterlin

Gançarski

. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognit 2011; 44: 678–693.

14.

Paparrizos

Gravano

. k-shape: Efficient and accurate clustering of time series. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, 2015, pp.1855–1870.

15.

Zhang

, et al. Salient subsequence learning for time series clustering. IEEE Trans Pattern Anal Mach Intell 2018; 41: 2193–2207.

16.

Cai

Huang

Yang

, et al. Se-shapelets: Semi-supervised clustering of time series using representative shapelets. Expert Syst Appl 2024; 240: 122584.

17.

Yue

Wang

Duan

, et al. Ts2vec: Towards universal representation of time series. In: Proceedings of the AAAI conference on artificial intelligence, Vol. 36, 2022, pp.8980–8987.

18.

Eldele

Ragab

Chen

, et al. Tslanet: Rethinking transformers for time series representation learning. In: International conference on machine learning, 2024, pp.12409–12428. PMLR.

19.

Zhong

Huang

Wang

. Deep temporal contrastive clustering. Neural Process Lett 2023; 55: 7869–7885.

20.

Lee

Choi

Son

. Deep time-series clustering via latent representation alignment. Knowl Based Syst 2024; 303: 112434.

21.

Huang

Deng

. Tcgan: Convolutional generative adversarial network for time series classification and clustering. Neural Netw 2023; 165: 868–883.

22.

Guan

Lam

. Spatial-spectral contrastive learning for hyperspectral image classification. In: IGARSS 2022-2022 IEEE International geoscience and remote sensing symposium, 2022, pp.1372–1375. IEEE.

23.

Chen

, et al. Structural deep multi-view clustering with integrated abstraction and detail. Neural Netw 2024; 175: 106287.

24.

Semenoglou

Spiliotis

Assimakopoulos

. Data augmentation for univariate time series forecasting with neural networks. Pattern Recognit 2023; 134: 109132.

25.

Cheng

Zhang

, et al. Learning hierarchical time series data augmentation invariances via contrastive supervision for human activity recognition. Knowl Based Syst 2023; 276: 110789.

26.

Wang

. From soft clustering to hard clustering: A collaborative annealing fuzzy c-means algorithm. IEEE Trans Fuzzy Syst 2024; 32: 1181–1194.

27.

Feng

Kittler

Awais

, et al. Wing loss for robust facial landmark localisation with convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2018, pp.2235–2245.

28.

Xie

Girshick

Farhadi

. Unsupervised deep embedding for clustering analysis. In: International conference on machine learning, 2016, pp.478–487. PMLR.

29.

Guo

Gao

Liu

, et al. Improved deep embedded clustering with local structure preservation. In: Proceedings of the Twenty-Sixth international joint conference on artificial intelligence, 2017, pp.1753–1759.

30.

Yang

Deng

Zheng

, et al. Deep spectral clustering using dual autoencoder network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp.4066–4075.

31.

Wang

Shi

, et al. Structural deep clustering network. In: Proceedings of the web conference 2020, 2020, pp.1400–1410.

32.

Jorge

Rubén

. Time series clustering with random convolutional kernels. Data Min Knowl Discov 2024; 38: 1862–1888.

33.

Eldele

Ragab

Chen

, et al. Time-series representation learning via temporal and contextual contrasting. In: Proceedings of the Thirtieth international joint conference on artificial intelligence, 2021, pp.2352–2359.

34.

Jeon

Aupetit

Shin

, et al. Measuring the validity of clustering validation datasets. IEEE Trans Pattern Anal Mach Intell 2025; 47: 5045–5058.