Abstract
Time series clustering is a pivotal technique for efficiently mining the structure of data. However, time series data possess unique characteristics such as periodicity, nonlinearity and sensitivity to high-frequency noise in dynamic environments, which significantly impact the performance of clustering methods. Recently, deep clustering has garnered widespread attention for its outstanding performance in capture multi-scale temporal dependencies. Despite this, existing methods struggle to effectively captured the similarities and diverse temporal patterns in noisy time series. Accordingly, we propose a novel deep time series clustering framework integrating spectral-adaptive masking with hierarchical contrastive learning. First, an enhanced encoder is designed to generate representations of time series through the incorporation of a frequency-domain adaptive noise filter, which dynamically suppress high-frequency fluctuations by learning threshold parameters from spectral power distributions. Second, hierarchical contrastive information is captured at both the temporal-level alignment within overlapping segments and instance-level comparisons across augmented subsequences, while simultaneously performing clustering on low-dimensional space and utilizing a novel fuzzy clustering loss to improve robustness against outlier interference. Finally, the network architecture is optimized through the integration of contrastive loss and clustering loss, which achieves end-to-end joint representation learning and clustering assignment. Extensive experiments on various time series datasets demonstrate that our approach outperforms state-of-the-art clustering methods.
Introduction
Time series data, which is a type of data inherently related to time, widely exists in fields such as financial markets, 1 industrial manufacturing, 2 and medical analysis. 3 In addition, time series clustering can reveal potential patterns within the data and group them into distinct categories, allowing researchers to extract valuable information from extensive datasets. 4 However, time series data is often characterized by high dimensionality and temporal dependencies, which pose challenges for traditional clustering methods in effectively handling large-scale datasets.
Traditional methods rely on manually designing and extracting features, which are inadequate to capture the intrinsic characteristics of time series data. Thus, these methods tend to perform poorly when applied to complex and nonlinear datasets. In recent years, the application of deep neural networks to clustering methods has significantly improved the performance of time series clustering. 5 Deep clustering methods focuses on generating cluster-oriented representations. It can automatically learn and extract the underlying features of the data by using deep neural networks, with these features being represented in a low-dimensional embedding space to better reflect the intrinsic structure of the data, thereby improving clustering accuracy and performance. 6 Madiraju et al., 7 as one of the earliest researchers in deep time series clustering, utilized an autoencoder to map time series data into a low-dimensional latent space. Then, updated the neural network parameters and clustering centers based on the predicted distribution of the clusters and the reconstruction loss. Building on this, Ma et al. 8 introduced a strategy for generating pseudo-samples to further enhance the capability of the encoder. The optimization of the network combines spectral relaxation-based k-means loss and auxiliary classification loss on top of the reconstruction loss. These methods primarily rely on instance reconstruction and cluster distribution alignment, often neglecting the inherent structural relationships between samples, which limits the ability to capture high-level semantic representations for clustering.
However, most deep time series clustering methods significantly depend on high-level features. As a novel technique in self-supervised learning, contrastive learning can efficiently learning invariant representations from augmented data without the need for labeled samples.9,10 For instance, Li et al.
11
proposed a single-stage clustering method that incorporates contrastive learning through the construction of a feature matrix, performing instance-level and cluster-level contrastive learning in the row and column spaces, respectively, to maximize the similarity of positive pairs while minimizing the similarity of negative pairs. The contrastive learning framework is illustrated in Figure 1, where

Contrastive learning framework.
While contrastive learning has shown promise for clustering, its direct application to time series clustering is still a challenging problem because of the dynamics and complexity of time series data.
12
Moreover, few methods can adequately capture multi-scale features of time series data while mitigating noise that may potentially impact clustering performance. In light of this, we propose a novel deep time series clustering method based on contrastive learning. Firstly, data augmentation is applied to the original time series, converting the input into a frequency-domain representation within the encoder. Then, an adaptive noise filter is employed to reduce the influence of noise, followed by multi-layer convolution to generate clustering-friendly representations. Subsequently, two levels of contrastive learning are implemented, which include instance-level and temporal-level. Finally, we optimize the entire network with joint loss functions. The contributions of this paper are as follows: We introduce a novel deep time series clustering method based on contrastive learning, which incorporates instance and temporal contrastive losses, along with an improved fuzzy clustering loss. An enhanced encoder is designed with adaptive noise filter, which dynamically adjusts high-frequency components and enhances the representation of time series. Extensive experiments with various datasets confirm the outstanding performance of the proposed method compared to state-of-the-art methods.
We organize the remainder of this paper as follows: In Section 2, we start with a review of the relevant studies on time series clustering. Then, we introduce the details of the proposed method in Section 3. Extensive experiments in Section 4 are conducted to demonstrate the effectiveness of the proposed method. Finally, we present the conclusion in Section 5.
Existing time series clustering methods can be divided into traditional and deep learning-based time series clustering methods. 3 Traditional time series clustering methods can be further classified into those based on raw data and those based on features. Methods based on raw data directly use the original values of time series for clustering by calculating the distances. To simplify the data, highlight important features, or enable comparisons of time series with different lengths, preprocessing operations such as normalization, smoothing, and interpolation are often applied to the data. Petitjean et al. 13 proposed a global technique for averaging a set of sequences, using the Dynamic Time Warping (DTW) distance metric for time series clustering. While this method provides a more robust average under DTW and shows improved performance on datasets like those from the UCR archive, it inherits DTW’s high computational cost and sensitivity to noise, which limits its scalability and practicality on large or noisy datasets. Additionally, Paparrizos et al. 14 proposed a clustering method that applied a standard cross-correlation distance measure to group time series with similar trends by optimizing shape similarity. k-Shape is computationally efficient due to its use of Fast Fourier Transform and demonstrates strong accuracy on benchmark datasets. However, its reliance on global alignment and z-normalization makes it less effective for sequences with complex local temporal variations or inconsistent amplitude characteristics. These methods are highly susceptible to data with noise and outliers, and exhibit poor performance on high-dimensional data. This limitation prompted the development of feature-based methods capable of clustering after capturing the features of the data. Zhang et al. 15 proposed an Unsupervised Salient Subsequence Learning (USSL) model that automatically discovers shapelets by integrating pseudo-labels and spectral analysis. Building on this, Cai et al. 16 introduced SE-Shapelets, a semi-supervised method that leverages a few labels to extract salient subsequence chains and select the most representative shapelets via linear discriminant selection, achieving even higher clustering accuracy. However, a fundamental limitation persists in these works: the decoupling of feature extraction from the clustering process. This separation means the learned features are not explicitly optimized for the final clustering objective, potentially failing to capture the most informative contextual hierarchies in the data.
In contrast, deep learning-based time series clustering methods have significant advantages in representation learning and the processing of complex data patterns. 6 Deep time series clustering can automatically learn the complex features of the data, and minimize intermediate errors through joint optimization of feature extraction and clustering, thereby improving the accuracy and robustness. Recently, extracting effective representations from time series data for downstream tasks has garnered considerable attention. Yue et al. 17 proposed an universal framework for learning representations of time series at arbitrary semantic levels, and performed contrastive learning in a hierarchical manner on the enhanced context view, demonstrating strong performance in classification and forecasting. However, as a general-purpose representation model, its learning objective is not explicitly designed to optimize for cluster-friendly structures in the latent space, which may limit its direct applicability to clustering tasks. Eldele et al. 18 proposed a lightweight adaptive network for time series that segments the input into multiple blocks. This method captures global patterns across different frequency components using adaptive spectral blocks, showing remarkable noise robustness, then interactive convolutional blocks are employed to extract local features from the time series. While it achieves state-of-the-art results in supervised and semi-supervised settings, its feature learning process is decoupled from the ultimate clustering objective, potentially leading to representations that are suboptimal for partitioning data. The core ideas of representation learning are reflected in subsequent deep time series clustering methods. Zhong et al. 19 leveraged data augmentations to construct dual views and apply contrastive losses at both the instance and cluster levels. While this approach achieves state-of-the-art performance by maximizing agreement between views, its effectiveness is inherently tied to the quality of augmentations and is susceptible to noise in the positive pairs. Lee et al. 20 exploited the eigenstructure of latent representations to define a topological loss that aligns samples with similar temporal structures. This allows it to achieve superior results even with a simple MLP encoder, but it introduces significant computational overhead from the eigendecomposition and is sensitive to the initial cluster assignments. In parallel, Huang et al. 21 learned representations via an adversarial game between a generator and a discriminator. Although capable of capturing complex temporal dynamics, these models often face training instability and lack an explicit clustering objective, which can limit their final clustering performance and consistency. Although the aforementioned methods have achieved considerable performance, these methods are incapable of adequately leveraging the inherent information in the data to obtain discriminative representations while accounting for the noise sensitivity and clustering consistency of the model. The method we proposed bridges this gap by apply a frequency-domain adaptive denoising mechanism coupled with dual-level contrastive learning, enabling joint optimization of representation purity and cluster separability.
Proposed method
We propose a deep time series clustering via spectral-adaptive masking and hierarchical contrastive learning (DTCSC), which includes an enhanced contrastive learning framework for time series representation and a clustering module. First, the overall network framework of DTCSC will be presented. Then, we elaborate each module of the proposed method.
Overall network framework
The overall framework of DTCSC is shown in Figure 2. In the network architecture of DTCSC, the raw time series is first randomly cropped to obtain two different but overlapping subseries. Then, these subseries are fed into the encoder to generate latent low-dimensional representations which encapsulate the contextual information. Further, the contrastive loss is obtained through two-level contrastive learning, which incorporates both instance-level and temporal-level contrastive learning. Additionally, the raw series of those subseries are fed into the encoder in parallel, with the output serving as the input for clustering module. The network is jointly optimized in a self-supervised manner using contrastive loss and clustering loss. It is worth noting that these steps are performed during the fine-tuning phase. Prior to this, we pretrain the model to obtain initial latent representations that reflect the temporal context information of time series, as well as initial cluster centers, thereby avoiding the problem of local optima caused by random initialization. In the pretraining phase, the representations generated by the encoder may not exhibit clearly distinct clustering structures because the encoder weights are updated through the minimization of the contrastive loss function.
Self-Supervised contrastive representation learning
The key to deep clustering is obtaining clustering-friendly representations through representation learning, and self-supervised learning generates representations for downstream tasks by leveraging the inherent structure and attributes of the data, without the need for labels. 22 Further, self-supervised contrastive learning emphasizes point details, establishes spatial relationships between data samples by making instances comparable, and clarifies similarities and dissimilarities, thereby enabling more effective capture of discriminative representations.
Data augmentation
The core idea of contrastive representation learning is to construct positive and negative samples in representation space via data augmentation, and maximizing similarity between positive samples while minimizing it for negative samples.23,24 The random cropping strategy can help the model learn the invariance of local patterns in time series, improve its ability to capture local features, with lower complexity. 25 Thus, we use random cropping strategy to generate subseries on data samples.
Given a time series dataset
Adaptive noise filter
In practice, time series data often exhibit noise that manifests as anomalies and irregular fluctuations. These noises may distort the similarity between data points, making it challenging for clustering methods to accurately distinguish different time patterns. However, it is common in existing deep clustering methods to assume that the data points are noise-free or that the noise is uniformly distributed across the entire dataset. This overlooks the varying degrees of impact that noise might have across different time intervals and the potential temporal dependencies, which may mask or misidentify relevant temporal features.

Overview of the proposed DTCSC.
High frequency components often represent rapid fluctuations that stray from the main trend, leading to more randomness and making data challenging to analyze.
18
Considering the frequency-domain characteristics of the Discrete Fourier Transform (DFT), which can efficiently transform complex time-domain signals into frequency-domain information to facilitate the analysis of frequency components and periodic patterns. Therefore, the 1D DFT is well-suited for preliminary processing to decompose time series data into their frequency components. For clarity, the input time series of
Further, the Fast Fourier Transform (FFT) and inverse fast Fourier transform (IFFT) optimize the computation of DFT and IDFT by exploiting the symmetry and periodicity of
Finally, the frequency-domain data is transformed back to the time domain through IFFT to obtain the output, and it is given as:
Define non-linear mapping function
The encoder consists of three components, including an adaptive noise filter, a dual-layer convolutional structure and a residual block, as depicted in Figure 3. The input is processed through multiple stacked convolutional layers for feature extraction, where each layer contains two parallel paths: a main path and an auxiliary path. The main path comprises two dilated 1-D convolution blocks with progressively increasing dilation parameters, where the dilation rate of the

Encoder architecture.
Contrastive loss functions aim to refine the geometry of the feature space by reducing the proximity between related or positively paired instances, while simultaneously increasing the separation between unrelated or negative pairs, enhances the capacity of model to distinguish between diverse data points. Thus, we use two contrastive losses, which focus on instance-level and temporal-level comparisons, to effectively capture the internal dynamic changes within the time series and the differences between instances. These two strategies constrain the representation space from two orthogonal dimensions. The instance-level loss provides a discriminative signal by contrasting different samples within a batch. This guides the model to learn features that can effectively separate one data instance from another, focusing on inter-instance differences. On the other hand, the temporal-level loss provides a consistency signal by contrasting different temporal contexts of the same instance. This guides the model to become invariant to trivial temporal variations and to capture the essential intra-instance dynamics and trends. Relying solely on instance-level contrast might yield representations that are sensitive to noise and lack temporal smoothness, while relying solely on temporal-level contrast might result in representations that are temporally coherent but lack discriminative power between different classes. Therefore, their combination is designed to learn representations that are both discriminative and temporally meaningful.
17
In our work, for any given timestamp, we consider representations from different timestamps within the same instance and from other instances in the same batch as negatives. This approach avoids the strong and often unrealistic invariance assumptions required by other methods, which can distort the inherent temporal structure of the data. Instead, by leveraging the natural structure within and across time series, it ensures that the learned representations faithfully preserve crucial temporal dependencies while maintaining high computational efficiency. Instance-level contrastive loss compares the inherent similarities between samples, pulling similar instances closer in the feature space to obtain discriminative representations. At timestamp
When dealing with datasets that exhibit fuzzy boundaries or overlapping features, data points are often not strictly assigned to a specific cluster. This uncertainty necessitates that clustering algorithms not only identify the cluster to which each data point belongs but also quantify the degree of relevance of data points to each cluster.
26
Therefore, we use the Euclidean distance to quantify the membership relationships between each data point and the cluster center, which is represented by the membership matrix
The contrastive loss and the clustering loss are jointly optimized within a unified framework. The overall loss function, formulated as a weighted sum of these two components, can be expressed as follows:
Suppose given a time series dataset with
The encoder network, built upon dilated causal convolutions with depth
Therefore, the total time complexity of DTCSC is approximately:
Experiment
In this section, extensive experiments were conducted to evaluate the efficacy of the proposed DTCSC. The experimental environment includes Windows 11 64-bit operating system, Intel i5-10400 at 3.10 GHz CPU, and 16 GB RAM.
Datasets
We employed ten datasets from the publicly UCR database, which include Meat for meat spectral data; DistalPhalanxOutlineAgeGroup, ProximalPhalanxOutlineAgeGroup, and ProximalPhalanxTW for distal phalanx contour data; ECGFiveDays for electrocardiogram curves; Beef for beef spectrometer spectral curves; MoteStrain for temperature and humidity sensor data; OSULeaf for leaf contour curves; and Plane for military aircraft contour curves; Symbols for symbol outline or trajectory data. Each dataset comprises three dimensions: sample size, length, and classes. Detailed descriptions of each dataset are provided in Table 1.
Statistics description of ten time series datasets.
Statistics description of ten time series datasets.
To verify the effectiveness of the proposed method, we compared DTCSC with several clustering methods, including deep or non-deep clustering methods.
DEC:
28
A deep clustering method that learns features and clustering assignments, minimizing a clustering loss based on KL divergence to improve both feature representation and clustering. IDEC:
29
A deep clustering method that uses an undercomplete autoencoder that integrates clustering loss with autoencoder loss to jointly assign clustering labels and learn features that are suitable for clustering while preserving the data structure. DTC:
7
DTC is a classical unsupervised temporal clustering that maps data to a low-dimensional latent space, then clustered using K-means, with updates to both neural network parameters and cluster centers based on the clustering results. DSC:
30
A deep clustering method that combines dual autoencoders and deep spectral clustering. The dual autoencoder consists of an encoder, a noisy decoder, and a noise-free decoder. SDCN:
31
A deep graph clustering method that captures the low-order and high-order structures of the data, and GCN is used to propagate the representations learned by the autoencoder. TS2Vec:
17
A general framework for learning time series representations at arbitrary semantic levels, performing contrastive learning in a hierarchical manner on augmented contextual views. TCGAN:
21
Employs a generative adversarial network with two one-dimensional CNNs to learn hierarchical representations from unlabeled time series data. R-clust:
32
A time series clustering method using random convolutional kernels and PCA for feature extraction and dimensionality reduction. SE-shapelets:
16
Leveraging a small number of labeled and pseudo-labeled time series to discover representative shapelets, incorporates a salient subsequence chain to extract informative subsequences and a linear discriminant selection algorithm to identify shapelets that capture discriminative local features. TS-TCC:
33
A contrastive learning framework employing temporal and contextual contrasting with Transformer encoders, suitable for representation learning and clustering tasks. TSLANet:
18
A universal time series model that combines spectral analysis with convolutional operations, utilizing frequency-domain processing to enhance feature representation and handle complex temporal patterns across various tasks. DEETO:
20
A deep clustering method that leverages self-supervised pretraining followed by fine-tuning with topological constraints to learn cluster-oriented representations.
Evaluation metrics
We used two widely accepted metrics in the clustering field to evaluate model performance, including Normalized Mutual Information (NMI), and Rand Index (RI), both of which range from 0 to 1, with higher values indicating better clustering performance. NMI is based on Mutual Information (MI). It is calculated via the following formula:
The training process of DTCSC consist of two steps, including the pretraining phase and the finetuning phase. In the encoder, the dimension of hidden layer is set to 64, while the dimension of output layer is reduced to 32 via a max pooling layer. Standard weight initialization is used. During the pretraining phase, the learning rate for the experiments is set at 0.001, with the model utilizes the Adam optimizer with its default hyperparameters (
Experimental results
We evaluated the clustering performance of the proposed method and compared it with that of the baseline methods. The NMI scores for 10 datasets are shown in Table 2, and the RI scores are presented in Table 3. In addition, we provide the average rank, the number of the best performance and the number of second-best performance. The underline implies the best results in all methods.
Comparison results of the NMI for 10 time series datasets.
Comparison results of the NMI for 10 time series datasets.
Comparison results of the RI for 10 time series datasets.
The experimental results demonstrate that DTCSC achieved best or second-best clustering performance across most datasets, outperforming the mainstream methods. Moreover, DTCSC achieved the highest average scores in both NMI and RI metrics. DEETO and TSLANet demonstrate competitive performance, with DEETO leveraging topological information for representation alignment and TSLANet employing spectral analysis for feature enhancement, though both are still outperformed by our approach. While the Transformer-based TS-TCC method benefits from its powerful sequence modeling capacity, its clustering performance is moderate, potentially because the generic representations it learns are not explicitly optimized for the clustering objective. While R-clust and SE-shapelets, as raw data-based methods, exhibit exceptional computational efficiency, their clustering effectiveness is suboptimal. TCGAN, a deep clustering method that focuses on image data, exhibits the poorest performance. In contrast to TS2Vec—a two-stage approach where the learned representations may lack sufficient adaptability for clustering objectives—our proposed method establishes a unified end-to-end framework that jointly optimizes representation learning and clustering objectives, enabling synergistic adaptation between feature encoding and cluster structure formation.
Notably, DEC and IDEC methods achieved satisfactory performance on datasets with short sequences but poor performance when handling long sequence datasets. DTC, a deep clustering method based on Recurrent Neural Network (RNN), heavily relies on the representation quality generated by its encoder. As shown in Figure 4, DTCC—a dual-view deep clustering method that employs contrastive learning and RNN-based autoencoders for temporal representation extraction—outperforms DTC in most datasets. This performance superiority arises from contrastive learning’s inherent mechanism to enforce semantic consistency by filtering stochastic noise and redundant variations, thereby yielding representations with enhanced discriminative separability and robustness.

Comparison of NMI scores of DTCSC with those of DTC and DTCC.
To provide an internal validation of the clustering results, we employed the silhouette coefficient, which measures the cohesion and separation of the formed clusters without the need for ground truth labels. 34 The coefficient ranges from -1 to 1, where values above 0 indicate that samples are, on average, closer to members of their own cluster than to members of other clusters.
The silhouette coefficients for our method across all ten datasets are summarized in Table 4. The results indicate that our method achieves reasonable and acceptable clustering quality, with the majority of silhouette scores above 0.5. Meaning samples are, on average, well-matched to their own cluster.The observed variation in scores, including the lower performance on a few specific datasets, can be primarily attributed to the inherent characteristics of the data themselves. Certain datasets may exhibit higher intrinsic overlap between native clusters or more complex noise patterns.
Silhouette coefficients of the proposed method across benchmark datasets.
To visually illustrate the changes in clustering performance throughout the clustering process, we used the MoteStrain dataset to analyze the relationship between clustering performance and the number of iterations, as depicted in Figure 5. As the loss decreased, the evaluation metric scores consistently increased. Furthermore, to track the evolution of clusters over time, the t-distributed Stochastic Neighbor Embedding (t-SNE) was utilized to visualize changes in cluster formation on the Symbols dataset. The experiments were conducted under standard parameter settings for 50 iterations, with clustering results generated by the current trained model after every 10 iterations. As shown in Figure 6, which presents a t-SNE projection where the axes are dimensionless and serve only to represent relative similarity, we can observe the evolution of clusters during the iterative optimization process. The visual analysis reveals an overall trend where samples of the same category gradually cluster together, demonstrating higher similarity, while samples of different categories gradually separate.
While the above t-SNE visualization demonstrates the effective cluster formation on datasets where our model performs well, we also employed the same technique to diagnose the challenges presented by the datasets with relatively lower metrics (DA, BE, and OL). The corresponding t-SNE plots are provided in the Figure 7. The visual analysis confirms the quantitative results, revealing less distinct cluster separation. We attribute this primarily to the inherent complexities of these specific datasets. For instance, the BE dataset’s small sample size and its long temporal length hinder the learning of robust feature representations, while the OL and DA datasets, being shape-based classifications, may not be fully leveraged by our contrastive learning framework which is more sensitive to temporal context than global shape morphology.

Visualization of the loss along with three metrics during training on MoteStrain.

The evolution of clusters during the iteration process.

t-SNE visualization of the clustering results on DA, BE and OL.
To verify the effectiveness of the noise filter used in the proposed method, we investigated the efficacy of the noise filters in reducing noise and enhancing the robustness of the model. As shown in Figure 8, we added four different levels of noise to the Meat dataset to simulate the performance of the model under high noise conditions. The noise levels were quantified using signal-to-noise ratio (SNR) in decibels (dB), where lower SNR values indicate stronger noise contamination. We applied four settings: 20 dB, 10 dB, 0 dB, and -10 dB to simulate varying noise intensities. As the noise intensity increased, the model’s accuracy rapidly declined. However, the performance of the model utilizing noise filters consistently outperformed that of the model without noise filters. This indicates that the noise filter can effectively mitigate the impact of noise and improve model performance.

The effectiveness of DTCSC in handling noisy data.
In this section, we compared DTCSC with three ablation strategies to verify the validity of several designs: DTCSC without noise filter (DTCSC-F), DTCSC without improved clustering loss (DTCSC-L), DTCSC without both noise filter and improved clustering loss (DTCSC-A), DTCSC without instance-level contrastive loss (DTCSC-I) and DTCSC without temporal-level contrastive loss (DTCSC-T). As shown in Table 5, DTCSC outperforms the other five methods in terms of performance on the most datasets. Crucially, the performance degradation observed in both DTCSC-I and DTCSC-T confirms that the instance-level and temporal-level contrastive components are both indispensable, as removing either leads to non-redundant performance drops. Compared to the other five strategies, DTCSC demonstrates varying degrees of performance improvement, although it may experience some degradations on certain datasets, which could be attributed to the inherent characteristics of the datasets themselves. Moreover, we evaluated the training time required for each method on each dataset. As shown in Figure 9, while DTCSC requires slightly more computation time than the other three methods across most datasets, the increased under 0.1, demonstrating that the performance gains are achieved with only modest additional time cost.
Results from experiments using different strategies.
Results from experiments using different strategies.

Comparison of execution times of different strategies (unit: min).
To comprehensively assess the efficiency of different clustering methods, we conduct a comparative analysis of the computational complexity. Based on their core operations, the benchmark methods exhibit a clear hierarchy in computational cost. Methods like R-Clust, which utilize random convolutional kernels, achieve the lowest complexity, typically on the order of
In comparison, the proposed DTCSC method positions itself in the medium-to-high complexity range. Its computational cost stems primarily from its multi-component architecture: the encoder network based on dilated causal convolutions has a complexity of
Conclusion
In this study, we propose a novel deep clustering method for time series based on contrastive learning. The method utilizes an encoder to extract features representations from time series data, incorporating noise filter to denoise the high-frequency noise and enhance the representation. Then, contrastive information is captured at both the temporal and instance levels. To improve the robustness and accuracy of clustering assignments, a refined distance measure, which combines absolute error with the smooth characteristics of Huber Loss, is employed. Finally, we optimize the contrastive loss function and clustering loss function to learn the cluster-friendly representations. Extensive experiments on multiple time series datasets demonstrate that the proposed method outperforms state-of-the-art deep clustering methods. The limitation of the current work is that it is designed for and evaluated on univariate time series, and extending it to multivariate data requires further architectural adjustments. Additionally, extracting feature representations from multiple layers in the deep network may require significant storage space, which could hinder its deployment on ultra-large time series datasets, thus giving rise to a new direction for future extensions.
Footnotes
Acknowledgements
This paper was supported by the National Natural Science Foundation of China (No.62076215, No.62301473), Jiangsu University Qing Lan Project, the Fundamental Research Funds for the Central Universities, China (No. K93-9-2022-03), the Jiangsu Provincial Natural Science Foundation of Higher Education (No. 23KJB520039), Jiangsu Provincial Key Laboratory of Network and Information Security (No. BM2003201), Yancheng Basic Research Fund Project (No. YCBK2023008, YCBK2024028) and Graduate Innovation Program of Yancheng Institute of Technology (No. KYCX24_XZ055).
Ethical and informed consent for data used
This article does not contain studies with human participants or animals. Statement of informed consent is not applicable since the manuscript does not contain any patient data.
Authors contribution statement
Zhixuan Wang: Writing-original draft, visualization, validation, methodology, formal analysis, data curation. Xiufang Xu: Writing-review, supervision, resources, project administration, funding acquisition. Sen Xu: writing-review, data verification, funding acquisition. Naixuan Guo: writing-review, project administration, funding acquisition. Xuesheng Bian: writing-review, funding acquisition. Shanliang Yao: Supervision, project administration. Tian Zhou: Supervision, resources, project administration. Yuyang Shen: Supervision, resources.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability and access
The datasets we used in this paper are public without private protection.
