Semi-supervised contrastive learning with decomposition-based data augmentation for time series classification

Abstract

While time series data are prevalent across diverse sectors, data labeling process still remains resource-intensive. This results in a scarcity of labeled data for deep learning, emphasizing the importance of semi-supervised learning techniques. Applying semi-supervised learning to time series data presents unique challenges due to its inherent temporal complexities. Efficient contrastive learning for time series requires specialized methods, particularly in the development of tailored data augmentation techniques. In this paper, we propose a single-step, semi-supervised contrastive learning framework named nearest neighbor contrastive learning for time series (NNCLR-TS). Specifically, the proposed framework incorporates a support set to store representations including their label information, enabling a pseudo-labeling of the unlabeled data based on nearby samples in the latent space. Moreover, our framework presents a novel data augmentation method, which selectively augments only the trend component of the data, effectively preserving their inherent periodic properties and facilitating effective training. For training, we introduce a novel contrastive loss that utilizes the nearest neighbors of augmented data for positive and negative representations. By employing our framework, we unlock the ability to attain high-quality embeddings and achieve remarkable performance in downstream classification tasks, tailored explicitly for time series. Experimental results demonstrate that our method outperforms the state-of-the-art approaches across various benchmarks, validating the effectiveness of our proposed method.

Keywords

Deep learning machine learning representation learning self-supervised learning semi-supervised learning time series analysis

1. Introduction

Time series are utilized for a variety of tasks in a wide range of scientific and industrial areas, including health assessment in medical fields, human activity recognition, and industrial process monitoring [1, 2, 3]. Recently, interest in time series classification tasks has grown substantially due to the increase in the amount of available data. With this increase in data, achieving optimal classification performance becomes increasingly crucial [4]. The success of the classification task depends not only on the quantity and quality of the observed data but also on the availability of sufficient annotated label information [5, 6].

Unlike other types of data, time series can be continuously generated as long as the source remains available for collection [1]. This is highly likely to result in the accumulation of large amounts of unlabeled data in practice, which is attributed to the fact that labeling time series sequences is a time-consuming and labor-intensive process [7, 8, 9]. Consequently, the lack of labeled data poses a huge challenge in applying deep learning approaches to time series tasks. This is due to the inherent limitation of deep neural networks, which heavily rely on a large number of labeled samples for training [6].

There has been a surge in research efforts aiming to tackle these challenges through self-supervised representation learning methods [5, 10]. In particular, contrastive learning has emerged as a promising approach in computer vision, demonstrating superior classification performance compared to other representation learning techniques [11, 12]. It leverages the data structure to learn meaningful representations without relying on explicit labels, effectively utilizing the information inherent in the data. This method focuses on maximizing agreement between differently transformed views of the same instance, while minimizing it between others. As a result, it can capture complex, high-level features without the need for extensive labeled data, making it particularly effective for tasks with limited annotations [13, 14].

Meanwhile, self-supervised contrastive learning models carry the risk of treating instances from the same class as negatives, which could potentially undermine their performance. To further enhance the potential of this approach, semi-supervised contrastive learning attempts to utilize a subset of labeled instances, leading to improved performance and potentially surpassing the capabilities of self-supervised methods in time series analysis [15, 16].

Many of these studies follow multi-step approaches where they first generate pseudo-labels from a pretrained encoder and then train the model using them. This sequential process implies that valuable label information is not exploited in the initial phase of encoder training, which is a key period for learning representation. Although some strategies have attempted to address these issues with the use of supervised contrastive loss and hard pseudo-labeling [15], these methods still face challenges due to the reduced learning efficiency from inaccuracies in pseudo-labeling in the early stages of training [17, 18].

On the other hand, data augmentation is also known to be essential for successful learning in contrastive learning for time series [19]. Augmentation methods using random transformations, such as jittering and scaling, have had a significant effect on improving learning performance in the field of computer vision [20]. However, directly applying these methods to time series data may not produce comparable results as those observed in other fields [21]. This may be attributed to the possibility that existing methods do not take into account the unique characteristics of time series, such as stationarity and seasonality [22, 23].

To address the aforementioned issues, we propose a single-step semi-supervised contrastive learning framework called nearest neighbor contrastive learning for time series (NNCLR-TS). First, we introduce a novel decomposition-based data augmentation algorithm tailored for time series, employing a sampling-based approach that respects the inherent properties of the data. This method focuses on selectively augmenting the trend component, thus enhancing the model’s ability to learn from the data’s temporal dynamics. Using different views of the data from this augmentation method, the model learns to capture transformation invariance in time series.

Furthermore, following the structure proposed in [13], we employ the notion of a memory buffer, called the support set, in which the representations and their labels/pseudo-labels are stored. By utilizing the support set, we perform pseudo-labeling on unlabeled data, where the assigned pseudo-label is determined by majority votes of its nearby samples in the latent space.

Lastly, we introduce two innovative loss functions, namely Instance-wise Cross Entropy (IW-Xent) and Intra-batch Similarity (IBS) losses, designed to enhance contrastive learning in our proposed model. The former takes full advantage of label information by selecting the nearest neighbor (NN) representations for both positive and negative views, effectively utilizing them in the design of the loss function. The latter contributes by exploiting the similarities among representations within a batch. Together, these new loss functions enable our framework to employ both labeled and unlabeled data efficiently, enhancing the model’s ability to generalize and improving the quality of the learned representations. We combine our proposed loss functions with the well-established normalized temperature-scaled cross entropy (NT-Xent) loss [14] to form a composite loss function. Empirically, we show that this strategy results in a significant performance improvement, demonstrating the effectiveness of our novel loss functions.

In this work, our contributions are summarized as follows:

–
A novel semi-supervised contrastive learning framework, called NNCLR-TS, is proposed for time series classification tasks in label-deficient scenarios. Also, we develop a new data augmentation algorithm based on decomposition that respects time series properties.
–
By employing a memory buffer referred to as a support set, we store the representations and their corresponding labels/pseudo-labels. We utilize the support set for pseudo-labeling and selecting NN representations of input data, which will be used for computing the loss function.
–
We propose a new composite contrastive loss function that makes use of the NN representations, called IW-Xent loss, and a new intra-batch similarity loss, named IBS loss. This composite function, alongside NT-Xent loss, facilitates the effective use of both labeled and unlabeled data.
–
Our approach outperforms previous methods across a wide range of benchmarks for classification tasks. Furthermore, comprehensive ablations on each component clearly confirm the effectiveness of our method.

2. Related work

2.1 Semi-supervised learning

Semi-supervised learning is a learning paradigm that leverages both labeled and unlabeled data for training. This approach aims to better understand the underlying structure of the data [24]. The labeled data offers foundational understanding based on explicit annotations, while unlabeled data helps to capture the broader underlying structure of the data. Combining these two types of data can not only enhance model performance but also mitigate the risk of overfitting [25, 26].

A frequently used approach in semi-supervised learning is consistency regularization [27, 28]. This method encourages consistent predictions across different perturbations of a given input for improved model generalization. Perturbation methods range from simple techniques like random max-pooling and dropout [29] to more complex strategies such as temporal ensembling [30] and virtual adversarial training [31]. Recently, Mixup augmentation [32] has been utilized in various studies to further extend the capabilities of this technique [33, 76, 77]. While consistency regularization is a flexible approach, its performance can be sensitive to the choice of domain-specific augmentations [34].

On the other hand, there have been research efforts focused on joint training of labeled and unlabeled data using pseudo-labeling. This approach involves initially training a model on labeled data and then using it to predict labels for unlabeled data. These predicted labels, known as pseudo-labels, are treated as if they were true labels in subsequent training iterations to leverage information from the unlabeled data [35]. Pseudo-labels serve as a regularizer during training, leading to improved performance on downstream tasks [36]. Some studies [37, 38] have adopted hard pseudo-labeling based on the class with the highest predicted probability.

While such approaches are straightforward and task-agnostic, they are susceptible to what is known as confirmation bias [36, 39]. This issue arises particularly in the early stages of training when the model reinforces potentially incorrect initial predictions through hard pseudo-labeling. Another notable issue is that these methods often prioritize the class with the highest predicted probability, neglecting other factors such as prediction uncertainty or confidence level. This could lead to less robust pseudo-labels, affecting the model’s overall performance if not managed carefully [40].

To alleviate these issues, several studies have explored alternative pseudo-labeling and training schemes based on information from nearby representations or clustering constraints [41, 42, 17]. These adaptations can lead to improvements in the reliability and accuracy of pseudo-labeling. Although these methods have demonstrated success in computer vision, to the best of our knowledge, there has been no similar exploration in time series analysis.

2.2 Contrastive learning

The goal of contrastive learning is to enable models to extract meaningful features and generate more discriminative representations. This is achieved by maximizing the similarity between representations for pairs of augmentations from the same data and minimizing the similarity between views from others [14]. It has demonstrated significant success in representation learning across diverse domains, including computer vision, natural language processing, and speech recognition [43, 44, 45].

However, its effectiveness is not as straightforward when applied to time series analysis. One of the significant challenges is the method’s limited capability to capture the temporal dependencies inherent in time series data [46]. Additionally, the non-stationarity nature of the time series data, where statistical properties shift over time, further complicates its adaptability [47].

In response to these challenges, specialized methodologies like [19] and [22] have been introduced. For example, [19] proposed a novel temporal contrasting module to foster robust representation learning. Additionally, [22] introduced a method that computes the contrastive loss in a hierarchical manner, incorporating temporal information through leveraging progressive max-pooling of representations. These approaches are designed to capture the distinctive temporal dynamics of time series data, demonstrating superior performance over conventional contrastive learning techniques in time series.

Although contrastive learning has been employed in time series analysis, techniques such as memory banks, have not been extensively explored for time series data [48, 49, 50]. Memory banks have demonstrated improved performance in computer vision tasks, as seen in works such as [13]. This research proposed a variant of contrastive learning that utilizes a modified memory-bank architecture, referred to as the support set, in combination with an NN approach for sample pair selection. This innovative method not only diversifies the sample pairs but also significantly enhances the learning process, achieving competitive results on ImageNet classification tasks.

Despite its success in computer vision, this method has not been extensively applied to contrastive learning for time series data. This suggests a promising opportunity to enhance performance in time series classification through the integration of memory bank techniques. Our research aims to investigate the potential benefits of applying memory banks to time series.

2.3 Data augmentation for time series

Data augmentation plays a pivotal role in contrastive learning, contributing to the model’s robustness by encouraging the learning of transformation invariance from input data [51]. Conventional random transformations widely adopted in image processing, such as flipping and jittering, have also been explored in the context of time series [20]. Additionally, transformation strategies tailored to the unique characteristics of time series data, including window warping and window slicing, have been proposed [52, 86].

Despite the introduction of various augmentation methods, there has not been a one-size-fits-all solution, especially given the unique and diverse nature of time series data [2, 53]. This inconsistency is presumably due to the diversity in the temporal characteristics of datasets, where recurring features like seasonality can be compromised during the augmentation process.

Thus, recognizing the need for preserving such crucial temporal structures, the development of novel augmentation strategies specifically for time series data has become imperative [20]. In response to this need, methods based on Dynamic Time Warping (DTW) have emerged as promising solutions, aimed at maintaining the integrity of temporal structures within the datasets [54, 55, 56].

However, these methods have significant computational complexity due to the need to align two time series [57, 58], making them impractical for online training scenarios or handling large datasets. Given these challenges, there is a need for exploration and development of augmentation methods that are computationally efficient and suitable for time series, to fully leverage the potential of contrastive learning in this domain.

3. Method

A set of univariate time series of $N$ instances is given by $X = {x_{1}, x_{2}, \dots, x_{N}}$ , where each input time series data $x_{i}$ has a length of $L$ , i.e, $x_{i} \in R^{L \times 1}$ for all $i$ . Our goal is to find an embedding function $f_{θ} (\cdot)$ that is able to generate a discriminative representation $r_{i} = f_{θ} (x_{i})$ , tailored for downstream classification task by capturing semantic features from the given data.

Figure 1.

An overview of NNCLR-TS architecture.

The representation $r_{i} = [r_{i 1}^{⊤}; r_{i 2}^{⊤}; \dots; r_{i L}^{⊤}] \in R^{L \times D}$ consists of representation vectors for each timestamp $t \in [1, \dots, L]$ with dimension $D$ , i.e., $r_{i t} \in R^{D}$ for all $i$ and $t$ . In this paper, our primary focus is on the classification task. The objective is to predict the class of the input data using the representations obtained from the trained embedding network, $r = f_{θ} (x)$ as the input to a classifier. Here, we use support vector machine (SVM) as a classifier for the classification task.

We illustrate the overall architecture of our proposed model, NNCLR-TS, in Figure 1, with each step from A to E corresponding to subsections of Section 3. In step A, we generate two views using our proposed augmentation method, named STLDDA. Step B describes the process of inputting these views into the encoder, composed of two identical networks. Each network is designed to capture either the temporal or spectral characteristics of the input data. In step C, the output representations are stored in a memory buffer known as the support set. Concurrently, pseudo-labels are generated using our proposed technique and are stored alongside the data. Step D outlines how the NN samples from the augmented data are identified. Finally, in step E, we explain how these samples are used in computing the loss function. The whole procedure of NNCLR-TS training is described in Algorithm 1. The corresponding subsections further detail each component for a comprehensive understanding.

Algorithm 1: NNCLR-TS Training Procedure
1: Input: Dataset $X = {x_{1}, x_{2}, \dots, x_{N}}$
2: Initialize the encoder $f_{θ} (\cdot)$ and the support set $M$
3: Pretraining: Train the encoder with the labeled data using $L_{NT-Xent}$ and update $M$ until each class has at least one representation in the support set
4: Training:
5: for each $x_{i}$ in the $X$ do
6: Compute representation: $r_{i} = f_{θ} (x_{i})$
7: Obtain ${\tilde{x}}_{i}^{a}$ and ${\tilde{x}}_{i}^{b}$ from $x_{i}$ using STLDDA
8: Compute representations: ${\tilde{r}}_{i}^{a} = f_{θ} ({\tilde{x}}_{i}^{a})$ , ${\tilde{r}}_{i}^{b} = f_{θ} ({\tilde{x}}_{i}^{b})$
9: if $x_{i}$ is unlabeled or pseudo-labeled then
10: Assign a pseudo-label $y_{i}$ on $x_{i}$ (Eq. (4))
11: end if
12: Update support set: $M \leftarrow M \cup {r_{i}, y_{i}}$
13: Obtain ${\tilde{r}}_{i}^{a +}$ and ${\tilde{r}}_{i}^{a -}$ (Eqs (5) and (6))
14: Compute the loss $L$ (Eq. (10)) and update parameter $θ$
15: end for

3.1 Decomposition-based data augmentation

We propose a new augmentation method, termed “STL Decomposition-based Data Augmentation” (STLDDA). Seasonal and trend decomposition using LOESS (STL) [59] is a widely used classical decomposition method for time series, which decomposes the data $x_{i}$ into three additive components: trend $T_{i}$ , seasonal $S_{i}$ , and residual $R_{i}$ , which facilitates the analysis of time series data, i.e., $x_{i} = T_{i} + S_{i} + R_{i}$ . Our method strategically augments only the trend component, $T_{i}$ , of the time series obtained from STL instead of applying transformations to the entire data, $x_{i}$ . This approach, by preserving the seasonal component, ensures that the periodic information inherent in the original data is not distorted, potentially enhancing the learning process.

Figure 2.

The overall process of STLDDA. The original data is decomposed into trend, seasonal, and residual components using STL. To respect the properties of time series data, augmentation is performed while preserving the periodical information, i.e., keeping the seasonal component unchanged. Specifically, we augment only the trend component where the augmented signal is sampled from the empirical cumulative distribution, which is obtained from Eq. (1), in the frequency domain.

STLDDA process is depicted in Figure 2. We begin by applying the real fast Fourier transform (RFFT) to the trend component $T_{i}$ of all training data $x_{i}$ . This generates $ω_{i} = FFT (T_{i}) \in R^{M}$ for all $i$ , where $M$ is the number of predefined bins in the FFT function. Subsequently, we obtain the empirical distribution of magnitudes, ${\hat{F}}_{j} (u)$ for each frequency bin.

\begin{aligned} {\hat{F}}_{j} (u) = \frac{1}{N} \sum_{i = 1}^{N} 1_{{| ω_{i j} | ⩽ u}}, \end{aligned}

(1)

where $1_{{\cdot}}$ is the indicator function, and $| ω_{i j} |$ corresponds to the magnitude for the $j$ -th frequency bin of the $i$ -th data. Based on these cumulative empirical distributions, for any given data $x$ to be augmented (subscript is omitted for simplicity), we sample the new amplitudes ${\tilde{ω}}_{j}$ for all frequency bins $j \in [1, 2, \dots, M]$ to augment the trend component:

\begin{aligned} \tilde{ω} = [{\tilde{ω}}_{1}; {\tilde{ω}}_{2}; \dots; {\tilde{ω}}_{M}] \in R^{M} . \end{aligned}

(2)

Note that the amplitude ${\tilde{ω}}_{j}$ can be easily sampled from ${\hat{F}}_{j} (u)$ using the inverse transform sampling [60]. Afterward, we transform this sampled component $\tilde{ω}$ back to the time domain and replace the original trend component $T$ with the augmented trend component as follows:

\begin{aligned} \tilde{X} = \tilde{T} + S + R, ~where~ \tilde{T} = I F F T (\tilde{ω}) . \end{aligned}

(3)

Here, IFFT denotes the inverse fast Fourier transform, and $\tilde{x}$ denotes the augmented data of $x$ . The whole process is described in Algorithm 2.

Algorithm 2: STL Decomposition-based Data Augmentation (STLDDA)
1: Input: Dataset $X$ = ${x_{1}, x_{2}, \dots, x_{N}}$ , Number of frequency bins $M$
2: Output: Augmented time series $\tilde{X} = {{\tilde{x}}_{1}, {\tilde{x}}_{1}, \dots, {\tilde{x}}_{N}}$
3: Apply STL decomposition to $X$ to obtain ${T_{i}, S_{i}, R_{i}}$ for all $i \in {1, 2, \dots, N}$
4: for each trend component $T_{i}$ in the decomposed data do
5: Apply real fast Fourier transform to $T_{i}$ to get $ω_{i} = FFT (T_{i}) \in R^{M}$
6: end for
7: Let $j \in {1, 2, \dots, M}$ denote the index of frequency bins
8: Compute empirical distribution of magnitudes ${\hat{F}}_{j} (u)$ for each frequency bin $j$
9: for each $x_{i}$ to be augmented do
10: for each frequency bin $j$ do
11: Sample new amplitude ${\tilde{ω}}_{j}$ from ${\hat{F}}_{j} (u)$ using inverse transform sampling
12: end for
13: Form the sampled component $\tilde{ω} = [{\tilde{ω}}_{1}; {\tilde{ω}}_{2}; \dots; {\tilde{ω}}_{M}]$
14: Transform $ω$ using IFFT to obtain ${\tilde{T}}_{i} = I F F T (ω)$
15: Create augmented data ${\tilde{x}}_{i} = {\tilde{T}}_{i} + S_{i} + R_{i}$
16: end for

Contrastive learning necessitates the generation of two different augmented time series data for the computation of its loss function [14]. Following this standard, our proposed loss function also requires the creation of two unique augmented sets. To fulfill this requirement, we execute the STLDDA process twice on the original data, thus producing two distinct augmented time series data.

In conclusion, our augmentation method offers several distinct advantages. By sampling in the frequency domain, the method preserves essential spectral information. Furthermore, this is not dependent on label information, which broadens its applicability across a wide range of contexts. Additionally, our method constructs empirical distributions based on the entire dataset, thus ensuring the augmented data follows the probability distribution of the trend component. Consequently, this strategy mitigates the creation of outlier instances, further enhancing its utility.

3.2 Representation encoder

In order to extract richer information from a given input data, $x_{i}$ , we incorporate both its temporal and spectral components. To accomplish this, our encoder model consists of two separate but structurally identical networks: the temporal and spectral networks, denoted as $f_{θ_{1}}^{temp} (\cdot)$ and $f_{θ_{2}}^{spec} (\cdot)$ , respectively.

These two networks operate independently, sharing no learned parameters, yet follow the same architecture derived from Ts2Vec’s encoder [22]. This architecture is composed of a dilated convolutional neural network (CNN) module followed by a linear input projection layer. Each network is specialized to process its respective component – either temporal or spectral – of the input data.

The roles of these networks are to extract embedding representations from the original data, $x_{i}$ , viewed from a temporal perspective, and the spectral signal, $FFT (x_{i})$ , respectively. The final embedding of our encoder is obtained by concatenating these distinct temporal and spectral representations. Each of these representations has the shape of $(L, D / 2)$ and are they are concatenated along the dimension axis. Thus, the final embedding output is defined as follows: $r_{i} = f_{θ} (x_{i}) : = [f_{θ_{1}}^{temp} (x_{i}), f_{θ_{2}}^{spec} (FFT (x_{i}))] \in R^{L \times D} .$

3.3 Support set & pseudo-labeling

The goal of training the proposed encoder is to cluster data from the same class closely in the latent space. To achieve this, we propose a novel pseudo-labeling method that assigns labels to unlabeled data based on majority voting of nearby samples within that space. This enables effective training with a limited amount of labeled data, guided by loss functions that will be detailed later.

In the pursuit of reliable pseudo-labeling, it’s important to have a sufficient number of representations. To achieve this, inspired by [13], we employ a queue-type memory buffer known as a support set, denoted by $M$ , to store these representations. For each iteration, given the data $x_{i}$ , the obtained representation $r_{i} = f_{θ} (x_{i})$ , and the corresponding label $y_{i}$ is saved to the support set as a tuple of $(r_{i}, y_{i})$ . If $x_{i}$ is unlabeled, a pseudo-label is assigned to it and stored along with its representation. This pseudo-labeling process is depicted in Figure 3.

To facilitate this process at the beginning, we obtain representations for labeled data and pretrain the encoder. In this stage, the representations of all labeled instances are stored in the support set. Subsequently, the training process begins using the complete dataset, which includes both labeled and unlabeled data. If an incoming sample is labeled data, the true label is simply stored with the newly obtained representation. However, if the given sample $x_{i}$ is unlabeled or its corresponding label $y_{i}$ is a pseudo-label, a new label is assigned to this sample according to the following rule:

\begin{aligned} y_{i} = \underset{c \in C}{\arg max} \sum_{k \in N_{K} (r_{i}, M)} 1_{{y_{k} = c}}, \end{aligned}

(4)

where $N_{K} (r_{i}, M)$ represents the set of data indices of the $K$ -nearest samples from $r_{i}$ within the support set $M$ , and $C$ indicates the set of class labels. In simple terms, the label is assigned based on the majority vote of labels/pseudo-labels of the $K$ -nearest samples in the latent representation space. It is important to note that the pseudo-label is reassigned every iteration when the data is initially unlabeled, providing an opportunity for improved pseudo-label assignments as the learning progresses.

Figure 3.

An illustrative figure of our pseudo-labeling process. If the data is identified as unlabeled or pseudo-labeled, a pseudo-label is newly assigned based on the majority vote of $K$ -NN samples. The input data is pseudo-labeled as class 3 in the above example.

Remarks. Here, we would like to provide additional details on how a newly computed representation is stored along with its label or pseudo-label, as well as the size of support set $M$ . When the representation is stored with its label, the support set is updated as $M \leftarrow M \cup {r_{i}, y_{i}}$ . In other words, if the same index of the data already exists in the support set, the old representation is replaced by the newly computed one. Regarding the size of $M$ , since the computed representations are saved to the memory at each iteration, the size of $M$ could be as large as that of training data. However, we empirically demonstrate that even with a smaller memory capacity, such as 20%, comparable performance can still be achieved, as shown in Section 4. In this case, when the capacity is reached (e.g., first-in-first-out queue), the data is discarded by the order of entry.

3.4 Nearest neighbor

Given an input data $x_{i}$ , the proposed model utilizes STLDDA to obtain two distinct views of the data, namely, ${\tilde{x}}_{i}^{a}$ and ${\tilde{x}}_{i}^{b}$ . From these two correlated views of data, the representations are generated from the encoder, i.e., ${\tilde{r}}_{i}^{a} = f_{θ} ({\tilde{x}}_{i}^{a})$ and ${\tilde{r}}_{i}^{b} = f_{θ} ({\tilde{x}}_{i}^{b})$ . Drawing inspiration from the previous work [61] that enhances the performance of contrastive representation learning using label information, we aim to utilize label information for further improvement.

To achieve this, the NN operation is performed on ${\tilde{r}}_{i}^{a}$ to obtain the nearest representation with the same label as the input data, which is designated as the NN-positive, while the nearest representation with a different label serves as the NN-negative. The specific definitions of the operation are as follows:

\begin{aligned} ({\tilde{r}}_{i}^{a +}, y_{i}) : = \underset{(r_{k}, y_{k}) \in M : y_{k} = y_{i}}{\arg min} ∥ {\tilde{r}}_{i}^{a} - r_{k} ∥_{2}, \end{aligned}

(5)

\begin{aligned} ({\tilde{r}}_{i}^{a -}, y_{i}^{a -}) : = \underset{(r_{k}, y_{k}) \in M : y_{k} \neq y_{i}}{\arg min} ∥ {\tilde{r}}_{i}^{a} - r_{k} ∥_{2} . \end{aligned}

(6)

${\tilde{r}}_{i}^{a +}$ and ${\tilde{r}}_{i}^{a -}$ represent the positive and negative NNs, respectively, and $y_{i}$ and $y_{i}^{a -}$ denote their corresponding (pseudo) label. While the NN-positive ${\tilde{r}}_{i}^{a +}$ must match (pseudo) label of the input data, the specific (pseudo) label of the NN-negative ${\tilde{r}}_{i}^{a -}$ remains undetermined until NN operation is performed. Our approach strategically applies the NN operation only to ${\tilde{r}}_{i}^{a}$ to explore the relationship between the original data and its closest representations in the support set. This method enables the model to better understand the alignment of augmented instances from their nearest peers, improving its ability to distinguish and classify time series data. Although SimCLR uses a pair of two correlated views ( ${\tilde{r}}_{i}^{a}$ , ${\tilde{r}}_{i}^{b}$ ) for training, we construct a triplet ( ${\tilde{r}}_{i}^{a +}$ , ${\tilde{r}}_{i}^{a -}$ , ${\tilde{r}}_{i}^{b}$ ) using the NN-positive and NN-negative.

3.5 Contrastive & similarity loss

The normalized temperature-scaled cross-entropy (NT-Xent) loss, which is a widely utilized loss function in contrastive representation learning, is adopted in this study [14]. The NT-Xent loss is defined as follows:

\begin{aligned} L_{NT-Xent} = - \frac{1}{N_{B}} \sum_{i = 1}^{N_{B}} \log \frac{\exp ({\tilde{r}}_{i}^{a +} \cdot {\tilde{r}}_{i}^{b} / τ)}{\sum_{k = 1}^{n} \exp ({\tilde{r}}_{i}^{a +} \cdot {\tilde{r}}_{k}^{b} / τ)} . \end{aligned}

(7)

Here, $N_{B}$ represents the batch size processed in each iteration. The parameter $τ$ represents the temperature term that adjusts the contrast strength between positive and negative pairs, directly affecting the model’s sensitivity to different similarities. It is important to note that during the pretraining stage, where only labeled samples are utilized, the encoder is exclusively trained with NT-Xent loss. Subsequently, to more effectively capture the relationships between instances in the latent space and enhance the performance of the trained model, two additional losses are developed. Finally, the triplet loss is formed by combining these new losses with the NT-Xent.

First, the instance-wise cross entropy (IW-Xent) loss encourages the proximity between the input representation and its NN-positive in the latent space, while simultaneously pushing the input representation away from its NN-negative. The IW-Xent loss is defined as:

\begin{aligned} L_{IW-Xent} = - \frac{1}{N_{B}} \sum_{i = 1}^{N_{B}} \log \frac{\exp ({\tilde{r}}_{i}^{a +} \cdot {\tilde{r}}_{i}^{b} / τ)}{\exp ({\tilde{r}}_{i}^{a +} \cdot {\tilde{r}}_{i}^{b} / τ) + \exp ({\tilde{r}}_{i}^{a -} \cdot {\tilde{r}}_{i}^{b} / τ)} . \end{aligned}

(8)

Notably, the IW-Xent loss is effective in promoting similarity between positive pairs and dissimilarity between negative pairs in the support set.

The last one is the intra-batch similarity (IBS) loss, which is derived from the disparity in similarities between instances within each mini-batch. The objective of this loss is to increase the similarity of each instance to the least similar representation of the same (pseudo) label, while simultaneously reducing the similarity to the most similar instance with a different (pseudo) label.

This strategy establishes a local structure within the latent space, ensuring that instances of the same class are densely clustered while maintaining distinct separation from instances of different classes. The IBS loss is as follows:

\begin{aligned} L_{IBS} = \frac{1}{N_{B}} \sum_{i = 1}^{N_{B}} min_{m \in B_{neg}^{i}} Sim (r_{i}, r_{m}) - min_{j \in B_{pos}^{i}} Sim (r_{i}, r_{j}), \end{aligned}

(9)

where $Sim (\cdot)$ indicates the similarity measure, and $B_{pos}^{i}$ and $B_{neg}^{i}$ are the sets of indices for samples within the batch that belong to the same and different classes as the $i$ -th sample, respectively. Note that $i$ is excluded in $B_{pos}^{i}$ , and we utilize the Euclidean distance as the similarity measure.

Finally, the total triplet loss can be defined as follows:

\begin{aligned} L = L_{NT-Xent} + λ_{1} L_{IW-Xent} + λ_{2} L_{IBS} . \end{aligned}

(10)

By combining these three losses, the model aims to learn a more discriminative and structured latent space, which is expected to improve the performance in classification tasks. The algorithmic description of our overall learning procedure is given in the Algorithm 1.

4. Experimental settings

4.1 Datasets

To assess the performance of our proposed model, we conduct experiments on several publicly available datasets. Detailed characteristics of each dataset, including the number of instances, time series length, and number of classes, are provided in Table 1.

4.1.1 Epilepsy seizure prediction

The Epileptic Seizure Recognition (Epilepsy) dataset [62] consists of electroencephalogram (EEG) recordings from 500 patients. Each recording has a duration of 23.6 seconds and contains 4,097 data points. The dataset is divided into 23 segments that are randomly shuffled, resulting in a reduced data length of 178 points per segment. Originally, the dataset was categorized into five classes. However, since only one class corresponds to epileptic seizures, the remaining four classes are merged into a single class, transforming the task into a binary classification problem.

4.1.2 UCR classification archive

The UCR Time Series Classification Archive (UCR archive) [63] is a comprehensive repository of benchmark datasets for univariate time series classification. The UCR archive encompasses a diverse range of datasets from various application areas, such as images, medical data, and other engineering-related fields. For performance evaluation, 13 datasets are selected from the archive that satisfies the condition of having at least two training data points for all classes when only 1% of the training data is labeled. The selected datasets are as follows: Crop, DistalPhalanxOutlineCorrect (DPOC), ElectricDevices, FordA, FordB, HandOutlines, MiddlePhalanxOutlineCorrect (MPOC), PhalangesOutlinesCorrect (POC), ProximalPhalanxOutlineCorrect (PPOC), StarLightCurves, Strawberry, TwoPatterns, and Wafer.

Table 1
A brief description of selected datasets for evaluation. Since the Epilepsy dataset does not have predefined train and test sets, we split the data in an 8:2 ratio. Although the Epilepsy dataset originally contains five classes, we merge four classes into a single class as only one class represents seizures.

Dataset Train Test Length Class

Epilepsy 9,200 2,300 178 2

Crop 7,200 16,800 46 24

DPOC 600 276 80 2

ElectricDevices 8,296 7,710 96 7

FordA 3,601 1,320 500 2

FordB 3,636 810 500 2

HandOutlines 1,000 370 2,709 2

MPOC 600 291 80 2

POC 1,800 858 80 2

PPOC 600 291 80 2

StarLightCurves 1,000 8,236 1,024 3

Strawberry 613 370 235 2

TwoPatterns 1,000 4,000 128 4

Wafer 1,000 6,164 152 2

Dataset	Train	Test	Length	Class
Epilepsy	9,200	2,300	178	2
Crop	7,200	16,800	46	24
DPOC	600	276	80	2
ElectricDevices	8,296	7,710	96	7
FordA	3,601	1,320	500	2
FordB	3,636	810	500	2
HandOutlines	1,000	370	2,709	2
MPOC	600	291	80	2
POC	1,800	858	80	2
PPOC	600	291	80	2
StarLightCurves	1,000	8,236	1,024	3
Strawberry	613	370	235	2
TwoPatterns	1,000	4,000	128	4
Wafer	1,000	6,164	152	2

4.2 Implementation details

For the Epilepsy dataset, we divide the data into 60% for training, 20% for validation, and 20% for testing. As for the 13 datasets selected from the UCR archive, the test sets are already separated, but no validation sets are provided. Therefore, we partition the training data into 75% for training and 25% for validation. To ensure the consistency of the experiments, we conduct the experiments five times using different seeds and record the average and standard deviation of the test accuracy accordingly.

The experimental settings of the proposed model follow Ts2Vec [22]. The batch size is set to 8, and we adopt the SWA optimizer [64] with a learning rate of 1e-3. The number of optimization iterations is set to 200 for datasets with a size less than 100,000 and 600 otherwise. The representation dimension is set to 80, and the dropout is set to 0.1. As for the losses, $τ$ is set to 0.07. Additionally, $λ_{1}$ and $λ_{2}$ are set to 0.2 and 2, respectively. For the other baseline models, we follow the best settings as described in their respective papers to ensure a fair comparison with our proposed method. We performed all experiments on an NVIDIA Geforce GTX TITAN X GPU using Pytorch 1.10.

4.3 Baseline models

In our experiments, we include several baseline models for comparative evaluation. We compare the performance of our proposed method with the following baseline models:

–
Supervised [19]: Supervised learning where the encoder and the projection layer are adopted from CA-TCC [15].
–
Ts2Vec [22]: A self-supervised contrastive learning framework that learns contextual representation for arbitrary sub-series at various semantic levels.
–
MixupCLR [65]: A self-supervised contrastive learning framework that predicts the amount of mixing between data points using Mixup [66] augmentation.
–
SemiTime [67]: A semi-supervised learning framework that jointly trains the supervised classification of labeled data along with the self-supervised temporal relation prediction of segment pairs
–
CA-TCC [15]: A semi-supervised model that learns representation from the data points with a cross-view prediction task and supervised contextual contrast.

By comparing our proposed method with these models, we provide a comprehensive evaluation of the effectiveness of our approach in the context of semi-supervised learning for time series classification. In the case of self-supervised baselines like Ts2Vec and MixupCLR, we adopt the training procedure from CA-TCC, which involves self-supervised training of the encoder. Afterward, the encoder is frozen and used in conjunction with a linear layer for supervised training.

The supervised model serves as a baseline, representing the lower boundary of performance when trained solely on labeled data. In order to explore whether extending self-supervised models to semi-supervised approaches could provide a more efficient alternative to traditional semi-supervised methodologies, we selected models based on self-supervised approaches, such as Ts2Vec and MixupCLR.

While certain baseline models, such as CA-TCC, reported experimental results on subsets of our selected dataset, they merged the provided train and test files into new splits. In alignment with the UCR Archive’s recommendation to exclusively utilize the predefined test data for testing, we conducted our experiments and re-evaluated other baselines in accordance with this guideline.
5. Results

5.1 Performance comparison with baseline models

To measure the performance of our experimental models, we utilized the average accuracy on test data as our evaluation metric. The results for the selected datasets, with labeled data ratios set at 1% and 5%, are presented in Table 2. Across all 14 datasets, our proposed model, NNCLR-TS, achieved the highest average test accuracy of 71.0% and 78.2% when the proportion of labeled data is 1% and 5%, respectively. These results demonstrate that the proposed method outperformed the second-best model by a margin of 3.8 percentage points (pp) and 3.1 pp, respectively.

When 1% of labeled data is used for training, NNCLR-TS yielded the highest performance in 7 out of 14 datasets and the second-highest in one, demonstrating its superior capability to learn representations more efficiently than other models. Furthermore, in comparison with Ts2Vec, which employs the same backbone network as NNCLR-TS, NNCLR-TS exhibits an average performance enhancement of 5.1 pp. This suggests that the newly proposed elements, including STLDDA, the support set, and the proposed losses, contribute to further performance improvements. With 5% of labeled data, NNCLR-TS achieved the highest performance in 6 datasets and ranked second on another 6 datasets. This reinforces that our proposed model consistently demonstrates robust performance across a variety of datasets.

5.2 Analyzing the effect of the ratio of labeled data

We evaluated the performance of NNCLR-TS compared to other baseline models with varying labeled data ratios: 1%, 5%, 10%, 50%, and 80%. Figure 4 presents the test accuracy of SemiTime, CA-TCC, and NNCLR-TS on Crop, TwoPatterns, and Wafer.

Table 2
Comparative performance of self-supervised and semi-supervised baselines with 1% and 5% labeled training data; Best results per row are in bold and the second-bests are underlined.

Datasets Supervised Ts2Vec MixupCLR SemiTime CA-TCC NNCLR-TS

Average test accuracy (%) on 1% of labeled data

Epilepsy 92.4 $\pm$ 0.5 96.3 $\pm$ 0.5 93.8 $\pm$ 1.5 93.3 $\pm$ 3.0 96.3 $\pm$ 0.4 96.9 $\pm$ 0.2

Crop 40.4 $\pm$ 1.6 30.4 $\pm$ 15.4 39.5 $\pm$ 4.0 35.8 $\pm$ 3.0 37.0 $\pm$ 0.9 38.7 $\pm$ 1.9

DPOC 53.7 $\pm$ 10.2 45.1 $\pm$ 3.5 59.6 $\pm$ 1.5 63.0 $\pm$ 4.8 53.7 $\pm$ 5.5 47.2 $\pm$ 8.0

ElectricDevices 50.8 $\pm$ 1.3 48.3 $\pm$ 15.8 50.0 $\pm$ 2.3 52.0 $\pm$ 5.2 53.7 $\pm$ 2.7 57.7 $\pm$ 4.6

FordA 53.6 $\pm$ 2.2 73.7 $\pm$ 14.3 76.9 $\pm$ 9.7 88.3 $\pm$ 0.8 83.8 $\pm$ 1.7 90.8 $\pm$ 1.2

FordB 53.5 $\pm$ 0.8 68.5 $\pm$ 2.0 61.7 $\pm$ 7.2 67.7 $\pm$ 1.1 64.8 $\pm$ 1.7 76.2 $\pm$ 5.4

HandOutlines 77.5 $\pm$ 4.7 61.5 $\pm$ 14.5 58.8 $\pm$ 12.8 66.3 $\pm$ 2.8 79.7 $\pm$ 2.2 82.6 $\pm$ 8.2

MPOC 57.5 $\pm$ 0.3 57.3 $\pm$ 0.5 56.7 $\pm$ 1.0 53.6 $\pm$ 9.3 57.7 $\pm$ 0.3 55.2 $\pm$ 5.1

POC 61.8 $\pm$ 0.2 60.5 $\pm$ 3.5 61.3 $\pm$ 0.1 61.1 $\pm$ 2.0 59.1 $\pm$ 2.4 58.2 $\pm$ 6.4

PPOC 68.0 $\pm$ 5.0 71.8 $\pm$ 6.4 68.2 $\pm$ 1.2 64.2 $\pm$ 7.1 68.3 $\pm$ 4.4 59.5 $\pm$ 16.2

StarLightCurves 78.0 $\pm$ 1.1 83.8 $\pm$ 3.7 78.7 $\pm$ 4.9 87.3 $\pm$ 4.7 78.9 $\pm$ 0.7 76.6 $\pm$ 7.1

Strawberry 38.3 $\pm$ 5.1 60.1 $\pm$ 5.3 61.0 $\pm$ 7.9 70.9 $\pm$ 6.9 36.9 $\pm$ 1.7 67.8 $\pm$ 5.7

TwoPatterns 26.1 $\pm$ 0.4 76.0 $\pm$ 4.0 55.3 $\pm$ 5.0 48.1 $\pm$ 6.8 28.9 $\pm$ 0.5 91.0 $\pm$ 10.6

Wafer 95.1 $\pm$ 0.2 89.5 $\pm$ 0.4 89.3 $\pm$ 0.2 89.3 $\pm$ 0.1 93.6 $\pm$ 0.7 95.7 $\pm$ 1.1

Average 60.5 $\pm$ 2.4 65.9 $\pm$ 6.4 65.1 $\pm$ 4.2 67.2 $\pm$ 4.1 63.7 $\pm$ 1.9 71.0 $\pm$ 5.9

Average test accuracy (%) on 5% of labeled data

Epilepsy 95.0 $\pm$ 0.4 97.4 $\pm$ 0.2 97.4 $\pm$ 0.5 97.6 $\pm$ 0.9 97.2 $\pm$ 0.3 97.4 $\pm$ 0.5

Crop 53.8 $\pm$ 1.0 54.4 $\pm$ 1.1 54.9 $\pm$ 1.3 49.5 $\pm$ 1.6 53.3 $\pm$ 0.7 52.9 $\pm$ 0.7

DPOC 59.6 $\pm$ 0.7 57.5 $\pm$ 8.5 59.9 $\pm$ 2.2 61.5 $\pm$ 3.9 59.2 $\pm$ 0.5 63.6 $\pm$ 3.9

ElectricDevices 50.2 $\pm$ 2.5 61.0 $\pm$ 3.0 56.2 $\pm$ 2.5 62.1 $\pm$ 1.7 50.3 $\pm$ 1.4 59.6 $\pm$ 1.7

FordA 63.1 $\pm$ 4.8 89.7 $\pm$ 0.8 90.7 $\pm$ 1.4 91.3 $\pm$ 0.9 89.8 $\pm$ 0.4 91.6 $\pm$ 0.4

FordB 53.0 $\pm$ 1.3 72.2 $\pm$ 2.6 71.2 $\pm$ 3.3 72.2 $\pm$ 3.2 70.7 $\pm$ 0.9 75.1 $\pm$ 3.8

HandOutlines 85.6 $\pm$ 2.3 78.6 $\pm$ 4.1 64.1 $\pm$ 0.0 67.1 $\pm$ 3.8 86.1 $\pm$ 0.3 85.6 $\pm$ 5.6

MPOC 57.7 $\pm$ 0.9 57.1 $\pm$ 0.2 57.2 $\pm$ 0.3 61.8 $\pm$ 8.1 57.9 $\pm$ 0.4 68.8 $\pm$ 5.6

POC 65.2 $\pm$ 0.8 61.9 $\pm$ 0.4 61.0 $\pm$ 1.6 62.9 $\pm$ 2.6 64.5 $\pm$ 0.9 64.9 $\pm$ 3.4

PPOC 69.0 $\pm$ 0.5 68.9 $\pm$ 1.2 68.4 $\pm$ 0.0 76.6 $\pm$ 5.6 70.2 $\pm$ 3.6 72.6 $\pm$ 3.9

StarLightCurves 77.4 $\pm$ 4.0 85.9 $\pm$ 0.9 83.9 $\pm$ 1.4 95.8 $\pm$ 1.1 84.9 $\pm$ 3.4 85.9 $\pm$ 0.4

Strawberry 48.8 $\pm$ 4.2 68.3 $\pm$ 4.3 64.3 $\pm$ 0.0 84.9 $\pm$ 1.9 56.1 $\pm$ 3.6 81.9 $\pm$ 3.7

TwoPatterns 55.9 $\pm$ 4.9 85.1 $\pm$ 3.4 62.5 $\pm$ 8.4 75.1 $\pm$ 5.0 75.5 $\pm$ 1.5 100.0 $\pm$ 0.0

Wafer 95.3 $\pm$ 0.2 90.6 $\pm$ 1.2 89.2 $\pm$ 0.0 93.0 $\pm$ 1.9 94.2 $\pm$ 0.3 95.3 $\pm$ 3.7

Average 66.4 $\pm$ 2.0 73.5 $\pm$ 2.3 70.0 $\pm$ 1.6 75.1 $\pm$ 3.0 72.1 $\pm$ 1.3 78.2 $\pm$ 2.7

Datasets	Supervised	Ts2Vec	MixupCLR	SemiTime	CA-TCC	NNCLR-TS
Average test accuracy (%) on 1% of labeled data
Epilepsy	92.4 $\pm$ 0.5	96.3 $\pm$ 0.5	93.8 $\pm$ 1.5	93.3 $\pm$ 3.0	96.3 $\pm$ 0.4	96.9 $\pm$ 0.2
Crop	40.4 $\pm$ 1.6	30.4 $\pm$ 15.4	39.5 $\pm$ 4.0	35.8 $\pm$ 3.0	37.0 $\pm$ 0.9	38.7 $\pm$ 1.9
DPOC	53.7 $\pm$ 10.2	45.1 $\pm$ 3.5	59.6 $\pm$ 1.5	63.0 $\pm$ 4.8	53.7 $\pm$ 5.5	47.2 $\pm$ 8.0
ElectricDevices	50.8 $\pm$ 1.3	48.3 $\pm$ 15.8	50.0 $\pm$ 2.3	52.0 $\pm$ 5.2	53.7 $\pm$ 2.7	57.7 $\pm$ 4.6
FordA	53.6 $\pm$ 2.2	73.7 $\pm$ 14.3	76.9 $\pm$ 9.7	88.3 $\pm$ 0.8	83.8 $\pm$ 1.7	90.8 $\pm$ 1.2
FordB	53.5 $\pm$ 0.8	68.5 $\pm$ 2.0	61.7 $\pm$ 7.2	67.7 $\pm$ 1.1	64.8 $\pm$ 1.7	76.2 $\pm$ 5.4
HandOutlines	77.5 $\pm$ 4.7	61.5 $\pm$ 14.5	58.8 $\pm$ 12.8	66.3 $\pm$ 2.8	79.7 $\pm$ 2.2	82.6 $\pm$ 8.2
MPOC	57.5 $\pm$ 0.3	57.3 $\pm$ 0.5	56.7 $\pm$ 1.0	53.6 $\pm$ 9.3	57.7 $\pm$ 0.3	55.2 $\pm$ 5.1
POC	61.8 $\pm$ 0.2	60.5 $\pm$ 3.5	61.3 $\pm$ 0.1	61.1 $\pm$ 2.0	59.1 $\pm$ 2.4	58.2 $\pm$ 6.4
PPOC	68.0 $\pm$ 5.0	71.8 $\pm$ 6.4	68.2 $\pm$ 1.2	64.2 $\pm$ 7.1	68.3 $\pm$ 4.4	59.5 $\pm$ 16.2
StarLightCurves	78.0 $\pm$ 1.1	83.8 $\pm$ 3.7	78.7 $\pm$ 4.9	87.3 $\pm$ 4.7	78.9 $\pm$ 0.7	76.6 $\pm$ 7.1
Strawberry	38.3 $\pm$ 5.1	60.1 $\pm$ 5.3	61.0 $\pm$ 7.9	70.9 $\pm$ 6.9	36.9 $\pm$ 1.7	67.8 $\pm$ 5.7
TwoPatterns	26.1 $\pm$ 0.4	76.0 $\pm$ 4.0	55.3 $\pm$ 5.0	48.1 $\pm$ 6.8	28.9 $\pm$ 0.5	91.0 $\pm$ 10.6
Wafer	95.1 $\pm$ 0.2	89.5 $\pm$ 0.4	89.3 $\pm$ 0.2	89.3 $\pm$ 0.1	93.6 $\pm$ 0.7	95.7 $\pm$ 1.1
Average	60.5 $\pm$ 2.4	65.9 $\pm$ 6.4	65.1 $\pm$ 4.2	67.2 $\pm$ 4.1	63.7 $\pm$ 1.9	71.0 $\pm$ 5.9
Average test accuracy (%) on 5% of labeled data
Epilepsy	95.0 $\pm$ 0.4	97.4 $\pm$ 0.2	97.4 $\pm$ 0.5	97.6 $\pm$ 0.9	97.2 $\pm$ 0.3	97.4 $\pm$ 0.5
Crop	53.8 $\pm$ 1.0	54.4 $\pm$ 1.1	54.9 $\pm$ 1.3	49.5 $\pm$ 1.6	53.3 $\pm$ 0.7	52.9 $\pm$ 0.7
DPOC	59.6 $\pm$ 0.7	57.5 $\pm$ 8.5	59.9 $\pm$ 2.2	61.5 $\pm$ 3.9	59.2 $\pm$ 0.5	63.6 $\pm$ 3.9
ElectricDevices	50.2 $\pm$ 2.5	61.0 $\pm$ 3.0	56.2 $\pm$ 2.5	62.1 $\pm$ 1.7	50.3 $\pm$ 1.4	59.6 $\pm$ 1.7
FordA	63.1 $\pm$ 4.8	89.7 $\pm$ 0.8	90.7 $\pm$ 1.4	91.3 $\pm$ 0.9	89.8 $\pm$ 0.4	91.6 $\pm$ 0.4
FordB	53.0 $\pm$ 1.3	72.2 $\pm$ 2.6	71.2 $\pm$ 3.3	72.2 $\pm$ 3.2	70.7 $\pm$ 0.9	75.1 $\pm$ 3.8
HandOutlines	85.6 $\pm$ 2.3	78.6 $\pm$ 4.1	64.1 $\pm$ 0.0	67.1 $\pm$ 3.8	86.1 $\pm$ 0.3	85.6 $\pm$ 5.6
MPOC	57.7 $\pm$ 0.9	57.1 $\pm$ 0.2	57.2 $\pm$ 0.3	61.8 $\pm$ 8.1	57.9 $\pm$ 0.4	68.8 $\pm$ 5.6
POC	65.2 $\pm$ 0.8	61.9 $\pm$ 0.4	61.0 $\pm$ 1.6	62.9 $\pm$ 2.6	64.5 $\pm$ 0.9	64.9 $\pm$ 3.4
PPOC	69.0 $\pm$ 0.5	68.9 $\pm$ 1.2	68.4 $\pm$ 0.0	76.6 $\pm$ 5.6	70.2 $\pm$ 3.6	72.6 $\pm$ 3.9
StarLightCurves	77.4 $\pm$ 4.0	85.9 $\pm$ 0.9	83.9 $\pm$ 1.4	95.8 $\pm$ 1.1	84.9 $\pm$ 3.4	85.9 $\pm$ 0.4
Strawberry	48.8 $\pm$ 4.2	68.3 $\pm$ 4.3	64.3 $\pm$ 0.0	84.9 $\pm$ 1.9	56.1 $\pm$ 3.6	81.9 $\pm$ 3.7
TwoPatterns	55.9 $\pm$ 4.9	85.1 $\pm$ 3.4	62.5 $\pm$ 8.4	75.1 $\pm$ 5.0	75.5 $\pm$ 1.5	100.0 $\pm$ 0.0
Wafer	95.3 $\pm$ 0.2	90.6 $\pm$ 1.2	89.2 $\pm$ 0.0	93.0 $\pm$ 1.9	94.2 $\pm$ 0.3	95.3 $\pm$ 3.7
Average	66.4 $\pm$ 2.0	73.5 $\pm$ 2.3	70.0 $\pm$ 1.6	75.1 $\pm$ 3.0	72.1 $\pm$ 1.3	78.2 $\pm$ 2.7

It is observed that the test accuracy of all models displayed a trend of improved test accuracy with increasing proportions of labeled data. Nevertheless, NNCLR-TS outperformed both SemiTime and CA-TCC across most scenarios. Notably, on Crop dataset, while CA-TCC and NNCLR-TS exhibited similar performance at lower labeled data ratios, the performance gap widened as the labeled data proption increased. At the 80% level, NNCLR-TS achieved a test accuracy of 72.2%, outperforming CA-TCC’s 67.7%.

For TwoPatterns, NNCLR-TS attained 100.0% test accuracy when only 5% of training data was labeled, showing performance close to its maximum potential, while other models required 50% or 80% of labeled data to reach comparable results. The necessity for a relatively smaller amount of labeled data to obtain the potential peak performance of the model was also evident in Wafer. With only 10% of the data labeled, NNCLR-TS achieved a test accuracy of 98.2%, which approached the maximum accuracy of 99.3% when 80% of the data was labeled. In contrast, the other baseline models required a larger proportion of labeled data to achieve similar performance.

From these observations, we can infer that NNCLR-TS excels in environments with limited labeled data, demonstrating test accuracies close to its maximum performance. This suggests NNCLR-TS’s adaptability to variations in data availability, making it a practical choice for applications where labeled data may be scarce. Additionally, when sufficient labeled data is provided, NNCLR-TS is capable of further widening the performance gap with other semi-supervised baselines. This highlights the capacity of NNCLR-TS to leverage the availability of labeled data to enhance its performance relative to other models.

Figure 4.

Performance comparison of CA-TCC, SemiTime, and NNCLR-TS with varying ratios of labeled data acress three datasets: (a) Crop, (b) TwoPatterns, and (c) Wafer. In all scenarios, NNCLR-TS consistently ourperforms other baselines, achieving the highest accuracy among the three models across all label ratio intervals.

5.3 Evaluating the robustness of NNCLR-TS’s pseudo-labeling to dataset biasness

In semi-supervised learning, the labeling of unlabeled data points can potentially introduce biases that may not be appropriate for the domain, leading to suboptimal performance. To investigate the effectiveness of our proposed pseudo-labeling approach in various scenarios, we introduce a metric that quantifies the degree of bias in the labeled data. By comparing the performance of NNCLR-TS with Ts2Vec (which shares the same backbone architecture but forgoes pseudo-labeling) across different levels of this bias metric, we can experimentally evaluate the robustness and benefits of our pseudo-labeling technique. Additionally, we compare the performance of NNCLR-TS with other semi-supervised learning models to assess the effectiveness of our proposed model in handling biased labeled data and its potential advantages over other approaches.

The introduced metric, termed “label biasness”, is defined as follows:

\begin{aligned} Label Biasness = \frac{\sum_{i \in I_{labeled}} 1_{{y_{i} = c^{*}}}}{| I_{labeled} |}, \end{aligned}

(11)

where $I_{labeled}$ represents the set of labeled samples utilized for training, and $1_{{y_{i} = c^{*}}}$ is an indicator function that takes the value 1 if $y_{i} = c^{*}$ and 0 otherwise. The value of $c^{*}$ is determined as follows:

\begin{aligned} c^{*} = {\begin{cases} 0, & if | C | = 2, \\ randomly selected class from C, & otherwise, \end{cases} \end{aligned}

(12)

where $C = {0, \dots, | C | - 1}$ is the set of unique labels in the dataset, and $| C |$ denotes the number of unique labels. In other words, for binary classification problems ( $| C | = 2$ ), $c^{*}$ is set to 0, while for datasets with multiple classes ( $| C | > 2$ ), $c^{*}$ is randomly selected from the set of unique labels. The interpretation of label biasness depends on the comparison with the original dataset’s label distribution. If the label biasness of the sampled labeled data differs significantly from the original dataset’s label biasness, it indicates a higher level of bias introduced during the sampling process.

We conducted experiments in scenario where the ratio of the labeled data is 5%. Biasness is introduced into the labeled data by varying the label biasness from 0.1 to 0.9 in increments of 0.2, i.e., 0.1, 0.3, 0.5, 0.7, and 0.9. For each level of label biasness, five random experiments were performed to ensure the robustness of our results. In each experiment, the sampled labeled data remained consistent across all models to ensure a fair comparison. We compared the performance of NNCLR-TS with other baseline models under these varying levels of label biasness. Furthermore, we included the results of the proposed model, NNCLR-TS, with labeled data sampled randomly without any intentional biasness. These results were obtained from the main experiments presented in the previous section, specifically from Table 2.

Table 3

Performance comparison of NNCLR-TS and other baselines on the HandOutlines and Strawberry datasets with varying label biasness. The best results per row are in bold, and the second-best results are underlined. Delta* refers to the performance difference between NNCLR-TS and NNCLR-TS with randomly sampled labeled data.

Label biasness	Supervised	Ts2Vec	MixupCLR	SemiTime	CA-TCC	NNCLR-TS	Delta*
Average test accuracy (%) on HandOutlines dataset
0.10	75.78	64.49	64.05	64.21	74.86	78.54	$-$ 7.06
0.30	75.73	73.57	64.05	65.66	76.32	83.78	$-$ 1.82
0.50	72.32	75.89	65.41	65.89	60.92	88.43	2.83
0.70	59.62	64.49	35.95	47.93	54.86	88.00	2.40
0.90	43.78	35.95	35.95	35.79	44.65	76.38	$-$ 9.22
Random	–	–	–	–	–	85.60	–
Average test accuracy (%) on strawberry dataset
0.10	63.62	64.32	64.32	73.00	66.05	86.27	4.37
0.30	55.19	63.78	64.32	81.06	66.43	82.22	0.32
0.50	52.54	66.86	70.86	76.82	58.65	78.00	$-$ 3.90
0.70	45.51	49.46	35.89	65.53	54.11	78.38	$-$ 3.52
0.90	38.27	40.11	35.68	45.23	43.30	55.24	$-$ 26.66
Random	–	–	–	–	–	81.90	–

Table 3 presents the average test accuracy (%) of NNCLR-TS compared to other baselines under varying label biasness levels in the HandOutlines and Strawberry datasets. HandOutlines has an average label biasness of 0.36 when randomly sampled, while Strawberry has an average label biasness of 0.34. For the HandOutlines dataset, NNCLR-TS consistently outperformed other models across all label biasness levels. Notably, it exhibits a smaller performance degradation compared to other methods when the bias is severe (0.7 or 0.9). For instance, at a label biasness of 0.9, NNCLR-TS achieves an accuracy of 76.38%, which is 31.73%p higher than the second-best model, CA-TCC, at 44.65%. Furthermore, even in the presence of biasness, NNCLR-TS maintained performance not significantly different to its performance under random sampling. The largest difference observed is at a label biasness of 0.9, with a difference of 9.22%p compared to random sampling. At moderate levels of label biasness (0.3 and 0.5), the difference is even smaller, with delta values of $-$ 1.82%p and 2.83%p, respectively.

Similar trends can be observed in the Strawberry dataset. NNCLR-TS outperformed other baselines across all levels of label biasness, with the performance gap widening at higher bias. For example, at a label biasness of 0.9, NNCLR-TS achieved an accuracy of 55.24%, which is 10.01%p higher than the second-best model.

It is worth noting that Ts2Vec, which shares the same backbone architecture as NNCLR-TS but does not employ pseudo-labeling, experiences a widening performance gap with the proposed model as the label biasness increases. This suggests that Ts2Vec’s architecture might be more vulnerable to the effects of label bias compared to NNCLR-TS. Although pseudo-labeling may introduce biases that can negatively impact the model’s performance, the results demonstrate that this technique used in NNCLR-TS effectively addresses this concern in the HandOutlines and Strawberry dataset. This leads to improved performance even in the presence of severe label biasness.

Interestingly, both the HandOutlines and Strawberry datasets share common characteristics. They are both binary classification tasks with relatively long data lengths (2,709 and 235, respectively) compared to other datasets. These characteristics may enable the presence of periodic features, including trends and seasonality within the data, which can be effectively captured by the proposed data augmentation method, STLDDA. This could explain why NNCLR-TS performs particularly well on these datasets, even under high levels of label biasness.

To investigate the generalizability of NNCLR-TS’s performance beyond binary classification, we conducted additional experiments on two multi-class datasets: Crop and TwoPatterns. These datasets were chosen because they represent common challenges in multi-class semi-supervised learning. Table 4 presents the changes in average test accuracy (%) of NNCLR-TS compared to other baselines under varying levels of label biasness in these datasets.

Table 4

Performance comparison of NNCLR-TS and other baselines on the Crop and TwoPatterns datasets with varying label biasness. The best results per row are in bold, and the second-best results are underlined. Delta* refers to the performance difference between NNCLR-TS and NNCLR-TS with randomly sampled labeled data.

Label biasness	Supervised	Ts2Vec	MixupCLR	SemiTime	CA-TCC	NNCLR-TS	Delta*
Average test accuracy (%) on Crop dataset
0.10	53.04	53.90	54.04	47.63	52.47	54.25	1.35
0.30	50.29	52.04	50.41	46.69	47.74	51.73	$-$ 1.17
0.50	45.85	48.96	46.64	42.45	43.11	50.56	$-$ 2.34
0.70	40.71	45.40	38.13	33.33	37.06	45.02	$-$ 7.88
0.90	33.06	40.57	25.93	21.54	22.06	37.31	$-$ 15.59
Random	–	–	–	–	–	52.90	–
Average test accuracy (%) on TwoPatterns dataset
0.10	51.20	83.73	50.87	69.91	71.27	100.00	0.00
0.30	54.27	85.05	69.86	74.41	76.93	99.99	$-$ 0.01
0.50	51.33	78.92	43.84	61.74	70.98	99.98	$-$ 0.02
0.70	42.92	70.81	27.32	45.78	59.78	100.00	0.00
0.90	35.34	52.67	24.79	35.75	49.96	97.72	$-$ 2.28
Random	–	–	–	–	–	100.00	–

For the Crop dataset, NNCLR-TS’s performance was relatively lower than the self-supervised model, Ts2Vec. At a label biasness of 0.3, Ts2Vec outperformed NNCLR-TS by 0.31%p, and this gap widened to 5.38%p and 3.26%p at label biasness levels of 0.7 and 0.9, respectively. This suggests that in this particular case, pseudo-labeling might have a detrimental effect on the model’s performance.

Our analysis revealed several factors that could contribute to this observation. First, the Crop dataset contains a significantly higher number of classes (24) compared to other datasets in our experiments. Pseudo-labeling techniques might be less effective with a larger number of classes due to the increased difficulty in accurately assigning pseudo-labels. Second, the Crop dataset also has the shortest data length (46) among all datasets, which may limit the effectiveness of trend and seasonal decomposition through STL. Consequently, this could lead to a decline in the performance of STLDDA and NNCLR-TS.

However, it is important to note that having multiple classes does not always hinder the performance of NNCLR-TS. In the case of the TwoPatterns dataset, NNCLR-TS demonstrated superior performance compared to other models across all levels of label biasness. Even at a label biasness of 0.9, NNCLR-TS exhibited minimal performance degradation (e.g., a decrease of less than 3%p in accuracy) and significantly outperformed the second-best model, Ts2Vec, by 45.05%p. This could be attributed to the fact that TwoPatterns is a simulated dataset, making it easier to capture periodic features. Additionally, the data length of TwoPatterns is 128, which is longer than that of Crop, allowing STLDDA to have a more positive impact on the model’s performance.

In conclusion, while pseudo-labeling may have a negative effect on the model’s performance in some cases, such as datasets with a high number of classes and short data lengths like Crop, the proposed model still exhibits a relatively smaller decline in performance compared to other semi-supervised methods. Furthermore, for datasets with distinct regularities and sufficient data length, like TwoPatterns, the proposed model demonstrates good performance even in the presence of high label biasness.

6. Model analysis

In this section, we present a comprehensive model analysis to assess the robustness of our proposed model and identify the key factors contributing to its performance improvement.

Figure 5.

Heatmap representation of test accuracies on (a) DPOC, (b) PPOC, and (c) Strawberry with varying $λ_{1}$ and $λ_{2}$ at a 5% labeled ratio.

6.1 Loss coefficients

Figure 5 demonstrates the effects of varying coefficients, $λ_{1}$ for IW-Xent loss and $λ_{2}$ for IBS loss, on the test accuracy of (a) DPOC, (b) PPOC, and (c) Strawberry when the labeled ratio is at 5%. The results indicate that the integration of these additional loss components, regulated by $λ_{1}$ and $λ_{2}$ , can lead to enhanced model performance. When both $λ_{1}$ and $λ_{2}$ are set to zero, meaning the model is trained only with NT-Xent loss, the performance metrics are comparable to other baseline methods. Incorporating a balanced combination of IW-Xent and IBS loss leads to notable improvements across the datasets.

For example, in PPOC, when both $λ_{1}$ and $λ_{2}$ were set to zero, indicating usage of only the NT-Xent loss, the performance achieved was 75.6%. This was higher than certain configurations with additional losses. However, the highest performance was observed at 82.1% when $λ_{1}$ was set to 1.5 and $λ_{2}$ was set to 0.2. For the other two datasets, DPOC and Strawberry, the maximum test accuracy was achieved when all three loss components, including NT-Xent, IW-Xent, and IBS loss, were adequately utilized in the learning process.

However, when the values of $λ_{1}$ and $λ_{2}$ are set too high, there is a noticeable decline in performance. For instance, with DPOC values of $λ_{1} = 2.0$ and $λ_{2} = 2.0$ , the performance dropped to 65.6%. This is notably lower than the best performance achieved with $λ_{1} = 0.2$ and $λ_{2} = 2.0$ , which was 75.2%. Furthermore, this performance is even inferior to using NT-Xent alone, which yielded a result of 71.0%. This suggests that excessive regularization of the two proposed losses leads to performance degradation, emphasizing the importance of balancing multiple loss components.

6.2 Maximum support set capacity

Although the support set plays a vital role in NNCLR-TS, increasing the maximum support set capacity could potentially introduce scalability problems. This is because retrieving nearest neighbors within the support set has a computational complexity of $O (N^{2})$ . To address this concern, it is essential to ensure that the model maintains its robustness when the support set capacity is restricted. Figure 6 presents the performance variation for the FordA and DPOC under the condition that the support set capacity is limited to 50%, 80%, and 100% (unrestricted) of the training data size. This experiment was conducted five times with five different randomly chosen seeds to ensure the robustness of the results.

For the FordA, increasing the support set capacity appears to improve the performance in general. The median test accuracy rises gradually from 90.9% with a 50% capacity, to 91.5% at 80% capacity, and finally reaching 91.7% when the capacity is unrestricted. It is noteworthy that even with a reduced support set size, the model continues to perform admirably, with only a marginal decrease in performance.

Figure 6.

Variations in performance according to the maximum capacity of the support set when only 5% of the dataset is labeled, for (a) FordA, and (b) DPOC.

On the other hand, for the DPOC, there is a more pronounced variation in performance as the support set size capacity changes. When the support set is limited to 50% of the training data, the median accuracy is 58.9%. This improves dramatically to 61.4%, when the capacity is increased to 80%. Notably, enabling the support set to utilize the entire training data yields a significant performance boost, with a median accuracy of 63.8%.

These results emphasize that while larger support set capacities can lead to improvements, the decline in performance with reduced capacities is minimal. This indicates that NNCLR-TS remains effective even when computational resources or memory are constrained, highlighting its potential for a broad spectrum of applications.

6.3 Augmentation methods

To validate the effectiveness of the proposed augmentation technique, STLDDA, for contrastive learning in time series, we conducted experiments to investigate the influence of different augmentation choices on model performance in NNCLR-TS. In addition to no augmentation, we selected commonly used techniques such as jittering, scaling, and rotation, along with permutation and window warp, which have been studied to be generally effective for time series classification according to [68] for comparison.

Table 5 presents the average test accuracy results for each augmentation method used in NNCLR-TS across 13 UCR Archive datasets. Even without explicit augmentation, our model achieved a relatively high test accuracy of 76.4%, ranking third among all methods. This may be attributed to the fact that the selection of nearest neighbors itself appears to function as a type of data augmentation, contributing to the enhanced performance.

The performance exhibited further improvement with the implementation of our proposed STLDDA method, which recorded the highest test accuracy of 76.8%. Additionally, a pairwise t-test at a significance level of 0.10 supports our claim regarding the STLDDA method’s effectiveness. This enhancement demonstrates the effectiveness of STLDDA, which leverages the concept of nearest neighbors to generate more diverse and representative augmentations. Given the inherent variability within the datasets, STLDDA consistently demonstrated superior performance, highlighting its robustness and versatility in processing a wide range of time series data.

Table 5
Performance of NNCLR-TS on 13 UCR Archive datasets using various augmentation methods, with only 5% of the data labeled.

Augmentation Accuracy (%) Diff vs. STLDDA Significance

No augmentation 76.4 $-$ 0.4 *

Jittering 75.0 $-$ 1.8 **

Scaling 76.5 $-$ 0.3 *

Permutation 76.4 $-$ 0.4 *

Rotation 75.3 $-$ 1.5

Window warp 74.9 $-$ 1.9

STLDDA 76.8 – –

Augmentation	Accuracy (%)	Diff vs. STLDDA	Significance
No augmentation	76.4	$-$ 0.4	*
Jittering	75.0	$-$ 1.8	**
Scaling	76.5	$-$ 0.3	*
Permutation	76.4	$-$ 0.4	*
Rotation	75.3	$-$ 1.5	**
Window warp	74.9	$-$ 1.9	**
STLDDA	76.8	–	–

Note: ^**p < 0.05, *p < 0.10. Significance levels indicate the result of pairwise comparisons with STLDDA.

6.4 Ablations on model architecture

The encoder network in our proposed model is a combination of temporal and spectral networks that are designed to learn from input data through multiple perspectives. To investigate the potential advantages of including the spectral component, we conducted a performance evaluation using solely the temporal network. Table 6 presents the average accuracy across 13 UCR Archive datasets when 1% and 5% of the training data are labeled.

The results demonstrate that regardless of the proportion of labeled data, the model incorporating the spectral network consistently achieved higher test accuracy. Performance improvements of 0.7% and 1.5% were observed at the 5% and 1% levels, respectively. These findings suggest that integrating spectral network into the encoder can enhance the overall performance of the proposed model.

Pseudo-labeling is conducted based on the labels (or pseudo-labels) of the K-nearest neighbor representations closest to the selected representation in the support set. Therefore, it is necessary to investigate how changes in model performance are influenced by the value of $K$ . To avoid ties in pseudo-labeling, $K$ should be an odd number. For datasets such as Crop and PPOC, there are fewer than 20 training data per class with just 5% labeled. Given this limitation, we limited our experiments to $K$ values of 3, 5 and 7.

The highest average test accuracy, 76.8%, was recorded when $K$ was set to 5. A relatively lower test accuracy of 73.0% was observed when $K$ was set to 3, which may be attributed to the insufficient neighbor information for pseudo-labeling in non-binary classification datasets. When $K$ was set to 7, the test accuracy was 75.4%, representing a decrease in performance compared to when $K$ was set to 5.

One possible explanation for the performance drop observed at $K = 7$ might be due to the inclusion of distant representations in the latent space within the K-nearest neighbors. Even though these samples are farther away, they still contribute to the pseudo-labeling process when $K$ is higher. As a result, their inclusion might introduce some noise in the labeling, which contributes to the observed reduction in performance.

Our experimental results suggest that the optimal performance for our model is achieved when $K$ is set to 5. Increasing $K$ does not necessarily guarantee performance improvement. Therefore, it is needed to select $K$ considering the characteristics of the dataset.

Table 6
Average test accuracy (%) on 13 UCR Archive datasets with respect to the choice of encoder network.

Mode/ratio of labeled data 1% 5%

NNCLR-TS (temporal only) 66.3 76.1

NNCLR-TS (temporal $+$ spectral) 67.8 76.8

Mode/ratio of labeled data	1%	5%
NNCLR-TS (temporal only)	66.3	76.1
NNCLR-TS (temporal $+$ spectral)	67.8	76.8

Table 7

Performance comparison of NNCLR-TS across 13 UCR Archive datasets, using different numbers of $K$ -nearest samples for pseudo-labeling in scenarios with 5% labeled data.

$K$ -nearest samples	Accuracy (%)
3	73.0
7	75.4
5	76.8

Figure 7.

t-SNE visualizations on the representations of the test data from SemiTime, CA-TCC, and NNCLR-TS applied for HandOutlines and TwoPatterns. All models were trained with 1% of the training data labeled.

6.5 Visualized explanations

For qualitative evaluation of the learned representations, the test data was processed through the trained encoder to extract features. These features were then visualized in a 2D space using t-SNE [69]. Figure 7 presents visualized results of the learned representations from SemiTime, CA-TCC, and NNCLR-TS for HandOutlines and TwoPatterns, with only 1% of the training data being labeled.

For HandOutlines, our proposed model displays a clearer separation between the two classes (0 and 1) compared to other baselines, indicating more effective feature extraction. Although SemiTime and CA-TCC do show a certain level of separation, there remain overlapping regions which could contribute to misclassifications. Moreover, the clusters formed by NNCLR-TS are notably more compact, suggesting a more consistent representation.

TwoPatterns introduces a scenario with four classes. Notably, NNCLR-TS clearly separates each class into its own cluster without any overlap. In contrast, both SemiTime and CA-TCC exhibit overlapping regions, especially among classes 0, 1, and 2. It’s worth noting that NNCLR-TS consistently outperforms across both datasets, suggesting the model’s potential to generalize across different complexities.

In conclusion, as visualized through t-SNE embeddings, the NNCLR-TS model demonstrates clear advantages in terms of class separation and clustering density over its counterparts in these scenarios. Its strong performance on the presented datasets highlights its potential in time series analysis.

7. Conclusions

We introduced a novel semi-supervised contrastive learning framework, NNCLR-TS, designed for time series classification. The NNCLR-TS framework utilizes an asynchronously updated support set that includes both data representations and label information. This design is crucial for assigning pseudo-labels to unlabeled data and for identifying the nearest representations in the context of contrastive learning. To further enhance the capabilities of NNCLR-TS, we developed STLDDA that aims to generate a diverse set of time series while preserving seasonal information, emphasizing the importance of retaining seasonal properties during augmentation.

Along with the standard NT-Xent loss, we introduced two additional losses: IW-Xent and IBS. These losses are designed to bring representations of the same class closer together while separating them from others. Our results suggest that NNCLR-TS outperforms other self-supervised and semi-supervised benchmarks in time series classification, particularly in scenarios with limited labeled data, such as the 1% and 5% settings.

In our comprehensive model analysis, we observed that performance remains fairly consistent across different hyperparameters. Although certain parameters, such as support set capacity, have an influence on the outcomes, these variations are not substantial. This consistent performance indicates the model’s robustness, reducing the necessity for extensive hyperparameter tuning.

As we look ahead, our research goals include the application of NNCLR-TS in multivariate contexts and the advancement of classification in diverse multivariate time series datasets. We also plan to explore cases where not every class is represented in the labeled datasets, delving into the challenges of zero-shot learning.

Footnotes

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grants funded by the Korea government (MSIT) (No. NRF-2022R1F1A1066744 and NRF-2020R1G1A1007453).

References

Hamilton

J.D.

, Time series analysis, Princeton university press, 2020.

Iwana

B.K.

Uchida

, An empirical survey of data augmentation for time series classification with neural networks, Plos One 16(7) (2021), e0254841.

Ismail Fawaz

Forestier

Weber

Idoumghar

Muller

P.-A.

, Deep learning for time series classification: A review, Data Mining and Knowledge Discovery 33(4) (2019), 917–963.

Wang

Yan

Oates

, Time series classification from scratch with deep neural networks: A strong baseline, in: 2017 International Joint Conference on Neural Networks (IJCNN), IEEE, 2017, pp. 1578–1585.

Doersch

Gupta

Efros

A.A.

, Unsupervised visual representation learning by context prediction, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1422–1430.

Jawed

Grabocka

Schmidt-Thieme

, Self-supervised learning for semi-supervised time series classification, in: Advances in Knowledge Discovery and Data Mining: 24th Pacific-Asia Conference, PAKDD 2020, Singapore, May 11–14, 2020, Proceedings, Part I 24, Springer, 2020, pp. 499–511.

Shen

Yun

Lipton

Z.C.

Kronrod

Anandkumar

, Deep active learning for named entity recognition, in: International Conference on Learning Representations (ICLR), 2018. https://openreview.net/forum?id=ry018WZAZ.

Malhotra

Bansal

Ganapathy

, Active Learning Methods for Low Resource End-to-End Speech Recognition, in: Proc. Interspeech 2019, 2019, pp. 2215–2219. doi: https://doi.org/10.21437/Interspeech.2019-2316.

Farha

Y.A.

Gall

, Ms-tcn: Multi-stage temporal convolutional network for action segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3575–3584.

10.

Hao

Wang

Alexander

A.D.

Yuan

Zhang

, MICOS: Mixed supervised contrastive learning for multivariate time series classification, Knowledge-Based Systems 260 (2023), 110158.

11.

Zhai

Oliver

Kolesnikov

Beyer

, S4l: Self-supervised semi-supervised learning, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1476–1485.

12.

Misra

Maaten

L.v.d.

, Self-supervised learning of pretext-invariant representations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6707–6717.

13.

Dwibedi

Aytar

Tompson

Sermanet

Zisserman

, With a little help from my friends: Nearest-neighbor contrastive learning of visual representations, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9588–9597.

14.

Chen

Kornblith

Norouzi

Hinton

, A simple framework for contrastive learning of visual representations, in: International Conference on Machine Learning, PMLR, 2020, pp. 1597–1607.

15.

Eldele

Ragab

Chen

Kwoh

C.K.

Guan

, Time-Series Representation Learning via Temporal and Contextual Contrasting, in: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, 2021, pp. 2352–2359.

16.

Liu

Abdelzaher

, Semi-supervised contrastive learning for human activity recognition, in: 2021 17th International Conference on Distributed Computing in Sensor Systems (DCOSS), IEEE, 2021, pp. 45–53.

17.

Wang

Breckon

, Unsupervised domain adaptation via structured prediction based selective pseudo-labeling, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 6243–6250.

18.

Pei

Cao

Long

Wang

, Multi-adversarial domain adaptation, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, 2018.

19.

Eldele

Ragab

Chen

Kwoh

C.K.

Guan

20.

Wen

Sun

Yang

Song

Gao

Wang

, Time Series Data Augmentation for Deep Learning: A Survey, in: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Vol. 8, 2021, pp. 4653–4660.

21.

Franceschi

J.-Y.

Dieuleveut

Jaggi

, Unsupervised scalable representation learning for multivariate time series, Advances in Neural Information Processing Systems 32 (2019).

22.

Yue

Wang

Duan

Yang

Huang

Tong

, Ts2vec: Towards universal representation of time series, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 2022, pp. 8980–8987.

23.

Han

Jeong

, Time-series data augmentation based on interpolation, Procedia Computer Science 175 (2020), 64–71.

24.

Zhu

Goldberg

A.B.

, Introduction to semi-supervised learning, Synthesis Lectures on Artificial Intelligence and Machine Learning 3(1) (2009), 1–130.

25.

Kingma

D.P.

Mohamed

Jimenez Rezende

Welling

, Semi-supervised learning with deep generative models, Advances in Neural Information Processing Systems 27 (2014).

26.

Oliver

Odena

Raffel

C.A.

Cubuk

E.D.

Goodfellow

, Realistic evaluation of deep semi-supervised learning algorithms, Advances in Neural Information Processing Systems 31 (2018).

27.

Abuduweili

Shi

C.-Z.

Dou

, Adaptive consistency regularization for semi-supervised transfer learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6923–6932.

28.

Lin

T.R.

, A consistency regularization based semi-supervised learning approach for intelligent fault diagnosis of rolling bearing, Measurement 165 (2020), 107987.

29.

Sajjadi

Javanmardi

Tasdizen

, Regularization with stochastic transformations and perturbations for deep semi-supervised learning, Advances in Neural Information Processing Systems 29 (2016).

30.

Laine

Aila

, Temporal Ensembling for Semi-Supervised Learning, in: International Conference on Learning Representations, 2016.

31.

Miyato

Dai

A.M.

Goodfellow

, Adversarial Training Methods for Semi-Supervised Text Classification, in: International Conference on Learning Representations, 2016.

32.

Zhang

Cisse

Dauphin

Y.N.

Lopez-Paz

, mixup: Beyond Empirical Risk Minimization, in: International Conference on Learning Representations, 2018.

33.

Berthelot

Carlini

Goodfellow

Papernot

Oliver

Raffel

C.A.

, Mixmatch: A holistic approach to semi-supervised learning, Proc. Adv. Neural Inf. Process. Syst. 32 (2019), 5050–5060.

34.

Rizve

M.N.

Duarte

Rawat

Y.S.

Shah

, In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning, 2021.

35.

Lee

D.-H.

et al., Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks, in: Workshop on Challenges in Representation Learning, ICML, Vol. 3, 2013, p. 896.

36.

Arazo

Ortego

Albert

OâĂŹConnor

N.E.

McGuinness

, Pseudo-labeling and confirmation bias in deep semi-supervised learning, in: International Joint Conference on Neural Networks (IJCNN), IEEE, 2020, pp. 1–8.

37.

Yang

Nevatia

, Simple: Similar pseudo label exploitation for semi-supervised classification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15099–15108.

38.

Lee

Kim

Cheon

Cho

Han

W.-S.

, Contrastive regularization for semi-supervised learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3911–3920.

39.

Haoran

Guanze

, Semi-supervised end-to-end automatic sleep stage classification based on pseudo-label, in: 2021 IEEE International Conference on Power Electronics, Computer Applications (ICPECA), IEEE, 2021, pp. 83–87.

40.

Cascante-Bonilla

Tan

Ordonez

, Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 2021, pp. 6912–6920.

41.

Choi

Jeong

Kim

, Pseudo-labeling curriculum for unsupervised domain adaptation, arXiv preprint arXiv:1908.00262, 2019.

42.

Niu

Shan

Wang

, Spice: Semantic pseudo-labeling for image clustering, IEEE Transactions on Image Processing 31 (2022), 7264–7278.

43.

Chen

Kornblith

Swersky

Norouzi

Hinton

G.E.

, Big self-supervised models are strong semi-supervised learners, Proc. Adv. Neural Inf. Process. Syst. 33 (2020), 22243–22255.

44.

Kaushik

Hovy

Lipton

Z.C.

, Learning the difference that makes a difference with counterfactually-augmented data, in: ICLR, 2020. https://openreview.net/forum?id=Sklgs0NFvr.

45.

Saeed

Grangier

Zeghidour

, Contrastive learning of general-purpose audio representations, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 3875–3879.

46.

Fulcher

B.D.

Jones

N.S.

, Highly comparative feature-based time-series classification, IEEE Transactions on Knowledge and Data Engineering 26(12) (2014), 3026–3037.

47.

Che

Purushotham

Cho

Sontag

Liu

, Recurrent neural networks for multivariate time series with missing values, Scientific Reports 8(1) (2018), 6085.

48.

Fan

Xie

Girshick

, Momentum contrast for unsupervised visual representation learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.

49.

Chen

Fan

Girshick

, Improved baselines with momentum contrastive learning, arXiv preprint arXiv:2003.04297, 2020.

50.

Tian

Krishnan

Isola

, Contrastive multiview coding, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, Springer, 2020, pp. 776–794.

51.

Shorten

Khoshgoftaar

T.M.

, A survey on image data augmentation for deep learning, Journal of Big Data 6(1) (2019), 1–48.

52.

Tan

C.W.

Herrmann

Forestier

Webb

G.I.

Petitjean

, Efficient search of the best warping window for dynamic time warping, in: Proceedings of the 2018 SIAM International Conference on Data Mining, SIAM, 2018, pp. 225–233.

53.

Iglesias

Talavera

González-Prieto

Á.

Mozo

Gómez-Canaval

, Data Augmentation techniques in time series domain: a survey and taxonomy, Neural Computing and Applications, 2023, 1–23.

54.

Iwana

B.K.

Uchida

, Time series data augmentation for neural networks by time warping with a discriminative teacher, in: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, 2021, pp. 3558–3565.

55.

Kamycki

Kapuscinski

Oszust

, Data augmentation with suboptimal warping for time-series classification, Sensors 20(1) (2019), 98.

56.

Jeong

Y.-S.

Jeong

M.K.

Omitaomu

O.A.

, Weighted dynamic time warping for time series classification, Pattern Recognition 44(9) (2011), 2231–2240.

57.

Cuturi

Blondel

, Soft-dtw: a differentiable loss function for time-series, in: International Conference on Machine Learning, PMLR, 2017, pp. 894–903.

58.

Keogh

E.J.

, FastDTW is approximate and generally slower than the algorithm it approximates, IEEE Transactions on Knowledge and Data Engineering 34(8) (2020), 3779–3785.

59.

Cleveland

R.B.

Cleveland

W.S.

McRae

J.E.

Terpenning

, STL: A seasonal-trend decomposition, J. Off. Stat 6(1) (1990), 3–73.

60.

Lee

Risteski

, Efficient sampling from the Bingham distribution, in: Proceedings of the 32nd International Conference on Algorithmic Learning Theory, 2021, pp. 673–685.

61.

Khosla

Teterwak

Wang

Sarna

Tian

Isola

Maschinot

Liu

Krishnan

, Supervised contrastive learning, Proc. Adv. Neural Inf. Process. Syst. 33 (2020), 18661–18673.

62.

Andrzejak

R.G.

Lehnertz

Mormann

Rieke

David

Elger

C.E.

, Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state, Physical Review E 64(6) (2001), 061907.

63.

Dau

H.A.

Bagnall

Kamgar

Yeh

C.-C.M.

Zhu

Gharghabi

Ratanamahatana

C.A.

Keogh

, The UCR time series archive, IEEE/CAA Journal of Automatica Sinica 6(6) (2019), 1293–1305.

64.

Izmailov

Podoprikhin

Garipov

Vetrov

Wilson

A.G.

, Averaging weights leads to wider optima and better generalization, in: Converence on Uncertatinty in Ariticial Intelligence (UAI), AUAI Press for Association for Uncertainty in Artificial Intelligence, 2018, pp. 876–885.

65.

Wickstrøm

Kampffmeyer

Mikalsen

K.Ø.

Jenssen

, Mixing up contrastive learning: Self-supervised representation learning for time series, Pattern Recognition Letters 155 (2022), 54–61.

66.

Zhang

Cisse

Dauphin

Y.N.

Lopez-Paz

, mixup: Beyond Empirical Risk Minimization, in: International Conference on Learning Representations, 2018. https://openreview.net/forum?id=r1Ddp1-Rb.

67.

Fan

Zhang

Wang

Huang

, Semi-supervised Time Series Classification by Temporal Relation Prediction, in: 46th International Conference on Acoustics, Speech, and Signal Processing, IEEE, 2021.

68.

Iwana

B.K.

Uchida

, An empirical survey of data augmentation for time series classification with neural networks, Plos One 16(7) (2021), e0254841.

69.

Van der Maaten

Hinton

, Visualizing data using t-SNE, Journal of Machine Learning Research 9(11) (2008).

70.

Zbontar

Jing

Misra

LeCun

Deny

, Barlow twins: Self-supervised learning via redundancy reduction, in: International Conference on Machine Learning, PMLR, 2021, pp. 12310–12320.

71.

Jaiswal

Babu

A.R.

Zadeh

M.Z.

Banerjee

Makedon

, A survey on contrastive self-supervised learning, Technologies 9(1) (2020), 2.

72.

Sohn

Berthelot

Carlini

Zhang

Raffel

C.A.

Cubuk

E.D.

Kurakin

C.-L.

, Fixmatch: Simplifying semi-supervised learning with consistency and confidence, Advances in Neural Information Processing Systems 33 (2020), 596–608.

73.

Sarkar

Etemad

, Self-supervised ECG representation learning for emotion recognition, IEEE Transactions on Affective Computing 13(3) (2020), 1541–1554.

74.

Khosla

Teterwak

Wang

Sarna

Tian

Isola

Maschinot

Liu

Krishnan

, Supervised contrastive learning, Advances in Neural Information Processing Systems 33 (2020), 18661–18673.

75.

Chen

Kornblith

Swersky

Norouzi

Hinton

G.E.

, Big self-supervised models are strong semi-supervised learners, Advances in Neural Information Processing Systems 33 (2020), 22243–22255.

76.

Sohn

Berthelot

Carlini

Zhang

Raffel

C.A.

Cubuk

E.D.

Kurakin

C.-L.

, Fixmatch: Simplifying semi-supervised learning with consistency and confidence, Proc. Adv. Neural Inf. Process. Syst. 33 (2020), 596–608.

77.

Kurakin

Raffel

Berthelot

Cubuk

E.D.

Zhang

Sohn

Carlini

, ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring, in: ICLR, 2020. https://openreview.net/pdf?id=HklkeR4KPB.

78.

Chen

Kornblith

Norouzi

Hinton

, A simple framework for contrastive learning of visual representations, in: International Conference on Machine Learning, PMLR, 2020, pp. 1597–1607.

79.

Xiong

S.X.

Lin

, Unsupervised feature learning via non-parametric instance discrimination, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3733–3742.

80.

Fan

Xie

Girshick

, Momentum contrast for unsupervised visual representation learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.

81.

Wickstrøm

Kampffmeyer

Mikalsen

K.Ø.

Jenssen

, Mixing up contrastive learning: Self-supervised representation learning for time series, Pattern Recognition 155 (2022), 54–61.

82.

Liu

Zhou

Zhang

Xiong

, Semi-supervised learning quantization algorithm with deep features for motor imagery EEG Recognition in smart healthcare application, Applied Soft Computing 89 (2020), 106071.

83.

Zhang

Zhao

Tsiligkaridis

Zitnik

, Self-supervised contrastive pre-training for time series via time-frequency consistency, 2022, 3988–4003.

84.

Wang

Zhang

Pan

Chen

, Time series feature learning with labeled and unlabeled data, Pattern Recognition 89 (2019), 55–66.

85.

Oord

A.v.d.

Vinyals

, Representation learning with contrastive predictive coding, arXiv preprint arXiv:1807. 03748, 2018.

86.

Han

Jeong

, Time-series data augmentation based on interpolation, Procedia Computer Science 175 (2020), 64–71.

87.

Längkvist

Karlsson

Loutfi

, A review of unsupervised feature learning and deep learning for time-series modeling, Pattern Recognition Letters 42 (2014), 11–24.

Semi-supervised contrastive learning with decomposition-based data augmentation for time series classification

Abstract

Keywords

1. Introduction

2.1 Semi-supervised learning

2.2 Contrastive learning

2.3 Data augmentation for time series

3. Method

3.3 Support set & pseudo-labeling

4.1 Datasets

4.1.1 Epilepsy seizure prediction

4.1.2 UCR classification archive

4.3 Baseline models

5.1 Performance comparison with baseline models

5.2 Analyzing the effect of the ratio of labeled data

6.2 Maximum support set capacity

Table 6 Average test accuracy (%) on 13 UCR Archive datasets with respect to the choice of encoder network. Mode/ratio of labeled data 1% 5% NNCLR-TS (temporal only) 66.3 76.1 NNCLR-TS (temporal + spectral) 67.8 76.8

7. Conclusions

Footnotes

Acknowledgments

References

Table 6
Average test accuracy (%) on 13 UCR Archive datasets with respect to the choice of encoder network.

Mode/ratio of labeled data 1% 5%

NNCLR-TS (temporal only) 66.3 76.1

NNCLR-TS (temporal $+$ spectral) 67.8 76.8