Abstract
Human activity recognition (HAR) plays a crucial role in remotely monitoring the health of the elderly. Human annotation is time-consuming and expensive, especially for abstract sensor data. Contrastive learning can extract robust features from weakly annotated data to promote the development of sensor-based HAR. However, current research mainly focuses on the exploration of data augmentation methods and pre-trained models, disregarding the impact of data quality on label effort for fine-tuning. This paper proposes a novel active contrastive coding model that focuses on using an active query strategy to evenly select small, high-quality samples in downstream tasks to complete the update of the pre-trained model. The proposed uncertainty-based balanced query strategy mines the most indistinguishable hard samples according to the data posterior probability in the unlabeled sample pool, and imposes class balance constraints to ensure equilibrium in the labeled sample pool. Extensive experiments have shown that the proposed method consistently outperforms several state-of-the-art baselines on four mainstream HAR benchmark datasets (UCI, WISDM, MotionSense, and USCHAD). With approximately only 10% labeled samples, our method achieves impressive F1-scores of 98.54%, 99.34%, 98.46%, and 87.74%, respectively.
Keywords
Introduction
Sensor-based human activity recognition (HAR) obtains the original time series data from the embedded sensors to infer the complex activities of human beings [1, 2]. HAR has been widely used in monitoring systems [3, 4], ambient assisted living [5], and other intelligent medical applications [6, 7]. The sensor data flow from mobile devices is abstract and challenging for humans to interpret. Currently, traditional machine learning methods primarily focus on developing domain-specific feature extraction techniques that allow for the incorporation of prior knowledge during the learning process [8]. However, these methods rely on human creativity to come up with novel features and cannot capture potential explanatory factors in low-level senses. To overcome these limitations, deep learning methods based on neural networks have gained popularity. Tang et al. explore the application of HAR in embedded scenarios by optimizing convolutional neural network (CNN) with Lego filters [9] and transformer architecture [10]. Challa et al. propose a multibranch CNN-BiLSTM network for automatic feature extraction from the raw sensor data with minimal data pre-processing [11]. As these deep learning models are trained in a data-driven manner, having an adequately large training set is crucial for achieving high performance. However, building a large labeled dataset is very time-consuming and expensive. The labeling cost increases with the size of the data and task complexity [12], posing a significant hurdle to implementing data-driven systems into mobile HAR applications. Recently, a novel context-aware mutual learning method (CMLHAR) is proposed for semi-supervised HAR aimed at reducing annotation quantities [13]. Nevertheless, the inaccurate pseudo-labels degrade the classification performance of the system [14, 15].
Self-supervised learning (SSL) technology presents another paradigm to circumvent label restrictions [16]. Unlike semi-supervised learning methods, SSL does not rely on any label values during training, but instead discovers relationships between samples by exploring the intrinsic features of the data. As a form of SSL, contrastive learning (CL) learns in an unsupervised manner from a large amount of unlabeled data and fine-tunes on a few labeled samples [17]. Several studies have demonstrated the effectiveness of CL for sensor based HAR [18]. Research [19] explores the first application of CL in HAR for healthcare, discovering improvements over supervised learning (SL) models with fine-tuning and random rotation. Multi-task SSL model learns a multi-task temporal convolutional network to recognize transformations applied on the input signal [20]. It demonstrates that simple auxiliary tasks of binary classification result in a robust supervisory signal for extracting useful features for downstream tasks. In the reference [21], STFNet is used as the basic construction block to construct the backbone, showing their model significantly outperforms time-domain based SSL approaches. Similarly, CSSHAR model [22] and CPCHAR model [23] adopt transformer encoder and Contrastive Predictive Coding framework to replace CL backbone respectively. Wang et al. propose a novel resampling-based data augmentation method and compare the performance of SimCLR and MoCo [24, 25]. In general, existing research on HAR using CL paradigm primarily focuses on the exploration of data augmentation methods and pre-trained models, yielding certain results. As an SSL paradigm, besides pre-training a high-performance backbone, fine-tuning the model is also crucial for downstream tasks. Current fine-tuning for CL often relies on unreasonable assumptions: people already know which samples should be labeled [26]. Consequently, researchers randomly select a proportion of labeled samples for retraining. The problem it causes is that the labeled sample pool is difficult to represent the feature space distribution of the dataset, which requires sufficient data volume for the downstream task to obtain the accurate decision boundaries. Regrettably, no research has analyzed which samples should be selected for fine-tuning to make it easier to obtain a retrained model suitable for HAR healthcare tasks.
Motivated by this issue, an active contrastive coding (ACC) model is proposed for real-world monitoring scenarios. The key objective is to mine “high-quality samples” to fine-tune the optimal weight of the backbone. Active learning (AL) aims to select an informative subset for labeling that achieves the highest performance within a fixed labeling budget, rather than labeling the entire dataset [27, 28]. It is suitable for sensor-based HAR as it offers another perspective for reducing label effort [29]. Core-set techniques [30] based on representativeness methods select samples by minimizing the Euclidean distance between query data and labeled samples in the feature space. However, their performance is constrained by the number of classes in the dataset and they have a poor representation of features in high-dimensional spaces [31]. Uncertainty query criteria address this problem by sampling the most uncertain data, leveraging class posterior probabilities, entropy and loss prediction [27]. Recently, some researchers have begun to explore the application of AL and self-supervised training on large-scale image classification tasks. Bengar et al. show that SSL is more effective than AL in reducing label effort on image recognition datasets [32]. Xie et al. propose an ActiveFT model for image classification to minimize the distance between distributions of the selected subset and entire unlabeled pool [26]. The inherent properties of human behavior need to be considered when selecting the optimal labeled data for fine-tuning. The abstract sensor data are easily confused for the classification model, and the frequency of different activities in daily life is inconsistent. Whether CL can benefit from AL in sensor-based HAR tasks remains an open question.
Different from the previous research of improving data augmentation and backbone performance, this study attempts to use AL to further reduce the label effort of CL for sensor-based HAR. Instead of randomly selecting samples for fine-tuning, we employ active query strategies to select class-balanced hard samples near the decision boundary for labeling. The results demonstrate significant performance gains of the proposed ACC model in reducing label effort compared with baselines. The main contributions are as follows: A novel ACC model has been proposed to tackle the challenges associated with annotating substantial volumes of abstract sensor data in practical HAR applications. The proposed method can significantly mitigate the labeling effort required for fine-tuning in existing CL methods by integrating an AL query strategy into downstream tasks. An uncertainty-based balanced query strategy is utilized to evenly mine hard samples in specific tasks, taking into account the characteristics of samples during the fine-tuning process to enhance the model’s adaptability for HAR.
Methods
The overall framework of the ACC method is illustrated in Figure 1. ACC consists of two main parts: pre-training a backbone network and fine-tuning the network with AL, as depicted in the upper and lower parts of Figure 1, respectively. The objective of pre-training is to learn meaningful representations of samples in an unsupervised manner to obtain a neural network-based encoder f (· ; w0) with pretrained weight w0. The pre-training steps include data augmentation, encoder, projection head, and contrastive loss function, which are detailed in subsection 2.1. After completing the pre-training, a linear classifier is added on top of the learned encoder for supervised fine-tuning with initial labeled data. Then, we employ an AL loop using the fine-tuned model to evenly select the most confusing samples and query them for labeling. The iteration continues until the specified labeling budget is reached, resulting in the final updated classification model f (· , w f ) with updated parameter w f . The details of the active query strategy are provided in subsection 2.2.

Overview of the proposed active contrastive coding (ACC) framework. The framework consists of 3 stages. (i) A backbone is trained in an unsupervised way using the entire dataset. (ii) Given a few labeled data, the backbone with a liner classifier is fine-tuned in a supervised way. (iii) We are running the model on unlabeled data and sorting the samples via active query strategies. Finally, the samples that meet the satisfaction criteria are queried to annotators for labeling and add to the labeled set. Stages (ii) & (iii) are repeated until the labeling budget is finished.
We initially present the generation process of the pre-trained model, which mainly comprises four parts. Different from the existing CL work, enhancements have been executed in the second part by utilizing an attention-based feature extraction mechanism to acquire more robust feature representation capabilities. The pre-training of the backbone network is depicted in the upper part of Figure 1:
The first part is data augmentation to maximize consistency between augmented views. Three-dimensional rotation and resampling serve as the fundamental augmentation methods. This is attributed to the fact that three-dimensional rotation can effectively simulate sensor shaking during data collection. Resampling introduces variable domain information and simulates real activity data by changing the sampling frequency to maximize the coverage of the sampling space. The original input data X generates two views
The second part involves an encoder f (·) based on a neural network, which maps the converted samples into low-dimensional representation. we utilize a 1D convolutional encoder as the fundamental model, incorporating batch normalization and ReLU activation, as depicted in Figure 2. The output of the CNN is then processed through a self-attention encoder. The primary objective here is to learn the weight coefficients that can capture the relationship between input data at different times and perform weighted encoding on features extracted by the CNN. The representation of enhanced data
The third part is the projection head, which is a multi-layer perceptron (MLP) with a hidden layer. It maps the representations learned by the encoder into another space where contrastive loss is applied, resulting in z i = g (h i ) and z j = g (h j ).
The final part defines the contrastive loss function for the contrastive task. In our method, normalized temperature scaling cross-entropy loss (NT-Xent) [33] is used as the loss function. By employing these four components, Samples from the entire dataset undergo unsupervised pre-training to derive the backbone f (· ; w0), where w0 represents the pre-trained parameter of the backbone.

Convolutional encoder structure with self-attention
The objective of ACC is to achieve the best performance in the downstream task model with a few amount of labeled data. Specifically, we adopt a batch-mode AL scheme. In each round of AL iteration, based on the proposed uncertainty-based balanced sampling strategy, we select b hard samples from the unlabeled sample pool to join the labeled sample pool, until the labeling budget is exhausted. When AL selects data for labeling, the uncertainty and the class distribution of the dataset need to be considered simultaneously. Selecting a similar number of confusing samples from each category for labeling can ensure that the fine-tuning model has a good recognition effectiveness on all classes. As data within the unlabeled sample pool D
unlabel
lacks the correct class label, we use the pre-trained backbone f (· ; w0) to generate a probability matrix to estimate its data distribution, as shown in Eq.(3),
In addition to constraining the class distribution of the labeled data, this paper utilizes uncertainty as a measure of sample information. Commonly used methods for measuring uncertainty based on posterior probability include the least confidence strategy, entropy-based sampling strategy and margin sampling strategy. The least confidence strategy only considers the category with the highest probability and disregards other categories. The entropy-based sampling strategy considers all categories but is easily influenced by individual categories. Therefore, we choose margin sampling to quantify data uncertainty, which involves selecting samples with the smallest difference in probability between the highest and second highest model predictions, as illustrated in Eq. (7),
1: Initialization: Initial labeled data pool |D label | = b o , n = 1.
2: f (· ; w f ) ← f (· ; w0) adding a linear classifier on top of it.
4: f (· ; w f ) ← fine-tuning model f (· ; w f ) on D label
5: Use f (· ; w f ) to compute probabilities P for x t ∈ D unlabel
6: Obtain the highest probability P1, and the second highest probability P2.
7: Compute P onehot from Eq.(4)
8: Compute Ω from Eq.(6)
9: Solve Eq.(9) to obtain C
10: Query C to human annotator
11: D label ← D label ∪ C, D unlabel ← D unlabel ∖ C
12: n ← n + 1
13:
Dataset description
The UCI dataset [34] is obtained from a group of 30 volunteers with a waist-mounted Samsung Galaxy S2 smartphone. The accelerometer and gyroscope signals were collected at 50Hz when subjects performed the following 6 activities: standing, sitting, laying down, walking, downstairs, and upstairs. The total number of samples is 10299.
The MotionSense dataset [35] comprises an accelerometer, gyroscope, and altitude data from 24 participants collected using an iPhone6s kept in the user’s front pocket. The subjects performed 6 different activities (i.e., walking, jogging, downstairs, upstairs, sitting, and standing.) in 15 trials under similar environments. The total number of samples is 21865.
The USCHAD dataset [36] consists of well-defined low-level daily activities suitable for healthcare scenarios. Accelerometer and gyroscope signals were collected when 14 subjects performed the following 12 activities: walking forward, walking left, walking right, going upstairs, going downstairs, running forward, jumping, sitting, standing, sleeping, going up and going down the elevator. The total number of samples is 42708.
The WISDM dataset [37] was collected in a controlled study with 29 volunteers who carried the cell phone in their pockets. The data were recorded for 6 different activities (i.e., sit, stand, walk, jog, ascend stairs, descend stairs) via an app developed for an Android phone. The total number of samples is 20846.
Implementation details
We utilize the model depicted in Figure 1 for pre-training. The projection head comprises three fully connected layers with 256, 128, and 50 units respectively. The Adam optimizer is employed with a learning rate of 1e-3 for pre-training, over a span of 200 training epochs. The temperature coefficient is 0.1, and the batch size is 1024. All samples in the dataset are pre-trained without labels to derive the backbone network. To assess the performance of this pre-trained backbone, we adopt a fine-tuning evaluation protocol for evaluating downstream classification tasks. It is fine-tuned by Adam optimizer with 200 periods and the learning rate is 5e-4. The loss function is cross-entropy loss and the batch size is 512. The dataset is partitioned into 70% and 30% segments, with the latter used as the test set. From the 70% training set, a small subset of data is randomly selected to serve as the initial labeled sample pool for fine-tuning. Following the query strategy, a specific number of hard samples are selected at each iteration to join the labeled sample pool and continue the fine-tuning process until the labeling budget is reached. The evaluation indexes used in the experiment is F1 score. All experiments were trained 10 times on different training and test sets, and the test results were averaged. All deep learning codes were built on the TensorFlow platform. An NVIDIA Tesla T4 GPU with 16GB memory was utilized to accelerate the training process.
Result analysis
The effectiveness of the backbone for CL
The performance of the backbone network has an impact on the final classification results, so the effectiveness of the backbone is first analyzed. Four backbones of CPC [23], TPN [20], DeepConvLSTM [24] and the Conv+Self-Attention proposed by us are compared. All models adopt a fine-tuning method consistent with the literature [24], randomly selecting varying proportions of labeled samples for downstream task fine-tuning. We evaluate their classification performance on two typical datasets using mean F1 score as the evaluation metric. Figure 3 illustrates the class distribution of these datasets, with UCI being balanced and WISDM exhibiting severe class imbalance. Table 1 shows the classification performance of different backbone networks on UCI and WISDM. All backbones show similar performance regardless of whether the class distribution is balanced. Both “DeepConvLSTM” and “Conv+Self-Attention” perform slightly better than “CPC” and “TPN”, and all four backbones achieve approximately 99% F1 scores with 60% labeled data. As the proportion of labeled samples in downstream tasks increases, the classification performance of the four backbones improves while their disparities diminish. This experiment substantiates that enhancing the backbone can improve the representation ability of CL and affect the final classification results, but it does not make a significant difference in reducing the label effort. In addition, the proposed backbone “Conv+Self-Attention” slightly outperforms other models.

The class distribution of the UCI dataset and the WISDM dataset
The comparison of classification performance (F1 score) with different backbones on UCI and WISDM for CL. 1%, 10% and 60% represent randomly selecting corresponding proportions of annotated data for fine-tuning
To validate the effectiveness of the proposed ACC model, we conduct comparisons among SL, CL, and our ACC methods. For each method, two pre-trained backbones with superior performance as shown in Table 1 are selected to further eliminate the influence of backbone settings on classification outcomes. The results are displayed in Table 2. While CL does not significantly surpass the SL method in final recognition effectiveness, it can reduce label effort by approximately 10%. The query strategy based on AL can substantially decrease the label requirements for specific tasks. The proposed ACC method requires approximately 10% of the labeled data to achieve the best results on the UCI, WISDM, and MotionSense datasets. Thus, it reduces the label effort to 1/7 that of SL and 1/6 that of traditional CL methods. Different from the other three datasets, the USCHAD contains 12 categories of activities and is essentially a fine-grained classification task, such as walking forward, walking left and walking right. To capture subtle differences between different categories, more labeled data is needed for fine-tuning. Furthermore, we observe that there is no significant variation in labeled size when changing backbone settings within each deep learning paradigm. Therefore, enhancing the quality of labeled data for specific tasks proves to be a more effective approach to further reduce label effort compared to optimizing the backbone.
The effectiveness comparison for different deep learning paradigms on UCI, WISDM, MotionSense, and USCHAD. The "70% (7209)" means that labeled size accounts for 70% of the entire UCI dataset, and the specific number is 7209
The effectiveness comparison for different deep learning paradigms on UCI, WISDM, MotionSense, and USCHAD. The "70% (7209)" means that labeled size accounts for 70% of the entire UCI dataset, and the specific number is 7209
To assess the hard sample mining capability of the ACC model, we construct confusion matrices for four datasets and enumerate the number of labeled samples across different categories, as depicted in Figure 4. The backbone is configured to the top-performing "Conv+Self-Attention". From Figure 4(a) , SL confuses "sitting" and "standing" activities in the UCI dataset, while the ACC method shows significant improvement in recognizing these confusing activities compared to CL. As highlighted by the rectangular box in Figure 4(a)’s last column, the ACC model allocates most of its label budget to confusing data and includes more uncertain difficult samples within an equivalently sized labeled sample pool. This special fine-tuning method allows the ACC method to establish more precise decision boundaries. The classification results on the WISDM and MotionSense align with those on UCI. The results on the USCHAD are slightly different. From Figure 4(d), it is evident that the number of the most confusing activities ("elevator up" and "elevator down") is four times higher than other methods; however, the improvement in results is not as significant as with the other three datasets. The explanation is that contrastive loss tends to cluster time series based on overall similarity. Fine-grained features are usually subtle, making them less useful in distinguishing paired activity categories during contrastive pretext tasks. Consequently, boundaries between different clusters do not align well with those among fine-grained classes. This effect may be ignored when evaluating coarse-grained classes, but becomes apparent in finer-grained tasks. It is consistent with the experimental phenomenon that the ACC model performs well on the UCI, WISDM and MotionSense (6-category) datasets, but performs poorly on the USCHAD (12-category) dataset.

Confusion matrix and number of labeled samples with SL, CL and ACC methods, respectively. The rectangle boxes in the last column indicate confusing activities. (a). The number of labeled samples for UCI is 10%. (b). The number of labeled samples for WISDM is 5%. (c). The number of labeled samples for MotionSense is 5%. (d). The number of labeled samples for USCHAD is 15%.
Furthermore, to demonstrate the advantages of the ACC model in reducing label effort, we plot the curve of F1 scores as the proportion of labeled data increases (Figure 5). Compared with SL, the advantages of CL are mainly manifested when the labeled data size is less than 20%. When the labeled samples reach more than 50%, the classification performance of CL and SL gradually equalizes. On the UCI, WISDM and MotionSense datasets, about 10% of the labeled data can get the best classification results. In the USCHAD data, 24% of the labeled data is required. This experiment shows that our proposed ACC method has the best classification performance by mining hard samples when the given label budget of HAR is small.

The curve of F1 scores with different labeled data size on the UCI, WISDM, MotionSense, and USCHAD datasets.
The proposed active query strategy considers both sample uncertainty and class balance. To verify their contribution, we conduct experiments as shown in Figure 6 to compare the effects of four different fine-tuning methods. These four methods include random sampling for fine-tuning, uncertainty sampling alone, balanced sampling alone, and combining balanced and uncertainty sampling strategies (ACC). The total labeling budgets on UCI, WISDM, MotionSense, and USCHAD are 1350, 2100, 1450, and 10400, respectively. The sampling budgets for each cycle of AL are 100, 100, 100, and 800, respectively. From Figure 6(a), the random sampling strategy used in traditional CL shows the poorest performance, whereas the balanced sampling strategy slightly outperforms random sampling. The classification performance sees a significant enhancement with uncertainty sampling method. Given the same labeling budget, mining hard samples proves effective in fine-tuning accurate boundaries. Nevertheless, focusing on samples near the decision boundary may lead to class imbalance issues. This is illustrated in Figure 6(b), where the white bar represents the number of samples labeled in each category using uncertainty sampling while the black bar corresponds to combining uncertainty and balanced sampling. The ACC model can alleviate this class imbalance problem induced by uncertainty sampling while maintaining the benefits of mining hard samples. As demonstrated in Figure 6(a), given an equal number of labeled samples across all datasets, our proposed ACC model delivers superior classification performance.

(a)The curve of F1 scores with different number of labeled samples for the UCI, WISDM, MotionSense, and USCHAD datasets. (b) The number of samples labeled in each category using uncertainty sampling and combining uncertainty and balanced sampling, respectively. The labeling budget is consistent with (a).
The ablation experiments further demonstrate the necessity of CL and AL for HAR, with the experimental results shown in Table 3. When given the same small annotated dataset, the F1 score of our ACC model surpasses that of SL by 3% -4%, and that of traditional CL by 2% -3%. The consistent performance gain across the entire dataset further attests to the fact that active query strategies can be integrated with CL without performance degradation. Additionally, effective active query strategies can significantly enhance classification performance for HAR.
Ablation experiments for the proposed ACC on four datasets
Ablation experiments for the proposed ACC on four datasets
We compare our ACC methods with the most advanced ones available, as shown in Table 4. The comparison methods include SL, semi-supervised learning (semi-SL), and SSL. For clarity, we have highlighted the best performance in bold. Compared to the SL, our method has increased the F1 score by 2% on UCI and WISDM, 5% on MotionSense, with the most significant improvement observed on USCHAD, where it increased by approximately 27%. Furthermore, we also compared three popular semi-SL methods. Although they reduce label effort to between 10% and 20%, their recognition performance significantly diminishes. Finally, our proposed method is compared with seven recently proposed SSL methods. With only 10% labeled data, our ACC significantly outperforms other SSL methods on UCI, WISDM and MotionSense datasets. The performance on the USCHAD dataset is only 0.12% lower than the state-of-art method, ClusterCLHAR, but the F1 scores on UCI, WISDM and MotionSense are increased by 2.63%, 2.38% and 2.02%, respectively. This is consistent with the previous experimental results. CL is not well-suited for the fine-grained task of USCHAD. However, for coarse-grained datasets UCI, WISDM and MotionSense, the combination of AL and CL can benefit from each other, enabling balanced mining of hard samples to achieve optimal classification performance while reducing label effort.
Comparison results of F1 score on four public datasets
Comparison results of F1 score on four public datasets
This paper proposes an ACC model that utilizes a pretraining-finetuning strategy to tackle the challenge of requiring a large number of labeled data for end-to-end learning from scratch. Our method achieves this by evenly selecting hard samples from the unlabeled sample pool of downstream tasks for fine-tuning. The proposed uncertainty-based balanced query strategy not only effectively enhances the feature representation ability of labeled data but also mitigates harmful classifier biases caused by class imbalance, thereby increasing its adaptability to practical HAR platforms. On the UCI, WISDM and MotionSense datasets, ACC requires only 13%, 9.4% and 10% labeled samples respectively to achieve state-of-the-art classification results. This label effort is approximately 1/6 of traditional CL methods and 1/7 of SL methods. Additionally, we observe that the CL paradigm performs poorly on the fine-grained classification task USCHAD compared to the other three coarse-grained datasets. Despite significantly increasing the label budget for confusing activities, there is no noticeable improvement in classification performance. One possible explanation is that contrastive loss tends to cluster time series based on overall similarity, making it difficult to utilize subtle fine-grained features for distinguishing between paired activity categories in contrastive excuse tasks. Consequently, the boundaries between different clusters may not align well with the boundaries between fine-grained classes. Finally, we reduce the label effort on USCHAD to 24%, which still outperforms SL and traditional CL methods. Our experiments show the most advanced performance of ACC with high data selection efficiency. These results are encouraging because they demonstrate that it is possible to use unlabeled data to obtain effective sensor data representation, resulting in a more efficient HAR system. This work can help exploit annotation budgets for supervised fine-tuning in practical HAR applications and make a solid contribution to the mobile sensor-based health monitoring systems.
Additionally, this paper presents certain limitations. The proposed ACC model doesn’t exhibit substantial improvement on fine-grained HAR tasks. To thoroughly comprehend the performance degradation instigated by this “granularity gap”, further analysis is necessitated, which we will leave for future work. Moreover, the incorporation of AL increases the time cost of the model, and how to accelerate the convergence speed of the fine-tuning model is also an issue we need to address in the future.
Funding
This work was supported by Changzhou Sci&Tech Program (Grant No. CJ20235026).
