Active contrastive coding reducing label effort for sensor-based human activity recognition

Abstract

Human activity recognition (HAR) plays a crucial role in remotely monitoring the health of the elderly. Human annotation is time-consuming and expensive, especially for abstract sensor data. Contrastive learning can extract robust features from weakly annotated data to promote the development of sensor-based HAR. However, current research mainly focuses on the exploration of data augmentation methods and pre-trained models, disregarding the impact of data quality on label effort for fine-tuning. This paper proposes a novel active contrastive coding model that focuses on using an active query strategy to evenly select small, high-quality samples in downstream tasks to complete the update of the pre-trained model. The proposed uncertainty-based balanced query strategy mines the most indistinguishable hard samples according to the data posterior probability in the unlabeled sample pool, and imposes class balance constraints to ensure equilibrium in the labeled sample pool. Extensive experiments have shown that the proposed method consistently outperforms several state-of-the-art baselines on four mainstream HAR benchmark datasets (UCI, WISDM, MotionSense, and USCHAD). With approximately only 10% labeled samples, our method achieves impressive F1-scores of 98.54%, 99.34%, 98.46%, and 87.74%, respectively.

Keywords

Contrastive learning active learning human activity recognition hard sample mining mobile medical system

1 Introduction

Sensor-based human activity recognition (HAR) obtains the original time series data from the embedded sensors to infer the complex activities of human beings [1, 2]. HAR has been widely used in monitoring systems [3, 4], ambient assisted living [5], and other intelligent medical applications [6, 7]. The sensor data flow from mobile devices is abstract and challenging for humans to interpret. Currently, traditional machine learning methods primarily focus on developing domain-specific feature extraction techniques that allow for the incorporation of prior knowledge during the learning process [8]. However, these methods rely on human creativity to come up with novel features and cannot capture potential explanatory factors in low-level senses. To overcome these limitations, deep learning methods based on neural networks have gained popularity. Tang et al. explore the application of HAR in embedded scenarios by optimizing convolutional neural network (CNN) with Lego filters [9] and transformer architecture [10]. Challa et al. propose a multibranch CNN-BiLSTM network for automatic feature extraction from the raw sensor data with minimal data pre-processing [11]. As these deep learning models are trained in a data-driven manner, having an adequately large training set is crucial for achieving high performance. However, building a large labeled dataset is very time-consuming and expensive. The labeling cost increases with the size of the data and task complexity [12], posing a significant hurdle to implementing data-driven systems into mobile HAR applications. Recently, a novel context-aware mutual learning method (CMLHAR) is proposed for semi-supervised HAR aimed at reducing annotation quantities [13]. Nevertheless, the inaccurate pseudo-labels degrade the classification performance of the system [14, 15].

Self-supervised learning (SSL) technology presents another paradigm to circumvent label restrictions [16]. Unlike semi-supervised learning methods, SSL does not rely on any label values during training, but instead discovers relationships between samples by exploring the intrinsic features of the data. As a form of SSL, contrastive learning (CL) learns in an unsupervised manner from a large amount of unlabeled data and fine-tunes on a few labeled samples [17]. Several studies have demonstrated the effectiveness of CL for sensor based HAR [18]. Research [19] explores the first application of CL in HAR for healthcare, discovering improvements over supervised learning (SL) models with fine-tuning and random rotation. Multi-task SSL model learns a multi-task temporal convolutional network to recognize transformations applied on the input signal [20]. It demonstrates that simple auxiliary tasks of binary classification result in a robust supervisory signal for extracting useful features for downstream tasks. In the reference [21], STFNet is used as the basic construction block to construct the backbone, showing their model significantly outperforms time-domain based SSL approaches. Similarly, CSSHAR model [22] and CPCHAR model [23] adopt transformer encoder and Contrastive Predictive Coding framework to replace CL backbone respectively. Wang et al. propose a novel resampling-based data augmentation method and compare the performance of SimCLR and MoCo [24, 25]. In general, existing research on HAR using CL paradigm primarily focuses on the exploration of data augmentation methods and pre-trained models, yielding certain results. As an SSL paradigm, besides pre-training a high-performance backbone, fine-tuning the model is also crucial for downstream tasks. Current fine-tuning for CL often relies on unreasonable assumptions: people already know which samples should be labeled [26]. Consequently, researchers randomly select a proportion of labeled samples for retraining. The problem it causes is that the labeled sample pool is difficult to represent the feature space distribution of the dataset, which requires sufficient data volume for the downstream task to obtain the accurate decision boundaries. Regrettably, no research has analyzed which samples should be selected for fine-tuning to make it easier to obtain a retrained model suitable for HAR healthcare tasks.

Motivated by this issue, an active contrastive coding (ACC) model is proposed for real-world monitoring scenarios. The key objective is to mine “high-quality samples” to fine-tune the optimal weight of the backbone. Active learning (AL) aims to select an informative subset for labeling that achieves the highest performance within a fixed labeling budget, rather than labeling the entire dataset [27, 28]. It is suitable for sensor-based HAR as it offers another perspective for reducing label effort [29]. Core-set techniques [30] based on representativeness methods select samples by minimizing the Euclidean distance between query data and labeled samples in the feature space. However, their performance is constrained by the number of classes in the dataset and they have a poor representation of features in high-dimensional spaces [31]. Uncertainty query criteria address this problem by sampling the most uncertain data, leveraging class posterior probabilities, entropy and loss prediction [27]. Recently, some researchers have begun to explore the application of AL and self-supervised training on large-scale image classification tasks. Bengar et al. show that SSL is more effective than AL in reducing label effort on image recognition datasets [32]. Xie et al. propose an ActiveFT model for image classification to minimize the distance between distributions of the selected subset and entire unlabeled pool [26]. The inherent properties of human behavior need to be considered when selecting the optimal labeled data for fine-tuning. The abstract sensor data are easily confused for the classification model, and the frequency of different activities in daily life is inconsistent. Whether CL can benefit from AL in sensor-based HAR tasks remains an open question.

Different from the previous research of improving data augmentation and backbone performance, this study attempts to use AL to further reduce the label effort of CL for sensor-based HAR. Instead of randomly selecting samples for fine-tuning, we employ active query strategies to select class-balanced hard samples near the decision boundary for labeling. The results demonstrate significant performance gains of the proposed ACC model in reducing label effort compared with baselines. The main contributions are as follows:

A novel ACC model has been proposed to tackle the challenges associated with annotating substantial volumes of abstract sensor data in practical HAR applications.

The proposed method can significantly mitigate the labeling effort required for fine-tuning in existing CL methods by integrating an AL query strategy into downstream tasks.

An uncertainty-based balanced query strategy is utilized to evenly mine hard samples in specific tasks, taking into account the characteristics of samples during the fine-tuning process to enhance the model’s adaptability for HAR.

2 Methods

The overall framework of the ACC method is illustrated in Figure 1. ACC consists of two main parts: pre-training a backbone network and fine-tuning the network with AL, as depicted in the upper and lower parts of Figure 1, respectively. The objective of pre-training is to learn meaningful representations of samples in an unsupervised manner to obtain a neural network-based encoder f (· ; w₀) with pretrained weight w₀. The pre-training steps include data augmentation, encoder, projection head, and contrastive loss function, which are detailed in subsection 2.1. After completing the pre-training, a linear classifier is added on top of the learned encoder for supervised fine-tuning with initial labeled data. Then, we employ an AL loop using the fine-tuned model to evenly select the most confusing samples and query them for labeling. The iteration continues until the specified labeling budget is reached, resulting in the final updated classification model f (· , w_f) with updated parameter w_f. The details of the active query strategy are provided in subsection 2.2.

Fig. 1

Overview of the proposed active contrastive coding (ACC) framework. The framework consists of 3 stages. (i) A backbone is trained in an unsupervised way using the entire dataset. (ii) Given a few labeled data, the backbone with a liner classifier is fine-tuned in a supervised way. (iii) We are running the model on unlabeled data and sorting the samples via active query strategies. Finally, the samples that meet the satisfaction criteria are queried to annotators for labeling and add to the labeled set. Stages (ii) & (iii) are repeated until the labeling budget is finished.

2.1 Backbone network for pre-training

We initially present the generation process of the pre-trained model, which mainly comprises four parts. Different from the existing CL work, enhancements have been executed in the second part by utilizing an attention-based feature extraction mechanism to acquire more robust feature representation capabilities. The pre-training of the backbone network is depicted in the upper part of Figure 1:

The first part is data augmentation to maximize consistency between augmented views. Three-dimensional rotation and resampling serve as the fundamental augmentation methods. This is attributed to the fact that three-dimensional rotation can effectively simulate sensor shaking during data collection. Resampling introduces variable domain information and simulates real activity data by changing the sampling frequency to maximize the coverage of the sampling space. The original input data X generates two views ${\hat{X}}_{i}, {\hat{X}}_{j}$ of each sample by applying two transformations.

The second part involves an encoder f (·) based on a neural network, which maps the converted samples into low-dimensional representation. we utilize a 1D convolutional encoder as the fundamental model, incorporating batch normalization and ReLU activation, as depicted in Figure 2. The output of the CNN is then processed through a self-attention encoder. The primary objective here is to learn the weight coefficients that can capture the relationship between input data at different times and perform weighted encoding on features extracted by the CNN. The representation of enhanced data $\hat{X}$ after one-dimensional convolution is ${\hat{X}}_{conv}$ . The attention score α_t (t = 1, 2, . . . , T) signifies the contribution of the feature ${\hat{X}}_{conv}$ in the subsequent layer, which can be calculated by Eq (1), $α_{t} = softmax (W_{v} tanh (W_{u} {\hat{X}}_{conv}^{T}))$ (1) where ${\hat{X}}_{conv}^{T}$ denotes the transpose of ${\hat{X}}_{conv}$ , W_u and W_v are the weight matrices of the self-attention module, The function softmax (·) ensures that all computed weights sum up to 1. The final feature vector h encoded via the self-attention mechanism can be expressed as a linear weighted sum of attention scores α_t and ${\hat{X}}_{conv}$ , as shown in Eq (2). Subsequently, the encoded features are flattened for nonlinear spatial mapping through the projection head. $h = \sum_{t = 1}^{T} α_{t} {\hat{X}}_{conv}$ (2)

The third part is the projection head, which is a multi-layer perceptron (MLP) with a hidden layer. It maps the representations learned by the encoder into another space where contrastive loss is applied, resulting in z_i = g (h_i) and z_j = g (h_j).

The final part defines the contrastive loss function for the contrastive task. In our method, normalized temperature scaling cross-entropy loss (NT-Xent) [33] is used as the loss function. By employing these four components, Samples from the entire dataset undergo unsupervised pre-training to derive the backbone f (· ; w₀), where w₀ represents the pre-trained parameter of the backbone.

Fig. 2

Convolutional encoder structure with self-attention

2.2 Model fine-tuning based on active learning

The objective of ACC is to achieve the best performance in the downstream task model with a few amount of labeled data. Specifically, we adopt a batch-mode AL scheme. In each round of AL iteration, based on the proposed uncertainty-based balanced sampling strategy, we select b hard samples from the unlabeled sample pool to join the labeled sample pool, until the labeling budget is exhausted. When AL selects data for labeling, the uncertainty and the class distribution of the dataset need to be considered simultaneously. Selecting a similar number of confusing samples from each category for labeling can ensure that the fine-tuning model has a good recognition effectiveness on all classes. As data within the unlabeled sample pool D_unlabel lacks the correct class label, we use the pre-trained backbone f (· ; w₀) to generate a probability matrix to estimate its data distribution, as shown in Eq.(3), $P = [\begin{matrix} p_{11} & p_{22} & \dots & p_{1 N} \\ p_{21} & p_{22} & \dots & p_{2 N} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ p_{n 1} & p_{n 2} & \dots & p_{nN} \end{matrix}] \in R^{n \times N}$ (3) where N represents the number of categories, n represents the number of samples in the unlabeled sample pool. The sum of each row of the matrix P is 1. The one-hot encoding form P_onehot of the probability matrix P can be given in Eq. (4). $P_{onehot} = onehot [arg max (P)]$ (4) The operation [arg max(P)] returns the index of the maximum value in each row of matrix P. Using C_i∈ { 0 , 1 } to indicate whether a sample i is selected during the query process, the difference between the expected distribution and the estimated distribution is presented in Eq. (5), $l_{equalization} = l (Ω, P_{onehot}^{T} C) = {∥ Ω - P_{onehot}^{T} C ∥}_{1}$ (5) where $P_{onehot}^{T} C$ is the estimated distribution and Ω is the expected distribution. Ω indicates that based on labeled data, selecting Ω_j (j ∈ [1, N]) samples for each class can make the class distribution of the labeled sample pool equal, as shown in Eq. (6), ${\begin{matrix} Ω = arg max [(D_{label} / N - K), 0] \\ K = [k_{1}, k_{2}, . . ., k_{j}], j \in [1, N] \end{matrix}$ (6) where D_label represents the total number of samples that the labeled sample pool should reach in this iteration. And K represents the total number of samples that have been labeled for each category in the last iteration.

In addition to constraining the class distribution of the labeled data, this paper utilizes uncertainty as a measure of sample information. Commonly used methods for measuring uncertainty based on posterior probability include the least confidence strategy, entropy-based sampling strategy and margin sampling strategy. The least confidence strategy only considers the category with the highest probability and disregards other categories. The entropy-based sampling strategy considers all categories but is easily influenced by individual categories. Therefore, we choose margin sampling to quantify data uncertainty, which involves selecting samples with the smallest difference in probability between the highest and second highest model predictions, as illustrated in Eq. (7), $H = {argmin}_{x} (P (y_{1} | x) - P (y_{2} | x))$ (7) where P (y₁|x) and P (y₂|x) are the predicted highest probability and the predicted second highest probability of the sample x, respectively. The matrix form of the uncertainty loss can be expressed as Eq. (8), $l_{uncerta int y} = \sum_{{i | C_{i} = 1}} H (x_{i}) = C^{T} (P_{1} - P_{2}) 1_{N \times 1}$ (8) where 1_N×1 is an all-ones column vector. In order to obtain balanced fine-tuning data with high uncertainty, Eq. (9) needs to be minimized. $\begin{matrix} min (l_{equalization} + l_{uncerta int y}) \\ = min_{C} {C^{T} (P_{1} - P_{2})) 1_{N \times 1} + λ {∥ Ω - P_{onehot}^{T} C ∥}_{1}} \\ s . t . C^{T} 1_{N \times 1} = b, C_{i} \in {0, 1} \end{matrix}$ (9) where b is the sampling budget for each cycle, and λ is a parameter that regularizes the contribution of the balancing term in the objective. We set λ = 1 in our experiment. The purpose of the constraint is to ensure that the number of samples selected by AL is consistent with the sampling budget. The underlying problem of the Eq. (9) is a Binary Integer Programming (BIP) problem. The GLPK-MI solver in the cvxpy toolkit is utilized to optimize it. By solving the Eq. (9), the optimal labeled samples can be obtained. We then unfreeze the backbone, add a linear classification layer at the top of the model, and use this model for supervised fine-tuning on the labeled sample pool. The pseudo-code for our ACC model is shown in Algorithm 1.

Algorithm 1 Pseudo-code for ACC

Input:Unlabeled data pool D_unlabel, pre-trained model f (· ; w₀), annotation budget B, budget per cycle b, initial labeled set b₀, and the parameter λ.

Output: the fine-tuning model f (· ; w_f).

1: Initialization: Initial labeled data pool |D_label| = b_o, n = 1.

2: f (· ; w_f) ← f (· ; w₀) adding a linear classifier on top of it.

While |D_label| < B do

4: f (· ; w_f) ← fine-tuning model f (· ; w_f) on D_label

5: Use f (· ; w_f) to compute probabilities P for x_t ∈ D_unlabel

6: Obtain the highest probability P₁, and the second highest probability P₂.

7: Compute P_onehot from Eq.(4)

8: Compute Ω from Eq.(6)

9: Solve Eq.(9) to obtain C

10: Query C to human annotator

11: D_label ← D_label ∪ C, D_unlabel ← D_unlabel ∖ C

12: n ← n + 1

13: end while

3 Experiments

3.1 Dataset description

The UCI dataset [34] is obtained from a group of 30 volunteers with a waist-mounted Samsung Galaxy S2 smartphone. The accelerometer and gyroscope signals were collected at 50Hz when subjects performed the following 6 activities: standing, sitting, laying down, walking, downstairs, and upstairs. The total number of samples is 10299.

The MotionSense dataset [35] comprises an accelerometer, gyroscope, and altitude data from 24 participants collected using an iPhone6s kept in the user’s front pocket. The subjects performed 6 different activities (i.e., walking, jogging, downstairs, upstairs, sitting, and standing.) in 15 trials under similar environments. The total number of samples is 21865.

The USCHAD dataset [36] consists of well-defined low-level daily activities suitable for healthcare scenarios. Accelerometer and gyroscope signals were collected when 14 subjects performed the following 12 activities: walking forward, walking left, walking right, going upstairs, going downstairs, running forward, jumping, sitting, standing, sleeping, going up and going down the elevator. The total number of samples is 42708.

The WISDM dataset [37] was collected in a controlled study with 29 volunteers who carried the cell phone in their pockets. The data were recorded for 6 different activities (i.e., sit, stand, walk, jog, ascend stairs, descend stairs) via an app developed for an Android phone. The total number of samples is 20846.

3.2 Implementation details

We utilize the model depicted in Figure 1 for pre-training. The projection head comprises three fully connected layers with 256, 128, and 50 units respectively. The Adam optimizer is employed with a learning rate of 1e-3 for pre-training, over a span of 200 training epochs. The temperature coefficient is 0.1, and the batch size is 1024. All samples in the dataset are pre-trained without labels to derive the backbone network. To assess the performance of this pre-trained backbone, we adopt a fine-tuning evaluation protocol for evaluating downstream classification tasks. It is fine-tuned by Adam optimizer with 200 periods and the learning rate is 5e-4. The loss function is cross-entropy loss and the batch size is 512. The dataset is partitioned into 70% and 30% segments, with the latter used as the test set. From the 70% training set, a small subset of data is randomly selected to serve as the initial labeled sample pool for fine-tuning. Following the query strategy, a specific number of hard samples are selected at each iteration to join the labeled sample pool and continue the fine-tuning process until the labeling budget is reached. The evaluation indexes used in the experiment is F1 score. All experiments were trained 10 times on different training and test sets, and the test results were averaged. All deep learning codes were built on the TensorFlow platform. An NVIDIA Tesla T4 GPU with 16GB memory was utilized to accelerate the training process.

3.3 Result analysis

3.3.1 The effectiveness of the backbone for CL

The performance of the backbone network has an impact on the final classification results, so the effectiveness of the backbone is first analyzed. Four backbones of CPC [23], TPN [20], DeepConvLSTM [24] and the Conv+Self-Attention proposed by us are compared. All models adopt a fine-tuning method consistent with the literature [24], randomly selecting varying proportions of labeled samples for downstream task fine-tuning. We evaluate their classification performance on two typical datasets using mean F1 score as the evaluation metric. Figure 3 illustrates the class distribution of these datasets, with UCI being balanced and WISDM exhibiting severe class imbalance. Table 1 shows the classification performance of different backbone networks on UCI and WISDM. All backbones show similar performance regardless of whether the class distribution is balanced. Both “DeepConvLSTM” and “Conv+Self-Attention” perform slightly better than “CPC” and “TPN”, and all four backbones achieve approximately 99% F1 scores with 60% labeled data. As the proportion of labeled samples in downstream tasks increases, the classification performance of the four backbones improves while their disparities diminish. This experiment substantiates that enhancing the backbone can improve the representation ability of CL and affect the final classification results, but it does not make a significant difference in reducing the label effort. In addition, the proposed backbone “Conv+Self-Attention” slightly outperforms other models.

Fig. 3

The class distribution of the UCI dataset and the WISDM dataset

Table 1

The comparison of classification performance (F1 score) with different backbones on UCI and WISDM for CL. 1%, 10% and 60% represent randomly selecting corresponding proportions of annotated data for fine-tuning

Backbone		UCI			WISDM
	1%	10%	60%	1%	10%	60%
CPC [23]	0.8904	0.9568	0.9862	0.8615	0.9603	0.9922
TPN [20]	0.8934	0.9575	0.9861	0.8791	0.9649	0.9938
DeepConvLSTM [24]	0.8886	0.9523	0.9891	0.8620	0.9711	0.9948
Conv+Self-Attention(ours)	0.8948	0.9634	0.9893	0.8794	0.9731	0.9959

3.3.2 The effectiveness of the proposed ACC model

To validate the effectiveness of the proposed ACC model, we conduct comparisons among SL, CL, and our ACC methods. For each method, two pre-trained backbones with superior performance as shown in Table 1 are selected to further eliminate the influence of backbone settings on classification outcomes. The results are displayed in Table 2. While CL does not significantly surpass the SL method in final recognition effectiveness, it can reduce label effort by approximately 10%. The query strategy based on AL can substantially decrease the label requirements for specific tasks. The proposed ACC method requires approximately 10% of the labeled data to achieve the best results on the UCI, WISDM, and MotionSense datasets. Thus, it reduces the label effort to 1/7 that of SL and 1/6 that of traditional CL methods. Different from the other three datasets, the USCHAD contains 12 categories of activities and is essentially a fine-grained classification task, such as walking forward, walking left and walking right. To capture subtle differences between different categories, more labeled data is needed for fine-tuning. Furthermore, we observe that there is no significant variation in labeled size when changing backbone settings within each deep learning paradigm. Therefore, enhancing the quality of labeled data for specific tasks proves to be a more effective approach to further reduce label effort compared to optimizing the backbone.

Table 2
The effectiveness comparison for different deep learning paradigms on UCI, WISDM, MotionSense, and USCHAD. The "70% (7209)" means that labeled size accounts for 70% of the entire UCI dataset, and the specific number is 7209

Model UCI WISDM MotionSense USCHAD

Backbone Type Labeled size F1 score Labeled size F1 score Labeled size F1 score Labeled size F1 score

DeepConvLSTM SL 70% (7209) 0.9782 70% (14592) 0.9933 70% (15305) 0.9864 70% (29895) 0.9180

Conv+Self-Attention 70% (7209) 0.9838 70% (14592) 0.9973 70% (15305) 0.9886 70% (29895) 0.9215

DeepConvLSTM CL 60% (6180) 0.9891 60% (12327) 0.9948 60% (13119) 0.9863 60% (25624) 0.9214

Conv+Self-Attention 60% (6180) 0.9893 60% (12327) 0.9959 60% (13119) 0.9862 60% (25624) 0.9248

DeepConvLSTM ACC 13% (1350) 0.9858 10% (2100) 0.9927 9.4% (1450) 0.9842 24% (10400) 0.9201

Conv+Self-Attention 13% (1350) 0.9883 10% (2100) 0.9934 9.4% (1450) 0.9846 24% (10400) 0.9254

Model	UCI	WISDM	MotionSense	USCHAD
DeepConvLSTM	SL	70% (7209)	0.9782	70% (14592)	0.9933	70% (15305)	0.9864	70% (29895)	0.9180
Conv+Self-Attention		70% (7209)	0.9838	70% (14592)	0.9973	70% (15305)	0.9886	70% (29895)	0.9215
DeepConvLSTM	CL	60% (6180)	0.9891	60% (12327)	0.9948	60% (13119)	0.9863	60% (25624)	0.9214
Conv+Self-Attention		60% (6180)	0.9893	60% (12327)	0.9959	60% (13119)	0.9862	60% (25624)	0.9248
DeepConvLSTM	ACC	13% (1350)	0.9858	10% (2100)	0.9927	9.4% (1450)	0.9842	24% (10400)	0.9201
Conv+Self-Attention		13% (1350)	0.9883	10% (2100)	0.9934	9.4% (1450)	0.9846	24% (10400)	0.9254

To assess the hard sample mining capability of the ACC model, we construct confusion matrices for four datasets and enumerate the number of labeled samples across different categories, as depicted in Figure 4. The backbone is configured to the top-performing "Conv+Self-Attention". From Figure 4(a) , SL confuses "sitting" and "standing" activities in the UCI dataset, while the ACC method shows significant improvement in recognizing these confusing activities compared to CL. As highlighted by the rectangular box in Figure 4(a)’s last column, the ACC model allocates most of its label budget to confusing data and includes more uncertain difficult samples within an equivalently sized labeled sample pool. This special fine-tuning method allows the ACC method to establish more precise decision boundaries. The classification results on the WISDM and MotionSense align with those on UCI. The results on the USCHAD are slightly different. From Figure 4(d), it is evident that the number of the most confusing activities ("elevator up" and "elevator down") is four times higher than other methods; however, the improvement in results is not as significant as with the other three datasets. The explanation is that contrastive loss tends to cluster time series based on overall similarity. Fine-grained features are usually subtle, making them less useful in distinguishing paired activity categories during contrastive pretext tasks. Consequently, boundaries between different clusters do not align well with those among fine-grained classes. This effect may be ignored when evaluating coarse-grained classes, but becomes apparent in finer-grained tasks. It is consistent with the experimental phenomenon that the ACC model performs well on the UCI, WISDM and MotionSense (6-category) datasets, but performs poorly on the USCHAD (12-category) dataset.

Fig. 4

Confusion matrix and number of labeled samples with SL, CL and ACC methods, respectively. The rectangle boxes in the last column indicate confusing activities. (a). The number of labeled samples for UCI is 10%. (b). The number of labeled samples for WISDM is 5%. (c). The number of labeled samples for MotionSense is 5%. (d). The number of labeled samples for USCHAD is 15%.

Furthermore, to demonstrate the advantages of the ACC model in reducing label effort, we plot the curve of F1 scores as the proportion of labeled data increases (Figure 5). Compared with SL, the advantages of CL are mainly manifested when the labeled data size is less than 20%. When the labeled samples reach more than 50%, the classification performance of CL and SL gradually equalizes. On the UCI, WISDM and MotionSense datasets, about 10% of the labeled data can get the best classification results. In the USCHAD data, 24% of the labeled data is required. This experiment shows that our proposed ACC method has the best classification performance by mining hard samples when the given label budget of HAR is small.

Fig. 5

The curve of F1 scores with different labeled data size on the UCI, WISDM, MotionSense, and USCHAD datasets.

3.3.3 The effectiveness of the active query strategy.

The proposed active query strategy considers both sample uncertainty and class balance. To verify their contribution, we conduct experiments as shown in Figure 6 to compare the effects of four different fine-tuning methods. These four methods include random sampling for fine-tuning, uncertainty sampling alone, balanced sampling alone, and combining balanced and uncertainty sampling strategies (ACC). The total labeling budgets on UCI, WISDM, MotionSense, and USCHAD are 1350, 2100, 1450, and 10400, respectively. The sampling budgets for each cycle of AL are 100, 100, 100, and 800, respectively. From Figure 6(a), the random sampling strategy used in traditional CL shows the poorest performance, whereas the balanced sampling strategy slightly outperforms random sampling. The classification performance sees a significant enhancement with uncertainty sampling method. Given the same labeling budget, mining hard samples proves effective in fine-tuning accurate boundaries. Nevertheless, focusing on samples near the decision boundary may lead to class imbalance issues. This is illustrated in Figure 6(b), where the white bar represents the number of samples labeled in each category using uncertainty sampling while the black bar corresponds to combining uncertainty and balanced sampling. The ACC model can alleviate this class imbalance problem induced by uncertainty sampling while maintaining the benefits of mining hard samples. As demonstrated in Figure 6(a), given an equal number of labeled samples across all datasets, our proposed ACC model delivers superior classification performance.

Fig. 6

(a)The curve of F1 scores with different number of labeled samples for the UCI, WISDM, MotionSense, and USCHAD datasets. (b) The number of samples labeled in each category using uncertainty sampling and combining uncertainty and balanced sampling, respectively. The labeling budget is consistent with (a).

3.3.4 Ablation experiments

The ablation experiments further demonstrate the necessity of CL and AL for HAR, with the experimental results shown in Table 3. When given the same small annotated dataset, the F1 score of our ACC model surpasses that of SL by 3% -4%, and that of traditional CL by 2% -3%. The consistent performance gain across the entire dataset further attests to the fact that active query strategies can be integrated with CL without performance degradation. Additionally, effective active query strategies can significantly enhance classification performance for HAR.

Table 3
Ablation experiments for the proposed ACC on four datasets

Datasets UCI WISDM MotionSense USCHAD

(Ratio of training sets) (13%) (10%) (9.4%) (24%)

SL 0.9502 0.9606 0.9598 0.8834

CL 0.9657 0.9735 0.9653 0.8962

CL+balanced sampling 0.9644 0.9771 0.9710 0.9039

CL+uncertainty sampling 0.9880 0.9845 0.9814 0.9153

ACC 0.9883 0.9934 0.9846 0.9254

Datasets	UCI	WISDM	MotionSense	USCHAD
SL	0.9502	0.9606	0.9598	0.8834
CL	0.9657	0.9735	0.9653	0.8962
CL+balanced sampling	0.9644	0.9771	0.9710	0.9039
CL+uncertainty sampling	0.9880	0.9845	0.9814	0.9153
ACC	0.9883	0.9934	0.9846	0.9254

3.3.5 Contrastive analysis with existing work

We compare our ACC methods with the most advanced ones available, as shown in Table 4. The comparison methods include SL, semi-supervised learning (semi-SL), and SSL. For clarity, we have highlighted the best performance in bold. Compared to the SL, our method has increased the F1 score by 2% on UCI and WISDM, 5% on MotionSense, with the most significant improvement observed on USCHAD, where it increased by approximately 27%. Furthermore, we also compared three popular semi-SL methods. Although they reduce label effort to between 10% and 20%, their recognition performance significantly diminishes. Finally, our proposed method is compared with seven recently proposed SSL methods. With only 10% labeled data, our ACC significantly outperforms other SSL methods on UCI, WISDM and MotionSense datasets. The performance on the USCHAD dataset is only 0.12% lower than the state-of-art method, ClusterCLHAR, but the F1 scores on UCI, WISDM and MotionSense are increased by 2.63%, 2.38% and 2.02%, respectively. This is consistent with the previous experimental results. CL is not well-suited for the fine-grained task of USCHAD. However, for coarse-grained datasets UCI, WISDM and MotionSense, the combination of AL and CL can benefit from each other, enabling balanced mining of hard samples to achieve optimal classification performance while reducing label effort.

Table 4
Comparison results of F1 score on four public datasets

Methods Type UCI WISDM MotionSense USCHAD Label effort

TPN [20] SL 0.9427 – 0.9300 0.5560 –

Lego CNN [9] SL 0.9627 0.9751 – – –

Transformer [22] SL 0.9526 – – 0.6056 –

CNN-BiLSTM [11] SL 0.9631 0.9604 – – –

SemiC-HAR [14] Semi-SL 0.9264 0.9006 0.9393 – 10%

SelfHAR [15] Semi-SL 0.8915 0.8780 0.9062 – 10%

CMLHAR [13] Semi-SL 0.9350 0.8813 – – 20%

CPCHAR [23] SSL 0.8165 – 0.8905 0.5201 80%

CSSHAR [22] SSL 0.9114 – – 0.5776 80%

Multi-task SLL [20] SSL 0.8987 0.8686 0.9005 – 80%

SimCLRHAR [19] SSL 0.7827 – 0.9577 0.8332 10%

SimCLRHAR(resampling) [24] SSL 0.9558 – 0.9608 0.8550 10%

NNCLR [18] SSL 0.9555 – 0.9640 0.8707 10%

ClusterCLHAR [25] SSL 0.9591 0.9696 0.9644 0.8786 10%

ACC(ours) SSL 0.9854 0.9934 0.9846 0.8774 10%

Methods	Type	UCI	WISDM	MotionSense	USCHAD	Label effort
TPN [20]	SL	0.9427	–	0.9300	0.5560	–
Lego CNN [9]	SL	0.9627	0.9751	–	–	–
Transformer [22]	SL	0.9526	–	–	0.6056	–
CNN-BiLSTM [11]	SL	0.9631	0.9604	–	–	–
SemiC-HAR [14]	Semi-SL	0.9264	0.9006	0.9393	–	10%
SelfHAR [15]	Semi-SL	0.8915	0.8780	0.9062	–	10%
CMLHAR [13]	Semi-SL	0.9350	0.8813	–	–	20%
CPCHAR [23]	SSL	0.8165	–	0.8905	0.5201	80%
CSSHAR [22]	SSL	0.9114	–	–	0.5776	80%
Multi-task SLL [20]	SSL	0.8987	0.8686	0.9005	–	80%
SimCLRHAR [19]	SSL	0.7827	–	0.9577	0.8332	10%
SimCLRHAR(resampling) [24]	SSL	0.9558	–	0.9608	0.8550	10%
NNCLR [18]	SSL	0.9555	–	0.9640	0.8707	10%
ClusterCLHAR [25]	SSL	0.9591	0.9696	0.9644	0.8786	10%
ACC(ours)	SSL	0.9854	0.9934	0.9846	0.8774	10%

4 Conclusion

This paper proposes an ACC model that utilizes a pretraining-finetuning strategy to tackle the challenge of requiring a large number of labeled data for end-to-end learning from scratch. Our method achieves this by evenly selecting hard samples from the unlabeled sample pool of downstream tasks for fine-tuning. The proposed uncertainty-based balanced query strategy not only effectively enhances the feature representation ability of labeled data but also mitigates harmful classifier biases caused by class imbalance, thereby increasing its adaptability to practical HAR platforms. On the UCI, WISDM and MotionSense datasets, ACC requires only 13%, 9.4% and 10% labeled samples respectively to achieve state-of-the-art classification results. This label effort is approximately 1/6 of traditional CL methods and 1/7 of SL methods. Additionally, we observe that the CL paradigm performs poorly on the fine-grained classification task USCHAD compared to the other three coarse-grained datasets. Despite significantly increasing the label budget for confusing activities, there is no noticeable improvement in classification performance. One possible explanation is that contrastive loss tends to cluster time series based on overall similarity, making it difficult to utilize subtle fine-grained features for distinguishing between paired activity categories in contrastive excuse tasks. Consequently, the boundaries between different clusters may not align well with the boundaries between fine-grained classes. Finally, we reduce the label effort on USCHAD to 24%, which still outperforms SL and traditional CL methods. Our experiments show the most advanced performance of ACC with high data selection efficiency. These results are encouraging because they demonstrate that it is possible to use unlabeled data to obtain effective sensor data representation, resulting in a more efficient HAR system. This work can help exploit annotation budgets for supervised fine-tuning in practical HAR applications and make a solid contribution to the mobile sensor-based health monitoring systems.

Additionally, this paper presents certain limitations. The proposed ACC model doesn’t exhibit substantial improvement on fine-grained HAR tasks. To thoroughly comprehend the performance degradation instigated by this “granularity gap”, further analysis is necessitated, which we will leave for future work. Moreover, the incorporation of AL increases the time cost of the model, and how to accelerate the convergence speed of the fine-tuning model is also an issue we need to address in the future.

Funding

This work was supported by Changzhou Sci&Tech Program (Grant No. CJ20235026).

References

Ferrari

, Micucci

, Mobilio

and Napoletano

, Deep learning and model personalization in sensor-based human activity recognition, Journal of Reliable Intelligent Environments 9(1) (2023), 27–39.

Chen

, Hoey

, Nugent

C.D.

, Cook

D.J.

and Yu

, Sensor-based activity recognition, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(6) (2012), 790–808.

, Wei

, Wen

, Liu

and Wang

, Hand gesture recognition based improved multi-channels CNN architecture using EMG sensors, Journal of Intelligent & Fuzzy Systems 43(1) (2022), 643–656.

Vijayvargiya

, Singh

, Kumar

and Tavares

J.M.R.

, Human lower limb activity recognition techniques, databases, challenges and its applications using sEMG signal: An overview, Biomedical Engineering Letters 12(4) (2022), 343–358.

Rajamohan

and Nazz

, Smart home activity recognition for Ambient Assisted Living (AAL), ECS Transactions 107(1) (2022), 20253.

Balasubramanian

, Prabu

, Shaik

M.F.

, Naik

R.A.

and Suguna

S.K.

, A hybrid deep learning for patient activity recognition (PAR): Real time body wearable sensor network from healthcare monitoring system (HMS), Journal of Intelligent & Fuzzy Systems 44(1) (2023), 195–211.

, Yang

, Zhou

, Huang

, Yang

and Pang

, Teleoperation of collaborative robot for remote dementia care in home environments, IEEE Journal of Translational Engineering in Health and Medicine 8 (2020), 1–10.

Okeyo

, Chen

, Wang

and Sterritt

, Ontology-based learning framework for activity assistance in an adaptive smart home, Activity Recognition in Pervasive Intelligent Environments 4 (2011), 237–263.

Tang

, Teng

, Zhang

, Min

and He

, Layer-wise training convolutional neural networks with smaller filters for human activity recognition using wearable sensors, IEEE Sensors Journal 21(1) (2020), 581–592.

10.

Tang

, Zhang

, Wu

, He

and Song

, Dual-branch interactive networks on multichannel time series for human activity recognition, IEEE Journal of Biomedical and Health Informatics 26(10) (2022), 5223–5234.

11.

Challa

S.K.

, Kumar

and Semwal

V.B.

, A multibranch CNN-BiLSTM model for human activity recognition using wearable sensor data, The Visual Computer 38(12) (2022), 4095–4109.

12.

J.S.K.

, Seo

, Park

and Choi

D.-G.

, PT4AL: Using self-supervised pretext tasks for active learning, European Conference on Computer Vision (ECCV) (2022), 596–612.

13.

, Tang

, Yang

, Wen

and Zhang

, Context-aware mutual learning for semi-supervised human activity recognition using wearable sensors, Expert Systems with Applications 219 (2023), 119679.

14.

Liu

, Abdelzaher

Semi-supervised contrastive learning for human activity recognition, 2021 17th International Conference on Distributed Computing in Sensor Systems (DCOSS), 2021, pp. 45–53.

15.

Tang

C.I.

, Perez-Pozuelo

, Spathis

, Brage

, Wareham

and Mascolo

, SelfHAR: Improving human activity recognition through self-training with unlabeled data, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5(1) (2021), 1–30.

16.

Haresamudram

, Essa

and Plötz

, Assessing the state of self-supervised human activity recognition using wearables, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6(3) (2022), 1–47.

17.

Chen

, Kornblith

, Norouzi

, Hinton

A simple framework for contrastive learning of visual representations, International Conference on Machine Learning (PMLR), 2020, pp. 1597–1607.

18.

Qian

, Tian

, Miao

What makes good contrastive learning on small-scale wearable-based tasks? Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 3761–3771.

19.

Tang

C.I.

, Perez-Pozuelo

, Spathis

, Mascolo

Exploring contrastive learning in human activity recognition for healthcare, arXiv preprint arXiv:2011.11542 (2020).

20.

Saeed

, Ozcelebi

and Lukkien

, Multi-task self-supervised learning for human activity detection, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3(2) (2019), 1–30.

21.

Liu

, Wang

, Liu

, Wang

, Yao

, Abdelzaher

Contrastive self-supervised representation learning for sensing signals from the time-frequency perspective, 2021 International Conference on Computer Communications and Networks (ICCCN), 2021, pp. 1–10.

22.

Khaertdinov

, Ghaleb

, Asteriadis

Contrastive self-supervised learning for sensor-based human activity recognition, 2021 IEEE International Joint Conference on Biometrics (IJCB), 2021, pp. 1–8.

23.

Haresamudram

, Essa

and Plötz

, Contrastive predictive coding for human activity recognition, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5(2) (2021), 1–26.

24.

Wang

, Zhu

, Gan

, Chen

L.L.

, Ning

and Wan

, Sensor data augmentation by resampling in contrastive learning for human activity recognition, IEEE Sensors Journal 22(23) (2022), 22994–23008.

25.

Wang

, Zhu

, Chen

, Ning

and Wan

, Negative selection by clustering for contrastive learning in human activity recognition, IEEE Internet of Things Journal 10(12) (2023), 10833–10844.

26.

Xie

, Lu

, Yan

, Yang

, Tomizuka

, Zhan

Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning Paradigm, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23715–23724.

27.

Ren

, Xiao

, Chang

, Huang

P.-Y.

, Li

, Gupta

B.B.

, Chen

and Wang

, A survey of deep active learning, ACM Computing Surveys (CSUR) 54(9) (2021), 1–40.

28.

Bengar

J.Z.

, van de Weijer

, Fuentes

L.L.

, Raducanu

Class-balanced active learning for image classification, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 1536–1545.

29.

, Perello-Nieto

, Santos-Rodriguez

and Flach

, Human activity recognition based on dynamic active learning, IEEE Journal of Biomedical and Health Informatics 25(4) (2020), 922–934.

30.

Sener

and Savarese

, Active Learning for Convolutional Neural Networks: A Core-Set Approach, International Conference on Learning Representations 2018.

31.

Kim

and Shin

, In Defense of Core-set: A Density-aware Core-set Selection for Active Learning, pp, Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022), 804–812.

32.

Bengar

J.Z.

, van de Weijer

, Twardowski

, Raducanu

Reducing label effort: Self-supervised meets active learning, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1631–1639.

33.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, pp, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), 770–778.

34.

Anguita

, Ghio

, Oneto

, Parra

, Reyes-Ortiz

J.L.

A public domain dataset for human activity recognition using smartphones, Esann (2013), 3.

35.

Malekzadeh

, Clegg

R.G.

, Cavallaro

, Haddadi

Protecting sensory data against sensitive inferences, Proceedings of the 1st Workshop on Privacy by Design in Distributed Systems, 2018, pp. 1–6.

36.

Zhang

, Sawchuk

A.A.

USC-HAD: A daily activity dataset for ubiquitous activity recognition using wearable sensors, Proceedings of the 2012 ACMConference on Ubiquitous Computing, 2012, pp. 1036–1043.

37.

Kwapisz

J.R.

, Weiss

G.M.

and Moore

S.A.

, Activity recognition using cell phone accelerometers, ACM SigKDD Explorations Newsletter 12(2) (2011), 74–82.