Abstract
BACKGROUND:
A daily activity routine is vital for overall health and well-being, supporting physical and mental fitness. Consistent physical activity is linked to a multitude of benefits for the body, mind, and emotions, playing a key role in raising a healthy lifestyle. The use of wearable devices has become essential in the realm of health and fitness, facilitating the monitoring of daily activities. While convolutional neural networks (CNN) have proven effective, challenges remain in quickly adapting to a variety of activities.
OBJECTIVE:
This study aimed to develop a model for precise recognition of human activities to revolutionize health monitoring by integrating transformer models with multi-head attention for precise human activity recognition using wearable devices.
METHODS:
The Human Activity Recognition (HAR) algorithm uses deep learning to classify human activities using spectrogram data. It uses a pretrained convolution neural network (CNN) with a MobileNetV2 model to extract features, a dense residual transformer network (DRTN), and a multi-head multi-level attention architecture (MH-MLA) to capture time-related patterns. The model then blends information from both layers through an adaptive attention mechanism and uses a SoftMax function to provide classification probabilities for various human activities.
RESULTS:
The integrated approach, combining pretrained CNN with transformer models to create a thorough and effective system for recognizing human activities from spectrogram data, outperformed these methods in various datasets – HARTH, KU-HAR, and HuGaDB produced accuracies of 92.81%, 97.98%, and 95.32%, respectively. This suggests that the integration of diverse methodologies yields good results in capturing nuanced human activities across different activities. The comparison analysis showed that the integrated system consistently performs better for dynamic human activity recognition datasets.
CONCLUSION:
In conclusion, maintaining a routine of daily activities is crucial for overall health and well-being. Regular physical activity contributes substantially to a healthy lifestyle, benefiting both the body and the mind. The integration of wearable devices has simplified the monitoring of daily routines. This research introduces an innovative approach to human activity recognition, combining the CNN model with a dense residual transformer network (DRTN) with multi-head multi-level attention (MH-MLA) within the transformer architecture to enhance its capability.
Introduction
Wearable sensors play a crucial role in comprehending human actions across many environments, such as smart homes [1, 2, 3], sports [4, 5, 6], and monitoring systems [7, 8, 9]. Accelerometers and gyroscopes [10, 11] are sensors that gather data to precisely detect behaviors such as running, walking, and sitting. In addition, sensors such as contact switches and pressure mats are advancing to offer discreet and privacy-conscious monitoring choices for residential settings [12, 13]. This advancement extends to wireless sensor networks, offering a range of applications from detecting open doors to tracking body and mind states using wearable sensors, particularly in the medical field [14, 15].
In a standard supervised Human Activity Recognition (HAR) framework [16], there are essential components at play. These involve gathering data from sensors, segmenting the raw data into consistent window sizes, extracting relevant features, and categorizing activities. In the feature extraction stage, activities are transformed into fixed-size feature vectors, serving as the training data for classifiers. This structured approach enables thorough detection, interpretation, and identification of human movements across various activities, such as walking, running, eating, lying down, and sitting down.
Currently, ongoing research emphasizes HAR based on sensor signals using deep learning (DL) techniques within the time domain. These techniques encompass convolutional neural networks (CNN) [17, 18, 19], variations of recurrent neural networks (RNNs) such as long short-term memory (LSTM) and gated recurrent units (GRU), as well as hybrid DL methods [20, 21, 22]. Researchers like [23, 24, 25] have introduced additional techniques, including the attention layer and convolution layers with various kernel sizes, to enhance CNNs for HAR. These modifications aim to improve the original CNN model architecture.
Mukherjee et al. [26] developed EnsemConvNet, an ensemble model that achieved a recognition accuracy of approximately 97% on the WISDM dataset. Additionally, Das et al. [27] introduced MMHAR-EnsemNet, a multi-modal HAR model, achieving an accuracy of around 99% on both the UTD-MHAD and Berkeley-MHAD datasets. Bhattacharya et al. [28] proposed SV-NET, a deep-learning model specifically designed for recognizing human activities from video images. Banerjee et al. [29] presented a CNN classifier model based on fuzzy integrals to address skeleton-based HAR problems.
Furthermore, Bhattacharya et al. [30] and Chattopadhyay et al. [31] explored various applications of CNN models for solving image classification problems. Research [32, 33] particularly advanced the field of HAR by introducing channel equalization and channel selectivity into convolutional neural networks (CNNs). This improvement, a first in HAR, represents a significant contribution to enhancing the performance and capabilities of CNNs in recognizing human activities.
In the field of HAR, the current systems, primarily driven by CNN, face several challenges. They struggle with issues like being too specialized, adapting to dynamic environments [34], processing information quickly, and gaining user acceptance [35]. While CNNs have shown impressive capabilities, there’s a growing interest in exploring alternative approaches. The proposed model introduces transformer models to overcome these challenges and improve the effectiveness of HAR.
The transition from CNNs to transformers brings its own set of challenges, such as understanding diverse situations [36, 37], adapting to changes, and enhancing user-friendliness. Our goal is to fine-tune these transformer models, enhancing their ability to understand various situations, adapt dynamically, process data in real time, and make interactions smoother for users. Through this effort, this research aims to advance the capabilities of HAR systems, moving beyond conventional CNN-based methods.
The study incorporates three HAR datasets: HARTH [38], KU-HAR [39], and HuGaDB [40]. The HARTH dataset achieved an 81% F1 score using an SVM classifier. The KU-HAR dataset attained nearly 90% accuracy with an RF classifier. Various approaches have been applied to this dataset, including hierarchical feature-based techniques, hybrid feature selection models, CNN-based models, ANN-based classification models, and LSTM-based deep learning classifier models, achieving accuracy ranging from 79.24% to 92.5%.
This paper introduces the following contributions:
Design of a multi-head adaptive attention mechanism to fuse features from a pretrained CNN with MobileNetV2 and a transformer model. Apply the dense residual transformer method that combines residual connections of HAR. The experiments on HAR datasets [HARTH, KU-HAR, and HuGaDB] evaluate the performance of the proposed method as well as different combinations.
In this paper, we introduce a new method to design a multihead adaptive attention mechanism to fuse features from pre-trained CNN with MobileNetV2 and the transformer model. Our method also applies the dense residual transformer method, which combines residual connections for human activity recognition (HAR). We performed the experiments on three HAR datasets, namely HARTH [38], KU-HAR [39], and HuGaDB [40], to evaluate the performance of the proposed method as well as different combinations of dense, residual, and integration of dense and residual transformer methods.
Dataset collection
This research has conducted experiments using three publicly available Human Activity Recognition (HAR) datasets [38, 39, 40]. Specifically, the HARTH dataset includes 12 activities, the KU-HAR dataset encompasses 18 activities, and the HuGaDB dataset includes 12 activities.
HARTH dataset
The HARTH dataset [38] is freely available and includes data from 22 people who wore accelerometers on their lower backs and thighs, recording acceleration information. It covers 12 different human activities, each labeled for classification. The distribution of samples for each activity is shown in Fig. 1.
Sample of HARTH dataset spectrogram images.
The KU-HAR dataset [39] includes data from 90 participants (75 men and 15 women) who shared details about 18 activities. They used smartphone sensors like accelerometers and gyroscopes.
HuGaDB dataset
The HuGaDB dataset [40] has continuous recordings of different activities like standing up, walking, and using stairs. They collected this data using a six-wearable body sensor system with sensors on thighs and feet, as well as EMG sensors on quadriceps, to track muscle activity.
The different activities for the three datasets are shown in Table 1.
Different activities available for three datasets
Different activities available for three datasets
This section briefly explains the proposed system. Figure 2 illustrates a block diagram of human activity identification. The model is divided into two major parts. The first part represents the primary contribution to the experiment, incorporating both CNN and transformer models. The second part focuses on dense residual transformer architecture with multi-head multi-level attention (MH-MLA) attention [41].
Proposed model.
The proposed model was developed using Tensorflow 2.0. For training the CNN model pretrained with MobileNet V2, the ADAM optimizer with a learning rate of 0.001 and a categorical-cross-entropy loss function is utilized. The proposed model’s performance in a classification problem is measured using accuracy, precision, recall, and the F1 measure.
To prepare for training, raw sensor data from the dataset signal is converted into spectrogram images. The dataset is organized activity-wise into data arrays. Figure 3 represents data preprocessing. Spectrogram images are generated by dividing data arrays into frames of rows. To maintain balance, the length of the smallest activity data among all activities is calculated. After processing, the spectrogram images undergo normalization using z-score normalization. The process of normalizing each value in a dataset to ensure that the mean of all values is 0 and the standard deviation is 1 is referred to as Z-score normalization. The subsequent formula is employed to conduct a z-score normalization on each value in a dataset: new value
Data preprocessing.
For medical imaging technologies, CNNs are one of the most frequently used models. In this model, a pre-trained CNN model using MobileNet_V2 [44] has been applied to derive the features. CNN processes many similar-sized images of the research facility [11, 34, 35]. Therefore, before being shared with CNN, all images were reduced to 224 by 224 pixels. CNN efficiently captures spatial features from spectrograms, providing a strong foundation for spatial analysis. The image dataset has been divided into 80% of the images (randomly chosen) to train the model and the remaining 20% to train the model.
The proposed model integrates CNN and Transformer architectures for enhanced human activity recognition (HAR). MobileNet_V2 [44] has been used as a pre-trained CNN model. CNN efficiently captures spatial features from spectrograms, providing a strong foundation for spatial analysis. In parallel, the Transformer, equipped with dense connections, excels at modeling long-range temporal dependencies crucial for recognizing complex human activities. The incorporation of dense and residual connections addresses challenges like vanishing gradients, promoting stable information flow during training. The adaptive attention fusion mechanism intelligently merges information from both modalities, leveraging their synergies. This integration optimizes feature extraction, bolstering the model’s robustness and capacity to discern diverse human activities across varying contexts.
Dense residual transformer network
Dense residual transformer network.
In the feature extraction process, a dense residual connection integrates two fundamental concepts in deep learning architectures: dense connections and residual connections. Dense connections, characteristic of DenseNet architectures, involve each layer receiving inputs from both the previous layer and all preceding layers. This dense connectivity enhances feature reuse and information flow. In a network with L layers, each layer L receives input from all previous layers (1, 2, …, L-1). Residual connections, introduced in ResNets, tackle the vanishing gradient problem in deep neural networks. This issue arises when gradients struggle to traverse numerous layers, impeding learning progress. In a residual connection, the layer’s output is combined with its original input. The dense residual connection combines the principles of residual and dense connections. In each layer, the output is added to the original input, similar to ResNets. Additionally, the layer receives inputs from all preceding layers, following DenseNet’s dense connectivity. This combination promotes feature reuse and smooth information flow, easing the training of deep networks. The incorporation of dense and residual connections aims to leverage the strengths of both architectures, improving feature reuse, information flow, and the training dynamics of deep neural networks. This design has been explored in various models to enhance the performance of deep learning systems.
These connections help capture time-related patterns in the input, allowing the model to understand complex representations. Position-wise feedforward networks make these representations even better, and residual connections maintain smooth information flow, which is represented in Fig. 4.
The architecture of the DRTN model employed to solve the recognition problem is shown in Fig. 4.
In the described architecture, each transformer block is intricately connected through a residual connection that includes normalization and a multi-layer perceptron (MLP). This connection serves to stabilize the training process and capture complex relationships within the input data. Additionally, within each transformer block, a dense connection is established, ensuring that each layer receives input from all preceding layers. This dense connectivity promotes the reuse of features and allows the model to capture intricate relationships between different parts of the input. The overall design, with both residual and dense connections, facilitates the learning of hierarchical features and dependencies across multiple Transformer blocks, making the model well-suited for tasks such as image-based activity recognition, where understanding intricate patterns is crucial.
The inclusion of dense and residual connections ensures the model can handle deep structures, enhancing accuracy and reliability in recognizing various human activities. Algorithm-1: Dense residual transformer incorporates a CNN as the initial layer before applying the transformer architecture.
The input spectrogram undergoes processing by the CNN. Subsequently, the output is handled by the transformer encoder and decoder layers, both featuring dense and residual connections. The refinement of features benefits from position-wise feedforward networks with residual connections.
In recognizing human activities, it’s important to combine the outcomes of the CNN and the Transformer using an adaptive attention mechanism, as shown in Algorithm-2-Adaptive Attention. This combination helps capture both the visual patterns and the time-related aspects of the input data. The CNN focuses on visual details like patterns in images or spectrograms, while the Transformer looks at the order of events to understand how activities unfold over time. With the adaptive attention mechanism, the model dynamically decides how much importance to give to spatial (visual) and temporal (time-related) features. This decision is made through attention scores (
Description of metrics
Description of metrics
The Softmax function transforms these scores into attention weights, ensuring they add up to 1 and form a valid probability distribution. It represents how spatial features from the CNN (C) and temporal features from the Transformer (T) are dynamically combined, guided by their attention scores. The resulting integrated output (I) is then utilized for additional processing and making final predictions in the Human Activity Recognition task.
To assess the model’s performance, the model has been evaluated using various metrics, including F1 score, recall, precision, and accuracy (Table 2), using the confusion matrix of the classification.
True Positive (TP) indicates that the patient has the disease and that the test is positive. True Negative (TN) denotes that the patient is healthy, and the test is negative. A False Negative (FN) shows that negative samples are predicted to be positive wrongly. False Positive (FP) denotes that positive patient samples are predicted to be negative wrongly.
Results and discussion
A confusion matrix is a performance measurement tool commonly used in classification tasks such as human activity recognition (HAR). Each matrix allows for a detailed examination of the model’s performance by depicting the distribution of predicted and actual class labels. Figures 5–7 provide a visual representation of the confusion matrices associated with three distinct datasets.
A careful analysis of these confusion matrices can provide insights into the strengths and weaknesses of the proposed method across different activities and datasets. Researchers and practitioners can use these visual representations to make informed decisions about model adjustments, fine-tuning, or the need for additional data preprocessing steps to enhance the overall performance of the Human Activity Recognition system.
Upon analysis, it is observed that the integrated transformer consistently shows higher performance compared to the dense and residual transformers. The integrated transformer excels in terms of a higher number of true positives, indicating its effectiveness in accurately recognizing human activities. Additionally, the integrated transformer shows a reduced number of false positives, signifying better precision in avoiding misclassifications.
The performance of the proposed model is assessed using three datasets (Tables 3–5).
Performance for HARTH dataset
Performance for HARTH dataset
Confusion matrix for HARTH dataset.
Performance for KU-HAR dataset
Performance for HuGaDB dataset
Confusion matrix for KU-HAR dataset.
In technical terms, the improvement in accuracy of 1% to 2% for human activity recognition suggests that the integrated transformer model achieves better classification performance compared to individual models, as tabulated in Tables 3–5 for three datasets. This improvement implies that the integrated transformer’s methodology involves effectively integrating information from different sources or adapting to diverse datasets, leading to a more resilient and accurate model. The model’s ability to generalize across various datasets and capture nuanced patterns from different sources contributes to the observed increase in accuracy. This adaptability likely stems from the integrated transformer’s architecture, which enables it to handle the complexities and variations present in each dataset, ultimately resulting in superior overall performance in human activity recognition tasks.
Table 6 indicates substantial variance in performance across different existing methodologies.
Comparison with existing methods
Confusion matrix for HuGaDB database.
The m-CNN (Multi-Convolutional Neural Network) [38] excels at spatial feature extraction from images, making it proficient in tasks like image analysis. However, it faces challenges in comprehending sequential activities. Conversely, the Bi-LSTM (Bidirectional Long Short-Term Memory) [38] model demonstrates prowess in capturing temporal dependencies but requires careful parameter tuning and may struggle with intricate patterns. Traditional CNNs, renowned for spatial feature extraction, may fall short in capturing the temporal order of sequential actions. While wolf-based optimization [42] explores solution spaces inspired by collaborative wolf-hunting behavior. Wrapper-based feature selection [43] dynamically optimizes feature subsets for specific models.
The study [45] employed attention-mechanism-based deep learning to recognize human activity; however, it encountered difficulties with accuracy on various datasets. The authors [46] utilized peripheral sensors to create DeepTransHHAR, which was designed to recognize heterogeneous activities. However, they encountered constraints as a result of the presence of numerous similar activities. Teran-Pineda et al. [47] implemented multimodal sensors to identify gait; however, they encountered difficulties with precision due to sensor noise and intricate patterns. Nevertheless, it is important to note that the proposed integrated approach achieved accuracy surpassing all other methods. This indicates that the proposed integrated approach is more effective than individual methodologies in accurately identifying human activities, as evidenced by the efficiency of the approach across a variety of datasets.
The limitations of the proposed study involve the Dense Residual Transformer Network (DRTN) and pre-trained CNN with MobiNetV2, which can have limited effectiveness in utilizing both spatial and temporal information. Even then, the integrated approach, combining transformer models with a dense residual transformer network and employing multi-head, multi-level attention, outperformed these methods in various action recognition scenarios shown in Table 6. This suggests that a synergistic integration of diverse methodologies yields superior results in capturing nuanced human activities across different contexts. The comparison analysis showed that the integrated system consistently performs better for dynamic human activity recognition datasets.
In conclusion, maintaining a routine of daily activities is crucial for overall health and well-being. Regular physical activity contributes substantially to a healthy lifestyle, benefiting both the body and the mind. The integration of wearable devices has simplified the monitoring of daily routines. Our research introduces an innovative approach to human activity recognition, combining transformer models with a dense residual transformer network (DRTN). Leveraging multi-head, multi-level attention (MH-MLA) within the transformer architecture enhances its adaptability. This collaboration with Convolutional Neural Networks (CNNs) significantly improves accuracy. Our primary objective was to surpass traditional methods in activity recognition, and our approach demonstrated impressive results. Testing on three datasets – HARTH, KU-HAR, and HuGaDB – produced accuracies of 92.81%, 97.98%, and 95.32%, respectively. These findings underscore the effectiveness of our method for real-time health monitoring and precise recognition of human activities. The integration of transformer models with a dense residual transformer network, particularly leveraging multi-head, multi-level attention, holds promise for advancing activity recognition in diverse and dynamic settings.
Funding
The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through a Small Group Research Project under grant number RGP1/316/45.
Data availability
Online data sources [38, 39, 40] from Kaggle are used throughout the study.
Footnotes
Conflict of interest
Not conflict of interest.
