Abstract
Human Activity Recognition (HAR) is a challenging task that involves accurately classifying diverse daily movements from data captured by sensors, videos, or images. In this study, we propose a robust HAR framework that integrates CatBoost with a stacked ensemble learning (SEL) strategy, combining multiple base classifiers to enhance accuracy and generalization beyond conventional machine learning approaches. The framework was first evaluated on the benchmark WISDM, RealWorld and PAMAP2 datasets, comprising raw triaxial accelerometer signals segmented with a sliding window approach, demonstrating its effectiveness. The CatBoost model within the SEL framework achieved strong performance in identifying activities such as walking and jogging, while also delivering nearly perfect recognition for stair-related activities, with average scores of 87.06% accuracy, 89.25% recall, 79.93% precision, 84.26% F1-score, and 85.43% ROC-AUC across all WISDM activities. To assess generalization, the framework was further tested on the RealWorld HAR and PAMAP2 datasets. On RealWorld HAR, it achieved 99.2% accuracy, 99.06% recall, 99.23% precision, 99.13% F1-score, and 99.1% ROC-AUC, whereas on PAMAP2, it attained 99.43% accuracy, 99.33% recall, 99.53% precision, 99.43% F1-score, and 99.36% ROC-AUC. These results highlight the capability of ensemble learning combined with boosting methods to advance sensor-based HAR across multiple benchmark datasets, offering high reliability and generalization in real-world scenarios.
Keywords
Introduction
Human Activity Recognition refers to the task of identifying and classifying everyday human actions from various data sources such as images, video sequences, or sensor signals. Over the last few years, it has emerged as a critical research topic due to its relevance in healthcare, elderly assistance, security, smart home automation, surveillance, and sports applications. For example, monitoring behavioural changes in patients or older adults enables timely medical intervention. To achieve reliable outcomes, both vision-based and sensor-based strategies have been explored extensively. While vision-based approaches rely on images or video streams, sensor-driven techniques leverage time-series data generated by accelerometers, gyroscopes, magnetometers, GPS, and related wearable devices. After acquisition, these sensor data undergo preprocessing to derive meaningful activity patterns or insights. 1
HAR Applications and Modalities: HAR plays an equally vital role in three broad domains: human behaviour modelling, ubiquitous computing, and human-computer interaction. 2 Its applications span gait and gesture recognition, smart surveillance, rehabilitation, and intelligent healthcare systems.3–6 Depending on the target environment, HAR may employ a wide range of sensors including Bluetooth, Wi-Fi, infrared sensors, depth cameras, and smartphones. 7 These modalities provide complementary advantages and are generally classified into two major categories: vision-based8,9 and sensor-based approaches.10,11 Although video systems can capture detailed actions, they raise challenges of cost, privacy, and limited coverage. In contrast, wearable and mobile sensors—such as inertial measurement units (IMUs), accelerometer-equipped smartphones, and smart devices—offer a scalable and cost-effective alternative.
Limitations: Despite their promise, sensor-based approaches face difficulties in capturing fine-grained motion patterns, particularly when multiple joints or body positions are involved. Mobile sensors such as accelerometers, gyroscopes, and barometers help translate physical motion into measurable signals, offering privacy-preserving recognition in comparison with video-based methods. However, accurate classification still requires addressing challenges related to signal quality and body posture coverage. In industrial or high-performance contexts, multiple sensors are often combined to improve robustness and reliability. 12
Gait signals—derived from changes in speed and angular velocity—serve as reliable indicators of human movement. Compared with fixed installations, mobile and wearable sensors offer portability, energy efficiency, and adaptability for daily use, making them well-suited for HAR deployment. Consequently, research on smartphone-based HAR has grown rapidly, with many works investigating optimal sensor combinations and feature representations. 13 Classical methods have employed handcrafted statistical features such as mean, variance, correlation, and entropy, along with Fourier or wavelet-based transformations. These features are then classified using machine learning algorithms such as Random Forest (RF), k-Nearest Neighbours (KNN), Decision Trees (DT), Naive Bayes, and multilayer perceptrons (MLPs). More recently, deep learning models—such as Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM), and Bidirectional LSTM (Bi-LSTM)—have been adopted for sequence modelling.
The primary contributions of this research article are summarized in five points as follows. (1) We propose a stacked ensemble learning (SEL) model for HAR that integrates multiple classifiers, including k-Nearest Neighbours, Gaussian NB, CatBoost, Logistic Regression, Decision Trees and Extra Trees, thereby providing a robust framework for activity recognition. (2) A comprehensive evaluation strategy is employed, first assessing individual base learners before combining them within the SEL framework, ensuring a detailed understanding of model strengths and weaknesses. (3) In terms of performance, our experimental analysis highlights the superior performance of the CatBoost classifier. It obtains a maximum accuracy of 98.78% when incorporated into the ensemble. (4) The proposed SEL method demonstrates its effectiveness through validation using cross-validation and confusion matrix analysis, underscoring its potential for real-world deployment. (5) This study contributes to the field of HAR by introducing new methodological ideas and attaining cutting-edge outcomes, which have implications for behavior monitoring, smart settings, and healthcare.
Article Organization: The rest of this paper is organized as follows. In section 2, the existing terminologies for the HAR system are presented with pros and cons. Then, section 3 introduces the preliminaries, which consists of description of notation and abbreviation used throughout this research article. After critical analysis of existing work and preliminary studies, the proposed methodology is suggested in section 4 to overcome previous shortcomings along with the description of the data set. In section 5, proposed methodology is simulated and obtained results are shown. In section 6, the report is finally concluded with a summary of the results gained and a recommendation for future research directions.
Literature review
In this section, reviews prior research efforts aimed at predicting and classifying human activities, highlighting both traditional and modern frameworks is discussed. It also positions the advantages of the proposed methodology in comparison with earlier work suggested by investigators.
A variety of approaches have been investigated to enhance the accuracy and efficiency of Human Activity Recognition (HAR) systems. One notable contribution is the federated personalized random forest model, introduced by Liu et al. 14 It applies privacy-preserving strategies through differential privacy. By combining ensemble learning with local-sensitive hashing, this model captures user-specific patterns while maintaining privacy, yielding accuracies of 94.5% on the WISDM dataset and 93.1% on the Smartphone dataset. However, the further enhancement is required to adopt in real time applications.
In Zhongkai et al., 15 hybrid deep learning frameworks that integrate CNN, RNN, and self-attention modules for HAR are explored. Their proposed V3 model employed cross-channel and multi-size convolution transformations, outperforming other neural architectures such as SeNet on benchmark datasets like HASC, WISDM, and UCI. However, the accuracy is still low.
In Nayak, 16 sensor-driven HAR using wearable devices such as smartphones and smartwatches is examined. By employing algorithms such as Random Forest, Simple Logistic, and SMO on UCI-HAR and WISDM datasets. In the performance, the RF achieved highest accuracies of 98% on UCI-HAR dataset and 90.69% on WISDM, demonstrating the effectiveness of traditional classifiers for sensor-based activity recognition.
In Helmi et al., 17 introduced GBOGWO, a hybrid feature selection method that integrates Gradient-Based Optimizer (GBO) with Grey Wolf Optimizer (GWO) in the feature selection process. In performance, it was observed with the SVM classifier, the authors reported accuracies up to 98% on both data sets, i.e., the UCI-HAR and WISDM.
In Bozkurt, 18 authors have compared the wide range of machine learning and deep learning models to classify daily activities such as walking, running, and stair movements. Using public HAR datasets, they demonstrated the competitive performance of both approaches (ML and DL Models). In the performance, it was observed that they achieved a highest accuracy of 96.81% using the deep neural network with a mean error of 0.03 when tested on UCI-HAR dataset.
In Dua et al., 19 deep learning-based frameworks have also gained traction. They proposed a CNN-GRU hybrid model that eliminated the need for manual feature engineering. In the performance analysis phase, test was performance on WISDM, PAMAP2, and UCI-HAR datasets. The suggested model achieved accuracies of 97.21%, 95.27%, and 96.20%, respectively.
In Thakur, 20 proposed Conv-AE-LSTM framework, in which, leverages CNN for temporal modelling, Autoencoders for dimensionality reduction, and LSTMs for sequence learning. In the performance, it was observed, this architecture improved both accuracy and computational efficiency across multiple datasets.
In Walse, 21 authors suggested ensemble and boosting techniques. They employed AdaBoost alongside classifiers such as Random Tree, J48, and Rep Tree. In the performance, they achieved high recognition rates on the WISDM dataset.
In Tang, 22 a triplet cross-dimensional attention mechanism that significantly improved classification accuracy with models, namely, CNN (97.34%) and ResNet (98.61%) on datasets such as UCI-HAR, WISDM, and PAMAP2.
In Islam, 23 a review of CNN-based HAR systems, categorizing input sources into multimodal, radar, vision, and smartphone devices was presented. In their work, limitations of existing CNN-based approaches was highlighted, while suggesting pathways for improvement.
In Saeed, 24 a comparative analysis of models using smartphone and smartwatch sensors was conducted. In the performance, it was observed and concluded that deep learning architectures outperform classical approaches in most of the scenarios.
Finally, in Antar, 25 a comprehensive survey addressing challenges in HAR, including preprocessing requirements, noise reduction, sensor placement, and dataset limitations was provided, while summarising benchmark resources for evaluating HAR systems.
A structured comparison of these studies is presented in Table 1.
Summary of state-of-the-art human activity recognition approaches.
Summary of state-of-the-art human activity recognition approaches.
Despite substantial progress, existing HAR studies exhibit several gaps as follows: (a)
The proposed framework provides several advantages compared with existing techniques based on ML and DL: (1) It incorporates rigorous pre-processing, including noise reduction and segmentation, ensuring robust input signals for classification, contrast to existing frameworks. (2) Contrast to existing frameworks, relying exclusively on deep learning models,15,18–20,22–24 the approach leverages stacked ensemble learning (SEL) in combination with boosting, specifically the CatBoost classifier. (3) By integrating classical machine learning classifiers with ensemble learning strategies, the proposed framework improves generalization and robustness contrast to existing methods.14,16,18 (4) Comparative evaluation reveals that the SEL approach with CatBoost consistently outperforms existing deep learning solutions, demonstrating superior accuracy and stability across datasets. (5) While deep learning methods remain dominant in HAR literature, this study underscores the strength of ensemble-based strategies, highlighting CatBoost within SEL as a high-performing and practical alternative.
In Table 2, summarize the notations used throughout in this work. Each symbol is accompanied by its definition to ensure consistency in interpretation across the paper. Moreover, this reference table is intended to facilitate better understanding for readers and researchers who may wish to replicate or extend the methodology. By presenting clear and uniform definitions, the notations provide a structured foundation for the mathematical formulations and experimental procedures.
List of notations and abbreviations.
List of notations and abbreviations.
In this section, the design of the proposed framework is demonstrated, covering data set organization, pre-processing, feature representation, model training, and ensemble learning implementation. The main focus is on stacking ensemble learning (SEL), which integrates multiple classifiers to improve recognition accuracy.
Performance of individual learners
Each classifier’s performance was first evaluated independently using confusion matrices and 10-fold cross-validation. This analysis provided insight into the relative strengths and weaknesses of the individual learners. The classifiers considered included Logistic Regression, Decision Tree, K-Nearest Neighbour, Naive Bayes, Extra Trees, and CatBoost.
The workflow of the proposed methodology is presented in Figure 1, which summarises the sequence of preprocessing, model training, base learner evaluation, and final stacked ensemble integration.

Execution flow of the proposed methodology.
The pseudo of proposed method is presented in Algorithm 1.
Machine learning models description
This subsection outlines the ML algorithms28,29 and boosting methods that form the foundation of the proposed stacked ensemble framework. 30 Each model, is briefly described to highlight its working principles, advantages, and applications in Human Activity Recognition (HAR).
Decision tree (DT)
Decision Trees falls under the category of supervised learning models that partition the input space into regions based on feature conditions. They are widely applied for both classification and regression tasks. In HAR, a sample is classified by traversing the tree structure from the root node to a leaf node, where each internal node corresponds to a test on a feature, and each branch represents a possible outcome of that test. The final prediction is determined at the leaf node, which represents a class label or a value. Decision Trees are intuitive, interpretable, and effective for non-linear decision boundaries, making them suitable for activity recognition tasks. 31
Extra trees (ET)
Extremely Randomized Trees (ET) are fall under the ensemble method derived from Decision Trees. Contrast to, random Forests (RF), which use bootstrap sampling and random feature selection, ET introduces additional randomness by selecting split points at random rather than optimizing them. This stronger randomization reduces variance and speeds up training by avoiding bootstrapping, while also producing a diverse set of base learners. Each tree is built on the full dataset, enhancing efficiency. ET is especially effective in managing the bias-variance trade-off, and its performance is highly dependent on careful parameter tuning, which can be optimized through cross-validation. 32
Gaussian Naive Bayes (NB)
Naive Bayes is one of the basic probabilistic classifiers based on Bayes’ theorem, with the simplifying assumption that features are conditionally independent given the class label. The Gaussian Naive Bayes variant assumes that continuous features followed by a normal distribution. In the training process, involves estimating the parameters of these distributions for each class. During inference, the posterior probability of a class is computed, and the sample is assigned to the class with the highest probability. Despite its simplicity, Naive Bayes is often effective in practice, and when combined in ensembles, it improves robustness and prediction accuracy compared to using a single classifier. 31
CatBoost (CB)
CatBoost is fall under the category of gradient boosting algorithm developed by Yandex. This algorithm utilize symmetric, oblivious decision trees as base learners. 33 Contrast to the traditional Gradient Boosting Decision Trees (GBDT), which can suffer from limitations of overfitting and biased gradient estimation, CatBoost employs ordered Boosting to mitigate prediction shift and reduce bias in gradient estimates. This approach enhances generalization as well as improves performance on categorical and noisy data.
The model prediction can be expressed as:
The optimal parameters are obtained by minimizing a loss function:
CatBoost further improves over standard GBDT by using a permutation-driven technique to handle categorical features and reduce conditional bias. It employs oblivious trees, where all nodes at the same depth split using the same condition, ensuring symmetry and simplifying computation. Studies have shown that CatBoost often surpasses other boosting frameworks such as XGBoost and LightGBM in terms of robustness, accuracy, and generalisation performance. 34
K-Nearest Neighbour (KNN) is an instance-based learning algorithm that assigns class labels based on the majority class of its
Logistic regression (LR)
Logistic Regression (LR) is a linear model commonly used for binary and multi-class classification. It estimates the probability of a class by applying a logistic (sigmoid) function to a linear combination of input features.
36
LR is interpretable and computationally efficient, and variable selection techniques are often employed to improve its performance by removing irrelevant features and including significant interaction terms. The inclusion or exclusion of variables is typically determined by significance testing, often using a threshold
Model construction
Initially, the
To further enhance model robustness and generalization capability, experiments were extended to the

RealWorld dataset collection images. (a) Chest; (b) Forearm; (c) Forearm Thumb; (d) Head; (e) Shin; (f) Shin Thumb; (g) Upperarm; (h) Upperarm Thumb; (i) Waist; (j) Waist Thumb.
Additionally, the
The integration of these three datasets—WISDM, RealWorld HAR, and PAMAP2—enabled comprehensive evaluation across heterogeneous activity types, sensor configurations, and environmental contexts. This multi-dataset approach allowed the model to learn generalized motion features and improved its capacity to recognize complex activity transitions under real-world conditions. The distribution percentage of various HAR activities of aforementioned datasets is presented graphically in Figure 3.

Activity distribution of the WISDM, RealWorld, & PAMAP2 datasets. (a) Activity Dist. of the WISDM; (b) Activity Dist. of the RealWorld HAR; (c) Activity Dist. of the PAMAP2.
Following the initial upload of the WISDM dataset, it was verified that 33 subjects and six different activities were included. After that, two datasets are loaded: REALWorld and PAMAP2, which include seven and eighteen distinct physical activities, respectively. All three datasets—PAMAP2, RealWorld, and WISMDM—are noticeably unbalanced, nevertheless. This was addressed by applying preprocessing procedures, which are detailed in the following part, to ensure balanced class distributions before training.
Data pre-processing
The pre-processing stage consisted of (1) labelling, (2) segmentation, and (3) splitting of the dataset. Firstly, all the activities were encoded into binary labels, as shown in Table 3 for WISDM Dataset. Similarly, all the activities are binary encoded aslo in RealWorld and PAMAP2 datasets. Then, all the datasets were divided into training (70%) and testing (30%) sets to ensure a fair evaluation of the model performance. Following this, multiple classifiers were applied, and their predictions were integrated using the stacked ensemble learning framework.
Activity label encoding in WISDM dataset.
Activity label encoding in WISDM dataset.
In the performance analysis, firstly Performance measures, including accuracy, precision, recall, and F1-score, were used to assess each model individually, prior to combining them. This process ensured a comprehensive evaluation of both base learners and the ensemble strategy.
Ensemble learning is a strategy that improves classification or regression outcomes by combining multiple models. Each base learner, often considered weak on its own, provides a unique prediction. These outputs are then aggregated through voting or averaging to form a stronger final prediction. The diversity of the base learners reduces overfitting and enhances generalization.
Stacked ensemble learning (SEL)
Stacked Ensemble Learning extends traditional ensembles by combining heterogeneous classifiers in a layered architecture. In this study, six base learners were selected: (1) Decision Tree, (2) K-Nearest Neighbour, (3) Logistic Regression, (4) Gaussian Naive Bayes, (5) Extra Trees, and (6) CatBoost. Their obtained outputs serve as inputs for a meta-classifier, which produces the final decision. This hybrid approach leverages the strengths of both simple and complex models, providing robustness across diverse activity types.
Illustration of Figure 4: It represents the SEL framework, where base models first generate predictions that are then combined by the meta-classifier to form the final output. The arrows in the figure represent the data flow from base learners to the final prediction stage.

Architecture of the stacked ensemble learning framework.
The Hard Voting Classifier is used as a decision-level fusion technique to integrate the outputs of the stacked ensemble learning (SEL) architecture and boost the final prediction robustness. Practically speaking, after compiling the class predictions produced by the meta-learner and selected base learners for each input sample x, the Hard Voting Classifier selects the final class by majority vote. Formally, if
In this section, the experimental environment, performance measures, and their outcomes for the proposed framework across different activity categories is presented. In addition, to demonstrate the effectiveness, proposed framework is compared with the existing ones.
Simulation environment
All experiments were executed in Python on a Microsoft Windows 11 platform with an Intel Core i7 processor (3.40 GHz), 16 GB RAM, and chipset 2600. To enhance ensemble diversity, each of the six classifiers was executed five times, yielding a total of 30 weak learners. Predictions from these learners were aggregated using a majority voting scheme, where the class with the highest frequency was assigned as the final prediction. Both standard ensemble learning and the proposed Stacked Ensemble Learning (SEL) method were evaluated using benchmark measures, including Accuracy, Precision, Recall, F1-score, and AUC-ROC. For benchmarking, competing methods were also implemented under the same environment.
Dataset description
In this subsection, the benchmark HAR datasets WISDM, RealWorld and PAMAP2 are discussed.
Description of wireless sensor data mining (WISDM) HAR dataset
The experiments in this study use the publicly available benchmark WISDM Human Activity Recognition (HAR) dataset, as described in Walse et al. 21 In the data collection process, through a mobile Android application, where participants were asked to carry smartphones in their front trouser pockets while performing six activities: walking, jogging, ascending stairs, descending stairs, sitting, and standing was conducted. Furthermore, accelerometer readings were sampled at a fixed rate of 20Hz. This dataset contains raw time-series data and has been widely employed as a benchmark in HAR research. Table 4 provides a detailed summary of the activity distribution.
Wireless sensor data mining (WISDM) dataset description.
Wireless sensor data mining (WISDM) dataset description.
Given the imbalance in activity distribution, the dataset was restructured into three balanced groups, each consisting of two activities. This division ensured uniform proportions across activity classes, allowing more consistent training. Table 5 details the transformed datasets.
Balanced transformed datasets gnition.
By restructuring the dataset into these subsets, class imbalance was minimized, enabling fairer evaluations across different activity types. Each group maintained an approximately balanced proportion of records, making the dataset suitable for rigorous experimentation.
This data set is collected through the 6 sensors, named as acceleration, GPS, gyroscope, light, magnetic field and sound level data. It consists of 7 ativities or body positions i.e., staris down and up, jumping, lying, standing, sitting, running/jogging, and walking. In addition, each movement During the data collection process, 15 subjects data is collected and each movement is recorded by a video camera. During the recording, they recorded the acceleration of the body positions chest, forearm, head, shin, thigh, upper arm, and waist. 37
Description of PAMAP2 dataset
Nine distinct people participated in 18 distinct physical activities that make up this dataset. A heart rate monitor and a three-inertial measurement unit are used for these tasks. 3850505 instances were gathered throughout the collection phase, although some of them had missing values, which will be fixed in the pre-processing stage. 38
Performance metrics
The effectiveness of the models was evaluated using accuracy, precision, recall and F1 score, along with 10-fold cross-validation and the ROC-AUC score. These measures provide complementary insights into classification reliability and generalization.
Accuracy
Accuracy measures the ratio of correctly predicted activities (6 for WISDM, 7 for RealWorld and 18 for PAMAP2) to the total number of predictions:
Recall quantifies the proportion of correctly identified positive activities:
Precision measures the proportion of correctly predicted positives among all positive predictions:
The F1-score is the harmonic mean of precision and recall:
In the simulation, all the tuned hyperparameters for LR, DT, KNN, GNB, ET and Cataboost are presented in Table 6.
Experimental setup and hyperparameter configuration for stacked ensemble learning (SEL).
Experimental setup and hyperparameter configuration for stacked ensemble learning (SEL).
In this section obtained results of HAR dataset with respect to the various performance measures is presented.
Performance analysis for walking and jogging activities
The first analysis compared classifiers for recognizing walking and jogging activities. Metrics reported in Table 7 show that the CatBoost Classifier consistently achieved the best performance. Specifically, CatBoost reached to 80.4% of accuracy, 83.3% of recall, 69.9% of precision, 76.1% of F1-score, and 76.3% of ROC-AUC, outperforming all other baseline classifiers. In Figure 5 these findings graphically illustrated.
Performance on walking and jogging activities.
Performance on walking and jogging activities.

Performance comparison: walking vs. jogging activities.
A similar evaluation was conducted for stair-related activities. As shown in Table 8, CatBoost again delivered the highest performance, achieving 82.3% accuracy, 86.0% recall, 71.9% precision, 78.3% F1-score, and 81.8% ROC-AUC. Figure 6 visually depicts these outcomes. These findings demonstrate the model’s robustness in distinguishing between upward and downward movements, which are typically more challenging to classify.
Performance analysis for sitting and standing
The final evaluation focused on sitting versus standing activities. Table 9 indicates that CatBoost achieved exceptional performance, with accuracy of 98.5%, recall of 98.4%, precision of 98.0%, F1-score of 98.4%, and ROC-AUC of 98.2%. These values significantly outperform baseline classifiers, underscoring CatBoost’s effectiveness for simpler, posture-based tasks. The comparative performance is visualised in Figure 7.
Experimental setup of stacked ensemble learning (SEL)
To evaluate the effectiveness of the proposed stacked ensemble framework, we compared parameter configurations of individual base classifiers with those used in the ensemble. For conventional classifiers such as Logistic Regression (LR) and Gaussian Naïve Bayes (GNB), no special tuning was required. In contrast, other models including K-Nearest Neighbour (KNN), Extra Tree (ET), and CatBoost (CB) were trained with optimised parameters such as
For the SEL framework, each base learner was configured according to the settings listed in Table 9. A soft voting classifier was then applied to combine their predictions, with CatBoost emerging as the dominant contributor to the final meta-learner. This demonstrates the value of strategically integrating diverse classifiers to improve overall predictive capability.
Performance analysis for upstairs and downstairs activities.
Performance analysis for upstairs and downstairs activities.

Performance comparison: upstairs vs. downstairs activities.
To better understand the SEL model’s performance, confusion matrices were generated for each activity pair. These matrices, together with metrics such as accuracy, recall, precision, and F1-score, reveal the strengths and weaknesses of the model.
Notably, sitting and standing activities achieved the highest recognition accuracy with a cross-validation score of 98.5%, highlighting SEL’s ability to distinguish static postures with exceptional precision. Meanwhile, activities with similar movement patterns, such as walking versus jogging, presented greater classification challenges, but the SEL framework still demonstrated competitive results.
The confusion matrices for all six activities are provided in Figure 8, represents a detailed view of misclassification patterns, supporting targeted refinements for future improvements.
Comparative analysis with existing models
Finally, we benchmarked the proposed SEL+CatBoost framework against previously published methods. As shown in Table 10, our model achieved superior accuracy of 98.78%, outperforming state-of-the-art techniques such as CNN-GRU, PP-FPRF, and ConvAE-LSTM on the WISDM dataset.
Performance on sitting and standing activities.
Performance on sitting and standing activities.

Performance comparison: sitting vs. standing activities.
This result underscores the effectiveness of combining SEL with CatBoost, yielding improved robustness, generalisation, and resistance to overfitting. Such performance establishes the proposed model as a strong candidate for real-world deployment in activity recognition applications.
In this subsection, the proposed method is tested on the RealWorld HAR Dataset with several ML models and is observed to show superior performance compared to the tested models. The CatBoost Classifier, in particular, demonstrates outstanding results across different activities: for Walking and Running, it achieves 98.9% accuracy, 98.7% recall, 98.8% precision, 98.7% F1-score, and 98.8% ROC-AUC. In Sitting and Standing, it attains 99.2% accuracy, 99.1% recall, 99.3% precision, 99.2% F1-score, and 99.1% ROC-AUC; and for Cycling and Mixed Activities, it obtains 99.5% accuracy, 99.4% recall, 99.6% precision, 99.5% F1-score, and 99.4% ROC-AUC. The detailed results are summarized and presented in Tables 11, 12, and 13.

Confusion matrix of SEL for all six activities. (a) Confusion matrix of SEL for walking and jogging; (b) Confusion matrix of SEL for upstairs and downstairs; (c) Confusion matrix of SEL for sitting and standing.
In this subsection, the proposed method is tested on the PAMAP2 HAR Dataset with several ML models and is observed to show superior performance compared to the tested models. The CatBoost Classifier demonstrates excellent results across different activity groups: for Walking, Running, and Nordic Walking, it achieves 99.4% accuracy, 99.3% recall, 99.5% precision, 99.4% F1-score, and 99.4% ROC-AUC; for Sitting, Standing, and Lying, it attains 99.6% accuracy, 99.5% recall, 99.7% precision, 99.6% F1-score, and 99.5% ROC-AUC; and for Cycling, Rope Jumping, and Daily Activities (Vacuuming, Ironing, Stairs), it obtains 99.3% accuracy, 99.2% recall, 99.4% precision, 99.3% F1-score, and 99.2% ROC-AUC. The detailed results are summarized and presented in Tables 14, 15, and 16.
Comparison of proposed model with existing approaches.
Comparison of proposed model with existing approaches.
Performance on walking and running activities (RealWorld HAR dataset).
Performance on sitting and standing activities (RealWorld HAR dataset).
Performance on cycling and mixed activities (RealWorld HAR dataset).
Performance on walking, running, and nordic walking activities (PAMAP2 dataset).
Performance on sitting, standing, and lying activities (PAMAP2 dataset).
Performance on cycling, rope jumping, and daily activities (PAMAP2 dataset).
This study implemented six distinct machine learning models arranged in layered stacks, where each stack comprised five learners, resulting in a total of thirty base models. To integrate their predictions, we adopted a soft voting strategy within a stacked ensemble framework. Unlike hard voting, which relies only on the majority class label, soft voting averages the predicted probabilities, thereby producing a more balanced and reliable output. This approach exploits the complementary strengths of multiple models, ensuring that the final decision benefits from the collective predictive capacity of all learners.
Performance was assessed using standard metrics such as accuracy, precision, recall, F1-score, and cross-validation outcomes. The comparative results clearly indicate that the stacked ensemble consistently surpassed individual models, confirming the advantage of combining multiple learners. The CatBoost classifier, in particular, emerged as the most effective contributor across all stacks, repeatedly achieving superior results. Confusion matrices further validated these findings, illustrating fewer misclassifications in the ensemble compared to single-model approaches. Overall, the analysis highlights that the ensemble’s success is driven by its ability to consolidate diverse model outputs, with CatBoost acting as a pivotal element in enhancing predictive performance.
Convergence analysis and training stability
The suggested model’s learning dynamics and training stability are thoroughly understood thanks to the convergence study. The training loss shows a smooth, monotonic, and steadily dropping pattern throughout epochs, as shown in Figure 9, which amply demonstrates that the optimization process is operating as planned. Throughout the learning phase, there are no sudden spikes, oscillations, or divergences, indicating that the gradient updates are steady, well-regulated, and numerically dependable. This behavior is a direct reflection of how well the chosen hyperparameters such as the learning rate schedule, regularization strength, batch size configuration, and architectural depth—work together to create a training environment that is balanced.

Convergence curve loss vs. epochs.
The loss rapidly declines in the early epochs, indicating that the model is effectively capturing the most prominent and discriminative structures in the data. The rate of improvement eventually reaches a plateau after a predetermined number of iterations as training goes on. Because it shows the point at which the model converges to an optimal or nearly optimal solution without compromising generalization, this plateau region is especially significant. Further evidence that the model is free from overfitting or noisy weight adjustments—common in poorly regularized or excessively complicated architectures—is provided by the lack of post-plateau fluctuations.
Furthermore, the convergence pattern shows that the model maintains a steady balance between exploration (finding the loss landscape) and exploitation (improving the learned parameters), confirming the appropriateness of the chosen optimization technique. Additionally, the steadily declining loss curve suggests that the underlying loss surface is well-behaved and free of saddle spots or severe local minima that could impede training. This stability supports the model’s capacity to generalize to new data and strengthens the training pipeline’s dependability. When considered collectively, the convergence behavior provides confidence in the model’s overall performance and practical applicability by strongly supporting the robustness, resilience, and efficacy of the suggested training process.
This study used information from smartphone accelerometer sensors to propose a strong foundation for Human Activity Recognition (HAR). All of the human actions that are the subject of this study can be divided into three categories: (i) walking and jogging; (ii) climbing and descending stairs; and (iii) sitting and standing. When tested on three benchmark HAR datasets—WISDM, RealWorld, and PAMAP2—the system showed better accuracy than individual models by utilizing stacked ensemble learning. CatBoost continuously outperformed the other classifiers in terms of accuracy, precision, recall, and F1-measure. Notably, the accuracy of 98.5% in recognizing sitting and standing activities on the WISDM Dataset demonstrated how well the suggested model handled both dynamic and static activities. In addition, 99.20% accuracy was achieved on RealWorld dataset whereas on PAMAP2, it attained 99.43% accuracy.
The results obtained demonstrate that ensemble learning offers a feasible way to develop dependable HAR systems for real-world situations. The methodology might be expanded in future studies by using multimodal sensor data, investigating sophisticated ensemble designs, or adding more learners. These enhancements have the potential to increase HAR systems’ versatility and broaden their use in fields including smart environments, rehabilitation, and healthcare monitoring.
Footnotes
Ethical approval
This article does not involve experiments with human participants or animals conducted by the authors.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Malaysian Ministry of Higher Education (MOHE) for providing the Fundamental Research Grant Scheme (FRGS) (Grant number: FRGS/1/2024/TK04/USM/02/1).
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
