Abstract
Background
Hypertension is one of the most important health-related problems worldwide, and its monitoring is necessary constantly.
Objective
The regular methods of blood pressure monitoring have disadvantages; hence, the interest in finding better solutions is stirred.
Methods
In this study, PPG signals from 218 subjects in Guilin People's Hospital were analyzed, where 657 PPG recordings were employed together with demographic and clinical data. CNN-Attention, CNN-GRU, and LSTM, have been conducted with z-score normalization and augmentation in an 80:20 train-test split.
Results
The highest performance of the CNN-GRU model achieved 75% accuracy, an AUC-ROC of 0.658, and perfect recall for hypertensive cases at 1.00. While the CNN-Attention model reached an accuracy of 61%, the overall poorest performance was given by LSTM.
Conclusion
These results prove that accessible cardiovascular monitoring is feasible and valuable in a resource-limited settings.
Keywords
Introduction
Hypertension remains one of the leading public health issues worldwide, affecting approximately adults globally. 1 Being a leading risk factor for cardiovascular diseases, stroke, and kidney failures, early detection and management of hypertension are crucial in order to avoid adverse health outcomes.2,3 Traditional methods of blood pressure monitoring (such as cuff-based techniques) show limitation with respect to accuracy, and usually involve intermittent and irregular monitoring and hence may not include the dynamic nature of blood pressure (BP) changes throughout the day.4,5
Modern wearables today are capable of delivering advanced functions and continuous monitoring of health parameters, such as glucose levels in the blood, heart rate variability, blood pressure readings, calories burned, steps walked amongst others. Heart rate variability is one of the key parameters related to the prediction of major cardiovascular events in both the general population and patients. Presently, these non-invasive detective methods have become a preliminary demand for wearable medical devices, particularly in the scenarios where devices are deployed to track personal health individually without nursing staff's assistance. Bio-signal highly relates to the cardiovascular conditions and thus can be an efficient way to detect the heart diseases by estimating the Blood Pressure. 6 Where Wearable Technology, more so photoplethysmography (PPG) sensors, has provided new avenues for continuous non-invasive monitoring of cardiovascular matters. 7 The PPG sensors, integrated into the individual smartwatches, are incorporated with different types of fitness trackers in measuring changes within the blood volume in microvascular tissue bed through optical sensing. These sensors allow handy information about cardiovascular parameters such as heart rate variability, pulse wave velocity, and also blood volume patterns.8,9
Recent breakthroughs in machine learning (ML) have shown great successes even in processing very complex physiological data and identifying the most subtle patterns indicative of various health conditions. 10 This, together with data from PPG sensors and ML algorithms, brings an unprecedented opportunity to innovate new study approaches for early detection and monitoring of hypertension. However, such forecasts, factoring in demographic elements, do enhance their accuracy and reliability since the risks of hypertension are so different across population segments according to age, gender, ethnicity, and lifestyle factors.11,12
Related studies have thoroughly explored PPG for BP prediction and hypertension forecasting using a range of ML and deep learning (DL) methods on diverse datasets, often with demographic or clinical data and using methods like feature engineering or raw signal direct analysis with models like convolutional neural networks (CNNs), Long Short-Term Memory networks (LSTMs), and ensemble techniques.13,14 However, these studies are plagued by limitations like the need for regular calibration, discontinuities in accuracy and validation against established standards, limited generalizability from small or specialized populations (e.g., healthy volunteers or intensive care unit (ICU) patients), heterogeneity of data that makes direct comparison impossible, challenges with DL interpretability and overfitting, sensitivity to signal quality and motion artifacts, and a general focus on correlation rather than causation.15–20 In contrast, this paper attempts to surmount such shortcomings through a population from a general hospital rather than solely among healthy or severely ill patients and through the targeted comparison of performances among different DL models, depicting the potential of affordable cardiovascular monitoring in low-resource settings and maintaining high recall among hypertensive conditions.
In General, this work presents a unified machine learning strategy that fuses data from PPG sensors and the demographic and health data to predict the risk of hypertension.
This work accepts greater importance in that it would help in understanding early intervention and personalized healthcare strategies. Our approach, by building on the widely available PPG sensor technology and other data, seeks to provide a cost-effective, accessible, and continuous monitoring solution for hypertension risk assessment. This can thus prove especially useful in resource-limited settings where traditional medical screening may be difficult to reach.
This research has several significant contributions to hypertension prediction and wearable health monitoring: Novel Hybrid Architectures: We propose and compare three various DL architectures (CNN-Attention, CNN-GRU, and LSTM) for hypertension prediction, demonstrating the benefit of the CNN-GRU model (75% accuracy, 1.00 recall for hypertensive cases) in leveraging both PPG waveform morphology and temporal relationships. Integrated Data Framework: By incorporating the fusion of PPG signals and clinical and demographic features, our approach enhances the prediction of hypertension to the multifactorial of the disease for personalized prediction. Resource-Limited Applications: The research demonstrates the value of routine noninvasive blood pressure screening for hypertension by using off-the-shelf PPG sensors as a low-cost means of early detection in resource-limited health-care settings.
Materials and methods
Dataset
The openly available data used in this work is the PPG-BP dataset, a database of PPG signals with the corresponding measure of blood pressure from 219 subjects 21 which 218 samples where valid. Data collection was performed in Guilin People's Hospital in China. This dataset consists of 657 PPG waveform segments from 3 segments per subject recorded at 1 kHz with 12-bit resolution using a custom portable hardware platform, a SEP9AF-2 PPG sensor from the SMPLUS Company, Korea, operating at 660 nm (Red) and 905 nm (Infrared); every segment of PPG is 2.1 s in length.
Their ages ranged from 21–86 years, with a median of 58 years; 48% of the population was male. The dataset contains recordings from cases considered to be healthy and also those with conditions such as hypertension, diabetes, cerebral infarction, and insufficient blood supply in the brain. Blood pressure was measured by using the upper arm non-invasive blood pressure monitor Omron HEM-7201, for which validity has been established.
Data collection was carried out under standardized conditions: the subjects sat comfortably on an office chair, with their arms relaxed and laid on a desk. The PPG signals were recorded from the left index fingertip after 10 min of acclimatization, while simultaneously measuring blood pressure from the right forearm. All the measurements were obtained within 3-min duration. Signal quality was monitored in real time by calculating a Skewness signal quality index (SSQI), preserving only segments of greater than zero to minimize noise and motion artifacts.
There are three PPG segments for each subject in the dataset, accompanied by blood pressure and clinical information, such as age and gender, height, weight, heart rate, and diagnoses of relevant medical conditions. All personal identifiable information had been removed at the time of data collection. Data were screened for integrity, availability of relevant clinical information, and signal quality before inclusion in the final dataset.
This large dataset enables the development and testing of machine learning approaches for blood pressure estimation and the detection of cardiovascular diseases by using PPG signals only, which can allow much easier and more accessible monitoring of cardiovascular health.
Data processing and model development
We propose and evaluate three different DL architectures for the analysis of PPG signals in combination with clinical features. All models were implemented in TensorFlow 2.10.0 combined with the Python 3.9.18 framework to ensure reproducibility and scalability.
Data preprocessing
These PPG signals were processed in a sequence of 2100 time points across three temporal segments. For all clinical features, standardization was done by using z-score normalization to put variables with consistent scale. To deal with an intrinsic imbalance in the classes of this dataset, we designed an extensive augmentation for the minority class. This is a strategy that had controlled signal perturbations, including additive Gaussian noise with variation in standard deviation from 0.005 to 0.02, time warping with up to 5% variation in signal duration, and amplitude scaling within variation of ±10%. These augmentation techniques were carefully calibrated to preserve the physiologically relevant characteristics of the PPG signals while introducing realistic variations.
Model architectures
We implemented three different DL architectures, each designed to model different aspects of the temporal and morphological features inherent in the PPG signals. The CNN-Attention Model, the first architecture, was initiated with initial convolutional layers with 64 filters, a kernel size of 7, and a stride of 2. Further, multi-scale feature extraction was carried out with varied kernel sizes through parallel convolutions: 3, 7, and 15, so as to extract features across different temporal scales. A channel-wise attention mechanism that adaptively re-weights feature responses was proposed; this is known as squeeze-and-excitation. The global pooling architecture was of dual-path, combining the maximum and average pooling. Further, it was followed by the dense layer with batch normalization and a dropout rate ranging from 0.2 to 0.4.
The CNN-GRU Model is the second architecture, which included convolutional layers with 64 to 128 filters, introducing residual connections to improve gradient flow during training. These were followed by a bidirectional GRU layer with 128 and 64 units, thus allowing the model to learn forward and backward temporal dependencies. To keep fine-grained information, skip connections were implemented between major blocks, followed by the dense layers with batch normalization.
The third architecture, the LSTM Model, began with an initial convolutional reduction layer to reduce the temporal dimension, which was followed by stacked LSTM layers that were progressively smaller, from 256 to 64 units. This architecture uniquely incorporated highway connectivity with its dense layers, providing this network with the ability to self-regulate in its balance between transformed and untransformed features that pass through it. Independently, a parallel pathway processed the clinical features; these were subsequently integrated through concatenation with the PPG-derived features.
The three deep architectures, CNN-Attention, CNN-GRU, and LSTM, were theoretically conceived with regard to the spatiotemporal and multifactorial structure of PPG signals and hypertension prediction. The CNN-Attention architecture employs multi-scale convolutions (kernel sizes 3–15) to extract morphological patterns (e.g., pulse shape) and channel-wise attention to pick-out salient signal components, and CNN-GRU combines CNNs (local feature extraction) with bidirectional GRUs (temporal rhythm modeling) to leverage both waveform morphology and heart rate dynamics. The LSTM architecture employs stacked LSTMs with highway connections to handle sequential dependencies but struggles with short 2.1-s segments. All the models use clinical data (age, BMI) subsequent to feature extraction to capture hypertension's demographic risk factors. Preprocessing procedures—z-score normalization, augmentation (noise, time warping), and class balancing—enable stable training on the fixed-duration, imbalanced PPG data.
Training protocol
Care has been taken in the training protocol to optimize model performance without leading to overfitting. We allow for a batch size of 24 samples and allow it to go up to 100 epochs of training. Besides this, early stopping has been set with patience for 15 epochs by monitoring validation performance to avoid overfitting. We also allowed learning rate reductions on plateaus with a reduction factor of 0.2. To handle class imbalance, we have used a focal loss function with gama = 2 and alpha = 0.25. We also utilized the Adam optimizer wherein the initial learning rate was set to 0.001. Class weighting was taken as 7.0 for the minority class and 1.0 for the majority class in order to handle further class imbalance.
So, the data is to be divided into training and testing sets in an 80:20 ratio; stratification of the class is maintained. Further, it is subdivided into subsets of training-80% and validation-20% to monitor model performance during training.
Evaluation metrics
Model performance was assessed from a number of complementary perspectives. We computed the area under the receiver operating characteristic curve, AUC-ROC, as a metric of discrimination capability across a range of classification thresholds. Average precision was calculated as a metric assessing performance where large class imbalances were present. F1 scores were calculated to provide a balanced consideration of both precision and recall. We further dissected class-specific precision and recall metrics to understand model performance for each outcome category. Confusion matrices were built to further describe classification patterns. For the optimization of thresholds in classification, F1 scores were maximized across a variety of threshold values ranging between 0.2 and 0.8 to capture the optimal balance between precision and recall.
Results
Our analysis consisted of 218 out of 219 patients with valid PPG recordings and clinical features. From those, 139 cases were labeled abnormal (pre hypertension and hypertension type 1 and 2), while the rest were labeled normal. The following three DL models were used to test their performance in classifying hypertension status using PPG signals and clinical features. The training process accuracy and losses were shown in Fig. 1.
Among the three architectures tested, the CNN-GRU model has the best overall performance. However, the best model achieved an accuracy of 75% with an AUC-ROC of 0.658, though with notably high average precision of 0.764. The CNN-GRU architecture showed perfect precision (1.00) in identifying normal cases, though with moderate recall of 0.31. For hypertensive cases, it achieved a precision of 0.72 with perfect recall of 1.00, whereby an F1-score of 0.84 was realized in this class. The model has correctly classified all 28 hypertensive cases, but it misdiagnosed 11 out of 16 normal cases, according to the confusion matrix (shown in Fig. 2).

Training loss and accuracy for a) CNN-attention, b) CNN-GRU and c) LSTM.

Confusion matrixes for CNN-attention, CNN-GRU and LSTM.
Indeed, the CNN-Attention model had a moderate performance at an overall accuracy of 61% with an AUC-ROC of 0.525. In this model, the precision and recall for normal cases were 0.45 and 0.31, respectively, giving an F1-score of 0.37, while for hypertensive cases, the precision and recall were 0.67 and 0.79, respectively, with an F1-score of 0.72. Then, the confusion matrix yielded 5 true negatives, 22 true positives, 11 false positives, and 6 false negatives, which indicates a tremendous leaning toward hypertension classification.
The LSTM model ran the poorest among the three architectures, with an area under the ROC curve of 0.513, respectively (Fig. 3). For normal cases, its precision and recall were 0.38 and 0.31, respectively, giving it an F1-score of 0.34 while for hypertensive cases, precision and recall were 0.65 and 0.71, respectively, giving it an F1-score of 0.68. Its confusion matrix indicated 5 true negatives, 20 true positives, 11 false positives, and 8 false negatives.

ROC curve comparison for CNN-attention, CNN-GRU and LSTM.
All three models are optimized using threshold tuning, with the optimal thresholds being 0.380 for the CNN-GRU model and 0.200 for both CNN-Attention and the LSTM model. The superiority of the CNN-GRU model was further reflected through its weighted average F1-score, standing at 0.71 against the class-weighted average F1-score of the CNN-Attention and the LSTM model, which stood at 0.59 and 0.56, respectively.
The resultant key insights and implications for clinical practice, as well as future research directions, are presented herein in the light of the DL-based analysis of hypertension classification using PPG signals. Our study found that DL models were effective in extracting meaningful patterns in PPG signals for the identification of hypertension, though with success varying across the studied architectures; especially, the CNN-GRU architecture was very strong. This agrees with the growing literature that hybrid architectures are well suited for physiological signal processing, and previous works have already been successful in applying CNNs and RNNs on these kinds of tasks 22
The better performance of the CNN-GRU model (accuracy: 75%, AUC-ROC: 0.658), against a background of poor performance by the CNN-Attention and LSTM architectures, would therefore suggest that the combination of convolutional and recurrent neural network components is particularly well suited in this application. The ability of CNN layers to capture local morphological features in PPG waveforms, combined with the GRU's capability to model temporal dependencies, seems to create a more robust framework for hypertension detection. This result is in agreement with Zhang et al., who published work that identified the benefits of combining CNNs with LSTMs for sequence classification and used such a hybrid model to effectively exploit the strengths of both architectures. 23
However, performance differences across various classification scenarios do call for careful deliberations. The perfect recall of 1.00 by the CNN-GRU model for hypertensive cases, together with the pretty low recall for normal cases at 0.31, insinuates bias in classification toward hypertension cases. Although this high sensitivity of the model for hypertensive cases may be useful as a screening tool, the relatively high false positive rate suggests that the model can probably be put to better use as an initial screening modality rather than as a definitive diagnostic one. This observation is consistent with the findings of Doan et al., who stated that hybrid models often demonstrate trade-offs between sensitivity and specificity, especially in medical applications. 24
Its modest performance-the CNN-Attention model yielded an accuracy of 61% with an AUC-ROC of 0.525-raises intriguing questions with respect to the role that attention mechanisms play in the analysis of PPG signals. Whereas attention mechanisms indeed performed very well in most DL applications, their relatively modest performance in our study could speak to the fact that temporal dependencies in PPG signals related to hypertension could be much better captured through recurrent architectures compared to approaches based on attention. Indeed, this is further supported by Shen and Lee, who concluded that attention models, though strong in some contexts, do not always outperform their traditional RNNs on tasks requiring strong temporal modeling. 25
Although the LSTM model is the lowest performing among the three architectures here presented, it provides an initial insight into the complexity of the classification task, with an accuracy of 57% and AUC-ROC of 0.513. The balanced modest performance in both classes indicated that the simple sequential modeling might already be enough to capture those small variations in PPG morphology associated with hypertension. This is in agreement with the work done by Yang et al., who emphasized the inability of LSTMs in extracting relevant features from complex physiological signals in the absence of support from complementary model architectures. 26
For the dataset used in this study, hypertension detection (classification) is a top research focus that also mentioned in a review. 14 Other research studies also aim at ML-based hypertension stage classifications.13,19,27
Several recent studies also employed CNNs and LSTMs, even a fusion of the two (CNN-LSTM), for BP estimation and hypertension on PPG data. 28 The application of an attention mechanism in the present study is a more recent approach to locate relevant temporal features in the PPG signals, which has been attempted in a few other studies to enhance model performance. The dataset used in this study makes this work distinct from the majority of state-of-the-art research that relies heavily on publicly available datasets like MIMIC-II and MIMIC-III, which are largely made up of ICU patient data.17,27–29
Although helpful, these ICU datasets might not record the complete physiological profiles of either the general population or individuals in non-daily life circumstances. 18 This study addresses a reported limitation in the literature by utilizing data from a general hospital setting, with the possibility of being more relevant to real-world application outside critical care. This is in concurrence with the suggestion for more research on patients both outside the ICU and across all ranges of blood pressure. Some studies, like the HYPE study, have indeed focused on hypertensive patients. 16
Regarding performance, this work focused on having hypertensive case perfect recall using the CNN-GRU model. State-of-the-art in BP estimation tends to report performance as Mean Absolute Error (MAE) and Standard Deviation (SD) and compare against AAMI, BHS, and IEEE standards.20,28 For classifying hypertension, accuracy, precision, recall, and F1-score are typical metrics. This recall-focused investigation suggests an emphasis on correct designation of patients as hypertensive, a clinically pertinent factor in treating hypertension.
For instance, Liang et al., 27 on the MIMIC-III database used a LightGBM model with Tsfresh feature extraction on PPG, VPG, and APG signals to report an F1-score of 92.77% for the (NT + PHT) vs. HT classification problem, a different metric but reporting great performance in an ICU setting. In contrast, Yan et al. 19 used a pilot study with a self-collected dataset of 30 volunteers (15 healthy, 15 hypertensive) and a LSTM-Attention model and achieved a much better accuracy of 99.1% in hypertension detection. Their dataset was smaller and self-collected, although they advocated the potential of DL on PPG for hypertension. Furthermore, Seto et al. 13 used a large dataset of 5992 normotensive and hypertensive adults and achieved 73% accuracy with a Random Forest model on a wide range of features beyond PPG only. The focus of our study on non-ICU patients gives an insight into the possibilities of PPG-based hypertension monitoring in more general health care settings, where data qualities could be unlike those in ICU settings.
There are several limitations that must be acknowledged with our study. First, the relatively small dataset size of 218 patients participated may constrain the models from learning more complex features. Imbalanced classes in our dataset may portray certain classification tendencies in different models despite augmentation strategies. Finally, binary classification may not completely represent the spectrum of disease severity for hypertension. These limitations, again said above, point to the fact that Mejhoudi et al. alluded to: the need for more and greater medical datasets to drive up robust machine learning. 30
The integration of other physiological signals or other clinical parameters may be a possible exploration to improve the accuracy of classification, as also suggested by Sánchez-Reolid et al. through a more holistic approach in analyzing physiological data Sánchez-Reolid et al., 2019). Besides, developing more sophisticated methods of specific data augmentation for PPG signals can favorably perform, as Choi and Lee noted proper preprocessing in any DL application. 31 This could also involve further investigation to interpret the proposed methods and understand which PPG features bear more influence on the classification decision. Such investigation will also provide valuable insight into clinicians’ work and further aligns with the recent growing emphasis on explainable AI in healthcare.
Larger and more diverse patient validation would help in generalizing our results. This is important because, as reproduced by Han et al., PPG signal characteristics may vary significantly among different demographics; thus, effective machine learning model training requires diversity in datasets. 32 A literature review on hypertension classification shows that the investigation of methods related to multi-class classification, which would provide discrimination of various grades of hypertension, therefore allowing more detailed insights into the disease.
This may mean, from a clinical viewpoint, that DL PPG analysis can be possibly applied to hypertension screening, especially in resource-constrained settings where traditional methods for blood pressure monitoring are limited in their applicability. The high sensitivity of the CNN-GRU model for hypertensive cases makes it very suitable as an initial screening, although the relatively high false positive rate mentioned above does indicate that confirmation of positive results should be done through traditional blood pressure measurements. This agrees with recommendations by health organizations calling for novel screening methods to help in the improvement of hypertension detection rates in resource-poor settings.
However, in summary, our findings promise the cognitive DL approaches to PPG-based hypertension detection yet again underline some challenges and issues relevant to the currently available methods. Further studies are warranted while the best performance using the CNN-GRU architecture opens a promising direction for future development to enhance specificity and confirm such observations in more extensive and diverse populations before clinical use. Therefore, going forward in DL for hypertension detection research must be a blend of the best modeling techniques and a focus on interpretation.
Conclusion
This serves as an indicator that the integration of machine learning with demographic elements and PPG sensor data holds immense potential for hypertension prediction. Among these three DL models studied in this work, one each for CNN-Attention, CNN-GRU, and LSTM, the CNN-GRU model was found to be the most efficient due to its 75% accuracy and AUC-ROC of 0.658. Thus, the model has high recall for hypertensive cases and is therefore highly effective for screening purposes. The overall better performance using the CNN-GRU architecture may suggest that the combination of convolutional layers for morphological feature extraction with recurrent units for modeling temporal dependencies is particularly suitable in PPG signal analysis. However, moderate specificity in the detection of normal cases may finally suggest that these models could only be put to use as preliminary screening tools and not as definitive diagnostic instruments. Although the results are encouraging, there are a number of limitations in the experiments, such as the relatively small size of the dataset and its imbalance, which indicate further improvements that could be made. Future work should be directed at increasing the size and diversity of the dataset, using more physiological signals and clinical parameters, developing more sophisticated augmentation techniques for the PPG signal, investigating interpretability methods to understand how the model makes decisions, and exploring multi-class classification strategies for different grades of hypertension. Such findings add to an already increasing level of evidence supporting the feasibility of hypertension detection based on PPG and pinpoint the potential for developing accessible, continuous monitoring solutions for cardiovascular health. This may be particularly useful in resource-poor settings where traditional methods of blood pressure monitoring are less commonly used.
Footnotes
Acknowledgments
The authors have no acknowledgments
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
