Abstract
Background
Atrial fibrillation (AF) is a common arrhythmia associated with substantial morbidity, particularly ischemic stroke. Conventional AF screening using rule-based algorithms is limited by suboptimal accuracy and the need for extensive clinician review when processing large volumes of electrocardiogram (ECG) data. Artificial intelligence (AI) has emerged as a promising approach to improve detection efficiency and scalability.
Objectives
To develop and internally evaluate a deep learning model for AF screening using real-world 24-hour Holter ECG data, and to evaluate a semi–real-time monitoring workflow integrating AI and wearable ECG devices in high-risk patients.
Methods
This retrospective study included 1,489 Holter ECG recordings collected at Nguyen Trai Hospital. A Residual Network (ResNet) model was trained on 1,089 recordings and evaluated on an independent dataset of 400 recordings using case-level classification with patient-level data splitting. A semi–real-time screening workflow integrating AI analysis and wearable ECG devices was prospectively implemented in a high-risk cohort.
Results
From the training dataset, a total of 29,765 minutes of cardiologist-annotated AF episodes were identified across 82 AF-positive recordings. These episode-level annotations were subsequently mapped to fixed 60-second segments for model training using a predefined sampling strategy. On the independent evaluation dataset (n = 400; AF prevalence 6.3%), the model achieved 100.0% sensitivity, 88.0% specificity, and 35.7% positive predictive value for case-level AF detection. In the pilot cohort (n = 167), AF was detected in 24 patients (14.4%), including 20 (12.0%) identified after 24 hours of monitoring.
Conclusions
The proposed framework demonstrated high sensitivity for AF detection in real-world Holter ECG data. Integration with wearable devices in a semi–real-time clinician-in-the-loop workflow was feasible and may support AF screening in high-risk populations.
Keywords
1. Introduction
Atrial fibrillation (AF) is one of the most prevalent cardiac arrhythmias and is strongly associated with an increased risk of thromboembolic events, particularly ischemic stroke. Timely diagnosis and appropriate anticoagulation therapy substantially reduce stroke risk, whereas delayed or missed detection exposes patients to preventable cerebrovascular events. Conversely, overdiagnosis may lead to unnecessary anticoagulation and associated bleeding risks.1,2 Recent guidelines have therefore expanded indications for AF screening, particularly among patients with prior ischemic stroke or transient ischemic attack. 3
The diagnosis of AF is typically established using 12-lead electrocardiography (ECG); however, paroxysmal or subclinical AF episodes are frequently missed in routine clinical evaluation. Ambulatory ECG monitoring, including 24-hour Holter monitoring, is widely used to improve detection yield. 4 Nevertheless, traditional Holter analysis relies on rule-based algorithms with limited accuracy in noisy or complex recordings, often requiring time-intensive expert review. 5
Recent advances in deep learning have enabled artificial intelligence (AI) models to automatically extract complex temporal and morphological ECG features, offering improved accuracy compared with conventional approaches. These models provide a scalable solution for AF screening and rhythm interpretation in large datasets. 6
In this context, the present study aimed to (1) develop and internally evaluate a deep learning–based AI model for AF detection using real-world 24-hour Holter ECG data, and (2) evaluate a semi–real-time monitoring framework integrating AI analysis with wearable ECG devices for screening in high-risk populations.
2. Methods
2.1. Study objectives
This study had three main objectives: (1) to develop a deep learning–based classifier for atrial fibrillation (AF) detection using Holter ECG data; (2) to evaluate the diagnostic performance of the model at the case (Holter recording) level using real-world datasets; and (3) to implement and assess the feasibility of a semi–real-time AF monitoring workflow integrating wearable ECG devices and AI analysis in high-risk patients.
The study was conducted at Nguyen Trai Hospital, Ho Chi Minh City, Vietnam, between March 2024 and September 2025. All data were anonymized prior to analysis in accordance with institutional ethical standards.
2.2. AI model development
2.2.1. Population and dataset
A total of 1,489 24-hour Holter ECG recordings were retrospectively collected during the study period. Of these, 135 recordings were excluded from demographic analysis because of incomplete demographic information. All 1,489 recordings were retained for model development and evaluation (Figure 1). Dataset flow diagram showing inclusion, exclusion, and allocation of Holter ECG recordings for demographic analysis and model development.
2.2.2. Data splitting
The dataset was split at the patient (subject) level to prevent data leakage. All recordings were assigned exclusively to either the training dataset (n = 1,089) or the evaluation dataset (n = 400), with no overlap. Each patient contributed only one Holter recording.
ECG signals were preprocessed using a bandpass filter of 0.5–30 Hz before segmentation. No notch filtering, resampling, normalization, or feature scaling was applied. ECG segments were generated only after dataset splitting, ensuring that no segment from the same recording appeared in both datasets.
Within the training dataset, a patient-level validation subset was used for hyperparameter tuning, model selection, and early stopping, and remained fully disjoint from the evaluation dataset. No cross-validation was performed.
2.2.3. Annotation and labeling
Two independent cardiologists reviewed all recordings and annotated AF episodes at the episode level. Inter-rater agreement between the two cardiologists was assessed prior to consensus adjudication using Cohen’s kappa coefficient, which was κ = 0.92. Continuous AF episodes were labeled from onset to termination, whereas intermittent AF was annotated as separate discrete episodes. Transition points were defined based on the earliest visually identifiable rhythm change.
In this study, annotated rhythm strips referred to cardiologist-identified continuous rhythm intervals or episodes with variable duration, rather than fixed-length model input segments. The reported AF duration therefore reflects cumulative annotated AF time across recordings. These episode-level annotations were subsequently mapped to fixed 60-second segments for model training.
The non-AF category included sinus rhythm, sinus tachycardia, sinus bradycardia, atrial tachycardia, supraventricular ectopy (PAC), and ventricular ectopy (PVC). No atrial flutter cases were present in the dataset. Frequent PACs or PVCs with irregularity were classified as non-AF when preserved P-wave morphology or consistent ectopic patterns were observed. Recordings with implanted pacemakers were excluded. Recordings with variable signal quality were retained for model development; potential noise-related misclassification was mitigated through clinician verification.
2.2.4. Segmentation and aggregation
After patient-level splitting, ECG recordings were segmented into 60-second windows. In the training dataset, overlapping segmentation was first applied to AF episodes using a fixed stride smaller than 60 seconds, whereas non-AF signals were segmented using non-overlapping windows (stride = 60 seconds). Segment-level sampling was then applied, retaining all AF-positive segments and randomly sampling non-AF segments to achieve an approximately balanced class distribution.
This sampling was applied only in the training dataset and does not reflect dataset-level AF prevalence. In the evaluation dataset, non-overlapping segmentation (stride = 60 seconds) was applied uniformly to avoid redundancy and ensure unbiased performance estimation.
Each segment contained 12,000 samples (200 Hz × 60 seconds). A segment was defined as AF-positive if it contained ≥6 consecutive seconds of confirmed AF. The model generated 60 probability outputs per segment (one per second). A segment was classified as AF-positive if ≥6 consecutive seconds had predicted AF probability ≥0.5. At the case level, a Holter recording was classified as AF-positive if at least one segment met the AF-positive criterion.
2.2.5. Model architecture and training
A Residual Network (ResNet) architecture was used to model temporal ECG features (Figure 2). Input segments consisted of 60s ECG signals sampled at 200 Hz. A single ECG lead was selected from the available channels and used as input to the model for both training and inference. Convolutional layers with progressively increasing filters (16–112) were used to extract hierarchical features, followed by batch normalization, ReLU activation, and dropout layers. Residual network (ResNet) architecture.
The classification threshold was determined using the training/validation dataset with prioritization of sensitivity for screening purposes and fixed prior to evaluation. External datasets (MIT-BIH and AFDB) were used for technical benchmarking only and were not involved in model training, hyperparameter tuning, or evaluation on the primary dataset.7,8
Model architecture, training, and evaluation framework.
2.3. AI performance evaluation
The evaluation was designed to estimate diagnostic performance, particularly sensitivity, at the case (Holter recording) level. No formal a priori sample size calculation was performed, and the evaluation dataset size was determined based on available real-world data.
All diagnostic performance metrics were calculated at the case level following aggregation of segment-level predictions. Sensitivity, specificity, positive predictive value, and negative predictive value were reported with 95% confidence intervals using the Wilson score method. AUROC and PR-AUC were reported descriptively.
2.4. Clinical application: Semi–real-time AF screening
2.4.1. Design of the pilot study
Following model development, a semi–real-time atrial fibrillation (AF) screening protocol was deployed in a cohort of high-risk patients without previously diagnosed AF using wearable multi-use 7-day ambulatory ECG devices. The high-risk population was defined as individuals meeting at least one of the following criteria: (1) history of ischemic stroke, (2) age ≥ 75 years, or (3) CHA2DS2-VA score ≥ 2 points. 2 Monitoring duration was determined by clinicians based on individual clinical judgment.
The primary endpoint was feasibility of the AI-assisted screening workflow, defined as successful completion of ECG acquisition, data upload, automated analysis, and clinician verification. Secondary endpoints included AF detection yield, monitoring duration, timing of AF detection, and AF burden (cumulative duration of AF episodes).
2.4.2. Wearable device
A wearable multi-use ECG device, OCTOBEAT (OCTOMED Co., Ltd., Vietnam), was used to support ambulatory rhythm monitoring (Figure 3). The device enables up to 7 days of continuous recording using a 3-lead configuration and supports Bluetooth-based data transmission to a mobile application with subsequent upload to a hospital server. The system provides basic device status monitoring (e.g., battery level and electrode contact) and allows patient-triggered symptom annotation. The device complies with IEC 60601 safety standards and was used solely for data acquisition in this study; all clinical decisions were made by cardiologists. Wearable multi-use device and patch electrodes.
2.4.3. Operational definition of semi–real-time workflow
In this study, “semi–real-time” refers to automated AI-based analysis performed after ECG data upload, with clinician review completed within 24 hours. Accordingly, the expected detection-to-review latency is within 24 hours from data upload.
This daily review cadence was selected to balance detection timeliness with clinical workflow feasibility in a screening context. No continuous real-time alerting was implemented. Instead, AI-detected AF events were communicated to clinicians at a predefined daily review time, representing a scheduled alerting workflow. No automated alert performance metrics (e.g., alert frequency or response time) were systematically recorded in this study.
2.4.4. AI-integrated screening workflow
ECG data were transmitted to a central server via wireless communication, preprocessed using the same bandpass filtering pipeline, segmented into 60-second non-overlapping windows, and input into the trained AI model for AF probability inference.
All AI-generated outputs were reviewed by cardiologists before clinical decision-making. The clinician-in-the-loop framework supports scalable ECG analysis with cardiologist oversight in real-world clinical settings. The AI-integrated screening workflow is illustrated in Figure 4 and includes segment prioritization, cardiologist confirmation, and subsequent clinical action following AF detection. This study was reported in accordance with TRIPOD-AI principles to ensure transparency and reproducibility (Appendix). AI-integrated semi–real-time atrial fibrillation screening workflow.
3. Results
A total of 1,489 24-hour Holter ECG recordings were included, with 1,089 assigned to the training dataset and 400 to the evaluation dataset. A total of 135 recordings with missing demographic information were excluded from descriptive analyses.
3.1. Demographics of study population
Demographics of study population.
*A total of 135 cases with missing demographic information were excluded.
Atrial fibrillation prevalence.
3.2. AI model training
Annotated rhythm intervals and cumulative duration in the training dataset.
Annotated rhythm strips are variable-duration rhythm intervals identified by cardiologists.
AF: Atrial fibrillation.
External benchmark performance on MIT-BIH and AFDB datasets using segment-based and duration-based metrics.
ESe: sensitivity, segment-based; E+P: precision, segment-based; DSe: sensitivity, duration-based; D+P: precision, duration-based.
3.3. AI model performance
3.3.1. Evaluation dataset
Independent Holter ECG recordings were used for case-level evaluation. In the final evaluation dataset, 25 AF-positive cases were observed among 400 recordings (6.3%).
3.3.2. Diagnostic performance evaluation
Confusion matrix and diagnostic performance on the evaluation dataset (case-level).
CI: confidence interval, PPV: positive predictive value, NPV: negative predictive value, AUROC: area under the receiver operating characteristic curve, PR-AUC: precision–recall AUC.
3.4. Clinical implementation
A pilot implementation of the AI-assisted screening workflow was conducted in 179 high-risk patients. The AI-assisted workflow was successfully completed in 167 of 179 patients (93.3%), including ECG acquisition, data upload, automated analysis, and clinician verification. The remaining 12 patients were excluded due to early device removal, insufficient recording duration, or inadequate signal quality.
Clinical application of the semi–real-time AF screening protocol.
Patients with detected AF were allowed to discontinue monitoring early and were referred for clinical evaluation. Oral anticoagulation therapy was initiated when clinically indicated, given the high-risk baseline characteristics of the cohort.
4. Discussion
In this study, we developed and evaluated a deep learning–based model for atrial fibrillation (AF) detection using real-world 24-hour Holter ECG data and implemented a semi–real-time screening workflow integrating wearable ECG devices and AI analysis. The model achieved high sensitivity (100.0%) and moderate specificity (88.0%) at the case level. The high sensitivity reflects the case-level aggregation rule (≥1 AF-positive segment) combined with a threshold prioritizing detection of AF episodes, consistent with prior AI-based AF detection studies.9,10 In addition, the pilot implementation in a high-risk cohort demonstrated the feasibility of integrating AI-assisted ECG analysis into routine clinical workflows.
The relatively low positive predictive value observed in this study is primarily attributable to the low AF prevalence in the evaluation dataset (6.3%), which is consistent with a screening context in a low-prevalence population. In addition, the sensitivity-prioritized classification strategy may contribute to an increased false-positive rate. In real-world screening implementation, this may increase the number of AI-flagged recordings requiring clinician review and potentially contribute to additional workload. Further refinement of classification thresholds and post-processing strategies may help improve precision while maintaining adequate screening sensitivity.
Conventional ECG interpretation remains dependent on clinician expertise and is subject to interobserver variability. AI-based ECG analysis enables scalable review of large volumes of ambulatory recordings, reducing reliance on continuous manual interpretation. 11 The AI system was designed as a decision-support tool rather than an autonomous diagnostic system, with all outputs reviewed by cardiologists prior to clinical decision-making.
The semi–real-time workflow implemented in this study enables structured batch analysis of ECG data with clinician review within 24 hours, providing a practical balance between timely detection and clinical workload constraints. By prioritizing segments with high predicted AF probability, the system facilitates efficient identification of clinically relevant AF episodes. 12
Continuous ECG monitoring is essential for detecting paroxysmal and asymptomatic AF. While conventional 24-hour Holter monitoring provides high-quality multi-lead recordings, it is limited by short monitoring duration. 13 Extended monitoring using single-use patch devices improves detection yield but may be associated with higher cost, limited reusability, and reduced signal diversity. 14 In contrast, the reusable multi-lead wearable ECG device used in this study supports extended monitoring and flexible ambulatory use. This approach may facilitate repeated or longitudinal rhythm monitoring in high-risk populations, where intermittent AF episodes are often missed with short-duration recordings.
The pilot implementation further demonstrated the feasibility of this approach in a real-world high-risk cohort. A high completion rate (93.3%) was achieved, and AF was detected in 14.4% of patients, with the majority of cases identified after 24 hours of monitoring. These findings underscore the importance of extended monitoring duration for detecting intermittent AF. The workflow endpoint was defined at the system level, including successful ECG acquisition, data transmission, automated analysis, and clinician verification. The ability to discontinue monitoring early after AF detection represents an additional practical advantage, enabling timely clinical intervention while optimizing resource utilization. Patients with detected AF were referred for clinical evaluation, and subsequent management, including initiation of oral anticoagulation therapy, was determined based on individual clinical assessment.
This study has several important limitations. First, it was conducted at a single center using retrospective data for model development, which may limit generalizability. External validation in independent and multicenter datasets is required. Second, although sensitivity was high, specificity could be further improved through refinement of classification thresholds or post-processing strategies. Third, workflow-level metrics, including alert frequency, false-positive alert rate, clinician review time, and time-to-notification, were not systematically recorded and should be evaluated in future prospective studies. Fourth, calibration performance was not assessed and warrants further investigation using calibration curves and related metrics. Fifth, no formal signal quality validation against a reference standard (e.g., 12-lead ECG) was performed. Sixth, no dedicated model interpretability analysis (e.g., saliency maps or attention mechanisms) was conducted, which may limit understanding of model decision processes. Finally, potential failure modes include misclassification in rhythms with irregular RR intervals (e.g., atrial tachycardia, frequent ectopy) and in low-quality recordings; these risks were partially mitigated through expert annotation and clinician verification.
5. Conclusions
This study demonstrates the feasibility of integrating deep learning–based ECG analysis with wearable monitoring devices in a semi–real-time clinician-in-the-loop workflow for atrial fibrillation screening. Using real-world Holter ECG data, the proposed framework achieved high sensitivity for AF detection while enabling structured prioritization and review of ambulatory ECG recordings. In a high-risk pilot cohort, AI-assisted wearable ECG monitoring was successfully incorporated into routine clinical workflows. Further prospective multicenter studies are warranted to evaluate broader clinical implementation and workflow performance.
Footnotes
Acknowledgements
The authors thank Minh Van Le from Nguyen Trai Hospital for his assistance and support in data collection.
Ethical considerations
The study protocol was reviewed and approved by the Institutional Review Board of the University of Medicine and Pharmacy at Ho Chi Minh City (IRB-VN01002/IRB00010293/FWA00023448). All procedures were conducted in accordance with the ethical standards of the institutional research committee and with the 1964 Helsinki declaration and its later amendments.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has received support from the Korea International Cooperation Agency (KOICA) under the project entitled “Education and Research Capacity Building Project at University of Medicine and Pharmacy at Ho Chi Minh City” conducted from 2024 to 2025 (Project No. 2021-00020-3).
Declaration of conflicting interests
The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Minh Khac Ho is affiliated with OCTOMED Co., Ltd., which developed the wearable ECG device used in this study. The device was used solely for data acquisition, and all diagnostic decisions were made by cardiologists. The other authors declare no conflicts of interest.
Contributorship
Si Van Nguyen: Conceptualization, study design, data analysis, manuscript drafting, and critical revision; Minh Khac Ho: AI model development, critical revision; Dat Vu Nguyen: Data labeling; Canh Quang Nguyen: Data labeling; An Le Pham: Supervision, critical revision; Hung Thanh Quach: Supervision, critical revision. All authors read and approved the final manuscript.
Guarantor
Si Van Nguyen, M.D., Ph.D.
