Abstract
Background
Low birth weight (LBW) is a leading cause of death for newborns and increases chronic disease risks later in life. Early identification of LBW risk is crucial.
Aim
The objective of this study was to develop predictive models for LBW using boosting ensemble machine learning, with a focus on features available during early pregnancy, such as pre-pregnancy body mass index, body height, and blood pressure before 20 weeks of pregnancy.
Methods
This is a retrospective cohort study. We used electronic medical records in four hospitals in Taiwan where pregnant women received prenatal care from January 2016 to July 2019, including 6719 pregnant women. Data preprocessing involved normalization, one-hot encoding, and a synthetic minority oversampling technique for class imbalance. Boosting ensemble methods were used to build the LBW predictive models.
Results
The mean diastolic blood pressure (DBP) in early pregnancy (<20 weeks) was 66.5 mmHg, 29.6% had experienced abortion, 8.7% delivered LBW, 12.2% were overweight or obese before pregnancy, and 18.3% had elevated or stage I hypertension before 20 weeks of pregnancy. Lightweight Gradient Boosting Machine was the best-performing LBW model, with an area under curve of 0.96 and an accuracy of 93.4%. Early pregnancy DBP, maternal height, and number of abortions were the most important features.
Conclusions
The LBW prediction model performed well. Nurses could use the model to assess LBW risk and intervene early. Preventive efforts could be directed to blood pressure management starting early pregnancy, nutritional support for short mothers, and self-care for women with a history of abortions.
Introduction
The World Health Organization (WHO) defines low birth weight (LBW) as a birth weight of <2500 g, irrespective of gestational age. Annually, over 20 million newborns are classified as having LBW, significantly contributing to mortality among newborns and children under 5 years of age. 1 Globally, 14.7% of live births are classified as LBW. 2 In Taiwan, the incidence of LBW increased from 5% in 1997 to 8.4% in 2016, 3 and increased to 10.94% by 2023. 4 Infants with LBW are ∼20 times more likely to die than heavier infants. 1 Common risks associated with LBW include cardiovascular disease, type 2 diabetes, hypertension, dyslipidemia, and chronic kidney disease. 5 Furthermore, LBW infants are at risk of growth disorders in early childhood, which may lead to obesity and subsequently cause blood pressure (BP) changes and other chronic cardiometabolic complications from childhood to adulthood. 6 Therefore, developing early predictive models for LBW may help develop prevention and intervention measures and reduce its incidence and subsequent health risks.
Machine learning (ML) forms the core of artificial intelligence (AI) and data science. 7 Ensemble learning achieves the best performance in numerous ML tasks by merging predictions from two or more foundational models. Boosting technologies include comprehensive algorithms such as AdaBoost, Gradient Boosting (GB), Extreme Gradient Boosting (XGBoost), Lightweight Gradient Boosting Machine (LightGBM), and Categorical Boosting (CatBoost). The AdaBoost algorithm improves the classifier accuracy by adjusting the weights of misclassified samples. GB reduces the error of the previous model by incrementally adding new models applicable to both regression and classification. XGBoost offers optimized computational speed and model performance and is widely used in various ML competitions. LightGBM focuses on speed and efficiency, making it suitable for large-scale datasets. CatBoost optimizes for categorical features by automatically handling missing values and categorical data. 8
ML has been increasingly applied to predict a wide range of perinatal outcomes, further illustrating its versatility in maternal–fetal medicine. To further highlight the widespread adoption of ML in clinical research, recent studies have applied ML techniques to predict various perinatal outcomes, including pre-eclampsia, 9 intrauterine growth restriction, 10 mother–newborn skin-to-skin contact, 11 nonreassuring fetal heart rate patterns, 12 and episiotomy risk factors. 13
This study used ensemble ML boosting techniques to establish a prediction model for LBW. Recent studies have demonstrated that ML-based LBW prediction is evolving toward multimodal data integration, advanced ensemble learning, and early gestational prediction. For example, Camargo-Marín et al. 14 proposed a multimodal birth weight prediction framework using multiple kernel learning, achieving a mean absolute error of 234 g with first-trimester maternal–fetal variables. Ranjbar et al. 15 evaluated eight ML algorithms on a large national perinatal dataset and identified XGBoost as the best-performing model (AUROC = 0.79, precision = 0.87). Khan et al. 16 developed ML-based LBW classification and birth weight estimation models in the UAE, integrating maternal and fetal features with ensemble and regression approaches, achieving competitive performance. These advances provide a strong methodological foundation and align with the objectives of the present study.
Regarding the factors related to LBW, a meta-analysis showed that women receiving at least one antenatal care session were associated with a reduced incidence of LBW. 17 Pattath et al. 18 demonstrated that mothers who did not receive prenatal care had a significantly higher proportion of LBW infants. A systematic review and meta-analysis showed that women with short stature had a higher risk of LBW. 19 Additionally, one study showed that mothers with short stature (height <145 cm), lower educational level, body mass index (BMI) <18.5, and inadequate prenatal care were associated with a higher risk of LBW. 20 Islam Pollob et al. 21 also found that maternal height and educational level were associated with LBW. Pregnant women with a history of abortion tend to have a significantly higher risk of LBW. 22 One study showed that the number of abortions was positively related to LBW. 23 LBW is associated with advanced maternal age, such that women aged 18–34 years are less likely to have infants with LBW than those aged ≥35 years. 24 Parity is also associated with LBW, with primiparous women having a higher risk for LBW. 25
The American College of Cardiology (ACC)/American Heart Association (AHA) revised the definition of hypertension in 2017, defining it as a systolic blood pressure (SBP) ≥ 130 mmHg or a diastolic blood pressure (DBP) ≥ 80 mmHg. 26 Elevated BP was defined as SBP between 120 and 129 mmHg and DBP <80 mmHg, while stage 1 hypertension was defined as SBP between 130 and 139 mmHg or DBP between 80 and 89 mmHg. 26 However, the criteria for diagnosing hypertensive disorders of pregnancy (HDP) have not changed, with BP defined as ≥140/90 mmHg. A previous study showed that HDP was associated with LBW using a pre-revised definition of hypertension (140/90 mmHg). 27 Although obstetric guidelines have not yet adopted new criteria for HDP, several studies have used the new criteria to explore their relationship with newborn outcomes. Wu et al. 28 demonstrated that elevated BP and stage 1 hypertension contribute to an increased risk of LBW. In contrast, Greenberg et al. 29 reported that only stage 1 hypertension, not elevated BP, was associated with an increased risk of adverse neonatal outcomes. These findings indicate inconsistencies between HDP cutoff values during pregnancy and birth outcomes across studies.
Hypertension increases the risk of maternal vascular malperfusion (MVM), and DBP reflects MVM better than SBP. These disorders can affect the blood supply to the placenta, thereby increasing the risk of LBW. 31 A previous study showed that high DBP in pregnant women is related to LBW. 32 However, few studies have investigated whether early DBP affects LBW, and studies on the relationship between LBW and the revised BP standards are limited. Exploring early pregnancy BP is crucial for understanding the effects of revised hypertension standards and BP on LBW. In addition, previous studies on LBW prediction using ML have predominantly examined cohorts from Latin America,9 the Middle East,10,11,23 South Asia, 30 and Africa. 23 To the best of our knowledge, few studies have incorporated early pregnancy DBP—evaluated under the 2017 ACC/AHA hypertension criteria—into ensemble boosting models for LBW prediction, particularly in a contemporary Chinese cohort. We selected DBP as the core BP indicator based on prior evidence that DBP is more strongly associated with LBW than SBP and may be linked to MVM. Although MVM was not directly measured in this study, the inclusion of early pregnancy DBP aligns with the concept of early health screening and risk stratification. We also systematically compared several state-of-the-art boosting algorithms and generated complete feature-importance rankings to strengthen model interpretability and facilitate clinical translation.
Developing predictive models for LBW using ML methodologies can facilitate the early identification of at-risk pregnancies and enable timely interventions. The study aimed to use ensemble boosting techniques to build predictive ML models, determine the best predictive model, and identify the important features of LBW with a focus on features available during early pregnancy, such as pre-pregnancy BMI, body height, and BP before 20 weeks of pregnancy. By enhancing the accuracy and timeliness of identifying at-risk pregnancies, this study not only builds on prior findings but also contributes to a deeper understanding of the factors influencing LBW. The insights gained from this study can inform the development of targeted prevention and intervention strategies, ultimately improving maternal and neonatal health outcomes and reducing the incidence of LBW.
Methods
Design and setting
This study was a retrospective cohort study. The study was based on electronic medical records obtained from four hospitals (three in Northern Taiwan and one in Eastern Taiwan) where pregnant women received prenatal care between January 2016 and July 2019. The eligible participants were women over 20 years of age with a singleton pregnancy of more than 20 weeks and no history of chronic disease. We retrieved data from the electronic records at the study hospital between February 2020 and May 2021. The study was approved by the Institutional Review Boards of the National Yang Ming Chiao Tung University (YM107127E) and Mackay Hospital (19MMHIS320e).
Participants
The initial data set included a total of 7547 participants. We cleaned up data to exclude abnormal BP values (SBP <70 mmHg or DBP <40 mmHg; n = 38) and early pregnancy BP records with fewer than two measurements (within 20 weeks of pregnancy; n = 685). For BP recordings, at least two measurements should be taken to reduce the influence of a single spurious BP recording. 34 We further excluded those with missing prenatal record information, including birthweight (n = 6) and history of HDP (n = 99). The final analytic cohort comprised 6719 participants.
Measures
We intentionally included features that occurred sufficiently early during pregnancy to provide implications for preventing LBW. To ensure both clinical relevance and statistical robustness, we first examined the association between each candidate feature and LBW using univariate analyses. Chi-squared tests were applied to categorical features, and logistic regression was used for continuous features. Features with P-values < 0.05 were deemed statistically significant and included in the model. In addition, some features had a P-value > 0.05 but were still included in the predictive model because the literature highlights their importance for LBW. This dual approach—statistical screening and literature-informed inclusion—ensured both empirical validity and clinical interpretability of the final model. Ultimately, features included: background characteristics (maternal age, education, maternal height, and pre-pregnancy BMI) and clinical characteristics (parity, abortion, prenatal care, history of HDP, early pregnancy DBP, and early pregnancy BP category).
Gestational age and preterm birth status were excluded from the feature set to prevent outcome leakage. As both are determined postnatally and strongly associated with LBW, their inclusion could compromise the model's temporal validity. To ensure clinical applicability for early intervention, the model focused exclusively on antenatal predictors available prior to delivery. The target feature was LBW, defined as birth weight < 2500 g. For classification purposes, we adopted a binary outcome structure: LBW versus non-LBW. The non-LBW group (≥2500 g) included both normal birth weight and macrosomic infants. Within the non-LBW group, the number of normal birth weight was 6084 (90.5%), whereas macrosomia was rare (n = 48, 0.7%). Given the extremely small number and proportion of macrosomic cases, their impact on the overall results is expected to be negligible. Birth weight data were obtained from delivery records. This binary classification approach is consistent with prior literature that emphasizes early identification of LBW risk due to its clinical relevance in neonatal outcomes.
Background characteristics
Maternal age and education were obtained from the nursing assessment records reported when pregnant women were admitted to the hospital for delivery. Maternal age was defined as age at delivery. Advanced maternal age was defined as ≥ 35 years of age. Maternal height and pre-pregnancy weight (usually self-reported) were obtained during the first prenatal care session and recorded by a nurse. Pre-pregnancy BMI was derived from pre-pregnancy weight and maternal height and was calculated by the researcher. It was classified according to the 2009 Institute of Medicine guidelines as underweight (<18.5 kg/m2), normal weight (18.5–24.9 kg/m2), overweight (25.0–29.9 kg/m2), and obesity (≥30 kg/m2). 33
Clinical characteristics
Parity, abortion, and history of HDP were based on self-reports of pregnant women when they were admitted to the hospital for delivery. Parities were classified as primiparous or multiparous. Abortion was defined as the number of abortions. The history of HDP was defined as a history of gestational hypertension, preeclampsia, or eclampsia. Data on early pregnancy DBP, early pregnancy BP, and prenatal care were obtained from records of prenatal care visits during pregnancy. In this study, early pregnancy BP recorded all BP measures from the fifth week to before the 20th week of pregnancy. BP measures were categorized with more than 2 times of abnormal values being classified as elevated/stage I hypertension. The early pregnancy BP category was defined according to the AHA/ACC revised standards. Prenatal care was defined as the total number of prenatal care sessions conducted from pregnancy to delivery.
Statistical analysis
The Python language (version 3.11.3), along with the Pandas, Numpy, and Scikit-learn ML tool libraries, was used to build models using the Jupyter Notebook. The dataset was split at a ratio of 80:20, with 80% used for training and 20% used for testing and performance verification. A hyperparameter adjustment grid search method and 10-fold cross-validation were used to construct the model. Figure 1 illustrates the program flow of the LBW prediction model. Continuous data were normalized, and categorical data were processed using one-hot encoding. To effectively distinguish the expressive ability of each subcategory of the multicategory, the multicategory features were converted into multiple binary features. 35 This process ensures that multiple category features are appropriately transformed. Following encoding, 19 features were included in this study. We used ensemble ML boosting techniques to establish the LBW prediction model, including AdaBoost, GB, XGBoost, LightGBM, and CatBoost.

Program flow of the prediction model for low birth weight (LBW).
Balancing the dataset and resampling procedures
The target feature LBW accounted for 8.7% of the study participants. We used the synthetic minority oversampling technique (SMOTE) to solve the imbalanced data distribution problem. 38 The dataset was balanced before model construction and training, enabling the model to learn more accurately. SMOTE was applied prior to train-test splitting to ensure adequate representation of the minority class. Subsequently, the dataset was split into training and testing sets using an 80:20 ratio, with stratified sampling to preserve class distribution across both sets. This approach mitigates class imbalance while maintaining the integrity of the hold-out evaluation. Model hyperparameters were optimized via three rounds of GridSearchCV, each with 10-fold cross-validation on the training set.
External validation
In the absence of an independent external cohort, we conducted internal–external cross-validation (IECV) across the four participating hospitals using a leave-one-hospital-out (LOHO) design. Although SMOTE was initially applied in the primary pipeline, and hyperparameter selection was performed on the SMOTE augmented data, the subsequent LOHO IECV used those selected parameter sets for fold-wise fitting. Resampling was restricted to the training partitions in each IECV iteration to prevent information leakage. 36 Each fold involved training on data from three hospitals and testing on the held-out hospital. Model performance was evaluated using the AUROC (with bootstrap 95% CI), Brier score, calibration intercept and slope from logistic recalibration, and two calibration-error metrics: expected calibration error (ECE) and maximum calibration error (MCE). This IECV/LOHO framework follows current methodological recommendations for assessing model generalizability in multicenter datasets.36,37
Normality assessment
The normality of continuous features was assessed prior to univariate analyses to ensure the validity of mean-based statistical tests and to enhance data transparency. We applied the Shapiro–Wilk test to DBP, maternal height, and number of abortions. Features with P-values > 0.05 were considered approximately normally distributed, whereas those with P-values ≤ 0.05 were treated as non-normal. Detailed results are provided in Appendix 1 Table A1.
Optimization strategy
ML models were optimized using hyperparameter tuning with GridSearchCV, and evaluating the model for each combination using the 10-fold cross-validation method. This process yields accuracy or loss for every combination of hyperparameters, enabling the selection of the best-performing set. 39 We conducted three rounds of GridSearchCV parameter tuning to systematically search for the optimal parameter combination. Each round involved 10-fold cross-validation to ensure that the performance of the model was comprehensively evaluated. Hyperparameter selection was first performed on the SMOTE-augmented dataset using three rounds of GridSearchCV with 10-fold cross-validation; during LOHO IECV, the previously selected best parameter sets were applied directly for fold-wise model fitting, and all training procedures (including any resampling) were confined to each fold's training partition to avoid information leakage.
Performance evaluation
Area under the receiver operating characteristic curve (AUROC)
The AUROC is a widely used indicator for evaluating the ability of ML to predict results. An AUROC >0.7 is considered an acceptable classification model. AUROC values ≥ 0.9 indicate excellent performance. 40 To quantify the uncertainty of the AUROC estimate, we performed 1000 bootstrap resamples on the test set predictions. The resulting AUROC distribution was visualized using a kernel density plot, with the 2.5th and 97.5th percentiles marked to indicate the 95% confidence interval.
Evaluation metrics were calculated on held-out hospitals (LOHO folds) and included AUROC with bootstrap 95% CI, Brier score, calibration intercept and slope from logistic recalibration, expected calibration error (ECE), and maximum calibration error (MCE); bootstrap uncertainty estimates and per-hospital metrics are reported.
Confusion matrix
Accuracy, precision, recall, specificity, and F1 score were in the confusion matrix. It compares the predicted and actual results, and categorizes them into True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Based on the above, the confusion matrix is derived from these calculations, and several metrics can be calculated to evaluate the model's performance. Accuracy is the proportion of correct predictions. Precision is the proportion of TP cases among all positive cases. Sensitivity (recall) is the proportion of TP cases among all actual positive cases. Specificity is the proportion of TN cases among all actual negative cases. The F1 score is a combined metric that balances precision and sensitivity. These metrics help us to comprehensively understand the model's performance and determine its predictive capabilities in different scenarios. 41
Feature importance analysis
Feature importance analysis is commonly used for model interpretation. Feature importance was quantified to analyze the contribution of features to the model prediction. 42 Feature importance was computed for the final trained ensemble model using the gain metric, which quantifies the total improvement in the model's objective function contributed by each feature across all decision trees. We also computed SHAP (SHapley Additive exPlanations) values using the TreeSHAP algorithm to quantify each feature's contribution to the final LightGBM model output. The mean absolute SHAP value (mean(|SHAP|)) across all 6719 participants was used to summarize global feature importance and visualized as a ranking plot. SHAP values were also visualized in a beeswarm plot to illustrate the distribution and directionality of feature effects on individual predictions, providing insight into both global and local model interpretability. Both approaches were applied to the complete set of antenatal predictors retained in the model.
Results
Participants
The analysis included 6719 participants, with 8.7% LBW. The mean maternal age was 33.0 years (SD = 4.5), with 37.8% aged above 35 years, and the mean maternal height was 160.1 cm (SD = 5.4). Among the participants, 12.2% were overweight (9.8%) and 2.4% were obese before pregnancy, and 73.01% were primiparous. The mean number of abortions was 0.4 (SD = 0.81), ranged from 0 to 10. 70.4% having no abortion history, and 29.6% having had one or more abortions. Prenatal care visits at three or fewer accounted for 1.6%, and 4 to 8 visits for 15.4%. The mean early pregnancy DBP was 66.5 mmHg (SD = 7.4). In the early pregnancy BP category, 18.6% had elevated BP or stage I hypertension (Table 1).
Background and clinical characteristics related to LBW (N = 6719).
LBW: low birth weight; Pre BMI: pre-pregnancy body mass index; HDP: hypertensive disorders of pregnancy; DBP: diastolic blood pressure; H/T: hypertension; SBP: systolic blood pressure.
Abortion: number of abortions. SD: standard deviation. P-values were calculated to compare demographic and characteristics between LBW and non-LBW groups.
Feature importance
This study analyzed the feature importance based on the best model established by LightGBM. Feature importance analysis showed that DBP during early pregnancy had the greatest impact on LBW prediction. Maternal height and abortion were the second- and third-most important predictors of LBW, respectively. Importance values reflect each feature's relative contribution to model performance, based on gain metrics. Other important predictors included advanced maternal age, prenatal care, maternal education, and pre-pregnancy BMI (Figure 2). The full list of antenatal predictors, including each feature's name, type, and collection time point, is presented in Table 2. Figure 3 presents the mean absolute SHAP ranking summarizing global feature importance for LBW prediction. Maternal height demonstrated the highest SHAP value (mean(|SHAP|) = 1.24), followed by early pregnancy DBP (0.58) and number of abortions (0.51). Prenatal care frequency, maternal age, and parity also contributed meaningfully, though to a lesser extent. The full mean(|SHAP|) values for all features are provided in Appendix 2 Table A2. Figure 4 (SHAP beeswarm) visualizes individual-level SHAP distributions, with color gradients representing original feature values. This plot reveals heterogeneity and potential nonlinear associations—such as varying DBP effects across subgroups—thereby enhancing clinical interpretability.

Predictive feature importance ranking from the LightGBM model. Note: early_DBP_nor = normalized early diastolic blood pressure, M_BH_nor = normalized maternal height, Abortion_nor = normalized abortion, age_AA = advanced age, Prenatalcare_above9 = received prenatal care more than 9, edu_College = maternal education level: College/University, preBMIIOM_N = pre-pregnancy BMI: normal, Parity = parity classification, Prenatalcare_4_8 = received prenatal care 4-8, BPcategory_N = blood pressure category: normal, edu_graduate School = maternal education level: graduate School, edu_High school = maternal education level high school or below, preBMIIOM_light = pre-pregnancy BMI: underweight, BPcategory_E = blood pressure category: elevate, preBMIIOM_heavy = pre-pregnancy BMI: overweight, preBMIIOM_ Obesity = pre-pregnancy BMI: obesity, Prenatalcare_less3 = received prenatal care less than 3, BPcategory_stage1 = blood pressure category: stage1, HDP_M_C = hypertensive disorders of pregnancy history.

SHAP ranking from the LightGBM model.

SHAP beeswarm from LightGBM model.
Features used in the boosting ensemble models for LBW prediction.
LBW: low birth weight; BMI: body mass index.
Model performance
The AUROC is > 0.91 (0.9118–0.9634). The accuracy values were more than 80% (81.9%–93.4%). The LightGBM outperformed the other four prediction models (Table 3). LightGBM achieved the highest AUROC of 0.9634 (95% CI: 0.9551–0.9714), followed by XGBoost (0.9632, 95% CI: 0.9550–0.9709), AdaBoost (0.9615, 95% CI: 0.9521–0.9698), CatBoost (0.9599, 95% CI: 0.9519–0.9676), and GradientBoost (0.9118, 95% CI: 0.9013–0.9228). The AUROC and the confusion matrix are shown in Figures 5 and 6, respectively.

Confusion matrix of (a) AdaBoost, (b) gradient boosting, (c) extreme gradient boosting, (d) light gradient boosting machine, and (e) categorical boosting.

Receiver operating characteristic (ROC) by (a) AdaBoost, (b) gradient boosting, (c) extreme gradient boosting, (d) light gradient boosting machine, and (e) categorical boosting.
Performance evaluation of the prediction models for LBW.
LBW: low birth weight; AUROC: area under the receiver operating characteristic curve; GB: Gradient Boosting; XGBoost: Extreme Gradient Boosting; LightGBM: Lightweight Gradient Boosting Machine; CatBoost: Categorical Boosting.
External validation and calibration performance
We evaluated generalizability across four hospitals (N = 6719) using LOHO IECV. Per-hospital AUROCs were Hospital A (n = 5106) 0.8190 (95% CI 0.7985–0.8374); Hospital B (n = 337) 0.8594 (95% CI 0.7634–0.9410); Hospital C (n = 1207) 0.8677 (95% CI 0.8294–0.9027); Hospital D (n = 69) 0.7574 (95% CI 0.5961–0.9078) (Table 4). Per-hospital AUROCs ranged from 0.7574 to 0.8677. Brier scores ranged 0.0562–0.0998 across sites, supporting acceptable overall predictive accuracy (Table 4). Calibration analyses showed between-hospital heterogeneity (intercepts A = 0.4475, B = −0.3570, C = 0.1388, D = −0.2586; slopes A = 1.6514, B = 1.5193, C = 1.8584, D = 1.0172; 95% CIs in Table 4). Estimates from smaller sites had larger uncertainty.
LOHO IECV external validation: per-hospital performance and calibration.
Note: LOHO IECV: leave-one-hospital-out internal-external cross-validation; AUROC: area under the receiver operating characteristic curve; ECE: expected calibration error; MCE: maximum calibration error.
Calibration analyses revealed between-hospital heterogeneity. Hospitals A–C demonstrated good discrimination and acceptable calibration, although calibration slopes >1 (1.5193–1.8584) suggest mild over-confidence in predicted risks. Hospital D showed wider confidence intervals due to its small sample size (n = 69), resulting in greater uncertainty in both discrimination (AUROC = 0.7574) and calibration estimates. The calibration intercepts ranged from −0.3570 to 0.4475, indicating moderate shifts in baseline risk across hospitals. ECE and MCE values (0.0544–0.0942 and 0.2271–0.7558, respectively) were within acceptable limits, supporting reasonable calibration with some site-level variability. Overall, the LOHO IECV results demonstrate that the model maintains robust discrimination and acceptable calibration across diverse hospital settings, reflecting good generalizability within the multicenter context (Figure 7).

Calibration plots for leave-one-hospital-out (LOHO) for four hospitals.
Discussion
This study found that the three most important features for LBW prediction were DBP (5–20 weeks of pregnancy), maternal height, and number of abortions. All these variables are routinely obtainable during pregnancy. Based on these results, we can use the model to predict LBW in clinical settings, allowing for timely and appropriate interventions. Both feature importance and SHAP analyses results consistently highlighted early pregnancy DBP, maternal height, and number of abortions as primary drivers of LBW risk in our cohort. Slight differences in the ranking between methods reflect methodological distinctions—feature importance was computed using the gain metric, which aggregates split-wise improvements to the model objective across the tree ensemble, whereas SHAP quantifies average marginal contributions of each feature to individual predictions—rather than conflicting evidence about predictor relevance. The concordance across methods strengthens confidence that these features are robust predictors, warranting clinical attention. Integrating SHAP provided two complementary interpretive advantages: it identified globally important predictors while preserving patient-level variability, thereby bridging model transparency with clinical reasoning. The pronounced influence of maternal height underscores the role of maternal anthropometry in early risk stratification, whereas nonlinear SHAP patterns for early pregnancy DBP highlight the need to explore clinically actionable BP thresholds. SHAP, therefore, complements traditional statistical inference and should be interpreted alongside baseline risk profiles and clinical judgment to inform individualized care decisions.
Elevated DBP in early pregnancy is a significant predictor of LBW. This finding aligns with previous literature, particularly studies by Fraser and Catov 31 and Steer et al. 32 The results highlighted that hypertension increases the risk of MVM, with DBP being a more accurate reflection of MVM than SBP. These conditions can impair placental blood supply, thereby increasing the risk of LBW. In addition, hypertension during pregnancy inhibits the typical decrease in DBP during early pregnancy, further underscoring the importance of DBP in influencing LBW. 31 Steer et al. 32 also demonstrated that high DBP in pregnant women is associated with LBW, increased risk of infants being small for gestational age, and perinatal death. Our results are consistent with those of previous studies, indicating that DBP is a significant predictive feature. This finding highlights the critical influence of DBP during early pregnancy on LBW and suggests that LBW prediction can be conducted as early as 20 weeks of pregnancy. These findings have important implications for early disease detection and healthcare.
Our study used the new AHA/ACC BP guidelines to predict LBW and showed that a lower cutoff BP value is not an important feature of LBW. Wu et al. 28 found that both elevated BP and stage 1 hypertension were positively associated with LBW. Greenberg et al. 29 reported that only stage 1 hypertension, not elevated BP, was associated with adverse neonatal outcomes. The inconsistencies across studies may be attributable to differences in population characteristics, measurement protocols, and model architectures. The lack of mechanism explanations also leaves gaps in understanding; potential physiological pathways, such as altered maternal vascular resistance and impaired placental perfusion, warrant further investigation. One of the aims of our study was to determine whether the revised AHA/ACC definition of hypertension is an important feature of LBW. However, our results indicate that this is not the case. Further research is needed to clarify the relationship between the BP cutoff values and LBW.
This study showed that maternal height is an important feature of LBW. According to a systematic review and meta-analysis by Han et al., 19 women of short stature have a higher risk of LBW. Kader and Pereira 20 also showed that mothers with a height <145 cm were associated with a higher risk of LBW. Islam Pollob et al. 21 supported this finding. However, this research shows that maternal height as a continuous feature is equally important in predictive models. However, the association between short stature and LBW remains unclear. Camilleri pointed out that short-statured women are more likely to pass on a genetic predisposition for smaller growth to their fetuses, which contributes to LBW. 43 Related studies have found that short maternal stature is a significant predictor of LBW. Further studies are required to explore the potential mechanisms underlying this association.
This study found that the number of abortions is an important feature of LBW. This finding is consistent with those of previous studies.22,23 The number of abortions may be associated with underlying reproductive health issues, such as uterine abnormalities. These issues can adversely affect subsequent pregnancies and overall stability. Diagnostic hysteroscopy should be performed after the first miscarriage to identify congenital and acquired uterine abnormalities. 44 Timely diagnosis of the causes of abortion and appropriate interventions can play a crucial role in reducing the incidence of LBW in women who have experienced abortion. Enhanced prenatal care, including frequent check-ups and intensive monitoring, can help detect abnormalities early and allow for timely intervention.
The prominence of early pregnancy DBP underscores its predictive value in assessing LBW risk, whereas maternal height and number of abortions reflect underlying maternal health and reproductive experience—both have plausible biological links to fetal growth. This pattern complements the strong performance of our model (AUROC = 0.96, accuracy = 93.4%) and supports its potential integration into clinical decision support systems to enable timely, personalized care. To contrast with our findings, Moges et al. analyzed LBW risk in the context of midwife-led continuity care in Ethiopia, using causal ML models to identify key features. Pregnancy-induced hypertension was identified as one of the top five features influencing LBW, while BP was included in their model but ranked lower in importance and was not further discussed. We note that Moges et al. 30 applied a causal ML framework to evaluate intervention effects on LBW, whereas our analysis focuses on predictive modeling using routinely collected early antenatal data to support early risk identification rather than causal attribution. Predictive ML can identify high-risk cases, while causal ML (as shown by Moges et al. 30 ) can determine which interventions may effectively reduce LBW risk. Future study could integration those to create a comprehensive framework for both prediction and prevention.
Despite recent advances in ML applications for LBW prediction, several limitations persist in the literature. Prior studies have either relied on imaging and Doppler-based modalities with limited clinical accessibility, 14 focused on demographic features without incorporating early physiological indicators, 15 or lacked interpretability and actionable insights for early intervention.16 To address these gaps, our study developed a clinically practical LBW prediction model using routinely collected features, including early pregnancy (<20 weeks) DBP, maternal height, and abortion. By integrating the 2017 AHA/ACC BP classification, we demonstrated the predictive value of early DBP—an often overlooked parameter—in identifying LBW risk. The LightGBM-based model achieved high performance (AUROC = 0.96, accuracy = 93.4%). Furthermore, the use of SMOTE and multi-round hyperparameter tuning ensured model robustness despite class imbalance. This study contributes a novel, regionally representative ML framework with potential for integration into clinical decision support systems, enabling timely risk stratification and personalized care for pregnant women.
The integration of AI into clinical practice can enhance the prediction and management of LBW using these predictive models. AI can assist healthcare professionals in identifying high-risk pregnancies early, allowing for timely interventions. By incorporating AI-driven insights into routine prenatal care, healthcare providers can improve the outcomes for both mothers and infants. Automation is the key to streamlining processes and reducing the workload of healthcare providers to effectively integrate AI into clinical practice for predicting and managing LBW. It is essential to implement systems to automatically collect and integrate data on maternal height, early pregnancy DBP, and history of abortions in the AI model. AI algorithms can stratify pregnant women into risk categories. For high-risk pregnancies, automated personalized monitoring plans can be developed, including scheduling more frequent prenatal visits and specific tests to monitor fetal growth. Additionally, AI can automatically alert healthcare providers to early signs of complications such as unexpected changes in DBP, prompting timely interventions. Offering automated educational resources and support to expectant mothers is also crucial, emphasizing the importance of regular prenatal care and adherence to medical advice. Finally, facilitating automated communication and collaboration among healthcare professionals ensures a comprehensive approach for managing high-risk pregnancies. By automating these processes, AI can make clinical adoption more feasible, and improve maternal and child health outcomes.
To further evaluate the model's cross-hospital generalizability and clinical implications, we performed a LOHO IECV. The LOHO analyses demonstrated that the model maintained robust discrimination across the four hospitals, with AUROCs ranging from 0.7574 to 0.8677. This consistency across geographically and organizationally distinct sites supports the model's transportability within the multicenter context. The modest variation in Brier scores (0.0562–0.0998) and calibration intercepts (−0.3570–0.4475) suggests that differences in baseline event rates and data recording practices may partly explain between-hospital calibration drift. Calibration slopes greater than one, observed in several hospitals, indicate mild over-confidence in predicted probabilities, a pattern commonly seen in cross-site validation due to variation in population case-mix or measurement protocols. From a practical standpoint, these findings imply that the model can be reliably transferred to new clinical settings with comparable data structures, but local recalibration (e.g. intercept/slope adjustment or isotonic regression) is advisable to align predicted and observed risks. Such recalibration requires only a small local sample and can substantially improve clinical interpretability without retraining the model. The use of LOHO IECV also illustrates a pragmatic validation strategy for multicenter health datasets, bridging the gap between internal validation and full external validation when independent cohorts are unavailable. Collectively, these results underscore the model's robustness, fairness, and readiness for stepwise clinical translation, provided that local adaptation and ongoing performance monitoring are implemented during deployment.
We additionally considered ethical, fairness, and explainability issues in our modeling and deployment recommendations. This study incorporated methodological and ethical safeguards to promote fairness, transparency, and accountability in predictive modeling for maternal health. To further promote fairness and mitigate institutional bias, we leveraged the LOHO IECV, which provided a practical assessment of performance consistency and fairness across heterogeneous healthcare contexts. In addition, SHAP analysis was performed to visualize individual-level feature contributions, enabling clinicians to interpret and verify the model's reasoning. While class imbalance was addressed using SMOTE, we acknowledge, consistent with Sun et al., 45 that algorithmic correction alone does not guarantee ethical fairness. Rather, fairness in AI should be understood as a contextual and ethical construct linked to data sources, population characteristics, and clinical applications. By integrating LOHO validation, explainability analysis, and ethical reflection, this study exemplifies a responsible AI framework that aligns with emerging global standards in digital maternal health research. 37 Collectively, these methodological and ethical safeguards provide a foundation for responsible clinical translation and highlight the importance of ongoing monitoring of fairness and transparency as the model is deployed in real-world maternal health settings.
Limitations and recommendations
While the study offers valuable insights, it also has important limitations that should be acknowledged. The sample was drawn from only four hospitals in Taiwan, which may constrain the generalizability of the findings to other populations or regions. This restriction represents a potential threat to external validity, as the sample may not be sufficiently diverse to capture variations across different healthcare systems and demographic profiles. However, the study included four hospitals in northern and eastern Taiwan, ranging from medical centers to regional hospitals. Nonetheless, future studies should include samples from different regions and demographic backgrounds to improve overall generalizability. Moreover, the lack of external validation may have affected applicability across different clinical settings, primarily due to the unavailability of external datasets with comparable features. Nevertheless, we employed GridSearchCV for hyperparameter tuning, which ensured a comprehensive evaluation through a 10-fold cross-validation to mitigate overfitting and assess model stability. While these measures strengthen internal validity, they cannot substitute for independent external validation. Future work should validate the model using datasets from diverse institutions and clinical environments, and adjust the model to suit different application scenarios.
The retrospective design may introduce biases, including reliance on electronic medical records that could contain missing or inaccurate data. Such biases may affect internal validity by introducing measurement errors or incomplete data capture. To improve robustness and generalizability, future studies should adopt prospective data collection, cross-validation with multiple data sources, and perform sensitivity analyses. Additionally, we acknowledge a methodological limitation regarding the application of SMOTE prior to train-test splitting. While this approach helped improve class balance and model sensitivity, it may introduce potential data leakage by allowing synthetic samples to influence both training and testing sets. Therefore, we report bootstrap AUROC CIs and LOHO IECV results to quantify uncertainty and evaluate robustness. To enhance external validity and prevent contamination, future studies should apply SMOTE exclusively to the training set after splitting.
Furthermore, differences in the importance of BP categories compared to prior research underscore the need for future studies to incorporate multicenter, multi-ethnic cohorts and to explore potential physiological mechanisms, such as altered maternal vascular resistance and impaired placental perfusion, to strengthen causal inference. This study is a retrospective observational predictive analysis and does not perform causal inference because the dataset lacks several key elements required for causal estimation; results are therefore presented and interpreted strictly as predictive associations.
In addition, the successful application of AI in clinical practice requires supporting measures, including adequate technical infrastructure and professional training.
Conclusion
Early pregnancy DBP, maternal height, and history of abortion were the three most significant predictors of LBW. This finding has considerable clinical implications as it enables healthcare practitioners to identify high-risk pregnancies at an early stage, thereby facilitating enhanced monitoring and timely interventions. Future investigations should prioritize the integration of these models into clinical practice to enhance their accessibility and usability. Real-time monitoring systems are essential for immediate feedback and interventions.
Footnotes
Acknowledgements
We would like to thank the Ministry of Science and Technology of Taiwan for supporting this research (grant no. MOST108-2314-B010-059-MY3).
Ethical considerations
The study was approved by the Institutional Review Boards of the National Yang Ming Chiao Tung University (YM107127E) and Mackay Hospital (19MMHIS320e).
Consent to participate
This retrospective study was conducted using pre-existing clinical data, and informed consent was not required, as approved by the relevant ethics committee and in line with international and local ethical regulations.
Author contributions
Ya-Ling Hu: writing–original draft, writing–review and editing, methodology, formal analysis, data curation, investigation, conceptualization, and visualization. Kung-Liahng Wang: writing–review and editing and resources. Jerry Cheng-Yen Lai: writing–review and editing and resources. Li-Yin Chien: writing–review and editing, validation, supervision, project administration, methodology, formal analysis, conceptualization, and resources.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Ministry of Science and Technology, Taiwan (grant number MOST 108-2314-B-010-059-MY3) and the Ministry of Science and Technology, Taiwan (grant number MOST 114-2314-B-A49-055-MY3).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
The datasets generated and/or analyzed during the current study are not publicly available as further analyses are ongoing. However, the corresponding author may make the data available upon reasonable request.
Guarantor
Li-Yin Chien is the guarantor for this article and accepts full responsibility for the integrity of the work as a whole.
