Abstract
Aim:
The primary objective of our research was to compare the performance of data analysis to predict vitamin D deficiency using three different regression approaches and to evaluate the usefulness of incorporating machine learning algorithms into the data analysis in a clinical setting.
Methods:
We included 221 patients from our hypertension unit, whose data were collected from electronic records dated between 2006 and 2017. We used classical stepwise logistic regression, and two machine learning methods [least absolute shrinkage and selection operator (LASSO) and elastic net]. We assessed the performance of these three algorithms in terms of sensitivity, specificity, misclassification error, and area under the curve (AUC).
Results:
LASSO and elastic net regression performed better than logistic regression in terms of AUC, which was significantly better in both penalized methods, with AUC = 0.76 and AUC = 0.74 for elastic net and LASSO, respectively, than in logistic regression, with AUC = 0.64. In terms of misclassification rate, elastic net (18%) outperformed LASSO (22%) and logistic regression (25%).
Conclusion:
Compared with a classical logistic regression approach, penalized methods were found to have better performance in predicting vitamin D deficiency. The use of machine learning algorithms such as LASSO and elastic net may significantly improve the prediction of vitamin D deficiency in a hypertensive obese population.
Introduction
Vitamin D has an important role in calcium homeostasis, but some studies have also found that vitamin D serum levels are inversely associated with risk of cardiovascular disease. 1 –3 Vitamin D deficiency can, therefore, be considered a predictor of cardiovascular disease and can increase cardiovascular events. Similarly, vitamin D deficiency has been linked to the development of diabetes mellitus, essential hypertension, and coronary artery disease 4,5 as well as metabolic syndrome, insulin resistance, impaired fasting glycemia, and impaired glucose tolerance. 6
In terms of obesity, vitamin D levels have also been inversely related to body mass index (BMI). 7 This relationship has not yet been clarified, and a number of hypotheses have been proposed, such as sequestration in adipose tissue, the effect of volumetric dilution, or that obese individuals have less sun exposure because of less outdoor activity. 8
The biomarker measured for vitamin D status is serum 25-hydroxy vitamin D, [25(OH)D], which can be expensive and, therefore, 25(OH)D determination may not be widely available in all clinical settings. An alternative approach to determining vitamin D deficiency would be to build predictive models using cheaper available biomarkers to estimate the vitamin D status. Only individuals with a high probability of having vitamin D deficiency would then undergo a specific laboratory test.
Some recent publications demonstrate that vitamin D deficiency is prevalent in certain populations, such as adolescents and the elderly. 9,10 However, there are no publications on the association between vitamin D deficiency, obesity, and essential hypertension. Furthermore, assessment of factors associated with vitamin D status in the aforementioned publications relies on models based on self-reported data such as physical activity, sun exposure, or smoking habit. To our knowledge, no predictive model has included clinical and laboratory parameters collected in an outpatient setting.
In health care research, the most popular common method for selecting informative features is logistic regression. 11 In this research, we used machine learning methods as an alternative approach to logistic regression to handle high dimensionality data sets obtained from demographic, clinical, and laboratory data.
Using data collected in an outpatient hypertension unit, our aim was to study associated features in our hypertensive population, which has a high prevalence of overweight and obesity. To determine the relationship between vitamin D status and demographic, clinical, and laboratory features, we used machine learning algorithms.
Methods
Patient selection, clinical and laboratory features
We conducted a retrospective cross-sectional study with 221 patients from the hypertension unit at Mostoles University Hospital (Madrid, Spain), whose records were dated between 2006 and 2017. Several demographic, clinical, and blood parameters had been measured through medical history, physical examination, and laboratory tests, such as systolic blood pressure (SBP) and diastolic blood pressure (DBP), LDL (low-density lipoprotein) and HDL (high-density lipoprotein) cholesterol, C-reactive protein (CRP), glycated hemoglobin (HbA1c), and albumin. Kidney function was assessed by means of serum levels of creatinine and cystatin C at baseline. Serum 25(OH)D was assessed using the quantitative electrochemiluminescence immunoassay method performed in a Cobas 8000 e602 analyzer (Roche Diagnostics). Patients were considered to have vitamin D deficiency if serum 25(OH)D levels were <20 ng/mL. Smoking habit, alcohol intake, exposure to sun, and physical activity were not properly recorded or were not available, and thus were not included as features to be analyzed.
Statistical analysis
Normality was checked using the Shapiro–Wilk test, to assess the shape of the distribution of the continuous variables. Data were shown as percentages, mean and standard deviation, or median and interquartile range, as appropriate. P values ≤0.05 were considered statistically significant. Statistical analyses were performed with R version 3.3.3 (2017-03-06), along with the packages glmnet and ROCR. 12 –14
Regularization methods
Our statistical analyses were based on Pavlou et al., 15 in which the potential use of penalized regression methods was discussed. We applied penalized regression to select relevant features regarding vitamin D deficiency. Penalized regression is recommended by the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis Or Diagnosis checklist for developing and validating risk and diagnostic models. 16
Regularization is a technique that adds a penalty to the objective function. This penalty controls the complexity of the model by shrinking the values of regression coefficients. If the shrinkage is toward zero, it is called the L2 norm. Ridge regression is the machine algorithm that uses the L2 norm. If the shrinkage is exactly zero, it is then called the L1 norm or L1 penalty. Least absolute shrinkage and selection operator (LASSO) uses L1 penalties. Elastic net combines L1 and L2 penalties. Therefore, penalized regression is a machine learning technique using this regularization approach. It fits the same statistical model as standard regression, but its procedure consists in adding a constraint (the aforementioned penalty) on the values of regression coefficients. The penalty term is controlled by a regularization parameter (λ), which can be selected either using a cross-validation procedure, or with the help of training and validation sets. In this study, λ was chosen using 10-fold cross-validation. If λ = 0, the coefficients are the same as the usual regression estimates wj . But if λ is very large, the model will be overfitted. To determine the optimal value of λ, we plotted its values against the number of coefficients, and then selected the minimum value of λ after the coefficients tended to stabilize. This tuning parameter was obtained using the training data set.
The L1 penalty is the sum of the absolute coefficients (wj ):
LASSO uses this L1 penalty by adding λ to control the penalization:
The L2 norm is the sum of the square of the coefficients (wj ),
which is used by ridge regression:
Elastic net uses both penalties, combines the advantages of them and may solve their limitations. Therefore, it has the effect of shrinking coefficients (as in ridge regression) and setting some coefficients to zero (as in LASSO), and thus automatically selecting features. The estimates from the elastic net algorithm are defined by
Logistic regression
Since the outcome in our analyses was binary, we used logistic regression, which is an algorithm used for modeling a binary dependent variable as a function of independent variables, in our case, clinical and laboratory features. The output of logistic regression is probability. This probability function can be derived as function of log-odds:
Dependent variables are denoted by yi , where i depicts a subject with either vitamin D deficiency or not. Beta coefficients are the estimation of the magnitude of the association between the independent variables x and the outcome y. There are a number of k independent variables. To avoid overfitting or underfitting, we decided to use a stepwise method to estimate the logistic regression. We used the backward selection approach, based on the Akaike Information Criterion to choose the best model.
Internal validation: training and testing sets
To highlight the usefulness of regularization methods, we trained the three evaluated algorithms (logistic regression, LASSO, and elastic net) using training data, and then compared their performance using testing data. To do so, we decided to split the data set into two samples: a training sample (80% of observations) and a testing sample (20% of observations). Ideally, the proportion of events and nonevents in the two samples should be the same. To solve the class bias, we decided to sample the observations in equal proportions in two data sets.
To avoid confusion, all three models were trained on a training sample, and then a testing set was used to make predictions and assess their performance. However, when LASSO and elastic net (but not logistic regression) were computed, the regularization parameter λ was automatically calculated by an embedded function within each machine learning algorithm. The method by which λ was internally calculated was 10-fold cross-validation.
Performance
After developing the models with the training data set, we computed several statistics to assess the performance of the constructed models in the testing data set: Classification accuracy, expressed as misclassification error, which is the percentage mismatch of predicted values versus observed known values. The lower the misclassification error, the better is the model. Sensitivity, expressed as the probability of predicting vitamin D deficiency. Specificity, expressed as the probability of predicting a normal vitamin D status. We also calculated the area under the receiver operating characteristic (ROC) curve (AUC) for the testing group to evaluate the predictive performance of the fitted models. We performed pairwise comparison of AUCs, using statistics to evaluate their power of discrimination.
17
Results
The main features of our hypertensive population are shown in Table 1. We selected data from 221 participants, of whom 56 (26%) were considered to have vitamin D deficiency. 51.1% of our population were women. The mean age was 59 years. The mean BMI was 29.7, indicating that our patients were overweight or obese, but they had well-controlled hypertension (SBP 137.4 mmHg and DBP 75.7 mmHg). We randomly chose 176 patients (79%) for the training set, and 45 (21%) for the testing set. In these groups, 44 patients (25%) in the training set and 12 (27%) in the testing set had vitamin D deficiency, indicating well-balanced samples.
Clinical and Laboratory Features of Our Groups
Data are shown as mean ± standard deviation, median (interquartile range), absolute values, and percentages, as appropriate. BMI, body mass index; CRP, C-reactive protein; DBP, diastolic blood pressure; GGT: gamma-glutamyl transferase; HbA1c, glycated hemoglobin; HDL, high-density lipoprotein; HOMA-IR, homeostasis model assessment-estimated insulin resistance; LDL, low-density lipoprotein; SBP, systolic blood pressure.
Feature selection
When applying the assessed algorithms, three models were built. Each model selected a subset of features (Table 2). Stepwise logistic regression selected a subset of six features: age, SBP, total cholesterol, HbA1c, GGT (gamma-glutamyl transferase), and CRP. LASSO also included six features in its model: age, SBP, DBP, LDL cholesterol, GGT, and CRP. Elastic net produced the most parsimonious model, with five features: age, SBP, DBP, LDL cholesterol, and CRP.
Feature Selection Using Different Approaches Within Our Cohort
LASSO, least absolute shrinkage and selection operator.
Features whose coefficients have a larger absolute value have a greater effect on the prediction of vitamin D deficiency. Likewise, lower values show less influence on the prediction. Accordingly, blood pressure, LDL cholesterol, and GGT were found to have minimal influence, despite being selected by the models.
Predictive performance
After training the models, several performance parameters were computed using the testing sample (Table 3). Overall, penalized methods performed better than logistic regression in terms of AUC and misclassification rate. AUC was significantly better in both penalized methods, with AUC = 0.76 and AUC = 0.74 for elastic net and LASSO, respectively, than logistic regression, with AUC = 0.64. ROC curves are shown in Fig. 1. Regarding misclassification rate, elastic net (18%) outperformed LASSO (22%) and logistic regression (25%).

ROC curves for models produced with stepwise logistic regression
Performance of the Different Algorithms Using the Testing Data Regarding Vitamin D Deficiency
AUC, area under the curve.
To perform pairwise comparisons of the three calculated AUCs, we used the method of Delong et al., 17 which showed differences among the models (Table 4). Both LASSO and elastic net outperformed logistic regression, but we found no differences between LASSO and elastic net.
Pairwise Comparison of Receiver Operating Characteristic Curves
If P < 0.05, the two AUCs being compared are significantly different. Using nonparametric analyses (the Mann–Whitney U test), we confirmed that there were differences between logistic regression and the penalized models. However, there were no discriminative differences between LASSO and elastic net.
Discussion
The main finding in our study was the demonstration of the usefulness of several machine learning algorithms in the development of predictive models to determine vitamin D deficiency. The models we produced were capable of accurately establishing the relationships between our analyzed features and vitamin D deficiency. We consider that these models can determine, with a high degree of accuracy, whether a certain patient is at high risk of having vitamin D deficiency, and thus should be tested for this biomarker.
A strength in our study was that we used machine learning-based approaches (i.e., penalization methods) to address the issue of a high-dimensional small data set. We chose LASSO and elastic net algorithms to obtain a limited number of features (i.e., a more parsimonious model). In a clinical setting, a model with few features is desirable due to limitations of resources and time to obtain information from patients. Another advantage of penalization approaches is that they address overfitting better than conventional analyses, and as such, are more reliable.
Both penalized algorithms performed better than classical logistic regression. Although elastic net seemed to perform better than LASSO, in terms of performance, both methods were equivalent because the observed differences were not significant. Logistic regression has been used in clinical research because its main advantage is that it does not require a linear relationship between dependent and independent variables. Therefore, it can handle several types of relationship as it applies a nonlinear log transformation to the predicted odds ratio. However, it does not accurately handle high-dimensional data sets and cannot handle correlated features. In contrast, both LASSO and elastic net can overcome these situations. Using L1 penalization, LASSO performed better than logistic regression, but since it cannot handle correlated features, modeling could be improved by introducing elastic net, which uses the advantages of L1 and L2 penalization, and can produce a more stable solution.
Our proposed penalized models rely on the automatic selection of features. 18 The LASSO approach has been used elsewhere when trying to identify prognostic factors and dealing with high-dimensional data such as radiological features of PET images, 19 and environmental enteropathy biomarkers. 20 A recent publication on prediction of breast cancer diagnosis demonstrated the superiority of penalized regression over logistic regression in predicting the presence of breast malignancy. 21
The AUCs and misclassification rates did not differ significantly when we compared LASSO and elastic net using the DeLong test. This may be because of the low importance (i.e., low coefficient) of GGT in the LASSO model. Although LASSO and elastic net performed equivalently, we believe clinicians should opt for a parsimonious model with few relevant features.
We observed that penalized regression coefficients (Table 2) were smaller than those from logistic regression. Shrinkage is another interesting aspect of penalized methods, as they impose a constraint (the so-called penalty term) on the coefficient values. A more parsimonious less complex model may be preferable to save time and effort gathering data from patients. Elastic net was able to select a subset of 5 out of the initial 20 features.
Some alternative machine methods may perform better than elastic net or LASSO regarding vitamin D status, such as support vector machines or random forest. 22,23 These methods are not, however, easy to implement in a clinical setting, and do not produce coefficient values, but features ranked in order of importance. Our proposed methods are easy to implement and simple to interpret.
Our objective was to identify the most relevant informative features associated with vitamin D status to identify the population at risk of having vitamin D deficiency. In our population, a high-sensitivity test is preferable to identify patients that need further investigation, that is, laboratory determination of vitamin D. The ideal unrealistic situation would be to produce a 100% accurate test, but in a clinical setting, a good alternative would be to have a high sensitivity/low specificity test, and if this is positive, then perform a low sensitivity/high specificity test. In this way, false positives may be correctly identified as disease negative.
Our study found that some clinical conditions are associated with vitamin D deficiency, although evidence of causation has not yet been established in the medical literature. Aging can involve low vitamin D serum levels due to decreased renal production of vitamin D-related metabolites by aging kidneys, decreased skin production of vitamin D, or tissue resistance to circulating vitamin D, among other factors. 24,25 Also, vitamin D deficiency is highly prevalent in patients with high LDL cholesterol, 26 but the underlying mechanism is not yet clear. A recent publication reviewed the role of vitamin D deficiency in the development of essential hypertension, 27 probably due to inappropriate activation of angiotensin II and renin, although the underlying mechanisms are still debatable. 28 Regarding CRP, a significant positive association between both high CRP levels and low vitamin D levels and cardiovascular disease was observed in an epidemiological study, 29 but whether this is a cause or a consequence relationship remains unknown. 30
Regarding obesity, our study demonstrated that vitamin D deficiency is not associated with overweight or obesity. Some studies have found an inverse correlation between vitamin D and obesity. 31,32 The true nature of this relationship remains, however, unclear, and several hypotheses have been proposed, such as malabsorption, or sequestration by the adipose tissue, which may decrease the bioavailability of vitamin D in overweight or obese individuals. 33,34
Vitamin D has also been associated with metabolic syndrome in some publications, 35,36 but similar to obesity, the true extent of this association is unclear. In our population, vitamin D deficiency was not associated with obesity criteria, that is, BMI and waist circumference, nor with some metabolic syndrome criteria, such as homeostasis model assessment-estimated insulin resistance, triglycerides, HDL cholesterol, or fasting glucose. Likewise, we demonstrated that in our patients, vitamin D status was not associated with metabolic syndrome.
Limitations
Our study has a number of limitations. The main limitation is the small sample size (221 patients). Another limitation involves the disadvantages inherent to retrospective studies. In addition, the high-risk profile of our patients (hypertensive, obese, and hyperglycemic) could have yielded potential confounders, resulting in bias. Participants consisted of attended patients referred to a hypertension unit that provided confirmatory ambulatory blood pressure monitoring. Thus, our cohort of patients is not representative of the general population. Furthermore, unlike other studies on metabolic syndrome, we did not include smoking status as a variable because data on this factor were not correctly obtained from patients. However, we took all of these limitations into full consideration when choosing the appropriate statistical approach: Small samples can be properly handled by regularization. 15,16 Although our results are not novel, developing a predictive model that uses machine learning to perform analyses extends previous knowledge on the relationship between vitamin D deficiency and some components of metabolic syndrome.
Future Works
Our data set included patients with essential hypertension, a prevalent disease in our current clinical setting. Although we believe that prospective studies should be performed to validate our proposed approach, the approach could be easily extended by other research groups to build their own predictive models for specific populations. Our main technical contribution is in providing a simple easy-to-use method of developing good-fitting predictive models for characterizing vitamin D deficiency and identifying at-risk populations. Our approach can be implemented in other situations or clinical settings (e.g., with patients with metabolic syndrome, heart failure, prevalent cardiovascular disease). Also, important factors such as LDL cholesterol or CRP need more in-depth analysis to determine whether they are causes of vitamin D deficiency or are simply correlated with it. If they are a cause of this deficiency, they could become therapeutic targets. Therefore, these factors may help clinicians improve their understanding of vitamin D deficiency.
Conclusions
Penalized methods such as LASSO or elastic net performed better than logistic regression in identifying individuals at a high risk of having vitamin D deficiency. In our hypertensive population, with a high prevalence of overweight, obesity, and metabolic syndrome, vitamin D was not associated with any criteria of insulin resistance, dyslipidemia, or adiposity.
Footnotes
Ethical Approval
All procedures performed in studies involving human participants were in accordance with the Ethical Standards of the Institutional and/or National Research Committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.
Author Disclosure Statement
No conflicting financial interests exist.
Funding Information
This study has been partly funded by Research Projects TEC2016-75361-R and TEC2016-75161-C2-1-R from the Spanish Government, and Research Project DTS17/00158 from Instituto Carlos III (Spain).
