Abstract
Abstract
This work contributes to the development of effective statistical methods of big data analysis for type 2 diabetes mellitus (T2DM) risk assessment to be employed in routine clinical practice. The objective of this study to be reached via machine-learning analysis is twofold: investigation of a possible application of biochemical biomarkers for the T2DM risk prediction in case of a limited knowledge of biometrical parameters of an individual, as well as study on the predictive ability of a derived parameter (rate of a biomarker change over time) in T2DM risk prediction. Obtained statistical parameters (AUC, p-value, etc.) justify a relatively high quality of the model. Nevertheless, a further improvement may be addressed through the following avenues: analysis of adding new factors and models, including lifestyle/habits, and genetic parameters.
1. Introduction
Type 2 diabetes mellitus (T2DM) is one of the major causes of mortality in the world. According to the World Health Organization data, the number of people with diabetes has risen from 108 million in 1980 up to 422 million in 2014, and a number of newly diagnosed cases increases each year. Diabetes is a major cause of blindness, kidney failure, heart attacks, stroke, and lower limb amputation. Most deaths from T2DM and associated diseases occur in early age, and they are associated with high blood glucose levels. However, T2DM may be prevented through lifestyle modifications, physical activity, and preventive use of drugs such as metformin. Therefore, effective methods of T2DM risk assessment should be developed and used in routine clinical practice.
During the past three decades, a large number of possible predictive biomarkers were investigated in relation with T2DM risk prediction. Among them are traditional biomarkers that are members of canonical signaling and metabolic pathways related to T2DM etiology and pathogenesis (their link with T2DM is well established), and novel biomarkers obtained by using modern high-throughput methods such as mass-spectrometry/liquid chromatography tandem, DNA sequencing, and differential gene expression analysis. According to the large meta-analysis of different predictors and predictive models performed by Abbasi et al. (2016), the following biomarkers have the highest predictive power: fasting glucose concentration, fructosamine concentration, glucose tolerance test results, and concentrations of glycated albumin, glycated hemoglobin, and uric acid. It is also possible to use other biomarkers, including biochemical indicators of liver function [alanine aminotransferase (ALT), aspartate aminotransferase (AST), bilirubin, and others], immunological biomarkers, biomarkers of metal metabolism, and others.
The use of noninvasive biomarkers (body mass index [BMI], wrist thickness, history, smoking, etc.), classical glycemic biomarkers (glycated hemoglobin, blood glucose concentration, glucose tolerance test), as well as new biomarkers (metabolic and genetic) is reviewed by Herder et al. (2014).
Single-nucleotide polymorphisms in more than 60 loci of the genome are associated with the risk of developing type 2 diabetes (Herder et al., 2011; Mehta, 2012; Pal and McCarthy, 2013). The presence of each of these variants separately increases the risk of developing diabetes by an amount of 5%–40% (odds ratio [OR] 1.05–1.4). The use of 40 SNPs (single-nucleotide polymorphisms) for predicting type 2 diabetes makes it possible to achieve an indicator of the AROC model from 0.55 to 0.63 (Zarkoob et al., 2017). A relatively recent study of the MTNR1B locus encoding the melatonin 1B receptor by the Meta-Analysis of Glucose and Insulin-Related Traits Consortium (Bonnefond et al., 2012) showed that rare mutations could be present in this locus, leading to a greater risk of developing type 2 diabetes (OR 5.7, 95% confidence interval [CI] 2.2–14.8). It was also reported that the use of genetic data allows making more accurate predictions of the risk of developing type 2 diabetes.
It was demonstrated by Fagerberg et al. (2011) that a decrease in plasma adiponectin concentration of less than 11.54 g/L is associated with the risk of developing type 2 diabetes. Other factors included in the analysis: HOMA-IR (homeostatic model assessment for insulin resistance), AIR (acute insulin response), IFG (impaired fasting glucose), IGT (impaired glucose tolerance), and number of cigarettes per year.
To predict the risk of developing type 2 diabetes, profiling of metabolites was used by Savolainen et al. (2017). Nine metabolites (sorbitol, galacticol, mannose, galactose, uric acid, oxalic acid, glucaric acid-1,4-lactone, 3-methyl-2-oxopentanoic acid, 2-hydroxybutyric acid), combined with noninvasive methods can improve the quality of prediction by 9% compared with the adiponectin model.
Wang-Sattler et al. (2012) reported three metabolites (glycine, lysophosphatiditylcholine, and acetylcarnitine) to be associated with the risk of developing insensitivity to glucose and type 2 diabetes. In addition, seven genes (PPARG, TCF7L2, HNF1A, GCK, IGF1, IRS1, and IDE) were identified to change expression in the phenotype of developing type 2 diabetes (Peddinti et al., 2017).
Moreover, at this point, researchers are ready to introduce various machine-learning algorithms to try to improve predictions using routine clinical data (Weng et al., 2017). These risk prediction models of different types include classic generalized linear regression, distributed random forest, gradient boosting machine, and artificial neural networks. For instance, the data obtained from metabolic studies using machine learning to interpret results show that the decrease in the concentration of alpha-tocopherol is associated with the risk of developing type 2 diabetes.
This exploratory study pursues two main objectives to be reached via machine-learning analysis: First, investigate whether biochemical biomarkers widely used in routine clinical practice can be employed for the T2DM risk prediction without any prior knowledge about biometrical parameters of the individual. Second, assess the predictive ability of a derived parameter (rate of a biomarker change over time) to be employed for T2DM risk prediction.
2. Materials and Methods
2.1. Individuals
The datasets from MedExpert medical center were employed in this study. MedExpert is one of the major nongovernment medical groups in the Voronezh region, Russia, being able to collect most of the medical data of their patients (up to half million) in the digital form. The database specifically on the type 2 diabetes contains 10,464 patients, with a various set of biomarkers being available (8461 records were excluded, see below). These were examined during the years from 2010 to 2016, among which 2202 people obtained their measurements more than once. For 2003 people, a list of biomarkers measured included fasting glucose, HbAc, bilirubin, cholesterol, liver enzymes, c-reactive protein, and several others. Unlike typically employed multivariate models (Herder et al., 2014), the following parameters were not used in this analysis: BMI, waist circumference, blood pressure.
All individuals were informed regarding the ongoing research work and provided written consent. The ethical approval was received from Voronezh State University ethical committee.
2.2. Risk factors assessment
Plasma glucose was measured by a glucose oxidase method. C-reactive protein was assessed by ultrasensitive nephelometry.
All measurements, including γ-glutamyl transferase and ALT, were performed using Cobas Integra 400 plus automated analyzer (Roche Diagnostics).
2.3. Dataset
We excluded 8461 records from our dataset based on the following criteria:
Individual must have follow-up, namely two or more appointments during 6 years. Individual must have at least two glucose measurements during the follow-up period. Other biomarkers must be measured at least once.
We also excluded from a statistical analysis all patients with records of glucose level being unpredictable or inconsistent. We defined a “predictable” level to be the one that provides a coherent approximation using any typical mathematical function with r2 greater or equal to 0.6. This was done to eliminate from our analysis all individuals who were either diagnosed with T2DM during a follow-up course or avoided fasting glucose testing prerequisites.
2.4. Statistical analysis
To estimate the risks of development of T2DM in a predictive manner, the regression analysis was employed in the form of logistic regression. Modeling was performed by using the predefined functions from the commercially available R software.
Three models were used in the analysis of risks associated with T2DM.
Model 1 used all predictors listed next. However, in the following postprocessing, only patients with constantly growing, falling, or stable concentrations of glucose were chosen. In the case of failure to describe the behavior of the glucose of a certain patient with a linear function, its record was excluded.
Model 2 took into account all predictors, omitting the rate of change of the concentration of glucose and without any filtration based on the transitional behavior of the glucose (see Model 1).
Model 3 used all predictors, omitting the rate of change of the concentration of glucose with the filtration, which took into account only patients with linearly growing glucose levels. This was done to exclude all individuals who were diagnosed with the T2DM beforehand and proceeding through the respective treatment scenarios. Moreover, such a procedure permits to drop out any patients going through an additional measurement after an erroneous one (e.g., after excessive and/or prohibitive consumption of food).
The quality of the model was judged based on the two typical statistical parameters: AUC (area under “receiver operating curve”), characterizing the respective weight of false-positive and true-positive prediction results, and p-value, characterizing the calculated probability of finding the observed, or more extreme, results when the null hypothesis of the study question is true.
In addition, an interplay of the rates of change of various biomarkers was tested. In particular, a prediction of the velocity increase of the glucose concentration was performed based on the rates of change of other biomarkers using the linear regression method.
3. Results
Different sizes of training and testing subsets of our dataset were tested for all 3 models. Results of each model performance assessment are shown in Table 1. The best performance (biggest AUC with p-value <0.01) is achieved by model 3 and the training dataset containing 90% of all records.
Evaluation of Different Predictive Models
AUC, area under receiver operating curve; SEP, size of the training subset.
Model 3 accuracy was 71.01%, with AUC of 0.763, and p-value of 0.006649, with sensitivity and specificity being equal to 86% and 44%, respectively. After analyzing the residual deviances of the model 3 (see Table 2 and discussion below), the following biomarkers were selected to improve its performance: gender, age, bilirubin concentration, ALT rate of change, and ALT glucose rate of change. An exclusion of one of these indicators, as well as an addition of any other, reduced the predictive power of the model (accuracy and AUC). Finally, the resulting accuracy of the modified model 3 (Model 3 m) was 73.85%, with the CI of 95% (0.689–0.7839), AUC of 0.81, p-value of 0.001162, and sensitivity and specificity of 90% and 42%, respectively.
Residual Deviances of Selected Biomarkers
ALT, alanine aminotransferase.
Beta-coefficients of the linear regression obtained during analysis are shown in Table 2. Individual probability of T2DM development can be calculated by using the following equation:
where x1−x6 are values of ALT rate of change, glucose rate of change, ALT concentration, bilirubin, gender, and age (with a gender being a categorical variable, for a female x5 = 1, and 0 for a male).
Table 2 shows that when ALT concentration, glucose rate of change, and age are added to the model, the greatest decrease in residual deviation occurs. Other variables do not improve the model significantly, despite the fact that the age has a rather small p-value. In fact, the larger the p-value is, the less is the importance of this variable in the model (in other words, a model without this variable provides almost the same results as when it is included). Such variables in our case include the concentration of bilirubin and the rate of change of ALT.
Analysis of an interplay of the rates of change of various biomarkers showed that the relationship of the rate of change of any biomarker does not have a statistically significant relationship with the rate of change of glucose level (p > 0.5). In addition to the main model, which includes the levels of ALT and bilirubin, extended models were obtained in this study, characterizing all combinations of biomarkers selected for analysis. However, the use of models that do not include the indicators of ALT, bilirubin, and the glucose rate of change is not recommended due to the extremely low accuracy (AUC ∼50%).
In the resulting model, the age of the patient makes the greatest contribution to the probability of developing of the disease. Figure 1 shows a clear dependence of the probability of developing diabetes on the age of the individual. According to the published data, men are more likely to develop T2DM compared with women, especially in the range from 25 to 50 years. In both cases, there is a uniform increase in probability depending on age.

Dependence of probability T2DM development on age and sex of patient. T2DM, type 2 diabetes mellitus.
Figure 2 shows the effect of a glucose level on the likelihood of the T2DM. It can be seen that the main increase (slope change) in the T2DM probability corresponds to glucose level in the range from 4 to 6 mM/L, which indirectly confirms the correctness of the selected boundary of 6.1 mM/L (an excess of the latter is considered the criteria for a patient to be diagnosed with T2DM in this study).

Dependence of probability of T2DM development on glucose concentration.
Figures 3–5 show diagrams of the dependence of the probability of T2DM development on ALT, AST, and cholesterol levels, respectively. From these plots, we conclude that ALT and AST do not make a significant contribution to the probability of developing the disease (the same level of biomarker corresponds to a wide range of probabilities, from 0% to 100%), whereas cholesterol demonstrates a better concordance with a T2DM probability. In practice, ALT to AST ratio is used to diagnose a wide range of diseases, including nonalcoholic liver diseases, fibrosis, cirrhosis, a number of metabolic diseases, as well as hepatitis B and C when both biomarkers exceed certain limits. However, the introduction of a new biomarker based on liver tests (ratio of ALT to AST) did not lead to any significant improvements in the predictive power of the model. Thus, the ALT concentration, which is used without any stratifications, seems to be at the moment the most important parameter of the biochemical predictors considered in these figures.

Dependence of probability of T2DM development on ALT levels. ALT, alanine aminotransferase.

Dependence of probability of T2DM development on AST levels. AST, aspartate aminotransferase.

Dependence of probability of T2DM development on cholesterol concentrations.
4. Discussion
Currently, T2DM is being advocated to be one of the most lethal diseases. Moreover, it is considered the seventh most probable cause of death worldwide (World Health Organization, 2014). Therefore, it is crucial to put effort into minimizing an overall incidence of T2DM, and to try to decrease its socioeconomic impact.
Major health policy regulators such as the American Diabetes Association publish recommendations regarding the diagnosis, prevention, and treatment of T2DM. Most of them concentrate on a daily monitoring of blood glucose of individuals having a high risk of T2DM. We believe that thorough, predictive, and accurate ways of T2DM risk estimation have to be implemented and used in the routine practice of medical organizations.
Various mathematical models have been developed to date for calculation of T2DM risks. Most of them include biometrical parameters of the individual and several biochemical biomarkers. It was widely approved that such models are accurate enough, with AUCs ranging from 0.8 to 0.95 with high values of specificity and sensitivity, and no further improvement is needed. Despite a well-established view, that T2DM is determined, among others, by genetic factors, most of the works published before 2016 advocate a limited role of various genetic traits (SNPs and differential gene expressions) in risk-assessment models associated with T2DM.
Most medical organizations worldwide do not perform any calculations of global disease risks on a continuous basis. Therefore, a patient at risk of T2DM might rest ignorant regarding a need for changes in lifestyle and pharmacological interventions, until irreversible changes occur. Another problem that precludes implementation of a continuous risk assessment tool is fragmentation and incompleteness of medical data. In most cases, only a negligible set of parameters needed to estimate T2DM risk is available for a given patient (e.g., majority of database entries in medical organizations contain only age, gender, and several biomarkers, but not height, weight, waist circumference, smoking status, and other parameters for an arbitrary risk model). Therefore, models that contain only a few parameters should be developed and tested.
This work demonstrates, in particular, that calculation of T2DM risk with a reasonable accuracy, high sensitivity and specificity can be performed, using only a small subset of widely used biochemical biomarkers. Several studies show that inclusion of such a variable as the rate of biomarker level change into a risk model can significantly improve its accuracy in the ischemic stroke and coronary artery disease prediction. Two variables in our model are the rate of changes of glucose and AST, respectively. Analysis of the model performance shows that the impact of these variables is comparable to the main parameters/biomarkers (ALT and bilirubin concentrations, and gender).
To sum up, the model described in this study can be readily employed in the routine clinical practice for an approximate although predictive estimation of risks associated with T2DM. The model is of particular interest in cases when the patient's data is incomplete and therefore insufficient for use in classical risk prediction algorithms.
5. Conclusion
In the course of this study, a parametric linear regression analysis was applied to the MedExpert database for as large as 10,464 patients with measurements performed during the past 5 years. The biomarkers were selected based on the following two criteria: their abundance in the database and their importance for the development of T2DM, extracted on the basis of the literature review. Later on, three mathematical models were constructed to link the selected biomarkers with their corresponding measurements and the probability for an individual to develop diabetes. The best model was chosen based on the AUC/p criteria and applied to provide predictive estimates of risks associated with T2DM.
The AUC and p-values value of 0.81 and 0.001, respectively, justify a relatively high quality of the model 3m. A further improvement can be addressed through the avenues of adding new factors, that is, lifestyle/habits, biometrical and genetic parameters. This work also demonstrated that the rates of change in the concentrations of ALT, AST, bilirubin, and cholesterol find no correlation with the rate of change of glucose in the blood, and, thus, cannot be used for an estimate of the time transient when the glucose level would exceed the threshold of 6.1 mM/L.
A number of papers indicate that genetic factors make a significant contribution to the risk of developing different types of diabetes. The performance of the model, thus, can be improved by adding new variables to it, reflecting the presence of certain genetic polymorphisms in the patient's genome. In addition, prediction of risk estimates of T2DM in a long timeframe can be further investigated by using a Cox model, which was reported in a separate preliminary study.
Moreover, development of the reported model and associated statistical methods (including advanced machine-learning techniques) laid out a basis for risk assessment in other disease scenarios, including cardiovascular, oncological, and prenatal chromosomal abnormalities.
Footnotes
Acknowledgments
The authors would like to acknowledge provision of the valuable dataset and technical support by the MedExpert medical center.
Author Disclosure Statement
The authors declare there are no conflicting financial interests.
