Understanding Prediabetes in a Medicare Advantage Population Using Data Adaptive Techniques

Abstract

The objective was to identify individuals with undiagnosed prediabetes from administrative data using adaptive techniques. The data source was a national Medicare Advantage Prescription Drug (MAPD) plan administrative data set. A retrospective, cross-sectional study developed and evaluated data adaptive logistic regression, decision tree, neural network, and ensemble predictive models for metabolic syndrome and prediabetes using 3 mutually exclusive cohorts (N = 279,903). The misclassification rate (MCR), average squared error (ASE), c-statistics, sensitivity (SN), and false positive (FP) rates were compared to select the final predictive models. MAPD individuals with continuous enrollment from 2013 to 2014 were included. Metabolic syndrome and prediabetes were defined using clinical guidelines, diagnosis, and laboratory data. A total of 512 variables identified through subject matter expertise in addition to utilizing all data available were evaluated for the modeling. The ensemble model demonstrated better discrimination (c-statistics, MCR, and ASE of 0.83, 0.24, and 0.16, respectively), high SN, and low FP rate in predicting metabolic syndrome than the individual data adaptive modeling techniques. Logistic regression demonstrated better discrimination (c-statistics, MCR, and ASE of 0.67, 0.13, and 0.11 respectively), high SN, and low FP rate in predicting prediabetes than the other adaptive modeling techniques or ensemble methods. The scored data predicted prediabetes in 44% of the MAPD population, which is comparable to 2005–2006 National Health and Nutrition Examination Survey prediabetes rates of 41%. The logistic regression model demonstrated good performance in predicting undiagnosed prediabetes in MAPD individuals.

Introduction

The prevalence of type 2 diabetes has quadrupled among adults in the United States over the last 3 decades. The number of individuals diagnosed with diabetes increased from 5.5 million in 1980 to 22 million in 2014.¹ Type 2 diabetes is associated with several complications such as heart diseases, stroke, blindness, kidney diseases, and lower limb amputation, leading to increased mortality and health care costs. In 2012, the total estimated cost incurred by patients diagnosed with diabetes was $245 billion, which was comprised of $176 billion in direct medical costs and $69 billion in productivity losses.^1,2

Based on fasting glucose or glycosylated hemoglobin (HbA1c) levels, an additional 86 million Americans aged 20 years or older were diagnosed with prediabetes in 2012 and 51% of those individuals were aged 65 years or older.¹ Recent clinical practice guidelines from the American Diabetes Association (ADA) identify individuals with prediabetes based on laboratory criteria including an impaired fasting glucose (IFG) of 100–125 mg/dl, impaired glucose tolerance (IGT) of 140–199 mg/dl, or HbA1c 5.7%-6.4%.³ According to the National Health and Nutrition Examination Survey from 2005–2006, 45.6% of elderly individuals with diabetes had undiagnosed diabetes while 40.8% of this population had prediabetes.⁴ Disease onset and progression may be delayed through lifestyle or pharmacological interventions if individuals at higher risk of developing diabetes during the prediabetic period can be identified.^5
–7

Identifying individuals at risk for prediabetes is a challenge and is a potential barrier to implementing timely effective interventions. According to the Centers for Disease Control and Prevention, only 4.7% to 10.6% of individuals with prediabetes had been informed of this condition by their health care professionals.⁸ Several public health initiatives for diabetes screening have failed; mainly attributed to low turnout, cost associated with screening a large number of individuals, and barriers to accessing care that provides guidance or appropriate referrals.⁹

Metabolic syndrome (MS), an indicator of prediabetes, is widely defined among practice guidelines. A constellation of cardiometabolic conditions, MS includes elevated waist circumference/central obesity, dyslipidemia, hypertension, and impaired fasting glucose.^10,11 The National Cholesterol Education Program Third Adult Treatment Panel (NCEP ATP III) defines MS as the presence of at least 3 of the aforementioned abnormalities. The prevalence of MS in adults has been estimated to be 22% to 34% using the NCEP ATP III definition.^12,13 Although IFG ≥100 mg/dl is one of the contributing factors to MS and can be a single indicator of prediabetes, its absence does not eliminate the risk of prediabetes if other components of MS are present; however, this has not been adequately explored and quantified, further complicating timely identification of patients for lifestyle or pharmacological interventions.

Health care providers and researchers have been working for years to design robust testing paradigms and instruments to identify individuals with prediabetes. In clinical settings, HbA1c and 1-hour plasma glucose levels were found to be important predictors of type 2 diabetes.¹⁴ Additionally, the Finnish diabetes risk score, based on medical history and health behavior, demonstrated the ability to identify individuals at high risk of this disease.¹⁵ Using a different approach, Gray et al¹⁶ developed a logistic regression model to predict undiagnosed IFG and type 2 diabetes through evaluation of medical history, sociodemographics, behavioral factors, anthropometric information such as weight, height, waist circumference, and IFG data from 2 primary health care centers. Interestingly, another approach to this testing paradigm was put forth in a separate study based on a Chinese population, which used 12 self-reported risk factors such as demographic characteristics, family diabetes history, anthropometric measurements, and lifestyle risk factors, to compare logistic regression, neural network, and decision tree models in terms of accuracy, sensitivity, and specificity for predicting diabetes or prediabetes.¹⁷

A review of the literature suggests that more than 145 risk models for type 2 diabetes have been developed over the past decade.¹⁸ These models have not been introduced in daily clinical practice. Barriers to employing these models are attributed mainly to the inability to obtain necessary clinical, laboratory, and patient-reported information such as anthropometrics, lifestyle-related information, socioeconomics, family history, smoking, and/or laboratory data.^18,19 The challenge of obtaining this information without direct patient contact or invasive resource-intensive laboratory tests makes currently available models potentially insufficient to meet the larger general public health need. These data are potentially obtainable through administrative claims which provides a wealth of longitudinal information regarding a patient's clinical characteristics (diagnoses, procedures, laboratory values, health care utilization, and costs), along with sociodemographics.

Data mining is the process of selecting, exploring, and modeling large amounts of multidimensional data, such as health care claims data, to identify unknown patterns or relationships. Mining administrative data may reduce the need for patient contact in this context and thus the need for patient-reported information and laboratory data. These data can be de-identified and used to build predictive models that can be used to assist in population health management strategies.

While the goal of predictive modeling is to accurately predict potential cases opposed to hypothesis testing; variable selection is based on the strength of the association between the terms and the response variable. Although independent variables in linear inferential regression models lend themselves to interpretation, independent variables for predictive models using high dimensional data are subject to covariate effects, and interpretation of the independent variables is not intended, as each variable is in the presence of the other and, when combined, may explain sufficient variance in relationship to the dependent variable. Variables that explain a sufficient proportion of the variance in the data are retained. Though it can be diminished through dimensionality reduction, multicollinearity is not an issue in predictive modeling as it does not affect a model's predictive ability.²⁰ Using complex and predicted variables as terms in a predictive model does not threaten the validity or predictive power of a validated predictive model, the issue that threatens this type of model's success is overfitting. That is, bias is significantly reduced and prediction in the training data is extremely accurate; however, accuracy is lost in the new, untrained data, and generality of the model to potential new cases is lost. The best predictive models consistently and accurately predict potential cases in new data samples regardless of the complexity of their terms.²⁰

Several data adaptive techniques have been developed in recent years.^17,21,22 In medical research, data adaptive techniques have been used to explore unknown factors and build predictive models.^23
–25 However, existing literature suggests that very few studies have explored data adaptive techniques in order to construct predictive models for prediabetes using administrative data. Furthermore, data adaptive ensemble methods that construct a model by combining the predictions from multiple analytic techniques (eg, decision tree, neural network, regression) help improve performance in comparison to a single predictive model.²⁶

The purpose of this study was to develop a data adaptive predictive model to identify individuals who were at high risk for diabetes during the prediabetic period. MS can be an indicator of prediabetes, but some of the constellation components are sparsely available in administrative claims data (ie, laboratory values, biometrics); therefore, this study aimed to develop and validate (1) an MS scoring algorithm to be evaluated for inclusion in (2) a prediabetes scoring algorithm for predicting prediabetes based on administrative claims data using data adaptive techniques. The rational for the 2-stage approach was to ensure that each patient had a predicted probability for MS, a constellation of conditions prognostic for prediabetes, which was then considered for inclusion in the final prediabetes predictive model.²⁷

Methods

Data source

This retrospective, observational, cross-sectional study utilized administrative claims data from patients with a Medicare Advantage Prescription Drug (MAPD) health plan offered by Humana, a health and well-being company that predominately serves people in the southern and midwestern United States through Medicare Advantage, stand-alone prescription drug, and commercial health plan offerings. The study cohort was developed utilizing enrollment, medical, pharmacy, and laboratory data from the de-identified Humana research database from January 1, 2013, to December 31, 2014. The study was approved by an independent Institutional Review Board.

Study subject selection

The study cohort included fully insured MAPD individuals aged 65–89 years as of January 1, 2014 who had continuous enrollment for 2 years, from 2013 to 2014. Calendar year 2013 was considered the baseline period and 2014 was considered the identification period. Patients diagnosed with diabetes, historical pregnancy-related complications, and chronic kidney disease during the baseline or identification period were excluded (see Supplementary Table S1; Supplementary Data are available online at www.liebertpub.com/pop). The eligible cohort of 839,709 patients with an MAPD plan were randomly assigned to 3 equally sized, mutually exclusive cohorts (N = 279,903) to achieve the study objectives and avoid multiple testing on a single data set. Figure 1 provides the details of study cohort selection strategy and attrition summary.

FIG 1.

Identification and attrition of eligible individuals and cohort development.

After excluding patients with known prediabetes, Cohort 1 (N = 231,867) was used to develop a predictive model for MS. Cohort 2 (N = 279,903) was used to score the data to identify predicted MS utilizing the model developed from Cohort 1. Subsequently, Cohort 2 (N = 279,903), which included patients with known prediabetes, was used to develop a predictive model for prediabetes. Cohort 3 (N = 279,903) was used to identify individuals most likely to have undiagnosed prediabetes and to descriptively compare them with individuals less likely to have prediabetes. This approach resembles the methods discussed by Kvancz et al.²⁷

Primary dependent variables (target variables)

MS was defined as having a diagnosis of MS or at least 2 of the following risk factors obtained from diagnostic and/or laboratory results during the identification period: elevated waist circumference/central obesity, dyslipidemia, hypertension, and impaired fasting blood glucose (see online Supplementary Table S1).²⁸ The dependent variable (target variable), MS, was a binary categorical variable (0 and 1), where 0 meant normal or no MS identified via claims and 1 indicating the presence of MS.

Prediabetes was defined according to the ADA clinical guideline using diagnosis and laboratory data obtained during the identification period, as patients with either International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) code for prediabetes, or elevated HbA1c, or IFG, or IGT (see online Supplementary Table S1).²⁹ The dependent variable (target variable), prediabetes, was a binary categorical variable with 2 categories (0 and 1), where 0 meant normal or no prediabetes identified from claims and 1 indicating the presence of prediabetes.

Independent variables (input variables)

The independent variables (input variables) included mainly patient demographics, clinical history, health care utilization, costs, and physician characteristics. Patient demographic characteristics including age, sex, race, geographic region of residence, health plan benefit type, low-income subsidy status, and access to care proxies were obtained from the enrollment file during the baseline period. Baseline clinical characteristics included Deyo-Charlson comorbidity index and clinical conditions based on the Agency for Healthcare Research and Quality (AHRQ) Clinical Classification System Software (CCS) level II, and were identified from ICD-9-CM diagnosis categories.³⁰ Health care utilization and amount of medical services utilized were derived from Current Procedural Terminology and Healthcare Common Procedure Coding System procedure data based on AHRQ CCS Health Care Cost and Utilization Project (H-CUP).³¹ All-cause health care utilization included counts of emergency department (ED) visits, inpatient hospital visits, and outpatient visits, and was measured based on all medical claims and associated place of treatment codes from the baseline period. All-cause health care costs were calculated based on financial data associated with medical and pharmacy claims and adding plan-allowed and patient out-of-pocket costs from the baseline period. Provider characteristics such as provider specialty (primary care physician, internal medicine, endocrinology, other), provider sex, and type of practice (group, solo, other) were obtained from the last provider office visit during the baseline period. A total of 512 independent variables related to patient baseline and provider characteristics were assessed for inclusion in the predictive models (see online Supplementary Table S2). The independent variables were informed by subject matter expertise, and aggregated using a common data grouper (H-CUP) allowing the use of all data available and common to administrative claims.

Analytical plan

Data cleaning, manipulation, and variable creation was performed using SAS Enterprise guide version 7.1 and data adaptive prediction models and subsequent scoring of data was constructed with SAS Enterprise Miner version 14.1 (SAS Institute Inc., Cary, NC).³² A descriptive analysis was conducted for patient demographic characteristics, provider characteristics, all-cause health care utilization, and access to care proxies using a t test for continuous variables and chi-square test for categorical variables. The data adaptive models that were explored included decision tree, probability decision tree, binary logistic regression, neural network, and ensemble models (a combination of the aforementioned machine learning techniques), and were constructed using training and validation data sets separately for the dependent variables MS and prediabetes.

For the purpose of the study, Cohort 1 was randomly and equally split into training and validation samples to develop a predictive model for the dependent variable of MS. This model was further used to score Cohort 2. The scored data were used to classify patients with MS with predicted probabilities at a 0.5 cutoff point. Cohort 2 was randomly and equally divided into training and validation samples in order to develop predictive models for the dependent variable prediabetes using the predicted MS along with other independent variables. The final model developed from Cohort 2 was applied to Cohort 3 to score the data and identify individuals most likely to have undiagnosed prediabetes using predicted probabilities at a cutoff point of 0.13 based on the observed proportion of primary outcome of prediabetes.³³

Results

Eligible individuals with an MAPD plan (N = 839,709) were randomly assigned to 3 equally sized, mutually exclusive cohorts (N = 279,903). After excluding MAPD patients with prediabetes from the baseline and identification periods, the remaining Cohort 1 (N = 231,867) had 112,108 (48%) study members diagnosed with MS. Cohort 2 had 37,503 (15%) individuals diagnosed with prediabetes. Table 1 displays the significant difference between those with and without MS in terms of demographics, provider, baseline health care utilization (inpatient, ED, and outpatient visits), and access to care. Deyo-Charlson comorbidity index and the top 10 clinical condition categories based on variable importance factor significantly differed between individuals with and without MS (Table 2). Patients identified with MS had a significantly higher average Deyo-Charlson comorbidity index (0.51 vs. 0.34), and a greater proportion had coronary atherosclerosis and other heart diseases, cardiac dysrhythmias, nutritional, endocrine, and metabolic disorders, thyroid disorders, heart valve disorders, and esophageal disorders (Table 2).

Table 1.

Descriptive Characteristics of Study Cohorts Used to Develop Predictive Models for Metabolic Syndrome and Prediabetes

	Cohort 1 (n = 231,867) MS predictive model		Cohort 2 (n = 279,903) prediabetes predictive model
Characteristics	Metabolic syndrome (n = 112,108) n (%)	No metabolic syndrome (n = 119,759) n (%)	Prediabetes (n = 37,503) n (%)	No prediabetes (n = 242,400) n (%)
Age, Mean (SD)	74.12 (6.07)	73.51 (6.01)	73.31 (5.70)	73.80 (6.03)
Sex
Female	66,797 (59.58)	68,823 (57.47)	21,363 (56.96)	141,200 (58.25)
Male	45,311 (40.42)	50,936 (42.53)	16,140 (43.04)	101,200 (41.75)
Region
Northeast	2368 (2.11)	2553 (2.13)	634 (1.69)	5023 (2.07)
Midwest	33,002 (29.44)	38,499 (32.15)	9634 (25.69)	75,135 (31.00)
South	68,073 (60.72)	65,614 (54.79)	22591 (60.24)	139,724 (57.64)
West	8665 (7.73)	13,093 (10.93)	4644 (12.38)	22,518 (9.29)
Race
White	95,608 (85.28)	103,607 (86.51)	31,017 (82.71)	207,862 (85.75)
Black	9211 (8.22)	7912 (6.61)	2969 (7.92)	18,135 (7.48)
Hispanic	740 (0.66)	834 (0.70)	546 (1.46)	1740 (0.72)
Other	6549 (5.84)	7406 (6.18)	2971 (7.92)	14,663 (6.05)
Health Plan
HMO	38,428 (34.28)	39,566 (33.04)	16,450 (43.86)	82,252 (33.93)
PPO	64,013 (57.1)	69,796 (58.28)	18,033 (48.08)	139,194 (57.42)
POS	784 (0.70)	919 (0.77)	307 (0.82)	1661 (0.69)
Other	8883 (7.92)	9478 (7.91)	2713 (7.23)	19,293 (7.96)
Low-income Status	9935 (8.86)	8938 (7.46)	3366 (8.98)	20,117 (8.30)
Provider Specialty
PCP	21,631 (19.26)	20,126 (16.81)	7522 (2.69)	43,723 (18.04)
Internal Medicine	24,893 (22.20)	17,821 (14.88)	8532 (22.75)	44,776 (18.47)
Endocrinology	323 (0.29)	355 (0.30)	162 (0.43)	762 (0.31)
Other	67,288 (60.02)	83,135 (69.42)	22,555 (60.14)	157,185 (64.85)
Type of Practice
Group	54,327 (48.46)	61,988 (51.76)	16,878 (45.00)	120,503 (49.71)
Solo	12,285 (10.96)	11,597 (9.68)	3922 (10.46)	24,702 (10.19)
Other	45,496 (40.58)	46,174 (38.56)	16,703 (44.54)	97,195 (40.10)
Access to Care
Urban	64,900 (57.89)	69,081 (57.68)	24,615 (65.63)	140,709 (58.05)
Suburban	31,717 (28.29)	32,834 (27.42)	9229 (24.61)	67,282 (27.76)
Rural	14,802 (13.20)	17,046 (14.23)	3287 (8.76)	32,874 (13.56)
Prior Use of Medical Services
Inpatient Visit	15,580 (13.9)	10,817 (9.03)	4415 (11.77)	28,285 (11.67)
ED visit	21,641 (19.3)	17,174 (14.34)	6555 (17.48)	41,114 (16.96)
Outpatient Visit	110,303 (98.39)	103,274 (86.23)	36,869 (98.31)	224,168 (92.48)

ED, emergency department; HMO, health maintenance organization; MS, metabolic syndrome; PCP, primary care physician; POS, point of service; PPO, preferred provider organization; SD, standard deviation.

Table 2.

Deyo-Charlson Comorbidity Index and Top 10 Clinical Classification Software Categories of Study Cohorts Used to Develop Predictive Models for Metabolic Syndrome and Prediabetes

Cohort 1 (n = 231,867) for MS predictive model
	Metabolic syndrome (n = 112,108)	No metabolic syndrome (n = 119,759)
Condition	N (%)	N (%)
Deyo-Charlson Comorbidity Index, Mean (SD)	0.51 (1.04)	0.34 (0.92)
Coronary atherosclerosis and other heart disease	20,233 (18.05)	5237 (4.37)
Cardiac dysrhythmias	15,854 (14.14)	8439 (7.05)
Other aftercare	23,314 (20.8)	14171 (11.83)
Other nutritional, endocrine, and metabolic disorders	12,380 (11.04)	6384 (5.33)
Thyroid disorders	18,901 (16.86)	13,091 (10.93)
Heart valve disorders	7249 (6.47)	3449 (2.88)
Esophageal disorders	15,848 (14.14)	9743 (8.14)
Other and ill-defined heart disease	3018 (2.69)	1196 (1.00)
Peripheral and visceral atherosclerosis	6534 (5.83)	3129 (2.61)
Occlusion or stenosis of pre-cerebral arteries	3897 (3.48)	1300 (1.09)

Cohort 2 (n = 279,903) prediabetes predictive model
	Prediabetes (n = 37,503)	No prediabetes (n = 242,400)
Clinical conditions	n (%)	n (%)
Deyo-Charlson comorbidity Index, Mean (SD)	0.49 (1.02)	0.43 (0.99)
Disorders of lipid metabolism	22,095 (58.92)	97,163 (40.08)
Predicted metabolic syndrome	21,958 (58.55)	101,996 (42.08)
Essential hypertension	22,699 (60.53)	112,234 (46.3)
Other nutritional, endocrine, and metabolic disorders	5875 (15.67)	20,415 (8.42)
Peripheral and visceral atherosclerosis	2795 (7.45)	10,423 (4.3)
Nutritional deficiencies	3793 (10.11)	14,745 (6.08)
Hyperplasia of prostate	2697 (7.19)	12,300 (5.07)
Esophageal disorders	5386 (14.36)	27,380 (11.3)
Thyroid disorders	6458 (17.22)	33,667 (13.89)
Other nontraumatic joint disorders	5716 (15.24)	31,588 (13.03)

MS, metabolic syndrome; SD, standard deviation.

There were significant differences between individuals who had prediabetes and those with no evidence of prediabetes in terms of demographic, provider, access to care-related characteristics, and baseline ED and outpatient visits in Cohort 2 (Table 1). In addition, the Deyo-Charlson comorbidity index and top 10 clinical condition categories were significantly different between individuals with and without prediabetes (Table 2). Patients identified with prediabetes had a significantly higher number of metabolic disorders, predicted MS, essential hypertension, nutritional, endocrine, and metabolic disorders, peripheral and visceral atherosclerosis, and nutritional deficiencies (Table 2).

The model fit and performance statistics for predictive models constructed for MS and prediabetes are displayed in Table 3. In the validation sample, the ensemble model developed by combining the decision tree, probability tree, neural network, and regression models demonstrated the best discrimination (c-statistic, 0.828; misclassification rate, 0.242; and average square error, 0.164), maximized the sensitivity, and lowered the false positive rate of predicting MS (Table 3). The ensemble model developed from Cohort 1 was used to score Cohort 2 data and 123,954 (44%) individuals were classified as having MS with predicted probabilities >0.50. Predicted MS was one of the contributing factors based on variable importance level and Gini coefficient split statistic (Table 2). In the validation sample, the logistic regression model outperformed the decision tree, probability tree, neural network, and ensemble model based on the model fit statistics, and further maximized sensitivity and lowered the false positive rate in comparison to other models while predicting prediabetes (Table 3). Cohort 3 was scored based on the predictive model developed for prediabetes from Cohort 2 and identified 122,849 (43.89%) patients likely to have prediabetes undocumented in administrative claims based on predicted probabilities between 0.13–0.68.

Table 3.

Model Fit Statistics and Performance of Predictive Models from Validation Sample

Predictive models for metabolic syndrome based on validation sample of cohort 1
	Fit statistics		Performance of models
Models	MCR	ASE	c-statistics	Sensitivity	Specificity	FP rate	FN rate
Ensemble^*	0.242	0.164	0.828	0.674	0.837	0.163	0.326
Decision Tree	0.242	0.167	0.815	0.671	0.838	0.162	0.329
Neural Network	0.242	0.164	0.829	0.670	0.839	0.161	0.330
Probability Tree	0.242	0.165	0.822	0.675	0.833	0.167	0.325
Regression	0.259	0.170	0.825	0.742	0.739	0.261	0.258

Predictive models for prediabetes based on validation sample of cohort 2
Models	MCR	ASE	c-statistics	Sensitivity	Specificity	FP rate	FN rate
Regression^*	0.134	0.110	0.670	0.669	0.571	0.429	0.330
Ensemble	0.134	0.111	0.670	0.672	0.560	0.440	0.328
Neural Network	0.134	0.110	0.670	0.672	0.569	0.431	0.328
Decision Tree	0.134	0.114	0.600	0.670	0.571	0.429	0.330
Probability Tree	0.134	0.112	0.650	0.704	0.510	0.490	0.296

ASE, average square error; FN, false negative; FP, false positive; MCR, misclassification rate.

Indicates final selected model based on fit statistics and model performance.

Table 4 represents the characteristics of Cohort 3 across predicted prediabetes. Notably, the top 10 clinical condition categories were highly prevalent among individuals with predicted prediabetes in comparison to those without predicted prediabetes. The incidence of disorders of lipid metabolism, predicted MS, essential hypertension, peripheral and visceral atherosclerosis, and other nutritional, endocrine, and metabolic disorders were significantly higher among MAPD individuals with predicted prediabetes in comparison to those without predicted prediabetes (P < .0001 for all; Table 4).

Table 4.

Characteristics of Study Cohort 3 Used to Score for Predicted Prediabetes

	Cohort 3 (n = 279,903)
Characteristics	Predicted prediabetes (n = 122, 849) n (%)	No predicted prediabetes (n = 157,054) n (%)	P value
Age, Mean (SD)	72.97 (5.54)	74.34 (6.27)	<.0001
Sex
Female	67,940 (55.3)	94,778 (60.35)	<.0001
Male	54,909 (44.7)	62,276 (39.65)
Region
Northeast	2148 (1.75)	3625 (2.31)	<.0001
Midwest	29,120 (23.7)	55,470 (35.32)
South	75,032 (61.08)	87,796 (55.90)
West	16,549 (13.47)	10,163 (6.47)
Race
White	100,664 (81.94)	138,249 (88.03)	<.0001
Black	10,317 (8.40)	10,757 (6.85)
Hispanic	1700 (1.38)	593 (0.38)
Other	10,168 (8.28)	7455 (4.75)
Health Plan
HMO	59,730 (48.62)	38,739 (24.67)	<.0001
PPO	54,527 (44.39)	103,108 (65.65)
POS	1105 (0.90)	790 (0.50)
Other	7487 (6.09)	14,417 (9.18)
Low-income Status	10,934 (8.90)	12,488 (7.95)	<.0001
Provider Specialty
PCP	25263 (9.03)	26229 (16.70)	<.0001
Internal Medicine	30750 (25.03)	22906 (14.58)
Endocrinology	529 (0.43)	372 (0.24)
Other	70534 (57.42)	108708 (69.22)
Type of Practice
Group	55232 (44.96)	82644 (52.62)	<.0001
Solo	12356 (10.06)	15989 (10.18)
Other	55261 (44.98)	58421 (37.2)
Access to Care
Urban	85552 (69.64)	79798 (50.81)	<.0001
Suburban	29309 (23.86)	47166 (30.03)
Rural	6757 (5.50)	29435 (18.74)
Unknown	1231 (1.00)	655 (0.42)
Prior Use of Medical Services
Inpatient Visit	14945 (12.17)	181119 (11.54)	<.0001
ED visit	21121 (17.19)	26887 (17.12)	0.6108
Outpatient Visit	122839 (99.99)	138280 (88.05)	<.0001
Top 10 Clinical Conditions
Deyo-Charlson comorbidity Index, Mean (SD)	0.55 (1.09)	0.35 (0.92)	<.0001
Disorders of lipid metabolism	9816 (76.37)	25,194 (16.04)	<.0001
Predicted metabolic syndrome	88,490 (72.03)	34,691 (22.09)	<.0001
Essential hypertension	87,330 (71.09)	47,035 (29.95)	<.0001
Other nutritional; endocrine; and metabolic disorders	21,849 (17.79)	4721 (3.01)	<.0001
Peripheral and visceral atherosclerosis	10,277 (8.37)	2973 (1.89)	<.0001
Nutritional deficiencies	14,752 (12.01)	3595 (2.29)	<.0001
Hyperplasia of prostate	10,121 (8.24)	4898 (3.12)	<.0001
Esophageal disorders	19,232 (15.65)	13,420 (8.54)	<.0001
Thyroid disorders	23,238 (18.92)	16,902 (10.76)	<.0001
Other non-traumatic joint disorders	19,656 (16.00)	17,539 (11.17)	<.0001

ED, emergency department; HMO, health maintenance organization; PCP, primary care physician; POS, point of service; PPO, preferred provider organization; SD, standard deviation.

Discussion

This study developed an MS risk factor scoring algorithm, and subsequently a prediabetes risk scoring algorithm, to identify individuals at high risk for type 2 diabetes during the prediabetic period for patients with an MAPD health plan. As laboratory and biometric data are scarce and often inconsistent within administrative databases, the development of a 2-step approach was used to enhance identification of patients with undiagnosed prediabetes. This study utilized all diagnoses, procedures, sociodemographic, laboratory (when available), and provider characteristics to the full extent while developing predictive models for MS and prediabetes. The ensemble model demonstrated good discrimination, with a c-statistic of 0.83, in order to predict MS in comparison to regression, neural network, and decision tree models. Predicted MS was identified in 44% of individuals in scored data, and was found to be the third most important contributing factor in predicting prediabetes based on variable importance and Gini statistics.

The regression model demonstrated good discrimination with c-statistics of 0.67 and a high sensitivity and low false positive rate in comparison to decision tree, neural network, and ensemble models to predict prediabetes. The predictive model in this study identified approximately 43% with prediabetes in a cohort of MAPD individuals without the diagnosis of diabetes and aged 65–89 years, whereas only 13% were identified based on specific diagnosis and laboratory criteria from the claims data. Clinical condition categories including disorders of lipid metabolism, hypertension, metabolic disorders, nutritional disorders, and deficiencies were highly associated with diabetes and were of increased prevalence in patients predicted to have prediabetes compared with patients not predicted to have prediabetes.

Cowie et al⁴ estimated the nationwide prevalence of prediabetes as defined by IFG or IGT to be approximately 40.8% among elderly individuals' aged 65 years and older using National Health and Nutrition Examination Survey data from 2005–2006. The current study identified 44% of MAPD individuals with prediabetes, which is similar to national estimates. Furthermore, Meng et al¹⁷ compared logistic regression, neural network, and decision tree models for predicting diabetes or prediabetes using 12 self-reported risk factors in a Chinese population and evaluated each model for its accuracy, sensitivity, and specificity. They found the decision tree model performed better than other models with a classification accuracy of 77.87%, a sensitivity of 80.68%, and a specificity of 75.13%. However, Meng et al did not construct an ensemble model by combining the different models. Furthermore, the study by Meng et al used only limited self-reported input variables, whereas the current study included 512 input variables in order to predict prediabetes.

This study only included individuals with an MAPD plan, which is a limitation. Unlike original Medicare, individuals enrolled in an MAPD plan may elect to have their entitlement benefit managed by a private health organization. MAPD plans are required to include all the benefits of traditional Medicare with additional value-added benefits (ie, vision, dental, health education, fitness, post-discharge transition planning, enhanced disease management, alternative therapies, transportation). Although this is a limitation to the findings, it is important to note that in 2018 the Centers for Medicare & Medicaid Services expanded their Medicare Diabetes Prevention Program (MDPP) nationally, making the MDPP part of the core benefit, which would have to be offered by any managed care organization (ie, MAPD, or other private fee-for-service providers). The purpose of the MDPP is to prevent the onset of type 2 diabetes through lifestyle management focused on diet and physical exercise with a primary end point of 5% weight loss.³⁴ For managed care organizations, and from a public health perspective, it is important to identify those individuals who would benefit most from this type of personalized preventive care planning. The current study approach of using a 2-stage adaptive method was taken to improve the accuracy of identifying optimal at-risk individuals.

This study has additional limitations that are common to administrative claims data and attributed to the absence of health behavior-related information, errors in claims coding, scarcity of highly prognostic measures, and the potential influence of unmeasured confounding variables. In addition, laboratory information used to define MS and prediabetes was limited to patients who had this data available.

Conclusions

Ensemble methods and logistic regression models demonstrated good performance in predicting MS and undiagnosed prediabetes, respectively, among individuals with an MAPD plan based on demographic, clinical, and health care utilization information obtained from administrative claims. A 2-step approach ensured that each subject had a metabolic risk factor score, as this constellation of conditions is highly associated with disease progression. This potentially increased the predictive accuracy of identifying individuals most likely to have undiagnosed prediabetes, which is ideal for finding candidates for engagement in a diabetes prevention program.

Footnotes

Author Disclosure Statement

At the time of writing, Dr. Kamble, Ms. Collins, and Mr. Harvey were or are employees of Comprehensive Health Insights Inc., a subsidiary of Humana Inc., which received funding from Novo Nordisk Inc. to conduct this study. Dr. Allen is an employee of Novo Nordisk, Inc., and owns stock in the company. Drs. Kimball and Deluzio, and Mr. Bouchard were employees of Novo Nordisk, Inc. at the time of this study. Dr. Prewitt is an employee of Humana Inc. This research was funded by Novo Nordisk Inc., and conducted as part of the Novo Nordisk-Humana Research Collaboration.

References

Centers for Disease Control and Prevention.. National Diabetes Statistics Report,. 2014. https://www.cdc.gov/diabetes/pdfs/data/2014-report-estimates-of-diabetes-and-its-burden-in-the-united-states.pdf Accessed March 3, 2016 .

American Diabetes Association. Economic costs of diabetes in the US in 2012. Diabetes Care, 2013; 36:1033–1046.

American Diabetes Association. Standards of medical care in diabetes—2015 abridged for primary care providers. Clin Diabetes, 2015; 33:97–111.

Cowie

, Rust

, Ford

, et al. Full accounting of diabetes and pre-diabetes in the US population in 1988–1994 and 2005–2006. Diabetes Care, 2009; 32:287–294.

Nichols

, Brown

. Higher medical care costs accompany impaired fasting glucose. Diabetes Care, 2005; 28:2223–2229.

Gillies

, Abrams

, Lambert

, et al. Pharmacological and lifestyle interventions to prevent or delay type 2 diabetes in people with impaired glucose tolerance: systematic review and meta-analysis. BMJ, 2007; 334:299.

Diabetes Prevention Program Research Group. The 10-year cost-effectiveness of lifestyle intervention or metformin for diabetes prevention. Diabetes Care, 2012; 35:723–730.

Centers for Disease Control and Prevention. Awareness of prediabetes—United States, 2005–2010. Morb Mortal Wkly Rep, 2013; 62:209.

Tabaei

, Burke

, Constance

, et al. Community-based screening for diabetes in Michigan. Diabetes Care, 2003; 26:668–670.

10.

Ginsberg

, MacCallum

. The obesity, metabolic syndrome, and type 2 diabetes mellitus pandemic: Part I. increased cardiovascular disease risk and the importance of atherogenic dyslipidemia in persons with the metabolic syndrome and type 2 diabetes mellitus. J Cardiometab Syndr, 2009; 4:113–119.

11.

Welzel

, Graubard

, Zeuzem

, El-Serag

, Davila

, McGlynn

. Metabolic syndrome increases the risk of primary liver cancer in the United States: a study in the SEER-Medicare database. Hepatology, 2011; 54:463–471.

12.

Ford

, Giles

, Dietz

. Prevalence of the metabolic syndrome among US adults: findings from the third National Health and Nutrition Examination Survey. JAMA, 2002; 287:356–359.

13.

Ford

. Prevalence of the metabolic syndrome defined by the International Diabetes Federation among adults in the US. Diabetes Care, 2005; 28:2745–2749.

14.

Abdul-Ghani

, Abdul-Ghani

, Müller

, et al. Role of glycated hemoglobin in the prediction of future risk of T2DM. J Clin Endocrinol Metab, 2011; 96:2596–2600.

15.

Lindström

, Tuomilehto

. The diabetes risk score. Diabetes Care, 2003; 26:725–731.

16.

Gray

, Barros

, Raposo

, Khunti

, Davies

, Santos

. The development and validation of the Portuguese risk score for detecting type 2 diabetes and impaired fasting glucose. Prim Care Diabetes, 2013; 7:11–18.

17.

Meng

X-H

, Huang

Y-X

, Rao

D-P

, Zhang

, Liu

. Comparison of three data mining models for predicting diabetes or prediabetes by risk factors. Kaohsiung J Med Sci, 2013; 29:93–99.

18.

Noble

, Mathur

, Dent

, Meads

, Greenhalgh

. Risk models and scores for type 2 diabetes: systematic review. BMJ, 2011; 343:d7163.

19.

Abbasi

, Peelen

, Corpeleijn

, et al. Prediction models for risk of developing type 2 diabetes: systematic literature search and independent external validation study. BMJ, 2012; 345:e5900.

20.

Shmueli

. To explain or to predict?. Stat Sci, 2010; 25:289–310.

21.

Bellazzi

, Zupan

. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform, 2008; 77:81–97.

22.

, Walker

, Kadam

. Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med, 2005; 34:113–127.

23.

Aslan

, Bozdemir

, Sahin

, Ogulata

. Can neural network able to estimate the prognosis of epilepsy patients according to risk factors?. J Med Syst, 2010; 34:541–550.

24.

Chen

H-Y

, Chuang

C-H

, Yang

Y-J

, Wu

T-P

. Exploring the risk factors of preterm birth using data mining. Expert Syst Appl, 2011; 38:5384–5387.

25.

Chang

C-D

, Wang

C-C

, Jiang

. Using data mining techniques for multi-diseases prediction modeling of hypertension and hyperlipidemia by common risk factors. Expert Syst Appl, 2011; 38:5507–5513.

26.

Simidjievski

, Todorovski

, Džeroski

. Modeling dynamic systems with efficient ensembles of process-based models. PLoS One, 2016; 11:e0153507.

27.

Kvancz

, Sredzinski

, Tadlock

. Predictive analytics: a case study in machine-learning and claims databases. Am J Pharm Benefits, 2016; 8:214–219.

28.

Steinberg

, Scott

, Honcz

, Spettell

, Pradhan

. Reducing metabolic syndrome risk using a personalized wellness program. J Occup Environ Med, 2015; 57:1269–1274.

29.

American Diabetes Association. Standards of medical care in diabetes—2015 abridged for primary care providers. Diabetes Care, 2015; 38(suppl 1):S1–S94.

30.

Quan

, Sundararajan

, Halfon

, et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care, 2005; 43:1130–1139.

31.

Agency for Healthcare Research and Quality. Healthcare Cost Utilization Project Clinical Classifications Software (CCS). www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp Accessed December 13, 2015 .

32.

SAS Institute, Inc. SAS^® Enterprise Guide^® 7.1. Cary, NC: SAS Institute, Inc., 2014.

33.

Shah

Use of cutoff and SAS code node in SAS ^® Enterprise Miner to determine appropriate probability cutoff point for decision making with binary target models. https://support.sas.com/resources/papers/proceedings12/127–2012.pdf Accessed March 6, 2018 .

34.

Centers for Medicare Medicaid Services. Medicare diabetes prevention program (MDPP) expanded model. 2017. https://innovation.cms.gov/initiatives/medicare-diabetes-prevention-program/ Accessed January 21, 2016 .

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB

0.04 MB