Abstract
Disease classification is a crucial element of biomedical research. Recent studies have demonstrated that machine learning techniques, such as Support Vector Machine (SVM) modeling, produce similar or improved predictive capabilities in comparison to the traditional method of Logistic Regression. In addition, it has been found that social network metrics can provide useful predictive information for disease modeling. In this study, we combine simulated social network metrics with SVM to predict diabetes in a sample of data from the Behavioral Risk Factor Surveillance System (BRFSS). In this dataset, Logistic Regression outperformed SVM with validation data ROC index of 81.8 and 81.7 for models with and without graph metrics, respectively. SVM with a polynomial kernel had validation ROC index of 72.9 and 75.6 for models with and without graph metrics, respectively. Although this did not perform as well as Logistic Regression, the results are consistent with previous studies utilizing SVM to classify diabetes.
Introduction
Disease classification is a crucial element of biomedical research. Improved disease classification models aim to provide accurate and timely prediction to allow for earlier diagnosis and implementation of preventative measures. For example, in the US, approximately 29.1 million people (9.3%) are affected by diabetes, with 1/3 unaware of their disease status, and 57 million with pre-diabetes (American Diabetes Association, 2012). Diabetes and pre-diabetes are known to increase the risk of heart disease and stroke (American Diabetes Association, 2012) but these long-term effects can be prevented with lifestyle changes and/or medical intervention (Pi-Sunyar, 2007). Early screening and predictive risk models built with simple clinical measurements (no lab tests required) are important for deployment of prevention strategies, especially in undiagnosed population (Heikes et al., 2008).
Traditionally, biomedical data is modeled using Logistic Regression, a method that relies on fitting data to a pre-determined model. Alternatively, the Support Vector Machine (SVM) algorithm is a supervised machine learning method that is a “model-free”, meaning it is a non-probabilistic classifier that, given a labeled dataset, will assign new examples to one class or another without requiring a probability distribution. In SVM each data point is represented as a n-dimensional vector and the algorithm constructs an n-1-dimensional separating hyperplane to discriminate 2 classes, with minimized penalty function for points to the maximal margin (defined by the support vectors). Non-linear functions, kernels, can also be used to transform data into multidimensional space. Previous research demonstrates that SVM has similar or improved predictive capabilities for disease classification in comparison to Logistic Regression (Yu et al., 2010).
In addition, it has been found that graph theory metrics provide useful information for the disease classification problem. Studies classifying diseases such as Alzheimer’s (Khazaee et al., 2014) and Multiple Sclerosis (Kocevar et al., 2016) combined graph theory with machine learning methods, such as SVM, for improved prediction. I have not found previous research on the application of graph theory metrics to demographic and behavioral data for the prediction of disease.
This project aims to assess the application of SVM for classification of diabetes in a sample of people in Georgia, and apply graph theory metrics as potential predictors of disease in the model. This paper includes: Section 2 description of the dataset, Section 3.1 overview of SVM application to disease classification, Section 3.2 overview of graph theory application to disease classification, Section 3.3 overview of nonlinear SVM with kernels, Section 4.1 social network simulation, Section 4.2 SVM algorithm, Section 4.3 logistic regression algorithm, Section 5 results, and Section 6 discussion.
Dataset details
Data from 2015 were obtained from Georgia’s Behavioral Risk Factor Surveillance System (BRFSS) (Centers for Disease Control and Prevention, 2014), landline and cellphone based survey conducted by the Centers for Disease Control and Prevention (CDC). BRFSS includes over 300 variables of various health behaviors and chronic conditions. A binary predictor variable was defined based on survey respondents reporting they had been informed by their physician they had diabetes or pre-diabetes. Variables related to survey design and unrelated to disease outcome were dropped, in addition to variables with greater than 30% missing, refused, or unknown response rate. Variables known to be redundant were also dropped to avoid unnecessary collinearity. Categorical variables with response of 9 (missing/refused) in less than 30% of observations were kept for consideration as predictors of interest with 9 as a valid response category. Once imputing missing data on continuous variables by the median, the analysis dataset included 3,465 observations with 100 potential predictors of interest. This analysis dataset has 616 observations (17.8%) classified as having diabetes or pre-diabetes. For this study, I selected potential predictor variables based on literature review and the known conceptual model (Okwechime & Roberson, 2015).
Literature review
SVM application to disease classification
Yu et al. (2014) used the National Health and Nutrition Examination Survey (NHANES), an ongoing, cross-sectional, probability sample of US population, to build SVM and Logistic Regression classification models for 2 classification schemes: persons with diabetes (diagnosed or undiagnosed) vs. persons without diabetes, and persons with undiagnosed diabetes or pre-diabetes vs. persons without diabetes. They used 14 potential predictors commonly associated with diabetes: family history, age, gender, race and ethnicity, weight, height, waist circumference, BMI, hypertension, physical activity, smoking, alcohol use, education, household income (Yu, 2014). They found that the Radial Basis Function (RBF) kernel, and Linear kernel worked best for classification schemes I and II respectively, and there was no significant difference between Logistic Regression and SVM performance (validation AUC 0.83 & 0.73 for classification I and II, respectively, with both models) (Yu et al., 2014).
Additionally, Kumari and Chitra (2013) also found success with the SVM model for classification of diabetes in the Pima India Diabetic Dataset from the UCI Machine Learning Laboratory. In this case, an 8 predictor SVM model, including lab data (plasma glucose concentration, 2-hr serum insulin) was validated with 78% accuracy using the RBF kernel.
SVM has been used across diverse biomedical classification problems. This includes a patient financial risk model using health claims and clinical encounter data, and a patient response to flu awareness campaign model, both using weighted SVM (Razzaghi et al., 2016). A project comparing various machine learning techniques with Logistic Regression for prediction of heart disease also shows no significant difference between Logistic Regression and SVM, with the Linear kernel performing best (Khanna et al., 2015).
Graph theory application to disease classification
In Kocevar et al. (2016) they combined graph metrics with SVM to classify various Multiple Sclerosis (MS) clinical profiles. Cortical and sub-cortical gray matter (GM) segmentation was performed on the advanced MRI imaging of 77 MS patients and 26 healthy controls (HC). The MRI scans are segmented to create nodes, and anatomically constrained probabilistic streamline tractography is used to create edges between the segments. Edge weights are determined by a function of the number of fibers connecting the segments. The weakest connections are removed by applying a threshold 0
The study found that global graph metrics were not significantly dependent on patients’s age or gender. Overall, significant difference in graph metrics were found when comparing MS patients with HC groups, as well as between different clinical classifications of MS. SVM classification with RBF kernel was then used to predict varying binary classifications of HC groups and clinical courses, with highest classification achieved using all graph metrics as a feature vector in the model at 91.8%. Using only one graph metric, the best in this case being modularity, the study could achieve validation accuracy of 88.9%.
In Khazaee et al. (2014), they found that using changes in brain connections from functional magnetic resonance imaging (MRI) provided strong predictive measures for classifying Alzheimer’s Disease (AD) patients from healthy controls (HC). Twenty patients with AD and 20 age-matched HC from Alzheimer’s disease neuroimaging initiative (ADNI) database were selected for study. MRI images were parcellated into 90 regions and edges were defined as connectivity of all pairs of regions using Pearson’s correlation coefficient. As in the previous study, thresholding was used to maintain the strongest connections in the network. Preserving a high proportion of the network results in a dense graph with noisy and less significant edges maintained. However, removing too many edges can result in a disconnected graph where global graph metrics cannot be calculated. From their previous research, this study found that a threshold of 12% was optimal (Khazaee et al., 2014). The study maintained the bridge edges between any disconnected sections that resulted from thresholding, regardless of the edge weight.
Graph metrics calculated for this project included: functional segregation via clustering coefficient, local efficiency, and normalized local efficiency to measure specialized processing within densely interconnected groups of regions; functional integration via characteristic path length and global efficiency to assess ability of the brain to rapidly combine specialized information from distributed regions; and 3 local measures including degree, participation coefficient, and betweenness centrality to measure properties of the 90 regions. An iterative feature selection algorithm using 7 different methods was then used to filter the most effective graph features for the classification problem. Linear SVM with a tuned C parameter using leave-one-out cross validation was used to perform the final feature classification.
End results found that Fisher Score provided the best feature selection method for the discriminative algorithm. The best algorithm found could classify AD patients from HC group with a highest validation accuracy of 97.5%.
Nonlinear SVM with kernels
It was found in (Zhang, 2007) that, in addition to the widely used SVM kernel functions for mapping the feature space (linear, rbf, polynomial), a tuned kernel function using combinations of the common functions, or a kernel designed for feature selection can provide improved model results. For feature selection, recursion methods can be used to repeatedly eliminate predictors that rank last in model performance until optimization is achieved.
For this study, feature selection was completed using SAS
To build the custom kernel based on feature selection, a dot product matrix was created to select the top 5 features based on conceptual knowledge, variable clustering, and performance in a stepwise selection logistic model. Features selected included: age, cholesterol, race, education, and sex.
Methods
Python Scikit-learn package was used to build the SVM models. Grid search and 5-fold cross validation was used to determine best kernel and parameters based on smallest mean square error. The data were split into 80% training, 20% validation datasets. Grid search compared the common kernels: Linear, Polynomial, and RBF, with the custom kernel based on feature selection with 5-fold cross validation for each kernel to determine the best parameters, C, gamma, and degree. The grid search assessed penalty parameter C values 0.001, 0.01, 0.1, 1, 10 for all kernels, gamma values 0.001, 0.01, 0.1, 1 for RBF kernel, and degrees 2 and 3 for the polynomial kernel. This controls for overfitting of the model by specifying allowable misclassification. SAS
Social network simulation
To test the application of graph theory metrics as potential predictors in a classification model, it was necessary to simulate a social network within the BRFSS dataset. The Watts-Strogatz small world network model was selected to represent the sample social network for this application due to its ability to simulate the interconnected groups (clusters) that exist in real-world networks, as well as the existence of random irregular connectivity patterns (Prettejohn et al., 2011). This model was first introduced by Duncan Watts and Steven Strogratz in Nature in 1998 (Liu et al., 2015).
Watts-Strogatz is a variation of the lattice network where nodes are connected to their nearest neighbors only. Watts-Strogatz randomly rewires some of the lattice edges, resulting in high clustering and short paths. This network is undirected. Ideally, the data studied would include some network characteristics, but I did not have access to a public use dataset that includes both demographic and network characteristics. For the purposes of testing the application of graph metrics to a predictive model, a simulated network will suffice. To incorporate an element of the demographic data into the social network simulation, I weighted each edge by the average standardized number of adults in the respondents household, assuming that a respondent with more household members would have more social network connections.
The algorithm for creating a Watts-Strogatz network starts with a lattice network where each node is adjacent to a defined
To randomly rewire the lattice network, each edge has a defined probability,
Watts-Strogatz simulated network
The R igraph package was used to create the Watts-Strogatz network using 1284 nodes,
SVM is a supervised learning algorithm that represents instances of data as points in space and then builds a model to assign new instances to one category or another. Each data point is represented as a
Data are represented as
where
Where
and
Figure 2 shows a maximum margin separation for linearly separable data. The samples that fall on the margin are known as the support vectors.
Maximum margin hyperplane (Kumari et al., 2013).
For data that is not linearly separable we can include a hinge loss function, ‘C’ to determine the trade-off between increasing the margin and whether an instance of
Kernel transformation of feature space (Khanna et al., 2015).
The most common kernel functions include:
Linear Kernel
Polynomial Kernel
RBF (Gaussian) Kernel
For example, polynomial kernel with degree 2 and offset
And would map to:
Thus, mapping the features from
Logistic Regression examines the non-linear relationship between a binary outcome and categorical or continuous predictor variables. The logistic model outputs a probability of an event between 0 and 1 as the log of the odds ratio:
where
SAS
Table 1 illustrates the significant effects remaining in the logistic regression models, for models with and without network characteristic metrics included. Age, BMI, hypertension, and cholesterol all have increased odds of diabetes outcome, while education has decreased odds. This is consistent with outcomes of previous research and known risk factors for diabetes. In the model with graph metrics included, only closeness centrality remains as significant, in addition to the same demographic variables from the previous model. If this were a real network in the dataset (not simulated), this would indicate that people with shorter total paths to other people in the network would have increased risk of diabetes.
Odds ratio estimates and wald confidence intervals
Odds ratio estimates and wald confidence intervals
Comparison of logistic regression and SVM results for data with and without simulated graph metrics
Table 2 shows the model performance results for models with and without the social network graph characteristics. The models were evaluated based on the sensitivity, specificity, and ROC index for the validation data set. The logistic model performs best for models with and without graph metrics included, and the SVM model with polynomial kernel is comparable but the ROC index is affected by lower sensitivity. Figure 4 provides a visual comparison of the area under the ROC curves for models including graph metrics. The ROC curves for models without the graph metrics are the same and are not included to avoid redundancy.
Comparison of ROC curves: Models including graph metrics.
Support Vector Machines are important tools to be considered for disease classification problems. While SVM did not perform as well as Logistic Regression in this study, its results were comparable to previous research. SVM is known to be less sensitive to high dimensionality, and sparse datasets, so would likely perform better than Logistic Regression in studies with biomedical data of that nature, particularly much larger datasets. This is a comparatively small dataset for use in machine learning algorithms. Including graph metrics in the model did improve predictive performance slightly using a simulated network. Ideally, future research will include a dataset with both demographic and network characteristics included. Program codes are available upon request.
Future improvements to this study will include:
Parameter selection using machine learning such as Random Forest. Multiple imputation of missing values for features of interest. Creating a custom kernel for SVM using deep learning techniques, neural network.
Footnotes
Acknowledgments
I would like to thank Dr. Erik Westlund for his encouragement to pursue applied graph theory methods for this project.
