Abstract
Multifactor dimensionality reduction (MDR) method is a machine learning algorithm to detect nonlinear interactions. This multifactor dimensionality reduction analysis is a combination of factor selection by classification accuracy, model selection by prediction accuracy and cross-validation consistency of classification accuracy, and statistical significance by the permutation testing. In this paper, we compare the performances of the standard multifactor dimensionality reduction method and a modified method in which the best model is selected by the area under receiver operating characteristic curve and cross-validation consistency of the area under the receiver operating characteristic curve. We conducted simulation studies based on 1,000 replicates per each parameter setting. The proposed MDR shows 1–8% increase of power to detect nonlinear interactions while false discovery rates remain the same as the original MDR. As an illustration, we applied these methods to pharmacogenomics data of antiepileptic drugs.
Introduction
Genome-wide association study (GWAS) is a powerful tool in genetic analysis of common human diseases. One of the limitations of GWAS is its assumption that a common genetic variation plays a large role in explaining the heritable variation of common diseases (Visscher et al., 2012). Common medical disorders such as heart disease, diabetes, and obesity do not have a single genetic cause. They are likely to be associated with the effects of multiple genes in combination with lifestyle and environmental factors. Medical conditions caused by many contributing factors are called complex or multifactorial disorders. The logistic regression analysis is difficult to discover interactions among more than two factors and requires a large sample size.
Multifactor Dimensionality Reduction (MDR) method was motivated by a combinatorial partitioning method (Nelson et al., 2001), one of the data-reduction methods for the exploratory analysis of quantitative responses. MDR (Ritchie et al., 2001) is a nonparametric data mining method for detecting nonlinear interaction effects. It is a comprehensive and powerful approach that combines variable selection, variable construction, classification by cross-validation (CV), and permutation test. There are many extensions of the original MDR method including family-based MDR (Martin et al., 2006), fuzzy MDR (Leem & Park, 2017), robust MDR (Gui et al., 2011), covariate adjustment for MDR (Gui et al., 2010), survival MDR (Gui et al., 2011; Lee et al., 2016), methods for quantitative traits (Lou et al., 2007), risk scores based MDR (Dai et al., 2013), odds ratio based MDR (Chung et al., 2006), and so on.
Detecting interacting genetic or/and environmental factors that cause diseases is difficult, thus making it computationally infeasible due to an extremely large number of possible combinations. However, efficient filtering algorithms which can detect interacting SNPs are needed. Kira and Rendell (1992) developed a feature selection algorithm, called Relief, based on a nearest neighbor approach to detecting highly dependent factors relevant for some outcome. Kononenko et al. (1997) extended the method to ReliefF algorithm that improves the reliability of the probability approximation, making it robust to incomplete data, and generalizing it to multi-class problems. While Relief uses, for each individual, a 1-nearest neighbor in each class, ReliefF uses k-nearest neighbors and thus more robust when the dataset contains noise. Moore and White (2007) developed a tuned ReliefF (TuRF) algorithm that systematically eliminates factors that have low quality estimates and the weights of the remaining factors are re-estimated. Greene et al. (2009) introduced Spatially Uniform ReliefF (SURF) algorithm that uses all neighbors within a fixed distance of the individual.
In this paper, we propose a multifactor dimensionality reduction procedure based on the area under receiver operating characteristic curve in which it is not necessary to pick up a threshold for classifications. We call it MDR-AUC or AUC-based MDR through this work. We examine the performance of our method by conducting simulation studies and compare our approach to the original multifactor dimensionality reduction algorithm. In addition, we illustrate an example of a pharmacogenomics data set by applying the multifactor dimensionality reduction methods.
Method
We begin this section by summarizing the multifactor dimensionality reduction method considered in our work.
Multifactor dimensionality reduction
In MDR,
The data set is partitioned into a training set A set of The Each multifactor cell in the The model with the smallest misclassification error is selected, denoted by The prediction error of the model The best model is selected based on the prediction accuracy, the average of the prediction accuracy over the
Though the best model often has the maximum prediction accuracy and the maximum CVC, another model may sometimes have the highest CVC. In this case, the more parsimonious model with the higher CVC can be selected (Pattin et al., 2009).
Permutation testing can be used with MDR to assess the statistical significance of the best model with the maximum testing accuracy. An empirical distribution of the maximum testing accuracy is obtained by running the whole MDR procedures with permuted data, in which case-control labels are permuted. This permutation testing procedure is computationally expensive.
A receiver operating characteristic (ROC) curve is a visualization method that illustrates the overall performance of a binary classifier. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold values. With normalized units, the area under the ROC curve (AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. The AUC is commonly used for model comparison and selection of binary classifiers. In our work, we perform the MDR without choosing a threshold for classification. We instead calculate the posterior probability
where
For a given number of factors, we select an optimal model that maximizes the average AUC over cross validation training sets. The best model of the optimal models in the set is chosen by the maximum prediction AUC and the maximum cross validation consistency of the AUC. Another important modification is made when a model with the maximum prediction AUC is different from the model(s) with the maximum CVC of AUC. The best model is defined as a more parsimonious one of the candidate models with the maximum prediction AUC and the less parsimonious model of the maximum CVC models. This method does not need a threshold since we do not actually classify a subject.
MDR analysis is often performed by combining with filtering algorithms when a dataset contains many factors such as in GWAS. Efficient algorithms for identifying sets of factors likely to contain predictive models for disease susceptibility are needed. ReliefF, TuRF, and SURF filtering algorithms are commonly used combinatorial methods to explore and filter a large number of SNPs. We briefly describe the filtering algorithms here.
Relief and ReliefF AlgorithmsRelief algorithm is one of classification methods used in the factor selection developed by Kira and Rendell (1992). This filtering algorithm is based on a nearest neighbor approach to detect highly dependent factors related to some outcome, but it is designed only for binary response model. Suppose a data set consists of
where ReliefF algorithm introduced by Kononenko et al. (1997), is the extended version of Relief algorithm, that is more robust to incomplete or/and multi-class models. While Relief algorithm uses a 1-nearest neighbor in each class for a randomly selected sample, ReliefF algorithm uses
where Tuned ReliefF (TuRF) algorithm Unfortunately, the power of ReliefF is reduced in the presence of a large number of noisy noninteracting factors. Moore and White (2007) developed TuRF by adding an iterative component to ReliefF algorithm. First, we apply ReliefF with the training set The TuRF algorithm is known to be consistently better than ReliefF across small factor subset sizes such as 50, 100, and 150. The powers of ReliefF and TuRF are asymptotically equal to each other as the sample size increases, however, the advantage of TuRF becomes apparent at the sample size over 1500 in many genetic epidemiological studies. Between TuRF10% and TuRF1%, the TuRF1% performs better than the TuRF10% for large number of samples (Moore & White, 2007). Specifically, in the study of human genetics, the TuRF algorithm is known to be significantly better than the ReliefF algorithm by systematically removing factors that have low weights so that the weights of the remaining factors can be re-estimated. Spatially Uniform ReliefF (SURF) algorithmAlthough TuRF improved the estimation of weights in noisy data, it did not fundamentally change the underlying ReliefF algorithm. Greene et al. (2009) developed SURF algorithm that uses all neighbors within a fixed distance
where
In this section, we summarize a simulation study. Since this study requires highly intensive computation time, we conducted a simulation study based on 1,000 replications per each scenario: (1) there is no interaction (Null simulation), and (2) there is a two-way or three-way interaction (Power simulation). We assume that a filtering algorithm has chosen five candidate factors. In the first scenario, the filtering methods failed to find both factors so that there is no interaction. In another case, we assume a filtering method has successfully found two factors before applying an MDR analysis. We simulate five common SNPs with minor allele frequency 0.3. The sample sizes for each case and control groups are set to be
Null simulation results based on 1,000 replicates per each setting
Null simulation results based on 1,000 replicates per each setting
Power simulation results based on 1,000 replicates per each setting: the value in each cell of the third and fourth columns is equal to the empirical probability that a method identifies the true model as the best model.
Description of 23 SNPs from 15 candidate genes. REF/ALT represents reference allele/alternative allele, MAF is minor allele frequency. P-value is obtained from the chi-square test of independent. NS is nonsignificant
Plot of the simulation results of 2-way and 3-way interactions.
TOP 5 SNPs selected by ReliefF, TuRF10%, TuRF1%, and SURF
MDR and MDR-AUC results using five SNPs selected by ReliefF, TuRF10%, TuRF 1%, SURF from the total 23 SNPs
MDR and MDR-AUC results using five SNPs selected by ReliefF, TuRF10%, TuRF 1%, SURF from the total 22 SNPs
The results of our null simulation studies are illustrated in Table 1. It shows distributions of each order of the best models chosen by the original MDR and the AUC-based MDR. These null simulation results do not show a significant difference in the type I error rates or the false positive rates of the original MDR and the AUC-based MDR. Table 2 and Fig. 1 summarize the power simulations. We define Increase in Power (%) by
so that a positive Increase in Power (%) implies a better performance of the AUC-based MDR than the original MDR. We observe that the AUC-based MDR has higher power to detect the correct interactions than the original MDR. Particularly, Increase in Power by the AUC-based MDR appears high when true interactions are less strong. The power of the AUC-based MDR was almost 7% higher than the original MDR when the relative risk was set as two. The Increase in Power (%) is equal to 21.9% in this simulation setting. The power gain was around 2% when the relative risk was set as four. For the low relative risk setting of 1.5, the Increase in Power (%) was as large as 20.9%, though the powers of the MDRs may not be enough to detect interactions due to the small strength of interactions. We can statistically evaluate whether the AUC-based MDR has higher power than the original MDR by McNemar test. In the low relative risk setting, the MDR-AUC identified the true model 34 times while the original MDR selected an incorrect model for the same samples. In contrast, the original MDR identified the correct model 16 times whereas the MDR-AUC failed to find the true model for the same samples. The one-tailed
Epistatic interactions in antiepileptic drug resistance were studied by the original MDR analysis (Kim et al., 2011) based on twenty five candidate SNPs of 200 drug-responsive patients and 200 drug-refractory patients. A new data set of 112 independent patients has been collected and their genotypes are obtained by the next generation sequence experiment. Unfortunately, the sequencing experiment has not identified two SNPs and hence we removed the two SNPs in our analysis. Table 3 summarizes the descriptions of 23 biallelic SNPs. One of SNPs, rs2272400, is significant in a chi-square test of independent. We performed filtering and MDR analyses with and without the significant SNP rs2272400 to see the effects of including a significant single factor to the filtering algorithms, the original MDR analysis, and the proposed AUC-based MDR analysis.
Table 4 shows the selected SNPs by the four filtering algorithms, ReliefF, TuRF10%, TuRF1%, and SURF when the significant single factor is included and not included. ReliefF has not selected rs2272400 but all other algorithms have selected the SNP. TuRF10% and TuRF1% select the same four SNPs (1, 5, 6, 12) when rs2272400 is included, whereas only one SNP (14) is selected by both algorithms if rs2272400 is removed. It appears an inclusion of a single SNP may highly influence the factor selections of the four filtering methods. As shown in Table 5, the best models obtained by the original MDR are exactly identical to the best models that the AUC-based MDR has chosen, when the significant SNP rs2272400 was included in the MDR analysis. However, if the SNP is deleted to search pure interactions, then the best models selected by two methods may differ as shown in Table 6. For example, when we applied the MDR and MDR-AUC methods to the five SNPs (1, 6, 9, 22, 23) obtained by ReliefF algorithm to 22 SNPs, the best model by the AUC-based MDR method is of third order while the best model by the original MDR is of second order. By combining the MDR results based on the four filtering algorithms, the overall best model selected by the original MDR is (1, 6, 14, 21) with prediction accuracy 53.88 and 10 CVC whereas the overall best model selected by the AUC-based MDR is (1, 6, 21) with AUC prediction accuracy 55.44 with 10 AUC-CVC.
Discussion and conclusions
In this paper, we compared the original MDR with our modified version of MDR based on the posterior probability and the AUC. This AUC-based MDR method does not require to choose a threshold for classifications to proceed a MDR analysis. In addition, the proposed MDR method does not require more computational resources. The results of the power simulation suggest that the AUC-based MDR performs better than the original MDR in all scenarios when interactions exist. When there is no interaction, the performance of the AUC-based MDR is almost equal to the performance of the original MDR in terms of the false discovery rate. In other words, when interacting factors are priorly chosen by filtering algorithms, our proposed MDR shows better power to detect the correct interactions than the original MDR.
The results shown in this paper may have limitations since there are many different ways of interactions that we have not considered here. Additionally, we have not considered the permutation testing, which is computationally intensive, because the main goal of this work is a comparison study of the best model selections by the original MDR and the AUC-based MDR.
Footnotes
Acknowledgments
This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI15C1559).
