Multifator dimensionality reduction method based on area under receiver operating characteristic curve

Abstract

Multifactor dimensionality reduction (MDR) method is a machine learning algorithm to detect nonlinear interactions. This multifactor dimensionality reduction analysis is a combination of factor selection by classification accuracy, model selection by prediction accuracy and cross-validation consistency of classification accuracy, and statistical significance by the permutation testing. In this paper, we compare the performances of the standard multifactor dimensionality reduction method and a modified method in which the best model is selected by the area under receiver operating characteristic curve and cross-validation consistency of the area under the receiver operating characteristic curve. We conducted simulation studies based on 1,000 replicates per each parameter setting. The proposed MDR shows 1–8% increase of power to detect nonlinear interactions while false discovery rates remain the same as the original MDR. As an illustration, we applied these methods to pharmacogenomics data of antiepileptic drugs.

Keywords

AUC epistasis MDR multifactorial disease ROC curve

1. Introduction

Genome-wide association study (GWAS) is a powerful tool in genetic analysis of common human diseases. One of the limitations of GWAS is its assumption that a common genetic variation plays a large role in explaining the heritable variation of common diseases (Visscher et al., 2012). Common medical disorders such as heart disease, diabetes, and obesity do not have a single genetic cause. They are likely to be associated with the effects of multiple genes in combination with lifestyle and environmental factors. Medical conditions caused by many contributing factors are called complex or multifactorial disorders. The logistic regression analysis is difficult to discover interactions among more than two factors and requires a large sample size.

Multifactor Dimensionality Reduction (MDR) method was motivated by a combinatorial partitioning method (Nelson et al., 2001), one of the data-reduction methods for the exploratory analysis of quantitative responses. MDR (Ritchie et al., 2001) is a nonparametric data mining method for detecting nonlinear interaction effects. It is a comprehensive and powerful approach that combines variable selection, variable construction, classification by cross-validation (CV), and permutation test. There are many extensions of the original MDR method including family-based MDR (Martin et al., 2006), fuzzy MDR (Leem & Park, 2017), robust MDR (Gui et al., 2011), covariate adjustment for MDR (Gui et al., 2010), survival MDR (Gui et al., 2011; Lee et al., 2016), methods for quantitative traits (Lou et al., 2007), risk scores based MDR (Dai et al., 2013), odds ratio based MDR (Chung et al., 2006), and so on.

Detecting interacting genetic or/and environmental factors that cause diseases is difficult, thus making it computationally infeasible due to an extremely large number of possible combinations. However, efficient filtering algorithms which can detect interacting SNPs are needed. Kira and Rendell (1992) developed a feature selection algorithm, called Relief, based on a nearest neighbor approach to detecting highly dependent factors relevant for some outcome. Kononenko et al. (1997) extended the method to ReliefF algorithm that improves the reliability of the probability approximation, making it robust to incomplete data, and generalizing it to multi-class problems. While Relief uses, for each individual, a 1-nearest neighbor in each class, ReliefF uses k-nearest neighbors and thus more robust when the dataset contains noise. Moore and White (2007) developed a tuned ReliefF (TuRF) algorithm that systematically eliminates factors that have low quality estimates and the weights of the remaining factors are re-estimated. Greene et al. (2009) introduced Spatially Uniform ReliefF (SURF) algorithm that uses all neighbors within a fixed distance of the individual.

In this paper, we propose a multifactor dimensionality reduction procedure based on the area under receiver operating characteristic curve in which it is not necessary to pick up a threshold for classifications. We call it MDR-AUC or AUC-based MDR through this work. We examine the performance of our method by conducting simulation studies and compare our approach to the original multifactor dimensionality reduction algorithm. In addition, we illustrate an example of a pharmacogenomics data set by applying the multifactor dimensionality reduction methods.

2. Method

We begin this section by summarizing the multifactor dimensionality reduction method considered in our work.

2.1 Multifactor dimensionality reduction

In MDR, $p$ -dimensional qualitative predictors are projected onto one dimensional space of two groups, called high-risk and low-risk groups, effectively reducing genotype predictors from $p$ dimensions to one dimension. The new constructed one-dimensional variable is evaluated for its ability to classify and predict a response through cross-validation and permutation test. While traditional parametric approaches such as logistic regression model require a large number of samples to detect nonlinear interactions, the MDR is known to have reasonable power to detect interactions with relatively small sample sizes (Moore et al., 2006). An original MDR analysis (Ritchie et al., 2001) is performed as following steps:

1.
The data set is partitioned into a training set $\mathcal{X}_{(-i)}=\bigcup_{j\neq i}\mathcal{X}_{j}$ and an independent testing set $\mathcal{X}_{i}$ for cross-validation for $i=1,\ldots,k$ .
2.
A set of $p$ qualitative predictors is then selected from the pool of all $M$ factors. Let $j=1,\ldots,{M\choose p}$ be the index of combinations of $p$ predictors.
3.
The $p$ factors and their $3^{p}$ possible multifactor cells are represented in $p$ -dimensional space.
4.
Each multifactor cell in the $p$ -dimensional space is labelled as high risk (HR) if the ratio of affected individuals to unaffected individuals exceeds a threshold $C$ , and low risk (LR) if the threshold is not exceeded. If ${n_{1js}\over n_{0js}}\geqslant C$ for cell $s$ , then the cell $s$ is classified as HR, where $n_{1js}(n_{0js})$ is the number of affected (unaffected) in cell $s$ for the $j$ th combination of $p$ predictors.
5.
The model with the smallest misclassification error is selected, denoted by $\mathcal{M}_{p(i)}$ .
6.
The prediction error of the model $\mathcal{M}_{p(i)}$ is estimated using the independent test data $\mathcal{X}_{i}$ .
7.
The best model is selected based on the prediction accuracy, the average of the prediction accuracy over the $k$ -fold cross validation and the cross-validation consistency (CVC), the number of times in a $k$ -fold cross validation that a specific model is identified to be an optimal model based on training data.

Though the best model often has the maximum prediction accuracy and the maximum CVC, another model may sometimes have the highest CVC. In this case, the more parsimonious model with the higher CVC can be selected (Pattin et al., 2009).

Permutation testing can be used with MDR to assess the statistical significance of the best model with the maximum testing accuracy. An empirical distribution of the maximum testing accuracy is obtained by running the whole MDR procedures with permuted data, in which case-control labels are permuted. This permutation testing procedure is computationally expensive.
2.2 MDR based on ROC curve and AUC

A receiver operating characteristic (ROC) curve is a visualization method that illustrates the overall performance of a binary classifier. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold values. With normalized units, the area under the ROC curve (AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. The AUC is commonly used for model comparison and selection of binary classifiers. In our work, we perform the MDR without choosing a threshold for classification. We instead calculate the posterior probability $\tau(x_{1},\ldots,x_{p})$ for each cell. Given training data $\mathcal{X}$ , the posterior probability of a case ( $Y=1$ ) given a combination $(x_{1},\ldots,x_{p})$ of $p$ factors is estimated by

$\displaystyle\widehat{\tau}(x_{1},\ldots,x_{p})={\sum_{(y_{i},x_{1i1},\ldots,x% _{ip})\in\mathcal{X}}I(y_{i}=1,x_{i1}=x_{1},\ldots,x_{ip}=x_{p})\over\sum_{(y_% {i},x_{1i1},\ldots,x_{ip})\in\mathcal{X}}I(x_{i1}=x_{1},\ldots,x_{ip}=x_{p})}$ (1)

where $I(\cdot)$ is an indicator function. With these posterior probabilities, we calculate the AUC for each combination of factors. Let $\widehat{\tau}_{(i)}^{0}$ and $\widehat{\tau}_{(j)}^{1}$ be the order statistics of the estimated posterior probabilities for $n_{0}$ controls and $n_{1}$ cases, respectively. The AUC can be calculated as follow:

$\displaystyle\widehat{\textit{AUC}}=1-{1\over n_{0}n_{1}}\sum_{i=1}^{n_{0}}\#% \{j:\widehat{\tau}_{(j)}^{1}<\widehat{\tau}_{(i)}^{0}\}$ (2)

For a given number of factors, we select an optimal model that maximizes the average AUC over cross validation training sets. The best model of the optimal models in the set is chosen by the maximum prediction AUC and the maximum cross validation consistency of the AUC. Another important modification is made when a model with the maximum prediction AUC is different from the model(s) with the maximum CVC of AUC. The best model is defined as a more parsimonious one of the candidate models with the maximum prediction AUC and the less parsimonious model of the maximum CVC models. This method does not need a threshold since we do not actually classify a subject.

2.3 Filtering algorithms

MDR analysis is often performed by combining with filtering algorithms when a dataset contains many factors such as in GWAS. Efficient algorithms for identifying sets of factors likely to contain predictive models for disease susceptibility are needed. ReliefF, TuRF, and SURF filtering algorithms are commonly used combinatorial methods to explore and filter a large number of SNPs. We briefly describe the filtering algorithms here.

1.
Relief and ReliefF AlgorithmsRelief algorithm is one of classification methods used in the factor selection developed by Kira and Rendell (1992). This filtering algorithm is based on a nearest neighbor approach to detect highly dependent factors related to some outcome, but it is designed only for binary response model. Suppose a data set consists of $n$ samples of $M$ factors, belonging to known classes (case or control). Given training data $\mathcal{X}$ , sample size $n$ , and a threshold of relevancy $h$ , the Relief filtering algorithm repeats the following steps: First, the algorithm starts with a $M$ -dimensional zero vector of $\boldmath{w}^{(0)}=(0,\ldots,0)$ and selects at random a sample $x=(x_{1},\ldots,x_{M})\in\mathcal{X}$ . Next, find a nearest sample $x_{nh}=(x_{nh,1},\ldots,x_{nh,M})$ from a same class, called near-hit, and a nearest sample $x_{nm}=(x_{nm,1},\ldots,x_{nm,M})$ from a different class, called near-miss. The distance is calculated by a Euclidean $L_{2}$ distance. At $(r+1)$ th iteration, update a weight vector as Eq. (3).

$\displaystyle\begin{split}&\displaystyle w^{(r+1)}_{j}\leftarrow w^{(r)}_{j}-% \left({x_{j}-x_{nh,j}\over c_{j}}\right)^{2}+\left({x_{j}-x_{nm,j}\over c_{j}}% \right)^{2}∼{}\mbox{ for∼{}}j=1,\ldots,M\end{split}$ (3)

where $c_{j}$ is a normalization constant and a possible choice is $\max(x_{j})-\min(x_{j})$ . Thus, the weight vector is updated upward if the two samples are in the different class, and downward if the two samples are in the same class. After $m$ iteration, a relevance is calculated as $\text{relevance}_{j}=w^{(m)}_{j}/m$ . For each factor, $x_{j}$ is a relevance factor if relevance ${}_{j}\geqslant h$ for $j=1,\ldots,M$ (Kira & Rendell, 1992).

ReliefF algorithm introduced by Kononenko et al. (1997), is the extended version of Relief algorithm, that is more robust to incomplete or/and multi-class models. While Relief algorithm uses a 1-nearest neighbor in each class for a randomly selected sample, ReliefF algorithm uses $k$ -nearest neighbors in each class averaging their contribution to the weights of each factor Thus ReliefF algorithm performs more robust than Relief algorithm, particularly when the dataset contains noise, and it improves the reliability of the probability approximation. Generally, 10 nearest neighbors are used in the ReliefF algorithm. They find a nearest-hit and a nearest-miss using the Manhattan ( $L_{1}$ ) distance rather than the Euclidean ( $L_{2}$ ) distance. For all sample sizes, it is known that the power of ReliefF improves with increasing factor subset size much more rapidly than the power of a chi-square test of independence. The update formula for ReliefF algorithm, when $x_{i}=(x_{11},\ldots x_{iM})$ of class ( $y_{i}$ ) is randomly selected, is given by

$\displaystyle\begin{split}&\displaystyle w^{(r+1)}_{j}\leftarrow w_{j}^{(r)}-{% 1\over mk}\sum_{x_{i^{\prime}}\in N_{k}(x_{i}),y_{i^{\prime}}=y_{i}}^{k}\frac{% |x_{ij}-x_{i^{\prime}j}|}{\max(x_{j})-\min(x_{j})}\\ &\displaystyle+\sum_{C\neq C_{i}}\frac{P(C)}{1-P(C_{i}))}{1\over mk}\sum_{x_{i% ^{\prime}}\in N_{k}(x_{i}),y_{i^{\prime}}\in C}^{k}\frac{|x_{ij}-x_{i^{\prime}% j}|}{\max(x_{j})-\min(x_{j})}\\ &\displaystyle\mbox{for ∼{}}j=1,\ldots,M\end{split}$

where $C$ is a class of the response variable and $C_{i}$ is the class of $y_{i}$ that is the response value of the randomly chosen $x_{i}$ .
2.
Tuned ReliefF (TuRF) algorithm Unfortunately, the power of ReliefF is reduced in the presence of a large number of noisy noninteracting factors. Moore and White (2007) developed TuRF by adding an iterative component to ReliefF algorithm. First, we apply ReliefF with the training set $\mathcal{X}$ , the number of factors $M$ , and the threshold of relevance $h$ . And we sort factors in descending order and remove $M/R$ factors of low weights. After $R$ iterations of the TuRF algorithm, we return last ReliefF estimate for each factor. Generally, $k=10$ of nearest neighbors is used, and $R=10$ of iterations for TuRF is applied. If the 10% of worst factors were removed at each of the $i=1$ to $R$ iterations, it is called TuRF10%. If the 1% of worst factors were removed at each iteration, it is called TuRF1% (Chiong, 2009).

The TuRF algorithm is known to be consistently better than ReliefF across small factor subset sizes such as 50, 100, and 150. The powers of ReliefF and TuRF are asymptotically equal to each other as the sample size increases, however, the advantage of TuRF becomes apparent at the sample size over 1500 in many genetic epidemiological studies. Between TuRF10% and TuRF1%, the TuRF1% performs better than the TuRF10% for large number of samples (Moore & White, 2007). Specifically, in the study of human genetics, the TuRF algorithm is known to be significantly better than the ReliefF algorithm by systematically removing factors that have low weights so that the weights of the remaining factors can be re-estimated.
3.
Spatially Uniform ReliefF (SURF) algorithmAlthough TuRF improved the estimation of weights in noisy data, it did not fundamentally change the underlying ReliefF algorithm. Greene et al. (2009) developed SURF algorithm that uses all neighbors within a fixed distance $T$ of the samples, where the fixed distance is determined by a similarity threshold. While ReliefF may use either fewer or more neighbors possibly neglecting informative samples or including uninformative ones, the contribution of the irrelevant SNPs to the distance from a sample to its nearest sample tends to be less than that to its nearest sample. The SURF score of a subject $i$ , $S_{i}^{R}$ , depends on the value of the distance threshold which is chosen to maximize the $S_{i}^{R}$ .

$\displaystyle S_{i}^{R}={1\over 2}\left(|\textit{TM}_{1}|-|\textit{TH}_{1}|% \right)+\left(|\textit{TM}_{2}|-|\textit{TH}_{2}|\right)$

where $\textit{TM}_{j}$ is the subset of the set consisting of exactly $k$ near-misses of two relevant SNPs in a different state from the subject $i$ , in which the misses lie within the threshold distance $T$ . The notation $\textit{TH}_{j}$ is similarly defined with the near-hit instead of the near-miss. The final score of a relevant SNP is the sum of the $S_{i}^{R}$ ( $i=1,\ldots,n$ ) values for each sample.

3. Simulation study

In this section, we summarize a simulation study. Since this study requires highly intensive computation time, we conducted a simulation study based on 1,000 replications per each scenario: (1) there is no interaction (Null simulation), and (2) there is a two-way or three-way interaction (Power simulation). We assume that a filtering algorithm has chosen five candidate factors. In the first scenario, the filtering methods failed to find both factors so that there is no interaction. In another case, we assume a filtering method has successfully found two factors before applying an MDR analysis. We simulate five common SNPs with minor allele frequency 0.3. The sample sizes for each case and control groups are set to be $n=$ 200. The simulation setting for two functional SNPs, the probability of a case for the subjects having a $(1,1)$ genotype vector of SNP1 and SNP2 is given by 0.6. We set the probability of a case for all other genotypes to be 0.3 so that the relative risk is equal to 0.6/0.3 $=$ 2. For a higher relative risk setting, we set it to be 0.4/0.1 $=$ 4. Additionally we set the relative risk to be 0.03/0.2 $=$ 1.5 for a low power simulation setting. We also considered three-way interactions by setting the high risks when SNP1 $+$ SNP2 $+$ SNP3 $=$ 2 with the same relative risks configurations. These simulation settings were chosen to reflect the pharmacogenomics data set that is explained in the example section though it may be interesting to consider a setting in which one or more interacting factors are rare variants. For the power simulations, we tried to search interactions up to fourth order while the maximum order of interactions for the null simulations was set as three. All our simulations were performed by statistical software R. We started with “MDR” package (Winham & Motsinger-Reif, 2011) to implement the AUC-based MDR procedure. We also use “ROCR” package (Sing et al., 2005) for AUC calculations.

Table 1
Null simulation results based on 1,000 replicates per each setting

Order of the best model	Single-locus	Two-locus	Three-locus
MDR	0.483	0.320	0.197
MDR-AUC	0.483	0.321	0.196

Table 2

Power simulation results based on 1,000 replicates per each setting: the value in each cell of the third and fourth columns is equal to the empirical probability that a method identifies the true model as the best model. $P$ -value is calculated by McNemar test for equality of the powers of MDR and MDR-AUC

Order of interactions	Relative risk	MDR	MDR-AUC	Increase in power (%)	$P$ -value
2-way	1.5	0.086	0.104	20.9	0.0055
	2	0.306	0.373	21.9	$<$ 10 ${}^{-6}$
	4	0.616	0.630	2.27	0.2100
3-way	1.5	0.133	0.149	12.0	0.026
	2	0.704	0.754	7.10	0.00023
	4	0.849	0.856	0.82	0.30

Table 3

Description of 23 SNPs from 15 candidate genes. REF/ALT represents reference allele/alternative allele, MAF is minor allele frequency. P-value is obtained from the chi-square test of independent. NS is nonsignificant

SNP $i$	Genes	rs number	REF/ALT	MAF	$P$ -value
1	MDR1	rs1128503	T/C	A $=$ 0.4581	NS
2	MDR1	rs1045642	C/T	A $=$ 0.4966	NS
3	BCRP	rs2231137	G/A	T $=$ 0.1076	NS
4	SCN1B	rs55742440	T/C	C $=$ 0.4262	NS
5	SCN1A	rs2298771	T/C	C $=$ 0.2748	NS
6	KCNQ3	rs1801475	A/G	T $=$ 0.3628	NS
7	KCNQ2	rs2303995	T/C	C $=$ 0.0341	NS
8	CLCN2	rs2228291	A/G	G $=$ 0.3550	NS
9	GAT3	rs2304725	T/C	C $=$ 0.3482	NS
10	GAT3	rs2272394	G/A	A $=$ 0.0292	NS
11	GAT3	rs2272395	T/C	C $=$ 0.3784	NS
12	GAT3	rs2272400	C/T	T $=$ 0.0265	0.00524
13	GAT3	rs2245532	A/G	G $=$ 0.3594	NS
14	GABRA1	rs35166395	T/C	C $=$ 0.2591	NS
15	GABRG2	rs11135176	C/T	T $=$ 0.1255	NS
16	GABRG2	rs211037	T/C	T $=$ 0.2854	NS
17	CHRNA4	rs1044396	G/A	A $=$ 0.4659	NS
18	CHRNA4	rs1044397	G/A	T $=$ 0.4750	NS
19	CHRNB2	rs2280781	C/T	T $=$ 0.1148	NS
20	VGLUT3	rs11110359	G/A	A $=$ 0.0112	NS
21	EAAT2	rs752949	C/T	T $=$ 0.2268	NS
22	EAAT2	rs1042113	A/G	C $=$ 0.2400	NS
23	EAAT3	rs2228622	G/A	A $=$ 0.3971	NS

Figure 1.

Plot of the simulation results of 2-way and 3-way interactions.

Table 4

TOP 5 SNPs selected by ReliefF, TuRF10%, TuRF1%, and SURF

SNPs	ReliefF	TuRF10%	TuRF1%	SURF
23 SNPs	rs1128503 (1)	rs1128503 (1)	rs2272394 (10)	rs1128503 (1)
	rs1801475 (6)	rs2272400 (12)	rs2272400 (12)	rs1801475 (6)
	rs2272394 (10)	rs2298771 (5)	rs2298771 (5)	rs2304725 (9)
	rs1042113 (22)	rs2272394 (10)	rs1801475 (6)	rs2272400 (12)
	rs752949 (21)	rs1801475 (6)	rs2245532 (13)	rs752949 (21)
22 SNPs	rs2304725 (9)	rs1801475 (6)	rs1042113 (22)	rs1128503 (1)
	rs1128503 (1)	rs35166395 (14)	rs35166395 (14)	rs211037 (16)
	rs1042113 (22)	rs752949 (21)	rs2272394 (10)	rs2304725 (9)
	rs1801475 (6)	rs2245532 (13)	rs2280781 (19)	rs1042113 (22)
	rs2228622 (23)	rs1128503 (1)	rs2304725 (9)	rs1801475 (6)

Table 5

MDR and MDR-AUC results using five SNPs selected by ReliefF, TuRF10%, TuRF 1%, SURF from the total 23 SNPs

Algorithm	SNPs	CA	PA	CVC	SNPs	CA	PA	CVC
	MDR				MDR-AUC
ReliefF	1	53.99	51.66	8	1	54.41	50.36	8
	1, 10	57.00	53.93	7	1, 10	58.51	51.75	7
	1, 6, 22	61.04	51.70	10	1, 6, 22	65.05	55.44	10
	1, 6, 21, 22	65.68	58.37	9	1, 6, 21, 22	72.12	60.33	9
TuRF10%	12	55.59	55.70	10	12	55.74	55.84	10
	6, 12	57.37	53.82	6	6, 12	59.61	56.97	6
	1, 6, 12	60.24	50.97	9	1, 6, 12	65.01	54.40	9
	1, 5, 6, 12	63.57	52.79	6	1, 5, 6, 12	70.06	54.46	6
TuRF1%	12	55.59	55.70	10	12	55.74	55.84	10
	6, 12	57.36	53.83	8	6, 12	59.59	57.23	8
	5, 6, 12	58.57	55.36	8	5, 6, 12	62.27	57.18	8
	5, 6, 10, 12	59.72	51.68	6	5, 6, 10, 12	65.18	50.88	6
SURF	12	55.59	55.70	10	12	55.74	55.84	10
	6, 12	57.40	53.67	7	6, 12	59.66	56.65	7
	1, 6, 12	61.21	51.37	8	1, 6, 12	65.21	54.19	8
	1, 6, 9, 21	66.40	51.13	6	1, 6, 9, 21	73.40	53.40	6

Table 6

MDR and MDR-AUC results using five SNPs selected by ReliefF, TuRF10%, TuRF 1%, SURF from the total 22 SNPs

Algorithm	SNPs	CA	PA	CVC	SNPs	CA	PA	CVC
	MDR				MDR-AUC
ReliefF	1	53.91	51.57	9	1	54.36	50.73	9
	1, 6	56.67	53.16	9	1, 6	58.91	53.19	9
	1, 6, 22	61.22	50.93	9	1, 6, 22	65.15	54.27	9
	1, 6, 9, 22	66.18	51.01	9	1, 6, 9, 22	72.81	53.19	9
TuRF10%	1	53.92	51.57	9	1	54.36	50.73	9
	1, 6	56.67	53.61	7	1, 6	58.96	52.61	7
	1, 6, 21	61.04	51.70	10	1, 6, 21	65.05	55.44	10
	1, 6, 14, 21	66.24	53.88	10	1, 6, 14, 21	72.59	51.14	10
TuRF1%	9	53.26	48.89	8	9	53.55	49.52	8
	9, 14	56.40	47.86	4	9, 14	53.54	47.67	4
	9, 14, 22	59.66	53.71	10	9, 14, 22	63.64	52.95	10
	9, 14, 19, 22	63.23	51.90	5	9, 14, 19, 22	68.40	50.09	5
SURF	1	53.92	51.57	${\it 9}$	1	54.36	50.73	9
	1, 6	56.82	51.68	7	1, 6	58.94	52.81	7
	1, 6, 22	61.50	48.61	6	1, 6, 22	65.52	50.41	6
	1, 9, 16, 22	66.49	51.27	5	1, 9, 16, 22	73.56	51.35	5

The results of our null simulation studies are illustrated in Table 1. It shows distributions of each order of the best models chosen by the original MDR and the AUC-based MDR. These null simulation results do not show a significant difference in the type I error rates or the false positive rates of the original MDR and the AUC-based MDR. Table 2 and Fig. 1 summarize the power simulations. We define Increase in Power (%) by

$\displaystyle\text{Increase in Power}(\%)={\text{Power}_{\text{MDR-AUC}}-\text% {Power}_{\text{MDR}}\over\text{Power}_{\text{MDR}}}\times 100(\%)$

so that a positive Increase in Power (%) implies a better performance of the AUC-based MDR than the original MDR. We observe that the AUC-based MDR has higher power to detect the correct interactions than the original MDR. Particularly, Increase in Power by the AUC-based MDR appears high when true interactions are less strong. The power of the AUC-based MDR was almost 7% higher than the original MDR when the relative risk was set as two. The Increase in Power (%) is equal to 21.9% in this simulation setting. The power gain was around 2% when the relative risk was set as four. For the low relative risk setting of 1.5, the Increase in Power (%) was as large as 20.9%, though the powers of the MDRs may not be enough to detect interactions due to the small strength of interactions. We can statistically evaluate whether the AUC-based MDR has higher power than the original MDR by McNemar test. In the low relative risk setting, the MDR-AUC identified the true model 34 times while the original MDR selected an incorrect model for the same samples. In contrast, the original MDR identified the correct model 16 times whereas the MDR-AUC failed to find the true model for the same samples. The one-tailed $P$ -value of the McNemar test is equal to 0.00545. The fifth column in Table 2 shows the results of the McNemar testing for equality of the powers of MDR and MDR-AUC, where the alternative hypothesis is that the MDR-AUC has higher power than the original MDR. By summarizing the simulation results, we can say that there is more power gain by using the AUC-based MDR when the strength of interactions is low. This finding may be an important advantage of the AUC-based MDR over the original MDR.

4. Example

Epistatic interactions in antiepileptic drug resistance were studied by the original MDR analysis (Kim et al., 2011) based on twenty five candidate SNPs of 200 drug-responsive patients and 200 drug-refractory patients. A new data set of 112 independent patients has been collected and their genotypes are obtained by the next generation sequence experiment. Unfortunately, the sequencing experiment has not identified two SNPs and hence we removed the two SNPs in our analysis. Table 3 summarizes the descriptions of 23 biallelic SNPs. One of SNPs, rs2272400, is significant in a chi-square test of independent. We performed filtering and MDR analyses with and without the significant SNP rs2272400 to see the effects of including a significant single factor to the filtering algorithms, the original MDR analysis, and the proposed AUC-based MDR analysis.

Table 4 shows the selected SNPs by the four filtering algorithms, ReliefF, TuRF10%, TuRF1%, and SURF when the significant single factor is included and not included. ReliefF has not selected rs2272400 but all other algorithms have selected the SNP. TuRF10% and TuRF1% select the same four SNPs (1, 5, 6, 12) when rs2272400 is included, whereas only one SNP (14) is selected by both algorithms if rs2272400 is removed. It appears an inclusion of a single SNP may highly influence the factor selections of the four filtering methods. As shown in Table 5, the best models obtained by the original MDR are exactly identical to the best models that the AUC-based MDR has chosen, when the significant SNP rs2272400 was included in the MDR analysis. However, if the SNP is deleted to search pure interactions, then the best models selected by two methods may differ as shown in Table 6. For example, when we applied the MDR and MDR-AUC methods to the five SNPs (1, 6, 9, 22, 23) obtained by ReliefF algorithm to 22 SNPs, the best model by the AUC-based MDR method is of third order while the best model by the original MDR is of second order. By combining the MDR results based on the four filtering algorithms, the overall best model selected by the original MDR is (1, 6, 14, 21) with prediction accuracy 53.88 and 10 CVC whereas the overall best model selected by the AUC-based MDR is (1, 6, 21) with AUC prediction accuracy 55.44 with 10 AUC-CVC.

5. Discussion and conclusions

In this paper, we compared the original MDR with our modified version of MDR based on the posterior probability and the AUC. This AUC-based MDR method does not require to choose a threshold for classifications to proceed a MDR analysis. In addition, the proposed MDR method does not require more computational resources. The results of the power simulation suggest that the AUC-based MDR performs better than the original MDR in all scenarios when interactions exist. When there is no interaction, the performance of the AUC-based MDR is almost equal to the performance of the original MDR in terms of the false discovery rate. In other words, when interacting factors are priorly chosen by filtering algorithms, our proposed MDR shows better power to detect the correct interactions than the original MDR.

The results shown in this paper may have limitations since there are many different ways of interactions that we have not considered here. Additionally, we have not considered the permutation testing, which is computationally intensive, because the main goal of this work is a comparison study of the best model selections by the original MDR and the AUC-based MDR.

Footnotes

Acknowledgments

This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI15C1559).

References

Chiong

(2009). Nature-inspired informatics for intelligent applications and knowledge discovery: Implications in business, science, and engineering. Hershey, PA: Information Science Reference.

Chung

Lee

S. Y.

Elston

R. C.

, & Park

(2006). Odds ratio based multifactor-dimensionality reduction method for detecting gene-gene interactions. Bioinformatics, 23(1), 71-76.

Dai

Charnigo

R. J.

Becker

M. L.

Leeder

J. S.

, & Motsinger-Reif

A. A.

(2013). Risk score modeling of multiple gene to gene interactions using aggregated-multifactor dimensionality reduction. Bio Data Mining, 6(1), 1.

Greene

C. S.

Penrod

N. M.

Kiralis

, & Moore

J. H.

(2009). Spatially uniform relieff (surf) for computationally-efficient filtering of gene-gene interactions. Bio Data Mining, 2(1), 5.

Gui

Andrew

A. S.

Andrews

Nelson

H. M.

Kelsey

K. T.

Karagas

M. R.

, & Moore

J. H.

(2010). A simple and computationally efficient sampling approach to covariate adjustment for multifactor dimensionality reduction analysis of epistasis. Human Heredity, 70(3), 219-225.

Gui

Andrew

A. S.

Andrews

Nelson

H. M.

Kelsey

K. T.

Karagas

M. R.

, & Moore

J. H.

(2011). A robust multifactor dimensionality reduction method for detecting gene-gene interactions with application to the genetic analysis of bladder cancer susceptibility. Annals of Human Genetics, 75(1), 20-28.

Gui

Moore

J. H.

Kelsey

K. T.

Marsit

C. J.

Karagas

M. R.

, & Andrew

A. S.

(2011). A novel survival multifactor dimensionality reduction method for detecting gene-gene interactions with application to bladder cancer prognosis. Human Genetics, 129(1), 101-110.

Kim

M.-K.

Moore

J. H.

Kim

J.-K.

Cho

K.-H.

Cho

Y.-W.

Kim

Y.-S.

Lee

M. C.

Kim

Y. O.

, & Shin

M.-H.

(2011). Evidence for epistatic interactions in antiepileptic drug resistance. J Hum Genet, 56(1), 71-76.

Kira

, & Rendell

L. A.

(1992). The feature selection problem: traditional methods and a new algorithm. San Jose, California: AAAI Press.

10.

Kononenko

Simec

, & Robnik-Sikonja

(1997). Overcoming the myopia of inductive learning algorithms with relieff. Applied Intelligence, 7(1), 39-55.

11.

Lee

Son

, & Park

(2016). Gene-gene interaction analysis for the accelerated failure time model using a unified model-based multifactor dimensionality reduction method. Genomics & Informatics, 14(4), 166-172.

12.

Leem

, & Park

(2017). An empirical fuzzy multifactor dimensionality reduction method for detecting gene-gene interactions. BMC Genomics, 18(2), 115.

13.

Lou

X.-Y.

Chen

G.-B.

Yan

J. Z.

Zhu

Elston

R. C.

, & Li

M. D.

(2007). A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence. The American Journal of Human Genetics, 80(6), 1125-1137.

14.

Martin

E. R.

Ritchie

Hahn

Kang

, & Moore

(2006). A novel method to identify gene-gene effects in nuclear families: The mdrpdt. Genetic Epidemiology, 30(2), 111-123.

15.

Moore

J. H.

Gilbert

J. C.

Tsai

C.-T.

Chiang

F.-T.

Holden

Barney

, & White

B. C.

(2006). A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. Journal of Theoretical Biology, 241(2), 252-261.

16.

Moore

J. H.

, & White

B. C.

(2007). Tuning relieff for genome-wide genetic analysis. Valencia, Spain: Springer-Verlag.

17.

Nelson

Kardia

Ferrell

, & Sing

(2001). A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Research, 11(3), 458-470.

18.

Pattin

K. A.

White

B. C.

Barney

Gui

Nelson

H. H.

Kelsey

K. T.

Nelson

H. H.

Kelsey

K. R.

Andrew

A. S.

Karaga

M. R.

, & Moore

J. H.

(2009). A computationally efficient hypothesis testing method for epistasis analysis using multifactor dimensionality reduction. Genetic Epidemiology, 33(1), 87-94.

19.

Ritchie

M. D.

Hahn

L. W.

Roodi

Bailey

L. R.

Dupont

W. D.

Parl

F. F.

, & Moore

J. H.

(2001). Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. The American Journal of Human Genetics, 69(1), 138-147.

20.

Sing

Sander

Beerenwinkel

, & Lengauer

(2005). Rocr: Visualizing classifier performance in r. Bioinformatics, 21(20), 3940-3941.

21.

Visscher

P. M.

Brown

M. A.

McCarthy

M. I.

, & Yang

(2012). Five years of gwas discovery. American Journal of Human Genetics, 90(1), 7-24.

22.

Winham

S. J.

, & Motsinger-Reif

A. A.

(2011). An r package implementation of multifactor dimensionality reduction. Bio Data Mining, 4(1), 24.

Multifator dimensionality reduction method based on area under receiver operating characteristic curve

Abstract

Keywords

1. Introduction

2. Method

2.1 Multifactor dimensionality reduction

Table 1 Null simulation results based on 1,000 replicates per each setting

5. Discussion and conclusions

Footnotes

Acknowledgments

References

Table 1
Null simulation results based on 1,000 replicates per each setting