Potpourri: An Epistasis Test Prioritization Algorithm via Diverse SNP Selection

Abstract

Genome-wide association studies (GWAS) explain a fraction of the underlying heritability of genetic diseases. Investigating epistatic interactions between two or more loci help to close this gap. Unfortunately, the sheer number of loci combinations to process and hypotheses prohibit the process both computationally and statistically. Epistasis test prioritization algorithms rank likely epistatic single nucleotide polymorphism (SNP) pairs to limit the number of tests. However, they still suffer from very low precision. It was shown in the literature that selecting SNPs that are individually correlated with the phenotype and also diverse with respect to genomic location leads to better phenotype prediction due to genetic complementation. Here, we propose that an algorithm that pairs SNPs from such diverse regions and ranks them can improve prediction power. We propose an epistasis test prioritization algorithm that optimizes a submodular set function to select a diverse and complementary set of genomic regions that span the underlying genome. The SNP pairs from these regions are then further ranked w.r.t. their co-coverage of the case cohort. We compare our algorithm with the state of the art on three GWAS and show that (1) we substantially improve precision (from 0.003 to 0.652) while maintaining the significance of selected pairs, (2) decrease the number of tests by 25-fold, and (3) decrease the runtime by 4-fold. We also show that promoting SNPs from regulatory/coding regions improves the performance (up to 0.8). Potpourri is available at http:/ciceklab.cs.bilkent.edu.tr/potpourri

1. Introduction

Genome-wide association studies (GWAS) have been an important tool for susceptibility gene discovery in genetic disorders (Easton et al., 2007; Samani et al., 2007; Rivas et al., 2011). Analyzing single loci associations have provided many valuable insights but they alone do not account for the full heritability (Manolio et al., 2009). Investigating the interplay among multiple loci has helped to bridge the missing heritability gap. Such interactions between two or more loci is called epistasis, and it has a major role in complex genetic traits such as cancer (Wang et al., 2014) and neurodevelopmental disorders (Moore and Mitchell, 2015).

Exhaustive identification of interacting loci, even just pairs of loci, is potentially intractable for large GWAS (Cowman and Koyutürk, 2017). Moreover, such an approach lacks statistical power due to multiple hypothesis testing. Several methods have been developed to circumvent these problems. TEAM and BOOST are exhaustive models that exploit data structures and efficient data representation to improve the brute force performance (Wan et al., 2010; Zhang et al., 2010). However, these methods still perform many tests and do not scale for large tasks. For instance, BOOST reports a runtime of 60 hours for 360k SNPs. Another approach is to reduce the search space by filtering pairs based on a type of statistical threshold. Examples include SNPHarvester (Yang et al., 2008), SNPRuler (Wan et al., 2009), and IBBFS (Chuang et al., 2013). Despite their advantages, these methods mostly do not follow a biological reasoning and tend to detect interactions that are in linkage disequilibrium (LD) as noted in Piriyapongsa et al. (2012). On another track, incorporating biological background and testing the SNP pairs that are annotated has also proven useful (Baranzini et al., 2009; Holmans et al., 2009; Weng et al., 2011; Liu et al., 2012; Mckinney and Pajewski, 2012). However, this approach requires most SNPs to be discarded as many are quite far away from any gene to be associated. Moreover, this introduces a literature bias in the selection of SNPs.

A rather more popular approach is to prioritize the tests to be performed rather than discarding pairs from the search space and controlling for Type-I error. In this approach, the user can keep performing tests in the order specified by the algorithm until the desired number of significant pairs are found. The idea is to provide the user with a manageable number of true positives (TPs; statistically significant epistatic pairs) while minimizing the number of tests to ensure statistical power. The first algorithm of this kind is iLOCi (Piriyapongsa et al., 2012), which ranks SNP pairs by performing a dependence test and avoiding pairs that are unrelated to disease but might seem related due to LD.

This work was followed by Ayati and Koyuturk (2014), who proposed testing pairs of SNPs in population covering locus sets—POCOs. First, the algorithm greedily selects multiple exclusive groups of SNPs that cover all affected individuals. That is, each case sample has to have at least one SNP in each POCO. Epistasis tests are then performed across POCOs with the hope that independent coverage of the cases will lead different POCOs to include complementary and epistatic SNPs (Ayati and Koyutürk, 2014, 2016). Finally, Cowman and Koyuturk (2017) introduced the LINDEN algorithm. First, in a bottom-up fashion, the method generates SNP trees on greedily selected genomic regions (LD forest). Each node represents the genotypes of cases and controls for the SNPs in that node. Then, by comparing the roots of these trees, it decides whether this pair of regions is a promising candidate for the epistasis test. Nodes in lower levels are continued to be checked, and leaf pairs (individual SNPs) are tested only if the estimation at higher levels is promising. LINDEN was shown to achieve the state-of-the-art results. Despite using various heuristics, all methods still have high false discovery rates. For instance, the false discovery rate (FDR) of LINDEN ranges from 0.96 to 0.998 on three GWAS from Wellcome Trust Case Control Consortium (WTCCC)—the ratio of significant pairs to the number of detected (reciprocally) significant epistatic pairs (Cowman and Koyutürk, 2017).

The LD is an important source of information for epistasis prioritization algorithms. Two SNPs that appear to be interacting statistically might not be biologically meaningful if they are on the same haplotype block Cordell and Clayton (2005). For this reason, all three methods mentioned earlier focus on such regions and aim at avoiding testing pairs that are located in close vicinity of each other. In an orthogonal study, Yilmaz et al. (2019) propose a feature (SNP) selection algorithm, which avoids LD for better phenotype prediction. The authors show that while looking for a small set of loci (i.e., 100) that is the most predictive of a continuous phenotype, selecting SNPs that are far away from each other, results in better predictive power. This method, SPADIS, is designed for feature selection for multiple regression. As the SNP set it generates contains diverse and complementary SNPs, it results in better R² values by covering more biological functions.

Inspired by this idea, we conjecture that selecting pairs of SNPs from genomic regions that (1) harbor individually informative SNPs, and (2) are diverse in terms of genomic location would avoid LD better and yield more functionally complementing and more epistatic SNP pairs compared with the current state of the art since no other algorithm exploits this information. We propose a new method that, for the first time, incorporates the genome location diversity with the population coverage density. Specifically, our proposed method, Potpourri, maximizes a submodular set function to select a set of genomic regions (1) that include SNPs that are individually predictive of the cases, and (2) that are distant from each other on the underlying genome. Epistasis tests are performed for pairs across these regions, such that pairs that densely co-cover the case cohort are given priority.

We validate our hypothesis and show that Potpourri can detect statistically significant and biologically meaningful epistatic SNP pairs. We perform extensive tests on three WTCCC GWA studies and compare our method with the state-of-the-art LINDEN algorithm. First, we guide LINDEN by pruning its search space using Potpourri-selected-SNPs to show that (1) it is possible to significantly improve the precision (from 0.003 up to 0.302) and (2) that our diversification approach is sound. Then, we show that the ranking of the diverse SNPs by the co-coverage of the case cohort further improves the prediction power and the precision (up to 0.652 in the selected setting). Potpourri drastically reduces the number of hypothesis tests to perform (from 380k to 15k), and yet it is still able to detect more epistatic pairs with similar significance levels in all there GWA studies considered. The running time is also cut by fourfold in the selected settings. Another problem with the current techniques is the biological interpretation of the obtained epistatic pairs. Once the most significant SNP pairs returned are in the noncoding regions and are too isolated to be associated with any gene, the user can hardly make sense of such a result despite statistical significance. We investigate the advantage of the promotion of SNPs falling into regulatory and noncoding regions for testing and propose three techniques. We show that these techniques further improve the precision (up to 0.8) with a similar number of epistatic pairs detected.

Finally, we investigate the biological meaning of the detected SNP pairs. We find (1) an SNP pair that supports the hypothesis of a shared genetic architecture between type 2 diabetes (T2D) and chronic kidney disease; and (2) a pair that suggests a new candidate risk gene (NPW) for hypertension (HT) that has loose indications only in rat studies. Potpourri is available at http://ciceklab.cs.bilkent.edu.tr/potpourri

2. Methods

2.1. Notation

A GWAS dataset consists of genotypes of a set of samples S who are associated with a binary phenotype: Case or Control. Let $f (s)$ be an indicator function that corresponds to the phenotype of a sample $s \in S$ , that is: $f (s) = 1$ , if s is a case sample, and $f (s) = 0$ , otherwise. Function h, which represents the genotype of sample $s \in S$ at locus $v \in V$ , is encoded as: $h (s, v) = \{\begin{matrix} 0, & i f g e n o t y p e o f s a m p l e s a t v i s H o m o z y g o u s r e f e r e n c e \\ 1, & i f g e n o t y p e o f s a m p l e s a t v i s H e t o r o z y g o u s \\ 2, & i f g e n o t y p e o f s a m p l e s a t v i s H o m o z y g o u s a l t e r n a t i v e \end{matrix}$

One can generate an undirected SNP-SNP network $G (V', E)$ , where $V' \subseteq V$ and $v_{i} \in V'$ if $\exists s \in S$ where $h (s, v_{i}) \neq 0$ . $e_{v_{i}, v_{j}} \in E$ where $v_{i}, v_{j} \in V$ and they are related. The definition of relatedness might change with respect to the application and the biological definition. In our setting, two SNPs are related if they are neighbors on the underlying genome.

2.2. Selection of diverse and informative SNP regions

Regions harboring informative SNPs (ones that are correlated with the phenotype) and also far away on the underlying genome with respect to an SNP-SNP network are likely to yield a diverse and explanatory SNP set without introducing a literature bias. Our hypothesis is that those selected SNPs are likely to be epistatic, because SNPs selected using this technique lead to better phenotype prediction due to better genomic complementation (Yilmaz et al., 2019). Hence, picking pairs from such a set should yield highly epistatic SNP pairs and is well situated for epistasis test prioritization. In our application, the goal is not to predict the phenotype. Our SNP set selection approach differs from Yilmaz et al. in the calculation of the SKAT scores (Wu et al., 2011), as in epistatis test prioritization we are analyzing a GWA study in which the phenotype is dichotomous (the term $c_{v_{i}}$ in Eq. 1). Moreover, we introduce a hyper-parameter that can assign extra artificial price to SNPs that fall in regulatory or coding regions to give priority to SNPs that are more likely to have a functional effect (the term $ω_{v_{i}}$ in Eq. 1). Also, node prizes (i.e., SKAT scores) are normalized by the set size to be able to compare performances of selected sets of different sizes (division of $c_{v_{i}} ω_{v_{i}}$ by k in Eq. 1).

First, for every SNP v_i, a score $c_{v_{i}}$ is calculated by using the SKAT method that works by regressing phenotypes on the covariates using a flexible semiparametric linear model (Wu et al., 2011). Instead of directly associating genotypes of the variants with the phenotype, SKAT uses a nonparametric function of the genotypes that is possibly contained in a vector space generated by a positive semi-definite kernel function. Given a user-specified number SNPs to select (k), the second step selects a subset of loci $E P \subset V'$ such that it maximizes the sum of selected SNP scores while penalizing SNPs that are in close vicinity of each other. In particular, the function F as shown in Equation (1) is maximized: $F (E P) = \sum_{v_{i} \in E P} \frac{c_{v_{i}} ω_{v_{i}}}{k} + β (1 - \sum_{v_{i}, v_{j} \in E P; v_{i} \neq v_{j}} \frac{D - d (v_{i}, v_{j})}{2 k D})$ (1)

In this equation, $D \in ℛ_{> 0}$ is a parameter that sets the upper limit on the distance that two SNPs are considered close. $d (v_{i}, v_{j})$ is the distance between vertices $v_{i}, v_{j} \in E P$ on graph G. In this application function, $d (v_{i}, v_{j})$ denotes the shortest path distance between $v_{i}, v_{j}$ only if $d (v_{i}, v_{j}) \leq D$ , otherwise $d (v_{i}, v_{j}) = D$ to cancel out the penalty term. The parameter $β \in ℛ_{\geq 0}$ adjusts the relative magnitudes of the prices and the penalties. Finally, the parameter $ω_{v_{i}} \in ℛ_{\geq 1}$ rewards $v_{i} \in V'$ if it falls into a regulatory or coding region of interest. If this information is not taken into consideration $ω_{v_{i}} = 1$ , $\forall v_{i} \in V'$ —see Section 2.3.3. The subset $E P^{*}$ that maximizes F is selected as the seed set for epistasis prioritization, as shown in Equation (2). $E P^{*} = \arg \max_{E P \subseteq V', \subseteq V, |E P| = k} F (E P)$ (2)

Given the set of SNPs $V'$ , the earlier-mentioned function $F : 2^{V'} \to ℛ$ is a submodular function (see Section 1.1 in Supplementary Data for the proof). Submodular optimization is non-deterministic polynomial-time hard. However, the greedy algorithm given in Section 1.2 in Supplementary Data, which basically iteratively adds the next best SNP to the set that maximizes F at each step, ensures a $(1 - \frac{1}{e})$ -factor approximation to the optimum solution (Nemhauser et al., 1978). This algorithm requires the submodular set function to be also monotone non-decreasing and non-negative. The proofs that F satisfies these properties are provided in Section 1.3 in Supplementary Data.

Note that using only the SNPs in $E P^{*}$ would restrict the epistasis testing to SNPs that are individually correlated with the trait. We rather use those as anchors to detect genomic regions of interest. Let R be a set of consecutive SNPs that fall into a region in the genome. After $E P^{*}$ is set, for every SNP $v_{i} \in E P^{*}$ , a region R_i is formed such that it includes v_i and m upstream and m downstream SNPs of v_i. Thus, $| R_{i} | = 2 m + 1$ . R is the set of sets (regions) R_i, $\forall v_{i} \in E P^{*} .$ Note that $| ⋃_{R_{i} \in R} R_{i} | = 2 m k + k$ .

2.3. Prioritization of the tests

The region set R generates a pruned search space for finding likely epistatic SNP pairs. Within-region tests are avoided to prevent LD. That is, SNPs v_i and v_j are not tested if both $v_{i} \in R_{x}$ and $v_{j} \in R_{x}$ such that $R_{x} \in R$ . Still, the number of possible tests is $\frac{k * (k - 1)}{2} (2 m + 1)^{2}$ and a prioritization scheme is needed to rank candidate pairs for testing. We employ two strategies. The first method uses our SNP selection strategy to prune the search space for LINDEN to show that the selected regions are good candidates for epistasis testing (see Section 2.3.1). Instead of guiding LINDEN, in the second approach, we introduce a new method that aims at testing the pairs that co-occur in the case cohort (see Section 2.3.2).

2.3.1. Guiding LINDEN for ranking SNP pairs

In this ranking strategy, we use LINDEN to rank the pairs selected in R. That is, we use Potpourri selected SNPs to guide LINDEN. Thus, in this strategy, Potpourri acts as a preprocessing step for LINDEN to prune its large search space. Doing so, we test whether our diverse SNP selection scheme is sound and whether it improves the performance of LINDEN.

LINDEN is input with only SNPs that fall into the regions in R. Then, the algorithm is run as described in Cowman and Koyuturk. That is, the algorithm forms LD trees over greedily selected regions. The leaves of these trees represent actual SNPS, each of which is represented by a sample genotype vector (i.e., the genotype of this SNP for every sample). The tree is constructed in bottom-up fashion, and nodes are merged to generate higher levels. During this process, sample genotype vectors are merged and ambiguous indices (i.e., samples with different genotypes) are assigned a NIL value. This merging step continues until a threshold d is met that denotes the fraction of NIL values allowed. This is a dynamically adjusted threshold that goes up with the number of iterations. Then, the tree pairs are tested with respect to the sample genotype vectors starting from the root, ignoring the NIL values. Lower levels are tested only if the significance of the chi-squared test meets a certain threshold. In parallel with our hypothesis on diversification of the regions, we prohibit LINDEN from merging SNPs from different regions. Thus, the tests are performed across regions in R.

2.3.2. Population co-covering for ranking SNP pairs

We propose a new strategy that aims at maximizing the population coverage of co-occurring SNPs. Population cover for epistasis test prioritization was first proposed in Ayati and Koyutürk (2014, 2016), where they select multiple exclusive groups of SNPs (POCOs) that cover the case cohort. That is, the union of the samples with the SNPs in a POCO should be equal to the case cohort. Epistasis tests are performed across POCOs, and the idea is that the independent coverage of the cases across POCOs will result in detecting complementary SNPs. We take a different approach in terms of covering the population. We would like the SNP pairs to be tested to co-cover the population. This is intuitive; the diverse selection step described in Section 2.2 finds complementing regions and avoids LD, but for an SNP pair to be epistatic they also need to be observed together in cases. More formally, let $p (v_{x}, v_{y})$ be a function that scores the SNP pair $v_{x}, v_{y} \in V'$ for testing. Given three SNPs $v_{x}, v_{y}$ and $v_{z} \in V'$ , $p (v_{x}, v_{y}) > p (v_{x}, v_{z})$ if and only if v_x is observed more frequently with v_y in cases as compared with v_z. p is formally defined as follows: $p (v_{x}, v_{y}) = \sum_{s \in S; f (s) = 1} 1_{(h (s, v_{x}) \neq 0 \land h (s, v_{y}) \neq 0)}$ (3)

Potpourri computes the pairwise population co-covering of all SNP pairs $(S N P_{i}, S N P_{j})$ such that $S N P_{i} \in R_{x}$ , $S N P_{j} \in R_{y}$ , $\forall R_{x}, R_{y} \in R$ , $x \neq y$ , and $i \neq j$ . Testing is then performed in the descending population co-cover order. The algorithm is restricted to test top v pairs among all region pairs in R. See Section 2.3.3 for details of prioritization when regulatory/coding regions are considered.

2.3.3. Promoting SNPs in regulatory and coding regions

SNPs can alter gene expression and the downstream function depending on their genomic location. Those that fall on to regulatory regions can affect messenger RNA (mRNA) levels, and those that fall onto coding regions can alter the structure and function of the protein. Since such SNPs are likely to alter the function and more likely to be related to the phenotype, we conjecture that we can find more statistically significant and biologically meaningful SNP pairs via promoting mutations in regulatory and coding regions. Although this might introduce a literature bias, for a life scientist who would like to narrow down the search space by using functional regions, this might be plausible and investigates the usefulness of this approach. However, one should not totally disregard the unannotated parts of the genome as most of the variation exists in such regions. Thus, we seek a balance.

We employ three techniques to promote regulatory/coding variants. In the first approach (Potpourri RC1), we assign an artificial prize to SNPs in the diverse SNP selection phase of the algorithm, as described in Section 2.2. This approach favors the regions in R to be regulatory/coding regions. Then, the pairs are tested with respect to the population co-cover ranking. The second approach (Potpourri RC2) uses $ω_{v_{i}} = 1, \forall v_{i} \in V'$ . That is, the SNP selection step does not promote selection of coding/regulatory SNPs. However, among the selected SNPs, the prioritization step prefers pairs such that at least one SNP falls into regulatory/coding regions. Then, these pairs are ranked with respect to the population co-cover, as described in Section 2.3.2. Note that after all such pairs are tested, the algorithm test examines the remaining pairs with respect to their population co-cover ranking. The final strategy (Potpourri RC3) combines both RC1 (promoting selection of SNPs in regulatory/coding regions) and RC2 (first testing pairs with at least one SNP in regulatory/coding regions) strategies.

3. Results

3.1. Datasets

To benchmark Potpourri, we used three GWAS datasets obtained from WTCCC: (1) T2D, (2) HT, and (3) bipolar disorder (BD). We use the 1958 Birth (58C) cohort as control for all datasets (Craddock et al., 2010). Please see the details of preprocessing and quality control steps taken in Sec 1.4 in Supplementary Data. The number of samples in each dataset is listed in Supplementary Table S1.

We used the following resources to obtain the regulatory and coding region information. We obtained the distant-acting transcriptional enhancer dataset from VISTA Enhancer Browser (Visel et al., 2007). VISTA enhancer dataset contains 1912 human noncoding fragments with gene enhancer activity. Transcription start sites (TSSs) and coding region coordinates were obtained from UCSC Genome Browser (Lindblad-Toh et al., 2011). We chose Ensembl Genes as a gene annotation track (Zerbino et al., 2017) and the March 2006 NCBI36/hg18 assembly of the human genome that matches the WTCCC datasets. The number and types of genes obtained from the Ensembl dataset are given in Supplementary Table S2. We defined 1 kb downstream and upstream of the TSSs as a regulatory region. The coding region for a gene begins from the first base of the first exon and continues to the last base of the last exon.

Potpourri operates on an SNP-SNP interaction network. In this study, we used the genomic sequence network as defined in Azencott et al. (2013). In this network, SNPs are connected if they are adjacent on the genome. This was the network of choice, as it was shown to provide the best regression performance in Yilmaz et al.

3.2. Experimental setup

We follow the experimental procedure in Yilmaz et al. (2019) and Azencott et al. (2013) for the SNP selection step. The parameters were selected by using a nested 10-fold cross-validation. First, the distance parameter D was selected via a line search among seven values (log-scale) between 1 and $D_{m a x}$ (a value for which the distance penalty for all SNPs in the selected set is 0). The D value that maximizes the L2-regularized logistic regression performance was picked. Then, 16 $β$ values between $1 0^{- 4}$ and a maximum $β = 2 k D_{m a x}$ were tried. Again, the $β$ value that maximizes the classification performance was picked. We experimented with the following k values: 500, 750, 1000, 1500, and 2000. Overall, best results were obtained when $k = 750$ and it was set as the default parameter setting. Then, we added $m = 9$ upstream and $m = 9$ downstream neighbors of SNPs for further coverage analysis. Unless otherwise stated, we set the $w_{i} = 1, \forall v_{i} \in V'$ . Once the regulatory and coding regions were considered by the diverse SNP selection part, $w_{i} > 1$ , $\forall i \in R C$ where RC contains SNPs that fall into regulatory and coding regions. We experimented with the three $ω$ values: $1 + 1 0^{- 0.5}, 1 + 1 0^{- 0.25}$ , and 2.

Unless otherwise stated, we used the suggested settings for LINDEN as described in Cowman and Koyuturk: d = 0.45 and $b = 10$ . When guiding LINDEN with our approach, we used default parameters for LINDEN. The only exception was that we limited the extent of LD in the first two iterations of the merging procedure of LINDEN, as explained in Section 2.3.1. That is, LINDEN's merging step could merge SNPs, forming trees only within selected regions that contain $2 m + 1$ SNPs ( $m = 9$ in our experiments). We used the earlier-mentioned parameter selection techniques for Potpourri. For the population co-cover, we performed an epistasis test for the top $v = 10$ SNP pairs among regions in R.

To quantify the performance of the proposed algorithms, we used precision as the evaluation metric in which TPs refer to the reciprocally significant epistatic pairs that pass the Bonferroni-adjusted statistical significance threshold. False positives (FPs) refer to failed tests: the reciprocally significant epistatic pairs that fail to pass the aforementioned threshold. Note that, in the epistasis test prioritization context, an FP does not mean a false rejection of the null hypothesis and falsely concluding that the pair has an epistatic interaction. Rather, it refers to a prioritized pair for testing that could not pass the corrected statistical significance threshold.

We use the definition of reciprocally significant epistatic pair from Cowman and Koyuturk. As the authors argue, most epistatic pairs are dominated by some hub SNPs and this leads to detection of redundant pairs. On the other hand, SNPs in a reciprocal pair are the most epistatic partners for each other. We also use this definition to measure the performance of our algorithm. We set the significance level as 10% throughout the experiments and adjust the significance level by using the Bonferroni correction based on the number of tests performed by each approach. Epistasis testing is performed via a chi-squared test on a 9 × 2 contingency table of all genotype combinations between cases and controls for a selected SNP pair (df = 8), as also done by Cowman and Koyuturk. All tests are performed on an Intel^® Xeon^® E5-2650 v3 Ten-Core Haswell Processor (2.3 GHz 2 5M 9.6 GT/s QPI). Two hundred fifty-one GB RAM is used in the parallel setting.

3.3. Guiding LINDEN with diverse and informative regions improves precision

We quantified the improvement in precision when LINDEN is guided by Potpourri-selected regions. First, we ran LINDEN on all datasets and it detected 1786, 885, and 1022 reciprocally significant epistatic pairs for T2D, BD, and HT datasets, respectively. Only 5, 35, and 2 pairs were statistically significant at the $10 %$ level after Bonferroni correction, respectively. These numbers correspond to precision values of 0.003, 0.04, and 0.002, respectively. These results set our baseline. Then, we ran Potpourri on these datasets with top k SNPs, where $k = 500, 750, 1000, 1500$ and 2000. We obtained five R sets. We then guided LINDEN with these regions and ran it as explained in Section 2.3.1. The selected SNP pairs by Potpourri-guided LINDEN are listed in Supplementary Tables S30, S31, S32.

Complete results are shown in Table 1 and Supplementary Tables S3 and S4 for T2D, BD, and HT datasets, respectively. The guidance of Potpourri improves the precision substantially, from 0.003 to 0.3 when $k = 750$ in the T2D dataset (up to .421 when $k = 500$ ). This is achieved by drastically reducing the number of FPs, and also increasing the number of TPs. Our pipeline outperforms LINDEN for all k values on all datasets, but we observe that the ideal k values are 500, 750, and 1000. Similar precision increases are also observed for BD and HT datasets. Statistics on the number of tests performed and trees formed by LINDEN and Potpourri-guided LINDEN are shown in Supplementary Table S6.

Table 1.

Results for Type 2 Diabetes Dataset That Compares LINDEN, Potpourri-Guided LINDEN, and Potpourri (with Population Co-Cover)

Method	No. of tested loci	No. of reported (reciprocally significant)	Precision
LINDEN	378,016	1786 (5)	0.003
Potpourri-guided LINDEN
k = 500	9250	19 (8)	0.421
k = 750	14,146	43 (13)	0.302
k = 1000	18,943	55 (21)	0.382
k = 1500	28,014	79 (18)	0.228
k = 2000	37,927	95 (11)	0.116
Potpourri
k = 500	9250	12 (8)	0.667
k = 750	14,145	23 (15)	0.652
k = 1000	18,943	31 (20)	0.645
k = 1500	28,014	50 (21)	0.420
k = 2000	37,927	74 (18)	0.243

The number of pairs reported is the total number of reciprocally significant pairs returned by each method for a varying number of selected SNPs. The number in parentheses denotes the significant pairs passing the significance threshold (0.1) after Bonferroni correction based on the number of tests performed by each method. Bold denotes the best result for a given k value. The table shows that the guidance of Potpourri substantially improves the precision of LINDEN. For all considered k values, Potpourri provides the best precision values.

The significance levels of each performed epistasis test by LINDEN and Potpourri-guided LINDEN are shown in Figure 1 and Supplementary Figures S1 and S2 for the T2D, BD, and HT datasets, respectively. The green lines denote the significance level (0.1) to be passed for each approach ( $k = 750$ ) after Bonferroni correction. Each point represents an SNP pair, and the ones below threshold are FPs whereas the ones above are TPs (reciprocally significant epistatic SNP pairs). It is clear that the pipeline drastically reduces the number of FPs while increasing the number TPs. Also, we can observe the importance of the number of tests performed by looking at the difference between Bonferroni corrected significance thresholds.

FIG. 1.

On the T2D dataset, this figure compares the p-values of the selected SNP pairs by the following methods: (1) LINDEN, (2) Potpourri-guided LINDEN, (3) Potpourri; and then also the variants of Potpourri, which further promotes SNP pairs in regulatory and coding regions: (4) Potpourri RC1, (5) Potpourri RC2, and finally (6) Potpourri RC3. We show the significance levels (y-axis) of each reported pair (dots) given the Bonferroni-corrected significance threshold (0.1, green dashed lines). X-axis is just randomly assigned values to pairs for better visualization. Potpourri is run with $k = 750$ for all five related subplots. For RC1 and RC3 strategies, $ω$ is set to $1.31623$ ; for Potpourri and RC2, it is set to 1. LINDEN is run with default parameters. For Potpourri-guided LINDEN, tree formation is restricted to be done on distinct regions in Potpourri-selected regions R as described in Section 2.3.1. T2D, type 2 diabetes.

Since Potpourri provides a pruned search space for LINDEN by eliminating SNPs that are most likely irrelevant to the trait, it also reduces number of tests that will be performed during the epistasis test. Indirectly, it eliminates the negative effect of multiple hypothesis testing, which reduces the statistical power. As seen in Figure 1 and Supplementary Figures S1 and S2, due to a low Bonferroni threshold, the pipeline is able to discover more TPs compared with LINDEN. We also show that our pipeline not only minimizes the number of FPs but is also able to maintain the significance level of the returned pairs. LINDEN's top SNP pair is more significant, whereas guided LINDEN's top SNP pairs still stand out in terms of their p-values. We compared the sets of SNP pairs detected by LINDEN and Potpourri-guided LINDEN and found that there the former is not a superset of the latter. Potpourri enforces LINDEN to form trees on its selected regions of $2 m + 1$ SNPs, and tests are performed across these trees. LINDEN follows a greedy approach to form trees in a bottom-up fashion. Thus, the trees it forms are potentially much different (e.g., deeper, covering more loci) than the guided version and the returned pairs are different.

In short, by just providing guidance to LINDEN by using Potpourri's diverse and informative SNP region selection, we were able to improve the performance of the state of the art with a substantially smaller number of tests performed. Next, we provide the results of the complete Potpourri pipeline that uses a population co-cover strategy for prioritization, instead of LINDEN's tree-based strategy.

3.4. Comparison of Potpourri with the state of the art

In this section, we evaluate the performance of the Potpourri pipeline with the population co-covering technique. Again, we compare the performance with the LINDEN algorithm. Table 1 shows the results of the algorithm for different k values on the T2D dataset. We observe that the population co-cover ranking strategy results in even further performance improvements. The precision is moved up to 0.652, as 15 out of 23 reciprocally significant pairs pass the Bonferonni-corrected threshold ( $k = 750$ ). See Supplementary Tables S3 and S4 for the results on BD and HT datasets, respectively, which follow the same pattern.

LINDEN performs 7,629,272,394 tests (378,016 tests for leaves), as opposed to 14,146 tests performed by Potpourri. This sets a much more conservative significance threshold. One could argue that the precision gain is only due to the reduced significance threshold. However, Figure 1 shows that it is not the case. The third panel shows the significance levels of Potpourri-selected SNP pairs on the T2D dataset. We see that 6 out of 15 Potpourri-selected reciprocally significant pairs are significant even when the Bonferonni-corrected threshold of LINDEN is considered (dashed green line on the left-most panel, which is far more conservative). Note that LINDEN only detects five reciprocally significant pairs. We also observe that the number of FPs is even further decreased compared with Potpourri-guided LINDEN. See Supplementary Figures S1 and S2 for the results on BD and HT datasets, respectively, which again show a similar pattern.

We also checked whether adjusting LINDEN's parameters to make it more conservative would improve the precision. Increasing d to 0.9 and to 0.99 enforced it to perform a smaller number of tests but still too many to get close to Potpourri's efficiency, as shown in Supplementary Table S5. We checked whether we would get better results by testing only SPADIS-selected SNPs. For each dataset, we performed pairwise epistasis tests on all SPADIS-selected SNPs ( $k = 750$ ). In each dataset, only 1 reciprocally significant pair is detected.

3.5. Promoting regulatory and coding regions improves precision

In this section, we investigate the advantage of promoting coding and regulatory regions in the Potpourri pipeline. For the T2D dataset, Table 2 compares the performance of original Potpourri pipeline with the three variants (RC1, RC2, and RC3) that promote the selection of SNPs from coding and regulatory regions, as described in Section 2.3.3. Results show that in the suggested setting ( $k = 750$ ) the RC1 technique can increase the precision up to 0.8 and the original Potpourri pipeline cannot achieve a better result in any of the k values (Table 2). The last three panels in Figure 1 show the significance levels of tested pairs. For the RC1 technique, we see that although we keep the most significant pairs detected by the original pipeline, we also discover new significant pairs (i.e., $p < 1 0^{- 15}$ ). Moreover, the two out of three pairs that failed to pass the significance threshold are borderline. Seven out of 15 reciprocally significant pairs are also significant with respect to LINDEN's stringent significance threshold, whereas LINDEN was able to detect only 5 such pairs and Potpourri was able to detect 6. Supplementary Tables S11 and S14 provide similar results for the BD and HT datasets, respectively. The margin of improvement is relatively low in the HT dataset (up to 0.2). Supplementary Figures S1 and S2 compare the significance levels of the detected pairs for the BD and HT datasets, respectively.

Table 2.

Comparison of the Potpourri Results with Three Strategies (RC1, RC2, and RC3) to Promote Regulatory and Coding Regions on the Type 2 Diabetes Dataset

Method	No. of tested loci	No. of reported (reciprocally significant)	Precision
Potpourri
k = 500	9250	12 (8)	0.667
k = 750	14,145	23 (15)	0.652
k = 1000	18,943	31 (20)	0.645
Potpourri RC1
k = 500	9489	17 (13)	0.765
k = 750	13,727	15 (12)	0.800
k = 1000	15,500	21 (12)	0.571
Potpourri RC2
k = 500	9250	13 (7)	0.538
k = 750	14,145	24 (14)	0.583
k = 1000	18,943	33 (23)	0.697
Potpourri RC3
k = 500	9489	18 (11)	0.611
k = 750	13,727	15 (11)	0.733
k = 1000	15,500	20 (13)	0.650

For RC1 and RC3, $ω$ is set to 1.31623. The number of pairs reported is the total number of reciprocally significant pairs. The number in parentheses denotes the significant pairs passing the significance threshold (0.1) after Bonferroni correction based on the number of tests performed by each method. Results show that promoting regulatory and coding regions improves Potpourri results substantially in all strategies, with RC1 performing the best. Bold denotes the best result for a given k value.

We experimented with three ω values and picked the 1.31623 as the best performing and suggested parameter values. See Supplementary Tables S7–S10, 12 and 13 for all results obtained by using other parameter choices. Finally, the selected SNP pairs by all Potpourri variant techniques with suggested settings ( $k = 750$ and $ω = 1$ or $1.31623$ ) are listed in Supplementary Tables S15–26. We also compared the returned reciprocally significant epistatic SNP pairs of LINDEN and Potpourri RC1 (suggested setting) and found no overlaps. The search space is vast and two algorithms have different approaches, which results in different result sets. The selected SNP pairs by LINDEN are listed in Supplementary Tables S27–S29, and by Potpourri-guided LINDEN in Supplementary Tables S29–S32 for T2D, BD, and HT datasets, respectively. Finally, we show that in the suggested setting of Potpourri RC1 (k = 750) the time requirement is decreased by 4-fold compared to LINDEN as shown in Supplementary Tables S33–S35.

3.6. Novel epistatic pairs

One interesting pair discovered by Potpourri is the $r s 12548378$ – $r s 10787472$ pair that is detected for T2D. The former falls into the TCF7L2 gene, and the latter falls into the long intergenic noncoding RNA LINC01111. On one hand, TCF7L2 is a well-known T2D susceptibility gene that affects the blood glucose balance by regulation of proglucagon gene expression over the WNT signaling pathway (Grant et al., 2006). Its other epistatic interactions were also reported in the literature (An et al., 2009). On the other hand, LINC0111 was shown to be related to urine creatinine level in UK Biobank samples (genome-wide significant, $p = 5.28 E - 09$ ) (Zanetti et al., 2018). Elevated levels of urine albumin-to-creatinine ratio are a biomarker for diabetic nephropathy (DN). DN occurs as a result of the damage to renal nerves due to high glucose levels in blood (Bouhairie and McGill, 2016). Although T2D causes nephropathy over nerve damage, it has also been long debated that the two might have overlapping genetic architectures. At least four T2D loci are associated with kidney function in American Indians, and two are related to kidney disease partially independently of T2D (Franceschini et al., 2012). Hussain et al. (2014) reports that a variant in TCF7L2 causes renal dysfunction but this effect is not independent of T2D in the south Indian population. Buraczynska et al. (2014) report that a variant in TCF7L2 is significantly more frequent in patients with DN compared with non-DN.

Finally, an interaction between EGFR and RXRG genes was reported, which increases the risk of DN in a T2D cohort (Hsieh et al., 2006). Thus, this new epistatic interaction further supports the hypothesis of a shared genetic architecture between these two diseases and that T2D related loci can also genetically increase the risk for DN. In the HT data, we detect an epistatic interaction between $r s 34585560$ in the MECOM gene and $r s 8051877$ in the NPW gene. MECOM is a transcriptional regulator and is a well-known risk gene for high blood pressure, as detected on a large-scale GWAS (200k samples) (Ehret et al., 2011). However, not much is known about the mechanism over which MECOM affects blood pressure (Coffman, 2011). NPW is a G-coupled protein activator that regulates neuroendocrine signals. Although its ties to HT in humans are loose, Pate et al. (2013) and Yu et al. (2007) demonstrate in rats that exogenously applied NPW increases mean arterial pressure. Thus, this new epistatic pair finding can suggest NPW as a new risk gene candidate for HT coupled with the affect of MECOM.

4. Conclusion

The detection of epistatic interactions is a promising approach for understanding the genetic underpinnings of complex traits. However, the large number of hypotheses to test prohibits the brute force investigation of the search space. We proposed a new test prioritization technique and showed that selecting individually informative and topologically diverse SNPs in terms of genomic location leads to detecting statistically significant epistatic interactions. Our approach performs favorably compared with state of the art.

Footnotes

Acknowledgments

The authors would like to thank WTCCC and the study participants. They used SPADIS and LINDEN code in their implementation. They are grateful to S. Yilmaz, T. Cowman, and M. Koyuturk for the source code.

Author Disclosure Statement

The authors declare they have no competing financial interests.

Funding Information

This work has been supported by TUBITAK via Career Grant number 116E148 to A.E.C. A.E.C. acknowledges the support of TUBA GEBIP 2017 award. O.T., and A.E.C. acknowledge the support of Bilim Akademisi BAGEP awards.

Supplementary Material

References

, Feitosa

, Ketkar

2009. Epistatic interactions of CDKN2B-TCF7L2 for risk of type 2 diabetes and of CDKN2B-JAZF1 for triglyceride/high-density lipoprotein ratio longitudinal change: evidence from the Framingham Heart Study. BMC Proc. 3, S71.

Ayati

, and Koyutürk

2014. Prioritization of genomic locus pairs for testing epistasis. Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM, 240–248.

Ayati

, and Koyutürk

2016. PoCos: population covering locus sets for risk assessment in complex diseases. PLoS Comput. Biol. 12, e1005195.

Azencott

C.-A.

, Grimm

, Sugiyama

2013. Efficient network-guided multi-locus association mapping with graph cuts. Bioinformatics, 29, i171–i179.

Baranzini

S.E.

, Galwey

, Wang

2009. Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum. Mol. Genet. 18, 2078–2090.

Bouhairie

V.E.

, and McGill

J.B.

2016. Diabetic kidney disease. Missouri Med. 113, 390.

Buraczynska

, Zukowski

, Ksiazek

2014. Transcription factor 7-like 2 (TCF7L2) gene polymorphism and clinical phenotype in end-stage renal disease patients. Mol. Biol. Rep. 41, 4063–4068.

Chuang

L.-Y.

, Yang

2013. Improved branch and bound algorithm for detecting SNP-SNP interactions in breast cancer. J. Clin. Bioinforma. 3, 4.

Coffman

T.M.

, 2011. Under pressure: the search for the essential mechanisms of hypertension. Nat. Med. 17, 1402.

10.

Cordell

H.J.

, and Clayton

D.G.

2005. Genetic association studies. Lancet, 366, 1121–1131.

11.

Cowman

, and Koyutürk

2017. Prioritizing tests of epistasis through hierarchical representation of genomic redundancies. Nucleic Acids Res. 45, e131.

12.

Craddock

N.J.

, Hurles

, Niall

JC.

2010. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature, 464, 713–720.

13.

Easton

D.F.

, Pooley

K.A.

, Dunning

A.M.

, 2007. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature, 447, 1087.

14.

Ehret

G.B.

, Munroe

P.B.

, Rice

K.M.

2011. Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature, 478, 103.

15.

Franceschini

, Shara

N.M.

, Wang

2012. The association of genetic variants of type 2 diabetes with kidney function. Kidney Int. 82, 220–225.

16.

Grant

S.F.

, Thorleifsson

, Reynisdottir

2006. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat. Genet., 38, 320.

17.

Holmans

, Green

E.K.

, Pawha

J.S.

2009. Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am. J. Hum. Genet. 85, 13–24.

18.

Hsieh

C.-H.

, Liang

, Hung

2006. Analysis of epistasis for diabetic nephropathy among type 2 diabetic patients. Hum. Mol. Genet. 15, 2701–2708.

19.

Hussain

, Ramachandran

, Ravi

2014. TCF7L2 rs7903146 polymorphism and diabetic nephropathy association is not independent of type 2 diabetes—a study in a south Indian population and meta-analysis. Endokrynol. Pol. 65, 298–305.

20.

Lindblad-Toh

, Garber

, Zuk

2011. A high-resolution map of human evolutionary constraint using 29 mammals. Nature, 478, 476–482.

21.

Liu

, Maxwell

, Feng

2012. Gene, pathway and network frameworks to identify epistatic interactions of single nucleotide polymorphisms derived from GWAS data. BMC Syst. Biol. 6, S15.

22.

Manolio

T.A.

, Collins

, Cox

2009. Finding the missing heritability of complex diseases. Nature, 461, 747.

23.

Mckinney

, and Pajewski

2012. Six degrees of epistasis: statistical network models for GWAS. Front. Genet. 2, 109.

24.

Moore

J.H.

, and Mitchell

K.J.

(ed) 2015. The role of genetic interactions in neurodevelopmental disorders, 69–80. In The Genetics of Neurodevelopmental Disorders. John Wiley & Sons, Inc., Hoboken, NJ.

25.

Nemhauser

G.L.

, Wollsey

L.A.

, Fisher

M.L.

1978. An analysis of approximations for maximizing submodular set functions—I. Math. Program. 14, 265–294.

26.

Pate

A.T.

, Yosten

G.L.C.

, Samson

W.K.

2013. Neuropeptide w increases mean arterial pressure as a result of behavioral arousal. Am. J. Physiol. Regul. Integr. Comp. Physiol. 305, R804–R810.

27.

Piriyapongsa

, Ngamphiw

, Intarapanich

2012. iLOCI: a SNP interaction prioritization technique for detecting epistasis in genome-wide association studies. BMC Genomics, 13, S2.

28.

Rivas

M.A.

, Beaudoin

, Gardet

2011. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat. Genet. 43, 1066.

29.

Samani

N.J.

, Erdmann

, Hall

A.S.

2007. Genomewide association analysis of coronary artery disease. N. Engl. J. Med. 357, 443–453.

30.

Visel

, Minovitsky

, Dubchak

2007. Vista enhancer browser—a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92.

31.

Wan

, Yang

2009. Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics, 26, 30–37.

32.

Wan

, Yang

2010. Boost: a fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am. J. Hum. Genet. 87, 325–340.

33.

Wang

, Fu

A.Q.

, McNerney

M.E.

2014. Widespread genetic epistasis among cancer genes. Nat. Commun. 5, 4828.

34.

Weng

, Macciardi

, Subramanian

2011. SNP-based pathway enrichment analysis for genome-wide association studies. BMC Bioinformatics, 12, 99.

35.

M.C.

, Seunggeun

, Tianxi

2011. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93.

36.

Yang

, He

, Wan

2008. Snpharvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies. Bioinformatics, 25, 504–511.

37.

Yilmaz

, Tastan

, Cicek

A.E

. 2019. SPADIS: an algorithm for selecting predictive and diverse SNPs in GWAS. IEEE/ACM Trans Comput Biol Bioinform. (in press).

38.

, Chu

, Kunitake

2007. Cardiovascular actions of central neuropeptide U in conscious rats. Regul. Pept. 138, 82–86.

39.

Zanetti

, Rao

, Gustafsson

2018. Genetic analyses in UK biobank identifies 78 novel loci associated with urinary biomarkers providing new insights into the biology of kidney function and chronic disease. bioRxiv 315259.

40.

Zerbino

D.R.

, Achuthan

, Akanni

2017. Ensembl 2018. Nucleic Acids Res. 46, D754–D761.

41.

Zhang

, Huang

, Zou

2010. Team: efficient two-locus epistasis tests in human genome-wide association study. Bioinformatics, 26, i217–i227.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.51 MB