Abstract
Genome-wide association studies (GWAS) explain a fraction of the underlying heritability of genetic diseases. Investigating epistatic interactions between two or more loci help to close this gap. Unfortunately, the sheer number of loci combinations to process and hypotheses prohibit the process both computationally and statistically. Epistasis test prioritization algorithms rank likely epistatic single nucleotide polymorphism (SNP) pairs to limit the number of tests. However, they still suffer from very low precision. It was shown in the literature that selecting SNPs that are individually correlated with the phenotype and also diverse with respect to genomic location leads to better phenotype prediction due to genetic complementation. Here, we propose that an algorithm that pairs SNPs from such diverse regions and ranks them can improve prediction power. We propose an epistasis test prioritization algorithm that optimizes a submodular set function to select a diverse and complementary set of genomic regions that span the underlying genome. The SNP pairs from these regions are then further ranked w.r.t. their co-coverage of the case cohort. We compare our algorithm with the state of the art on three GWAS and show that (1) we substantially improve precision (from 0.003 to 0.652) while maintaining the significance of selected pairs, (2) decrease the number of tests by 25-fold, and (3) decrease the runtime by 4-fold. We also show that promoting SNPs from regulatory/coding regions improves the performance (up to 0.8). Potpourri is available at http:/ciceklab.cs.bilkent.edu.tr/potpourri
1. Introduction
Genome-wide association studies (GWAS) have been an important tool for susceptibility gene discovery in genetic disorders (Easton et al., 2007; Samani et al., 2007; Rivas et al., 2011). Analyzing single loci associations have provided many valuable insights but they alone do not account for the full heritability (Manolio et al., 2009). Investigating the interplay among multiple loci has helped to bridge the missing heritability gap. Such interactions between two or more loci is called epistasis, and it has a major role in complex genetic traits such as cancer (Wang et al., 2014) and neurodevelopmental disorders (Moore and Mitchell, 2015).
Exhaustive identification of interacting loci, even just pairs of loci, is potentially intractable for large GWAS (Cowman and Koyutürk, 2017). Moreover, such an approach lacks statistical power due to multiple hypothesis testing. Several methods have been developed to circumvent these problems. TEAM and BOOST are exhaustive models that exploit data structures and efficient data representation to improve the brute force performance (Wan et al., 2010; Zhang et al., 2010). However, these methods still perform many tests and do not scale for large tasks. For instance, BOOST reports a runtime of 60 hours for 360k SNPs. Another approach is to reduce the search space by filtering pairs based on a type of statistical threshold. Examples include SNPHarvester (Yang et al., 2008), SNPRuler (Wan et al., 2009), and IBBFS (Chuang et al., 2013). Despite their advantages, these methods mostly do not follow a biological reasoning and tend to detect interactions that are in linkage disequilibrium (LD) as noted in Piriyapongsa et al. (2012). On another track, incorporating biological background and testing the SNP pairs that are annotated has also proven useful (Baranzini et al., 2009; Holmans et al., 2009; Weng et al., 2011; Liu et al., 2012; Mckinney and Pajewski, 2012). However, this approach requires most SNPs to be discarded as many are quite far away from any gene to be associated. Moreover, this introduces a literature bias in the selection of SNPs.
A rather more popular approach is to prioritize the tests to be performed rather than discarding pairs from the search space and controlling for Type-I error. In this approach, the user can keep performing tests in the order specified by the algorithm until the desired number of significant pairs are found. The idea is to provide the user with a manageable number of true positives (TPs; statistically significant epistatic pairs) while minimizing the number of tests to ensure statistical power. The first algorithm of this kind is iLOCi (Piriyapongsa et al., 2012), which ranks SNP pairs by performing a dependence test and avoiding pairs that are unrelated to disease but might seem related due to LD.
This work was followed by Ayati and Koyuturk (2014), who proposed testing pairs of SNPs in population covering locus sets—POCOs. First, the algorithm greedily selects multiple exclusive groups of SNPs that cover all affected individuals. That is, each case sample has to have at least one SNP in each POCO. Epistasis tests are then performed across POCOs with the hope that independent coverage of the cases will lead different POCOs to include complementary and epistatic SNPs (Ayati and Koyutürk, 2014, 2016). Finally, Cowman and Koyuturk (2017) introduced the LINDEN algorithm. First, in a bottom-up fashion, the method generates SNP trees on greedily selected genomic regions (LD forest). Each node represents the genotypes of cases and controls for the SNPs in that node. Then, by comparing the roots of these trees, it decides whether this pair of regions is a promising candidate for the epistasis test. Nodes in lower levels are continued to be checked, and leaf pairs (individual SNPs) are tested only if the estimation at higher levels is promising. LINDEN was shown to achieve the state-of-the-art results. Despite using various heuristics, all methods still have high false discovery rates. For instance, the false discovery rate (FDR) of LINDEN ranges from 0.96 to 0.998 on three GWAS from Wellcome Trust Case Control Consortium (WTCCC)—the ratio of significant pairs to the number of detected (reciprocally) significant epistatic pairs (Cowman and Koyutürk, 2017).
The LD is an important source of information for epistasis prioritization algorithms. Two SNPs that appear to be interacting statistically might not be biologically meaningful if they are on the same haplotype block Cordell and Clayton (2005). For this reason, all three methods mentioned earlier focus on such regions and aim at avoiding testing pairs that are located in close vicinity of each other. In an orthogonal study, Yilmaz et al. (2019) propose a feature (SNP) selection algorithm, which avoids LD for better phenotype prediction. The authors show that while looking for a small set of loci (i.e., 100) that is the most predictive of a continuous phenotype, selecting SNPs that are far away from each other, results in better predictive power. This method, SPADIS, is designed for feature selection for multiple regression. As the SNP set it generates contains diverse and complementary SNPs, it results in better R2 values by covering more biological functions.
Inspired by this idea, we conjecture that selecting pairs of SNPs from genomic regions that (1) harbor individually informative SNPs, and (2) are diverse in terms of genomic location would avoid LD better and yield more functionally complementing and more epistatic SNP pairs compared with the current state of the art since no other algorithm exploits this information. We propose a new method that, for the first time, incorporates the genome location diversity with the population coverage density. Specifically, our proposed method, Potpourri, maximizes a submodular set function to select a set of genomic regions (1) that include SNPs that are individually predictive of the cases, and (2) that are distant from each other on the underlying genome. Epistasis tests are performed for pairs across these regions, such that pairs that densely co-cover the case cohort are given priority.
We validate our hypothesis and show that Potpourri can detect statistically significant and biologically meaningful epistatic SNP pairs. We perform extensive tests on three WTCCC GWA studies and compare our method with the state-of-the-art LINDEN algorithm. First, we guide LINDEN by pruning its search space using Potpourri-selected-SNPs to show that (1) it is possible to significantly improve the precision (from 0.003 up to 0.302) and (2) that our diversification approach is sound. Then, we show that the ranking of the diverse SNPs by the co-coverage of the case cohort further improves the prediction power and the precision (up to 0.652 in the selected setting). Potpourri drastically reduces the number of hypothesis tests to perform (from 380k to 15k), and yet it is still able to detect more epistatic pairs with similar significance levels in all there GWA studies considered. The running time is also cut by fourfold in the selected settings. Another problem with the current techniques is the biological interpretation of the obtained epistatic pairs. Once the most significant SNP pairs returned are in the noncoding regions and are too isolated to be associated with any gene, the user can hardly make sense of such a result despite statistical significance. We investigate the advantage of the promotion of SNPs falling into regulatory and noncoding regions for testing and propose three techniques. We show that these techniques further improve the precision (up to 0.8) with a similar number of epistatic pairs detected.
Finally, we investigate the biological meaning of the detected SNP pairs. We find (1) an SNP pair that supports the hypothesis of a shared genetic architecture between type 2 diabetes (T2D) and chronic kidney disease; and (2) a pair that suggests a new candidate risk gene (NPW) for hypertension (HT) that has loose indications only in rat studies. Potpourri is available at http://ciceklab.cs.bilkent.edu.tr/potpourri
2. Methods
2.1. Notation
A GWAS dataset consists of genotypes of a set of samples S who are associated with a binary phenotype: Case or Control. Let
One can generate an undirected SNP-SNP network
2.2. Selection of diverse and informative SNP regions
Regions harboring informative SNPs (ones that are correlated with the phenotype) and also far away on the underlying genome with respect to an SNP-SNP network are likely to yield a diverse and explanatory SNP set without introducing a literature bias. Our hypothesis is that those selected SNPs are likely to be epistatic, because SNPs selected using this technique lead to better phenotype prediction due to better genomic complementation (Yilmaz et al., 2019). Hence, picking pairs from such a set should yield highly epistatic SNP pairs and is well situated for epistasis test prioritization. In our application, the goal is not to predict the phenotype. Our SNP set selection approach differs from Yilmaz et al. in the calculation of the SKAT scores (Wu et al., 2011), as in epistatis test prioritization we are analyzing a GWA study in which the phenotype is dichotomous (the term
First, for every SNP vi, a score
In this equation,
Given the set of SNPs
Note that using only the SNPs in
2.3. Prioritization of the tests
The region set
2.3.1. Guiding LINDEN for ranking SNP pairs
In this ranking strategy, we use LINDEN to rank the pairs selected in
LINDEN is input with only SNPs that fall into the regions in
2.3.2. Population co-covering for ranking SNP pairs
We propose a new strategy that aims at maximizing the population coverage of co-occurring SNPs. Population cover for epistasis test prioritization was first proposed in Ayati and Koyutürk (2014, 2016), where they select multiple exclusive groups of SNPs (POCOs) that cover the case cohort. That is, the union of the samples with the SNPs in a POCO should be equal to the case cohort. Epistasis tests are performed across POCOs, and the idea is that the independent coverage of the cases across POCOs will result in detecting complementary SNPs. We take a different approach in terms of covering the population. We would like the SNP pairs to be tested to co-cover the population. This is intuitive; the diverse selection step described in Section 2.2 finds complementing regions and avoids LD, but for an SNP pair to be epistatic they also need to be observed together in cases. More formally, let
Potpourri computes the pairwise population co-covering of all SNP pairs
2.3.3. Promoting SNPs in regulatory and coding regions
SNPs can alter gene expression and the downstream function depending on their genomic location. Those that fall on to regulatory regions can affect messenger RNA (mRNA) levels, and those that fall onto coding regions can alter the structure and function of the protein. Since such SNPs are likely to alter the function and more likely to be related to the phenotype, we conjecture that we can find more statistically significant and biologically meaningful SNP pairs via promoting mutations in regulatory and coding regions. Although this might introduce a literature bias, for a life scientist who would like to narrow down the search space by using functional regions, this might be plausible and investigates the usefulness of this approach. However, one should not totally disregard the unannotated parts of the genome as most of the variation exists in such regions. Thus, we seek a balance.
We employ three techniques to promote regulatory/coding variants. In the first approach (Potpourri RC1), we assign an artificial prize to SNPs in the diverse SNP selection phase of the algorithm, as described in Section 2.2. This approach favors the regions in R to be regulatory/coding regions. Then, the pairs are tested with respect to the population co-cover ranking. The second approach (Potpourri RC2) uses
3. Results
3.1. Datasets
To benchmark Potpourri, we used three GWAS datasets obtained from WTCCC: (1) T2D, (2) HT, and (3) bipolar disorder (BD). We use the 1958 Birth (58C) cohort as control for all datasets (Craddock et al., 2010). Please see the details of preprocessing and quality control steps taken in Sec 1.4 in Supplementary Data. The number of samples in each dataset is listed in Supplementary Table S1.
We used the following resources to obtain the regulatory and coding region information. We obtained the distant-acting transcriptional enhancer dataset from VISTA Enhancer Browser (Visel et al., 2007). VISTA enhancer dataset contains 1912 human noncoding fragments with gene enhancer activity. Transcription start sites (TSSs) and coding region coordinates were obtained from UCSC Genome Browser (Lindblad-Toh et al., 2011). We chose Ensembl Genes as a gene annotation track (Zerbino et al., 2017) and the March 2006 NCBI36/hg18 assembly of the human genome that matches the WTCCC datasets. The number and types of genes obtained from the Ensembl dataset are given in Supplementary Table S2. We defined 1 kb downstream and upstream of the TSSs as a regulatory region. The coding region for a gene begins from the first base of the first exon and continues to the last base of the last exon.
Potpourri operates on an SNP-SNP interaction network. In this study, we used the genomic sequence network as defined in Azencott et al. (2013). In this network, SNPs are connected if they are adjacent on the genome. This was the network of choice, as it was shown to provide the best regression performance in Yilmaz et al.
3.2. Experimental setup
We follow the experimental procedure in Yilmaz et al. (2019) and Azencott et al. (2013) for the SNP selection step. The parameters were selected by using a nested 10-fold cross-validation. First, the distance parameter D was selected via a line search among seven values (log-scale) between 1 and
Unless otherwise stated, we used the suggested settings for LINDEN as described in Cowman and Koyuturk: d = 0.45 and
To quantify the performance of the proposed algorithms, we used precision as the evaluation metric in which TPs refer to the reciprocally significant epistatic pairs that pass the Bonferroni-adjusted statistical significance threshold. False positives (FPs) refer to failed tests: the reciprocally significant epistatic pairs that fail to pass the aforementioned threshold. Note that, in the epistasis test prioritization context, an FP does not mean a false rejection of the null hypothesis and falsely concluding that the pair has an epistatic interaction. Rather, it refers to a prioritized pair for testing that could not pass the corrected statistical significance threshold.
We use the definition of reciprocally significant epistatic pair from Cowman and Koyuturk. As the authors argue, most epistatic pairs are dominated by some hub SNPs and this leads to detection of redundant pairs. On the other hand, SNPs in a reciprocal pair are the most epistatic partners for each other. We also use this definition to measure the performance of our algorithm. We set the significance level as 10% throughout the experiments and adjust the significance level by using the Bonferroni correction based on the number of tests performed by each approach. Epistasis testing is performed via a chi-squared test on a 9 × 2 contingency table of all genotype combinations between cases and controls for a selected SNP pair (df = 8), as also done by Cowman and Koyuturk. All tests are performed on an Intel® Xeon® E5-2650 v3 Ten-Core Haswell Processor (2.3 GHz 2 5M 9.6 GT/s QPI). Two hundred fifty-one GB RAM is used in the parallel setting.
3.3. Guiding LINDEN with diverse and informative regions improves precision
We quantified the improvement in precision when LINDEN is guided by Potpourri-selected regions. First, we ran LINDEN on all datasets and it detected 1786, 885, and 1022 reciprocally significant epistatic pairs for T2D, BD, and HT datasets, respectively. Only 5, 35, and 2 pairs were statistically significant at the
Complete results are shown in Table 1 and Supplementary Tables S3 and S4 for T2D, BD, and HT datasets, respectively. The guidance of Potpourri improves the precision substantially, from 0.003 to 0.3 when
Results for Type 2 Diabetes Dataset That Compares LINDEN, Potpourri-Guided LINDEN, and Potpourri (with Population Co-Cover)
The number of pairs reported is the total number of reciprocally significant pairs returned by each method for a varying number of selected SNPs. The number in parentheses denotes the significant pairs passing the significance threshold (0.1) after Bonferroni correction based on the number of tests performed by each method. Bold denotes the best result for a given k value. The table shows that the guidance of Potpourri substantially improves the precision of LINDEN. For all considered k values, Potpourri provides the best precision values.
The significance levels of each performed epistasis test by LINDEN and Potpourri-guided LINDEN are shown in Figure 1 and Supplementary Figures S1 and S2 for the T2D, BD, and HT datasets, respectively. The green lines denote the significance level (0.1) to be passed for each approach (

On the T2D dataset, this figure compares the p-values of the selected SNP pairs by the following methods: (1) LINDEN, (2) Potpourri-guided LINDEN, (3) Potpourri; and then also the variants of Potpourri, which further promotes SNP pairs in regulatory and coding regions: (4) Potpourri RC1, (5) Potpourri RC2, and finally (6) Potpourri RC3. We show the significance levels (y-axis) of each reported pair (dots) given the Bonferroni-corrected significance threshold (0.1, green dashed lines). X-axis is just randomly assigned values to pairs for better visualization. Potpourri is run with
Since Potpourri provides a pruned search space for LINDEN by eliminating SNPs that are most likely irrelevant to the trait, it also reduces number of tests that will be performed during the epistasis test. Indirectly, it eliminates the negative effect of multiple hypothesis testing, which reduces the statistical power. As seen in Figure 1 and Supplementary Figures S1 and S2, due to a low Bonferroni threshold, the pipeline is able to discover more TPs compared with LINDEN. We also show that our pipeline not only minimizes the number of FPs but is also able to maintain the significance level of the returned pairs. LINDEN's top SNP pair is more significant, whereas guided LINDEN's top SNP pairs still stand out in terms of their p-values. We compared the sets of SNP pairs detected by LINDEN and Potpourri-guided LINDEN and found that there the former is not a superset of the latter. Potpourri enforces LINDEN to form trees on its selected regions of
In short, by just providing guidance to LINDEN by using Potpourri's diverse and informative SNP region selection, we were able to improve the performance of the state of the art with a substantially smaller number of tests performed. Next, we provide the results of the complete Potpourri pipeline that uses a population co-cover strategy for prioritization, instead of LINDEN's tree-based strategy.
3.4. Comparison of Potpourri with the state of the art
In this section, we evaluate the performance of the Potpourri pipeline with the population co-covering technique. Again, we compare the performance with the LINDEN algorithm. Table 1 shows the results of the algorithm for different k values on the T2D dataset. We observe that the population co-cover ranking strategy results in even further performance improvements. The precision is moved up to 0.652, as 15 out of 23 reciprocally significant pairs pass the Bonferonni-corrected threshold (
LINDEN performs 7,629,272,394 tests (378,016 tests for leaves), as opposed to 14,146 tests performed by Potpourri. This sets a much more conservative significance threshold. One could argue that the precision gain is only due to the reduced significance threshold. However, Figure 1 shows that it is not the case. The third panel shows the significance levels of Potpourri-selected SNP pairs on the T2D dataset. We see that 6 out of 15 Potpourri-selected reciprocally significant pairs are significant even when the Bonferonni-corrected threshold of LINDEN is considered (dashed green line on the left-most panel, which is far more conservative). Note that LINDEN only detects five reciprocally significant pairs. We also observe that the number of FPs is even further decreased compared with Potpourri-guided LINDEN. See Supplementary Figures S1 and S2 for the results on BD and HT datasets, respectively, which again show a similar pattern.
We also checked whether adjusting LINDEN's parameters to make it more conservative would improve the precision. Increasing d to 0.9 and to 0.99 enforced it to perform a smaller number of tests but still too many to get close to Potpourri's efficiency, as shown in Supplementary Table S5. We checked whether we would get better results by testing only SPADIS-selected SNPs. For each dataset, we performed pairwise epistasis tests on all SPADIS-selected SNPs (
3.5. Promoting regulatory and coding regions improves precision
In this section, we investigate the advantage of promoting coding and regulatory regions in the Potpourri pipeline. For the T2D dataset, Table 2 compares the performance of original Potpourri pipeline with the three variants (RC1, RC2, and RC3) that promote the selection of SNPs from coding and regulatory regions, as described in Section 2.3.3. Results show that in the suggested setting (
Comparison of the Potpourri Results with Three Strategies (RC1, RC2, and RC3) to Promote Regulatory and Coding Regions on the Type 2 Diabetes Dataset
For RC1 and RC3,
We experimented with three ω values and picked the 1.31623 as the best performing and suggested parameter values. See Supplementary Tables S7–S10, 12 and 13 for all results obtained by using other parameter choices. Finally, the selected SNP pairs by all Potpourri variant techniques with suggested settings (
3.6. Novel epistatic pairs
One interesting pair discovered by Potpourri is the
Finally, an interaction between EGFR and RXRG genes was reported, which increases the risk of DN in a T2D cohort (Hsieh et al., 2006). Thus, this new epistatic interaction further supports the hypothesis of a shared genetic architecture between these two diseases and that T2D related loci can also genetically increase the risk for DN. In the HT data, we detect an epistatic interaction between
4. Conclusion
The detection of epistatic interactions is a promising approach for understanding the genetic underpinnings of complex traits. However, the large number of hypotheses to test prohibits the brute force investigation of the search space. We proposed a new test prioritization technique and showed that selecting individually informative and topologically diverse SNPs in terms of genomic location leads to detecting statistically significant epistatic interactions. Our approach performs favorably compared with state of the art.
Footnotes
Acknowledgments
The authors would like to thank WTCCC and the study participants. They used SPADIS and LINDEN code in their implementation. They are grateful to S. Yilmaz, T. Cowman, and M. Koyuturk for the source code.
Author Disclosure Statement
The authors declare they have no competing financial interests.
Funding Information
This work has been supported by TUBITAK via Career Grant number 116E148 to A.E.C. A.E.C. acknowledges the support of TUBA GEBIP 2017 award. O.T., and A.E.C. acknowledge the support of Bilim Akademisi BAGEP awards.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
