An Efficient Nonlinear Regression Approach for Genome-wide Detection of Marginal and Interacting Genetic Variations

Abstract

Genome-wide association studies have revealed individual genetic variants associated with phenotypic traits such as disease risk and gene expressions. However, detecting pairwise interaction effects of genetic variants on traits still remains a challenge due to a large number of combinations of variants (∼10¹¹ SNP pairs in the human genome), and relatively small sample sizes (typically <10⁴). Despite recent breakthroughs in detecting interaction effects, there are still several open problems, including: (1) how to quickly process a large number of SNP pairs, (2) how to distinguish between true signals and SNPs/SNP pairs merely correlated with true signals, (3) how to detect nonlinear associations between SNP pairs and traits given small sample sizes, and (4) how to control false positives. In this article, we present a unified framework, called SPHINX, which addresses the aforementioned challenges. We first propose a piecewise linear model for interaction detection, because it is simple enough to estimate model parameters given small sample sizes but complex enough to capture nonlinear interaction effects. Then, based on the piecewise linear model, we introduce randomized group lasso under stability selection, and a screening algorithm to address the statistical and computational challenges mentioned above. In our experiments, we first demonstrate that SPHINX achieves better power than existing methods for interaction detection under false positive control. We further applied SPHINX to late-onset Alzheimer's disease dataset, and report 16 SNPs and 17 SNP pairs associated with gene traits. We also present a highly scalable implementation of our screening algorithm, which can screen ∼118 billion candidates of associations on a 60-node cluster in <5.5 hours.

1. Introduction

Afundamental problem in genetics is to understand the interaction (or epistatic) effects from pairs of or multiple single-nucleotide polymorphisms (SNPs) on phenotypic traits (Moore et al., 2010). Existing methods for detecting causal SNP pairs include hypothesis-testing-based methods (Zhang et al., 2008; Wan et al., 2010; Purcell et al., 2007) and penalized multivariate regression (PMR) – based methods (Park and Hastie, 2008; Lee and Xing, 2012; Bien et al., 2013). Arguably, PMR-based methods are more powerful than hypothesis-testing-based methods because PMR can in principle jointly estimate all marginal and interaction effects simultaneously (Lee and Xing, 2012; Hoffman et al., 2013). However, statistical and computational bottlenecks have prevented PMR from being widely used for detecting interaction effects on traits. Firstly, it is difficult to control false positives. One can use a “screen and clean” procedure to compute p-values (Wasserman and Roeder, 2009; Meinshausen et al., 2009), but this strategy substantially downgrades the power in genome-wide association mapping because only half of the samples can be used for each step of screening and cleaning. Secondly, the high correlations between pairs of SNPs also lead to decreasing the power of PMR, because PMR can only detect true associations accurately under conditions with little correlation between different SNPs/SNP pairs (Bühlmann et al., 2013). Lastly, there is a substantial computational challenge to overcome. If we were to consider millions of SNPs as candidates in studying a particular phenotypic trait, the number of potential pairwise interactions between pairs of SNPs to be considered is >10¹¹. Such a massive pool of candidates of SNP pairs makes it infeasible to solve the mathematical optimization program underlying PMR with currently available tools.

The past several years have seen the emergence of several statistical methods that can potentially be employed to address the problems mentioned above. For the first problem of error control, Meinshausen and Bühlmann (2010) proposed a procedure known as stability selection. The insight behind this technique is that, given randomly chosen multiple subsamples, true associations of covariates (e.g., SNPs or SNP pairs) to responses (e.g., a trait) will be selected at high frequency because true association signals are likely to be insensitive to the random selection of subsamples. Second, to address the nonidentifiability problem in regression due to intercovariate correlation, a randomized lasso technique has been proposed that randomly perturbs the scale of covariates in the framework of stability selection, thereby relaxing the original requirements on small correlation for recovery of true association signals from all covariates (Meinshausen and Bühlmann, 2010). Naturally, such a scheme is expected to help distinguish between true and false associations of SNPs/SNP pairs, because only true ones are likely to be selected under the perturbations. Finally, to combat the computational challenge due to a massive number of covariates, a sure independence screening (SIS) procedure (Fan and Lv, 2008) has been proposed to contain the operational size of the regression problem under provable guarantee of retaining true signals. It is possible to use the idea behind SIS to effectively perform simple independent tests on each pair of SNPs (or individual SNPs) and discard the large fraction of candidates with no associations, such that one can end up with only O(NC) candidates (where N is sample size and C is a data dependent constant) of which no true associations will be missed with high probability. These theoretical developments notwithstanding, their promised power remains largely unleashed for practical genome-wide association mapping, especially in nontrivial scenarios such as nonadditive epistatic effects, due to several remaining hurdles, including proper models for association, algorithms for screening with such models and on a computer cluster, and proper integration of techniques for error control, identifiability, screening, etc., in such a new paradigm.

In this article, we present SPHINX [which stems from sparse piecewise linear model with high throughput screening for interaction detection(X)], a new PMR-based approach built on the advancements in statistical methodologies mentioned above. It is an integrative platform that conjoins and extends the aforementioned three components, further enhanced with techniques allowing more realistic trait association patterns to be detected. In particular, SPHINX is designed to capture SNP pairs with nonlinear interaction effects (synergistic/antagonistic epistasis) on traits using a piecewise linear model (PLM), which is better suited to model the complex interactions between a pair of SNPs and the traits. In short, SPHINX is designed as follows: using an extension of SIS based on PLM, it first selects a set of O(NC) SNPs and SNP pairs with the smallest residual sum of squares. Then it runs the randomized group lasso based on PLM on the set of SNPs and SNP pairs selected in the previous step under stability selection. Finally, it reports SNPs and SNP pairs selected by stability selection, whose coefficients are nonzero given a majority of subsamples. In Figure 1, we illustrate the overall framework of SPHINX. Note that in practical association analysis with all pairs of SNPs, we should address the three problems mentioned above simultaneously, which is a nontrivial task. To achieve this goal, we take the approach of unified framework, which requires statistically sound models and algorithms and scalable system implementations.

FIG. 1.

Overall framework of SPHINX. Using a screening method, we first discard SNPs/SNP pairs without associations; given that the SNPs/SNP pairs survived in the screening step, we run a method that incorporates three different techniques, each of which is introduced to address the problem on its right side.

In our experiments, we show the efficacy of SPHINX in controlling false positives, detecting true causal SNPs and SNP pairs, and using multiple cores/machines to deal with a large number of SNP pairs. Furthermore, with SPHINX, we analyzed late-onset Alzheimer's disease eQTL dataset (Zhang et al., 2013), which contains ∼118 billion candidates of associations; the analysis took <5.5 hours using 60-node cluster with 720 cores. As a result, we found 16 SNPs and 17 SNP pairs associated with gene traits. Among our findings, we report the analysis of 6 SNPs (rs1619379, rs2734986, rs1611710, rs2395175, rs3135363, rs602875) associated with immune system–related genes (i.e., HLA gene family) and an SNP pair (pair of rs4272759 and rs6081791) associated with a dopamine-related gene (i.e., DAT gene); the roles of dopamine and immune system in Alzheimer's disease have been studied in previous research (Li et al., 2004; Maggioli et al., 2013).

2. Methods

SPHINX is a framework for genome-wide association mapping, which consists of PLM-based screening technique and PLM-based randomized group lasso under stability selection. Among the SPHINX components, the effectiveness of the randomization technique and stability selection are demonstrated in Fan and Lv (2008); and Meinshausen and Bühlmann (2010) with theory and experiments; the screening approach is extensively studied in both parametric and nonparametric settings (Fan and Lv, 2008; Fan et al., 2011). In this section, we focus on describing our proposed novel model PLM-based group lasso with the randomization technique and stability selection. We then present the PLM-based screening method, followed by our system implementation of the screening method. Note that SPHINX runs the screening method prior to the PLM-based randomized group lasso, as shown in Figure 1.

2.1. Piecewise linear model-based group lasso

The relationships between genetic variations and phenotypic traits are complex, for example, nonlinear. However, due to the highly under-determined nature of the mathematical problem—too many features (SNPs and SNP pairs) but too few samples—it is difficult to employ models that have a high degree of freedom. Traditionally, linear models have been used extensively in genome-wide association studies despite the fact that these models are not flexible enough to capture the complexity of the trait-associated epistatic interactions between SNPs.

We introduce a multivariate piecewise linear model (PLM), which is better suited to model the complex interactions between a pair of SNPs and traits. Note that we employ PLM for adding additional degrees of freedom into a linear model in a high-dimensional multivariate regression setting. Therefore, it is different from the cases, in which we change the degrees of freedom in statistical tests such as the F-test. We denote the j-th SNP for the i-th individual by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x_j^i \in \{ 0 , 1 , 2 \} $$ \end{document} , with the number of minor alleles at the locus. Let us start converting a linear model into a piecewise linear model with two knots denoted by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Delta = \{ \eta_1 , \eta_2 \} $$ \end{document} , where η₁ = 1 and η₂ = 2 for our SNP encoding. It uses three degrees of freedom, flexible enough to capture the change of gene expression with a change in the genotype. Specifically, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m_{jk}^i$$ \end{document} denote the genotype encoding for the interaction between the j-th SNP and k-th SNP for the i-th individual, that is, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m_{jk}^i \equiv x_j^i x_k^i$$ \end{document} . Then, we have a piecewise linear model as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\hat{{ \bf y}} = {\bf 1}C + \mathop \sum \limits_{j = 1}^P { \bf x}_j \beta_j + \mathop \sum \limits_{j < k} \Psi ( \textbf{\textit{m}}_{jk} , \{ u_{jk} , t_{jk} , w_{jk} \} ) + {\bf \epsilon} , \tag{1}\end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\textbf{\textit{m}}_{jk} = [ m_{jk}^1 , \ldots , m_{jk}^N ] ^T$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{{ \bf y}}$$ \end{document} is an output trait based on the model, β_j is the regression coefficient for the j-th SNP, and ε is Gaussian noise. Here, Ψ(·) is a piecewise linear function given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \Psi ( m_{jk}^i , \{ u_{jk} , t_{jk} , w_{jk} \} ) = \begin{cases}m_{jk}^i u_{jk} \qquad \qquad \qquad \qquad \qquad \qquad \quad \quad \, \, \,\,{ \rm if} \ m_{jk}^i \le \eta_1 , \\ m_{jk}^iu_{jk} + ( m_{jk}^i - \eta_1 ) t_{jk} \qquad \qquad \qquad \quad \quad { \rm if} \ \eta_1 < m_{jk}^i \le \eta_2 , \\ m_{jk}^i u_{jk} + ( m_{jk}^i - \eta_1 ) t_{jk} + ( m_{jk}^i - \eta_2 ) w_{jk} \quad { \rm if} \ m_{jk}^i > \eta_2 , \end{cases} \tag{2}\end{align*} \end{document}

where u_jk, t_jk, and w_jk represent the regression coefficients for the first, second, and third line segment, respectively. Given the model in Equation. (1), to select significant SNPs/SNP pairs, we propose the following penalized multivariate piecewise-linear regression, referred to as PLM-based group lasso: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} C , & \{ \beta_j \} , \{ { \min \atop u_{jk}} , t_{jk} , w_{jk} \} \left \| { \bf y} - \left\{ 1C + \mathop \sum \limits_{j = 1}^P { \bf x}_j \beta_j + \mathop \sum \limits_{j < k} \Psi ( \textbf{\textit{m}}_{j^k} , \{ u_{jk} , t_{jk} , w_{jk} \} ) \right\} \right \|_{2}^{2} \\ & + \lambda_1 \mathop \sum \limits_{j = 1}^P \mid \beta_j \mid + \lambda_2 \mathop \sum \limits_{j < k} \sqrt { \mid \Delta \mid} \sqrt {u_{jk}^2 + t_{jk}^2 + w_{jk}^2} , \tag{3}\end{align*} \end{document}

where λ₁ and λ₂ are regularization parameters, determining the sparsity of the solutions. Here the first ℓ₁ and second ℓ₁/ℓ₂ norm are introduced to set the coefficients of individual SNPs and SNP pairs to exactly zero respectively if they are irrelevant to the observed trait y. It is equivalent to group lasso penalty (Yuan and Lin, 2005), and has been shown that it allows us to select true nonzero β_js and {u_jk, t_jk, w_jk}s under certain conditions (Bach, 2008). We can optimize Equation. (3) using standard optimization techniques for group lasso such as a block coordinate descent (Friedman et al., 2007), or a proximal gradient method (Liu and Ye, 2010) [we used a proximal gradient method to optimize Eq. (3) (Liu et al., 2009)] because the loss function is differentiable and Ψ(·) is linear. Further, the penalty is separable because there is no overlap between different groups of coefficients. Here, we considered the squared-loss for eQTL (expression quantitative trait loci) mapping with continuous traits; however, our methodology can be extended to other loss functions (e.g logistic loss in case/control studies).

Randomization As previously mentioned, high correlations between SNPs or SNP pairs make it hard to distinguish between true association SNPs/SNP pairs and the correlated ones. To address the problem, we randomly perturb the scale of covariates in Eq. (3) (Meinshausen and Bühlmann, 2010), called PLM-based randomized group Lasso: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} C , & \{ \beta_j \} , \{ { \min \atop u_{jk}} , t_{jk} , w_{jk} \} \left \| { \bf y} - \left\{ 1C + \mathop \sum \limits_{j = 1}^P W_j { \bf x}_j \beta_j + \mathop \sum \limits_{j > k} \Psi ( W_{jk}\textbf{\textit{m}}_{jk} , \{ u_{jk} , t_{jk} , w_{jk} \} ) \right\} \right \|_{2}^{2} \\ & + \lambda_1 \mathop \sum \limits_j \mid \beta_j \mid + \lambda_2 \mathop \sum \limits_{j < k} \sqrt { \mid \Delta \mid} \sqrt {u_{jk}^2 + t_{jk}^2 + w_{jk}^2.} \tag{4}\end{align*} \end{document}

Here P(W_j = 1) = P(W_j = δ) = P(W_jk = 1) = P(W_jk = δ) = 0.5, and δ ∈(0, 1] determines the degree of perturbations (the smaller δ, the larger perturbations). It has been shown that this randomization with stability selection weakens the condition for the recovery of true nonzero coefficients (Meinshausen and Bühlmann, 2010). Furthermore, Meinshausen and Bühlmann empirically showed that the randomization is very useful to distinguish between true causal signals and the false ones merely correlated with the true signals (Meinshausen and Bühlmann, 2010). However, there is trade-off for the degree of random perturbations: as we increase the degree of perturbations, false positives will be reduced, but true negatives can be increased.

Stability Selection Next, to control false positives, we adopt stability selection (Meinshausen and Bühlmann, 2010), which takes the bootstrapping approach. Suppose we have a set of (λ₁, λ₂) parameters denoted by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \{ \Lambda_t \} }_{t = 1}^T$$ \end{document} , where Λ_t = (λ₁, λ₂)_t. For each Λ_t, we solve Equation (4) based on randomly chosen samples size of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\lfloor N / 2 \rfloor$$ \end{document} for α times. Then we select SNPs or SNP pairs if their coefficients are set to nonzero more than π_thrα (0.5 < π_thr ≤ 1) times for any regularization parameters. Under certain assumptions, it has been shown that the expected number of false positives E(V) is bounded by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} E ( V ) \le \frac {{1}} {{2 \pi_ { thr } - 1}} \frac {{q}^{2}_ {\bf \Lambda}} {{k_ { mar } + k_ { int }}} , \tag{5}\end{align*} \end{document}

where q_Λ is the expected number of nonzero coefficients in a solution of Equation. (4) (Meinshausen and Bühlmann, 2010). Note that for whole genome-wide association studies, stability selection based on Equation. (4) is computationally challenging due to all SNP pairs considered. Specifically, optimizing Equation. (4) is non-trivial because it requires us to use an iterative algorithm such as a proximal gradient method (Liu et al., 2009), which sweeps over such a large number of SNP pairs multiple times. To address the problem, we introduce a PLM-based screening algorithm, which efficiently gives us small candidate sets of association SNPs and SNP pairs, denoted by Ω_mar and Ω_int. We describe PLM-based randomized group lasso under stability selection in Algorithm 1.

2.2. Piecewise linear model-based screening

In Equation. (4), we include all P SNPs and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ { P \choose 2 } \big( = \frac { P ( P - 1 ) } { 2 } \big)$$ \end{document} SNP pairs, which makes it impractical to solve the problem at a whole-genome scale (e.g., millions of SNPs). To handle the quadratic explosion of the number of SNP pairs, we propose a scalable screening method based on PLM. Our screening method is designed to sequentially select potentially relevant SNPs and SNP pairs using a simple test scheme. Note that this screening step focuses on avoiding missing true positives supported by sure-screening theory (Fan et al., 2011).

Our screening algorithm is a variant of iterative sure independence screening based on Equation. (1). It greedily selects SNPs and SNP pairs based on the contribution of each candidate SNP or SNP pair to the decrease of residual sum of squares. The PLM-based screening is described in Algorithm 2. Note that there are two pairs of parameters (k_mar, k_int) and (b_mar, b_int). The pair (k_mar, k_int) determines the total number of SNPs and SNP pairs selected, and (b_mar, b_int) determines the number of candidates selected per-iteration. In our experiments, we used (k_mar, k_int) = (N, N) and (b_mar, b_int) = (10, 10) to limit the number of selected correlated SNPs/SNP pairs at each iteration by 10. When SNPs are highly correlated, we recommend large values of (k_mar, k_int) and small values of (b_mar, b_int), because it will allow us to select more independent SNPs/SNP pairs. After the screening step, we obtain small candidate sets of SNPs and SNP pairs, and thus it is computationally tractable to solve the high-dimensional problem in Equation (4).

2.3. System implementation of piecewise linear model-based screening

We implemented a highly efficient shared- and distributed-memory parallel PLM-based screening algorithm in C++. Our implementation can exploit parallelism when running on multicore machines, or on clusters of multicore machines. To exploit shared-memory parallelism, we used PFunc (Kambadur et al., 2009), a lightweight and portable library that provides C and C++ APIs to express task parallelism. For distributed-memory parallelism, we used MPI (Message Passing Interface Forum, 1995, 1997), a popular library specification for message-passing that is used extensively in high-performance computing. In this section, we briefly describe some salient features of our implementation that optimize memory and computational efficiency.

First of all, we optimize the memory footprint of SPHINX by storing each SNP using 2 bits (to represent 0, 1, 2), thereby giving us four SNPs per-byte of data. This way, the entire SNP dataset is compressed and most of the operations, such as tests for SNP–SNP interactions, are performed as bit-wise operations. For example, using this scheme, a 200-patient, 500,000-SNP dataset only occupies 250 MB of storage that can be entirely cached in-memory on most modern machines. The SNP–SNP interaction pairs are constructed on-the-fly in order to save space instead of explicitly storing \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ P \choose 2}$$ \end{document} additional columns.

We also optimize our implementation for computational efficiency. Note that the significant SNPs and SNP–SNP interactions are selected by solving millions of linear systems, followed by computation of the 2-norm of the resulting residual. To quickly solve the linear systems, we use the Cholesky factorization (Bretscher, 1997) because Cholesky decomposition is faster (although less numerically stable in some cases) than other alternatives such as QR decomposition (Bretscher, 1997) and singular value decomposition (SVD) (Golub and Reinsch, 1970). Furthermore, we use BLAS and LAPACK kernels to optimize all the linear operations. After the selection of the first SNP and SNP pair, incremental linear models are built; that is, given a set of selected SNPs and SNP pairs, the best SNP or SNP pair to add to our model has to be determined. As the number of selected candidates increases, the linear system becomes more expensive to solve, thereby making successive later iterations expensive. In order to offset these costs, we resort to using a hand-coded incremental version of the Cholesky factorization, which keeps the per-iteration costs near constant.

3. Simulation Study

In this section, we validate the effectiveness of SPHINX in terms of false positive control, statistical power, and the benefits of using a piecewise linear model over a linear model via simulations because ground-truth associations are unknown in real datasets. Furthermore, we show the scalability of our screening implementation on multinode, multicore, and hybrid settings. We first conduct an extensive simulation study to demonstrate and statistically validate the efficacy of SPHINX, in comparison to two popular existing approaches: the two-locus test by PLINK (Purcell et al., 2007) with the –epistasis option and the maximum likelihood method with the fully parametrized two-locus model (saturated two-locus test) (Evans et al., 2006) with Bonferroni correction at significance level 0.01. We set |Ω_mar| = N, |Ω_int| = N, and κ = 3 for SPHINX, allowing for three false positives on average, and used the following sequence of regularization parameters: λ₁ = λ₂ \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\in$$ \end{document} {0.5, 0.1, 0.05, 0.01, 0.005}. We simulated chromosome 1 with 22834 SNPs and 2000 individuals using GWAsimulator (Li and Li, 2008), and generated traits under additive and nonadditive scenarios: association SNP pairs have (1) additive and (2) non-additive interaction effects. We ran the methods on 50 different data sets generated by randomly choosing 200 samples and 300 consecutive SNPs from the simulated genome for each simulation setting. In our plots, we report the average performance with error bars of 1/2 standard deviation.

3.1. Generation of simulation data

Let us denote S₁ by a set of SNPs with marginal effects, and S₂ by a set of SNP pairs with interaction effects. For the additive scenario, we generate simulation data as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}y_i = \mathop \sum \limits_{j \in { \bf S}_1} x_j^i \beta_j + \mathop \sum \limits_{ ( j , k ) \in { \bf S}_2} x_j^i x_k^i \beta_{jk} + \epsilon_i , \tag{6}\end{align*} \end{document}

where y_i is the continuous response (e.g., gene expression level) for the i-th individual, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x_j^i \in \{ 0 , 1 , 2 \} $$ \end{document} represents the encoding of the j-th SNP for the i-th individual (i.e., the number of minor alleles), ε_i represents Gaussian noise with zero mean and unit variance for the i-th individual, and β_j and β_jk are constants that represent the size of marginal and interaction effects, respectively.

For the nonadditive scenario, we generate simulation data as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}y_i = \mathop \sum \limits_{j \in { \bf S}_1} x_j^i \beta_j + \mathop \sum \limits_{ ( j , k ) \in { \bf S}_2} f ( x_j^ix_k^i ) + \epsilon_i , \tag{7}\end{align*} \end{document}

where r_q ∼ Unif(−β_jk,β_jk) for all q = 1,…, 4. Note that in this nonadditive scenario, the relationship between y and a pair of SNPs x_j and x_k is nonadditive due to the function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$f ( x_j^ix_k^i )$$ \end{document} , which randomly assigns the size of interaction effects according to the input genotype \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x_j^ix_k^i$$ \end{document} .

In our experiments below, we denote N by the sample size, P by the number of SNPs, ν by the association strength of marginal and interaction effects (i.e., ν = {β_j, β_jk}), and ξ by the number of true association SNPs and SNP pairs (i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf xi} = \{ \mid { \bf S}_1 \mid , \mid { \bf S}_2 \mid \} $$ \end{document} ). Furthermore, we randomly choose S₂ such that each SNP pair in S₂ has the minor allele frequency less than MAF1 and MAF2. For the set S₁, we randomly choose SNPs with marginal effects among the SNPs with minor allele frequency less than 0.1.

3.2. False positive control

We first confirm that SPHINX effectively controls the number of false positives of SNP pairs under two null hypotheses: (1) there exist no marginal and no interaction effects, and (2) there exist only marginal effects but no interaction effects. As shown in Figure 2, for both null hypotheses, false positives were well controlled with different sample sizes from 100 to 1000 and different numbers of SNPs from 100 to 700 (less than one false positive under both null hypotheses). However, PLINK and the saturated two-locus test did not effectively control the number of false positives under the second scenario (up to 7.94 and 8470, respectively) because SNP pairs correlated with SNPs having some marginal effects were falsely detected.

FIG. 2.

Number of false positives of SNP pairs found by SPHINX (a,d), PLINK (b,e), and saturated two-locus test (c,f) with different sample sizes and the number of SNPs under two null hypotheses (see text for details).

3.3. Comparison of different methods for the detection of SNP pairs with interaction effects

We present our comparison results among SPHINX, the two-locus test by PLINK (Purcell et al., 2007) with the –epistasis option, and the saturated two-locus test (Evans et al., 2006) with various experimental settings. We evaluate the performance of SPHINX, PLINK (Purcell et al., 2007), and the saturated two-locus test (Evans et al., 2006) on simulation datasets with different numbers of true association SNP pairs, different MAFs of true association SNP pairs, and different association strengths.

Comparison with different numbers of true association SNP pairs We performed experiments to show that SPHINX exhibits high power even when false positives are wellsuppressed. In the simulation, we randomly chose three SNPs for marginal effects (out of 300 SNPs), and set the number of SNP pairs from 1 to 5 (out of 44850 possible pairs) for interaction effects (SNPs with minor allele frequency between 0 and 0.2 were randomly chosen). Compared to PLINK and the saturated two-locus test, as shown in Figure 3, SPHINX showed significantly larger true positive rates (up to ∼40%) while generating fewer number of false positives (<0.18) under both scenarios of additive and nonadditive interaction effects. PLINK found a smaller fraction of SNP pairs with true interaction effects (up to ∼10%), and the number of false positives was less than 1.04. The saturated two-locus test found more true positives than PLINK but the number of false positives was very large (>1000).

FIG. 3.

Comparison of true positive rate and the number of false positives among SPHINX, PLINK, and saturated two-locus test with different numbers of true association SNP pairs under the additive scenario (a,b) and the nonadditive scenario (c,d).

Comparison with different minor allele frequencies We evaluated the three different methods on simulation datasets with N = 200 (sample size), P = 300 (the number of SNPs), ν = {3, 3} (association strength of marginal and interaction effects), and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf xi} = \{ 3 , 3 \} $$ \end{document} (the number of true association SNPs and SNP pairs). Figure 4 shows true positive rate and the number of false positives of the three different methods (columns) with different MAFs of true association SNP pairs (i.e., MAF1 = MAF2 \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\in$$ \end{document} {0.1, 0.2, 0.3}) under the linear scenario (Fig. 4a) and the nonlinear scenario (Fig. 4b). Overall, SPHINX achieved the best performance considering both true positive rate and the number of false positives. When we compare between PLINK and SPHINX, under both the additive and nonadditive scenarios, SPHINX showed significantly better true positive rate than PLINK while producing fewer number of false positives. Furthermore, SPHINX effectively controlled the number of false positives over all regions of MAFs, showing that the theory of stability selection (Meinshausen and Bühlmann, 2010) is in agreement with the empirical results (e.g., under the additive scenario, SPHINX had 0.12 false positives on average). When we compare between the saturated two-locus test and SPHINX, for both scenarios, the saturated two-locus test found slightly more true positives but much larger number of false positives than SPHINX, which makes the saturated two-locus test impractical. It can be explained by the fact that many parameters in the saturated two-locus test led to over-fitting of the model.

FIG. 4.

Comparison of true positive rate and the number of false positives for SPHINX (first column), two-locus test by PLINK (second column), and saturated two-locus test (third column) under the linear scenario (a) and the nonlinear scenario (b). In each panel, x-axis and y-axis show MAFs of true association SNP pairs (MAF1, MAF2), and z-axis represents the true positive rate or the number of false positives.

Comparison with different association strengths We also tested the three methods with different association strengths ν₁ = ν₂ = 1,…, 5 (N = 200, P = 300, MAF1 = MAF2 = 0.1, ξ = {3, 3}), and show true positive rate and the number of false positives under the additive scenario in Figure 5a and b and under the nonadditive scenario in Figure 5c and d). Overall, SPHINX showed the best performance among the three methods as it found a relatively large number of true positives while effectively suppressing false positives over all association strengths. Furthermore, under the nonadditive scenario, only SPHINX effectively increased true positive rate as association strength increased under the control of false positives. PLINK showed very low true positive rate (true positive rate was <0.05 for all association strengths), and the saturated two-locus test produced many false positives (>200 in most cases).

FIG. 5.

Comparison of true positive rate and the number of false positives among SPHINX, PLINK, and saturated two-locus test with different association strengths under the additive scenario (a,b) and the nonadditive scenario (c,d).

3.4. Benefits of using a piecewise linear model for screening

We tested the benefits of using a piecewise linear model instead of a simple linear model during the screening procedure. Throughout this section, we use PLS to indicate using a piecewise linear model for screening and LS to indicate using a simple linear model for screening. For this experiment, we simulated data with P = 500 (that generates candidates of 124750 SNP pairs), three SNPs having marginal effects with association strength of 1, and three SNP pairs having interaction effects with association strength of 3. Given the simulation data, for both PLS and LS, N candidates of SNP pairs were selected. We then evaluated true positive rate of PLS and LS under different minor allele frequencies (MAFs) of true association SNP pairs from 0.1 to 0.4 (fixing N = 200) and different sample sizes from 100 to 600 (fixing MAF1 = MAF2 = 0.1). Figure 6 represents the average true positive rate of PLS and LS with error bars of 1/2 standard deviation when the underlying true interaction effect was additive (Fig. 6a and b), and nonadditive (Fig. 6c and d). In general, our results show that PLS is very useful under various simulation settings. When true model was linear, true positive rates of PLS and LS were comparable in most of our settings as shown in Fig. 6 a,b), which was not expected because a simple linear model would be ideal given finite data under the additive scenario. It seems that the model complexity of PLS was small enough not to lose much power. When true model was nonlinear, PLS showed clear benefits over LS. As seen in Figure 6c and d, the true positive rate of PLS substantially increased as sample size and minor allele frequency increased but the true positive rate of LS marginally improved. It seems that the true positive rate of PLS significantly increased due to the fact that additional degrees of freedom allowed PLS to fit well into the data under the nonadditive scenario.

FIG. 6.

Comparison of true positive rate between piecewise linear screening and linear screening under different sample sizes and MAFs of true association SNP pairs under additive scenario (a,b) and nonadditive scenario (c,d).

3.5. Scalability of Screening Implementation

We carried out scalability experiments for our screening implementation on oxygen, a six-node cluster of dual-socket, quad-core Intel Xeon E5410 machine with 32GB of RAM per-node running Linux Kernel 2.6.31-23 (total 48 cores). Figure 7 shows the throughput (SNPs processed per second) for various scenarios when running on a simulated dataset that had 200 SNPs, 500 samples, and 20 true association SNPs. Each experiment was run for 50 iterations, where each iteration considered 200 marginal candidates and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${200 \choose 2}$$ \end{document} interaction candidates. To test the effect of simultaneously evaluating multiple responses (e.g., eQTL mapping on many gene traits), we ran experiments with the number of responses varying from 1 to 16. The left panel depicts the multinode (cluster) performance of our implementation on oxygen; as can be seen, our implementation is able to process up to 14,000 SNPs per second on 6 machines and shows near-linear speedup. Furthermore, our implementation handles an increasing number of responses (e.g., gene traits) gracefully; processing 16 responses in a multitask fashion only results in a 3.5× slowdown when compared to processing just one response (4,000 SNPs per second as opposed 14,000 SNPs per second). The middle panel shows the near-linear speedup achieved when we use pure multithreading on a single node of oxygen. The scalability is slightly less than the multinode case because of memory bandwidth issues that result from BLAS-2 operations such as matrix-vector products. Finally, the right panel demonstrates our algorithm's capability to exploit both multicore and cluster architectures together. In this experiment, we ran eight threads per-node and increased the number of nodes from one to six to achieve a near-linear speedup. To conclude, our implementation is able to efficiently process large datasets while scaling near-linearly.

FIG. 7.

Performance of the parallel implementation of our screening algorithm on oxygen cluster.

4. Association Analysis of Late-Onset Alzheimer's Disease Data

We applied SPHINX to late-onset Alzheimer's disease (AD) data from Harvard Brain Tissue Resource Center and Merck Research Laboratories (Zhang et al., 2013) in an attempt to detect causal SNPs associated either marginally or epistatically to molecular traits of interest. This data concerns 206 AD cases with 555,091 SNPs in total and expression levels of 37,585 DNA probes including known and predicted genes, miRNAs, and noncoding RNAs in three brain regions including cerebellum, visual cortex, and dorsolateral prefrontal cortex, profiled on a custom-made Agilent 44K microarray. Specifically, we are interested in the expression traits of all 718 genes in visual cortex related to neurological diseases according to GAD (genetic association database) (Becker et al., 2004), and we focused on the 18,137 SNPs residing within 50 kb from these genes, in an attempt to search for cis-acting causal SNPs or “restricted” trans-acting (i.e., acting on genes within the same functional group) SNPs related to neurological diseases. This results in a massive problem involving 18,137 SNPs, ∼164 million SNP-pairs, and 718 gene traits, that is, ∼118 billion candidates of associations between SNPs/SNP-pairs and traits. We employed a cluster with a total of 720 cores (see Methods for experimental details), which took 4.5 hours to perform screening and <1 hour for stability selection with PLM-based randomized group lasso. Using SPHINX, we found 16 SNPs and 17 SNP pairs significantly associated with the expression traits (see Tables 1 and 2 for the list of all SNPs and SNP pairs found by SPHINX). Note that most association studies on AD have focused on detecting SNPs with marginal effects, and SNP pairs associated with AD are largely unknown. The patterns of marginal and interaction effects are illustrated in Figure 8.

FIG. 8.

Gene expression levels according to the genotypes of (a) 16 SNPs and (b) 17 SNP pairs found by SPHINX. In (a), x-axis represents genotypes and y-axis shows the average gene expression levels of individuals who possess the corresponding genotype with error bars of 1/2 standard deviation. In (b), x- and y-axis represent genotypes of an SNP pair and z-axis shows the average gene expression levels.

Table 1.

Significant Trait-Associated SNPs in Alzheimer's Disease Dataset (Zhang et al., 2013) Found by SPHINX

SNP	GENE	Affected gene	Stability score
rs1047631	DTNBP1	DTNBP1	0.705
rs536635	C9orf72	SELL	0.651
rs7483826	WT1	WT1	0.979
rs2699411	LRPAP1	LRPAP1	0.824
rs16844487	LRPAP1	LRPAP1	0.763
rs1323580	PTPRD	HHEX	0.631
rs4701834	SEMA5A	SEMA5A	0.631
rs7852952	PTPRD	PTPRD	0.724
rs2734986	HLA-A	HLA-A	0.628
rs1611710	HLA-A	HLA-A	0.617
rs2395175	HLA-DRB1	HLA-DRB1	0.692
rs602875	HLA-DQB1	HLA-DQB1	0.809
rs3135363	HLA-DRB1	HLA-DQB1	0.717
rs1619379	HLA-A	HLA-A	0.967
rs156697	GSTO2	GSTO2	0.943
rs7759273	ABCB1	PARK2	0.67

For each SNP, we represent GENE, which is located within 50 kb from the corresponding SNP. The stability score represents the proportion for which the SNP was selected in stability selection.

Table 2.

Significant Trait-Associated SNP Pairs Identified by SPHINX in Alzheimer's Disease Dataset (Zhang et al., 2013)

SNP A	GENE A	SNP B	GENE B	Affected gene	Stability score
rs10501554	DLG2	rs7805834	NOS3	NEFH	0.684
rs4547324	Intergenic	rs7870939	PTPRD	MEIS1	0.602
rs1956993	NUBPL	rs6677129	LOC199897	FARP1	0.633
rs27744	LTC4S	rs13209308	PARK2	CLCN2	0.629
rs17150898	MAGI2	rs7798194	CDK5	NINJ2	0.605
rs2802247	FLT1	rs9533787	DNAJC15	ADH1C	0.629
rs10883782	CYP17A1	rs10786737	CNNM2	SCN1B	0.683
rs7139251	ITPR2	rs12915954	IGF1R	IL6	0.605
rs11207272	PDE4D	rs2274932	ZBP1	ARSB	0.635
rs2634507	TOX	rs11790283	VLDLR	SFXN2	0.622
rs17309944	BDNF	rs358523	HTR1A	GRIK1	0.665
rs10501554	DLG2	rs17318454	RFX4	GNAS	0.611
rs4900468	CYP46A1	rs10217447	PTPRD	CAPN5	0.64
rs17415066	KCNJ10	rs912666	SUSD1	SEMA5A	0.663
rs6578750	CCKBR	rs12340630	TAL2	CTNNA3	0.631
rs4272759	PGR	rs6081791	PDYN	DAT	0.71
rs2679822	MYRIP	rs4538793	NXPH1	CPT7	0.85

For each SNP A(B), we represent GENE A(B), which is located within 50 kb from the SNP. The stability score represents the proportion for which the pair was selected in the stability selection.

4.1. Marginal effects in late-onset Alzheimer's disease dataset

Among 16 SNPs identified with marginal effects, 13 SNPs were located near affected genes (12 SNPs are located within 50 kb, and 1 SNP is located within 130 kb from their associated genes), and 3 SNPs were associated with a gene trait in a different chromosome. As an example, here we investigate 6 SNPs (rs1619379, rs2734986, rs1611710, rs2395175, rs3135363, rs602875) associated with HLA (human leukocyte antigen) genes including HLA-A, HLA-DRB1, and HLA-DQB1, related to the immune system. All 6 SNPs were located nearby the affected HLA genes, which encode proteins for antigen presentation (Bodmer and Bodmer, 1978). We observed that 5 SNPs out of the 6 SNPs had positive correlation with the expression levels of their associated genes, whereas 1 SNP (rs1619379) had negative correlation with the expression levels of its associated gene (HLA-A).

We found out that 5 SNPs (out of the 6 SNPs) had genome annotations in their locations. For associations between the three SNPs (rs1619379, rs2734986, rs1611710) and HLA-A, we observed that rs1619379 and rs1611710 coincide with H3K27Ac histone mark and transcription factor binding sites, respectively, and rs2734986 aligns with spliced ESTs. It suggests that rs1619379 and rs1611710 may perturb regulatory elements of HLA-A, and rs2734986 may be related to a mechanism for DNA transcription. In case of the association between rs2395175 and HLA-DRB1, rs2395175 was in an intron of a HLA-DRB1 gene (chr6:32489683-32557613). Finally, for associations between the two SNPs (rs3135363, rs602875) and HLA-DQB1, we observed that rs3135363 coincides with both transcription factor binding site and H3K27Ac histone mark, which hints that rs3135363 may be related to regulatory mechanisms for HLA-DQB1. For rs602875, we did not find any specific genome annotations.

It should be noted that associations between HLA genes and late-onset AD (Lehmann et al., 2001; Maggioli et al., 2013) have been found, and these findings have been replicated in previous studies. It has been reported that there is association between HLA-A and late-onset of AD (Payami et al., 1997; Guerini et al., 2009), and Lehmann et al. (2006) replicated the association between HLA-B7 and AD. Furthermore, recently HLA-DRB1 identified by SPHINX has been reported as a new susceptibility locus for AD (Lambert et al., 2013). Lambert et al. identified 11 new loci associated with AD that includes HLA-DRB1 from 17,008 AD cases and 37,154 controls. This dataset is independent from ours, which indicates that our findings can be reproducible. As we found associations between the 6 SNPs and HLA genes, and previous studies reported associations between HLA genes and AD, it would be interesting to further investigate whether these associations are related to regulatory mechanisms or transcription factor bindings.

4.2. Interaction effects in late-onset alzheimer's disease dataset

Among 17 SNP pairs identified with interaction effects, as an example, we investigate the biological underpinnings of one of our findings—the pair rs4272759 (chr11:100899750) and rs6081791 (chr20:1988298) that is jointly (but not marginally) associated with DAT (dopamine active transporter, chr5:1392905-1445545) — to demonstrate the biological validity of our results. Specifically, the expression level of DAT is high only when both SNPs are heterozygous (i.e., both SNPs have only one minor allele). SNP rs4272759 is located 605 base-pairs upstream of the start position of gene PGR (progesterone receptor, chr11:100900355-101001255), whereas SNP rs6081791 is 13,407 base-pairs downstream of gene PDYN (prodynorphin, chr20:1959402-1974891). An extensive literature survey has yielded intriguing biological evidence to explain the association involving DAT with PGR and PDYN, suggesting that our finding is biologically plausible. It was reported that progesterone treatment could increase the dynorphin concentration and prodynorphin mRNA level (prodynorphin is the precursor protein of dynorphin) (Foradori et al., 2005), suggesting that a disruption of the PGR function could alter the activity of PDYN, which supports our finding that the SNPs in PGR and PDYN are epistatic. A direct association between PDYN and DAT has also been reported. For example, it has been reported that prodynorphin expression in the striatum is associated with D1 dopamine receptor stimulation (Gerfen et al., 1990); furthermore, in the experiments with DAT knock-down mice, Cagniard et al. (2005) found that the increased level of dopamine is associated with the level of dynorphin expression. Overall, evidence from the literature seems to support a hypothesis drawn from our association analysis of interacting genetic variations that a pair of SNPs affecting PGR and PDYN are likely to lead to an epistatic effect on DAT, and it would be interesting to further examine the status of DAT in the case studied in Foradori et al. (2005), and the status of PGR in cases studied in Gerfen et al. (1990) and Cagniard et al. (2005) to directly confirm and characterize such an epistatic effect.

5. Conclusions

We developed a unified framework for detecting marginal and pairwise interaction effects on traits, built on state-of-the-art techniques including screening, randomization, and stability selection. Furthermore, to facilitate the detection of SNPs and SNP pairs associated with traits at a whole genome scale, we implemented an efficient and scalable screening program. We validate the efficacy of SPHINX via simulations and the analysis of late-onset Alzheimer's disease dataset. Note that detecting pairwise interaction effects on traits requires us to address computational and statistical challenges simultaneously, which stem from a large number of SNP pairs to be tested, correlations between SNPs/SNP pairs, and nonlinear patterns of marginal and interaction effects; to our knowledge, SPHINX is the first attempt to address these challenges within a single framework. We further note that by redefining mⁱ_jk in Equation (1), it is possible to investigate different choices of interaction encodings [e.g., data-driven encoding (He et al., 2015)]. In this article, we adopted the widely used genotype encoding for the pairwise interaction (i.e., multiplication of two SNPs). For future work, we plan to (1) incorporate diverse prior knowledge into our model such as trait networks using graph-guided fused lasso (Kim and Xing, 2009) or grouping information on both genotypes (e.g., LD structures) and phenotypic traits (e.g., pathways) using structured input–output lasso (Lee and Xing, 2012), (2) use kernel techniques for detecting multiway interactions among SNPs, (3) detect interaction effects under case-control settings via logistic regression, and (4) combine linear mixed model with SPHINX to correct for population structures (Rakitsch et al., 2013).

Footnotes

Acknowledgments

This work was done under a support from NIH 1 R01 GM087694-01; NIH 1RC2HL101487-01 (ARRA); AFOSR FA9550010247; ONR N0001140910758; NSF Career DBI-0546594; NSF IIS-0713379; P30 DA035778A1; and Alfred P. Sloan Fellowship awarded to E.P.X.

Author Disclosure Statement

No competing financial interests exist.

References

Bach

F.R.

2008. Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res., 9, 1179–1225.

Becker

K.G.

, Barnes

K.C.

, Bright

T.J.

, and Wang

S.A.

2004. The genetic association database. Nat. Genet., 36, 431–432.

Bien

, Taylor

, and Tibshirani

2013. A lasso for hierarchical interactions. Ann. Stat., 41, 1111–1141.

Bodmer

W.F.

, and Bodmer

J.G.

1978. Evolution and function of the HLA system. Br. Med. Bull., 34, 309–316.

Bretscher

1997. Linear Algebra with Applications. Prentice-Hall, Eaglewood Cliffs, NJ.

Bühlmann

, Rütimann

, van de Geer

, and Zhang

2013. Correlated variables in regression: Clustering and sparse estimation. J. Stat. Plan. Infer., 143, 1835–1858.

Cagniard

, Balsam

P.D.

, Brunner

, and Zhuang

2005. Mice with chronically elevated dopamine exhibit enhanced motivation, but not learning, for a food reward. Neuropsychopharmacology, 31, 1362–1370.

Evans

D.M.

, Marchini

, Morris

A.P.

, and Cardon

L.R.

2006. Two-stage two-locus models in genome-wide association. PLoS Genet. 2, e157.

Fan

, Feng

, and Song

2011. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Stat. Assoc., 106, 544–557.

10.

Fan

, and Lv

2008. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B, 70, 849–911.

11.

Foradori

C.D.

, Goodman

R.L.

, Adams

V.L.

, et al. 2005. Progesterone increases dynorphin a concentrations in cerebrospinal uid and preprodynorphin messenger ribonucleic acid levels in a subset of dynorphin neurons in the sheep. Endocrinology, 146, 1835–1842.

12.

Friedman

, Hastie

, Höing

, and Tibshirani

2007. Pathwise coordinate optimization. Ann. Appl. Stat., 1, 302–332.

13.

Gerfen

C.R.

, Engber

T.M.

, Mahan

L.C.

, et al. 1990. D1 and d2 dopamine receptor-regulated gene expression of striatonigral and striatopallidal neurons. Science, 250, 1429–1432.

14.

Golub

G.H.

, and Reinsch

1970. Singular value decomposition and least squares solutions. Numer. Math., 14, 403–420.

15.

Guerini

F.R.

, Tinelli

, Calabrese

, et al. 2009. HLA-A*01 is associated with late onset of Alzheimer's disease in Italian patients. Int. J. Immunopathol. Pharmacol. 22, 991–999.

16.

, Wang

, and Parida

2015. Data-driven encoding for quantitative genetic trait prediction. BMC Bioinform. 16, S10.

17.

Hoffman

G.E.

, Logsdon

B.A.

, and Mezey

J.G.

2013. PUMA: A unified framework for penalized multiple regression analysis of gwas data. PLoS Comput. Biol. 9, e1003101.

18.

Kambadur

, Gupta

, Ghoting

, et al. 2009. PFunc: Modern task parallelism for modern high performance computing, 43. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, New York.

19.

Kim

, and Xing

E.P.

2009. Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet. 5, e1000587.

20.

Lambert

, et al. 2013. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease. Nat. Genet., 45, 1452–1458.

21.

Lee

, and Xing

E.P.

2012. Leveraging input and output structures for joint mapping of epistatic and marginal eqtls. Bioinformatics, 28, i137–i146.

22.

Lehmann

D.J.

, Barnardo

M.C.

, Fuggle

, et al. 2006. Replication of the association of HLA-B7 with Alzheimer's disease: A role for homozygosity? J. Neuroinamm. 3, 33.

23.

Lehmann

D.J.

, et al. 2001. HLA class I, II & III genes in confirmed late-onset Alzheimer's disease. Neurobiol. Aging, 22, 71–77.

24.

, and Li

2008. GWAsimulator: A rapid whole-genome simulation program. Bioinformatics, 24, 140–142.

25.

, Zhu

, Manning-Bog

A.B.

, et al. 2004. Dopamine and l-dopa disaggregate amyloid fibrils: Implications for Parkinson's and Alzheimer's disease. FASEB J. 18, 962–964.

26.

Liu

, and Ye

2010. Moreau-yosida regularization for grouped tree structure learning. Adv. Neural Inf. Process. Syst. 187, 195–207.

27.

Liu

, Ji

, and Ye

2009. SLEP: sparse learning with efficient projections. Arizona State University. www.public.asu.edu/∼jye02/Software/SLEP

28.

Maggioli

, Boiocchi

, Zorzetto

, et al. 2013. The human leukocyte antigen class III haplotype approach: New insight in Alzheimer's disease inammation hypothesis. Curr. Alzheimer Res., 10, 1047–1056.

29.

Meinshausen

, and Bühlmann

2010. Stability selection. J. R. Stat. Soc. Ser. B, 72, 417–473.

30.

Meinshausen

, Meier

, and Bühlmann

2009. P-values for high-dimensional regression. J. Am. Stat. Assoc., 104, 1671–1681.

31.

Message Passing Interface Forum. June 1995. www.mpi-forum.org/

32.

Message Passing Interface Forum. July 1997. www.mpi-forum.org/

33.

Moore

J.H.

, Asselbergs

F.W.

, and Williams

S.M.

2010. Bioinformatics challenges for genome-wide association studies. Bioinformatics, 26, 445–455.

34.

Park

, and Hastie

2008. Penalized logistic regression for detecting gene interactions. Biostatistics, 9, 30–50.

35.

Payami

, et al. 1997. Evidence for association of HLA-A2 allele with onset age of Alzheimer's disease. Neurology, 49, 512–518.

36.

Purcell

, et al. 2007. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81, 559–575.

37.

Rakitsch

, Lippert

, Stegle

, and Borgwardt

2013. A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics, 29, 206–214.

38.

Wan

, Yang

, et al. 2010. BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am. J. Hum. Genet. 87, 325.

39.

Wasserman

, and Roeder

2009. High dimensional variable selection. Ann. Stat. 37, 2178.

40.

Yuan

, and Lin

2005. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B, 68, 49–67.

41.

Zhang

, et al. 2013. Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer's disease. Cell, 153, 707–720.

42.

Zhang

, Zou

, and Wang

2008. FastANOVA: An efficient algorithm for genome-wide association study, 821–829. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York.