Statistical Selection Strategy for Risk and Protective Rare Variants Associated with Complex Traits

Abstract

In genetic association studies with deep sequencing data, it is a challenging statistical problem to precisely locate rare variants associated with complex diseases or traits due to the limited number of observed genetic mutations. In particular, both risk and protective rare variants can be present in the same gene or genetic region. There currently exist very few statistical methods to separate casual rare variants from noncausal variants within a disease/trait-related gene or a genetic region, while there are relatively many statistical tests to detect a phenotypic association of a group of rare variants such as a gene or a genetic region. In this article, we propose a new statistical selection strategy that is able to locate causal rare variants within the disease/trait-related gene or a genetic region. The proposed procedure is to linearly combine potential risk and protective variants in order to find the optimal combination of rare variants that can have the strongest association signal. It is also computationally very efficient since the procedure is based on forward selection. In simulation studies we demonstrate that the selection performance of the proposed procedure is more powerful than other existing methods when both risk and protective variants are present. We also applied it to the real sequencing data on the ANGPTL gene family from the Dallas Heart Study.

1. Introduction

In human genetic association studies, newly identified rare variants [with minor allele frequency (MAF) < 1%] are expected to explain missing heritability of complex diseases/traits since early genome-wide association studies (GWAS) have focused on common variants (Manolio et al., 2009; Zuk et al., 2012). As high-throughput sequencing technology is rapidly developing, rare variants analysis is getting attention in genetics research. For example, multiple rare variants that are associated with type 1 diabetes (Nejentsev et al., 2009), hypertension (Ji et al., 2008), autism (Weiss et al., 2008), obesity (Ahituv et al., 2007), and low plasma levels of HDL cholesterol (Romeo et al., 2007; Slatter et al., 2008) have been identified. Due to the extremely low frequency of rare variants compared with common variants, most statistical methods designed to detect common variants are not directly applied to locate susceptible rare variants that are associated with complex diseases or traits. Recently, new statistical approaches have been proposed to test phenotypic associations of a group of rare variants such as a gene or a genetic region, where association signals may be boosted as aggregating or collapsing multiple rare variants (Li and Leal, 2008; Madsen and Browning, 2009; Liu and Leal, 2010; Price et al., 2010; Wu et al., 2011; Lin and Tang, 2011; Ionita-Laza et al., 2011; Neale et al., 2011; Lee et al., 2012, 2013; Ionita-Laza et al., 2013).

However, there is a limited number of statistical methods that are able to distinguish causal rare variants from noncausal rare variants within the disease/trait-related gene or genetic region. Once a phenotypic association of a group of rare variants is detected, next interest is to specify which rare variants among multiple variants within the disease/trait-related gene or genetic region are truly causal variants. Most existing rare variant tests consider only an association of a group of variants since univariate analysis testing of an individual rare variant has too low statistical power due to rarely observed genetic mutations. Recently, Sun and Wang (2014) regarded this as a variable selection problem rather than a statistical test problem. Therefore, they proposed a power set (Pset)-based statistical selection procedure to locate susceptible rare variants within a disease/trait-related gene or genetic region. Their selection method, which is based on a weighted linear combination of multiple rare variants that has the maximum association signal, is powerful particularly when both risk and protective variants are present in the same gene or genetic region. However, it carries two nontrivial problems in practice.

First, Pset initially screens potentially protective variants and flips the coding of the protective variants so that both risk and protective variants can have an effect on the outcome in the same direction. This idea is originally proposed by Han and Pan (2010) and called a data-adaptive procedure. Sun and Wang (2014) demonstrated that the adaptive Pset (aPset) can precisely locate protective variants as well as risk variants when both risk and protective variants are present in the same gene or genetic region. However, the initial screening test of aPset is based on a marginal association test of each rare variant, so the detection power of potential protective variants depends on both variant effect size and a sample size. That is, aPset has a limitation to select protective variants when either the effect size or the sample size is relatively small. Another problem of aPset is computational cost for a big gene that has a large number of rare variants. The power set of the subsets of rare variants generates the 2^p − 1 different combinations of rare variants, where p is the total number of rare variants within a gene or a genetic region. As p is increasing, the computational time of the 2^p − 1 statistics is explosively increased.

In this article, we propose new statistical selection strategy to locate susceptible rare variants within a disease/trait-related gene or genetic region. The proposed method searches a single variant that has the strongest association signal with the phenotype outcome at each stage, where an individual variant is considered as potentially risk, protective, or noncausal. Then, selected variants at each stage are linearly combined until the association signal is no longer powerful. Since the proposed selection procedure proceeds at most p stages, the computational cost can be drastically reduced even for a big gene. Moreover, it is not required to perform the initial screening test to identify potentially protective variants, so the selection performance of protective variants are equivalent to that of risk variants. The rest of the article is organized as follows. In section 2, new statistical selection strategies and methods are described. In section 3, the results of extensive simulation studies where the proposed method is compared with the aPset selection procedure are present. We also applied the proposed selection procedure to real sequencing data on the ANGPTL gene family from the Dallas Heart Study (DHS). Finally, limitation of the proposed method and future research are discussed in section 4.

2. Methods

Assume that we observe the number of mutations over p rare variants in a gene or a genetic region from sequence data of n individuals. For the ith individual, let y_i be a quantitative or a binary phenotype outcome, and if the outcome, is binary, y_i is coded as either 0 for controls or 1 for cases. Let the p-dimensional genotype vector of the ith individual \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x_i = \left( x_{i1} , \ldots , x_{ip} \right) ^T$$ \end{document} represent the presence of minor alleles over p variants, that is, x_ij = 1 if variant j has a genetic mutation and x_ij = 0 if no genetic mutations are observed for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$j = 1 , \ldots , p$$ \end{document} . We also allow the ith individual to have the m-dimensional covariate effects \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$u_i = \left( u_{i1} , \ldots , u_{im} \right) ^T$$ \end{document} such as age and gender. Therefore, we denote the data set of the ith individual by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\left( y_i , x_i , u_i \right) , i = 1 , \ldots , n$$ \end{document} .

Each rare variant should be one of risk, protective, or noncausal variants. We code them as 1, −1, and 0, respectively. Then, the parameter vector representing the type of variant can be defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \boldsymbol{\xi}} = \left( \xi_{1} , \ldots , \xi_{p} \right) ^T$$ \end{document} where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\xi_{j} \in \{ - 1 , 0 , 1 \} \ { \rm for} \ j = 1 , \ldots , p$$ \end{document} . Note that the total number of configurations of p rare variants is 3^p − 1 since a disease/trait-related gene is assumed to have at least one causal variant. That is, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \boldsymbol{\xi}} = \left( 0 , 0 , \ldots , 0 \right) ^{ \rm T}$$ \end{document} is not included in the parameter space. The weighted linear combination of multiple rare variants of the ith individual is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}z_{i} = \sum_{j = 1}^{p} \xi_{j}w_{j}x_{ij}^* ,\end{align*} \end{document}

When both risk and protective variants are present within the same gene or genetic region, detection power is likely to be decreased. For this reason, Han and Pan (2010) proposed to flip the coding of the protective variant so that both risk and protective variants can have an effect on the outcome in the same direction. Also, Sun and Wang (2014) employed this idea to find the optimal weighted linear combination. Individual variant weight w_j is often used in rare variant analysis to increase association signals. For example, Madsen and Browning (2009) suggested to use the inverse of standard deviation of each variant. Lin and Tang (2011) demonstrated that a regression coefficient of an individual variant is an optimal weight.

In order to locate susceptible rare variants within a disease/trait-related gene, we need to find the optimal configuration of rare variants that has the strongest association between the weighted linear combination z_i and the phenotype outcome y_i. In presence of covariate effects such as age and gender, we can replace y_i by the regression residual \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{y}_i$$ \end{document} for i = 1, … , n, where a regression of phenotypes y_i on covariate m_i is initially fit. Then, the optimal configuration is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\newcommand{\textbi}[1]{\textbf{\textit{#1}}}\begin{document} \begin{align*} \hat{\boldsymbol{\xi}} = \mathop{\arg\max}\limits_{\xi}T (\tilde{\textbi{y}} , \textbi{z} ) , \tag{1} \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\newcommand{\textbi}[1]{\textbf{\textit{#1}}}\begin{document} $$\tilde{\textbi{y}} = ( \tilde{y}_1 , \ldots , \tilde{y}_n )^{T} ,\textbf{\textit{z}} = ( z_{1} , \ldots , z_{n} ) ^{ \rm T}$$ \end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\newcommand{\textbi}[1]{\textbf{\textit{#1}}}\begin{document} $$T ( \tilde{\textbi{y}}, {\bf z}) $$ \end{document} represents a statistic to measure the strength of an association between \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\newcommand{\textbi}[1]{\textbf{\textit{#1}}}\begin{document} $$ \tilde{\textbi{y}} $$ \end{document} and z .

Exhaustive search of the optimal configuration requires 3^p – 1 computations of the statistic. Although this is computationally feasible, the exhaustive search is very inefficient when p is relatively large like a big gene. In this article, we propose a forward-based procedure to select both risk and protective variants. The basic idea is that the most outcome-related variant that could be either a risk variant {ξ_j = 1} or a protective variant {ξ_j = −1} is included in the model one by one until no association signals between the weighted linear combination and the phenotype outcome are increased. We start with a single variant association for p variants, where we consider only p configurations such as ξ ∈ {(1,0, … , 0)^T, (0,1, … , 0)^T, … ,(0, 0, … , 1)^T}. Then, we choose one that has the maximum association with the phenotype outcome. Suppose that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\hat \xi}_j = 1$$ \end{document} in the selected \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{\boldsymbol{\xi}}$$ \end{document} at the first stage, then we denote j by j₁. Note that the variant j₁ might be a protective variant, but association strength in the single variant association is the same for both risk and protective variants. Therefore, we can simply update \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\hat \xi}_{j_1} = - 1$$ \end{document} if variant j₁ has a negative relationship with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\newcommand{\textbi}[1]{\textbf{\textit{#1}}}\begin{document} $$\tilde{\textbi{y}}$$ \end{document} while other ξ_j = 0 for all j ≠ j₁.

At the second stage, the weighted linear combination of the the ith individual z_i is updated as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} z_{i}^{ [ 2 ] } = {\hat \xi}_{j_1}w_{j_1}x_{ij_1}^* + \sum_{j \neq j_1} \xi_{j}w_{j}x_{ij}^* \end{align*} \end{document}

Given \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\hat \xi}_{j_1}$$ \end{document} , we find the most outcome-related \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\hat \xi}_j$$ \end{document} for j ≠ j₁ in this stage. This time we consider either ξ_j = 1 or ξ_j = −1 for all j ≠ j₁. Using the Equation (1) where z is replaced by z ^[2] = (z₁^[2], … , z_n^[2])^T, we select variant j₂, that is, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\hat \xi}_{j_2} = 1$$ \end{document} or −1. This requires 2(p − 1) computations of the statistic since the remaining p − 1 variants might be a risk variant or a protective variant. In general, at the t-th stage the current model should have t − 1 nonzeros in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{\boldsymbol{\xi}}$$ \end{document} , and p − t + 1 variants are not included in the model yet. The same forward-based selection procedure is repeatedly applied, that is, the t-th updated weighted linear combination \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} z_{i}^{ [ t ] } = \mathop \sum \limits_{l = 1}^{t - 1} {\hat \xi}_{j_l}w_{j_l}x_{ij_l}^* + \sum_{j \notin \{ j_1 , \ldots , j_{t - 1} \} } \xi_{j}w_{j}x_{ij}^*\end{align*} \end{document}

replaces z_i in Equation (1) and the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\hat\xi}_{j_{\rm t}}$$ \end{document} is estimated. This procedure is stopped at the t-th stage when \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\newcommand{\textbi}[1]{\textbf{\textit{#1}}}\begin{document} \begin{align*} T ( \tilde{\textbi{y}} , \textbf{\textit{z}}^{ [ t ] } ) - T ( \tilde{\textbi{y}} , \textbf{\textit{z}}^{ [ t - 1 ] } ) < \lambda\end{align*} \end{document}

where λ ≥ 0 is a tuning parameter that controls a model size. When λ = 0, the model selects the most number of variants, including possible false positives. In contrast, the model selects small numbers of variants as λ is increasing. In this case, both true positives and false positives are expected to be reduced. Consequently, the tuning parameter λ can balance the number of causal variants and noncausal variants selected by the proposed procedure.

The proposed selection procedure requires at most p stages and at most p(p − 1) computations of the statistic, while aPset (Sun and Wang, 2014) always needs to compute exactly 2^p − 1 statistics. Therefore, the difference of computational time between the proposed procedure and aPset is significantly noticeable when p is large. In addition to computational cost, the proposed method allows an individual variant to be risk, protective, or noncausal at each stage, leading to equivalent selection powers of both risk and protective variants. In contrast, aPset initially screens potential protective variants based on a marginal association test where selection power of protective variants are always lower than that of risk variants. Besides, the proposed selection procedure can control a model size using a tuning parameter λ, so a parsimonious model can be chosen to minimize the number of false positives if necessary.

3. Results

In this section, we conducted extensive simulation studies to compare the proposed selection procedure to the data-adaptive power set–based procedure (aPset) proposed by Sun and Wang (2014), who demonstrated that the selection performance of aPset is the most powerful compared with other selection methods, particularly when both risk and protective variants are present within the same gene or the same genetic region. Also, we applied the proposed selection procedure to the Dallas Heart Study data in order to locate both risk and protective variants within the trait-related genes.

3.1. Simulation studies

First, we generated a minor allele frequency (MAF) at each variant using the Wright's distribution (Ionita-Laza et al., 2011; Madsen and Browning, 2009; Sun and Wang, 2014) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}f ( q ) = cq^{k - 1} ( 1 - q ) ^{k / 3 - 1}e^{12 ( 1 - q ) }\end{align*} \end{document}

where k = 0.0001, c is a normalizing constant, and q is the allele frequency, 0.001 ≤ q ≤ 0.01. Secondly, on the basis of the Hardy–Weinberg equilibrium (HWE), we generated genotype data \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x_i = ( x_{i1} , \ldots , x_{ip} )$$ \end{document} with the p variants and the n individuals. Finally, we obtained quantitative phenotype value y_i by using the following regression model \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}y_i = 0.5u_{i1} + 0.5u_{i2} + \beta_1x_{i1} + \beta_2x_{i2} + \ldots + \beta_px_{ip} + \epsilon_i ,\end{align*} \end{document}

where u_i₁ is a binary covariate generated from Bernoulli(0.5), u_i₂ is a continuous covariate generated from N(0, 1), and ε_i is a standard normal error generated from N(0, 1). The variant effect size is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\beta_{j} = \begin{cases} \gamma \mid \log_{10}{ \rm MAF}_j \mid \ \ \qquad { \rm if \ variant} \ j \ { \rm is \ risk} , \\ - \gamma \mid \log_{10} { \rm MAF}_j \mid \ \ \quad { \rm if \ variant} \ j \ { \rm is \ protective} , \\ 0 \qquad \qquad \ \quad \qquad \ \, { \rm if \ variant} \ j \ { \rm is \ noncausal}.\end{cases}\end{align*} \end{document}

In this setting rarer variants have larger effects. This is often used in rare variant simulation studies (Wu et al., 2011; Sun and Wang, 2014). Also, we set the sequence of γ from 0.2 to 0.7 increased by 0.1 so the average effect size of a risk variant is about 0.5 for γ = 0.2 and 1.75 for γ = 0.7. The sample size was fixed as 1,000, 2,000 and 5,000. For case–control phenotypes, we first generated twice the number of quantitative sample size, then we set case–control data selecting the top 25% of quantitative phenotypes (cases) and the bottom 25% of the phenotypes (controls).

We also considered three different variant settings such as (i) 5 risk, 0 protective, 10 noncausal, (ii) 3 risk, 2 protective, 10 noncausal, and (iii) 5 risk, 5 protective, 5 noncausal. Consequently, the total number of variants is fixed as p = 15 for all simulation studies. Before we conduct variant selection, we first detected significant genes with the 15 rare variants using the sequence kernel association test (SKAT) proposed by Wu et al. (2011). For only significant genes whose SKAT p-value is less than 0.05, we applied aPset and the proposed method with different λ values from 0 to 0.05 increased by 0.005. For a computational speed-up, we adopted a sample correlation for the statistic T(·,·) in Equation (1). In our experiment, other statistics from a regression-based test or a score test have almost the same selection result, since T(·,·) essentially measures the strength of correlation between two numerical vectors.

In order to assess the selection performance, we computed the averaged selection proportions (ASP) of risk, protective, and noncausal rare variants, along with three different variant settings, different sample sizes, different types of phenotype outcome, and different variant effect sizes. The ASP of risk variants is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\hbox { ASP of risk variants } = \frac { \hbox { the number of selected risk variants } } { K \times \hbox { the total number of true risk variants } } ,\end{align*} \end{document}

where K = 1,000 is the number of simulation replications. Similarly, the ASP of protective and noncausal variants are defined. The larger ASP of causal (risk and protective) variants and the smaller ASP of noncausal variants, the better the selection performance.

Figure 1 shows the ASPs of aPSet and the proposed selection procedures with λ = 0.005 (sel1), λ = 0.01 (sel2), and λ = 0.015 (sel3), along with three different variant settings and two different sample sizes of n = 1,000 and 2,000, when the outcome is quantitative and the averaged variant effect size is 0.75. It seems that the SKAT power is increasing as the sample size increases and the total number of causal (risk and protective) variants increases. As expected, the proposed selection procedure selects less noncausal variants when the tuning parameter λ increases, while it also selects less causal variants. For the proposed selection procedure, ASP of risk variants are almost the same as that of protective variants regardless of λ in all simulation settings.

FIG. 1.

Averaged selection proportions of risk (R), protective (P), and noncausal (N) rare variants for data-adaptive Pset (aPset), the proposed methods with λ = 0.005 (sel1), λ = 0.01 (sel2), and λ = 0.015 (sel3) when the outcome is quantitative. The sample size is n = 1,000 for (A)–(C) and n = 2,000 for (D)–(F). There are 5R/0P/10N variants in (A) and (D), 3R/2P/10N in (B) and (E), and 5R/5P/5N in (C) and (F).

In contrast, aPSet always selects a significantly lower number of protective variants than risk variants since it applies a marginal association test to initially screen potential protective variants. Although aPset shows slightly better selection performance than the proposed procedures when protective variants are absent in a gene, the latter overwhelms the former when both risk and protective variants are present. For example, ASP of the protective variant for aPset is around 53% when there are 3 risk, 2 protective, and 10 noncausal variants with n = 1,000. In the same setting, sel1 can increase the ASP of protective variants up to 80%, allowing 10% more noncausal variants. Likewise, sel2 can increase it up to 76%, allowing 5% additional noncausal variants. Similar patterns for case–control binary outcomes were observed in Figure 2.

FIG. 2.

Averaged selection proportions of risk (R), protective (P) and noncausal (N) rare variants for data-adaptive Pset (aPset), the proposed methods with λ = 0.005 (sel1), λ = 0.01 (sel2), and λ = 0.015 (sel3) when the outcome is binary (case–control status). The sample size is n = 1,000 for (A)–(C) and n = 2,000 for (D)–(F). There are 5R/0P/10N variants in (A) and (D), 3R/2P/10N in (B) and (E), and 5R/5P/5N in (C) and (F).

In addition to ASP, we also computed the selection power of each selection procedure. Sun and Wang (2014) defined the selection power as the simulation proportion that all causal variants (both risk and protective variants) are exactly selected among 1,000 simulation replications. Figure 3 displays the selection powers of the four selection procedures along with increasing variant effect sizes when there are 5 risk, 0 protective, and 10 noncausal variants, and 3 risk, 2 protective, and 10 noncausal variants with three different sample sizes, and the outcome is quantitative. The simulation result of selection power clearly demonstrates that the proposed selection procedure is superior over aPset in all simulation settings. Particularly, when the variant effect size is neither very small nor very large, the proposed procedure always outperforms aPset. The similar result was observed for case–control binary outcomes.

FIG. 3.

Selection power of data-adaptive Pset (aPset), the proposed methods with λ = 0.005 (sel1), λ = 0.01 (sel2), and λ = 0.015 (sel3) are displayed as the variant effect size increases when the outcome is quantitative. There are 5R/0P/10N variants in (A)–(C), and 3R/2P/10N variants in (D)–(F). The sample size is n = 1,000 for (A) and (D), n = 2,000 for (B) and (E), and n = 5,000 for (C) and (F).

In Figure 4, the selection power of the proposed method is present along with different values of λ when the sample sizes, variant settings, and the phenotype outcomes are all different. We can see that the values of the optimal tuning parameter λ that maximizes selection power are all different each simulation setting. This result concludes that the selection of the optimal tuning parameter of the proposed procedure can depend on the sample size; the type of phenotype outcome; and the number of risk, protective, and noncausal variants. Consequently, the correct choice of the optimal tuning parameter λ could be another challenging problem in rare variant selection. Although a signal value of λ is desired for real data analysis, we suggest to use multiple λ values and compare the selection results in practice.

FIG. 4.

Selection power of the proposed method along with different λ values and three different sample sizes of 1,000, 2,000 and 5,000. The outcome is binary (case–control status) for (A)–(C) and quantitative for (D)–(F). There are 5R/0P/10N variants in (A) and (D), 3R/2P/10N variants in (B) and (E), and 5R/2P/5N variants in (C) and (F).

3.2. Data analysis

We also applied the proposed selection procedure to the Dallas Heart Study (DHS) data (Romeo et al., 2007, 2009), where four genes—ANGTPL3, ANGTPL4, ANGTPL5, and ANGTPL6—were sequenced to identify susceptible rare variants associated with nine energy metabolism traits, namely triglyceride (TG), low-density lipoprotein (LDL), very low = density lipoprotein (VLDL), high-density lipoprotein (HDL), cholesterol, glucose, body mass index (BMI), and systolic (SysBP) and diastolic blood pressure (DiasBP). Various statistical methods have been applied to DHS data to detect trait-related genes for different ethnic groups (Liu and Leal, 2010, 2012; Sun and Wang, 2014).

Previous analysis of DHS data is largely limited to the European-American population. In this study, we focused on the Hispanic population, considering gender and logarithm-transformed age as covariates. First, we applied SKAT to identify significant genes associated with nine phenotype outcomes at 5% significance level, and found that the ANGTPL5 gene is associated with a binary systolic trait for Hispanic population. This association was also reported by Sun and Wang (2014). The ANGTPL5 gene contains seven rare variants that have less than 2% MAF. We then applied the four selection procedures compared in the simulation studies in order to separate causal variants from noncausal variants within the ANGTPL5 gene.

Table 1 summarizes the selection result of the 7 rare variants. It appears that all of the proposed selection procedures with three different λ values have exactly the same selection result, while aPset shows slightly different results. In particular, although two potential protective variants L98P and T268M were selected by all four methods, the variants L125F and N329S were selected by the proposed procedure but missed by aPset. They could be also regarded as potential protective variants because the genetic mutations of these variants were observed only in controls. In simulation studies, we have already shown that aPset has lower selection power especially for protective variants compared with the proposed selection procedure. The selection result of data analysis might be consistent with that of our simulation studies. Therefore, additional biological analysis of two variants L125F and N329S are strongly recommended.

Table 1.

Susceptible Rare Variants in the ANGPTL5 Gene that is Associated with the Binary Systolic Traits for the Hispanic Population

Variants	MAF	Cases	Ctrls	aPset	Fsel1	Fsel2	Fsel3
S93N	0.005976	2	0	×	×	×	×
L98P	0.009960	0	4	×	×	×	×
L125F	0.003984	0	1		×	×	×
S175P	0.003984	1	0	×	×	×	×
T268M	0.017928	1	7	×	×	×	×
G309R	0.003984	1	0	×
N329S	0.003984	0	1		×	×	×

The symbol × indicates a selected variant.

The columns “cases” and “ctrls” indicate the number of mutations in cases and controls, respectively.

4. Discussion

In this article, we proposed a new selection strategy to locate both risk and protective rare variants within disease/trait-related genes or genetic regions. The proposed method is not only computationally efficient but also shows powerful selection performance especially when both risk and protective variants are present within the same gene or genetic region. In extensive simulation studies we demonstrated that the selection power of the proposed procedure outperforms that of aPset proposed by Sun and Wang (2014) in various simulation scenarios.

One drawback of the proposed method is dependency of tuning parameter λ, which controls the number of selected rare variants. The correct choice of the optimal tuning parameter can maximize the selection power of the proposed method. However, the value of the optimal tuning parameter varies on the sample size; the variant effect size; the type of phenotype outcome; and the number of risk, protective, and noncausal variants. New method development to determine the optimal tuning parameter is a possibility for future research. Although only unweighted versions of the proposed method were discussed in this article, new or existing variant weights can be easily applied to the proposed selection procedure. We expect that the optimal variant weight as well as the optimal tuning parameter can lead the proposed selection procedure to have the best selection performance of causal rare variants in genetic association studies with sequencing data.

Footnotes

Acknowledgments

This research was supported by the Research Fund Program of Research Institute for Basic Sciences, Pusan National University, Korea (RIBS-PNU-2013-310), and Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (2014R1A1A2054178).

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Ahituv

, Kavaslar

, Schackwitz

, et al. 2007. Medical sequencing at the extremes of human body mass. Am. J. Hum. Genet., 80, 779–791.

Han

, and Pan

2010. A data-adaptive sum test for disease association with multiple common or rare variants. Hum. Hered., 70, 4254.

Ionita-Laza

, Buxbaum

J.D.

, Laird

N.M.

, et al. 2011. A new testing strategy to identify rare variants with either risk or protective effect on disease. PLoS Genet. 7, e1001289.

Ionita-Laza

, Lee

, Makarov

, et al. 2013. Sequence kernel association tests for the combined effect of rare and common variants. Am. J. Hum. Genet., 92, 841–853.

, Foo

J.N.

, ORoak

B.J.

, et al. 2008. Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nat. Genet., 40, 592–599.

Lee

, Emond

M.J.

, Bamshad

M.J.

, et al. 2012. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am. J. Hum. Genet., 91, 224237.

Lee

, Teslovich

T.M.

, Boehnke

, et al. 2013. General Framework for meta-analysis of rare variants in sequencing association studies. Am. J. Hum. Genet., 93, 42–53.

, and Leal

2008. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet., 83, 311321.

Lin

D.Y.

, and Tang

Z.Z.

2011. A general framework for detecting disease associations with rare variants in sequencing studies. Am. J. Hum. Genet., 89, 354367.

10.

Liu

, and Leal

2010. A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet. 6, e1001156.

11.

Liu

, and Leal

2012. Estimating genetic effects and quantifying missing heritabaility explained by identified rare-variant associations. Am. J. Hum. Genet., 91, 585596.

12.

Madsen

B.E.

, and Browning

S.R.

2009. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 5, e1000384.

13.

Manolio

T.A.

, Collins

F.S.

, Cox

N.J.

, et al. 2009. Finding the missing heritability of complex diseases. Nature, 461, 747–753.

14.

Neale

B.M.

, Rivas

M.A.

, Voight

B.F.

, et al. 2011. Testing for an unusual distribution of rare variants. PLoS Genet. 7, e100132.

15.

Nejentsev

, Walker

, Riches

, et al. 2009. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science, 324, 387–389.

16.

Price

A.L.

, Kryukov

G.V.

, de Bakker

P.I.

, et al. 2010. Pooled association tests for rare variants in exon-resequencing studies. Am. J. Hum. Genet., 86, 832838.

17.

Romeo

, Pennacchio

L.A.

, Fu

, et al. 2007. Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase HDL. Nat. Genet., 39, 513516.

18.

Romeo

, Yin

, Kozlitina

, et al. 2009. Rare loss-of-function mutations in ANGPTL family members contribute to plasma triglyceride levels in humans. J. Clin. Invest., 119, 7079.

19.

Slatter

T.L.

, Jones

G.T.

, Williams

M.J.

, et al. 2008. Novel rare mutations and promoter haplotypes in ABCA1 contribute to low-HDL-C levels. J. Clin. Genet., 73, 179–184.

20.

Sun

, and Wang

2014. A power set-based statistical selection procedure to locate susceptible rare variants associated with complex traits with sequencing data. Bioinformatics, 30, 2317–2323.

21.

Weiss

L.A.

, Shen

, Korn

J.M.

, et al. 2008. Association between microdeletion and microduplication at 16p11.2 and autism. N. Engl. J. Med., 358, 667675.

22.

M.C.

, Lee

, Cai

, et al. 2011. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet., 89, 8293.

23.

Zuk

, Hechter

, Sunyaev

S.R.

, et al. 2012. The mystery of missing heritability: genetic interactions create phantom heritability. PNAS, 109, 1193–1198.