A Simulation Study to Assess a Variable Selection Method for Selecting Single Nucleotide Polymorphisms Associated with Disease

Abstract

In genome-wide association studies, where hundreds of thousands of single nucleotide polymorphisms (SNPs) are genotyped, the potential for false positives is high and methods for selecting models with only a few SNPs are required. Methods for variable selection giving sets of SNPs associated with disease have been developed, but are still less common than evaluation of individual SNPs one at a time. To assess the potential improvement available from multi-SNP approaches, we examined the performance of the software GeneRaVE as a variable selection method when applied to SNP data in case-control studies. The method was assessed via simulations, in which a haplotype identified by three SNPs was taken to be associated with the disease. Simulated data sets reflecting different levels and patterns of genetic association with the disease were generated. In order to have a baseline level of performance to assess the method against, we used a generalized linear model using only the three disease susceptibility SNPs to provide an upper bound on the possible performance of the selection methods. To investigate the advantage of using variable selection method as a multivariate method over a single SNP approach, we used chi-squared tests for each of the disease susceptibility (DS) SNPs with correction for multiple testing. Simulation results showed that GeneRaVE performed well and outperformed single SNP analysis using the chi-squared method in identifying disease-related SNPs. In application to a large dataset, it identified SNPs known to be associated with disease that were not identified by single SNP methods.

Introduction

Single nucleotide polymorphisms (SNPs) are very popular and important as genetic markers in studies to detect association between genes and diseases. Chips containing millions of SNPs are now readily available, which can significantly improve the analyses of genome-wide association studies (GWAS). The potential for false positives is high, however, and methods for selecting models with only a few SNPs are required. One approach is to use a variable selection method to identify SNPs associated with disease. SNP association studies bring specific challenges in that the predictors are the SNP genotypes, which have only three possible levels, commonly denoted by AA, AB, and BB, which gives an SNP dataset a different structure than one containing continuous variables, with the predictor values lying on the edges and vertices of a high-dimensional hypercube rather than on a subset of a continuous space. It is not a priori clear what impact this will have on estimation, though the extreme sparsity of coverage for discrete variables suggests intuitively that it may be more difficult to discover relationships, as found by Guan and Stephens (2011).

We therefore investigated the performance of a penalized likelihood variable selection method when applied to SNP data in case-control studies, where the response (case/control) is binary. The penalized likelihood function-based methods, such as Lasso, were proposed by Tibshirani (1996) as a means of dealing with situations in which a model is to be fitted with many more parameters than data points. Several variations on the method have been developed, mainly for continuous responses with continuous predictors (see, for example, Fan and Li, 2001; Zou, 2006; and Buhlmann, 2007). There are fewer algorithms for selecting SNPs when the response variable is binary, though more are becoming available. The penalized likelihood approach was used by Kiiveri (2008) and Hoggart et al. (2008), using different penalty functions. Wu et al. (2009) studied the performance of lasso penalized logistic regression when the response is binary for SNP data with large numbers of SNPs. Their algorithm selects a set of predetermined numbers of SNPs, then they applied logistic regression to the selected SNPs to determine the interation effects. They showed that variable selection is superior to univariate p-value-based methods.

He and Lin (2011) developed a variable selection method “GWASelect” for GWAS. They showed, via simulation, that their method outperformed other existing methods in terms of reducing the false positive rates and in selecting causal SNPs that are weakly associated with disease. Their simulation study included methods for continuous and binary responses.

Guan and Stephens (2011) used Bayesian variable selection regression method to select a small set of SNPs, and they extended the method to binary responses using the probit link function. As noted above, they found that the method did not perform well for binary responses when the number of SNPs was in the hundreds of thousands. A limitation of all of these papers is that their simulation studies use the SNP genotypes as the basis for the underlying mechanism generating the link with disease. The biological reality is that SNPs will generally be markers of nearby genes rather than causative.

This article examines a simulation model in which the risk is related to an allele of an unobserved gene linked to a specific pattern of surrounding SNPs. The GeneRaVE algorithm of Kiiveri (2008) is used for the variable selection method. A brief description of the algorithm is given in Section 2 (Methods). Some of the analyses were repeated using the methods of Hoggart and colleagues (2008), giving very similar results that are not reported here.

In order to capture realistic variation and correlation structure in the predictors, we applied this algorithm to simulated SNP data based on genotypes from a real case-control study, in which a “haplotype” identified by three SNPs was taken to be associated with the disease. This reflects the biological reality, in which a mutant allele of a gene is passed on with the surrounding genetic material. In order to study the effect of the strength of the evidence for association on the performance of the method, simulated datasets were created that reflected different levels of genetic association with disease and different sample sizes. This allows us to assess the impact of the degree of association and the amount of evidence on the comparison of methods. Intuitively, we would expect any methods to perform well when the signal is strong due to strong association or large sample size and poorly when the signal is very weak. It is of interest to compare the performance of methods in the intermediate situation, in which the signal is relatively weak, and see which methods do well in detecting such a signal. The accuracy of predictions was examined, as well as the ability of the method, to select the correct model. We limited our study to additive models, since we were primarily interested in assessing the performance of the GeneRaVE method. The method can be readily applied for recessive and dominant models.

In order to have a baseline level of performance to assess the method against, we considered two other situations. The first was intended to assess the feasibility of detecting genetic association when only the “true” disease susceptibility (DS) SNPs are in the model. We used generalized linear models (GLM) with only the known DS SNPs as predictors for genotyped and haplotyped data. It is clear that the addition of SNPs that are not associated with the disease will make it harder to detect the association, and so the performance of these models provides an upper bound on the possible performance of the selection methods. The second was a single SNP method using chi-squared tests for each SNP, with correction for multiple testing (Balding et al., 2003). The value of the selection method is that it can identify models that involve more than one SNP, and comparison with a single SNP approach will show how well this potential advantage is realized.

All simulations were carried out based on genotype data from a case-control study of acute lymphoblastic leukemia (ALL) SNP data (Mullighan et al., 2007). This dataset includes 244 cases and 60 controls with SNPs genotyped with 50K Xba 240 Affymetrix platform. The simulated data were based on a haplotype-disease association that reflects different genetic models. In addition, we applied the method to a much larger dataset of SNP genotypes of Crohn's disease sufferers and normal controls, as described in Section 2.2.

2. Methods

We divide this section into several subsections describing our simulation strategy and the methods used in our study. All simulations and calculations were carried out using the R statistical package (R Development Core Team, 2011). Binomial models with a logistic link function were used for all GLMs and in the GeneRaVE package. This models the probability of disease as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_i = \frac {1} {1 + e^{- \beta x_i}} $$\end{document} where x_i is a vector of predictor values for individual i, and β is a vector of parameters to be estimated. The predictors used here are either the SNP genotypes, coded as AA = 2, AB = 1, BB = 0, or indicator functions for known or estimated haplotypes. Details for each model are given below.

2.1. Simulated data

We selected three adjacent SNPs from the Affymetrix 50K Xba240 SNP array of the ALL data to define a haplotype notionally containing a gene—“the DS gene”—related to disease in our simulation. Our method is, thus, more reflective of the underlying biology than the more common approach in such simulations (Hoggart et al., 2008)) of taking a particular SNP allele as the indicator of disease risk: Usually the disease gene allele will be indicated by the presence of a particular haplotype block, and an SNP may vary between variants of the block that do not contain the disease gene. The ALL SNP data does not uniquely define the haplotypes, since an SNP array cannot distinguish between SNPs from the two haplotypes contained in the cell—only the number of alleles at each SNP. However, for the simulation, we can choose a haplotype to be associated with the DS allele of the gene.

We assumed that disease risk depends on a single DS gene and that the DS allele of the gene was associated with a particular haplotype—the “DS haplotype”—defined by the three SNPs. That is, the DS allele has been inherited from an ancestor who had particular alleles at the three SNPs, and these have been carried unchanged to the current generation. Other SNP allele patterns, therefore, indicate that the non-DS allele of the gene is present. The SNPs used as DS SNPs in the simulation were, therefore, chosen so that the dataset contained a haplotype of frequency greater than 0.5, which was used as the DS haplotype, and also contained at least 2 other haplotypes of frequency greater than 0.1. The three selected SNPs were “SNP_A-1678401,” “SNP_A-1746397,” and “SNP_A-1694538” on chromosome 22. The Expectation-Maximization (EM) algorithm (Excoffier and Slatkin, 1995) was used to determine the two haplotypes present in each case and control; 227 individuals carried the DS haplotype and the remaining 77 did not. SNPs with missing genotypes and those that had only a single allele present in the dataset were not used in our simulations. The number of SNPs used in the analyses was 5585.

Three different sets of models for disease risk were studied. In each, the risk for the 77 individuals without the DS haplotype was set to a low value p₀, while the risk for the 227 with one or two copies of the DS haplotype was set to a higher value p₁. Disease states for the high and low risk groups were then simulated using binomial distributions with these values for the probability of disease. The first set of models, the “Original” set, had p₀ and p₁ set so that the expected numbers of cases and controls were similar to those in the original data: about 31% controls and 69% cases. In the second set, the “Equal” set, p₀ and p₁ were set to give equal expected numbers of cases and controls. The third set, the “Rare Disease” set, was included to examine the effect of varying the risk ratio while keeping the disease risk for noncarriers low. For this set, p₀ was set at 0.05. Within each set, six values of the risk ratio R = p₁/p₀ were used: R = 1.0, 1.5, 2.0, 5.0, 10.0, and 40.0. For the rare disease set, the value of p₁ for R = 40.0 would have been greater than 1.0, so only the other five values of R were used. The values of p₀ and p₁ for each model and R values are given in Table 1. In order to study the effect of sample size on the results, larger synthetic datasets were constructed, which retained the association structure of the original data but did not artificially inflate the linkage disequilibrium (LD) between non-DS and DS SNPs. Datasets were constructed of sizes 304, 608, and 1216, which contained 1, 2, and 4 copies of the original SNP genotypes. Within each copy, the sets of non-DS SNP genotypes for the additional subjects were permuted so that each subject received the non-DS SNP genotypes of another subject. This maintained the association structure of the non-DS SNPs but removed any association they had with the DS SNPs. A further set of simulation runs were carried out with the SNP genotypes duplicated so that the SNP genotypes of the additional subjects were exact copies of the subjects in the original dataset. This led to small associations between the SNPs being amplified, and so simulated a situation of LD between some non-DS SNPs and the disease haplotype. For each combination of model type, risk ratio, and sample size, 1000 simulated datasets were constructed using the binomial distribution with the appropriate disease probabilities. The datasets with 1216 subjects were included, only for the rare disease model, where smaller datasets gave too few affected individuals to allow the disease link to be detected.

Table 1.

Risk Ratio Values and Disease Risk Values for the Six Cases

Case	I	II	III	IV	V	VI
Risk ratio	40	10	5	2	1.5	1
Original
p₁	0.917	0.899	0.866	0.791	0.755	0.691
p₀	0.022	0.090	0.173	0.396	0.503	0.691
Equal
p₁	0.976	0.909	0.833	0.667	0.600	0.500
p₀	0.024	0.091	0.167	0.333	0.400	0.500
Rare disease
p₁	–	0.500	0.250	0.100	0.075	0.050
p₀	–	0.050	0.050	0.050	0.050	0.050

2.2. Crohn's disease dataset

In addition to the simulation, we applied the single SNP and variable selection methods to the data for Crohn's disease from the Wellcome Trust Case Control Consortium (Consortium WTCCC, 2007). The “truth” is not known in this case, in contrast to the simulations, so the conclusions are necessarily less definite, but it is possible to see whether the SNPs selected by the two methods are different. Then it is of interest to investigate whether the SNPs that were found by the variable selection approach but missed using single SNP tests have been identified in other studies. The genotype data consisted of 3004 controls and 2005 cases from Affymetrix 500K chip with 490032 SNPs. Of the SNPs, a total of 20803 were found to violate Hardy-Weinberg equilibrium, suggestive of errors in the genotyping calls. These were excluded from the analysis. The R software restricts the size of a matrix to 2³¹ elements and (3004 + 2005) times (490,032 – 20,803) SNPs exceeds this limit. Therefore, for this exploratory study, a further 50,000 SNPs were excluded on the basis of having low single-SNP association with disease status as measured by a chi-squared test. These 50,000 SNPs all had a chi-squared statistic with p-value greater than 0.85. The design matrix thus consisted of 5009 subjects by 419,229 SNPs and so had less than 2³¹ elements.

2.3. Scenarios for GLM predictors

We applied the GLM model with a small number of predictors on the 1000 simulated data sets under different scenarios for our knowledge of the genetic causes of the disease. These GLM results provide a “baseline” level when evaluating results from the GeneRaVE method.

In Scenario 1, we assume that we know the exact DS haplotype and also which haplotypes are present in each subject. We thus use as the predictor an indicator variable that takes the value one, for subjects carrying one or two copies of the DS haplotype, and zero, for those with only non-DS haplotypes. These GLM results represent the best possible results we could get assuming we know the DS haplotype exactly.

In Scenario 2, we assume we do not know which haplotype is the DS haplotype, but that information is available on which two haplotypes are present in each individual, and use as predictors the indicator variables for these haplotypes. With three SNPs in the data, there are eight possible haplotypes. The indicator variable for each haplotype takes the value one if a haplotype is present, otherwise it is zero. The GLM results represent the best possible results we can get, assuming we know the DS SNPs and that the haplotypes for each subject are exactly determined, but we do not know which of the haplotypes is the DS haplotype.

In Scenarios 3 and 4, we assume we know which SNPs are the three DS SNPs, but we have only the genotypes rather than the haplotype information. In Scenario 3, we use the genotype data directly, while in Scenario 4, we first estimate the two haplotypes by maximum likelihood (Excoffier and Slatkin, 1995) and use indicator functions for the estimated haplotypes as predictors. The comparison of Scenario 3 and 4 indicates whether using the estimated haplotype information adds to the ability to detect disease association.

2.4. Single-SNP analysis

A common method for detecting genetic association with disease is to carry out significance tests on each SNP separately and use a multiple testing correction to avoid excessive false positives. The aim of the variable selection method being studied here is to outperform single-SNP methods by allowing for models in which multiple SNPs are included. It is, therefore, of interest to see how a single-SNP approach performs on the same datasets.

The chi-squared test is a standard method used to test associations between a disease and SNPs in genome-wide association studies. The significance of an SNP is assessed by its p-value, adjusted for multiple testing. SNPs that are highly significant are considered the DS SNPs. So if we were to test all SNPs in our simulated data we would expect the DS SNPs to be the most significant ones. We assessed the significance of each SNP in our study using a chi-squared test and counted how often the DS and non-DS SNPs were significant. This would show any advantage of using GeneRaVE over the chi-squared test in terms of how well the method is selecting the correct SNPs.

We applied the chi-squared test with two degrees of freedom to the simulated data for all cases to test for association between the disease status and each of the SNPs. SNPs with Bonferroni-corrected p-values of less than 0.05 were identified as associated with the disease. For the rare disease data, the numbers affected were sometimes so small that the chi-squared approximation was inadequate. Any values with a p-value of less than 10^–5 were re-evaluated using simulation of the exact distribution with 10⁷ simulation runs for each. The same approach was applied to the WTCCC dataset, but to allow for the possibility that the chi-squared approximation to the distribution of the test statistic might be inaccurate, all significant values were re-evaluated using simulation of the exact distribution with 10⁹ simulation runs for each. This large number was necessary to give reasonable accuracy for the nominal p-value of 0.05/419229 ≈ 10^–7 required for the Bonferroni correction. All of the retested values proved significant, the majority having simulated p-values less than the approximate values.

2.5. The GeneRaVE method

We describe briefly the method used in our study, GeneRaVE (Kiiveri, 2008). When p >> n, a useful fit is possible only if most of the predictor variables, SNPs in our case, are unrelated to the disease and only a few SNPs are in reality predictive of the response. The algorithm aims to identify these SNPs using a method that forces a sparse model, that is, a model with only a small number of non-zero coefficients. The approach used in GeneRaVE is maximization of a penalized likelihood. The objective function is the binomial likelihood of the data with the addition of a penalty function on the β coefficients. The normal gamma density function is used for the penalty function; it is determined by two parameters 0 ≤ k ≤ 1 (shape) and b > 0 (scale). The EM algorithm is used to maximize the penalized likelihood to produce the beta coefficient estimates. As shown by (Kiiveri, 2008), this leads to many of the estimated β coefficients being exactly zero, thus producing a sparse model.

2.6. Parameter selection

The performance of the GeneRaVE algorithm is influenced by the choice of the parameters of the penalty function and also a parameter “epsilon” used to control the convergence of the iterative method. A range of values of these parameters was tried in a small subset of the simulated datasets, and it was found that the values k = 0.8, b = 1.0, and epsilon = 0.26 gave good performance as assessed by the tenfold cross-validated error rate (Arlot and Celisse, 2010). These values of k, b, and epsilon were used in all subsequent analyses of the simulated data.

For the Crohn's disease data, we again used cross-validation to select the best k and b. The grid consisted of 100 values of k and 101 values for b. We calculated the tenfold cross-validated errors and selected the parameter with the smallest error. The smallest cross-validated error was 0.011 and was attained when k = 0.09 and b = 0.000819. The value of the epsilon parameter was again set to 0.26.

3. Results

3.1. Summary of the GLM results

We calculated the 10-fold cross validated error rates for each scenario and for each case, model and sample size. Figure 1 shows the mean of the cross validated error rates for all cases and scenarios from the 1000 simulations for the Original model and sample size 304. The standard error of each point is about 0.01, similar to the size of the plotting symbols. As would be expected, the mean error increases as the risk ratio decreases. For risk ratios of R = 2 or less, the error rate is close to the level of 31% which corresponds to the “null” model classifying all subjects as “cases” and so achieving 69% correct results. Knowing the DS haplotype, as in Scenarios 1 and 2, clearly gives much better performance for higher risk ratios than knowing only the genotypes, suggesting that it is worthwhile obtaining such “phased haplotype” data, though it is more difficult to obtain than genotype data. The estimated haplotypes of Scenario 4 give no advantage over the direct use of the genotypes in Scenario 3 and so the simpler genotype approach of Scenario 3 was used as the model for the GeneRaVE study.

FIG. 1.

Mean of cross-validated error rates obtained from 1000 simulations for Scenarios 1–4 and GeneRaVE for all risk ratios in all scenarios. See text for definitions of these scenarios.

3.2. Summary of single-SNP significance results

The use of a Bonferroni corrected significance level of 0.05 implies that the expected number of non-DS SNPs found significant in each simulation should be 0.05. The average numbers of DS and non-DS SNPs that were significant in the 1000 simulations are shown in Table 2. The results are as expected: the DS SNPs are very likely to be identified when the risk ratio is high and the probability declines as the risk ratio is decreased. The numbers of non-DS SNPs selected (the false positive rate) was less than the expected 0.05.

Table 2.

Mean Number of Disease Susceptibility (DS) And Non-DS Single Nucleotide Polymorphisms Selected

			Risk ratio
Model	n		40	10	5	2	1.5	1
Original	304	Chi-squared Bonferroni DS	2.89	2.51	1.81	0.19	0.01	0.00
		Chi-squared matched DS	3.00	2.90	2.56	0.85	0.20	0.00
		GeneRaVE DS	3.00	2.98	2.73	1.09	0.32	0.01
		Chi-squared Bonferroni non-DS	0.02	0.03	0.03	0.04	0.04	0.03
		GeneRaVE non-DS	1.55	1.86	2.41	3.31	3.49	3.74
	608	Chi-squared Bonferroni DS	3.00	3.00	2.92	1.04	0.11	0.00
		Chi-squared matched DS	3.00	3.00	2.94	1.28	0.22	0.00
		GeneRaVE DS	3.00	3.00	3.00	1.86	0.40	0.00
		Chi-squared Bonferroni non-DS	0.02	0.03	0.02	0.03	0.03	0.04
		GeneRaVE non-DS	0.10	0.04	0.06	0.09	0.18	0.30
	608 Dup	Chi-squared Bonferroni DS	3.00	3.00	2.92	1.05	0.10	0.00
		Chi-squared matched DS	3.00	2.98	2.82	1.02	0.15	0.00
		GeneRaVE DS	3.00	3.00	2.94	1.46	0.23	0.00
		Chi-squared Bonferroni non-DS	6.54	3.99	1.69	0.27	0.07	0.03
		GeneRaVE non-DS	0.87	0.64	0.44	0.23	0.18	0.15
Equal	304	Chi-squared Bonferroni DS	3.00	2.59	1.56	0.04	0.00	0.00
		Chi-squared matched DS	3.00	2.94	2.44	0.42	0.08	0.00
		GeneRaVE DS	3.00	2.98	2.60	0.56	0.13	0.01
		Chi-squared Bonferroni non-DS	0.01	0.04	0.03	0.02	0.01	0.00
		GeneRaVE non-DS	0.84	1.77	2.58	3.64	3.75	3.92
	608	Chi-squared Bonferroni DS	3.00	3.00	2.84	0.38	0.01	0.00
		Chi-squared matched DS	3.00	3.00	2.90	0.65	0.06	0.00
		GeneRaVE DS	3.00	3.00	2.94	0.83	0.09	0.00
		Chi-squared Bonferroni non-DS	0.04	0.14	0.01	0.02	0.02	0.02
		GeneRaVE non-DS	0.11	0.18	0.07	0.12	0.17	0.23
Rare disease	304	Chi-squared Bonferroni DS	–	0.12	0.00	0.00	0.00	0.00
		Chi-squared matched DS	–	1.16	0.01	0.00	0.00	0.00
		GeneRaVE DS	–	1.28	0.13	0.00	0.00	0.00
		Chi-squared Bonferroni non-DS	–	0.01	0.03	0.03	0.03	0.02
		GeneRaVE non-DS	–	3.66	2.67	0.63	0.39	0.14
	608	Chi-squared Bonferroni DS	–	1.26	0.01	0.00	0.00	0.00
		Chi-squared matched DS	–	1.47	0.02	0.00	0.00	0.00
		GeneRaVE DS	–	2.39	0.87	0.29	0.24	0.24
		Chi-squared Bonferroni non-DS	–	0.03	0.02	0.02	0.02	0.02
		GeneRaVE non-DS	–	0.08	0.29	0.81	0.94	0.96
	1216	Chi-squared Bonferroni DS	–	2.64	0.24	0.00	0.00	0.00
		Chi-squared matched DS	–	2.56	0.29	0.00	0.00	0.00
		GeneRaVE DS	–	2.55	0.78	0.15	0.10	0.04
		Chi-squared Bonferroni non-DS	–	0.02	0.03	0.02	0.03	0.04
		GeneRaVE non-DS	–	0.01	0.05	0.18	0.17	0.21

The numbers of non-DS SNPs with significant association in the simulations involving the original dataset were small as a result of the Bonferroni correction, so to provide a more meaningful comparison with the GeneRaVE results below, threshold values for significance were calculated, which allowed for numbers of non-DS SNPs equal to those found for GeneRaVE. These were used for the rows labeled “matched” in Table 2. This reflects the inevitable tradeoff between false positive and false negative rates. These are more fully illustrated in the receiver operating characteristic (ROC) curves of true positive rates versus false positive rates given in Figure 2 for the Original model and sample size 304. The ROC curves for the other models are given in Supplementary Figures S1–S6. (Supplementary Material is available online at www.liebertpub.com/cmb.) The points corresponding to the Bonferroni-corrected and “matched” cutoffs are marked.

FIG. 2.

Receiver operating characteristic (ROC) curve for chi-squared method and true and false positive rates for GeneRaVE method for N = 304 and dataset Original.

In the simulations involving duplicating the dataset, larger numbers of non-DS SNPs are selected when the R parameter is 40 or 10. These included SNPs that had substantial association by chance with the DS SNPs in the original data and so have more significant correlations in the duplicated data, since the degree of association is the same but the sample size is larger. Such SNPs represent false positives in the simulations.

3.3. Summary of the GeneRaVE results

To see how well the GeneRaVE method performed, in terms of identifying the presence of a genetic link and also of identifying the associated SNPs, we calculated for each simulated data set the average of the number of DS SNPs and the average number of additional SNPs that were selected in 1000 simulations. These are shown in Table 2 as “DS” and “Non-DS.”

The cross-validated error rates of the simulated data for the original model with risk ratio 40, 10, 5, 2, 1.5, and 1 are plotted with the GLM results in Figure 1. The mean error rates are clearly nondecreasing as the risk ratio values decrease, which was expected. It is clear that the GeneRaVE error rates are higher than those of the GLM scenarios, also as expected.

The error rates for low-risk ratios slightly exceed the “null” level of 31%, suggesting that, despite the use of penalty parameters selected on the basis of cross-validation error, the procedure is overfitting, selecting too many parameters and so inflating the error rates.

The results for GeneRaVE in Table 2 show that when n = 304, and for risk ratios of 5 or more, the set of SNPs selected usually contains all three DS SNPs and very few others. For lower risk ratios, the performance is less good. For the larger datasets, the number of DS SNPs selected is increased, and the number of non-DS SNPs selected is reduced, as would be expected. For some of the models, the number of non-DS SNPs selected by GeneRaVE is greater for the lower risk ratios, which is not what would be expected, but there is no general trend.

The number of non-DS SNPs selected in the simulations for a risk ratio of 1 is large for the datasets of size 304 with the “original” and “equal” patterns of risk. Indeed, the number of non-DS SNPs selected increases as the risk ratio declines. The reason for this is unclear. This confirms the overfitting noted above.

The large number of false positives found by the chi-squared tests in the duplicated datasets do not occur in the GeneRaVE results. This is because the true DS SNP genotypes alone are good predictors of disease state so these additional SNPs, while individually associated with disease state, do not improve the performance of the predictive model.

3.4. Crohn's disease results

Chi-squared tests of association between SNP genotypes and disease status were calculated for each of the SNPs. Using significance levels from 10⁹ simulations as described above, 485 of the 419,229 SNPs gave significant chi-squared statistics at the Bonferroni-corrected level of 0.05/419229 = 1.19 × 10^–7. In the absence of any association, the expected number of significant values would be 0.05.

GeneRaVE gave a fitted model with only 40 SNPs selected. The plug-in error rate for this model was about 4% when the penalised parameter estimates from the GeneRaVE algorithm were used. To investigate the potential for a more parsimonious model, a GLM with binomial errors was fitted to the set of selected SNPs as predictors for the disease status. Stepwise selection was then used to see if a smaller model would fit as well. All of the selected SNPs were found to be necessary in the model. The plug-in error for the unpenalized GLM with all the selected SNPs included was slightly smaller at 3.5%.

The single-SNP p-values of 39 of the 40 SNPs selected by GeneRaVE were less than the Bonferroni-corrected level, but they were not the smallest among the 485 significant SNPs, the highest being 304th in order of p-value. Also, the 40th SNP, whose p-value was not significant, added significantly to the GLM as noted above. However, 24 of the 40 most significant SNPs were included in the set selected by GeneRaVE.

This illustrates the limitation of single-SNP methods, since SNPs with important effects may be masked by the large number of others. The 40 selected SNPs included, for example, rs9292777, found by van der Heide et al. (2010) to be associated with Crohn's disease in nonsmokers, which ranked only 143rd among the single SNP p-values.

3.5. Comparing the methods

GLM Scenario 3 is the best baseline for comparison with the GeneRaVE results, as the best outcome that could be expected using genotype data. Figure 1 shows that for risk ratios of 5 or more, GeneRaVE's prediction performance is similar to that of the GLM. This is explained by the results in Table 2, which show that for high-risk ratios, GeneRaVE has high probability of selecting the correct DS SNPs and few non-DS SNPs, so that the model selected is as close to the true model as used for the GLM results. When the risk ratio is 2, however, GeneRaVE usually selects only one of the DS SNPs and about 1 non-DS SNP. For smaller risk ratios, the probability of identifying a DS SNP is small. This is reflected in Figure 1, in which the error for the GeneRaVE method rises to the “null” level of around 31% when the risk ratio is 2 or less.

In terms of selecting the DS SNPs using the single-SNP tests, the Bonferroni correction meant that the probability of finding a DS SNP was substantially lower than that for GeneRaVE. Increasing the cutoff for the chi-squared statistics to match the false positive counts of the GeneRaVE approach increases the likelihood of identifying a DS SNP. The GeneRaVE method still significantly outperforms the single-SNP chi-squared in the number of DS SNPs found. This is clear also in Figure 2, where the points corresponding to the GeneRaVE results are above the ROC curve for the single-SNP method in every case. Nonparametric tests showed that when the mean number of non-DS SNPs selected was matched, the number of DS SNPs found by GeneRaVE was significantly higher when the risk ratio was greater than 1.5.

The GeneRaVE method was also more resistant to detecting non-DS SNPs that are associated with the DS SNPs. This is as would be expected, since variable selection methods like GeneRaVE pick a set of SNPs that are predictive together rather than the isolated SNPs identified by the single-SNP techniques.

4. Discussion

Penalized variable selection methods have become popular in the current climate of massive datasets and increasing computing capability. However, simpler methods remain popular, and it has been unclear whether the more sophisticated approaches add significant value.

This study has shown that there is indeed value in penalized variable selection in the context of discrete predictors and response with a small number of levels. In simulated data where the true model is known, and the signal is strong, the correct predictors are identified, giving prediction error close to that of methods based on knowledge of the true model.

Even when the genetic link is weaker, the variable selection method allows good ability to select the true predictors, and the performance is even better in the situation here where the identification of any one DS SNP is sufficient to identify a region of interest for further study.

The variable selection approach shows a significant advantage over single SNP methods. In the simulations, it showed an ability to detect a genetic effect in which single SNP methods failed, since they are unable to detect a relationship that involves more than one predictor. In the WTCCC study, it identified a small set of jointly predictive SNPs, some of which, notably rs9292777, would not have been detected by a single SNP approach.

It was found that the correct SNPs were often selected even when the cross-validated error rate was relatively large. This suggests that the cross-validated error rate may not be a good measure of the value of a model for classification. More investigation is needed to develop better approaches.

There is some evidence of overfitting—as evidenced by additional SNPs and error rates exceeding 31%, so that too many SNPs may be included in the model. This remains a challenge for such methods and may explain the common perception that a large number of studies have found SNP associations that have not proved to be validated in further studies. However, as discussed in Visscher et al. (2012), this perception is not entirely justified. Many associations initially identified by GWAS studies have proved to be validated in follow-up investigations. Exploratory studies such as GWAS will usually be only the beginning of a program of research and incorrectly selected SNPs will fail to be validated at later stages, while the true positives will remain. On the other hand, the variable selection approach was able to discount SNPs that were associated with the DS SNPs but not part of the causal haplotype. This ability to identify joint effects represents a substantial advantage of variable selection.

Of course, any simulation study of this kind is limited in the range of models and scenarios it can consider. However, the use of realistic SNP datasets and a model based on the nature of the underlying genetic structure give some hope that these conclusions have more general applicability. The results are broadly consistent with the results of Wu et al. (2009), and other studies that found variable selection methods generally outperformed single SNP methods. Given the ready availability of methods and code to implement them, it should be standard practice to use multivariate methods, in addition to univariate methods, to analyze case-control studies using SNP data.

Footnotes

Acknowledgments

This research received no specific grant from any funding agency, commercial or not-for-profit sectors.

Disclosure Statement

The authors have no conflicts of interest.

References

Arlot

, Celisse

2010. A survey of cross-validation procedures for model selection. Statistics Surveys, 4:40–79.

Balding

D.J.

, Bishop

, Cannings

2003. Handbook of Statistical Genetics, 2. Wiley: Chicester.

Buhlmann

2007. Variable selection for high dimensional data: with application in molecular biology. Bulletin of the International Statistical Institute, 56th session. http://stat.ethz.ch/∼buhlmann/Publications/.

Consortium WTCCC. 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447:661–78.

Excoffier

, Slatkin

1995. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol., 12.5:921–927.

Fan

, Li

2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc, 96:1348–1360.

Guan

, Stephens

2011. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann. Appl. Stat., 5:1780–1815.

, Lin

D.Y.

2011. A variable selection method for genome-wide association studies. Bioinformatics, 27:1–8.

Hoggart

C.J.

, Whittaker

J.C.

, De lorio

et al. 2008. Simultaneous analysis of all snps in genome-wide and re-sequencing association studies. PLOS Genetics, 4:1–8.

10.

Kiiveri

H.T.

2008. A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations. BMC Bioinformatics, 9:195.

11.

Mullighan

C.G.

, Goorha

, Radtke

et al. 2007. Genome-wide analysis of genetic alterations in acute lymphoblastic leukaemia. Nature, 446:758–764.

12.

R Development Core Team. 2011. R: A language and environment for statistical computing. R Foundation for Statistical Computing: Vienna, Austria.

13.

Tibshirani

1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B. Stat. Methodol., 58:267–288.

14.

van der Heide

, Nolte

I.M.

, Kleibeuker

J.H.

et al. 2010. Differences in genetic background between active smokers, passive smokers, and non-smokers with Crohn's disease. Am. J. Gastroenterol., 5:1165–1172.

15.

Visscher

P.M.

, Brown

M.A.

, McCarthy

M.I.

et al. 2012. Five years of GWAS discovery. Am. J. Hum. Genet., 90:7–24.

16.

T.T.

, Chen

Y.F.

, Hastie

et al. 2009. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25:714–21.

17.

Zou

2006. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistician Association, 101:1418–1429.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.22 MB