Abstract
Variable selection methods play an important role in high-dimensional statistical modeling and analysis. Computational cost and estimation accuracy are the two main concerns for statistical inference from ultrahigh-dimensional data. In particular, genome-wide association studies (GWAS), which focus on identifying single nucleotide polymorphisms (SNPs) associated with a disease of interest, have produced ultrahigh-dimensional data. Numerous methods have been proposed to handle GWAS data. Most statistical methods have adopted a two-stage approach: pre-screening for dimensional reduction and variable selection to identify causal SNPs. The pre-screening step selects SNPs in terms of their P-values or the absolute values of the regression coefficients in single SNP analysis. Penalized regressions, such as the ridge, lasso, adaptive lasso, and elastic-net regressions, are commonly used for the variable selection step. In this paper, we investigate which combination of pre-screening method and penalized regression performs best on a quantitative phenotype using two real GWAS datasets.
Keywords
Introduction
Recently, many high-dimensional datasets have been generated in biomedical science, such as microarrays and single nucleotide polymorphism (SNP) databases. In particular, genome-wide association studies (GWAS), which focus on identifying SNPs associated with a disease of interest, have produced ultrahigh-dimensional data. For theoretical development, we consider data to have high dimensionality if p = O(n a ) for some a≫0, and to have ultra-high dimensionality if log p = O(n a ) for some a≫0. When the dimension p is high, we run into the often-fatal “curse of dimensionality.” The convergence of any estimator to the true value of a smooth function defined on a space of high dimension is very slow. Variable selection plays an important role in high-dimensional statistical modeling and analysis. Computational cost and estimation accuracy are the two main concerns for statistical inference from high-dimensional data.
Many efficient approaches have been introduced to overcome these problems. One is the adoption of multistep strategies.1,2 The first stage of this approach reduces the dimensionality P for significant predictor selection in ultrahigh-dimensional data. This pre-screening stage is used to find variables that may only be marginally associated with a response variable. This step reduces the dimension of the dataset and makes joint analysis possible. Therefore, the multistep approach indicates one solution for the ultrahigh-dimensional problem. Several predictor selection tools have been developed to implement the above idea for ultrahigh-dimensional linear models. Sure independence screening (SIS), which is the most widely used pre-screening method,3,4 ranks the predictor variables using the absolute values of the correlation coefficients as a criterion. Another pre-screening method is described in Cho et al. 1 , which uses the pre-screening step to identify marginally associated responses, using the P-value as a criterion.
We want to know which of the available pre-screening methods is better for quantitative traits. Although many pre-screening methods are available, we do not know which method performs best in predicting a particular quantitative phenotype. We can find predictors that are jointly associated with the response variable among the parameters that remain after the pre-screening step. When multiple predictor variables exist for a response variable, joint identification becomes a powerful tool. 1
One of the traditional approaches for joint identification is the multiple linear/logistic regression method. However, when we handle high-dimensional data using traditional methods, we experience several problems. First, multiple linear regressions do not work well within high dimensionality, which causes computational complexity. Second, multiple linear regression is very sensitive to multicollinearity among SNPs. To overcome this problem, various penalization methods have been proposed, such as the ridge, bridge, least absolute shrinkage and selection operator (lasso), adaptive lasso, smoothly clipped absolute deviation (SCAD), and elastic-net.5–9 These methods can find jointly associated variables in high-dimensional data. The elastic-net method uses both the ridge and lasso penalties, obtaining the advantages of both approaches. The elastic-net method automatically selects significant variables, and, thus, efficiently resolves the problem caused by multicollinearity. The iterative adaptive lasso (IAL) method 10 retains the appealing property of rapid computation even for ultrahigh-dimensional problems. This method yields a sparse solution by setting certain parameters to zero. Predictor selection is then achieved with the nonzero values.
Many methods have been suggested for pre-screening and the variable selection procedure. However, we do not know which method performs best for quantitative traits. In this paper, we investigate which combination of pre-screening method and penalized regression performs best. To compare the power of pre-screening methods and penalized regressions, we use two GWAS datasets: one from the Korea Association Resource (KARE) project and the other from the Age-Related Eye Disease Study (AREDS). The adjusted R-square is used as a measure of comparison. 11
Materials and Methods
Materials
KARE Data
The KARE project began in 2007. 11 Participants in this project were recruited from two community-based cohorts: the rural Ansung cohort and the urban Ansan cohort in Gyeonggi-do province of South Korea. The numbers of people in the Ansung and Ansan cohorts are 5,018 and 5,020, respectively. The age range is from 40 to 69 years. More than 260 phenotypes have been surveyed through physical examinations, epidemiological surveys, and laboratory tests. We focus on the height trait, because height is a highly heritable polygenic characteristic. 11
The KARE data contain 500,568 SNPs. Before analysis, quality control processes are performed following Cho et al. 1 , and missing genotypes are imputed using PLINK software and the Japanese in Tokyo (JPT)/Han Chinese in Beijing (CHB) reference panel in HapMap. 1 After these processes, we obtain a dataset with 327,872 SNPs from 8,842 individuals.
Areds Data
AREDS is a prospective study of 4,757 persons to establish the risk factors of both age-related macular degeneration (AMD) and cataract. 12 The AREDS began in 1992. Ages of participants ranged from 55 to 80 years. Participants have been followed for at least seven years. We used body mass index (BMI) as a quantitative trait. The genotype platform of the AREDS data is an Illumina 100K GWAS chip. A total of 525 individuals were genotyped. Quality control processes were performed using the same criteria as with the KARE data. After quality control, we obtained a dataset with 87,260 SNPs from 462 individuals.
Methods
We formulate a multistage strategy for identifying the significant parameters among an enormous number of explanatory variables. Our strategy consists of three stages. At stage 1, we screen out the variables that are weakly correlated with the response variable via single-variable association tests. We select variables in terms of their P-values or by the absolute values of their regression coefficients in single-variable analysis. At stage 2, we search for multiple-variable associations by using penalized multiple regression with the elastic-net, ridge, lasso, and IAL methods. At stage 3, using the elastic-net and lasso methods, we assess the jointly identified variables using bootstrap selection stability (BSS), which is proposed empirically to assess with what consistency a variable is selected from the bootstrap samples. 1 Using the ridge and IAL methods, we assess the jointly identified variables by using the effect size.
Stage 1. Standardization
Suppose that yi for i = 1, n are the responses for the ith individual, and xij for j = 1, …, p are its predictors. We assume that the predictors are standardized to have zero mean and unit standard deviation in order to maintain generality.
Stage 2. Pre-screening
We use the linear regression model in order to eliminate predictors that are weakly correlated with the response variable to achieve dimensionality reduction.
Stage 3. Variable selection
Method 1
Penalized regression. Multiple linear models are fitted for the selected top P variables after adjusting for the covariates.
We find the optimal solution by using penalized regressions such as the ridge, lasso, and elastic-net. The penalized regressions find the solution as follows:
The amount of shrinkage is represented by parameter λ. We can find an optimal λ by using tenfold cross-validation, which accomplishes mean squared error minimization. Ridge regression (α = 0) entails a shrinkage of the least squares estimators. 8 The ridge is a biased estimator. Since the ridge reduces the variance of the estimators, it reduces the mean square error. In cases of high dimensionality, the ridge provides a shrinkage factor that does not accomplish variable selection. The lasso (α = 1) has an l1-norm penalty function. 5 Thus, the lasso produces coefficients of zero for insignificant variables. The lasso thus automatically performs variable selection. The elastic net method includes the lasso and ridge regressions. In other words, each of them is a special case where α = 1 or α = 0. The elastic net thus has the advantages of both the ridge and lasso regularizations. Variables showing strong joint association with the response variable are automatically selected via the elastic net method. Therefore, the elastic net has the ability to perform grouped selection of highly correlated variables.
Method 2. IAL
The IAL method is a two-stage procedure. 10 At the first stage, single-variable analysis is implemented to rank the magnitude of the marginal linear regression estimators. At the second stage, a weighted least-squares-type objective function is used to approximate a potential function. This allows us to further define a penalized weighted least square (PWLS) model for moderate-scale selection.
Step 1. Let M = {j | β
j
≫ 0, j = 1,…, p}. We need first to predetermine a sparsity parameter size d. It is recommended to take d = n/(logn). For each variable, the single-variable association with phenotype is examined using linear regression. The jth predictor is
to select a subset M1 of A1.
Step 2. For every j ∊ M1c = [1,….., p}\M1. estimate
After ordering
Step 3. Apply the PWLS procedure at {M1, A2}. The nonzero elements of the variable yield a new significant index M2.
Step 4. Iterate steps 2–3 until |M l |≥ d or | M l |=| Ml-1|.
Step 5. Finally, we obtain both the predictor set
Stage 4. Ordering
After selecting the significant predictor variables, we rank them in order of importance. For the elastic-net and lasso methods, we use BSS. Joint selection of SNPs via the elastic-net method is performed for the bootstrap samples. Bootstrapping is a resampling technique: a bootstrapping sample is a random sample with replacements from the original dataset. The bootstrap sample size is equal to the original dataset size. B bootstrap samples are generated. BSS is defined for the ith variables as follows.
BSS signifies how many times each selected predictor variable is replicated in B bootstrap datasets. SNPs are ranked in descending order of BSS.
For the ridge and IAL methods, the selected significant predictor variables are ranked in descending order of effect size.
Results
Pre-screening
KARE Data
At this step, we use the linear regression model in order to perform single SNP analysis for 327,872 SNPs. This linear regression model includes adjustment variables such as gender, age, and recruitment area (rural Ansung and urban Ansan).
where i = 1, 2, …, 8,842 denotes the individuals and j = 1, 2, …, 327,872 represents the SNPs. All SNPs are ranked in ascending order of P-values or in descending order of the absolute values of the coefficients from single-variable analysis. We use the top 1,000 ranked SNPs for each criterion, namely, the P-values and the absolute values of coefficients. We compare the minor allele frequency (MAF) and the number of missing values of the selected SNPs for each criterion. For the P-value criterion, the number of rare variants whose MAF values are less than 0.05 is 87. For the absolute values of coefficients criterion, the number of rare variants is 991 (Fig. 1). We observe that the common variants tend to have large numbers of missing values.
Although we use imputation processes, there are still missing values. Therefore, individuals having at least one missing value are eliminated. When data include individuals who have at least one missing value, penalized regressions such as the elastic-net and adaptive lasso cannot be performed. After the elimination process is performed, the number of remaining individuals is 4,183 for the P-value criterion and 7,496 for the absolute values of coefficients criterion. The number of overlapping individuals is 3,740. We use these overlapping individuals to compare each combination method. However, too many individuals have been lost. In order to reduce the loss of data, we eliminated SNPs with more than 30 missing values. The number of remaining SNPs is then 944 for the P-value criterion and 984 for the absolute values of coefficients criterion. After the elimination process, the remaining number of individuals is 7,481 for the P-value criterion and 8,164 for the absolute values of coefficients criterion. The number of overlapping individuals is 7,061. Henceforth, we shall use these individuals.
Areds Data
For the AREDS data, we use the linear regression model in order to perform single SNP analysis for 87,260 SNPs on the AREDs data. The model is given as follows:
where i = 1, 2, …, 461 are the individuals and j = 1, 2, …, 87,260 are the SNPs. All SNPs are ranked in ascending order of P-values or in descending order of the absolute values of the coefficients from single-variable analysis. We use the top 1,000 ranked SNPs via the P-values and absolute values of coefficients.

Number of missing data in the top 1,000 variables for each filtering method. The X-axis shows the MAFs of each SNP and the Y-axis shows the number of missing data for each SNP. The left figure shows the case where the SNPs are filtered out by P-values. In this case, the MAFs of the SNPs are uniformly distributed. The right figure shows the case where the SNPs are filtered by the absolute values of coefficients. In this case, mainly rare variants are chosen.
Top 10 SNPs in each method after coefficient filtering of KARE data by effect size for each method.
Variable Selection
KARE Data
We fit the multiple linear regression model to select the top 944 jointly associated SNPs for the P-value criterion and the top 984 SNPs for the absolute values of coefficients criterion.
All tuning parameters are determined by 10-fold cross-validation, which minimizes the mean squared error.
We can make eight combinations: (P-value + elastic-net), (P-value + lasso), (P-value + ridge), (P-value + IAL), (absolute values of coefficients + elastic-net), (absolute values of coefficients + lasso), (absolute values of coefficients + ridge), and (absolute values of coefficients + IAL). The combination method identifies 524, 504, 944, 471, 549, 548, 984, and 530 SNPs for these eight combinations, respectively, as putative height-related genetic variants. Then, for the elastic regularization and lasso methods, we generate 1,000 bootstrapped sets. The same fixed value of λ is used for the generated bootstrapped datasets. We can then determine the BSS value of each SNP. The SNPs are ranked in descending order of BSS. The ridge method cannot perform variable selection, as it selects all the SNPs. Therefore, BSS is meaningless in the ridge approach. For the ridge and adaptive lasso methods, the SNPs are ranked in descending order of effect size.
Top 10 SNPs in each method after P-value filtering of KARE data.
Table 1 shows the results of filtering SNPs with absolute value of coefficients in single variant analysis. Table 1 summarizes the list of SNPs that have the top 10 absolute values of coefficients in each penalized method. Among these SNPs, rs10948187, rs3799977, rs7954185, and rs7969076 were reported in other studies.13,14 Table 2 shows the results of filtering SNPs with P-values in single variant analysis. It summarizes the list of SNPs that have the top 10 absolute values of coefficients in each penalized method.
AREDS data.
We fit the multiple linear regression model to select the top 1,000 jointly associated SNPs for the P-value criterion and the top 1,000 SNPs for the absolute values of coefficients criterion.
All tuning parameters are determined by 10-fold cross-validation, which minimizes the mean squared error. The combination method identifies 493, 460, 1000, 559, 485, 442, 1000, and 534 SNPs for the eight combinations, respectively, as putative BMI-related genetic variants.
Top 10 SNPs in each method after coefficient filtering of AREDS data.
Top 10 SNPs in each method after P-value filtering of AREDS data.
Table 3 shows the results of filtering SNPs with absolute values of coefficients in single variant analysis. Table 3 summarizes the list of SNPs that have the top 10 absolute values of coefficients in each penalized method. Table 4 shows the results of filtering SNPs with P-values in single variant analysis. It summarizes the list of SNPs that have the top 10 absolute values of coefficients in each penalized method.
Comparative Study
We calculated the adjusted R-squares for the selected SNPs to investigate which combination of pre-screening method and penalized regression performs best for predicting quantitative traits. SNPs are ranked by BSS for the elastic-net and the lasso methods, while SNPs are ranked by effect size for the IAL and ridge methods.

Comparison of adjusted R-squares when the SNPs are filtered out by P-values in KARE data analysis. The X-axis represents the number of SNPs and the Y-axis the adjusted R-squares. The SNPs are ranked by BSS for the elastic-net and lasso methods, while the SNPs are ranked by effect size for the IAL and ridge methods.

Comparison of adjusted R-squares when SNPs are filtered out by effect size in KARE data analysis. The X-axis represents the number of SNPs and the Y-axis the adjusted R-squares. The SNPs are ranked by BSS for the elastic-net and lasso methods, while the SNPs are ranked by effect size for the IAL and ridge methods.
KARE Data
Figures 2–4 show the results of KARE data analysis. Figure 2 shows the adjusted R-square with the number of SNPs when SNPs are filtered by P-values. There is a tendency for the adjusted R-square to increase as the number of SNPs increases. The increase rate of the ridge method is slower than that of the IAL, lasso, and elastic-net methods. The adjusted R-squares all converge to 0.75 except for the ridge method. The IAL method shows the fastest increase rate.
Figure 3 shows the adjusted R-square with the number of SNPs when the SNPs are filtered by the absolute values of coefficients. There is a tendency for the adjusted R-square to increase as the number of SNPs increases. The increase rate of the ridge method is slower than that of other penalized regression methods. The adjusted R-squares all converge to 0.71 except for the ridge method. The IAL, lasso, and elastic-net methods show very similar increase rates.

Comparison of adjusted R-squares when SNPs are filtered out by P-value in KARE data analysis. The SNPs are ranked by effect size for each method.

Comparison of adjusted R-squares when the SNPs are filtered out by P-values in AREDS data analysis. The X-axis represents the number of SNPs and the Y-axis the adjusted R-squares. The SNPs are ranked by effect size for each method.
Figures 2 and 3 show the consistent results that (1) the P-value criterion tends to select better SNPs to predict the traits than the absolute values of coefficients criterion and (2) the ridge method performs worse in variable selection than other penalized regression methods.
Note that Figures 2 and 3 compare four penalized regression methods for a given pre-screening criterion. Among the IAL, lasso, and elastic-net methods, only the IAL method ranks SNPs by effect size. We wonder whether this difference among these three methods may be because of a different ordering of SNPs. Thus, instead of using BSS for the lasso and elastic-net methods, we use the same ordering of SNPs by effect size. Figure 4 shows the adjusted R-square with the number of SNPs when SNPs are filtered by the absolute values of the coefficients and ordered by effect size. Interestingly, the elastic-net, lasso, and IAL methods produce almost identical results. Thus, Figure 4 suggests that effect size is a better SNP ordering measure than BSS.
Areds Data
Figures 5 and 6 show the results of AREDS data analysis. These figures show very consistent results with those of KARE. Figure 5 shows the adjusted R-square with the number of SNPs when SNPs are filtered by P-values and ordered by effect sizes. There is a tendency for the adjusted R-square to increase as the number of SNPs increases. The increase rate of the ridge method is slower than that of the IAL, lasso, and elastic-net methods. The IAL, lasso, and elastic-net methods show very similar increase rates. Figure 6 shows the adjusted R-square with the number of SNPs when the SNPs are filtered by the absolute values of coefficients ordered by effect sizes. There is a tendency for the adjusted R-square to increase as the number of SNPs increases. The increase rate of the ridge method is slower than that of other penalized regression methods.

Comparison of adjusted R-squares when SNPs are filtered out by effect size in AREDS data analysis. The X-axis represents the number of SNPs and the Y-axis the adjusted R-squares. The SNPs are ranked by effect size for each method.
Conclusion
Recently many high-dimensional datasets have been generated in biomedical science, such as microarrays and SNP databases. Multistep strategies have been introduced to analyze these data. The first stage is pre-screening, in which the marginally associated response variables are identified, using various criteria. The second stage is variable selection. Various penalization methods have been proposed to analyze high-dimensional data. These include the ridge, bridge, least absolute shrinkage and selection operator (lasso), adaptive lasso, SCAD, and elastic-net methods. However, we do not know which method performs best for quantitative traits. Using an adjusted R-square as a measure of comparison, our study shows that for quantitative traits, the P-value criterion selects better variables to predict the trait than the absolute values of coefficients criterion. We conclude that the elastic-net, lasso, and IAL methods have almost the same performance, while the ridge method performs worst in variable selection.
In this study, we use only quantitative traits. However, a similar study could be easily conducted using binary traits such as diabetes and high blood pressure.
Because of gaps in the data, we unavoidably eliminate SNPs and individuals who have at least one missing value. This loss of information may reduce the accuracy of the study. We need to improve this accuracy by trialing appropriate imputation methods using simulated datasets.
Author Contributions
Conceived and designed the experiments: TP, SH. Analyzed the data: SH, YK. Wrote the first draft of the manuscript: TP, SH. Contributed to the writing of the manuscript: TP, SH, YK. Agree with manuscript results and conclusions: TP, SH, YK. All authors reviewed and approved of the final manuscript.
Footnotes
Acknowledgement
The AREDS data used for the analyses described in this manuscript were obtained from the AREDS database, controlled through the database of Genotypes and Phenotypes (dbGaP) accession number phs000001.v2.pl.
