Catching the Genomic Wave in Oligonucleotide Single-Nucleotide Polymorphism Arrays by Modeling Sequence Binding

Abstract

The genomic wave has been identified as a major artifact in genome data and is highly correlated with the sequence GC content. Although statistical methods have been developed to filter this artifact, the mechanism underlying the genomic wave has not been studied yet. Understanding of the artifact, specifically the sources of the artifact, may lead to successful separation of biological signals from the artifact and improve array design, modeling, and association studies. We develop an approach to catching the genomic wave in the oligonucleotide single-nucleotide polymorphism (SNP) arrays by separating biological signals from the array baseline background through modeling sequence binding with a newly developed probe intensity composite representation (PICR) model. The PICR model decomposes the probe intensity of each SNP probe set into the target sequence concentrations, SNP-specific background (nonsignal) and measurement error, and identifies the biological signals through the target concentration for each allele. We demonstrate with the Affymetrix GeneChip 500K HapMap data and the Wellcome Trust Case–Control Study data that the genomic wave is captured through the SNP-specific background term of the PICR model, and is separated successfully from the allelic target concentrations—the biological signals. We further identify two important sources of the genomic waves, the GC content and the fragment length (FL) of the sequence, and conclude that (1) the genomic wave artifact can be removed from the genome data with the PICR model, and (2) in addition to the GC content, the genomic wave also has a component of nonlinear effect of the FL.

1. Introduction

The genomic wave was reported as an artifact in genome studies (Marioni et al., 2007; Diskin et al., 2008). It has been described as a phenomenon of wave pattern in genome data along the chromosome, which is present even after the initial array normalization procedure, such as the quantile normalization (Bolstad et al., 2003). It has been shown to be independent of the array platform generating the data, lowering the accuracy of DNA variant annotation, such as copy number variants (CNVs), and reducing the statistical power in association tests (Diskin et al., 2008). Although robust methods have been developed to remove the genomic wave artifact before the association tests (Marioni et al., 2007; Diskin et al., 2008), mechanism underlying the genomic wave still remains unknown. So far, no report has been published in this regard.

It has been shown in the literature that understanding of the mechanism is crucial in modeling microarray data and successful separation of array artifacts from the biological signals (Li and Wong, 2001; Held et al., 2003, 2006; Johnson et al., 2006; Kapur et al., 2008; Kalari et al. (2006); Mount et al. (1992); Pozhitkov et al., 2010). The first important step is to identify the source of the artifacts by studying the components that present the main characteristics of the artifact. The genomic wave has been shown to present a high correlation with the GC content, a main feature of the genomic waves (Diskin et al., 2008). However, this does not imply that the GC content is the only source of the genomic wave artifact, and more studies, especially based on the array mechanism, are needed to achieve a better understanding of the source of the artifact, which may lead to an accurate estimation of the artifact and its complete correction. Furthermore, it has been reported that the study of GC content plays a major role in signal bias correction (Hasan and Schreiber, 2006) because the GC content varies from intron regions to exon regions (Gilbert and Rechtsteiner, 2009) and may affect many genome functions, including splicing (Zhang et al., 2011), genomic punctuation (Zhang et al., 2004), and perturbations in genomic signals (Hasan and Schreiber, 2006). However, it is not known whether the current methods of removing the genomic wave also remove part of the biological signals, which, if it is true, may lead to potential loss of statistical power and trade off the gain from the artifact removal. Therefore, identification of the source of the genomic waves and confirmation for the separation of the artifact from the biological signals is crucial to subsequent association studies. In this article, we report a novel method of catching the genomic wave on oligonucleotide SNP arrays and provide further insight on the source of the genomic waves, which further confirms that the genomic wave artifact can be separated from the biological signals through a simple regression model that removes not only the genomic wave artifact but also other artifacts simultaneously.

In a previous study (Wan et al., 2009), we modeled hybridization of the DNA–DNA duplex binding through a probe intensity composite representation (PICR) model based on physicochemical thermodynamics model. The single array-based PICR model employs a linear regression in each SNP probe set and generates highly accurate copy number estimation and SNP genotype calls consistently across arrays and array platforms (Wan et al., 2009). Here we report that the PICR model catches the genomic wave with the SNP-specific baseline term, and leaves the estimated target concentration unaffected by the genomic wave artifact. Thus, subsequent analysis based on the allelic concentrations by the PICR does not suffer from the genomic wave artifact. By analyzing the baseline component, we further find that the genomic wave presents the wave pattern of the linear effect of the GC content consistently across chromosomes, and a nonlinear effect of the fragment length (FL) varying across chromosomes. We illustrate the effect of the genomic wave and its removal with the Affymetrix 500K HapMap data and the Wellcome Trust Case–Control Consortium data.

2. Materials and Methods

2.1. Data

GeneChip 500K SNP array HapMap samples were downloaded from the Affymetrix, Inc., website. The hypertension cases of the Wellcome Trust Case–Control Study data were obtained from the Wellcome Trust Case–Control Consortium online (see References).

2.2. Statistical analysis

Probe intensity data were obtained from each CEL file of the Affymetrix 500K GeneChip array, and then processed through the PICR model to generate the allelic concentrations using Equation 1. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} & \qquad\qquad\qquad\quad \vdots \\ & \\ & \ I_{PA , ks} = b + N_A f^ {\{S^{PA , ks} , S^{TA}\}} + N_B f^{\{S^{PA , ks} , S^{TB}\}} + \varepsilon \\ & \\ & \ I_{PB , ks} = b + N_A f^{\{S^{PB , ks} , S^{TA}\}} + N_B f^{\{S^{PB , ks} , S^{TB}\}} + \varepsilon \qquad , \qquad\qquad\qquad (1) \\ & \ I_{MA , ks} = b + N_A f^{\{S^{MA , ks} , S^{TA}\}} + N_B f^{\{S^{MA , ks} , S^{TB}\}} + \varepsilon \\ & \\ & \ I_{MB , ks} = b + N_A f^{\{S^{MB , ks} , S^{TA}\}} + N_B f^{\{S^{MB , ks} , S^{TB}\}} + \varepsilon \\ & \\ & \qquad \qquad \qquad \vdots \end{align*} \end{document}

where I's are the probe intensities of a probe set designed to annotate a given SNP of two alleles, labeled as alleles A and B. By the SNP array design, these probes in the probe set may have perfect match to allele A (PA), mismatch to allele A (MA), perfect match to allele B (PB), and mismatch to allele B (MB). b is the SNP-specific baseline background of the intensity, and may include the array background and the SNP-specific deviation from it. N_A and N_B are the allelic concentrations of the target sequences for annotating alleles A and B, respectively. The function f represents the binding affinity of each probe to the allele (either A or B) of an SNP between a target sequence S^T and a probe sequence S^P, and was computed following the generalized positional-dependent-nearest-neighbor (GPDNN) model (Wan et al., 2009). The measurement error term ɛ is assumed to follow a normal distribution with mean 0 and a common variance σ². By linear regression, the PICR model yields the allelic concentrations N_A and N_B, and the baseline intensity b by the model intercept. For subsequent analysis, only N_A and N_B are used in association test on DNA variants (Wan et al., 2009).

To illustrate the genomic wave, we examined plot of the relative concentration (RC) at the base 2 logarithmic scale along the genomic position of the SNP, and the plot of the SNP-specific baseline intercept b at the base 2 logarithmic scale. The RC was defined as RC = (N_A + N_B)/med (N), where med(N) = median(N_A + N_B) across all SNPs on the same array except for the SNPs on the X-chromosome. Thus, the regular DNA copy number 2 corresponds to RC = 1, and severe deviation from 1 indicates a copy number deviation from 2 at the SNP locus. A loess smoothing curve with a smoothing parameter 0.1 was also fitted to the RC to reveal the pattern along the chromosome, similar to what was examined by Marioni et al. (2007) and Diskin et al. (2008). We further calculated the Pearson correlation coefficient between the RC and the GC content for each SNP and between the baseline intercept b and the GC content. Here the Pearson correlation is preferred to the Spearman correlation in examining the correlation between continuous variables. The correlation coefficients on multiple arrays in each study data set are illustrated with boxplot by chromosome.

The following generalized additive models (GAM) (Hastie and Tibshirani, 1991) were fitted to the baseline estimate b at the base 2 logarithmic scale and its loess smoothing value: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}{\log_2} (b) = \alpha_1 + \beta_1 \times {\rm GC} + {\rm s}_1 ({\rm FL}) + {\rm error} \tag{2.1}\end{align*} \end{document}

and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}{\rm loess} ({\log_2} [b]) = \alpha_2 + \beta_2 \times {\rm GC} + {\rm s}_2 ({\rm FL}) + {\rm error} , \tag{2.2}\end{align*} \end{document}

where log(b) and loess(log[b]) are the log-transformed SNP-specific baseline estimated from the PICR model and the fitted value of its loess smoothing, respectively. s₁(FL) and s₂(FL) are the nonlinear smooth terms of the FL. The linear term on GC content was chosen because the nonlinear effect of the GC content is found nonsignificant (p > 0.05). The capture of the genomic wave is demonstrated by the loess smoothing of the log(b) with a smoothing parameter 0.1, and the effect of the GC content, α + β × GC in the model fitting.

To explain that the wave pattern in the smoothing curve loess(log[b]) of (2.2) is captured by the term b or log(b) in (2.1), rather than some terms other than b, we assess the similarity of the effect of the GC content between models (2.1) and (2.2) by the relative difference (RD), defined as the ratio of the mean squared difference of the fitted value with the GC component between models (2.1) and (2.2) calculated with the total number of SNPs (M) on each chromosome to the median of the loess smoothed value of the log intercept as in Equation 3. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\rm RD} = \frac {\displaystyle {\frac {1} {M}} \sum \limits_ {SNP} [(\hat {\alpha} _1 + \hat {\beta} _1 GC) - (\hat {\alpha} _2 + \hat {\beta} _2 GC)] ^2} {median (loess (\log (b)))} . \tag {3} \end{align*} \end{document}

3. Results

3.1. Components presenting genomic wave

We examined the pattern of the RC along the chromosome and the baseline intercept of the PICR model. The RC plot and its loess smoothing curve show some fluctuation on chromosome 11, but no consistent pattern across arrays as illustrated with 3 randomly selected HapMap SNP arrays (left panels of Fig. 1). Similar plots are found on other chromosomes (data not shown). In contrast, the loess smoothing curve of the baseline intercept presents consistent wave pattern on chromosome 11 (right panels of Fig. 1) as illustrated with 3 randomly selected HapMap SNP arrays, and similar pattern on other chromosomes (data not shown), across all HapMap samples. The correlation between the RC and the GC content is around 0 consistently across all chromosomes (Fig. 2) and not statistically significant, whereas the correlation between the baseline intercept and the GC content is positive (>0.2) on all chromosomes except for the X-chromosome (>0.16) (Fig. 3). Since the genomic wave has been reported to be highly correlated with the GC content consistently across array platforms (Diskin et al., 2008), it implies that the genomic wave is contained in the baseline intercept term only, but not in the allelic concentrations. Since the allelic concentrations contain the genetic information and convey the biological signals and functions of the genes (Wan et al., 2009), this indicates that the genomic wave can be successfully separated from the biological signals by the PICR model, and thus does not affect the outcome of subsequent analysis based on the allelic concentrations.

FIG. 1.

Illustration of the genomic wave on chromosome 11 with 3 randomly selected arrays of the HapMap samples on Affymetrix 500K GeneChip Arrays. Left panel: plot of the relative concentration (RC) in log 2 scale with a loess fit (red curve). Right panel: plot of the baseline intercept in log 2 scale with a loess smoothing curve (span = 0.1) (red curve).

FIG. 2.

Boxplot of the correlation between the baseline intercept and the GC content by chromosome.

FIG. 3.

Boxplot of the correlation between the baseline intercept and the GC content by chromosome.

3.2. Decomposition of the genomic wave

Given that we have identified the source of the artifact, the only term containing the genomic wave is the SNP-specific baseline term of the PICR model (1), we further study the baseline term by decomposing it into the GC content and FL of the target sequence through a GAM model (2.1 and 2.2) to further trace the genomic wave. The GAM model has a linear effect of the GC content (nonlinear term is not significant with p > 0.05) for most chromosomes (1 to 21) except for chromosomes 22 and X. Figure 4A and B illustrates the linear effect and nonlinear effect of the GC content on chromosomes 3 and 22, respectively. The nonlinear effect of the FL also varies across chromosomes as shown in Figure 4C and D on chromosomes 3 and 6 on two randomly selected arrays. The parallel pattern of the effect of GC content across arrays indicates that the effect remains consistent across arrays, but also varies across chromosomes, which further suggests that the GC content may be used for array normalization, as also observed by Gilbert and Rechtsteiner (2009). Figure 5 illustrates the capture of the genomic wave (black curve of loess smoothing of the baseline term) by the GC content (α₁ + β₁ × GC in green) from the GAM model (2.1) (upper panel), and the comparison of the primary effect of the GC content (β₁ × GC in green) to the effect of the nonlinear FL (s₁(FL) in red) on chromosome 11 (lower panel). The difference in scale of the green curve of the GC content between the upper and the lower panels of Figure 5 is because of the omit of the intercept term α₂ in the lower panel, to illustrate the effect of the FL. The fairly flat FL effect compared with the large wave pattern of the GC content in the lower panel demonstrates that the GC content is a dominant source of the genomic wave, while the contribution of the FL is small. The same pattern was observed across all chromosomes.

FIG. 4.

Illustration of the effects of the GC content and fragment length on the genomic wave. (A) Linear effect of the GC content on chromosome 3 in 15 HapMap samples. (B) Nonlinear effect of the GC content on chromosome 22 in 15 HapMap samples. (C) Nonlinear effect of the fragment length on chromosome 3 of a randomly selected HapMap array. (D) Nonlinear effect of the fragment length on chromosome 6 of a randomly selected HapMap array.

FIG. 5.

Capture of the genomic wave on chromosome 11 by the GAM model on the GC content and the fragment length of the target sequence. Upper panel shows the loess smoothing of the baseline term (black) and the fitted value by the GC content and the overall mean (α+ β × GC) of the GAM model (green) in log scale. The exclusion of the fragment length demonstrates that the genomic wave consists largely of GC content. Lower panel shows the effects of the GC content β × GC (green) and the fragment length s(FL) (red) from the GAM model in log scale with the former dominating the latter.

To further quantify the portion of the GC content out of the genomic wave, we examined the variability explained by the GC content out of the total from the loess-smoothed log(b) through GAM model (2.2). Notice that GAM model (2.1) does not serve the purpose since log(b) does not represent the genomic wave, while its loess-smoothed term loess(log(b)) does. The FL term is not significant in model (2.2), and leaves the GC component relatively large to the total variability in loess(log(b)). Although the two GAM models, (2.1) and (2.2), fit different quantities, smoothed and unsmoothed log(b), respectively, their effects of the GC components are very much close through the comparison of the relative difference RD (Table 1). This further confirms that the genomic wave has a dominant GC content component, as illustrated in Figure 6A and B in comparison of the loess smoothing of the SNP-specific baseline and the fit by the GC content. It can thus be concluded that the genomic wave can be captured mainly with the GC content of the SNP target sequence. The WTCCC samples present similar results (data not shown).

FIG. 6.

Illustration of the comparison of the loess smoothing of the SNP-specific baseline and the fit by the GC content on one randomly selected HapMap sample on (A) chromosome 10 and (B) chromosome 20.

Table 1.

Summary Statistics of the Relative Deviation from 15 Affymetrix GeneChip 500K SNP Arrays by Chromosome

Chromosome	Mean	SE	Chromosome	Mean	SE
1	0.0118	0.0004	2	0.0146	0.0005
3	0.0105	0.0004	4	0.0110	0.0004
5	0.0114	0.0004	6	0.0098	0.0003
7	0.0117	0.0004	8	0.0123	0.0004
9	0.0089	0.0003	10	0.0130	0.0004
11	0.0120	0.0003	12	0.0112	0.0004
13	0.0088	0.0003	14	0.0120	0.0004
15	0.0121	0.0004	16	0.0118	0.0004
17	0.0137	0.0004	18	0.0098	0.0004
19	0.0193	0.0004	20	0.0130	0.0004
21	0.0115	0.0005	22	0.0153	0.0005
X	0.0110	0.0006

4. Discussion and Conclusions

The genomic wave has been reported as an artifact in the genome data independent of the array media used for data generation. It requires special computational methods to remove the artifact before downstream association studies. Further understanding of the sources of the artifact may lead to successful separation of the biological signals from the artifact.

In this work, we discovered that the genomic wave in the oligonucleotide SNP arrays can be captured and further separated successfully from the biological signals. By studying the array mechanism of the oligoarrays through the PICR model, we showed that the intercept term of the PICR model contains the genomic wave, while the allelic concentrations do not. Further modeling of the intercept term through the GAM model revealed that both the GC content and the FL are significant, with the former presenting a linear effect and contributing largely to the genomic wave, while the latter presenting a nonlinear effect varying across the chromosomes.

Our finding not only confirmed previous conclusion of the high correlation between the genomic wave and the GC content (Diskin et al., 2008), but also quantified the effect of the GC content and the FL, which has not been reported in the literature. In particular, the linear effect of the GC content and the nonlinear effect of the FL varying across chromosomes are novel observations and may be useful for array normalization. Our finding further confirmed that the allelic concentration of the SNPs does not possess consistent wave pattern and is free of the artifact of the genomic wave. This provides strong evidence to the following conclusions: (1) Subsequent analysis based on the allelic concentrations estimated with the PICR model does not suffer from the genomic wave artifact, including association studies on SNP genotypes, copy number alterations, and so on. (2) The genomic wave artifact can be successfully separated from the biological signals by applying the PICR model and the RC method without taking any procedure for artifact adjustment. (3) The GC content of the target sequence is a major source of the genomic wave. (4) Compared to the GC content, the FL of the SNP target sequence has a significant nonlinear effect on the SNP-specific baseline, which varies largely across chromosomes.

Given our previous findings that the PICR model removes the unequal footing artifact, filters the noise in the probe intensity data from the SNP arrays, and yields accurate copy number estimation, we conclude that the PICR model can simultaneously remove multiple artifacts, including two most important ones, the unequal footing and the genomic wave. Thus, the PICR model provides a unified computational approach to the simultaneous removal of multiple artifacts, which has not been achieved by any other computational approach.

Although the PICR model has a large number of parameters, its implementation is fast because the free energy parameters and binding affinities can be precalculated and remain the same across different arrays. Following one reviewer's suggestion to examine the sensitivity of the PICR model in terms of genotype calling accuracy, we randomly selected one HapMap sample and conducted parameter permutation followed by SNP genotype calling. To keep the parameters sensible, all parameters of the PICR model were permuted within each level of the following parameter space: perfect match nucleotide pair effect, perfect match position effect, mismatch nucleotide pair effect, mismatch nucleotide triplet effect, and mismatch position effect. To examine the effect on SNP genotype calling, we permuted the parameters 10 times and calculated the genotype calling accuracy following each permutation. The results show that compared with the high genotype calling accuracy on the selected sample by the original PICR model (99.4% correct calls using the HapMap SNP annotation as the gold standard), the permutation decreased the accuracy largely. Among the 10 permutations, 8 had 80% accuracy, while the other 2 had very low accuracy (38.4% and 33.8% correct calls). The large drop of genotyping accuracy by permutation indicates that the relative large number of model parameters of the PICR model does not lead to overfitting, unlike previous report achieved by modeling of RNA gene expression arrays with a small number of parameters (Ding and Lawrence, 2001; Wan et al., 2008). This may partly reflect the difference in the Affymetrix array design between SNP GeneChip arrays and gene expression Human Genome arrays that use different number of probes for each target annotation.

Given that the genomic wave has been observed in both oligoarrays and bead arrays, and is independent of the array platform (Diskin et al., 2008), it can be implied that the genomic wave must not come from any array specific features, but rather from common experimental procedures. Our finding showed that the GC content and the FL of the DNA sequence around the SNP locus had significant effects, which partly explains why the artifact is independent of the array media. It may further suggest that improvement of future array design needs to consider the GC content and FL of the target sequence, for either SNPs or CNVs. However, further investigation requires more data, which is out of the scope of this current study.

Footnotes

Acknowledgments

This work was partly supported by an Oncology Scholarship from College of Human Medicine, Michigan State University. The author would like to thank the editor and one anonymous reviewer for their helpful comments.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Affymetrix, Inc. http://www.affymetrix.com/support/technical/sample_data/500K_data.affx.

Bolstad

B.M.

, Irizarry

R.A.

, Astrand

et al. 2003. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19:185–193.

Diskin

S.J.

, Li

, Hou

et al. 2008. Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res., 36:e126.

Ding

, Lawrence

C.E.

2001. Statistical prediction of single-stranded regions in RNA secondary structure and application to predicting effective antisense target sites and beyond. Nucleic Acids Res., 29:1034–1046.

Gilbert

, Rechtsteiner

2009. Comments on sequence normalization of tiling array expression. Bioinformatics, 25:2171–2173.

Hasan

, Schreiber

2006. Recovering motifs from biased genomes: application of signal correction. Nucleic Acids Res., 34:5124–5132.

Hastie

, Tibshirani

1991. Generalized Additive Models. Chapman and Hall/CRC: New York, NY.

Held

G.A.

, Grinstein

, Tu

2003. Modeling of DNA microarray data by using physical properties of hybridization. Proc. Natl. Acad. Sci. U. S. A., 100:7575–7580.

Held

G.A.

, Grinstein

, Tu

2006. Relationship between gene expression and observed intensities in DNA microarrays—a modeling study. Nucleic Acids Res., 34:e70.

10.

Johnson

W.E.

, Li

, Meyer

C.A.

et al. 2006. Model-based analysis of tiling-arrays for ChiP-chip. Proc. Natl. Acad. Sci. U. S. A., 103:12457–12462.

11.

Kalari

K.R.

, Casavant

, Bair

T.B.

et al. 2006. First exons and introns: a survey of GC content and gene structure in human genome. In Silico Biol, 6:237–242.

12.

Kapur

, Jiang

, Xing

et al. 2008. Cross-hybridization modeling on Affymetrix exon arrays. Bioinformatics, 24:2887–2893.

13.

, Wong

W.H.

2001. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl. Acad. Sci. U. S. A., 98:31–36.

14.

Marioni

J.C.

, Thorne

N.P.

, Valsesia

et al. 2007. Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biol., 8:R228.

15.

Mount

S.M.

, Burks

, Hertz

et al. 1992. Splicing signals in Drosophila: intron size, information, content, and consensus sequences. Nucleic Acids Res, 20:4255–4262.

16.

Pozhitkov

A.E.

, Boube

, Brouwer

M.H.

et al. 2010. Beyond Affymetrix arrays: expanding the set of known hybridization isotherms and observing pre-wash signal intensities. Nucleic Acids Res., 38:e28.

17.

Wan

, Kang

, Tang

et al. 2008. Meta-prediction of phosphorylation sites with weighted voting and restricted grid search parameter selection. Nucleic Acids Res., 36:e22.

18.

Wan

, Sun

K.L.

, Ding

et al. 2009. Hybridization modeling on oligonucleotide SNP arrays for accurate DNA copy number estimation. Nucleic Acids Res., 37:e117.

19.

Wellcome Trust Case—Control Consortium. http://www.wtccc.org.uk/.

20.

Zhang

, Kuo

C.C.J.

, Chen

2011. GC content around splice sites affects splicing through pre-mRNA secondary structures. BMC Genomics, 12:90.

21.

Zhang

, Kasif

, Cantor

C.R.

et al. 2004. GC/AT-content spikes as genomic punctuation marks. Proc. Natl. Acad. Sci. U. S. A, 101:16855–16860.