1. Introduction
Determining the ancestry of individuals based on their DNA sequence is a common and useful task in the study of human genetics. Ancestry information is essential for improving statistical power in studies that detect disease-related genes (Price et al., 2010; Seldin et al., 2011), in addition to its multiple applications in population genetics studies (Bryc et al., 2010; Jarvis et al., 2012; Hinch et al., 2011; Wegmann et al., 2011; Gravel et al., 2011). Until recently, computational approaches treated ancestry as a discrete entity, and different methods were designed to classify individuals to their ancestry, or to multiple ancestries in the case of individuals of admixed ancestry (Pritchard et al., 2000; Alexander et al., 2009). The discrete modeling approach is unjustified in many cases, since human individuals tend to disperse locally, thereby generating continuous variation patterns across the geographic space. In the absence of continuous ancestry representations, dimensionality reduction techniques, mostly principal component analysis (PCA), have been heuristically used to represent genetic samples on the continuous space.
PCA had been suggested already in the 70's (Menozzi et al., 1978) for the study of population genetics from population data, and was later adapted to whole-genome data (Patterson et al., 2006). Despite its controversial early use in the inference of ancient demographic events (Novembre and Stephens, 2008), over time PCA has proven extremely useful in multiple applications, largely due to its ability to capture genetic variation resulting from different types of processes. One important observation is that in some geographic regions, most prominently Europe (Lao et al., 2008; Novembre et al., 2008), the geographic dispersal of human individuals correlates well with the first PCs of genotype data. This correlation has led to the use of PCA for estimating the geographic origin of individuals based on their genetic data. However, in other regions the correspondence between geography and PCs is rather weak (Wang et al., 2012). The exact relation between PCA and spatial inference has never been formally clarified, to the best of our knowledge, and its use in this context remains a heuristic.
In the past few years multiple models aiming to explicitly describe the relation between geography and human genetic variation have emerged. Such modeling approaches consist of a promising research direction, since they potentially allow for accurate characterization of genetic variation as a function of the geographic space, which in turn can lead to important insights into human demographic history, as well as into selection and mutation processes. In addition, inference in such models allows for the accurate estimation of the geographic origin or multiple origins of human individuals, which can be thought of as their continuous ancestry.
In what follows we attempt to provide a unifying framework for these recent models as well as PCA. We begin by making explicit the spatio-genetic model that underlies PCA. Next, we review the spatio-genetic models behind SPA (Yang et al., 2012) and LOCO-LD (Baran et al., 2013), two recently suggested spatial approaches, and present them as extensions of PCA's model, each handling a different limitation of the latter. LOCO-LD was originally suggested as an approach for spatial inference in a supervised context, in which the geographic locations of origin are known for some of the samples. Leveraging its connection to PCA, we adapt it here to the unsupervised scenario and derive an optimization procedure that yields spatial predictions for each of the samples. We then provide empirical results demonstrating that this new procedure outperforms PCA in the inference of spatial structure. We end by reviewing additional recent relevant work. Throughout the text we discuss connections to related problems in computational genetics, including heritability estimation and gene expression analysis.
2. Data and Notation
We assume we are given the whole-genome genotypes of n individuals over m SNPs. The genotypes are encoded in the m×n matrix G, Gji∈{0, 1, 2}. We denote by Gi the ith column of G, which corresponds to individual i. (GT)
j
is therefore the column vector containing the row of G that corresponds to SNP j. More generally, we denote by Ai and (AT)
j
the ith column and jth row of matrix A, respectively, and by Aji the cell located in their intersection. In addition, we denote by G0 the SNP-centered genotype matrix, whose rows have zero mean.
3. PCA
The PCA procedure searches for a low-dimensional orthogonal coordinate system, such that the variance of the projections of the data samples on the chosen axes is maximized. For a given dimension d, these axes are represented as the columns of an m×d matrix U whose columns are orthonormal. In addition, the projections of the samples on the axes are required to be uncorrelated in pairs; however, this constraint only affects the rotation of the coordinate system, which is irrelevant to our application, and we therefore ignore it in the presentation.
We take the common approach of performing PCA on the SNP-centered matrix G0, instead of on the original matrix G. This step has the effect of centering the data around the origin, and when skipped, the first PCA axis often captures the data intercept (i.e., the SNP means vector) instead of capturing the within-data variance. The function we aim at maximizing in PCA as a function of U is therefore
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}f_{PCA} ( U ) = \mathop\sum\limits_{k = 1}^d {
\parallel{ ( U_k ) }^T G_0 \parallel}^2 = \mathop \sum \limits_{i
= 1}^n {G_{0i}}^T U U^T G_{0i} \tag{1}\end{align*}
\end{document}
where d is the reduced dimension. When a geographic representation of the samples is requested (i.e., longitude and latitude coordinates), this parameter is set to 2.
The function fPCA is optimized when U's columns are the d leading eigenvectors of the SNP covariance matrix
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$ \frac { 1 } { n } G_0 { G_0 } ^T$$
\end{document}
(or any other orthonormal set that spans the same subspace, if we omit the constraint of uncorrelated projections). The genotypes of each individual are then projected onto this subspace, and the resulting PCA scores X=UT G0 are used for representing the genetic structure or the geographic dispersal of the study sample; we refer to these scores as PCA's solution. We note that PCA is often applied to the standardized genotype matrix, in which each SNP has a variance of 1 in addition to zero mean. This modification is motivated by the fact that the frequency change of an SNP due to genetic drift occurs at a rate proportional to
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\sqrt{p ( 1 - p ) }$$
\end{document}
, p being the SNP's minor allele frequency (Patterson et al., 2006; Nicholson et al., 2002). When the Hardy–Weinberg equilibrium holds and the sample size is large, standardization has the effect of scaling the genotypes by the inverse of this rate (indeed, in some analyses the genotypes of each SNP are divided by the respective
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\sqrt{p ( 1 - p ) }$$
\end{document}
instead). We observed that standardization improves PCA's correspondence with geography (results not shown), and the results we present in the experimental section were therefore generated using the standardized matrix.
PCA is merely a computational technique, and its results have no clear genetic interpretation. One notable result draws a connection between the Euclidean distance of two populations in the PCA space and the fixation index (Fst) between them (McVean, 2009). This is, however, an interpretation in terms of coalescence theory, as opposed to a spatial interpretation. Nevertheless, a spatial interpretation does exist: PCA's solution can be cast as the maximum likelihood solution of a probabilistic model in which
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}G_{0ji}\, \sim\, N ( { \beta_j}^T X_i , \sigma^2 ) \tag{2}\end{align*}
\end{document}
where β is a d×m matrix constrained to have orthonormal columns, and X is a d×n matrix. When this model is interpreted from the spatio-genetic perspective (i.e., when X is taken to be spatial coordinates), its underlying assumption is that the expected genotype value of SNP j changes linearly along a single direction in the d-dimensional space, with direction and rate determined by β
j
. This assumption of uniform rate of change is expected to hold better in isolation-by-distance scenarios, where local genetic differences are accumulated under spatially restricted dispersal (Wright, 1943). Indeed, the PCA map fits the geographic map well in Europe, where no strong geographic or cultural barriers to gene flow exist. In cases where more complex scenarios hold [e.g., individuals are admixed (Ma and Amos, 2012), originate from multiple continents or from regions separated by natural barriers], this approximation is not expected to produce results congruent with geography.
We now turn to highlight additional limitations of PCA's model in the context of spatial inference. The first two of these are presented as motivations for two recent spatio-genetic methods, SPA (Yang et al., 2012) and LOCO-LD (Baran et al., 2013).
4. SPA—Introducing Discreteness
The model underlying SPA (Yang et al., 2012) assumes that the frequency of SNP j is a function of the spatial coordinates
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$x = \left( {x_1 \atop x_2} \right)$$
\end{document}
, and specifically is determined as
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$f_j ( x ) = \frac { e^ { { a_j } ^T x + b_j } } { 1 + e^ { { a_j } ^T x + b_j } } $$
\end{document}
, where
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$a_j = \left( {a_{j1} \atop a_{j2}} \right)$$
\end{document}
, bj are SNP-specific parameters. Given the parameters a, b and individual i's location of origin, their genotype is sampled independently for each SNP from the corresponding binomial distributions. The resulting log likelihood is
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\log L ( G_i;a , b , X_i ) = h_i \log ( 2 ) +
\sum_j \left[ G_ { ji } \log \left( \frac { e^ { { a_j } ^T X_i +
b_j } } { 1 + e^ { { a_j } ^T X_i + b_j } } \right) + ( 2 - G_ {
ji } ) \log \left( \frac { 1 } { 1 + e^ { { a_j } ^T X_i + b_j }
} \right) \right] \tag { 3 } \end{align*}
\end{document}
where X is the 2×n matrix of spatial coordinates, and hi is the number of heterozygous positions in Gi. Since hi is fixed, maximizing the likelihood of the entire sample is equivalent to minimizing
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}f_{SPA} ( X , a , b ) = 2 \mathop \sum \limits_{i =
1}^n \mathop \sum \limits_{j = 1}^m \log ( 1 + e^{{a_j}^T X_i +
b_j} ) - \mathop \sum \limits_{i = 1}^n \mathop \sum \limits_{j =
1}^m G_{ji} ( {a_j}^T X_i + b_j ) \tag{4}\end{align*}
\end{document}
Now, as formulated in Equation (2), PCA can be cast as the maximum likelihood solution to a model in which the genotypes are sampled from normal distributions. The probability density function of observing genotype g∈{0, 1, 2} in SNP j given PCA score
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$x = \left( {x_1 \atop x_2} \right)$$
\end{document}
is
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}f_ { j , x } ( g ) \propto e^ { - \frac { ( (
\beta_j ) ^T x - g ) ^2 } { 2 \sigma^2 } } \tag { 5 }
\end{align*}
\end{document}
However, since genotypes are discrete, the normal density does not induce a proper probability function on its domain. One way of fixing this situation is replacing the normal density with the conditional probability given that the genotype value is legal. We claim that a solution to this discretized version of PCA is also a solution to SPA. We begin with the simpler case in which the data is haploid, that is, Gji∈{0, 1}:
Lemma 5.1 Let X be the discretized PCA solution on the matrix G, assuming it consists of haplotypes. There exist a,b such that {a, b, X} are the arguments minimizing fSPA.
Proof The normal density attains the following values at its legal data points:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}f_ { j , x } ( g = 1 ) \propto e^ { - \frac { ( (
\beta_j ) ^T x - 1 ) ^2 } { 2 \sigma^2 } } , \quad f_ { j , x } (
g_j = 0 ) \propto e^ { - \frac { ( ( \beta_j ) ^T x ) ^2 } { 2
\sigma^2 } } \tag { 6 } \end{align*}
\end{document}
Conditioning on g∈{0, 1} yields the following probability function:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\begin{split} p_ { j , x } ( g = 0 ) & = 1 - p_ { j , x } ( g = 1
) \\ p_ { j , x } ( g = 1 ) & = \frac { f_ { j , x } ( 1 ) } { f_
{ j , x } ( 1 ) + f_ { j , x } ( 0 ) } = \frac { e^ { - \frac { (
( \beta_j ) ^T x - 1 ) ^2 } { 2 \sigma^2 } } } { e^ { - \frac {
( ( \beta_j ) ^T x - 1 ) ^2 } { 2 \sigma^2 } } + e^ { - \frac { (
( \beta_j ) ^T x ) ^2 } { 2 \sigma^2 } } } \\ & = \frac { e^ {
\frac { 2 ( \beta_j ) ^T x - 1 } { 2 \sigma^2 } } } { e^ { \frac
{ 2 ( \beta_j ) ^T x - 1 } { 2 \sigma^2 } } + 1 } = \frac { e^ {
( \beta^ { \prime } _j ) ^Tx + \alpha^ { \prime } _j } } { e^ { (
\beta^ { \prime } _j ) ^Tx + \alpha^ { \prime } _j } + 1 }
\end{split} \tag { 7 } \end{align*}
\end{document}
The last equality holds by assigning
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\beta_j^ { \prime } = \frac { 1 } { \sigma^2 } \left( { \beta_ { 1j } \atop \beta_ { 2j } } \right)$$
\end{document}
and
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\alpha^ { \prime } _j = - \frac { 1 } { 2 \sigma^2 } $$
\end{document}
. This is simply another parameterization of the SPA model, with the difference that PCA's orthogonality constraint on β is omitted. Optimizing the discretized PCA therefore yields a solution to SPA that maintains this additional constraint. ■
We note that the converse does not hold, that is, a solution to SPA is not necessarily a solution to the discretized PCA, again due to the orthogonality constraint.
SPA assumes Hardy–Weinberg equilibrium (HWE), as reflected in the assumption of independence between the two alleles in a genotype. If this assumption holds in practice, we get equivalence also in the diploid case. To see this, denote by h1 and h2 the two SNP alleles in arbitrary order, so that h1+h2=g, and let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{h}_j$$
\end{document}
be the average haplotype value for SNP j,
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar { h } _j = \frac { 1 } { 2n } \sum\nolimits_ { i = 1 } ^nG_ { ji } $$
\end{document}
. The diploid PCA density can be written as follows:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*} f_ { j , x } ( g ) & \propto e^ {
- \frac { ( ( \beta_j ) ^T x - g ) ^2 } { 2 \sigma^2 } } \\ & =
e^ { - \frac { \big (\big ( \frac { 1 } { 2 } \, ( \beta_j ) ^T x
- h^1 \big) + \big( \frac { 1 } { 2 } \, ( \beta_j ) ^T x - h^2
\big)\big)^2 } { 2 \sigma^2 } } \\ & = e^ { - \frac { \big( \frac
{ 1 } { 2 } \, ( \beta_j ) ^T x - h^1 \big)^2 } { 2 \sigma^2 } }
\cdot e^ { - \frac { \big( \frac { 1 } { 2 } \, ( \beta_j ) ^T x
- h^2 \big ) ^2 } { 2 \sigma^2 } } \cdot e^ { - \frac { 2 {\big(
\frac { 1 } { 2 } \, ( \beta_j ) ^T x - h^1 \big )} {\big( \frac
{ 1 } { 2 } \, ( \beta_j ) ^T x - h^2 \big ) }} { 2 \sigma^2 } }
\tag { 8 } \end{align*}
\end{document}
The first two factors in the last expression are identical to the haploid density expressions. As for the third factor, when multiplied over all individuals in the sample, the numerator in the exponent becomes proportional to
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\mathop \sum \limits_ { i = 1 } ^n \left( \frac { 1
} { 2 } { \beta_j } ^T X_i - { H_ { ji } } ^1 \right)\left( \frac
{ 1 } { 2 } { \beta_j } ^T X_i - { H_ { ji } } ^2 \right) \tag {
9 } \end{align*}
\end{document}
where
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$( {H_{ji}}^1 , {H_{ji}}^2 )$$
\end{document}
are the two alleles of SNP j in individual i. Recall that PCA's model assumes that
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\mathbb E} [ G_{ji} ] = { \beta_j}^TX_i$$
\end{document}
, and therefore
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*} { \mathbb E } [ { H_ { ji } } ^1 ] = { \mathbb E }
[ { H_ { ji } } ^2 ] = \frac { 1 } { 2 } { \beta_j } ^TX_i \tag {
10 } \end{align*}
\end{document}
We therefore see that the expression in Equation (9) is the sample covariance between the alleles of the first haplotype and the alleles of the second haplotype. When the Hardy–Weinberg equilibrium holds in the sample, this covariance equals zero, and the exponent of the third term of Equation (8) vanishes, leaving us with the product of haplotypic likelihoods.
To summarize, we have shown that SPA's model is equivalent to a discretized version of PCA with the orthonormality constraint on the axes of variation omitted and when HWE is in force. We note that without constraining the axes to be orthonormal, SPA's model is nonidentifiable, and the resulting solution may be very different from PCA's. In addition, unlike PCA, there is no closed form solution to the maximum likelihood parameters of SPA, and the model's log likelihood is not concave in a, b, X. Despite these two points, SPA's solution as retrieved by the iterative procedure suggested in Yang et al. (2012) is highly similar to PCA's.
Finally, we note that other adaptations of PCA to the modeling of genotype data have been suggested. They differ in the distribution from which the genotypes are assumed to be drawn: Either encoding them as binary variable through collapsing together both genotypes that include the minor allele (Lu et al., 2012a, b), or treating them as a categorial variable sampled from a multinomial distribution over the three different genotype values (Lu et al., 2014). None of these previous methods assume HWE, as done by SPA, and all require orthogonality, and sometimes additional requirements such as sparsity of the PC loadings.
5. LOCO-LD—Introducing Correlations Between Nearby SNPs
LOCO-LD was originally suggested as a supervised, windows-based approach for spatial localization (Baran et al., 2013). The motivation for its model was that existing methods (specifically, PCA and SPA) assume independence between genotypes of different SNPs given the location of origin. In real genetic data this assumption is violated for nearby SNPs due to linkage disequilibrium, and LOCO-LD was designed to accounts for these correlations. Its model splits the genome into nonoverlapping windows of length L, and assumes that the genotypes in each window were drawn from a multivariate normal distribution. Particularly, in window j the SNP-centered genotype vector of an individual located in coordinate x was drawn from
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \cal N} ( \beta_j x + \alpha_j , \Sigma_j )$$
\end{document}
, where β
j
is an L×d matrix, α
j
is an L×1 vector, and Σ
j
is an L×L positive-definite matrix. Denote by G0ji the L×1 centered genotype vector of individual i confined to window j, and by m the number of windows along the genome. The resulting log likelihood of the genotype of individual i as a function of the window-specific parameters β, α, Σ, and the individual's spatial position x is
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\begin{split} & \log L ( G_ { 0i } ; \beta , \alpha , \Sigma , x
) = \\ & - \frac { Lm } { 2 } \log ( 2 \pi ) - \frac { 1 } { 2 }
\mathop \sum \limits_ { j = 1 } ^m \log ( \mid \Sigma_j \mid ) -
\frac { 1 } { 2 } \mathop \sum \limits_ { j = 1 } ^m ( \beta_j x
+ \alpha_j - G_ { 0ji } ) ^T { \Sigma_j } ^ { - 1 } ( \beta_j x +
\alpha_j - G_ { 0ji } ) \end{split} \tag { 11 } \end{align*}
\end{document}
The supervised method presented in Baran et al. (2013) infers maximum likelihood estimators for the model parameters β, α, Σ using a training set of individuals whose locations of origin X are known, and uses these estimators to infer the locations of new individuals. Here we suggest a modified procedure that we term LOCO-LD-US (unsupervised), which does not require location data, and instead assumes that Σ is given. Specifically, Σ
j
is set to capture the linkage disequilibrium structure in window j, and can be computed as the covariance matrix of the SNPs in the window using any panel of genotypes, preferably from the relevant geographic region. Given a sample of individuals to be localized, LOCO-LD-US therefore searches for assignment of the model parameters β, α and the spatial coordinates X so as to minimize the following function:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}f^L_{LOCO} ( X , \beta , \alpha ) = \mathop \sum
\limits_{i = 1}^n \mathop \sum \limits_{j = 1}^m ( \beta_j X_i +
\alpha_j - G_{0ji} ) ^T{ \Sigma_j}^{ - 1} ( \beta_j X_i + \alpha_j
- G_{0ji} ) \tag{12}\end{align*}
\end{document}
5.1. Single-SNP windows
We begin by examining the case where L=1. We state the following:
Lemma 6.1 Let X be the PCA solution on the centered matrix G0. For any constant Σ=c>0 there exist β, α, such that β, α, X are the arguments minimizing
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$f^{1 , c}_{LOCO} ( X , \beta , \alpha )$$
\end{document}
.
Proof In this case the function to be minimized is ■
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\begin{split} f^ { 1 , c } _ { LOCO } ( X , \beta , \alpha ) & =
\frac { 1 } { c } \sum_i \sum_j ( \beta_j X_i + \alpha_j - G_ {
0ji } ) ^T ( \beta_j X_i + \alpha_j - G_ { 0ji } ) \\ & \propto
\sum_j \sum_i ( \beta_j X_i + \alpha_j - G_ { 0ji } ) ^2
\end{split} \tag { 13 } \end{align*}
\end{document}
We note that for every SNP j, this model can be thought of as a separate multiple regression problem with d predictors and an intercept term, in which both the predictors and the parameters are unknown. We also note that α
j
can in fact be omitted from the model. This is because for fixed X and β
j
, α
j
can be derived as
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\alpha_j = \bar{g}_{0j} - \beta_j \bar{x}$$
\end{document}
, where
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{g}_{0j}$$
\end{document}
is the average value of row j in G0, and
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{x}$$
\end{document}
is the average spatial coordinates vector. Since G0 is SNP-centered we have
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{g}_{0j} = 0$$
\end{document}
, and we can set
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{x} = 0$$
\end{document}
without loss of generality so that no intercept term is needed. We therefore obtain the simplified formulation
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\begin{split}f^{1 , c}_{LOCO} ( X , \beta ) & = \sum_i \sum_j (
\beta_j X_i - G_{0ji} ) ^2 \\ & = \mathop \sum \limits_{i = 1}^n (
\beta X_i - G_{0i} ) ^T ( \beta X_i - G_{0i} ) \end{split}
\tag{14}\end{align*}
\end{document}
where β is the m×d matrix that consists of all the window-specific β
j
's stacked one on top of the other.
β and X are obviously nonidentifiable in this model. For a fixed β, however, Xi is obtained directly as
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}X_i = ( \beta^T \beta ) ^{ - 1} \beta^T G_{0i} \tag{15}\end{align*}
\end{document}
We can now plug this Xi into the optimization and search for the best possible β. Denote W=β(β
T
β)–1β
T
, and note that W is an m×m matrix of rank d, and that WW=W, therefore W is a projection matrix. It follows that W has d eigenvalues that equal 1 and the rest equal 0, and so W can be written as W=UUT, where the d columns of U consist an orthonormal basis of eigenvectors of W. We obtain the following minimization problem:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
f^{1 , c}_{LOCO} ( X , \beta ) & = \sum_i ( WG_{0i}
- G_{0i} ) ^T ( WG_{0i} - G_{0i} ) \\ & = \sum_i \left( {G_{0i}}^T
W^TWG_{0i} - 2{G_{0i}}^TWG_{0i} + {G_{0i}}^TG_{0i} \right) \\ & =
\sum_i{G_{0i}}^TG_{0i} - \sum_i {G_{0i}}^TWG_{0i} \\ & = \sum_i
{G_{0i}}^TG_{0i} - \sum_i {G_{0i}}^TUU^TG_{0i} \\ & = - \sum_i
{G_{0i}}^TUU^TG_{0i} + C
\tag{16}\end{align*}
\end{document}
where U is an m×d orthogonal matrix. The last expression is PCA's criterion, and therefore the maximizing U contains the first d eigenvectors of
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$G_{0i}{G_{0i}}^T$$
\end{document}
. By setting β=U, LOCO-LD-US's solution becomes Xi=(β
T
β)–1βT G0i=UT G0i, the same as PCA's solution. PCA's solution is therefore also a solution to LOCO-LD-US when L=1 and Σ is constant. ■
We note that the converse of Lemma 6.1 is not true, that is, not every LOCO-LD-US solution for L=1 and fixed Σ is a PCA solution, and this is because LOCO-LD-US does not include an orthogonality constraint on β.
5.2. Multi-SNP windows
The motivation behind LOCO-LD was to capture correlations between nearby SNPs, and therefore in practice the model would be used with L>1. Particularly, we would want to fix Σ
j
as the window correlation matrix in the study sample:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\Sigma_j = \frac { 1 } { n } \mathop \sum
\limits_ { i = 1 } ^n G_ { 0ji } { G_ { 0ji } } ^T \tag { 17 }
\end{align*}
\end{document}
where, as before, G0ji is the L×1 centered genotype vector of individual i confined to window j. When L>1 PCA's solution is no longer a LOCO-LD-US solution. However, as we show below, for a fixed Σ there is still a tight relation between the two procedures. Specifically, a solution to LOCO-LD-US can be obtained by applying PCA to the genotype data after a proper transformation. We prove the following lemma:
Lemma 6.2 Consider the m×m block matrix Λ, which is composed of blocks along the diagonal, where the jth block is
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \Sigma_j}^{ - 1}$$
\end{document}
, and denote its decomposition Λ=ST S. Let X be the PCA solution of SG0. Then there exists β and α, such that X, β, α minimize
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$f^{L , \Sigma}_{LOCO} ( X , \beta , \alpha )$$
\end{document}
.
Proof Σ
j
is positive definite, and therefore
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\Sigma_j}^{ - 1}$$
\end{document}
is also positive definite, and can be written as
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\Sigma_j}^{ - 1} = {S_j}^T S_j$$
\end{document}
for an L×L full-rank matrix Sj. Let bj=Sj βj, aj=Sjαj. The function to be minimized by LOCO-LD-US can be rewritten as follows:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
f^L_{LOCO} ( X , \beta , \alpha , \Sigma ) & =
\sum_i \sum_j ( \beta_j X_i + \alpha_j - G_{0ji} ) ^T{ \Sigma_j}^{
- 1} ( \beta_j X_i + \alpha_j - G_{0ji} ) \\ & = \sum_i \sum_j (
\beta_j X_i + \alpha_j - G_{0ji} ) ^T {S_j}^TS_j ( \beta_j X_i +
\alpha_j - G_{0ji} ) \\ & = \sum_i \sum_j ( b_j X_i + a_j - S_j
G_{0ji} ) ^T ( b_j X_i + a_j - S_j G_{0ji} )
\tag{18}\end{align*}
\end{document}
This is the same optimization problem as in Equation (13), but with the transformed genotype matrix SG0 replacing G0. It follows that solving for PCA on SG0 yields a solution for LOCO-LD-US. ■
We now give some intuition into the last lemma. Let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\Sigma_j = \frac { 1 } { n } G_ { 0ji } { G_ {
0ji } } ^T = U_j { D_j } ^2 { U_j } ^T \tag { 19 } \end{align*}
\end{document}
be the SVD decomposition of the sample covariance matrix in window j. The matrix Sj that is used for transforming the data in window j equals
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${D_j}^{ - 1}{U_j}^T$$
\end{document}
, where Uj contains the PCs of G0 restricted to window j, and Dj's diagonal contains the standard deviations of the PCA scores along these axes over the individuals. Transforming the data through multiplying by S therefore removes local correlations from the data by performing per-window PCA, projecting the genotypes onto the “local” PCs, and rescaling the resulting scores so that each PC has the same variance. The rescaled scores then serve as a new set of variables, free of within-window correlations, on which regular PCA is carried out. Similarly to regular PCA, we can use as many PCs as required, despite the fact that the original motivation for this approach was localization in a two-dimensional geographic space.
We note that the window-based approach has the disadvantage of assuming independence between nearby SNPs that happen to reside in different windows. For this reason, instead of de-correlating within windows, in the experiments presented next in this work we used tapering (Kaufman et al., 2008) of the per-chromosome covariance matrices to construct the local correlation matrices. The goal of covariance tapering is to set to zero certain elements of Σ, such that the resulting matrix is positive definite and retains the original properties of Σ for proximate SNPs. The correlations between SNPs that are farther apart will be attenuated, reflecting the prior belief that linkage disequilibrium decays with distance. We performed tapering by multiplying the sample covariance element-wise by a matrix R defined as
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}R_ { ij } = e^ { - \frac { \mid i - j \mid } { \lambda } } \tag { 20 } \end{align*}
\end{document}
where λ is a parameter of the tapering; the higher we set λ, the faster we assume that the correlations decay with physical distance along the chromosome. The optimal value for λ is likely to differ between datasets, being higher in datasets with low SNP density or low number of samples. In this modified procedure we compute the per-chromosome sample covariance matrices, taper them, compute their SVDs, then rescale the data using the resulting PCs and carry out PCA as before. A further improvement could potentially be made be replacing |i−j| with the absolute value of the physical distance between the SNPs on the chromosome so as to account for varying SNP density along the genome.
Finally, we note that a similar idea for structured de-correlation was introduced in the context of differential expression pattern analysis (Nam, 2010). While most tests for group-wise pattern analysis assume independence between the expression of genes within the tested groups, biologically defined gene sets tend to have some correlation structure in their expression profile. This false assumption increases the false positive rate in the analysis. Within-group expression decorrelation through rescaling along the PCs was suggested for solving this problem.
5.3. Regularizing variance contribution and relation to heritability estimation
To demonstrate another interesting property of LOCO-LD-US, as well as its potential contribution to other tasks in computational genetics, we make the following definition:
Definition 6.1 The contribution of SNP j to the variance of a group of variables
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$S = \{ i_1 \ldots i_K \} $$
\end{document}
measured over the same set of observations is the sum of squared correlation coefficients between j and the variables in the group:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\sum \nolimits_{k = 1}^K r^2_{j , i_k}$$
\end{document}
.
We claim the following:
Lemma 6.3 The contribution of every SNP to the variance of the transformed data in the window in which it resides equals 1.
Proof Let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${G_0}^j$$
\end{document}
be the L×n centered genotype matrix confined to window j, and as in Equation (19) let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\Sigma_j = \frac { 1 } { n } \mathop \sum
\limits_ { i = 1 } ^n G_ { 0ji } { G_ { 0ji } } ^T = \frac { 1 }
{ n } { { G_0 } ^j } { { G_0 } ^j } ^T = U_j { D_j } ^2 { U_j } ^T
\tag { 21 } \end{align*}
\end{document}
be the SVD decomposition of the covariance matrix of the SNPs in the window. It follows that the covariance matrix between the original SNP set
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${G_0}^j$$
\end{document}
and the transformed variables
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${{D_j}^{ - 1}{U_j}^T{G_0}^j}$$
\end{document}
is
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*} \frac { 1 } { n } { G_0 } ^j { \left( { { D_j } ^
{ - 1 } { U_j } ^T { G_0 } ^j } \right) } ^T = \frac { 1 } { n }
{ G_0 } ^j { { G_0 } ^j } ^TU_j { D_j } ^ { - 1 } = \frac { 1 } {
n } \cdot n \cdot U_j { D_j } ^2 { U_j } ^T \cdot { U_j } { D_j }
^ { - 1 } = U_jD_j \tag { 22 } \end{align*}
\end{document}
The sum of squares of row l in this covariance matrix equals the lth element on the diagonal of
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$U_jD_j{D_j}^T{U_j}^T = \Sigma_j$$
\end{document}
, which is the sample variance of SNP l, denoted
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${ \sigma_l}^2$$
\end{document}
. Since
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$r^2_ { i , j } = \frac { { cov ( i , j ) } ^2 } { { \sigma_i } ^2 { \sigma_j } ^2 } $$
\end{document}
, and since the variance of each transformed variable equals 1, the sum of squared covariances in row l needs to be scaled by
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$ \frac { 1 } { { \sigma_l } ^2 } $$
\end{document}
to obtain the sum of squared correlation coefficients, and the result is 1. ■
It follows that another way of interpreting LOCO-LD-US's per-window transformation is the removal of redundancies from the data by generating a new variable set, such that each original SNP contributes equally to the new set's variance.
Genotype decorrelation is relevant to multiple tasks in computational genetics, since many commonly used methods assume SNP independence. One such task is estimation of narrow sense heritability, that is, the fraction of variance of a given phenotype that is attributable to additive genetic variance (Yang et al., 2010). The commonly taken approach for heritability estimation is based on inferring the variance components in a random effect model. The genetic variance–covariance matrix (also termed the kinship matrix) that depicts the genetic similarity between the study individuals is generated from the SNP data, and when the SNPs are correlated this may yield noisy as well as biased kinship estimates. The effect of these correlations on heritability estimates can be substantial when the causal SNPs are strongly correlated with multiple nearby SNPs. An approach for decorrelating genotype data was therefore introduced in the context of heritability estimation as a method named LDAK (Speed et al., 2012). Improving upon an existing weighting approach (Zou et al., 2010), LDAK weights SNPs so as to equate the contribution of each of them to the local variance of the data. More formally, denote by dj,i the physical distance in base pairs between SNPs j and i. LDAK searches for SNP-specific weights
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$( w_1 \ldots w_m )$$
\end{document}
such that the following will hold for every SNP
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$j \in \{ 1 \ldots m \} $$
\end{document}
:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\mathop \sum \limits_{i = 1}^m w_{i} \cdot {r_{j ,
i}}^2 \cdot e^{ - \lambda d_{j , i}} = c \tag{23}\end{align*}
\end{document}
where λ is a parameter of the method, and c is an arbitrary constant. The weighted genotypes are then used to construct the correlation-corrected kinship matrix.
LOCO-LD-US's and LDAK's criteria are clearly similar, but exhibit a few differences. First, LDAK restricts itself to rescaling the SNPs, while LOCO-LD-US generates a new variable set that is a linear combination of the original SNPs. Second, the methods account for the physical distance between SNPs differently. Finally, LOCO-LD-US and LDAK differ in the optimization, as LOCO-LD-US's model has a closed form solution, while LDAK approximates the optimal weights using linear programming.
6. An Empirical Comparison
We carried out an empirical comparison between PCA, SPA, and LOCO-LD-US in the task of spatial inference by executing them on European samples from the POPRES (Nelson et al., 2008) dataset. Preprocessing included removing individuals and SNPs with missingness rate higher than 0.05 and imputation using BEAGLE (Browning and Browning, 2007). For PCA and SPA we filtered the SNPs as previously described (Novembre et al., 2008; Baran et al., 2013) to obtain an LD-pruned SNP set, as this was found to improve their performance due to the assumption of independence between SNPs (Baran et al., 2013). The resulting genotype matrix included 1380 individuals and 167,572 (PCA and SPA) or 323,357 (LOCO-LD-US) SNPs. For LOCO-LD-US we performed tapering within windows of length 1000 SNPs using λ=0.05.
We compared two different statistics. First, we computed, for every pair of populations, the quotient of the between-population variance of the two estimated spatial coordinates and the total variance of these coordinates. This statistic, known as the fraction of variance explained by the population clusters, is a common measure of clustering quality:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*} \frac { N \cdot \parallel \mu_1 - \mu \parallel^2
+ N \cdot \parallel \mu_2 - \mu \parallel^2 } { \sum_ { k = 1 , 2
} \sum\nolimits_ { i = 1 } ^ { N } \parallel x_ { i } ^ { k } -
\mu \parallel^2 } \end{align*}
\end{document}
where N is the number of individuals randomly drawn from each population,
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$x^{k}_{i}$$
\end{document}
is the estimated coordinate vector for the ith individual of population k∈{1,2}, and
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*} \mu_1 & = \frac { 1 } { N } \mathop \sum \limits_
{ i = 1 } ^ { N } x_ { i } ^ { 1 } \quad \mu_2 = \frac { 1 } { N
} \mathop \sum \limits_ { i = 1 } ^ { N } x_ { i } ^ { 2 } \\ \mu
& = \frac { 1 } { 2N } \sum_ { k = 1 , 2 } \mathop \sum \limits_
{ i = 1 } ^ { N } x_ { i } ^ { k } \end{align*}
\end{document}
The results are given in Table 1. LOCO-LD-US can be seen to attain the highest separation quality for most population pairs, often considerably higher than the two other methods. Between PCA and SPA there is no clear winner.
For each pair of populations, the table gives the fraction of variance of spatial coordinates explained by the two population clusters. Included are all countries represented by at least 60 individual, excluding Switzerland which is a mixture of different population groups. For every pair and method we provide the median value over 106 choices of 10 uniformly sampled individuals from both countries. Best results per pair are in bold. DE, Germany; FR, France; IE, Ireland; IT, Italy; PCA, principal component analysis; PT, Portugal; SP, Spain; UK, United Kingdom.
Next, we estimated the fraction of variance in geographic coordinates (longitude and latitude) explained by the estimated spatial coordinates of each of the methods. This statistic is useful since longitude and latitude often serve as confounders in GWAS. In some cases, widespread selection over related phenotypes may cause both disease susceptibility and the geographic coordinates to be correlated with the SNP's frequency (Casto and Feldman, 2011). In other cases, phenotypes that are correlated with geography [for example, melanoma, which is correlated with latitude (Chang et al., 2009)] are correlated with SNPs whose frequency happen to change along the same spatial directions. When conducting a GWAS, researchers often use PCA scores as covariates in the regression model (Price et al., 2010) in order to correct for different confounding effects. Correction is more accurate and statistical power is increased when the confounder, in our case longitude and latitude, are better explained by these scores.
Table 2 presents the results. Here PCA and LOCO-LD-US have the advantage of being able to provide more spatial coordinates by using more PCs (we used 10), while SPA provides only two. Unsurprisingly then, SPA's results are inferior, and LOCO-LD-US does slightly better than PCA.
For PCA and LOCO-LD-US, the 10 leading PCs were used in a linear regression model. For SPA, the two spatial axes were used. The best results for latitude/longitude are in bold.
7. Further Variations
In this section we review a few additional recent approaches for localization, each bringing up another aspect of the problem.
7.1. Nonparametric spatial variation
One strong assumption that is common to the models of PCA, SPA, and LOCO-LD is that the frequency of each SNP changes linearly along a single direction in the geographic space. In reality, the frequencies of most SNPs exhibit more complex spatial variation patterns. A recent work (Rañola et al., 2014) allows for such flexible patterns by using a Markov random field model commonly utilized for image processing. This approach divides the geographic region of interest into demes (that is, pixels on a rectangular grid), and assumes that the genotypes of all individuals in a deme are sampled from a binomial distribution with a deme-specific frequency. The method infers the frequency of every SNP in every deme from training data, while incorporating a smoothing penalty, to account for the fact that neighboring demes are expected to have similar frequencies. This model can therefore be thought of as a version of SPA employing a more general group of frequency functions over a discretized geographic space. Optimization is performed using the Minorize–Maximize (MM) algorithm, and the authors claim to attain a high accuracy in localizing European samples.
7.2. Spatial autocorrelations
Another assumption common to all models surveyed so far is that the genotypes of different individuals in a given SNP are independent given their spatial locations. In reality this assumption is violated since spatial models tend to capture global trends, while failing to capture local variation patterns resulting from the fact that genetic dispersal occurs over short distances. These local patterns, termed spatial autocorrelations, have been shown (Novembre and Stephens, 2008) to distort and bias the interpretation of population genetic structure as inferred from principal component analysis (PCA). For example, in scenarios where genetic similarity decays exponentially with geographic distance, PCA plots are expected to exhibit horseshoe effects, an artifact in which the second axis is curved relative to the first axis (Legendre and Gallagher, 2001). A method called spFA (Frichot et al., 2012) attempts to account for these autocorrelations by selectively removing local-scale correlations from the data. The idea behind the method is analogous to LOCO-LD-US's approach, but instead of decorrelating physically proximal SNPs, it decorrelates geographically proximal individuals. The local correlations matrix is therefore an n×n matrix generated from the pairwise geographic distances between the study individuals, whose origins are assumed to be known. We note that both decorrelations, SNP-wise (to account for linkage disequilibrium) and individual-wise (to account for spatial autocorrelations), may be performed simultaneously.
7.3. Admixed individuals
While the above models assume that all individuals can be assigned to a single geographic location, in practice many people contain DNA characteristics of multiple distinct locations. Such geographic admixture may be recent, for example, when the father's side of the family and the mother's side of the family originate from different regions, or ancient, for example, as in admixed populations such as African American. When PCA is naively executed on such samples, the resulting spatial coordinates are an affine combination of the coordinates of the mixing locations (McVean, 2009; Ma and Amos, 2012). However, the recent explicit spatial models can be easily generalized to account for multiple origins. Two of the approaches described above were extended to model recently admixed individuals, simply by replacing the per-SNP distributions with mixtures of distributions (Yang et al., 2012; Rañola et al., 2014). Moreover, recent mixing yields strong correlations between the geographic origins of nearby SNPs along the chromosomes, since only a few recombinations have occurred since the mixing event. Modeling such correlations, for example, using a hidden Markov model (HMM), should allow for a higher localization accuracy, and was indeed carried out in two other recent works (Yang et al., 2014; Margalit et al., 2015).
7.4. Descriptive models
Finally, we mention two recent localization methods (Hoggart et al., 2012; Elhaik et al., 2014) that avoid laying out an explicit spatio-genetic model. One uses the output of PCA and the other the output of the software ADMIXTURE (Alexander et al., 2009) to train complex descriptive models that lack a clear spatial interpretation but are claimed to attain highly accurate localization. We do not review these approaches here in depth since our focus is on spatio-genetic modeling. It is worth noting that such methods are harder to extend to admixed individuals. In addition, their utility is currently limited to inferring spatial coordinates, since their parameters do not seem to have a genetic interpretation.
8. CONCLUSION
We surveyed recent computational models for human spatio-genetic variation. Inference in spatial models can teach us about the underlying genetic and demographic processes that shaped human spatial variation, in addition to its multiple applications in studies of selection and disease. Due to these applications we believe that spatial models will play an important role in the field of human computational genetics in the future.
While spatial separation of genetic samples has been traditionally performed heuristically using dimensionality reduction techniques, we demonstrated that it is important to understand the assumptions made by such procedures. This understanding fosters their adaptation to genetic data within explicit models, thereby yielding improved spatial inference. As has already been shown, such models can be naturally extended to multiple-origin individuals. Many further enhancements, including incorporating prior information and population genetic theory into the model parametrization and estimation, are yet to be introduced in future work.