Abstract
Abstract
The genomes of remotely related individuals occasionally contain long segments that are identical by descent (IBD). Sharing of IBD segments has many applications in population and medical genetics, and it is thus desirable to study their properties in simulations. However, no current method provides a direct, efficient means to extract IBD segments from simulated genealogies. Here, we introduce computationally efficient approaches to extract ground-truth IBD segments from a sequence of genealogies, or equivalently, an ancestral recombination graph. Specifically, we use a two-step scheme, where we first identify putative shared segments by comparing the common ancestors of all pairs of individuals at some distance apart. This reduces the search space considerably, and we then proceed by determining the true IBD status of the candidate segments. Under some assumptions and when allowing a limited resolution of segment lengths, our run-time complexity is reduced from O(n3 log n) for the naïve algorithm to O(n log n), where n is the number of individuals in the sample.
1. Introduction
S
Multiple inference paradigms, such as approximate Bayesian computation (Beaumont et al., 2002) or importance sampling (Fearnhead and Donnelly, 2001), are based on sampling from a probability space defined by the hypothesized model for the data. In the context of demographic inference, these methods require simulating IBD segments often based on complicated models for histories of populations. Naively, this would be carried out by simulating genetic data using genome simulators (e.g., Excoffier and Foll, 2011; Excoffier et al., 2013; Schaffner et al., 2005; Liang et al., 2007, followed by computational recovery of IBD segments (e.g., Gusev et al., 2009; Browning and Browning 2011). However, this process is both computationally intensive, therefore limiting sample sizes, as well as error prone, contrasting with its role of producing ground-truth simulated data.
In addition to the computational burden, inference of IBD segments from simulated sequences is also redundant, in the sense that information about IBD segments is intrinsic to the simulated genealogies, without the need to explicitly generate the entire sequences. Specifically, genetic data simulated according to the coalescent with recombination is represented as an ancestral recombination graph (ARG), a combinatorial structure that has the genealogies of the entire sample at all positions along the chromosome (see Hudson, 1990; Griffiths and Marjoram, 1996, 1997; and below for definitions). Equivalently, the ARG can be represented as a sequence of genealogical (or “coalescent”) trees, where each new tree is formed due to a recombination event in the history of the sample (Wiuf and Hein, 1999). Our goal in this work is to efficiently scan such a series of trees for IBD segments, that is, find all contiguous segments where pairs of individuals share the same common ancestor. Previous studies have either employed a naïve algorithm (see below; Browning and Browning, 2013; Tataru et al., 2014; Chiang et al., 2014) or avoided coalescent simulations by using permutations of real genotypes (Gusev et al., 2009; Browning and Browning, 2011; Su et al., 2012).
The article is organized as follows. In section 2, we introduce notation and models. In section 3, we describe a series of computational approaches for extracting ground-truth IBD segments from ARGs. In section 3.1, a naïve algorithm is presented. In section 3.2, we analyze the complexity of the algorithm when segment lengths are discretized. In section 3.3, we describe a two-step approach for segment discovery, which is based on decoupling the problem into first identifying a small set of “candidate” pairs and segments, some of which are false but which include all true segments. Then, taking advantage of the constant time LCA query algorithm, we rapidly eliminate all false positives. In section 3.4, we present a fast algorithm to compare the common ancestors of all pairs of leaves between two trees, which, when segment lengths are discretized and combined with the two-step approach, achieves the best asymptotic run-time complexity. In section 4, we present performance benchmarks demonstrating utility for practical applications. In section 5, we demonstrate the application of our tool in demographic inference, and in section 6, we discuss limitations and future plans. The implementation of our algorithms is freely available (Yang, 2014).
2. Preliminaries
We are given a sample of n individuals, each of which is represented by a single continuous chromosome of length L Morgans (M). The ancestry of the sample is denoted by a series of trees, {Ti}, i = 0,…,nT, each defined along a genomic interval, [xi, xi+1) (where x0 = 0, xn T = x n T +1 = L and the last tree is degenerate). The tree at each genomic position corresponds to the genealogy of the individuals at that position (Fig. 1). Genealogies are formed according to the coalescent (Wakeley, 2005): Each pair of lineages merges, going backward in time, at rate 1/N, where N is the effective population size. Intervals are broken, and hence, new trees are formed, due to recombination in one of the lineages in the genealogy. Specifically, the effect of recombination is to create a breakpoint in the tree, leading to rewiring of the edges of the tree (Hudson, 1983; Wiuf and Hein, 1999). While rewiring can happen only in a limited number of ways, we do not make any assumption on the nature of the differences between successive trees. Internal nodes in the tree are formed by lineages merging into their common ancestors (going backward in time) and are labeled by the time in the past when those ancestors existed. Time is assumed to be continuous, and therefore, two internal nodes in different trees but with the same label correspond to the same ancestral individual. The collection of all trees, labels of internal nodes, and intervals is called an ancestral recombination graph (ARG) and is the input to our method.

Extracting IBD segments from a sequence of coalescent trees. A series of trees for a sample of n = 3 is shown. The collection of all trees and their intervals forms an ancestral recombination graph (ARG). The time to the most recent common ancestor (tMRCA) of individuals 1 and 2 is indicated as a horizontal line for each tree. Below the trees, bars of different colors indicate the boundaries of the shared segments for this ARG and individuals 1 and 2, that is, maximal contiguous segments where the MRCA of 1 and 2 does not change. Imposing a minimal segment length m, only one segment exceeds the length cutoff (black). Other segments (gray) will not be reported.
Define a pair of individuals as IBD along a genomic interval if the interval is longer than a threshold m and the time to the most recent common ancestor (MRCA; equivalently, the lowest common ancestor [LCA]) of the individuals is the same along all the trees contained in the interval. Our desired output is the set of maximal IBD segments between all pairs of individuals, in the sense that no reported segment can be extended in either direction and remain IBD (Fig. 1). Typical values of m and L are 1 centiMorgans (cM; roughly a million base pairs) and 100cM (1M), respectively, and we treat them as constants throughout the paper. The mean number of trees is known to satisfy nT = O(NL log n) (Griffiths and Marjoram, 1997). Running times are thus reported as functions of n and N, and occasionally, to project the result into a single dimension, we assume that N = O(n). This is realistic for human populations, where the effective population size is 10,000–20,000 (Conrad et al., 2006; Henn et al., 2012a), and current sample sizes are in the thousands (e.g., Zhang et al., 2015; Guha et al. 2012; The Wellcome Trust Case Control Consortium, 2007).
3. Methods
3.1. The naïve algorithm
The naïve algorithm works by keeping track of the time to the MRCA (tMRCA) between all pairs of chromosomes (i.e., leaves in the tree), and determining, for each tree, which tMRCAs have changed compared to the previous tree. To extract the tMRCAs of all pairs in a given tree, we used an in-order traversal algorithm (Yang, 2014). Whenever we detect a change of the tMRCA for a given pair of leaves, we report the segment as IBD if the previous tMRCA had persisted over length > m (Algorithm 1). Each comparison between trees involves all pairs of leaves and thus runs in time O (n2). As there are O (NL log n) trees, the total time complexity for the naïve algorithm is O (n2NL log n), or O (n3 log n) when treating L as a constant and assuming N = O(n).
A naïve algorithm for reporting IBD
3.2. Discretization of the genome
When the number of trees per unit genetic distance is large (e.g., whenever N or n are large), examining all trees has limited merit. We thus follow Browning and Browning (2013) and consider only trees at fixed tickmarks along the genome, every d = εm cM. This allows even the shortest segment length to be estimated with a relative error up to 1 ± ε, while reducing the running time to

The effect of segment length discretization on the accuracy of genome-wide IBD statistics. (Left) The distribution of the total number of IBD segments shared between each pair of individuals. (Right) the density of the total fraction of the chromosome found in IBD segments. (Top) n = 1000, m = 0.00534; (bottom) n = 5000, m = 0.00245. In all panels, N = 2n, L = 1, and ε = 0.01.
3.3. A two-step approach
For typical values of m and N and in a typical genomic locus, most pairs of individuals do not maintain the same MRCA over lengths longer than m. Therefore, an appealing approach would be to rapidly eliminate, at each genomic position, all pairs of individuals that do not share an IBD segment, and then consider for validating only the remaining pairs. Specifically, when we compare the MRCAs of pairs of individuals at genomic tickmarks spaced s = m/2 apart, we observe that true IBD segments must span at least two consecutive tickmarks (Fig. 3; note that no discretization of segment lengths is assumed). For each pair of individuals that satisfies this condition, we first verify that the MRCA is unchanged in all trees between the tickmarks, and then extend the segment in both directions and determine whether the final segment length is longer than m. For the validation step, we use Bender's LCA algorithm (Bender and Farach-Colton, 2000; Berkman et al., 1989), which requires linear time to preprocess each tree, but then just a constant time for each LCA (i.e., tMRCA) query (Algorithm 2).

A two-step approach for IBD segment discovery. For a given pair of individuals, a partition of the chromosome into shared segments is shown at the top, where in each segment, the MRCA is unchanged. Tickmarks are shown at multiples of m/2, and segments that span at least two tickmarks are considered for further validation. Note that segments that span just a single tickmark are necessarily shorter than m. Extension of some of the candidate segments is shown below the chromosome. The partition of the chromosome at the bottom highlights IBD segments longer than m.
Two-Step Algorithm
The running time of the candidate identification step is simply
3.4. A discretized two-step approach with a novel candidate extraction algorithm
Let us now analyze the complexity of the two-step approach when segment lengths are discretized. The time spent on candidate extraction remains
– a is a node at x′, and
– l, r are leaves in respective subtrees spanned by the two children of a at x′.
The practical implication of Observation 1 is that we should look for a triple intersection between successive trees: an ancestor, a leaf in its left subtree, and a leaf in its right subtree. We implement this intersection by hashing all ancestors in the two trees and looking for shared ones, followed by determining which pairs of leaves are found, in both trees, in distinct subtrees that descend from the shared ancestor. The newly developed algorithm is illustrated in Fig. 4 (see pseudocode in Yang, 2014).

Intersecting trees using hashing. We perform an in-order traversal of each tree in linear time, while hashing all internal nodes, bidirectionally hashing all leaves, and mapping left and right intervals along the traversal order for each internal node. The hash and reverse-hash tables enable us to compute the intersection between the intervals of corresponding internal nodes of the two trees in linear time. Doing that for the left and right children of an ancestor yields the pairs of leaves for which the MRCA is unchanged between the two trees.
In-order traversal and hashing the internal nodes takes O(n) time, and finding the triple intersection can be done, for each internal node, in linear time using bidirectional hashing of the leaves. The overall time complexity is dominated by the number of potential shared ancestors (O(n)) times the number of leaves descending from each ancestor (O(n)) and is thus, at the worst case, O (n2). However, coalescent genealogies are asymptotically balanced (Li and Wiehe, 2013), and it is easy to show that for a full tree topology, the complexity is O (n log n) (see also Supplementary Fig. S2). With an O (n log n) algorithm to extract candidate pairs, the total time complexity, assuming N = O(n), becomes O (n log n), and this is our presently best theoretical result.
4. Benchmarking
We implemented the algorithms of section 3 in C/C++ and performed experiments to evaluate their practical running time. Testing was conducted on a standard workstation running Ubuntu 10.04 LTS. Figure 5 compares the wallclock running time of different algorithms. As a previous implementation of the naïve algorithm was based on the Python open source library DendroPy (Browning and Browning, 2013), results for that method are also shown. While the two-step approach is asymptotically superior, the LCA running time has a large prefactor compared to the tight implementation of the naïve algorithm, and it becomes faster only around n ≈ 7000.

Running times of the algorithms considered in this study (symbols). In all experiments, we used fastsimcoal2 (Excoffier and Foll, 2011) to generate the ARGs, N = 2n, L = 1M, and m = 0.01M. When discretizing segment lengths, we used ε = 0.01. Lines are fits to the asymptotic running times (see boxed key).
5. Applications: Evaluating Demographic Parameters
In this section we show some examples of using our method for demographic inference of recent population history. We developed a tool that uses a new statistic of total IBD to infer demography and demonstrate the process of evaluating feasibility and accuracy of such analysis.
5.1. Population models and IBD statistics used
We first aim at evaluating the effectiveness of the new IBD statistic for inference. We thus start from very simple models, the constant population size model (parameterized by N) and the bottleneck model (parameterized by NA, NC, NB, T, and TB; see Fig. 6 for detailed explanation of the models and the parameters). The latter model is expressive enough to be consistent with some populations that went through a sharp decline in population size in recent history (Palamara et al., 2012; Gusev et al., 2012; Carmi et al., 2014a; Lim et al., 2014; Gauvin et al., 2014).

Simple demographic models. The left panel depicts a naive model of constant population size, and the right one depicts a more complicated and realistic bottleneck model, which involves recent population shrinkage. The time axis goes top to bottom. The constant size population model is parameterized by N (effective population size); the bottleneck model is parameterized by NA (ancient population size), NC (current population size), NB (bottleneck population size), T (bottleneck duration), and TB (time length to bottleneck event).
Figure 7 plots the statistic we use for inference, average total IBD-sharing across the cohort, for several constant population sizes across different cohort sizes and cutoffs m for IBD length. The decay pattern of this statistic as m increases and its growth, with accumulated reference genomes are both encoded as a matrix of total IBD statistics for different values of these two coordinates, which provides some information about the model parameters (see Palamara et al., 2012; Palamara and Pe'er, 2013; Carmi et al., (2014a,b). As shown in Figure 7, the value of model parameter N gives rise to a unique value of this IBD statistic, which suggests that it is possible to estimate this parameter based on the observed IBD statistics. Also, the developments in earlier sections of this manuscript allow using this statistic for demographic inference.

Total-sharing statistics for different population size parameters under constant population size model. Each statistic is constructed by all the points in a surface, which is indexed by the number of reference genomes i and the IBD cutoff value m. For a specific point in (i, m), the value of the statistic represents the average total fraction of IBD sharing of one genome against i other genomes where IBD cutoff value is m.
The practical inference process is as follows: We simulate data for different values of the studied parameter using the fastsimcoal2 simulator (Excoffier and Foll, 2011) and extract IBD segments with our designed tool; we tally total IBD statistics (average total IBD-sharing across the cohort, discussed in the above paragraph; also see Fig. 7) from these data, and we use 1000 replicates for one statistic under each parameter combination, which is necessary and enough to get a better quantification of the statistic (a stable or converged value of the statistic); we calculate the distance (sum of the squared differences) between the matrix of statistics for the designated “true” population parameter and those of all other guessed parameters. Ideally, the distance would be minimized at the “true” parameter, supporting single-point estimation for that parameter. In practice, when no ground truth solution is available, we will use the distance of the simulated statistics from the empirical matrix for real data instead of the true matrix. Figure 8 shows such measurement under constant population size model when we treat different N's as real parameters. In the next section, we will expand this evaluation to all parameters in bottleneck model.

The distance curve under a constant population size model where the true parameter is set to different values. We can see the minimum distance is always attained at the true value (though inaccurate for large N), which means the total IBD statistic allows good estimation of this population parameter.
5.2. Results for the bottleneck model
In this section we present a single-point evaluation under the bottleneck model. This model requires five parameters to be estimated: NC, NA, NB, T, and TB. Their default values are set to NA = 10,000, NC = 1000, NB = 200, T = 10, TB = 30. We evaluated all of them, but focus the discussion on NB and T, which we find to be most instructive (see Supplementary Fig. S3, Fig. S4, and Fig. S5 for the results regarding NA, NC, and TB). We computed the joint distance surface for NB and T (see Fig. 9). We observed that individual values of NB and T are nearly indistinguishable in terms of total sharing, but there is a line across the joint plane that always achieves the minimum distance. This indicates that the ratio

The top two figures are renderings of the joint distance surface for NB and T given all other three correct parameters, NA, NC, and TB, where the true values are NBreal = 200,Treal = 10. The points that attain the minimum distance are highlighted in green (top right). A linear-fit of only these minima (bottom left) is T = 0.053 × NB − 0.651, where the slope is very close to the true ratio of T and NB 0.05. The distances attained along the linear fit (bottom right) approach zero and give a sense of the noise level around the convex pattern.
The reasoning above allows evaluating feasibility of inferring distinct model parameters from total sharing statistics. We fit all these parameters to statistics from real data by a grid search. We experimented also with some faster optimization algorithms, both derivative-based, such as L-BFGS-B algorithm (Byrd et al., 1995), as well as non-derivative-based ones, such as Nelder-Mead algorithm (Nelder and Mead, 1965). Unfortunately, we report (data not shown) that these are ineffective for this problem due to instability across any of the specific dimensions of the problem. Specifically, the statistic used in the inference is simulated. In that case, the distance curve (or surface) we get from the statistics will be fluctuant all the time (as the statistics have no perfectly stable or converged values from finite replicates of simulations), thus unable to make these classical optimizers successfully optimize along.
6. Discussion
We designed and implemented a set of efficient algorithms for extraction of ground-truth IBD segments from a sequence of coalescent trees. We anticipate our method to facilitate progress across multiple areas of IBD research. First, extraction of IBD segments from simulated ARGs will inform analysis of methods based on the increasingly popular SMC and SMC1 models (e.g., Harris and Nielsen, 2013; Li and Durbin, 2011). Second, while simulation-based approaches for demographic inference are widespread (Beaumont et al., 2002; Excoffier et al., 2013), no existing method is based on IBD sharing, which is highly informative regarding recent history. Our method will provide the community with the enabling tool for such analysis. Finally, an efficient means to generate ground-truth-simulated IBD data will enable researchers to generate a background null distribution of key IBD statistics, against which alternative hypothesis can be tested, such as positive natural selection (Albrechtsen et al., 2010) or genetic association (Gusev et al., 2011).
The coalescent trees (or ARGs) that are our input are often simulated, raising the possibility of replacing the benefits of our work by simulators that directly report IBD segments. However, even if simulators evolve to directly output IBD segments, the most straightforward path will be to include methods such as ours as a feature. Additionally, ARGs not only are, but will likely continue to be, the standard output of simulators, as they concisely report all relevant information about the genetic ancestry of the sample, so our method will likely be needed even as simulators evolve. We further point out that the empirical results have demonstrated that the utility of our new two-step approach is limited to sample sizes in the thousands. However, the fast decline in the cost of sequencing and the rapid growth in the number of available genomes generate demand for ever larger simulated samples, where our two-step algorithm is competitive.
The latter part of the article includes demonstration of the utility of our tool for demographic inference via total IBD sharing. We evaluated this new statistic across some simple models and showed the potential of our tool to be used broadly in the community for this type of analysis. This evaluation raises two issues worthy of future work. First, the need to evaluate many unknown parameters jointly, rather than pursue single-point estimation, is a weakness of this strategy. Potential improvement may rely on the ability to break up the set of parameters to low-dimension subsets that can each be optimized separately. The second issue has to do with the reliance of our parameterized models on specific model assumption, which often oversimplify realistic historical scenarios. More flexible models, for example with arbitrary population size at each generation, that rely on regularization rather than a limited number of parameters are an attractive strategy to tackle this issue in the future.
Footnotes
Acknowledgment
We thank the Human Frontier Science Program (SC) and NIH grant 1R01MH095458-01A1 (IP).
Author Disclosure Statement
No competing financial interests exist.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
