Abstract
Reconstruction of population histories is a central problem in population genetics. Existing coalescent-based methods, such as the seminal work of Li and Durbin, attempt to solve this problem using sequence data but have no rigorous guarantees. Determining the amount of data needed to correctly reconstruct population histories is a major challenge. Using a variety of tools from information theory, the theory of extremal polynomials, and approximation theory, we prove new sharp information-theoretic lower bounds on the problem of reconstructing population structure—the history of multiple subpopulations that merge, split, and change sizes over time. Our lower bounds are exponential in the number of subpopulations, even when reconstructing recent histories. We demonstrate the sharpness of our lower bounds by providing algorithms for distinguishing and learning population histories with matching dependence on the number of subpopulations. Along the way and of independent interest, we essentially determine the optimal number of samples needed to learn an exponential mixture distribution information-theoretically, proving the upper bound by analyzing natural (and efficient) algorithms for this problem.
Introduction
Background: inference of population size history
A
There are many existing methods for estimating the size history of a single population from sequence data. Some rely on maximum likelihood methods (Li and Durbin, 2011; Sheehan et al., 2013; Schiffels and Durbin, 2014; Terhorst et al., 2017) and others utilize Bayesian inference (Nielsen, 2000; Drummond et al., 2005; Heled and Drummond, 2008) along with a variety of simplifying assumptions. A well-known work, Li and Durbin (2011), is based on using sequence data from just a single human (a single pair of haplotypes) and revolves around the assumption that coalescent trees of alleles along the genome satisfy a certain conditional independence property (McVean and Cardin, 2005). By and large, methods such as these do not have any associated provable guarantees. For example, expectation–maximization is a popular heuristic for maximizing the likelihood but can get stuck in a local maximum. Similarly, Markov Chain Monte Carlo (MCMC) methods are able to sample from complex posterior distributions if they are run for a long enough time, but it is rare to have reasonable bounds on the mixing time. In the absence of provable guarantees, simulations are often used to give some sort of evidence of correctness.
Under what sorts of conditions is it possible to infer a single population history? Kim et al. (2015) gave a strong lower bound on the number of samples needed even when one is given exact coalescence data. In particular, they showed that the number of samples must be at least exponential in the number of generations. Thus, there are serious limitations to what kind of information we can hope to glean from (say) sequence data from a single human individual. In a sense, their work provides a quantitative answer to the question: How far back into the past can we hope to reliably infer population size, using the data we currently have? We emphasize that although they work in a highly idealized setting, this only makes their problem easier (e.g., assuming independent inheritance of loci along the genome and assuming that there are no phasing errors) and thus their lower bounds more worrisome.
Our setting: inference of multiple subpopulation histories
A more interesting and challenging task is the reconstruction of population structure, which refers to the subdivision of a single population into several subpopulations that merge, split, and change sizes over time. There are two well-known works that attack this problem using coalescent-based approaches. Both use sequence data to infer population histories where present-day subpopulations were formed through divergence events of a single ancestral population in the distant past. The first is Schiffels and Durbin (2014), who used their method to infer the population structure of nine human subpopulations up to about 200,000 years into the past. More recently, Terhorst et al. (2017) inferred population structures of up to three human subpopulations. Just as in the single population case, these methods do not come with provable guarantees of correctness due to the simplifying assumptions they invoke and the heuristics they employ.
As for theoretical work, the lower bounds proven for a single population trivially carry over to the setting of inferring population structure. However, the lower bound in Kim et al. (2015) only applies when we are trying to reconstruct events in the distant past, leading us to a natural question: Can we infer recent population structure, but, when there are multiple subpopulations?
In this article, we establish strong limitations to inferring the population sizes of multiple subpopulation histories using pairwise coalescent trees. We prove sample complexity lower bounds that are exponential in the number of subpopulations, even for reconstructing recent histories. Our results provide a quantitative answer to the question, Up to what granularity of dividing a population into multiple subpopulations, can we hope to reliably infer population structure?
Our methods incorporate tools from information theory, approximation theory, and analysis (from Turán, 1984). To complement our lower bounds, we also give an algorithm for hypothesis testing based on the celebrated Nazarov–Turán lemma (Nazarov, 1993). Our upper and lower bounds match up to constant factors and establish sharp bounds for the number of samples needed to distinguish between two known population structures as a function of the number of subpopulations. Finally, for the more general problem of learning the population structure (as opposed to testing which of two given population structures is more accurate) we give an algorithm with provable guarantees based on the Matrix Pencil Method (Hua and Sarkar, 1990) from signal processing. We elaborate on our results in Section 1.4.
Modeling assumptions
Our results will apply under the following assumptions: (1) individuals are haploids,* (2) the genome can be divided into known allelic blocks that are inherited independently, and (3) for each pair of blocks, we are given the exact coalescence time.
Indeed, in practice, one must start with sequenced genomes—and in the context of recovering events in human history (potentially unphased) genotypes of diploid individuals. The problem of recovering coalescence times from sequences provides a major challenge and often requires one to either know the population history beforehand, or leverage simultaneous recovery of history and coalescence times using various joint models that enable probabilistic inference.
But since the main message of our article is a lower bound on the number of exact pairwise coalescent samples needed to recover population history, in practice it would only be harder. Even in our idealized setting, handling seven or eight subpopulations already requires more data than one could reasonably be assumed to possess. Thus, our work provides a rather direct challenge to empirical work in the area: either results with seven or eight subpopulations are not to be trusted or there must be some biological reason why the types of population histories that arise in our lower bounds, that are information-theoretically impossible to distinguish from each other using too few samples, can be ruled out.
The Multiple-Subpopulation Coalescent Model
Consider a panmictic haploid † population, such that each subpopulation evolves according to the standard Wright–Fisher dynamics ‡ —we direct the reader to Blythe and McKane (2007) for an overview. For simplicity, we assume no admixture between distinct subpopulations as long as they are separated in the model (i.e., they have not merged into a single population in the time period under consideration).
As a reminder, if one assumes that a single population has size N, which is large and constant throughout time, then the time to the most recent common ancestor of two randomly sampled individuals closely follows the Kingman coalescent with exponential rate N:
where T, the coalescence time for two randomly chosen individuals, is measured in generations. Henceforth, we will assume that this is the distribution of T in the single-component case.
If instead we have a population that is partitioned into a collection of distinct subpopulations with nonconstant sizes, let
Consider the case where
where p0 + p1 + ⋯ + pD = 1, λ0 = 0 and the other λi are
The population structure is assumed to undergo changes over time, where the positive direction points toward the past. The three possible changes are as follows:
(Split) One subpopulation at time t− becomes two subpopulations at time t (i.e., Dk = Dk−1 + 1). (Merge) Two subpopulations at time t− join to form one subpopulation at time t (i.e., Dk = Dk−1 − 1). (Change size) An arbitrary number of subpopulations change size at time t.
Figure 1 provides an illustrative example. If an individual at time t− is from a subpopulation of size M that splits into two subpopulations of sizes M1, M2 at time t, then its ancestral subpopulation is random: for i ∈{1, 2}, subpopulation i is chosen with probability Mi/M. In our model, we only allow at most one of these events at any particular time point. For us, a “split” looking backward in time refers to a convergence event of two subpopulations going forward in time, whereas a “merge” refers to a divergence event. This convention is chosen because we think of reconstruction as proceeding backward in time from the present.

An example of population structure history, illustrating merges and splits starting with three present-day subpopulations.
The main theoretical contribution of this work is an essentially tight bound on the sample complexity of learning population history in the multiple-subpopulation model. In particular, we show sample complexity lower bounds that are exponential in the number of subpopulations k. Here is an organized summary of our results:
First, we show a two-way relationship between the problem of learning a population history (in our simplified model) and the problem of learning a mixture of exponentials. Recall that when the effective subpopulation sizes are all constant, the distribution of coalescence times follows Equation (2) and thus is equivalent to learning the parameters pt and λt in a mixture of exponentials. Conversely, we show how to use an algorithm for learning mixtures of exponentials to reconstruct the entire population history by locating the intervals where there are no genetic events and then learning the associated parameters in each, separately (Section 2.1 with details in Supplementary Appendix B and Supplementary Appendix F).
(Main result) Using this equivalence, we show an information-theoretic lower bound on the sample complexity that applies regardless of what algorithm is being used. In particular, we construct a pair of population histories that have different parameters but which require Ω((1/Δ)4k) samples to tell apart. This lower bound is exponential in the number of subpopulations k. Here, Δ ≤ 1/k is the smallest gap between any pair of the λt's. The proof of this result combines tools spanning information theory, extremal polynomials, and approximation theory (Section 2.3 with details in Supplementary Appendix D).
In the hypothesis testing setting where we are given a pair of population histories that we would like to use coalescence statistics to distinguish between, we give an algorithm that succeeds with only O((1/Δ)4k) samples. The key to this result is a powerful tool from analysis, the Nazarov–Turán lemma (Nazarov, 1993) that lower bounds the maximum absolute value of a sum of exponentials on a given interval in terms of various parameters. This result matches our lower bounds, thus resolving the sample complexity of hypothesis testing up to constant factors (Section 2.4 with details in Supplementary Appendix E).
In the parameter learning setting when we want to directly estimate population history from coalescence times, we give an efficient algorithm that provably learns the parameters of a (possibly truncated) mixture of exponentials given only O((1/Δ)6k) samples. We accomplish this by analyzing the Matrix Pencil Method (Hua and Sarkar, 1990), a classical tool from signal processing, in the real-exponent setting (Section 2.2 with details in Supplementary Appendix C).
Finally, we demonstrate using simulated data that our sample complexity lower bounds really do place serious limitations on what can be done in practice. From our plots it is easy to see that the sample complexity grows exponentially in the number of subpopulations even in the optimistic case where the separation Δ = 1/k, which minimizes our lower bounds. In particular, the number of samples we would need very quickly exceeds the number of functionally relevant genes (on the order of 104) and even the number of SNPs available in the human genome (on the order of 107). In fact, through a direct numerical analysis of our chosen instances, we can give even stronger sample-complexity lower bounds (Section 3, with details in Supplementary Appendix G).
Discussion of results
In summary, this work highlights some of the fundamental difficulties of reconstructing population histories from pairwise coalescence data. Even for recent histories, the lower bounds grow exponentially in the number of subpopulations. Empirically, and in the absence of provable guarantees, and even with much noisier data than we are assuming, many works suggest that it is possible to reconstruct population histories with as many as nine subpopulations. Although testing out heuristics on real data and assessing the biological plausibility of what they find is important, so too is delineating sharp theoretical limitations. Thus, we believe that our work is an important contribution to the discussion on reconstructing population histories. It points to the need for the methods that are applied in practice to be able to justify why their findings ought to be believed. Moreover, they need to somehow preclude the types of population histories that arise in our lower bounds and are genuinely impossible to distinguish between given the finite amount of data we have access to.
Related works
As mentioned in Section 1.1, existing methods that attempt to empirically estimate the population history of a single population from sequence data generally fall into one of two categories: Many are based on (approximately) maximizing the likelihood (e.g., Li and Durbin, 2011; Sheehan et al., 2013; Schiffels and Durbin, 2014; Terhorst et al., 2017) and others perform Bayesian inference (e.g., Nielsen, 2000; Drummond et al., 2005; Heled and Drummond, 2008). Generally, they are designed to recover a piecewise constant function N(t) that describes the size of a population, with the goal of accurately summarizing divergence events, bottleneck events, and growth rates throughout time.
Many notable methods that fall into the first category rely on hidden Markov models (HMMs), which implicitly make a Markovian assumption on the coalescent trees of alleles across the genome. Li and Durbin (2011) gave an HMM-based method (Pairwise Sequentially Markovian Coalescent aka PSMC) that reconstructs the population history of a single population using the genome of a single diploid individual. Later related works gave alternative HMMs that incorporate more than two haplotypes [diCal; Sheehan et al., 2013) and Multiple Sequentially Markovian Coalescent aka MSMC (Schiffels and Durbin, 2014)] and improve robustness under phasing errors [SMC++ (Terhorst et al., 2017)].
Methods in the second category operate under an assumption about the probability distribution of coalescence events and the observed data. For instance, Drummond et al. (2005) prescribes a prior for the distribution of coalescence trees and population sizes, under which MCMC techniques are used to compute both an output and a corresponding 95% credibility interval. However, given the highly idealized nature of their models and the limitations of their methodology (e.g., there is no guarantee their MCMC method has actually mixed), it is unclear whether the ground truth actually lies in those credibility intervals.
In the multiple subpopulations case, there are two major coalescent-based methods. The first is Schiffels and Durbin (2014), which introduced the MSMC model as an improvement over PSMC. These authors used their method to infer the population history of nine human subpopulations up to about 200,000 years into the past. Terhorst et al. (2017) introduced a variant (SMC++) that was directly designed to work on genotypes with missing phase information. In particular, they demonstrate the potential dangers of relying on phase information, by showing that MSMC is sensitive to such errors. In an experiment, SMC++ was used to perform inference of population histories of various combinations of up to three human subpopulations. In these experiments, individuals are purposefully chosen from specific subpopulations. We emphasize that in our model, due to the presence of population merges and splits, one does not always know what subpopulation an ancestral individual is from.
As a side remark, there are approaches that attempt to infer a (single-component) population history using different types of information. We briefly touch upon some of these known works. One alternative strategy is to use the site frequency spectrum (SFS) (e.g., Excoffier et al., 2013; Bhaskar et al., 2015). The earliest theoretical result regarding SFS-based reconstruction is due to Myers et al. (2008), who proved that generic 1-component population histories suffer from unidentifiability issues. Their lower bound constructions have a caveat: they are pathological examples of oscillating functions that are unlikely to be observed in a biological context. Later works (Bhaskar and Song, 2014; Terhorst and Song, 2015) prove both identifiability and lower bounds for reconstructing piecewise constant population histories using information from the SFS. (In contrast, as our algorithms show, reconstruction from coalescence data does not suffer from the same lack of identifiability issues.)
Most recently, Joseph and Pe'er (2018) developed a Bayesian time-series model that incorporates data from ancient DNA to recover the history for multiple subpopulations only under size changes, without considering merges or splits. Although our analysis does not directly account for such data, the necessity of considering such models is consistent with our assertion: extra information about the ground truth, such as directly observable information about the past (e.g., ancestral DNA), is probably required in order for the problem to even be information-theoretically feasible. In addition, Joseph and Pe'er do not solve for subpopulation sizes, but rather subpopulation proportions, which contain less information than what we are after.
Theoretical Discussion
Reductions between mixtures of exponentials and population history
In the rest of our theoretical analysis, we will focus on the mixture of exponentials viewpoint of population history. To justify this, note that if we can learn truncated mixtures of exponentials, then we can easily learn population history. Details are given in Supplementary Appendix F, including a concrete algorithm based on our analysis of the Matrix Pencil Method. Conversely, we observe that an arbitrary mixture of exponentials can be embedded as a submixture of a simple population history with two time periods, so that recovering the population history requires in particular learning the mixture of exponentials. The following theorem makes this precise; its proof is delegated to Supplementary Appendix B.
In addition, we provide a more sophisticated version of this reduction that maps two mixtures of exponentials to different population histories simultaneously, while preserving statistical indistinguishability.
Again, if we take t0 small enough, we ensure that any distinguishing algorithm must rely on information from the second (less recent) period, and hence because the probability of all other events match, must distinguish between the mixtures of exponentials Q and R.
Guaranteed recovery of exponential mixtures through the Matrix Pencil Method
Given samples from a hyperexponential distribution
can we learn the parameters p1, …, pk, λ1, …, λk? In Section 2.1, we established the equivalence between solving this problem and learning population history.
Suppose for now that we are given access to the exact values of probabilities
Let A, B be k × k matrices where Aij = vi+j−1 and Bij = vi+j−2.
Solve the generalized eigenvalue equation det(A − γB) = 0 for the pair (A, B). The γ that solve det(A − γB) = 0 are the α's.
Finish by solving for the p's in a linear system of equations
To understand why the algorithm works in the noiseless setting, consider the decomposition
The Matrix Pencil Method is a close cousin of Prony's Method (Prony, 1795). Before this work, Feldmann and Whitt (1998) considered the strategy of using Prony's Method to fit exponential mixtures to general long-tail distributions. In the upcoming section, we provide a finite-sample guarantee of the MPM in the context of learning exponential mixture distributions. As it turns out (Remarks 3 and 4), this algorithm is nearly optimal in terms of the number of samples required.
We now describe our analysis of the MPM in the more realistic setup where the cumulative distribution function (CDF) is estimated from sample data. First note that the model [Equation (3)] is statistically unidentifiable if there exist two identical λ's. Indeed, the mixture
Without loss of generality, we also assume that (1) the components are sorted in decreasing order of exponents, so that λ1 > ⋯ > λk > 0, and (2) time has been rescaled § by a constant factor, so that λi ∈ (0, 1) for each i. Now we can state our guarantee for the MPM under noise:
for all j.
The full proof of Theorem 2.3 is given in Supplementary Appendix C. As in previous work analyzing the MPM in the super-resolution setting with imaginary exponents (Moitra, 2015), we see that the stability of MPM ultimately comes down to analyzing the condition number of the corresponding Vandermonde matrix, which in our case is very well understood (Gautschi, 1962).
Strong information-theoretic lower bounds
In this section we describe our main results, strong information-theoretic lower bounds establishing the difficulty of learning mixtures of exponentials (and hence, by our reductions, population histories). The full proofs of all results found in this section are given in Supplementary Appendix D. First, we state a lower bound on learning the exponents λj, which is an informal restatement of Corollary D.5.
1. Each λi and μj is in (0, 1], λ1 = μ1, and the elements of
2. Let H1 and H2 be hypotheses, under which the random variable T respectively follows the distributions
If N samples are observed from either H1 or H2, each with prior probability 1/2, then the Bayes error rate for any classifier that distinguishes H1 from H2 is at least
Table 1 shows examples of this lower bound for small values of k.
Sample Lower Bounds of Theorem 2.4
Sample Lower Bounds of Theorem 2.4
We instantiate the bounds with α = 1/k, m = 2k, Δ = 1/(4k), and δ = 1/2, and solve for N in Equation (4) to get the required number of samples N0.
Next, we state an additional information-theoretic lower bound showing that the information-theoretic (minimax) rate is necessarily of the form
where the max is taken over feasible choices of p, and the infimum is taken over possible estimators
As expected, our lower bounds show that the learning problem becomes harder as Δ approaches 0. The “easiest” case, then, ought to be when Δ is as large as possible, so that the λi are equally spaced apart in the unit interval. This raises the following question: as Δ grows, does the sample complexity remain exponential in k, or is there a phase transition [as is the case in super-resolution, from Moitra (2015)] where the problem becomes easier? In the Supplementary Appendix, we completely resolve this question: the sample complexity still grows exponentially in 4k (Theorem D.7) even when Δ is maximally large.
As an alternative to the learning problem that the Matrix Pencil Method solves, we also consider the hypothesis testing scenario in which we want to test if the sampled data match a hypothesized mixture distribution. In this case, we can give guarantees from weaker assumptions and requiring smaller numbers of samples. To state our guarantee, we need the following additional notation: for P a mixture of exponentials, let pλ(P) denote the coefficient of e−λt, which is 0 if this component is not present in the mixture. We study the following simple-versus-composite hypothesis testing problem using N samples:
H0: The sampled data are drawn from P. H1: The sampled data are drawn from a different unknown mixture of at most k1 exponentials Q. Let
Henceforth, we will refer to H0 as the null hypothesis and H1 as the alternative hypothesis (note that H1 is a composite hypothesis). To solve this hypothesis testing problem, we propose a finite-sample variant of the Kolmogorov–Smirnov test:
Let α > 0 be the significance level.
Let FN be the empirical CDF and let F be the CDF under the null hypothesis H0.
Reject H0 if
We show that this test comes with a provable finite-sample guarantee.
1. (Type I Error) Under the null hypothesis, the above test rejects H
0
with probability at most α.
2. (Type II Error) There exists
The full proof of Theorem 2.6 is given in Supplementary Appendix E. The key step in the proof is a careful application of the celebrated Nazarov–Turán lemma (Nazarov, 1993).
Our theoretical analysis rigorously establishes the worst-case dependence on the number of samples needed to learn the parameters of a single period of population history under our model—recall the construction of Theorem 2.4 of two hard-to-distinguish mixtures of exponentials and the result Theorem 2.2 converting these to population histories.
In our simulations, we will analyze both the performance and information-theoretic difficulty of learning not a specially constructed worst-case instance, but instead an extremely simple population history with k populations. More precisely we consider the following instance:
The constant term represents atomic mass at ∞ and corresponds to no coalescence. When k = 1 this is a standard exponential distribution; otherwise, it is a mixture of k + 1 exponentials, counting the degenerate constant term.
We do not believe that this is an unusually difficult instance of a mixture of exponentials on k components. If anything, the situation is likely the opposite: our worst-case analysis (Theorems 2.3 and 2.5) suggests that this is comparatively easy as the gap parameter Δ is maximally large.
To evaluate the error in parameter space from the result of the learning algorithm, we adopted a natural metric, the well-known Earthmover's distance. Informally, this measures the minimum distance (weighted by pi and recovered
For a point of comparison with MPM, we also tested a natural convex programming formulation that essentially minimizes

Plot of #components versus log (base 10) number of samples needed for accurate reconstruction (parameters within Earthmover's distance 0.01). Below the red line, it is mathematically impossible for any method to distinguish with >75% success between the ground truth and a fixed alternative instance that has significantly different parameters.
Values (Before Log-Scale) in Figure 2
CVX, convex program; LB, theoretical lower bound; MPM, Matrix Pencil Method.
Besides showing the performance of the algorithms, we were able to deduce rigorous unconditional lower bounds on the information-theoretic difficulty of these particular instances. Each point on the red line corresponds to the existence of a different mixture of exponentials (found by examining the output of the convex program), with a comparable number of mixture components,
††
which is far in parameter space
‡‡
from the ground truth and yet the distribution of N samples from this model (where N = 10
y
and y is the y-coordinate in the plot) has total variation (TV) distance at most 0.5 from the distribution of N samples from the true distribution. By the Neyman–Pearson lemma, this implies that if the prior distribution is
Notably, the lower bound shows that reliably learning the underlying parameters in this simple model with five components necessarily requires at least 10 trillion samples from the true coalescence distribution. In reality, since we do not truly have access to clean i.i.d. samples from the distribution, this is likely a significant underestimate.
Despite being very different in parameter space, their H2 distance is 7.9727 × 10−6 so any learning algorithm requires at least 15,660 samples to distinguish them with better than 75% success rate.
As a remark, we point out that the CDFs F and G in this example have exponents that are interlaced. Observe that this is a characteristic also shared by the information-theoretic obstructions referenced in Section 2.3 and Supplementary Appendix D. This likely illustrates a major source of difficulty of most reasonable-looking instances: “averaging” adjacent exponents of an exponential mixture typically produces a different mixture with a similar distribution whose components interlace with the original.
Footnotes
Author Disclosure Statement
The authors declare they have no conflicting financial interests.
Funding Information
Ankur Moitra was supported by a National Science Foundation CAREER Award CCF-1453261 and Large CCF-1565235, the David and Lucille Packard Fellowship, the Alfred P. Sloan Fellowship, and the Office of Naval Research Young Investigator award. Elchanan Mossel was supported by the Office of Naval Research N00014-16-1-2227, National Science Foundation CCF-1665252 and DMS-1737944 grants, and the Simons Investigator in Mathematics award 622132. Frederic Koehler was supported by Ankur Moitra's National Science Foundation Large CCF-1565235 grant and the David and Lucille Packard Fellowship. Govind Ramnarayan was supported by Elchanan Mossel's National Science Foundation CCF-1665252 and DMS-1737944 grants.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
