Abstract
For many types of high-throughput sequencing experiments, success in downstream analysis depends on attaining sufficient coverage for individual positions in the genome. For example, when identifying single-nucleotide variants de novo, the number of reads supporting a particular variant call determines our confidence in that variant call. If sequenced reads are distributed uniformly along the genome, the coverage of a nucleotide position is easily approximated by a Poisson distribution, with rate equal to average sequencing depth. Unfortunately, as has become well known, high-throughput sequencing data are never uniform. The numerous factors contributing to variation in coverage have resisted attempts at direct modeling and change along with minor adjustments in the underlying technology. We propose a new nonparametric method to predict the portion of a genome that will attain some specified minimum coverage, as a function of sequencing effort, using information from a shallow sequencing experiment from the same library. Simulations show our approach performs well under an array of distributional assumptions that deviate from uniformity. We applied this approach to estimate coverage at varying depths in single-cell whole-genome sequencing data from multiple protocols. These resulted in highly accurate predictions, demonstrating the effectiveness of our approach in analyzing complexity of sequencing libraries and optimizing design of sequencing experiments.
1. Introduction
Genome coverage is a critical technical characteristic of high-throughput genomic sequencing experiments (Sims et al., 2014). High coverage is necessary for correcting sequencing errors and for credible biological conclusions. For example, in single-nucleotide variant detection, candidate variants supported by few reads are often filtered out to reduce false positives caused by sequencing errors (Chen et al., 2017; Xu, 2018). However, the cost of sequencing is proportional to the total amount of nucleotides sequenced. To plan sequencing experiments and determine the appropriate allotment of sequencing effort in a given experiment, experimental design must consider the portion of sites in the genome expected to attain sufficient coverage as a function of total nucleotides sequenced. To indicate “sufficient” coverage, we define a site as having sufficient coverage if the site is covered by at least r reads. Here the value r is predetermined by the experiment's designer, based on goals and subsequent analyses. For example, a value of r = 8 has been used to select high-confident single-nucleotide polymorphism candidates as those supported by at least eight reads (Ng et al., 2010). Similarly, a value of r = 5 has been used in identification of indels (The 1000 Genomes Project Consortium, 2012) and a value of r = 3 has been used in detection of ultrarare de novo mutations (Eboreime et al., 2016). For a specified value of r, we seek to estimate the number of sites covered by at least r reads as a function of the sequencing effort.
Excessive variance of coverage along the genome in high-throughput sequencing can be attributed, in large part, to sequence-specific factors. First, DNA fragments in sequencing libraries amplify with different efficiencies. GC-rich and AT-rich DNA fragments have been found underrepresented in sequencing results (Benjamini and Speed, 2012). Second, parts of genomes can go missing during sequencing library preparation. In particular, for single-cell whole-genome sequencing (scWGS), this dropout rate is high compared with bulk sequencing experiments due to the low amount of DNA in a single cell. Given an equal number of sequenced reads, different scWGS experiments can exhibit quite different coverage profiles (Chen et al., 2017). This high and unpredictable variance presents difficulties in formulating any general guidelines for how many total reads to sequence in a given experiment.
The pioneering work in this area is the well-known Lander/Waterman theory (Lander and Waterman, 1988), which relates genome coverage to sequencing effort for Sanger sequencing (Sanger et al., 1977). The basic statistical assumption of the Lander/Waterman model is that reads are generated uniformly at random from the genome. Under this assumption, for each site in the genome, the number of reads that cover this site can be approximated by a Poisson distribution, where the Poisson rate λ is estimated by average sequencing depth using maximum likelihood estimation. The Lander/Waterman theory fits extremely well for Sanger sequencing, and it elegantly accounts for many intricacies relevant to de novo sequence assembly. Although many of the statistical questions we face in high-throughput sequencing applications appear more straightforward, this theory is largely inapplicable when faced with highly variable sequencing coverage (Benjamini and Speed, 2012; Daley and Smith, 2013).
Instead of using the same Poisson rate λ for each site, in this article, we assume that the Poisson rate λ varies among sites and follows a latent distribution G(λ), which is used to describe all varieties of coverage along the genome. Mixtures of Poisson distributions have seen wide use for inferences on the relationship between species and individuals (Engen, 1978). To connect the terminology from this body of literature with our application, one can regard species as sites in a genome, and sequenced nucleotides mapping over those base-pairs are the sampled individuals. As a notable early example, Fisher et al. (1943) assumed that the latent G(λ) is a gamma distribution. Although other parametric distributions have been investigated (Bhattacharya, 1966; Bulmer, 1974; Sichel, 1975; Burrell and Fenton, 1993), it is hard to assess which parametric distribution should be applied based on observed data. In particular, distinct parametric forms may appear to fit observed data well, but exhibit very different extrapolation behaviors (Engen, 1978). In this context, extrapolation means predicting the number of species in a sample of larger size—expanding the number of observed individuals.
Good and Toulmin (1956) established a nonparametric empirical Bayes framework that served as the foundation for much subsequent nonparametric methodology (Efron and Thisted, 1976; Boneh et al., 1998; Chao and Shen, 2004; Daley and Smith, 2013). For the case r = 1, Good and Toulmin (1956) derived an estimator for the expected number of species in a sample while avoiding direct inference of G(λ). This estimator takes the form of an alternating power series, which is accurate for short-range extrapolation but diverges in practice for long-range extrapolation (Good and Toulmin, 1956). Daley and Smith (2013) proposed a solution to the divergence problem by applying rational function approximation to the Good/Toulmin power series. This approach showed promising results in predicting library complexity (Daley and Smith, 2013), the number of sites covered by at least one read (Daley and Smith, 2014), and other applications (Deng et al., 2015). However, because this method does not infer the latent distribution, it did not suggest an extension to predict the number of sites covered by at least r > 1 reads as a function of sequencing effort (Daley, 2014).
Intuitively, predicting genome coverage seems more difficult when r > 1 compared with r = 1. In the example of Figure 1a, the curve for the number of sites covered by at least r = 1 read appears flat after 500 reads, suggesting that the sample is saturated. However, barely any sites are covered at least r = 16 times after sequencing 500 reads—leading to a very different flat curve. Visually inspecting the shape of the curve for r = 16 based on 500 reads (Fig. 1a) provides very little information about the shape of the curve after 3000 reads (Fig. 1b).

The expected number of sites in the genome covered by at least r reads as a function of the sequencing effort. In this toy example, we assume that the length of the reference genome is 100 bp. One unit in the x-axis represents 50 reads with read length 1. The number of reads is up to
We propose a new method to predict the expected number of sites in the genome covered by at least r reads as a function of the sequencing effort. To overcome the difficulties of generalizing nonparametric methods from r = 1 to r > 1, we first derive a relationship between the expected number of sites covered by at least r reads and the expected number of sites covered by at least one read, under the mixture of Poisson distribution assumption. Without inferring the latent distribution, this derived relationship can generalize any smooth expression, which is used to calculate the expected number of sites covered by at least r = 1 read as a function of the sequencing effort, to an expression for r > 1. We then utilize this relationship to construct a nonparametric estimator that can be applied for every value of r. We prove that our estimator converges as the sequencing effort goes to infinity, essential for large-scale applications such as high-throughput sequencing. Extensive simulations suggest that our estimator performs very well for various types of heterogeneous populations. Applications to scWGS data demonstrate accuracy as a basis for resource allocation in high-throughput sequencing experiments.
2. Unified Representation for Multiplicity of Coverage
We will use the conventional “species” terminology of Good (1953). The problem is as follows: a random sample of individuals is captured from a population after trapping for one unit of time. Each individual belongs to exactly one species, and the total number L of species in the population is finite but not known. In the context of sequencing, although the length of the reference genome is known, the actual observable/captured genome varies among prepared sequencing libraries. Let Nj be the number of species represented by exactly j individuals in this sample. Clearly N0 is not observable. Imagine that a second sample is obtained after trapping for t units of time from the same population. The time t > 1 should bring to mind a “scaled up” experiment. This second sample may take the form of an expansion of the initial sample, but may also be a separate sampling experiment. Let Sr(t) be the random variable whose value is the number of species represented at least r times after trapping for t units of time. We are concerned with predicting the expected number
For a given species, we assume that the number of individuals in a sample follows a Poisson distribution with expectation λ per unit time. To describe the heterogeneity of abundance among species, we assume the rate λ is generated from a latent distribution G(λ). Let Nj(t) be the random variable whose value is the number of species represented exactly j times after trapping for t units of time. By definition
From our Poisson mixture assumption, the expected number of species after trapping for t units of time can be expressed as follows:
Taking the jth derivative of
Note that the expected value of Nj(t) is as follows:
By comparing the above expression with the jth derivative of
which has been noted previously (Kalinin, 1965). By replacing the
Thus, we have established a direct relationship between the SAC and the r-SAC. Using Equation (3), we can generalize any smooth expression for the SAC to the expression for the corresponding r-SAC. We demonstrate the use of Equation (3) by deriving the expression of the r-SAC for a few widely used methods that are designed for the SAC (Supplementary Materials in Section S2). We call the ratio
3. A New Nonparametric Estimator
Here we leverage the technique of Padé approximation to build a nonparametric estimator for the r-SAC. A Padé approximant is a rational function with a Taylor expansion that agrees with the power series of the function it approximates up to a specified degree (Baker and Graves-Morris, 1996). In this sense, Padé approximants are rational functions that optimally approximate a power series. This method was successfully applied to construct the estimator of the SAC, using Padé approximants to the Good/Toulmin power series (Deng et al., 2015). Padé approximants are effective because they converge in practice when the Good/Toulmin power series does not, yet within the applicable range of Good/Toulmin power series (t < 2), the two functions remain close. We apply a similar idea beginning with the average discovery rate. This leads to an expression that simplifies the formula in Theorem 1, yielding a new and practical nonparametric estimator for the r-SAC.
Our first step is to obtain a power series representation for the average discovery rate
Replacing expectations with the corresponding observations, each of which is an unbiased estimator, we obtain an unbiased power series estimator of the average discovery rate:
This power series estimator ϕ(t) serves as a bridge between the observed data Si and the Padé approximant for
Although Padé approximants to a given function can have any combination of degrees for the numerator and denominator polynomials, we consider only the subset for which the difference in degree of the numerator and denominator is 1. Importantly, this choice permits these rational functions to mimic the long-term behavior of the average discovery rate, which should approach L/t for large t.
Let Pm-1(t)/Qm(t) denote the Padé approximant to power series ϕ(t) with numerator degree m − 1 and denominator degree m. According to the formal determinant representation (Baker and Graves-Morris, 1996),
The above representation allows us to reason algebraically about the existence of the desired Padé approximant to ϕ(t) for a given initial sample. Define the Hankel determinants as follows:
with Sk = 0 for k < 1. A proof of the next lemma is given in the Supplementary Materials (Section S1.2).
satisfies
and all ai and bj are uniquely determined by S 1 ,S 2 ,…,S 2m .
Combining Lemma 1 and 2, we constructed a nonparametric estimator Ψr,m(t) for the r-SAC in Theorem 2. The proof is found in the Supplementary Materials (Section S1.2). In what follows, we assume that denominators of all rational functions of interest have simple roots. Although in practice we do not encounter Qm(t) with repeated roots, in the Supplementary Materials we show how this assumption can be removed.
satisfies Ψr,m(1) = Sr.
Of note, the coefficients ci and poles xi are independent of r: once determined, they can be used to directly evaluate Ψr,m(t) for any r. The estimator Ψr,m(t) has some favorable properties, summarized in the following proposition, with proofs given in the Supplementary Materials (Section S1.2).
(ii) The estimator Ψr,m(t) converges as t approaches infinity. In particular,
(iii) The estimator Ψr,m(t) is strongly consistent as the initial sample size goes to infinity.
Remark. Both determinants Δm−1,m and Δm,m become 0 when Sj = L for j ≤ 2m and m > 1, so the determinant representation of the Padé approximant [Equation (6)] is ill-defined in such cases. However, the Padé approximant itself remains valid and reduces to L/t for t > 0 (Supplementary Materials in Section S1.3).
4. An Algorithm for Estimator Construction
4.1. Conditions for well-behaved rational functions
The choice of m controls the degree of both the numerator and the denominator in the Padé approximant, and determines the amount of information from the initial sample that is used by Ψr,m(t). In principle, m should be selected sufficiently large so that the estimator Ψr,m(t) can explain the complexity of the latent distribution G(λ). However, a larger value of m leads to more poles in the estimator Ψr,m(t) and makes instability more likely. In practice, the stability of the estimators depends on the locations of poles. For example, if any pole xi resides on the positive real axis, then Ψr,m(t) is unbounded in the neighborhood of xi and becomes ill-defined at t = xi. Here we give a sufficient condition to stabilize the estimator so that it is well-defined and bounded for t ≥ 0 and r ≥ 1. Note: Re(x) is the real part of x. A proof of the next proposition is given in the Supplementary Materials (Section S1.4).
Remark. It is not unusual to constrain roots in such a way to ensure stability. For example, the Hurwitz polynomial, which has all zeros located in the left half-plane of the complex plane, is used as a defining criterion for a system of differential equations to have stable solutions (Rahman and Schmeisser, 2002).
4.2. The construction algorithm
Algorithm 1 provides a complete procedure for constructing our estimator beginning with the observed counts Nj, and satisfying the conditions outlined above. This procedure requires specifying a maximum value of m, and also leaves room for using more effective numerical procedures at each step. Details about these procedures are found in the Supplementary Materials (Section S3).
To see that Algorithm 1 terminates successfully, note that when m = 1,
So if there exists at least one species represented once and one species represented more than once in the initial sample, then we observe
4.3. Variance and confidence interval
Deriving a closed-form expression for the variance of the estimator Ψr,m(t) is challenging. When m ≥ 5, there is no general algebraic solution to the polynomial equations that identify xi in Ψr,m(t), so a closed-form may not exist. Moreover, even for m = 1, the variance of Ψr,1(t) involves a nonlinear combination of random variables S1 and S2 [Equation (10)].
In practice, we approximate the variance of our estimates by bootstrap (Efron and Tibshirani, 1994). Each bootstrap sample is a vector of counts

Relative errors in simulation studies.
5. Simulation Studies
5.1. Models
We carried out a series of simulations to examine the performance of the estimator Ψr,m(t). The simulation scheme is inspired by work from Chao and Shen (2004), but involves populations and samples of larger scale. From our statistical assumptions, the number of individuals for species i in the initial sample follows a Poisson distribution with the rate λi, for i = 1, 2, …, L. The rates λi are generated from distributions we have selected to model populations with different degrees and types of heterogeneity. We describe the degree of heterogeneity in a population by the coefficient of variation (CV) for λi:
The CV quantifies difference in relative abundances among species and is independent of sample size. For the type of heterogeneity, we focus on the shapes of distributions, for example, distinguishing those with exponentially decreasing tails versus heavy-tailed distributions.
We selected six models for our simulations. The first is a homogenous model (P; Poisson), where all species have the same relative abundance, included as a basis for comparison with the other models. Intuitively, the homogeneous model is the simplest one among all models. However, for a given sample size, samples from the homogeneous population contain the least information among any type of population if the sample size is relatively small compared with L (see details and the proof in Supplementary Materials [Section S5]). The second and third models are negative binomials (NB1 and NB2), where the λi follows gamma distributions. The NB models are widely used to describe overdispersed count data (Hilbe, 2011), particularly in modeling sequencing data (Robinson and Smyth, 2007; Robinson et al., 2010; Song and Smith, 2011; McCarthy et al., 2012; Van den Berge et al., 2018). The fourth model is a lognormal (LN) model (Bulmer, 1974), which has been applied to capture–recapture analysis in ecology (Preston, 1948). Models 5 and 6 are a Zipf distribution (Z; Zipf, 1935) and a Zipf/Mandelbrot distribution (ZM; Mandelbrot, 1977), respectively, which are known as power law. Models 4–6 represent so-called heavy-tailed populations (Newman, 2005). Supplementary Table S1 in the Supplementary Materials summarizes these parameter settings.
In our simulations, we fixed the total number L of species at 1 million (M) to represent large-scale applications. For each model, the values of model parameters were determined in such that
5.2. Simulation results
As can be seen from Figure 2a, the estimator Ψr,m(t) performs well under models NB1, NB2, LN, Z, and ZM. We consider these to represent heterogeneous populations due to their large CV compared with the homogeneous model (Supplementary Table S1). The relative errors are 0.002 (± 0.003) and 0.027 (± 0.011) for NB1 and NB2. The errors for the Z and ZM models are slightly higher: 0.057 (± 0.042) and 0.057 (± 0.040), respectively (Supplementary Table S2). Both the relative error and the standard error of Ψr,m(t) are much higher when applied to the homogenous models (Fig. 2a).
We compared the estimator Ψr,m(t) with the five other estimators. The estimator Ψr,m(t) has the least mean relative error compared with other approaches under the LN, Z, and ZM models (Fig. 2b), which are the heavy-tailed models. The relative errors under these three models are 0.020, 0.057, and 0.057 (Supplementary Table S2). In particular, under the Z and ZM models, the second-most accurate approach, our generalization of CS estimator, has a relative error of 0.525 and 0.558, around 10 × the error of Ψr,m(t). The estimator Ψr,m(t) has a higher standard error compared with the other methods (Fig. 2b), which we attribute broadly to its use of procedures that can introduce numerical error (e.g., to fit the Padé approximant). Even considering this variation, when Ψr,m(t) is at its least accurate, it remains substantially more accurate than the other methods across models LN, Z, and ZM. As expected, for models NB1 and NB2, the ZTNB approach is the most accurate because it matches the precise statistical assumptions of those simulations. Importantly, without any assumption about the latent distribution of λi, the estimator Ψr,m(t) also yields excellent accuracy in these two models, with relative errors less than 5%. The LS approach performs similarly to the ZTNB approach when the shape parameter in the NB model is close to zero, as occurs for NB2 (Fig. 2). Similarly, for the homogeneous population model, the ZTP approach is the most accurate.
We found the estimator Ψr,m(t) to be more accurate when the population samples correspond to heavy-tailed distributions compared with other methods. In general, these are the most challenging scenarios for accurately predicting
Clearly our estimator has larger relative errors when the samples are generated from a homogeneous model compared with other models (Fig. 2). Our initial intuition was that the homogeneous model should be easier to predict because all λi are constrained by a single parameter. Our simulation results show an interesting dichotomy in the performance of the methods we tested. On the one hand, nonparametric methods that do not assume an underlying Poisson have a higher relative error on the homogeneous model. For example, the relative errors are 0.5 and 0.32 for BBC and our estimator, respectively (Supplementary Table S2). On the other hand, the relative error is 0.003 for CS, which is based on the Poisson distribution. Parametric methods show similar trends. The ZTNB performs well under the homogeneous model because it can easily describe a Poisson when the shape parameter is large. Although the LS estimator is derived from the negative binomial, it assumes that the shape parameter is close to 0, so it has difficulty describing homogeneous data.
The SC provides one perspective on why the homogeneous model might present challenges for nonparametric approaches. In particular, the homogeneous model has the lowest SC compared with other models having a fixed sample size (Supplementary Materials in Section S5). Increasing the initial sample size can increase SC, which in turn improves the accuracy of our estimator. For example, when we increase the size of the initial sample from 1L to 2L, the relative error reduces to 0.123 (± 0.06).
6. Applications
scWGS is used to study genotypes of individual cells, for example, in the context of detecting mutations in tumors (Zong et al., 2012) or characterizing the landscape of mutations in normal cells (Zhang et al., 2019). Because of the inherent limited amount of DNA in any single cell, scWGS heavily relies on whole-genome amplification (WGA) to obtain enough material for sequencing. Multiple strategies exist for WGA. Early methods include exponential amplification as in degenerate oligonucleotide-primed polymerase chain reaction (Telenius et al., 1992) and multiple displacement amplifications (Dean et al., 2002; Leung et al., 2016). More recent methods avoid exponential amplification. Examples include the multiple annealing and looping-based amplification cycles (Zong et al., 2012), and the linear amplification via transposon insertion (Chen et al., 2017). Each WGA method has different amplification efficiencies and sequencing biases, resulting in different genome coverage profiles for a given amount of sequencing effort (Huang et al., 2015; Chen et al., 2017). By analyzing data with these diverse characteristics, we can examine the behavior of estimators in multiple distinct, yet realistic, settings.
We aim to predict the expected number of sites in the genome covered by at least r reads as a function of the sequencing effort (r-SAC) in a sequencing library. Therefore, for each protocol, our task is to estimate how the genome coverage changes as the sequencing effort increases. In practice, researchers are usually not interested in all values of r but particular values of r, which are defined as “sufficient” coverage based on how the data will be used in subsequent analysis steps.
We collected 10 scWGS samples from 3 studies (Zong et al., 2012; Chen et al., 2017; Dong et al., 2017). These data sets were generated in different laboratories using different protocols. Supplementary Table S3 lists the accession numbers, in the NCBI SRA database, of the sequencing runs used in this study. For each data set, we randomly downsampled a subset of 5M reads as an initial sample, and used this initial sample to predict the r-SAC (see Supplementary Materials in Section S6 for scWGS processing details). We then compared the predicted r-SAC with the ground truth, which is obtained by subsampling without replacement from the entire data set, and used relative error to assess prediction accuracy of our method.
We evaluated the accuracy of our predicted r-SAC for r ≤ 20, which is sufficient for almost all applications. Figure 3 shows the actual r-SAC along with our estimated r-SAC for r = 4, 8, 10, 20. The first thing to note is the diverse shapes associated with each distinct technology. These shapes are not necessarily a result of the technology, and might differ if we examined other data from the same technology. However, this diversity still challenges estimators in different ways. In each case, our estimated curves closely track the true curves for different values of r. Predictions from all four protocols were all generally accurate, with relative errors less than 10% for all data sets when r ≤ 10 (Table 1). We can conclude that using only 5M reads in each experiment, we are able to detect the differences in genome coverage within each of the sequencing libraries, for a variety of r values and at over 40-fold extrapolation.

Predicted r-SACs for scWGS data sets. The number of sites in the genome covered at least r times as a function of the sequencing effort, for r = 4, 8, 10, and 20. The dashed line is the predicted curve based on Ψr,m(t) using an initial sample of 5M reads. The size of initial samples equals to one unit of sequencing effort. We extrapolated up to 500M reads using the initial sample. The solid line is generated by subsampling the entire data set without replacement. The accession numbers in NCBI for each data set used in the figure are as follows:
Relative Errors for Estimating r-Species Accumulation Curves in Single-Cell Whole-Genome Sequencing Using Ψr,m(t) and Zero-Truncated Poisson
ZTP, zero-truncated Poisson.
We compared our estimation method with both parametric and nonparametric methods that are listed above for the simulation study. The estimator Ψr,m(t) has clearly superior performance among these methods. Table 1 and Supplementary Table S4 show the relative errors of each method for r from 1 to 10. Overall, the number of cases where the relative error of Ψr,m(t) is higher than other methods is fewer than 2%. In particular, our estimator has a lower relative error than estimators ZTP, CS, and BBC for all values of r ≤ 10 across all data sets. Furthermore, the relative errors of the estimator Ψr,m(t) are less than 10% across all data sets for all r ≤ 10 (Table 1). In contrast, the ZTNB, which shows the second-best performance, has a relative error of 30.3% at r = 1 for the data set SRR2976568 (Supplementary Table S4).
7. Discussion
We introduced a new approach to estimate the genome coverage as a function of the sequencing effort. The nonparametric estimators obtained by our approach are universal in the sense that they apply across values of r for a given sequencing library. We have shown that these estimators have favorable properties in theory, and also give highly accurate estimates in practice. Accuracy remains high for large values of r and for long-range extrapolations. This approach builds on the theoretical nonparametric empirical Bayes foundation of Good and Toulmin (1956), providing a practical way to compute estimates that are both accurate and stable.
We discovered a relationship between the r-SAC
For large-sequencing projects, such as the international 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015) and the UK 100,000 Genomes Project (Genomics England, 2017), one crucial question is how to balance between the number of individuals in the study and the sequencing effort for each individual. Under a fixed budget, we should aim to sequence as many individuals as possible to enhance the statistical power of the study, while still ensuring that the sequencing in each sample is sufficient to make confident estimates about genotype. Having a small but sufficient sequencing sample also largely saves computing time and data storage. Using a shallow sequencing experiment as a pilot study, our method provides an accurate estimation for the genome coverage in the future large-sequencing sample, and helps scientists plan the sequencing effort for various protocols.
For modern biological sequencing applications, samples are frequently in the millions, and the scale of the data could be different by orders of magnitude. These large-scale applications present new challenges to traditional capture–recapture statistics, and call for methods that can integrate high-order moments to accurately characterize the underlying population. We generalized the classical study of estimating an SAC and propose a nonparametric estimator that can theoretically leverage any number of moments. Both this generalization and the associated methodology suggest possible avenues for practical advances in related estimation problems. Particularly in the context of genomic sequencing, we believe that this perspective will play important roles of experimental planning in the big genome projects, as well as accelerating the development process of new sequencing methods.
Footnotes
Author Disclosure Statement
The authors are not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.
Funding Information
This work was supported by the NIH/NHGRI (National Institutes of Health/National Human Genome Research Institute) R01 HG007650 (A.D.S.).
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
