End-to-End Optimization of High-Throughput DNA Sequencing

Abstract

At the core of Illumina's high-throughput DNA sequencing platforms lies a biophysical surface process that results in a random geometry of clusters of homogeneous short DNA fragments typically hundreds of base pairs long—bridge amplification. The statistical properties of this random process and the lengths of the fragments are critical as they affect the information that can be subsequently extracted, that is, density of successfully inferred DNA fragment reads. The ensembles of overlapping DNA fragment reads are then used to computationally reconstruct the much longer target genome sequence. The success of the reconstruction in turn depends on having a sufficiently large ensemble of DNA fragments that are sufficiently long. In this article using stochastic geometry, we model and optimize the end-to-end flow cell synthesis and target genome sequencing process, linking and partially controlling the statistics of the physical processes to the success of the final computational step. Based on a rough calibration of our model, we provide, for the first time, a mathematical framework capturing the salient features of the sequencing platform that serves as a basis for optimizing cost, performance, and/or sensitivity analysis to various parameters.

1. Introduction

Rapid and affordable detection of the order of nucleotides in DNA molecules has become an indispensable research tool in molecular biology. In this article, we consider the most prevalent sequencing technology that relies on reversible terminator chemistry (Bentley et al., 2008) and the shotgun sequencing strategy (Messing et al., 1981; Venter et al., 1996, 1998) to determine long DNA strands. Our goal is to develop a model and an associated mathematical framework that enable optimization of the end-to-end cost of DNA sequencing. For this, we draw upon stochastic geometry and queuing theoretic tools to model and analyze salient characteristics of growing DNA clusters on the surface of a sequencing chip and optimize the process of sequence assembly from the short reads provided by the sequencing device. These developments provide a systematic basis to study the trade-offs and maximize the cost efficiency of the sequencing procedure.

1.1. High-level description of the problem

A target DNA strand to be sequenced can be viewed as a possibly long, for example, 10⁹, sequence of L letters. In shotgun sequencing, a large number of copies of the target are first randomly cut into fragments; the fragments are then sequenced, each providing a read of length l, where l is much smaller than L (Venter et al., 1996, 1998). The approach involves two key steps. In Step 1, one reads as many fragments as possible—as we elaborate later in this section—that can be parallelized and therefore performed very efficiently. In Step 2, one attempts to reconstruct the target sequence by leveraging the library of overlapping reads obtained in Step 1—this step is referred to as assembly. For clarity, let us consider the two steps independently, although one should keep in mind that they are intimately linked and will, in the sequel, be jointly optimized.

To functionalize the surface of a DNA chip (referred to as a flow cell), DNA fragments are first scattered across its surface whereupon they attach at random locations (Bentley et al., 2008). A single fragment is insufficient to generate a signal that is detectable; to remedy this, each of the initially positioned fragments, which we refer to as germs, is replicated in parallel a number of times through the process called bridge amplification. The resulting ensembles of fragments, each comprising hundreds of identical copies of a germ, enable signal amplification and accurate DNA sequence detection. The germ replication can be viewed as a spatial branching process happening on the surface of the flow cell. The result of each such process, in the simplest case, is roughly a disc of the fragment's copies centered at the location where the original strand (germ) happened to attach to the flow cell. The radius of the disc can be controlled by the number of steps of the bridge amplification process. As we shall see, due to possible interaction among such growth patterns, the resulting shapes might be more complex than discs, so in the sequel, we will more generally refer to them as clusters.

As mentioned above, each fragment is replicated to generate a cluster of identical molecules, which enables signal amplification and thus facilitates sequence detection, that is, the underlying letter reading mechanism. Reads are obtained in parallel, that is, all clusters are read simultaneously one letter at a time. This is accomplished by relying on reversible terminator chemistry (Bentley et al., 2008) where the first unread letter of each fragment is identified by detecting the color of the fluorescent label attached to the nucleotide bound to it.¹ Since clusters contain multiple copies of the fragments, with proper illumination, each cluster will light up the color indicative of the latest letter/base being examined. This process, referred to as sequencing-by-synthesis, is applied sequentially for say l steps and thus, in principle, one can determine the first l letters of all fragments on the flow cell—this is the aim of Step 1.

There are two caveats, however. First, the successive reading of nucleotides can fail for some fragments in each cluster. Specifically, on a given step, say k, the chemical processes associated with reading the kth nucleotide base may fail or jump ahead to the (k + 1)st one. Thus, at each step, only a fraction of the fragments in a cluster have their kth nucleotide properly marked with the correct fluorescent marker. As this proceeds, an increasing number of markers get out of phase, that is, the disc/cluster will eventually appear to have a mix of colors, making correct detection of the fragment's next letter increasingly difficult. To deal with this phenomenon, typically referred to as phasing, a number of base calling methods have been proposed in recent years (Erlich et al., 2008; Kao et al., 2009; Kircher et al., 2009; Kao and Song, 2011; Das and Vikalo, 2012, 2013). Indeed, amplifying the signal is the reason for synthesizing the cluster of duplicates of each fragment in the first place, that is, clusters are meant to enable in-phase addition of light emanating from a number of identical fluorescent tags.

The second caveat is that randomly placed germs may be grown into clusters that overlap, which will also impair the reading process. Larger clusters will tend to experience more overlaps. In essence, this is a random disc packing problem: if two germs happen to attach to the flow cell at distance r from each other, any cluster growth that leads to discs of radius larger than r/2 leads to such an overlap and hence to an impaired reading of the letters of the corresponding two strands.

We are now in a position to articulate the main trade-offs that drive the efficiency of shotgun sequencing, which assembles the target using short reads from a flow cell. There are three main parameters at play: the density λ of fragments initially placed on the flow cell, the length l of the fragment reads, and the disc radius r associated with the bridge amplification process:

• λ large looks desirable (because more fragment reads will facilitate Step 2), but could be problematic for any fixed r because of possible disc overlaps;

• l large is desirable (because it will facilitate Step 2), but could be problematic because of the deterioration in read quality as base pair reads get out of phase;

• r large is necessary (to amplify signals and facilitate detection), but this precludes the desire for a large λ (because of disc overlaps).

We pose the following question: What is the optimal set of parameters to maximize the yield and/or possibly minimize the cost in the presence of all these trade-offs?

There are many ways to frame the problem of optimizing yield in such systems. In this article, we will first consider optimizing fragment yield, which we define as the density of length l fragments successfully read per unit area of the flow cell. Then, we consider metrics that are more directly tied to the final objective of sequencing length L target DNA sequence. To that end, let us consider Step 2, the reassembly process.

Following a simple stylized model of the process, a condition for successful reassembly is that the collection of successfully read fragments covers the target DNA—that is, if one denotes by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${a_1} , \ldots , {a_L}$$ \end{document} the target DNA sequence, then in the set of correctly read fragments, there should exist a subset of fragments, say \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s_1} , \ldots , {s_k}$$ \end{document} , such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s_1}$$ \end{document} contains the sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${a_1} , \ldots {a_{{p_1}}}$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s_2}$$ \end{document} contains the sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${a_{{p_1} + 1}} , \ldots {a_{{p_2}}} , \ldots $$ \end{document} , and so on, until \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${s_k}$$ \end{document} contains the sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${a_{{p_{k - 1}} + 1}} , \ldots {a_L} ,$$ \end{document} for some \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${p_1} , \ldots , {p_{k - 1}}$$ \end{document} . This stylized model for reassembly is highly simplified. In practice, we may encounter two scenarios, de novo and reference-based sequencing. De novo sequencing (Idury and Waterman, 1995; Myers, 1995; Miller et al., 2010) refers to sequencing genomes, which have not been previously characterized, while reference-based sequencing (Delcher et al., 2002) refers to the sequencing task where one or more previously sequenced references, which are similar to the target, are available. These present different challenges in the reassembly process, that is, determining where fragments should be placed to reconstruct the target [sequence alignment has attracted considerable amount of attention (e.g., see Li et al., 2008; Langmead et al., 2009; Li and Durbin, 2009, 2010; Lunter and Goodson, 2011; Ruffalo et al., 2011 and the references therein)]. Moreover, reassembly requires additional slackness in the cover, that is, in s_i, there should be enough letters on the left of a_pi₋₁₊₁ and on the right of a_pi to correctly reconstruct the long sequence from the fragments exactly as in a puzzle. Assuming the ability to appropriately map/align fragments, the existence of a covering is a minimal requirement to sequence the target, see Bresler et al. (2013) and Motahari et al. (2013) and references therein for modern discussion of this problem.

A critical trade-off associated with Step 2 now emerges. It is between the number of correctly read fragments and their length. A large collection of fragments is helpful, but if they are too short, that is, l is small, it is difficult to obtain a covering of the entire DNA target sequence².

This brings us to the main problem addressed (and solved) in this article. The goal is optimize the key parameters associated with the sequencing process we described earlier (namely the density λ of fragments placed on the flow cell, the duration of the growth process, and the length of the fragment reads l) so as to ensure a prespecified probability, say δ, for example, 99%, of obtaining a covering of the target DNA sequence at minimal operational cost. Operational costs can be viewed as being in two main categories: (1) those associated with raw materials, for example, DNA copies, reagents, and flow cells; and (2) those associated with time, for example, the time spent on the sequencing machine or the execution time of the signal processing algorithms. In this article, we shall for simplicity adopt the flow cell area as the cost. Some reagent costs and some flow cell processing time costs might be proportional³ to the flow cell area. Given an initial rough calibration of the model based on available data, we are able to show how to optimize the steps/parameters of the sequencing process. We stress, however, that the article provides a general mathematical framework, which can accommodate other costs and objectives, for instance, costs that are proportional to the total number of strands (this might be the case for certain processing times) or to the number of DNA copies (raw material) rather than to the area. We will not discuss them here for the sake of brevity.

1.2. Related work

To our knowledge, this is the first work attempting to model and analyze the cluster growth process with a view on optimizing DNA sequencing cost/yield. The detailed simulations of the surface physics associated with the bridge amplification process (Mercier et al., 2003; Mercier and Slater, 2005) support that the disc/cluster processes we introduced earlier and will use are well suited. Work optimizing this process has taken place in industry where empirical evidence and simple rules of thumb have been used. There is, however, a substantial body of work toward developing mathematical tools for analyzing random spatial processes [see, e.g., (Stoyan et al., 1995) and some of our work (Baccelli and Blaszczyszyn, 2009)]. Indeed, this branch of mathematics is now ubiquitous with applications in material science, cosmology, life sciences, and information theory, to name a few. Further developments of the mathematical foundations have recently been carried out by us in Baccelli and Blaszczyszyn (2009) and proven to be invaluable, for example, to understand fundamental characteristics of large wireless systems and optimizing their performance. Such stochastic geometry models have been embraced by academia and industry, providing insight into current and future technological developments. Indeed, the aim of this article is to show that this may also play a role in the DNA sequencing setting.

As mentioned earlier in this section, sequence assembly may be performed with or without referring to a previously determined sequence (genome, transcriptome). De novo genome shotgun assembly is a computationally challenging task due to the presence of perfect repeat regions in the target and by limited lengths and accuracy of the reads (Miller et al., 2010; Bresler et al., 2013; Motahari et al., 2013). In the reference-guided assembly setting, the reads are first aligned (i.e., mapped) to the reference, easing some of the difficulties faced by the de novo assembly (Delcher et al., 2002). However, the reference often contains errors and gaps, creating a different set of challenges and problems. In fact, if the sample is highly divergent from the reference or if the reference is missing large regions, it may even be preferable to use de novo assembly (Iqbal et al., 2012). While the development of methods for sequence assembly received significant attention, ultimate limits of their performance have been less explored. The pioneering work of Lander and Waterman (1988) provided the first simple mathematical model for the reassembly process. This has been followed with various refinements (Bresler et al., 2013; Motahari et al., 2013), but to our knowledge, none has provided a mathematical framework to compute the flow cell parameters needed to achieve a given likelihood of full target coverage. Moreover, no previous work has linked this end objective to the optimization of the end-to-end process as we will do in this article.

In Wendl and Wilson (2008) and related articles, the authors derive closed-form expressions for the first and second-order statistics of the sequencing depth, which pertains to the number of reads covering bases along the target. Specifically, the expected sequencing depth and its variance are expressed in terms of the parameters of the sequencing task (target and read length, number of reads) while taking into account practical issues, including edge effects and the paired-end nature of the reads. Interestingly the authors find that due to these effects, for small to moderately long targets, shorter reads may provide higher coverage per unit sequence depth than longer ones. This notion of sequencing depth can be seen as a coverage problem in the sense of Hall (1988) and Stoyan et al. (1995). However, this coverage problem is one dimensional (linked to what we call the reassembly problem below) and is quite different from the two-dimensional coverage processes taking place on the flow cell, which are analyzed in the present article, for example, the Boolean model (see again Hall, 1988; Stoyan et al., 1995) discussed below.

The article is organized as follows: we first describe the stochastic geometric models for the distribution of clusters on the flow cell resulting from the bridge amplification process. We then analyze the impact of the geometry of clusters on the achievable yield of fragment reads and discuss its optimization. The last section proposes a simplified model for the reassembly problem, which is based on a queuing model. This allows us to consider the end-to-end cost optimization of the sequencing process to meet a desired likelihood of coverage for the target DNA sequence.

2. Stochastic Geometry for Shotgun DNA Sequencing

In this section, we introduce basic geometric models for the cluster processes associated with the DNA fragments resulting from the bridge amplification procedure on the surface of the flow cell. These are the singleton cluster, shot-noise, and Voronoi models, respectively. These processes will be tied to the salient features of fragment-reading mechanisms.

In the singleton cluster process model, all clusters that intersect (or touch) another cluster are discarded. The retained clusters are roughly modeled as discs of radius r consisting of duplicates of the same DNA fragment. In the shot-noise process model, an attempt is made to read each cluster. Isolated clusters are as in the singleton cluster case. A cluster, which is in contact with one or more clusters, is still analyzed as a disc of radius r; however, depending on the number and shape of the other clusters in contact, part of the light signal stemming from that disc creates an interference, which is treated as noise. If signal dominates interference/noise, one can still read this cluster. Finally, the Voronoi case studies are the (hypothetical and somewhat futuristic) scenarios where one computes optimal masks that allow one to mask all clusters in contact with the tagged cluster and hence to cancel interference.

Some of these models will be used for the reassembly optimization alluded to above. For each case, we describe the mathematical approach used to evaluate its performance. This will also be used to give some yield optimizations of independent interest.

2.1. Random seed model and growth model

We consider the locations of the initial DNA fragments (the centers of the clusters or the seeds) to be a homogeneous Poisson point process N on \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\mathbb R}^2$$ \end{document} , with intensity λ. This parameter is simply the number of seeds per unit area. Growth of clusters is assumed to be radially homogeneous, so if a cluster does not come into contact with any other cluster over a time r of growth, it will form a disc of radius r. The amplification process creates up to 1000 copies of the initial fragment in a disc of radius 0.5 μm (Illumina, 2015). This gives us a density a of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ \frac { { 4000 } } { \pi } $$ \end{document} fragments per square micron.

If it does come into contact with another cluster, we assume the growth in that direction stops, but growth continues in all other directions. As r approaches infinity, the configuration of clusters becomes the Voronoi diagram associated with the point process N (Stoyan et al., 1995). For r finite, the shapes are known as the Johnson–Mehl growth model (Stoyan et al., 1995).

2.2. Read reliability model

As already explained, phasing problems occur because nucleotides occasionally fail to incorporate in particular duplicates or anneal to the base pair right next in line. These duplicates are then out of sync with the rest of the duplicates and give off a different color signal. So, even though amplifying the fragments gives a much stronger signal, there is noise due to these out-of-phase duplicates limiting the accuracy of the reads.

To model this in a single cluster, let X_l denote the number of copies of the original DNA fragment that remain in phase after l steps of the process. Let p be the probability that a DNA fragment gets out of phase at one step.

The random variable X_l has a binomial distribution, where the number of trials is the number of DNA fragment copies in a cluster and p(l):=(1 − p)^l is the probability that a fragment remains in phase after l steps, that is, the probability of success.

At first glance, it makes sense to require that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$p ( l ) > \frac { 1 } { 2 } $$ \end{document} (so that on average more than half the duplicates remain in phase) to have a correct read. However, this does not capture certain phenomena, for example, the fact that as the radius becomes very small, fluctuations are high around the mean. We will hence require that the number of duplicates remaining in phase is above half by some positive margin. The probability of a correct read will then rather be \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ { \mathbb P } ( { X_l } \ge \frac { { a \pi { r^2 } } } { 2 } + \epsilon )$$ \end{document} for some positive epsilon, which is an important complementary parameter of our model. The value of epsilon used here is 10 since the output yields using this value are on the same order as the density of clusters achieved by Illumina technology (Illumina, 2015).

2.3. The singleton cluster model

For a cluster with unimpeded growth over time r, the number of DNA fragments in a cluster is aπr², so \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_l} \sim B ( a \pi {r^2} , p ( l ) )$$ \end{document} .

When the number of trials is large, the binomial distribution can be approximated by the normal distribution. In the singleton case \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {X_l} \sim { \cal N} ( a \pi {r^2}p ( l ) , \ \sqrt {a \pi {r^2}p ( l ) ( 1 - p ( l ) ) } ). \end{align*} \end{document}

A cluster centered at \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$y$$ \end{document} in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$N$$ \end{document} is isolated after time \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r$$ \end{document} if the ball centered at \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$y$$ \end{document} with radius \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$2r$$ \end{document} does not contain any other point of N. A homogeneous Poisson point process is stationary, so we can consider a typical ball centered at 0. Given intensity \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\lambda$$ \end{document} and radius \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r$$ \end{document} , we can then calculate the intensity of isolated clusters \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _i} ( \lambda , \ r )$$ \end{document} . By Slivnyak's theorem, conditional on a typical cluster at the origin the distribution of other cluster centers, i.e., the reduced Palm distribution, is also Poisson, i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \lambda {{\mathbb P}^0} ( {N^!} ( B ( 0 , 2r ) ) = 0 ) = \lambda {\mathbb P} ( N ( B ( 0 , 2r ) ) = 0 ) = \lambda {e^{ - 4 \lambda \pi {r^2}}}. \end{align*} \end{document}

The overall fragment yield of singletons is the intensity of the isolated clusters times the probability a cluster will enjoy a correct read: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \lambda _y } ( \lambda , r , l ) = \lambda { e^ { - 4 \lambda \pi { r^2 } } } { \mathbb P } ( { X_l } \ge \frac { { a \pi { r^2 } } } { 2 } + \epsilon ). \end{align*} \end{document}

2.4. The shot-noise model

Using only isolated clusters is clearly suboptimal. We consider here the situation where all clusters are used. In this case, for each cluster, the amount of interference from contact with other clusters during the growth/duplication process has to be taken into account. According to our growth assumptions, the area of the typical cluster (assumed with a seed located at 0) is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\vert {V_0} \cap B ( 0 , r ) \vert$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\vert \cdot \vert$$ \end{document} is Lebesgue measure, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${V_0}$$ \end{document} is the Voronoi cell of the point at 0, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B ( x , r )$$ \end{document} the ball of center \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x$$ \end{document} and radius \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r$$ \end{document} . In this study, we use a lower bound for the area that is easier to calculate. The interference encountered from another cluster centered at \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${x_i} \in N$$ \end{document} is considered to be half of the area of the overlap between the discs \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B ( 0 , r )$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B ( {x_i} , r )$$ \end{document} . We take the total interference for the typical cluster to be the sum of these areas over all surrounding clusters. This is in fact an upper bound on the actual interference, for example, triple intersections are counted twice (Fig. 1).

FIG. 1.

Cluster interference.

This interference upper bound can be described in terms of a shot-noise field \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${I_N}$$ \end{document} defined on \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{\mathbb R}^2}$$ \end{document} as a functional of our Poisson point process \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$N$$ \end{document} with response function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ { \alpha _r } ( x ) = \frac { 1 } { 2 } \vert B ( 0 , \ r ) \cap B ( x , \ r ) \vert$$ \end{document} . For fixed \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\alpha_r}$$ \end{document} depends only on the distance ‖x‖, so we write \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\alpha_r} ( \parallel x \parallel )$$ \end{document} . The total interference is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$I \equiv {I_N} = \int_{{{\mathbb R}^2}} {\alpha_r} ( \parallel x \parallel ) N ( dx )$$ \end{document} .

The Laplace functional of the interference is as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {L_I} ( s ) = { e{^ { - { \int{_{{\mathbb R}^{{ 2}}} \left( {1 - {e^{ - s{\alpha_r} ( \parallel x \parallel ) }}} \right) \lambda ( dx ) }}}}}. \end{align*} \end{document}

Overlap with the typical cluster \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B ( 0 , r )$$ \end{document} only occurs for clusters with centers contained in the ball of radius \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$2r$$ \end{document} around 0, so we consider only the interference on \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B ( 0 , 2r )$$ \end{document} . Switching to polar coordinates, the Laplace transform becomes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {L_I} ( s ) = \exp ( - 2 \pi \lambda \int_0^{2r} ( 1 - {e^{ - s{ \alpha _r} ( v ) }} ) vdv ) , \end{align*} \end{document}

The random variable \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_l}$$ \end{document} (the number of copies of the original DNA fragment that remain in phase after l steps) is now binomial with the number of trials depending on the interference \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$I$$ \end{document} . Since the interference comes from fragments that are not copies of the seed of the tagged cluster, this area contains no potential in-phase fragments. The area occupied by fragments of interest is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\pi {r^2} - I$$ \end{document} and thus \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${X_l} ( I ) \sim Bin ( a ( \pi {r^2} - I ) , p ( l ) )$$ \end{document} . The cluster is still analyzed as a disk of radius \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r$$ \end{document} , so the number of in-phase DNA fragments needed for a correct read remains at least \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ \frac { 1 } { 2 } a \pi { r^2 } + \epsilon$$ \end{document} . The probability of a correct reading given interference \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$I = x$$ \end{document} is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} g ( x ) = P ( { X_l } ( x ) > \frac { 1 } { 2 } a \pi { r^2 } + \epsilon \vert I = x ) \end{align*} \end{document}

and the fragment yield is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \lambda _y} ( r , \lambda , l ) = \lambda \int g ( x ) f ( dx ) , \end{align*} \end{document}

where f is the law of I, which is known through its Laplace transform (Stoyan et al., 1995). The computation of this yield, which is based on Fourier techniques, is discussed in the Appendix of O'Reilly et al. (2015).

2.5. The Voronoi model

In this subsection, we consider an optimal scenario for collecting signal reads using our assumptions about cluster growth. Clusters are allowed to grow until they have formed a Voronoi tessellation. Then, optimal sized masks, having the shape of each Voronoi cell, are used to read the signal. In this scenario, no interference from neighboring clusters is present, and the only variable to optimize is the intensity λ of the underlying point process.

The closed form of the distribution for the area of Voronoi cells is unknown, but it can be approximated by the generalized Gamma density: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { f_ { { \rm { \lambda } } A } } ( x ) = { \frac { \gamma { \chi ^ { \nu / \gamma } } } { \Gamma ( \nu / \gamma ) } } { x^ { \nu - 1 } } \exp ( - \chi { x^ \nu } ) \ { \rm for } \ x \ge 0. \end{align*} \end{document}

This is the approximate distribution for the normalized cell size λA, where A denotes the area. For λ = 1, good choices for the area are γ = 1.08, ν = 3.31, and χ = 3.03 (Stoyan et al., 1995).

For a general intensity λ, the area distribution is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { f_A } ( x ) = \lambda { f_ { \lambda A } } ( \lambda x ) = \lambda { \frac { \gamma { \chi ^ { \nu / \gamma } } } { \Gamma ( \nu / \gamma ) } } { ( \lambda x ) ^ { \nu - 1 } } \exp ( - \chi { \left( { \lambda x } \right) ^ \nu } ) , \end{align*} \end{document}

for x ≥ 0, where γ = 1.08, ν = 3.31, and χ = 3.03. The expected fragment yield for the Voronoi case is then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \lambda _y } ( \lambda , l ) = \lambda \int_0^ \infty { \mathbb P } ( { X_l } > \frac { { aA } } { 2 } + \epsilon \vert A = x ) { f_A } ( x ) dx. \end{align*} \end{document}

3. Fragment and Letter Yield

In this section, we first study the fragment yield, namely the mean number of correctly read fragments per unit space of the flow cell. We then study the letter yield, which is based on the number of letters correctly read.

3.1. Numerical results on fragment yield

We use the mathematical expressions obtained above to optimize the yield in all three models for l fixed. For the singleton cluster and the shot-noise model, we optimize over the radius r and the intensity λ. For the Voronoi model, the optimization is over λ. Tables 1 and 2 show optimal parameters and fragment yields for l = 200 and l = 150.

Table 1.

Optimal Parameters for l = 200

Value	Singletons	Shot-noise	Voronoi
λ	1.3981	1.9143	3.7478
r	0.2386	0.2447
F-Yield (per square micron)	0.2844	0.4145	2.8422

Table 2.

Optimal Parameters for l = 150

Value	Singletons	Shot-noise	Voronoi
Λ	3.2625	4.6031	8.6723
R	0.1562	0.1686
F-Yield (per micron²)	0.9146	1.5704	8.6108

For l = 200, a 45.7% increase in optimal fragment yield can be obtained when considering all clusters and not just the singleton ones. This increase in yield comes with a slightly larger radius and higher intensity than the optimal parameters for singleton clusters. This corresponds to increasing the amount of time for replication and increasing the number of initial DNA fragments spread over the flow cell. This will mean more clusters overlap with their neighbors, but many more correct reads can still be made for clusters that only run into their neighbors late in the growth stage.

For each case, decreasing l to 150 from 200 resulted in a smaller optimal radius and a larger optimal intensity. With a smaller l, a fragment is more likely to remain in-phase, making it easier for in-phase fragments to comprise half the cluster plus the fixed margin ε. This allows clusters to be smaller and more densely packed to obtain a higher yield.

The percent increases in the fragment yield obtained when switching to l = 150 are for Singletons: 221.59%, for Interference: 278.87%, and for Voronoi: 202.96%.

3.2. Numerical results on letter yield

In view of the differences between l = 200 and l = 150, it makes sense to consider the optimal letter yield, namely lλ_y(l, r, λ), the mean number of letters correctly deciphered per unit space. So, above, optimization takes place w.r.t. l as well.

We see in Table 3 that the optimizing value of l is actually somewhere around 100, which is shorter than the read length provided by, for example, Illumina's HiSeq sequencing platforms.

Table 3.

Optimal Parameters for Letter Yield

Value	Singletons	Shot-noise	Voronoi
λ	6.2320	9.5769	12.1160
R	0.1130	0.1233
L	91.4570	86.1716	119.8302
L-Yield (per micron²)	179.8272	344.2525	1535.7

The percent increases in the fragment yield obtained when switching from l = 150 to the optimal l are for Singletons: 31.07%, for Interference: 49.14%, and for Voronoi: 18.89%. In this optimized setting, the letter yield of the shot-noise case is close to twice that of the singleton cluster case.

4. Reassembly Model and Optimization

This section is focused on the optimization of the probability of reassembly of the original DNA sequence.

4.1. Reassembly model

The reassembly question can be formulated in terms of n, the number of fragments in the genomic library (fragments correctly deciphered); L, the DNA sequence length in base pairs (for the human genome, L = 3 billion); and l, the length of the fragments in the same unit. We see L as a segment on the real line and we assume that the fragment starting points form a Poisson point process on the real line with parameter \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\lambda = \frac { n } { L } $$ \end{document} .

As already explained, we are interested in the probability of complete reassembly. We will first reduce this to the probability that all letters are covered.

4.2. Analytic expression for the reassembly probability

Within the Poisson setting described above, this can be reduced to a queuing theory question: consider an M/D/∞ queue, namely a queue with Poisson arrivals, an infinite number of servers, and a constant service time of length l. The probability of reassembly is then just the probability that the busy period in such a queue exceeds L. The distribution of the busy period can be determined through its Laplace transform.

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$N$$ \end{document} be a Poisson point process on \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\mathbb R}$$ \end{document} with parameter \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\lambda$$ \end{document} . Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \{ {T_k} \} _{k \in N}}$$ \end{document} be the following random sequence: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T_0} = 0$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {T_k} = \begin{cases}{\begin{matrix} {{T_{k - 1}} + l , } & {{ \rm{if}} \ \ N ( {T_{k - 1}} , {T_{k - 1}} + l ) = 0} \\ { \mathop { \max } \limits_{x \in N , \ x \le {T_{k - 1}} + l} {x} , } & {{ \rm{if}} \ \ N ( {T_{k - 1}} , {T_{k - 1}} + l ) > 0, } \\ \end{matrix} } \ \ { \rm{for}} \ \ k \ge 1.\end{cases} \end{align*} \end{document}

The busy period of the queue is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$T \equiv {T_K}$$ \end{document} , defined by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {T_K} = \begin{cases}{ \begin{matrix} {l , } & {{ \rm{if}} \ \ K = 1} \\ {l + \sum \nolimits_{i = 1}^{k - 1} {} ( { \xi _i} \vert { \xi _i} < l ) , } & \ {{ \rm{if}} \ \ K > 1}, \end{matrix} }\end{cases} \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$K$$ \end{document} is the random variable \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$K = \min \{ n \in {\mathbb N}:{T_n} = l \} $$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \xi _i}$$ \end{document} are i.i.d random variables with distribution equal to that of the last Poisson arrival in the interval \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( 0 , l )$$ \end{document} . We claim that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \xi _i}$$ \end{document} are equal in distribution to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$l - ( \eta \vert \eta < l )$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\eta \sim \exp ( \lambda )$$ \end{document} . Indeed, the distribution of the distance of the first point in a Poisson point process on \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\mathbb R}$$ \end{document} from 0 is the same looking forward and looking backward. By this and translation invariance, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$l - \mathop { \min } \limits_{x \in N:x \in ( 0 , l ) } \vert x \vert \mathop = \limits^{ ( d ) } l - \mathop { \min } \limits_{x \in N:x \in ( - l , 0 ) } \vert {x} \vert$$ \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mathop = \limits^{ ( d ) } \mathop { \max } \limits_{x \in N:x \in ( - l , 0 ) } x + l \mathop = \limits^{ ( d ) } \mathop { \max } \limits_{x \in N:x \in ( l , 0 ) } x$$ \end{document} . Now, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\eta \mathop = \limits^{ ( d ) } \sim \exp ( \lambda )$$ \end{document} . The Laplace Transform of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( \eta \vert \eta < l )$$ \end{document} is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}{ \Psi _{ \eta \vert \eta < l}} ( s ) &= {\mathbb E} ( {e^{ - s \eta }} \vert \eta < l ) = {{{\mathbb E} ( {e^{ - s \eta }}{1_{ \{ \eta < l \} }} ) } \over {{\mathbb P} ( \eta < l ) }} \\ &= {1 \over {1 - {e^{ - \lambda l}}}} \int_0^l \lambda {e^{ - ( s + \lambda ) x}}dx = {1 \over {1 - {e^{ - \lambda l}}}} \left[ {{ \lambda \over {s + \lambda }} ( 1 - {e^{ - ( s + \lambda ) l}} ) } \right].\end{align*} \end{document}

Now, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$K = 1$$ \end{document} , then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${T_K} = l$$ \end{document} . So, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\mathbb E} [ {e^{ - s{T_K}}} \vert K = 1 ] = {e^{ - \lambda l}}$$ \end{document} . For \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k > 1$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\mathbb E} [ {e^{ - s{T_K}}} \vert K = k ]$$ \end{document} is equal to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}{\mathbb E} [ {e^{- s ( l + {\sum\nolimits_{i = 1}^{k - 1} { ( {l - ( {\eta _i} | {\eta_i} < l)})}})}}] &= {e^{- sl}} \prod \limits_{i = 1}^{k - 1} {e^{- sl}}{\mathbb E} {[ {e^{s{ \eta_i}}} \vert { \eta _i} < l ]} \\ & = {e^{ - sl}}{ \left( {{e^{ - sl}}{ \Psi _{ \eta \vert \eta < l}} ( - s ) } \right) ^{k - 1} = {e^{ - sl}}{ \left( \frac {\lambda} {\lambda - s} \left[\frac {e^{-sl} - {e^{- \lambda l}}} {1 - {e^{ - \lambda l}}} \right] \right)^{k - 1}}}. \end{align*} \end{document}

Since \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$K$$ \end{document} is a geometric random variable with success probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$p = {\mathbb P} ( N ( 0 , l ) = 0 ) = {e^{ - \lambda l}}$$ \end{document} , the Laplace transform of the busy period \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$T$$ \end{document} is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \Psi _T } ( s ) \;= \mathop \sum \limits_ { k = 1 } ^ \infty \, { \mathbb E } [ { e^ { - s { T_K } } } \vert K = k ] { \mathbb P } ( K = k ) \;= \; { \frac { { e^ { - ( \lambda + s ) l } } ( \lambda - s ) } { \lambda - s - \lambda ( { e^ { - sl } } - { e^ { - \lambda l } } ) } } . \end{align*} \end{document}

We want to compute \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\mathbb P} ( T > L )$$ \end{document} , so we invert the Laplace transform of the CDF, which is easily calculated from the Laplace transform of the density \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ { { \cal L } _ { { F_T } } } ( s ) = \frac { 1 } { s } { \Psi _T } ( s )$$ \end{document} . The inversion is done numerically using the Euler Inversion method (Abate and Whitt, 2006).

4.3. Reassembly optimization

In our optimization, we ask the following question: For a given genome length, what is the smallest area the sequencing flow cell needs to have to get a high probability that the entire genome can be reassembled?

To answer this, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$A$$ \end{document} be the area (in square microns) of the flow cell where fragments are replicated and sequenced. Then, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$n = A{ \lambda _l}$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _l}$$ \end{document} is the optimal yield per square micron for fragments of length \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$l$$ \end{document} as derived in the stochastic geometry section. Assume \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$L$$ \end{document} is given. Then, consider the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$M / D / \infty$$ \end{document} queue with arrival rate \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ \frac { { A { \lambda _l } } } { L } $$ \end{document} and service time \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$l$$ \end{document} .

For each \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$l$$ \end{document} , we find the minimum \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$A$$ \end{document} such that the probability of reassembly, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\mathbb P} ( T > L )$$ \end{document} , is greater than some threshold.

Finally, we optimize over \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$l$$ \end{document} to find the smallest required area. The optimal \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$l$$ \end{document} is the fragment length that requires the least flow cell area to obtain the desired probability of reassembly (Fig. 2 and Table 4).

FIG. 2.

Minimum area needed to achieve a probability of reassembly of 0.99 versus the length of the fragments for a genome of length 100,000.

Table 4.

Optimal Parameters for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\mathbb P}$$ \end{document} (T > 100,000) ≥0.99

	Singletons	Shot-noise
Optimal l	97	92
Minimum A (square microns)	6483	3088

5. Conclusion

This article establishes a connection, which is new to the best of our knowledge, between stochastic geometry and queuing theory on one side and fast DNA sequencing on the other side. This connection allowed us to propose a simple model, which captures the key steps of the fast sequencing process: segmentation of multiple copies of the DNA into random fragments, replication of the randomly placed fragments on the flow cell, spatial interactions between the resulting fragment clusters, read of fragments through their cluster amplification, taking the possibility of read errors and interference between clusters into account, and finally assembly of the successfully read fragments. This model is analytically tractable and allowed us to quantify and optimize various notions of yield, including the yield of the end-to-end sequencing process, in function of the parameters. This basic model seems generic and flexible enough for us to envision a series of increasingly realistic and yet tractable variants for each step of the process and eventually a comprehensive quantitative theory for this class of sequencing problems.

Footnotes

Acknowledgments

The work of the first two authors was supported by a grant of the Simons Foundation (#197982 to UT Austin). The work of the first author was supported by the National Science Foundation Graduate Research Fellowship (Grant DGE-1110007).

Author Disclosure Statement

No competing financial interests exist.

References

Abate

, and Whitt

2006. A unified framework for numerically inverting laplace transforms. INFORMS J. Comput. 18, 408–421.

Baccelli

, and Blaszczyszyn

2009. Stochastic Geometry and Wireless Networks. NoW Publishers. Boston; Delft.

Bentley

D.R.

, Balasubramanian

, Swerdlow

H.P.

, et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456, 53–59.

Bresler

, Bresler

, and Tse

2013. Optimal assembly for high throughput shotgun sequencing. BMC Bioinformatics, 14 Suppl 5, S18.

Das

, and Vikalo

2012. Onlinecall: Fast online parameter estimation and base calling for illumina's next-generation sequencing. Bioinformatics, 28, 1677–1683.

Das

, and Vikalo

2013. Base calling for high-throughput short-read sequencing: Dynamic programming solutions. BMC Bioinformatics, 14, 129.

Delcher

A.L.

, et al. 2002. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30, 2478–2483.

Erlich

, Mitra

, Delabastide

, et al. 2008. Alta-cyclic: A self-optimizing base caller for next-generation sequencing. Nat. Methods, 5, 679–682.

Hall

1988. Introduction to the Theory of Coverage Processes. Wiley. Series in Probability and Statistics.

10.

Idury

R.M.

, and Waterman

M.S.

1995. A new algorithm for DNA sequence assembly. J. Comput. Biol. 2, 291–306.

11.

Illumina. 2015. Illumina's white paper. Available at: http://www.illumina.com/documents/products/techspotlight

12.

Iqbal

, Caccamo

, Turner

, et al. 2012. De novo assembly and genotyping of variants using colored de bruijn graphs. Nat. Genet. 226–232.

13.

Kao

W.-C.

, and Song

Y.S.

2011. naivebayescall: An efficient model-based base-calling algorithm for high-throughput sequencing. J. Comput. Biol. 18, 365–377.

14.

Kao

W.-C.

, Stevens

, and Song

Y.S.

2009. Bayescall: A model-based basecalling algorithm for high-throughput short-read sequencing. Genome Res. 19, 1884–1895.

15.

Kircher

, Stenzel

, and Kelso

2009. Improved base calling for the illumina genome analyzer using machine learning strategies. Genome Biol. 10, R83.

16.

Lander

, and Waterman

1988. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics, 2, 231–239.

17.

Langmead

, et al. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25.

18.

, and Durbin

2009. Fast and accurate short-read alignment with burrows-wheeler transform. Bioinformatics, 25, 1754–1760.

19.

, and Durbin

2010. Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics, 26, 589–595.

20.

, et al. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858.

21.

Lunter

, and Goodson

2011. Stampy: A statistical algorithm for sensitive and fast mapping of illumina sequence reads. Genome Res. 21, 936–939.

22.

Mercier

J.-F.

, and Slater

G.W.

2005. Solid phase DNA amplification: A brownian dyanmics study of crowding effects. Biophys. J. 89, 32–42.

23.

Mercier

J.-F.

, Slater

G. W.

, and Mayer

2003. Solid phase DNA amplification: A simple monte carlo lattice model. Biophys. J. 85, 2075–2086.

24.

Messing

, Crea

, and Seeburg

P. H.

1981. A system for shotgun DNA sequencing. Nucleic Acids Res. 9, 309–321.

25.

Miller

J.R.

, Koren

, and Sutton

2010. Assembly algorithms for next-generation sequencing data. Genomics, 95, 315–327.

26.

Motahari

, Bresler

, and Tse

2013. Information theory of DNA shotgun sequencing. IEEE Trans. Inf. Theory, 59, 6273–6289.

27.

Myers

E.W.

1995. Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2, 291–306.

28.

O'Reilly

, Baccelli

, de Veciana

, et al. 2015. End-to-end optimization of high throughput DNA sequencing. ArXiv e-prints. Pgs 1–17.

29.

Ruffalo

, LaFramboise

, and Koyuturk

2011. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics. 27, 2790–2796.

30.

Stoyan

, Kendall

W.S.

, and Mecke

1995. Stochastic Geometry and its Applications, second edition. Wiley, Chichester, United Kingdom.

31.

Venter

et al. 1998. Shotgun sequencing of the human genome. Science, 280, 1540–1542.

32.

Venter

J.C.

, Smith

H.O.

, and Hood

1996. A new strategy for genome sequencing. Nature, 381, 364–366.

33.

Wendl

M.C.

, and Wilson

R.K.

2008. Aspects of coverage in medical DNA sequencing. BMC Bioinformatics, 9, 239.