An R Package for Nonparametric Inference on Dynamic Populations with Infinitely Many Types

Abstract

Fleming–Viot diffusions are widely used stochastic models for population dynamics that extend the celebrated Wright–Fisher diffusions. They describe the temporal evolution of the relative frequencies of the allelic types in an ideally infinite panmictic population, whose individuals undergo random genetic drift and at birth can mutate to a new allelic type drawn from a possibly infinite potential pool, independently of their parent. Recently, Bayesian nonparametric inference has been considered for this model when a finite sample of individuals is drawn from the population at several discrete time points. Previous works have fully described the relevant estimators for this problem, but current software is available only for the Wright–Fisher finite-dimensional case. Here, we provide software for the general case, overcoming some nontrivial computational challenges posed by this setting. The R package FVDDPpkg efficiently approximates the filtering and smoothing distribution for Fleming–Viot diffusions, given finite samples of individuals collected at different times. A suitable Monte Carlo approximation is also introduced in order to reduce the computational cost.

1. INTRODUCTION

Fleming–Viot (FV) diffusions are among the most widely used models in population genetics and stochastic population dynamics. See Ethier and Kurtz (1993) for an extensive review. They take values in the space of purely atomic probability measures defined on an arbitrary complete and separable metric space $Y$ , which in our application denotes the set of all possible allelic types. The process, denoted ${X_{t}, t \geq 0}$ , describes the temporal evolution of the relative frequencies of types in an ideally infinite population, when the individuals that compose the population undergo random genetic drift, due to panmictic random mating, and at birth can mutate to a yet unseen type, drawn from a possibly infinite potential pool. Fleming–Viot diffusion can also accommodate mathematical descriptions of other evolutionary forces such as natural selection or recombination, but in this article, we do not consider these cases. The time-homogeneous transition function, found by Ethier and Griffiths (1993), can be written as $P_{t} (x, d x') = \sum_{m = 0}^{\infty} d_{m}^{θ} (t) \int_{Y^{m}} Π_{α + \sum_{k = 1}^{m} δ_{y_{k}}} (d x') x (d y_{1}) \dots x (d y_{m}) .$ (1)

See Eq. 2 in the Supplementary Text S1 for the diffusion infinitesimal generator. Here $Π_{α}$ is the reversible measure of the FV diffusion and denotes the law of a Dirichlet process (Ferguson, 1973) with baseline measure $α = θ P_{0}$ , with $θ > 0$ being the overall mutation rate and P₀ a probability measure on $Y$ representing the potential pool of allelic types from which mutants are drawn. Furthermore, $d_{m}^{θ} (t)$ is the transition probability of a pure-death process on $ℤ_{+}$ which starts at infinity almost surely and jumps from m to m — 1 at rate $λ_{m} = m (θ + m - 1) / 2$ . These were computed in Griffiths (1980) and Tavaré (1984) and are in fact the transition probabilities of the block-counting process in Kingman’s coalescent with mutation, given by $d_{m}^{θ} (t) = {\begin{array}{l} \sum_{k \geq m} e^{- λ_{k} t} {(- 1)}^{k - m} (\begin{matrix} k \\ m \end{matrix}) \frac{(θ + 2 k - 1) {(θ + m)}_{(k - 1)}}{k!}, & m \geq 1, \\ 1 - \sum_{k \geq 1} e^{- λ_{k} t} {(- 1)}^{k - 1} \frac{(θ + 2 k - 1) {(θ)}_{(k - 1)}}{k!}, & m = 0, \end{array}$ (2)where ${(a)}_{(b)} = a (a + 1) \dots (a + b - 1)$ is the ascending factorial or Pochhammer symbol. A different representation of Eq. 1 highlights the role of the death process as latent variable which determines the correlation between X_s and $X_{s + t}$ (here s can be set equal to 0 as the process is time-homogeneous), and reads $\begin{array}{l} X_{t} | (D_{t} = m, Y_{0, 1}, \dots, Y_{0, m}) \sim Π_{α + \sum_{i = 1}^{m} δ_{Y_{0, i}}} Y_{0, i} | X_{0} \overset{iid}{\sim} X_{0} . \end{array}$ (3)

Here D_t is the above death process with transition probabilities $d_{m}^{θ} (t)$ . The realization of D_t determines the number $m \in ℤ_{+}$ of individuals whose type is drawn from the initial state X₀ and used in the distribution of the arrival state X_t. The intuition is that for small t a large m is more likely, so that X₀ and X_t will tend to be similar, whereas at $t \to \infty$ , m will be small or zero with high probability, and X₀ and X_t will be essentially uncorrelated. Cf. also Walker et al. (2007).

A projection of the above FV diffusion onto a measurable partition $A_{1}, \dots, A_{K}$ of the type space $Y$ yields a neutral Wright–Fisher (WF) diffusion with K types, cf. Etheridge (2000) and Feng (2010). Recently, Jenkins and Spanò (2017) and Sant et al. (2023) addressed the problem of generating exact draws from the transition functions of this and other WF diffusions, in particular addressing the intractability of Eq. 2, which also appears in the associated finite-dimensional special case of Eq. 1.

Given the increasing availability of population-level allelic data, it is often of interest to study the temporal evolution of the associated frequencies, which can be modeled through an FV or a WF diffusion, see, e.g., Tataru et al. (2017), Gory et al. (2018), and Paris et al. (2019) for recent works in this direction. From a statistical perspective, the problem can be formulated as a hidden Markov model, a popular inferential framework for temporally evolving quantities (Cappé et al., 2005). In such a setting, the FV process which models the evolving allelic frequencies is taken as the unobserved target of inference, and finitely many biological samples from the population are assumed to be collected at discrete times. Denote by $Y_{t_{k}, i}$ the type of the ith individual observed at time t_k, for $i = 1, \dots, n_{k}$ and times $0 = t_{0} < t_{1} < \dots < t_{p} = T$ . Given a dataset $Y_{0 : T} : = (Y_{0}, \dots, Y_{T})$ , where $Y_{k} : = (Y_{t_{k}, 1}, \dots, Y_{t_{k}, n_{k}})$ is a row vector, the task is then to make inferences on the process states, hence on the configuration of frequencies in the underlying population. Mathematically this amounts to fully describe the law of $X_{t} | Y_{0 : T}$ , which is typically called filtering distribution when t = T, or smoothing distribution when $0 \leq t < T$ . See Section 2 for a more detailed formalization of the setting. These inferential problems were fully solved and implemented by Papaspiliopoulos and Ruggiero (2014) and Kon Kam King et al. (2021, 2024) for the finite-dimensional WF special case. In the above infinite-dimensional framework, Papaspiliopoulos et al. (2016), Ascolani et al. (2021) and Ascolani et al. (2023) provided recursions that in principle allow to fully evaluate such distributions, showing that they are qualitatively similar to the formulation of Eq. 3, with the important difference that the latent variable D_t is substituted by another variable with finite state space, yielding in fact finite mixtures of laws of Dirichlet processes structurally similar to $\sum_{i \in I} v_{i} Π_{α_{i}}$ , where I has finite cardinality and $\sum_{i \in I} v_{i} = 1$ (see Theorem 1 in the Supplementary Text S1). These results surprisingly overcome the intractability of the death process D_t in Eq. 3, and in particular, the need to deal with the infinite series expansion for its transition probabilities, making the evaluation of these distributions possible with a finite computational effort, without the need for stochastic truncation or approximation strategies. However, sampling in practice from these distributions leads to non trivial computational challenges that are not present in the finite-dimensional subcase, mainly related to the growth in the number of observed types as more data are collected. This aspect reflects the growth of the cardinality of I in the above expression, making exact computations quickly infeasible. Another aspect that is peculiar to this setting is how the weights v_i in some of the distributions of interest depend on the support of the offspring distribution P₀, which from a modeling perspective can be taken as atomic or nonatomic.

Here, we present a simple and efficient Monte Carlo procedure to approximate the filtering and smoothing distributions of an FV process under the assumed data collection framework. These are implemented in the publicly available R package FVDDPpkg. The acronym for the package relates to another line of research on dependent Dirichlet processes (DDPs), which in Bayesian nonparametric statistics are considered as collections of correlated random probability measures with Dirichlet process marginals. See Quintana et al. (2022) for a recent review.

2. MODEL

We assume a hidden Markov model formulation, whereby the unobserved process (sometimes called signal) X_t is an FV diffusion with dynamics as in Eq. 1, denoted $X \sim FV (α)$ , and the observations, which provide the allelic types of the individuals, are conditionally iid given the current FV state, namely, $Y_{t_{k}, i} | X_{t_{k}} = x \overset{iid}{\sim} x$ . That is, when the current configuration of the type frequencies is the atomic probability measure x on $Y$ , so that this admits representation $x = \sum_{i = 1}^{\infty} v_{i} δ_{z_{i}}$ , with $\sum_{i \geq 1} v_{i} = 1$ and $z_{i} \in Y$ , then $Y_{t_{k}, i} = z_{i}$ with probability v_i. This essentially amounts to sampling without replacement when the population from which we draw is infinite.

Let now $X_{0} \sim Π_{α}$ be the initial state of the process. Recall that $α = θ P_{0}$ , with $θ > 0$ a fixed constant and P₀ a probability measure on $Y$ . Denote as above by $Y_{0 : T} : = (Y_{0}, \dots, Y_{T})$ the overall dataset, where $Y_{k} : = (Y_{t_{k}, 1}, \dots, Y_{t_{k}, n_{k}})$ for times $0 = t_{0} < t_{1} < \dots < t_{p} = T$ , and let K be the number of unique values (allelic types) in $Y_{0 : T}$ . We are interested in the filtering and smoothing distribution for X_t, which can be written (Papaspiliopoulos et al., 2016 and Ascolani et al. 2023) as $X_{t} | (Y_{0 : T}, M_{t} = m) \sim Π_{α + \sum_{k = 1}^{K} m_{k} δ_{y_{k}^{*}}}, p (M_{t} = m) = w_{m} (t) .$ (4)

Here $M_{t}$ is a certain death process on a finite state space $M \subset ℤ_{+}^{K}$ , where $m = (m_{1}, \dots, m_{K}) \in M$ describes the multiplicities of $(y_{1}^{*}, \dots, y_{K}^{*})$ , the unique values (types) in $Y_{0 : T}$ . These elements input information in the mixture components through the empirical measure $\sum_{k = 1}^{K} m_{k} δ_{y_{k}^{*}}$ . Notice that Eq. 4 is structurally similar to the representation of Eq. 3, with the key difference that the death process $M_{t}$ has a finite state space. Further details on this representation, including $M$ and $w_{m} (t)$ , can be found in the Supplementary Text S1.

These distributions allow one to make inferences on the evolution of the frequencies in the population, given the data, at any time of interest. Moreover, they can be used to evaluate the probability of observing again the already recorded allelic types in new individuals drawn from the population; see Proposition 2 in Ascolani et al. (2021) and Section 1 in Supplementary Text S1.

3. METHODS

The task of evaluating the right-hand side of Eq. 4 poses significant challenges. The probability weights $w_{m} (t)$ depend on the transition probabilities of a certain death process (informally, a projection of D_t in Eq. 1), whose computation can be numerically unstable and requires some care (Griffiths, 1984). Even assuming regularly spaced data collection times at intervals of length Δ, the cardinality of $M$ grows polynomially with the number of observed datapoints [cf. Section 3.2 in Ascolani et al. (2023)], thus storing all the $w_{m} (Δ)$ becomes demanding even with moderately large datasets. Finally, when t < T as in (4), the set of nodes m with strictly positive probability depends on the nature of baseline distribution P₀ [Proposition 3.6 in Ascolani et al. (2021)]. Specifically, if P₀ is nonatomic, $M$ depends on the unique values shared across different data collection times [cf. point C of Proposition 3.6 in Ascolani et al. (2021)].

An accurate approximation can nevertheless be obtained by exploiting different factors. First, the weights can be computed recursively. If $t \in (t_{j}, t_{j + 1})$ , Eq. 4 can be computed starting by the distributions of the signal given past and future observations used separately, i.e., $\begin{array}{l} L & (X_{t_{j}} | Y_{0 : j}) = \sum_{m_{1} \in M_{1}} u_{m_{1}} Π_{α + \sum_{k = 1}^{K} m_{1, k} δ_{y_{k}^{*}}}, \\ L & (X_{t_{j + 1}} | Y_{j + 1 : T}) = \sum_{m_{2} \in M_{2}} v_{m_{2}} Π_{α + \sum_{k = 1}^{K} m_{2, k} δ_{y_{k}^{*}}}, \end{array}$ (5)where $M_{1} \subset ℕ^{K}$ and $M_{2} \subset ℕ^{K}$ are suitable sets (see Sections 2 and 3 of the Supplementary Text S1). Below we focus on the (arguably harder) case of approximating the smoothing distribution, while the procedure for filtering is discussed in Section 4 of the Supplementary Text S1. Each weight $w_{m}$ in Eq. 4 for the smoothing distribution turns out to be $w_{m} = \sum_{m_{1} \in M_{1}} \sum_{m_{2} \in M_{2}} u_{m_{1}} v_{m_{2}} \sum_{k_{1}, k_{2} : k_{1} + n + k_{2} = m} {\tilde{w}}_{k_{1}, n, k_{2}}^{m_{1}, m_{2}},$ (6)with $u_{m_{1}}, v_{m_{2}}$ taken from (5) and where ${\tilde{w}}_{k_{1}, n, k_{2}}^{m_{1}, m_{2}} \propto p_{m_{1}, k_{1}} (t - t_{i}) p_{m_{2}, k_{2}} (t_{i + 1} - t) q (k_{1}, n, k_{2}) .$

Here, in turn the vector indices n are the multiplicities of the allelic types observed at the time t of interest (one can simply set n to be a vector of zeros if at time t no data are available), $p_{n, k} (t)$ are the transition probabilities of a death process on $ℤ_{+}^{K}$ which decreases from $m \in ℤ_{+}^{K}$ to $m - e_{j}$ at rate $λ_{m, j} : = m_{j} (| m | + θ - 1)$ , where $e_{j}$ denotes the canonical vector in the direction j, and q is a suitable function which depends on P₀. Weights in Eq. 6 represent the probability that, starting two independent death processes at the points of $M_{1}$ and $M_{2}$ , the arrival points $k_{1}$ and $k_{2}$ sum up to $m - n$ . This is then reweighted through the function q, which formalizes how the information contained in the data collected before and after t is combined. Further details are given in Section 3 of the Supplementary Text S1 (cf. Proposition 5).

Computing the transition probabilities $p_{n, k} (t)$ is the hardest task. Indeed, their exact value depends on a sum with alternating signs [see formula B40 in Kon Kam King et al. (2021)] and they assign most of the probability mass to few nodes, hence a large majority of elements in $M$ retains a low cumulative probability [cf. Figure 4 in Ascolani et al. (2021)]. Kon Kam King et al. (2021) (cf. Section 4.2) exploit this principle for WF diffusions through a pruning strategy. Here, we instead consider a Monte Carlo approximation of Eq. 6, leading to the following key part of the algorithm for approximating the terms $p_{n, k} (t)$ . Given Eq. 5: •

draw $m_{1}$ and $m_{2}$ with probability $u_{m_{1}}$ and $v_{m_{2}}$ respectively;

•

simulate two independent death processes with rates $λ_{m_{1}}$ and $λ_{m_{2}}$ over the time intervals $t - t_{j}$ and $t_{j + 1} - t$ respectively, where $λ_{m} : = (λ_{m, 1}, \dots, λ_{m, K})$ and $λ_{m, j}$ is as above;

•

return the two arrival nodes $k_{1}$ and $k_{2}$ .

The death process simulation can be performed by means of the Gillespie algorithm (Gillespie, 2007), as described in Supplementary Text S1. Once the above procedure is repeated for the desired number of particles, for a fixed m the unnormalized version of $w_{m}$ is approximated by the empirical average of $q (k_{1}, n, k_{2})$ , for all the sampled $(k_{1}, k_{2})$ such that $m = k_{1} + n + k_{2}$ . Normalizing the weights is straightforward as their cardinality is finite, and after normalization we can in addition apply a suitable pruning, that is all the weights below a certain threshold (which we chose no larger than $10^{- 9}$ ) are eliminated and a new normalization is performed. Further details can be found in Section 4 of the Supplementary Text S1.

The above strategy typically leads to mixtures with a relatively small number of components which collectively provide a good approximation to the true distribution. Figure 1 shows that the algorithm is able to reproduce efficiently large mixtures. The exact smoothing distribution in the left panel is composed of more than 30.000 components, many of which have an almost null weight: indeed 90%, 95%, and 99% of the mass is given by respectively 237, 432, and 1170 nodes with the largest weight. The approximate distribution whose nodes are represented in the right panel captures the same information using 12.500 components, with an average absolute error of order 5 × 10⁻⁶. In the figure, the x-axis depicts the cardinality of the vectors $m$ , showing that the vast majority of the probability mass is retained by nodes with few elements: this is coherent with the fact that the death process $M_{t}$ decreases very fast [cf. Figure 4 in Ascolani et al. (2021)]. See Section 6 of Supplementary Text S1 for further details on the simulation setting and the associated computational cost: in particular, Supplementary Figure S3 in Supplementary Text S1 depicts the growth of the (time and memory) cost as a function of the Monte Carlo samples. Clearly, one could choose a higher threshold value for pruning the mixture components, leading to more parsimonious mixtures and higher computational efficiency, at the cost of a larger approximation error. The feather-like shape in the figure shows that nodes corresponding to higher cardinalities have typically smaller weights. This behavior is mainly a product of the fact that the death process propagates the probability mass toward the empty node (the origin of the graph in Supplementary Fig. S1 in Supplementary Text S1). Hence the interplay between updates with new data, which moves the mass upward (cf. Supplementary Fig. S1 in Supplementary Text S1), with the temporal propagation which operates for both distributions in (5), which becomes part of the smoothing distribution, ends up keeping most of the probability mass in nodes that are relatively near the origin.

FIG. 1.

Comparison between probability mass assigned to mixture components in the exact (left) and approximate (right) smoothing distribution for FVDDP (Fleming-Viot Dependent Dirichlet Process). The simulation is based on synthetic data from 3 different times, each with 10 observations, and 10⁶ particles for the death process simulations. The cardinality $| m | = \sum_{j = 1}^{K} m_{j}$ associated with each node is depicted on the x-axis, with the corresponding probability weight (in log scale) on the y-axis. Observations are color-coded based on the number of active types ( $card {j : m_{j} \neq 0}$ ), with darker colors indicating lower values. Above the dashed lines lie the 95% and 99% of the total mass, respectively. The figure shows that the exact distribution can be efficiently approximated through simulation of the death processes by mixture with a significantly lower number of components having non-negligible probability mass.

4. DISCUSSION

The R package FVDDPpkg allows one to compute the filtering and smoothing distributions for the above FV-driven hidden Markov model, with an arbitrary set of observation times. It also allows one to evaluate the probability of observing the already recorded allelic type in new individuals to be drawn from the population, through the corresponding predictive distributions (see Section 1 of the Supplementary Text S1 for more details). The package efficiently implements the Monte Carlo approximation discussed earlier by exploiting the dynamic programming paradigm and optimized routines in the C++ language [via the Rcpp package, Eddelbuettel and François (2011)]. The recursive structure of the problem is exploited and a suitable pruning strategy allows to save both computational time and memory requirements. Extensive documentation and a vignette are also available to increase the accessibility.

Footnotes

ACKNOWLEDGMENTS

The authors are grateful to three anonymous referees for their helpful comments.

AUTHORS’ CONTRIBUTIONS

S.D. wrote the package, the associated documentation and performed all the simulations. F.A. and M.R. wrote the article and provided all the mathematical details.

AVAILABILITY AND IMPLEMENTATION

Software available at .

AUTHORS DISCLOSURE STATEMENT

The authors declare that they have no competing interests.

FUNDING INFORMATION

M.R. was partially supported by the European Union – Next Generation EU, PRIN-PNRR 2022 (P2022H5WZ9).

SUPPLEMENTARY MATERIAL

References

Ascolani

, Lijoi

, Ruggiero

. Predictive inference with Fleming–Viot-driven dependent Dirichlet processes. Bayesian Anal, 2021; 16(2):371–395.

Ascolani

, Lijoi

, Ruggiero

. Smoothing distributions for conditional Fleming–Viot and Dawson–Watanabe diffusions. Bernoulli, 2023; 29(2):1410–1434.

Cappé

, Moulines

, Ryden

. Inference in Hidden Markov Models. Springer Series in Statistics. Springer-Verlag, Berlin, Heidelberg, 2005.

Eddelbuettel

, François

. RCPP: Seamless R and C++ integration. J Stat Soft, 2011; 40(8):1–18.

Etheridge

. An Introduction to Superprocesses. Number 20 in University Lecture Series. American Mathematical Society; 2000.

Ethier

, Griffiths

. The transition function of a Fleming-Viot process. Ann Probab, 1993; 21(3):1571–1590.

Ethier

, Kurtz

. Fleming–Viot processes in population genetics. SIAM J Control Optim, 1993; 31(2):345–386.

Feng

. The Poisson–Dirichlet Distribution and Related Topics. Probability and Its Applications. Springer; 2010.

Ferguson

. A Bayesian analysis of some nonparametric problems. Ann, Stat, 1973; 1(2):209–230.

10.

Gillespie

. Stochastic simulation of chemical kinetics. Annu Rev Phys Chem, 2007; 58:35–55.

11.

Gory

, Herbei

, Kubatko

. Bayesian inference of selection in the Wright-Fisher diffusion model. Stat. Appl. Genet. Mol. Biol, 2018; 17(3):20170046.

12.

Griffiths

. Asymptotic line-of-descent distributions. J Math Biology, 1984; 21(1):67–75.

13.

Griffiths

. Lines of descent in the diffusion approximation of neutral Wright-Fisher models. Theor Popul Biol, 1980; 17(1):37–50.

14.

Jenkins

, Spanò

. Exact simulation of the Wright–Fisher diffusion. Ann Appl Probab, 2017; 27(3).

15.

Kon Kam King

, Pandolfi

, Piretto

, et al. Approximate filtering via discrete dual processes. Stoch. Process. Their Appl, 2024; 168:104268.

16.

Kon Kam King

, Papaspiliopoulos

, Ruggiero . M. Exact inference for a class of hidden Markov models on general state spaces. Electron. J. Stat, 2021; 15:2832–2875.

17.

Papaspiliopoulos

, Ruggiero

. Optimal filtering and the dual process. Bernoulli, 2014; 20(4) pages:1999–2019.

18.

Papaspiliopoulos

, Ruggiero

, Spanò

. Conjugacy properties of time-evolving Dirichlet and gamma random measures. Electron. J. Stat, 2016; 10(2):3452–3489.

19.

Paris

, Servin

, Boitard

. Inference of selection from genetic time series using various parametric approximations to the Wright–Fisher model. G3, 2019; 9(12):4073–4086.

20.

Quintana

, Müller

, Jara

, et al. The dependent Dirichlet process and related models. Stat. Science, 2022; 37(1):24–41.

21.

Sant

, Jenkins

, Koskela

, et al. EWF: Simulating exact paths of the Wright-Fisher diffusion. Bioinformatics, 2023; 39(1).

22.

Tataru

, Simonsen

, Bataillon

, et al. Statistical inference in the Wright–Fisher model using allele frequency data. Syst Biol, 2017; 66(1):e30–e46.

23.

Tavaré

. Lines-of-descent and genealogical processes, and their applications in population genetics models. Theor Popul Biol, 1984; 26(2):119–164.

24.

Walker

, Hatjispyros

, Nicoleris

. A Fleming-Viot process and Bayesian nonparametrics. Ann Appl Probab, 2007; 17(1):67–80.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.48 MB

0.19 MB

0.12 MB

0.36 MB