Improved Harmonic Mean Estimator for Phylogenetic Model Evidence

Abstract

Bayesian phylogenetic methods are generating noticeable enthusiasm in the field of molecular systematics. Many phylogenetic models are often at stake, and different approaches are used to compare them within a Bayesian framework. The Bayes factor, defined as the ratio of the marginal likelihoods of two competing models, plays a key role in Bayesian model selection. We focus on an alternative estimator of the marginal likelihood whose computation is still a challenging problem. Several computational solutions have been proposed, none of which can be considered outperforming the others simultaneously in terms of simplicity of implementation, computational burden and precision of the estimates. Practitioners and researchers, often led by available software, have privileged so far the simplicity of the harmonic mean (HM) estimator. However, it is known that the resulting estimates of the Bayesian evidence in favor of one model are biased and often inaccurate, up to having an infinite variance so that the reliability of the corresponding conclusions is doubtful. We consider possible improvements of the generalized harmonic mean (GHM) idea that recycle Markov Chain Monte Carlo (MCMC) simulations from the posterior, share the computational simplicity of the original HM estimator, but, unlike it, overcome the infinite variance issue. We show reliability and comparative performance of the improved harmonic mean estimators comparing them to approximation techniques relying on improved variants of the thermodynamic integration.

1. Introduction

The theory of evolution states that all organisms are related through a history of common ancestor and that life on Earth diversified in a tree-like pattern connecting all living species. Phylogenetics aims at inferring the tree that better represents the evolutionary relationships among species studying differences and similarities in their genomic sequences. Alternative tree estimation methods such as parsimony methods (Felsenstein, 2004) and distance methods (Cavalli-Sforza and Edwards, 1967; Fitch and Margoliash, 1967) have been proposed. In this article, we will focus on stochastic models for substitution rates, and we address the model choice issue within a fully Bayesian framework proposing alternative model evidence estimation procedures.

The article is organized as follows: in Section 2, we briefly review basic phylogenetic concepts and the Bayesian inference for substitution models. In Section 3, we focus on the model selection issue for substitution models and discuss some available computational tools for Bayesian model evidence. One of the most popular tools for computing model evidence in phylogenetics is the harmonic mean (HM) estimator proposed elsewhere (Newton and Raftery, 1994), which can be regarded as an easy-to-apply instance of a more general class of estimators called generalized harmonic mean (GHM), defined in Section 4. An alternative version of GHM is considered in Section 4.1. It was introduced in Petris and Tardella (2007) under the name of Inflated density ratio (IDR), and its implementation for substitution models is described in Section 4.2. In Section 5, we describe the thermodynamic integration method and two improved estimators based on the thermodynamic integration idea recently proposed in Fan et al. (2011) and Xie et al. (2011). In Sections 6 and 7, we focus on the effectiveness as well as comparative performance of IDR compared to the original and alternatively modified HM estimators and also to two of the most up-to-date methods for marginal likelihood evaluation in the phylogenetic fields based on thermodynamic integration ideas using simulated and real data. Although IDR can be an order of magnitude more variable than thermodynamic estimators, it appears to be sufficiently robust and less prone to bias due to possible misspecification of tuning features of its competitors. Indeed a real data example (Xie et al., 2011) as well as other simulated phylogenetic data shows that low variability of the marginal likelihood estimates cannot be considered per se a warranty of accuracy if nonnegligible bias, possibly due to miscalibration of some features, can affect the estimates. Since in complex high-dimensional settings no Monte Carlo method can be considered intrinsically fail-safe, we will argue that our improved HM estimator can be also useful to validate the overall estimation results. We conclude with a brief discussion in Section 8.

2. Substitution Models: A Brief Overview

Phylogenetic data consists of homologous DNA strands or protein sequences of related species. Observed data consists of a nucleotide matrix X with n rows representing species and k columns representing sites. Comparing DNA sequences of two related species, we define substitution as the replacement in the same situs of one nucleotide in one species by another one in the other species. The stochastic models describing this replacement process are called substitution models. A phylogeny or a phylogenetic tree is a representation of the genealogical relationships among species, also called taxa or taxonomies. Tips (leaves or external nodes) represent the present-day species, while internal nodes usually represent extinct ancestors for which genomic sequences are no longer available. The ancestor of all sequences is the root of the tree. The branching pattern of a tree is called topology and is denoted with τ, while the lengths ν_τ of the branches of the tree τ represent the time periods elapsed until a new substitution occurs.

DNA substitution models are probabilistic models which aim at modeling changes between nucleotides in homologous DNA strands. Changes at each site occur at random times. Nucleotides at different sites are usually assumed to evolve independently each other. For a fixed site, nucleotide replacements over time are modeled by a four-state Markov process, in which each state represents a nucleotide. The Markov process indexed with time t is completely specified by a substitution rate matrix Q(t) = r_ij(t): each element r_ij(t), i ≠ j, represents the instantaneous rate of substitution from nucleotide i to nucleotide j. The diagonal elements of the rate matrix are defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r_{ii} (t) = - \sum \nolimits_{j \neq i}r_{ij} (t)$$ \end{document} so that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sum \nolimits_{j = 1}^4 r_{ij} (t) = 0, \ \forall i$$ \end{document} . The transition probability matrix P(t) = {p_ij(t)}, where P(t) is the matrix of substitution probability, which gives, for each couple of nucleotides (i, j) the probability that state i is replaced by state j at time t. The substitution process is assumed homogeneous over time t so that Q(t) = Q and P(t) = P. It is also commonly assumed that the substitution process at each site is stationary with equilibrium distribution Π = (π_A, π_C, π_G, π_T) and time-reversible, that is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \pi_{i}r_{ij} = \pi_{j}r_{ji} \tag{1} \end{align*} \end{document}

where π_i is the proportion of time the Markov chain spends in state i and π_ir_ij is the amount of flow from state i to j. Equation (1) is known as detailed-balance condition and means that flow between any two states in the opposite direction is the same. Following the notation in Hudelot et al. (2003) we define r_ij(t) = r_ij = ρ_ijπ_i, ∀ i ≠ j, where ρ_ij is the transition rate from nucleotide i to nucleotide j. This reparameterization is particularly useful for the specification of substitution models, since it makes clear the distinction between the nucleotide frequencies π_A,π_G,π_C,π_T and substitution rates ρ_ij, allowing to spell out different assumptions on evolutionary patterns. The most general time-reversible nucleotide substitution model is the so-called GTR defined by the following rate matrix \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Q = \left(\begin{matrix}\hbox{-} & \rho_{AC}\pi_{C} & \rho_{AG}\pi_{G}& \rho_{AT}\pi_{T}\\ \rho_{AC}\pi_{A} & \hbox{-} & \rho_{CG}\pi_{G} & \rho_{CT}\pi_{T}\\ \rho_{AG}\pi_{A} & \rho_{CG}\pi_{C} & \hbox{-} & \rho_{GT}\pi_{T}\\ \rho_{AT}\pi_{A} & \rho_{CT}\pi_{C} & \rho_{GT}\pi_{G} & \hbox{-}\end{matrix}\right) \tag{2} \end{align*} \end{document}

and more thoroughly illustrated in Lanave et al. (1984). Several substitution models can be obtained simplifying the Q matrix reflecting specific biological assumptions: the simplest one is the JC69 model, originally proposed in Jukes and Cantor (1969), which assumes that all nucleotides are interchangeable and have the same rate of change, that is ρ_ij = ρ ∀ i, j and π_A = π_C = π_G = π_T.

In this article, for illustrative purposes we will consider only instances of the GTR and JC69 models. One can look at Felsenstein (2004) and Yang (2006) for a wider range of alternative substitution models.

2.1. Bayesian inference for substitution models

The parameter space of a phylogenetic model can be represented as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \Omega = \{\tau, \nu_{\tau}, \theta \} \end{align*} \end{document}

Observed data consists of a nucleotide matrix X; once specified the substitution model M, the likelihood p(X|τ, ν_τ, θ, M) can be computed using the pruning algorithm, a practical and efficient recursive algorithm proposed in Felsenstein (1981). One can then make inference on the unknown model parameters looking for the values which maximize the likelihood. Alternatively one can adopt a Bayesian approach where the parameter space is endowed with a joint prior distribution π(τ, ν_τ, θ) on the unknown parameters and the likelihood is used to update the prior uncertainty about (τ, ν_τ, θ) following the Bayes' rule: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} p (\theta, \nu_{\tau}, \tau \mid X, M) = \frac {p (X \mid \tau, \nu_{\tau}, \theta, M) \pi (\tau, \nu_{\tau}, \theta)} {m (X \mid M)} \end{align*} \end{document}

The resulting distribution p(τ, ν_τ, θ|X, M) is the posterior distribution which coherently combines prior believes and data information. Prior believes are usually conveyed as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\pi (\tau, \nu_{\tau}, \theta) = \pi_{{\tau}} (\tau) \pi_{{\cal V}} (\nu_{\tau}) \pi_{\Theta} (\theta)$$ \end{document} . The denominator of the Bayes' rule m(X|M) is the marginal likelihood of model M and it plays a key role in discriminating among alternative models.

More precisely, suppose we aim at comparing two competing substitution models M₀ and M₁. The Bayes factor (BF) is defined as the ratio of the marginal likelihoods as follows \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} BF_{10} = \frac {m (X \mid M_{1})} {m (X \mid M_{0})} \end{align*} \end{document}

where, for i = 0, 1 \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} c^{(i)} = m (X \mid M_{i}) = \sum_{\tau \in {{\cal T}}} \int_{\nu^{(i)}_{\tau}} \int_{\theta^{(i)}} p (X \mid \theta, \tau, M_{i}) \pi_{\Theta} (\theta^{(i)}) d \theta^{(i)} \pi_{{\cal V}} (\nu_{\tau}^{(i)} \mid \tau) d \nu_{\tau}^{(i)} \pi_{{\tau}} (\tau) \tag{3} \end{align*} \end{document}

Numerical guidelines for interpreting the evidence scale are given in Kass and Raftery (1995). Values of BF₁₀ > 1 (log(BF₁₀) > 0) can be considered as evidence in favor of M₁ but only a value of BF₁₀ > 100 (log(BF₁₀ > 4.6) can be really considered as decisive.

Most of the times the posterior distributions p(τ, ν_τ, θ|X, M_i) and marginal likelihoods are not analytically computable but can be approached through appropriate approximations. Indeed, over the past 10 years, powerful numerical methods based on Markov chain Monte Carlo (MCMC) have been developed, allowing one to carry out Bayesian inference under a large category of probabilistic models, even when dimension of the parameter space is very large. Indeed, several ad-hoc MCMC algorithms have been tailored for phylogenetic models (Larget and Simon, 1999; Li et al., 2000) and are currently implemented in publicly available software such as in MrBayes (Ronquist and Huelsenbeck, 2003), PHASE (Gowri-Shankar and Jow, 2006) and Phycas (Xie et al., 2011).

3. Model Selection for Substitution Models and Computational Tools for Bayesian Model Evidence

Given the variety of possible stochastic substitution mechanisms, an important issue of any model-based phylogenetic analysis is to select the model that is most supported by the data. Several model selection procedures have been proposed depending also on the inferential approach. A classical approach to model selection for choosing between alternative nested models is to perform the hierarchical likelihood ratio test (LRT) (Posada and Buckely, 2004). A number of popular programs allow users to compare pairs of models using this test such as PAUP (Swafford, 2003) PAML (Yang, 2007), and the R package APE (R Development Core Team, 2008). However, Posada and Buckely (2004) have shown some drawbacks of performing systematic LRT for model selection in phylogenetics. This is because the model that is finally selected can depend on the order in which the pairwise comparisons are performed (Pol, 2004). Moreover, it is well-known that LRT tends to favor parameter-rich models.

The Akaike Information Criterion (AIC) is another model-selection criterion commonly used also in phylogenetics (Posada and Buckley, 2004): one of the advantages of the AIC is that it allows to compare nested as well as non nested models and it can be easily implemented. However, the AIC tends to favor parameter-rich models as well. To overcome this selection bias, one can use the Bayesian Information Criterion (BIC) (Schwartz, 1978), which better penalizes parameter-rich models. Sometimes these criteria applied to the same data can end up selecting very different substitution models, as shown in Abdo (2005). Indeed, they compare ratios of likelihood values penalized for an increase in the dimension of one of the models, without directly accounting for uncertainty in the estimates of model parameters. The latter aspect is addressed within a fully Bayesian framework through the use of BF. BF directly incorporates this uncertainty, and its meaning is more intuitive than other methods since it can be directly used to assess the comparative evidence provided by the data in terms of the most probable model under equal prior model probabilities. BF for comparing phylogenetic models was first introduced in Sinsheimer et al. (1996) and Weiss et al. (2001). Since then its popularity in phylogenetics has grown, so that some publicly available software provide in their standard output approximations of marginal likelihoods for model evidence and BF evaluation. Indeed the complexity of phylogenetic models and the computational burden in the light of high-dimensional parameter space make the problem of finding alternative more reliable and efficient computational strategy for computing BF still open and in continuous development (Lartillot et al., 2007; Ronquist and Deans, 2010). A recent improved implementation of the thermodynamic approach inspired by the power priors of Friel and Pettitt (2008) has been recently introduced in Xie et al. (2011) and successively improved in Fan et al. (2011).

The computation of the marginal likelihood m(X|M) of a phylogenetic model M is not straightforward. It involves integrating the likelihood over k-dimensional subspaces for the branch length parameters ν_τ and the substitution rate matrix θ = (ρ, π) and eventually summing over all possible topologies.

Most of the marginal likelihood estimation methods proposed in the literature have been applied extensively also in molecular phylogenetics (Lartillot et al., 2007; Minin et al., 2003; Weiss et al., 2001). Among these methods, many of them are valid only under very specific conditions. For instance, the Dickey-Savage ratio (Verdinelli and Wasserman, 1995) applied in phylogenetics in Weiss et al. (2001) assumes nested models. Laplace approximation (Kass and Raftery, 1995) and BIC (Schwartz, 1978) applied in phylogenetics firstly in Minin et al. (2003), require large sample approximations around the maximum likelihood, which can be sometimes difficult to compute or approximate for very complex models. A recent appealing variation of the Laplace approximation has been proposed in Rodrigue et al. (2007): however, its applicability and performance are endangered when the posterior distribution deviates from normality and the maximization of the likelihood can be neither straightforward nor accurate.

The reversible jump approach (Bartolucci et al., 2006; Green, 1995) is another MCMC option applied to phylogenetic model selection in Huelsenbeck et al. (2004). Unfortunately the implementation of this algorithm is not straightforward for the end user and it often requires appropriate delicate tuning of the Metropolis Hastings proposal. Moreover, its implementation suffers extra difficulties when comparing models based on an entirely different parametric rationale (Lartillot et al., 2007).

As recently pointed out in Ronquist and Deans (2010), among the most widely used methods for estimating the marginal likelihood of phylogenetic models are the thermodynamic integration, also known as path sampling, and the harmonic mean approach. The thermodynamic integration reviewed in Gelman and Meng (1998) and first applied in a phylogenetic context in Lartillot and Philippe (2006) produces reliable estimates of BF of phylogenetic models in a large varieties of models. Although this method has the advantage of general applicability, it can incur high computational costs and may require specific adjustments. For certain model comparisons, a full thermodynamic integration may take weeks on a modern desktop computer, even under a fixed tree topology for small single protein data sets (Rodrigue et al., 2007; Ronquist and Deans, 2010). Alternative approaches, based on a similar rationale and resembling the power prior approach of Friel and Pettitt (2008), are the Stepping Stone (SS) and the Generalized Stepping Stone (GSS) methods, recently proposed in two consecutive papers by Xie et al. (2011) and Fan et al. (2011): similarly to thermodynamic integration, these methods are computationally intensive since they need to sample from a series of distributions in addition to the posterior. Such dedicated path-based MCMC analyses appear to be the price for estimating marginal likelihoods more accurately. In Fan et al., (2011) it is shown that GSS produces very stable estimates of the marginal likelihood and in Xie et al. (2011) it appears to be slightly more precise than the plain thermodynamic integration based on power priors. The thermodynamic integration method and GSS method will be described in more detail in Section 5.

The HM estimator can be easily computed, and it does not demand further computational efforts other than those already made to draw inference on model parameters, since it only needs simulations from the posterior distributions. However, it is well known that the HM estimator is unstable since it can end up with an infinite variance.

Although several solutions have been proposed, the computation of the marginal likelihood is still an open problem: none of the proposed methods can be considered intrinsically fail-safe and always outperforming the others, and the comparison among different methods can be useful in order to accurately compare competing models and more thoroughly validate the final estimate. In the next sections, we focus on an alternative GHM estimator, the IDR estimator, which shares the computational simplicity of HM estimator but, unlike it, better copes with the infinite variance issue. Its simple implementation makes the IDR estimator a useful tool to easily provide a reliable method for comparing competing substitution models. We will argue how IDR can be useful as a confirmatory tool even in those models for which more complex estimation methods, such as the thermodynamic integration and GSS, can be applied.

4. Harmonic Mean Estimators

We introduce the basic ideas and formulas for the class of estimators known as GHM. Since the marginal likelihood is nothing but the normalizing constant of the unnormalized posterior density, we illustrate the GHM estimator as a general solution for estimating the normalizing constant of a non-negative, integrable density g defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} c = \int_{\Omega}g (\theta) d \theta \tag{4} \end{align*} \end{document}

where f is a convenient instrumental Lebesgue integrable function, which is only required to have a support which is contained in that of g and to satisfy \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \int_{\Omega}f (\theta) d \theta = 1. \tag{6} \end{align*} \end{document}

The GHM estimator, denoted as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{GHM}$$ \end{document} is the empirical counterpart of (2), namely \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \hat {c}_{GHM} = \frac {1} {\frac {1} {T} \sum_{t = 1}^{T} \frac {f (\theta_{t})} {g (\theta_{t})}} . \tag {7} \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta_1, \theta_2, \ldots, \theta_T$$ \end{document} are sampled from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{g}$$ \end{document} . In Bayesian inference, the very first instance of such GHM estimator was introduced in Gelfand and Dey (1994) to estimate the marginal likelihood considered as the normalizing constant of the unnormalized posterior density g(θ) = π_Θ(θ) L(θ). Hence, taking f (θ) = π_Θ(θ) one obtains as special case of equation (4) the HM estimator \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \hat {c}_{HM} = \frac {1} {\frac {1} {T} \sum_{t = 1}^{T} \frac {1} {L (\theta_{t})}} \tag {8} \end{align*} \end{document}

which can be easily computed by recycling simulations \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta_1, \ldots, \theta_T$$ \end{document} from the target posterior distribution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{g} (\theta)$$ \end{document} available from MC or MCMC sampling scheme. This probably explains the original enthusiasm in favor of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{HM}$$ \end{document} , which indeed was considered a potential convenient competitor of the standard Monte Carlo Importance Sampling estimate given by the (Prior) Arithmetic Mean (AM) estimator. The implementation of equation (7) and, more generally, equation (6) requires a lighter computational burden; hence, it can reduce computing time with respect to thermodynamic integration. The simplicity of the computation has then favored the widespread use of the HM estimator with respect to more complex methods. In fact, the HM estimator is implemented in several Bayesian phylogenetic software as shown in Table 1 and recent biological articles (Normana et al., 2009; Wang et al., 2009; Yamanoue et al. 2008) report the HM as a routinely used model selection tool.

Table 1.

Availability of Marginal Likelihood Estimation Methods in Bayesian Phylogenetics Software: In Spite of Its Inaccuracy, the Harmonic Mean Estimator Is Still One of the Most Diffuse Marginal Likelihood Estimation Tool

Software	Marginal likelihood estimation method
BayesTraits	Harmonic Mean
BEAST	Harmonic Mean or bootstrapped Harmonic Mean
MrBayes	Harmonic Mean
PHASE	Harmonic Mean and Reversible Jump
PhyloBayes	Thermodynamic Integration under normal approximation
Phycas	Generalized Stepping Stone

However, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{HM}$$ \end{document} can end up with a very large variance and unstable behavior. This fact cannot be considered as an unusual exception, but it often occurs and the reason for that can be argued on a theoretical ground. Indeed, starting from the original article (Newton and Raftery, 1994), it has been shown that even in very simple and standard Gaussian models also the HM estimator can end up having an infinite variance hence yielding unreliable approximations. This fact raises sometimes the question whether they are effective tools and certainly has encouraged researchers to look for alternative solutions. Several generalizations and improved alternatives have been proposed and recently reviewed in Raftery et al. (2007).

In the following sections, we consider the performance of alternative implementations of GHM method. The first one is directly based on equation (7) where g is taken as the posterior density up to the unknown proportionality constant, while f is taken as an auxiliary probability density which is possibly a convenient fully available probability density approximating the posterior. Such auxiliary density is not easy to build up appropriately in general, but in principle, setting it as close as possible to the posterior could yield a very efficient estimator. The second one is derived from a different perspective and does not require any additional calibration of an auxiliary density although this prevents it from achieving in principle arbitrary precision. It has been originally proposed in Petris and Tardella (2007) and it is called IDR estimator although it can be also seen as a particular instance of the GHM approach.

These two estimators basically share the original simplicity and the computation feasibility of the HM estimator but, unlike it, they can guarantee greater stability and important theoretical properties, such as a bounded variance.

4.1. Inflated Density Ratio (IDR) estimator

The inflated density ratio estimator is a different formulation of the GHM estimator, based on a particular choice of the instrumental density f (θ) as originally proposed in Petris and Tardella (2007). The instrumental f (θ) is obtained through a perturbation of the original target function g. The perturbed density, denoted with g_Pk, is defined so that its total mass has some known functional relation to the total mass c of the target density g as in equation (4). In particular, g_Pk is obtained as a parametric inflation of g so that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \int_{\Omega}g_{P_{k}} (\theta) = c + k \tag{9} \end{align*} \end{document}

where k is a known inflation constant which can be arbitrarily fixed. The perturbation device comes from an original idea in Petris and Tardella (2003) and is detailed in Petris and Tardella (2007) for unidimensional and multidimensional densities. In the unidimensional case the perturbed density is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} g_{P_{k}} (\theta) = \begin{cases}g (\theta + r_{k}) \qquad \hbox{if} \ \theta < - r_{k} \\ g (0) \qquad \qquad \hbox{if} - r_{k} \leq \theta \leq r_{k} \\ g (\theta - r_{k}) \qquad \hbox{if} \ \theta > r_{k}\end{cases} \tag{10} \end{align*} \end{document}

with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$2r_k = \frac {k} {g (0)}$$ \end{document} corresponding to the length of the interval centered around the origin where the density is kept constant. In Figure 1, one can visualize how the perturbation acts. The perturbed density allows one to define an instrumental density \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$f_{k} (\theta) = \frac {g_{P_{k}} (\theta) - g (\theta)} {k}$$ \end{document} which satisfies the requirement (6) needed to define the GHM estimator as in (7). The Inflated Density Ratio estimator \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{IDR}$$ \end{document} for c is then obtained as follows \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \hat {c}_{IDR} = \frac {k} {\frac {1} {T} \sum_{t = 1}^{T} \frac {g_{P_{k}} (\theta_{t})} {g (\theta_{t})} - 1} \tag {11} \end{align*} \end{document}

FIG. 1.

(Left) original density g with total mass c. (Right) perturbed density \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$g_{P_{k}}$$ \end{document} defined as in equation (10). The total mass of the perturbed density is then c + k. The shaded area correspond to the inflated mass k with k = 2 · r_k · g(0) as in (10).

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta_{1}, \ldots, \theta_{T}$$ \end{document} is a sample from the normalized target density \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{g}$$ \end{document} . The use of the perturbed density as importance function leads to some advantages with respect to the other instances of c_GHM proposed in the literature. In fact \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{IDR}$$ \end{document} yields a finite-variance estimator under mild sufficient conditions and a wide range of g densities (Petris and Tardella, 2007) (Lemma, 1, 2, and 3). Notice that in order for the perturbed density g_Pk to be defined it is required that the original density g has full support in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Re^k$$ \end{document} . Moreover, the use of a parametric perturbation makes the method more flexible and efficient with a moderate extra computational effort.

Like all methods based on importance sampling strategies, the properties of the estimator \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{IDR}$$ \end{document} strongly depend on the ratio \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\frac {g_{P_{k}} (\theta)} {g (\theta)}$$ \end{document} . To evaluate its performance one can use an asymptotic approximation (as t →+∞) via standard delta-method of the Relative Mean Square Error of the estimator \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} RMSE_{\hat {c}_{IDR}} = \sqrt {E_{\tilde {g}} \Bigg[\left(\frac {\hat {c}_{IDR} - c} {c} \right)^{2} \Bigg]} \approx \frac {c} {k} \sqrt {Var \Bigg[\frac {g_{P_{k}} (\theta)} {g (\theta)} \Bigg]} = RMSE_{\hat {c}_{IDR, Delta}} \tag {12} \end{align*} \end{document}

which can be ultimately estimated as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\widehat {RMSE}}_{\hat {c}_{IDR, Delta}} = \frac {\hat {c}_{IDR}} {k} \sqrt {\widehat {Var}_{\tilde {g}} \Bigg[\frac {g_{P_{k}} (\theta)} {g (\theta)} \Bigg]} \tag {13} \end{align*} \end{document}

The expression in equation (13) clarifies the key role of the choice of k with respect to the error of the estimator: for k → 0, the variance of the ratio \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\frac {g_{P_{k}} (\theta)} {g (\theta)}$$ \end{document} tends to 0, since g_Pk is very close to g, but \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\frac {c} {k}$$ \end{document} tends to infinity: in other words, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$Var_{\tilde {g}} \left[\frac {g_{P_{k}} (\theta)} {g (\theta)} \right]$$ \end{document} would favor as little values of k as possible, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\frac {1} {k}$$ \end{document} acts in the opposite direction. In order to address the choice of k, Petris and Tardella (2007) suggested to choose the perturbation k which minimizes the estimation error (13). In practice, one can calculate the values of the estimator for a grid of different perturbation values, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{IDR} (k), k = 1, \ldots, K$$ \end{document} and choose the optimal k^opt as the k for which \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\widehat {RMSE}}_{\hat{c}_{IDR} (k)}$$ \end{document} is minimum. This procedure for the calibration of k requires iterative evaluation of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{IDR} (k)$$ \end{document} hence is relatively heavier than the HM estimator, but it does not require extra simulations which in the phylogenetic context is often the main time-consuming part. Hence, the computational cost is alleviated by the fact that one uses the same sample from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{g}$$ \end{document} and the only quantity to be evaluated K times is the inflated density g_Pk. Once obtained the ratio of the perturbed and the original density, the computation of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{IDR} (k)$$ \end{document} is straightforward.

For practical purposes, the computation of the inflated density when the support of g is the whole \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Re^k$$ \end{document} can be easily implemented in R using a function available at http://sites.google.com/site/idrharmonicmean/home. In Arima (2009) and Petris and Tardella (2007), it has been shown that in order to improve the precision of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{IDR}$$ \end{document} it is recommended to standardize the simulated data \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta_1, \ldots, \theta_T$$ \end{document} with respect to a (local) mode and the sample variance-covariance matrix so that the corresponding density has a local mode at the origin and approximately standard variance-covariance matrix. This is automatically implemented in the publicly available R code.

In order to assess its effectiveness, the IDR method has been applied to simulated data from differently shaped distributions for which the normalizing constant is known. As shown in Petris and Tardella (2007), and confirmed by the numerical examples in Section 6, the estimator produces fully convincing results with simulated data from several known distributions, even for a 100-dimensional multivariate Gaussian distribution. In terms of estimator precision, these results are comparable with those in Lartillot and Philippe (2006) obtained with the thermodynamic integration. In Arima (2009), simple antithetic variate tricks allow the IDR estimator to perform well even for those distributions with severe variations from the symmetric Gaussian case such as asymmetric and even some multimodal distributions. In the next sections, we extend the IDR method in order to use \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{IDR}$$ \end{document} in more complex settings such as phylogenetic models.

4.2. Implementing IDR for substitution models with fixed topology

We extend the IDR approach in order to compute the marginal likelihood of phylogenetic models. In this section, we show how to compute the marginal likelihood when it involves integration of substitution model parameters θ and the branch lengths ν_τ which are both defined in continuous subspaces. Indeed the approach can be used as a building block to integrate also over the tree topology τ.

For a fixed topology τ and a sequence alignment X, the parameters of a phylogenetic model M_τ are denoted as ω = (θ, ν_τ) ⊂ Ω_τ. The joint posterior distribution on ω is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} p (\theta, \nu_{\tau} \mid X, M_\tau) = \frac {p (X \mid \theta, \nu_{\tau}, M_\tau) \pi_{\Theta} (\theta) \pi_{{\cal V}} (\nu_{\tau})} {m (X \mid M_\tau)} \tag {14} \end{align*} \end{document}

is the marginal likelihood we aim at estimating.

When the topology τ is fixed, the parameter space Ω_τ is continuous. Hence, in order to apply the IDR method we only need the following two ingredients:

• a sample \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$(\theta^{(1)}, \nu_{\tau}^{(1)}), \ldots, (\theta^{(T)}, \nu_{\tau}^{(T)})$$ \end{document} from the posterior distribution, p(θ, ν_τ|X, M_τ)

• the likelihood and the prior distribution evaluated at each posterior sampled value \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$(\theta^{(k)}, \nu_{\tau}^{(k)})$$ \end{document} , that is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$p (X \mid \theta^{(k)}, \nu_{\tau}^{(k)}, M_\tau)$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\pi (\theta^{(k)}, \nu_{\tau}^{(k)}) = \pi_{\Theta} (\theta^{(k)}) \pi_{\cal V} (\nu_{\tau}^{(k)})$$ \end{document}

The first ingredient is just the usual output of the MCMC simulations derived from model M and data X. The computation of the likelihood and the joint prior is usually already coded within available software. The first one is accomplished through the pruning algorithm while computing the prior is straightforward. Indeed a necessary condition for the inflation idea to be implemented as prescribed in Petris and Tardella (2007) is that the posterior density must have full support on the whole real k-dimensional space. In our phylogenetic models, this is not always the case hence we explain simple and fully automatic remedies to overcome this kind of obstacle.

We start with branch length parameters which are constrained to lie in the positive half-line. In that case, the remedy is straightforward. One can reparameterize with a simple logarithmic transformation \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \nu^{\prime}_{\tau} = \log (\nu_{\tau}) \tag{16} \end{align*} \end{document}

so that the support corresponding to the reparameterized density becomes unconstrained. Obviously the log(ν_τ) reparameterization calls for the appropriate Jacobian when evaluating the corresponding transformed density. For model parameters with linear constraints like the substitution θ = {ρ, π}, a little less obvious transformation is needed. In this case, θ = {ρ, π} are subject to the following set of constraints: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \sum_{i \in \{A, T, C, G \}} \pi_{i} & = 1 \\ \sum_{j \in \{A, T, C, G \}} \rho_{ij} \pi_{j} & = 0 \qquad \qquad \forall \ i \in \{A, T, C, G \} \end{align*} \end{document}

Similarly to the first simplex constraint the last set of constraints together with the reversibility can be rephrased (Gowri-Shankar, 2006) in terms of another simplex constraint concerning only the extra-diagonal entries of the substitution rate matrix (2) namely \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \rho_{AC} + \rho_{AG} + \rho_{AT} + \rho_{CG} + \rho_{CT} + \rho_{GT} = 1. \end{align*} \end{document}

In order to bypass the constrained nature of the parameter space we have relied on the so-called additive logistic transformation (Aitichinson, 1986; Tiao and Cuttman, 1965), which is a one-to-one transformation from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\mathbb R}^{D - 1}$$ \end{document} to the (D − 1)-dimensional simplex \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} S^{D} = \left\{(x_{1}, \ldots, x_{D}) : x_{1} > 0, \ldots, x_{D} > 0; x_{1} + \ldots x_{D} = 1 \right\} . \end{align*} \end{document}

Hence, we can use its inverse, called additive log-ratio transformation, which is defined as follows \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} y_{i} = \log \left(\frac {x_{i}} {x_{D}} \right) \qquad i = 1, \ldots, D - 1 \end{align*} \end{document}

for any \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x = (x_1, \ldots, x_D) \in S^D$$ \end{document} . Here, the x_i's are the ρ's and D = 6. Applying these transformations to nucleotide frequencies π_i and to exchangeability parameters ρ's, the transformed parameters assume values in the entire real support and the IDR estimator can be applied. Again, the reparameterization calls for the appropriate change-of-measure Jacobian when evaluating the corresponding transformed density (Aitichinson, 1986).

5. Thermodynamic Integration Estimators

Starting with the pioneering work of Lartillot et al. (2007), the thermodynamic integration technique has been fruitfully employed for marginal likelihood approximation in phylogenetic substitution models. Indeed the thermodynamic integration has a longer history dated back at least to the work of Frenkel (1986) and Ogata (1989). The interested reader can refer to Gelman and Meng (1998) for a historic overview and also to Chen and Shao (1997) for a more technical insight.

We will briefly recall the core ideas which start from the so-called path sampling identity for estimating the ratio of two normalizing constants of two different densities say \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$c_f = \int_{\Omega} f (\theta) d \theta$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$c_g = \int_{\Omega} g (\theta) d \theta$$ \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \log \frac {c_g} {c_f} = \int_{[0, 1]} Q (\beta) d \beta = \int_{[0, 1]} \Bigg[\int_{\Omega} \frac {\partial} {\partial \beta} \log q (\theta, \beta) \frac {q (\theta, \beta)} {c (\beta)} d \theta \Bigg] d \beta \tag {17} \end{align*} \end{document}

where q(θ, β) is a so-called path function which has to guarantee the following conditions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} f (\theta) = q (\theta, 0) \quad g (\theta) = q (\theta, 1) \end{align*} \end{document}

providing a smooth link (path) between f and g parameterized by β and where c(β) = c_q_{(·, β)} is the normalizing constant of the path function q(·, β) regarded as a density function in θ. An alternative expression of the path sampling identity is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \int_{0}^{1} E_{\tilde {q} (\theta, \beta)} \left[\frac {\partial} {\partial \beta} \log q (\theta, \beta) \right] d \beta \tag {18} \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{q} (\theta, \beta) = q (\theta, \beta) / c (\beta)$$ \end{document} . This identity is generally used to build up approximations of the log ratio of normalizing constants using simulations for approximating the inner expectations and discretizing over \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$t \in [0, 1]$$ \end{document} for approximating the outer integral. One of the most favorite path function is the geometric path \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} q (\theta, \beta) = g (\theta)^\beta f (\theta)^{1 - \beta} \end{align*} \end{document}

which yields the following simplified expression for the inner integral \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} Q (\beta) = \int_{\Omega} \ \log \frac {g (\theta)} {f (\theta)} \ \frac {g (\theta)^\beta f (\theta)^{1 - \beta}} {c (\beta)} \ d \theta. \end{align*} \end{document}

Indeed the starting density f of the path can be chosen arbitrarily. In the next sections we will consider a general path sampling identity as in equation (18) in which g(θ) = L(θ)π(θ) and f (θ) = π₀(θ) where π₀(θ) is an auxiliary distribution which is chosen as close as possible to the target g. Potentially setting π₀ ∞ L(θ)π(θ) would yield a perfect approximation. The resulting generalized path sampling (GPS) estimator is computed as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \log \hat {c}_{GPS} &= \frac {1} {2} \sum_{t = 1}^{T - 1} (\beta_{t + 1} - \beta_{t}) \left[\frac {1} {n} \sum_{i = 1}^{n} \left(\log \pi (\theta_{i; \beta_{t + 1}}) + \log L (\theta_{i; \beta_{t + 1}}) - \log \pi_{0} (\theta_{i; \beta_{t + 1}}) \right) \right. & (19) \\&\quad \left. + \frac {1} {n} \sum_{i = 1}^{n} \left(\log \pi (\theta_{i; \beta_{t}}) + \log L (\theta_{i; \beta_{t}}) - \log \pi_{0} (\theta_{i; \beta_{t}}) \right) \right] \end{align*} \end{document}

where θ_i,βt are sampled from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde {q} (\theta; \beta_{t}) = \frac {(\pi (\theta) L (\theta))^{\beta_{t}} \pi_{0} (\theta)^{1 - \beta_{t}}} {c (\beta_{t})}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\beta_{1}, \beta_{2}, \ldots, \beta_{T}$$ \end{document} are a suitably chosen grid of points in [0, 1].

A simplified version of the GPS estimator has been proposed in Lartillot and Philippe (2006). They used a geometric path for estimating model evidence using g(θ) = π(θ)L(θ) and f (θ) = π(θ). The corresponding path sampling estimator log c_PS can be estimated as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \log \hat {c}_{PS} = \frac {1} {2} \sum_{t = 1}^{T - 1} (\beta_{t + 1} - \beta_{t}) \left[\frac {1} {n} \sum_{i = 1}^{n} L (\theta_{i; \beta_{t + 1}}) + \frac {1} {n} \sum_{i = 1}^{n} L (\theta_{i; \beta_{t}}) \right] \tag {20} \end{align*} \end{document}

where θ_i,βt are sampled from the so-called power posterior \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde {q} (\theta; \beta_{t}) = \frac {\pi (\theta) L (\theta)^{\beta_{t}}} {c (\beta_{t})}$$ \end{document} .

However, it can be easily argued (Lefebvre et al., 2010) that the choice of π(θ) as starting density f is hardly a good choice for something which ought to be as close as possible to the target. Moreover, it has been shown in Calderhead and Girolami (2009) and in Fan et al. (2011) that the choice of a uniform grid is not optimal and using a grid which is denser in the lower part of the unit interval may improve substantially the accuracy and the efficiency of the resulting estimator. The values \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\beta_1, \beta_2, \ldots, \beta_T$$ \end{document} are set equal to the quantiles of an auxiliary density w on [0, 1]. Numerical examples in Sections 6 and 7 show that the choice of the reference auxiliary density w and the grid size play a relevant role in the GPS and PS estimates. In order to appropriately calibrate the starting reference distribution, one could rely on the simple idea proposed in Xie et al. (2011), which has led to remarkable improvements of the so-called SS method. The reference distribution f can be chosen with independent marginal distributions within the family of the prior distributions so that the parameters of the reference are guessed from the marginal posterior simulations. In the next sections, we have used the reference distribution π₀ according to Fan et al. (2011).

5.1. Generalized Stepping Stone (GSS) estimator

The GSS method, presented in Fan et al. (2011) is a modification of the SS method introduced in Xie et al. (2011). This method is closely related to the thermodynamic integration and shares with it the need of sampling from a discretized grid of intermediate distributions in addition to the posterior.

Using the same notation of the previous section, the GSS estimator is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} c_{GSS} = \prod_{t = 1}^{T} \frac {c (\beta_{t})} {c (\beta_{t - 1})} \tag {21} \end{align*} \end{document}

and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde {q} (; \beta_{t}) = \frac {q (; \beta_{t})} {c (\beta_{t})} = \frac {g (\theta)^{\beta_{t}} f (\theta)^{1 - \beta_{t}}} {c (\beta_{t})}$$ \end{document} . In Fan et al. (2011), g(θ) is taken as the product of likelihood and prior and f (θ) is a reference distribution (π₀(θ)), which yields \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} r_{t} = E_{\tilde {q} (; \beta_{t - 1})} \left[\left(\frac {L (\theta) \pi (\theta)} {\pi_{0} (\theta)} \right)^{\beta_{t} - \beta_{t - 1}} \right] \tag {23} \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde {q} (; \beta_{t}) = \frac {(L (\theta) \pi (\theta))^{\beta_{t}} \pi_{0} (\theta)^{1 - \beta_{t}}} {c (\beta_{t})}$$ \end{document} . The density \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{q} (; \beta_t)$$ \end{document} is equivalent to the posterior distribution when β = 1 and it is equivalent to the reference distribution when β = 0. The expression in (23) can be estimated using \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \hat {r}_{t} = \frac {1} {n} \sum_{i = 1}^{n} \left(\frac {L (\theta_{i, \beta_{t - 1}}) \pi (\theta_{i, \beta_{t - 1}})} {\pi_{0} (\theta_{i, \beta_{t - 1}})} \right) \end{align*} \end{document}

In Fan et al. (2011), the reference distribution π₀(·) is constructed using the same parameterized functional form of the prior distribution and calibrated using samples from the posterior distribution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{q} (\cdot ; \beta = 1)$$ \end{document} : parameters of the reference distribution are set equal to the posterior mean of the same parameters drawn from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{q} (\cdot ; \beta = 1)$$ \end{document} . The relative mean square error of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{GSS}$$ \end{document} can be approximated as follows \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\widehat {RMSE}}_{\hat {c}_{GSS}} \approx \frac {1} {n^{2}} \sum_{t = 1}^{T} \sum_{i = 1}^{n} \left(\frac {\widehat {c}_{GSS, i, t}} {\widehat {c}_{GSS, t}} - 1 \right)^{2} \tag {24} \end{align*} \end{document}

The GSS reduces to the original SS method proposed in Xie et al. (2011) if the reference distribution is equal to the actual prior: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \hat {r}_{t}^{SS} = \frac {1} {n} \sum_{i = 1}^{n} L (\theta_{i, \beta_{t - 1}}) \tag {25} \end{align*} \end{document}

In Fan et al. (2011), it is argued that GSS is more reliable and efficient than the other methods, such as the harmonic mean, the stepping stone method and the path sampling estimator.

6. Numerical Examples and Comparative Performance: The Generalized Linear Model Example

We have considered a controlled simulation setting as in Calderhead and Girolami (2009) and Lartillot et al. (2007). A simple linear model is considered where the marginal likelihood can be computed exactly. Data X are generated from the following \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} X_i = Z_i^T B + u_i \qquad i = 1, 2, \ldots, n \end{align*} \end{document}

where the n × k matrix Z of covariates are fixed uniformly, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B = (b_1, \ldots, b_k)$$ \end{document} is a vector of regression coefficients which have been randomly fixed and u_i ∼ N(0, σ²). The exact closed form formula for the marginal likelihood when the prior distribution on B is a multivariate normal with null mean vector and diagonal covariance matrix η² is the following: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} p (X \mid \sigma^{2}, \eta^{2}) = (2 \pi)^{k / 2} \mid \Omega \mid^{- 1 / 2} exp \left\{- \frac {1} {2} X^{T} \Omega^{- 1} X \right\} \end{align*} \end{document}

where Ω = σ²I + η²ZZ^T.

We verified first the comparative performance of two alternative classes of marginal likelihood estimators: the first one relying on the GHM formula and using only a single simulation from the posterior and the second one relying on the thermodynamic integration and requiring a sufficiently dense grid of simulations from the auxiliary distributions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{q} (\cdot, \beta_t)$$ \end{document} . As representative of the first class we have considered the classical HM estimator, the GHM estimator and the IDR estimators. The GHM estimator, has been implemented as in (7) using as instrumental function f the reference distribution π₀ as suggested in the implementation of GSS. For the second class, we have considered the Generalized Path Sampling again with f = π₀ and the choice of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\beta_1, \beta_2, \ldots, \beta_d$$ \end{document} set equal to the quantiles of a Beta distribution with parameters α₁ = 0.3 and α₂ = 1.0. We have simulated 10 datasets from the above model in four simulation settings (A, B, C, and D). They differ in the way the design matrix Z and the true regressors B are fixed so that they yield different posterior correlation structures. They are ordered according to the increasing presence of correlation ranging from (−0.13; 0.13) in setting A to (−0.33; 0.58) in setting D. The four settings have been replicated for different choices of n (n = 200; n = 1000) and of k (k = 20; k = 100). Thermodynamic integral estimators are based on 10,000 draws from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{q} (\cdot, \beta_t)$$ \end{document} for each β_t in the grid while GHM estimators only on 10,000 from the posterior \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{q} (\cdot, \beta_t = 1)$$ \end{document} . In Table 2, we can see that overall the comparative performance of the alternative methods is certainly in favor of the Thermodynamic Integration side. However one can notice that in most simulation settings in which the correlation structure is mild and hence the reference π₀ is reasonably close to the target posterior, GHM can be competitive. Notice that the reference distribution has been chosen as a multivariate normal so that only the posterior means and variances are matched in the same way the reference distribution is calibrated in Fan et al. (2011). This means that there is a correlation structure that is overlooked in the starting density. As evident in setting D, a stronger correlation structure may cause a loss of performance of both of GHM and thermodynamic integration methods up to making IDR more competitive. Indeed, GSS and GPS can outperform the other methods also in simulation D provided that the grid size is increased from 20 to 100 (results not reported here). The main message of this example is that although the Thermodynamic Integration approach will eventually outperform the simpler GHM and IDR estimators, the latter can still be reliable in all simulation settings and to some extent competitive when a reference distribution is not too close to the target posterior and/or a very dense grid can be used.

Table 2.

Linear Model Simulated Data: Actual Relative Mean Square Error of Alternative Marginal Likelihood Estimators Resulting from the Simulation of Ten Different Datasets in Four Simulation Settings (A, B, C, and D) with Different Correlation Structures

	HM	GHM	IDR	GSS	GPS
n = 200
Sim. A (k = 20)	0.24184	0.00002	0.00323	0.00001	0.00001
Sim. B (k = 20)	0.27126	0.00002	0.00360	0.00001	0.00001
Sim. C (k = 20)	0.29329	0.00695	0.00281	0.00005	0.00007
Sim. D (k = 20)	0.28573	0.00653	0.00163	0.00126	0.00041
Sim. A (k = 100)	0.60359	0.00657	0.00193	0.00002	0.00002
Sim. B (k = 100)	0.63408	0.00613	0.00179	0.00002	0.00002
Sim. C (k = 100)	0.65727	0.06714	0.00149	0.00314	0.00008
Sim. D (k = 100)	0.64734	0.04089	0.00161	0.00268	0.00405
n = 1000
Sim. A (k = 20)	0.07016	<10⁻⁵	0.00063	<10⁻⁵	<10⁻⁵
Sim. B (k = 20)	0.07903	<10⁻⁵	0.00073	<10⁻⁵	<10⁻⁵
Sim. C (k = 20)	0.08789	0.00124	0.00062	0.00001	0.00002
Sim. D (k = 20)	0.08271	0.00158	0.00089	0.00010	0.00012
Sim. A (k = 100)	0.27634	0.00006	0.00073	<10⁻⁵	<10⁻⁵
Sim. B (k = 100)	0.29862	0.00006	0.00073	<10⁻⁵	<10⁻⁵
Sim. C (k = 100)	0.32460	0.02243	0.00062	0.00028	0.00022
Sim. D (k = 100)	0.30450	0.01108	0.00076	0.00098	0.00109

The best performing estimators are highlighted in bold.

7. Marginal Likelihood for Phylogenetic Data

In this section, we will investigate to what extent simpler estimators based solely on recycling posterior simulations can be effectively used in the phylogenetic context. As benchmark, we will use the most powerful and tunable tools based on the thermodynamic integration which require simulations from a grid of distributions. In particular, the successful implementation of the IDR estimator is illustrated with simulated and real data.

All estimates have been computed using the MCMC output of the simulations from the posterior distribution obtained using Phycas software (Lewis et al., 2008); the likelihood has been computed using the R package PML, while the reparameterization on \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Re^k$$ \end{document} and IDR perturbation g_Pk(θ) have called for specifically developed R functions. The whole R code is available upon request from the first author.

GSS has been implemented using the open-source software Phycas (Lewis et al., 2008). We consider a grid of 20 β values and, following an initial burn-in phase consisting of 50000 cycles at β = 1 and then 10,000 MCMC cycles for each β including the cold chain at β = 1. A thinning rate of 2 has been used for all chains. In these examples, a grid of 20 β values resulted sufficiently large since estimates obtained with a larger grid size do not appreciably change.

In order to have a fair comparison of the estimation methods, all the other estimators have been computed using 200,000 random draws from the posterior only.

7.1. Hadamard data: marginal likelihood computation

We use as a first benchmark the synthetic data set Hadamard already employed in Felsenstein (2004). It consists of 200 amino acids and four species (A, B, C, and D). This dataset have been simulated from the Jukes-Cantor model (JC69), and the true tree is shown in the left panel of Figure 2.

FIG. 2.

True phylogenetic trees of the two data sets used as benchmarks: Hadamard (left) and Green (right).

For the true topology, we compute the marginal likelihood for the JC69 model and for the GTR + Γ model: parameters lie in a five-dimensional space for the JC69 model and in 14-dimensional space for the GTR + Γ model. The simulated values from the Metropolis-Coupled algorithm have been rearranged to evaluate the model evidence of both models using different variants of the GHM and thermodynamic integration approaches.

GTR + Γ and JC69 models have been implemented in Phycas which provides as output the simulated Markov chains. Simulated values from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde{q} (\cdot, \beta_t = 1)$$ \end{document} can be used as a sample from the posterior distribution in order to make inference on the model parameters.

In Table 3, we list the corresponding values of the IDR estimator on the log scale ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\log \hat{c}_{IDR}$$ \end{document} ), the Relative Mean Square Error (RMSE) estimate ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\widehat {RMSE}}_{\hat{c}_{IDR}}$$ \end{document} ) as in equation (13) and the confidence interval \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}$$ \end{document} for different perturbation masses k. In order to take into account the autocorrelation of the posterior simulated values, a correction has been applied to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\widehat {RMSE}}_{\hat{c}_{IDR}}$$ \end{document} replacing n in (13) with the effective sample size given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {n}_{ESS} = n \times \frac {1} {1 + 2 \sum_{i = 1}^{I} \hat {\rho}_i} \tag {27} \end{align*} \end{document}

Table 3.

Inflated Density Ratio Method Applied to Hadamard Data with a JC69 Model with a Five-Dimensional Parameter Space

k	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE^{*}}_{\hat{c}_{IDR}}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\log \hat{c}_{IDR} = - 592.2874$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\log \hat{c}_{IDR}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE}_{\hat{c}_{IDR}}$$ \end{document}
10⁻⁷	−592.2877	0.0217	0.0644	[−592.3504, −592.2202]
10⁻⁶	−592.2874	0.0033	0.0099	[−592.2986, −592.2761]
10⁻⁵	−592.2934	0.0037	0.0101	[−592.3035, −592.2832]
10⁻⁴	−592.5918	0.0041	0.0122	[−592.6039, −592.5794]

IDR_α estimates on the log scale for a small regular grid of perturbation values. A trimmed factor of α = 0.01 has been used. The relative mean square errors \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE^{*}}_{\hat{c}_{IDR}}$$ \end{document} are computed as in (13) without accounting for autocorrelation. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}$$ \end{document} are the relative mean square errors corrected for the autocorrelation. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE}_{\hat{c}_{IDR}}$$ \end{document} are confidence intervals on a log scale. Since the smallest error in the grid corresponds to a perturbation value k^opt = 10⁻⁶, the IDR estimator for the JC69 model is taken to be \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE^{*}}_{\hat{c}_{IDR}}$$ \end{document} .

with I = 10. Since the optimal corrected error \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}$$ \end{document} corresponds to a perturbation value k^opt = 10⁻⁶, the IDR estimator (on a logarithmic scale) for the JC model is log \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\log \hat{c}_{IDR} = - 592.2874$$ \end{document} .

For comparative purposes, we considered the results of the IDR method with those obtained with the original HM and GHM based on the same reference distribution π₀ which is used to implement the GSS as in Fan et al. (2011). Indeed to verify the impact of the choice of the initial distribution we also evaluate the estimates resulting from GSS and SS. The path sampling estimator and its generalized version have been also computed. The evaluation of all these thermodynamic alternatives were simple to implement since the reference distribution as well as the prior together with their evaluation on the sampled parameters are available from the Phycas output.

RMSEs of the HM estimators have been computed adapting the formula (13) while RMSEs of the thermodynamic estimators have been computed adapting the formula (24). For the adjusted RMSEs the effective sample size in (27) has been used. For each method, the Monte Carlo relative error of the estimate has been computed re-estimating the model 10 times ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE}_{\hat{c}, MC}$$ \end{document} ) and recording the corresponding different values of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}$$ \end{document} . Although it is known that under critical conditions such MC error is not sufficient to guarantee its precision it still remains a necessary premise for an accurate estimate.

The methods produce different results which were overall closer in the JC model while sometimes very distant in the GTR + Γ case. The closest marginal likelihood estimates are those from GSS, GPS, and GHM, all methods relying on the same reference distribution. Also SS estimates are quite close to those obtained with GSS. IDR and PS produce estimates which are very similar in both models although only slightly different from the previous ones. On the other hand, HM estimates can be very far apart. This is particularly evident in the GTR + Γ case, confirming that HM can be badly biased and unstable resulting in a seriously doubtful reliability. In the JC model, all methods, with the exception of HM, produce similar results: the corresponding estimates of the relative error, once adjusted for the autocorrelation, show that GPS and GSS are the most stable methods followed by IDR and PS. On the other hand, for the GTR + Γ model all estimates, with the exception of HM, are comparable although IDR results in the largest estimated error. This is also confirmed by the Monte Carlo error. Indeed we notice that the Monte Carlo estimates of the relative error do not always reflect well the theoretical estimates even when not accounting the autocorrelation. This could be explained by the presence of some bias in the corresponding estimates possibly due to choice of a not sufficiently dense grid or a not appropriate starting density such as the prior in the path sampling. This is particularly evident always for HM and for GHM and SS especially in the GRT + Γ model. For GSS, GPS, PS, and IDR, the Monte Carlo errors always fall in between the one estimated as with an independent sample and the one accounting for the presence of autocorrelation.

In order to better visualize the results in Table 4, Figure 3 shows the marginal likelihoods (on logarithmic scale) for JC69 and GTR + Γ model. Considering the reference values for the Bayes Factor defined in Kass and Raftery (1995), all methods, with the exception of HM, strongly and consistently give support to the Jukes-Cantor model, which is known in this case to be the true model. We also point out that the estimates BF are all very similar.

FIG. 3.

Hadamard data: Log marginal likelihood estimated with HM, GHM, IDR, GSS, and SS methods for JC and GTR + Γ models. The estimated values are reported in Table 4.

Table 4.

Hadamard Data: Marginal Likelihood Estimates of the GTR + Γ Model ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\omega = \Re^{14}$$ \end{document} ) Obtained with the HM, GHM, IDR, GSS, SS, PS, and GPS Approaches

Method	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$log (\widehat{c})$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE}_{\hat{c}}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE^{*}}_{\hat{c}}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE}_{\hat{c}, MC}$$ \end{document}
JC69
HM	−590.3502	111.5118	361.6873	0.5500
GHM	−591.9880	0.1949	0.5825	0.0038
IDR	−592.2883	0.0068	0.0223	0.0102
SS_{U(0, 1)}	−591.9896	0.0421	0.1141	0.0109
SS_{Beta(0.3, 1)}	−591.9923	0.0353	0.1041	0.0045
GSS_{U(0, 1)}	−592.0057	0.0006	0.0021	0.0008
GSS_{Beta(0.3, 1)}	−591.9885	0.0008	0.0027	0.0005
GPS	−591.9882	0.0008	0.0016	0.0005
PS	−592.0336	0.0105	0.0343	0.0060
GTR + Γ
HM	−596.4504	178.2420	494.4335	0.8641
GHM	−603.8314	0.9089	1.7845	0.0026
IDR	−604.6316	0.1429	0.2479	0.1664
SS_{U(0, 1)}	−603.8979	0.1960	0.2458	0.0591
SS_{Beta(0.3, 1)}	−603.8865	0.0996	0.1400	0.0349
GSS_{U(0, 1)}	−603.9066	0.0035	0.0250	0.0030
GSS_{Beta(0.3, 1)}	−603.8921	0.0040	0.0674	0.0056
GPS	−603.8283	0.0033	0.0132	0.0048
PS	−604.0438	0.0182	0.0929	0.0389

For GSS and SS estimators we show the estimates obtained when β_k are sampled from a Uniform distribution (U(0, 1)) and when are sampled from a Beta distribution with parameters β₁ = 0.3 and β₂ = 1.0. For GPS and PS the β_k are sampled from a Beta distribution with parameters β₁ = 0.3 and β₂ = 1.0. Three different RMSE estimates are provided: for the harmonic mean estimators \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE}_{\hat{c}}$$ \end{document} has been computed adapting the expression (13) while for thermodynamic estimators it has been computed adapting the expression (24). \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE^{*}}_{\hat{c}}$$ \end{document} is the RMSE corrected for the autocorrelation. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE}_{\hat{c}, MC}$$ \end{document} comes from 10 Monte Carlo independent replicates of the estimation.

7.2. Hadamard data: tree selection

We now show how it is possible to extend the IDR approach for dealing with selecting competing trees when the topology is not fixed in advance. For a fixed substitution model, competing trees can be compared by considering the evidence of the data for a fixed tree topology. The evidence in support of each tree topology \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tau_i \in N_T$$ \end{document} can be evaluated in terms of its posterior probability p(τ_i|X) derived from the Bayes theorem as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} p (\tau_{i} \mid X) = \frac {p (X \mid \tau_{i}) \pi_{{\cal T}} (\tau_{i})} {\sum_{\tau \in N_T} p (X \mid \tau) \pi_{{\cal T}} (\tau)} \end{align*} \end{document}

where the experimental evidence in favor of the model M_τi with fixed tree topology τ_i is contained in the marginal likelihood \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} m (X \mid M_{\tau_{i}}) = p (X \mid \tau_{i}) = \int_{\Omega_{\tau_i}} p (X \mid \omega_i, M_{\tau_{i}}) \pi (\omega_i \mid \tau_{i}) d \omega_i \end{align*} \end{document}

where the continuous parameters \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\omega_i \in \Omega_{\tau i}$$ \end{document} of the evolutionary process corresponding to Mτ_i are integrated out as nuisance parameters. Indeed, when prior beliefs on trees are set equal π(τ_i) = π(τ_j) comparative evidence discriminating τ_i against τ_j, is summarized in the BF \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} BF_{ij} = \frac {m (X \mid M_{\tau_{i}})} {m (X \mid M_{\tau_{j}})}. \tag {28} \end{align*} \end{document}

We have considered the problem of selecting competing trees of a substitution model for Hadamard data. In the previous subsection, we have verified the feasibility and effectiveness of the IDR approach in comparing JC69 with the GTR + Γ model. Under a fixed topology, JC69 was favored. Indeed, we know that Hadamard data was simulated from the JC69 model with true topology labeled as τ = 1. Hence, we can compare N_T = 3 alternative topologies within the JC69 model computing the corresponding estimates of the marginal likelihoods and finally evaluate tree selection in terms of the pairwise BF. Results are shown in Table 5.

Table 5.

Hadamard Data: The Bayes Factor Is Computed in Order to Compare Competing Topologies

Method	log(BF₁₂)	log(BF₁₃)
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}_{log (\widehat{BF}_{HM})}$$ \end{document}	[−4.1912, 10.4829]	[−129.6968,137.9124]
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}_{log (\widehat{BF}_{GHM})}$$ \end{document}	[3.5812, 3.4479]	[3.8448, 3.9131]
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}_{log (\widehat{BF}_{IDR})}$$ \end{document}	[4.2868, 4.3468]	[5.0088, 5.0488]
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}_{log (\widehat{BF}_{SS_{U (0, 1)}})}$$ \end{document}	[3.0223, 3.0220]	[3.1702, 3.1693]
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}_{log (\widehat{BF}_{SS_{Beta (0.3, 1)}})}$$ \end{document}	[3.464, 3.554]	[3.8548, 3.9348]
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}_{log (\widehat{BF}_{GSS_{U (0, 1)}})}$$ \end{document}	[3.5205, 3.5210]	[3.8847, 3.8862]
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}_{log (\widehat{BF}_{GSS_{Beta (0.3, 1)}})}$$ \end{document}	[3.6375, 3.6575]	[3.908, 3.928]
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}_{log (\widehat{BF}_{GPS})}$$ \end{document}	[3.4961, 3.5280]	[3.8747, 3.8847]
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}_{log (\widehat{BF}_{PS})}$$ \end{document}	[3.5621, 3.6421]	[3.9526, 4.0346]

The Bayes Factor is approximated with different methods: HM, GHM, IDR, SS, GSS, GPS, and PS. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{CI}^{MC}_{log (\widehat{BF})}$$ \end{document} is the confidence interval obtained as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$log (\hat{BF}) \pm 2 \cdot \widehat{RMSE}_{log (\hat{BF})}$$ \end{document} .

In this case, all methods allow to correctly favor the true topology with an estimate of the BF in favor of τ = 1, which ranges from strong to decisive. However, the original HM estimate is affected by a very large variability so that the confidence interval of the corresponding BF cannot lead to a clear conclusion. On the other hand, although the other methods are similarly conclusive, even accounting for estimate variability, the IDR method yields the most supporting evidence in favor of the true tree.

7.3. Green Plant rbcL example

We analyze the same green plant data used in Xie et al. (2011) (Table 6). Green plant data consists of 10 taxons and DNA sequences of the chloroplast-encoded large subunit of the RuBisCO gene (rbcL) (1296 amino acids). Following Xie et al. (2011) we use the topology shown in the right panel of Figure 1.

Table 6.

Green data Data: Marginal Likelihood Estimates of the GTR + Γ Model ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\omega = \Re^{26}$$ \end{document} ) and JC69 Model ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\omega = \Re^{17}$$ \end{document} ) Obtained with the HM, GHM, IDR, GSS, and SS Methods

Method	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$log (\widehat{c})$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE}_{\hat{c}}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE^{*}}_{\hat{c}}$$ \end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE}_{\hat{c}, MC}$$ \end{document}
JC69
HM	−7374.8690	139.3546	453.5142	0.4764
GHM	−7398.5300	0.5945	1.7858	0.0012
IDR	−7399.8450	0.1723	0.3014	0.1959
SS_{U(0, 1)}	−7399.6500	0.6824	1.5566	1.8476
SS_{Beta(0.3, 1)}	−7398.5360	0.1168	0.1884	0.0606
GSS_{U(0, 1)}	−7398.6410	0.0026	0.0084	0.0015
GSS_{Beta(0.3, 1)}	−7398.529	0.0019	0.0061	0.0015
GPS	−7398.5280	0.0036	0.0065	0.0017
PS	−7399.1250	0.0293	0.0964	0.0514
GTR + Γ
HM	−6637.366	219.9642	619.4541	1.5106
GHM	−6679.766	0.0110	0.1325	0.0279
IDR	−6679.679	0.2600	0.3501	0.3492
SS_{U(0, 1)}	−6700.856	0.9540	6.8098	>10³
SS_{Beta(0.3, 1)}	−6679.802	0.2489	0.4344	0.0891
GSS_{U(0, 1)}	−6679.653	0.0110	0.1326	0.0280
GSS_{Beta(0.3, 1)}	−6679.859	0.0094	0.2503	0.0336
GPS	−6679.856	0.0163	0.0824	0.0305
PS	−6681.270	0.0396	0.2777	0.1220

For GSS and SS estimators, we show the estimates obtained when β_k are sampled from a Uniform distribution (U(0, 1)) and when are sampled from a Beta distribution with parameters β₁ = 0.3 and β₂ = 1.0. For GPS and PS, the β_k are sampled from a Beta distribution with parameters β₁ = 0.3 and β₂ = 1.0. Three different RMSE estimates are provided: for the harmonic mean estimators \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE}_{\hat{c}}$$ \end{document} has been computed adapting the expression (13) while for thermodynamic estimators it has been computed adapting the expression (24). \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE^{*}}_{\hat{c}}$$ \end{document} is the RMSE corrected for the autocorrelation. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widehat{RMSE}_{\hat{c}, MC}$$ \end{document} comes from 10 Monte Carlo independent replicates of the estimation.

Again, we fit JC69 and GTR + Γ models and compare HM, GHM, IDR, SS, GSS, GPS, and PS marginal likelihood estimates. Original HM method appears to be unstable and remarkably different from all the other methods. GHM, GSS, and GPS give the most similar estimates. On the other hand, IDR, although showing larger error estimates, agrees with all the other methods and is very similar to PS.

Similar arguments apply when the GTR + Γ model is considered: all methods, with the exception of HM, produce very close estimates and, when accounted for the autocorrelation, also error estimates are comparable. However, SS method produces different results when applied to GTR + Γ model according to different choices of grid points: it appears that the auxiliary distribution whose quantiles are used to set the β grid plays here a key role in better estimating the marginal likelihood.

As far as model comparison is concerned the GTR + Γ model is strongly supported by all methods. Also in this case, HM reveals itself unreliable and not conclusive because of the larger error. On the other hand, the other methods give similar results. Despite some difference of the estimates produced by IDR, those differences are mitigated when BF is considered with IDR providing a slightly stronger evidence in favor of GTR + Gamma model.

In these examples, GSS and GPS are the most stable methods with respect to estimation error, and are somewhat robust with respect to the specification of the auxiliary distribution used to set the grid of β values. The stability of the results is possibly due to the fact that the reference distribution π₀ is well calibrated and closely resembles the posterior distribution. In Table 7, we investigate the robustness of the GSS estimates with respect to change of scale of the reference distribution. In particular, for the branch length parameters ν_τ the reference distribution is specified as the product of independent and identically distributed gamma distributions with shape α₁ and rate α₂ which are estimated using simulations from the posterior distribution. We modify the scale of the reference distribution considering as reference a gamma distribution with shape \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\frac {\alpha_1} {\lambda}$$ \end{document} and rate α₂λ where λ is the perturbation factor. The same is done for the reference distribution component of the Γ parameter. For the substitution rates, we perturbed the original reference distribution multiplying all Dirichlet parameters for λ. Table 7 shows GSS can be affected by some variation in the scale of the reference distribution.

Table 7.

Sensitivity of GSS Method with Respect to Perturbation of the Reference Distribution

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat{c}_{GSS} = - 6679.685$$ \end{document}
λ	ν_τ	ρ	Γ
0.5	−6676.399 (1.6 10⁻2)	−6678.473 (1.3 10⁻4)	−6679.538 (1.2 10⁻4)
1.1	−6679.616 (1.1 10⁻4)	−6679.745 (1.2 10⁻4)	−6679.681 (1.3 10⁻4)
1.2	−6679.485 (1.1 10⁻4)	−6679.771 (1.2 10⁻4)	−6679.673 (1.1 10⁻4)
1.5	−6678.894 (1.1 10⁻4)	−6679.682 (1.2 10⁻4)	−6679.641 (1.2 10⁻4)
2.0	−6677.695 (1.1 10⁻4)	−6679.160 (1.3 10⁻4)	−6679.575 (0.9 10⁻4)

Values in brackets correspond to the estimated RMSE.

8. Discussion

In this article, we have investigated the possibility of using alternative simple effective recipes for evaluating model evidence of competing phylogenetic substitution models. In a Bayesian framework, several methods have been proposed in order to approximate the marginal likelihood of a single model and then eventually estimate the BF of two competing models. None of the methods proposed so far can be considered outperforming the others simultaneously in terms of simplicity of implementation, computational burden and precision of the estimates. One of the most widely used method until now has been the HM: the simplicity of implementation combined with a relatively light computational burden are two appealing features which explain why the HM is still one of the most favorite option for routine implementation (von Reumont et al., 2009). However, the simplicity of HM is often not matched with its accuracy, and very recent literature has highlighted unreliability of HM estimators in phylogenetic models (Fan et al., 2011; Lartillot and Philippe, 2006; Xie et al., 2011), as well as in more general biological applications (Calderhead and Girolami, 2009). In this article, we have provided evidence of improved effectiveness of a simple alternative marginal likelihood estimators belonging to the class of GHM estimators. In particular, we have introduced a new marginal likelihood estimator, the IDR estimator, and we have also borrowed from Fan et al. (2011) the use of a suitable auxiliary distribution in the GHM formula. They share the original simplicity and computation feasibility of the HM estimator, but unlike it, they enjoy important theoretical properties, such as the finiteness of the variance. Moreover, they allow one to recycle posterior simulations which is particularly appealing in those contexts—such as phylogenetic models—where the computational burden of the simulation is heavier than the evaluation of the likelihood, posterior densities, and the like. We have verified the reliability of the IDR estimator as well as the improved GHM in some of the most common phylogenetic substitution models under different model complexity. In this study, the topology was fixed when estimating the marginal likelihood, even though there is often interest in estimating marginal likelihood when the topology is unknown. This problem is still open, as the most widely used methods for estimating the marginal likelihood are applied to a fixed topology (Fan et al., 2011; Lartillot and Philippe, 2006; Xie et al., 2011). As a simple solution, IDR estimator for an unknown topology can be computed by averaging IDR estimators computed with fixed topologies. The single estimators can then be combined in a weighted mean whose weights are the frequency of each topology in the MCMC. However, more work is needed in order to determine the efficiency of this method and its applicability in real context.

Overall, we have compared two alternative classes of marginal likelihood estimators: the first one relying on the GHM formula and using only a single simulation from the posterior, and the second one relying on the thermodynamic integration and requiring simulations from the auxiliary distribution. As representative of the first class, we have considered the classical HM estimator, the GHM estimator, and the IDR estimator. For the second class, we have considered the classical PS based on the prior as starting auxiliary density as well as the GPS with an appropriately chosen reference distribution. We have also considered the SS method and the GSS, which can be considered—so far—as one of the most reliable method for ML approximation.

Focusing only on applications of the first class of estimators, for phylogenetic examples we can conclude that GHM, relying on the same reference distribution of GSS and GPS, resulted as the most performing estimator. On the other hand, IDR turned out still a reliable tool for comparing model evidence of substitution models in agreement with all the other methods, while in simulated examples, IDR resulted competitive and outperform GHM when the reference distribution cannot be calibrated sufficiently close to the target posterior for the presence of unmodeled correlation in the class of candidate reference distributions. With respect to the second class, although GSS and GPS turned out to be very accurate and performed very similarly in the phylogenetic examples, in the simulated examples GPS resulted more robust to possible miscalibration of the reference distribution. Moreover, we have shown that when implementing GPS and GSS the first important feature is the suitable choice of the reference distribution: we have investigated the degree of sensitivity of the final estimates to the change of scale of the reference as well as of the correlation patterns and GPS resulted more robust than GSS. The final important feature to be addressed is the choice of the grid size and of the grid points: in fact, one can consider the marginal likelihood estimator reliable only when, increasing the grid size, it does not change appreciably. However, once a reference distribution is chosen very close to the target distribution, one can obtain accurate estimators when the grid size can be as low as 20 grid points. Whether or not one has calibrated a suitable reference, IDR and GHM can still be considered reliable alternatives, which can be useful when a simpler solution is sought for or when a sophisticated calibration of the reference cannot be successfully obtained. We believe that since no method can be considered as an intrinsically infallible benchmark for estimating the marginal likelihood, a consensus of multiple methods on the plausible order of magnitude of the marginal likelihood may well result in a safer estimation and better grounded scientific conclusions.

Footnotes

Acknowledgments

We would like to thank an anonymous referee whose comments strongly improved our article and Paul Lewis for his kind support with the Phycas software.

Disclosure Statement

No competing financial interests exist.

References

Abdo

, Minin

V.N.

, Joyce

et al. 2005. Accounting for uncertainty in the tree topology has little effect on the decision-theoretic approach to model selection in phylogeny estimation. Mol. Biol. Evol., 22:691–703.

Aitichinson

1986. The Statistical Analysis of Compositional Data. Chapman & Hall: London.

Altekar

, Dwarkadas

, Huelsenbeck

J.P.

et al. 2004. Parallel metropolis-coupled Markov Chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics, 20:407–415.

Arima

2009. Bayesian tools for complex statistical models in genetics [Ph.D. dissertation] Sapienza Universitá di Roma: Rome.

Bartolucci

, Scaccia

, Mira

2006. Efficient Bayes factor estimation from the reversible jump output. Biometrika, 93:41–52.

Calderhead

, Girolami

2009. Estimating Bayes Factors via thermodynamic integration and population MCMC. Comput. Stat. Data Anal., 53:4028–4045.

Cavalli-Sforza

L.L.

, Edwards

A.W.F.

1967. Phylogenetic analysis: models and estimation procedures. Evolution, 21:550–570.

Chen

M.H.

1994. Importance-weighted marginal Bayesian posterior density estimation. J. Am. Stat. Assoc., 89:818–824.

hen

M.H.

, Shao

Q.M.

1997. On Monte Carlo methods for estimating ratios of normalizing constants. Ann. Stat., 25:1563–1594.

10.

K.D.

, Muller

, Vannucci

2006. Bayesian Inference for Gene Expression and Proteomics. Cambridge University Press: New York.

11.

Fan

, Wu

, Chen

M.H.

et al. 2011. Choosing among partition models in Bayesian phylogenetics. Mol. Biol. Evol., 28:523–532.

12.

Felsenstein

1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., 17:368–376.

13.

Felsenstein

2004. Inferring Phylogenies. Sinauer: Sunderland, MA.

14.

Fitch

W.M.

, Margoliash

1967. Construction of phylogenetic trees. Science, 155:279–284.

15.

Frenkel

1986. Free-energy competition and first-order phase transition. Ciccotti

, Hover

W.G.

Molecular-Dynamics Simulation of Statistical-Mechanical Systems. North Holland, Amsterdam.

16.

Friel

, Pettitt

A.N.

2008. Marginal likelihood estimation via power posteriors. J. R. Stat. Soc. B Stat. Methodol., 70:589–607.

17.

Gelfand

A.E.

, Dey

D.K.

1994. Bayesian model choice: asymptotics and exact calculations. J. R. Stat. Soc. B Methodological, 56:501–514.

18.

Gelman

, Meng

X.L.

1998. Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Stat. Sci., 13:163–185.

19.

Gowri-Shankar

, Jow

2006. PHASE: a software package for PHylogenetics And Sequence Evolution, 2.0. University of Manchester: Manchester, UK.

20.

Gowri-Shankar

2006. RNA phylogenetic inference with heterogeneous substitution models [Ph.D. dissertation] University of Manchester: Manchester, UK.

21.

Green

1995. Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination. Biometrika, 82:711–732.

22.

Hudelot

, Gowri-Shankar

, Jow

et al. 2003. RNA-based phylogenetic methods: application to mammalian mitochondrial RNA sequences. Mol. Phylogenet. Evol., 28:241–252.

23.

Huelsenbeck

J.P.

, Ronquist

2001. Mrbayes: Bayesian inference of phylogenetic trees. Bioinformatics, 17:754–755.

24.

Huelsenbeck

J.P.

, Larget

, Alfaro

2004. Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. Mol. Biol. Evol., 21:1123–1133.

25.

Jukes

T.H.

, Cantor

C.R.

1969. Evolution of protein molecules, 21–123. Munro

H.N.

Mammalian Protein Metabolism. Academic Press: New York.

26.

Kass

R.E.

, Raftery

1995. Bayes factor. J. Am. Stat. Assoc., 90:773–795.

27.

Lanave

, Preparata

, Serio

1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol., 20:86–93.

28.

Larget

, Simon

D.L.

1999. Markov Chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol., 16:750–759.

29.

Larget

, Miller

R.E.

, Huelsenbeck

J.P.

et al. 2002. Potential applications and pitfalls of Bayesian inference of phylogeny. Syst. Biol., 51:673–688.

30.

Lartillot

, Philippe

2006. Computing Bayes factors using thermodynamic integration. Syst. Biol., 55:195–207.

31.

Lartillot

, Brinkmann

, Philippe

2007. Suppression of long branch attraction artefacts in the animal phylogeny using site-heterogeneous model. BMC Evol. Biol., 7.

32.

Lefebvre

, Steele

, Vandal

A.C.

2010. A path sampling identity for computing the Kullback-Leibler and J divergences Comput. Stat. Data Anal., 54:1719–1731.

33.

Lewis

P.O.

, Holder

M.T.

, Swofford

D.L.

2008. Phycas: software for phylogenetic analysis. http://www.phycas.org/. 2012 January 10.

34.

, Pearl

, Doss

2000. Phylogenetic tree reconstruction using Markov Chain Monte Carlo. J. Am. Stat. Assoc., 95:493–508.

35.

Minin

V.N

, Abdo

, Joyce

et al. 2003. Performance-based selection of likelihood models for phylogeny estimation. Syst. Biol., 52:674–683.

36.

Newton

M.A.

, Raftery

1994. Approximate Bayesian inference by the weighted likelihood bootstrap. J. R. Stat. Soc B, 56:3–48.

37.

Normana

J.A.

, Ericsonc

G.P.

, Jonssond

K.A.

et al. 2009. A multi-gene phylogeny reveals novel relationships for aberrant genera of AustraloPapuan core Corvoidea and polyphyly of the Pachycephalidae and Psophodidae (Aves: Passeriformes) Mol. Phylogenet. Evol., 52:488–497.

38.

Ogata

1989. A Monte Carlo method for high dimensional integration. Numer. Math., 55:137–157.

39.

Petris

, Tardella

2003. A geometric approach to transdimensional Markov chain Monte Carlo. Can. J. Stat., 31:469–482.

40.

Petris

, Tardella

2007. New perspectives for estimating normalizing constants via posterior simulation [Technical Report] Sapienza Universitá di Roma: Rome.

41.

Pol

2004. Empirical problems of the hierarchical likelihood ratio test for model selection. Syst. Biol., 53:949–962.

42.

Posada

, Crandall

K.A.

2001. Selecting the best-fit model of nucleotide substitution. Syst. Biol., 50:580–601.

43.

Posada

, Buckely

T.R.

2004. Model selection and model averaging in phylogenetics: advantages of Akaike information criterion and Bayesian approach over likelihood ratio test. Syst. Biol., 53:793–808.

44.

R Development Core Team. 2008. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria.

45.

Raftery

M.A.

, Newton

, Satagopan

J.M.

et al. 2007. Estimating the integrated likelihood via posterior simulation using the harmonic mean identity. Proc. 8th Valencia Int Mtg. Bayesian Stat., 8:1–45.

46.

Robert

R.C.

, Wraith

2009. Computational methods for Bayesian model choice. arXiv:0907.5123.

47.

Rodrigue

, Philippe

, Lartillot

2007. Exploring fast computational strategies for probabilistic phylogenetic analysis. Syst. Biol., 55:137–157.

48.

Ronquist

, Huelsenbeck

J.P.

2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19:1572:1574G.

49.

Ronquist

, Deans

A.R.

2010. Bayesian phylogenetics and its influence on insect systematics. Annu. Rev. Entomol, 55:189–206.

50.

Schwartz

1978. Estimating the dimension of a model. Ann. Stat., 6:461–464.

51.

Sinsheimer

J.S.

, Lake

J.A.

, Little

J.A.

1996. Bayesian hypothesis testing of four-taxon topologies using molecular sequence data. Biometrics, 52:193–210.

52.

Swafford

D.L.

2003. PAUP: Phylogenetic Analysis Using Parsimony (and Other Methods) Sinauer Associates: Sunderland, MA.

53.

Tiao

G.G.

, Cuttman

1965. The inverted Dirichlet distribution with applications. J. Am. Stat. Assoc., 311:793–805.

54.

Verdinelli

, Wasserman

1995. Computing Bayes factor using a generalization of the Savage-Dickey density ratio. J. Am. Stat. Assoc., 90:614–618.

55.

von Reumont

, Meusemann

, Szucsich

N.U.

et al. 2009. Can comprehensive background knowledge be incorporated into substitution models to improve phylogenetic analyses? A case study on major arthropod relationships. BMC Evol. Biol., 9:1–19.

56.

Wang

, Johnston

P.R.

, Yang

Z.L.

et al. 2009. Evolution of reproductive morphology in leaf endophytes. PLoS ONE, 4:e4246.

57.

Weiss

R.E.

, Suchard

M.A.

, Sinsheimer

J.S.

2001. Bayesian selection of continuous-time Markov chain evolutionary models. Mol. Biol. Evol., 18:1001–1013.

58.

Xie

, Lewis

, Fan

et al. 2011. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst. Biol., 60:150–160.

59.

Yamanoue

, Miya

, Matsuura

et al. 2008. A new perspective on phylogeny and evolution of tetraodontiform fishes (Pisces: Acanthopterygii) based on whole mitochondrial genome sequences: basal ecological diversification? SBMC Evol. Biol., 8:212–226.

60.

Yang

2006. Computational Molecular Evolution. OSEE: Oxford, UK.

61.

Yang

2007. PAML 4: phylogentic analysis by maximum likelihood. Mol. Biol. Evol., 24:1586–1591.