Incomplete graphical model inference via latent tree aggregation

Abstract

Graphical network inference is used in many fields such as genomics or ecology to infer the conditional independence structure between variables, from measurements of gene expression or species abundances for instance. In many practical cases, not all variables involved in the network have been observed, and the samples are actually drawn from a distribution where some variables have been marginalized out. This challenges the sparsity assumption commonly made in graphical model inference, since marginalization yields locally dense structures, even when the original network is sparse. We present a procedure for inferring Gaussian graphical models when some variables are unobserved, that accounts both for the influence of missing variables and the low density of the original network. Our model is based on the aggregation of spanning trees, and the estimation procedure on the expectation-maximization algorithm. We treat the graph structure and the unobserved nodes as missing variables and compute posterior probabilities of edge appearance. To provide a complete methodology, we also propose several model selection criteria to estimate the number of missing nodes. A simulation study and an illustration on flow cytometry data reveal that our method has favourable edge detection properties compared to existing graph inference techniques. The methods are implemented in an R package.

Keywords

Gaussian graphical model latent variables EM algorithm model selection

1 Introduction

1.1 Motivations

Graphical models have been extensively studied and used in a wide variety of contexts to represent complex dependency structures. In many practical cases however, it is more than likely that some variables involved in the network were in fact not observed. Such missing variables are interpreted as actors that were not measured but nonetheless influence the measurements, or experimental conditions that were not taken into account.

Structure inference aims at unrevealing independence between subsets of variables (e.g., pairs)–conditional on all other observed ones. When some variables are missing, the inference is carried out in a distribution from which they have been marginalized out. Thus, two variables correlated to a missing one will appear to be ‘conditionally’ dependent based on the observed sample, which misleads the interpretation of dependency structure.

The existence of unobserved variables can be naturally encompassed in the graphical model framework by assuming there exists a ‘full’ graph describing the conditional independence structure of the joint distribution of observed and hidden variables. Observations are then samples of the marginal distribution of the observed variables only. From a graph-theoretical point of view, marginalizing hidden variables means removing them from the node set and marrying their children together, thus forming complete subgraphs, that is, cliques. Hence, the conditional independence structure among the observed variables is described by a marginal graph containing locally dense structures. This violates the sparsity assumption on which the majority of graph inference methods are based. Moreover, an identifiability problem arises in the hidden variable setting, since infinitely many full graphs induce the same marginal structure.

In this article, we are interested in both checking if some variables are indeed missing in the graph and, if it is the case, inferring the complete graphical model. We address these problem in the context of Gaussian graphical models.

1.2 Incomplete Gaussian graphical models

Consider a multivariate Gaussian random vector parametrized by its precision matrix

X \in ℝ^{p + r} \sim N (0, K^{- 1}), p, r \geq 1, K \in ℝ^{(p + r) \times (p + r)} ≻ 0,

(1.1)

where $≻$ denotes positive definiteness. We assume that $X$ can be decomposed as

X = (X_{O}, X_{H}),

where $X_{O} \in ℝ^{p}$ denotes a set of observed variables and $X_{H} \in ℝ^{r}$ a set of hidden variables. The goal of graphical model inference is to uncover the conditional independence structure of $X$ , described by the following ‘full graph’

G = ({1, \dots, p, p + 1, \dots, p + r}, E),

(1.2)

where $E$ is the set of undirected edges, such that ${i, j} \in E$ if and only if $X_{i}$ and $X_{j}$ are dependent conditionally to $X_{{1, \dots, p + r} \ {i, j}}$ , which we denote $X_{i} X_{j} | X_{{1, \dots, p + r} \ {i, j}}$ . In the Gaussian setting we consider, the set of edges $E$ is nicely determined by the non-zero entries of $K$ (Lauritzen, 1996):

For all (i, j) \in {1, \dots, p + r}^{2}, i \neq j, {i, j} \in E if and only if K_{ij} \neq 0 .

(1.3)

The precision matrix $K$ can be written blockwise to differentiate the terms corresponding to observed and latent variables:

K = (\begin{matrix} K_{O} & K_{OH} \\ K_{HO} & K_{H} \end{matrix}) .

(1.4)

From (1.4) and the Schur complement formula (Boyd and Vandenberghe, 2004, Example 3.15) we deduce that the marginal distribution of the observed variables is

X_{O} \sim N (0, K_{m}^{- 1}), K_{m} = K_{O} - K_{OH} K_{H}^{- 1} K_{HO} .

(1.5)

The conditional independence structure of $X_{0}$ is thus described by the following ‘marginal graph’

G_{m} = ({1, \dots, p}, E_{m}),

where

E_{m}

is the set of undirected edges given by the nonzero entries of

K_{m}

. Consider a sample

(X_{O}^{1}, \dots, X_{O}^{n})

n

independent realizations of the marginal distribution of

X_{O} \sim N (0, K_{m}^{- 1})

. From such measurements, standard statistical tasks are to infer the ‘full graph’

G

or the ‘marginal graph’

G_{m}

; in this article we tackle both problems.

1.3 Contributions and related work

Methods to perform graphical model inference with unobserved variables have been proposed in the past. Some use the expectation-maximization (EM) algorithm (Dempster et al., 1977), its variational approximation described in Beal and Ghahramani (2003), or the Bayesian structural EM algorithm (Friedman, 1998). A lot of attention has also been brought to a regularized approach described in Chandrasekaran et al. (2012), based on the analysis of the sum of low-rank and sparse matrices. Alternatives based on this method were also proposed by Meng et al. (2014), Lauritzen and Meinshausen (2012) and Giraud and Tsybakov (2012)

A major concern in the latent variable framework is identifiability; in general, identifiability constraints are very complex, as those derived in Chandrasekaran et al. (2012) for their model, which rely on algebraic geometry properties of low-rank and sparse matrices. On the contrary, in the particular case of trees (acyclic graphs), the conditions for identifying the joint graph from the marginal graph only, described in Pearl (1988), are very simple. In this article, we propose to exploit this property to build an inference strategy based on the EM algorithm and spanning trees.

Latent tree models were studied in the context of phylogenetic tree learning; the neighbor-joining algorithm (Saitou and Nei, 1987) among others is a popular method in this field. More recently, a method called recursive grouping was proposed in Choi et al. (2011) to reconstruct tree structures from partially observable data. We emphasize the fact that all these methods learn a single tree from data. In the present, we take advantage of two key properties of tree-structured graphical models. First, we can specify under which conditions they remain identifiable in presence of missing variables. Second, treating trees as random, we can easily integrate over the whole set of spanning trees, thanks to an algebra result called the Matrix-Tree theorem (Chaiken, 1982). To our knowledge, no method for latent variable graphical model inference is based on mixtures of trees, which constitute the main novelty of our approach.

Our contribution can be casted in the framework of Meilăa and Jordan (2000), who considered a special mixture of Bayesian networks Geiger and Heckerman, 1996) where each network involved in the mixture is tree-shaped. Meilăa and Jordan (2000) show the interest of such a model both in terms of tractability and interpretation. Meilăa and Jaakkola (2006) also use the same framework to estimate the joint distribution of the observed variables and variables and Shiers et al. (2016) aim at characterizing such distributions, but none of them is interested in the inference of the structure of the graphical model itself. The first difference with these tree-based methods is that we do not limit ourselves to a fixed number of trees but consider a mixture over all possible trees. Second, and more importantly, we extend the framework to the hidden variable setting.

Our inference strategy is based on the EM algorithm. The computations at the E-step are tractable thanks to the Matrix-Tree theorem, which enables us to integrate over the whole set of spanning trees, as opposed to the M step of Meilăa and Jordan (2000) that relies on the Chow–Liu algorithm (Chow and Liu, 1968). This approach enables us to compute posterior probabilities of edge appearance, as proposed by Schwaller et al. (2015) in the fully observable setting. To our knowledge, no other existing approach provides such an edge-specific measure of reliability. The final inference of the graph relies on the ranking of these probabilities, therefore we estimate graphs with general structures, though our method is based on trees. Although we mostly focus on the inference of the graph structure, we also obtain an estimate of the precision matrix of the joint distribution of the observed and hidden variables, as a by-product of the EM algorithm.

Our first contribution is to define, in Section 2, a latent tree aggregation model for graphical model inference in the presence of hidden variables and to give identifiability conditions. In Section 3, we introduce our procedure based on the EM algorithm to infer the parameters of the joint distribution and probabilities of edge appearance, and to estimate the number of missing nodes. In Section 4 we show on synthetic data that our method compares favourably to competitors in terms of edge detection. Finally we illustrate the procedure with a flow cytometry data analysis in Section 5.

2 Latent tree aggregation model

2.1 Identifiability conditions

Assume the ‘full graph’ $G$ defined in (1.2) is tree-structured. We now characterize the class of trees that are statistically identifiable in our model, that is, such that the full graph $G$ is uniquely determined by the marginal structure $G_{m}$ . We assume without loss of generality that the observed and hidden variables are ordered, that is, $X_{i}$ is observed for all $i \in {1, \dots, p}$ and hidden for all $i \in {p + 1, \dots, p + r}$ , and denote for some set $A$ by $Card (A)$ its cardinality. For $i \in {1, \dots p + r}$ , we define

E_{i} = \{j \in {1, \dots p + r}; {i, j} \in E\} .

The following conditions on $G$ and $K$ , derived from Pearl (1985), Pearl (1988) and Choi et al. (2011), guarantee statistical identifiability. Assumption [Identifiability conditions]

For all $(i, j) \in {p + 1, p + r}^{2}$ , ${i, j} \notin E$ ;

For all $i \in {p + 1, p + r}$ , $Card (E_{i}) \geq 3$ ;

Two nodes connected by an edge are neither perfectly independent nor perfectly dependent.

These conditions stem from the simple graphical properties of spanning trees. Indeed, the maximal cliques of a tree are of size two, therefore if (a) no edge connects two hidden nodes and (b) all hidden variables have at least three neighbours, there is exactly one hidden node for every clique of size more than or equal to $3$ in $G_{m}$ , as illustrated in Figure 1, and the class of identifiable trees is now fully characterized. In particular, hubs (central hidden nodes) are identifiable, while recovering chains of hidden nodes, or hidden nodes located at the leaves of the tree, is hopeless. An important feature is that our identifiability conditions allow sparsity in $G_{m}$ , contrary to what happens in the sparse plus low-rank model of Chandrasekaran et al. (2012). Indeed, identifiable graph structures in their case will typically have a small number of central hidden variables (hubs), and marginal graphs will therefore be densely connected, nay complete. This is an important difference with our model, and we will see in Section 4 that the inferred marginal structures are in fact very different.

Figure 1:

Effect of marginalizing one hidden variable (h). Full graph (all edges except blue), marginal graph (all edges except red) (see online for colour images)

2.2 Fixed unknown tree

We now turn to the description of our latent tree aggregation model, and start with a simple procedure where we infer a single tree structure. Let $T$ be the set of spanning trees with $p + r$ nodes, and assume the graphical model associated with $X$ , that we now write $T \in T$ , is tree-shaped. Assume further that, conditionally on $T$ , the vector $X = (X_{O}, X_{H})$ is drawn from the Gaussian distribution $N (0, K_{T}^{- 1})$ , where $K_{T}$ has a tree-structured support determined by the edges of $T$ , and can be decomposed in

K_{T} = (\begin{matrix} K_{T, O} & K_{T, OH} \\ K_{T, HO} & K_{T, H} \end{matrix}) .

(2.1)

In the complete data setting where $X$ is fully observed but $T$ is unknown, the Chow–Liu algorithm (Chow and Liu, 1968) computes the tree of maximum likelihood $\hat{T}$ from empirical observations, and the coefficients of the matrix $K_{\hat{T}}$ can be computed easily using a result of Lauritzen (1996) and the empirical covariance matrix. Building $\hat{T}$ in this case boils down to finding a maximum spanning tree, which can be done with Kruskal's algorithm (Kruskal, 1956). If variables are now hidden but the underlying tree $T$ and $K_{T}$ are known, the conditional distribution of the hidden variables given the observed ones is

X_{H} | X_{O} \sim N (μ_{H | O}, K_{H | O}^{- 1}), μ_{H | O} = - K_{T, HO} X_{O}, K_{H | O} = K_{T, H} .

From these two results, we can derive an EM algorithm to infer the tree-structured graph underlying the distribution of $X$ in the hidden variables setting, which runs iteratively until convergence, with the following steps at iteration $h + 1$ , $h \geq 1$ .

E-step: Evaluation of the conditional expectation of the complete log-likelihood with respect to the current value $K^{h}$ of the parameter, namely:

𝔼_{X_{H} | X_{O}; K^{h}} \log p (X_{O}, X_{H}; K) .

(2.2)

M-step: Maximization of (2.2) with respect to $K$ to update $K^{h}$ into $K^{h + 1}$ , using the Chow–Liu algorithm.

2.3 Random unknown tree

The inference method described above is very simple, but the tree assumption is restrictive, and we expect poor results when it is violated. To overcome this, we choose to treat $T$ as a random variable. Doing so, we are able to compute a posterior probability of appearance for every possible edge in the graph. Ranking them in the decreasing order, we can infer a graph of general structure, even though our model is based on spanning trees. Denote by $E_{T}$ the set of edges of $T$ . We assume $T$ to be drawn from a distribution parametrized by a weight matrix $(π_{ij})$ . The probability of tree $T$ is then proportional to the product of the weights of its edges:

P (T) \propto \prod_{{i, j} \in E_{T}} π_{ij} .

(2.3)

Prior information about the existence of each edge is easily encoded in a distribution of this form, and a non-informative choice of prior is to set the $π_{ij}$ to be equal for all $i, j$ , that is, all trees have the same probability to be drawn so every edge has the same probability to be part of the drawn tree. We then assume the existence of a full symmetric matrix $K$ with block decomposition given in (1.4), the entries of which have to be estimated. For every $T \in T$ , we define the corresponding $(p + r) \times (p + r)$ matrix $K_{T}$ , with off-diagonal term $K_{T, ij} = K_{ij}$ if ${i, j} \in E_{T}$ and zeros otherwise. The diagonal term $K_{T, ii}$ both depend on $K_{ii}$ and on the degree of node $i$ in $T$ . Its expression derived from Lauritzen (1996) is given in (7.2) of Appendix 7. Note that $K$ does not need to be positive definite, although it may be desirable for the numerical stability of the algorithm. The joint distribution of $(X_{O}, X_{H})$ is a mixture of centred Gaussian distributions:

(X_{O}, X_{H}) \sim \sum_{T \in T} p (T) N (X_{0}, X_{H}; 0, K_{T}^{- 1}) .

We develop this random unknown tree model further in Section 3 where we propose an inference procedure. For every possible edge ${i, j}$ , we will compute the quantity

α_{ij} = \sum_{T \in T T ∍ {i, j}} P (T | X_{O}),

that we interpret as edge specific probabilities of appearance. First, we derive conditional distributions that will be necessary. In particular, we show that these distributions factorize over the edges.

2.4 Some conditional distributions

Let us first compute the joint distribution of $T$ and $X_{H}$ conditionally on $X_{O}$ which will be needed in Section 3:

P (T, X_{H} | X_{O}) = P (T | X_{O}) P (X_{H} | X_{O}, T) .

On the one hand, $P (X_{H} | X_{O}, T) = N (μ_{H | O, T}, K_{H | O, T}) .$ On the other hand,

\begin{matrix} P (T | X_{O}) & \propto P (T) P (X_{O} | T) \\ \propto (\prod_{{i, j} \in E_{T}} π_{i j}) \underset{(1)}{\underset{︸}{\frac{\det {(K_{T, m})}^{\frac{n}{2}}}{{(2 π)}^{\frac{n p}{2}}}}} \end{matrix} \underset{(2)}{\underset{︸}{\exp (- \frac{n}{2} tr(K_{T, m} Σ_{O}))}},

(2.4)

where $K_{T, m} = K_{T, O} - K_{T, OH} (K_{T, H})^{- 1} K_{T, HO}$ . Terms (1) and (2) can be expressed as products over the edges of $T$ . We directly give the results and leave the derivations to Appendix 7. Let us define

\begin{matrix} d_{ij} & = {(\frac{K_{ii} K_{jj} - K_{ij}^{2}}{K_{ii} K_{jj}})}^{\frac{n}{2}} \\ t_{ij} & = \exp (- {nK}_{ij} Σ_{ij}) \end{matrix} \forall {i, j} \in {1, \dots, p}^{2},

(2.5)

\begin{matrix} f_{ih} = \exp (\frac{n}{2} \sum_{k \in O} \frac{K_{ih} K_{hk} Σ_{ki}}{K_{hh}}) \end{matrix} \forall {i, h} \in {1, \dots, p} \times {p + 1, \dots, p + r}

(2.6)

and finally

\begin{matrix} m_{ij} = \{\begin{matrix} t_{ij} & if {i, j} \in {1, \dots, p}^{2} \\ f_{ij} & if {i, j} \in {1, \dots, p} \times {p + 1, \dots, p + r} \\ f_{ij} & if {i, j} \in {p + 1, \dots, p + r} \times {1, \dots, p} \\ 1 & if {i, j} \in {p + 1, \dots, p + r}^{2} \end{matrix} . \end{matrix}

(2.7)

We obtain that the conditional distribution $P (T | X_{O})$ nicely factorizes over the edges of $T$ :

\begin{matrix} P (T | X_{O}) & \propto & P (T) P (X_{O} | T) ​ \propto \prod_{{i, j} \in E_{T}} π_{ij} d_{ij} m_{ij} . \end{matrix}

(2.8)

We also need to compute the normalizing constant of $P (T)$ and $P (T | X_{O})$ , that is, respectively,

\sum_{T} \prod_{{i, j} \in E_{T}} π_{ij} and \sum_{T} \prod_{{i, j} \in E_{T}} π_{ij} d_{ij} m_{ij} .

Those constants can be computed with the same complexity as a determinant, that is, in $O (p^{3})$ operations, using the Matrix-Tree theorem that we now state. For a matrix $W$ of weights $w_{ij}$ , we define the Laplacian $Δ = (Δ_{ij})_{i, j \in V^{2}}$ associated to matrix $W$ by

Δ_{ij} = \{\begin{matrix} - w_{ij} if i \neq j, \\ \sum_{j} w_{ij} if i = j . \end{matrix}

Theorem 1. [Chaiken (1982)]. Let $W = (w_{ij})_{(i, j) \in V^{2}}$ be a symmetric matrix of weights and $Δ$ its associated Laplacian. For $(u, v) \in V^{2}$ , let ${\bar{Δ}}_{uv}$ be the $(u, v)$ th minor of $Δ$ . Then all ${\bar{Δ}}_{uv}$ are equal and

{\bar{Δ}}_{uv} = \sum_{T \in T} \prod_{{i, j} \in E_{T}} w_{ij} : = Z (W) .

In Section 3, we will need to compute similar quantities after removing a given edge. Furthermore, we will need to compute such a quantity for all possible edges. This can be achieved in an efficient manner for all edges at a time thanks to a corollary of Theorem given in Kirshner (2007), Theorem 3.

3 Inference of the random unknown tree model

3.1 EM algorithm

Because the proposed model involves unobserved variables, the EM algorithm (Dempster et al., 1977) is a natural framework to carry the inference out. Importantly, two hidden layers appear in the model: the latent tree $T$ and the signal at the unobserved nodes $X_{H}$ . We show that these two hidden layers can be handled, thanks to the Matrix-Tree theorem (Chaiken 1982) introduced in Section 2. We first remind that the EM algorithm aims at maximizing the log-likelihood of the observed data $\log p (X_{O}; K)$ with respect to the parameter $K$ , alternating two steps in an iterative manner. At iteration $h$ we perform:

E-step: Evaluation of all the conditional moments involved in the the conditional expectation of the complete log-likelihood with the current value $K^{h}$ of the parameter, namely:

𝔼_{X_{H}, T | X_{O}; K^{h}} \log p (X_{O}, X_{H}, T; K);

(3.1)

M-step: Maximization of (3.1) with respect to $K$ to update $K^{h}$ into $K^{h + 1}$ .

We now give the details of how those two steps are performed.

E-step: The conditional expectation of the complete log-likelihood writes

\begin{matrix} 𝔼_{T | X_{O}; K^{h}} (𝔼_{X_{H} | X_{O}, T} \log p (X_{O}, X_{H}, T; K)) \\ = & 𝔼_{T | X_{O}; K^{h}} (\log p (T) + 𝔼_{X_{H} | X_{O}, T; K^{h}} [\log p (X_{O}, X_{H} | T; K)]) . \end{matrix}

Thanks to the tree structure of the graphical model, we have a simple form for the latter term:

𝔼_{X_{H} | X_{O}, T; K^{h}} [\log p (X_{O}, X_{H} | T; K)] = \sum_{{i, j} \in T} p_{ij} (K),

where $p_{ij} (K)$ is $- 2 K_{ij} {\hat{Σ}}_{ij}$ if both $i \neq j$ are observed, $2 K_{ij} W_{ij}^{h}$ if $i$ is observed and $j$ is hidden, $- K_{ii} {\hat{Σ}}_{ii}$ if $i = j$ is observed and $- K_{ii} B_{ii}^{h}$ if $i = j$ is hidden, variance and covariance matrices being given by

\begin{matrix} W_{HO}^{h} & = & (K_{H}^{h})^{- 1} K_{HO}^{h} {\hat{Σ}}_{O}, \\ V_{H}^{h} & = & (K_{H}^{h})^{- 1} K_{HO}^{h} {\hat{Σ}}_{O} K_{OH}^{h} (K_{H}^{h})^{- 1}, \\ B_{H}^{h} & = & (K_{H}^{h})^{- 1} + V_{H}^{h} . \end{matrix}

As explained in Section 2, the diagonal term $K_{ii}$ should actually depend on the tree $T$ . We work here with a common parameter $K_{ii}$ , which may result in non-positive definite matrices $K_{T}$ . To circumvent this issue, we project the estimated matrix $K$ on the cone of positive definite matrices (which amounts to thresholding its non-positive eigenvalues) at each step of the EM algorithm. In the case where the tree $T$ is supposed to be fixed, the calculation of the conditional distribution (2.4) is replaced by the determination of the conditionally most probable tree, likewise in the classification EM introduced by Celeux and Govaert (1992).

E-step: Combined with $p (T) \propto \prod_{{i, j} \in T} π_{ij}$ and with the conditional distribution of $T$ , $p (T | X_{O}; K^{h}) \propto \prod_{{i, j} \in T} γ_{ij}$ given in (2.4) (with $γ_{ij} = π_{ij} d_{ij} m_{ij}$ ), we get that

\begin{matrix} 𝔼_{X_{H}, T | X_{O}; K^{h}} \log p (X_{O}, X_{H}, T; K) & \propto ​ & 𝔼_{X_{H}, T | X_{O}; K^{h}} [\sum_{{i, j} \in T} \log π_{ij} + p_{ij} (K)] \\ \propto ​ & \sum_{T} (\prod_{{k, ℓ} \in T} γ_{k ℓ}^{h}) [\sum_{{i, j} \in T} \log π_{ij} + p_{ij} (K)] \end{matrix}

where the normalizing constant does depend on $K^{h}$ but not on $K$ . Hence, at the M-step, we need to maximize with respect to $K$

\sum_{T} (\prod_{{k, ℓ} \in T} γ_{k ℓ}^{h}) [\sum_{{i, j} \in T} p_{ij} (K)] = \sum_{i < j} A_{ij} ​ p_{ij} (K),

(3.2)

where all $A_{ij} = \sum_{T : {i, j} \in T} (\prod_{{k, ℓ} \in T} γ_{k ℓ}^{h})$ can be computed in $O ((p + r)^{3})$ using Theorem 3 from Kirshner (2007). The resulting update formulas of $K$ are given in Appendix 8.

agraph Initialization: The behaviour of the EM-algorithm is known to strongly depend on its starting point. Our initialization strategy is described in Appendix 9.

3.2 Edge probability and model selection

In this section, we derive a series of quantities of interest for practical inference.

agraph Edge probability: In the perspective of network inference, we need to compute the probability for an edge to be part of the tree given the observed data, that is, for edge ${k, l}$ ,

α_{kl} : = P ({k, l} \in T | X_{O}) .

(3.3)

This probability can be computed for all edges at a time in $O ((p + r)^{3})$ thanks to Theorem 3 from Kirshner (2007). It depends on the marginal distribution of the tree $P (T)$ given in (2.3) parametrized with $π_{ij}$ , which controls the marginal probability of the edge $p_{ij}^{0} : = P ({i, j} \in E_{T})$ in a complex manner. In a decision-making perspective, it may be desirable to set this probability to an uninformative value such as $1 / 2$ . This probability change can be achieved in $O (p + r)^{2}$ (Schwaller et al., 2015).

agraph Conditional entropy of the tree: We are also interested in the variability of the distribution of the tree given the observed data, measured by its entropy. Denoting $Z_{O}$ the normalizing constant of the conditional distribution $P (T | X_{O})$ , we have that

\begin{matrix} H (T | X_{O}) & = & - \sum_{T} P (T | X_{O}) \log P (T | X_{O}) \\ = & - \sum_{T} P (T | X_{O}) (- \log Z_{O} + \sum_{kl \in T} \log γ_{kl}) \\ = & \log Z_{O} - \sum_{kl} \log γ_{kl} (\sum_{T : kl \in T} P (T | X_{O})) \\ = & \log Z_{O} - \sum_{kl} α_{kl} \log γ_{kl} \end{matrix}

(3.4)

which can be computed with complexity $O ((p + r)^{2})$ , once the edge probabilities $α_{kl}$ have been computed.

Because our model involves two hidden variables ( $T$ and $X_{H}$ ), one may be interested in the conditional entropy of all hidden variables, that is

H (T, X_{H} | X_{O}) = H (T | X_{O}) + 𝔼_{T | X_{O}} [H (X_{H} | T, X_{O})] .

For the second term, we observe that the conditional distribution of $X_{H}$ given both $T$ and $X_{O}$ is a Gaussian distribution with variance $K_{H}^{- 1}$ (which is diagonal), whatever $T$ and $X_{O}$ . As a consequence, $H (X_{H} | T, X_{O})$ is constant, so we get that

𝔼_{T | X_{O}} [H (X_{H} | T, X_{O})] = \frac{r \log (2 π e)}{2} - \underset{i \in H}{\frac{12}{\sum}} \log (K_{ii}) .

agraph Model selection: We now turn to the estimation of the unknown number of hidden nodes $r$ . First, a standard Bayesian information criterion (BIC) can be defined as $BIC (r) = \log p (X_{O}; \hat{K}) - pen (r)$ , where the penalty term depends on the number of independent parameters in $K$ , that is,

pen (r) = (\frac{p (p + 1)}{2} + rp + r) \frac{\log n}{2} .

Note that the maximized log-likelihood can be computed as

\log p (X_{O}; \hat{K}) = 𝔼 [\log p (X_{O}, X_{H}, T) | X_{O}; \hat{K}] + H (X_{H}, T | X_{O}, \hat{K}) .

In the context of classification, Biernacki et al. (2000) introduced an Integrated Complete Likelihood (ICL) criterion where the conditional entropy of the hidden variable is added to the penalty. The rationale behind ICL is a preference for models with lower uncertainty for the hidden variables. Because we are mostly interested in network inference, it seems desirable to penalize only for the conditional entropy of the tree. This leads to the following criterion

{ICL}_{T} (r) = \log p (X_{O}; \hat{K}) - H (T | X_{O}) - pen (r),

where $H (T | X_{O})$ is given by (3.5). In situations where a reliable prediction of the hidden node $X_{H}$ is of interest, both entropies can be used in the penalty leading to

{ICL}_{T, X_{H}} (r) = \log p (X_{O}; \hat{K}) - H (T, X_{H} | X_{O}) - pen (r) .

4 Numerical experiments

4.1 Experimental set-up

Data synthesis in our framework requires the simulation of a graph and of a sparse inverse covariance matrix with matching support. We simulated graphs of two different structures which are given in Figure 2, namely a random tree and an Erdös–Renyi graph with density 0.1 containing $p = 20$ nodes. The binary incidence matrix of the graph is then transformed by randomly flipping the sign of some elements in order to simulate both positively and negatively correlated variables. Positive definiteness of this precision matrix $K$ is ensured by adding a large enough constant to the diagonal. We choose the missing nodes at random among those that satisfy the identifiability conditions described in Section 2. The difficulty of detecting missing edges is related to the value of the correlations between the missing nodes and their children. Recall that the marginal precision matrix writes

K_{m} = K_{O} - K_{OH} K_{H}^{- 1} K_{HO} .

We measure the difficulty of detecting the second term $K_{OH} K_{H}^{- 1} K_{HO}$ with the ratio

SNR = \frac{{∥K_{OH} K_{H}^{- 1} K_{HO}∥}_{2}^{2}}{{∥K_{O}∥}_{2}^{2}} .

As it increases, the amplitude of the signal coming from the marginalized nodes indeed increases compared to the signal coming from the observed nodes. We control this ratio by multiplying terms in the precision matrix by a constant $ε$ that we vary:

K = (\begin{matrix} K_{O} & ε K_{OH} \\ ε K_{HO} & ε K_{H} \end{matrix}) .

In the experiments we will consider two settings where $ε \in {1, 10}$ . A Gaussian sample of size $n = 30$ with zero mean and the above concentration matrix is then simulated 50 times; the results we present below are averaged over the 50 samples. The total complexity of our inference method is $O (n (p + r)^{3})$ , where $r$ is the (fixed) number of missing nodes. To simulate marginalization, we simply remove in all samples the chosen variable.

Figure 2:

Two graph structures used for simulation: Tree (left) and Erdös with $p = 0.1$ (right) (see online for colour images)

4.2 Edge detection

We focus this experiment on the ability to recover existing edges of the network, that is, the nonzero entries of the concentration matrix. This is a binary decision problem where the compared algorithms are considered as classifiers. The decision made by a binary classifier can be summarized using four numbers: True Positives ( $TP$ ), False Positive ( $FP$ ), True Negatives ( $TN$ ) and False Negatives ( $FN$ ). We have chosen to draw ROC curves—power ( $power = TP / (FN + TP)$ ) versus false positive rate ( $FPR = FP / (FP + TN)$ )—to display this information and compare how well the methods perform. The performance of five algorithms was tested on all the simulated graph structures: the Chow–Liu algorithm (Chow and Liu, 1968), the graphical lasso (Friedman et al., 2008) (Glasso), the EM of Lauritzen and Meinshausen (2012) (EM-Glasso), the EM algorithm searching for a fixed unknown tree using Chow–Liu algorithm (EM-Chow–Liu) and our EM algorithm for tree aggregation (Tree Aggregation). The same initialization was used for all methods. Note that the Chow–Liu and Glasso algorithms do not consider missing variables, whereas all four other approaches do. We compare all methods in terms of marginal graph inference and only the four methods considering missing nodes in terms full graph inference. We put a special emphasis on the inclusion of ‘spurious’ edges—that is, edges resulting from marginalization—in the inferred marginal graph. Technically, spurious edges are edges from the marginal graph linking neighbours of the missing nodes in the full graph. To this aim, we plot the fraction $IS / S$ of included spurious edges ( $IS$ ) among the total number of spurious edges ( $S$ ) versus the density of the inferred graph: $(FP + TP) / [p (p - 1) / 2]$ . The interpretation of this curve differs from ROC. An ideal method would keep $IS / S$ to 0 until the end, meaning that the corresponding curve should be pushed down to the bottom right corner.

Figure 3:

Simulation results for $SNR = 1$ . Top: Tree; Bottom: Erdös. Left: ROC for the full graph. Centre: ROC for the marginal graph; Right: spurious edges. Comparison of EM-Tree Aggregation (dots), EM-Glasso (stars), EM-Chow–Liu (squares), Glasso (crosses) and Chow–Liu (triangles)

Figure 4:

Simulation results for $SNR = 10$ . Top: Tree; bottom: Erdös. left: ROC for the full graph. Centre: ROC for the marginal graph; right: spurious edges. Comparison of EM-Tree Aggregation (dots), EM-Glasso (stars), EM-Chow–Liu (squares), Glasso (crosses) and Chow–Liu (triangles)

The results are displayed in Figures 3 and 4. The Chow–Liu algorithm and its EM version are very fast to converge and provide very similar solutions of the inference problem. On the marginal graph, even when the true model is a tree, both algorithms do not seem to provide better results than Glasso. Glasso and Tree Aggregation perform equally well, and better than EM-Glasso, at inferring the marginal graph. On the full graph Tree Aggregation performs slightly better than EM-Glasso, which tends to overestimate the number of children of the missing node and therefore has a higher false positive rate. This is in accordance with its underlying model, which assumes that all observed nodes have a hidden parent. Each of these false positive edges in the complete graph induces several false positive edges in the marginal graph. Interestingly, though Tree Aggregation is tailored to infer the full graph, it performs as well as Glasso at predicting the marginal graph, which is the primary target of Glasso. In terms of computation time, the tree-based approaches range in the expected order, that is, Chow–Liu $<$ EM-Chow–Liu $<$ EM-Tree Aggregation (see Table 1). No fair comparison can be made in this respect with EM-Glasso as this algorithm needs to be stopped after few iterations and before convergence (see Lauritzen and Meinshausen, 2012). Interestingly, the most costly procedure turns out to be the initialization.

Table 1:

Computational times averaged over $50$ replications (tree, n = 200, p = 20, SNR = 10)

Algorithm	Chow–Liu	EM-Chow–Liu	Glasso	EM-Glasso	EM–TreeAgg	Initialization
Time (s)	$1.4 10^{- 4}$	$0.23$	$1.3 10^{- 3}$	–	$0.97$	$2.60$

We also studied how influential is the initialization step. More specifically, we studied how far are the final results of Tree Aggregation from the initial point. The initialization provides cliques containing initial neighbours for the unobserved nodes. To assess how far the algorithm gets from its starting point, we sort the estimated weights of the edges coming from the unobserved node. Then, we compute the average rank (rescaled between 0 and 1) of the edges corresponding to the initial clique among these weights. An average rank close to zero indicates that the initial edges still have the highest weights, whereas as a value close to one means that they have the lowest weights. The results given in Table 2 show that, in all cases, the initial edges have a mean rank close to $0.5$ , which proves that the result evolves substantially during the algorithm.

Table 2:

Average rank of the initial clique (50 replications)

SNR	1	1	10	10
Graph	Tree	Erdös	Tree	Erdös
Avg rank	0.40	0.58	0.71	0.46

4.3 Model selection

We now assess the performance of the proposed model selection criteria on the same simulated datasets, in which $r = 1$ node is missing. In all simulations, the criteria ${ICL}_{T, X_{H}}$ and ${ICL}_{T}$ displayed very similar results, the conditional entropy of $X_{H}$ being very small as compared to this of $T$ . As a consequence, we only provide the results for ${ICL}_{T}$ (hereafter named simply $ICL$ ). Figure 5 shows that, for both network topologies, the BIC and ICL criteria display very similar behaviours and that they all detect the existence of a missing node. When the full network is tree-shaped (Figure 5, top), all criteria are maximal for $r = 1$ , whereas the choice between $r = 1$ and $r = 2$ is more difficult for the Erdös network.

Figure 5:

Model selection. Left block: Tree; right block: Erdös. Top: BIC; bottom: ICL. Within block left: $SNR = 1$ , right: $SNR = 10$ . Dotted line: true number of missing nodes

We repeat the experiment, this time without marginalizing any node. The results shown in Figure 6 show that the BIC criterion does not detect any hidden node, contrary to the ICL criterion. Nonetheless the values of ICL for 0, 1, 2 and 3 hidden nodes are much tighter than in the previous example.

Figure 6:

Model selection. Left block: Tree; right block: Erdös. Within block left: BIC, right: ICL. Dotted line: true number of missing nodes

5 Flow cytometry data analysis

We illustrate our method with an application to cellular network inference, where the missing variables can be understood as proteins or experimental conditions that were not measured but nonetheless influence the results of the experiments. We applied our procedure to the inference of the Raf cellular signalling network based on flow cytometry data. The Raf network is implied in the regulation of cellular proliferation. The data were collected by (Sachs et al., 2005) and later used by (Werhli et al., 2006) and (Schwaller et al., 2015) in network inference experiments. Flow cytometry measurements consist in sending unique cells suspended in a fluid through a laser beam, and measuring parameters of interest by collecting the light re-emitted by the cell by diffusion or fluorescence. In this study, the parameters of interest are the activation level of $11$ proteins and phospholipids involved in the Raf pathway, and are measured by flow cytometry across $100$ different cells. Though the true structure of this network is unknown, experiments have highlighted a consensus pathway that we used as gold standard to assess the performance of our algorithm. The consensus network displayed in Figure 7 is far from being a tree. We removed one protein from the dataset, which amounts to hide the corresponding node (in red in Figure 7), and applied our algorithm to this marginal data.

Figure 7:

Gold standard for Raf pathway. (a) Full graph (hidden node in red); (b) marginal graph (see online for colour images)

Using hierarchical clustering initialization, we inferred models with $r = 0$ to $3$ hidden nodes. Figure 8 (left) shows that the three proposed model selection criteria agree on the true model, that is, $r = 1$ . The same figure shows that ${ICL}_{T}$ and ${ICL}_{T, X_{H}}$ are almost equal and both lower than $BIC$ , meaning that the conditional entropy is mostly due to the uncertainty on the tree.

Figure 8:

Selection of the number of hidden nodes (BIC: crosses, ${ICL}_{T}$ : points, ${ICL}_{T, X_{H}}$ : triangles). Left: when removing one protein. Right: complete dataset

The performances of the methods described in Section 4 are compared on this example in Figure 9. The results are similar to those obtained in the simulation study. The proposed latent tree-based approach performs better than the EM-Glasso when trying to infer the full graph. The methods also perform well for the marginal graph. In terms of spurious edges, Tree Aggregation displays a plateau, along which the inclusion of spurious edges is delayed compared to Glasso and EM-Glasso.

Figure 9:

ROC curves for the full (left), marginal (centre) graphs and spurious edges (right). EM-Tree-Aggregation (points), EM-Glasso (stars), EM-Chow-Liu (square), Chow-Liu (triangle), Glasso (crosses), Recursive Grouping (diamond)

Finally, we analysed the complete dataset from Sachs et al. (2005), without removing any node. Model selection criteria are given in Figure 8 (right): they all agree on the absence of a missing node, which is consistent with the biological consensus on the Raf pathway.

6 Discussion

We proposed a method for graphical model inference with missing variables. Uncovering such a latent structure provides additional hints in the interpretation of the underlying graphical model. For example, the inference of a missing variable allows to pinpoint a group of observed variables, which are related to this unobserved variable.

Our procedure relies on spanning trees and the computations are performed efficiently using the Matrix-Tree theorem. We have defined a model with a two-layer hidden structure where the graph as well as the missing nodes are treated as latent variables. We derived conditional distributions of the latent variables given the observations and developed an inference procedure based on the EM algorithm. We also proposed three model selection criteria to determine the presence of a hidden structure, as well as the choice of the number of missing variables. We observed on a simulation study that the tree constraint, that we overcome by computing posterior edge probabilities, is not too costly in practice. An implementation of the method is publicly available through the R package LITree The LITree package is available on GitHub https://github.com/cambroise/LITree. Directions of future work include the extension to non-Gaussian (such as counts) and temporal data.

7 Computation of the conditional distributions

We show that the conditional distribution of the tree given the observations factorizes over the edges of the tree.

\begin{matrix} P (T | X_{O}) & \propto P (T) P (X_{O} | T) \\ \propto (\prod_{{i, j} \in E_{T}} π_{ij}) \underset{(1)}{\underset{⏟}{\frac{\det (K_{T, M})^{\frac{n}{2}}}{(2 π)^{\frac{np}{2}}}}} \underset{(2)}{\underset{⏟}{\exp (- \frac{n}{2} tr (K_{T, M} Σ_{O}))}} . \end{matrix}

(A.1)

We first focus on the $\det$ term (1). A linear algebra result based on the Schur complement states that

\begin{matrix} \det (K_{T}) & = \det (\begin{matrix} K_{T, O} & K_{T, OH} \\ K_{T, HO} & K_{T, H} \end{matrix}) \\ = \det (K_{T, H}) \det (\underset{K_{T, M}}{\underset{⏟}{K_{T, O} - K_{T, OH} (K_{T, H})^{- 1} K_{T, HO}}}), \end{matrix}

which finally gives with $\det (K_{T, H}) > 0$ by definition $\det (K_{T, M}) = \det (K_{T}) / \det (K_{T, H})$ . The assumptions on the hidden nodes for identifiability give that $K_{T, H}$ is diagonal and $\det (K_{T, H}) = \prod_{h \in H} K_{hh}$ is independent of $T$ . Therefore, we only need to express $\det (K_{T})$ as a product over the edges of $T$ . We know from a result of Lauritzen (1996) on decomposable graphs that the precision matrix and determinant of tree-structured graphs can be decomposed simply, with $[K_{{I, J}}]$ denoting the matrix equal to $K$ on indices $I \times J$ and $0$ elsewhere,

K_{T} = \sum_{i \in V} [K_{{i, i}}] + \sum_{{i, j} \in V^{2} {i, j} \in E_{T}} [K_{{i, j}}] - [K_{{i, i}}] - [K_{{j, j}}],

(A.2)

which gives

tr (K_{T} Σ) = \sum_{i \in V} K_{ii} Σ_{ii} + \sum_{{i, j} \in V^{2} {i, j} \in E_{T}} 2 K_{ij} Σ_{ij} - K_{ii} Σ_{ii} - K_{jj} Σ_{jj} .

(A.3)

The approximation mentioned in Section 3 arises precisely here, where $K_{ii}$ should actually be $K_{T, ii}$ . We can also decompose the determinant of $K_{T}$ as

\det (K_{T}) = \prod_{i \in V} \det ([K_{{i, i}}]) \prod_{{i, j} \in E_{T}} \frac{\det ([K_{{i, j}}])}{K_{ii} K_{jj}},

(A.4)

where $[K_{{i, j}}]$ stands for the sub-matrix $K$ where only the $i$ th and $j$ th rows and columns are kept and with $\det (K_{T, H}) = \prod_{h \in H} K_{hh}$ and $V = O ⋃ H$ ,

\det (K_{T, M}) = \prod_{i \in O} \det ([K_{{i, i}}]) \prod_{{i, j} \in E_{T}} \frac{\det ([K_{{i, j}}])}{K_{ii} K_{jj}} .

(A.5)

8 Formulas for the M-step

We need to set the derivative of the objective function $E$ given (3.3) wrt to each $K_{ij}$ to 0. Depending on the status of nodes $i$ and $j$ , $K_{ij}$ must satisfy the following:

\begin{matrix} i, j \in O^{2}, i \neq j : & K_{ij}^{h + 1} = (1 - \sqrt{1 + 4 {\hat{Σ}}_{ij}^{2} K_{ii}^{h} K_{jj}^{h}}) /2 {\hat{Σ}}_{ij}; \\ i, j \in O \times H : & K_{ij}^{h + 1} = (- 1 + \sqrt{1 + 4 (W_{ij}^{h})^{2} K_{ii}^{h} K_{jj}^{h}}) 2 W_{ij}^{h}; \\ i = j \in O : & \frac{1}{K_{ii}^{h + 1}} + \sum_{k \in V} \frac{(K_{ik}^{h})^{2}}{K_{ii}^{h + 1} K_{kk}^{h} - (K_{ik}^{h})^{2}} α_{ik}^{h} = {\hat{Σ}}_{ii}; \\ i = j \in H : & \frac{1}{K_{ii}^{h + 1}} + \sum_{k \in V} \frac{(K_{ik}^{h})^{2}}{K_{ii}^{h + 1} K_{kk}^{h} - (K_{ik}^{h})^{2}} α_{ik}^{h} = B_{ii}^{h} . \end{matrix}

9 Initialization

As the EM-algorithm is highly dependent on its starting point, initialization should be carefully undertaken. As a consequence, although this step is overlooked in most publications, we choose to describe it precisely in this appendix. In our case, it requires an initial graph structure as well as initial values for the missing nodes. Our initialization scheme relies on three stages. First, we perform a clustering step and treat the clusters as groups of nodes which share a hidden parent. Then, we initialize the missing variables as the first principal component of the matrix containing their children. Finally, from this completed data, we infer an initial tree using the Chow–Liu algorithm.

Let us now describe the details of the clustering procedure. We span all the possible triplets of nodes and merge together the triplet for which the assumption that they had a common hidden parent resulted in the biggest gain in terms of likelihood of the observed realizations. Once the ‘best’ triplet is selected, we can repeat the same procedure iteratively in order to form clusters in a hierarchical manner. At every level of the hierarchy, we have a set of cliques in which the nodes share the same parent and a set of nodes that have not yet been assigned to a clique. For computational reasons, we restricted the search to the triplets in which at least one pair of nodes was connected by an edge in the current estimate of the structure. The likelihood gain induced by merging two cliques was penalized for the complexity of the model with the BIC criterion (Schwarz, 1978). We show in Figure A.1 the dendrogram obtained with this hierarchical clustering procedure and the cliques (coloured nodes) obtained by cutting the hierarchy at the level chosen with BIC. This was done on synthetic data, where we generated 2 000 samples of a Gaussian network with 50 nodes.

Figure 10:

Dendrogram of the hierarchical clustering procedure used for initialization. The coloured nodes correspond to the clusters at the height chosen with the BIC criterion (see online for colour images)

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

Acknowledgments

The authors received no financial support for the research, authorship and/or publication of this article.

References

Beal

Ghahramani

(2003) The vari- ational Bayes EM algorithm for incomplete data: With application to scoring graphical model structures. In Bernardo

J. M.

Bayarri

M. J.

Berger

J. O.

Dawid

A. P.

Heckerman

Smith

A. F. M.

West

(eds), Bayesian Statistics 7 (pp. 453–63). Oxford, UK: Oxford University Press.

Biernacki

Celeux

Govaert

(2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence , 22, 719–25.

Boyd

Vandenberghe

(2004) Convex Optimization . New York, NY: Cambridge University Press.

Celeux

Govaert

(1992).A classification EM algorithm for clustering and two stochastic versions. Computational Statistics & Data Analysis , 14, 315–32.

Chaiken

(1982) A combinatorial proof of the all minors matrix tree theorem. SIAM Journal on Algebraic Discrete Methods , 3, 319–29.

Chandrasekaran

Parrilo

Willsky

(2012) Latent variable graphical model selection via convex optimization. The Annals of Statistics , 40, 1935–67.

Choi

Tan

Anandkumar

Willsky

(2011) Learning latent tree graphical models. The Journal of Machine Learning Research , 12, 1771–812.

Chow

Liu

(1968) Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory , 14, 462–67.

Dempster

Laird

Rubin

(1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society, Series B , 39, 1–38.

10.

Friedman

Hastie

Tibshirani

(2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics , 9, 432–41.

11.

Friedman

(1998) The Bayesian structural EM algorithm. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pages 129–38. San Francisco, CA: Morgan Kaufmann Publishers Inc.

12.

Geiger

Heckerman

(1996) Knowledge representation and inference in similarity networks and Bayesian multinets. Artificial Intelligence , 82,45–74.

13.

Giraud

Tsybakov

(2012) Discussion of ‘latent variable graphical model selection via convex optimization’. Annals of Statistics . doi:10.1214/12-AOS984.

14.

Kirshner

(2007) Learning with tree-averaged densities and distributions. In Platt

Koller

Singer

Roweis

(eds), Advances in Neural Information Processing Systems 20 (pp. 761–68). New York: Curran Associates, Inc. NIPS.

15.

Kruskal

(1956) On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society , 7, 48–50.

16.

Lauritzen

(1996) Graphical Models . Oxford, UK: Oxford University Press.

17.

Lauritzen

Meinshausen

(2012) Discussion: Latent variable graphical model selec- tion via convex optimization. The Annals of Statistics . doi:10.1214/12-AOS980.

18.

Meilă

Jaakkola

(2006) Tractable Bayesian learning of tree belief networks. Statistics and Computing . doi:10.1007/ s11222-006-5535-3.

19.

Meilăa

Jordan

(2000) Learning with mixtures of trees. Journal of Machine Learning Research , 1, 1–48.

20.

Meng

Eriksson

III

AOH

(2014) Learning latent variable Gaussian graphical models. Proceedings of the 31 International Conference on Machine Learning , 32.

21.

Pearl

(1985) Learning hidden causes from empirical data. Proceedings of the 1985 International Joint Conference on Artificial Intelligence , 1, 567–72.

22.

pearl

(1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference . San Francisco, CA: Morgan Kaufmann.

23.

Sachs

Perez

Pe'er

Lauenburger

Nolan

(2005) Causal protein-signaling networks derived from multiparameter single-cell data. Science , 308, 523–29.

24.

Saitou

Nei

(1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution , 4, 406–25.

25.

Schwaller

Robin

Stumpf

(2015) Bayesian inference of graphical model structures using trees. arXiv:1504.02723.

26.

Schwarz

(1978) Estimating the dimension of a model. The Annals of Statistics , 6, 461–64.

27.

Shiers

Zwiernik

Aston

JAD