An improvement of FDR for edge detection by applying EM method

Abstract

In building a graphical model, accuracy in edge detection for the model structure is crucial for the quality of the model. We explored methods for improvement of false discovery rate(FDR) by devising an estimation procedure which is more data sensitive under some condition. The estimation is made by applying an EM method where the parameters include the density function under the null hypothesis (no edge) and the location parameters of the density functions under the alternative hypothesis (presence of edge). Our method is compared favorably with a most popular FDR tool in numerical experiments. We applied our method for analysing gene data of 800 genes and built a network of vector autoregressive model for the data.

Keywords

Discrepancy measure edge detection EM algorithm error rate graphical Gaussian model mixture distribution Parzen window

1. Introduction

Graphical models [44, 25] are useful for representing inter-relationships among the variables involved in a model where the interrelationship can be cause-effect (or asymmetric) relationship or association (or symmetric) among others. A variable in a model can be causally related to another variable and in an associative mode with a third. These inter-relationships in a model are easily read when they are represented in a graph. When all the relationships are causal, we use arrows to represent them in a graph, and lines when they are associative. When both types of the relationships are present in a model, both types of edges appear in the graph of the model.

Whether the relationships are symmetric or not, their strengths can be expressed in terms of partial correlation coefficients (PCC’s) [44]. So structure learning of a graphical model is indispensable to testing if the PCC between each pair of the variables involved in a model is zero. This is where the false discovery rate (FDR) method plays a key role [19, 18, 14, 16].

Graphical modelling methods have been proposed by many authors [42, 2, 43, 41, 45]. As the model size gets larger, we in many cases run into a small data problem in modelling. The PCC’s are estimated, in a small data situation, by applying a non-parametric approach [35] or a Bayes shrinkage approach [27] among others. So we can obtain estimates of partial correlations for graphical modelling. Once the estimates are obtained, they are used as a measure of edge strength between nodes (i.e., variables) in a model structure. Detecting the edges which are strong enough as many as possible while controlling the overall type 1 error is the main goal in this line of work. In this line of work, the type 1 error refers to false discovery of edges. For simplicity, we will call the edge which is strong enough a non-zero edge and a zero edge for the edge which is not strong enough.

Our proposed process for an improvement of FDR involves estimation of the distribution under the alternative hypothesis of non-zero edges. We took the alternative distribution as a mixture of normal distributions. The initial values of the estimates of the locations of the normal distributions are based on the estimates of the partial correlations. This is a main difference between the proposed and the existing methods in which a single distribution, such as a uniform distribution, is usually assumed for the alternative distribution. The proposed process performed favorably in comparison with the methods well known in literature.

The paper is organized in 8 sections. In Section 2, we briefly review FDR methods and graphical Gaussian models. In Section, 3, an EM (expectation and maximization) method for mixture distributions is described. We then extend the EM method to our problem. In Section 4, we describe our proposed method. We apply the Parzen window approach for determining initial values to be used in the EM algorithm of our method. Once the estimates are obtained, we compute the error rate along with the density curves under the null and the alternative models. Performances are compared in Section 5 through numerical experiments between the proposed and a most popular standard method. The comparisons are made using ROC curves and a discrepancy measure between graphs. The experiments show favorable results for the proposed method. The proposed method is applied to real data in Section 6. Discussions are made in Section 7 concerning the proposed method from several aspects. Finally, conclusions are made in Section 8 with some summarising remarks.

2. Related works and graphical Gaussian model

In a simultaneous hypothesis testing problem, we try to control the family probability of type 1 error. This problem was addressed by many authors as a multiple comparison problem where many treatment effects are compared with a control [32, 11, 12, 30]. Modified versions of $t$ -statistics were proposed as test statistics for the problem with a view to keep the family probability of type 1 error under control.

As the number of treatments increased, a new method was developed [4] with introduction of FDR to handle large-scale simultaneous hypothesis testing problems and several improvements have been made on this method [19, 14, 1]. This approach is based on the assumption that the test statistics are iid (independent and identically distributed) so that the p-values are uniform random variables under the null hypothesis of no treatment effects.

This assumption on the p-values, however, is often misleading in the sense that the p-values may not be iid at all. For instance, the null model may be misspecified or the test statistics may not be independent to each other [17]. This concern on the assumption made a larger availability of the FDR method possible by using test statistics such as $t$ -score, $z$ -score, or correlation coefficient for computing the FDR [17, 39, 3].

A variety of FDR methods have been developed. Most of them were based on $p$ -values with estimations made on null proportions and null probability models [37, 33, 38, 7]. Strimmer [39] developed a unified version of the FDR method which is available with p-values as well as test statistics such as $z$ -scores, $t$ -scores, and correlation coefficients [26]. After a comprehensive comparison study of commonly used FDR methods, the unified version of the FDR methods was instrumented into an R package ‘fdrtool’ which was compared favorably with other FDR methods. Sun and Cai [40] proposed a data-driven procedures that aim to minimize the false non-discovery rate subject to a constraint on the FDR. They assumed that the data are generated from a two-state hidden Markov model and devised a test statistic called the local index of significance instead of the $p$ -values.

FDR methods are indispensable to graphical modelling for microarray analysis for causal relationship between genes [6, 36, 31]. In this analysis, PCC’s were used for building a vector autoregresive (VAR) model. The strength of the causal relationship is measured by the PCC’s.

The PCC is used for structure learning of a graphical Gaussian model(GGM) also [10]. Let ${\bf X}=(X_{1},\cdots,X_{v})$ be a random vector whose probability model is a GGM. As for the GGM, the PCC’s are obtained from the inverse sample covariance matrix of ${\bf X}$ (Whittaker 1990). If we denote the PCC between $X_{i}$ and $X_{j}$ by $\pi_{ij}$ , $\pi_{ij}=0$ means conditional independence between $X_{i}$ and $X_{j}$ given the remaining components of ${\bf X}$ . This conditional independence is represented by “no edge” in the model structure of the GGM. If $\pi_{ij}\neq 0$ , then the conditional dependence is represented by an edge between the nodes of the two variables. The decision on the state of edge (presence or absence) is made through a FDR procedure.

3. EM algorithm with multiple alternative models

Consider a problem of building a GGM model of the normal random vector ${\bf X}=(X_{1},\cdots,X_{v})$ . Denote by $E$ the total number of the variable pairs, i.e., $E=v(v-1)/2$ . For notational convenience, we will simplify the index of $\pi_{ij}$ , $1\leqslant i<j\leqslant v$ , to $\pi_{h}$ for $h=1,\cdots,E$ . Let the estimate of $\pi_{h}$ be denoted by $r_{h}$ .

We will assume that the $r_{h}$ ’s are from $m+1$ different models, $f_{0},f_{1},\cdots,f_{m}$ . We will call $f_{0}$ a null model in the sense that it is a probability model when $\pi_{h}=0$ and $f_{A}$ an alternative model for the $r_{h}$ ’s when $\pi_{h}\neq 0$ . Let $f_{j}$ be the $j$ -th component of $f_{A}$ for $j=1,\cdots,m$ . Let $\lambda_{j}$ be the parameter in $f_{j}$ , $j=0,1,\cdots,m$ .

$f_{j}$ is weighted by $\eta_{j}$ in the probability model $f$ of $r$ as

$\displaystyle f(r;\eta_{0},\cdots,\eta_{m},\lambda_{0},\cdots,\lambda_{m})=% \sum_{j=0}^{m}\eta_{j}f_{j}(r;\lambda_{j}),$ (1)

where $\sum_{j=0}^{m}\eta_{j}=1$ . In other words, $f$ is a mixture of $f_{0},f_{1},\cdots,f_{m}$ and the alternative model $f_{A}$ is given as a mixture of $f_{j}$ with weights $\eta_{j}$ , $j=1,\cdots,m$ .

Let $\theta=(\eta_{0},\cdots,\eta_{m},\lambda_{0},\cdots,\lambda_{m})$ and $\lambda_{A}=(\lambda_{1},\cdots,\lambda_{m})$ . Then, a simplified version of Eq. (1) is

$\displaystyle f(r;\theta)=\eta_{0}f_{0}(r;\lambda_{0})+f_{A}(r;\lambda_{A}).$

Let the latent variable $Z_{ij}$ be defined as

$\displaystyle Z_{ij}=\left\{\begin{array}[]{ll}1&\textrm{if $r_{i}$ is from $f% _{j}$.}\\ 0&\textrm{o.w.}\end{array}\right.$

where $i=1,\ldots,E$ and $j=0,1,\ldots,m$ .

The probability of $Z_{ij}$ is defined as the $j$ -th component of weight. In other words,

$\displaystyle P(Z_{ij}=1)=\eta_{j},j=0,\ldots,m.$ (2)

Then the $j$ -th component $f_{j}$ of $f_{A}$ is given by

$\displaystyle f(r_{i}|Z_{ij}=1)=f_{j}(r_{i};\lambda_{j})$ (3)

where $\lambda_{j}$ is the location parameter of $f_{j}$ . Then, the probability Eq. (1) of $r_{i}$ can be written as

$\displaystyle f(r_{i};\theta)=\sum_{j=0}^{m}f(r_{i}|Z_{ij}=1)P(Z_{ij}=1).$ (4)

As shown in Eq. (3), it is reasonable that the distribution of $r_{i}$ is defined as a mixture of $(m+1)$ models.

The probability model of the complete data $(r_{i},Z_{ij})$ , $i=1,\ldots,E$ , $j=0,\ldots,m$ , is given by

$\displaystyle P_{c}(r_{i},z_{ij};\theta)=\prod_{i=1}^{E}\prod_{j=0}^{m}[\eta_{% j}f(r_{i};\lambda_{j})]^{Z_{ij}}.$

The MLE (maximum likelihood estimator) of $\theta$ is obtained by applying an EM algorithm as follows. We denote the complete-data log-likelihood of $\theta$ by

$\displaystyle l_{c}(\theta)=\Sigma_{i=1}^{E}\Sigma_{j=0}^{m}Z_{ij}\log[\eta_{j% }f_{j}(r_{i};\lambda_{j})].$

Let the estimates of $\theta$ obtained at the $k$ -th EM step be denoted by $\theta^{(k)}$ . Then the classification probability $P_{ij}=P(Z_{ij}=1|r_{i})$ of $r_{i}$ is estimated as

$\displaystyle P_{ij}^{(k+1)}=P_{\theta^{(k)}}(Z_{ij}=1|r_{i})=\frac{P_{\theta^% {(k)}}(r_{i}|Z_{ij}=1)P_{\theta^{(k)}}(Z_{ij}=1)}{\sum_{l=0}^{m}P_{\theta^{(k)% }}(r_{i}|Z_{il}=1)P_{\theta^{(k)}}(Z_{il}=1)}=\frac{\eta_{j}^{(k)}f(r_{i};% \lambda_{j}^{(k)})}{\sum_{l=0}^{m}\eta_{l}^{(k)}f(r_{i};\lambda_{l}^{(k)})}$ (5)

Note that

$\displaystyle P_{ij}^{(k+1)}=E_{\theta^{(k)}}[Z_{ij}|r_{i}].$

Let ${\bf r}=(r_{1},\cdots,r_{E})$ . Then we can update $\theta^{(k)}$ using $P_{ij}^{(k+1)}$ and $\theta^{(k)}$ by maximizing $E_{\theta^{(k)}}[l_{c}(\theta|\mathbf{r})]$ as

$\displaystyle\theta^{(k+1)}=\textit{arg}\max_{\theta}E_{\theta^{(k)}}[l_{c}(% \theta)|\mathbf{r}]=\textit{arg}\max_{\theta}\Sigma_{i=1}^{E}\Sigma_{j=0}^{m}P% _{ij}^{(k+1)}\log[\eta_{j}^{(k)}f_{j}(r_{i};\lambda_{j}^{(k)})]$

Therefore, for $j=0,\ldots,m$ ,

$\displaystyle\eta_{j}^{(k+1)}=\frac{\Sigma_{i=1}^{E}P_{ij}^{(k+1)}}{E}.$ (6) $\displaystyle\lambda_{j}^{(k+1)}=arg\max_{\lambda_{j}}\Sigma_{i=1}^{E}P_{ij}^{% (k+1)}\log[\{\eta_{j}^{(k)}f_{j}(r_{i};\lambda_{j}^{(k)})].$

Convergence of this EM process is well established in literature (Dempster et al. 1977).

Now we specify $f_{j}$ ’s for our problem of structure learning of a GGM. As for $f_{0}$ , we will assume the distribution of $r_{i}$ which was derived by Hotelling (1953) when $\pi_{i}=0$ as given by

$\displaystyle f_{0}(r_{i};\kappa)=(1-r_{i}^{2})^{\frac{\kappa-3}{2}}\frac{% \Gamma(\frac{\kappa}{2})}{\sqrt{\pi}\Gamma(\frac{\kappa-1}{2})}.$ (7)

It is worthwhile to note that $\textit{Var}(r_{i})=\kappa^{-1}$ for the pdf $f_{0}$ . We will consider the mixture Eq. (1) for $r_{i}$ ’s with $f_{0}$ in Eq. (7) and $f_{j}$ as the pdf of $N(\mu_{j},\sigma_{j}^{2})$ for $j=1,\cdots,m$ .

With this specification of $f_{j}$ ’s, $\theta$ is redefined as

$\displaystyle\theta=(\eta_{0},\cdots,\eta_{m},\kappa,\mu_{1},\cdots,\mu_{m},% \sigma_{1},\cdots,\sigma_{m}).$ (8)

The EM procedure as described in Eqs (5) through (6) is modified in accordance to the above specification of $f_{j}$ ’s.

As for the E-step, we update the estimate of $P_{ij}=E_{\theta}[Z_{ij}|r_{i}]$ , for $i=1,\ldots,E$ and $j=0,1,\ldots,m$ , as

$\displaystyle P_{i0}^{(k+1)}=E_{\theta^{(k)}}[Z_{i0}|r_{i}]=\frac{\eta_{0}^{(k% )}f_{0}(r_{i};\kappa^{(k)})}{\eta_{0}^{(k)}f_{0}(r_{i};\kappa^{(k)})+\sum_{l=1% }^{m}\eta_{l}^{(k)}f(r_{i};\mu_{l}^{(k)},\sigma_{l}^{(k)})}$ $\displaystyle P_{ij}^{(k+1)}=E_{\theta^{(}k)}[Z_{ij}|r_{i}]=\frac{\eta_{j}^{(k% )}f_{j}(r_{i};\mu_{j}^{(k)},\sigma_{j}^{(k)})}{\eta_{0}^{(k)}f_{0}(r_{i};% \kappa^{(k)})+\sum_{l=1}^{m}\eta_{l}^{(k)}f_{l}(r_{i};\mu_{l}^{(k)},\sigma_{l}% ^{(k)})},j\geqslant 1.$

Once these updates, $P_{ij}^{(k+1)}$ , are obtained, we can get updates of the weights, $\eta_{j}$ , by

$\displaystyle\eta_{j}^{(k+1)}=\frac{\Sigma_{i=1}^{E}P_{ij}^{(k+1)}}{E},j=0,1,% \cdots,m.$

Then we can update the other components of $\theta$ as follows:

$\displaystyle\kappa^{(k+1)}=\textit{arg}\max_{\kappa}\Sigma_{i=1}^{E}P_{ij}^{(% k+1)}\log f_{0}(r_{i};\kappa).$ (9) $\displaystyle\mu_{j}^{(k+1)}=\frac{\Sigma_{i=1}^{E}P_{ij}^{(k+1)}r_{i}}{\Sigma% _{i=1}^{E}P_{ij}^{(k+1)}}.$ (10) $\displaystyle(\sigma_{j}^{2})^{(k+1)}=\frac{\Sigma_{i=1}^{E}P_{ij}^{(k+1)}(r_{% i}-\mu_{j}^{(k+1)})^{2}}{\Sigma_{i=1}^{E}P_{ij}^{(k+1)}}.$ (11)

If we denote the MLE of $\theta$ by $\hat{\theta}$ which is obtained as a result of the EM process, we can see that

$\displaystyle\widehat{\eta}_{j}=\frac{\sum_{i=1}^{E}\widehat{P}_{ij}}{E},\mbox% { for }j=0,1,\cdots,m.$ $\displaystyle\widehat{\mu_{j}}=\frac{\sum_{i=1}^{E}\widehat{P}_{ij}r_{i}}{\sum% _{i=1}^{E}\widehat{P}_{ij}},\mbox{ for }j=1,\cdots,m.$ $\displaystyle\widehat{\sigma_{j}^{2}}=\frac{\sum_{i=1}^{E}\widehat{P}_{ij}(r_{% i}-\widehat{\mu_{j}})^{2}}{\sum_{i=1}^{E}\widehat{P}_{ij}},\mbox{ for }j=,1,% \cdots,m.$

$\widehat{\eta_{0}}$ can be interpreted as the estimate of $P(r\mbox{ is from }f_{0})$ , $\widehat{\eta_{j}}$ as the estimate of $P(r\mbox{ is from }f_{j})$ , $\widehat{\mu_{j}}$ as the estimate of $E_{f_{j}}(r)$ , and $\widehat{\sigma_{j}^{2}}$ as the estimate of $\textit{Var}_{f_{j}}(r)$ where the subscript $f_{j}$ means that $r$ is from model $f_{j}$ .

The initial values for the estimates in the EM process are obtained as follows. We assigned 0.5 as the initial value for $\eta_{0}$ and $0.5/m$ for the other $\eta_{j}$ ’s. The initial values for $\mu_{j}$ ’s are assigned as the Parzen window method suggests. Those for $\sigma_{j}$ ’s are assigned by using the Fisher’s $z$ -transformation for stability [20, 21]. And the initial value for $\kappa$ is obtained as an MLE based on the empirical distribution in a neighborhood of 0.

4. Proposed method for edge detection

Once the MLE’s are obtained for the mixture Eq. (1) with parameters in Eq. (8), we have the estimate of the probability model $f$ as

$\displaystyle f(r;\hat{\theta})=\widehat{\eta_{0}}f_{0}(r;\hat{\kappa})+\sum_{% j=1}^{m}\widehat{\eta_{j}}f_{j}(r;\widehat{\mu_{j}},\hat{\sigma}_{j}^{2}).$

Let $\sum_{j=1}^{m}\widehat{\eta_{j}}f_{j}(r;\widehat{\mu_{j}},\hat{\sigma}_{j}^{2})$ be denoted simply as $\hat{f}_{A}(r;\lambda_{A})$ . The estimate of $f$ suggests that the null weight is estimated as $\widehat{\eta_{0}}$ and the classification of the sources of $r_{i}$ ’s will be made based on the estimates as follows:

If $\widehat{\eta_{0}}f_{0}(r_{i};\hat{\kappa})>\hat{f}_{A}(r_{i};\lambda_{A})$ , then $r_{i}$ is classified as it is from $f_{0}$ . Otherwise, $r_{i}$ is classified as it is from some of $f_{j}$ ’s, $j=1,\cdots,m$ .

4.1 Selection of $m$ by the Parzen window method

In the above EM process, the alternative model $f_{A}$ is of $m$ component models $f_{j}$ where $m$ is predetermined. We will describe how $m$ is chosen through the Parzen window method with examples. Suppose we have a set of $r_{i}$ ’s from a data set of size $n=100$ generated from a GGM model of $v=10$ variables. The histogram in panel (a) of Fig. 1 is of the $r_{i}$ ’s. The Parzen window method is used for drawing a smoothed version of the histogram.

Suppose we use a kernel function (or window function) $\varphi$ defined as

$\displaystyle\varphi\left(\frac{r-r_{i}}{h}\right)=\left\{\begin{array}[]{ll}1% &\textrm{if $|r-r_{i}|<\frac{h}{2}$}\\ 0&\textrm{o.w.}\end{array}\right.$ (12)

Then the function $g$ defined as

$\displaystyle g(r)=\sum_{i=1}^{E}\varphi\left(\frac{r-r_{i}}{h}\right)$

will yield a smoother version of the histogram as the window width $h$ increases. As $h$ increases, the $r_{i}$ ’s close to each other will produce a larger value of $g$ . As an extreme case, when $h=4$ , $g(r)=E$ for $-1\leqslant r\leqslant 1$ .

The panels (b) through (f) of Fig. 1 illustrate how the shape of the Parzen window graphs (PWG’s) changes as the window width $h$ increases from 0.01 to 0.4. One of the merits of the Parzen window method is that we can get a rough clue about clusters of $r_{i}$ values by examining the PWG’s.

Figure 1.

Histogram and Parzen window graphs with the window width $h=0.01,0.1,0.2,0.3,0.4$ .

When $h=0.01$ , the PWG displays most of the individual $r_{i}$ points. But when $h=0.1$ , we can see 3 clusters, 2 small clusters with a big one in the middle. The cluster in the middle seems to be a mixture of two clusters, a small cluster being attached to a big one from the negative side of the latter. In this respect, four clusters seems reasonable when $h=0.1$ . When $h=0.2,0.3,0.4$ , 3 clusters seem reasonable. Since we assume a null model around 0, we are interested in the number of small clusters on the shoulders of the big cluster around 0 for the value $m$ .

Suppose we took $m=3$ as suggested in panel (c). Then in the EM process, we use the mode or the mean of each cluster as the initial value for the estimate of $\mu_{j}$ . As indicated in this illustration, a larger $h$ would blind us from smaller clusters, which might lead us to worse initial values for the EM process. It goes without saying that an appropriate value of $m$ be selected for a successful estimation. It is rather recommended, from our experience, that a larger $m$ than necessary is better than a smaller $m$ . We will see an instance of this in the next subsection.

If we do the EM with a larger $m$ than necessary, a meaningful compromise takes place in the estimates of the weights $\eta_{j}$ . If a small cluster is ignorable, then the estimate of its weight $\eta_{j}$ becomes very small or the cluster is relocated, during the EM process, as merged to its closest cluster whose weight is larger. On the other hand, if $m$ is smaller than necessary, then it means that clusters neighboring close to each other get merged into a larger one, which may take place when $h$ is larger than desired. A main concern in this case is that some clusters that may be from $f_{A}$ close to the big cluster around $0$ , which may be from $f_{0}$ , may possibly lost at the initial stage of the EM process. This may deteriorate the edge detection accuracy.

There is no general guideline for a proper value of $h$ . Its value may be subject to the size $E$ of $r_{i}$ ’s and the empirical distribution itself. As for the kernel function, one may use a function such as a Gaussian kernel function which is more sensitive to the $r_{i}$ values so that the shape of $g$ is more reflective the data values. But the function $\varphi$ in Eq. (12) is simpler and serves well our goal of choosing a value for $m$ , which we used in this work.

4.2 Edge detection and the number

m

f_{j}

’s in

f_{A}

We will use the sample PCC’s, $r_{i}$ ’s, with their histogram in panel (a) of Fig. 1 to demonstrate the effect of $m$ on the edge detection. The PCC’s are obtained based on the data of size $n=100$ generated from the GGM in Fig. 2. As mentioned above, we chose $m=3$ when $h=0.1$ (panel (c) of Fig. 1) and $m=2$ when $h=0.3$ (panel (e)). The $r_{i}$ ’s are listed in Table 1 for the pairs of variables which are connected by an edge in the GGM model in Fig. 2.

Table 1
Difference in edge detection between when $m=2$ and when $m=3$

Edge	$r_{i}$	Detection
		$m=$ 2	$m=$ 3
(1, 2)	0.528	Yes	Yes
(1, 5)	0.413	Yes	Yes
(2, 7)	0.616	Yes	Yes
(3, 10)	$-$ 0.638	Yes	Yes
(4, 5)	$-$ 0.646	Yes	Yes
(6, 9)	$-$ 0.647	Yes	Yes
(8, 10)	$-$ 0.734	Yes	Yes
(2, 6)	$-$ 0.282	No	Yes
(3, 5)	$-$ 0.259	No	Yes
(5, 10)	0.241	No	Yes
(5, 8)	$-$ 0.211	No	No
(6, 7)	$-$ 0.210	No	No
(9, 10)	$-$ 0.201	No	No

We can see in Table 1 an apparent difference in edge detection between the two values, 2 and 3, of $m$ . As for the edges with their absolute $r_{i}$ values between 0.24 and 0.3, they were detected when $m=3$ but not when $m=2$ . The edges whose absolute $r_{i}$ values were smaller than 0.22 were not detected for either of the two values of $m$ .

Figure 2.

A GGM model of 10 variables.

This is a merit of choosing a larger $m$ that the edges whose strengths, $|r_{i}|$ , are from $f_{A}$ are near the boundary between $f_{0}$ and $f_{A}$ have a higher chance of being detected with a larger $m$ . This is elaborated in detail in Section 7.

4.3 Partition of

r_{i}

’s for estimation for

f_{0}

and

f_{A}

Figure 3.

Partition of the set of $r_{i}$ ’s and estimation results based on data of size 100 from a GGM model of 20 variables.

In estimating parameters $\theta$ , we partition the set of $r_{i}$ ’s into three parts, $[-1,-\tau)$ , $[-\tau,\tau]$ , and $(\tau,1]$ with $0\leqslant\tau\leqslant 1$ . The set, $C_{0}$ say, in the middle is used for estimating the parameter $\kappa$ of $f_{0}$ . We observe the estimates $\hat{\kappa}$ of $\kappa$ decreases as the threshold $\tau$ increases. When $C_{0}$ begins to be contaminated with the $r_{i}$ ’s from $f_{A}$ , the estimate $\hat{\kappa}$ may begin to decrease, i.e., $f_{0}(r,\hat{\kappa})$ becomes sharper. As the contamination rate increases by a larger amount, the change rate will become larger.

The partition also affects the estimation for $f_{A}$ as well as $\eta_{j}$ ’s. When we take a too small $\tau$ value, then it is highly possible that the $r_{i}$ ’s from $f_{A}$ are contaminated with those from $f_{0}$ , which may end up with an estimate of $f_{A}$ with a relatively high density part mixed with $f_{0}$ . This phenomenon is depicted in Fig. 3.

The figure is based on a set of the sample PCC’s, $r_{i}$ ’s, which are based on a data set of size $n=100$ from a GGM model of size $v=20$ . We can see in the figure that when $\tau=0.45$ , $f_{0}$ is estimated based on the set of $r_{i}$ ’s that are contaminated with some $r_{i}$ ’s from $f_{A}$ . But when the set $C_{0}$ shrinks a bit with $\tau=0.4$ , the shape $f_{0}$ becomes sharper with $\hat{\kappa}$ increased to more than twice as large as that for $\tau=0.45$ (see panel (b)). If $C_{0}$ gets even smaller with $\tau=0.15$ , then we see a weird picture in panel (e). In this situation, the $r_{i}$ ’s from $f_{A}$ may be contaminated at a high rate with those from $f_{0}$ and so $f_{A}$ may contain inappropriate component $f_{j}$ ’s deep inside the $f_{0}$ region of $r_{i}$ ’s. This will end up with a smaller than desired $\hat{\eta}_{0}$ value as is shown in panel (e). Also note that the shape $f_{0}$ becomes even sharper with a larger $\hat{\kappa}$ . This is a strong indication that $f_{A}$ is estimated based on a largely contaminated set of $r_{i}$ ’s.

Let $C(\tau)$ be the set of $r_{i}$ ’s such that $|r_{i}|\leqslant\tau$ . By examining these four panels, we can see that $C(0.4)$ and $C(0.2)$ are more or less similar to each other and that the set $C(0.2)$ looks an appropriate set for estimating $\kappa$ . If one moves from $C(0.2)$ down to $C(0.15)$ , $f_{A}$ is now estimated based on a set of the $r_{i}$ ’s which are contaminated too much with those from $f_{0}$ . By monitoring this change of the estimates, we can safely use appropriate estimation results for edge detection.

4.4 Mixture pattern of

f

and error rate

Figure 4.

Estimation and error rate for the GGM mentioned in Fig. 3.

Let the error rate in edge detection be defined as

$\displaystyle\textit{Error rate}=\frac{FP+FN}{E}$

where ‘ $F P$ ’ and ‘ $F N$ ’ are the number of the edges which are falsely detected (false positive) and falsely undetected (false negative), respectively, and $E$ is the total number of the node-pairs. This error rate thus increases as the contamination rate increases. As an extreme case, the error rate will be 0 when there is a set $C(\tau)$ which separates the $r_{i}$ ’s from $f_{0}$ perfectly from those from $f_{A}$ . In this context, the error rate is subject to the mixture pattern of the distribution of $r_{i}$ ’s. We will see a few examples of this phenomenon.

Figure 5.

Estimation and error rate for a “sparse” GGM model of size 50 which is of 2% of all the possible edges.

Figure 6.

Estimation and error rate for a “dense” GGM model of size 50 which is of 10 % of all the possible edges.

Figure 4 shows an edge detection performance based on a data set of size 100 generated from a GGM model of size 20. The error rate is 0 when $\hat{\kappa}=105$ with $\tau=0.2$ which corresponds to panel (d) of Fig. 3. In panel (c) of Fig. 4, we can see that the error rate varies as the threshold $\tau$ increases taking the minimum when $0.2\leqslant\tau\leqslant 0.4$ . Panels (a) and (b) show the relationship among $\eta_{0}$ , $\sqrt{\kappa}$ , and the error rate. It is obvious that $\kappa$ and $\eta_{0}$ move together. As $\kappa$ increases, $\eta_{0}$ decreases since a larger $\kappa$ means a sharper shape of the $f_{0}$ curve and this causes an increased proportion of $f_{A}$ , which in turn decreases the $\eta_{0}$ value. This phenomenon is manifested by circle points in panel (a). The error rates corresponding to the $\kappa$ values are displayed in the lower part of the panel by asterisk points.

In Figs 5 and 6, we summarized experimental results from a GGM model of size 50. The model structure considered in the former figure is of 24 edges which are only 2% of all the possible node-pairs, $\left(\begin{array}[]{c}50\\ 2\end{array}\right)=1225$ , while it is of 123 edges (i.e., 10%) for the latter figure. For convenience’ sake, we will call the former model “sparse” and the latter “dense.”

The histogram from the “sparse” model suggests that the $f_{0}$ may be contaminated by $f_{A}$ deep inside the domain of $f_{0}$ . This is well reflected in the estimates for $f_{0}$ and $f_{A}$ in panels (e) and (f) of Fig. 5. The error rate is minimized when $\hat{\kappa}=36$ (see panels (a) through (c)). In panel (a), we can see an abrupt drop of $\hat{\eta}_{0}$ at some point $\sqrt{\kappa^{*}}$ of $\sqrt{\hat{\kappa}}$ . This means that as the set $C(\tau)$ gets smaller, the set may hit some point of $\tau$ where a meaningful distinction takes place between $f_{0}$ and $f_{A}$ .

As for the data in Fig. 6, the $\hat{\kappa}$ values decrease in a continuum mode. This seems to reflect that the $r_{i}$ ’s are contaminated over a wide range of them. If this is the case, the smallest error rate may occur when $\tau$ is near $0$ since then the distinction between $f_{0}$ and $f_{A}$ can be made in a full scale. Note in panel (c) that the error rate is non-decreasing in $\tau$ .

To sum, there is no general rule for a best edge detection. We can however recommend to monitor the $\hat{\kappa}$ values for a range of $\tau$ . $\hat{\kappa}$ may drop abruptly, smoothly, or in some way in between as $\tau$ increases. If we work with real data, we can’t get the error rate. But if we use the plot of $\hat{\kappa}$ and $\hat{\eta}_{0}$ and monitor the patterns of the estimates of $f_{0}$ and $f_{A}$ as in panels (d), (e), and (f) in any of the above two figures, we may attain a reasonably good edge detection. Our FDR procedure is summarized in the form of a workflow in Fig. 7.

5. Performance comparison by numerical experiments

We compared our method with the standard FDR method by numerical experiments using data generated by the method described in subsection ‘Simulation setup’ of Schafer and Strimmer [35]. This method guarantees a positive definite covariance matrix for a given set of random variables involved in a GGM model. Once a GGM model is specified such as the one in Fig. 2, we generate data for the random variables involved in the model and the data is used for performance comparison.

We generated 50 data sets of size $n=100$ from a given GGM model of size $v=20$ (i.e., $E=190$ ) and obtained a receiver operating characteristic (ROC) curve for each of the data sets. The comparison is then made using the area under the ROC curve (AUC). This comparison is given in Table 2 by taking the average of the 50 AUC values. The ROC curves in Fig. 8 are an example from the data sets. We ran numerical experiments of the same kind for GGM models of sizes 50 and 100 with similar comparison results that the proposed method has higher AUC values than the standard method.

Table 2
Comparison of the proposed method and fdrtool using AUC. The AUC value is the average from 50 iterative experiments with $n=100$ and $v=20$ for each of the methods

	Proposed method	fdrtool
Mean	0.994	0.898
Sd	0.004	0.012

Figure 7.

Workflow of the proposed FDR procedure.

Figure 8.

Comparison of the proposed method (in red) and fdrtool (in blue) using ROC curves.

Figure 9.

Discrepancy between the actual and estimated models. ‘EM’ denotes the proposed method and ‘fdr’ the standard method.

It is worthwhile to note that a ROC curve is of sensitivity and specificity which are both domain-specific ratios. For instance, if there are fewer true positive cases, then the sensitivity may vary by a larger rate. So we may need to use another measure so that we may have a better picture of an holistic discrepancy between two model structures. In this context, we define a discrepancy measure $D$ defined as

$\displaystyle D=\Sigma_{i=1}^{E}d_{i}$

where

$\displaystyle d_{i}\equiv\left\{\begin{array}[]{ll}0&\textrm{if a correct % decision is made on $r_{i}$}\\ 1&\textrm{o.w.}\end{array}\right.$

The measure $D$ is an index of structural discrepancy between a true model and an estimated or learned model.

For a performance comparison using $D$ between the proposed and the standard FDR methods, we considered GGM models of sizes $v=10,20,50,100$ . We further considered two types of model complexity for the model sizes 50 and 100. One type of the model is of 2% of edges out of all the possible node-pairs in the model structure, and as for the other type, it is of 10% of edges. For convenience’ sake, we will call the former model ‘sparse’ and the latter ‘dense.’ As it was for the AUC, 50 data sets were generated for a given model and as many $D$ values are obtained for the comparison using boxplots. The sample sizes were 100 for $v=10,20,50$ , and it was 200 for $v=100$ . These experiment results are summarized in Fig. 9.

We can see in the boxplots that the discrepancy is smaller by the proposed method than by the standard method for all the models considered.

6. Application to real data

Figure 10.

Histogram and Parzen window graph for the gene data.

We applied the proposed method for graphical modelling of the gene data of Arabidopsis thaliana. The real data we used for the experiment is available in the package GeneNet in R programming language in Schafer et al. [34]. The data was originally obtained from microarray analysis of diurnal changes in the starch transcriptome in leaves of Arabidopsis thaliana [6, 36]. The original data of 22,814 probes, 11 time points, and two biological replicates were preprocessed and a subset of 800 genes were selected after filtering out all genes containing missing values and whose maximum signal intensity value was lower than 5 on a log-base 2 scale [31].

Figure 11.

Estimation of $f_{0}$ and $f_{A}$ by the proposed method for the gene data.

The data was analyzed in Lee et al. [27] and fortunately we could get access to the $r_{i}$ values from these authors who obtained the PCC’s by applying a Bayesian shrinkage method. It is known that the genes are causally related in temporal order. So a vector autoregressive (VAR) model was assumed for this data and the autoregressive coefficients can be obtained as PCC’s. Since 800 genes are involved in the model, we need to deal with $800^{2}=640,000$ $r_{i}$ ’s for this data. Note that, for the VAR model of order 1, we need to consider all the possible autoregressive coefficients across genes.

Figure 10 shows the histogram and a Parzen window graph with $h=0.001$ in panels (a) and (b) respectively. Both of the graphs seem to indicate that the null model dominates with the alternative model almost ignorable. In panel (c), we can see some $r_{i}$ ’s scattered almost unnoticeably off a ‘huge’ stack of $r_{i}$ ’s around $0$ .

Estimate results of $f_{0}$ and $f_{A}$ are listed in Fig. 11 where the estimated curves are zoomed at the foot of $f_{0}$ . If we draw the whole curve of the estimates of $f_{0}$ and $f_{A}$ , it is like a picture of a 100 m-tall tree ( $f_{0}$ ) with little trees ( $f_{j}$ ’s) at its foot which are less than 1 m tall. We can see that the alternative model meets the null model almost at the foot of the null model implicating a large value of $\widehat{\eta_{0}}$ . In Fig. 12, we can see in panel (a) that $\hat{\kappa}$ drops when $\tau$ moves from 0.030 to 0.035 with a slight increase of $\widehat{\eta_{0}}$ . This seems to indicate that there may be a small cluster of $r_{i}$ ’s from $f_{A}$ in the set $C(0.035)-C(0.030)$ while the $r_{i}$ ’s are mostly from $f_{0}$ in the set $C(0.030)$ . Note in Fig. 11 that there is an apparent change in the shape of $f_{A}$ when $\tau$ moves from 0.035 (panel (a)) to 0.030 (panel (b)). In this regard, it is suggested that we use the estimates when $\tau=0.03$ for our edge detection for the model structure.

Table 3

Summary of the analysis of the genes of Arabidopsis thaliana by applying the proposed EM method. ‘fdrtool’ represents the standard FDR method

	Proposed method				fdrtool
Threshold ( $\tau$ )	0.035	0.030	0.025	0.020
$\sqrt{\hat{\kappa}}$	267.7	279.6	280.8	280.9	272.1
$\hat{\eta_{0}}$	0.9989	0.9883	0.9864	0.9862	0.9969
The number ( $m$ ) of $f_{j}$ ’s	1	4	7	10
$\hat{\sigma}$	0.0037	0.0037	0.0037	0.0037
Cutoff for rejection of null	1.3582 $\times$ 10 ${}^{-2}$	1.1148 $\times$ 10 ${}^{-2}$	1.1294 $\times$ 10 ${}^{-2}$	1.1283 $\times$ 10 ${}^{-2}$	1.2806 $\times$ 10 ${}^{-2}$
$E$	1603	4282	4599	4630	2287
Edge proportion	0.2505 (%)	0.6691 (%)	0.7186 (%)	0.7234 (%)	0.3573 (%)
Edge increment		$+$ 2679	$+$ 317	$+$ 31
		0.4186 (%)	0.0495 (%)	0.0020 (%)
Node degree
$\geqslant$ 95 (%) (Hub-node)	24	32	33	33	26
Nodes with positive node degree	438	633	649	650	513

Figure 12.

Comparison of $\hat{\kappa}$ and $\hat{\eta}_{0}$ between the standard FDR method and proposed method.

The edge detection result for the gene data is represented by a VAR network in Fig.13 in Appendix. Its subgraph with the nodes of the top 150 edge-strengths is given in Fig. 14 in Appendix. One may compare this graph with Fig. 13 in Lee et al. [27] which is obtained by the standard FDR method. The two graphs are more or less the same except that the genes labelled 198 and 573, for instance, are hub-nodes in Fig. 14 while they are not in Lee et al. (2016b).

A summary of the analysis of the gene data is given in Table 3. The results by the standard FDR method is given in the column of ‘fdrtool’ and they are as described in Subsection 5.3 in Lee et al. [27]. Note that when $\tau$ moves from 0.035 to 0.030, the detected edges increased from 1603 to 4282. 2679 edges are added and any further addition of edges is relatively small in number as $\tau$ moves from 0.030 down to 0.020. This phenomenon is closely related to that the cutoff for the rejection of $f_{0}$ (no edge) remains more or less the same for $0.02\leqslant\tau\leqslant 0.03$ while it changes from 1.11/100 to 1.36/100 when $\tau$ moves from 0.03 to 0.035.

We cannot compare our method with the standard FDR method directly since the estimation is implemented under different conditions for the alternative model. But according to the result in Table 3 the result by the standard method is somewhere between our result obtained with $\tau=0.03$ and another obtained with $\tau=0.035$ .

Our method is available for an exploratory analysis of data since we could find a range of $\tau$ where a dramatic change of $\hat{\kappa}$ , if any, takes place. The set of genes responsible to this variation of $\hat{\kappa}$ may well deserve our attention for a further investigation.

7. Discussions

The proposed FDR procedure is different from the others in that we took the alternative model as a mixture of multiple normal distributions instead of one or two distributions. In case we use PCCs for FDR procedure, our interest is whether the correlations are strong enough (non-zero) or not (zero). We used the pdf in Eq. (7) for the null distribution ( $f_{0}$ ), which is introduced by Hotelling [24] as the distribution of correlations when $\rho=0$ . As for the alternative model, we considered a mixture of normal distributions and estimated it by applying an EM approach. In most of the FDR methods in existence, the alternative model is taken as a mixture of at most two distributions, one at each side of the null area.

It is worthwhile to note that the $\hat{\kappa}$ value of $f_{0}$ is affected by the locations and the amounts of the $r_{i}$ ’s which are regarded as from $f_{A}$ . By monitoring the values of $\hat{\kappa}$ , we could determine the location of $r_{i}$ ’s reasonably well where the contamination of $f_{A}$ with $f_{0}$ becomes far more influential on the estimation of $\kappa$ . It is recommended that the threshold( $\tau$ ) between $f_{0}$ and $f_{A}$ be smaller than the location of $r_{i}$ ’s where $\hat{\kappa}$ decreases abruptly and not increases thereafter as $\tau$ increases.

The estimation of the proposed method is based on data. We applied the Parzen window approach in search of an appropriate number of $m$ , the number of the component models in the alternative model $f_{A}$ . We have seen that a larger value of $m$ than necessary would rather be recommended than a smaller value. In this way, we could make our EM process more data-based. A smaller than necessary $m$ may lead us to a larger error rate in edge detection by involving more than desired $r_{i}$ values for estimating $f_{0}$ .

In most of the preceding studies, $f_{0}$ was a main concern where $N(0,1)$ [19], $N(\mu,\sigma)$ [15], or Hotelling’s distribution was assumed for the sample PCC [35]. On the other hand, no or a simple such as a uniform distribution was considered for $f_{A}$ . Once $f_{0}$ was estimated, $f_{A}$ was often taken as $f^{s}-\hat{f}_{0}$ where $f^{s}$ is a smoothed version of the empirical distribution of $r$ and $\hat{f}_{0}$ is the estimate of $f_{0}$ . This distinguishes our method from these methods in literature.

If we work with real data, there is no knowing about the error rate. But by monitoring $\hat{\kappa}$ , $\hat{\eta}_{0}$ , and the curves of the estimates of $f_{0}$ and $f_{A}$ , we can possibly find the points of $\hat{\kappa}$ and $\hat{\eta}_{0}$ where we may attain a best edge detection. The histogram from the “sparse” model suggests that the $f_{0}$ may be contaminated by $f_{A}$ deep inside the domain of $f_{0}$ . This is well reflected in the estimates of $f_{0}$ and $f_{A}$ in panels (e) and (f) of Fig. 5. The error rate is minimized when $\hat{\kappa}=36$ (see panels (a) through (c)). In panel (a), we can see an abrupt drop of $\hat{\eta}_{0}$ at some point $\sqrt{\kappa^{*}}$ of $\sqrt{\hat{\kappa}}$ . This means that as the set $C(\tau)$ gets smaller, the set may hit some point of $\tau$ where a meaningful distinction takes place between $f_{0}$ and $f_{A}$ .

8. Concluding remarks

A main difference between our proposed method and the conventional FDR method is that we consider a mixture of multiple Gaussian distributions as the distribution for the alternative hypothesis (i.e., presence of edge) while a uniform distribution or the difference between the estimate of the null model and the smoothed version of the empirical distribution is used for the alternative in the conventional method. This difference makes our method more powerful when the model structure is sparse, i.e., the number of edges is not large considering the number of the nodes (i.e., random variables) involved in a model structure. Another reason for the mixed alternative model is that the relationship between a pair of random variables may have its own ground. For example, a functional connectivity between two brain regions may be different in strength from others due to some unknown neuro-biological reasons. If there are several of those pairs of brain regions, it would be desirable to regard them as random variables from a particular distribution.

The Gaussian distributions considered under the alternative hypothesis have their means near or at the estimates of the partial correlations at the initial stage of the EM process. At each step of the process, we obtain MLE’s of the null density function, the means of the Gaussian densities which are mixed into a density function under the alternative hypothesis, and the weights of the null density and the Gaussian densities. The process terminates when the full likelihood reaches its maximum.

Since the initial values of the proposed method are dependent upon individual estimates of the partial correlations, the method is in a sense sensitive to the data. With a view to control the level of data sensitivity, we used the variances of the Gaussian distributions in a stabilized form through the Fisher’s $z$ -transformation. For a large-scale problem with small data, the partial correlations are estimated by applying non-parametric or Bayesian shrinkage approaches [31, 27].

We compared our method favorably with the conventional method through numerical experiments for Gaussian graphical models whose model structure is given in an undirected graph. However, we can use the proposed method for causal graphical models of continuous random variables where the joint probability model is given as a product of marginal or conditional Gaussian models.

We devised a procedure of the proposed method where we can monitor the dynamic variation of the null and alternative density curves with the empirical distribution as a reference curve. This procedure helps us laying our edge detection process well based on data.

Although we ran numerical experiment using Gaussian graphical models of sizes up to 100 variables, we can safely anticipate favorable comparisons of our method over the conventional one for even larger models. This is because any single distribution as an alternative model can be approximated as a mixture of Gaussian densities and our method consists of an amphibious monitoring of dynamic change of the shapes of the null and the alternative models and their corresponding weights.

Footnotes

Acknowledgments

The research was supported by a grant from the National Research Foundation of the Republic of Korea (Grant No: 2020R1A2C1A01008767).

Appendix

Model structure representing causal relationship among the genes of Arabidopsis thaliana by the proposed method

Figure 13.

Model structure representing causal relationship among the genes of Arabidopsis thaliana by the proposed method with $\tau=0.030$ . In this network, $v=633$ and $E=4282$ .

Figure 14.

Subgraph of the graph in Fig. 13. The nodes with the top 150 edge-strengths are involved in the graph.

References

Aubert

Bar-Hen

Daudin

J.J.

and Robin

, Determination of the differentially expressed genes in microarray experiments using local FDR, BMC Bioinformatics 5 (2004), 125.

Bay

S.D.

Shrager

Pohorille

and Langley

, Revising regulatory networks: From expression data to linear causal models, J. Biomed. Informatics 35 (2002), 298–497.

Benjamini

, Discovering the false discovery rate, J. Roy. Statist. Soc. B 72(4) (2010), 405–416.

Benjamini

and Hochberg

, Controling the false discovery rate: A practical and powerful approach to multiple testing, J. Roy. Statist. Soc. B 57 (1995), 289–300.

Bickel

P.J.

and Doksum

K.A.

, in: Mathematical Statistics. 2nd ed. New-York, CRC Press (Talor and Francis Group), 2015.

Craigon

D.J.

James

Okyere

Higgins

Jotham

and May

, NASCArrays: A repository for microarray data generated by NASC’s transcriptomics service, Nucleic Acids Research 32 (2004), D575–D577.

Dalmasso

Bröret

and Moreau

, A simple procedure fir estimating the false discovery rate, Bioinformatics 21 (2005), 660–668.

Dempster

A.P.

Laird

and Rubin

, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B 39 (1977), 1–38.

Dempster

A.P.

, The direct use of likelihood for significance testing, Statistics and Computing 7 (1997), 247–252.

10.

Dempster

A.P.

, Covariance selection, Biometrics 28 (1972), 157–175.

11.

Dunnet

C.W.

, Multiple comparisons procedure for comparing several treatments with a control, J. Am. Statist. Assoc. 50 (1955), 1096–1121.

12.

Dunnet

C.W.

, New tables for multiple comparisons with a control, Biometrics 20 (1964), 482–491.

13.

Edwards

, in: Introduction to Graphical Modelling, New York, Springer, 1995.

14.

Efron

, Robbins, empirical Bayes and microarrays, The Annals of Statistics 31 (2003), 366–378.

15.

Efron

, Large-scale simultaneous hypothesis testing: The choice of a null hypothesis, Journal of the American Statistical Association 99 (2004), 96–104.

16.

Efron

, Local False Discovery Rates, National Institute of Health grant 8R01 EB002784 and National Science Foundation grant DMS-0072360, Technical Report N. 2005-20B (234), 2005.

17.

Efron

, Correlation and large-scale simultaneous significance testing, J. Amer. Statist. Assoc. 102 (2007), 93–103.

18.

Efron

and Tibshirani

, Empirical Bayes methods and false discovery rates for microarrays, Genetic Epidemiology 23 (2002), 70–86.

19.

Efron

Tibshirani

Storey

and Tusher

, Empirical Bayes analysis of a microarray experiment, J. Amer. Statist. Assoc. 96 (2001), 1151–1160.

20.

Fisher

R.A.

, Frequency distribution of the values of the correlation coefficient in samples of an indefinitely large population, Biometrika 10(4) (1915), 507–521.

21.

Fisher

R.A.

, On the ‘probable error’ of a coefficient of correlation deduced from a small sample, Metron 1 (1921), 3–32.

22.

Friedman

, Regularized discriminant analysis, Journal of the American Statistical Association 84 (1989), 165–175.

23.

Hastie

Tibshirani

and Friedman

, The Elements of Statistical Learning, Springer, NY, 2001.

24.

Hotelling

, New light on the correlation coefficient and its transforms, J. R. Statist. Soc. B 15 (1953), 193–232.

25.

Laurizen

S.L.

, Graphical Models, New York, Oxford University Press, 1996.

26.

Lee

Kim

A.-K.

Park

and Kim

S.-H.

, An improvement on local FDR analysis applied to functional MRI data, J. Neuroscience Methods 267 (2016), 115–125.

27.

Lee

Choi

and Kim

S.-H.

, Bayes shrinkage estimation for high-dimensional VAR models with scale mixture of normal distributions for noise, Computational Statistics and Data Analysis 101 (2016), 250–276.

28.

McLachlan

and Peel

, Finite mixture models, New York, John Wiley

\&

Sons, 2000.

29.

McLachlan

and Krishnan

, The EM Algorithm and Extensions, New York, John Wiley

\&

Sons, 1997.

30.

Miller

R.G.

, Simultaneous Statistical Inference, 2nd edition, Springer-Verlag, 1980.

31.

Opgen-Rhein

and Strimmer

, Learning causal networks from systems biology time course data: An effective model selection procedure for the vector autoregressive process, BMC Bioinformatics 8(Suppl 2) (2007), S3.

32.

Paulso

, On the comparison of several experimental categories with a control, Ann. Math. Statist. 23 (1952), 239–246.

33.

Pounds

and Morris

S.W.

, Estimaing the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values, Bioinformatics 19 (2003), 1236–1242.

34.

Schafer

Opgen-Rhein

and Strimmer

, GeneNet: Modeling and inferring gene networks, R Package Version 1.2.11, 2014.

35.

Schafer

and Strimmer

, An empirical Bayes approach to inferring large-scale gene association networks, Bioinformatics 21 (2005), 754–764.

36.

Smith

S.M.

Fulton

D.C.

Chia

Thorneycroft

Chapple

Dunstan

Hylton

Zeeman

S.C.

and Smith

A.M.

, Diurnal changes in the transcriptom encoding enzymes of starch metabolism provide evidence for both transcriptional and posttranscriptional regulation of starch metabolism in Arabidopsis leaves, Plant Physiology 136 (2004), 2687–2699.

37.

Storey

J.D.

, A direct approach to false discovery rates, J. R. Statist. Soc. B 64 (2002), 479–498.

38.

Storey

J.D.

and Tibshirani

, Statistical significance fir genomewide studies, Proc. Natl. Acad. Sci. 100 (2003), 9440–9445.

39.

Strimmer

, Fdrtool: A versatile R package for estimating local and tail area-based false discovery rates, Bioinformatics 24 (2008), 1461–1462.

40.

Sun

and Cai

T.T.

, Large-scale multiple testing under dependence, J. R. Statist. Soc. B 71 (2009), 393–424.

41.

Toh

and Horimoto

, Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling, Bioinformatics 18 (2002), 287–297.

42.

Waddell

P.J.

and Kishino

, Cluster inferences methods and graphical models evaluated on NCI60 microarray gene expression data, Genome Informatics 11 (2000), 129–140.

43.

Wang

Myklebost

and Hovig

, MGraph: Graphical model for microarray data analysis, Bioinformatics 19 (2003), 2210–2211.

44.

Whittaker

, Graphical Models in Applied Multivariate Statistics, New York, John Wiley

\&

Sons, 1990.

45.

and Subramanian

K.R.

, Interactive analysis of gene interactions using graphical Gaussian model, in: Proceedings of the ACM SIGKDD Workshop on Data Mining in Bioinformatics, Vol. 3, 2003, pp. 63–69.

46.

Zweig

M.H.

and Campbell

, Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine, Clinical Chemistry 39 (1993), 561–577.

An improvement of FDR for edge detection by applying EM method

Abstract

Keywords

1. Introduction

2. Related works and graphical Gaussian model

3. EM algorithm with multiple alternative models

4.1 Selection of m by the Parzen window method

Table 1 Difference in edge detection between when m = 2 and when m = 3

Table 2 Comparison of the proposed method and fdrtool using AUC. The AUC value is the average from 50 iterative experiments with n = 100 and v = 20 for each of the methods

8. Concluding remarks

Footnotes

Acknowledgments

Appendix

Model structure representing causal relationship among the genes of Arabidopsis thaliana by the proposed method

References

4.1 Selection of $m$ by the Parzen window method

Table 1
Difference in edge detection between when $m=2$ and when $m=3$

Table 2
Comparison of the proposed method and fdrtool using AUC. The AUC value is the average from 50 iterative experiments with $n=100$ and $v=20$ for each of the methods