Abstract
1. INTRODUCTION
High dimensional analysis in gene expression studies often requires identifying associations between genes. Scientists are usually interested in a sparse network, which provides biologists with insights into possible pathways from particular groups of genes (Dobra et al., 2004; Meinshausen and Bühlmann, 2006; Friedman et al., 2008). Multivariate Gaussian graphical models (Meinshausen and Bühlmann, 2006; Friedman et al., 2008) have been widely used to model continuous microarray data, since log ratios of the microarray gene expressions are approximately normally distributed after normalization. More recently, next generation high-throughput sequencing (RNA-seq) has become a popular data collection method for expression analysis (Dillies et al., 2013). Because RNA-seq gene expression data consist of counts of sequencing reads for each gene, researchers sought discrete probabilistic models, in favor of continuous Gaussian models, to describe the RNA-seq data (Srivastava and Chen, 2010; Witten, 2011; Allen and Liu, 2012; Gallopin et al., 2013; Choi et al., 2017; Imbert et al., 2018; Chiquet et al., 2019). Some of these previous studies address zero-inflated Poisson distributions (Choi et al., 2017), whereas others focus on multivariate Poisson models (Chiquet et al., 2019).
Owing to restrictions in assumptions imposed by some joint models, some seek to build network models using neighborhood selection (Meinshausen and Bühlmann, 2006; Allen and Liu, 2012). A key advantage of neighborhood selection, in contrast to a joint distribution model, is that each neighborhood sparse estimation can be done simply by a multivariate log-linear regression, and the regression model can be regularized conveniently by popular regularization methods such as ℓ1-regularized Lasso. Neighborhood network selections assume a pair-wise Markov property (Lauritzen, 1996): conditional on all other variables, each variable follows a Poisson distribution, and is estimated locally through neighborhood selection (Meinshausen and Bühlmann, 2006) by fitting ℓ1-regularized log-linear models (Allen and Liu, 2012). This Poisson graphical model based on neighborhood selection by Allen and Liu (2012) was recognized as one of the recent studies of graphical modeling specifically for discrete data with Poisson distributions, and it addresses conditional variable relationships without the need for a joint discrete distribution (Gallopin et al., 2013; Choi et al., 2017; Imbert et al., 2018; Chiquet et al., 2019). For the rest of the article, we address this model as L1 log-linear graphical model (L1-LLGM).
However, the L1-LLGM method has some pitfalls because ℓ1 regularization is known to lack oracle properties and tends to include unwanted noise variables (Zou, 2006; Zou and Zhang, 2009; Zhang, 2010). Consequently, the resulting estimated network is often not sparse enough when compared with the true underlying network structure. To mitigate this issue, Allen and Liu (2012) introduced a threshold to filter out small coefficients retained by ℓ1-regularized log-linear regressions. We demonstrate in simulations that the inferred network is not robust with respect to the threshold level, and can sometimes have a very poor performance when a suboptimal threshold level is used. Unfortunately, no practical guidance is available in the literature on how to choose an appropriate threshold level for a given data set. In subsequent parts of the article, we employ the same assumptions (local Markov property) and settings (normalization, power transformation) of the L1-LLGM, and propose a more refined estimation method for this model by adopting an ℓ0-equivalent regularization.
We frame the goal of this article as an improvement over L1-LLGM. To this aim, we developed and implemented an approximate ℓ0-regularized log-linear graphical model (L0-LLGM) for constructing sparse gene network from RNA-seq count data. We consider ℓ0 regularization because it generally yields higher true-positive estimations than ℓ1 regularization and has been shown to be more accurate for feature selection and parameter estimation (Lin et al., 2010, 2020; Shen et al., 2012, 2013). Because exact ℓ0 regularization is computationally non-deterministic polynomial-time hardness (NP-hard) and only feasible for low dimension data, we adapt the recently developed broken adaptive ridge (BAR) method to approximate ℓ0 regularization. Defined as the limit of an iteratively reweighted ℓ2-regularization algorithm, the BAR method is an approximate ℓ0-regularization method that enjoys the best of ℓ0 and ℓ2 regularizations with desirable selection, estimation, and grouping properties (Dai et al., 2018, 2020; Zhao et al., 2018, 2020; Kawaguchi et al., 2020b).
These desirable properties are important to our network analysis, as they offer a theoretical advantage of ℓ0 regularization in our Poisson graphical model. Similar to L1-LLGM, our proposed L0-LLGM assumes a pair-wise Markov property and estimates gene network structures through a local sparse LLGM that evaluates conditional network correlations to each node. Specifically, at each step, a regularized Poisson log-linear model is fitted on one node, using BAR regularization to introduce sparsity. Nonzero coefficients estimate edges extending from the node. The stability approach to regularization selection (StARS) method (Liu et al., 2010) is used to select the regularization tuning parameters for the graphical model. Our empirical studies suggest that the proposed L0-LLGM generally produces network structures closer to the true structure than those of L1-LLGM, as measured by receiving operating characteristic (ROC) curves.
This article is organized as follows. In Section 4, we briefly review some gene expression normalization methods used before the network analyses, a step essential to all gene expression studies, introduce notations necessary for graphical model constructions, and review steps of LLGM model. In Section 2.2, we describe the proposed L0-LLGM in detail. In Section 3, we illustrate the performance of L0-LLGM in comparison with L1-LLGM by simulating RNA-seq type of data from a few known network structures. We then describe a regularization parameter selection procedure for graphical models based on a stability algorithm in Section 2.3. Finally, we provide a real data illustration on kidney renal clear cell carcinoma (KIRC) micro-RNA (miRNA) data from the Cancer Genome Atlas (TCGA) (Collins and Barker, 2007) in Section 4.
2. METHODS
2.1. Data and notations
We define matrix X as the design matrix, where columns are variables and rows are samples. Matrix
2.2. ℓ0-regularized log-linear Poisson graphical model through BAR
We consider a log-linear Poisson graphical model, which characterizes conditional Poisson relationships by assuming pair-wise Markov properties (Lauritzen, 1996). Specifically, we assume that for each
where the intercept term
The first step to neighborhood network selection method is to infer graphical networks by fitting the mentioned log-linear Poisson regression for every node j,
For each j,
where the first term is the

Simulation study for two network topologies:
The BAR (Kawaguchi et al., 2020a) estimator of
The BAR estimator has been shown to possess the oracle properties in the sense that with large probability, it estimates the zero coefficients as 0's and estimates the non-zero coefficients as well as the scenario when the true submodel is known in advance and a grouping property that highly correlated variables are naturally grouped together with similar coefficients (Kawaguchi et al., 2020a).
A nonzero element in coefficient estimate vector
Theoretically, whether to use the union or intersection of each network edge based on its two neighborhood selections concerning its two nodes is asymptotically identical (Meinshausen and Bühlmann, 2006). This less conservative approach of estimating by unions remains consistent with previous neighborhood selection Poisson graphical model literature (Allen and Liu, 2012). In other words, if either one of the two local log-linear regressions concerning the two nodes i and j produces a nonzero estimate, it implies conditional dependency. Consequently the network estimate specifies an edge between nodes i and j. Therefore, estimated adjacency matrix is always symmetrical. Estimated coefficients,
Lastly, we note that for each j,
2.3. Selecting regularization parameters through StARS criterion
The sparsity and performance of the network largely depend on the regularization parameter
Specifically, let K be the number of subsamples that we draw, and let Xk be a subsample from the design matrix X, where
A potential issue here with estimator
Lastly, an overall stability measure is then calculated by evaluating the mean of all edge-specific instabilities,
When
The performance instability criterion is supported theoretically by the Theorem of Partial Sparsistency (Liu et al., 2010), which states that under suitable regularity conditions, the estimated set of edges is expected to contain the set of edges in the true underlying model as n approaches infinity.
We developed an R package for implementing L0-LLGM, which can be found at repository https://github.com/caeseriousli/prBARgraph.git.
3. SIMULATIONS
In this section, we demonstrate the performance of the BAR Poisson graphical model (L0-LLGM) versus ℓ1 Poisson graphical model (L1-LLGM) through simulations. To evaluate model fit, we measure prediction accuracy by the true-positive rates and false-positive rates. The true-positive rate is defined as the portion of correctly predicted edge out of total number of edges predicted. For instance, if a predicted network has a total of 80 edges, out of which 40 exist in the underlying true network, then the true-positive rate is
3.1. Simulating correlated Poisson networks
To generate simulation data for model comparison, we adapt the same method introduced in Allen and Liu (2012). Again, let n be the number of observations and p the number of elements (genes). We first generate independent Poisson samples: Y, a
Furthermore, using the underlying true network, we construct a structure matrix,
where
3.2. Model comparison
When compared with nondiscrete models, such as graphical Lasso (Friedman et al., 2008), the L1-LLGM model has already been numerically demonstrated to have as good or better prediction accuracy for simulated Poisson data (Allen and Liu, 2012). In this article, as the major innovation is an L0 BAR regularization, we will focus on comparing L0-LLGM with L1-LLGM. We will move on to adopt simulation setup similar to Allen and Liu (2012). Specifically, we simulated RNA sequencing data based on two common network topologies, hub and scale free. The data are randomly generated using methods introduced in Section 3.1. For each topology, we have constructed a network consisting of 50 nodes. For each topology we generate two data sets, with 200 and 500 independent samples, respectively. We further note that, sample sizes no greater than 500 in simulations are common for most RNA-seq studies (Gallopin et al., 2013; Choi et al., 2017; Imbert et al., 2018; Chiquet et al., 2019).
For each model, both the L0-LLGM and L1-LLGM methods are performed on the simulated data, using StARS criterion to determine the regularization parameters. For L1-LLGM, we considered a set of four different values for the additional sparsity threshold (“th”), specified by L1-LLGM implementations as a necessary step (Wan et al., 2016), to investigate its effects on the resulting estimated network. With both true-positive and false-positive measurements defined in the beginning of Section 3, we construct ROC curves to compare the performance of L0-LLGM in comparison with L1-LLGM. Figure 1 shows the ROC curves generated under two different topologies, scale free (Fig. 1A) and hub (Fig. 1D), each consisting of 50 nodes. We observe that L0-LLGM consistently outperformed L1-LLGM, especially in high specificity regions. The advantage of L0-LLGM is more evident for hub topology. Furthermore, it is clear that the performance of L1-LLGM can vary greatly depending on the choice of its sparsity threshold. The optimal choice of this threshold depends on the underlying topology and sample sizes. For any given false-positive rate, L0-LLGM yields a model with more correctly estimated connections. L1-LLGM, in contrast, could potentially lose nodes that are important to the structure of the network shown in Figure 1.
Lastly, we validate the mentioned findings in replications. Owing to limitations of ROC plot visualizing multiple network fits, we summarize 40 replications in box plots. In Figures 2 and 3, L0-LLGM and L1-LLGM are fitted to 40 randomly generated data sets from scale-free and hub topologies, respectively. At each replication, both the topology and data set are randomly generated, and each data set has sample size

Simulation study for scale-free topology with sample size

Simulation study for hub topology with sample size
4. APPLICATION OF L0-LLGM TO KIRC mi RNA-SEQ DATA
High throughput sequencing (second generation RNA sequencing) returns millions of short reads of RNA fragments, which have varying lengths ranging from ∼25 to possibly 300 bp paired-end reads (Chhangawala et al., 2015). These reads are usually mapped to the genome and the data are in the form of non-negative counts of the RNA fragment reads (Witten, 2011). LLGM models can be applied to any data that are assumed to have Poisson distributions. For RNA-seq data specifically, a normalization pipeline is required before data analysis. For comparison purposes, we follow the same normalization pipeline as Wan et al. (2016); Allen and Liu (2012), which consists of the following major steps: (1) adjusting for sequencing depth, (2) biological entities (e.g., genes, miRNAs) with low counts or low variances are filtered out, (3) vectors with potential over-dispersion are transformed using a power transformation to transform the data closer to Poisson distribution (Li et al., 2012; Wan et al., 2016). The normalization steps can be performed by R package XMRF (Wan et al., 2016). We defer the detailed procedures and justifications of this specific pipeline for RNA-seq Poisson graphical models to Wan et al. (2016).
We then fitted the proposed method on the KIRC miRNA data set from The Cancer Genome Atlas. The data set was downloaded from TCGA data portal (https://portal.gdc.cancer.gov) (Collins and Barker, 2007). It contains 1881 miRNAs and 616 samples. Before the normalization pipeline, we filtered out miRNAs that have all zero read counts throughout all samples, resulting in 1502 miRNAs left (20.15% of miRNAs with low counts). Then we normalize the rest of the data using XMRF package developed for L1-LLGM (Wan et al., 2016). For demonstration purposes of this article, we specify the R package to keep top 100 miRNAs with the most variance (i.e., look at top miRNAs that vary the most). Minimum read count is set to be no less than 20, the suggested default (Allen and Liu, 2012; Wan et al., 2016). This keeps ∼6.7% of miRNAs. We then move on to focus on conditional relationships between these 100 miRNAs with the largest variance and reasonable read counts. We then fit L0-LLGM using our R package, along with L1-LLGM, implemented by XMRF (Wan et al., 2016), with an StARS instability threshold of

L0-LLGM KIRC miRNA data: estimated network generated by fitting an L0-LLGM model on KIRC miRNA data from TCGA database. The penalization parameter was chosen by setting an StARS estimation instability threshold of 0.01. KIRC, kidney renal clear cell carcinoma; L0-LLGM, ℓ0-regularized log-linear graphical model; miRNA, micro-RNA; StARS, stability approach to regularization selection; TCGA, the Cancer Genome Atlas.

L1-LLGM KIRC miRNA data: estimated network generated by fitting an L1-LLGM model on KIRC miRNA data from TCGA database. The penalization parameter was chosen by setting an StARS instability threshold of 0.01. In addition, a further artificial threshold (“th”) to fine tune the L1-LLGM model. This figure shows four network estimates by varying the th threshold.
Micro-RNA Network Annotation
miRNA, micro-RNA.
It is observed from Figures 4 and 5 that the two model results reveal some similar structures, including the hub surrounding center, mir-10b (node 25). However, L0-LLGM produces a less visually “chaotic” network, in comparison with L1-LLGM. For instance, L0-LLGM outlines a clean scale-free topology with minimal cyclic loops. L1-LLGM, however, frequently exhibits loops even in sparse network estimates, possibly due to unwanted noises from ℓ1 regularization. From Figure 5, as we increase the “artificial threshold” for L1-LLGM used in XMRF package, to some extent it helps reducing these noise edges.
However, during this process, we observe that this user-imposed threshold also filtered out lower degree nodes (weaker signal), such as the hub miRNAs surrounding node 25. For example, in Figure 5A1, with no artificial threshold, L1-LLGM identifies hub center node 25 (mir-10b), which agrees with L0-LLGM in Figure 4. As the threshold increases, plots in Figure 5A2–B2 show a decreasing degree in hub center node 25. In Figure 5B2, almost entire hub is filtered out by this threshold along with noise. This observation parallels to the simulation section (Fig. 1), where the true-positive rates can be reduced by the artificial threshold, losing important network structures. These preliminary observations suggest that L0-LLGM is potentially more capable of separating signal from noise, which is consistent with our simulation results depicted in Figure 1.
Although a graphical model alone is not enough to make any further conclusions on gene interactions inference, we focus, in particular, on highly connected miRNAs (i.e., hub nodes). Some of the hub miRNAs revealed by the L0-LLGM network were previously known to be associated with each other, and with certain cancers. For example, the center of the largest hub, gene mir-10b in Figure 4 is known to be associated with cancers such as bladder cancer and proteoglycans cancer. Based on literature studies, this RNA was known to be highly expressed in metastatic hepatocellular carcinomas, in contrast to those without metastasis (Ma et al., 2010). Our network results are based on the data from patients with adenomas and adenocarcinomas from project KIRC. It is connected to numerous miRNAs, including several cluster centers known to be associated with cancer suppressing. RNA named hsa-let-7b, for example, identified as a subcluster connected to the hub center mir-10b, a previously known putative cancer suppressor, is found to play a key role in chemoresistance in renal cells from carcinoma cases (Peng et al., 2015). Together with another cluster center RNA, named miR-126 and hsa-let-7b are both identified as crucial biomarkers for identifying renal cell carcinoma (Jusufović et al., 2012; Yin et al., 2014; Carlsson et al., 2019). Our graphical model successfully identifies important miRNAs that align with published biological findings regarding such miRNAs.
We also performed additional analyses using different StARS instability thresholds
5. DISCUSSION
We have proposed and implemented an approximate L0-LLGM for constructing sparse gene network from RNA-seq count data. This approach uses a neighborhood Poisson graphical model, which offers a more comprehensive set of predictions, has less constraints on the Poisson distributions of each element, and is less sensitive to changes of individual genes, than a joint distribution model. Sparsity is achieved through the BAR penalization, a surrogate ℓ0 regularization with established oracle properties for selection and estimation. Our simulations in Section 3 show that, in general, L0-LLGM offers theoretically more accurate estimates than L1-LLGM. It reaches a high level of true-positive rate faster, without accumulating a high rate of false estimates. L0-LLGM also spares users the need of selecting an additional sparsity threshold after the regularization tuning parameter has already been selected by StARS. This brings more consistency and reproducibility to the graphical model.
Our simulations considered two types of network topologies, namely scale-free and hub topologies, and found that both L0-LLGM and L1-LLGM tend to perform better under scale-free topologies as compared with hub. However, because graphical models could potentially give drastically different results under various topologies, it would be of interest to consider more topologies in future studies.
It is worth noting that although this article has focused on the
We acknowledge that, for model applications in RNA-seq data, the network in itself often is not enough to draw definitive inference on complex gene interactions. Often a network serves as a first step in identifying clusters, under the assumptions that genes interact with each other in hubs (Friedman, 2004). One can modularize clustered genes through methods such as dynamic tree cutting. These modules can subsequently be used for gene enrichment analyses (Langfelder and Horvath, 2008).
Footnotes
AUTHOR DISCLOSURE STATEMENT
The authors declare they have no competing financial interests.
FUNDING INFORMATION
The research of G.L. was partly supported by National Institute of Health Grants P30CA-16042, UL1TR000124-02, and P50CA211015.
