Regression-Based Network Estimation for High-Dimensional Genetic Data

Abstract

Given the continuous advancement in genome sequencing technology, large volumes of gene expression data can be easily obtained. However, the corresponding increase in genetic information necessitates adoption of a new approach for network estimation. Data dimensions increase with the progress in genome sequencing technology, thereby making it difficult to estimate gene networks by causing multicollinearity. Furthermore, such a problem also occurs when hub nodes exist, where gene networks are known to have regulator genes that can be interpreted as hub nodes. This study aims at developing methods that demonstrate good performance when handling high-dimensional data with hub nodes. We propose regression-based approaches as feasible solutions in this article. Elastic-net and adaptive elastic-net penalty regressions were applied to compensate for the disadvantages of existing regression-based approaches employing LASSO or adaptive LASSO. Experiments were performed to compare the proposed regression-based approaches with other conventional methods. We confirmed the superior performance of the regression-based approaches and applied it to actual genetic data to verify the suitability to estimate gene networks. As results, robustness of the proposed methods was demonstrated with respect to high-dimensional gene expression data.

1. Introduction

Health care technology has improved dramatically with taking advantage of recent developments in emerging technologies, including big data, Internet-of-things, and genetic testing technologies. Precision medicine plays a vital role in health care by providing patient-tailored care with due consideration of environmental factors, lifestyles, and genes. Gene network estimation has emerged as a leading tool in precision medicine for preventing and diagnosing fatal diseases such as cancer (Weston and Hood, 2004; Alansari et al., 2018).

In general, >10,000 different messenger RNA species are detected in a single cancer sample. Such numbers cause high dimensionality, which becomes a serious problem in gene network analysis. In high-dimensional data, optimal model fitting using statistical methods is often impossible, because the solution obtained can be suboptimal and requires high computational times. This situation is called the “curse of dimensionality.” In particular, high dimensionality establishes spurious relationships in network estimation with genetic data, thereby rendering the estimation of gene networks difficult (Clarke et al., 2008). This has exposed the limitations of existing gene network estimation methods from the viewpoint of gene network analysis.

In this article, we aimed to improve the existing methods to solve the problem of gene network estimation under high-dimensional data. We found clues in the regression-based approaches of LASSO and adaptive LASSO (ADLASSO), which have been shown to have low complexity and high estimation accuracy (Zou and Hastie, 2005; Meinshausen and Bühlmann, 2006; Zou, 2006; Zhou, 2011; Han et al., 2016; Lee et al., 2017). To overcome the shortcomings of traditional regression-based network estimation approaches in high-dimensional genetic data, we applied elastic-net (Enet) and adaptive elastic-net (ADEnet) to resolve the multicollinearity and high-dimensional problems (Zou and Hastie, 2005; Zou and Zhang, 2009).

Network-estimation methods extensively used in bioinformatics were compared against the proposed regression-based approach. Typical bioinformatics methods include MINET, which uses mutual information, and the tree-based GENIE3 (Meyer et al., 2008; Irrthum et al., 2010). By performing experiments using various types of simulation data, the proposed method was validated for successfully handling high-dimensional genetic data.

The composition of this article is as follows. Section 2 describes simulation results, whereas Section 3 demonstrates feasibility of the proposed method by applying the same to colorectal cancerous gene data. Lastly, Section 4 gives the conclusions of this study.

2. Simulation Study

A brief review of previously reported network-estimation methods has been included in Section 1 of Supplementary Materials, and comprehensive descriptions of the proposed regression-based approach and existing methods are provided in Section 2 of Supplementary Materials. Four simulation data types were used in this study, and the relevant settings are given in Section 3 of the Supplementary Materials. The four types of simulation data used in the experiments were as follows: (1) genetic simulation data, (2) Peng's simulation data, (3) scale-free simulation data, and (4) band-type simulation data (Albert and Barabási, 2002; Peng et al., 2009; Han et al., 2016; Lee et al., 2017).

Genetic simulation data and Peng's simulation data are commonly divided into two cases depending on whether corresponding network edges are sparse or dense. The parameter D is expressed as the average number of edges per node; thus, the total number of edges in the network is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{N}} = { \rm{P}} \times { \rm{D}}$$ \end{document} , where P denotes the number of nodes. During experiments, conditions of low and high density were defined as D = 1 and D = 2, respectively.

A comparison between these results was performed in two phases. During phase 1, regression-based approaches and methods used in bioinformatics were compared separately. During phase 2, final comparisons were performed using methods demonstrating satisfactory performance during phase 1. Results under nonhigh-dimensional simulation data (P = 50, 200; N = 100) are provided in Section 4 of the Supplementary Materials. The receiver operating characteristic (ROC) curves and computation times obtained under high-dimensional simulation data (P = 400; N = 100) are described in Section 2 of the text.

In this study, we compared the performance of the regression-based approaches using LASSO, ADLASSO, Enet, and ADEnet (Tibshirani, 1996; Zou and Hastie, 2005; Meinshausen and Bühlmann, 2006; Zou and Zhang, 2009; Han et al., 2016). Moreover, the proposed network estimation methods were validated by performing experiments employing the tree-based GENIE3 and other methods based on mutual information, including the algorithm for reconstruction of accurate cellular networks (ARACNE), context likelihood of relatedness (CLR), and multicast reduction network (MRNET) using the maximum relevance/minimum redundancy algorithm (Ding and Peng, 2005; Margolin et al., 2006; Faith et al., 2007; Meyer et al., 2008; Irrthum et al., 2010).

2.1. Comparison of results for genetic simulation data

2.1.1. Results for each approach

When using ADEnet, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{ \alpha }}$$ \end{document} denotes the ratio of L₁-penalty to L₂-penalty. The closer the value of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{ \alpha }}$$ \end{document} is to 1, ADEnet is similar to ADLASSO, and the closer \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{ \alpha }}$$ \end{document} is to 0, the more similar to Adaptive Ridge regression. For the regression-based network method, ADLASSO with initial lambda \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}}$$ \end{document} = 0.1 and 0.3 was analyzed with genetic simulation data. ADEnet with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1$$ \end{document} showed better accuracy that that with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.3$$ \end{document} ; thus, only the ROC curve for ADEnet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1$$ \end{document} ) was plotted. ADEnet approaches were further compared for different values of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{ \alpha }}$$ \end{document} —that is, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{ \alpha }}$$ \end{document} 0.2 and 0.8—for the same \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1$$ \end{document} . A comparison of the performances of LASSO, ADLASSO ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1$$ \end{document} ), ADLASSO ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.3$$ \end{document} ), ADEnet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1;{ \rm{ \; \alpha }} = 0.2$$ \end{document} ), and ADEnet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1;{ \rm{ \; \alpha }} = 0.8$$ \end{document} ) in terms of regression-based approaches can be seen in Figure 1.

FIG. 1.

Comparison of regression-based approaches for genetic simulation data. Dark gray dashed lines indicate LASSO, black dotted lines indicate ADLASSO (λ_ini = 0.1), gray dotted lines indicate ADLASSO (λ_ini = 0.3), gray dotted dashed lines indicate Enet (α = 0.8), black solid lines indicate ADEnet λ_ini = 0.1; α = 0.2), and gray solid lines indicate ADEnet (λ_ini = 0.1; α = 0.8). ADEnet, adaptive elastic-net; ADLASSO, adaptive LASSO; FPR, false positive rate; Enet, elastic-net; TPR, true positive rate.

With the number of samples \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{n}} = 100$$ \end{document} , number of nodes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{P}} = 400$$ \end{document} , and density \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{D}} = 1$$ \end{document} , regression-based methods were observed to attain a true positive rate (TPR) of 100% before the false positive rate (FPR) becoming 0.1%. However, Enet was observed to be ineffective when density increased to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{D}} = 2$$ \end{document} . ADEnet converged with better performance than the other methods.

As described in Figure 2, GENIE3 performed better than ARACNE, CLR, and MRNET, which use mutual information. When using GENIE3, the performance difference between the random forests (GENIE3_RF) and the extremely randomized trees (GENIE3_ET) was not significant (Breiman, 2001; Geurts et al., 2006; Irrthum et al., 2010). Among methods using mutual information, CLR demonstrated better performance than ARACNE and MRNET.

FIG. 2.

Comparison of methods used in bioinformatics for genetic simulation data. Dark gray dashed lines indicate ARACNE, black dotted lines indicate CLR, gray dotted dashed lines indicate MRNET, black long dashed lines indicate GENIE3_RF, and gray long dashed lines indicate GENIE3_ET. ARACNE, accurate cellular networks; CLR, context likelihood of relatedness; MRNET, multicast reduction network.

2.1.2. Comparison of final results for genetic simulation data

Among the bioinformatics techniques, GENIE3 with random forests (GENIE3_RF) and CLR using mutual information were chosen. Among regression-based approaches, LASSO, ADLASSO ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1$$ \end{document} ), and ADEnet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1; \alpha = 0.8$$ \end{document} ) were selected, as shown in Figure 3.

FIG. 3.

Comparison of final results for genetic simulation data. Dark gray dashed lines indicate LASSO, black dotted lines indicate ADLASSO (λ_ini = 0.1), black solid lines indicate ADEnet (λ_ini = 0.1; α = 0.8), gray dotted lines indicate CLR, and gray long dashed lines indicate GENIE3_RF.

Although LASSO showed the best performance at low density, it also suffered from a drawback in that the higher the density, the greater was its performance degradation. Bioinformatics methods showed relatively poor performance, whereas ADEnet was observed to be the best method of estimation to be used with genetic simulation data regardless of the density.

2.2. Comparison of results for Peng's simulation data

2.2.1. Results for each approach

Enet becomes similar to LASSO when \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{ \alpha }}$$ \end{document} , which is the ratio of the L₁-penalty to L₂-penalty, is close to 1. In Peng's simulation data, Enet with α = 0.8 was nearly identical to LASSO, so it was omitted from the ROC curve. Performances of LASSO, ADLASSO ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1$$ \end{document} ), ADLASSO ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.3$$ \end{document} ), Enet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.2$$ \end{document} ), ADEnet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1; \; \alpha = 0.2$$ \end{document} ), and ADEnet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1; \; \alpha = 0.8$$ \end{document} ) are shown in Figure 4.

FIG. 4.

Comparison of regression-based approaches for Peng's simulation data. Dark gray dashed lines indicate LASSO, black dotted lines indicate ADLASSO (λ_ini = 0.1), gray dotted lines indicate ADLASSO (λ_ini = 0.3), gray dotted dashed lines indicate Enet (α = 0.2), black solid lines indicate ADEnet (λ_ini = 0.1; α = 0.2), and gray solid lines indicate ADEnet (λ_ini = 0.1; α = 0.8).

Enet was the best of the regression-based method at both low density (D = 1) and high density (D = 2). ADLASSO performed the worst among the regression-based methods, although the difference in performance was very small.

As shown in Figure 5, methods using mutual information failed to perform appropriate network estimation using Peng's simulation data. GENIE3 was observed to be the only bioinformatics methods that proved suitable showing slightly better performance.

FIG. 5.

Comparison of methods used in bioinformatics for Peng's simulation data. Dark gray dashed lines indicate ARACNE, black dotted lines indicate CLR, gray dotted dashed lines indicate MRNET, black long dashed lines indicate GENIE3_RF, and gray long dashed lines indicate GENIE3_ET.

2.2.2. Comparison of final results for Peng's simulation data

As depicted in Figure 6, GENIE3 with extra trees and CLR using mutual information was chosen. Among regression-based approaches, Enet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.2$$ \end{document} ), ADLASSO ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1$$ \end{document} ), and ADEnet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1; \alpha = 0.2$$ \end{document} ) were selected based on performance.

FIG. 6.

Comparison of final results for Peng's simulation data. Black dotted dashed lines indicate Enet (α = 0.2), black dotted lines indicate ADLASSO (λ_ini = 0.1), black solid lines indicate ADEnet (λ_ini = 0.1; α = 0.2), gray dotted lines indicate CLR, and gray long dashed lines indicate GENIE3_ET.

Network estimation from Peng's simulation data was found to be more difficult than genetic simulation data. In particular, the methods using mutual information were rather inadequate. The regression-based approach was superior to GENIE3—the best bioinformatics method, and Enet was the best.

2.3. Comparison of results for other simulation data

2.3.1. Results for each approach

When changing the initial lambda value of ADEnet to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.3$$ \end{document} , this showed better performance than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1$$ \end{document} . Since Enet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.8$$ \end{document} ) demonstrated no performance difference compared with LASSO, only the result of Enet (α = 0.2) was plotted. For the regression-based approach, the performances of LASSO, ADLASSO ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1$$ \end{document} ), ADLASSO ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.3$$ \end{document} ), Enet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.2$$ \end{document} ), ADEnet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.3$$ \end{document} ; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.2$$ \end{document} ), and ADEnet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.3$$ \end{document} ; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.8$$ \end{document} ) are described in Figure 7.

FIG. 7.

Comparison of regression-based approaches for other typical simulation data. Dark gray dashed lines indicate LASSO, black dotted lines indicate ADLASSO (λ_ini = 0.1), gray dotted lines indicate ADLASSO (λ_ini = 0.3), gray dotted dashed lines indicate Enet (α = 0.2), black solid lines indicate ADEnet (λ_ini = 0.3; α = 0.2), and gray solid lines indicate ADEnet (λ_ini = 0.3; α = 0.8).

With scale-free simulation data, Enet converged to a higher TPR than other regression-based methods, but ADLASSO and ADEnet showed poor performance. Although Enet obtained lower accuracy than LASSO over a short FPR range, it demonstrated superior performance when FPR was >0.2%. Besides, Enet had a 100% TPR, which was not achieved by other regression-based methods. However, the performance of Enet dropped significantly when using band-type simulation data, whereas ADLASSO, ADEnet, and LASSO demonstrated network estimation performances corresponding to 100% TPR.

Although CLR performed the best among methods using mutual information, GENIE3 performed much better. With both band-type and scale-free simulation data, GENIE3 with extra trees was observed to be superior, as can be seen in Figure 8.

FIG. 8.

Comparison of methods used in bioinformatics for other typical simulation data. Dark gray dashed lines indicate ARACNE, gray dotted lines indicate CLR, gray dotted dashed lines indicate MRNET, black long dashed lines indicate GENIE3_RF, and gray long dashed lines indicate GENIE3_ET.

2.3.2. Comparison of final results for other typical simulation data

GENIE3 with extra trees, which is the best method used in bioinformatics, was selected. Based on their performance, LASSO and ADLASSO ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.3$$ \end{document} ), Enet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.2$$ \end{document} ), and ADEnet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.3$$ \end{document} ; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.8$$ \end{document} ) were selected from among regression-based methods. Final ROC curves for the scale-free and band-type simulation data were plotted, as depicted in Figure 9, wherein the lower graph represents performance improvements in TPR over a wide FPR range, whereas upper graph demonstrates TPR over a narrow FPR range.

FIG. 9.

Comparison of final results for other simulation data types. Dark gray dashed lines indicate LASSO, black dotted lines indicate ADLASSO (λ_ini = 0.3), black dotted dashed lines indicate Enet (α = 0.2), black solid lines indicate ADEnet (λ_ini = 0.3; α = 0.8), gray dotted lines indicate CLR, and gray long dashed lines indicate GENIE3_ET.

With scale-free simulation data, GENIE3 and Enet performed better than other methods when FPR exceeded 0.5%. The regression-based approach employing Enet was observed to be superior to GENIE3 across the entire FPR range. Consequently, it was necessary to properly merge LASSO and Enet within the scale-free network. With the band-type simulation data, regression-based methods, excluding Enet, outperformed GENIE3 and CLR. Enet, in contrast, had lower accuracy than GENIE3, unlike the case of the scale-free simulation data.

2.4. Comparison of computational time of the methods

Computational time is another vital performance indicator in network estimation. In this study, a workstation equipped with an Intel(R) Xeon(R) CPU E5-2660 v4 2.00 GHz and 78 GB memory was used. Complexity was defined as the number of estimated edges relative to the number of nodes during simulation, and network estimation computational times were recorded for the genetic random and hub simulation data with number of nodes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{P}} = 50 , { \rm{ \;}}200 , { \rm{ \;}}400$$ \end{document} . Results concerning estimation of random networks are summarized in Table 1, and those for estimation of hub networks are illustrated in Table 2.

Table 1.

Computation Times for Estimating Random Networks

			Complexity
Network type	P	Method	0.5	1	2	4
Random network	50	LASSO	0.110	0.116	0.107	0.111
		ADLASSO	0.159	0.164	0.164	0.166
		Enet	0.098	0.095	0.098	0.100
		ADEnet	0.197	0.199	0.199	0.203
		GENIE3	54.488	54.410	54.270	54.642
		ARACNE	0.040	0.034	0.034	0.034
		CLR	0.039	0.045	0.047	0.050
		MRNET	0.041	0.039	0.043	0.041
	200	LASSO	0.465	0.488	0.514	0.580
		ADLASSO	1.137	1.156	1.179	1.218
		Enet	0.548	0.557	0.597	0.654
		ADEnet	1.424	1.440	1.468	1.499
		GENIE3	805.841	807.379	805.457	805.559
		ARACNE	0.761	0.762	0.760	0.766
		CLR	1.083	1.094	1.087	1.095
		MRNET	1.225	1.122	1.140	1.162
	400	LASSO	0.707	0.735	0.827	0.999
		ADLASSO	3.697	3.754	3.790	3.885
		Enet	1.516	1.553	1.558	1.714
		ADEnet	5.726	5.425	5.449	5.523
		GENIE3	4382.671	4382.361	4368.863	4359.436
		ARACNE	4.249	4.288	4.278	4.278
		CLR	5.916	5.789	5.993	5.739
		MRNET	6.207	6.155	6.157	6.216

Time unit and complexity; number of estimated edges divided by P; workstation: Intel(R) Xeon(R) CPU E5-2660 v4 2.00 GHz and 78 GB memory.

ADEnet, adaptive elastic-net; ADLASSO, adaptive LASSO; ARACNE, accurate cellular networks; CLR, context likelihood of relatedness; Enet, elastic-net; MRNET, multicast reduction network.

Table 2.

Computation Times for Estimating Hub Networks

			Complexity
Network type	P	Method	0.5	1	2	4
Hub network	50	LASSO	0.110	0.109	0.124	0.116
		ADLASSO	0.160	0.166	0.162	0.165
		Enet	0.095	0.096	0.101	0.098
		ADEnet	0.188	0.190	0.190	0.192
		GENIE3	54.583	54.436	54.570	54.327
		ARACNE	0.035	0.034	0.034	0.035
		CLR	0.054	0.047	0.060	0.053
		MRNET	0.043	0.047	0.046	0.044
	200	LASSO	0.442	0.457	0.473	0.512
		ADLASSO	1.152	1.165	1.197	1.206
		Enet	0.518	0.543	0.560	0.602
		ADEnet	1.476	1.474	1.487	1.506
		GENIE3	806.643	805.102	807.358	805.753
		ARACNE	0.794	0.799	0.808	0.797
		CLR	1.130	1.140	1.085	1.081
		MRNET	1.149	1.158	1.131	1.157
	400	LASSO	0.709	0.740	0.812	0.927
		ADLASSO	3.620	3.723	3.773	3.864
		Enet	1.576	1.663	1.729	1.998
		ADEnet	5.486	5.432	5.477	5.565
		GENIE3	4423.058	4417.574	4420.911	4258.579
		ARACNE	4.548	4.578	4.568	4.600
		CLR	5.928	5.985	5.932	5.851
		MRNET	6.050	6.161	6.167	6.245

Time unit and complexity; number of estimated edges divided by P; workstation: Intel(R) Xeon(R) CPU E5-2660 v4 2.00 GHz and 78 GB memory.

When the number of edges to be estimated was small, methods using mutual information were observed to be rather fast, but they became slower than the regression-based approaches as the number of edges increased. Methods using mutual information have complexities greater than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{O}} \left( {{n^2}} \right) + { \rm{O}} \left( {{n^2}} \right)$$ \end{document} ; thus, an increase in the number of nodes greatly affects the growth of the computational time.

Among the regression-based methods, the network estimation method using LASSO was the fastest, and ADEnet was the slowest. However, as the number of nodes increased, ADEnet became faster than CLR and MRNET.

GENIE3—the only bioinformatics method demonstrating good performance—required the highest computational time; roughly 500–1000 times more computational time was required compared with other methods, thereby representing a significant disadvantage. In the case of ultrahigh-dimensional genetic data, GENIE3 may not be able to estimate gene networks owing to its high computational demands. Overall, based on the mentioned results, regression-based methods are more suitable for gene network estimation when considering both accuracy and computational time.

3. Application

The incidence of colorectal cancer has increased two- to fourfold in recent decades in Asian countries such as South Korea and China. Dietary habits and lifestyle changes are major reasons behind the increase; however, genetic characteristics of Asian populations could also be important. Asian countries, therefore, require more genetics research efforts to be directed toward the treatment and prevention of colon cancer (Sung et al., 2005). The Cancer Genome Atlas (TCGA) project provides type-sorted cancer-gene-expression data, including results obtained during many colorectal cancer studies. In this study, colon cancer-gene-expression data (TCGA-COAD) from 282 patients were obtained from the TCGA data portal (https://gdac.Broadinstitute.org). By cross-referencing these data with information provided by the Genomic Data Commons portal, the top 50 genes most frequently mutated in colorectal cancer were selected, including APC, TP53, TTN, and KRAS. In addition, several articles were reviewed to identify and add essential genes that cause colorectal cancer, and a data set comprising 73 genes was created. Among genes that were assumed to be colon cancer inducing, those that have been studied are listed in Section 6 of Supplementary Materials. Regression-based methods were used to estimate the networks to identify interactions between genes causing colon cancer.

3.1. Applying the regression-based methods to TCGA-COAD

In the visualization of the gene network, the node size represents betweenness centrality, whereas its color indicates the number of edges per node. The interquartile range (IQR) was defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \rm{IQR}} = \left[ {{ \rm{upper\;quartile \;}} \left( {{ \rm{Q}}3} \right) - { \rm{lower \;quartile}} \;\left( {{ \rm{Q}}1} \right) } \right]$$ \end{document} , and values outside the range \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\left[ {{Q_1} - 1.5 \times IQR ,{Q_3} + 1.5 \times IQR} \right]$$ \end{document} were called outliers. When the degree of a node was determined to be an outlier, the node was marked in dark gray to highlight the importance of the gene. All gene networks estimated in this study were limited to an edge average number in the range of 1.9–2.1 per node. Gene networks causing colon cancer were estimated using LASSO, ADLASSO ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1$$ \end{document} ), Enet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.8$$ \end{document} ), and ADEnet ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _{ini}} = 0.1$$ \end{document} ; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha = 0.8$$ \end{document} ). Results obtained for each network are described in Section 5 of the Supplementary Materials. Lastly, a final colorectal cancer gene network was established by combining estimated graphical models.

In the final merged model, edges estimated using several regression-based methods were marked by a thick black line. If an edge was estimated by only one of the four models, it was omitted from the network because its reliability was considered to be low. In other words, visualization was performed when edges were estimated by at least two graphical models. The final gene network merging four regression-based models is depicted in Figure 10.

FIG. 10.

Final colorectal cancer gene network using regression-based approaches. The final gene network is constructed with 60 nodes and 120 edges. Each node has two neighborhoods on average. Node size represents betweenness centrality, and nodes with high degree are marked in dark gray. The overlapped estimated edge is marked by a thick black line.

3.2. Final colorectal cancer gene network

The network shown in Figure 10 was obtained as a result of embedding four regression-based graphical models. The top five genes with high betweenness centrality were SYNE1, MACF1, CHEK2, PIK3CA, and ATM; the genes with the high degrees included ANK2, SYNE1, COL6A3, HMCN1, ZFHX4, and MACF1.

SYNE1 and MACF1, which had remarkably high betweenness and degree, have received much attention in colorectal cancer research. SYNE1 has been proposed as a promising biomarker for colorectal cancer detection (Melotte et al., 2014), and abnormal expression of MACF1 initiates tumor cell proliferation and metastasis in colorectal cancer (Miao et al., 2017).

We could infer why GENIE3 was more suitable for scale-free data than for other data types through the applications. As shown in Section 5 of Supplementary Materials, GENIE3 is a tree-based method in which the structure of the estimated network is such that branches extend from a particular node, similar to a scale-free network.

The final network model with the addition of GENIE3 is shown in Figure 11. Genetic networks based on GENIE3 could be used to find genes related to a specific hub node. In contrast, owing to the limitations concerning estimation of the overall gene structure, appropriate care must be taken to merge GENIE3 with other regression-based methods.

FIG. 11.

Final colorectal cancer gene network using regression-based approaches and GENIE3. The final gene network is constructed with 60 nodes and 126 edges. Each node has 2.1 neighborhoods on average. Node size represents betweenness centrality, and nodes with high degree are marked in dark gray. The overlapped estimated edge is marked by a thick black line.

4. Conclusion

This study describes development of a network estimation method suitable for handling high-dimensional genetic data. We applied Enet and ADEnet to the network estimation method to resolve problems associated with handling high-dimensional genetic data with hub nodes. Comparisons between existing regression-based approaches, LASSO, ADLASSO, and other bioinformatics methods, have also been performed.

By using simulation data with different properties, the diversity and reliability of the experiments have been secured. This demonstrated the strength of regression-based methods with respect to performance and computation time. Moreover, the proposed method using ADEnet was robust to the properties of the gene data and showed good performance.

Finally, we confirmed the feasibility of the network estimation methods employing Enet and ADEnet by acquiring actual colorectal cancer gene data. Because the best performing method depended on the type of data, embedding graphical models rather than merely using a single model enabled creation of a more reliable genetic network structure. Important genes estimated by the graphical models are now being studied as potential biomarkers for cancer. The proposed study can, therefore, be expected to make meaningful contributions to researches concerning genes related to cancer.

In the future, we aim to develop a network estimation method that provides optimal performance for all types of genetic data. ADEnet and Enet have demonstrated satisfactory performance when handling several simulation data types, so further research concerning these approaches must continue. We expect this work to play a vital part in the development of gene estimation methods.

Footnotes

Acknowledgments

This research was supported by grants from the National Research Foundation of Korea (NRF-2017R1E1A1A03070507 and NRF-2017R1C1B2002850) and Korea University (K1719881 and K1822881). This article contains a portion of the MS thesis compiled by Kyu Min Lee, which followed the policy and guidelines of Korea University. Copyright is held by the Journal of Computational Biology.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Alansari

, Soomro

, Belgaum

M.R.

, et al. 2018. The rise of Internet of Things (IoT) in Big Healthcare Data: Review and open research issues, 675–685. In Saeed

, Chaki

, Pati

, et al., eds. Progress in Advanced Computing and Intelligent Engineering. Springer, Singapore.

Albert

, and Barabási

A.L.

2002. Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47.

Breiman

2001. Random forests. Mach. Learn. 45, 5–32.

Clarke

, Ressom

H.W.

, Wang

, et al. 2008. The properties of high-dimensional data spaces: Implications for exploring gene and protein expression data. Nat. Rev. Cancer. 8, 37.

Ding

, and Peng

2005. Minimum redundancy feature selection from microarray gene expression data. J. Bioinf. Comput. Biol. 3, 185–205.

Faith

J.J.

, Hayete

, Thaden

J.T.

, et al. 2007. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 5, e8.

Geurts

, Ernst

, and Wehenkel

2006. Extremely randomized trees. Mach. Learn. 63, 3–42.

Han

S.W.

, Chen

, Cheon

M.S.

, et al. 2016. Estimation of directed acyclic graphs through two-stage adaptive LASSO for gene network inference. J. Am. Stat. Assoc. 111, 1004–1019.

Irrthum

, Wehenkel

, and Geurts

2010. Inferring regulatory networks from expression data using tree-based methods. PLoS One. 5, e12776.

10.

Lee

, Seok

, Tae

, et al. 2017. A comparison of two-stage approaches based on penalized regression for estimating gene networks. J. Comput. Biol. 24, 709–720.

11.

Margolin

A.A.

, Nemenman

, Basso

, et al. 2006. ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 7, S7.

12.

Meinshausen

, and Bühlmann

2006. High-dimensional graphs and variable selection with the lasso. Ann. Stat. 34, 1436–1462.

13.

Melotte

, Yi

J.M.

, Lentjes

, et al. 2014. Spectrin repeat containing nuclear envelope 1 and Forkhead box protein E1 are promising markers for the detection of colorectal cancer in blood. Cancer Prev. Res. 8, 157–164.

14.

Meyer

P.E.

, Lafitte

, and Bontempi

2008. minet: A R/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics. 9, 461.

15.

Miao

, Ali

, Hu

, et al. 2017. Microtubule actin cross-linking factor 1, a novel potential target in cancer. Cancer Sci. J. 108, 1953–1958.

16.

Peng

, Wang

, Zhou

, et al. 2009. Partial correlation estimation by joint sparse regression models. J. Am. Stat. Assoc. 104, 735–746.

17.

Sung

J.J.

, Lau

J.Y.

, Goh

K.L.

, et al. 2005. Increasing incidence of colorectal cancer in Asia: Implications for screening. Lancet Oncol. 6, 871–876.

18.

Tibshirani

1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Stat. Methodol. 58, 267–288.

19.

Weston

A.D.

, and Hood

2004. Systems biology, proteomics, and the future of health care: Toward predictive, preventative, and personalized medicine. J. Proteome Res. 3, 179–196.

20.

Zhou

2011. Structure learning of probabilistic graphical models: A comprehensive survey. arXiv preprint arXiv:1111.6925.

21.

Zou

2006. The adaptive LASSO and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429.

22.

Zou

, and Hastie

2005. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B Stat. Methodol. 67, 301–320.

23.

Zou

, and Zhang

H.H.

2009. On the adaptive elastic-net with a diverging number of parameters, Ann. Stat. 37, 1733–1751.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.10 MB