Abstract
Abstract
Metabolic disorders such as obesity and diabetes are nowadays regarded as diseases affecting majority of population. These diseases develop gradually over time in an individual. Recently, systematic experiments tracking disease progression are conducted giving a high-throughput complex data. There is a pressing need for developing methods to analyze this complex data to capture the disease mechanism at molecular level. Diseases usually develop through perturbations of biological processes in an organism. In this study, we have tried to capture the interlinking between different biological processes that work together to regulate the disease phenotype. Here, we have considered a temporal microarray data from an experiment conducted to study obesity and diabetes in mice. We have analyzed the data to obtain perturbed biological processes and developed methods to establish link between these perturbed biological processes. We have derived a mathematical formula to score genes and identified a significant set of genes regulating such a complex process network. The methods developed in our study are also applicable to a broad array of data types.
1. Introduction
O
Systematic methods have been proposed that analyze such high-throughput data to understand how different biological processes get perturbed in a disease condition. Some of these methods are clustering and biclustering. In these methods, we first find clusters of coexpressing genes and then the biological processes significantly enriched in each cluster (Eisen et al., 1998; Tamayo et al., 1999; Chen et al., 2013; Eren et al., 2013; Yosef et al., 2013). There are also other ways that include Gene Set Enrichment Analysis (Subramanian et al., 2005), which finds processes/gene lists that significantly correlate with a phenotype of interest. Gene Network Enrichment Analysis (Liu et al., 2007) finds high transcriptionally affected subnetworks in a protein–protein interaction network and looks for its significant overlap with a biological process. All these clustering and biclustering methods result in a list of significant biological processes.
In recent studies, people have started tracking disease progression over time and looking for the biological processes driving those changes. For example, Sun et al. (2013) measured gene expression from normal and diseased rats (animal model of diabetes) aged from 4 to 20 weeks, with a time interval of 4 weeks in different tissues. They identified differential expression networks and found pathways enriched in these networks both statically and temporally. Another work (He et al., 2011) looked at molecular mechanisms underlying the hepatocarcinogenesis (HCC) induced by hepatitis C virus (HCV) infection. They took the six pathologically defined disease stages in the development and progression of HCC after HCV infection. They found significant functions represented by the progression-related subnetworks that are perturbed starting from different times.
How a large number of biological processes work together temporally in a disease condition is still not known. Furthermore, it would be interesting to study the underlying genes that are regulating these processes. In this article, we aimed to capture the interdependency between different biological processes and the interconnecting genes through systems biology approaches and the network-related concepts that are used in understanding complex diseases (Barabasi et al., 2011). Furthermore, in the data analysis, we tried to incorporate the temporal dimension in identification of significant genes.
Here, we used a temporal microarray data obtained from liver of a mice fed on high-fat high-sucrose diet over an 18-week period to study obesity progression (Tikoo et al., 2013) (data present in GEO series GSE63175). Mice fed on chow diet were used as control. The high-fat diet-fed mice become obese during the mentioned period as reflected in various physiological parameters measured during the experiment.
We first focused our attention on reducing the complexity of the data by categorizing processes into a broad category of processes and studied their temporal perturbations. Then, we used the common gene concept used to make biological networks (Goh et al., 2007). Using the constructed network along with the temporal information, we identified significant genes.
2. Results
2.1. Discretizing the data and obtaining highly perturbed biological processes
The details of the experiment conducted and processing of data are given in Processing of Microarray Data section in Supplementary Material. A heat map of the microarray data given in Figure 1A shows the patterns that genes are following. We first applied the Clustered Groups (CG) algorithm (Anand and Chatterjee, 2016) on the data to get the discretized version, giving the temporal perturbations of genes. Next, we calculated the percentage perturbation (fraction of genes perturbed in at least one time point in a given process) for different biological processes. List of different biological processes along with genes present was taken from the web (see Gene Ontology Biological Process List section in Supplementary Material). We have plotted this fraction perturbation for each biological process in Figure 2B (Supplementary Fig. S3).

Perturbed biological processes obtained using discretizing the microarray data.

Broad categories of biological processes and their time profiles.
2.2. Grouping of list of biological processes reveals biological categories
Any database of biological processes, which are functionally close to each other, can be put together under a broad process category. The existence of functionally close biological processes is mainly because of the hierarchal-like nature of gene ontology biological process tree (Ashburner et al., 2000). A biological process database contains both the general process (parent) name and specific subprocess (child) name along with their respective genes. The genes present in a child are also present in its parent. Using different enrichment analysis tools on such a list to find significant processes would be bound to give redundant results.
To remove such redundancy and give a systematic way to separate a given list of biological processes into different broad categories, we devised a method that is based on the fact of parent–child relationship. For this, we organized the total biological processes in a tree-like form where each biological process is represented as a node and they are linked through a directed edge going from parent to child. This is a general tree and can be used in any study to categorize a list of biological processes. For example, we chose top 173 processes (with a cutoff of 0.875) for this analysis (see red dots in Fig. 1B). We built the tree by adding the parents of each of these 173 processes until we reached the general most process that has no parents as shown in Figure 2A. We found that the perturbed biological processes obtained clearly separate into different disjoint categories as shown in Figure 2A. We then selected categories that contain 3 or more perturbed processes resulting in 10 categories. We named these categories based on the general biological processes present in them, see Figure 2A. The remaining small categories are grouped under the “others” category. The perturbed biological processes present in each category are provided in Supplementary Table S1. This method helped us to reduce the redundancy that occurred during the analysis. For example, six processes linked through parent–child relationships are present in T Cell Selection category, in which one process “T Cell Selection” does not have any parent. This signifies that those five child processes are redundant and the topmost parent process can be used in place of the six processes without loss of any information.
One can use this concise information to dissect a complex data to reveal insights. For example, in our case, we wanted to dissect the temporal information hidden in the data using the biological categories. For this, we superimposed the temporal information on the biological categories to find the temporal perturbation of each category. Our data contain temporal information from 10 time points. To simplify the temporal complexity of our data, we grouped the 10 time points into different phases or windows depending on the phenotypic changes of the studied mice. According to the phenotypic data published in Midha et al. (2012), we can distinctly group the 10 time points into 3 phases based on phenotypic changes and named them as early, middle, and late phases. We calculated the fractions of total perturbed genes for each process for each phase. Next we defined four equally spaced bins on the calculated fractions to categorize each process into four degrees of perturbations. The very small perturbation (vs) falls in the bin with fraction <0.25, small perturbation(s) falls between 0.25 and 0.5, whereas the highest two perturbations (large, l, and very largest, vl) fall between 0.5 and 0.75, and >0.75, respectively. Now for each time window, we calculated the number of processes of each category that falls in different bins and plotted them as shown in Figure 2B.
Here, we found that “Regulation of Secretion and Protein/Amino Acid Transport” and “Nitrogen Compound/Anion Transport” categories of related processes follow a similar pattern: perturbed to small and large extent in early and middle phases and then perturbed to large and very large extent in late phase, see Figure 2B. The same pattern is being followed by “Muscle System” category of processes. Reasoning that processes perturbed to a very large extent, that is, >0.75 might be playing a dominant role in respective time windows, we found that “T Cell Selection”-related processes were perturbed in all three phases, but “Regulation of Inflammatory Response” and “Regulation of T Cell Activation”-related processes were mostly perturbed in middle and late phases, whereas “Metabolic Process”-related processes were perturbed largely throughout.
2.3. Connectivity between biological categories highlights important genes
The next question we want to address is whether there exists any link between processes within these categories. It might help us to understand how multiple biological processes are working together toward the development of a disease phenotype. One possibility of connecting the processes within these categories is through common genes. There are studies that have shown the importance of considering common nodes (Goh et al., 2007) and that there exists a correlation between observed simultaneous occurrences of two clusters with the fact that these are connected through common nodes (Park et al., 2009). To capture this, we constructed a network for biological processes and their linking genes (those that are common between two or more processes), see Figure 3A (Supplementary Fig. S4).

Connectivity between processes highlights important genes.
As shown in the figure that the processes within same category are more close to each other than processes between two different categories, it may be concluded that the processes within a category are well connected. We also observed that there are some genes connecting processes from different categories and some connecting processes within the same category. Next we constructed a simplified network in which we combined biological processes from the same category as one node and linked them through connecting genes, see Figure 3B. We observed different degrees of connections in different categories. There are some highly connected categories such as “Regulation of Secretion and Protein/Amino Acid Transport,” “Regulation of Lipid Metabolic Process,” and some less connected categories such as “Regulation of Inflammatory Response” and “Regulation of T Cell Activation.”
Next, we looked for the temporal perturbation of these connecting genes. For this, we used the discretized temporal data and calculated how many times these connecting genes are perturbed at different phases, that is, early phase, in middle phase, and in late phase. This was repeated for all pairs of biological process categories and the result is depicted as heat map in Figure 3C. From the heat map we can observe that the connections between some categories change with time, whereas others do not change. For example, the category “Regulation of Secretion and Protein/Amino Acid Transport” shows high connectivity with other categories for all time points. Whereas connection between the categories “Nitrogen Compound Anion Transport” and “Amide/Hormone Metabolic Process” occurs only at the middle phase (Fig. 3C) through the connecting gene Huntingtin (HTT) (Fig. 3B).
To understand the significance of the connecting genes in our data, we obtained the correlation between the perturbations of the processes and its connecting genes (for details, see Agreement of Hypothesis with Data in Supplementary Material). We found that 84 processes out of 136 processes (∼61%) are correlating with their respective connecting genes at 6 or more time points (Supplementary Fig. S1). Thus, we may conclude that the genes connecting two processes might play a significant role in explaining the perturbation of these processes.
To test the significance of these connecting genes, we compared them with the single-nucleotide polymorphism (SNP) genes from the database of catalog of Genome Wide Association Studies (for details, see Significance of Connecting Genes Using SNP Data section in Supplementary Material). We observed that out of 25 SNP genes, 11 SNP genes are present in our list of perturbed processes, and of them 6 genes are connecting genes representing a significant 55% enrichment with p = 0.0839 (using a hypergeometric test).
So, after establishing the significance of the connecting genes, we wanted to identify which of them are more influential in terms of regulating this biological process network. For this we provided a quantitative score to all genes to reflect their influence over the entire network and then ranked them according to their scores. The proposed mathematical formula for calculating the score is based on the following assumptions:
1. High-degree nodes are more significant and influential than the low-degree nodes (Barabasi and Oltvai, 2004). Thus here we can assume that the genes connecting more processes are more influential than those connecting fewer processes. So, degree of a gene is directly proportionate to its significance. 2. We also assumed that more the number of genes present between two processes, less will be the importance of an individual gene. This is because the information flow between these processes is shared between the connecting genes, and shutting down of one or two genes will not have a significant regulation on its flow. A network measure to capture this kind of phenomena is present in network theory (Zhu et al., 2007). So, presence of more sharing genes has an inverse effect on the significance of a gene. 3. Finally, it is reasonable to assume that more the number of times a gene is perturbed, more influential it is in the disease progression. So, number of times a gene is perturbed is directly proportionate to its significance.
With these assumptions, we derived a mathematical formula for scoring a gene, say g1 and is given by
where D is the biological process degree, that is, the number of biological processes connected by gene g1,
TP is number of time points, where gene g1 is perturbed, and
R is number of sharing genes, that is, the number of alternative genes present along with gene g1 between the biological processes it connects.
As the maximum values of three parameters would be different, we normalized each parameter by their respective maximum values so that they lie between 0 and 1 and then calculated the score. Using this formula, we gave score to all the 523 connecting genes present in the network of 173 processes, see Figure 3D. From the plot we can identify the most influential genes in the network by considering the top ranking genes. The list is presented in Supplementary Table S1. To establish the justification behind the selection of 173 processes, we compared the top ranked gene lists from the subnetwork of these 173 processes (small list) with the list from the whole network of 5192 processes (large list). We found that all the genes from top 10% of the small list are also in the top 10% genes of the large list. This overlap is statistically significant with a p value of <10−5. This showed that the significant genes identified from 173 processes network play an important role in influencing the entire network of 5192 processes. Hence, studying these 173 processes is enough to capture the significant information present in the network of 5192 processes.
As these genes have been obtained from liver microarray data from an experiment tracking obesity and diabetes over time, we tested the overlap of our list with a positive data set of obesity and diabetes disease genes. For this, we used the OMIM database (Hamosh et al., 2005). We calculated the precision and recall values for each cutoff (see Validation of Ranking section in Supplementary Material) and found that precision and recall were very much less (<10%) for all cutoffs when overlap between genes of our list and OMIM genes is considered. So, we find that the top genes from our list do not match much with 176 disease-specific genes from the OMIM database. So, we investigated the possibility whether the genes present in both lists are involved in same processes. For this, we calculated the precision and recall values when the processes in which genes participate in are taken into account. We plotted these precision and recall values for different cutoffs shown in Supplementary Figure S2. We saw that we reached a precision of 95% (70% average when a random list is used) when top 100 genes were used with corresponding recall value as 70% (63% average when random list is used) (i.e., capturing 123 of 176 positive genes). These top 100 genes thus correspond to ideal cutoff threshold that can be used. Thus, our ranked genes are able to capture the disease process.
3. Discussion
The idea of using common genes to connect different perturbed biological processes has already been used in a different context (Goh et al., 2007). In this study, we wanted to dissect a complex temporal microarray data obtained from mice liver tissue fed on high-fat high-sucrose diet. Here, we have used a mathematical formula (proposed and derived from network graph theory) to score genes and then ranked them to identify the most significant genes. The first step of our analysis was discretization of input data that was done using CG algorithm (Anand and Chatterjee, 2016). Here, we want to mention that one can also use any other discretization criteria such as a two-fold cutoff on gene expression values. Then, we calculated the fraction of perturbed genes for each process and then selected few top processes with high fraction of perturbed genes. We used a method to categorize such a list of perturbed processes into different categories. We ultimately used our derived mathematical formula to rank each of the connecting genes between processes in the constructed biological process network. We used a different database to validate our ranked list. The top 15 ranked genes are mentioned in Table 1 along with their respective parameters. These top ranked genes were obtained from our microarray data obtained from mice liver tissue fed on high-fat high-sucrose diet. So they should play a significant role in the development of obesity and diabetes.
Available literature also validate that some of these 15 genes are influential in diabetes and obesity development. For example, sonic hedgehog (SHH) is the top ranking gene in our list. The connecting biological processes along with its temporal fold change are shown in Figure 3D bottom inset. SHH is a signal molecule essential for a variety of patterning events during development. A recent study (Swierczynska et al., 2015) has shown that diet-induced obesity in mice elevates SHH signaling in Gli1-positive progenitors. This independently supports the involvement of our top ranked gene in obesity and diabetes. Another example is the gene arginine vasopressin 1a receptor (AVPR1A), which is ranked 13th in our list. The biological processes with which the gene is connecting to along with its temporal fold change are shown in Figure 3D bottom inset. It has been shown that male AVPR1A knockout mice (V1aR−/−) display a phenotype of low triglycerides and high glucose concentrations and high-fat diet-induced obesity and diabetes. Subjects with mutations in this gene had lower concentrations of triglycerides and a tendency toward an increased prevalence of diabetes (Enhorning et al., 2009), thus also supporting involvement of our top ranked gene in obesity and diabetes.
Our analysis provides a framework to dissect complex data to rank genes based on their importance. Our ranking captures the connectivity between perturbed processes in the data, and because the condition in our data set is time, we have also included the time dimension in our ranking. Thus, these ranked genes could play an important role in disease progression (obesity and diabetes) over time. Validation of our ranking based on positive data set of disease genes from the OMIM database confirms the accuracy by which our ranking captures the processes in which disease genes are involved. The whole methodology can also be used to analyze data from microarray, RNAseq, or proteomic experiments containing expression values in different conditions such as time as done here or in tissues.
Footnotes
Acknowledgments
The authors would like to acknowledge the contribution of Dr. K.V.S. Rao in intellectual discussions during early part of the study. This work is supported by CSIR (Government of India), grant number 25 (0258)/16/EMR-II.
Authors' Contributions
R.A. and S.C. developed the algorithm. R.A. tested the algorithm on the data and built the network. R.A. and S.C. wrote the article.
Author Disclosure Statement
No competing financial interests exist.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
