Identification by Gene Coregulation Mapping of Novel Genes Involved in Embryonic Stem Cell Differentiation

Abstract

A combined analysis of data from a series of literature studies can lead to more reliable results than that based on a single study. A common problem in performing combined analyses of literature microarray gene expression data is that the original raw data are not always available and not always easy to combine in one analysis. We propose an approach that does not require analyzing original raw data, but instead takes literature gene sets derived from (supplementary) tables as input and uses gene co-occurrence in these sets for mapping a co-regulation network. An algorithm for this method was applied to a collection of literature-derived gene sets related to embryonic stem cell (ESC) differentiation. In the resulting network, genes involved in similar biological processes or expressed at similar time points during differentiation were found to cluster together. Using this information, we identified 43 genes not previously associated with cardiac ESC differentiation for which we were able to assign a putative novel biological function. For 6 of these genes (Apobec2, Cth, Ptges, Rrad, Zfp57, and 2410146L05Rik), literature data on mouse knockout phenotypes support their putative function. Three other genes (Rcor2, Zfp503, and Hspb3) are part of major pathways within the network and therefore likely mechanistically relevant candidate genes. We anticipate that these 43 genes can help to improve the understanding of the molecular events underlying ESC differentiation. Moreover, the approach introduced here can be more widely applied to identify possible novel gene functions in biological processes.

Introduction

C ombined analysis of microarray gene expression data from multiple studies offers the possibility to derive more reliable results than can be obtained from a single study. Although such combined analyses of microarray data are still comparatively rare, they have been successfully applied in well-studied research areas like cancer [1 –3], host response to pathogens [4,5], and aging [6]. However, for many microarray studies published in the literature, their inclusion in a combined analysis is hampered by the fact that their full (raw or normalized) gene expression data are not publicly accessible. Even if this is the case, differences in study designs or experimental procedures between studies can provide a practical barrier for inclusion. Further, processing large amounts of data can be time consuming and troublesome, strengthening the need for a convenient alternative.

We propose to use published microarray data in a different way, omitting normalization and statistical issues between different studies. We do this by looking for jointly regulated genes based on co-occurrence in individual experimental gene sets. A gene set is here defined as a list of genes that are described as meeting the same regulation criteria in a given literature study, such as being significantly upregulated, downregulated, or being clustered together based on having similar expression patterns. In fact, most microarray study publications provide (supplementary) tables with gene lists that meet criteria suitable for the underlying study. In many cases, a gene set will correspond directly to such a table, sometimes to table subsections that have different headers. If 2 genes occur together in a gene set in one publication, there is a reasonable chance that they are relevant to the process and also that they are to a certain extent coregulated. If this is reported in multiple studies, both these chances become increasingly likely. Therefore, by analyzing a sufficiently large number of gene sets, it becomes possible to determine with high confidence which genes are commonly regulated and additionally share coregulation. The latter information can be used to construct a coregulation network, which can provide additional insights into the regulatory processes involved [7,8]. More specifically, if a gene with a hitherto unknown function is coregulated with several genes that all are known to be involved in the same biological process, this gene then can be assumed to play a role in that process [9 –11]. Further, the number of connections (edges) that a gene has in a network can serve as an indicator for its importance in the network process [12].

In this study, we will show the applicability of our method using 3 different gene set collections. One of these collections, containing gene sets derived from studies on embryonic stem cell (ESC) differentiation, will be analyzed in more detail with a view to cardiomyocyte differentiation and its application in toxicity testing.

During embryonic development, the heart is one of the first tissues to be formed and to acquire functionality. The cellular and molecular events involved in early cardiac development are highly conserved between different species (see review in [13,14]). Cardiac progenitor cells originate in the mesodermal germ layer and ultimately differentiate to cardiomyocytes via a cascade of development steps. Although molecular analysis of cardiac development has already identified several genes involved in (part of) this complex differentiation process, much is still unknown. ESC models are often used as a practical model system to study in vitro cardiomyocyte differentiation (reviewed in [15]). Several gene expression studies using this model system have been published, which allows for data from several related studies to be analyzed in combination.

We applied our method to construct and analyze a coregulation network of ESC differentiation, and used this network to derive groups of coregulated and functionally related genes. Using this approach, we identified several novel genes involved in the cardiac cell differentiation process.

Methods

Libraries

The first gene set collection used is the c2.cgp collection, which was downloaded from the MSigDB Web site (www.broadinstitute.org/gsea/msigdb/). This collection contains 1,186 expert curated gene sets and a total number of 15,891 unique gene identifiers (UGIs) derived from microarray experiments using chemical and genetic perturbations [16] and is, to our knowledge, the largest gene set collection publicly available. The second gene set collection is the lung-inflammation collection, derived from 12 microarray studies used in a meta-analysis on acute lung inflammation models from 45 exposures to air pollutants; bacterial, viral, and parasitic infections; and allergic asthma models [5]. This collection contains 90 gene sets and 2,680 UGIs.

We previously described the third gene set collection, which is the ESC differentiation collection [17,18] (and references sited therein). This collection contains gene sets derived from 42 microarray studies describing gene expression changes during the differentiation of ESC to several types of tissue, with a certain emphasis on cardiomyocytes. During the collecting process, gene sets were included if the underlying study was considered relevant based on mutual agreement by a developmental toxicologist and bioinformatician. To avoid redundancy, gene sets obtained using the Anni textmining tool were excluded for the present study as textmining is implicitly using literature data. The resulting gene set collection used contained 167 gene sets and a total number of 15,196 UGIs.

Coregulation mapping

Gene coregulation mapping was determined using an algorithm written in statistical software environment R (www.r-project.org). Starting from a gene set collection in tab-delimited file format, UGIs were converted to uppercase (for computational convenience) and, subsequently, nonmeaningful UGIs like “0” or “–” were removed. For the resulting gene symbols, the number of occurrences in the collection was calculated. For each combination of 2 (nonidentical) symbols, coregulation was determined using 2 criteria. First, the absolute number of sets in which both genes co-occurred had to be at least a certain (user-definable) threshold, to select for often coregulated genes and ensure biological relevance. Second, relative coregulation strength, calculated as the cosine correlation of gene set co-occurrences, had to be at least 50%. This is calculated as N_a,b/√(N_a*N_b), where N_a,b is the number of sets in which geneA and geneB occur together, and N_a and N_b are the number of occurrences of, respectively, geneA and geneB in all sets. If both these criteria were met, the 2 gene symbols were connected in a network description file for observation with Cytoscape [19]. The criteria for choosing the user-definable threshold were tested and optimized using the c2.cgp and lung-inflammation collection. For further analyses on these and the ESC differentiation collection, a default stringency was defined so that the threshold for each data set led to ∼5% of the total number of UGIs becoming mapped in the network description file.

The resulting coregulation network was further observed using Gene Ontology (GO) functional annotations, or gene expression data from earlier studies [5,17]. The presence of significantly densely connected clusters within the total network was determined using the Cytoscape plugin MCODE [20] using default settings.

Identification of novel genes involved in ESC cardiac differentiation

For the genes in the ESC differentiation coregulation network mapping, Gene Ontology functional annotation and enrichment was determined using the Cytoscape plugin BiNGO [21] in combination with the DAVID Web application (david.abcc.ncifcrf.gov) [22]. Genes were considered already associated with ESC differentiation if they were assigned to one of the following Gene Ontology terms: development processes (including stem cell, organ, heart, or muscle development), cell differentiation, cell proliferation, extracellular matrix (including collagen), contractile fiber, muscle system process (including muscle contraction), and heart contraction. Further, genes for which the textmining tool Anni [23] found concept weights >10⁻⁵ for the above-mentioned terms, or the concepts “stem cells,” “gastrulation,” or “ectoderm/endoderm/mesoderm formation” were considered not novel for stem cell differentiation. The concept weight is a measure Anni uses to indicate the degree of literature association of one term with another. We found that when distributions for a larger number of concept weights are combined, the distribution approaches its background level at concept weights around 10⁻⁵, and therefore concept weights below this value are considered as not indicative for literature association. For the genes not functionally annotated by Gene Ontology or Anni, those that were part of an MCODE cluster enriched for one of the above functions were considered candidates for such a function.

Results

Method development

To show the applicability of our approach and to optimize the criteria used, we applied our algorithm to 2 gene set collections that are different in nature. The first of these is the MSigDB c2.cgp collection [16], which contains 1,186 gene sets related to a wide range of chemical and genetic perturbations, and, second, the lung-inflammation collection, with 90 gene sets, which is smaller and more focused on acute lung inflammation models. For both gene set collections we found that biologically meaningful coregulation networks could be obtained (Fig. 1A, B). For the c2.cgp collection, various coregulation networks of varying sizes were found. Several of these consisted of genes with a common related functionality such as cell division or immunological response (Fig. 1A). For the lung-inflammation collection, besides some smaller networks, a large and dense network of commonly inflammation-induced genes containing 120 genes was found (Fig. 1B). This network includes 66 immune response genes, 61 of which overlap with the 100 immune response genes found previously as part of a full-scale meta-analysis [5]. Moreover, this network contains all of the 23 “core inflammation response” genes found in the previous meta-analysis.

FIG. 1.

Network observations for 3 gene set collections. (A) Network for c2.cgp collection (591 genes, 1,243 edges); (B) network for lung-inflammation collection (133 genes, 930 edges); (C) network for embryonic stem cell differentiation collection (927 genes, 7,402 edges) with transitions between sub-networks indicated as dashed lines. Color images available online at www.liebertonline.com/scd.

ESC differentiation mapping

When the algorithm was applied to the ESC differentiation gene set collection, a large and densely interconnected network with 927 genes and 7,402 edges was obtained (Fig. 1C). The network appeared to have 2 major sub-networks, at the left- and right-hand side, respectively. This finding was found to be robust for various stringency settings, as was the observation that both major sub-networks comprised 2 minor sub-networks. The overall network was significantly enriched for several GO-terms related to development, differentiation, and muscle function. In addition, enrichment for several other functional categories was found, especially transcription factors, cell proliferation, small heat shock proteins, and several types of metabolite transport. However, we found that known cardiomyocyte marker genes were located exclusively in the lower right-hand side of the network, whereas stem cell pluripotency markers were located only in the top left. This suggests that these 2 areas correspond to different processes or phases in stem cell differentiation. Both the functional and timing aspects were examined further.

By using the MCODE plugin, we identified 21 clusters within the overall network that are significantly densely connected. The size of these clusters ranged from 4 to 39 genes, with an overall total of 304 genes. Several of these clusters were enriched for specific GO terms. Clusters with such similar GO term enrichments appeared together in the network graph, and (by making small adjustments to the MCODE cluster size threshold parameter) it was found that they were mutually strongly connected. Taken together, 3 cluster groups could be distinguished (Fig. 2A; Tables 1 –3; Supplementary Table S1, available online at www.liebertonline.com/scd). The first group contained 5 clusters (87 genes, upper left side of Fig. 2A) that were mainly enriched for early developmental GO terms such as stem cell development, embryonic development, and regulation of cell proliferation. The second group, containing 3 clusters (25 genes, upper right side of Fig. 2A), was enriched for terms such as organ development and embryonic development and GO terms involved in later development stages. The third group (55 genes, lower right side of Fig. 2A) contained 3 clusters and was enriched for specific heart- and muscle-related GO terms such as muscle development, muscle contraction, and heart development. In addition to development- and muscle-related processes, the first 2 groups were enriched for transcription factors, and the third group for small heat shock proteins. Next to the clusters in the 3 mentioned groups, 5 clusters (61 genes) were enriched for other biological processes related to transport and/or metabolism, and 5 other clusters (76 genes) did not show functional enrichment. The 3 groups mentioned above are each found in a distinct minor sub-network. The fourth minor sub-network in the lower left side of Fig. 2A contains several genes that can be associated with early development, proliferation, pluripotency, as well as several metabolic processes. However, the 2 clusters in this sub-network do not reach statistically significant functional enrichment.

FIG. 2.

Embryonic stem cell cardiomyocyte differentiation coregulation network. (A) Biological process enrichment in network clusters containing novel functional candidates; (B) peak gene expression for genes in a stem cell differentiation time series. Color images available online at www.liebertonline.com/scd.

Table 1.

Identified Genes in Cluster Group 1 (Associated with Stem Cell Development, Embryonic Development, and Regulation of Cell Proliferation)

Gene symbol	Gene description	Novel	MCODE cluster	Edges
1700019D03RIK	RIKEN cDNA 1700019D03 gene	Yes	2	74
2310003C23RIK	RIKEN cDNA 2310003C23 gene	Yes	2	49
AVPI1	Arginine vasopressin-induced 1	Yes	2	36
LRRC2	Leucine rich repeat containing 2	Yes	2	69
PIPOX	Pipecolic acid oxidase	Yes	2	48
RCOR2	Rest corepressor 2	Yes	2	59
SLC25A36	Solute carrier family 25, member 36	Yes	2	31
STOML1	Stomatin-like 1	Yes	2	28
TMEM8	Transmembrane protein 8 (5 membrane-spanning domains)	Yes	2	42
UPP1	Uridine phosphorylase 1	Yes	2	51
ZFP57	Zinc finger protein 57	Yes	2	51
COBL	Cordon-bleu	No	2	52
ETV5	Ets variant gene 5	No	2	85
JARID2	Jumonji, AT rich interactive domain 2	No	2	83
KLF9	Kruppel-like factor 9	No	2	41
LEFTY1	Left right determination factor 1	No	2	67
MYCN	N-myc proto-oncogene	No	2	68
NR0B1	Nuclear receptor subfamily 0, group B, member 1	No	2	69
NTN1	Netrin 1	No	2	47
PHC1	Polyhomeotic-like 1 (Drosophila)	No	2	62
REST	RE1-silencing transcription factor	No	2	40
RIF1	Rap1 interacting factor 1 homolog (yeast)	No	2	88
SPRY2	Sprouty homolog 2 (Drosophila)	No	2	65
SPRY4	Sprouty homolog 4 (Drosophila)	No	2	79
TCFCP2L1	Transcription factor CP2-like 1	No	2	67
ZIC5	Zinc finger protein of the cerebellum 5	No	2	52
2200001I15RIK	RIKEN cDNA 2200001I15 gene	Yes	6	30
IFITM2	Interferon induced transmembrane protein 2	Yes	6	43
JAM2	Junction adhesion molecule 2	Yes	6	75
MANBA	Mannosidase, beta a, lysosomal	Yes	6	61
MKRN1	Makorin, ring finger protein, 1	Yes	6	59
PDCL2	Phosducin-like 2	Yes	6	25
TDH	L-threonine dehydrogenase	Yes	6	69
CTNNAL1	Catenin (cadherin associated protein), alpha-like 1	No	6	23
DPPA3	Developmental pluripotency-associated 3	No	6	65
ENAH	Enabled homolog (Drosophila)	No	6	50
GBX2	Gastrulation brain homeobox 2	No	6	43
GPA33	Glycoprotein A33 (transmembrane)	No	6	35
GRSF1	G-rich RNA sequence binding factor 1	No	6	23
IFITM1	Interferon induced transmembrane protein 1	No	6	29
KLF2	Kruppel-like factor 2 (lung)	No	6	40
LEFTY2	Left-right determination factor 2	No	6	33
MORC1	Microrchidia 1	No	6	50
NANOG	Nanog homeobox	No	6	29
OTX2	Orthodenticle homeobox 2	No	6	60
PCAF	p300/CBP-associated factor	No	6	33
SALL1	Sal-like 1 (Drosophila)	No	6	50
SERTAD2	SERTA domain containing 2	No	6	50
TGIF	TGFB-induced factor homeobox	No	6	28
TLE4	Transducin-like enhancer of split 4, homolog of Drosophila E(spl)	No	6	51
ZFP64	Zinc finger protein 64	No	6	24
ZIC3	Zinc finger protein of the cerebellum 3	No	6	35
2410146L05RIK	RIKEN cDNA 2410146L05 gene	Yes	17	29
ACP6	Lysophosphatidic acid phosphatase	Yes	17	35
CDYL2	Chromodomain protein, Y chromosome-like 2	Yes	17	16
EPB4.1L4A	Erythrocyte protein band 4.1-like 4a	Yes	17	17
PTGES	Prostaglandin E synthase	Yes	17	24
FGF4	Fibroblast growth factor 4	No	17	35
HCK	Hemopoietic cell kinase	No	17	21
HK2	Hexokinase 2	No	17	20
IGF2BP1	Insulin-like growth factor 2 mRNA binding protein 1	No	17	36
MRAS	Muscle and microspikes RAS	No	17	19
NCL	Nucleolin	No	17	18
NPHS1	Nephrosis 1 homolog, nephrin (human)	No	17	19
RARG	Retinoic acid receptor, gamma	No	17	29
SGK	Serum/glucocorticoid regulated kinase	No	17	45
TCFAP2C	Transcription factor AP-2, gamma	No	17	34
CTH	Cystathionase (cystathionine gamma-lyase)	Yes	18	47
D14ERTD436E	DNA segment, chr 14, erato doi 436, expressed	Yes	18	19
NFATC2IP	NFAT, cytoplasmic, calcineurin-dependent 2 interacting protein	Yes	18	33
RANBP17	RAN binding protein 17	Yes	18	17
SNX10	Sorting nexin 10	Yes	18	25
FOXD3	Forkhead box D3	No	18	49
POU5F1	POU domain, class 5, transcription factor 1	No	18	32
SIX4	Sine oculis-related homeobox 4 homolog (Drosophila)	No	18	17
SOCS2	Suppressor of cytokine signaling 2	No	18	43
SUZ12	Suppressor of zeste 12 homolog (Drosophila)	No	18	14
TDGF1	Teratocarcinoma-derived growth factor 1	No	18	41
ZFP36L1	Zinc finger protein 36, C3H type-like 1	No	18	46
AGTRAP	Angiotensin II, type I receptor-associated protein	Yes	19	22
ASS1	Argininosuccinate synthetase 1	Yes	19	11
DPPA5	Developmental pluripotency associated 5	No	19	33
ESRRB	Estrogen related receptor, beta	No	19	42
KLF4	Kruppel-like factor 4 (gut)	No	19	32
MYBL2	Myeloblastosis oncogene-like 2	No	19	46
TCL1	T-cell lymphoma breakpoint 1	No	19	29
TRP53	Transformation related protein 53	No	19	15

Table 2.

Identified Genes in Cluster Group 2 (Associated with Organ Development and Embryonic Development)

Gene symbol	Gene description	Novel	MCODE cluster	Edges
COLEC12	Collectin sub-family member 12	Yes	9	8
ZFP503	Zinc finger protein 503	Yes	9	14
FOXC2	Forkhead box C2	No	9	10
GATA5	GATA binding protein 5	No	9	7
HEY1	Hairy/enhancer-of-split related with YRPW motif 1	No	9	33
HOXA3	Homeo box A3	No	9	17
HOXB2	Homeo box B2	No	9	42
HOXB4	Homeo box B4	No	9	15
PDGFRA	Platelet derived growth factor receptor, alpha polypeptide	No	9	29
RYR2	Ryanodine receptor 2, cardiac	No	9	10
SPON1	Spondin 1, (f-spondin) extracellular matrix protein	No	9	60
TBX2	T-box 2	No	9	28
ZFPM2	Zinc finger protein, multitype 2	No	9	37
COL2A1	Collagen, type II, alpha 1	No	12	20
FLRT2	Fibronectin leucine rich transmembrane protein 2	No	12	21
ISL1	ISL1 transcription factor, LIM/homeodomain	No	12	25
MEIS1	Meis homeobox 1	No	12	21
MRG1	Meis1-related gene 1	No	12	13
GAS6	Growth arrest specific 6	No	14	18
HDAC7A	Histone deacetylase 7A	No	14	14
ITGB1	Integrin beta 1 (fibronectin receptor beta)	No	14	18
MMP14	Matrix metallopeptidase 14 (membrane-inserted)	No	14	15
PEG3	Paternally expressed 3	No	14	18
TIMP2	Tissue inhibitor of metalloproteinase 2	No	14	17
VCL	Vinculin	No	14	14

Table 3.

Identified Genes in Cluster Group 3 (Associated with Muscle Development, Muscle Contraction, and Heart Development)

Gene symbol	Gene description	Novel	MCODE cluster	Edges
APOBEC2	Apolipoprotein B editing complex 2	Yes	1	57
G0S2	G0/G1 switch gene 2	Yes	1	40
HSPB3	Heat shock protein 3	Yes	1	39
PPP1R14C	Protein phosphatase 1, regulatory (inhibitor) subunit 14C	Yes	1	46
RRAD	Ras-related associated with diabetes	Yes	1	67
SYNPO2L	Synaptopodin 2-like	Yes	1	72
ACTN2	Actinin alpha 2	No	1	77
ASB2	Ankyrin repeat and SOCS box-containing 2	No	1	51
ATP2A2	ATPase, Ca++ transporting, cardiac muscle, slow twitch 2	No	1	43
CKM	Creatine kinase, muscle	No	1	57
CSRP3	Cysteine and glycine-rich protein 3	No	1	74
ENO3	Enolase 3, beta muscle	No	1	55
HSPB2	Heat shock protein 2	No	1	69
HSPB7	Heat shock protein family, member 7 (cardiovascular)	No	1	62
ITGB1BP2	Integrin beta 1 binding protein 2	No	1	50
LAMA4	Laminin, alpha 4	No	1	48
LDB3	LIM domain binding 3	No	1	58
MB	Myoglobin	No	1	44
MYBPC3	Myosin binding protein C, cardiac	No	1	39
MYH6	Myosin, heavy polypeptide 6, cardiac muscle, alpha	No	1	78
MYL3	Myosin, light polypeptide 3	No	1	60
MYL4	Myosin, light polypeptide 4	No	1	55
MYL7	Myosin, light polypeptide 7, regulatory	No	1	52
MYOM1	Myomesin 1	No	1	34
MYOZ2	Myozenin 2	No	1	70
NPPA	Natriuretic peptide precursor type A	No	1	55
PGAM2	Phosphoglycerate mutase 2	No	1	45
PLN	Phospholamban	No	1	71
POPDC2	Popeye domain containing 2	No	1	48
PPP1R12B	Protein phosphatase 1, regulatory (inhibitor) subunit 12B	No	1	33
SH3BGR	SH3-binding domain glutamic acid-rich protein	No	1	58
SMPX	Small muscle protein, X-linked	No	1	66
TCAP	Titin-cap	No	1	48
TNNC1	Troponin C, cardiac/slow skeletal	No	1	60
TNNI1	Troponin I, skeletal, slow 1	No	1	52
TNNI3	Troponin I, cardiac 3	No	1	59
TNNT2	Troponin T2, cardiac	No	1	63
TRIM63	Tripartite motif-containing 63	No	1	32
TTN	Titin	No	1	55
2310046A06RIK	RIKEN cDNA 2310046A06 gene	Yes	7	36
TM4SF1	Transmembrane 4 superfamily member 1	Yes	7	45
ACTA1	Actin, alpha 1, skeletal muscle	No	7	37
CRYAB	Crystallin, alpha B	No	7	42
DES	Desmin	No	7	45
DKK3	Dickkopf homolog 3 (Xenopus laevis)	No	7	29
EEF1A2	Eukaryotic translation elongation factor 1 alpha 2	No	7	24
IL6	Interleukin 6	No	7	31
NPPB	Natriuretic peptide precursor type B	No	7	37
PTX3	Pentraxin 3	No	7	38
ALPK2	Alpha-kinase 2	Yes	11	26
NEBL	Nebulette	Yes	11	27
RCSD1	RCSD domain containing 1	Yes	11	31
MYL2	Myosin, light polypeptide 2, regulatory, cardiac, slow	No	11	29
MYOCD	Myocardin	No	11	26
SMYD1	SET and MYND domain containing 1	No	11	21

To observe the differentiation process in an alternative manner, we used data from a recent study in which we described gene expression changes during ESC cardiac muscle cell differentiation [17]. This study was not included in the ESC differentiation gene set collection and therefore provided independent data. For the genes for which differential expression was found, we used different colors to indicate at which differentiation phase these genes had their highest expression. This observation indicated that the left and right side of the network can be associated with early (undifferentiated stem cells) and late (more differentiated) time points, respectively (Fig. 2B). This is in agreement with the different functionalities based on GO term enrichment. Genes expressed during intermediate time points did not group together, but appeared in either of the 2 network sides.

When these 2 observation approaches were combined again, we found that out of the 167 genes in the 3 cluster groups (Fig. 2A, Tables 1 –3), independent data from the van Dartel [17] study showed differential expression for 49 of these genes (29%, compared to 6% for the whole-genome data). Of these, 18 early expressed genes were all found in the stem cell development group (group 1). Of the 4 mid-phase expressed genes, 2 were found in the stem cell development group and 2 in the organ development group (group 2). Of the 26 late expressed genes, 2 were found in the stem cell differentiation group, 4 in the organ development group, and 20 in the muscle/heart function group (group 3). Finally, one gene with both early and late high expression was found in the stem cell development group.

Novel ESC cardiac differentiation genes

Using Gene Ontology for functional annotation of the genes in the ESC development network, we found that, of the 927 genes, 521 could be associated with GO processes relevant for ESC differentiation, and 418 with related Anni textmining concepts. Leaving out these genes left 327 genes not yet described to be involved in ESC differentiation, 165 and 162 of which were located at the left- and right-hand side of the network, respectively. Of these novel genes, 43 were also part of a significantly dense MCODE cluster with a functional enrichment for a GO term related to ESC or cardiac differentiation, making them candidate genes for having a corresponding function assigned to them. These 43 genes are listed in Table 4 and indicated (in dark red, dark green, or dark blue) in Fig. 2. Concerning the overlap with other pathways found previously, among these novel genes there were no genes involved in cell proliferation, there were 3 transcription factors (Rcor2, Zfp57, and Zfp503), and one heat shock protein (Hspb3).

Table 4.

Identified Novel Genes with Functional Evidence for a Role in Embryonic Stem Cell Differentiation

Gene symbol	Gene description	MCODE cluster	Figure location
1700019D03RIK	RIKEN cDNA 1700019D03 gene	2	Upper left
2310003C23RIK	RIKEN cDNA 2310003C23 gene	2	Upper left
AVPI1	Arginine vasopressin-induced 1	2	Upper left
LRRC2	Leucine rich repeat containing 2	2	Upper left
PIPOX	Pipecolic acid oxidase	2	Upper left
RCOR2	Rest corepressor 2	2	Upper left
SLC25A36	Solute carrier family 25, member 36	2	Upper left
STOML1	Stomatin-like 1	2	Upper left
TMEM8	Transmembrane protein 8 (5 membrane-spanning domains)	2	Upper left
UPP1	Uridine phosphorylase 1	2	Upper left
ZFP57	Zinc finger protein 57	2	Upper left
2200001I15RIK	RIKEN cDNA 2200001I15 gene	6	Upper left
IFITM2	Interferon induced transmembrane protein 2	6	Upper left
JAM2	Junction adhesion molecule 2	6	Upper left
MANBA	Mannosidase, beta a, lysosomal	6	Upper left
MKRN1	Makorin, ring finger protein, 1	6	Upper left
PDCL2	Phosducin-like 2	6	Upper left
TDH	L-threonine dehydrogenase	6	Upper left
2410146L05RIK	RIKEN cDNA 2410146L05 gene	17	Upper left
ACP6	Lysophosphatidic acid phosphatase	17	Upper left
CDYL2	Chromodomain protein, Y chromosome-like 2	17	Upper left
EPB4.1L4A	Erythrocyte protein band 4.1-like 4a	17	Upper left
PTGES	Prostaglandin E synthase	17	Upper left
CTH	Cystathionase (cystathionine gamma-lyase)	18	Upper left
D14ERTD436E	DNA segment, chr 14, erato doi 436, expressed	18	Upper left
NFATC2IP	NFAT, cytoplasmic, calcineurin-dependent 2 interacting protein	18	Upper left
RANBP17	RAN binding protein 17	18	Upper left
SNX10	Sorting nexin 10	18	Upper left
AGTRAP	Angiotensin II, type I receptor-associated protein	19	Upper left
ASS1	Argininosuccinate synthetase 1	19	Upper left
COLEC12	Collectin sub-family member 12	9	Upper right
ZFP503	Zinc finger protein 503	9	Upper right
APOBEC2	Apolipoprotein B editing complex 2	1	Lower right
G0S2	G0/G1 switch gene 2	1	Lower right
HSPB3	Heat shock protein 3	1	Lower right
PPP1R14C	Protein phosphatase 1, regulatory (inhibitor) subunit 14C	1	Lower right
RRAD	Ras-related associated with diabetes	1	Lower right
SYNPO2L	Synaptopodin 2-like	1	Lower right
2310046A06RIK	RIKEN cDNA 2310046A06 gene	7	Lower right
TM4SF1	Transmembrane 4 superfamily member 1	7	Lower right
ALPK2	Alpha-kinase 2	11	Lower right
NEBL	Nebulette	11	Lower right
RCSD1	RCSD domain containing 1	11	Lower right

Discussion

Since its introduction, microarray technology has grown to become used in over 6,000 PubMed publications per year. The growth in the number of studies published has been followed by the development and application of various methods for combined or meta-analysis (reviewed in [24]). However, there are several practical issues related to such analyses of gene expression data, and one major hurdle is the limited public availability of the complete raw or normalized data sets used in literature studies. Several approaches to combined or meta-analyses have been published in the last few years, and although this field is still developing, it becomes apparent that no single best method exists as the suitability of an approach will depend on the nature and quality of available data. In this study we propose an approach that is not based on the actual re-analysis of original data, but on the co-occurrence of genes based on processing of gene sets derived from these data. One of the rationales behind our approach is the need for a combined analysis that allows inclusion of large numbers of literature microarray study results, even if the full original data are not (or not easily) available. Our algorithm for network construction allows the inclusion of almost any published study and further does not require elaborate data processing or new software implementations; R and Cytoscape are freely available and already familiar to most bioinformaticians. The algorithm used by us is available as Supplementary Data (available online at www.liebertonline.com/scd).

The 3 underlying concepts for our approach have each already been described on their own. First, coexpression or coregulation networks [8] have become a commonly applied method to infer novel gene functionality [7,9 –11]. Second, algorithms for determining gene co-occurrence in literature publications have been applied in textmining applications [25 –28]. However, whereas textmining is mostly based on literature abstracts or database description fields, our approach specifically searches in (supplementary) tables, allowing the use of information that would otherwise have been overlooked. Third, employing gene sets rather than full microarray data on the one hand or literature abstracts on the other has been described in several applications [29 –32]. These approaches, however, look for statistically significant overlap between lists, or to an experimental gene set supplied by the user. Therefore, the combination of these methods provides a novel integrated type of approach that should be able to incorporate some of the advantages of each of the underlying methods. A potential weakness of our approach is that the quality of the gene set collection taken as the starting point for this method will depend on the quality and relevance of the input data. We had to rely on the analyses done by the authors of the underlying studies, including the assessment of the reviewers and editors responsible for the process of peer review.

As part of the algorithm development, our approach was tested on 2 gene set collections, which were selected to be different in nature. On the basis of the multipurpose c2.cgp collection, a large number of small networks were identified, including cell division and immunological response networks (Fig. 1A). The coregulation network for the lung-inflammation gene set collection found a large network, of mainly immunological genes (Fig. 1B), which comprises the majority of the immunological genes that were found regulated by a previous full meta-analysis [5]. Moreover, it contains all of the 23 “core inflammation response” genes found previously [5]. These findings illustrate the usefulness of our approach.

For these 2 gene set collections, the examples given were obtained using stringency settings for which the number of genes in the network description file is around 5% of the number of UGIs in the collection, and these settings were chosen as the default stringency settings. However, comparable results were obtained when this fraction was between 1% and 10% (data not shown). This indicates that the results are robust against small changes in stringency. This was also found for the ESC differentiation coregulation network mapping. Moreover, to ensure that the method does not lead to false-positive results, the c2.cgp and lung inflammation were scrambled; that is, the genes were randomly relocated across the gene sets. Applying the algorithm to these sets did not result in a network even for the lowest stringency level (data not shown), confirming that the method is robust against random false-positive findings.

Applying our algorithm to the ESC differentiation gene set collection with 15,196 UGIs results in a network with 927 genes mostly located within 2 major sub-networks that correspond to early and late differentiation processes stages, respectively. Combining GO term functional enrichment to MCODE clusters within the network reveals that genes involved in similar biological processes are significantly densely connected, which is in agreement with the premise that such genes are coregulated. Moreover, this analysis shows that early differentiation events, such as loss of pluripotency, are concentrated in a cluster group at the left side of the network, whereas the right side of the network contains 2 cluster groups involved in later differentiation phases, one of which is specifically related to muscle and heart function (Fig. 2A). The early–late division is supported by overlaying gene expression data from an independent study on ESC cardiomyocyte differentiation time series [17] (Fig. 2B). Genes expressed during intermediate time points in that study did not form a separate sub-network, but instead appeared within either of the major sub-networks. This can partially be ascribed to the focus of most literature studies on the differences between undifferentiated and fully differentiated cells, leaving intermediate time point genes underrepresented in the combined gene set collection. Further, whereas pluripotent stem cells or differentiated cardiomyocytes are both stable situations regarding gene expression, the intermediate phase can be described as consecutive waves of gene expression corresponding to sequential transient activation of differentiation programs [17,33,34]. Due to their transient expression, such genes are more difficult to identify than those involved in the initial and final phase of the differentiation process and they may therefore have been described in smaller numbers [17,34]. Both these factors will lead to a relatively smaller number of coregulation edges among intermediate phase genes, which influences the Cytoscape observation as its mapping algorithm is developed for grouping together connected genes rather than observing data as a time series. Nevertheless, the additional use of the MCODE plugin results in 3 functional cluster groups that are visually separated and each correspond to a particular differentiation phase.

At the lower left side of Fig. 2A, there is a small sub-network containing a number of genes involved in development, proliferation, and pluripotency. These genes could have been expected to occur in the larger sub-network in the upper left side of this figure. The presence of a separate sub-network might indicate a possible common factor influencing the (co-)expression of these genes. However, as the main associated processes are still comparable to that of its larger neighbor except that the functional annotation enrichment of this smaller sub-network is not significant, the relevance of this sub-network should not be overinterpreted. The same can be said for the 6 groups of 2 or 3 genes each in the very lower right of the network observation. Their occurrence as separate groups is not robust to small changes in the algorithm settings used, and they can be considered as falling just short of being connected to the main network.

The 3 functional groups in the network contain over 60 genes with a GO annotation organ development as an umbrella term, or more specific terms regarding development of a specific organ or tissue. Apart from those involved in heart or muscle, there was also functional enrichment for the following terms: blood vessel development, ear development, eye development, gland development, immune system development, lung development, nervous system development, and skeletal development, as well as some of their daughter terms. Although several genes are annotated to the development of more than one type of organ, muscle and heart (-specific) development-related genes are significantly found in the third group (Table 3), and the second group (Table 2) significantly contains genes (specifically) involved in the development of other organs. Because genes expressed during the intermediate and late time points in the van Dartel [17] study are found in the sub-networks around the second as well as the third group, this indicates that during culture conditions favoring cardiomyocyte differentiation, also other types of tissues are formed.

Other functional analyses showed that the overall network not only contains various development- and heart/muscle-associated genes, but in addition genes involved in cellular proliferation, transcription, and also several members of the alpha crystallin/small heat shock protein family. The first of these processes can easily be understood as the starting point of stem cell differentiation in a culture of proliferating cells. Likewise, transcription factors are necessary to trigger and regulate the developmental changes that occur between proliferating stem cells and fully differentiated cells, which can explain why these genes are mainly found in the development-related first and second group, but hardly in the muscle function-related third group. In contrast, this third group contains a number of small heat shock proteins and the structurally related alpha crystallin B. Several members of this chaperonin family have already been associated with muscle contraction or development, and as Hspb3 has not been associated with such a role yet, this might indicate a novel function for this protein. However, although a common relevance of chaperonins in heart function is conceivable, extending this association to Hspb3 would require additional study and verification, as there is evidence that in heart Cryab and Hspb2 act through different and distinct mechanisms within cardiomyocytes [35].

Of the 927 genes in the network, 327 (35%) have been described in multiple gene lists, but have not yet been linked to a related developmental process in Gene Ontology or by literature association. Although in most of the studies the genes are only part of a bulk of differentially expressed genes, this combined study shows that there is good evidence to describe them as novel genes involved in ESC differentiation. Further, based on their position in the network, it can already be assumed whether they are mainly expressed at early or late differentiation stages. For 43 of these novel genes, there is even more specific evidence, namely, that they are part of densely connected cluster groups within the overall network that are enriched for a relevant biological function.

To determine the in vivo relevance of these 43 genes, we searched the literature as well as the Mouse Genome Informatics (www.informatics.jax.org) and the International Knockout Mouse Consortium (www.knockoutmouse.org) databases for mouse knockout phenotype data. For 12 genes phenotype data have been described. In the case of 4 genes, these data are supportive of the putative function in heart or muscle development inferred from the network. The Apobec2 knockout phenotype revealed a role in maintenance of slow/fast muscle fiber-type ratios [36]; Cth deletion results in overproliferation of smooth muscle cells [37]; deletion of the Ptges gene leads to adverse ventricular remodeling after myocardial infarction [38]; and Rrad knockout mice display cardiac hypertrophy [39]. In the case of Rrad, this is further supported by additional human data [39]. For an additional 2 genes, mouse knockout data are consistent with an early developmental function. Mice homozygous for a 2410146L05Rik (Ooep) null allele exhibit female infertility associated with a failure of embryos to progress beyond the 2-cell stage [40]; for Zfp57, loss of its the zygotic function causes partial neonatal lethality, whereas eliminating both the maternal and zygotic functions of Zfp57 results in a highly penetrant embryonic lethality due to effects on genomic imprinting [41]. Taken together, in vivo knockout data corroborate the putative function for half the genes for which such data are available.

This indicates that the other genes in Table 4 also provide interesting candidates for further studies regarding ESC cardiac differentiation. Among these, Hspb3 and 2 transcription factors (Rcor2 and Zfp503) are part of pathways enriched within the network, which makes them the mechanistically most conceivable starting points. We compared expression data for these genes across multiple tissues and cell types by means of the BioGPS Web site (http://biogps.gnf.org) [42,43]; Hspb3 showed high expression in heart and skeletal muscle, Rcor2 in mouse ESC lines, and Zfp503 in multiple tissues. This is in agreement with their proposed biological role, although further study will be required to corroborate their mechanistic and functional importance.

In addition to mechanistic studies of stem cell differentiation, we expect the newly found genes to be potentially useful in applied test systems using ESC differentiation as a model to identify developmental-toxic properties of compounds, such as the ESC test (EST) [44]. The implementation of molecular biological approaches in such models may help to improve the prediction accuracy of the model [17,18]. Further, it may help in understanding the biology of the model, which is useful for the definition of its applicability domain. Thus, a further knowledge of genes involved in cardiomyocyte differentiation as well as their regulation can help to further optimize such test models for developmental toxicity.

Footnotes

Acknowledgments

This study was supported by grant MFA6809 from the Dutch technology society foundation STW and by grant no. 050-06-510 from the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research (NWO).

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Segal

, Friedman

, Koller

, Regev

. 2004. A module map showing conditional activity of expression modules in cancer. Nat Genet, 36:1090–1098.

Tseng

, Cheng

, Yu

, Nelson

, Michalopoulos

, Luo

. 2009. Investigating multi-cancer biomarkers and their cross-predictability in the expression profiles of multiple cancer types. Biomark Insights, 4:57–79.

Yang

, Pospisil

, Iyer

, Adelstein

, Kassis

. 2008. Integrative genomic data mining for discovery of potential blood-borne biomarkers for early diagnosis of cancer. PLoS ONE, 3:e3661.

Jenner

, Young

. 2005. Insights into host responses against pathogens from transcriptional profiling. Nat Rev Microbiol, 3:281–294.

Pennings

, Kimman

, Janssen

. 2008. Identification of a common gene expression response in different lung inflammatory diseases in rodents and macaques. PLoS ONE, 3:e2596.

de Magalhaes

, Curado

, Church

. 2009. Meta-analysis of age-related gene expression profiles identifies common signatures of aging. Bioinformatics, 25:875–881.

Rodriguez-Zas

, Ko

, Adams

, Southey

. 2008. Advancing the understanding of the embryo transcriptome co-regulation using meta-, functional, and gene network analysis tools. Reproduction, 135:213–224.

Stuart

, Segal

, Koller

, Kim

. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science, 302:249–255.

Luo

, Yang

, Zhong

, Gao

, Khan

, Thompson

, Zhou

. 2007. Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory. BMC Bioinformatics, 8:299.

10.

Presson

, Sobel

, Papp

, Suarez

, Whistler

, Rajeevan

, Vernon

, Horvath

. 2008. Integrated weighted gene co-expression network analysis with an application to chronic fatigue syndrome. BMC Syst Biol, 2:95.

11.

Wang

, Narayanan

, Zhong

, Tompa

, Schadt

, Zhu

. 2009. Meta-analysis of inter-species liver co-expression networks elucidates traits associated with common human diseases. PLoS Comput Biol, 5:e1000616.

12.

Gustin

, Paultre

, Randon

, Bricca

, Cerutti

. 2008. Functional meta-analysis of double connectivity in gene coexpression networks in mammals. Physiol Genomics, 34:34–41.

13.

Fishman

, Chien

. 1997. Fashioning the vertebrate heart: earliest embryonic decisions. Development, 124:2099–2117.

14.

Srivastava

, Olson

. 2000. A genetic blueprint for cardiac development. Nature, 407:221–226.

15.

Beqqali

, van Eldik

, Mummery

, Passier

. 2009. Human stem cells as a model for cardiac differentiation and disease. Cell Mol Life Sci, 66:800–813.

16.

Subramanian

, Tamayo

, Mootha

, Mukherjee

, Ebert

, Gillette

, Paulovich

, Pomeroy

, Golub

, Lander

, Mesirov

. 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA, 102:15545–15550.

17.

van Dartel

, Pennings

, van Schooten

, Piersma

. 2010. Transcriptomics-based identification of developmental toxicants through their interference with cardiomyocyte differentiation of embryonic stem cells. Toxicol Appl Pharmacol, 243:420–428.

18.

van Dartel

, Pennings

, Hendriksen

, van Schooten

, Piersma

. 2009. Early gene expression changes during embryonic stem cell differentiation into cardiomyocytes and their modulation by monobutyl phthalate. Reprod Toxicol, 27:93–102.

19.

Shannon

, Markiel

, Ozier

, Baliga

, Wang

, Ramage

, Amin

, Schwikowski

, Ideker

. 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 13:2498–2504.

20.

Bader

, Hogue

. 2003. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4:2.

21.

Maere

, Heymans

, Kuiper

. 2005. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics, 21:3448–3449.

22.

Huang

, Sherman

, Lempicki

. 2009. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc, 4:44–57.

23.

Jelier

, Schuemie

, Veldhoven

, Dorssers

, Jenster

, Kors

. 2008. Anni 2.0: a multipurpose text-mining tool for the life sciences. Genome Biol, 9:R96.

24.

Larsson

, Wennmalm

, Sandberg

. 2006. Comparative microarray analysis. OMICS, 10:381–397.

25.

Alako

, Veldhoven

, van Baal

, Jelier

, Verhoeven

, Rullmann

, Polman

, Jenster

. 2005. CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics, 6:51.

26.

Frijters

, Heupers

, van Beek

, Bouwhuis

, van Schaik

, de Vlieg

, Polman

, Alkema

. 2008. CoPub: a literature-based keyword enrichment tool for microarray data analysis. Nucleic Acids Res, 36:W406–W410.

27.

Rubinstein

, Simon

. 2005. MILANO—custom annotation of microarray results using automatic literature searches. BMC Bioinformatics, 6:12.

28.

Wren

. 2009. A global meta-analysis of microarray expression data to predict unknown gene functions and estimate the literature-data divide. Bioinformatics, 25:1694–1701.

29.

Cahan

, Ahmad

, Burke

, Fu

, Lai

, Florea

, Dharker

, Kobrinski

, Kale

, McCaffrey

. 2005. List of lists-annotated (LOLA): a database for annotation and comparison of published microarray gene lists. Gene, 360:78–82.

30.

Finocchiaro

, Mancuso

, Muller

. 2005. Mining published lists of cancer related microarray experiments: identification of a gene expression signature having a critical role in cell-cycle control. BMC Bioinformatics, 6,Suppl 4:S14.

31.

Newman

, Weiner

. 2005. L2L: a simple tool for discovering the hidden significance in microarray expression data. Genome Biol, 6:R81.

32.

Smid

, Dorssers

, Jenster

. 2003. Venn Mapping: clustering of heterologous microarray data based on the number of co-occurring differentially expressed genes. Bioinformatics, 19:2065–2071.

33.

Beqqali

, Kloots

, Ward-van Oostwaard

, Mummery

, Passier

. 2006. Genome-wide transcriptional profiling of human embryonic stem cells differentiating to cardiomyocytes. Stem Cells, 24:1956–1967.

34.

Bruce

, Gardiner

, Burke

, Gongora

, Grimmond

, Perkins

. 2007. Dynamic transcription programs during ES cell differentiation towards mesoderm in serum versus serum-freeBMP4 culture. BMC Genomics, 8:365.

35.

Pinz

, Robbins

, Rajasekaran

, Benjamin

, Ingwall

. 2008. Unmasking different mechanical and energetic roles for the small heat shock proteins CryAB and HSPB2 using genetically modified mouse hearts. FASEB J, 22:84–92.

36.

Sato

, Probst

, Tatsumi

, Ikeuchi

, Neuberger

, Rada

. 2010. Deficiency in APOBEC2 leads to a shift in muscle fiber type, diminished body mass, and myopathy. J Biol Chem, 285:7111–7118.

37.

Yang

, Wu

, Bryan

, Khaper

, Mani

, Wang

. 2010. Cystathionine gamma-lyase deficiency and overproliferation of smooth muscle cells. Cardiovasc Res, 86:487–495.

38.

Degousee

, Fazel

, Angoulvant

, Stefanski

, Pawelzik

, Korotkova

, Arab

, Liu

, Lindsay

, Zhuo

, Butany

, Li

, Audoly

, Schmidt

, Angioni

, Geisslinger

, Jakobsson

, Rubin

. 2008. Microsomal prostaglandin E2 synthase-1 deletion leads to adverse left ventricular remodeling after myocardial infarction. Circulation, 117:1701–1710.

39.

Chang

, Zhang

, Tseng

, Xie

, Ilany

, Bruning

, Sun

, Zhu

, Cui

, Youker

, Yang

, Day

, Kahn

, Chen

. 2007. Rad GTPase deficiency leads to cardiac hypertrophy. Circulation, 116:2976–2983.

40.

, Baibakov

, Dean

. 2008. A subcortical maternal complex essential for preimplantation mouse embryogenesis. Dev Cell, 15:416–425.

41.

, Ito

, Zhou

, Youngson

, Zuo

, Leder

, Ferguson-Smith

. 2008. A maternal-zygotic effect gene, Zfp57, maintains both maternal and paternal imprints. Dev Cell, 15:547–557.

42.

Lattin

, Schroder

, Su

, Walker

, Zhang

, Wiltshire

, Saijo

, Glass

, Hume

, Kellie

, Sweet

. 2008. Expression analysis of G protein-coupled receptors in mouse macrophages. Immunome Res, 4:5.

43.

, Orozco

, Boyer

, Leglise

, Goodale

, Batalov

, Hodge

, Haase

, Janes

, Huss

III , Su

. 2009. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol, 10:R130.

44.

Genschow

, Spielmann

, Scholz

, Pohl

, Seiler

, Clemann

, Bremer

, Becker

. 2004. Validation of the embryonic stem cell test in the international ECVAM validation study on three in vitro embryotoxicity tests. Altern Lab Anim, 32:209–244.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.06 MB