Abstract
Abstract
Transcriptome analyses based on high-throughput RNA sequencing (RNA-Seq) provide powerful and quantitative characterization of cell types and in-depth understanding of biological systems in health and disease. In this study, we present a comprehensive transcriptome profile of human primary monocytes, a crucial component of the innate immune system. We performed deep RNA-Seq of monocytes from six healthy subjects and integrated our data with 10 other publicly available RNA-Seq datasets of human monocytes. A total of 1.9 billion reads were generated, which allowed us to capture most of the genes transcribed in human monocytes, including 11,994 protein-coding genes, 5558 noncoding genes (including long noncoding RNAs, precursor miRNAs, and others), 2819 pseudogenes, and 7034 putative novel transcripts. In addition, we profiled the expression pattern of 1155 transcription factors (TFs) in human monocytes, which are the main molecules in controlling the gene transcription. An interaction network was constructed among the top expressed TFs and their targeted genes, which revealed the potential key regulatory genes in biological function of human monocytes. The gene catalog of human primary monocytes provided in this study offers significant promise and future potential clinical applications in the fields of precision medicine, systems diagnostics, immunogenomics, and the development of innovative biomarkers and therapeutic monitoring strategies.
Introduction
M
Transcriptome study is important for understanding the genome functional elements, the molecular components of cells/tissues, and development of diseases. Previously, microarray was the commonly used method for transcriptome analysis; however, recently high-throughput RNA sequencing (RNA-Seq) has become a powerful alternative approach for transcriptome studies. RNA-Seq is able to qualitatively and quantitatively explore any RNA type, including messenger RNAs (mRNAs), long noncoding RNAs (lncRNAs), microRNAs (miRNAs), and small interfering RNAs (siRNAs), as well as novel transcripts (Dong and Chen, 2013). Recent studies have applied RNA-Seq technology for transcriptome profiling of several tissues and cell types such as endometrium (Zieba et al., 2015), spleen (Dang et al., 2016), T cells (Mitchell et al., 2015), B cells (Toung et al., 2011), and macrophages (Beyer et al., 2012).
In addition, by using the RNA-Seq approach, the global gene transcription changes that occur during the differentiation of monocyte to macrophage have been reported (Dong et al., 2013). Furthermore, the RNA-Seq data are valuable for postgenomic study leading to the development of diagnostic and therapeutic applications for precision medicine. Recently, Fuchs et al. (2016) established an integrative tool for postgenomic data analysis utilizing next-generation sequencing, RNA-Seq, and microarray data. However, to the best of our knowledge, the genome-wide transcriptome profiling of human primary monocytes is still not available. Thus, in the current study, we carried out very deep RNA-Seq analysis (200 million reads per sample) on a purified population of monocytes from six healthy subjects. These data were integrated with other publicly available RNA-Seq data for human monocytes from ENCODE and ArrayExpress databases to generate the comprehensive gene expression profile of human primary monocytes under healthy states.
Materials and Methods
Ethics and consent
The Medical Research and Ethics Committee (MREC) Malaysia approved this research study, with given reference number NMRR-13-972-16921. Six healthy unrelated subjects with urban lifestyle were included in this study. All subjects fulfilled the criteria set for the study. These criteria were being nonsmokers, not having any medical illness, not prescribed any chronic medication, and not receiving any vaccination at least 6 months before the study. All subjects completed the written informed consent forms before the study.
Monocyte isolation and RNA extraction
The classical monocytes (CD14++CD16−) were isolated from peripheral blood mononuclear cells by negative selection technique using the human monocyte isolation kit II (Miltenyi Biotec). Total RNA was extracted from isolated monocytes using the RNeasy Mini Kit (Qiagen) following standard protocols. The RNA quality and quantity were assessed using NanoDrop 2000 (Thermo Fisher Scientific, Inc.) and Qubit 2.0 RNA Broad Range Assay (Invitrogen). The purity of total RNA was examined using Agilent Bioanalyzer RNA Nano chip (Agilent Technologies). mRNA was purified from total RNA and an RNA sequence library was generated using the TruSeq RNA Sample Preparation Kit (Illumina) and SuperScript II Reverse Transcriptase (Invitrogen). The RNA libraries were sequenced on the Illumina HiSeq 2000 Platform (Illumina) to generate 2 × 100 bp paired-end sequencing reads. In addition, 10 raw RNA sequences of human classical monocytes of healthy subjects (included paired-end reads), 100 bases long and sequenced on Illumina platforms, were downloaded from ENCODE and ArrayExpress databases with accession numbers, ENCSR000CUC and E-MTAB-2399.
Alignment and transcript assembly
Quality control of all 16 sequencing reads was verified using FASTQC (Andrews, 2010) and the low-quality bases and adaptors were trimmed using Trimmomatic (Bolger et al., 2014). These trimmed reads were aligned separately to the human reference genome sequence (Ensembl GRCH38.79) using HISAT (version 0.1.4) (Kim et al., 2015) and assembled into transcripts by StringTie (version 1.3.3) (Pertea et al., 2015) using a GENCODE reference annotation GTF file (version 22). Separate GTF (general transfer format) files were generated for each of the 16 samples. The transcript abundance was estimated as fragments per kilobase of exon per million fragments mapped (FPKM) (Trapnell et al., 2010). Since each sample has differing read numbers, Cuffnorm (part of the Cufflinks, version 2.2.1) (Trapnell et al., 2010) was used to normalize FPKM values between the samples. The FPKM >0.1 threshold was used to determine expressed transcripts.
Gene expression profiling
For detecting the gene expression pattern in monocytes, transcript assemblies (GTF files) of all 16 samples were merged together to form a single set of nonredundant (NR) transcripts using Cuffmerge (part of the Cufflinks, version 2.2.1) (Trapnell et al., 2010). The merged assembly was then compared with a GENCODE reference annotation (version 22), which is the most comprehensive annotation, contains protein coding with alternative splice variants, noncoding, and pseudogenes.
The transcripts, which are intergenic and not aligned to the reference annotation, were considered as putatively novel transcripts. These transcripts were filtered against the NR database from NCBI using BLASTN (version 2.4.0) with an E-value <1e-10 threshold. TransDecoder (version 3.3.0) (http://transdecoder.github.io) (Haas and Papanicolaou, 2015) was then used to identify the potential novel transcript coding for peptides (transcripts with open reading frame [ORF]). To further capture the ORFs that may have significant functions, potential novel transcripts predicted with ORFs were searched against the PFAM-A database (Finn et al., 2016) and results were filtered with an E-value <1e-10 threshold.
The Gene Ontology (GO) enrichment analysis and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway analysis of identified protein-coding genes were performed using the DAVID (the database for annotation, visualization and integrated discovery) functional annotation tool (Dennis Jr. et al., 2003) by applying the Hypergeometric statistical test, followed by the Benjamini and Hochberg method (Benjamini and Hochberg, 1995) for adjusting the p value (adjp).
To identify the expression of transcription factors (TFs) in monocytes, the merged transcript assembly of our datasets was compared with the list of human TFs, which was compiled from literatures (Ravasi et al., 2010; Roach et al., 2007) and GO term transcription factor. To detect TF gene targets, the TRRUST database (transcriptional regulatory relationships unraveled by sentence-based text mining) (Han et al., 2015) was used, which contains 8015 interactions between 748 TF genes and 1975 non-TF genes. The interaction network between TFs with their gene target was constructed using the Cytoscape plug-in, GeneMANIA (Montojo et al., 2010).
Results and Discussion
Transcriptome profiling
We extracted RNAs from purified classical monocytes of six healthy subjects and performed a very deep RNA-Seq. Approximately 1.2 billion reads of 100 bp read length were generated from the sequencing. Subsequently, these RNA-Seq datasets were integrated with publicly available RNA-Seq data from human classical monocyte samples to generate a total of 1.9 billion reads. The details of all RNA-Seq datasets are shown in Table 1. All reads were mapped to the reference genome and assembled into transcriptome using the same pipeline to reduce any bias. An average 90% of the reads aligned to the human reference genome (Ensembl GRCh38.79). The abundance of assembled transcripts was estimated using FPKM value. By applying an FPKM >0.1 threshold, we have identified a total of 20,371 genes and 82,996 transcripts expressed in our monocyte datasets. The summary of identified genes and transcripts with regard to their biotype is presented in Table 2.
N/A, the information of subjects is not available in public database.
FPKM, fragments per kilobase of exon per million fragments mapped.
Protein-coding genes
Protein-coding genes expressed at different rates are influenced by different parameters (Ryuchkova-Mostacci and Robinson-Rechavi, 2015). Identifying and measuring the protein-coding gene expression at transcriptome levels are important to quantify which particular gene is expressed within a cell, tissue, or organism under different conditions (Dueck et al., 2015). Of 19,814 protein-coding genes reported in the GENCODE database (version 22), we detected the expression of total 11,994 protein-coding genes in human monocyte (Supplementary Table S1). We divided the identified protein-coding genes into three groups based on their FPKM values: high expression (top 25th percentile; FPKM >26.9) (3009 genes), moderate expression (middle 50th percentile; 1.6< FPKM ≤26.9) (6008 genes), and low expression (bottom 25th percentile; FPKM ≤1.6) (2909 genes) (Supplementary Table S1).
The GO enrichment analysis based on the biological process categories revealed that the highly expressed genes were mainly enriched for the immune system process (278 gene) and death (206 gene) (Fig. 1a), while the low and moderately expressed genes were mainly involved in the cellular process and metabolic process (Fig. 1b, c), confirming the metabolic bias in human classical monocytes (Schmidl et al., 2014). The KEGG pathway enrichment analysis also showed that highly expressed genes mainly enriched in several pathways that belonged to the immune system such as FcγR-mediated phagocytosis, chemokine signaling pathways, and toll-like receptor signaling pathway, and apoptosis (Fig. 1d), while the low and moderately expressed genes were significantly enriched in RNA degradation and glycine and serine and threonine metabolism, respectively (Fig. 1e, f).

The GO and KEGG pathway analyses of protein-coding genes in human monocytes. The significant GO biological process terms (adjp < 0.01) for
Noncoding genes
A comparison of the assembled transcripts with the GENCODE reference annotation genes showed evidence of expression for 5558 noncoding genes, including lncRNAs (Supplementary Table S2), precursor miRNAs (pre-miRNAs) (Supplementary Table S3), and other noncoding genes (snoRNA, snRNA, and miscRNA) (Supplementary Table S4) across all the datasets studied. As the lncRNAs and pre-miRNAs are important classes of noncoding genes, we further inspected the expressed genes in those classes.
Long noncoding RNAs
Of 5558 noncoding genes, we detected the expression of 4799 lncRNAs across all 16 samples (Supplementary Table S2). lncRNAs are the largest class of noncoding RNA (ncRNA) genes in human genomes and defined as transcripts larger than 200 nucleotides without any coding potential (Derrien et al., 2012). Several recent studies have shown the role of lncRNAs in relation to immune regulation, and their role in several autoimmune diseases such as systemic lupus erythematosus and rheumatoid arthritis (Sigdel et al., 2015). However, the exact function of the large majority of lncRNAs still remains unknown. Recently, we reported the landscape of lncRNAs in the human monocytes along with several potential novel long intergenic noncoding RNAs (Mirsafian et al., 2016). In this study, we detected that the expression of another 253 lncRNAs in human monocytes has not been identified in our previous study (Supplementary Table S2).
Some of these lncRNAs (such as nuclear paraspeckle assembly transcript 1 [NEAT1] and small nucleolar RNA host gene 1 [SNHG1]) had high expression levels (with an average FPKM >50) across all 16 samples analyzed in the present study (Supplementary Table S2). NEAT1 is reported to be involved in immune system cell regulation and proliferation (Imamura et al., 2014). The SNHG1 expression is reported to be correlated with some members of the TNF pathway, including TAB2 (TGF-beta activated kinase 1/MAP3K7-binding protein 2) and CREB1 (cyclic AMP-responsive element-binding protein 1). In addition, it has been reported that SNHG1 is significantly upregulated in nonsmall cell lung cancer cell lines, resulting in enhanced proliferation (You et al., 2014).
Precursor miRNAs
The expression of total 166 pre-miRNAs has been detected in our RNA-Seq datasets (Supplementary Table S3). Pre-miRNAs are important mediators during the transcription process of miRNAs (Gan and Denecke, 2013). The exact biological function of these noncoding RNAs remains unknown; however, some pre-miRNAs have been shown to have an important role in the immune system and serve as disease biomarkers for different diseases such as Kaposi sarcoma (O'Hara et al., 2009) and primary effusion lymphoma (O'Hara et al., 2008). We identified the high expression of some pre-miRNAs in our datasets, which was reported to control immune cell proliferation and apoptosis, such as hsa-miR181b, hsa-miR-129, and hsa-miR155 (Haapa-Paananen et al., 2013) The high expression of these pre-miRNAs in monocytes may suggest their potential roles in monocyte cell regulation and function as well.
Pseudogenes
Pseudogenes are copies of protein-coding genes that arise from genomic duplication or retrotransposition of mRNA sequences into the genome, followed by accumulation of deleterious mutations due to loss of selection pressure, degenerating eventually into so-called genetic fossils (Porter et al., 2014). The expression of 2819 pseudogenes was detected in our datasets (Supplementary Table S5). Although pseudogenes were known as genomic fossils for several years, some studies have reported that pseudogenes could play critical roles in regulation of their parent genes, and many pseudogenes were transcribed into RNA (Porter et al., 2014; Tian et al., 2007). We identified the high expression of two functional pseudogenes, including MORF4 (mortality factor 4) (Yochum and Ayer, 2002) and MEIS3P1 (Meis homeobox 3 pseudogene 1) (Tian et al., 2007), in our datasets, which have been reported to act as TFs.
Novel transcripts
A salient feature of RNA-Seq is its ability to detect novel transcripts (Wang et al., 2009). We have found 7043 novel transcripts expressed in monocytes, which have not been previously annotated in databases (Supplementary Table S6). Of these, 1362 transcripts could potentially code for peptides (Supplementary Table S7). A comparison of 1362 novel transcripts against PFAM-A domain database resulted in 210 novel transcripts matching at least one protein domain model in which some of them associated with immune-related functions (Supplementary Table S8). However, further functional studies are needed to identify the exact function and mechanism of these novel transcripts in human monocytes.
Transcription factors
TFs are key molecules that control gene transcription (Vaquerizas et al., 2009). Over the past 30 years, several TFs involved in the immune system have been discovered and their mechanisms of action were studied (Smale, 2014). However, no data exist on complete list of TFs expressed in monocytes. Through this study, we identified the expression of 1155 TFs in human classical monocytes (Supplementary Table S9). As the TFs are the major regulators of gene transcription, identification of the genes that are targeted by a specific TF is important for understanding cellular developmental processes, response to stimulus, and disease etiology (Taverner et al., 2004). Using the TRURSUT database, we have found 1339 targeted genes for 445 TFs in monocytes. The list of TF target genes with their regulatory mode (either activates or represses) is presented in Supplementary Table S10. Several TFs were found to regulate a smaller number of genes such as ZSCAN21 (1 target) and CITED2 (2 targets), while others regulate a larger number of genes such as SP1 (305 targets), NFKB1 (226 targets) RELA (223 targets), TP53 (132 targets), E2F1 (108 targets), JUN (93 targets), and STAT1 (41 targets) (Supplementary Table S10).
The interaction network between top 20 highly expressed TFs and their targeted genes is presented in Figure 2. This network contains 226 interactions in between 20 TFs and 146 targeted genes. The GO enrichment analysis on TF target genes showed that they are significantly involved in response to stimulus, biological regulation, immune system process, and death. The TFs in the network, which regulate the most immune system and death-related genes, were STAT1 (signal transducer and activator of transcription 1, 91 kDa), SATA6 (signal transducer and activator of transcription 6, interleukin-4 induced), FOS (FBJ murine osteosarcoma viral oncogene homolog), JUNB (Jun B proto-oncogene), FLI1 (Fli-1 proto-oncogene, ETS transcription factor), ZEP36 (growth factor-inducible nuclear protein NUP475), and DEK (DEK proto-oncogene).

Interaction network analysis of the top 20 highly expressed TFs with their target in human monocytes. The network contains 226 interactions between 20 TFs and 146 targeted genes. The pink color circles represent the TFs, while the blue and green color circles represent the TF target genes. The green color circles represent the genes that are involved in the immune system process and death. TF, transcription factor.
Conclusions
Monocytes are crucial players in the innate immune system and essential for front-line defense against pathogens. While several studies have addressed the functional elements in monocyte subsets (Ancuta et al., 2009; Dong et al., 2013; Ziegler-Heitbrock et al., 2010), the complete gene expression catalog of human monocytes is not yet available. Motivated by the ability of RNA-Seq technology to study gene expression, we performed deep RNA-Seq of monocytes from six healthy subjects (200 million reads per sample) and integrated our dataset with 10 other publicly available RNA-Seq datasets for monocytes to establish the catalog of gene expression in human monocytes. The catalog contains 20,371 genes (including protein coding, noncoding, and pseudogenes) and 82,996 transcripts (including known and novel transcripts). Moreover, we profiled the expression pattern of 1155 TFs in human monocytes. This study provides an important and significant resource for gene expression signatures of human primary monocytes, which could be used as a starting point for postgenomics and system biology research on human monocytes under healthy and diseased states. Additionally, the gene catalog of human primary monocytes provided in this study offers significant promise for the fields of precision medicine, systems diagnostics, immunogenomics, and the development of innovative biomarkers and therapeutic monitoring strategies.
Data availability
The RNA-Seq dataset of human monocytes discussed in this publication can be accessed from the Gene Expression Omnibus database with accession number GSE80095 (www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE80095).
Footnotes
Acknowledgments
The authors would like to thank the Director General of Health, Malaysia, for supporting the work described in this article. The authors would also like to thank all the volunteers for their cooperation in this study. This research work was supported by the High Impact Research (HIR) Grant: UM.S/P/HIR/MOHE/30; the Ministry of Education's Fundamental Research Grant Scheme (FRGS): FP050-2016; the University of Malaya Research Grant (UMRG): RP004C-13AFR; and the IPPP Postgraduate Grant: PG086-2013B.
Author Disclosure Statement
All the authors declare that there is no conflict of interests.
Abbreviations Used
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
