Abstract
Background:
Colorectal cancer (CRC) is a leading cause of tumor-related mortality. Recent studies have shown that the transcriptome plays an important role in the development and occurrence of CRC. However, a comprehensive repository of CRC transcriptome sequencing data is unavailable. In the present study, we constructed a colorectal database (iCRCexp; http://icrcexp.omicsbio.info/).
Method:
We collected CRC-related transcriptome datasets from The Cancer Genome Atlas (TCGA) and National Center for Biotechnology Information (NCBI) Gene Ontology Omnibus (GEO) databases up to 2022. The sequencing data were preprocessed through a unified pipeline and subsequently analyzed. CRC-related genes and drugs were identified via text mining of the PubMed abstracts.
Results:
A total of 18 466 tissue samples from 231 studies, 2429 CRC-related genes, and 1852 CRC-related drugs were collected and integrated into iCRCexp. Among these studies, 251 CRC-related datasets were identified with abundant characteristic information, including tissue source, baseline characteristics, therapeutic responses, recurrence and metastasis, and survival. We conducted differential correlation and survival analyses. We predicted potential target drugs for CRC-related genes by calculating connectivity scores. Consequently, we integrated these analysis results through network construction and presented them in a CRC database.
Conclusion:
A comprehensive resource, including CRC-related gene and medication information and an expression analysis platform, was constructed for the CRC community.
Introduction
According to the 2018 Global Cancer Statistics, colorectal cancer (CRC) poses a serious threat to human health and has one of the highest incidence and mortality rates worldwide. 1 Since 1990, although CRC mortality in the elderly has steadily declined, mortality among young patients under 50 years of age has gradually increased, with an average annual rate of 1.3%. 4 Therefore, early diagnosis and precise treatment of CRC remain significant challenges. The AJCC eighth edition staging manual divides CRC into 4 clinical stages (I-IV) based on primary tumor size, depth of invasion, and extent of lymph node and distant metastasis (I-IV). 5 High-risk stage II (defined as fewer than 12 lymph nodes examined), poorly differentiated tumors (excluding those with high microsatellite instability [MSI-H], neurovascular invasion, intestinal obstruction, tumor site perforation, positive surgical margin, or unknown differentiation), and stage III cancers are collectively considered locally advanced CRC, accounting for approximately 26% of newly diagnosed cases. 6 Neoadjuvant chemoradiotherapy (NACRT) can significantly reduce the tumor size; achieve complete remission in 20% to 30% of patients; preserve the function of the anus, bladder, and other organs, and potentially prolong patients survival.2,3 Therefore, further improvements in NACRT efficacy are critical for improving both the quality of life and survival time of patients with locally advanced CRC.
With the advent of transcriptome technologies, the transcriptome analysis has a common approach to studying tumor development, and novel therapeutic strategies have been proposed to overcome therapy resistance.4 -6 Recent evidence indicates that several transcription factors act as oncogenes or tumor suppressors in CRC progression. For example, NF-κB activation can induce gene transcription, promoting colorectal tumorigenesis and resistance to 5-fluorouracil.7,8 Conversely, downregulation of TP53 has been proposed as a biomarker of prognostic and therapeutic responses.9,10 Targeting transcription factors thus represent a promising strategy for CRC therapy. Bortezomib, an FDA-approved drug for myeloma treatment, may also be effective in CRC patients exhibiting NF-κB activation. 11
A series of databases, such as GEPIA and TIMER, which are based on The Cancer Genome Atlas (TCGA) and National Center for Biotechnology Information (NCBI) Gene Ontology Omnibus (GEO) databases, allow researchers to analyze gene expression efficiently across different cancers.12,13 However, transcriptomic gene expression data specific to CRC are not well integrated in public databases and obtaining high-throughput sequencing data can be challenging for researchers lacking bioinformatics expertise. To address this, we constructed an integrated database, the Colorectal Cancer Database (iCRCexp), an integrative database of colorectal cancer gene expression profiles, based on the TCGA and NCBI GEO databases. CRC-related genes and drugs were identified through literature test mining of PubMed abstracts. The database is publicly accessible at http://icrcexp.omicsbio.info/. We anticipate that iCRCexp will facilitate CRC research by helping researchers better understand the biology of CRC and predict potential target drugs for CRC treatment.
Materials and Methods
Collection of CRC-Related Genes and Drugs
Similar to the construction method of the HNCDB database, 14 CRC-related abstracts were obtained from NCBI PubMed using disease-related keywords for CRC via the “eSearch” and “eFetch” methods. The disease-related keywords included “colorectal cancer,” “colon cancer,” “rectal cancer,” “colorectal carcinoma,” “colon carcinoma,” “rectal carcinoma,” “carcinoma of the large intestine,” “CRC,” and “colon adenocarcinoma.” A total of 270 052 CRC-related abstracts were retrieved and used as the basis for subsequent analysis. The official full names, alias, and gene IDs for all the gene symbols were downloaded from the HGNC, 15 ENTREZ, 16 OMIM, 17 and UniProt 18 databases. A total of 7331 drug aliases, trade names, target genes, and functional information were collected from DrugBank. 19
All CRC-related abstracts were scanned for gene and drug names, and the matched gene and drug strings were filtered and manually verified. A total of 2429 CRC-related genes and 1852 CRC-related drugs were identified.
Integration of CRC-Related Transcriptome Sequencing Datasets
CRC-related expression profiles from high-throughput sequencing and microarray datasets were obtained from the NCBI GEO database using the following keywords: “colon,” “rectal,” “rectum,” and “Homo sapiens.” The CRC-related expression datasets were manually reviewed. CRC-related expression data from the TCGA database were also downloaded from the UCSC Xena database (https://xena.ucsc.edu/), and clinical information relevant to CRC-related expression datasets was collected.
Probe Reannotation and Normalization of Microarray Expression Data
Probe sequences from different platforms were reannotated via microarray datasets. First, probe sequences were downloaded from the official website. The latest human genome files (“Homo_sapiens.GRCh38.dna.primary assembly.fa,” “Homo_sapiens.GRCh38.102. gtf”) were obtained from the ENSEMBL website (https://asia.ensembl.org/index.html). The index was constructed and compared based on the human reference genome files using the Rsubread package. All probe sequences from the different platforms were aligned to the human genome (GRCh38) via the Rsamtools package for coordinate mapping. All probes were mapped to gene symbols using a reannotation file. If the probe sequence corresponded to both an mRNA and the lncRNA, only the annotation of the encoding mRNA was retained. Microarray expression data were normalized using a previously reported method. 14 When a gene symbol corresponded to multiple probe data points, the gene expression level was calculated as the average of all corresponding probe data.
Differential Expression and Survival Analysis
Differential expression analysis was performed for different groups according to the clinical information of each dataset. The characteristics of the patients were as follows: tissue histopathology, NACRT response, advanced CRC treatment response, clinical stage, distant metastasis, disease recurrence, mismatch repair (MMR) status, and gene mutation. Differential expression analysis was conducted using the “Limma” R package for each clinical group. NACRT efficacy was classified as good or poor according to tumor regression grade (TRG) or pathological regression in the specific dataset. The group information for the NACRT response is detailed in Supplemental Table 1. Treatment efficacy for metastatic CRC was similarly grouped according to the clinical information of the dataset. The treatment strategies are listed in Supplemental Table 2.
We collected data, including overall survival (OS), disease-free survival (DFS), progression-free survival (PFS), distant metastasis-free survival (DMFS), and recurrence-free survival (RFS) data. All survival time units are in months. Survival analysis was performed via the “survival” R package. A univariate Cox proportional hazards regression model was employed to evaluate the correlation between gene expression and survival. A heatmap was generated via the ggplot2 R package to show the correlation between gene expression and survival in all CRC-related datasets. Gene expression levels were divided into high and low groups based on the median or optimal cutoff values. A P-value < .05 was considered statistically significant. Significance in the heatmap is indicated as follows: .01, P < .05; *, 0.001 < P < .01; **, P < .001; ***.
Construction of a CRC-Related Gene‒drug Connectivity Map
CRC-related genes and drugs collected from text mining were used to construct a drug‒gene connectivity map. A connectivity score (θ gd) was calculated for each CRC-related gene-drug pair based a regularized log-odds function. The calculation formula is as follows 14 :
where dfgd is the number of CRC-related PubMed abstracts containing both the CRC-related genes and drugs, dfg and dfd are the numbers of PubMed abstracts containing CRC-related genes and drugs, respectively. P is the total number of CRC-related PubMed abstracts, and Corrgd is the average Pearson correlation coefficient between the targeted drug d and gene g across all CRC-related gene expression datasets. λ was set to as 1 as a constant to prevent computational errors. Pearson correlation analysis was performed using the “cor” functions in the “stats” R package (version 4.0.2).
Construction of a Web Database
Java Scipt was used to build the webpages, and the data was subsequently uploaded the data to a Linux server to build the web database. The reporting of this study conforms to the
Results
Data Summary
There were 270 052 CRC-related abstracts, of which 114 616 contained CRC-related genes, and 106 587 abstracts contained CRC-related drugs (Figure 1A). Additionally, we collected 251 CRC-related expression datasets, annotated via 91 platforms, from the NCBI GEO and TCGA databases (Supplemental Table 3). Most probe annotation platforms were reannotated. A total of 18 466 sample expression data points were obtained. A total of 18 677 coding genes and 13 827 lncRNAs were identified across all samples. While sequencing data were downloaded, the clinical information of all datasets, including age, sex, microsatellite instability (MSI) status or MMR protein expression, clinical stage, gene mutations, therapeutic effect, metastasis, recurrence, and survival information—was also collected.

The iCRCexp database was constructed: (A) flowchart of the database construction process and (B) web interface of the iCRCexp database.
Web Interface and Usage
Gene
The interface features an interactive heatmap providing differential expression results for 19 516 genes across different groups of CRC-related datasets (Figure 2). Based on the clinical information of the datasets, we performed differential expression analysis across various groups (tumor tissue vs normal tissue; NACRT response: good vs poor); mCRC therapy response: good versus poor; clinical stage: early(I + II) versus advanced (III + IV), distant metastasis:yes versus no; disease recurrence: yes versus no; primary tumor versus metastatic tumor; normal liver tissue versus metastatic liver tissue: dMMR versus pMMR: KRAS mutation:wild versus mutation; BRAF mutation: wild versus mutation; P53 mutation : wild versus mutation; PI3KCA mutation: wild versus mutation; and APC mutation: wild versus mutation. The color of each box in the interactive heatmap corresponds to the log2 (fold change) value derived from the differential analysis. Users can explore the differential expression of genes of interest in each dataset using the search function. By “clicking” the box, users are directed to a webpage displaying a box plot of the differentially expressed levels for the gene of interest in the selected group. Box plots was constructed based on the sample size of the 2 groups and the P-value of the differential analysis.

The web interface of “Gene” in iCRCexp: (A) an interactive heatmap displaying the differential expression results of genes across all the CRC gene expression datasets and (B and C) detailed information of a selected gene in a certain dataset.
Connectivity Map
The “Connectivity Map” module displays the connectivity between CRC-related genes and CRC-related drugs, providing potential gene-targeted therapies for CRC treatment (Figure 3A). In the heatmap, the rows contain 2429 CRC-related genes, and columns contain 1852 CRC-related drugs. Each box indicates the “connection score” of a specific gene-drug pair. The numbers of PubMed abstracts containing CRC-related gene and drug are shown in the first column and first row, respectively. Users can explore the connectivity map by inputting genes or drugs into a search box. A “connection score” greater than zero indicates that the drug may be a potential target for the corresponding gene.

(A) The web interface of “Connectivity Map” in iCRCexp and (B) the web interface for the “Differential expression analysis” module of “Analysis” in iCRCexp.
Analysis
The “Analysis” module comprises 3 parts: “Differential Expression Analysis,” “Correlation Analysis,” and ‘Survival Analysis. Users can interactively analyze the sequencing data of 18 466 tissue samples collected from GEO and TCGA databases. “Differential Expression Analysis” shows the heatmap of the top 500 and the bottom 500 significantly differentially expressed genes in each dataset (Figure 3B). By clicking the “show results,” users can view and download the differential expression result tables for specific group. Users can input 2 genes to generate a scatter plot showing the correlation between them (Figure 4A). Expression correlation coefficients and p-values are provided for the selected datasets. In survival analysis, users can explore the prognostic value of a specific gene in CRC-related datasets by selecting a gene symbol and survival type (Figure 4B). Results included 2 heatmaps and 1 Kaplan‒Meier survival curve. The heatmaps display the natural logarithm of the univariate Cox hazard ratio and the log-rank test P-value using 2 grouping methods: median-based (right heatmap) and optimal cutoff-based (left heatmap). The Kaplan‒Meier survival curve is plotted using the optimal gene expression cutoff (Supplemental Figure 1).

(A) The web interface for the “Correlation analysis” module of “Analysis” in iCRCexp and (B) the web interface for the “Survival analysis” module of “Analysis” in iCRCexp.
An Application Example: The Oncogenic Role of MMP1 in CRC and Its Clinical Implications, as Revealed by iCRCexp
MMP1(matrix metallopeptidase 1) plays prominent roles in CRC invasion and metastasis. 21 High MMP1 expression is associated with poor survival in CRC patients.21 -23 By exploring “MMP1” in iCRCexp, we found that MMP1 mRNA expression was significantly higher in CRC primary tumors compared to normal tissue (Figure 5A). High MMP1 expression correlated with poor OS and RFS (Figure 5C). Moreover, high MMP1 expression was positively correlated with clinical stage and significantly upregulated in patients with BRAF mutations (Figure 5A). In locally advanced CRC, high MMP1 expression were negatively correlated with NACRT response (Figure 5A). Similarly, in advanced CRC patients treated with FOLFOX, responders exhibited lower MMP1 expression compared to non-responders (Figure 5A). These findings suggest that MMP1 upregulation may be associated with chemoradiation resistance in CRC. Correlation analysis revealed a significant positive association between MMP1 and DNMT1 expression, suggesting that MMP1-mediated chemoradiation resistance may involve DNA methylation (Figure 5B). Connectivity map analysis suggest that amlodipine may target MMP1. Previous studies have shown that amlodipine possesses antitumor effect and can enhance the efficacy of regorafenib in advanced CRC. Therefore, based on iCRCexp, we hypothesized that MMP1 may serve as a novel target to reverse chemoradiation resistance, and that amlodipine may be a potential target drug for MMP1.

Clinical significance of MMP1 in CRC as revealed by iCRCexp: (A) significance differences in the expression of MMP1 in different groups in the CRC gene expression datasets collected from iCRCexp, (B) correlation between MMP1 and DNMT1 expression in CRC tissues, and (C) correlations between MMP1 expression and 2 survival sets (DFS and RFS).
Discussion
Microarray and high-throughput sequencing technologies have been widely used to explore the molecular mechanisms underlying tumorigenesis and therapeutic resistance.24 -27 Through bioinformatics analyses, we explored the biological characteristics of the tumors and identified potential tumorigenesis markers and therapeutic targets. 28 iCRCexp is an interactive web application for gene expression analysis based on the sequencing data of 14 831 tumors and 3635 normal tissue samples from the TCGA and NCBI GEO databases. The iCRCexp analysis results covered 19 516 genes, 1851 CRC-related drugs, and 251 CRC-related transcriptome sequencing datasets. By exploring the iCRCexp database, researchers who lack experience in high-throughput sequencing analysis can rapidly and accurately determine the clinical significance of specific genes in CRC. Moreover, hypotheses regarding interaction between 2 genes can be evaluated by examining gene correlation in iCRCexp. In 2006, Lamb et al constructed a connectivity map to identify connections between small molecules, genes, and diseases. However, this resource was constructed by analyzing the changes in gene expression signatures in 5 human cancer cell lines treated with various small molecules, and no CRC cells were included. Based on HNCDB, we adopted a simple and convenient strategy to link CRC-related genes to drugs. 14 Therefore, iCRCexp can provide potentially effective targeted drugs for CRC-related genes by integrating CRC-related genes with evidence extracted from PubMed abstracts.
Previously published CRC-related databases include the CBD (http://sysbio.suda.edu.cn/CBD/), ResMarkerDB (http://www.resmarkerdb.org), and CRC-EBD (http://www.sysbio.org.cn/EBD/).29 -31 Zhang et al constructed a CBD database via text mining on the basis of PubMed literature published between 1986 and 2017. The CBD identified 870 CRC biomarkers, including 35 DNA, 94 RNA, 583 protein, and 158 others. Researchers can then explore the prognostic value, experimental sources, and expression sites of the 870 CRC biomarkers in CRC. Analyses of miRNA‒gene interactions, protein-protein interactions, Gene Ontology (GO) enrichment analysis, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment were also performed via this database. In addition, researchers could explore the geographical distribution, proportion of different sexes, biological characteristics, and clinical application types of CRC biomarkers of interest.
In 2020, CBD database researchers established the CRC-EBD database, an epigenetic biomarker database for CRC, integrating 355 epigenetic biomarkers with detailed information on modification types, application, and biomarker classes. In 2019, Spanish researchers constructed the ResMarkerDB database as a comprehensive resource for biomarkers of antibody responses in breast cancer and CRC. The database integrates 187 CRC-related drug response biomarkers through text-mining guidelines, clinical studies, and basic research abstracts. However, the coding RNA accounted for only 17% of the database. Clinical information was largely unavailable. These databases primarily explore the clinical significance of biomarkers via text mining of PubMed literature, but include only a limited genes, and information regarding their association with tumorigenesis is restricted.
Other databases are based on transcriptome sequencing data from cancer researches. For example, GEPIA database (http://gepia2.cancer-pku.cn/), TIMER database (http://timer.comp-genomics.org/), PrognoScan database (http://dna00.bio.kyutech.ac.jp/PrognoScan/), and Kaplan‒Meier Plotter database (http://kmplot.com/). TIMER and Kaplan‒Meier Plotter databases explore the expression differences and prognostic value across cancers using TCGA database. 32 Although GEPIA database integrates several GEO datasets with TCGA datasets, the sample size of the CRC studies is only 308 cases, 12 limiting subgroup analyses. The PrognoScan database includes only 4 GEO transcriptome datasets for assessing gene prognostic value. 33
In contrast, iCRCexp contains most bulk transcriptome sequencing data of CRC from TCGA and NCBI GEO databases, and can also be used to search for potential drugs targeting CRC-related genes. Compared with other online databases, iCRCexp database provides more comprehensive gene expression information in CRC, facilitates exploration of CRC biology and drug resistance, and identifies potentially effective targeted drugs.
However, iCRCexp has limitation: it contains only bulk transcriptome data, with no genomics, metabolomics, or proteomics data. The connectivity map is based solely on sequencing data without experimental verification. Additionally, multivariate Cox analysis could not be performed in survival analysis because most samples lacked complete clinical information.
Conclusion
In summary, the iCRCexp represents a valuable public resource for the CRC research community, providing a comprehensive collection of CRC-related bulk transcriptomic sequencing datasets. It enables researchers to explore gene expression patterns, investigate biological mechanisms, and validate hypotheses in CRC studies.
Supplemental Material
sj-docx-1-cix-10.1177_11769351261454653 – Supplemental material for iCRCexp: An Integrative Database for Colorectal Cancer-Associated Gene Expression Profiles
Supplemental material, sj-docx-1-cix-10.1177_11769351261454653 for iCRCexp: An Integrative Database for Colorectal Cancer-Associated Gene Expression Profiles by Yan Yuan, Bi-Jin Cao, Zhi-Kai Qian, Ze-Kun Liu, Wei-Wei Xiao, Xing-Yang Li, Zhi-Xiang Zuo, Ze-Xian Liu and Yuan-Hong Gao in Cancer Informatics
Footnotes
Acknowledgements
Not applicable.
Author Note
Full list of author information is available at the end of the article.
Ethical Considerations
Not applicable.
Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Author Contributions
YY, CBJ, and QZK constructed the database and drafted the manuscript. LZK and XWW collected data. LXY prepared Figure 1-
and supplement materials. ZZX LZX greatly improved the database and revised the manuscript. LZX and GYH designed the study. All the authors have read and approved the final manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the Guangzhou Municipal Science and Technology Bureau Basic Research Project (No. 2024A04J4494) and the Scientific Research Promotion Project of Guangzhou Medical University (No. 2024SRP149).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
All the data used in this study were acquired from the GEO (https://www.ncbi.nlm.nih.gov/geo/) and TCGA datasets (https://xenabrowser.net/). iCRCexp is available at
.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
