Abstract
Deep transfer learning improves the inference of gene regulatory networks in human cells, reveals disease-associated genes, and identifies network-based druggable targets in human heart disease.
Cell identity derives from the interplay of extrinsic cues and contextual gene expression. Transcription factors (TFs) are gene products that act as key regulators of cell identity by navigating chromatin landscapes, binding to sequence motifs in cis-regulatory elements, and recruiting cofactors to increase or decrease transcription (Badia-I-Mompel et al., 2023). The TFs act in a combinatorial fashion to form complex gene circuits (Sorrells and Johnson, 2015; Moura et al., 2020) that lead to cell-type-specific transcription and safeguard cell identity. Gene circuits are complex, dynamic, and build-up by key regulatory motifs (Milo et al., 2002; Sorrells and Johnson, 2015). Therefore, it is challenging to understand how TFs control cell identity and the impact of TF perturbation (i.e., knockdown/knockout or over-expression) on gene circuits and cell phenotypes. Thus, understanding the regulatory logic of gene regulatory network (GRN) dynamics and its impact on cell-fate transitions would have broad implications in biomedical research, such as assisting the development of improved protocols for stem-cell differentiation, cellular reprogramming, and further facilitating the identification of more effective therapeutics for a broad range of diseases.
The GRNs are an alternative to represent regulatory relationships (interactions between TFs and target genes) within a cell (Badia-I-Mompel et al., 2023; Bravo González-Blas et al., 2023; Kakimoto et al., 2023). These GRNs uncover master regulators of cell identity and may predict the outcome of gene perturbation (Kakimoto et al., 2023). However, most GRN methods remained with low predictive power, at least in part due to the limited availability of perturbation data. Single-cell technologies and artificial intelligence (AI) models led to numerous methods of GRN inference (Badia-I-Mompel et al., 2023). Machine learning (ML) is a subset of AI that enables learning from experience and has become an attractive approach to improve GRNs (Kakimoto et al., 2023). Further, deep learning (i.e., ML based on neural networks) became of widespread use in genomics due to its higher capacity and flexibility by allowing millions of training parameters (Zou et al., 2019).
Theodoris et al. (2023) described a deep learning model to tackle data shortage (Fig. 1). Geneformer trained with publicly available human single-cell RNA-sequencing (scRNA-seq) datasets in a self-supervised (may rely on unlabeled data), attention-based (learns the most relevant genes for the phenotype), and context-aware (perceives contextual nuances) manner (Fig. 1). The dataset was quite unbalanced, with greater representation of brain, immune, liver, and heart cells. Transcripts were ranked per cell. Cellular reprogramming to pluripotency was mimicked in silico by adding Yamanaka factors on the top of transcript rankings and predicted the upregulation of other pluripotency genes. Based on deep transfer learning (Mignone et al., 2020), GeneFormer was applied to untrained scenarios (new tasks) for biological discovery (Fig. 1). To improve predictions, GeneFormer underwent task-specific “fine-tuning” training associated with the new task (Fig. 1). Genformer provided better predictions than other ML models, despite fine-tuning with 10,000–35,000 cells. The contexts of Genformer validation included copy number variations (dosage-sensitive TFs), genes marked by bivalent domains (stretches of chromatin simultaneously marked with H3K4me3 and H3K27me3), and TF binding range (0.1–3.0 vs. >3.0 kilobases from the transcription start site) (Fig. 1). Despite these results, the limited “benchmarking” with other GRN inference methods (based on scRNA-seq or scRNA-seq and single-cell chromatin accessibility datasets—multiomics) did not reveal potential advantages or limitations in comparison to available tools.

Prediction of GRNs with Geneformer. A ML (i.e., deep transfer learning) model was trained with the scRNA-seq data of nearly 30 million human cells. The model was fine-tuned with datasets for each new task. The model was further tested by simulations of gene deletion or activation (in silico gene deletion) and modulations of signaling pathways (in silico treatment). AUC is area under the ROC curve (performance measurement of machine learning models—better models score closer to 1.0). CNVs, copy number variants; ECs, endothelial cells; ESCs, embryonic stem cells; GRNs, gene regulatory networks; iPSCs, induced pluripotent stem cells; ML, machine learning; scRNA-seq, single-cell RNA-sequencing; ROC, receiver operating characteristic; TFs, transcription factors.
To estimate the impact of dataset size on fine-tuning, authors compared varying numbers of endothelial cells (ECs) from the Heart Atlas (Fig. 1). The use of 5000–30,000 ECs provided similar results, while 2500 ECs displayed a modest reduction in model performance. A smaller dataset of 884 ECs (from healthy and dilated aortas) provided a reasonable performance, thus reinforcing the impact of the dataset to a specific task. In silico gene perturbation provided biological insights (Fig. 1). Gene perturbation in fetal cardiomyocytes revealed TFs associated with human heart function. CRISPR-mediated knockout in induced pluripotent stem cell-derived cardiac tissue validated the role of TEAD4 in contractile potential. After fine-tuning with ∼600,000 cells, Genformer predicted ∼1000 genes associated with cardiomyopathies (Fig. 1). As a complementary approach, authors simulated the modulation of signaling pathways (in silico treatment) in cardiomyopathy datasets. Knockout of GSN and PLN diminished contractile stress of TITIN-mutant cells, thus validating the prospection of druggable targets.
How did Geneformer “learn” about GRN dynamics? The self-supervised model culminated in greater attention to key principles of transcriptional networks (TFs, central regulatory nodes, and genes determining cell identity). In silico perturbation of master regulators GATA4 and TBX5 in cardiomyocytes had a greater effect on direct targets than indirect target genes. Co-perturbation had a greater effect on cobound target genes than the sum of single-TF in silico perturbation. The development of Geneformer made a strong case on the potential of transfer learning to improve predictions of gene networks. However, questions remain about its potential. The inclusion of datasets from other species (e.g., mouse) would improve predictions in human cells? What would be the improvement in its predictive power as a multiomics tool? Can it make predictions in other species?
The fast-paced incremental developments in GRN inference suggest that it will become useful tools for scientists dissecting the regulatory modes of biological phenomena. Unfortunately, the dissection of gene interactions driving GRN dynamics remains much less understood, and it should be the focus of future efforts. Another major goal of the field should be to improve predictions of cellular phenotypes associated with specific GRN changes. Deep learning (and perhaps deep transfer learning) are here to stay in the field of GRN inference, and improved GRNs will contribute to dissecting developmental processes (Kakimoto et al., 2023), cell identity (Bravo González-Blas et al., 2023), species-specific gene circuits (Moura et al., 2020), and disease (Cook and Vanderhyden, 2020).
Author's Contribution
M.T.M.: writing—original draft, review, and editing.
Footnotes
Author Disclosure Statement
The author declares there are no conflicting financial interests.
Funding Information
No funding was received for this article.
