DiGeS-FN: Assessing genetically associated diseases using semantic,network,and functional scores

Abstract

Identifying associations among diseases is essential for advancing our understanding of disease mechanisms, enhancing diagnostics, facilitating drug repurposing, and guiding new therapeutic development. Despite substantial progress in decoding disease biology, the molecular underpinnings, therapeutic targets, and phenotypic traits of many diseases remain poorly understood. Previous studies have typically relied on either single similarity metrics or weighted combinations of multiple metrics, often lacking objectivity and standardization. In this work, we systematically evaluate and compare state-of-the-art similarity metrics across three distinct categories—semantic, functional, and network-based—to identify the most effective representative from each. Our analysis reveals SemSim as the optimal semantic metric, FunSim for functional similarity, and NetSim for network similarity. Leveraging these findings, we propose DiGeS-FN (Disease-Gene associations using Semantic, Functional, and Network metrics), an integrated framework for comprehensive disease similarity assessment. Experimental results demonstrate that DiGeS-FN achieves an AUC of 0.81, with a high true positive rate and a low false positive rate. The framework effectively recovers well-established disease associations, including atherosclerosis–myocardial infarction, asthma–bronchitis, and asthma–chronic obstructive airway disease, thereby validating its reliability. Notably, it also uncovers a novel association between polycystic ovary syndrome and endometriosis, supported by shared gene ontologies and pathways. These findings demonstrate the dual potential of DiGeS-FN to both validate known disease relationships and uncover novel genetically associated disease pairs.

Keywords

1 Introduction

Identifying associations among various diseases is an evolving field of study in contemporary biological and medical science. Understanding relationships among disease pairs can deepen our knowledge of molecular mechanisms underlying human diseases and can improve how diseases are classified. The quantitative evaluation of disease similarity, which depends on the qualitative associations^1–5 is gaining more interest because of its significant role in identifying genes responsible for causing diseases,^6,7 discovering novel drug indications,⁸ identifying microRNA functional associations.⁹ Associations between diseases primarily manifest in various dimensions, like clinically, they exhibit remarkably similar phenotypical characteristics such as symptoms and indications, whereas in genetics, a single gene can lead to diverse mutations. Disease can also be characterized using varied biomedical data. In biomedical studies, the relationships between diseases are primarily determined using three types of metrics: semantic, functional, and network. These similarity metrics provide unique approaches for determining disease similarity.

The semantic similarity metrics compare disease depending on the terminology and descriptions utilized to define them. Semantic similarity metrics often use Disease Ontology (DO) for computing disease similarity, relying on shared attributes and hierarchical associations. DO provides a standardized vocabulary for disease classification and a structured understanding of diseases depending upon symptoms. Although bio-ontology terms can effectively define concepts underlying diseases and their semantic relationships, their universal applicability is limited because not all diseases adhere to ontological structure. That's the reason we need to consider additional metrics to compute disease associations. Another important metric is functional, which assess disease similarity based on biological functions, such as shared genetic pathways, molecular mechanisms, or protein interactions, which reveal common underlying biological processes, providing insights into disease mechanisms. Diseases are contrasted in this context depending on similar metabolic pathways or common gene expression profiles. Network-based disease connections are analyzed utilizing the interactions between genes, proteins, and other biological entities. Diseases are associated in networks depending on similar molecular interactions or co-occurrence among pathways. For instance, Protein-Protein Interaction (PPI) networks are utilized to identify diseases that are associated with shared interacting proteins. Each of the disease similarity metrics has its own significance. Semantic similarity metrics offer high-level, thorough comprehension of diseases; however, functional similarity metrics explore the biological mechanisms underlying the diseases, and network similarity metrics provide valuable perspectives into the intricate connections and dynamics at the system level.

Previous studies typically utilized a single measure of similarity, such as semantic, network, or functional scores, to quantify the similarity between diseases. Few other researchers employ weighting coefficients to integrate multiple metrics of varying natures. The researchers have used varied forms and combinations of semantic, functional, or network similarity computing metrics. Each of the categories of metrics is defined differently. This results in a lack of objectivity when predicting diseases that share similarities. To address this issue, our proposed approach aligns multiple disease metrics into a unified spatial dimension, utilizing multi-dimensional vectors to define every disease node. The objective of this paper is to systematically evaluate and compare leading approaches for measuring disease similarity across three categories: semantic, functional, and network-based metrics. We aim to identify the most effective metric within each category and then determine how these can be integrated to achieve a more comprehensive measure of disease similarity. Furthermore, we compare the integrated approach against existing state-of-the-art methods, identify potential genetically associated disease pairs, and investigate the key biological processes and pathways shared by these diseases.

The challenges encountered while conducting research in this direction are as follows:

Data Challenge: The benchmark datasets for identifying disease associations are unavailable; rather, they have been curated by referring to various data sources. Access to quality data is very limited, and available datasets are often large in size and high-dimensional.

Comprehensive and substantial metrics extraction: Researchers either worked on varied semantic, network, and functional metrics solely or have used these metrics in combination. The computation lacks objectivity. Along this line, we have analyzed and compared state-of-the-art metrics and utilized the best among them to determine the proposed integrated metric for computing disease similarity.

Different identifiers represent diseases across different databases, demonstrating diversity in disease identification. For example, in DO, there are DOIDs; in MeSH, diseases are represented by MeSH IDs; and in OMIM, there are OMIM IDs. This diversity of identifiers obstructed the process of integration, as mapping between these data sources is intricate and prone to errors.

The paper's primary contribution consists of the following key aspects:

We referred to multiple data sources integrated them to extract meaningful relationships and information about disease associations.

Conducted a comparative evaluation of different semantic metrics that are currently in use for determining semantic similarity in heterogeneous networks, where meta paths symbolized different semantic relationships among diseases.

Conducted a comparative analysis of various existing network metrics for disease topological similarity calculation in the network, considering multiple hops to be the neighbors to identify the most effective topological similarity metric.

Proposed the DiGeS-FN Integration method, which combines the most effective functional, semantic, and topological similarity metrics to identify the optimal disease similarity.

Identified significant disease pairs and performed functional enrichment analysis.

The remaining paper is organized as follows: Section 2 presents the related literature; section 3 describes the proposed framework. Section 4 highlights the experimental details, Section 5 discusses the results, and finally the paper concludes with Section 6.

2 Related work

Numerous studies have been performed in the past that have identified disease associations using varied means and utilized them for different purposes. This section highlights studies that are closely related to our work. Approaches for computing similarity between diseases can be generally classified into semantic, network, and functional-based metrics.

Semantic similarity metrics are extensively employed to assess similarity among terms within the GO (Gene Ontology) and HPO (Human Phenotype Ontology) in the field of bioinformatics. Some of the metrics are utilized to assess similarity among diseases based on disease-related ontological terms.^10–13 Resnik's method¹⁴ outperforms union-intersection (UI), longest shared path (LP), JC,¹⁵ and Lin's technique.¹⁶ The Resnik evaluates similarity among diseases using DO terms,^17,18 assessing similarity among diseases by utilizing the information content (IC) of MICA among pairs of diseases. The dominant ontology for computing similarity among diseases is DO, which defines disease relationships using the “IS_A” relationship format.¹⁹ Furthermore, the Wang method²⁰ computes disease similarity using DO terms by taking into account multiple ancestors that are shared between them. The Wang method demonstrates excellent performance in computing semantic associations among GO terms and has been effectively employed to assess similarity among diseases across Medical Subject Headings (MeSH) terms. The author²¹ measured disease semantic similarity using the metapath technique based on chemicals used to diagnose a pair of disease.

Network-based methods have also been used for assessing similarity among diseases. This metric computes the associations and interactions that take place among genes within biological networks. The author²² proposed a network-centric approach that evaluates gene functional similarities using GO by analyzing the overall structure of the co-functional network and GO terms. The author²³ computes network similarity based on disease pathway using the dynamic time wrapping technique. To assess network similarity between diseases, the average topological properties for diseases are computed by the author.¹³ As described, the studies have generally used three metrics to measure similarity among diseases: functional, network, and semantic. However, researchers tend to use these metrics based on convenience, and there is no established method to determine the most optimal functional, semantic, or network metric. So, there is an immediate need to determine the best available metric—functional, network, or semantic—for measuring disease similarity and to devise a method for integrating these three metrics effectively.

In recent times, there have been new function-based approaches developed for assessing disease similarity. Among these, the most commonly used computational techniques rely on the guilt-by-association hypothesis.^24,25 The presumption is that diseases sharing similarities are likely to be linked with the same or similar genes. As a result, a critical aspect was to evaluate the proximity among candidate genes and known disease-associated genes. The similarity among diseases was converted into the similarity between their associated gene sets. The first approach to compute similarity among diseases utilizes the overlapping gene sets (BOG).²⁶ Compared to semantic similarity techniques, the BOG method offers a novel perspective on defining disease similarity. Hence, there is potential to uncover previously unidentified associations.²⁶ Nevertheless, it disregards the functional relationships between genes associated with diseases, which play a role in determining similarity among diseases. The researcher²⁷ introduced a technique known as process-similarity-based (PSB), which incorporates associations derived from GO terms.¹⁰ PSB surpasses BOG in performance and demonstrates superior efficacy compared to Resnik,¹⁴ Lin,¹⁶ LC,²¹ and JC¹⁵ methods.

3 Proposed framework: DiGeS-FN (Disease Gene Semantic Functional & Network Association)

To examine the genetic resemblance among diseases, we employed the best three metrics—network, functional, and semantic similarity- obtained from comparative analysis and integrated them to achieve improved prediction accuracy. This integration leverages complementary strengths of the three measures, ensuring a more robust and reliable characterization of genetic resemblance among diseases. (Figure 1 depicts the proposed workflow of DiGeS-FN).

Figure 1.

The diagram illustrating the proposed workflow of DiGeS-FN.

3.1 Computing network-based similarity among diseases

The STRING database²⁸ was utilized to gather data on interactions between human genes (PPI), and Cytoscape software (v3.8.2)²⁹ was utilized to build a network of human PPI. The network's topological properties concerning genes associated with diseases were calculated.^30–34 The genes associated with disease d were delineated as G = {g₁, g₂, g₃, …g_k}. To evaluate topological similarity among diseases, the mean structural properties of each disease in the network were computed in the following manner:

\begin{aligned} R = \frac{\sum_{1 \leq i \leq k} r_{i}}{k}, C = \frac{\sum_{1 \leq i \leq k} c_{i}}{k}, B = \frac{\sum_{1 \leq i \leq k} b_{i}}{k}, S = \frac{\sum_{1 \leq i \leq k} s_{i}}{k} \end{aligned}

(1)

Here, k denotes the count of genes in the set G where each gene g_i is described by specific network metrics: degree centrality (r_i), clustering coefficient (c_i), betweenness centrality (b_i) and average shortest path length (s_i). The average values of these measures across all genes in G yield the disease-level descriptors: degree centrality (R), reflecting the average number of direct connections per gene; clustering coefficient (C), indicating how interconnected a gene's neighbors are; betweenness centrality (B), measuring the extent to which a gene lies on the shortest paths between other genes and average shortest path length (S), capturing the mean distance from a gene to all others. Together, these parameters provide a comprehensive topological profile of each disease, enabling comparison and similarity evaluation among different disease terms.

As illustrated in Figure 1, G₁ and G₂ represent the set of genes linked to disease terms d₁ and d₂, respectively. T₁ = {R₁, C₁, B₁, S₁} and T₂ = {R₂, C₂, B₂, S₂} are the disease vectors for diseases d₁ and d₂ derived from their topological properties. Network similarity is determined among diseases using the cosine similarity applied to T₁ and T₂.

\begin{aligned} N e t S i m (d_{1}, d_{2}) = c o s (T_{1}, T_{2}) = \frac{T_{1} . T_{2}}{| | T_{1} | | | | T_{2 |} |} \end{aligned}

(2)

where T₁. T₂ represents the product of two disease vectors. ||T₁|| and ||T₂|| denote the magnitude of disease vectors T₁ and T₂ respectively. Figure 2 provides the schematic representation for computing network similarity, where disease vectors derived from PPI network topological metrics are compared using cosine similarity.

Figure 2.

Schematic representation of network similarity computation among diseases.

3.2 Computing semantic-based similarity among diseases

To quantify proximity among disease terms based on a specific ontology,^35–37 a semantic similarity metric is employed. The metric in Cheng et al.¹² was utilized to evaluate similarity among DO terms. Disease terms were fetched from DO.³⁸ The DO is structured as a DAG (directed acyclic graph) in which each vertex defines a disease term and each edge defines an “IS_A” relationship among the diseases. Figure 3 presents the subgraph of the Disease Ontology DAG representing the hierarchical relationship for the disease term Cutaneous lupus erythematosus (DOID:0050169).

Figure 3.

A DAG subgraph corresponding to the DO term “Cutaneous lupus erythematosus (DOID:0050169)”.

The IC for each DO term can be determined in the following manner:

\begin{aligned} I C (d) = - \log p (d) \end{aligned}

(3)

Here, d represents the disease term within DO, and p(d) indicates the count of genes associated with d, divided by the total genes associated with disease ontology.

\begin{aligned} d_{M I C A} (d_{1}, d_{2}) = max_{a \in A (d_{1}) \cap A (d_{2})} I C (a) \end{aligned}

(4)

where A (d₁) and A (d₂) refer to ancestors of disease terms d₁ and d₂ within the DO respectively. The ancestor with the highest IC is termed as the most informative common ancestor (MICA), denoted as d_MICA which is further used to compute associated gene set (G_MICA). The semantic similarity will then be calculated using equation (5),

\begin{aligned} S e m S i m (d_{1}, d_{2}) = \frac{| G_{1} |}{| G_{M I C A |}} . \frac{| G_{2} |}{| G_{M I C A |}} \end{aligned}

(5)

where gene sets linked to disease terms d₁ and d₂ are represented by G₁ and G₂, respectively. d_MICA represents MICA for d₁ and d₂, and G_MICA denotes the genes associated with d_MICA. |G₁|, |G₂|, and |G_MICA| represent the number of genes related to d₁, d₂, and d_MICA, respectively. Figure 4 illustrates the workflow for computing semantic similarity between two diseases.

Figure 4.

Workflow illustrating the computation of semantic similarity among diseases.

3.3 Computing functional-based similarity among diseases

The functional associations among genes were extracted from HumanNet.³⁶ In HumanNet, each interaction is assigned an LLS (log likelihood score). LLS indicates the likelihood of functional association among the genes.¹² The LLSs among human genes were retrieved from the HumanNet database and standardized using the following equation¹²:

\begin{aligned} L L S_{N} (g_{i}, g_{j}) = \frac{L L S (g_{i}, g_{j}) - L L S_{m i n}}{L L S_{m a x} - L L S_{m i n}} \end{aligned}

(6)

Here, g_i and g_j represent the i^th and j^th genes, respectively. LLSN (g_i, g_j) denotes the normalized LLS among g_i and g_j. LLS (g_i, g_j) denotes the original LLS among g_i and g_j. LLS min and LLS max indicate min and max LLS within HumanNet, respectively. The functional similarity score, referred to as FunSim, for a gene pair was established in the following manner:

\begin{aligned} F u n S i m (g_{i}, g_{j}) = {\begin{array}{ll} 1 & i = j \\ L L S_{N} & i \neq j and e (i, j) \in E (H u m a n N e t) \\ 0 & i \neq j and e (i, j) E (H u m a n N e t) \end{array} \end{aligned}

(7)

here, e (i, j) denotes the association among genes g_i and g_j. E (HumanNet) comprises all the edges in HumanNet. Following that, the functional links between gene g and the gene set G = {g₁, g₂, g₃, …, g_k} were established in the following manner:

\begin{aligned} F_{G} (g) = max_{1 \leq i \leq k} (F u n S i m (g, g_{i})), g_{i} \in G \end{aligned}

(8)

where k indicates the count of genes in G, and g_i is the i^th gene of G. G₁ = {g₁₁, g₁₂, …, g_1i, …, g_1m} and G₂ = {g₂₁, g₂₂, …, g_2j, …, g_2n} denotes the set of genes related to disease d₁ and disease d₂. m and n denote the number of genes within the gene sets G₁ and G₂. The FunSim score between diseases d₁ and d₂ was computed using equation (9). Figure 5 illustrates the workflow for computing functional similarity between two diseases.

\begin{aligned} F u n S i m (d_{1}, d_{2}) = \frac{\sum_{1 \leq i \leq m} F_{G_{2}} (g_{1 i}) + \sum_{1 \leq i \leq n} F_{G_{1}} (g_{2 j})}{m + n}, g_{1 i} \in G_{1}, g_{2 j} \in G_{2} \end{aligned}

(9)

Figure 5.

Workflow illustrating the computation of functional similarity among diseases.

To identify potential disease pairs, three types of similarity metrics were integrated together to comprehensively compute the similarity, which was established as follows:

\begin{aligned} I n t e g r a t e d S i m (d_{1}, d_{2}) = (S e m S i m (d_{1}, d_{2}) . F u n S i m (d_{1}, d_{2})) + N e t S i m (d_{1}, d_{2}) \end{aligned}

(10)

where d₁ and d₂ signify diseases 1 and 2, respectively. SemSim (d₁, d₂) captures ontological closeness, while FunSim (d₁, d₂) captures shared molecular mechanism. When both measures are simultaneously high, multiplying them boosts the overall score, suggesting that diseases which are both semantically similar and functionally linked tend to have real biological associations. In contrast, NetSim (d₁, d₂) captures system-level biology beyond that are not always directly aligned with ontology or function. Therefore, we added the NetSim term to the product, ensuring that network information contributes without disproportionately inflating the combined score when semantic and functional signals are weak. The overall framework of DiGeS-FN for computing disease similarity is illustrated in Figure 6.

Figure 6.

Schematic representation of DiGeS-FN for disease–disease similarity computation.

4 Experimental details

This section outlines the data sources employed and the experimental findings. The results are organized into four subsections: a comparison of state-of-the-art semantic similarity metrics for diseases, a comparison of network-based similarity metrics, a comparison of functional similarity metrics, and, finally, the identification of the optimal disease similarity through integrated metrics. Furthermore, we identified significant disease pairs and conducted functional enrichment analysis using GO terms and key pathways.

4.1 Data sources

To find the disease associations, the data has been collected from the different sources. Multiple sources are referred to and combined to reveal meaningful relationships. In this investigation, four public data sources are used, including DO, STRING, HumanNet, and SIDD.

4.1.1 Disease Ontology (DO)

The Disease Ontology (DO)³⁸ comprises 11,000 unique disease terms and 18,000 relationships denoted by “IS_A” between diseases. In the DAG of the DO, disease terms are connected by “IS_A” relationships, where each vertex denotes a DO term, and each association denotes an “IS_A” relationship among various diseases.

4.1.2 STRING

The STRING³⁹ database aims to consolidate both confirmed and predicted interactions among proteins, incorporating physical and functional interactions among the proteins. The protein–protein network includes 1895 proteins with 187874 associations.

4.1.3 HumanNet

We utilized HumanNet⁴⁰ to access functional interactions among genes, which serves as an expanded network of gene functional interactions specific to Homo sapiens. This functional network includes 977,495 interactions involving 18,459 genes (Table 1).

Table 1.
Sources of data utilized to evaluate similarity among diseases.

Data Source Website Link

DO (Disease Ontology) https://disease-ontology.org/download/

SIDD (Semantically Integrated Disease-Associated Database) http://mlg.hit.edu.cn/SIDD

CTD (Comparative Toxicogenomics Database) https://ctdbase.org/downloads/

STRING (Search Tool for the Retrieval of Interacting Genes) https://string-db.org/cgi/download?sessionId=bEnz8s3oCSKO

HumanNet https://staging2.inetbio.org/humannetv3/download.php

GO https://geneontology.org/docs/download-ontology/

GOA https://geneontology.org/docs/download-go-annotations/

MimMiner http://www.cmbi.ru.nl/MimMiner/suppl.html

Data Source	Website Link
DO (Disease Ontology)	https://disease-ontology.org/download/
SIDD (Semantically Integrated Disease-Associated Database)	http://mlg.hit.edu.cn/SIDD
CTD (Comparative Toxicogenomics Database)	https://ctdbase.org/downloads/
STRING (Search Tool for the Retrieval of Interacting Genes)	https://string-db.org/cgi/download?sessionId=bEnz8s3oCSKO
HumanNet	https://staging2.inetbio.org/humannetv3/download.php
GO	https://geneontology.org/docs/download-ontology/
GOA	https://geneontology.org/docs/download-go-annotations/
MimMiner	http://www.cmbi.ru.nl/MimMiner/suppl.html

4.1.4 Disease associated genes

Disease-associated genes are sourced from SIDD,⁴¹ which consolidates information from five databases of disease-associated genes: CTD (Comparative Toxicogenomic Database),⁴² OMIM (Online Mendelian Inheritance in Man),⁴³ GAD (Genetic Association Database),⁴⁴ GeneRIF (Gene Reference into Function),⁴⁵ and SpliceDisease.⁴⁶ Overall, the disease-associated gene dataset encompasses 2814 diseases, 12,063 genes, and 117,190 interactions among them. Table 1 provides specific details about the data sources.

To validate the results, a benchmark dataset comprising 47 diseases with 70 associations among them was utilized from a previous study.¹² This benchmark dataset combines two manually curated datasets that include disease pairs with significant similarity.^40,47,48 Initially, the dataset was introduced in Pakhomov et al.⁴⁷ Then, a literature review was conducted to identify highly similar disease pairs, as described in Mathur and Dinakarpandian.²⁷ The second dataset was provided in Suthram et al.,⁴⁸ based on judgments given by medical residents. Table 2 gives an overview of statistics for all datasets utilized.

Table 2.
Statistics for all datasets utilized.

Summary of Dataset Statistics

Disease Ontology 11000 disease terms 18000 relationships

HumanNet 18459 genes 977495 relationships

SIDD 2814 diseases 12063 genes 117190 relationships

Benchmark Dataset 47 diseases 70 relationships

Summary of Dataset Statistics
Disease Ontology	11000 disease terms	18000 relationships
HumanNet	18459 genes	977495 relationships
SIDD	2814 diseases	12063 genes	117190 relationships
Benchmark Dataset	47 diseases	70 relationships

4.2 Experimental results

This section presents the varied research question given as follows:

Q1: Which semantic method is the best for identifying disease associations?

Q2: Which topological similarity metric is most effective for finding disease associations?

Q3: What is the accuracy of the functional similarity method used for identifying disease associations? Q4: Which among the top metrics in each category obtained above is the best choice for identifying the disease association?

Q5: Find the genetically associated potential disease pairs based on all three metrics.

Q6: Which method outperforms when comparing and contrasting the state-of-the-art methods with the proposed metric utilized for identifying significant disease pairs?

Q7: What are the GO terms and key pathways shared by significant disease pairs?

Q1: Which semantic method is the best for identifying disease associations?

Motivation: The semantic similarity metric identifies related disease pairs by measuring how closely their associated terms are connected through shared annotations within structured knowledge bases such as the Disease Ontology (DO). There are several metrics available, and researchers often choose them based on their convenience without a clear consensus on which is the most optimal.

Method used: In semantic similarity analysis of diseases, the Most Informative Common Ancestor (MICA)—defined as the shared ancestor with the highest information content (IC) among compared terms—is widely adopted across various semantic similarity metrics. To identify an efficient semantic similarity metric among pairs of diseases, we employed the following six established methods as benchmarks:

SemFunSim.¹² This method computes disease similarity by combining semantic information from the Disease Ontology (DO) with functional similarity (FunSim). It leverages the two most informative shared ancestor nodes of a disease pair to capture their biological relatedness. The approach for computing sematic similarity is formally defined in Equation 5.

Jiang.³¹ The disease ontology-based metric computes Information Content (IC) for measuring similarity among diseases. This approach measures how specific or informative a disease term is within the ontology, with higher IC values indicating more specific diseases. Similarity is then derived by comparing the IC of shared ancestor nodes between disease pairs.

\begin{aligned} S i m_{j i a n g} (d_{1}, d_{2}) = \frac{2 * | G O (d_{1}) \cap G O (d_{2}) |}{| G O (d_{1}) | + | G O (d_{2}) |} \end{aligned}

(11)

Here, |GO(d₁) | and |GO(d₂) | represent the count of GO terms related to diseases d₁ and d₂, respectively.

|GO(d₁) ∩ GO(d₂) | denotes common GO terms among diseases d₁ and d₂.

Lin.¹⁶ This method introduces a similarity approach based on information theory, enhancing Resnik's method by incorporating assumptions about the shared characteristics between the two concepts. This approach accounts for both the commonality and the distinctiveness of the concepts, resulting in a more balanced and discriminative similarity score.

\begin{aligned} S i m_{L I N} (d_{1}, d_{2}) = \frac{2 * I C (M I C A (d_{1}, d_{2}))}{I C (d_{1}) + I C (d_{2})} \end{aligned}

(12)

where MICA (d₁, d₂) indicates MICA among diseases d₁ and d₂. IC(d₁) and IC(d₂) denote the information content of disease d₁ and disease d₂, respectively.

Resnik.¹⁴ Resnik's method calculates similarity between disease pairs using the genetic inheritance relationship in DO. It depends on the premise that the measure of similarity among diseases reflects how much ancestral information they share in common. The DO encompasses both frequent and uncommon diseases, structured as a DAG (directed acyclic graph) where each vertex signifies a DO term and each interaction indicates an “IS_A” association among the diseases. The IC for each DO term is computed as defined in equation (3). The semantic similarity among diseases evaluated using Resnik's approach is defined previously in equation (4).

Wang.²⁰ The method in Wang et al.²⁰ suggests encoding the semantic meanings of GO terms into numerical values by incorporating the semantic information from their ancestral terms in the GO graph. Subsequently, similarity among diseases is determined by assessing similar genes. Suppose that D₁ consists of d₁ along with its ancestor terms within ontology-based “IS_A” relationships. The hierarchical relationship among disease terms from d to d₁ is determined in the following manner:

\begin{aligned} S_{d_{1}} (t) = {\begin{array}{ll} 1 & d = d_{1}, \\ S_{d_{1}} (t) = max {w . S_{d_{1}} (d^{'}) | d^{'} \in d_{1}} & d \neq d_{1} \end{array} \end{aligned}

(13)

where w is the hierarchical relationship factor. As stated in Wang et al.,²⁰ w is adjusted to 0.5. SV(d₁) represents the cumulative sum of all hierarchical relationships from D₁ to d₁, calculated as:

\begin{aligned} S V (d_{1}) = \sum_{d \in D_{1}} S_{d_{1}} (d) \end{aligned}

(14)

Suppose that D₂ consists of d₂ along with its ancestral terms. The similarity between diseases d₁ and d₂ according to Wang's method is expressed as follows:

\begin{aligned} S i m_{W a n g} (d_{1}, d_{2}) = \frac{\sum_{d \in D_{1 \cap} D_{2}} (S_{d_{1}} (d) + S_{d_{2}} {(d)}_{})}{S V (d_{1}) + S V (d_{2})} \end{aligned}

(15)

Results obtained: Figure 7 presents the comparative analysis using ROC curves along with their AUC values for various standard methods used for computing semantic similarity among diseases. Resnik,¹⁴ Lin,¹⁶ Jiang,³¹ and Wang²⁰ methods demonstrate slightly lower performance in contrast to semantic-based methods proposed via Cheng et al.¹²

Figure 7.

ROC curves illustrate the experimental results using various semantic similarity methods.

Table 3 displays the AUC achieved by various metrics for computing semantic similarity among diseases. From the table, it has been observed that Cheng's method demonstrates the highest performance among the five semantic-based metrics over Gene Ontology (GO). The findings indicate that for assessing semantic similarity among disease pairs, Cheng's method is the most effective, achieving the AUC value of 0.73.

Table 3.

Comparison of various semantic similarity metrics.

Semantic Similarity Metrics for Assessing Disease-Disease Similarity
Paper ID	Author	Year	AUC
Cheng et al.¹²	Cheng et al.	2014	0.73
Chavali et al.³¹	Jiang et al.	1997	0.71
Lin¹⁶	Lin et al.	1998	0.67
Resnik¹⁴	Resnik et al.	1995	0.67
Wang et al.²⁰	Wang et al.	2007	0.61

Q2. Which topological similarity metric is most effective for finding disease associations?

Motivation: The topological similarity score examines shared molecular interactions and co-occurrences among the pathways. Despite the availability of various metrics, researchers often select similarity metrics based on convenience rather than adhering to a standard approach to identify the most effective network similarity metric.

Method Used: A comparative analysis has been conducted on the benchmark methods to identify the most efficient network similarity metric.

Yanjun.¹³ The method computes diverse topological characteristics associated with disease genes as shown in equation (16).

\begin{aligned} R = \frac{\sum_{1 \leq i \leq k} r_{i}}{k}, C = \frac{\sum_{1 \leq i \leq k} c_{i}}{k}, B = \frac{\sum_{1 \leq i \leq k} b_{i}}{k}, S = \frac{\sum_{1 \leq i \leq k} s_{i}}{k}, H = \frac{\sum_{1 \leq i \leq k} h_{i}}{k} \end{aligned}

(16)

Here, R is the degree centrality represent average number of a gene's direct connections; C is clustering coefficient that shows how connected its neighbors are; B is betweenness centrality that measures how often it lies on shortest paths; S is average shortest path length that reflects its mean distance to all other genes; and H is neighborhood connectivity that signifies the average connectivity of its neighbors. Each disease is represented with five-dimensional feature vector (T) based on its topological properties, and finally network similarity is computed using the Pearson Correlation Coefficient (PCC) as shown in equation (17).

\begin{aligned} ρ_{T_{1}, T_{2}} = \frac{c o v (T_{1}, T_{2})}{σ_{T_{1}}, σ_{T_{2}}} \end{aligned}

(17)

NetSim. We used four key topological features to construct the disease feature vector for computation of network similarity, as defined in Equation 1. These include degree centrality (R), betweenness centrality (B), clustering coefficient (C), and average shortest path length (S). We excluded Neighborhood connectivity (H) as it is strongly correlated with degree centrality potentially introducing redundancy into the feature vector and diluting the discriminative power of the representation. Equation 2 describes the NetSim metric used to compute network similarity among disease vectors in proposed approach.

Jaccard. It calculates similarity using disease-related gene sets by determining the overlap of genes between gene sets G₁ and G₂ for diseases d₁ and d₂, respectively. This overlap is divided by overall genes associated with disease d₁ and disease d₂.

\begin{aligned} J (d_{1}, d_{2}) = \frac{| G_{1} \cap G_{2} |}{| G_{1} \cup G_{2} |} \end{aligned}

(18)

where |G₁∩G₂| represents common genes among diseases d₁ and d₂. |G₁∪G₂| indicates unique genes among diseases d₁ and d₂.

Overlap. Overlap similarity calculates similarity between disease-related gene sets by finding the common genes in gene sets G₁ and G₂ regarding disease d₁ and disease d₂, respectively. This is divided by the smaller of the two gene set sizes of G₁ and G₂.

\begin{aligned} O (d_{1}, d_{2}) = \frac{| G_{1} \cap G_{2} |}{min (| G_{1} |, | G_{2} |)} \end{aligned}

(19)

where G₁ and G₂ denote the genes linked to diseases d₁ and d₂, respectively. |G₁ ∩ G₂| refers to the number of common genes shared by diseases d₁ and d₂.

Sorensen. The Sorensen similarity coefficient evaluates similarity among gene sets G₁ and G₂ by doubling the count of associated genes and dividing this by the sum of the sizes of both gene sets.

\begin{aligned} S (d_{1}, d_{2}) = \frac{2 * | G_{1} \cap G_{2} |}{| G_{1} | + | G_{2} |} \end{aligned}

(20)

where genes related to diseases d₁ and d₂ are represented by G₁ and G₂, respectively. |G₁∩ G₂| denotes the number of related genes among diseases d₁ and d₂ respectively. G₁| + |G₂| denotes the total number of genes associated with d₁ and d₂.

Results Obtained: Comparative analysis has been conducted on the proposed metric (NetSim) and various benchmark methods used for assessing network similarity among diseases. ROC curves along with their associated AUC values are shown in Figure 8. Based on the figure, it has been observed that the Jaccard, Overlap and Sorensen similarity approaches demonstrate lower performance compared to Yanjun et al.¹³ However, when comparing the proposed metric (NetSim) outcome with Yanjun's method, a 0.2% increase in accuracy is achieved.

Figure 8.

ROC curves illustrate the experimental results using various network similarity methods.

Table 4.

Comparison of various network similarity metrics.

Network Similarity Metrics for Assessing Disease-Disease Similarity
Author/Technique Name	AUC
NetSim	0.75
Yanjun Ding et al. ¹³	0.73
Jaccard	0.68
Overlap	0.68
Sorensen	0.68

Table 4 displays the AUC achieved by various methods for calculating network similarity among diseases. From the table, it has been noted that the cosine method outperforms the Pearson correlation utilized in Ding et al.¹³ when applied to disease vectors for measuring similarity among diseases. When employing the Cosine method on disease vectors, it attains an AUC score of 0.75 (increased by 0.2%) compared to using Pearson correlation to measure disease similarity.

Q3. What is the accuracy of the functional similarity method used for identifying disease associations?

Motivation: Functional similarity metric assesses disease similarity by analyzing the functional biological pathways, like genetic pathways, protein-protein interactions, or molecular mechanisms. Since most of the available studies employing functional similarity for finding the disease association have used similar methods to the best of our knowledge, we want to compute its accuracy.

Method used: FunSim is a widely adopted method for computing disease similarity and has been used by several researchers.^12,13,32,34 As shown in Table 5, a single functional similarity metric is consistently applied across multiple studies. FunSim works by comparing the sets of genes associated with each disease to determine their functional similarity.^26,27 This gene-level comparison captures the biological relatedness between diseases based on shared molecular mechanisms. The FunSim is computed using Equation 9.

Table 5.

Single Functional Similarity used by various researchers.

Functional Similarity Metrics for Assessing Disease-Disease Similarity
Paper ID	Author/Technique Name	Year	AUC
Ding et al.¹³	Yanjun et al./FunSim	2021	0.76
Suet al.³²	Shuhui et al./FunSim	2019	0.76
Cheng et al.¹²	Cheng et al./FunSim	2014	0.76

Results Obtained: Figure 9 and Table 5 displays the ROC curve and the corresponding AUC value for the baseline functional similarity method.

Figure 9.

ROC curves illustrate the experimental result using functional similarity methods.

Figure 10.

ROC curves illustrate the experimental results using Functional, Semantic and Network metrics.

Q4: Which among the top metrics in each category (semantic, functional, and network) is the best for identifying the disease association?

Motivation: Each metric plays a significant role in determining the disease associations. We want to identify which metric contributes the most and the least in calculating the similarity among diseases.

Methods: We evaluated and compared three leading similarity metrics from each category used to assess disease similarity. The best-performing metrics identified were: SemSim for the semantic category, NetSim measure for the network-based, and FunSim for the functional similarity.

Results obtained: Figure 10 depicts the ROC curve and their associated AUC values of the most effective baseline methods alongside our method, which computes functional similarity, semantic similarity, and network similarity among diseases. Out of the three metrics, the functional similarity metric has achieved the greatest AUC value of 0.76, followed by the network and semantic similarity metrics with AUC values of 0.75 and 0.73, respectively.

Q5: Which method outperforms when comparing and contrasting the state-of-the-art methods with the proposed metric utilized for identifying significant disease pairs?

Motivation: The network, semantic, and functional similarity scores are combined together to evaluate integrated similarity scores. Each of the three metrics contributes effectively, aiding in the development of a comprehensive understanding of disease associations. The integrated score helps identify the significant disease pairs.

Results Obtained: To comprehensively capture disease–disease relationships, we integrated the three similarity metrics to compute combined similarity scores across 47 diseases from the benchmark dataset.¹² Figure 11 illustrates the performance of integrated approach and state-of-the-art methods; Autoimmune,¹³ SemFunSim,¹² and FunSim.³⁴ The results demonstrate that the proposed DiGeS-FN framework achieves superior predictive capability compared to existing methods.

Figure 11.

ROC curves illustrate the experimental results using combined similarity metric.

To evaluate the significance of the improvement achieved by the proposed method (DiGeS-FN), we conducted DeLong's test on the proposed method (DiGeS-FN) against Autoimmune, FunSim, and SemFunSim to statistically validate the differences in AUC values. The null hypothesis of DeLong's test assumes no substantial differences exist between the ROC AUC values of the compared methods. Table 6 presents the results. The proposed method achieved an AUC of 0.81, compared with 0.70 for Autoimmune, 0.76 for FunSim, and 0.77 for SemFunSim. The DeLong test confirmed that the improvements over Autoimmune (p = 0.002) and FunSim (p = 0.010) are statistically significant, demonstrating the robustness of the proposed framework. These results reinforce that DiGeS-FN consistently yields higher discriminative performance across different comparisons, validating its effectiveness over existing approaches.

To further explore potential associations, all disease pairs were ranked in descending order according to their integrated similarity scores. The top ten pairs, presented in Table 7, represent the most promising candidates for biological validation and deeper investigation. Notably, pairs such as atherosclerosis–myocardial infarction, asthma–bronchitis and asthma–chronic obstructive airway disease emerged as highly significant, which aligns with well-established clinical and biological evidence. These findings underscore the capability of DiGeS-FN not only to capture known associations but also to uncover novel and meaningful disease–disease relationships.

Table 6.

Delong's test results comparing the AUC of the proposed method with baseline approaches.

Method	AUC	Comparison vs DiGeS-FN	P-value	Significant ?
DiGeS-FN	0.81	-	-	-
Autoimmune	0.70	Autoimmune vs DiGeS-FN	0.002	Yes (p < 0.05)
FunSim	0.76	FunSim vs DiGeS-FN	0.010	Yes (p < 0.05)
SemFunSim	0.77	SemFunSim vs DiGeS-FN	0.085	No

Table 7.

The top ten disease pairs with the highest integrated similarity scores.

DOID	Disease Name	DOID	Disease Name	Similarity Score
DOID:12365	Malaria	DOID:8469	Influenza	1
DOID:2841	Asthma	DOID:6132	Bronchitis	0.750
DOID:9970	Obesity	DOID:9351	Diabetes Mellitus	0.373
DOID:423	Myopathy	DOID:633	Myositis	0.369
DOID:2841	Asthma	DOID:3083	Chronic Obstructive Airway Disease	0.299
DOID:5419	Schizophrenia	DOID:3312	Bipolar Disorder	0.264
DOID:7148	Rheumatoid Arthritis	DOID:848	Arthritis	0.200
DOID:11612	Polycystic Ovary Syndrome	DOID:289	Endometriosis	0.177
DOID:1936	Atherosclerosis	DOID:5844	Myocardial Infarction	0.176
DOID:12176	Goitre	DOID:1459	Hypothyroidism	0.125

Q6: Find the genetically associated potential disease pairs based on all three metrics.

Motivation: Prior studies indicate that genetically similar diseases often arise from overlapping or shared genes, which in turn reflect common biological pathways and Gene Ontology (GO) terms. Identifying potential disease pairs based on shared genetic components is crucial for understanding disease etiology, improving diagnostic strategies, and guiding the discovery of therapeutic targets.

Results Obtained: To uncover genetically related disease pairs, we evaluated disease similarity using three complementary metrics: Functional, Semantic, and Network Similarity. Disease pairs were ranked in descending order based on their respective scores for each metric. To enhance the reliability of the findings, we selected ten overlapping disease pairs from the top 100 ranked pairs across all three similarity measures as demonstrated in Figure 12. These were considered as potential disease pairs exhibiting significant genetic similarity for further analysis.

Figure 12.

Analysis via Venn diagram for disease pairs ranked by network, semantic and functional similarity metrics.

Additionally, we created a network using the ten disease pairs and their associated genes. The network's topological characteristics were analyzed. We identified 35 genes from the network, having the highest degree, which are potentially relevant to diseases. Prior studies have shown that these genes are linked to multiple diseases. A disease pair with a greater number of shared genes indicates a stronger likelihood of genetic similarity among them. Therefore, ten potential disease pairs were prioritized according to the number of genes they have in common, as shown in Table 8.

Table 8.

The top ten disease pairs ordered by number of shared genes.

DOID	Disease Name	DOID	Disease Name	No. of Shared Genes
DOID:10652	Alzheimer's Disease	DOID:11476	Osteoporosis	412
DOID:11612	Polycystic Ovary Syndrome	DOID:9970	Obesity	331
DOID:10652	Alzheimer's Disease	DOID:9970	Obesity	314
DOID:11476	Osteoporosis	DOID:9970	Obesity	242
DOID:3312	Bipolar Disorder	DOID:5419	Schizophrenia	242
DOID:2841	Asthma	DOID:7148	Rheumatoid Arthritis	231
DOID:10652	Alzheimer's Disease	DOID:5419	Schizophrenia	230
DOID:10652	Alzheimer's Disease	DOID:1936	Atherosclerosis	225
DOID:9351	Diabetes Mellitus	DOID:9970	Obesity	219
DOID:10652	Alzheimer's Disease	DOID:3083	Chronic Obstructive Airway Disease	217

Q7: What are the GO terms and key pathways shared by significant disease pairs?

Motivation: Identification of GO terms and key pathways shared by significant disease pairs is essential to uncovering the underlying mechanism behind the similar diseases. GO terms and pathways identified help elucidate biological pathways, share molecular mechanisms involved, and facilitate drug repositioning.

Results Obtained-Functional Enrichment Analysis of Key Disease Pair: To uncover the shared underlying mechanism between the two similar diseases, we carried out an analysis on related genes of atherosclerosis and myocardial infarction named functional enrichment. The FDR cutoff criterion was set to be lower than 0.05. The genes associated with Atherosclerosis and Myocardial Infarction showed significant enrichment among GO terms mainly participating in Regulation of Inflammatory Response (GO:0050727), Inflammatory Response (GO:0006954) and Positive Regulation of Cytokine Production (GO:0001819) (Figure 13(A)). The pathways that showed significant enrichment include Lipid and Atherosclerosis, the Adipocytokine signaling pathway, and Pathways in Cancer (Figure 13(B)).

Figure 13.

Functional Enrichment Analysis of Atherosclerosis and Myocardial Infarction (A-B) Functional enriched terms are ranked in descending order as per their p-values. For the key disease pairs, the top ten GO terms in Biological Process (BP) and KEGG Pathways were utilized for additional analysis.

5 Discussion

We computed similarity among 47 diseases using a proposed method, DiGeS-FN framework that combines the best semantic, functional, and network metrics effectively. We identified ten pairs of potentially similar diseases based on the similarity score and the count of shared genes, which offered vital insights into understanding the shared mechanisms among multiple disease pairs (shown in Table 7 & 8). One notable example is atherosclerosis and myocardial infarction, for which we further examined the shared functional terms. Functional enrichment analysis revealed that genes associated with both diseases were significantly enriched in GO terms related to Regulation of Inflammatory Response (GO:0050727), Inflammatory Response (GO:0006954), and Positive Regulation of Cytokine Production (GO:0001819). The pathways that showed significant enrichment includes Lipid and Atherosclerosis, Adipocytokine signalling pathway and Pathways in cancer. This strong molecular link between atherosclerosis and myocardial infarction suggests that patients diagnosed with atherosclerosis may benefit from early cardiovascular screening and preventive interventions to mitigate myocardial infarction risk.

The clinical and translational relevance of DiGeS-FN framework lies in its ability to support disease comorbidity analysis, drug repurposing, and biomarker discovery. Beyond clinical practice, the identified disease pairs also provide valuable direction for pharmaceutical research by highlighting shared therapeutic targets, which can facilitate drug repositioning and multi-target drug development. For example, drugs with anti-inflammatory indications approved for myocardial infarction could be prioritized for testing in atherosclerosis, and vice versa, thereby accelerating translational research and therapeutic innovation. Our findings align with several previous studies. Several studies have confirmed the relationships between the significant disease pairs:

A) Atherosclerosis (DOID:1936) and Myocardial Infarction (DOID:5844): It was demonstrated in Libby et al.⁴⁹ that atherosclerosis is a major contributing factor to myocardial infarction. It was emphasized in Ross⁵⁰ that atherosclerosis has an inflammatory nature and focuses on its contribution to CAD (coronary artery disease) and myocardial infarction. The importance of the immune system in the evolution of atherosclerosis and its adverse outcomes, such as myocardial infarction, is discussed in Hansson and Hermansson.⁵¹ An update on pathological mechanisms associated with Atherosclerosis to coronary syndromes, including myocardial infarction, is provided in Falk et al.⁵² It was demonstrated in Davies⁵³ that atherosclerosis is the leading factor in causing myocardial infarction.

B) Asthma (DOID:2841) and Bronchitis (DOID:6132): Another significant disease pair association identified by our proposed method is the asthma–bronchitis pair, which shows a similarity score of 0.75. This pair stands out as a well-documented association, validating the model's accuracy. Clinical studies show that these two respiratory conditions share overlapping pathophysiology, including airway inflammation, hyper responsiveness, and early-life viral triggers.^54,55. Genetic studies further reveal shared susceptibility loci linked to inflammatory responses and lung development.⁵⁶

C) Asthma (DOID:2841) and Chronic Obstructive Airway Disease (DOID:3083): Asthma and Chronic Obstructive Airway Disease form another identified disease pair, exhibiting a similarity score of 0.299.

The recovery of known comorbidities such as asthma and COPD (commonly referred to as Asthma–COPD Overlap, ACO) further supports the reliability of DiGeS-FN. Asthma and COPD share overlapping airway inflammation, remodeling processes, and genetic predispositions, and ACO has been increasingly recognized as a distinct clinical entity with worse outcomes than either disease alone.^57–59 By correctly identifying this established pair, the model demonstrates its capacity to reproduce known biological relationships while also uncovering less obvious disease associations.

D) Polycystic Ovary Syndrome (PCOS) and Endometriosis: To demonstrate the discovery potential of DiGeS-FN beyond well-documented comorbidities, we highlight the disease pair Polycystic Ovary Syndrome (PCOS) and Endometriosis, which achieved a relatively high similarity score (0.177). Although PCOS and endometriosis are clinically distinct gynecological disorders, emerging literature suggests partial overlap in hormonal dysregulation, inflammatory pathways, and genetic susceptibility. For instance, Zondervan et al.⁶⁰ and Sapkota et al.⁶¹ report genetic correlations and shared reproductive traits among women with these conditions. However, the biological basis of their co-occurrence remains underexplored. By identifying this connection, DiGeS-FN emphasizes its ability to capture less obvious disease relationships, which may guide future research into shared mechanisms and therapeutic targets.

6 Conclusion

This study demonstrates that two diseases with a significant similarity score, though clinically distinct, may share common genetic mechanisms and induce similar biological responses in the human body. Such genetically related diseases can often be managed using similar therapeutic strategies. Similarity among diseases can be found using three metrics: semantic, functional, and network. Each metric has its own importance. We found that the functional metric achieves the highest accuracy in the identification of similar diseases. We integrated all three metrics effectively and proposed a novel method, DiGeS-FN, to compute similar diseases to achieve the highest AUC area. The experimental results were evaluated on the benchmark dataset. The experimental evaluation shows that the proposed approach attains an AUC of 0.81% with strong TPR while maintaining low FPR. Beyond the case study of atherosclerosis–myocardial infarction, DiGeS-FN also identified several clinically meaningful pairs, including asthma–bronchitis, asthma–COPD, and polycystic ovary syndrome (PCOS)–endometriosis, which are supported by prior biological and clinical evidence. The paper further explored the biological processes and key pathways associated with these pairs, with enrichment observed in inflammatory responses, lipid metabolism, adipocytokine signalling, and cancer-related pathways. Although DiGeS-FN has been evaluated on a curated benchmark dataset of 47 diseases, the framework is inherently scalable and generalizable and can be extended to larger and more diverse disease datasets by incorporating additional ontologies and updated association networks. The modular nature of DiGeS-FN also allows it to be adapted to evolving biomedical databases, making it a useful tool for precision medicine.

Despite its strong performance and improved results over traditional approaches, there remain several opportunities for further enhancement. First, the framework depends heavily on the quality and completeness of input data, including gene-disease associations and interaction networks, which may introduce bias or noise. Second, the integration strategy assumes equal reliability of semantic, network, and functional metrics, which might not hold true across all disease types. Additionally, DiGeS-FN currently focuses on pairwise disease similarity and does not capture higher-order relationships (e.g., disease triads or clusters). Future work could address these issues by incorporating dynamic weighting schemes and expanding the model to multi-disease associations. DiGeS-FN can also be extended by incorporating multi-omics data and advanced multi-view learning models, thereby enhancing its utility in precision medicine and translational research.

Footnotes

Acknowledgements

The authors would like to thank their supervisor, colleagues, and family for their valuable contributions to this study.

ORCID iD

Sonia Lamba

Consent for publication

All authors have read and approved the final manuscript and have provided consent for publication.

Funding

The author received no financial support for the research, authorship and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data and material availability

The implementation of DiGeS-FN and the preprocessed data is available at .

References

Goh

Cusick

Valle

, et al. The human disease network. Proc Natl Acad Sci U.S.A 2007; 104: 8685–8690.

Agarwal

. Human disease-drug network based on genomic expression profiles. PLoS One 2009; 4: e6536.

Zhang

Jiang

, et al. The expanded human disease network combining protein-protein interaction information. Eur J Hum Genet 2011; 19: 783–788.

Lee

Park

Kay

, et al. The implications of human metabolic network topology for disease comorbidity. Proc Natl Acad Sci U.S.A. 2008; 105: 9880–9885.

Agarwal

. A pathway-based view of human diseases and disease relationships. PLoS One 2009; 4: e4346.

Lage

Karlberg

Storling

, et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnology 2007; 25: 309–316.

Liu

Jiang

. Align human interactome with phenome to identify causative genes and networks underlying disease families. Bioinformatics 2009; 25: 98–104.

Gottlieb

Stein

Ruppin

, et al. PREDICT: a method for inferring novel drug indications with application to personalized medicine. Mol Syst Biol 2011; 7: 496.

Wang

, et al. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics 2010; 26: 1644–1650.

10.

Ashburner

Ball

Blake

, et al. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000; 25: 25–29.

11.

Pesquita

Faria

Falcao

, et al. Semantic similarity in biomedical ontologies. PLoS Comput Biol 2009; 5: e1000443.

12.

Cheng

, et al. Semfunsim: a new method for measuring disease similarity by integrating semantic and gene functional association. PLoS One 2014; 9: e99415.

13.

Ding

Cui

Qian

, et al. Calculation of similarity between 26 autoimmune diseases based on three measurements including network, function, and semantics. Front Genet 2021; 12: 758041.

14.

Resnik

. Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence, Morgan Kaufmann Publishers Inc, 1995, pp.448–453.

15.

Jiang

Conrath

. Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmp-lg, pp. 9709008, 1997.

16.

Lin

. An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on Machine Learning, San Francisco, CA: Morgan Kaufmann, 1998, pp.296–304.

17.

Gong

Chen

, et al. DOSim: an R package for similarity between diseases based on Disease Ontology. BMC Bioinformatics 2011; 12: 266.

18.

Schriml

Arze

Nadendla

, et al. Disease Ontology: a backbone for disease semantic integration. Nucleic Acids 2012; 40: 940–946.

19.

Smith

Ceusters

Klagges

, et al. Relations in biomedical ontologies. Genome Biol 2005; 46: 15.

20.

Wang

Payattakool

, et al. A new method to measure the semantic similarity of GO terms. Bioinformatics 2007; 23: 1274–1281.

21.

Leacock

Chodorow

. Combining local context and WordNet similarity for word sense identification. in WordNet: An Electronic Lexical Database 1998; 49: 265–283.

22.

Wang

Zhong

, et al. Constructing disease similarity networks based on disease module theory. IEEE/ACM Trans Comput Biol Bioinform 2018; 17: 906–915.

23.

Gao

Tian

Wang

, et al. Similar disease prediction with heterogeneous disease information networks. IEEE Trans NanoBiosci 2020; 19: 571–578.

24.

Cheng

Wang

Tian

, et al. Lncrna2target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse. Nucleic Acids Res 2019; 47: 140–144.

25.

Peng

Hui

Shang

. Measuring phenotype-phenotype similarity through the interactome. BMC Informatics 2017; 19: 114.

26.

Mathur

Dinakarpandian

. Automated ontological gene annotation for computing disease similarity. Summit Transl Bioinform 2010; 2010: 12–16.

27.

Mathur

Dinakarpandian

. Finding disease similarity based on implicit semantic similarity. J Biomed Inform 2012; 45: 363–371.

28.

Szklarczyk

Gable

Lyon

, et al. STRING V11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res 2019; 47: D607–D613.

29.

Shannon

Markiel

Ozier

, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003; 13: 2498–2504.

30.

Hidalgo

Blumm

Barabási

A-L

, et al. A dynamic network approach for the study of human phenotypes. PLoS Comput Biol 2009; 5: e1000353.

31.

Chavali

Barrenas

Kanduri

, et al. Network properties of human disease genes with pleiotropic effects. BMC Syst Biol 2010; 4: 78.

32.

Zhang

Liu

. An effective method to measure disease similarity using gene and phenotype associations. Front Genet 2019; 10: 466.

33.

Jiang

Conrath

. Semantic similarity based on corpus statistics and lexical taxonomy. In: Proc. Int. Conf. Research on Computational Linguistics (ROCLING X), Taiwan, 1997.

34.

Schlicker

Domingues

Rahnenführer

, et al. A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics 2006; 7: 302.

35.

Zhang

S-B

Lai

J-H

. Semantic similarity measurement between Gene Ontology terms based on exclusively inherited shared information. Gene 2015; 558: 108–117.

36.

Zhang

S-B

Lai

J-H

. Exploring information from the topology beneath the Gene Ontology terms to improve semantic similarity measures. Gene 2016; 586: 148–157.

37.

Del Prete

Facchiano

Liò

. Bioinformatics methodologies for Coeliac disease and its comorbidities. Brief Bioinform 2018; 21: 355–367.

38.

Schriml

Munro

Schor

, et al. The Human Disease Ontology 2022 update. Nucleic Acids Res Jan. 2022; 50: D1255–D1261.

39.

Szklarczyk

Gable

Nastou

, et al. The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res 2021; 49: D605–D612.

40.

Lee

Blom

Wang

, et al. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res 2011; 21: 1109–1121.

41.

Cheng

Wang

, et al. SIDD: a semantically integrated database towards a global view of human disease. PLoS One 2013; 8: e75504.

42.

Davis

Murphy

Johnson

, et al. The comparative toxicogenomics database: update 2013. Nucleic Acids Res 2013; 41: D1104–D1114.

43.

Amberger

Bocchini

Hamosh

. A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®). Hum Mutat 2011; 32: 564–567.

44.

Becker

Barnes

Bright

, et al. The genetic association database. Nat Genet 2004; 36: 431–432.

45.

Mitchell

Aronson

Mork

, et al. Gene indexing: characterization and analysis of NLM’s GeneRIFs. Proc AMIA Annu Symp 2003; 2003: 460.

46.

Wang

Zhang

, et al. Splicedisease database: linking RNA splicing and disease. Nucleic Acids Res 2012; 40: D1055–D1059.

47.

Pakhomov

McInnes

Adam

, et al. Semantic similarity and relatedness between clinical terms: an experimental study. Proc AMIA Annu Symp Nov. 2010; 2010: 572.

48.

Suthram

Dudley

Chiang

, et al. Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets. PLoS Comput Biol 2010; 6: e1000662.

49.

Libby

Ridker

Hansson

. Progress and challenges in translating the biology of atherosclerosis. Nature 2011; 473: 317–325.

50.

Ross

. Atherosclerosis—an inflammatory disease. N Engl J Med 1999; 340: 115–126.

51.

Hansson

Hermansson

. The immune system in atherosclerosis. Nat Immunol 2011; 12: 204–212.

52.

Falk

Nakano

Bentzon

, et al. Update on acute coronary syndromes: the pathologists’ view. Eur Heart J 2013; 34: 719–728.

53.

Davies

. The pathophysiology of acute coronary syndromes. Heart 2000; 83: 361–366.

54.

Gibson

, et al. Comparison of airway immunopathology of eosinophilic bronchitis and asthma. Am J Respir Crit Care Med 2003; 167: 418–423.

55.

Accordini

Corsico

Calciano

, et al. The impact of asthma, chronic bronchitis and allergic rhinitis on all-cause hospitalizations and limitations in daily activities: a population-based observational study. BMC Pulm Med 2015; 15: 10.

56.

McGeachie

Chang

, et al. Predicting inhaled corticosteroid response in asthma with two associated SNPs. Pharmacogenomics J 2013; 13: 306–311.

57.

Gibson

McDonald

. Asthma–COPD overlap 2015: now we are six. Thorax 2015; 70: 683–691.

58.

Barnes

. Asthma–COPD overlap. Chest 2016; 149: 7–8.

59.

Cosio

Soriano

, et al. Defining the asthma-COPD overlap syndrome in a COPD cohort. Chest 2016; 149: 45–52.

60.

Zondervan

Becker

Missmer

. Endometriosis. N Engl J Med 2020; 382: 1244–1256.

61.

Sapkota

Steinthorsdottir

Morris

, et al. Meta-analysis identifies five novel loci associated with endometriosis highlighting key genes involved in hormone metabolism. Nat Commun 2017; 8: 15539.

DiGeS-FN: Assessing genetically associated diseases using semantic,network,and functional scores

Abstract

Keywords

1 Introduction

2 Related work

3 Proposed framework: DiGeS-FN (Disease Gene Semantic Functional & Network Association)

4.1 Data sources

4.1.1 Disease Ontology (DO)

4.1.2 STRING

4.1.3 HumanNet

Table 2. Statistics for all datasets utilized. Summary of Dataset Statistics Disease Ontology 11000 disease terms 18000 relationships HumanNet 18459 genes 977495 relationships SIDD 2814 diseases 12063 genes 117190 relationships Benchmark Dataset 47 diseases 70 relationships

6 Conclusion

Footnotes

Acknowledgements

ORCID iD

Consent for publication

Funding

Declaration of conflicting interests

Data and material availability

References

Table 2.
Statistics for all datasets utilized.

Summary of Dataset Statistics

Disease Ontology 11000 disease terms 18000 relationships

HumanNet 18459 genes 977495 relationships

SIDD 2814 diseases 12063 genes 117190 relationships

Benchmark Dataset 47 diseases 70 relationships