IsoDA: Isoform–Disease Association Prediction by Multiomics Data Fusion

Abstract

A gene can be spliced into different isoforms by alternative splicing, which contributes to the functional diversity of protein species. Computational prediction of gene–disease associations (GDAs) has been studied for decades. However, the process of identifying the isoform–disease associations (IDAs) at a large scale is rarely explored, which can decipher the pathology at a more granular level. The main bottleneck is the lack of IDAs in current databases and the multilevel omics data fusion. To bridge this gap, we propose a computational approach called Isoform-Disease Association prediction by multiomics data fusion (IsoDA) to predict IDAs. Based on the relationship between a gene and its spliced isoforms, IsoDA first introduces a dispatch and aggregation term to dispatch gene-disease associations to individual isoforms, and reversely aggregate these dispatched associations to their hosting genes. At the same time, it fuses the genome, transcriptome, and proteome data by joint matrix factorization to improve the prediction of IDAs. Experimental results show that IsoDA significantly outperforms the related state-of-the-art methods at both the gene level and isoform level. A case study further shows that IsoDA credibly identifies three isoforms spliced from apolipoprotein E, which have individual associations with Alzheimer's disease, and two isoforms spliced from vascular endothelial growth factor A, which have different associations with coronary heart disease. The codes of IsoDA are available at http://mlda.swu.edu.cn/codes.php?name=IsoDA

1. Introduction

Understanding the genetic mechanism and pathology of diseases helps to decipher the human genome and development of life sciences (Wang et al., 2019; Claussnitzer et al., 2020). The discovery of genetic disease association is very important for disease prevention, diagnosis, and treatment. Wet lab (clinical)-based methods or high-throughput biotechnologies can help us in identifying the candidate genes associated with a particular disease, but they are still limited by a low throughput or coverage, but high costs.

With rapid accumulation of multiomics data (i.e., genomics, transcriptomics, and proteomics) related to gene products and disease phenotypes, diverse computational methods have been proposed (Vanunu et al., 2010; Natarajan and Dhillon, 2014; Zhou and Skolnick, 2016; Luo et al., 2019). These computational solutions can save resources by excluding genes unlikely to be associated with diseases. These approaches build on different machine learning techniques (Sun et al., 2011; Zhou and Skolnick, 2016; Frasca, 2017), such as network propagation (Vanunu et al., 2010; Wang et al., 2011; Qian et al., 2014; Jiang, 2015), matrix factorization (Natarajan and Dhillon, 2014), data fusion (Pletscher-Frankild et al., 2015), and deep neural networks (DNNs) (Yang et al., 2018; Luo et al., 2019). They mainly use gene–disease associations (GDAs) collected from public databases [i.e., DisGeNET (Piñero et al., 2020) and OMIM (Hamosh et al., 2005)].

Integration of multilevel omics data is essential for development of high-precision predictive models. To achieve better performance, researchers further fused protein–protein interaction (PPI) data from BioGRID (Stark et al., 2006), functional gene network from HumanNet (Wu et al., 2010), RNA-seq datasets, and many others.

Existing computational solutions for predicting genetic disease associations still focus on the gene level. However, a gene can be associated with diverse diseases mainly caused by isoforms alternatively spliced from the same gene. It is reported that more than 90% human multiexon genes undergo alternative splicing (Pan et al., 2008; Wang et al., 2008), which greatly increases the transcriptome and proteome complexity (Smith and Kelleher, 2018). The preteoforms translated from different isoforms of the same gene have different amino acid sequences and structures, thus they may have different associations with diverse diseases. Diverse complex diseases have been found to be associated with alternative splicing, such as autism spectrum disorders (Skotheim and Nees, 2007), ischemic human heart disease (Neagoe et al., 2002), Alzheimer's disease (AD) (Holtzman et al., 2000), and so on. Apolipoprotein E (APOE) is localized in the senile plaques, congophilic angiopathy, and neurofibrillary tangles of AD. Strittmatter et al. (1993) reported that the pathogenesis of AD may be related to different bindings in APOE. They compared the difference of binding of the synthetic amyloid beta (beta/A4) peptide to APOE4 and APOE3 (two common isoforms of APOE) and observed that APOE4 is associated with increased susceptibility to disease. Neagoe et al. (2002) observed a titin isoform switch in chronically ischemic human hearts, with a 47:53 average N2BA-to-N2B ratio in severely diseased coronary artery disease (CHD) transplanted hearts, but a 32:68 ratio in nonischemic transplants.

Identifying isoform–disease associations (IDAs) enables a deeper view of the molecular basis of diverse genetic diseases and helps exploring precise strategies and drugs to treat diverse complex diseases. However, available IDAs are mainly detected by biological experiments and there is no public database storing sufficient IDAs for training. Therefore, traditional machine learning methods cannot be directly adopted for predicting IDAs. In fact, such a bottleneck also exists in isoform function prediction. To overcome this difficulty, some researchers have adapted multiple-instance learning (MIL) (Maron and Lozano-Pérez, 1998; Carbonneau et al., 2018) for isoform function prediction. They model the gene as a bag and the isoforms spliced from this gene as its instances and then identify the individual functions of isoforms by leveraging the known gene-level functional annotations, gene–isoform relationships, and multiple RAN-seq datasets (Eksi et al., 2013; Li et al., 2014; Chen et al., 2019; Shaw et al., 2019; Wang et al., 2020; Yu et al., 2020). These solutions mainly focus on using RNA-seq datasets and/or genomic/proteomic data, without accounting for latent correlations between functional labels, or fuse two types of omics data only.

In this study, we proposed a task of predicting IDAs. Compared with the canonical GDA prediction task, the IDA prediction task is more deeper and challenging due to the lack of IDAs and complexity of alternative splicing. With the advance of RNA-seq technology, large-scale, high-resolution, transcript-level expression data can be easily collected (Wang et al., 2009) and the isoform expression can be quantified at a more precise level. Therefore, IsoDA integrates multiple RNA-seq datasets to identify IDAs. Particularly, IsoDA introduces a regularization term to distribute known GDAs of a gene to its isoforms and reversely aggregate IDAs to gene-level using the gene-isoform relationships. Considering the incomplete GDAs, IsoDA leverages protein interaction data to replenish GDAs and constructs tissue-wise isoform coexpression networks using 298 RNA-seq datasets to account for the tissue specificity of alternative splicing. It further uses the isoform sequence data to construct another isoform functional association network and then combines these networks with adaptive weights to induce a network-regularized, multilabel linear classifier to predict IDAs. In addition, IsoDA introduces an indicator matrix into the unified objective function to differentiate the observed GDAs from unobserved ones and thus alleviates the bias toward observed ones. This study is an extension of our conference work (Huang et al., 2020), which as a showcase proposes the IDA prediction task and demonstrates the fusion of genomics and RNA-seq datasets, enabling the prediction of IDAs. In this extended version, we adopt a larger human dataset with more genes, isoforms, and diseases. We fuse more omics data (genomics, transcriptomics, and proteomics), explicitly model the interrelationships between diseases, give more details on optimizing the fusion of multilevel omics data, and conduct more comprehensive validations. Experimental results show that IsoDA achieves better results than other competitive approaches, including two approaches for predicting GDAs (Vanunu et al., 2010; Zhou and Skolnick, 2016) and three solutions for predicting isoform functions (Li et al., 2014; Wang et al., 2020; Yu et al., 2020).

2. Related Work

Due to the lack of IDAs in public repositories, there is almost no computational solution for identifying IDAs at a large scale. From the gene–isoform relationships, the prediction of IDAs can be modeled as an MIL problem (Zhou et al., 2012; Carbonneau et al., 2018), which has been extensively applied for isoform function prediction in recent years and has a close connection with the prediction of IDAs. Unlike the widely studied gene/protein function prediction, isoform function prediction is still a tough problem. The main difficulty is the lack of functional annotation at the isoform level and the complex relationships between genes and isoforms. Existing functional genome databases [i.e., Gene Ontology (Ashburner et al., 2000) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2016)] only record the functional annotation of gene products at the gene level, and contemporary molecule interaction databases [i.e., BioGRID (Chatr-Aryamontri et al., 2017) and STRING (Szklarczyk et al., 2015)] still record the interaction between proteins at the gene level.

Several teams tried to push the gene-level annotations to individual isoforms by adopting MIL (Eksi et al., 2013; Li et al., 2014; Luo et al., 2017; Chen et al., 2019; Shaw et al., 2019; Wang et al., 2020; Yu et al., 2020). These computational solutions model a gene as a bag and its spliced isoforms as instances. They typically follow the principle that a gene is positive for a functional label if at least one of its isoforms is positively annotated with that label, while if the gene is negative for a label, it means that none of its isoforms annotated with that label. To name a few, Eksi et al. (2013) adopted the multiple-instance support vector machine (miSVM) (Andrews et al., 2003) to differentiate functions of isoforms in mouse RNA-seq data. miSVM leverages the functional annotations of genes, isoform expression data, and gene–isoform associations to generate an isoform-level maximum margin classifier. Li et al. (2014) developed instance-oriented multi-instance label propagation (iMILP) to predict isoform functions. iMILP first constructs multiple isoform functional association networks, then uses gene ontology (GO) annotations of a gene to universally initialize the annotations of isoforms, and next, it updates the annotations of isoforms based on the greedy combination of multiple networks and label propagation on the combined network. Luo et al. (2017) proposed a novel, sparse, simplex projection-based approach, the Weighted Logistic Regression-based MIL method (WLRM), to differentiate the functions of isoforms within the MIL framework. WLRM specially takes the genes annotated with the function as positive bags and the genes without the function as negative ones and then maps the original bag space to a different feature space. To alleviate the lack of ground-truth annotations at the isoform level, Shaw et al. (2019) proposed a deep learning-based method (DeepIsoFun) that combines MIL with domain adaption to predict isoform functions, which provides additional labeled training data to transfer the knowledge of gene functions to prediction of isoform functions from GO annotations and RNA-seq data. Yu et al. (2020) recently introduced an approach (IsoFun) to predict isoform functions based on birandom walks on a heterogeneous network, which comprises the isoform functional association network, GO annotations of genes, gene–gene interaction network, and the gene–isoform relationships. Chen et al. (2019) presented the deep learning-based prediction of isoform functions from sequences and expression (DIFFUSE). In the first stage, DIFFUSE designs a DNN to capture features from isoform sequences and domains; in the second stage, it uses a conditional random field (CRF) to explore the relationship between isoforms and assigns GO annotations to isoforms based on initial scores computed by the DNN. DIFFUSE trains both DNN and CRF together under a novel, semisupervised learning setting. Wang et al. (2020) recently proposed DisoFun to differentiate isoform functions with collaborative matrix factorization. DisoFun complies with the main idea that the functional annotations of genes are aggregated from key isoforms, it jointly factorizes the isoform expression data matrix (derived from multiple RNA-seq datasets) and the gene-term association matrix (storing the GO annotations of genes) into low-rank matrices to explore the latent key isoforms, and it pushes the annotations to isoforms by enforcing the aggregated annotations from isoforms to be consistent with the known annotations of genes. DisoFun further leverages PPI networks and the GO hierarchy structure to replenish the annotations of genes and those of key isoforms. These solutions mainly focus on using RNA-seq datasets (Eksi et al., 2013; Li et al., 2014; Luo et al., 2017; Shaw et al., 2019), some of them additionally use genomic data (Chen et al., 2019) or PPIs (Wang et al., 2020; Yu et al., 2020). They neglect the important latent correlations between functional labels and simply fuse two types of omics data without differentiation.

Many studies reported that isoforms are indeed associated with many complex diseases (Holtzman et al., 2000; Neagoe et al., 2002; Latorre et al., 2018), but the study of computational solutions for IDAs is rarely reported, compared with the large-scale study of GDA prediction (Vanunu et al., 2010; Natarajan and Dhillon, 2014; Zhou and Skolnick, 2016; Luo et al., 2019). The recent progress on isoform function prediction sheds light on how to infer IDAs. In this study, we introduce a computational solution (IsoDA) by fusing multiomics data and MIL in a principled way. IsoDA integrates multiple isoform–isoform association networks derived from multiple RNA-seq datasets and the sequence similarity work derived from nucleotides with adaptive weights. It takes advantage of the PPI network to replenish the missing GDAs and then induces a linear classifier to push gene-level associations with individual isoforms in a coherent way. The experimental results show that IsoDA achieves better performance than not only representative, related, GDA prediction methods (Vanunu et al., 2010; Zhou and Skolnick, 2016) but also competitive, isoform function prediction solutions (Li et al., 2014; Wang et al., 2020; Yu et al., 2020). Further case study again corroborates the effectiveness of IsoDA and advantages of these compared methods.

3. The Proposed Method

3.1. Materials and preprocessing

Suppose there are n genes, the i-th gene produces $n_{i} \geq 1$ isoforms, and the total number of isoforms is $m = \sum_{i = 1}^{n} n_{i}$ . $R_{12} \in ℛ^{n \times m}$ is the relational data matrix between n genes and m isoforms; $R_{12} (i, j) = 1$ if the i-th gene hosts the j-th isoform, otherwise $R_{12} (i, j) = 0$ .

We adopt the widely used fragments per kilobase of exon per million fragments mapped (FPKM) values to quantify the expression of isoforms. Particularly, we downloaded 596 RNA-seq runs (of total 298 samples from different tissues and conditions) of humans from The Encyclopedia of DNA Elements (ENCODE) project (Consortium, 2012; access date: November 10, 2019). These datasets are heterogeneous in terms of library preparation procedures and sequencing platform. Following the preprocessing done by Li et al. (2014) and Wang et al. (2020), for each tissue, we control the quality of these RNA-seq datasets and quantify the expression values of isoforms as follows: (i)

We first align the short reads of each RNA-seq dataset of the human genome (build GRCh38.90) from Ensemble using HISAT2 (v.2-2.1.0; Kim et al., 2015) and a GTF annotation file of the same build with an option of no novel junction.

i) Then, we use StringTie (v.1.3.3b; Pertea et al., 2015) to calculate the relative abundance of the transcript as FPKM. We separately compute the FPKM values of a total of 57,964 genes with 219,288 isoforms for each sample.

ii) The FPKM values of very short isoforms are exceptionally higher. Therefore, we discard the isoforms with <100 nucleotides.

v) To further control the quality of isoforms, we use known protein coding gene names to map those genes obtained in step (iii). Finally, we obtain 15,204 genes with 137,910 isoforms. The expression values of these isoforms are stored in the data matrix $X_{1} \in ℛ^{m \times d_{1}}$ . We further normalize $X_{1}$ by . For convenience, we use the normalized $X_{1}$ for subsequent experiments.

To get the available GDAs, we downloaded the GDA file and the mapping file Unified Medical Language System Unique Concept Identifier (UMLS CUI) to disease ontology (DO) (Schriml et al., 2012) vocabularies from DisGeNET (Piñero et al., 2020). Then, we directly use the available GDAs and DO hierarchy to specify the gene-term association matrix $Y \in ℛ^{n \times c}$ between n genes and c DO terms. Specifically, if a DO term s or s's descendant terms are positively associated with the gene i, then $Y (i, s) = 1$ , otherwise $Y (i, s) = 0$ . In addition, we excluded the too sparse DO terms annotated to fewer than 30 genes and the too general DO terms annotated to more than 300 genes.

We collected the gene interaction data from BioGRID (https://thebiogrid.org), which is a curated biological database of genetic interactions, chemical interactions, and post-translational modifications of gene products. Let $R_{11} \in ℛ^{n \times n}$ encode the gene-level interactions, then $R_{11} (i, j) > 0$ if the gene i has a physical interaction with gene j, otherwise $R_{11} (i, j) = 0$ , and the entry weight of $R_{11} (i, j)$ is determined by the interaction strength.

We further collected the nucleotide sequences of isoforms from The National Center for Biotechnology Information (NCBI) Nucleotide database, and we adopted the conjoint triad method (Shen et al., 2007) to extract the numeric feature of the nucleotide sequence, which considers three continuous bases as a unit and calculates the frequency of each triad type. The nucleotide sequence comprises adenine (A), guanine (G), cytosine (C), and thymine (T) and three continuous bases were considered as a unit, thus a $4 \times 4 \times 4$ -dimensional frequency vector was generated to represent the sequence information of each isoform. To handle the variable lengths of nucleotides of different isoforms, we further normalize the represented isoform sequence feature data matrix $X_{2} \in ℛ^{m \times d_{2}} .$

3.2. IDA prediction

The lack of IDAs makes it difficult to directly apply the traditional, supervised learning methods to predict IDAs. Within the MIL framework, we leverage the obtained gene–isoform relationships $R_{12}$ and gene-level disease associations to identify the distinct disease associations of individual isoforms. Suppose $Z \in ℛ^{m \times c}$ stores the latent associations between m isoforms and c distinct DO terms, given the known associations $Y$ between n genes and c diseases and motivated by the principle that the label of a bag is responsible for at least one instance of this bag (Maron and Lozano-Pérez, 1998; Carbonneau et al., 2018), a GDA should also be responsible for at least one isoform spliced from this gene. We can obtain the aggregated GDAs from its spliced isoforms and distribute the collected gene-level GDAs (stored in $Y$ ) to individual isoforms spliced from the genes as follows: $Y = b o l d Λ b o l d R_{12} Z$ (1)

where $b o l d Λ b o l d \in ℛ^{n \times n}$ is a diagonal matrix, $b o l d Λ b o l d (i, i) = 1 ∕ n_{i}$ , and n_i represents the number of distinct isoforms spliced from the i-th gene. $Y$ is the available gene-term association matrix. With this dispatch and aggregation objective, we can optimize $Z$ and thus predict the latent associations between m isoforms and c DO terms. Next, we can induce a linear predictor based on $Z$ as follows: $\begin{matrix} min Ω (W, Z) = {∥Z - X W∥}_{F}^{2} + ∥ Y - b o l d Λ b o l d R_{12} Z ∥_{F}^{2} + λ_{1} {∥W∥}_{F}^{2} \end{matrix}$ (2)

where $X \in ℛ^{m \times d}$ is the numeric feature matrix of m isoforms, which is concatenated by isoform expression feature matrix $X_{1}$ and sequence feature matrix $X_{2}$ , and $d = d_{1} + d_{2}$ . $W \in ℛ^{d \times c}$ is the coefficient matrix for the linear predictor, which maps the numeric feature matrix of m isoforms $X$ onto c distinct DO terms. The scale parameter $λ_{1}$ is added to control the complexity of the linear predictor.

By taking $Z$ as the to-be-predicted variable, we can reversely push the GDAs of n genes to m isoforms and achieve the prediction of IDAs at the isoform level. However, the collected GDAs are rather incomplete and biased, which may miss some important GDAs and lead to biased prediction of IDAs. Given this, we attempt to replenish GDAs using the interactome of genes and extend the above equation as follows:

where $F \in ℛ^{n \times c}$ stores the latent GDAs between n genes and c DO terms. $H = Y$ , $⨀$ denotes the element-wise multiplication. ${∥H ⨀ (F - Y)∥}_{F}^{2}$ is introduced to enforce latent GDAs to be consistent with the collected ones and also to differentiate the observed ones from latent ones and thus to reduce the bias toward observed ones. $t r (F^{T} L_{11} F)$ is introduced to replenish IDAs by introducing protein-level interaction data. Here, $R_{11}$ refers to the protein interaction network matrix (as stated in the data preprocess subsection). $L_{11} = D_{11} - R_{11}$ , $D_{11}$ is a diagonal matrix with $D_{11} (i, i) = \sum_{j = 1}^{n} R_{11} (i, j)$ .

A gene generates one or more isoforms by alternative splicing, and these isoforms have diverse expression patterns across tissues (Kandoi and Dickerson, 2019; Defer et al., 2000). Based on this observation, the association networks of isoforms should be constructed from the tissue level, and more appropriate fusion of these networks can help to accurately identify the IDAs. To make full use of tissue-specific patterns of multiple RNA-seq datasets, we advocate integrating multiple isoform functional association networks from the tissue level with weights. In addition, sequence data also carry important information for prediction of IDAs, so we also construct a sequence similarity-based isoform functional association network of isoforms. To this end, we integrate multiple isoform functional association networks and extend Eq. (3) as follows

$L_{22}^{(v)} = D_{22}^{(v)} - R_{22}^{(v)}$ , and $R_{22}^{(v)} \in ℛ^{m \times m}$ encodes the coexpression strength induced from multiple RNA-seq datasets of the v-th tissue. Here, $V = 10$ , including 9 association networks from 9 different tissues, which are obtained by cosine similarity from isoform expression feature data, and 1 network based on isoform sequence feature data. $D_{22}^{(v)}$ is a diagonal matrix with $D_{22}^{(v)} (i, i) = \sum_{j = 1}^{m} R_{22}^{(v)} (i, j)$ . $α = [α_{1}, α_{2}, \dots, α_{V}]$ values are weights assigned to V networks. $λ_{2}$ is introduced to balance the information sources from the gene level and isoform level.

An isoform can be associated with different diseases and these diseases have some latent correlations. For example, the diseases are hierarchically organized by a directed acyclic graph in DO. It is recognized that the account of such hierarchical information can boost the performance of isoform function prediction (Wang et al., 2020; Yu et al., 2020). Here, we introduce a latent disease–disease correlation matrix $S \in ℛ^{c \times c}$ into our model. We adopt cosine similarity to construct the disease–disease association network $S$ from the available GDA data. Since the initially estimated disease–disease correlations may be incomplete and unreliable, we further optimize $S$ during the training of IsoDA and formalize the objective function of IsoDA as follows:

3.3. Optimization

The optimization problem in Eq. (5) is nonconvex with respect to $W$ , $Z$ , $F$ , $S$ , and α altogether. It is difficult to seek the global optimal solutions for them at the same time. We follow the idea of alternating direction method of multipliers (Boyd and Vandenberghe, 2004) to alternately optimize one variable by fixing the other four variables in an iterative way. The detailed procedure is presented as follows. The partial derivatives of $Ω (W, Z, F, S, α)$ with respect to $W$ , $Z$ , $F$ , and $S$ are as follows:

We can then use the Karush–Kuhn–Tucker conditions (Boyd and Vandenberghe, 2004) for the non-negativity of $W$ , $Z$ , $F$ , and $S$ :

\begin{matrix} {(λ_{1} W - X^{T} Z S^{T} + X^{T} X W S S^{T})}_{i j} {[W]}_{i j} = 0 \end{matrix}

(10)

\begin{matrix} {(F - b o l d Λ b o l d R_{12} Z + H . * F . * H - H . * Y . * H + D_{11} F - R_{11} F)}_{i j} {[F]}_{i j} = 0 \end{matrix}

(12)

\begin{matrix} {(- Z + X W S)}_{i j} {[S]}_{i j} = 0 \end{matrix}

(13)

These non-negative constraints give the fixed-point relationship that the solution must satisfy. As such, we can update $W$ , $Z$ , $F$ , and $S$ using the following update rules:

When $W$ , $Z$ , $F$ , and $S$ are fixed, Eq. (5) is equivalent as follows:

Here, we adopt the Lagrange multiplier method to optimize α:

where $η$ is the Lagrange multiplier. We can take the partial derivative of $H (Z, α, η)$ with respect to $α_{v}$ and set it to 0 as follows: $\begin{matrix} \frac{\partial Ω (W, Z, F, S, α)}{\partial α_{v}} = λ p α_{v}^{p - 1} t r (Z^{T} L_{22}^{(v)} Z) - η = 0 \end{matrix}$ (20) $\begin{matrix} α_{v} = {(\frac{η}{λ p t r (Z^{T} L_{22}^{(v)} Z)})}^{\frac{1}{p - 1}} \end{matrix}$ (21)

Since $\sum_{v = 1}^{V} α_{v} = 1$ , we can obtain the following: $\begin{matrix} α_{v} = \frac{{(\frac{1}{t r (Z^{T} L_{(22)}^{(v)} Z)})}^{\frac{1}{p - 1}}}{\sum_{v = 1}^{V} {(\frac{1}{t r (Z^{T} L_{22}^{(v)} Z)})}^{\frac{1}{p - 1}}} \end{matrix}$ (22)

By iteratively updating $W$ , $Z$ , $F$ , $S$ , and $α$ using Eqs. (14), (15), (16), (17), and (22), we can obtain the local optimal values of $W$ , $Z$ , $F$ , $S$ , and $α$ . Algorithm 1 lists the above optimization procedure, and IsoDA often converges in 50 iterations on our used dataset.

Algorithm 1. IsoDA: Isoform–Disease Association Prediction by Multiomics Data Fusion

Require:

X

R_{11}

b o l d Λ b o l d

R_{12}

{\{R_{22}^{v}\}}_{v = 1}^{V}

Y

, p,

λ_{1}

λ_{2}

, maxIter, and tol.

Ensure:

W

Z

F

S

, and

α

1: Initialize

α_{V} = 1 ∕ V

i t e r = 1

t o l = 1 0^{- 2}

, and

m a x i I t e r = 60

;

2: Initialize

W

randomly;

3: Specify

S

as the disease–disease correlation matrix by cosine similarity;

4: Initialize

F = Y;

5: Initialize

Z = R_{12}^{T} F;

l o s s^{i t e r} = Ω (W, Z, F, S, α);

7: While

i t e r < m a x I t e r

and

| δ | > t o l

8: update

W

using Eq. (14);

9: update

Z

using Eq. (15);

10: update

F

using Eq. (16);

11: ipdate

S

using Eq. (17);

12: update

α

using Eq. (22);

13:

l o s s^{i t e r + 1} = Ω (W, Z, F, S, α);

14:

δ \leftarrow l o s s^{i t e r + 1} - l o s s^{i t e r};

15:

i t e r = i t e r + 1;

16: End while

3.4. IDA/GDA prediction

Suppose $Z^{*}$ is the optimized variable. However, the disease associations of isoforms are generally unknown. To enable a surrogate evaluation, we need to aggregate the IDAs to the gene level. For this surrogate evaluation, we recall Eq. (23) to approximate the GDA matrix as follows: $Y^{*} = b o l d Λ b o l d R_{12} Z^{*}$ (23)

4. Experiment Results and Analysis

4.1. Experimental setup

In our article, we collect multiple RNA-seq datasets from the ENCODE project, GDA data from DisGeNET, gene interaction data from BioGRID, and sequence data of isoforms from NCBI for assessing the performance of IsoDA in predicting IDAs. The preprocessed GDAs and isoforms of the genes are listed in Table 1.

Table 1.

Statistics of Genes, Isoforms, and Gene–Disease Associations for Experiments

Genes (n)	Isoforms (m)	Diseases (c)	GDAs
12,371	26,866	3883	673,046

GDAs, gene–disease associations.

To comparatively study the performance of IsoDA, we take the state-of-the-art isoform function prediction methods [iMILP (Li et al., 2014), IsoFun (Yu et al., 2020), and DisoFun (Wang et al., 2020)], and two GDA prediction methods [PRINCE (Vanunu et al., 2010) and Know-GENE (Zhou and Skolnick, 2016)] as the compared methods. The input parameters of these compared methods are fixed/optimized as in the original articles or shared codes. For IsoDA, we choose $λ_{1}$ and $λ_{2}$ in $\{10^{- 4} {, 10}^{- 3}, \dots {, 10}^{3} {, 10}^{4}\}$ , $p = 2$ . Due to the lack of IDAs, we use surrogate evaluation by aggregating the predicted IDAs to affiliated genes; this approximate evaluation was also adopted in isoform function prediction (Li et al., 2014; Yu et al., 2020). In addition, we compare IsoDA against its degenerated variants to study the contribution components of IsoDA. We further use isoforms with collected isoform–disease associations to prove the effectiveness of IsoDA.

We adopt six evaluation metrics, $M i c r o F 1$ , $M a c r o F 1$ , $1 - R a n k L o s s$ , Fmax, AUPRC, and AUROC, to evaluate the performance of IsoDA. $M i c r o F 1$ computes the F1-score on the predictions of different DO terms as a whole; $M a c r o F 1$ calculates the F1-score of each term and then takes the average value across all DO terms; RankLoss computes the average fraction of incorrectly predicted associations ranking ahead of the ground-truth associations; Fmax is the global maximum harmonic mean of recall and precision across all possible thresholds; AUPRC calculates the area under the precision–recall curve of each term and then computes the average value of these areas as the overall performance; and AUROC computes the area under the receiver operating curve of each term at first and then takes the average value of these areas to quantify the overall performance. The higher the values of $M i c r o F 1$ , $M a c r o F 1$ , $1 - R a n k L o s s$ , Fmax, AUPRC, and AUROC, the better the performance. We want to remark that these six metrics quantify the prediction results from different aspects, and it is difficult for one method to always outperform another across all these metrics.

4.2. Result evaluation at the gene level

We adopt fivefold cross-validation at the gene level for the experiment. The GDAs in the validation set are considered as unknown during training and prediction and only used for validation. Table 2 reports the results of IsoDA and of compared methods. IsoDA achieves better results than the compared methods across all the six evaluation metrics. $M i c r o F 1$ , $M a c r o F 1$ , AUPRC, and AUROC are disease-centric metrics, while $1 - R a n k l o s s$ and Fmax are gene/isoform-centric metrics. These results clearly confirm that IsoDA can more accurately predict GDAs (IDAs) from both the gene (isoforms) and DO term perspectives. IsoDA takes tissue specificity and isoform sequence data into account and fuses multiple isoform functional association networks constructed from RNA-seq datasets and isoform sequence data with adaptive weights. In contrast, iMILP only fuses functional association networks derived from RNA-seq datasets without adaptive weights. IsoFun and DisoFun concatenate the isoform expression profiles of different tissues into a single feature vector and ignore the tissue specificity. For these reasons, IsoDA gives better results than these MIL-based isoform function prediction methods. IsoDA, IsoFun, and DisoFun incorporate the important PPI data to complete GDAs, but IsoDA adds an additional indicator matrix $H$ to separately model the seen GDAs and unseen ones. As a result, the optimized GDAs are consistent with the collected ones and IsoDA is less biased toward seen ones than IsoFun and DisoFun. In addition, the completed GDAs can be distributed to individual isoforms and thus boost the prediction of IDAs. These advantages will be further confirmed by ablation study.

Table 2.

Experimental Results of Fivefold Cross-Validation

	Prince	Know-GENE	iMILP	IsoFun	DisoFun	IsoDA
MicroF1	0.2146 ± 0.0253^●	0.4804 ± 0.0075^●	0.1954 ± 0.0148^●	0.2561 ± 0.0205^●	0.2879 ± 0.0158^●	0.6739 ± 0.0076
MacroF1	0.1759 ± 0.0204^●	0.2930 ± 0.0142^●	0.0782 ± 0.0194^●	0.1152 ± 0.0238^●	0.1035 ± 0.0115^●	0.3187 ± 0.0132
1-RankLoss	0.8243 ± 0.0262^●	0.9355 ± 0.0133^●	0.3764 ± 0.0291^●	0.7268 ± 0.0109^●	0.8914 ± 0.0037^●	0.9465 ± 0.0022
Fmax	0.2854 ± 0.0186^●	0.3269 ± 0.0061^●	0.1492 ± 0.0163^●	0.2267 ± 0.0162^●	0.2450 ± 0.0128^●	0.5437 ± 0.0059
AUPRC	0.3053 ± 0.0197^●	0.3741 ± 0.0084^●	0.0152 ± 0.0034^●	0.0745 ± 0.0118^●	0.0806 ± 0.0065^●	0.3816 ± 0.0087
AUROC	0.5846 ± 0.0139^●	0.6365 ± 0.0068^●	0.5149 ± 0.0095^●	0.6077 ± 0.0060^●	0.6223 ± 0.0082^●	0.6471 ± 0.0046

●

/^○ indicates IsoDA performing better/worse than the other comparing method, with significance assessed by pairwise t-test at the 95% level.

We also compare the performance of IsoDA with two GDA prediction methods, [PRINCE (Vanunu et al., 2010) and Know-GENE (Zhou and Skolnick, 2016)]. PRINCE uses a network propagation strategy to predict causal genes and protein complexes that are involved in a disease of interest. Know-GENE first quantifies gene–gene mutual information using the co-occurrence of genes in GDA data and then combines the mutual information with PPI networks using a boosted tree regression method to predict GDAs. Compared with PRINCE, Know-GENE makes better use of GDAs: it integrates gene–gene mutual information calculated from GDAs and the available PPIs to predict GDAs in a knowledge-driven way, so Know-GENE outperforms PRINCE by a large margin. For similar reasons, the performance margin between IsoDA and Know-GENE is smaller than those between IsoDA and other compared methods. From Table 2, we can observe that some isoform function prediction methods sometimes perform worse than the two GDA prediction methods. This is because isoform-level methods focus more on utilizing transcriptomics data, while the surrogate evaluations are made at the aggregated gene level instead of the target isoform level. We want to highlight that our IsoDA is an inductive approach that can directly predict the associations between diseases and a new isoform, whereas these compared methods are transductive solutions and they need this new isoform to be included for training before the prediction.

We further applied the signed-rank test (Demšar, 2006) to compare the results of IsoDA against those of the compared methods across the six evaluation metrics; all the p-values are smaller than 0.0313. In summary, these results indicate the effectiveness of IsoDA in identifying IDAs.

4.3. Result evaluation at the isoform level

In this subsection, we further assess the performance of IsoDA at the isoform level. Due to the lack of ground-truth IDAs, we take 5568 single-isoform genes, each of which produces only one isoform within our used dataset as the testbed, and take the rest of the genes and isoforms as the training set. We follow the same setting as previous experiments and report the results in Figure 1. PRINCE and Know-GENE do not consider the gene–isoform relationships and they can only predict the associations between genes and diseases, so their results are excluded.

FIG. 1.

Performance results of IsoDA and of compared methods on predicting IDAs of isoforms spliced from single-isoform genes. IDAs, isoform–disease associations.

IsoDA again achieves better performance than the three compared methods (iMILP, IsoFun, and DisoFun) at the isoform-level disease association prediction. iMILP universally distributes GDAs to all isoforms of a gene, then only propagates IDAs on the isoform coexpression network, so it has the lowest performance. IsoFun and DisoFun leverage the protein interaction data similar to IsoDA, but they do not consider the tissue specificity of multiple RNA-seq datasets and isoform sequence data, so they both perform worse than IsoDA. By referring to Table 2 and Figure 1, we can conclude that IsoDA is indeed effective in fusing genomics, transcriptomics, and proteomic data to handle the multiplicity of predicting IDAs at the isoform level.

4.4. Case study

To further explore the reliability of IsoDA in predicting IDAs, we collect some IDAs from PubMed literature (Strittmatter et al., 1993; Rebeck et al., 2002; Li et al., 2020). APOE2, APOE3, and APOE4 are three alternatively spliced isoforms of APOE, and AD is associated with different bindings with APOE (Strittmatter et al., 1993). It is recognized that the expression of APOE4 increases the risk of AD, while APOE2 decreases the risk. Accumulated evidences suggest the detrimental effect of APOE4, and APOE2 protects against AD through both amyloid- $β$ (A $β$ )-dependent and -independent mechanisms (Li et al., 2020). APOE performs neuroprotective and neurotrophic functions in the normal aging brain, while APOE2 and APOE3 execute these functions more efficiently than APOE4. Therefore, individuals without APOE2 or APOE3 are at risk for AD (Rebeck et al., 2002). We report the prediction results of IsoDA and three compared methods (iMILP, IsoFun, and DisoFun) with respect to APOE in Table 3. We observe that IsoDA correctly differentiates individual IDAs of APOE, while iMILP incorrectly predicts the association between APOE3 and AD, and iMILP and IsoFun wrongly predict associations between APOE4 and AD. Particularly, IsoDA predicts that APOE2 is less positively associated with AD than APOE3, and this fact also agrees with the finding that APOE2 can decrease the risk of AD more than APOE3 (Li et al., 2020).

Table 3.

Isoform–Disease Associations of Apolipoprotein E and Vascular Endothelial Growth Factor A

Gene	Isoform	Disease	Association	iMILP	IsoFun	DisoFun	IsoDA
APOE	APOE2	Alzheimer's disease	$\times$	$\times$	$\times$	$\times$	$\times$
	APOE3		$\times$	$\sqrt$	$\times$	$\times$	$\times$
	APOE4		$\sqrt$	$\times$	$\times$	$\sqrt$	$\sqrt$
VEGFA	$V E G F A_{121}$	Coronary heart disease	$\times$	$\sqrt$	$\sqrt$	$\times$	$\times$
	$V E G F A_{165} b$		$\sqrt$	$\times$	$\sqrt$	$\times$	$\sqrt$
		Accuracy		20%	60%	80%	100%

$\sqrt$ indicates that the disease is known (or predicted) to be associated with the isoform, and $\times$ means the opposite. When the predicted association probability between a disease and an isoform is in the top 3 of the total isoforms of the gene, the isoform is associated with the disease.

APOE, apolipoprotein E; VEGFA, vascular endothelial growth factor A.

We further investigate IDAs of the gene, vascular endothelial growth factor A (VEGFA), with respect to CHD. VEGFA undergoes extensive alternative splicing and encodes isoforms with both angiogenic and antiangiogenic potential through the differential use of an alternative splice site with exon 8 (Qiu et al., 2009). Some researches (Qiu et al., 2009; Kikuchi et al., 2014; Wang et al., 2017; Latorre et al., 2018) found that two isoforms ( $V E G F A_{165} b$ and $V E G F A_{165} b$ ) of VEGFA exert the opposite effects of antiangiogenesis and proangiogenesis. The antiangiogenic isoform $V E G F A_{165} b$ was found to be associated with CHD (Latorre et al., 2018). From the results in Table 3, we can find that IsoDA more credibly predicts the individual associations between CHD and isoforms spliced from the same gene.

Based on these case results, we can conclude that IsoDA has the potential to accurately identify IDAs of isoforms spliced from the same gene.

4.5. Ablation study

To further investigate the contribution components, we design seven variants of IsoDA, which are configured as follows:

(i)

IsoDA (nS) removes the disease correlation matrix $S$ from ${∥Z - X W S∥}_{F}^{2}$ in Eq. (5), which means that the disease–disease association network is disregarded.

(ii)

IsoDA (cS) uses the initial disease correlation matrix $S$ without an update in the iterative optimization process.

(iii)

IsoDA (RNA) only uses the isoform expression data derived from multiple RNA-seq datasets.

(iv)

IsoDA (Seq) only utilizes the isoform sequence data.

(v)

IsoDA (RNA+Seq) concatenates the isoform expression profile feature vectors of different tissues and the isoform sequence feature vectors into a single one, and then directly constructs a single isoform functional association network using cosine similarity.

(vi)

IsoDA (n $α$ ) integrates multiple isoform functional association networks with equal weight.

(vii)

IsoDA (nH) removes the indicator matrix H in in Eq. (5), and it does not consider the bias toward the observed GDAs.

All the other configurations of these variants are kept the same as IsoDA, unless specified otherwise. Figures 2–4 report the performance results of IsoDA and its variants. The experimental settings are the same as the evaluation at the gene level and we can easily observe that IsoDA achieves better performance than its variants.

FIG. 2.

Performance results of IsoDA (nS), IsoDA (cS), and IsoDA. IsoDA (nS) disregards the disease correlations, and IsoDA (cS) directly uses disease correlations estimated from initial GDAs. GDAs, gene–disease associations.

FIG. 4.

Performance results of IsoDA (nH) and IsoDA. IsoDA (nH) implicitly assumes that the observed GDAs are complete.

In Figure 2, IsoDA (nS) clearly has lower performance values than IsoDA, which considers disease correlations. This leads to the fact that exploring latent correlations between diseases can boost the performance of IDAs. IsoDA (cS) incorporates the estimated disease correlation $S$ , but does not iteratively refine $S$ , and it gives a better performance than IsoDA (nS), but performs worse than IsoDA. This trend confirms not only the necessity of incorporating the disease correlations but also the necessity of refining the coarse disease correlations (estimated from known GDAs) using additional biological data.

In Figure 3, IsoDA (RNA) performs worse than IsoDA (Seq), which shows that the isoform sequence data make more contributions for identifying IDAs (GDAs) than RNA-seq datasets since the sequence data include important functional sites and domains of isoforms. Both IsoDA (RNA) and IsoDA (Seq) have lower performance values than IsoDA (RNA+Seq), and not to mention IsoDA. This fact shows that fusing RNA-seq data and sequence data can boost the prediction of IDAs (GDAs). IsoDA (RNA+Seq) has lower performance values than IsoDA (n $α$ ), which considers the tissue specificity of alternative splicing and combines isoform functional association networks with equal weight from the tissue level. This pattern proves the necessity of combining multiple isoform functional association networks at the tissue level. However, IsoDA (nα) gives lower performance values than IsoDA, which not only considers the tissue specificity but also integrates multiple isoform functional association networks with adaptive weights. This contrast supports the effectiveness of adaptive weights and rationality of combining multiomics data.

FIG. 3.

Performance results of IsoDA (RNA), IsoDA (Seq), IsoDA (RNA+Seq), IsoDA (n $α$ ), and IsoDA. IsoDA (RNA) only uses RNA-seq datasets, and IsoDA (Seq) only uses the isoform nucleotide data, while IsoDA (RNA+Seq) combines these two types of data into a single feature vector. IsoDA (n $α$ ) integrates multiple isoform functional association networks with equal weights.

From Figure 4, we can see that IsoDA shows an obvious improvement over IsoDA (nH), which implicitly assumes that the observed GDAs are complete. In contrast, IsoDA considers the incompleteness of observed GDAs and enforces latent IDAs to be consistent with the collected ones. At the same time, it differentiates the currently observed associations from the unobserved (potential) ones. As a result, IsoDA is less biased toward observed ones. In practice, the incomplete associations are implicitly ignored by most GDA prediction methods and isoform function prediction methods. As a result, these compared methods and IsoDA (H) perform worse than IsoDA.

In conclusion, the ablation study confirms the effectiveness of IsoDA in fusing genomics, transcriptomics, and proteomic data to more accurately predict IDAs. It also supports the importance of specifically considering the incomplete GDAs. IsoDA models these important factors and thus clearly obtains better performance results than these variants.

4.6. Parameter sensitivity analysis

There are two input parameters ( $λ_{1}$ and $λ_{2}$ ) in IsoDA. $λ_{1}$ controls the complexity of the linear predictor, and $λ_{2}$ is a balance factor for the information sources from the gene level to isoform level. We vary $λ_{1}$ and $λ_{2}$ in $\{10^{- 4} {, 10}^{- 3}, \dots {, 10}^{3} {, 10}^{4}\}$ and present the results of IsoDA under different combinations of $λ_{1}$ and $λ_{2}$ in Figure 5.

FIG. 5.

Performance results of IsoDA under different input values of $λ_{1}$ and $λ_{2}$ .

We observe that IsoDA increases clearly in Fmax, AUPRC, and AUROC when $λ_{1}$ increases from $1 0^{- 4}$ to $1 0^{- 2}$ and then the performance of IsoDA decreases with a slight trend as $λ_{1}$ increases further. Similar to $λ_{1}$ , IsoDA first shows an obviously increased performance as $λ_{2}$ increases from $1 0^{- 4}$ to $1 0^{3}$ and then decreases a little as $λ_{2}$ increases to $1 0^{4}$ . These results confirm that it is important to fuse the gene-level data and the gene–isoform relationships for predicting IDAs. Meanwhile, we find that $λ_{2}$ is more positively related with the performance of IsoDA than $λ_{1}$ . The reason is that $λ_{1}$ only controls the complexity of the multilabel linear predictor, but $λ_{2}$ balances the information sources from the gene level to isoform level, which plays a more important role in fusing genomics, transcriptomics, and proteomics data to improve the performance of predicting IDAs. Moreover, when both $λ_{1}$ and $λ_{2}$ are fixed to too small values, IsoDA shows the lowest performance. This phenomenon expresses the superiority of the unified objective function for predicting IDAs and it also corroborates the necessity of fusing multiomics data. Based on the above analysis, we adopt $λ_{1} = 1 0^{- 2}$ and $λ_{2} = 1 0^{3}$ for experiments.

5. Conclusion

In this study, we study how to computationally identify IDAs, which is an interesting important, but largely unexplored, topic that can uncover the disease pathology at a deeper level than the well-studied GDA analysis. Our proposed approach, IsoDA, leverages genome, transcriptome, and proteome data and MIL to bypass the lack of IDAs and to distribute GDAs with individual isoforms. IsoDA considers the incompleteness of available GDAs and incorporates PPI data and the indicator matrix to complete GDAs. It further takes into account the tissue specificity of alternative splicing and adaptively combines multiple isoform functional association networks induced from multiple RNA-seq datasets at the tissue level. IsoDA performs significantly better than related competitive methods that target identification of GDAs or isoform functions.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

This research is supported by the National Natural Science Foundation of China (Grant Nos. 61872300, 62031003, and 62072380).

References

Andrews

, Tsochantaridis

, and Hofmann

2003. Support vector machines for multiple-instance learning. Neural Inform. Process. Syst. 577–584.

Ashburner

, Ball

C.A.

, Blake

J.A.

, et al. 2000. Gene ontology: Tool for the unification of biology. Nat. Genet. 25, 25.

Boyd

, and Vandenberghe

2004. Convex Optimization. Cambridge University Press, Cambridge.

Carbonneau

M.-A.

, Cheplygina

, Granger

, et al. 2018. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognit. 77, 329–353.

Chatr-Aryamontri

, Oughtred

, Boucher

, et al. 2017. The biogrid interaction database: 2017 update. Nucleic Acids Res. 45(D1), D369–D379.

Chen

, Shaw

, Zeng

, et al. 2019. Diffuse: Predicting isoform functions from sequences and expression profiles via deep learning. Bioinformatics. 35, i284–i294.

Claussnitzer

, Cho

J.H.

, Collins

, et al. 2020. A brief history of human disease genetics. Nature. 577, 179–189.

Consortium

E.P.

2012. An integrated encyclopedia of DNA elements in the human genome. Nature. 489, 57.

Defer

, Best-Belpomme

, and Hanoune

2000. Tissue specificity and physiological relevance of various isoforms of adenylyl cyclase. Am. J. Physiol. Renal. Physiol. 279, F400–F416.

10.

Demšar

2006. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1—30.

11.

Eksi

, Li

H.-D.

, Menon

, et al. 2013. Systematically differentiating functions for alternatively spliced isoforms through integrating rna-seq data. PLoS Comput. Biol. 9, e1003314.

12.

Frasca

2017. Gene2disco: Gene to disease using disease commonalities. Artif. Intell. Med. 82, 34–46.

13.

Hamosh

, Scott

A.F.

, Amberger

J.S.

, et al. 2005. Online mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33(S1), D514–D517.

14.

Holtzman

D.M.

, Bales

K.R.

, Tenkova

, et al. 2000. Apolipoprotein E isoform-dependent amyloid deposition and neuritic degeneration in a mouse model of Alzheimer's disease. Proc. Natl. Acad. Sci. USA. 97, 2892–2897.

15.

Huang

, Wang

, Zhang

, et al. 2020. Isoform-disease association prediction by data fusion, 44–55. In Cai, Z., Mandoiu, I., Narasimhan, G., Skums, P., and Guo, X., eds. International Symposium on Bioinformatics Research and Applications. Springer, Cham.

16.

Jiang

2015. Walking on multiple disease-gene networks to prioritize candidate genes. J. Mol. Cell Biol. 7, 214–230.

17.

Kandoi

, and Dickerson

J.A.

2019. Tissue-specific mouse mRNA isoform networks. Sci. Rep. 9, 1–24.

18.

Kanehisa

, Furumichi

, Tanabe

, et al. 2016. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45(D1), D353–D361.

19.

Kikuchi

, Nakamura

, MacLauchlan

, et al. 2014. An antiangiogenic isoform of VEGF-A contributes to impaired vascularization in peripheral artery disease. Nat. Med. 20, 1464–1471.

20.

Kim

, Langmead

, and Salzberg

S.L.

2015. HISAT: A fast spliced aligner with low memory requirements. Nat. Methods. 12, 357.

21.

Latorre

, Pilling

L.C.

, Lee

B.P.

, et al. 2018. The vegfa156b isoform is dysregulated in senescent endothelial cells and may be associated with prevalent and incident coronary heart disease. Clin. Sci. 132, 313–325.

22.

, Kang

, Liu

C.-C.

, et al. 2014. High-resolution functional annotation of human transcriptome: Predicting isoform functions by a novel multiple instance-based label propagation method. Nucleic Acids Res. 42, e39–e39.

23.

, Shue

, Zhao

, et al. 2020. Apoe2: Protective mechanism and therapeutic implications for Alzheimer's disease. Mol. Neurodegener. 15, 1–19.

24.

Luo

, Li

, Tian

L.-P.

, et al. 2019. Enhancing the prediction of disease–gene associations with multimodal deep learning. Bioinformatics. 35, 3735–3742.

25.

Luo

, Zhang

, Qiu

, et al. 2017. Functional annotation of human protein coding isoforms via non-convex multi-instance learning, 345–354. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 13–17, 2017, Halifax, NS, Canada.

26.

Maron

, and Lozano-Pérez

1998. A framework for multiple-instance learning. Neural Inform. Process. Syst. 570–576.

27.

Natarajan

, and Dhillon

I.S.

2014. Inductive matrix completion for predicting gene–disease associations. Bioinformatics. 30, i60–i68.

28.

Neagoe

, Kulke

, del Monte

, et al. 2002. Titin isoform switch in ischemic human heart disease. Circulation. 106, 1333–1341.

29.

Pan

, Shai

, Lee

L.J.

, et al. 2008. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40, 1413.

30.

Pertea

, Pertea

G.M.

, Antonescu

C.M.

, et al. 2015. Stringtie enables improved reconstruction of a transcriptome from rna-seq reads. Nat. Biotechnol. 33, 290.

31.

Piñero

, Ramírez-Anguita

J.M.

, Saüch-Pitarch

, et al. 2020. The disgenet knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 48(D1), D845–D855.

32.

Pletscher-Frankild

, Pallejà

, Tsafou

, et al. 2015. Diseases: Text mining and data integration of disease–gene associations. Methods. 74, 83–89.

33.

Qian

, Besenbacher

, Mailund

, et al. 2014. Identifying disease associated genes by network propagation. BMC Syst. Biol. 8(S1), S6.

34.

Qiu

, Hoareau-Aveilla

, Oltean

, et al. 2009. The anti-angiogenic isoforms of VEGF in health and disease. Biochem. Soc. Trans. 37, 1207.

35.

Rebeck

G.W.

, Kindy

, and LaDu

M.J.

2002. Apolipoprotein e and Alzheimer's disease: The protective effects of apoe2 and e3. J. Alzheimers Dis. 4, 145–154.

36.

Schriml

L.M.

, Arze

, Nadendla

, et al. 2012. Disease ontology: A backbone for disease semantic integration. Nucleic Acids Res. 40(D1), D940–D946.

37.

Shaw

, Chen

, and Jiang

2019. Deepisofun: A deep domain adaptation approach to predict isoform functions. Bioinformatics. 35, 2535–2544.

38.

Shen

, Zhang

, Luo

, et al. 2007. Predicting protein–protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA. 104, 4337–4341.

39.

Skotheim

R.I.

, and Nees

2007. Alternative splicing in cancer: Noise, functional, or systematic?. IInt. J. Biochem. Cell Biol. 39, 1432–1449.

40.

Smith

L.M.

, and Kelleher

N.L.

2018. Proteoforms as the next proteomics currency. Science. 359, 1106–1107.

41.

Stark

, Breitkreutz

B.-J.

, Reguly

, et al. 2006. Biogrid: A general repository for interaction datasets. Nucleic Acids Res. 34(Suppl. 1), D535–D539.

42.

Strittmatter

W.J.

, Weisgraber

K.H.

, Huang

D.Y.

, et al. 1993. Binding of human apolipoprotein E to synthetic amyloid beta peptide: Isoform-specific effects and implications for late-onset alzheimer disease. Proc. Natl. Acad. Sci. USA. 90, 8098–8102.

43.

Sun

P.G.

, Gao

, and Han

2011. Prediction of human disease-related gene clusters by clustering analysis. Int. J. Biol. Sci. 7, 61.

44.

Szklarczyk

, Franceschini

, Wyder

, et al. 2015. String v10: Protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43(D1), D447–D452.

45.

Vanunu

, Magger

, Ruppin

, et al. 2010. Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol. 6, e1000641.

46.

Wang

E.T.

, Sandberg

, Luo

, et al. 2008. Alternative isoform regulation in human tissue transcriptomes. Nature. 456, 470.

47.

Wang

, Wang

, Domeniconi

, et al. 2020. Differentiating isoform functions with collaborative matrix factorization. Bioinformatics. 36, 1864–1871.

48.

Wang

, Sun

, Yuan

, et al. 2017. The different effects of VEGFA121 and VEGFA165 on regulating angiogenesis depend on phosphorylation sites of VEGFR2. Inflamm. Bowel. Dis. 23, 603–616.

49.

Wang

, Gong

, Yi

, et al. 2019. Predicting gene-disease associations from the heterogeneous network using graph embedding, 504–511. IEEE International Conference on Bioinformatics and Biomedicine (BIBM), November 18–21, 2019, San Diego, CA.

50.

Wang

, Gulbahce

, and Yu

2011. Network-based methods for human disease gene prediction. Brief. Funct. Genomics. 10, 280–293.

51.

Wang

, Gerstein

, and Snyder

2009. Rna-seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57.

52.

, Feng

, and Stein

2010. A human functional protein interaction network and its application to cancer data analysis. Genome. Biol. 11, R53.

53.

Yang

, Wang

, Liu

, et al. 2018. Hergepred: Heterogeneous network embedding representation for disease gene prediction. IEEE J. Biomed. Health Inform. 23, 1805–1815.

54.

, Wang

, Domeniconi

, et al. 2020. Isoform function prediction based on bi-random walks on a heterogeneous network. Bioinformatics. 36, 303–310.

55.

Zhou

, and Skolnick

2016. A knowledge-based approach for predicting gene–disease associations. Bioinformatics. 32, 2831–2838.

56.

Zhou

, Zhang

, Huang

, et al. 2012. Multi-instance multi-label learning. Artif. Intell. 176, 2291–2320.