Deep Large-Scale Multitask Learning Network for Gene Expression Inference

Abstract

Gene expression profiling makes it possible to conduct many biological studies in a variety of fields due to its thorough characterization of cellular states under various experimental conditions. Despite recent advances in high-throughput technology, profiling an entire set of genomes is still difficult and expensive. Due to the high correlation between expression patterns of different genes, the aforementioned problem can be solved with a cost-effective approach that collects only a small subset of genes, called landmark genes, representing the entire set of genes, and infer the remaining genes, called target genes, using a computational model. There are several shallow and deep regression models in literature to estimate the expressions of target genes from the landmark genes. However, the shallow mostly have limited capacity in learning the nonlinear and complex gene expression data and are prone to underfitting, and the deep models generally do not take advantage of correlation among target genes in the learning process and suffer from overfitting. Considering the gene expression inference as a multitask learning problem, we propose a new deep multitask learning algorithm to tackle these issues. Our learning framework automatically learns the correlation between target genes and uses this knowledge to improve its generalization. Specifically, we utilize a subnetwork with low-dimensional latent variables to discover the relationships between target genes and enforce a seamless and easy to implement regularization to our deep regression model. Unlike the existing multitask learning methods that can only deal with dozens or hundreds of tasks, our algorithm is able to efficiently learn the relationships between ∼10,000 target genes and, thus, is scalable to a large number of tasks. Our proposed method outperforms the shallow and deep regression models for gene expression inference and alternative multitask learning algorithms on two large-scale datasets regardless of the network architecture.

1. Introduction

A key problem in biological research is the characterization of cellular status in different states like disorders, drug therapeutics, and genetic perturbations. Gene expression profiling provides a powerful method for systematic cellular status analysis by identifying the patterns of gene expressions. Recent advances in high-throughput technologies allow comprehensive gene expression profiles to be collected under various cellular conditions, providing useful large-scale gene expression databases (Edgar et al., 2002; Brazma et al., 2003). For example, Van't Veer et al. (2002) recognized genes that are effective in breast cancer by investigating the patterns of gene expressions in various patients. By exploring the relations of gene expressions between different categories of tumors, Stephens et al. (2012) studied the relationships between and within various types of cancer. Richiardi et al. (2015) analyzed the gene expressions in a postmortem brain tissue and demonstrated the association between resting-state functional brain networks and gene activities, which are linked to ion channels and synaptic functions. A microarray study indicates a radical shift in the level of expression of several immune-related genes in mice susceptible to influenza A virus infection (Yan et al., 2015). In response to drug effects, gene expression patterns are also studied in various tasks such as construction of the drug-target network and drug discovery (Rees et al., 2016). Moreover, the connection of single-gene mutations on some chromosomes and early onset of Alzheimer disease are examined (Alzheimer's Association, 2013).

Despite recent progress on gene expression profiling, large archives of gene expressions are still costly and difficult to collect under various experimental conditions (Nelms et al., 2016). However, previous studies have shown that gene expressions are strongly correlated, suggesting that genes have similar functions in response to different conditions (Heimberg et al., 2016; Ntranos et al., 2016; Shah et al., 2016). The single cell RNA-Seq clustering studies also reveal a common pattern of expression between intracluster genes across various cell states (Ntranos et al., 2016). Therefore, the entire set of genes can be represented by a small subset of informative genes. This assumption was used by researchers in the Library of Integrated Network-based Cell-Signature (LINCS) program,* who used Principle Component Analysis to pick ∼1000 genes that contain ∼80% of the information of whole genome data. The cost of gene-expression profiling (∼$5 per profile) can be dramatically reduced by profiling these ∼1000 genes, named landmark genes, instead of the entire genomes (Peck et al., 2006). Therefore in profiling large-scale gene expression data, a cost-effective approach is to gather the landmark genes and estimate the remaining genes, named target genes, a computational model.

The first candidate for predicting target genes are the linear regression models with different regularizations. Some attempts were made to use nonlinear models to better capture the complex patterns of gene expression profiles guo2014inferring. In general, deep models have shown great versatility in learning the nonlinear patterns in biomedical data and high scalability when working with large databases. Inspired by the success of deep neural networks on several biological studies (Leung et al., 2014; Alipanahi et al., 2015; Spencer et al., 2015; Zhou and Troyanskaya, 2015; Singh et al., 2016), researchers introduced deep regression models for the task of gene expression inference (Chen et al., 2016; Ghasedi Dizaji et al., 2018). However, the interrelations between the target genes are not used by these deep regression models, which have several shared layers at the bottom and an exclusive layer at the top for different genes. In their training phase, these models consequently neglect the biological information associated with gene correlations, which contributes to their suboptimal results.

To tackle these challenges and benefit from the highly correlated target genes, we formulate the prediction of target genes from landmark genes as a multitask learning problem. In general, multitask learning algorithms seek to improve the generalization of predictors for multiple tasks using the information transferred through a joint learning system across similar tasks (Caruana, 1997). We consider each prediction of gene expression as a learning task and use the multitask learning model to capture the interrelationships between target genes as tasks to improve the prediction using this knowledge. Although there are several literature studies on the design of multitask learning algorithms for deep models (Ruder, 2017), they are mostly implemented on dozens or hundreds of tasks. Hence they are not efficient and scalable for gene expression inference with a large number of tasks (e.g., 10,000 tasks).

In this article, we propose a new multitask learning framework, denoted as Deep-LSMTL, for training deep regression networks that automatically learn the correlation between large number target genes (i.e., tasks) without high computational overheads. Deep-LSMTL particularly explores task relations by clustering of specific parameters of tasks. In other words, Deep-LSMTL aims to approximate reconstructing the parameters of each task by sparse combination of other parameters, providing a seamless regularization in stochastic training of deep models. Our algorithm uses a two-layer subnetwork with low-dimension bottleneck to capture the nonlinear low-rank representations of task relations. Furthermore, Deep-LSMTL alleviates the common issue of negative transfer in MTL methods by transferring asymmetric knowledge across the tasks and enforcing task correlation through the latent variables rather than the parameters. We evaluate Deep-LSMTL on two large-scale gene expression datasets compared to several deep and shallow regression models. Experimental outcomes show that our proposed algorithm has significantly better results in comparison to the state-of-the-art MTL methods and deep gene expression inference networks with different network size and architecture. In addition, we visualize the relevance of landmarks and target genes in the trained model to gain insight into gene relations. The main findings of this article can be summarized as the following points:

We introduce a new multitask deep regression method, which can address the large-scale prediction tasks and is effective for the gene expression inference problem, which has the nonimage data.

We design a new seamless regularization for our deep multitask regression method to automatically learn the task interrelations by utilizing the multilayer subnetwork with low-rank latent variables.

Our empirical studies show that the proposed new model can consistently outperform existing gene expression inference approaches and alternative multitask learning algorithms on two datasets regardless of network architectures.

The following sections in this study will be organized as follows. In Related Work section, we will briefly review the related works on recent multitask learning algorithms and gene expression inference task. In Deep Large-Scale Multitask Learning Network section, we will first revisit the general clustering-based multitask learning method and then introduce our new multitask deep regression model. After that, we will show the experimental results in Experiments section and evaluate the effectiveness of the proposed algorithm by comparing to other alternative models with different experimental conditions. In our empirical study, the visualization figures will also be plotted to validate the correctness of our method. Finally, we will conclude the article in Conclusion section.

2. Related Work

2.1. Gene expression inference problem

Finding a way to minimize the costs of gene expression profiling is an important issue in biological studies, since it is still difficult and expensive to archive whole-genome expression profiles under different perturbations and biological conditions (Nelms et al., 2016). Previous studies have shown that there is a strong correlation between gene expressions, and even a small set of genes can store extensive information. For example, Shah et al. (2016) showed that a random set of only 20 genes includes ∼50% of the information in the entire genes. In addition, recent RNA-seq studies support the idea that a small number of genes are adequate to store comprehensive information throughout the transcriptome (Heimberg et al., 2016; Ntranos et al., 2016).

Researchers from the LINCS program collected the GEO dataset^† based on Affymetrix HGU133A microarrays and evaluated the similarity of gene expression profiles to determine the collection of most informative genes. They conducted experiments to calculate maximum percentage of information recovered by a subset of genes based on the comparable rank in the Kolmogorov–Smirnov statistic. Given the total number of 12,063 genes, they found that a subset of only 978 genes is sufficient to restore 82% of the observed connections in the whole transcriptome (Keenan et al., 2017). These genes (i.e., landmark genes) can be used to infer the expression of the other genes (i.e., target genes).

We can naturally formulate the gene expression inference problem as a multitask learning problem; some traditional models such as linear regression with $ℓ_{1}$ -norm and $ℓ_{2}$ -norm regularizations and K-nearest neighbors (KNN) can be used to estimate the expressions of target genes given the landmark ones (Chen et al., 2016; Ghasedi Dizaji et al., 2018). There are also a few studies to detect and infer gene expressions using deep models (Chen et al., 2016; Ghasedi Dizaji et al., 2018; Kishan et al., 2019; Wang et al., 2019). Moreover, a few works focused on designing supervised models to predict different categories given gene expression profiles. For instance, a 70-gene panel-based predictor is introduced in the study of Cardoso et al., 2016 to identify low-risk patients in early-stage breast cancer. Another study proposed a supervised model to predict the prognosis of the patients given the gene expression profile of tumors (Cuzick et al., 2011). Xu et al. (2020) used a Bayesian ridge regression-based model to infer gene expression profiles of multiple tissues given blood gene expression profile. Following the success of deep models on various applications, Chen et al. (2016) proposed a multitask deep regression model based on a fully connected multilayer neural network, by which they achieved indicated better experimental results compared to shallow and linear regression models. For analyzing the effectiveness of anticancer drug combinations, a deep network is applied on gene expression data in cell line experiments (Preuer et al., 2018). Ghasedi Dizaji et al. (2018) recently adopted generative adversarial networks (GANs) to present a semisupervised model, called SemiGAN, for inferring the gene expressions (Ghasedi Dizaji et al., 2018). SemiGAN uses pairs of landmark and target genes as the labeled data to learn their joint distribution and enhances the learned distribution through another set of landmark genes as the unlabeled data. In particular, SemiGAN empowers its training phase by enlarging its training set using unlabeled landmark genes and the predicted target genes as pseudo labels. Another GAN is suggested to improve the classification of cancers by augmented data synthesized by its generator network (Chaudhari et al., 2019). Transforming the single cell expression profiles into images, ScIGANs utilizes a GAN model for generating different types of cell expression to impute the dropouts of the real cells (Xu et al., 2020). While these deep inference models discussed the issue of inadequate ability in shallow and linear regression models, they do not benefit the task interrelationships in their training phase that imply the biological knowledge of genes. Considering the gene expression inference as a multitask learning problem, we propose a new MTL learning framework for discovering the interrelations between the target genes explicitly to boost the prediction results and enhance the generalization of the deep regression model.

2.2. Multitask learning algorithms

The primary objective of multitask learning is to increase the generalization of several predictors in a shared training phase using the knowledge transferred through the associated tasks, while the MTL methods assume that the correlated tasks lie in a low-dimensional subspace (Caruana, 1997). Based on this hypothesis, Argyriou et al. (2008) applied $ℓ_{(2, 1)}$ -norm regularization on the feature matrix to have shared features across tasks, approximated its optimization problem, including this regularization, by a convex objective function. Instead of all tasks, Kang et al. (2011) presented a method to share features only within a community of similar tasks. To alleviate the strict classification of correlated tasks in real-world problems, a few works share the parameters in the overlapping sets of related tasks (Kumar and Daumé III, 2012; Maurer et al., 2013). Asymmetric multitask learning (AMTL) aims to reconstruct the parameters of each task through the linear and sparse mix of the remaining tasks using a regularization loss (Lee et al., 2016). It also penalizes the unreliable task predictors with higher loss transfer knowledge less than the reliable predictors with lower loss. In addition, some studies explored the general concept of regularizing parameters using the task interrelationships obtained through clustering-based approaches (Thrun and O'Sullivan, 1996; Bakker and Heskes, 2003; Evgeniou et al., 2005; Jacob et al., 2009).

Sharing multiple layers across all tasks and stacking a particular layer for each task at the top is the standard way of adopting MTL methods on deep models. Furthermore, there are multiple works on designing the architecture of deep multitask networks (Yang and Hospedales, 2016, 2017; Ruder et al., 2017). To expand AMTL to deep neural networks, Deep-AMTFL transfers asymmetric task knowledge using latent variables instead of the model parameters (Lee et al., 2018). The proposed multitask learning algorithm is different with the previous approaches, because it provides an effective and scalable method for the gene expression problem with a large number of tasks using a bottleneck network for capturing the correlation between tasks.

3. Deep Large-Scale Multitask Learning Network

Considering $D = {g_{i}^{l}, g_{i}^{t}}_{i = 1}^{N}$ as the training set in the gene expression inference problem, we denote $g_{i}^{l} \in ℛ^{D}$ and $g_{i}^{t} \in ℛ^{M}$ as the expressions of landmark and target genes for the i-th sample. While N represents the number of samples, M and D indicate the dimensions of output and input data (i.e., number of target and landmark genes). Assuming $g_{i}^{t} \in ℛ^{M}$ , we are dealing with M regression tasks, where each one is inferring the expression of a target gene given the expression of all landmark genes. We use the below notations throughout the article except for the cases that are explicitly stated otherwise. We represent the scalars with the lower and upper case letters (e.g., $j, M$ ), vectors with the bold lowercase letters (e.g., $g, w$ ), matrices with the upper case letters (e.g., $X, W$ ), and losses, sets, and functions with the calligraphic letters.

3.1. Clustered multitask learning

Multitask learning algorithms topically share the relevant knowledge between tasks using a joint learning framework for the tasks. This joint learning framework mostly includes a regularization term to improve generalization of the model like the following objective: ${min}_{W} \sum_{i = 1}^{N} \sum_{j = 1}^{M} ℒ (w_{j}; g_{i}^{l}, g_{i j}^{t}) + ℒ_{r} (W)$ (1)

where $ℒ$ represents the loss function for each task, and $ℒ_{r}$ denotes the regularization on the shared parameters based on the relations of tasks. Note that the columns of $W \in ℛ^{D \times M}$ show task-specified parameters (i.e., w _j ) in a shallow regression model. While the mean squared error (MSE) is the most popular regression loss, we empirically note that the mean absolute error (MAE) loss function $ℒ (W; g_{i}^{l}, g_{i}^{t}) = ∥ g_{i}^{t} - ℱ (g_{i}^{l}) ∥_{1}$ is a better choice in our objective, where $ℱ (g_{i}^{l}) = {W g}_{i}^{l}$ is a regression model. Many studies on diverse applications also advocate using MAE loss instead of MSE loss because of its robustness to noisy samples and outliers.

We can simply extend shallow MTL models to deeper MTL models by sharing a set of latent features across all tasks. Considering $W = L S$ , $L \in ℛ^{D \times K}$ represents the shared parameters and $S \in ℛ^{K \times M}$ indicates the task-specific weights (Kumar and Daumé III, 2012; Argyriou et al., 2008). We can use a similar approach in deep neural networks by stacking a specific layer for each task on top of multiple shared layers between tasks. The multilayer perceptron (i.e., fully connected) network is the simplest form of a deep MTL model as $ℱ (g^{l}) = σ (\dots σ (σ (g^{l} W^{(1)}) W^{(2)}) \dots W^{(L)})$ , where the first $L - 1$ layers are shared across all tasks and the last one is a task-specific layer. Inspired by the densely connected convolutional network (DenseNet) (Huang et al., 2017), we design a more efficient architecture for the shared layers in our inference network. Considering the input for each layer as $g^{l^{(k)}}$ where $k \in {0, \dots, L - 1}$ , the DenseNet output is obtained through $x^{(k + 1)} = σ ([x^{(0)}, x^{(1)}, \dots, x^{(k)}] W^{(k + 1)})$ , where $[x^{(0)};, x^{(1)}, \dots, x^{(k)}]$ is the concatenation of features from all the previous layers. Figure 1a shows the architecture of our inference network, where each layer of DenseNet receives the features of all preceding layers as the input. Using DenseNet instead of the conventional multilayer perceptron network, we are able to take advantage of decreasing the number of parameters, reusing the preceding layer features, and reducing the chance of gradient vanishing in the training process. We can therefore rewrite the objective in Eq. (1) as following equation:

FIG. 1.

The visualization of our Deep-LSMTL architecture. (a) This figure illustrates the architecture of our DenseNet ( $ℱ$ ), where each layer receives the features of all preceding layers as the input. The $ℓ_{1}$ -norm loss ( $ℒ$ ) is applied on the output of this network. (b) This network indicates the shallow and linear $g$ function used in Eq. (4). The crosses on some weights represent the zero diagonal elements constraint. (c) This network shows the two-layer model $g$ of Eq. (5), where $β$ and $(1 - β)$ filters are represented by the cross signs. The regularization loss ( $ℒ_{r}$ ) is applied on the output of this layer. DenseNet, densely connected convolutional network; LSMTL.

{min}_{W^{(1)}, \dots, W^{(L)}} \sum_{i = 1}^{N} ∥ g_{i}^{t} - ℱ (g_{i}^{l}) ∥_{1} + ℒ_{r} (W^{(1)}, \dots, W^{(L)}) .

(2)

One way for regularizing the task-specific parameters is to enforce clustering-based constraints based on the task correlations (Thrun and O'Sullivan, 1996; Bakker and Heskes, 2003; Evgeniou et al., 2005; Jacob et al., 2009; Lee et al., 2016). The clustering constraints force the correlated tasks to have similar parameters or features, but not all of the tasks to share features. This is helpful for avoiding the problem of negative transfer in MTL methods, where uncorrelated tasks degrade the shared features of related tasks (Ruder, 2017). A successful instance of the clustering constraints is using subspace clustering for grouping the task-specific parameters. Considering the subspace clustering constraint as the regularization in Eq. (2), we define the objective as follows: ${min}_{W^{(1)}, \dots, W^{(L)}, V} \sum_{i = 1}^{N} ∥ g_{i}^{t} - ℱ (g_{i}^{l}) ∥_{1} + λ ∥ W^{(L)} - W^{(L)} V ∥_{F}^{2} + γ ∥ V ∥_{1}$ (3)

where $V \in ℛ^{M \times M}$ indicates the correlation among the M tasks by a self-representation coefficient matrix with zero diagonal elements (i.e., $v_{t t} = 0$ ). Using the regularization term, we force the model to reconstruct the parameters of every task by the sparse and linear. This regularization encourages the parameters of each task to be a combination of the remaining tasks and explores asymmetric similarity among the tasks to alleviate the problem of negative transfer in MTL methods.

We further multiply the features in the latest hidden layer to the parameters in the second term loss in Eq. (3). Then we are able to reformulate the objective in Eq. (3) like the following equation, as our last layer has the linear activation function: ${min}_{W^{(1)}, \dots, W^{(L)}, V} \sum_{i = 1}^{N} ∥ g_{i}^{t} - ℱ (g_{i}^{l}) ∥_{1} + λ ∥ ℱ (g_{i}^{l}) - g (ℱ (g_{i}^{l})) ∥_{F}^{2} + γ ∥ V ∥_{1}$ (4)

where $ℱ (g_{i}^{l}) = [g^{l^{(0)}}, g^{l^{(1)}}, \dots, g^{l^{(L - 1)}}] W^{(L)}$ denotes the DenseNet output for the i-th sample, and $g (ℱ (g_{i}^{l})) = ℱ (g_{i}^{l}) V$ is the output of the layer stacked at the top of our DenseNet model. Figure 1b illustrates the architecture of this layer.

3.2. Deep large-scale multitask learning

There are some disadvantages to the aforementioned model, such as lack of scalability to the large number of tasks like estimating ∼10,000 target genes from landmark genes), large number of parameters (i.e., V ), and the shallow and linear layer $g (.)$ not capturing the complex relationships between the tasks. In addition, there is no explicit constraint to learn a low-dimension manifold of highly correlated target gene expressions.

To tackle these issues and efficiently learn the task correlations, we replace the linear function in $g (.)$ by a two-layer network as $g (ℱ (g_{i}^{l})) = V^{(2)} σ (V^{(1)} ℱ (g_{i}^{l}))$ with the parameters $V^{(1)}$ and $V^{(2)}$ in the first and second layers. Using this trick, we not only increase the representative power of $g (.)$ subnetwork but also decrease the number of parameters with a smaller number of hidden units than the number of tasks. In particular, the linear function has M² parameters ( $\sim 1 0^{4} \times 1 0^{4} = 1 0^{8}$ ), but the proposed two-layer network contains $2 M K$ free parameters, where $K ≪ M$ ( $\sim 2 \times 1 0^{4} \times 100 = 2 \times 1 0^{6}$ ). Moreover, the low-dimensional bottleneck in $g$ leads to learning a low-rank representation for the task relations as shown in the hidden layer of Figure 1c. The following equation shows the objective function of our proposed method:

where $β$ indicates a binary mask, and $⨀$ denotes the element-wise multiplication. It is worth mentioning that the second term of the objective pushes each target gene to be reconstructed by the others, leading to learning the correlation between the tasks.

However, we cannot reconstruct the output of each task by the others in the $g$ subnetwork similar to subspace clustering constraint (i.e., zero diagonal elements in V). To do so, we introduce the random $β$ mask to efficiently approximate the reconstruction process in a stochastic learning approach like SGD. In particular, we randomly mask one or a few task outputs in each training iteration (e.g., $β = [1, 0, 0, \dots, 0]$ ), compute the output of regularization subnetwork by , and then only reconstruct the masked tasks according to the $(1 - β)$ filter. This approach provides a seamless and efficient regularization in deep models similar to subspace clustering regularization.

4. Experiments

In this section, we analyze our model in comparison with several shallow and deep regression models on two databases. In particular, we start by describing the different settings of the experiments, evaluating our proposed model compared to the state-of-the-art models through neural networks with various structures. We also try to shed light on learned knowledge stored in the model parameters by illustrating the connections of target and landmark genes.

4.1. Alternative methods

Least square regression mode is one of the most popular inference models. Its parameters are usually trained using this objective: ${min}_{W} \sum_{i = 1}^{n} ∥ g_{i}^{l} W - g_{i}^{t} ∥_{2}^{2} + λ ∥ W ∥_{p}^{2},$ (6)

where W denotes the parameters, and $λ$ is a hyperparameter balancing the weights of reconstruction and regularization losses. Setting $λ = 0$ , the model is named least square regression (i.e., LSR). If $λ \neq 0$ and $p = 2$ , the model is called least square regression with $ℓ_{2}$ -norm regularization (i.e., LSR-L2); when $λ \neq 0$ and $p = 1$ the model is named least square regression with $ℓ_{1}$ -norm regularization (i.e., LSR-L1). Note that these regularizations are used to avoid the overfitting problem in training of LSR models. Our proposed algorithm uses the MAE instead of the MSE for the regression loss and also benefits from the $ℓ_{1}$ -norm regularization but only in the $g$ network parameters.

KNN is another well-known method for the regression problem, by which we obtain the output of test samples by averaging the k nearest samples in the training set. Furthermore, we include two deep learning methods, D-GEX (Chen et al., 2016) and SemiGAN (Ghasedi Dizaji et al., 2018), as the baseline. D-GEX model consists of a multilayer fully connected network, and SemiGAN adopts GANs in its regression model. However, our model utilizes a DenseNet architecture for its base network (i.e., $ℱ$ ).

In addition, we implement multiple MTL algorithms as the learning framework in the gene expression inference problem. Following, we briefly revisit them for our problem, but refer the readers to the original articles for more details. The CNMTL algorithm tries to group the specific parameters of tasks (i.e., last layer of the deep regression network in our case) through a regularization (Jacob et al., 2009). The following equation shows its objective, including the reconstruction loss and the regularization loss, based on the mean of parameters and the variances between and within clusters. ${min}_{W} \sum_{i = 1}^{N} \sum_{j = 1}^{M} ℒ (w_{j}, g_{i}^{l}, g_{i}^{t}) + λ_{M} ∥ \bar{W} ∥_{F}^{2} + λ_{B} \sum_{k = 1}^{L} ∥ {\bar{W}}_{k} - \bar{W} ∥_{F}^{2} + λ_{W} \sum_{k = 1}^{L} \sum_{j \in J (k)} ∥ W_{j}^{(L)} - {\bar{W}}_{k} ∥_{F}^{2}$ (7)

The second term shows the regularization term on the average of parameters with $λ_{M}$ as the hyperparameter. Assuming $\bar{W} = 1 ∕ T \sum_{j = 1}^{M} W_{j}^{(L)}$ as the average of parameters in the last layer across tasks, the third term shows the regularization term based on the variance between clusters with $λ_{B}$ as the hyperparameter. Considering ${\bar{W}}_{k}$ as the average of parameters in the last layer for the k-th cluster, the fourth term shows the regularization term based on the variance within clusters with $λ_{W}$ as the hyperparameter, where $J (k)$ is the set of tasks belonging to the k-th cluster. Setting $ℒ (W; g_{i}^{l}, g_{i}^{t}) = ∥ g_{i}^{t} - ℱ (g_{i}^{l}) ∥_{1}$ , we have similar regression loss for the CNMTL model (and also all the following alternative models) for a fair comparison, but Deep-LSMTL uses different regularization (i.e., the second term in Eq. 5) than CNMTL.

The GO-MTL method enforces $ℓ_{1}$ -norm and Frobenius-norm regularizations on the specific and shared parameters between tasks (Kumar and Daumé III, 2012): ${min}_{W} \sum_{i = 1}^{N} \sum_{j = 1}^{M} ℒ (w_{j}, g_{i}^{l}, g_{i}^{t}) + μ ∥ w_{j}^{(L)} ∥_{1} + λ \sum_{k = 1}^{L - 1} ∥ W^{(l)} ∥_{F}^{2}$ (8)

where $μ$ and $λ$ represent the hyperparameters for the regularization terms. We also use $ℓ_{1}$ -norm loss regularization in the $g$ network parameters of Deep-LSMTL.

The AMTL algorithm imposes a reconstruction loss on the task-specific parameters as the regularization as follows (Lee et al., 2018): ${min}_{W, V} \sum_{i = 1}^{N} \sum_{j = 1}^{M} α_{j} ℒ (w_{j}, g_{i}^{l}, g_{i}^{t}) + λ ∥ W^{(L)} - W^{(L)} V ∥_{2}^{2}$ (9)

where $λ$ denotes the hyperparameter for the regularization term, and $α_{j}$ represents the coefficient that prevents the hard tasks with higher loss to affect training compared to the easy tasks with lower loss. AMTL has a similar objective to our model in regularizing the task-specific parameters, but Deep-LSMTL has a more flexible two-layer subnetwork $g$ with a computationally less expensive and easy to apply regularization for deep models.

To expand AMTL to deep neural networks, Deep-AMTFL transfers related task knowledge using latent variable rather than the model parameters (Lee et al., 2016): ${min}_{W, V} \sum_{i = 1}^{N} \sum_{j = 1}^{M} α_{j} ℒ (w_{j}, g_{i}^{l}, g_{i}^{t}) + μ ∥ w_{j}^{(L)} ∥_{1} + γ ∥ Z - σ (Z W^{(L)} V) ∥_{F}^{2} + λ \sum_{k = 1}^{L - 1} ∥ W^{(l)} ∥_{F}^{2}$ (10)

where $μ$ , $λ$ , and $γ$ show the hyperparameters for regularization terms, $α_{j}$ is the coefficient according to task easiness level, and $Z$ indicates the output of the last hidden layer. AMTFL aims to regularize its model using a reconstruction loss on the shared features (only the last hidden layer). Although our regularization term can also be seen as a reconstruction loss, Deep-LSMTL applies the regularization on the predictions (not the last hidden layer features) and benefits from nonlinear and easy to implement reconstruction subnetwork.

4.2. Datasets

We use three publicly available datasets as the data in our experiments, including the microarray-based GEO dataset, the RNA-Seq-based GTEx dataset, and the 1000 Genomes (1000G) RNA-Seq expression data.^‡ The GEO dataset contains 129,158 gene expression profiles corresponding to 22,268 genes, where the numbers of landmark and target genes are 978 and 21,290, respectively. These profiles are collected using the Affymetrix microarray platform in the RPKM format (i.e., Reads Per Kilobase per Million). The GTEx dataset consists of 2921 profiles, which are collected using the Illumina RNA-Seq platform. The 1000G dataset contains 2921 profiles from the Illumina RNA-Seq platform in the RPKM format.

We follow the preprocessing steps in Chen et al. (2016) in removing duplicate samples, joint quantile normalization, and cross-platform data matching. After removing the duplicated data, we map the expression values in the GTEx and 1000G datasets based on the quantile computed in the GEO data, resulting in the expression value in the range of 4.11 to 14.97. The expression values are then normalized to have zero mean and unit variance for each gene. Finally, there are 111,009 samples in GEO, 2921 samples in GTEx, and 462 samples in the 1000G datasets, where each sample (i.e., profile) includes 943 landmark genes and 9520 target genes.

We compare different methods under two settings as suggested in Chen et al. (2016). First, we consider 80% of the GEO data for training, 10% of the GEO data for validation, and the other 10% of the GEO data for testing. Second, we use the same 80% of the GEO data for training, the 1000G data for validation, and the GTEx data for testing. While the first case is a standard approach in selecting train, validation, and test sets, the second case is useful for validating the regression models on cross-platform prediction due to different distributions in train, validation, and test sets.

4.3. Evaluation metrics

We measure the effectiveness of inference models using two evaluation metrics, MAE and concordance correlation (CC). Denoting the predicted expressions as ${{\hat{g}}_{i}^{t}}_{i = 1}^{M}$ for the testing data ${(g_{i}^{l}, g_{i}^{t})}_{i = 1}^{M}$ given a certain model, the MAE measure is calculated as $M A E_{j} = \frac{1}{M} \sum_{i = 1}^{M} | ĝ_{i j}^{t} - g_{i j}^{t} |,$ (11)

where $M A E_{j}$ shows the MAE for the j-th task, calculated based on the predicted and ground truth expression values ${\hat{g}}_{i t}^{t}$ and $g_{i j}^{t}$ across all test samples. The CC measure is also computed using $C C_{j} = \frac{2 ρ σ_{g_{j}^{t}} σ_{{\hat{g}}_{j}^{t}}}{σ_{g_{j}^{t}}^{2} + σ_{{\hat{g}}_{j}^{t}}^{2} + {(μ_{g_{j}^{t}} - μ_{{\hat{g}}_{j}^{t}})}^{2}},$ (12)

where $C C_{j}$ shows the CC for the j-th target gene. $ρ$ is the Pearson correlation, and $μ_{g_{j}^{t}}$ , $μ_{ĝ_{j}^{t}}$ , and $σ_{g_{j}^{t}}$ , $σ_{ĝ_{j}^{t}}$ are the mean and standard deviation of $g_{j}^{t}$ and $ĝ_{j}^{t}$ , respectively. Note that in addition to the mean values of the absolute error and CC using $M A E_{m e a n} = 1 ∕ T \sum_{j = 1}^{M} M A E_{j}$ and $C C_{m e a n} = 1 ∕ T \sum_{j = 1}^{M} C C_{t}$ , we report the standard deviation across the tasks for each inference model.

4.4. Implementation details

We use DenseNet as the main architecture of our deep regression network with three 9000 dimensional hidden layers. We use Leaky rectified linear unit with leakiness hyperparameter $0.2$ (Maas et al., 2013) as our activation function and Adam algorithm (Kingma and Ba, 2014) as our optimization strategy. In addition, we linearly reduce the learning rates from $1 \times 1 0^{- 3}$ to $1 \times 1 0^{- 5}$ throughout the training process with 500 epochs. We also set batch size to 100 and use batch normalization (Ioffe and Szegedy, 2015) to increase the training convergence. The parameters of all layers are initialized by Xavier approach (Glorot and Bengio, 2010). The dropout value, $λ$ , and the number of hidden units in subspace layer are selected using $d r o p o u t^{s e t} = {0.05, 0.1, 0.25}$ , $λ^{s e t} = {0.1, 1, 10}$ , and $u n i t s^{s e t} = {500, 1000, 2000}$ , respectively, based on the validation results. We use Pytorch toolbox for writing our code and run the algorithm in a machine with one Titan X pascal GPU. The details of our Deep-LSMTL model architecture are described in Table 1.

Table 1.

The Architecture of Deep-LSMTL Model

Batch of input data

g^{l} \in ℛ^{100 \times 943}

Dense-layer, # hidden-units

= 9000

, Layer normalization: BN, Activation-function: Leaky-ReLU(0.2)

Dense-layer, # hidden-units

= 9000

, Layer normalization: BN, Activation-function: Leaky-ReLU(0.2)

Dense-layer, # hidden-units

= 9000

, Layer normalization: BN, Activation-function: Leaky-ReLU(0.2)

FC-layer, # output-units

= 9520

Activation-function: Linear

FC-layer, # hidden-units 500, Layer normalization: BN, Activation-function: Leaky-ReLU(0.2)

FC-layer, # output-units

= 9520

Activation-function: Linear

BN; FC; ReLU.

4.5. Performance comparison

In this subsection, we evaluate the performance of methods on predicting the target gene expressions on the two datasets. Figure 2 shows the results of different models in four groups, the shallow regression models in the first part, the existing deep regression networks in the second part, the MTL algorithms applied on deep regression models in the third part, and our DenseNet baseline and Deep-LSMTL network in the fourth part. We set the number of parameters for the deep MTL and our models based on the largest possible network fitted on GPU in this experiment. Consequently, Deep-Go-MTL, Deep-CNMTL, Deep-AMTFL, Deep-AMTL, and Deep-LSMTL have 8000, 4000, 5000, 7000, and 9000 hidden units, respectively.

FIG. 2.

Experimental results to compare different machine learning methods on GEO and GTEx datasets based on the MAE and CC evaluation metrics. The empirical results of the traditional regression models are listed at the first part, and the previous deep inference networks are showed at the second part (these results were reported at their original articles or running their released codes). The results of multitask learning methods are reported at the third part, and the results of our proposed models are added at the fourth part using densely connected architecture with different numbers of hidden units. Better results correspond to lower MAE values or higher CC values. AMTL, asymmetric multitask learning; AMTFL; CC, concordance correlation; CNMTL; D-GEX; GAN, generative adversarial network; GEO; GO-MTL; GTEx; KNN, K-nearest neighbors; LSMTL; LSR; MAE, mean absolute error; MTL.

The first observation is that Deep-LSMTL has significant superiority compared to all of the other models consistently on both datasets according to both MAE and CC evaluation metrics. Deep-LSMTL substantially outperforms the shallow models as expected, showing the significance of deeper networks in learning the nonlinear gene expression data. Deep-LSMTL also provides better outcomes compared to the existing deep inference models in the literature, confirming the benefits of using the task interrelations in our multitask learning framework. Furthermore, Deep-LSMTL not only has better prediction in comparison with the alternative MTL algorithms but also it needs much less GPU memory than its MTL counterparts.

Since target gene expressions are normalized, it is probable that direct comparisons of the errors are not definitive. We use the $5 \times 2$ cross-validation approach (Dietterich, 1998) to test the hypothesis that Deep-LSMTL has statistically significant improvements over the alternative models. In particular, we repeat twofold cross-validation of Deep-LSMTL and the best alternative model on GEO dataset (i.e., DenseNet) five times and use a paired Student's t-test on the MAE results. The obtained p values <5% reject the null hypothesis of similar distributions between each pair of the models. Therefore we are able to verify that the improvements of Deep-LSMTL over the alternative methods are statistically significant.

4.6. Ablation study

Although the previous experiments have validated the effectiveness of our model on large-scale multitask inference problems by providing a memory efficient network on one GPU, we develop a new experiment comparing our proposed algorithm and MTL methods on the networks with similar architecture. In particular, we set a two hidden-layer DenseNet as the architecture of all models in three different scenarios with 3000, 6000, and 9000 hidden units. Figure 3 indicates the results of Deep-GO-MTL, Deep-CNMTL, Deep-AMTL, Deep AMTFL, and Deep-LSMTL on both GEO and GTEx Datasets. It is worth mentioning that Deep-CNMTL and Deep-AMTFL still suffer from out-of-memory issues when they have 9000 hidden units. Figure 3 indicates that Deep-LSMTL consistently beats all the other MTL methods on various architectures. Therefore, our algorithm not only offers a better scalable method in training inference models with a large number of tasks but also demonstrates better outcomes in the case of identical base networks.

FIG. 3.

Experimental results to compare different deep multitask learning algorithms for the gene expression inference problems on GEO and GTEx datasets. All compared models use a two-hidden layer DenseNet as their structure, but have different numbers of hidden units at each part of the table. Better results correspond to lower MAE value or higher CC value.

In addition to examining the efficacy of Deep-LSMTL on base networks with DenseNet architecture, we analyze Deep-LSMTL and D-GEX with an MLP base network in Figure 4. The results are reported for both models with MLP base network containing one, two, or three hidden layers and 3000, 6000, or 9000 hidden units. Deep-LSMTL similarly outperforms D-GEX in all architectures consistently and confirms its capability irrespective of the base network choice.

FIG. 4.

Experimental results to compare the MAE of D-GEX and Deep-LSMTL on GEO and GTEx datasets, when the number of hidden layers varies from 1 to 3 and the number of hidden units is 3000, 6000, or 9000. The structure of both models is based on the MLP network.

We also perform another ablation study to investigate the sensitivity of our learning framework with respect to the hyperparameters. Although we are able to select the dropout probability, $λ$ , and number of hidden units based on the results on validation set, we report the performance of Deep-LSMTL with three hidden dense layers and 9000 hidden units for $d r o p o u t^{s e t} = {0.05, 0.1, 0.25}$ , $λ^{s e t} = {0.1, 1, 10}$ , and $u n i t s^{s e t} = {500, 1000, 2000}$ . Figure 5 illustrates MAE of Deep-LSMTL for each one of the hyperparameters, when the remaining ones are the best based on the validation results. The first observation is that Deep-LSMTL is not very sensitive to different values of hyperparameters in the selected range. However, the figure shows that $λ$ has more effect on the results compared to the other hyperparameters, indicating the importance of the proposed regularization.

FIG. 5.

Ablation study of hyperparameter effects, including dropout probability (dr), the regularization hyperparameter $λ e q n o o p e n ($ lam $)$ , and number of hidden units (unit).

4.7. Visualization

To demonstrate the role of various landmark genes in our inference problem, we conduct a qualitative analysis on Deep-LSMTL by visualizing the role of different landmark genes in our trained model using the Layer-wise Relevance Propagation (Bach et al., 2015) technique. Figure 6 illustrates the results of Deep-LSMTL with DenseNet structure (in Table 1) on GEO dataset. Clustering the gene expression samples into 20 groups, Figure 6a and b shows the relevance score of landmark genes w.r.t. each cluster. The results suggest different patterns for different clusters of the landmark gene expressions, reproducing the findings in the earlier cancer subtype discovery and cancer landscape analysis that different groups of samples typically exhibit different expression patterns (Speicher and Pfeifer, 2015; Kandoth et al., 2013).

FIG. 6.

Visualization of the relevance score calculated for each landmark gene on GEO dataset. (a) Relevance score of landmark genes w.r.t. cluster of profiles. We grouped the gene expression profiles into 20 clusters using K-means and plot the contribution of each landmark gene to different clusters of profiles. (b) Cleaned version of landmark gene score. For each profile cluster, only the top 20 landmark genes in (a) are kept for clear visualization. (c) Relevance score of landmark genes w.r.t. cluster of target genes. We divide the 9520 target genes into 20 clusters using K-means and demonstrate the contributions of cleaned landmark genes. (d) Relevance score of landmark gene clusters w.r.t. cluster of target genes. The landmark genes are clustered into 10 clusters, and their contributions in predicting different clusters of target genes are plotted.

We further explore the relationships between target and landmark genes. Dividing the target genes into 20 clusters, we calculate the overall relevance score of landmark genes in the prediction of each target gene cluster as shown in Figure 6c. Figure 6d provides better insights by categorizing the landmark genes to 10 groups and illustrating the connection between the clusters of target and landmark genes. While there are obvious disparities in the relevance patterns of different target gene clusters, there is correlation between some clusters. The previous gene cluster studies also showed similar findings on relations of gene clusters and the structure of biosynthetic pathways and metabolites (Medema et al., 2015).

We also visualize the predictions of our model on the GTEx dataset in Figure 7 similar to GEO dataset. The figures show similar patterns as the previous outcomes. However, they are more notable because of training on GEO data and predicting on GTEx data, qualitatively confirming the capability of our proposed model in capturing the relations among genes even for cross-platform prediction.

FIG. 7.

Visualization of the relevance score calculated for each landmark gene on GTX dataset. (a) Relevance score of landmark genes w.r.t. cluster of profiles. We grouped the gene expression profiles into 20 clusters using K-means and plot the contribution of each landmark gene to different clusters of profiles. (b) Cleaned version of landmark gene score. For each profile cluster, only the top 20 landmark genes in (a) are kept for clear visualization. (c) Relevance score of landmark genes w.r.t. cluster of target genes. We divide the 9520 target genes into 20 clusters using K-means and demonstrate the contributions of cleaned landmark genes. (d) Relevance score of landmark gene clusters w.r.t. cluster of target genes. The landmark genes are clustered into 10 clusters, and their contributions in predicting different clusters of target genes are plotted.

5. Conclusion

In this study, we introduced a new MTL method for training deep inference models for estimating the gene expressions. Our algorithm improves the generalizations of multitask predictors by effectively discovering the task correlations. To do so, we proposed a seamless regularization for deep neural networks that is scalable to a huge number of tasks. Experimental results confirmed the effectiveness of our proposed algorithm compared to alternative models, where our model consistently and significantly outperforms all counterparts on two gene expression datasets with various base network architectures. We also visualized the role of landmark genes in estimating the expressions of target genes, providing better insights about the knowledge learned by our regression model.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

This work was partially supported by NSF IIS 1845666, 1852606, 1838627, 1837956, 1956002, 2040588, and NIH AG049371.

References

Alipanahi

, Delong

, Weirauch

M.T.

, et al., 2015. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831.

Argyriou

, Evgeniou

, and Pontil

2008. Convex multi-task feature learning. Mach. Learn. 73, 243–272.

Alzheimer's Association. 2013. 2013 Alzheimer's disease facts and figures. Alzheimers Dement. 9, 208–245.

Bach

, Binder

, Montavon

, et al. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS One. 10, e0130140.

Bakker

, and Heskes

2003. Task clustering and gating for bayesian multitask learning. J. Mach. Learn. Res. 4, 83–99.

Brazma

, Parkinson

, Sarkans

, et al. 2003. Arrayexpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 31, 68–71.

Cardoso

, van't Veer

L.J.

, Bogaerts

, et al. 2016. 70-gene signature as an aid to treatment decisions in early-stage breast cancer. N. Engl. J. Med. 375, 717–729.

Caruana

1997. Multitask learning. Mach. Learn. 28, 41–75.

Chaudhari

, Agrawal

, and Kotecha

2019. Data augmentation using MG-GAN for improved cancer classification on gene expression data. Soft Comput. 24, 11381–11391.

10.

Chen

, Li

, Narayan

, et al. 2016. Gene expression inference with deep learning. Bioinformatics. 32, 1832–1839.

11.

Cuzick

, Swanson

G.P.

, Fisher

, et al. 2011. Prognostic value of an RNA expression signature derived from cell cycle proliferation genes in patients with prostate cancer: A retrospective study. Lancet Oncol. 12, 245–255.

12.

Dietterich

T.G.

1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 1895–1923.

13.

Edgar

, Domrachev

, and Lash

A.E.

2002. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210.

14.

Evgeniou

, Micchelli

C.A.

, and Pontil

2005. Learning multiple tasks with kernel methods. JMLR. 6, 615–637.

15.

Ghasedi Dizaji

, Wang

, and Huang

2018. Semi-supervised generative adversarial network for gene expression inference, 1435–1444. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM.

16.

Glorot

, and Bengio

2010. Understanding the difficulty of training deep feedforward neural networks, 249–256. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics.

17.

Heimberg

, Bhatnagar

, El-Samad

, et al. 2016. Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Systems. 2, 239–250.

18.

Huang

, Liu

, Weinberger

K.Q.

, et al. 2017. Densely connected convolutional networks, 2261–2269. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI.

19.

Ioffe

, and Szegedy

2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 448–456. International Conference on Machine Learning (ICML).

20.

Jacob

, Vert

J.-p

. and Bach

F.R.

2009. Clustered multi-task learning: A convex formulation, 745–752. Advances in Neural Information Processing Systems (NIPS).

21.

Kandoth

, McLellan

M.D.

, Vandin

, et al. 2013. Mutational landscape and significance across 12 major cancer types. Nature. 502, 333.

22.

Kang

, Grauman

, and Sha

2011. Learning with whom to share in multi-task feature learning, 521–528. International Conference on Machine Learning (ICML).

23.

Keenan

A.B.

, Jenkins

S.L.

, Jagodnik

K.M.

, et al. 2017. The library of integrated network-based cellular signatures NIH program: System-level cataloging of human cells response to perturbations. Cell Systems. 6, 13–24.

24.

Kingma

, and Ba

2014. Adam: A method for stochastic optimization.

25.

Kishan

, Li

, Cui

, et al. 2019. GNE: A deep learning framework for gene network inference by aggregating biological information. BMC Syst. Biol. 13, 38.

26.

Kumar

, and Daumé

III, H.

2012. Learning task grouping and overlap in multi-task learning, 1723–1730. Proceedings of the 29th International Conference on International Conference on Machine Learning (ICML), Omnipress.

27.

Lee

, Yang

, and Hwang

2016. Asymmetric multi-task learning based on task relatedness and loss, 230–238. International Conference on Machine Learning (ICML).

28.

Lee

, Yang

, and Hwang

S.J.

2018. Deep asymmetric multi-task feature learning, 2956–2964. Proceedings of the 35th International Conference on International Conference on Machine Learning (ICML), Stockholm, Sweden.

29.

Leung

M.K.

, Xiong

H.Y.

, Lee

L.J.

, et al. 2014. Deep learning of the tissue-regulated splicing code. Bioinformatics. 30, i121–i129.

30.

Maas

A.L.

, Hannun

A.Y.

, and Ng

A.Y.

2013. Rectifier nonlinearities improve neural network acoustic models. Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, Georgia, Vol. 30.

31.

Maurer

, Pontil

, and Romera-Paredes

2013. Sparse coding for multitask and transfer learning, 343–351. Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, Georgia.

32.

Medema

M.H.

, Kottmann

, Yilmaz

, et al. 2015. Minimum information about a biosynthetic gene cluster. Nat. Chem. Biol. 11, 625.

33.

Nelms

B.D.

, Waldron

, Barrera

L.A.

, et al. 2016. Cellmapper: Rapid and accurate inference of gene expression in difficult-to-isolate cell types. Genome Biol. 17, 201.

34.

Ntranos

, Kamath

G.M.

, Zhang

J.M.

, et al. 2016. Fast and accurate single-cell RNA-Seq analysis by clustering of transcript-compatibility counts. Genome Biol. 17, 112.

35.

Peck

, Crawford

E.D.

, Ross

K.N.

, et al. 2006. A method for high-throughput gene expression signature analysis. Genome Biol. 7, R61.

36.

Preuer

, Lewis

R.P.

, Hochreiter

, et al. 2018. Deepsynergy: Predicting anti-cancer drug synergy with deep learning. Bioinformatics. 34, 1538–1546.

37.

Rees

M.G.

, Seashore-Ludlow

, Cheah

J.H.

, et al. 2016. Correlating chemical sensitivity and basal gene expression reveals mechanism of action. Nat. Chem Biol. 12, 109.

38.

Richiardi

, Altmann

, Milazzo

A.-C.

, et al. 2015. Correlated gene expression supports synchronous activity in brain networks. Science. 348, 1241–1244.

39.

Ruder

2017. An overview of multi-task learning in deep neural networks.

40.

Ruder

, Bingel

, Augenstein

, et al. 2017. Learning what to share between loosely related tasks.

41.

Shah

, Lubeck

, Zhou

, et al. 2016. In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron. 92, 342–357.

42.

Singh

, Lanchantin

, Robins

, et al. 2016. Deepchrome: Deep-learning for predicting gene expression from histone modifications. Bioinformatics. 32, i639–i648.

43.

Speicher

N.K.

, and Pfeifer

2015. Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery. Bioinformatics. 31, i268–i275.

44.

Spencer

, Eickholt

, and Cheng

2015. A deep learning network approach to ab initio protein secondary structure prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 12, 103–112.

45.

Stephens

P.J.

, Tarpey

P.S.

, Davies

, et al. 2012. The landscape of cancer genes and mutational processes in breast cancer. Nature. 486, 400.

46.

Thrun

, and O'Sullivan

1996. Discovering structure in multiple learning tasks: The tc algorithm. Int. Conf. Mach. Learn. 96, 489–497.

47.

Van't Veer

L.J.

, Dai

, Van De Vijver

M.J.

, et al. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 415, 530.

48.

Wang

, He

, Shah

, et al. 2020. Network-based multi-task learning models for biomarker selection and cancer outcome prediction. Bioinformatics. 36, 1814–1822.

49.

, Liu

, Leng

, et al. 2020. Blood-based multi-tissue gene expression inference with bayesian ridge regression. Bioinformatics. 36, 3788–3794.

50.

, Zhang

, You

, et al. 2020. scIGANs: Single-cell RNA-Seq imputation using generative adversarial networks. Nucleic Acids Res. 48, e85.

51.

Yan

, Wei

, Deng

, et al. 2015. Transcriptional analysis of immune-related gene expression in p53-deficient mice with increased susceptibility to influenza a virus infection. BMC Med Genom. 8, 52.

52.

Yang

, and Hospedales

2017. Deep multi-task representation learning: A tensor factorisation approach. International Conference on Learning Representations (ICLR).

53.

Yang

, and Hospedales

T.M.

2016. Trace norm regularised deep multi-task learning.

54.

Zhou

, and Troyanskaya

O.G.

2015. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods. 12, 931.