Abstract
Uncovering the potential treatment associations of the drug-disease pairs is a research focus of drug repositioning. However, it is time-consuming and costly to verify the potential treatment relation between a drug and a disease by “wet” experiment methods. Fortunately, along with the accumulation of large amount of data and the development of machine learning methods, lots of computational methods to predict the drug-disease treatment associations have been proposed. In order to build the prediction model based on machine learning techniques, both plenty of positive and negative training samples are required. In the case of biological experiments, however, we can only verify whether a drug cures a disease, yet we are unable to answer whether a drug definitely cannot treat a disease. Correspondently, there are only positive and unlabeled samples in the data. Being lack of validated negative samples, most computational methods assume the unlabeled samples to be negative ones and randomly select some unlabeled samples and positive samples to train the prediction models. Obviously, the unlabeled samples are not necessarily negative, and some of them may be positive just remaining uncovered via experiments. In this paper, we propose a method called PUDrDi which directly make use of the positive and unlabeled samples to train a Biased-SVM classifier. Moreover, we combine the drug and disease features together to represent a drug-disease pair, in which we use chemical substructures and symptoms as the features to represent drugs and diseases respectively. The experiment results demonstrate that PUDrDi outperforms some other methods. The case study further shows the practicality of PUDrDi.
Keywords
Introduction
De novo drug development is costly, time-consuming and with low success rate. It will cost above 800 million US dollars and about 15 years for a small molecular from the initial laboratory study to its approval [13], and nearly 90% of the small molecules may be eliminated in the Phase I clinical trial period [3]. The pharmacological properties of approved drugs are known, but only some of their indications are found. Therefore, to find the new indications of approved drugs, known as drug repositioning, has become a new alternative for drug development because it doesn’t need to repeat a variety of pharmacological experiments, thus can greatly reduce the cost, shorten the cycle and improve the succuss rate. In recent years, it brings new opportunities for drug repositioning along with the accumulation of a large quantity of drugs and diseases related data and the wide application of machine learning methods. There are above 1730 approved small molecule drugs in DrugBank [43], and above 25000 diseases in UMLS medical database [32], thus forming tens of millions of drug-disease interactions. However, only 4.5% of the interactions are confirmed to be definite treatment relationships, and most of the drug-disease relationships are unknown [25]. Therefore, many computational methods are proposed to predict potential drug-disease treatment relationships from the large amount of unknown interactions by using different kinds of data.
If a drug has similar gene expression profiles with a particular drug, then the two drugs may have similar indications; if a drug has opposite gene expression profiles with a particular disease, then the disease may be a potential indication of that drug [12]. Based on above assumption, many scholars proposed methods to predict the drug-disease relations by using gene expression profiles data. For examples, Sirota et al. collected gene expression data from 100 diseases and 164 drugs, and derived potential indications for these drugs, such as lung adenocarcinoma as a potential indication of cimetidine (a drug for treating peptic ulcers) [26]; Jahchan et al. queried a large compendium of gene expression profiles and identified tricyclic antidepressants as inhibitors of small cell lung cancer [30].
At present, the drug-disease association data from a large number of literature and databases also makes it possible to derive potential new drug-disease relations based on text mining. For example, if it is found in some studies that disease A is caused by lack of nutrition B, and it is also reported in some other studies that drug C for treating another disease is an activator of B, then drug C is likely to treat disease A based on literature mining and semantic inference [8, 29]. Ahlers et al. proposed a literature-based mining model to extract the semantic information from MEDLINE citations, thus found the potential associations between antipsychotic agents and cancer [9].
In recent years, many methods constructed networks based on data of drugs and diseases to predict the drug-disease associations. Motivated by the assumption that if two diseases share some therapies, then drugs for treating one disease might also be useful for the other, Chiang and Butte proposed a guilt-by-association (GBA) method [5]. Cheng et al. proposed a network-based inference (NBI) method making use of the topology of the network [14], they use a two-step resource allocation to infer the associations. Wang et al. built a heterogenous graph among drugs and targets, then proposed a heterogenous graph based inference method (HGBI) to predict drug-target associations, and HGBI can also be modified to predict the drug-disease associations [45].
Due to the increasingly abundant disease-related data in different kinds, many methods have been proposed to integrate multiple sources of data and predict drug-disease relations by machine learning techniques. For examples, Gottlieb et al. integrated different data of drugs and diseases to construct drug-drug and disease-disease similarity measures, then predicted drug-disease interactions by logistic regression classifier [1]; Chen et al. developed two recommendation methods to recommend diseases for drugs based on the network topology information [18]. Wu et al. integrated drug similarities and disease similarities from chemical/phenotype layer, gene layer and treatment network layer, and proposed a semi-supervised graph cut method (SSGC) to predict the drug-disease associations [17]. Liang et al. integrated drug chemical, target domain and gene ontology annotation information, and proposed a Laplacian regularized sparse subspace learning method (LRSSL) to predict drug-disease associations [46].
However, in biological data, there are usually lack of validated negative samples, thus it is a challenge for machine learning based methods which need both positive and negative samples to train the prediction model. Currently, most traditional machine learning based methods take the randomly selected unlabeled samples as negative ones when building the model. However, unlabeled samples are not necessarily negative, there might exists some positive samples not validated yet. So, take the unlabeled samples as negative ones might bring noise to the model, thus will decrease the accuracy of the model. To address above challenges and problems, learning from positive and unlabeled samples (PU learning) has become another solution, and has been used to deal with some bioinformatics problems, such as the identification of disease genes [34] and the prediction of protein-RNA interactions [50].
In this work, we use PU learning for drug-disease associations prediction. We propose a drug-disease associations prediction method, PUDrDi, in which we do not simply regard the unlabeled samples as negative ones, and do not need to find reliable negative samples, but directly take the positive and unlabeled samples as training samples, which is different with most traditional machine learning based methods and PU learning based methods, then employ the Biased-SVM classifier to build the prediction model (Liu’s work has tested Biased-SVM to be better than the two-step strategy) [7]. To the best of our knowledge, although many problems have been tried with PU learning methods, it is the first work to predict drug-disease associations just using positive samples and unlabeled samples. Besides, to prescribe medicine according to the indications and symptoms is very important in clinical treatment. However, the existing computational methods seldom make use of symptom data for the prediction of drug-disease associations. Thus, in this work, we attempt to make use of the important disease symptom data as well as the drug chemical data to infer potential drug-disease associations, which is different with most traditional methods in predicting the drug-disease associations. To better evaluate the performance of the methods, we not only use the traditional metrics such as Precision, Recall, F1 and so on, but also employ a new metric
Related work
In these years, there are some methods based on PU learning proposed to address the problems in bioinformatics and biomedicine fields. Yang et al. extracted candidate positive samples from unlabeled samples to complement the positive samples, and extracted reliable negative samples from unlabeled samples, then based on the complemented positive samples and reliable negative samples, they integrated multiple PU learning classifiers to predict the disease genes [33]. Ren et al. tried to identify reliable negative samples, then built a classifier based on positive samples and reliable negative samples to predict the conformational B-cell epitopes [20]. Lan et al. divided the data set into three parts (positive samples, reliable negative samples and likely negative samples), then used the weighted SVM to predict the drug-target interactions [44]. They avoid treating the randomly selected unlabeled samples as negative samples, but they all need a step to extract or find the reliable negative samples. However, we can not guarantee the reliable negative samples are really reliable.
On one side, in most bioinformatics problems which are lack of validated negative samples, traditional machine learning methods treat the randomly selected unlabeled samples as negative samples to build the classification model. But the unlabeled samples are not necessarily negative, there are probably many positive samples not validated yet. On the other side, most PU learning based methods can partly avoid this weak point, but they need to find reliable negative samples from unlabeled samples. However, no one can make sure the obtained reliable negative samples are reliable enough. It is meaningful and necessary to find another solution to address such problems.
What’s more, to the best of our knowledge, the drug-disease associations prediction problem is not solved with PU learning methods yet. Therefore, in this work, we propose a PU learning based method, PUDrDi, to predict the drug-disease associations. Different with traditional machine learning based methods, we do not simply treat the randomly selected unlabeled samples as negative samples. Different with most PU learning based methods, we do not need to find reliable negative samples. We directly use the positive samples and unlabeled samples to build the classifier. Different with most traditional methods in predicting the drug-disease associations, we make use of intuitive but important symptom data to represent the diseases.
Materials and methods
Pipeline of PUDrDi
The pipeline of PUDrDi is shown in Fig. 1. The main steps are as follows: (1) Data collection and vector construction. Drug-disease pairs with definite treatment relations are collected as positive samples (labeled as ‘1’), other pairs are randomly collected as unlabeled samples (labeled as ‘-1’, but not means negative samples). Each drug-disease pair is vectorized by using the chemical substructure features and disease symptom features. More details will be described in the section “Data sets and data representation”. (2) Feature selection. A formula is used to score the features, then we can select out the useful features to build the prediction model. More details will be described in the section “Feature selection”. (3) Classifier training. A Biased-SVM classifier is trained after feature selection. More details will be shown in the section “Classifier construction based on Biased-SVM” and “Auto-selection of parameter value”. (4) Prediction. The trained model can be used to predict novel drug-disease associations. More details will be shown in the section “Experiments and results”.

The pipeline of PUDrDi.
In this work, the drug data were collected from DrugBank, which is a public resource database of drugs and drug targets [43]. A total of 1007 approved small-molecule drugs were collected. The drug fingerprint data were obtained from PubChem database [37], and there are 881 fingerprint features in all. Each drug was encoded as a 881-dimensional binary vector in which each bit represents the presence or absence of a chemical substructure fragment by 1 or 0, respectively.
Disease data were collected from MeSH database [10], and there were 4219 diseases in all. The symptom data were obtained from Zhou’s work [47], and there are 322 symptoms in all. Each disease was encoded as a 322-dimensional binary vector in which each bit represents the presence or absence of a kind of symptom by 1 or 0, respectively.
There were 799 drugs, 719 diseases and 3250 drug-disease treatment relations in the drug-disease association data set, which was the same as Li and Lu’s gold standard data set [19]. Some of the diseases were found to have no symptom data in Zhou’s symptom data set, so we excluded these diseases. Finally, the drug-disease associations data set were remained 720 drugs, 637 diseases and 2783 drug-disease treatment relations.
In this work, each drug-disease pair was encoded as a 1203-dimensional binary vector which was catenated by corresponding 881-dimensional drug vector and 322-dimensional disease vector. Besides, the known 2783 drug-disease association samples were labeled as ‘1’, other samples were labeled as ‘–1’.
In the remained drug-disease relations data set, 203 drugs (28.2%) can treat only one disease, 2 drugs (Ofloxacin and Ceftriaxone) can treat 21 diseases, which is the maximum. Those broad-spectrum antibiotics have treatment relations with more diseases. 520 drugs (72.2%) are used to treat less than 5 diseases, 664 drugs (92.2%) have treatment relations with less than 10 diseases. The statistics result of disease number per drug is shown in Fig. 2. Similarly, 253 diseases (39.7%) have only one drug, Hypertension has 72 related drugs, which is the maximum. 482 diseases (75.7%) have less than 5 drugs, 574 diseases (90.1%) have less than 10 related drugs. The statistic result of drug number per disease is shown in Fig. 3.

Number of diseases per drug.

Number of drugs per disease.
Though there are more than one thousand of features to represent a drug-disease pair, not all of the features are helpful to the prediction task. We define a simple formula to score the distinguishing ability of a feature as the following:
where f m (i) denotes the value of the i-th feature in the m-th sample (drug-disease pair), n p is the number of labeled (positive) samples, n u is the number of unlabeled samples.
The feature score measures the average enrichment of the i-th feature in the positive samples. Now that the unlabeled samples are thought to be negative with high probabilities, the feature score can be regarded as the usefulness measure of the i-th feature for the prediction task. Therefore, we ranked the features and chose the top-N to build the prediction model.
In our drug-disease associations data set, we only have little positive samples, but have large amount of unlabeled samples. With this unbalanced data and lots of unlabeled samples, what kind of classifier is proper to deal with it? This is the first problem we faced with.
Biased-SVM is in fact a weighted SVM which was first proposed to solve the text classification problems with little labeled samples but large amount of unlabeled samples, and it has shown to outperform the two-stage strategy which needs reliable negative samples [7]. Based on the hypothesis that if the data set is large enough, the unlabeled samples are probably negative, Biased-SVM looks the classification problem as a constrained optimization problem: the number of unlabeled samples being predicted to be positive should be minimized under the condition that the positive samples are correctly predicted to be positive. Therefore, it is proper to use Biased-SVM to address the drug-disease association prediction problem, and the problem can be formulated as:
where x i is a vector of training sample, y i ∈ {1, - 1} is the corresponding class label, label ‘1’ means a positive sample, label ‘-1’ means an unlabeled sample, ξ i is a slack variable, and C p and C u are penalty parameters for misclassification of positive samples and unlabeled samples, respectively. In practice, C p and C u are usually tuned to get the best classification performance. If the amount of experiment data is large enough, the unlabeled samples are mostly negative ones, only a few of them are positive, so we usually set a larger C p and a smaller C u initially.
When we employed the Biased-SVM, it needs to represent the drug-disease pairs as vectors. How to vectorize the drug-disease pair? This is the second problem. We tried some ideas such as the tensor product of the corresponding drug vector and disease vector, but it takes too much time and memory when running the program. In the end, we found to represent the drug-disease pair as the catenation of drug vector and disease vector is simple but effective. Therefore, in this work, each drug-disease pair was encoded as a 1203-dimensional binary vector which was catenated by corresponding 881-dimensional drug vector and 322-dimensional disease vector.
When we have vectorized the drug-disease pairs, the number of features is relatively too much for the number of positive samples is little. How to select out the useful features to build the model? This is the third problem. More details have been described in the section “Feature selection”.
Taking the vectorized representations of drug-disease pairs as training samples (to keep the balance, unlabeled samples are randomly selected to have the same amount as the positive samples), after feature selection, we can build a drug-disease associations prediction model based on Biased-SVM.
The effect of parameters c and j on performance metric
Evaluation metrics
In traditional classification problems, Accuracy, Precision, Recall and F1 are commonly used performance metrics. These metrics are defined as follows:
where TP, FP, TN and FN denote the number of true positive samples, false positive samples, true negative samples and false negative samples, respectively. And TP + FP + TN + FN equals the number of all samples.
Besides, to compare the performance of the classifiers more intuitively, ROC curve, AUC (the area under the ROC curve), P-R curve and AUPR (the area under the P-R curve) are often used to show the performance of the classifiers, too. ROC curve shows the relationship between the true positive rate (TPR, the same as Recall) and the false positive rate (FPR) of a classifier. TPR and FPR are calculated as follows:
From the view of geometry, if the ROC curve is closer to the upper left corner, then the AUC is greater, thus the performance of the classifier is better. P-R curve shows the relationship between Precision and Recall. If the P-R curve is closer to the upper right corner, then the AUPR is greater, thus the performance of the classifier is better.
However, in classification problems with positive and unlabeled samples, because there are no validated negative samples, to get the traditional metrics such as F1 score, we have to treat the unlabeled samples as negative samples, then the metrics got by this means can not demonstrate the actual performance of the classifiers. To better measure the performance of the classifiers, literature [7] proposed a new metric,
where
The results of performance comparison among algorithms
To automatically tune the parameters C
p
and C
u
in Equation (1), a validation set is often used to validate the performance of the final classifier. In this work, 30% of the samples were randomly selected as validation set and other 70% were used as training set. The Biased-SVM classifier was implemented by using SVMlight package [40]. We set SVMlight’s parameters c = C
u
and j = C
p
/C
u
, thus we can tune Equation (1)’s super parameters C
p
and C
u
by tuning c and j in program. To get the best parameter values, we tried 49 combinations within c = {0.1, 0.3, 0.5, 1, 3, 5, 10} and j = {1, 3, 5, 10, 30, 50, 100}, then automatically selected the parameter combination when metric
Comparison with other methods
With the parameters c and j automatically selected in subsection 4.2, we performed 5-fold cross validation on the data set. The data set was divided into five equal parts randomly, four of them were used as training set and the rest part was used as test set. After these five parts used as test set in turn, we obtained the performance metric values of five experiments and took the average results as the final classification performance of the method.
By 5-fold cross validation, we compared our method (PUDrDi) with two classical methods (kNN and Random Forest) and a typical drug-disease or drug-target prediction method (HGBI) [45]. To be fair, the unlabeled samples (negative samples in other methods) were randomly selected and kept the same in all methods. In PUDrDi, the Biased-SVM was implemented by using SVMlight package, and Radial Basic Function (RBF) was selected as the kernel function, the parameter gamma was set 0.03 by grid search. In kNN, the Euclidian distance was used and parameter k was set 3 by grid search to achieve its best performance. Random Forest was implemented by using the RandomForestClassifier function in scikit-learn package [16], and 5-fold cross validation was conducted with the default parameter values. In HGBI, the tanimoto similarities [41] of the drugs were calculated according to the feature vectors of drug chemical substructures, the tanimoto similarities of diseases were calculated according to the feature vectors of disease symptoms, the parameter α was set 0.1 by grid search.
To compare these methods more comprehensively, both traditional and new performance metrics were considered. On one hand, traditional performance metrics (Precision, Recall, F1, AUC and AUPR) were computed, but we had to treat the unlabeled samples as negative samples to obtain these values according to their definitions; On the other hand, new metric
The experiment results are shown in Table 2. From the view of single metric, PUDrDi is lower than kNN, but higher than other two methods in Recall value; And it is a bit lower than HGBI, but outperforms other two methods in Precision value. From the view of comprehensive metrics, not only in traditional metrics F1, AUC and AUPR, but also in new metric
To compare the performance more intuitively, we plotted the ROC and P-R curves which are shown in Figure 4 and 5, respectively. It is easy to find that PUDrDi outperforms other compared methods.
Case study
To test the practicality of PUDrdi, we observed the prediction results of the unlabeled samples in the data set. In cross validation experiments, the unlabeled samples in each test set were labeled as ‘-1’, if the predicted values are greater than or equal to 0, then they are predicted to be positive samples. But it needs to be further verified whether they are actually positive samples or not. One way is to check whether these drug-disease associations have been collected in CTD database [4] which provides information about associations among chemicals, genes and diseases. Another way is to check whether these drug-disease treatment relationships have been reported in published papers.

The ROC curves of algorithms.

The Precision-Recall curves of algorithms.
The predicted drug-disease treatment associations (Top 20)
The new predicted top 20 drug-disease associations are shown in Table 3, where “CTD Mark” item denotes the mark in CTD database (“Therapeutic” or “Marker/Mechanism” means a curated association, “Inferred” means an association inferred from curated gene interactions), “Literature Support” item denotes the corresponding references if there is a literature support.
As shown in Table 3, about half of the new predicted associations are curated or inferred associations in CTD database, 90% treatment relations have been reported in published papers. These predicted drug-disease associations are unlabeled samples in the original data set, but we find it out by PUDrDi method, thus it further demonstrates the practicality and effectiveness of PUDrDi.
To uncover the potential drug-disease associations is an important step in drug development, but it is time-consuming and costly to uncover them by “wet” experiment methods. Along with the accumulation of large amount of drugs and diseases data, as well as the rapid development of machine leaning methods, many well designed computational methods have been proposed to predict the drug-disease associations. However, there are some limitations in these methods. Being lack of validated negative samples, most of these machine learning based methods randomly select the unlabeled samples as negative ones. In fact, there are many positive samples not be currently verified yet in these unlabeled samples. In other bioinformatics fields, there are some PU learning based methods proposed. But most of them need to extract reliable negative samples from unlabeled samples. In fact, no one can guarantee whether the reliable negative samples are reliable enough. Besides, the concept that to prescribe medicine according to the indications and symptoms is always emphasized in clinical treatment. However, the existing methods seldom utilize the important symptom data for the prediction. The data of intuitive symptoms as well as the drug chemical structure and efficacy will help to screen the drug-disease associations.
In this work, we propose to apply PU learning idea to predict drug-disease associations. A Biased-SVM classifier is trained by using positive and unlabeled samples after feature selection, and a new performance metric
Footnotes
Acknowledgement
This work was supported by the National Science Foundation of China [61272274, 60970063]; the program for New Century Excellent Talents in Universities [NCET-10-0644]; the National Science Foundation of Jiangsu Province [BK20161249].
