Abstract
There is an urgent need to develop novel anti-malarials in view of the increasing disease burden and growing resistance of the currently used drugs against the malarial parasites. Proliferation inhibitors targeting P. falciparum intraerythrocytic cycle are one of the important classes of compounds being explored for its potential to be novel antimalarials. Support Vector Machine (SVM) based model developed by us can facilitate rapid screening of large and diverse chemical libraries by reducing false hits and prioritising compounds before setting up expensive High Throughput Screening experiment. The SVM model, trained with molecular descriptors of proliferation inhibitors and non-inhibitors, displayed a satisfactory performance on cross validations and independent data set, with an average accuracy of 83% and AUC of 0.88. Intriguingly, the method displayed remarkable accuracy for the recently submitted P. falciparum whole cell screening datasets. The method also predicted several inhibitors in the National Cancer Institute diversity set, mostly similar to the known inhibitors.
Keywords
Introduction
Malaria is a devastating disease causing millions of death annually, apart from thousands of man hours lost to morbidity. 1 The majority of deaths due to malaria are caused by P. falciparum, the most virulent amongst the rest of the species that cause the disease. The mounting resistance and failure of existing first-line antimalarial drugs has exacerbated the condition leading to an urgent need to develop novel anti-malarials.
Amongst various experimental methods, the experimental cell based assays to identify growth inhibitors of P. falciparum has been one of the promising approaches for novel antimalarial drug discovery. The technique has shown success in identifying several novel chemical scaffolds with antimalarial activity.2,3 Cell based bioassays make use of living organisms, enabling the simultaneous testing of all drug targets for their viability in the presence of the test compounds. Identification of P. falciparum intraerythrocytic cycle proliferation inhibitors has gained much attention as many of these compounds have successfully inhibited the parasite growth at a very low concentration.4,5 Development of in silico models for prediction of proliferation inhibitors against P. falciparum will aid research experiments aimed towards identification of novel antimalarial leads. The study reported here focuses on development of a SVM based classification method for P. falciparum proliferation inhibitors and non-inhibitors.
The striking growth and complexity of High Throughput Screening (HTS) data has increased the importance of data mining techniques to aid efficient data analysis and decision-making at crucial phase of drug discovery. 6 Such techniques are often helpful to discover meaningful patterns and rules in the screened data. These patterns form the basis for building models that are effectively applied to prioritize compounds for the subsequent phases. Data mining methods can assist identification of false leads at an early stage and also facilitate understanding of Structural and Activity Relationships. 7 Supervised and unsupervised methods are increasingly being applied to build predictive bioactivity compound models. 8 In various studies, classification of compounds has been carried out using machine learning methods like Decision Tree (DT), k-Nearest Neighbours (kNN), Artificial Neural Networks (ANN), PLS Discriminant Analysis (PLS-DA) and all of them have shown statistically significant performance. 9 It is encouraging to note that the existing mathematical methods in Quantitative Structural and Activity Relationship (QSAR) field are being constantly upgraded and novel mathematical algorithms are continuously evolving. At the same time, the increasing availability of published compounds assays in PubChem database has stimulated greater interest to apply these robust methods, leading to development of highly accurate predictive models.10–12
In recent years, SVM based classification has gained wide usage in Ligand Based Virtual Screening (LBVS) mainly due to its efficient generalization capabilities and empirical performance.13–15 SVM based ligand screening has been illustrated as an ideal tool for rapid screening of large compound libraries with enhanced hit rate and better coverage.16,17 Unlike most of the LBVS methods which work on similarity based principles, SVM based classification has been shown to yield structurally diverse hits. 18 In a comparative study conducted by Plewczynski et al. 19 SVM with a linear kernel was found to be the best performing algorithm, compared to the other methods namely; kNN, ANN, DT, Random Forest (RF) and Naïve Bayesian Classification (NBC). Summarily, the performance of SVM methods is better when compared with above-discussed methods and hence we have used SVM in our studies. In previous studies, linear methods have been applied for classification of antimalarial compounds; however, there are fewer reports about usage of nonlinear methods.20,21 In this study, we have developed linear as well as nonlinear SVM models to classify compounds for anti-proliferative activity against P. falciparum.
Materials and Methods
Generation of training and independent testing set
All the molecular structures for generating SVM models were retrieved from the PubChem bioassay data corresponding to the bioassay ID “AID-1815”.4,5 The Bioassay reports 441 active compounds with the potency ranging from 0.06 μM to 14.12 μM. The assay was based on qHTS for differential inhibitors of proliferation of P. falciparum line 7G8, derived from a malarial isolate from Brazil. Further assay details are available at the PubChem bioassay database website (http://pubchem.ncbi.nlm.nih.gov/).
In preparing the training and independent test dataset, we have considered all the compounds labelled as active or inactive for their inclusion in the positive and negative training dataset. In the bioassay, 441 compounds are reported as active (potency ranging between 0.06 μM to 14.12 μM) and 558 compounds as inactive, ie, a total of 999 compounds. These compounds were pre-processed for removal of redundant compounds, which resulted in 426 active and 533 inactive compounds. All the non-redundant compounds, 959 in numbers, were standardized and hydrogen atoms were added using JChem 5.2. 22 The data set of 959 compounds was divided into training and testing set such that exactly 80% was reserved for training and the remaining 20% was retained for independent testing, ie, not to be included in the training step. Thus the training data comprised of total 640 compounds, while the test set comprised of 319 compounds (Table 1). SVM models described here after are developed based on the training set of 290 active compounds and 350 inactive compounds. Some of the potent proliferation inhibitors in the training data are shown in Table 2. The training and testing dataset is available online as supplementary material (Supplementary file 1 and 2).
Compound dataset used in this study.
Selected proliferation inhibitors in the training dataset.
Descriptor calculation and selection
Molecular descriptors are the numeric representation of physico-chemical features extracted from various structural representation of a molecular structure. 23 Such a quantitative representation is obtained as the result of a logical and mathematical procedure that transforms chemical information encoded within a symbolic representation of a molecule into a useful number. In this work, a number of 0D (constitutional descriptors), 1D (functional group counts), 2D (topological, walk and path counts, connectivity indices, information indices, 2D autocorrelations, edge adjacency matrices, Burden Eigen values, topological charge indices, Eigen-value based indices) and 3D (Randic molecular profiles, geometric, RDF, 3D-Morse, WHIM, GETWAY) descriptors were calculated using DRAGON software. 24 Details of individual descriptors can be found in the reference manual of DRAGON software. The list of descriptors used in the study for developing different SVM models is presented in Table 3. Calculations for 0D, 1D and 2D descriptors were based on 2D structures of the compounds where as 3D descriptors calculations were based on JChem generated single low energy conformers. We generated three models based on different sets of descriptors; the first one based on 0D, 1D and 2D descriptors, the second one based exclusively on 2D descriptors and the third model based exclusively on 3D descriptors. The total number of selected descriptors was above 300 in each case. In order to reduce redundancy and noise in the training data, we reduced the number of descriptors in each case. For instance, in case of the model based on 0D, 1D and 2D descriptors, the total number of calculated descriptors was 383. We reduced the total number of descriptors to 184 by the following approach: firstly, descriptors with the same values and near-constant descriptors were eliminated. Secondly, redundant descriptors were removed by pair correlation method. The pair wise correlations for all descriptors were examined and one of the two descriptors with the correlation coefficient r of 0.9 and higher was excluded. Finally, three different models were developed using 184 (0D, 1D, 2D), 112 (2D) and 195 (3D) descriptors respectively. SVM training and testing files require normalized data input, hence we normalized our training data to range within –1 to +1.
Molecular descriptors used in the development of SVM models.
SVM algorithm
The SVM method was developed by Vapnik. 25 SVM algorithms project input data into a high-dimensional feature space using kernel functions, so that an optimal plane (maximal-margin hyper plane) may be drawn which can demarcate positive and negative datasets. The hyper plane is dependent on choice of kernel function and representative training examples, called support vectors. Optimized SVM classification model is generated by iterations of learning and evaluations, based on optimized choice of training support vectors, kernel functions and parameters. In developing SVM models for this work, we have used LIB-SVM, available freely at (http://www.csie.ntu.edu.tw/~cjlin/libsvm). 26 The SVM model built in this study is based on C-SVC (C-Support Vector Classification) algorithm implementation of LIBSVM. As the number of features used here is less than the number of instances, we primarily used a nonlinear kernel for building the SVM model. However, models based on linear kernel were also developed to compare their performance. A coarse grid-based optimization of the kernel parameters C and the hyper parameter γ was performed to achieve the highest classification accuracy.
Model validation
The training data set of 640 compounds was subjected to five-fold cross validation to find the best kernel parameters C and γ by maximizing the accuracy and minimizing the error. In five-fold cross-validations, the training data is split into 5 folds; one fold is used for testing, the remaining ones for training. This is iterated five times, such that each of the data sets is used as a test data. The optimum values of C and y were then used to retrain the SVM model. The performance of the models was also assessed on the independent test dataset, using standard statistical measures namely- sensitivity: the percentage of correctly predicted active compounds, specificity: the percentage of correctly predicted inactive compounds, accuracy: the percentage of correctly predicted active and inactive compounds. In addition, balanced measures like MCC, Balanced Accuracy (BAC) and AUC (Area under ROC curve) were also computed. 27 MCC = 1 indicates a perfect prediction while MCC = 0 indicates a random prediction.
These evaluation measures can be mathematically expressed as:
where TP is the number of true positives, FN is the number of false negatives, TN is the number of true negatives and FP is the number of false positives.
Principal Component Analysis (PCA) and Applicability Domain (AD)
As expected, the models based on machine learning methods normally show good performance for compounds that share similar properties as those in the training set. Thus, it is of ever-increasing concern to define the AD of the models, and to check if it is valid for any new molecules. AD is the boundary defined by the descriptor space in the training data. Any new chemical compound should essentially be positioned in the boundary of the chemical space of the training set, in order to be qualified for reliable prediction. 28 Several simple and complex approaches are used to define AD; based on range, distance, geometric and density distribution. One of the simplest and widely applied approach is the AD based on range-based definition with a preliminary PCA rotation. 29 In the present study, we have defined the AD of the model and evaluated its validity on the test set and the screening dataset based on Principal Component (PC) ranges. This method will be helpful to confirm whether a new compound is inside or outside the AD. PCA based definition of AD, reduces the higher dimensionality of the data (due to large number of descriptors) and facilitates simple exploration besides maintaining the variation of the data. This is achieved by identifying directions, or PCs, along which there is maximal variation in the data. Each PC is expressed as a linear combination of the original descriptors. It may be noted that PCs are orthogonal to each other, and the correlation between any two PCs is zero. PCA in the study was performed using R package. 30 The PCA was carried out for the training data of the best model that showed highest classification accuracy. PCA of the independent test set and screening set of NCI diversity set II was also performed in order to validate the applications of the model.
SVM model as virtual screening tool
We used NCI diversity set of 1364 compounds retrieved from the NCI/DTP Open Chemical Repository (http://dtp.nci.nih.gov/branches/dscb/div2_explanation.html) for virtual screening purpose. 31 The compounds were processed in the same way as done for the training set and descriptors were calculated. Subsequently, 1328 compounds were suitable for descriptor calculation and were predicted for their activity by the best SVM model.
Results and Discussion
In the present study, we have developed SVM based model for prediction of proliferation inhibitors of P. falciparum in erythrocytes based on the bioassay results from PubChem bioassay ID “AID 1815”. SVM models were generated using three different sets of descriptors; the first model was based on descriptors belonging to 0D, 1D and 2D category, the second model was based on 2D descriptors and the third model was based on 3D descriptors (Table 3). Non-linear models based on Radial Basis Function (RBF) kernel and linear models were developed for each category of the descriptors. A five-fold cross validation method was used to select the best kernel parameters and to evaluate the self-consistency of the data set in each case. The performance of the models was assessed using a test dataset of 319 compounds, not used in the training process.
Model validation
The overall cross validation accuracy of the models is in the range of 80% to 83%, this suggests the self-consistency of the data and also validates the reliability of the models (Table 4). Area Under the ROC curve (AUC) values for all the models (~0.88) indicates an overall good performance of the models than random classification.
SVM model parameters and evaluation of classification performance.
Performance of model based on 0D, 1D and 2D descriptors
Best kernel parameters determined by five-fold cross validation and the corresponding results obtained with each model are illustrated in Table 4. The model was based on a set of 184 descriptors belonging to 0D, 1D and 2D category which yielded highest accuracy in cross validation as well as over independent test set. The model performed consistently well with an accuracy of 83%, and an AUC of 0.88 in five -fold cross validation. The model was able to correctly classify 117 inhibitors (86%) and 160 non-inhibitors (87%) with an overall accuracy of 87% and MCC of 0.73. Although the number of inactive compounds is slightly higher than the active compounds in the training set, almost equal sensitivity and specificity was obtained. This signifies the balanced performance of the model with respect to good recognition rate and low false prediction rate. The overall performance of the model was found to be satisfactory as evident from the independent testing data performance. The better accuracy of the model can be attributed to the appropriate choice of 0D, 1D and 2D descriptors that were capable to discriminate proliferation inhibitors and non-inhibitors. The model based on these descriptors showed consistent and optimum performance when compared with other models.
Performance of models based on 2D and 3D descriptors
The nonlinear model based on 2D descriptors showed second best performance with a five-fold cross validation accuracy of 82%, overall testing accuracy of 85%, sensitivity of 84% and MCC measure of 0.69. Some of the 2D descriptors applied in this model are overlapping with the ones used in the first model. This implies the specific contribution of 2D descriptors in better discrimination of active and inactive compounds with good sensitivity and specificity than models based on 3D descriptors. Linear model based on 2D descriptors had the lowest cross validation accuracy although with good testing accuracy (81%) comparable to other models. As shown in Table 4, the model based on 3D descriptors ranks last in the testing accuracy, specificity and sensitivity. Perhaps, the overall limited structural diversity in the compounds could be a limiting step for performance of shape based 3D descriptors. In general, all the models showed a balance in terms of their specificity and sensitivity as demonstrated by the BAC. The overall BAC of all SVM models ranges from 0.80 to 0.87. However, in all the cases RBF based models outperformed corresponding linear models with higher classification accuracy.
Applicability Domain
PCA was applied here to define the AD of the best model and also to map the active and inactive compounds in their respective chemical spaces. PCs are basically the linear combinations of the original 184 descriptors used in this study. The AD is calculated on the basis of the PC ranges. The minimum and maximum values of principal components are set by considering all the compounds in the training data set. Figure 1 shows the first three principal components of the compounds in the training set that has been used to define the AD of the model. The compounds in the independent testing set were also found to be within the AD (Fig. 2). PCA results reveal in general that, the active and inactive compounds occupy different clusters in the chemical space, although there was no clear boundary between the two classes. The training data shows limited structural diversity, which poses a restraint on the sensitivity and specificity of the model. These parameters could be apparently improved by increasing the number of diverse structures in training set.

Visualization of chemical space in training dataset. Proliferation inhibitors (blue diamonds) and non-inhibitors (red diamonds) are represented using the first three Principal Components. The figure depicts the range of Principal Components of the compounds in the training set that define the applicability domain (AD).

Visualization of chemical space in testing dataset. The figure illustrates the compounds in the independent testing dataset lying within the applicability domain of the classifier.
Additional validation using ChEMBL-NTD datasets
While we were developing the models, three novel datasets of proliferation inhibitors of P. falciparum were submitted to the ChEMBL-Neglected Tropical Disease database (www.ebi.ac.uk/chemblntd) from three sources namely: GSK TCAMS Dataset, Novartis-GNF Malaria Box Dataset and St. Jude Children's Research Hospital Dataset.32–35 We used the datasets to perform additional testing on the model developed by us. Prior to screening, we ensured that none of the compounds in these datasets overlaps with those in the training dataset. Two such overlapping entries found in Novartis-GNF Malaria Box Dataset were removed before screening. The performance of the SVM model over these datasets is shown in Table 5. The SVM model correctly predicted 89% (12082/13519), 83% (4750/5692), and 90% (1384/1535) of the experimentally verified inhibitors in the GSK, Novartis and St. Jude's datasets respectively. These results suggest that the SVM model is equally effective in identifying potential hits in virtual screening of large libraries with reasonable AD.
Performance of the SVM model in validating ChEMBL-NTD datasets.
Virtual screening of inhibitors
We have utilised the best SVM model based on 0D, 1D and 2D descriptor category for identifying further novel inhibitors from NCI diversity collection of 1364 compounds. Only 1328 compounds passed through the descriptor calculation. First we tested, if all 1328 compounds were within the AD of the model using the first three principal component ranges of the 184 descriptors (as described in the methods). About 70 compounds violated the descriptor ranges observed for compounds in the training set. Therefore we considered them unreliable for prediction. The outliers were discarded and the remaining 1257 compounds were predicted using best SVM model. The model predicted about 580 compounds as positive and remaining 677 as negative. In the NCI diversity set, we observed that there were four known proliferation inhibitors which were correctly classified. The predicted compounds were prioritised according to the probability score of LIBSVM. Further, we compared predicted active compounds to those in the training dataset, to check their similarity in terms of Tanimoto coefficient. The Tanimoto coefficient for the 580 predicted positive NCI diversity set compounds against the 290 positive training compounds ranged from 0.98 to 0.24. Some of the predicted active compounds and their corresponding maximum Tanimoto score to the compounds in the training data are shown in Table 6.
Selected virtual hits from NCI diverse set collection.
Conclusion
The SVM model based on 184 0D, 1D and 2D descriptors of the inhibitors exhibited the highest accuracy with lower false-hit rate. The selected molecular descriptors have sufficiently captured the features required to discriminate P. falciparum intra erythrocytic cycle proliferation inhibitors from noninhibitors. The predictive power of the optimized model with good performance on three additional validation (ChEMBL-NTD) datasets indicates that it can be equally effective in selecting potential hits in screening large libraries. Several new compounds predicted as inhibitors from NCI diverse set have shown good similarity to the known proliferation inhibitors. The SVM model developed in this study is fast and precise enough to be applied for large scale screening of proliferation inhibitors of P. falciparum. The large repositories of chemical compounds, for example PubChem and ChEM-BL-NTD can be a rich source for generating such quality predictive models. Efforts to publish more bioassays for neglected diseases like malaria would benefit from such data mining techniques to support decision-making in drug discovery. However the generalization capability of a model largely depends on the quality and the diversity of the data set. Such models can be ideal for quick screening of potential bioactive molecules from large chemical libraries and facilitate lead identification.
Disclosures
This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.
Footnotes
Acknowledgements
Department of Biotechnology (DBT, India) grant for “Bioinformatics Infrastructure Facility” at ICGEB and Indian Council of Medical Research (ICMR) fellowship to Sangeetha Subramaniam is duly acknowledged.
