Drug-target interaction data cluster analysis based on improving the density peaks clustering algorithm

Abstract

Since drug-target data have neither class labels nor the cluster number information, they are not suitable for clustering algorithms that require predefined parameters determined by comparing clustering results with real class labels. Density peaks clustering (DPC) is a density-based clustering algorithm that can determine the number of clusters without requiring class labels. However, the predefined cutoff distance of local density limits its wide application. Therefore, this paper proposes an improved local density method based on a cutoff distance sequence that overcomes the limitations of DPC and can be successful applied to drug-target data. We also introduce multiple-dimensional scaling based on drug and target similarity and perform intuitive graph analysis of the two most significant differentiation features. Drugs of the Enzyme, GPCR, Ion Channel, and Nuclear Receptor 4 standard datasets are identified as 6, 6, 3, and 5 clusters by an improved algorithm, respectively, and similarly, their targets are identified be 5, 5, 8, and 4 clusters. Drug-target data clustering results of the improved algorithm are more reasonable than the results of the fast K-medoids and hierarchical clustering algorithms.

Keywords

Drug-target interaction data cluster analysis density-based clustering cutoff distance sequence

1. Introduction

Bringing a new drug to the market is a highly challenging and complex process in terms of time and cost. The experimental determination of drug-target interaction (DTI) is a time-consuming, labor-intensive process with a relatively poor success rate and is limited to small-scale research [1]. Drug repositioning [2] and network pharmacology [3] have provided a theoretical basis for using computational methods to predict new drugs based on the interaction data of drug-target. The hypothesis is that similar drugs interact with the same targets, and similar targets interact with the same drugs [4]. Drug repositioning, the process of finding new uses outside the scope of the original medical indications for existing drugs [5], is considered a promising strategy with the benefit of providing a more rapid route to clinic than through the traditional drug discovery approaches because of the use of existing knowledge about drugs. The availability of public biomedical databases along with the development of computational methods have made it possible to provide useful frameworks to partially overcome the limitations of the traditional experimental approaches [6] and help in finding a new association for the existing drug-target interaction data. Therefore, drug-target interaction prediction via many mathematical models and computational algorithms to aid drug development and drug design is a low-cost and efficient method in the pharmaceutical industry.

Cluster analysis of drug-target interaction data, to some extent, can improve the prediction performance of computational methods [7, 8, 9, 10]. Generally, clustering algorithms are categorized into hierarchical, partition-based, density-based, and grid-based algorithms [11]. The partition-based algorithms, such as K-means and K-medoids [12], need to prespecify the number of clusters. The best known density-based algorithm is the density-based spatial clustering of applications with noise (DBSCAN) [13], which requires predefined $\varepsilon$ -neighborhood for each point and each point in a cluster with at least a minimum number of points MinPts within the $\varepsilon$ -neighborhood. The grid-based algorithms, for instance, STING [14], are widely used in spatial data mining. The drug-target interaction data include the drug-target association relationship, drug-drug similarity (e.g., chemical structure similarity between compounds using SIMCOMP [15]) and target-target similarity (e.g., sequence similarity between proteins using a normalized version of Smith-Waterman score [16]). They are not spatial data, nor can they provide information about clusters number. Therefore, researchers prefer to choose hierarchical clustering. Liu and Johnson [17] provided a summary of popular compound clustering methods and pointed out that hierarchical clustering is used more widely. In hierarchical clustering, the number of clusters should also be prespecified directly or determined via the cutoff value, and then, the hierarchical clustering tree is cut to form the clustering result. For example, Hansch and Unger [18] clustered 90 compounds successively into 5, 10, 20 and 60 clusters. Xu et al. [19] clustered target data and cut the hierarchical clustering tree with the default cluster number 6. Shi et al. [7] clustered target data and obtained clusters with the cutoff value of 1.1 to the hierarchical clustering tree. Therefore, in the clustering of drug-target interaction data, whether it using hierarchical clustering or other clustering algorithms, the most important problem is to determine the number of clusters and provide more accurate clustering results.

In 2014, Rodriguez and Laio [20] proposed a density-based clustering algorithm called clustering by fast search and find of density peaks (DPC), which only depends on the distance between points. DPC can determine the number of clusters independently using decision graph and identify arbitrary shape clusters, as well as deal well with noise points. It is based in the assumptions that cluster center points are surrounded by neighbors with lower local density and that they are relatively distant from any points with higher local density. DPC has been widely applied to many fields, including biological data [21], social circle clustering [22], image clustering [23] and so on. It seems that DPC can directly solve the clusters number problem and cluster drug-target interaction data, but the local density and distance of each point in DPC are defined with the prespecified parameter cutoff distance $d_{c}$ , and there is no rigorous way to determine it aside from a rule of thumb recommended by Rodriguez and Laio, which discourages researchers from using it to cluster drug-target interaction data.

There are two feasible methods to determine the optimal parameter. One heavily depends on user experience and prior data knowledge, and the other requires clustering evaluation indexes via repeated experiments. However, the majority of evaluation indexes require real data class labels, which greatly limits these algorithm applications.

For drug-target interaction data, there are no class labels or prior class information. DPC and its improved algorithms can still not be used directly. Therefore, this paper proposes an improved DPC clustering algorithm that can be successfully applied to drug-target interaction data clustering. We use a cutoff distance sequence to improve DPC and propose the clustering by fast search and find density peaks based on sequence (DPCS) algorithm to cluster drug-target interaction data. The main contributions of this paper are summarized as follows.

1.
An improved local density calculation method is proposed based on a cutoff distance sequence that avoids predefined parameter limitations.
2.
Drug-target interaction data have neither class labels nor cluster number information, and hence, clustering algorithms requiring predefined parameters are inapplicable. However, DPCS was successfully applied to drug-target data, and the number of clusters in Enzyme, Ion Channel, GPCR and Nuclear Receptor datasets were determined.
3.
DPCS clustering with satisfying outcomes tends to be more reasonable compared with fast k-medoids (FastK) [24] and hierarchical clustering (HC) [25].
4.
Since the original drug feature was the compound structure and the target was a specific amino acid sequence, the data were not convenient for intuitive graphical interpretation. We introduce multiple-dimensional scaling (MDS) based on drug and target similarities, and we provide an intuitive graph analysis of the two most significant differentiation features.

The remainder of this paper is organized as follows. Section 2 describes the related works. Section 3 briefly introduces DPC. Section 4 describes the proposed DPCS algorithm in detail. Section 5 verifies DPCS’s performance. Section 6 analyzes drug-target interaction data with DPCS algorithm. Finally, Section 7 concludes the paper.
2. Related works

Many studies have attempted to overcome DPC’s limitations and have proposed improved algorithms.

The first aspect is to improve local density, assignment strategy, or both. Du et al. [26] used fuzzy neighborhood relationships to define the local density, and then developed FNDP algorithm. Jia et al. [27] proposed an enhanced fast search and find of density peaks clustering (E-FDPC) algorithm to select hyperspectral bands and embedded learning rules based on exponential function to adjust cutoff distance in local density computing. Tao et al. [28] proposed F-DPC using the data field theory to adaptively select the $d_{c}$ to compute local density. Zhou et al. [29] verified clustering sensitivity to cutoff distance experimentally and recommended selecting the distance from the $k\text{th}$ nearest neighbor point as the cutoff distance using the automatic peak detection (APD) algorithm. Geng et al. [30] proposed RECOME using a density measure based on the relative k-nearest neighbors (KNN) kernel density with a parameter. Xie et al. [31] proposed fuzzy weighted KNN density peak clustering (FKNNDPC) using KNN and fuzzy weighted KNN to improve DPC. The improved local density is calculated by the sum distance to KNN, and the assignment strategy is optimized by fuzzy weighted KNN to reduce error propagation. Liu et al. [32] proposed the shared-nearest-neighbor-based clustering by fast search and find of density peaks (SNNDPC) algorithm, which gives some new definitions, such as SNN similarity, local density and distance from the nearest larger density point. Liu et al. [33] proposed adaptive density peak clustering (ADPC-KNN) using the KNN to compute local density. Seyedi et al. [34] proposed the DPC-DLP algorithm that employs the idea of KNN to compute the global cutoff parameter and the local density of each point and uses a graph-based label propagation to assign labels to the remaining points. However, APD, FKNNDPC, SNNDPC and DPC-DLP only transfer the predefined parameter from $d_{c}$ to $k$ , the number of KNN. Du et al. [35] used KNN and principal component analysis (PCA) to optimize DPC and proposed DPC-KNN-PCA. The new local density is calculated by the negative mean distance to KNN of exponential function and calculates $k$ as a percentage $p$ of the total number of points, i.e., $k=\lfloor p\times n\rfloor$ . Therefore, DPC-KNN-PCA also transfers the predefined parameter from $d_{c}$ to $p$ . Thus, APD, FKNNDPC, SNNDPC, DPC-DLP and DPC-KNN-PCA do not overcome the predefined parameter shortcoming. Zhou et al. [36] embedded kernel density estimation (KDE) to optimize local density and proposed density peaks clustering by identifying the veins (IVDPC). However, KDE’s inherent defects seriously affect IVDPC clustering performance. KDE is acceptably accurate in one-dimensional (1D) or 2D data but becomes highly inaccurate for higher-dimensional or sparse data. Hence, IVDPC tends to achieve poor clustering performance for most real-world high-dimensional datasets. Another nonparameterized algorithm is to use fixed parameter values. An adaptive clustering algorithm based on KNN (fixed $k=$ 5) and density (ACND) has been proposed by Shi et al. [37]. The fixed number addresses the nearest neighbor interference susceptibility from the accidental meeting of two or three outliers, i.e., 2NN, 3NN, or 4NN may be close to normal data points, and it is not easy to distinguish between them. Strictly speaking, this cannot be called a nonparameter algorithm, but for users, there is no need to predefine parameters in practical applications, which are approximately equivalent to a nonparameter algorithm.

The second aspect is improving clustering center selection, application, distance measure, clustering efficiency and so on. Jiang et al. [38] developed an enhanced DPC enhanced algorithm, called GDPC, with an alternative decision graph based on gravitation theory and nearby distance to identify centers and anomalies accurately. Moreover, they tried to overcome some weakness, such as varying densities and irregular shapes, and proposed the DPC-LG algorithm [39] to improve GDPC based on logistic distribution and gravitation. Xu et al. [40] proposed a density peaks clustering algorithm based on grid (DPCG), which improves the efficiency using the CLIQUE clustering algorithm to calculate the local density. Xu et al. [41] proposed two prescreening strategies, grid-division and circle-division, to find cluster centers for a large-scale dataset fast.

3. DPC introduction

Assume that $\bm{x}_{i}\in\mathrm{R}^{m},i=1,\ldots,n$ belong to dataset $X$ . Let $d(\bm{x}_{i},\bm{x}_{j})$ represent the Euclidean distance between $\bm{x}_{i}$ and $\bm{x}_{j}$ ,

$\displaystyle d\left(\bm{x}_{i},\bm{x}_{j}\right)=\sqrt{\sum_{l=1}^{m}\left(x_% {il}-x_{jl}\right)^{2}}$ (1)

DPC has the following two important variables: local density $\rho_{i}$ and distance $\delta_{i}$ . There are two definitions for $\rho_{i}$ . One counts the number of points in its neighborhood

$\displaystyle\rho_{i}=\sum_{j=1}^{n}\chi\left(d\left(\bm{x}_{i},\bm{x}_{j}% \right)-d_{c}\right)$ (2)

where

$\displaystyle\chi(x)=\begin{cases}1,&x<0\\ 0,&x\geqslant 0\end{cases}$

and the other uses the Gaussian kernel local density,

$\displaystyle\rho_{i}=\sum_{j=1}^{n}\exp\left(-\frac{d\left(\bm{x}_{i},\bm{x}_% {j}\right)^{2}}{d_{c}^{2}}\right)$ (3)

where

$\displaystyle\delta_{i}=\begin{cases}\min\limits_{j:\rho_{j}>\rho_{i}}\left(d% \left(\bm{x}_{i},\bm{x}_{j}\right)\right),&\exists j\ s.t.\ \rho_{j}>\rho_{i}% \\ \max\limits_{j}\left(d\left(\bm{x}_{i},\bm{x}_{j}\right)\right),&\text{% otherwise}\end{cases}$ (4)

DPC suggests that center points are surrounded by neighbor points with lower local density, and they are a relatively large distance from any points with higher local density. Therefore, those points where $\rho_{i}$ is as large as possible and $\delta_{i}$ is relatively large are selected as center points. To determine the number of clusters and corresponding center points conveniently, DPC introduces decision graphs and requires the user to manually select center points.

4. DPCS algorithm

4.1 Definitions, process and explanation

DPCS calculates the local density using a cutoff distance sequence to overcome requiring a prespecified parameter. To obtain the cutoff distance sequence, we first define the maximum cutoff distance $d_{\max}$ .

Definition 1. The maximum cutoff distance $d_{\max}$ for dataset $X$ with distance $d\left(\bm{x}_{i},\bm{x}_{j}\right),\ \bm{x}_{i},\bm{x}_{j}\in X$ is defined as

$\displaystyle d_{\max}=\min_{i}\left\{d_{i}|d_{i}=\max_{j}d\left(\bm{x}_{i},% \bm{x}_{j}\right),i,j=1,\ldots,n\right\}$ (5)

Thus, $d_{\max}$ can take all points as its neighborhood and is the neighborhood radius lower bound that contains all points. Hence, the cutoff distance sequences can be defined as follows.

Definition 2. All distances satisfying the following condition are called the cutoff distance sequence,

$\displaystyle Q=\left\{dc_{i}|dc_{i}=i\ast h,\ h=d_{\max}/n,\ i=1,\ldots,n\right\}$ (6)

Definition 3. The local density $\rho_{i}$ of point $\bm{x}_{i}\in X$ is defined as

$\displaystyle\rho_{i}=\sum_{l=1}^{n}\left(\sum_{j=1}^{n}\chi\left(d_{ij}-dc_{l% }\right)/l\right)$ (7)

where $d_{ij}=d\left(\bm{x}_{i},\bm{x}_{j}\right)$ , and $dc_{l}\in Q$ . To eliminate dimensional deviations, $\sum_{j=1}^{n}\chi\left(d_{ij}-dc_{l}\right)$ is normalized.

From Eq. (7), the proposed $Q$ ensures that local density of at least one point (e.g., the point corresponding to the maximum cutoff distance) will be affected by all other points. Dividing by $l$ ensures that farther distances have weaker effects on local density. When $Q$ increases, the difference in the number of points in the neighborhood of the same cutoff distance will also be reflected in the local density. Similar to DPC with Gaussian kernel local density, the local density of DPCS algorithm considers the influence of all data points to local density, i.e., each point contributes to the local density. However, their contribution differs, with more distant points having a weaker contribution to the local density. In DPCS, each point in the dataset contributes to the local density rather than being fixed within a neighborhood as DPC, FKNNDPC, SNNDPC, and DPC-DLP. The improved local density calculation method embeds these two important differences through the proposed cutoff distance sequence. A clearer explanation is given by the following example.

Table 1

Example 2D dataset

$x$	$y$	class	$x$	$y$	class
1.0	1.1	1	1.5	1.6	2
0.9	1.0	1	1.4	1.5	2
1.0	1.0	1	1.5	1.5	2
1.1	1.0	1	1.6	1.5	2
1.0	0.9	1	1.5	1.4	2

Table 1 shows an example 2D dataset containing 10 points, where $x, y$ represents the features and class represents the cluster label. From the definitions for $\delta_{i}$ and $\rho_{i}$ (Eqs (4) and (7), respectively), we compute $Q$ from Eqs (5) and (6) and obtain 10 distance values, as shown in Table 2, where Num represents the serial number, and $d c$ represents the cutoff distance.

Table 2

Cutoff distance sequence for the dataset from Table 1

Num	1	2	3	4	5
$d c$	0.0849	0.1698	0.2548	0.3397	0.4246
Num	6	7	8	9	10
$d c$	0.5095	0.5944	0.6793	0.7643	0.8492

Let us examine the number of points in the neighborhood of each point under every cutoff distance. Table 3 shows $\bm{x}_{1}$ as an example, where neighbors represents the number of points in the neighborhood of the corresponding cutoff distance.

Table 3

Cutoff neighbor numbers for object $\bm{x}_{1}$ from Table 1

Num	1	2	3	4	5	6	7	8	9	10
neighbors	1	1	1	2	4	5	5	5	8	10

There is a special phenomenon in Table 3 that neighbors values may be the same, corresponding to different cutoff distances. This is because the number of points in the neighborhood of $\bm{x}_{1}$ does not increase as cutoff distance increases. Equation (7) shows that although $\sum_{j=1}^{n}\chi(d_{ij}-dc_{l})$ is the same, increasing $l$ decreases its contribution to local density. This finding reflects the difference between neighborhood point number and local density and ensures that larger $dc_{l}$ means smaller contribution. We calculate point distances and local densities using Eqs (4) and (7) and produce the decision graph shown in Fig. 1a. The two center points are significantly separated from others (red and blue points). Having determined the center points, the clustering result is shown in Fig. 1b. The blue ( $\textit{class}=$ 1) and red ( $\textit{class}=$ 2) clusters are successfully separated.

Figure 1.

Two-dimensional dataset from Table 1.

In DPCS, on the one hand, each point in the dataset contributes to the local density rather than fixed neighborhoods as in DPC, which considers differences between datasets as much as possible. On the other hand, the contributions of farther points to the local density should be weaker, as differences between points within the dataset are considered.

4.2 DPCS algorithm flow

The proposed improved local density Eq. (7) has no predefined cutoff distance. Therefore, compared with DPC, DPCS does not need this input item.

Input: Output:DPCS[1] Dataset $X$ Cluster labels Calculate the distance matrix using Eq. (1). Find the maximum cutoff distance $d_{\max}$ using Eq. (5). Determine the cutoff distance sequence $Q$ using Eq. (6). $i=1$ to $n$ $l=1$ to $n$ Calculate the contribution of cutoff distance $dc_{l}$ to the local density $\rho_{i}$ . Calculate local density $\rho_{i}$ of point $\bm{x}_{i}$ using Eq. (7). $i=1$ to $n$ in descending order of local density Calculate object distances $\delta_{i}$ using Eq. (4). Create the decision graph. Select cluster number $K$ and center points manually. $i=1$ to $n-K$ in descending order of local density Assign remaining point $\bm{x}_{i}$ to a cluster with higher local density and the smallest distance.

4.3 Complexity analysis

Clustering efficiency can be estimated from the time complexity of the algorithm. DPC, FKNNDPC, DPCS, the distance matrix, local density $\rho_{i}$ , and object distance $\delta_{i}$ are calculated, and the cluster label is assigned. Suppose there are $n$ points in the dataset, and the value of KNN is $k$ .

All three algorithms calculate the distance matrix and object distance $\delta_{i}$ with complexity $O(n^{2})$ . In the step of calculating the local density, the complexity of DPC is $O(n^{2})$ . In FKNNDPC, the complexity to calculate the local density for all points is $O(kn^{2})$ when searching KNN points. In DPCS, we compute the maximum cutoff distance $d_{\max}$ with complexity $O(n^{2})$ and then compute the cutoff distance sequence with complexity $O(n)$ . Finally, we compute the local density using Eq. (7) as $O(n^{2})$ . Therefore, the total complexity to compute the local density of all points for DPCS is $O(n^{3})$ . In the step of labeling the assignment cluster, DPC and DPCS use the same method with complexity $O(n)$ . FKNNDPC is rather complicated and involves three strategies. Strategies 1 and 3 are simple with complexity $O(n)$ ; however, strategy 2 requires updating the recognition matrix with fuzzy KNN, and its complexity is $O(n^{2})$ ; therefore, the assignment cluster label complexity of FKNNDPC is $O(n^{2})$ . Hence, the complexity of DPC and FKNNDPC is $O(n^{2})$ and of DPCS is $O(n^{3})$ .

5. Performance verification of DPCS

5.1 Datasets

We tested DPCS performance on 6 classic datasets for clustering accuracy (ACC) [42] and adjusted mutual information (AMI) [43]. We also compared the outcomes with DPC and FKNNDPC state-of-the-art algorithms. All three clustering algorithms were implemented in MATLAB, and we provide the decision graph and clustering results. Six classic datasets for performance testing and validation were chosen from various sources, as detailed in Table 4.

Table 4
Experimental datasets

Name	Size	Attributes	Class
Aggregation	788	2	7
Flame	240	2	2
R15	600	2	15
D31	3100	2	31
Spiral	312	2	3
S1	5000	2	15

•

Aggregation [44]: seven non-Gaussian distributed clusters with significant shape differences. This is a typical dataset for clustering aggregation.

•

Flame [45]: different data classes have inconsistent size and shape and semi-encircling.

•

R15 and D31 [46]: R15 data are distributed on a ring generated by 15 similar Gaussian distributions, and D31 data are generated by 31 similar Gaussian distributions. The D31 annular distribution is somewhat weaker than that of R15.

•

Spiral [47]: three spiral strip datasets and classical verification data for density-based clustering.

•

S1 [48]: fifteen inconsistent cluster shapes with almost overlapping boundaries.

In addition, to verify the ability of the three algorithms to deal with noise, we add Gauss noise ( $\mu=$ 0, $\sigma=$ 0.5) to the datasets in Table 4 for clustering analysis.

5.2 Experimental results and discussion

The experimental datasets were clustered using DPC, FKNNDPC and DPCS. Figures 2–7 show the DPCS decision graphs and corresponding clustering results. Finally, the evaluation indexes comparison results of three algorithms are listed in Table 5, and the corresponding predefined parameters of DPC and FKNNDPC are also given.

In the DPCS clustering of aggregation, 7 points near the diagonal are selected as center points in the decision graph Fig. 2 (a). The clustering result is given in Fig. 2b. Although there are 7 clusters, with 4 large (3 ellipsoidal clusters and an inner concave cluster) and 3 small clusters, 7 center points are identified, and all points are clustered successfully.

Figure 2.

Aggregation dataset DPCS outcomes.

The semi-encircling shape of Flame is more obvious than those of the others. The decision graph in Fig. 3a shows that the difference between the center points and the noncenter points identified by DPCS is very significant. The clustering result in Fig. 3b shows that the two clusters are separated completely.

Figure 3.

Flame dataset DPCS outcomes.

When the number of clusters increases, DPCS also obtains good clustering performance. The decision graphs in Figs 4a and 5a of R15 and D31, respectively, show that the distinction between center points and noncenter points is very obvious. The clustering results in Figs 4b and 5b also show that no clusters are split, and the clustering result of DPCS is perfect.

Figure 4.

R15 dataset DPCS outcomes.

Figure 5.

D31 dataset DPCS outcomes.

Spiral is a dataset that must be validated by any density-based clustering algorithms. Its special spiral shape leads to a great challenge for the nondensity-based algorithm to separate them completely. The 3 center points are easily selected from the decision graph in Fig. 6a of DPCS. The three spiral lines are separated completely in the clustering result in Fig. 6b, and no point is clustered into the wrong cluster.

The points of S1 distribute into different shapes (circle, ellipse, rectangle, strip, etc.), and the boundary points of different clusters are almost overlapping. The decision graph Fig. 7a of DPCS shows that 15 center points are significantly separated from noncenter points. Therefore, the number of clusters and center points are easily selected. The clustering result in Fig. 7b shows that DPCS can identify the points in each cluster correctly.

Table 5

Clustering performance metrics on the datasets from Table 4

Dataset	DPC			FKNNDPC			DPCS
	ACC	AMI	Par	ACC	AMI	Par	ACC	AMI
Aggregation	0.9975	0.9922	2	0.9975	0.9907	8	0.9962	0.9892
D31	0.9668	0.9539	2	0.9690	0.9566	9	0.9655	0.9526
Flame	1	1	5	0.9917	0.9267	5	1	1
R15	0.9967	0.9938	2	0.9933	0.9897	6	0.9967	0.9938
Spiral	1	1	2	1	1	6	1	1
S1	0.9952	0.9897	2	0.9940	0.9872	13	0.9940	0.9875
Number	5	5	–	3	2	–	3	3

Figure 6.

Spiral dataset DPCS outcomes.

Figure 7.

S1 dataset DPCS outcomes.

Since we improved the definition of local density, we pay more attention to the identification of center points in the decision graph. Figures 3–7 show that DPCS finds large differences between center points and noncenter points, which makes it clear to determine the number of clusters and select the appropriate center points. However, Fig. 2 (a) shows two almost parallel noncenter points on the bottom right side of the light green center point. Nevertheless, DPCS still identified the appropriate center and noncenter points correctly.

Both DPC and FKNNDPC exhibit high clustering performance, with only small differences that can hardly be seen by eye in the clustering results. However, clustering evaluation indexes highlight these slight differences. Table 5 shows the evaluation indexes for the 6 datasets, where the optimal value is marked in bold type. DPC achieves the highest clustering accuracy for 5 of the 6 datasets, and both FKNNDPC and DPCS achieve the highest clustering accuracy for 3 of the 6 datasets. Although it appears that DPCS clustering performance is inferior to DPC, the different Aggregation and S1 datasets show that the DPC and DPCS clustering accuracies are very close (the difference in accuracy is less than 0.01). Therefore, DPCS and DPC have relatively the same clustering performance. More importantly, the greatest DPCS advantage is that users do not need to supply predefined parameter(s). This makes DPCS particularly good for clustering analysis of datasets without prior knowledge. Hence, DPCS has high performance and is more competitive than DPC and FKNNDPC for clustering analysis, and we propose DPCS to cluster drug-target data.

Table 6 shows the evaluation indexes for the 6 datasets with Gauss noise ( $\mu=$ 0, $\sigma=$ 0.5). From Tables 5 and 6, the added noise decreases the three algorithms’ clustering accuracy on 6 datasets, especially on the Spiral dataset (the ACC value decreases 0.5128, 0.1411, and 0.5449 for DPC, FKNNDPC and DPCS, respectively). Among the three algorithms, the clustering performance of DPC has the fastest decline, followed by that of FKNNDPC and DPCS.

Table 6

Clustering performance metrics on the datasets with Gauss noise from Table 4

Dataset	DPC			FKNNDPC			DPCS
	ACC	AMI	Par	ACC	AMI	Par	ACC	AMI
Aggregation	0.9961	0.9874	2	0.9175	0.9107	8	0.9961	0.9874
D31	0.9100	0.9163	2	0.9171	0.9217	9	0.9412	0.9300
Flame	0.9000	0.5472	5	0.9833	0.8707	5	0.9833	0.8707
R15	0.9450	0.9190	2	0.7833	0.8292	6	0.9300	0.9062
Spiral	0.4872	0.0553	2	0.8589	0.7387	6	0.4551	0.0356
S1	0.9952	0.9895	2	0.9940	0.9875	13	0.9940	0.9875
Number	3	3	–	2	2	–	3	3

Table 7

Drug-target interaction datasets

Dataset	Enzyme	IC	GPCR	NR
Drug	445	210	223	54
Target	664	204	95	26
Interaction	2926	1476	635	90

Figure 8.

Enzyme dataset decision graphs.

6. Drug-target interaction data cluster analysis

6.1 Drug-target interaction datasets

Table 7 shows four drug-target datasets published by Yamanishi et al. [49] with drug-drug and target-target similarity as well as drug-target interaction relationships. They are standard datasets for many studies predicting drug-target interaction [50, 51, 52, 53]. The interactions between drugs and targets are obtained from the KEGG BRITE [54], BRENDA [55], SuperTarget [56] and DrugBank [57]. The chemical structures of the compounds come from the DRUG and COMPOUND sections in the KEGG LIGAND database [54], and chemical structure similarities between compounds are computed using SIMCOMP [15]. Amino acid sequences of the target proteins are obtained from the KEGG GENES database [54], and sequence similarities between proteins are computed using a normalized version of Smith-Waterman scores [16].

Figure 9.

GPCR dataset decision graphs.

Figure 10.

Ion Channel dataset decision graphs.

6.2 Clustering drug-target datasets with DPCS

Since there is no class label information in the drug-target data, users cannot judge the number of clusters using prior knowledge. Therefore, algorithms such as DPC and FKNNDPC that need to determine predefined parameter based on class label are unsuitable, whereas the DPCS algorithm proposed here does not require a predefined parameter and is suitable for clustering and analyzing drug-target data.

We use DPCS to cluster the four drug and target datasets and compare the outcomes with FastK [24] and HC [25]. We first use DPCS to determine the number of clusters and identify the corresponding center points and then use the DPCS number of clusters as the predefined parameter for FastK and HC. The number of points within each cluster are then extracted using the clustering algorithms.

Table 8
Clustering results for drug-target datasets shown in Table 7

Dataset	Algorithm	Drug	Target	Number of objects
				Drug	Target
Enzyme	DPCS	6	5	20, 39, 31,9,289, 57	9, 5, 630,9, 11
	FastK			133, 88, 30,66,52, 76	166, 170, 114,107, 107
	HC			1,6,432,1,2,3	2, 637,4,3,18
GPCR	DPCS	6	5	4, 12, 12, 15,119, 61	5, 9, 45,2, 34
	FastK			44,60,38,25,22,34	33, 13, 7,9, 33
	HC			12,12,1,191,6,1	1, 1, 7,77,9
IC	DPCS	3	8	7, 150, 53	9, 122, 4, 5,26, 14, 20, 4
	FastK			85, 60, 65	14, 105, 10, 3,20, 17, 25, 10
	HC			2, 201, 7	20, 137, 10,10,14, 6, 4, 3
NR	DPCS	5	4	2, 6, 5, 18, 23	11, 3, 9, 3
	FastK			18, 6, 13, 9, 8	3, 3, 12, 8
	HC			2, 16, 1, 34, 1	3, 21, 1, 1

Figure 11.

Nuclear Receptor dataset decision graphs.

The drug-target datasets provide similarity matrices, whereas the clustering algorithms require distance matrices. Therefore, we use $1-\textit{similarity}$ as the distance matrix and cluster the datasets with FastK, HC, and DPCS. Figure 8 shows the Enzyme dataset’s DPCS decision graphs with 6 center points identified for drug (Fig. 8a) and 5 for target (Fig. 8b) data.

Figure 12.

Nuclear Receptor clustering.

The number of drugs and targets in GPCR is less than that in Enzyme. Figure 9 shows that, aside from the 6 center points (Fig. 9a, two points are overlapping, and the number of center points can be seen in Table 8) shows that the drug data are centralized, whereas target data (see Fig. 9b) are more dispersed. DPCS selects 7 center points for drug and 5 for target data.

Figure 10 shows that the Channel drug data center points are easily distinguished, and DPCS selects 3 points, whereas 8 center points are selected for the target data. Nuclear Receptor has the least number of samples. Figure 11 shows that DPCS selects 5 drug and 4 target data center points, even though there is no centralized trend to the target data point distribution.

Table 8 shows FastK, HC, and DPCS clustering metrics based on the number of clusters from Figs 8–11. The number of drugs or targets are shown in the Number of objects column. HC clustering for the Enzyme, GPCR, and Nuclear Receptor datasets exhibits isolated clusters, and the Ion Channel dataset has a seriously unbalanced numbers of points in each cluster. FastK and DPCS provide more balanced clusters than HC, with no isolated clusters across all four datasets. Thus, it is unlikely that isolated clusters conform to the actual data situation. Many FastK and DCPS clusters have similar numbers of drugs or targets, e.g., FastK cluster with 30 and DPCS cluster with 31 objects for Enzyme drug data, FastK clusters with 9, 5, and 33 objects, and DPCS clusters with 9, 7, and 34 objects for GPCR target data. These similar results from the different clustering algorithms increase their credibility.

6.3 Intuitive graph analysis based on MDS

To further analyze the algorithm cluster results, we used the Nuclear Receptor dataset as an example to perform MDS and selected the two most significant differentiation features for intuitive graph analysis.

Figure 12 shows the three algorithm clustering results, where the large square represents the center point, and points in the same cluster are marked in the same color. Since there are no center points in HC, Fig. 12c and d do not show square marks. The purple, black, and blue drug points (Fig. 12c) and the red and black target points (Fig. 12d) are isolated clusters. Although FastK drug clustering (Fig. 12a) shows the black, purple, and red clusters are somewhat closer to each other, DPCS clustering gathers these points into the same cluster (Fig. 12e, black points). FastK target clustering (Fig. 12b) shows the red cluster contains two points near coordinates $(-0.5,0)$ , which are from the other red points. Indeed, they are closer to the black cluster center points. However, DPCS clusters these points and their close neighbors into the same cluster (Fig. 12f, blue points).

Thus, DPCS clusters the points more reasonably without requiring predefined parameters.

7. Conclusion

This paper investigated the problem of drug-target data clustering and proposed the clustering by fast search and find of density peaks based on a sequence (DPCS) algorithm to overcome the limitation of requiring a predefined cutoff distance. We experimentally compared DPCS with the DPC and FKNNDPC algorithms for 6 classical datasets, showed that the improved local density calculation without requiring a predefined parameter easily and correctly identified center points in the decision graph, providing excellent clustering performance.

We also applied the proposed DPCS clustering algorithm to four common standard drug-target datasets and compared it with FastK and HC algorithms. Without prior information, DPCS determined an appropriate number of clusters for each dataset and identified the corresponding center points. DPCS clustering was more reasonable than FastK and HC, providing more accurate auxiliary information for drug-target interaction prediction. An intuitive graph analysis of the two most significant differentiation features used in the MDS confirmed the conclusion.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China (61571164, 61671188, 61871020, 61532014, 61571163 and 61671189), the Key Science and Technology Plan Project of Beijing Municipal Education Commission of China (KZ201810016019), the Fundamental Research Funds for Beijing University of Civil Engineering and Architecture (X18197 and X18203) and the National Key Research and Development Plan Task of China (2016YFC0901902-5).

Supplementary materials

The drug-target datasets are available at http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/, and the source code of DPCS is available at https://github.com/Yu123456/DPCDrugAndTarget.

References

Chen

Yan

C.C.

Zhang

et al., Drug-target interaction prediction: databases, web servers and computational models, Briefings in Bioinformatics 17 (2016), 696–712.

Ashburn

and Thor

, Drug repositioning: Identifying and developing new uses for existing drugs, Nature Reviews Drug Discovery 3 (2004), 673–683.

Hopkins

, Network pharmacology: the next paradigm in drug discovery, Nature Chemical Biology 4 (2008), 682–690.

Palma

Vidal

M.E.

and Raschid

, Drug-target interaction prediction using semantic similarity and edge partitioning, in: Proceedings of the 13th International Semantic Web Conference, New York, USA, 2014, pp. 131–146.

Chong

C.R.

and Sullivan

D.J.

, New uses for old drugs, Nature 448 (2007), 645–646.

Vilar

and Hripcsak

, The role of drug profiles as similarity metrics: applications to repurposing, adverse effects detection and drug-drug interactions, Briefings in Bioinformatics 18 (2016), 670–681.

Shi

Yiu

Leung

and Chin

, Predicting drug-target interaction for new drugs using enhanced similarity measures and super-target clustering, Methods 83 (2015), 98–104.

Zhang

and Zhang

, Drug-target interaction prediction by integrating multiview network data, Computational Biology & Chemistry 69 (2017), 185–193.

Hao

Pan

and Zhang

, Prediction of drug-target proteins by integrating protein-protein interaction network and protein sequence similarity, Acta Biophysica Sinica 29 (2013), 695–705.

10.

Gudivada

B.J.A.R.C.

and Jegga

A.G.

, Computational drug repositioning through heterogeneous network clustering, BMC Systems Biology 7 (2013), S6.

11.

Han

Kamber

and Pei

, Data mining: Concepts and techniques, Morgan Kaufmann, 2006.

12.

Liu

Guo

and Liu

, An improved k-medoids algorithm based on step increasing and optimizing medoids, Expert Systems with Applications 92 (2018), 464–473.

13.

Ester

Kriegel

Sander

and Xu

, A density based algorithm for discovering clusters in large spatial databases with noise, in: Proceeding of the 2th International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, 1996, pp. 226–231.

14.

Wang

Yang

and Muntz

R.R.

, Sting: a statistical information grid approach to spatial data mining, in: Proceedings of the 23th International Conference on Very Large Data Bases, San Francisco, USA, 1997, pp. 186–195.

15.

Hattori

Okuno

Goto

and Kanehisa

, Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways, Journal of the American Chemical Society 125 (2003), 11853–11865.

16.

Smith

T.F.

and Waterman

, Identification of common molecular subsequences, Journal of Molecular Biology 147 (1981), 195–197.

17.

Liu

and Johnson

D.E.

, Clustering and Its application in multi-target prediction, Current Opinion in Drug Discovery & Development 12 (2009), 98–107.

18.

Hansch

and Unger

S.H.

, Strategy in drug design cluster analysis as an aid in the selection of substituents, Journal of Medicinal Chemistry 16 (1973), 1217–1222.

19.

Zhu

Liu

and Cao

, Quantitatively integrating molecular structure and bioactivity profile evidence into drug-target relationship analysis, BMC Bioinformatics 13 (2012), 75.

20.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science 344 (2014), 1492–1496.

21.

Cao

Packer

Ramani

Cusanovich

Huynh

Daza

Qiu

Lee

Furlan

Steemers

Adey

Waterston

Trapnell

and Shendure

, Comprehensive single-cell tra nscriptional profiling of a multicellular organism, Science 357 (2017), 661–667.

22.

Wang

Zuo

and Wang

, An improved density peaks based clustering method for social circle discovery in social networks, Neurocomputing 179 (2016), 219–227.

23.

Dong

Feng

and Zhang

, Lsi: Latent semantic inference for natural image segmentation, Pattern Recognition 59 (2016), 282–291.

24.

Park

H.-S.

and Jun

C.-H.

, A simple and fast algorithm for k-medoids clustering, Expert Systems with Applications 36 (2009), 3336–3341.

25.

Johnson

, Hierarchical clustering schemes, Psychometrika 32 (1967), 241–254.

26.

Ding

and Xue

, A robust density peaks clustering algorithm using fuzzy neighborhood, International Journal of Machine Learning and Cybernetics 9 (2018), 1131–1140.

27.

Jia

Tang

Zhu

and Li

, A novel ranking based clustering approach for hyperspectral band selection, IEEE Transactions on Geoscience & Remote Sensing 54 (2016), 88–102.

28.

Tao

and Jin

, An optimal density peak algorithm based on data field and information entropy, in: Proceedings of the 2017 International Conference on Data Mining, Communications and Information Technology, ACM, 2017, p. 4.

29.

Zhou

Zhang

Chen

Ning

Zhang

Feng

Liu

and Luktarhan

, A distance and density based clustering algorithm using automatic peak detection, in: IEEE International Conference on Smart Cloud, New York, USA, 2016, pp. 176–183.

30.

Geng

Y.A.

Zheng

Zhuang

and Xiong

, Recome: A new density-based clustering algorithm using relative knn kernel density, Information Sciences 436–437 (2018), 13–30.

31.

Xie

Gao

Xie

Liu

and Grant

, Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors, Information Sciences 354 (2016), 19–40.

32.

Liu

Wang

and Yu

, Shared-nearest-neighbor-based clustering by fast search and find of density peaks, Information Sciences 450 (2018), 200–226.

33.

Liu

and Yu

, Adaptive density peak clustering based on k-nearest neighbors with aggregating strategy, Knowledge-Based Systems 133 (2017), 208–220.

34.

Seyedi

S.A.

Lotfi

Moradi

and Qader

N.N.

, Dynamic graph-based label propagation for density peaks clustering, Expert Systems with Applications 115 (2019), 314–328.

35.

Ding

and Jia

, Study on density peaks clustering based on k-nearest neighbors and principal component analysis, Knowledge-Based Systems 99 (2016), 135–145.

36.

Zhou

Zhang

and Zheng

, Robust clustering by identifying the veins of clusters based on kernel density estimation, Knowledge-Based Systems 159 (2018), 309–320.

37.

Shi

Han

and Yan

, Adaptive clustering algorithm based on knn and density, Pattern Recognition Letters 104 (2018), 37–44.

38.

Jiang

Hao

Chen

Parmar

and Li

, Gdpc: Gravitation-based density peaks clustering algorithm, Physica A: Statistical Mechanics and its Applications 502 (2018), 345–355.

39.

Jiang

Chen

Hao

and Li

, Dpc-lg: Density peaks clustering based on logistic distribution and gravitation, Physica A: Statistical Mechanics and its Applications 514 (2019), 25–35.

40.

Ding

and Xue

, Dpcg: an efficient density peaks clustering algorithm based on grid, International Journal of Machine Learning and Cybernetics 9 (2018a), 743–754.

41.

Ding

and Shi

, An improved density peaks clustering algorithm with fast finding cluster centers, Knowledge-Based Systems 158 (2018b), 65–74.

42.

Cai

and Han

, Document clustering using locality preserving indexing, IEEE Transactions on Knowledge and Data Engineering 17 (2005), 1624–1637.

43.

Vinh

Epps

and Bailey

, Information theoretic measures for clusterings comparison: is a correction for chance necessary, in: Proceedings of the 26th Annual International Conference on Machine Learning, Montreal. Quebec, Canada, 2009, pp. 1073–1080.

44.

Gionis

Mannila

and Tsaparas

, Clustering aggregation, ACM Transactions on Knowledge Discovery from Data 1 (2007), 1–30.

45.

and Medico

, Flame, a novel fuzzy clustering method for the analysis of dna microarray data, BMC Bioinformatics 8 (2007), 3.

46.

Veenman

Reinders

and Backer

, A maximum variance cluster algorithm, IEEE Transactions on Pattern Analysis and Machine Intelligene 24 (2002), 1273–1280.

47.

Chang

and Yeung

, Robust path based spectral clustering, Pattern Recognition 41 (2008), 191–203.

48.

Franti

and Virmajoki

, Iterative shrinking method for clustering problems, Pattern Recognition 39 (2006), 761–775.

49.

Yamanishi

Araki

Gutteridge

Honda

and Kanehisa

, Prediction of drug-target interaction networks from the integration of chemical and genomic spaces, Bioinformatics 24 (2008), I232–I240.

50.

Mei

Kwoh

Yang

and Zheng

, Drug-target interaction prediction by learning from local information and neighbors, Bioinformatics 29 (2013), 238–245.

51.

Hao

Wang

and Bryant

, Improved prediction of drug-target interactions using regularized least squares integrating with kernel fusion technique, Analytica Chimica Acta 909 (2016), 41–50.

52.

Liu

Miao

Zhao

and Li

, Neighborhood regularized logistic matrix factorization for drug-target interaction prediction, Plos Computational Biology 12 (2016), e1004760.

53.

Luo

Zhao

Zhou

Yang

Zhang

Kuang

Peng

Chen

and Zeng

, A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information, Nature Communications 8 (2017), 573.

54.

Kanehisa

Goto

Hattori

et al., From genomics to chemical genomics: new developments in kegg, Nucleic Acids Research 34 (2006), D354–D57.

55.

Schomburg

Chang

Ebeling

et al., Brenda, the enzyme database: updates and major new developments, Nucleic Acids Research 32 (2004), D431–D433.

56.

Gunthers

Kuhn

Dunkel

et al., Supertarget and matador: resources for exploring drug-target relationships, Nucleic Acids Research 36 (2008), D919–D922.

57.

Wishart

D.S.

Knox

Guo

A.C.

et al., Drugbank: a knowledgebase for drugs, drug actions and drug targets., Nucleic Acids Research 36 (2008), D901–D906.

$x$	$y$	class	$x$	$y$	class
1.0	1.1	1	1.5	1.6	2
0.9	1.0	1	1.4	1.5	2
1.0	1.0	1	1.5	1.5	2
1.1	1.0	1	1.6	1.5	2
1.0	0.9	1	1.5	1.4	2

$x$	$y$	class	$x$	$y$	class
1.0	1.1	1	1.5	1.6	2
0.9	1.0	1	1.4	1.5	2
1.0	1.0	1	1.5	1.5	2
1.1	1.0	1	1.6	1.5	2
1.0	0.9	1	1.5	1.4	2

Drug-target interaction data cluster analysis based on improving the density peaks clustering algorithm

Abstract

Keywords

1. Introduction

3. DPC introduction

4.1 Definitions, process and explanation

4.3 Complexity analysis

5. Performance verification of DPCS

5.1 Datasets

Table 4 Experimental datasets

6.1 Drug-target interaction datasets

Table 8 Clustering results for drug-target datasets shown in Table 7

7. Conclusion

Footnotes

Acknowledgments

Supplementary materials

References

Table 4
Experimental datasets

Table 8
Clustering results for drug-target datasets shown in Table 7

$x$	$y$	class	$x$	$y$	class
1.0	1.1	1	1.5	1.6	2
0.9	1.0	1	1.4	1.5	2
1.0	1.0	1	1.5	1.5	2
1.1	1.0	1	1.6	1.5	2
1.0	0.9	1	1.5	1.4	2