Abstract
Clustering by fast search and find of density peaks (CFSFDP) was proposed to create clusters by finding high-density peaks, quickly. CFSFDP mainly based on two rules: 1) a cluster center has a high dense point and 2) a cluster center lies at a large distance from other clusters centers. The effectiveness of CFSFDP highly depends upon the cutoff distance (C d ), which is used to estimate the density of each data point. However, there is a need to provide the predefined C d . In this paper, we propose an adaptive way to estimate the accurate C d by using the characteristics of Improved Sheather-Jones (ISJ) method named as IJS-CFSFDP. ISJ method provides the best estimation for C d to measure accurate density of each data point. We perform a number of experiments on standard benchmark clustering datasets and real academic dataset of students. The evaluated clustering results on education dataset validate the IJS-CFSFDP can be used to make intelligent contents delivery system based on the capability and intelligence of the student. The experimental results on synthetic datasets show that the proposed adaptive C d method creates better clusters as compare to the CFSFDP, mean shift, affinity propagation and k-means.
Keywords
Introduction
Clustering algorithms are aimed to analyze data by organizing data into a set of disjoint categories, called clusters. Clustering has been successfully applied in different fields such as pattern recognition [1–3],astronomy [4], cyber security [5], health care [6], bioinformatics [7, 8], social networks [9], education [10, 11], and image processing [12, 13] etc. The potential efforts are also required to apply clustering in emerging fields such as big data [14, 15], IoT [16–20], and Virtual Reality [21]. Clusteringalgorithms can be categorized into different types such as, density-based [22–28], model-based [29, 30], grid-based [31, 32], hierarchical [8, 34], and partitioning [35, 39].
The k-means [35] clustering is a state-of-the-art partitioning algorithm. In k-means, data is partitioned into k clusters and these clusters are iteratively optimized. The k-means creates spherical clusters and cannot detects the outliers. The accuracy of k-means is subjected to appropriate knowledge to the number of clusters and initial selection of centroids.
Affinity propagation (AP) [36] clustering is an effective clustering algorithm that outperforms traditional and classical clustering algorithm. AP gives promising results in different applications, such as face clustering, gene detection, and air-travel routines. However, the time complexity of AP is much higher than k-means and mean shift. For clusters of arbitrary shape, AP does not perform well.
The mean shift [37] is well known kernel density estimation based clustering algorithm, which is successfully used for image segmentation, visual tracking, space analysis, and mode seeking. In mean shift there is no need of domain knowledge to do clustering. However, the mean shift depends upon the window size.
Density based clustering is a primary approach to create clusters of arbitrary shapes and to identify noise from spatial dataset. The clusters are characterized as dense regions and some points have isolated densities declared as noise or outliers. Clusters of arbitrary shapes are created by connecting densities with maximum set of density-connected points [23]. For density based clustering approaches there is only need of minimum domain knowledge to cluster the datasets [22].
DBSCAN [25] is a popular density based clustering method used to create clusters of arbitrary shapes. DBSCAN is robust to noise, well scale to large datasets, and requires minimum number of input parameter. However, it is not completely deterministic for border points, and does not perform well on highly overlapped dense regions. Correctly estimate the input parameters is also a difficult task. To overcome the problems of DBSCAN, a new density based clustering method was proposed, abbreviated as OPTICS [22]. OPTICS can create more accurate clusters in data of varying density. However, DBSCAN, and OPTICS still cannot identify some boarder points. A number of DBSCAN variants have been proposed, such as VDBSCAN [27], DVBSCAN [24], ST-DBSCAN [26], and DBCLASD [24] to improve the performance of DBSCAN.
Alex et al. proposed a density based clustering method named as clustering by fast search and find of density peaks (CFSFDP) [40]. CFSFDP has two basic rules for clusters centers: 1) cluster center has high-density as compare to its neighbors, and 2) cluster center is positioned at high distance from other clusters centers. The effectiveness of algorithm depends upon cutoff distance (C d ), which is an essential parameter to estimate better density at each point. In CFSFDP C d is selected based on the heuristic that the average number of neighbors is around 1 to 2% of the whole dataset [40]. To minimize the statistical errors of CFSFDP for estimating the densities and to identify the border points, there is a need to estimate C d adaptively instead of using 1 to 2% heuristic of neighbor points.
To overcome aforementioned problem, we propose an adaptive way to estimate C d by using the phenomena of Improved Sheather-Jones (ISJ) method [41] named as IJS-CFSFDP. ISJ provides the better estimation of data points to determine the C d . The C d can be used to 1) estimate the densities of underlaying dataset and 2)to detect the noise, and 3) refine the overlapping border regions of the clusters. Mean integrated squared error (MISE) is minimized to estimate the C d by ISJ method. Experimental results on standard benchmark clustering datasets validate the effectiveness of the proposed adaptive C d method.
The rest of this paper is organized as follows. The background knowledge is presented in Section 2. Section 3 describes the proposed method to select C d by using Improved Sheather-Jones (ISJ) method in detail. Experimental results are presented and discussed in Section 4, and finally, the concluding remarks are presented in Section 5.
Background knowledge
In this section we have presented a brief introduction of CFSFDP.
CFSFDP
CFSFDP deals with two basic properties of data points: 1) local density ρ
i
, and 2) distance of data point i to closest higher dense point represented as δ
i
. For a data point i, local density ρ
i
is calculated as follows:
C
d
represents the cutoff distance and d
ij
is distance between point i and j. The ρ
i
is the number of points that are closer than C
d
to point i. The δ
i
is minimum distance of point i with a nearest high-density point. The δ
i
can be calculated by Equation 2.
The values of δ are much larger for those points, which have the maximum density globally or locally. So the centers of the clusters are points which have the maximum values of ρ and δ as shown in Fig. 1.
Figure 1(A) contains 28 data elements plotted in a 2D space with decreasing density. The decision graph in Fig. 1(B) illustrates the points 1 and 10 are the maximum density points that lies at large distance from other points, so these can be identified as cluster centers. Points 26, 27 and 28 have high values of δ and have low values of ρ. These are isolated points and hence identified as outliers or noisepoints.
With the help of decision graph one can identify clusters centers successfully. After the identification of clusters centers, other points are assigned to clusters in a single round. The rest of data elements are assigned to the nearest cluster with respect to defined clusters centers. Furthermore, for border point’s conditions, a border region for each cluster is identified. The border region of a cluster contains a set of data points that are part of the underlying cluster and also fall in the C d radius of other cluster points. The next step is to find the maximum density in the border region of cluster, which is expressed as ρ j . The points with higher-density than ρ j are identified as cluster core points and other points are identified as halo point of the cluster (considered as outliers ornoise).
In this section we have presented a kernel density estimation and adaptive cutoff distance by Improved Sheather-Jones (ISJ) method as follows.
Kernel density estimation
The nonparametric density estimation is an important tool in statistical analysis of data. It can be used to evaluate the degree of symmetry in a dataset, multimodality, discriminant analysis, summarization of Bayesian posteriors and classification [41, 42]. The nonparametric density estimation is an alternative to parametric estimation of density. Unlike the classical approach, the nonparametric density estimation is not affected by specification bias [41], which makes it more flexible for modeling of data. The kernel density estimation (KDE) is most common density estimation approach [37, 43]. The standard method to estimate the density is introduced by a narrow Gaussian kernel (or alternative) at each point d
i
and compute the integral of all kernel values over the entire dataset [43, 45]. For KDE, identical and independent distributed samples are drawn {x1, x2, x3, . . . , x
n
} with an unknown probability density function is given as follows:
Generally, kernel density estimator places the small kernel with bandwidth K C d at each sample point x i . Alex et al. proposed to use Equation 1 or Equation 3 in CFSFDP to compute the density. However, the effectiveness of KDE depends upon the appropriate choice of C d . Mean integrated squared error (MISE) is well-studied criteria for optimal choice of smoothing parameter for quality estimation of KDE [41], which is given below:
Equation 4 can be decompose into two components: 1) integrated squared bias and 2) integrated variance. The decomposed equation is given as follow:
Where ∥f" ∥ 2 is equal to ∫ (f" (x)) 2 dx. The asymptotically optimized value of C
d
is a minimizer of the AMISE as follows:
To compute the optimal from Equation 7, one needs to estimate the ∥f" ∥ 2. Consider the problem of estimating function ∥f(k) ∥ 2 for any arbitrary integer k ≥ 1. The identity suggests two plug-in estimators [41, 46], which are given as follow:
The both estimators and estimates the same quantity for any given smoothing parameter. To make both estimators asymptotically equivalent in sense of mean squared error, the can be selected as follows:
In Equation 8 the computation of requires the estimation of ∥f(k+1) ∥ 2 which is unknown. So, each is estimated by using following expression:
The computation of needs the estimation of itself, which in turn requires the estimation of , and so on, as seen from Equation 9. The problem in computation of is to estimate the infinite sequence {. However, if is given for some l >0, then all { can be estimated recursively. The l-stage direct plug-in bandwidth selector [46] uses this idea to find optimal bandwidth. For a given integer l > 0, the l-stage plug-in method select C
d
by computing the following equation:
where the is estimated via ∥ f(l+1) ∥ 2 computing by assuming that f is a normal density with variance and mean estimated from the data [41, 46]. Because of assumptions used in l-stage plug-in method can lead to arbitrarily bad estimation of , for example, when true f is far from being Gaussian. Botev et al. [41] proposed a unique solution of the nonlinear equation:
For higher value of l, Equation 11 can either be solved with Newton’s method or fixed point iteration while considering with initial condition C d =0 [38, 41]. Equation 11 can be solved with fixed point iteration as shown in the following algorithm:
For given l > 2, initialized with , where ɛ is machine precision, and n=0; Set ; If , then stop and set ; otherwise, set n = n + 1 and repeat from step 2; Deliver the Gaussian kernel density estimator in Equation 3 evaluated at as the final estimator of f, and as the smoothing parameter for optimal estimation of
.
The improved Sheather-Jones uses O (n) operations to get an optimal bandwidth.
Results and discussion
To validate the robustness of IJS-CFSFDP, four standard synthetic clustering datasets (aggregation [47], flame [48], spiral [49], and toys problem [50]) are used in our experiments and compared with CFSFDP, mean shift, affinity propagation and k-means.
Flame dataset consists of 240 data points and comprises of 2 clusters. Figure 2(A) shows ground truth of flame dataset. Figure 2(B) shows clusters, which are formed by using C d as 2% of whole dataset. CFSFDP identify most of the core points as noise. In Fig. 2(C) clusters are formed by considering 1% neighbors exist around a point in dataset. However, the effect to decrease in C d cannot create good clusters as shown in Fig. 2(C). In Fig. 2(C) CFSFDP identified some of core points as noise points and also could not identify the compact relationship among the density connected points. Figure 2(D) shows clusters created by IJS-CFSFDP in which an optimal adaptive C d is used to estimate the accurate density of dataset. By estimating accurate density with optimal choice of C d , IJS-CFSFDP can creates better clusters and discovers density connected points in dataset, successfully.
We have also compared the effectiveness of IJS-CFSFDP on toys problem datasets. The datasets consist of 300, 1500 points having 2 clusters. Figure 4(A) shows ground truth of toys problem dataset having 1500 data points. CFSFDP identified some of the clusters core and boarder points as halo points as shown in Fig. 4(B). Furthermore, CFSFDP cannot discover a maximum density connective points to integrate different parts of a cluster. Figure 4(C) shows clusters formed by calculating the ρ of each object using IJS-CFSFDP which produces accurate clusters successfully and provides strong foundation for density connectivity. Figure 4(D) shows the toys problem dataset having 300 data points with comparatively more disperse densities. Figure 4(E) shows clusters created by CFSFDP. CFSFDP has detected more core points as noise and also could not detect effective relationship among connected densities. The appropriate selection of C d is important to discover connected densities and to detect border points. Figure 4(F) shows the accurate clusters created by IJS-CFSFDP.
Path-based spiral dataset consist of 312 data points with separation of 3 clusters spread over 2 dimensional space. CFSFDP creates same clusters as compared to IJS-CFSFDP because of spiral dataset have not overlapping densities. In our experiments we have used the same parameters used by mean shift, AP, k-means, and CFSFDP their experiments. Figure 5(A) shows the visual results on aggregation dataset. Aggregation is 2 dimensional dataset consists of 788 data points and comprises of 7 clusters. Figure 5(B) shows the clusters created by mean shift with window size of 4.6. After a number of iteration at window size of 4.6 it creates better clusters. However, the accuracy of mean shift is subject to the accurate selection of window size. The Fig. 5(C) shows seven clusters created by k-means. The k-means is effective when there is no overlapping in dataset. However, it could not create good clusters when data distribution is highly complex and overlapped. Figure 5(D) shows the clusters created by AP. The AP separates the flame dataset into 18 clusters. Figure 5(E) shows the visualization of resultant clusters on aggregation dataset using CFSFDP while Fig. 5(F) shows the visual result of clusters create by IJS-CFSFDP. CFSFDP has identified some boarder points as outliers, which are actually part of clusters. Where IJS-CFSFDP identified the border points as part of cluster, accurately.
However, the CFSFDP results as shown in Fig. 5(E) misclassify 82 points and declared as noise points at Cd = 1.860108 calculated by given method in CFSFDP. The ISJ based cutoff distance in CFSFDP method adaptively calculates the optimal cutoff distance for CFSFDP, which resultant to the better clusters as compare to k-means, mean shift, AP and CFSFDP. The mean shift is also use kernel density estimator to estimate the density of dataset like CFSFDP. However, both have deficiency to select optimal smoothing parameter to estimate accurate densities and to discover boarder points as well. The optimal C d and window size is hard to estimate for users. However, the IJS-CFSFDP gives more robust methodology to estimate adaptively optimal C d in order to get robust performance of CFSFDP.
After a numerous experiments we conclude that IJS-CFSFDP works well in small size datasets. However, in compound dataset [51] IJS-CFSFDP could not separate two red clusters as shown in Fig. 6 (B). IJS-CFSFDP merged inner small cluster into outer red cluster because its density was very low and it was very near to high dense region of outer cluster. In future, we will consider to refine this issue.
Recently, the incorporation of web based systems, social networks, ubiquitous computing, and IoT in education system has changed the entire education systems. The web based education systems are virtual form of instructions and beyond the geographical limitations. The modern education systems are generating the massive amount of data that can be used to extract the hidden intrinsic pattern. The clustering is a major technique to analyze data intelligently. Mostly, the existing educational system are static and could not satisfy the diversity of students. Clustering technique can be utilize to make the existing education systems smarter and intelligent [10, 11].
To improve the academic performance of student, we utilize the IJS-CFSFDP to group the students into unique clusters based on their GPA obtained in previous semester. With the special attention of teachers and counseling toward the weak group of students, their performance can be improve. The proposed method can also be utilize to make intelligent contents delivery system based on the capability and intelligence of the student.
Table 1 shows the comparison of C d and identified clusters points between the standard CFSFDP and IJS-CFSFDP on different datasets. The CFSFDP fails to identify most of the core points and declare them as noise points. However, in propose adaptive estimation of C d , IJS-CFSFDP can identify the core points, correctly. The accuracy is achieved because of propose adaptive estimation of C d better express the density of dataset.
Conclusion
In this paper, ISJ based adaptive optimal cutoff distance method is proposed for CFSFDP named as IJS-CFSFDP. ISJ method is an adaptive way to calculate C d of a dataset, which is further used in kernel density estimation and identify the border points in CFSFDP. ISJ gives most accurate estimation of smoothing parameter because it does not use normal references rules as compare to other state-of-the-art methods used to calculate smoothing parameter in KDE. ISJ provides minimum value of AMISE, which effectively used as C d in CFSFDP to cluster datasets adaptively. So, the problem of C d in CFSFDP is solved by proposed method. Experiments results on four benchmark datasets shows the robustness and effectiveness of the IJS-CFSFDP as compared to CFSFDP.
Footnotes
Acknowledgments
This research is sponsored by National Natural Science Foundation of China (Nos. 61171014, 61371185, 61401029, 61472044, 61472403, 61571049) and the Fundamental Research Funds for the Central Universities (Nos. 2014KJJCB32, 2013NT57) and by SRF for ROCS, SEM.
