Davies Bouldin Index based hierarchical initialization K-means

Abstract

K-means algorithm is an effective clustering algorithm based on partition, which has been widely used for clustering analysis. However, there are two main problems for K-means algorithm: how to provide appropriate number of clusters and how to determine initial cluster centers automatically. Plenty of methods have been proposed to address the above problems. In our previous work, we proposed the hierarchical initialization approach to determine initial cluster centers, but we cannot provide the number of clusters automatically. In this paper, in order to determine the number of clusters automatically, we propose the Davies Bouldin Index (DBI) based hierarchical K-means (DHIKM) algorithm on the basis of our previous work. The proposed algorithm can integrate DBI metric into our hierarchical K-means algorithm and can determine the number of clusters with low time cost. Experiments on UCI datasets and synthetic data demonstrate the effectiveness and feasibility of the proposed algorithm.

Keywords

K-means Davies Bouldin Index the number of clusters hierarchical initialization initial cluster centers

1. Introduction

K-means algorithm has been widely used in clustering analysis. And lots of clustering algorithms are based on K-means, such as spectral clustering [1]. However, its disadvantage is also obvious; clustering result largely depends on the initialization. Firstly, the number of clusters, $K$ , should be determined by person, which would be very difficult especially when facing invisible, high-dimension and large volume data. Secondly, K-means algorithm is very sensitive to the initial cluster centers, and the initial cluster centers will directly affect the results of complicated clustering problems. Hence, the determination of the number of clusters and initial cluster centers is the key issue for K-means, which also becomes the research focus for K-means algorithm [1].

In order to determine the initial cluster centers, we proposed a hierarchical K-means (HIKM) algorithm [2], which samples the data from low level to top level first, and then carries out clustering from top level to low level. Meanwhile, the cluster centers are also propagated from top level to low level. Finally, the better initial cluster centers can be obtained with less time cost. However, how to determine the number of cluster automatically is not addressed in this approach either. Existing clustering algorithms to determine the number of clusters automatically all require lots of control parameters in advance, such as Iterative Self-organizing Data Analysis Techniques Algorithm (ISODATA) and its extension method hybrid PSO-ISODATA [3] algorithm that combined POS and ISODATA. All of them need many control parameters, so they are hardly used on high-dimension dataset. Other clustering algorithms like method in [4] have specific requirements on datasets, and are easy to cause over-fitting.

In order to determine the number of clusters, we should find an appropriate evaluation metric to measure the performance of clustering. Most of clustering metrics are to maximize the similarity within class and minimize the similarity between classes. More precisely, the standard of clustering is proportional to the similarity within class and the difference between classes. Davies Bouldin index (DBI) [5] is a popular measure to evaluate clustering performance through the separation between the $i$ th and the $j$ th cluster. It is as large as possible for the first case, and is as small as possible for the second case. It can indicate the clustering quality by the ratio of intra-cluster similarity and inter-cluster similarity. Meanwhile, its time complexity is linear with clustered vectors, which is lower than other internal indexes. Section 2.2 will show that the performance of DBI is best in general among all evaluation indexes. Therefore, we choose the DBI as the metric of clustering performance.

Based on the evaluation ability of DBI, we propose DBI-based Hierarchical Initialization K-means algorithm (DHIKM), which can determine both the initial cluster centers and the number of clusters automatically. The proposed algorithm reduces the data by sampling level by level, and determines the number of clusters based on DBI criterion on the level where sampling ends. Then clustering is executed in the order of top-down, until reaching the original data level. If we directly apply DBI to the original data, the number of clusters can also be obtained, but its time cost may be unbearable, especially facing large volume data. But for our approach, the time cost to determine the number of clusters is very limited and may be ignored for some cases, which is the main contribution of this paper. Based on our previous work, the new proposed method can find not only the initial cluster centers but also the number of clusters. In other words, it can provide the better initialization for K-means algorithm.

The rest of this paper is organized as follows. The preliminary knowledge is presented in Section 2. In Section 3, we will provide the description and complexity analysis of proposed method. The experiments are conducted on UCI datasets and simulation datasets. The comparison and analysis are given in Section 4. Finally, conclusion is drawn in Section 5.

2. Preliminaries

This section will introduce the preliminary knowledge about the proposed method, namely, DBI and HIKM algorithm, which is very important for understanding of DHIKM.

2.1 Definition of DBI

Davies Bouldin Index (DBI) [5] is a measure to evaluate the clustering performance. Its basic idea is to evaluate the separation between the $i$ th and the $j$ th cluster, which should be as large as possible between clusters and as small as possible within a cluster. DBI has the positive correlation for the “within-class” case and negative correlation for the “between-class” case.

The formula of Davies Bouldin Index is as follows:

$\textit{DBI}=\frac{1}{K}\sum_{i=1}^{K}\max\limits_{i;j\neq i}\frac{S_{i}+S_{j}% }{d_{i,j}}$ (1)

Where $S_{i}=\frac{1}{|C_{i}|}\sum_{x_{j}\in C_{i}}{||x_{j}-v_{i}||}$ is a measure of scatter within the cluster $i$ , $K$ is the number of clusters, $x_{j}$ is an $n$ dimensional feature vector assigned to cluster $i$ , $v_{i}$ is the center of cluster $i$ , $C_{i}$ represents the cluster $i$ , $||\bullet||$ is the Euclidean distance, $d_{i,j}=\|v_{i}-v_{j}\|$ is the distance between center of cluster $i$ and $j$ .

2.2 Why DBI?

In this subsection, we will state why we choose DBI as the clustering metric. In general, clustering validations, which usually contain two main categories – external validation and internal validation, are used to judge the performance of clustering results. Rendon [6] presented a comparative study between four external indexes: F-measure, NMIMeasure [7], Entropy, purity [8] and five internal indexes: BIC, CH, DBI, SIL, DUNN [9]. They tested the K-means and Bissecting K-means algorithms on 12 synthetic data sets. Finally, for the Bissection K-means algorithm, correct rate is 86% with internal indexes, and 51.9% with external indexes. For K-means algorithm, the accuracy was 76.9% with internal indexes, and 61.5% with external indexes. On the other hand, DBI performs the best in both algorithms. In fact, not only the research results rule out external indexes, but also the requirement of external indexes implies that they are unsuitable for real data. The main difference between external indexes and internal indexes is whether the priori knowledge is needed. But for most real applications, prior knowledge is usually not available, thus, the external indexes are not suitable for determining the number of clusters.

In literatures, a large amount of internal validations have been proposed, for example, DBI. Liu et al. [10] studied 11 internal validations from 5 aspects: monotonicity, noise, density, subclusters and skewed distributions, including RMSSTD, RS, $\Gamma$ , CH, $I$ , $D$ , $S$ , DBI, XB, SD and $S_{\textit{Dbw}}$ [9]. For each aspect, they generated a synthetic data for experience using K-means. In their study, $S_{\textit{Dbw}}$ performs the best and DBI is in the second place. However, the $S_{\textit{Dbw}}$ does not work well with arbitrarily shaped clusters [9]. Meanwhile, DBI gets lower grades in subclusters part because of the initial cluster centers issue, which can be solved by our Hierarchical Initialization approach. On the other hand, DBI owns better discrimination than others. Dunn index [11] is also an internal evaluation metric. Compared with Dunn Index, DBI is not sensitive to the boundary points and can get the right number of clusters when it is larger than 2 [12, 13].

In another paper [14], Arbelaitz et al. compares 30 cluster validity indices in many different environments with different characteristics. There are $D$ , CH, $G$ , CI, DB, $s i l$ , $D^{\textit{MST}}$ , $D^{\textit{RNG}}$ , $D^{\textit{GG}}$ , $\textit{DB}^{\textit{MST}}$ , $\textit{DB}^{\textit{RNG}}$ , $\textit{DB}^{\textit{GG}}$ , gD31, gD41, gD51, gD33, gD43, gD53, SDbw, CS, $\textit{DB}*$ , SF, Sym, SymDB, SymD, Sym33, COP, $N I$ , SV and OS[14]. They used 3 clustering algorithms to compute partitions from the datasets: k-means, Ward and average-linkage and replicated all the experiments using three partition similarity measures: Adjusted Rand, Jaccard and Variation of Information. They used 720 synthetic datasets and 20 real UCI datasets. After many experiments from different aspects including noise and clusters overlap, a statistical significance analysis of the results showed that there are three main groups of indices and the indices in the first group – Silhouette, Davies-Bouldin, Calinski-Harabasz, generalized Dunn, COP and SDbw – behave better than indices in the last group – Dunn and its Point Symmetry-Distance based variation, Gamma, C-Index, Negentropy increment and OS-Index – being the differences statistically significant [14].

The time complexity of DBI is in the linear order of clustered vectors, but if DBI is directly used to original data, the time cost will be very high, which may be intolerable. However, based on our Hierarchical approach, we can calculate the DBI at the top level, where the data amount is very limited, usually no more than 1000 instances. Therefore, time cost will be bearable. Based above analysis, we can see that DBI is a suitable index for determining the number of clusters, which can be easy to integrate into the Hierarchical Initialization K-means algorithm.

2.3 HIKM

HIKM [2] is our previous work, the main flowchart is as follows: First, all data are treated as the weighted data. We perform sampling on the preprocessed data level by level so as to reduce the amount of data and meanwhile maintain the original data information; Then, the clustering algorithm is carried out on the top level to get the initial clustering centers; Finally, these cluster centers are propagated from high level to low level so as to obtain the initial centers of original data. Here, we can take 2-dimension data as an example. First, we rasterize the original data, and average the weight of data in the grid when sampling from low level to high level. For non-weighted data, the weight is set to 1. The original data after rasterizing is called 1st level. The data after sampling is called 2nd level; and so on until the top level that sampling ends. Supposing that one instance $P$ at the $(k+1)$ th level, it corresponds to four instances at the $k$ th level, namely, $(i,j,w_{k,i,j})$ , $(i+1,j,w_{k,i+1,j})$ , $(i,j+1,w_{k,i,j+1})$ and $(i+1,j+1,w_{k,i+1,j+1})$ , where the first two elements represent the 2D coordinates, the third one e.g. $w_{k,i,j}$ is its weight value. If $i$ , $j$ are both odd numbers, then $P$ can be represented as $(\frac{i+1}{2},\frac{j+1}{2},\frac{w_{k,i,j}+w_{k,i+1,j}+w_{k,i,j+1}+w_{k,i+1,% j+1}}{4})$

Property 1. Using minimum sum of square error as criterion, weights can be seen as density, cluster center is center of gravity.

Property 2. Based on the sampling method, the coordinate of $k$ th and $(k+1)$ th levels have following relation:

$\displaystyle 2X_{k+1}\geqslant X_{k}$ $\displaystyle 2X_{k+1}-X_{k}\leqslant 1$

Where $X_{k}$ represents the one axis coordinate at the $i$ th level. The above expression means that the error of the same cluster centers between 2 neighbor levels is no greater than 1 caused by sampling.

Based on the above sampling method, the data amount will be reduced greatly after sampling and most of the noise will be filtered, but the information of the original data is well preserved. Finally, cluster centers at the $(k+1)$ th level will be propagated to $k$ th level and used as the initial cluster centers, until reaching the level of original data.

3. Algorithm description

This section will detail our algorithm, i.e. Davies Bouldin Index based hierarchical initialization K-means (DHIKM) algorithm.

3.1 Algorithm flowchart

DHIKM algorithm can determine the number of clusters and initial cluster centers automatically first. Then, it completes K-means clustering without any control parameters. Regarding the determination of initial cluster centers, it has been addressed in our previous work. Therefore, we’ll mainly focus on the determining of the number of clusters. The algorithm composes of 4 main steps: data transformation, sampling from bottom-up, determining the number of clusters and clustering from up-down, the detail is described in Algorithm 1.

[!t] Davies Bouldin Index based hierarchical initialization K-means (DHIKM)Step 1: Execute linear transformation on original data and make their values be integers in the range of $[1,\ldots,2^{N}]$ .

Step 2: Use hierarchical sampling level by level, and ensure the size of final data $n$ after sampling , is no less than 150 [2].

Step 3: From $K=2$ , carry out K-means on the top level, and use Eq. (1) to calculate DBI,

$\displaystyle\textit{DBI}=\frac{1}{K}\sum_{i=1}^{K}\max\limits_{i;j\neq i}% \frac{S_{i}+S_{j}}{d_{i,j}}$

then $K+1\rightarrow K$ , repeat Step 3. If consecutive $T$ times, the DBI of $K$ th is larger than that of $K-1$ th, then iteration ends. $K=K-T$ is the number of clusters, and the centers can be got at the same time. $T$ is called the size of test window.

Step 4: If $K*20>n$ , it uses the data on second top level after sampling to redo Step 3.

Step 5: Execute clustering from top-down level by level, propagate the centers in ( $k+1)$ th level to $k$ th level, until reaching the original data level.

Step 6: Perform inverse transform of Step 1 to get the actual initial centers.

3.2 Algorithm analysis

The linear transformation rule in Step 1 is $(2^{N}-1)(x-\textit{mi})/(\textit{ma}-\textit{mi})+1$ ; ma and mi are maximum and minimum of coordinate, respectively. According to the measure of Euclidean distance, linear transformation only changes absolute size of the distance, but relative distance remains same. So the difference between sums of squared errors of the original data and transformed data only lies in one coefficient.

If every dimension of data does not meet integer powers of 2, we will expand the data to integer power of 2.

The strategy to determine the best number of cluster $K$ in Step 3 is as follows: We calculate DBI values when the number of clusters is $K$ , $K+1,\ldots,K+T$ respectively. Then the number of clusters with the minimum DBI is just what we need. $T$ is the size of test window. Suppose the DBI value for $X$ cluster is $D_{X}$ , When $T$ consecutive DBI values are bigger than $D_{X}$ , we can ensure $D_{X}$ is minimum.

If the average instances in each cluster is less than 20 after sampling, under this situation, we will use the data in second top level to recalculate DBI.

After that, the determination of the initial cluster center is as follows: The data points are sorted by weight, then we select the first $C$ points as the initial cluster centers. Therefore, the initial centers are unique for the number of clusters $K$ , and the clustering result is unique too.

3.3 Complexity analysis

The main time cost of DHIKM lies on the computation of DBI in Step 3 and the top-to-down iterative clustering in Step 4, time cost for sampling is almost ignorable.

It is assumed that the dimension of dataset is $D$ , and size of each dimension is $S$ , then the upper bound of original data scale can be expressed as $M=S^{D}$ . The upper bound of data scale on the level that sampling ends is $M*(\frac{1}{2^{D}})^{k}$ [2]. Assuming that the number of clusters is $K$ , iterative computation for one DBI is $M*(\frac{1}{2^{D}})^{k}+Kn$ . Computation of DBI is only executed on the level where sampling ends. The data scale of this level is very small compared with that of original level, so the time cost of computing DBI is also small. For the top-to-down clustering iteration, initial center of each level is very close to real cluster centers and the distances for each dimension are less than one unit length, as pointed by Property 2 in Section 2. Therefore, it only needs several rounds of iterations (usually 6 times or so) to come to convergence, which means clustering speed on each level is very fast.

In summary, DHIKM algorithm only needs two control parameters. One is N, which represents the scale of instances after sampling, and the other is the T, which represents the size of test window for DBI. The parameter N relates to the size of final data $n$ after sampling. $n$ also relates to the number of clusters $K$ that we get at last, the relationship between them should be $n>K*20$ , which means that there are no less than 20 instances for each cluster in average so as to guarantee enough instances for each cluster. After that, the algorithm can execute automatically.

4. Experiments and results

In order to evaluate the performance of algorithm, we carry out two groups of experiments: First, we test the capability of determining the number of clusters automatically on 40 datasets which include 12 UCI data sets and 28 simulation data sets; then, we compare our DHIKM with 3 other clustering algorithms, ISODATA[15], ASC[4], NC-estimation[16].

Figure 1.

Simulation data sets for performance experiment of DHIKM.

4.1 Performance of determining the number of clusters

The main aim of our DHIKM is to determine the number of clusters $K$ automatically by integrating DBI into our hierarchical initialization K-means algorithm. So we test its capability of determining the number of clusters automatically on 40 datasets which include 12 UCI data sets and 28 simulation data sets (see Fig. 1). In the real UCI datasets, they all have noise. In the 28 synthetic datasets, except 1 dataset totally follows the Gaussian distributions without noise, the others all have noise. For ensuring the comprehensiveness of test, we choose the diversified data: the range of dimensions is from 2 to 147, the size of instances is from 150 to 5000 and the number of clusters is from 2 to 20. At the same time, those data sets have different distributions, such as Gaussian distribution, uniform distribution and skewed distribution.

Test window size of automatically determining the number of clusters in DHIKM algorithm is set to 3. Since the cluster centers after hierarchical sampling are unique, and the projection rule of cluster centers from $(K+1)$ th to Kth level is the same as top-down clustering process, the clustering result is unique, too. The results are shown in Table 1. In the column of “Distribution”, we use term “mixed” to mean the dataset cannot be described by only one kind of data distribution. That may be caused by larger dimensions or instances of datasets.

Table 1
Results of the number of clusters (k)

Datasets	Dimensions	Instances	Distribution	True number	Experiment value	Results
Dataset1	2	300	Skewed	3	3	$\surd$
Dataset2	2	1000	Uniform	4	5	$\times$
Dataset3	2	238	Mixed	3	4	$\times$
Dataset4	8	1484	Mixed	10	11	$\times$
Dataset5	35	301	Mixed	19	21	$\times$
Dataset6	2	1000	Uniform	20	22	$\times$
Dataset7	2	1000	Skewed	2	2	$\surd$
Dataset8	2	1000	Gaussian	4	4	$\surd$
Dataset9	34	351	Mixed	2	2	$\surd$
Dataset10	5	4339	Skewed	2	2	$\surd$
Dataset11	4	150	Gaussian	3	3	$\surd$
Dataset12	13	217	Mixed	3	3	$\surd$
Dataset13	7	210	Gaussian	3	3	$\surd$
Dataset14	2	238	Mixed	3	3	$\surd$
Dataset15	8	1473	Skewed	2	2	$\surd$
Dataset16	4	132	Skewed	4	4	$\surd$
Dataset17	2	3400	Gaussian	4	4	$\surd$
Dataset18	2	1000	Gaussian	4	4	$\surd$
Dataset29	2	512	Skewed	4	4	$\surd$
Dataset20	9	178	Mixed	6	6	$\surd$
Dataset21	147	168	Mixed	9	9	$\surd$
Dataset22	2	3000	Gaussian	3	3	$\surd$
Dataset23	3	800	Mixed	2	2	$\surd$
Dataset24	2	2000	Gaussian	4	4	$\surd$
Dataset26	2	185	Skewed	4	4	$\surd$
Dataset27	3	800	Mixed	8	8	$\surd$
Dataset28	2	312	Mixed	3	3	$\surd$
Dataset29	2	1000	Gaussian	4	4	$\surd$
Dataset30	4	400	Skewed	4	4	$\surd$
Dataset31	2	850	Mixed	5	5	$\surd$
Dataset32	2	900	Gaussian	6	6	$\surd$
Dataset33	2	250	Mixed	5	5	$\surd$
Dataset34	2	1500	Gaussian	6	6	$\surd$
Dataset35	2	300	Uniform	6	7	$\times$
Dataset36	2	5000	Mixed	16	17	$\times$
Dataset37	2	5000	Mixed	15	15	$\surd$
Dataset38	2	3031	Mixed	9	8	$\times$
Dataset39	3	400	Skewed	4	4	$\surd$
Dataset40	7	336	Mixed	8	8	$\surd$

As shown in the Table 1, among 40 datasets, our algorithm got the 32 correct results and 8 incorrect results. The performance of DHIKM is good in general, but it is found that the DHIKM is easy to be incorrect when the number of clusters is large and dimensions or instances are not large enough. On the other hand, DHIKM performs well on Gaussian distribution data sets, but badly on uniform distribution data sets. When the data sets are in uniform distribution, it is hardly to choose the initial centers for hierarchical initialization approach, because every point will have similar weight. On the other hand, the special simulation data sets (see in Fig. 2), which are difficult to find out the number of cluster even by human, the experiment results indicates that our DHIKM performance better for automatically determining the number of clusters.

4.2 Comparison with other algorithm

To manifest the clustering capability of DHIKM, we choose 3 other algorithms for comparison, which all can automatically determine the number of clusters. They are ISODATA, ASC[4] and NC-estimation [16].

4.2.1 Comparison with ISODATA

First, we compare our DHIKM with Iterative Self-organizing Data Analysis Techniques Algorithm (ISODATA). ISODATA gets more reasonable number of clusters by splitting and merging clusters [3]. However, it requires the following control parameters:

k: Desired number of clusters

L: Maximum number of clusters that can be merged at one time

I: Maximum number of iterations

ON: Minimum number of samples per clustering

OC: Closeness criterion

OS: Elongation criterion

We select Iris and Wine datasets to test, and the ISODATA control parameter setting is as follows [17]:

Iris: $k=$ 10; L $=$ 2; I $=$ 100; ON $=$ 10; OC $=$ 2.5; OS $=$ 0.6;

Wine: $k=$ 10; L $=$ 2; I $=$ 100; ON $=$ 1; OC $=$ 10; OS $=$ 0.001;

Those sets of parameters can get the correct number of clusters. The results are shown in Table 2 to Table 3.

Table 2
Performance comparison between the DHIKM and ISODATA algorithms on Iris dataset

	Number of clusters k	Sum of the squared errors	Running time(s)	Purity(%)
DHIKM	3	78.85	0.3036	89.33
ISODATA	3	98.06	0.0456	78.67

Figure 2.

Two special simulation datasets.

From Tables 2 and 3, we can get following conclusions: ISODATA converges faster, but it has poor performance of clustering and needs excessive control parameters; Although DHIKM runs slower than ISODATA, it has good performance of clustering, and few parameters is needed (see Section 3.3).

Table 3

Performance comparison between the DHIKM and ISODATA algorithms on Wine dataset

	Number of clusters k	Sum of the squared errors	Running time(s)	Purity(%)
DHIKM	3	1895934.00	0.4291	96.63
ISODATA	3	6516694.00	0.0663	66.29

4.2.2 Clustering performance comparison

This section will verify whether clustering performance will be affected after DBI measure is introduced into HIKM algorithm. The common evaluation metrics used to measure the clustering performance are sum of the squared errors (SSE) and purity.

We compared DHIKM algorithm with HIKM method and K-means combined with DBI directly. Since HIKM cannot automatically determine the number of clusters, we will use the real clusters number for it. For this part, we use 3 data sets, Iris, Glass and KRK which is a Chess Endgame Database contains 28056 instances, 6 attributes and 18 clusters. The average clustering results on three data sets after 100 times running are shown in Table 4 to Table 6. Besides, we also give the average iteration number and running time.

Table 4
Performance comparison of the three algorithms on Iris dataset

	Number of iterations	Sum of the squared errors	Running time(s)	Purity(%)
K-means $+$ DBI	6.4	85.24	0.3322	87.07
HIKM	3.0	78.85	0.2886	89.33
DHIKM	3.0	78.85	0.3036	89.33

Table 5

Performance comparison of the three algorithms on Glass dataset

	Number of iterations	Sum of the squared errors	Running time (s)	Purity (%)
K-means $+$ DBI	15.2	357.44	2.9417	56.92
HIKM	3.0	336.27	1.8186	58.88
DHIKM	3.0	336.27	2.7385	58.88

As shown in Table 4 to Table 6, for DHIKM algorithm and HIKM algorithms, the number of iterations, sum of the squared errors and purity remain the same, which means the introduction of DBI has no influence on HIKM. Although the introduction of DBI makes the time cost of DHIKM higher than that of HIKM, running time of DHIKM in practice is still acceptable. However, running time of DHIKM is shorter than DBI combined with K-means clustering algorithm.

Table 6

Performance comparison of the three algorithms on KRK dataset

	Number of iterations	Sum of the squared errors	Running time (s)	Purity (%)
K-means $+$ DBI	24.6	184974.13	385.26	23.2571
HIKM	11.0	181536.91	272.12	82.8199
DHIKM	11.0	181536.91	381.38	82.8199

Table 7

Results of the number of clusters by compare with ASC and NC-estimation

Data sets	True number	DHIKM	ASC	NC-estimation
				Sil	DBI	CH
Iris	3	3	5	3	3	3
Wine	3	3	3	8	8	9
Dataset1	3	3	3	8	3	6
Dataset2	4	5	5	9	4	2
Dataset3	5	5	5	2	5	2
Dataset4	15	15	20	10	13	2

Figure 3.

Over-fitting result by ASC.

4.2.3 Comparison with ASC and NC-estimation

Sanguinetti et al.[4] introduced a spectral clustering algorithm that can automatically determine the number of clusters (ASC), and it is based on Mahalanobis distance [18]. The key of ASC is the elongated K-means that downweight distances along radial directions and penalizes distances along transversal directions. It keeps adding a center at the origin and computing another eigenvector, and this procedure is repeated until no points will be assigned to the centre at the origin, leaving the last cluster empty. NC-estimation [16] is to calculate 4 External validity indices and 8 internal validity indices, including Rand, Adjusted Rand, Mirkin, Hubert, Silhouette(Sil), DBI, Calinski-Harabasz(CH), Krzanowski-Lai, Hartigan, weighted inter-intra, Homogeneity-Separation, for each K beginning with 2, then find the number of clusters of those 12 indices. In our test, we only focus on the internal validity indices, and only show the most effective 3 indices, Sil, DBI and CH.

We choose 6 data sets including 2 UCI data sets and 4 simulation data sets to make the comparison between our DHIKM and the ASC. 2 UCI data sets are Iris and Wine and simulation data sets are the first 4 data sets in Fig. 1. The results are shown in Table 7.

Clearly, DHIKM has much better performance than ASC and NC-estimation. For ASC, it is easy to get an over-fitting result with outliers. Figure 3 shows the clustering result by ASC that 1 pink diamonds dot and 2 red round dots are grouped into another 2 clusters which obviously is a bad decision.

Meanwhile, it’s found that ASC heavily relies on the data distribution. It has two control parameters that should be set. One of them changes a lot with different datasets and has great influence on the cluster results. If the data distribution is regular, it will get the right number of cluster. If there are some outliers in data set, ASC is easy to lead to an over-fitting problem. Compared with ASC, our DHIKM has less control parameters and less dependence on the data distribution. For NC-estimation, the result based on DBI is better than the other two methods, but still lower than our DHIKM.

From the comparison with ISDATA, ASC and NC-estimation, obviously, DHIKM needs fewer control parameters and performs better. Moreover, it can handle the outliers well.

5. Conclusion

In this paper, we propose Davies Bouldin Index based hierarchical initialization K-means (DHIKM) algorithm which aimed to determine the number of clusters automatically. Our experiments on UCI datasets and Simulation datasets assess DHIKM from whether it can automatically determine number of clusters, clustering performance and practicality. DHIKM is easy to get the incorrect number of clusters for uniform distribution. In fact, in case of a uniform distribution, it is challenging to find the correct number of clusters even for human being. The algorithm performs well on Gaussian distribution that is very common in real applications. From the comparison with other clustering algorithms that can automatically determine the number of cluster, our DHIKM has much better result and less dependence on the data distribution. Besides, it needs fewer control parameters, which means it can be used easily. The main shortcoming lies in that time cost is a little bit higher.

Footnotes

Acknowledgments

This work was supported in part by Jiangsu Natural Science Foundation (No. BK20131351), by Jiangsu Provincial Science and Technology Support Program (No. BE2014714), by the National Natural Science Foundation of China (NSFC) (No. 91220301, 61233011), by the Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions, by the 111 Project (No. B13022), and by Six talent peaks project in Jiangsu Province, by the project Funded by TCM Informatics Key Discipline of NJUCM (No. ZYYXXX-13).

References

Jain

A.K.

, Data clustering: 50 years beyond k-means, Pattern Recognition Letters 31 (2010), 651–666.

Tang

and Yang

J.Y.

, Hierarchical initialization approach for k-means clustering, Pattern Recognition Letters 29 (2008), 787–795.

Cai-Hong

Qin

and Shi-Bin

, A hybrid pso-isodata algorithm for remote sensing image segmentation, in: Proceedings of the 2012 International Conference on Industrial Control and Electronics Engineering (ICICEE), Xi’an, China: IEEE, 2012, pp. 1371–1375.

Sanguinetti

Laidler

and Lawrence

N.D.

, Automatic determination of the number of clusters using spectral algorithms, in: Machine Learning for Signal Processing, 2005 IEEE Workshop on, 2005, pp. 55–60.

Davies

D.L.

and Bouldin

D.W.

, A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence (1979), 224–227.

Rendon

Abundez

Arizmendi

and Quiroz

, Internal versus external cluster validation indexes, International Journal of Computers and Communications 5 (2011), 27–34.

Strehl

and Ghosh

, Cluster ensembles – a knowledge reuse framework for combining multiple partitions, The Journal of Machine Learning Research 3 (2003), 583–617.

Manning

C.D.

et al., Introduction to information retrieval, Cambridge University Press Cambridge 1 (2008).

Halkidi

Batistakis

and Vazirgiannis

, On clustering validation techniques, Journal of Intelligent Information Systems 17 (2001), 7–145.

10.

Liu

Xiong

Gao

and Wu

, Understanding of internal clustering validation measures, in: 2010 IEEE 10th International Conference on Data Mining (ICDM), 2010, pp. 911–916.

11.

Dunn

J.C.

, A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters, Journal of Cybernetics 3 (1973), 32–57.

12.

Zalik

K.R.

and Zalik

, Validity index for clusters of different sizes and densities, Pattern Recognition Letters 32 (2011), 221–234.

13.

Kovács

Legány

and Babos

, Cluster validity measurement techniques, in: 6th International Symposium of Hungarian Researchers on Computational Intelligence, Citeseer, 2005.

14.

Arbelaitz

Gurrutxaga

Muguerza

Pérez

J.M.

and Perona

, An extensive comparative study of cluster validity indices, Pattern Recognition 46(1) (2013), 243–256.

15.

Ball

G.H.

and Hall

D.J.

, ISODATA, a novel method of data analysis and pattern classification, Technical Report, DTIC Document, 1965.

16.

Wang

and Peng

, Cvap: validation for cluster analyses, Data Science Journal 8 (2009), 88–93.

17.

Liang

Han

and Han

, A novel diversity measure based on geometrical relationship, in: Iet International Conference on Information Science and Control Engineering, 2012, pp. 1–5.

18.

A.Y.

Jordan

M.I.

and Weiss

, On spectral clustering: Analysis and an algorithm, Proceedings of Advances in Neural Information Processing Systems 14 (2002), 849–856.