Abstract
It is critical to determine the optimal number of clusters (NC) in cluster analysis. Many cluster validity indices have been proposed, such as the Silhouette index and In-group proportion index. However, these validity indices have more time complexity. From the viewpoint of sample geometry, a new internal cluster validity index for determining the optimal NC is proposed. The new index can evaluate the clustering quality of a certain clustering algorithm and determine the optimal NC for many kinds of data sets, including synthetic data sets, benchmark data sets, and real data sets. Compared with many well-known validity indices, the proposed index is more effective and efficient. Theoretical analysis and experimental results show the effectiveness and high efficiency of the new index.
Introduction
Clustering is the unsupervised classification of data points or samples into groups or clusters, such that samples in the same cluster are similar to each other and samples in different clusters are distinct. It is widely used in many fields, such as data mining, image analysis, and bioinformatics. Cluster validity, which validates the goodness of clustering results, is one of the vital issues of clustering. In general, we do some research on cluster validity by using cluster validity indices. External cluster validity index and internal cluster validity index are two main categories of cluster validity indices. The difference is whether external information is used for cluster validity. Unlike external validity index, internal validity index validate the partition without respect to external information. Since external validity index knows cluster labels in advance, it is mainly used for choosing the optimal clustering algorithm on a specific data set [1]. On the other hand, internal validity index can be used to evaluate the clustering results of a certain clustering algorithm and decide the optimal number of clusters (NC) without any additional information.
In the literature, a number of internal cluster validity indices for crisp clustering have been proposed. Among existing internal cluster validity indices, excellent performance indices include the Davies-Bouldin (DB) index [2], Silhouette (Sil) index [3], Krzanowski-Lai (KL) index [4], Weighted inter-intra (Wint) index [5, 6], and In-group proportion (IGP) index [7]. These indices have difficulty in deciding the optimal NC for many kinds of data sets correctly. As a supplement, we design a novel internal cluster validity index called the centroid-based intra-inter partition (CIP) index. The CIP index can analyse the clustering results of a clustering algorithm and decide the optimal NC for synthetic data sets, benchmark data sets, and real data sets. Theoretical analysis and experimental results demonstrate the effectiveness and high efficiency of the new index and method.
The rest of this paper is organized as follows. The related work is discussed in Section 2. A novel internal validity index is presented in Section 3. A new NC determination algorithm is described in Section 4. Experimental results of multiple types of data sets are provided in Section 5. Finally, the conclusion is provided in Section 6.
Related work
Some research has been done to apply internal validity indices to evaluate clustering quality and identify the optimal NC [8–17]. Zhao et al. [9] proposed the WB-index based on a sum-of-squares. The WB-index obtains its minimum value when the optimal NC is achieved. The relation between WB-index and the other two indices are analysed, and the experimental results shows that the WB-index works slightly better than the other two indices. Starczewski et al. [11] proposed the STR index, which is composed of cluster compactness and cluster separability. The STR index is mainly used to identify the right NC and measure the correctness of different data partition. STR could detect the knee point in the measure of clustering quality. Furthermore, although the two components have different scales, we do not need to normalize them. In [12], Yue et al. proposed the SMV index base on the separation measure. SMV could validate the clustering results of partitional clustering algorithms and identify the correct NC. Bhargavi et al. [13] proposed a novel validation approach which can dynamically terminate the clustering process to determine true clusters. The validity index uses both local proximity relationship and global cluster proximity relationship to validate the clustering process. The validation is a bottom-up approach, so it is not applicable to divisive hierarchical clustering which is a top-down approach. Shieh et al. [14] proposed a robust validity index that analyses the clustering results of subtractive clustering algorithm and obtains the optimal NC based on compactness, separation and partition. In [16], Bezdek et al. proposed a soft generalization of the C index that can validate sets of candidate partitions produced by either probabilistic or fuzzy clustering algorithms. Four generalizations based on relational transformations of the soft partition are defined. The soft C index is also suitable for validating partitions of relational data. Liu et al. [17] proposed the evaluating method for the number of uncertain clusters (ENUC) to determine the NC by combining the compactness and dispersion concepts. The diameter of the kNN boundary is used as the compactness of clusters. The boundary can directly define the intra-cluster distance. The maximum distance between clusters is used as the separateness of clusters. They calculate the ENUC criterion by integrating the aforementioned compactness and separateness. By observing the first minimum value of the ENUC, the appropriate NC can be obtained.
Novel cluster validity index
Cluster validity evaluates the clustering quality of a certain clustering algorithm. There are two major evaluation factors: the intercluster separation and the intracluster compactness. The optimal data partition should maximize the intracluster compactness and intercluster separation simultaneously. In this section, a novel internal cluster validity index, termed the centroid-based intra-inter partition (CIP), is presented to evaluate a group of clustering results from a clustering algorithm with many different clustering numbers.
Definitions of CIP index and related concepts
where k denotes the cluster label,
To reflect the intracluster compactness and intercluster separability of data, we propose the CIP index. Based on the distribution of samples, CIP takes one sample in a cluster as the research object and evaluates the validity of clustering results. Clustering similarity has two main measures: distance and similarity. Regarding distance measure, the more dissimilar two samples are, the farther one sample will be from the other sample. We adopt the Euclidean distance as the distance measure and have the following definition, where ||·|| denotes the Euclidean distance.
According to Definition 8, the above-defined (2), (3), and (6) can be defined as
To analyse the meaning of the CIP index and related concepts conveniently, we illustrate the CIP index with the distribution diagram of Fig. 1, where all samples are clustered into 4 clusters, represented by u, v, w, and j. There is a sample termed i in cluster j. In the case of intracluster structure for sample i, the distance between i and the centroid of cluster j is called the intracluster distance according to Definition 4. In the case of intercluster structure for sample i, we study the relationship between the nearest neighbor cluster and i to reflect the intercluster separability. The nearest neighbor cluster can be obtained by calculating the minimum intercluster distance for i. In Fig. 1, wc (j, i) = a, bc (j, i) = min(b, c, d) = d. Distance d corresponds to cluster w, so cluster w is the nearest neighbor cluster for sample i.

Distribution diagram of clustering structure for CIP.
The internal cluster validation often reflects three measures of clustering partitions including compactness, separation, and connectivity. These measures could be clearly analysed for the CIP index.
(1) Compactness: The cluster compactness in the CIP index is measured by the intracluster distance of a single sample, which is calculated by the distance between the sample and its cluster centroid. We use wc(j, i) defined in (3) to reflect the intracluster compactness. The small value of wc(j, i) indicates the high cluster compactness.
(2) Separation: The cluster separation in the CIP index is measured by the intercluster distance of a single sample, which is calculated by the distance between the sample and the cluster centroid of its nearest neighbor cluster. Compared with other clusters, nearest neighbor cluster can better reflect the intercluster separation of the sample. We use bc(j, i) defined in (2) to reflect the intercluster separation. The large value of bc(j, i) indicates the high cluster separation.
(3) Connectivity: The cluster connectivity relates to what extent neighboring samples are placed in the same cluster. In general, neighboring samples have great probability to be assigned to the same cluster, which means that they may have the same cluster centroid. In the CIP index, the value of wc(j, i) is related to the cluster centroid of a cluster, which may determined by neighboring samples. However, the CIP index does not reflect the connectivity measure directly.
From above analysis, we understand that the CIP index mainly reflect the compactness and separation measures. Since the compactness and separation follows the opposing trends, we use bc(j, i) – wc(j, i) to combine them into a single score. In addition, we use bc (j, i) + wc (j, i) to normalize the CIP index.
CIP(j, i) reflects the cluster validity of a single sample in a data set. To validate a whole data set, we have the following definition.
We can calculate avgCIP(m) to analyse the clustering results of the data set.
We analyse the computational process of the CIP index by comparing it to the well-known validity index, Sil [3]. The study of [10] and [18] has demonstrated that the Sil index outperforms the other existing indices.
The Sil index is designed to identify clusters that are compact and well separated. The compactness is measured based on the distance between all the points in the same cluster and the separation is based on the nearest neighbor distance. According to the definition of Sil, we have the following definitions for Sil.
We can calculate avgSil(m) to analyse the clustering results of a whole data set.
To explain the computational process of the Sil index conveniently, we use the distribution diagram of Fig. 2. In Fig. 2, all samples of the data set are clustered into four clusters, denoted by u, v, w, and j. There is a sample called i in cluster j. In ws(j, i), we need to calculate n j – 1 intra-point distances. In bs(j, i), we need to calculate n – n j inter-point distances and make m – 1 comparisons.

Distribution diagram of clustering structure for Sil.
We explain the computational process of the CIP index by using the distribution diagram of Fig. 1. Above all, to obtain m centroids of the data set, we need to calculate m means according to (1), i.e., n sums are needed. In wc(j, i), we need to calculate one intra-point distance. In bc(j, i), we need to calculate m – 1 inter-point distances and make m – 1 comparisons.
Let n be the number of samples in the data set, d the number of dimensions representing the data, and m the NC. The time complexity of Sil(j, i) is determined by the complexity of both ws(j, i) and bs(j, i). The time complexity of ws(j, i) is O(d2(n j – 1)), bs(j, i) is O(d2(n – n j )+m – 1), and Sil(j, i) is O(d2(n j – 1)+d2(n – n j )+m – 1) = O(nd2 + m). Therefore, according to (17), the total time complexity of the Sil index is O(n(nd2 + m)) = O(n2d2 + mn). The time complexity of CIP(j, i) is determined by the complexity of both wc(j, i) and bc(j, i). The time complexity of m means is O(nd), wc(j, i) is O(d2), bc(j, i) is O(d2(m – 1)+m – 1) = O(md2), and CIP(j, i) is O(nd+d2 + md2) = O(md2 + nd). Therefore, according to (13), the total time complexity of the CIP index is O(nd+n(d2 + md2)) = O(mnd2), which makes it affordable for large-scale and high-dimensional data sets.
Usually, m and d are far less than n (mlaquon and dlaquon). Thus, the time complexity of the Sil index is O(n(nd2 + m)) = O(n2) and the CIP index is O(mnd2) = O(n). Compared with the Sil index, the CIP index is less time-consuming. In addition, we list the time complexity of some validity indices in Table 1, where we can see that CIP is less time-consuming than Sil, Wint, and IGP, and the time complexity of CIP is the same as those of the other validity indices.
Time complexity of six validity indices
Regarding the properties of the CIP index, we present the following theorems and deduction.
and
Since bc(j, i)≥0 and wc(j, i)≥0, then CIP(j, i)≥– 1 and CIP(j, i)≤1. We draw the conclusion that CIP(j, i) ∈ [– 1, 1]. In particular, when bc(j, i) = 0, CIP(j, i) = – 1. When there is only one sample in cluster j, i.e., wc(j, i) = 0, CIP(j, i) = 1.
In theory, CIP(j, i) and avgCIP(m) are in the range of [– 1, 1]. In practice, When there is only one sample in cluster j, we set CIP(j, i) = 0 by convention [3]. So CIP(j, i) and avgCIP(m) are virtually in the range of [– 1, 1).
Because bc(j, i)¬ =0 and bc(j, i) > 0, according to (18), CIP(j, i) > – 1 and CIP(j, i)+1 > 0. According to (20), we have
In (21), the larger CIP(j, i) is, the smaller
Minimizing the ratio of (21) is equivalent to maximizing CIP(j, i), which is important to the determination of the optimal NC.
According to (9) and (22), we have
Therefore, according to (14) and (23), we have
According to (15) and (25), we have
According to Theorem 3 and 4, we draw the conclusion that CIP and Sil are different from each other and they have no necessary connection.

Distribution diagram of clustering structure for CIP’s properties.
According to (9), we have
and
Because
and
Therefore, according to (8), (28), (29), and (30), we have
and
CIP is a ratio and Theorem 5 shows the ranges for the numerator and denominator of CIP. According to Theorem 5, we also draw the conclusion of Theorem 1.
The CIP index analyse the clustering validity of a single sample. The higher the value of CIP is, the better the clustering quality of a single sample will be. We analyse the clustering effects by calculating the average CIP index value of all samples in the data set. The higher the average value of CIP, the better the clustering effects. The optimal NC is the NC corresponding to the maximum average CIP value. According to (13), we have the following formula, where n represents the sample number and kopt represents the optimal NC.
Based on the CIP index, a novel algorithm is presented to evaluate the clustering quality and determine the optimal NC. The CIP-based optimal NC determination (CONCD) algorithm is described in Algorithm 1.
Experimental studies
To show the performance of the CIP index and CONCD algorithm, this paper adopts the following clustering algorithms: affinity propagation (AP) [19], agglomerative hierarchical clustering (AHC) with single linkage, average linkage, and median linkage [20, 21], and divisive hierarchical clustering (DHC) with divisive analysis (DIANA) [22]. AP is a partitional clustering algorithm while AHC and DHC are hierarchical clustering algorithms. The NC cannot be directly provided as an input parameter in AP. Search methods are generally used to obtain the clustering results for a given NC. To achieve the clustering for a specified NC, Frey et al. [19] search an appropriate preference value by using a bisection method. We do experiments on synthetic data sets, benchmark data sets, and real data sets, to examine the CIP index and contrast it with other indices, such as DB, Sil, KL, Wint, and IGP.
When using the CIP index to evaluate the clustering quality, we adopt the Euclidean distance measure for all data sets. In the experiments, the NCs are in the search range of [2, kmax]. Generally, we set

Structure distribution of eleven synthetic data sets.
The AP algorithm clusters a data set based on the similarity matrix of n samples. Usually, the similarity values for AP are negative. The AP algorithm calculates the similarity value by using the squared Euclidean distance for two d-dimensional samples x p and x q in every data set, i.e., s(p, q) = – ||x p – x q ||2.
Number of clusters-index relationship table of E2 data set
Number of clusters-index relationship table of E2 data set
Data set information and the experimental results from using the validity indices to identify the optimal NC are listed in Table 3. Table 3 shows that the Sil and CIP indices can identify the correct optimal NCs for the seven synthetic data sets; the DB index is valid for the seven synthetic data sets except N3 and N4; the IGP index is valid for the E2, F3, and M3 data sets; the Wint index is valid for the K3 and M3 data sets; and the KL index is only valid for the N3 data set.
Experimental results of optimal NCs using AP for synthetic data sets
Experimental results of optimal NCs using AP for benchmark data sets

Clustering number-index relationship graph of Breast.
Data set information and the experimental results from using the validity indices to identify the optimal NCs are listed in Table 5. Table 5 shows that the CIP index can identify the correct optimal NCs for the eight real data sets; the Sil index is valid for all the real data sets except Column_2 C; the IGP index is valid for the Breast, Column_2 C, Pima, and Wdbc data sets; the DB index is valid for the Breast, Seeds, and Wdbc data sets; the KL index is valid for the Heart and Wdbc data sets; and the Wint index is only valid for the Seeds data set.
Experimental results of optimal NCs using AP for real data sets
Experimental results of optimal NCs with single linkage for synthetic data sets
Experimental results of optimal NCs with single linkage for synthetic data sets
Experimental results of optimal NCs with single linkage for real data sets
Experimental results of optimal NCs with average linkage for synthetic data sets
Experimental results of optimal NCs with average linkage for synthetic data sets
Experimental results of optimal NCs with average linkage for real data sets
Experimental results of optimal NCs with median linkage for synthetic data sets
Experimental results of optimal NCs with median linkage for synthetic data sets
Experimental results of optimal NCs with median linkage for real data sets
Experimental results of optimal NCs for seven synthetic data sets
Experimental results of optimal NCs for seven synthetic data sets
Experimental results of optimal NCs for eight real data sets
From the viewpoint of sample geometry, we propose a new cluster validity index named CIP, which is a commonly designed index based on the idea of the nearest neighbor cluster and clustering centroid. CIP can be used for validating the clustering results of a certain clustering algorithm. AP, AHC with single linkage, average linkage, and median linkage, and DIANA divisive hierarchical clustering are adopted as the clustering algorithms for our experiments. When we adopt the AP clustering algorithm, CIP identifies the correct optimal NCs for all the experimental data sets, and Sil identifies the same results except the Column_2 C data set. When we adopt the AHC with single linkage algorithm, CIP identifies the correct optimal NCs for all the experimental data sets, and Sil identifies the same results except the Sonar data set. When we adopt the AHC with average linkage algorithm, CIP and Sil identify the correct optimal NCs for all the experimental data sets. When we adopt the AHC with median linkage algorithm, CIP and Sil also identify the correct optimal NCs for all the experimental data sets. When we adopt the DHC with DIANA algorithm, CIP identifies the correct optimal NCs for all the experimental data sets, and Sil identifies the same results except the Seeds data set.
The AHC with single linkage algorithm along with the CIP index identifies the correct optimal NCs for the RA6 and J2 data sets, which may be used to decide that the clustering results of AHC with single linkage are suitable for data structures with complicated and nonconvex distribution. The AHC with average linkage or median linkage algorithm along with the CIP index identifies the correct optimal NCs for the K3 and M3 data sets, which may be used to decide that the clustering results of AHC with average linkage or median linkage are suitable for data structures with overlap distribution.
CIP and Sil excel over the other indices for choosing the optimal NC. Sil is a famous cluster validity index, and it outperforms the other existing indices in the literature. The time complexity of CIP is O(n), whereas Sil is O(n2). Contrasted with Sil, CIP is more efficient. Theoretical analysis and experimental results demonstrate that the proposed index and method analyse the clustering results effectively and are suitable for determining the optimal NC for various types of data sets. Further research work will include applications of the new index for other clustering algorithms on multiple data sets.
Footnotes
Acknowledgments
The authors would like to thank Dongdong Cheng of Chongqing University for her help in some data sets. This work was supported in part by the Fundamental Research Funds for the Central Universities (Grant JUSRP11235) and in part by the National Natural Science Foundation of China (Grant 61673193 and 61833007).
