A novel internal cluster validity index

Abstract

It is critical to determine the optimal number of clusters (NC) in cluster analysis. Many cluster validity indices have been proposed, such as the Silhouette index and In-group proportion index. However, these validity indices have more time complexity. From the viewpoint of sample geometry, a new internal cluster validity index for determining the optimal NC is proposed. The new index can evaluate the clustering quality of a certain clustering algorithm and determine the optimal NC for many kinds of data sets, including synthetic data sets, benchmark data sets, and real data sets. Compared with many well-known validity indices, the proposed index is more effective and efficient. Theoretical analysis and experimental results show the effectiveness and high efficiency of the new index.

Keywords

Cluster validity index number of clusters affinity propagation hierarchical clustering

1 Introduction

Clustering is the unsupervised classification of data points or samples into groups or clusters, such that samples in the same cluster are similar to each other and samples in different clusters are distinct. It is widely used in many fields, such as data mining, image analysis, and bioinformatics. Cluster validity, which validates the goodness of clustering results, is one of the vital issues of clustering. In general, we do some research on cluster validity by using cluster validity indices. External cluster validity index and internal cluster validity index are two main categories of cluster validity indices. The difference is whether external information is used for cluster validity. Unlike external validity index, internal validity index validate the partition without respect to external information. Since external validity index knows cluster labels in advance, it is mainly used for choosing the optimal clustering algorithm on a specific data set [1]. On the other hand, internal validity index can be used to evaluate the clustering results of a certain clustering algorithm and decide the optimal number of clusters (NC) without any additional information.

In the literature, a number of internal cluster validity indices for crisp clustering have been proposed. Among existing internal cluster validity indices, excellent performance indices include the Davies-Bouldin (DB) index [2], Silhouette (Sil) index [3], Krzanowski-Lai (KL) index [4], Weighted inter-intra (Wint) index [5, 6], and In-group proportion (IGP) index [7]. These indices have difficulty in deciding the optimal NC for many kinds of data sets correctly. As a supplement, we design a novel internal cluster validity index called the centroid-based intra-inter partition (CIP) index. The CIP index can analyse the clustering results of a clustering algorithm and decide the optimal NC for synthetic data sets, benchmark data sets, and real data sets. Theoretical analysis and experimental results demonstrate the effectiveness and high efficiency of the new index and method.

The rest of this paper is organized as follows. The related work is discussed in Section 2. A novel internal validity index is presented in Section 3. A new NC determination algorithm is described in Section 4. Experimental results of multiple types of data sets are provided in Section 5. Finally, the conclusion is provided in Section 6.

2 Related work

Some research has been done to apply internal validity indices to evaluate clustering quality and identify the optimal NC [8 –17]. Zhao et al. [9] proposed the WB-index based on a sum-of-squares. The WB-index obtains its minimum value when the optimal NC is achieved. The relation between WB-index and the other two indices are analysed, and the experimental results shows that the WB-index works slightly better than the other two indices. Starczewski et al. [11] proposed the STR index, which is composed of cluster compactness and cluster separability. The STR index is mainly used to identify the right NC and measure the correctness of different data partition. STR could detect the knee point in the measure of clustering quality. Furthermore, although the two components have different scales, we do not need to normalize them. In [12], Yue et al. proposed the SMV index base on the separation measure. SMV could validate the clustering results of partitional clustering algorithms and identify the correct NC. Bhargavi et al. [13] proposed a novel validation approach which can dynamically terminate the clustering process to determine true clusters. The validity index uses both local proximity relationship and global cluster proximity relationship to validate the clustering process. The validation is a bottom-up approach, so it is not applicable to divisive hierarchical clustering which is a top-down approach. Shieh et al. [14] proposed a robust validity index that analyses the clustering results of subtractive clustering algorithm and obtains the optimal NC based on compactness, separation and partition. In [16], Bezdek et al. proposed a soft generalization of the C index that can validate sets of candidate partitions produced by either probabilistic or fuzzy clustering algorithms. Four generalizations based on relational transformations of the soft partition are defined. The soft C index is also suitable for validating partitions of relational data. Liu et al. [17] proposed the evaluating method for the number of uncertain clusters (ENUC) to determine the NC by combining the compactness and dispersion concepts. The diameter of the kNN boundary is used as the compactness of clusters. The boundary can directly define the intra-cluster distance. The maximum distance between clusters is used as the separateness of clusters. They calculate the ENUC criterion by integrating the aforementioned compactness and separateness. By observing the first minimum value of the ENUC, the appropriate NC can be obtained.

3 Novel cluster validity index

Cluster validity evaluates the clustering quality of a certain clustering algorithm. There are two major evaluation factors: the intercluster separation and the intracluster compactness. The optimal data partition should maximize the intracluster compactness and intercluster separation simultaneously. In this section, a novel internal cluster validity index, termed the centroid-based intra-inter partition (CIP), is presented to evaluate a group of clustering results from a clustering algorithm with many different clustering numbers.

3.1 Definitions of CIP index and related concepts

Definition 1. Let data set X = {x₁, x₂,..., x_n}, and x_i represents the ith sample. Supposing that n samples are clustered into m clusters, we take the mean of all samples in cluster k as the centroid μ _k of cluster k, which is defined as $μ_{k} = \frac{1}{n_{k}} \sum_{p = 1}^{n_{k}} x_{p}^{(k)}$ (1)

where k denotes the cluster label, $x_{p}^{(k)}$ denotes the pth sample in cluster k, and n_k denotes the number of samples in cluster k.

Definition 2. In Definition 1, we take D(x_i, μ _k ) as the dissimilarity between sample x_i and the centroid μ _k of cluster k.

Definition 3. In Definition 1, we take the minimum dissimilarity between the ith sample of cluster j and every centroid of other clusters as the minimum intercluster dissimilarity bc(j, i), which is defined as $bc (j, i) = min_{1 ⩽ k ⩽ m, k \neq j} D (x_{i}^{(j)}, μ_{k})$ (2) where j and k denote cluster labels, and $x_{i}^{(j)}$ denotes the ith sample in cluster j.

Definition 4. In Definition 1, we take the dissimilarity between the ith sample of cluster j and the centroid of cluster j as the intracluster dissimilarity wc(j, i), which is defined as $wc (j, i) = D (x_{i}^{(j)}, μ_{j}) .$ (3)

Definition 5. In Definition 1, we take the sum of the minimum intercluster dissimilarity and the intracluster dissimilarity for the ith sample in cluster j as the clustering dissimilarity bawc(j, i), which is defined as $\begin{matrix} bawc (j, i) = bc (j, i) + wc (j, i) \\ = min_{1 ⩽ k ⩽ m, k \neq j} D (x_{i}^{(j)}, μ_{k}) + D (x_{i}^{(j)}, μ_{j}) . \end{matrix}$ (4)

Definition 6. In Definition 1, we take the difference between the minimum intercluster dissimilarity and the intracluster dissimilarity for the ith sample in cluster j as the clustering deviation dissimilarity bswc(j, i), which is defined as $\begin{matrix} bswc (j, i) = bc (j, i) - wc (j, i) \\ = min_{1 ⩽ k ⩽ m, k \neq j} D (x_{i}^{(j)}, μ_{k}) - D (x_{i}^{(j)}, μ_{j}) . \end{matrix}$ (5)

Definition 7. In Definition 1, we take the ratio of the clustering deviation dissimilarity and the clustering dissimilarity for the ith sample in cluster j as the centroid-based intra-inter partition (CIP) index CIP(j, i), which is defined as $\begin{matrix} CIP (j, i) = \frac{bswc (j, i)}{bawc (j, i)} \\ = \frac{bc (j, i) - wc (j, i)}{bc (j, i) + wc (j, i)} \\ = \frac{min_{1 ⩽ k ⩽ m, k \neq j} D (x_{i}^{(j)}, μ_{k}) - D (x_{i}^{(j)}, μ_{j})}{min_{1 ⩽ k ⩽ m, k \neq j} D (x_{i}^{(j)}, μ_{k}) + D (x_{i}^{(j)}, μ_{j})} . \end{matrix}$ (6)

3.2 Analysis of CIP index

To reflect the intracluster compactness and intercluster separability of data, we propose the CIP index. Based on the distribution of samples, CIP takes one sample in a cluster as the research object and evaluates the validity of clustering results. Clustering similarity has two main measures: distance and similarity. Regarding distance measure, the more dissimilar two samples are, the farther one sample will be from the other sample. We adopt the Euclidean distance as the distance measure and have the following definition, where ||·|| denotes the Euclidean distance.

Definition 8 [2]. In d-dimensional Euclidean space, x_i and x_p are two vectors, D (x_i, x_p) = ||x_i–x_p|| is a distance function or metric if the following properties hold: $1) D (x_{i}, x_{p}) ⩾ 0 if x_{i} = x_{p}, D (x_{i}, x_{p}) = 0;$ (7) $2) D (x_{i}, x_{p}) = D (x_{p}, x_{i});$ (8) $3) D (x_{i}, x_{p}) ⩽ D (x_{i}, x_{q}) + D (x_{q}, x_{p}) .$ (9)

According to Definition 8, the above-defined (2), (3), and (6) can be defined as $bc (j, i) = min_{1 ⩽ k ⩽ m, k \neq j} ∥ x_{i}^{(j)} - μ_{k} ∥$ (10) $wc (j, i) = ∥ x_{i}^{(j)} - μ_{j} ∥$ (11) $\begin{matrix} CIP (j, i) = \frac{bc (j, i) - wc (j, i)}{bc (j, i) + wc (j, i)} \\ = \frac{min_{1 ⩽ k ⩽ m, k \neq j} ∥ x_{i}^{(j)} - μ_{k} ∥ - ∥ x_{i}^{(j)} - μ_{j} ∥}{min_{1 ⩽ k ⩽ m, k \neq j} ∥ x_{i}^{(j)} - μ_{k} ∥ + ∥ x_{i}^{(j)} - μ_{j} ∥} . \end{matrix}$ (12)

To analyse the meaning of the CIP index and related concepts conveniently, we illustrate the CIP index with the distribution diagram of Fig. 1, where all samples are clustered into 4 clusters, represented by u, v, w, and j. There is a sample termed i in cluster j. In the case of intracluster structure for sample i, the distance between i and the centroid of cluster j is called the intracluster distance according to Definition 4. In the case of intercluster structure for sample i, we study the relationship between the nearest neighbor cluster and i to reflect the intercluster separability. The nearest neighbor cluster can be obtained by calculating the minimum intercluster distance for i. In Fig. 1, wc (j, i) = a, bc (j, i) = min(b, c, d) = d. Distance d corresponds to cluster w, so cluster w is the nearest neighbor cluster for sample i.

Fig. 1

Distribution diagram of clustering structure for CIP.

The internal cluster validation often reflects three measures of clustering partitions including compactness, separation, and connectivity. These measures could be clearly analysed for the CIP index.

(1) Compactness: The cluster compactness in the CIP index is measured by the intracluster distance of a single sample, which is calculated by the distance between the sample and its cluster centroid. We use wc(j, i) defined in (3) to reflect the intracluster compactness. The small value of wc(j, i) indicates the high cluster compactness.

(2) Separation: The cluster separation in the CIP index is measured by the intercluster distance of a single sample, which is calculated by the distance between the sample and the cluster centroid of its nearest neighbor cluster. Compared with other clusters, nearest neighbor cluster can better reflect the intercluster separation of the sample. We use bc(j, i) defined in (2) to reflect the intercluster separation. The large value of bc(j, i) indicates the high cluster separation.

(3) Connectivity: The cluster connectivity relates to what extent neighboring samples are placed in the same cluster. In general, neighboring samples have great probability to be assigned to the same cluster, which means that they may have the same cluster centroid. In the CIP index, the value of wc(j, i) is related to the cluster centroid of a cluster, which may determined by neighboring samples. However, the CIP index does not reflect the connectivity measure directly.

From above analysis, we understand that the CIP index mainly reflect the compactness and separation measures. Since the compactness and separation follows the opposing trends, we use bc(j, i) – wc(j, i) to combine them into a single score. In addition, we use bc (j, i) + wc (j, i) to normalize the CIP index.

CIP(j, i) reflects the cluster validity of a single sample in a data set. To validate a whole data set, we have the following definition.

Definition 9. In Definition 1, we take the average CIP(j, i) of all samples in the data set as the avgCIP(m) function, which is defined as $avgCIP (m) = \frac{1}{n} \sum_{j = 1}^{m} \sum_{i = 1}^{n_{j}} CIP (j, i) .$ (13)

We can calculate avgCIP(m) to analyse the clustering results of the data set.

3.3 Computational issues of CIP index

We analyse the computational process of the CIP index by comparing it to the well-known validity index, Sil [3]. The study of [10] and [18] has demonstrated that the Sil index outperforms the other existing indices.

The Sil index is designed to identify clusters that are compact and well separated. The compactness is measured based on the distance between all the points in the same cluster and the separation is based on the nearest neighbor distance. According to the definition of Sil, we have the following definitions for Sil.

Definition 10. In Definition 1, we take the minimum value of average distances between the ith sample in cluster j and samples in every other cluster as the minimum between-cluster distance bs(j, i), which is defined as $bs (j, i) = min_{1 ⩽ k ⩽ m, k \neq j} (\frac{1}{n_{k}} \sum_{p = 1}^{n_{k}} ∥ x_{i}^{(j)} - x_{p}^{(k)} ∥) .$ (14)

Definition 11. In Definition 1, we take the average distance between the ith sample in cluster j and other samples in cluster j as the within-cluster distance ws(j, i), which is defined as $ws (j, i) = \frac{1}{n_{j} - 1} \sum_{q = 1, q \neq i}^{n_{j}} ∥ x_{i}^{(j)} - x_{q}^{(j)} ∥$ (15) where $x_{q}^{(j)}$ denotes the qth sample in cluster j, q¬ =i, and n_j denotes the number of samples in cluster j.

Definition 12. In Definition 1, we define the Sil index Sil(j, i) as follows:

$\begin{matrix} Sil (j, i) = \frac{bs (j, i) - ws (j, i)}{max (bs (j, i), ws (j, i))} \\ = \frac{min_{1 ⩽ k ⩽ m, k \neq j} (\frac{1}{n_{k}} \sum_{p = 1}^{n_{k}} ∥ x_{i}^{(j)} - x_{p}^{(k)} ∥) - \frac{1}{n_{j} - 1} \sum_{q = 1, q \neq i}^{n_{j}} ∥ x_{i}^{(j)} - x_{q}^{(j)} ∥}{max (min_{1 ⩽ k ⩽ m, k \neq j} (\frac{1}{n_{k}} \sum_{p = 1}^{n_{k}} ∥ x_{i}^{(j)} - x_{p}^{(k)} ∥), \frac{1}{n_{j} - 1} \sum_{q = 1, q \neq i}^{n_{j}} ∥ x_{i}^{(j)} - x_{q}^{(j)} ∥)} . \end{matrix}$ (16)

Definition 13. In Definition 1, we take the average Sil(j, i) of all samples in the data set as the avgSil(m) function, which is defined as $avgSil (m) = \frac{1}{n} \sum_{j = 1}^{m} \sum_{i = 1}^{n_{j}} Sil (j, i) .$ (17)

We can calculate avgSil(m) to analyse the clustering results of a whole data set.

To explain the computational process of the Sil index conveniently, we use the distribution diagram of Fig. 2. In Fig. 2, all samples of the data set are clustered into four clusters, denoted by u, v, w, and j. There is a sample called i in cluster j. In ws(j, i), we need to calculate n_j – 1 intra-point distances. In bs(j, i), we need to calculate n – n_j inter-point distances and make m – 1 comparisons.

Fig. 2

Distribution diagram of clustering structure for Sil.

We explain the computational process of the CIP index by using the distribution diagram of Fig. 1. Above all, to obtain m centroids of the data set, we need to calculate m means according to (1), i.e., n sums are needed. In wc(j, i), we need to calculate one intra-point distance. In bc(j, i), we need to calculate m – 1 inter-point distances and make m – 1 comparisons.

Let n be the number of samples in the data set, d the number of dimensions representing the data, and m the NC. The time complexity of Sil(j, i) is determined by the complexity of both ws(j, i) and bs(j, i). The time complexity of ws(j, i) is O(d²(n_j – 1)), bs(j, i) is O(d²(n – n_j)+m – 1), and Sil(j, i) is O(d²(n_j – 1)+d²(n – n_j)+m – 1) = O(nd² + m). Therefore, according to (17), the total time complexity of the Sil index is O(n(nd² + m)) = O(n²d² + mn). The time complexity of CIP(j, i) is determined by the complexity of both wc(j, i) and bc(j, i). The time complexity of m means is O(nd), wc(j, i) is O(d²), bc(j, i) is O(d²(m – 1)+m – 1) = O(md²), and CIP(j, i) is O(nd+d² + md²) = O(md² + nd). Therefore, according to (13), the total time complexity of the CIP index is O(nd+n(d² + md²)) = O(mnd²), which makes it affordable for large-scale and high-dimensional data sets.

Usually, m and d are far less than n (mlaquon and dlaquon). Thus, the time complexity of the Sil index is O(n(nd² + m)) = O(n²) and the CIP index is O(mnd²) = O(n). Compared with the Sil index, the CIP index is less time-consuming. In addition, we list the time complexity of some validity indices in Table 1, where we can see that CIP is less time-consuming than Sil, Wint, and IGP, and the time complexity of CIP is the same as those of the other validity indices.

Table 1

Time complexity of six validity indices

Validity index	DB	Sil	KL	Wint	IGP	CIP
Time complexity	O(n)	O(n²)	O(n)	O(n²)	O(n²)	O(n)

3.4 Properties of CIP index

Regarding the properties of the CIP index, we present the following theorems and deduction.

Theorem 1. The CIP index value is in the range of [– 1, 1].

Proof. According to (6), we obtain the following formulas: $\begin{matrix} CIP (j, i) = \frac{bc (j, i) - wc (j, i)}{bc (j, i) + wc (j, i)} \\ = \frac{2 \times bc (j, i)}{bc (j, i) + wc (j, i)} - 1 \end{matrix}$ (18)

and $\begin{matrix} CIP (j, i) = \frac{bc (j, i) - wc (j, i)}{bc (j, i) + wc (j, i)} \\ = 1 - \frac{2 \times wc (j, i)}{bc (j, i) + wc (j, i)} . \end{matrix}$ (19)

Since bc(j, i)≥0 and wc(j, i)≥0, then CIP(j, i)≥– 1 and CIP(j, i)≤1. We draw the conclusion that CIP(j, i) ∈ [– 1, 1]. In particular, when bc(j, i) = 0, CIP(j, i) = – 1. When there is only one sample in cluster j, i.e., wc(j, i) = 0, CIP(j, i) = 1.

Deduction 1. The avgCIP(m) function value is in the range of [– 1, 1].

In theory, CIP(j, i) and avgCIP(m) are in the range of [– 1, 1]. In practice, When there is only one sample in cluster j, we set CIP(j, i) = 0 by convention [3]. So CIP(j, i) and avgCIP(m) are virtually in the range of [– 1, 1).

Theorem 2. The larger the CIP index value, the better the clustering result of the sample.

Proof. According to the standard of clustering validity, to obtain better clustering results, we expect that wc(j, i) is as small as possible and bc(j, i) is as large as possible, which means making $\frac{wc (j, i)}{bc (j, i)}$ as small as possible. In (6), when there is only one cluster in the data set, i.e., bc(j, i) = 0, CIP(j, i) = – 1. If bc(j, i)¬ =0, according to (6), we obtain the following formula: $\begin{matrix} CIP (j, i) = \frac{bc (j, i) - wc (j, i)}{bc (j, i) + wc (j, i)} \\ = \frac{2 \times bc (j, i)}{bc (j, i) + wc (j, i)} - 1 \\ = \frac{2}{1 + \frac{wc (j, i)}{bc (j, i)}} - 1 . \end{matrix}$ (20)

Because bc(j, i)¬ =0 and bc(j, i) > 0, according to (18), CIP(j, i) > – 1 and CIP(j, i)+1 > 0. According to (20), we have $\frac{wc (j, i)}{bc (j, i)} = \frac{2}{CIP (j, i) + 1} - 1 .$ (21)

In (21), the larger CIP(j, i) is, the smaller $\frac{wc (j, i)}{bc (j, i)}$ would be. Thus, a better clustering result of sample i can be obtained when the CIP index becomes larger.

Minimizing the ratio of (21) is equivalent to maximizing CIP(j, i), which is important to the determination of the optimal NC.

Theorem 3. In the Euclidean space, bc(j, i)≤bs(j, i).

Proof. According to (1) and (10), we have $\begin{matrix} bc (j, i) = min_{1 ⩽ k ⩽ m, k \neq j} ∥ x_{i}^{(j)} - μ_{k} ∥ \\ = min_{1 ⩽ k ⩽ m, k \neq j} ∥ x_{i}^{(j)} - \frac{1}{n_{k}} \sum_{p = 1}^{n_{k}} x_{p}^{(k)} ∥ \\ = min_{1 ⩽ k ⩽ m, k \neq j} ∥ \frac{1}{n_{k}} \sum_{p = 1}^{n_{k}} (x_{i}^{(j)} - x_{p}^{(k)}) ∥ \\ = min_{1 ⩽ k ⩽ m, k \neq j} (\frac{1}{n_{k}} ∥ \sum_{p = 1}^{n_{k}} (x_{i}^{(j)} - x_{p}^{(k)}) ∥) . \end{matrix}$ (22)

According to (9) and (22), we have $\begin{matrix} bc (j, i) = min_{1 ⩽ k ⩽ m, k \neq j} (\frac{1}{n_{k}} ∥ \sum_{p = 1}^{n_{k}} (x_{i}^{(j)} - x_{p}^{(k)}) ∥) \\ ⩽ min_{1 ⩽ k ⩽ m, k \neq j} (\frac{1}{n_{k}} \sum_{p = 1}^{n_{k}} ∥ x_{i}^{(j)} - x_{p}^{(k)} ∥) . \end{matrix}$ (23)

Therefore, according to (14) and (23), we have $bc (j, i) ⩽ bs (j, i)$ (24)

Theorem 4. In the Euclidean space, wc(j, i)≤ws(j, i).

Proof. According to (1) and (11), we have $\begin{matrix} wc (j, i) = ∥ x_{i}^{(j)} - μ_{j} ∥ \\ = ∥ x_{i}^{(j)} - \frac{1}{n_{j}} \sum_{q = 1}^{n_{j}} x_{q}^{(j)} ∥ \end{matrix}$ $\begin{matrix} = ∥ \frac{1}{n_{j}} \sum_{q = 1}^{n_{j}} (x_{i}^{(j)} - x_{q}^{(j)}) ∥ \\ ⩽ \frac{1}{n_{j}} \sum_{q = 1}^{n_{j}} ∥ x_{i}^{(j)} - x_{q}^{(j)} ∥ \\ ⩽ \frac{1}{n_{j} - 1} \sum_{q = 1}^{n_{j}} ∥ x_{i}^{(j)} - x_{q}^{(j)} ∥ \\ = \frac{1}{n_{j} - 1} \sum_{q = 1, q \neq i}^{n_{j}} ∥ x_{i}^{(j)} - x_{q}^{(j)} ∥ . \end{matrix}$ (25)

According to (15) and (25), we have $wc (j, i) ⩽ ws (j, i)$ (26)

According to Theorem 3 and 4, we draw the conclusion that CIP and Sil are different from each other and they have no necessary connection.

Theorem 5. In the Euclidean space, -∥ μ_j - μ_w ∥ ⩽ bc (j, i) - wc (j, i) ⩽ ∥ μ_j - μ_w ∥ and bc (j, i)+ wc (j, i) ⩾ ∥ μ_j - μ_w ∥, where $w = \underset{1 ⩽ k ⩽ m, k \neq j}{arg min} ∥ x_{i}^{(j)} - μ_{k} ∥$ .

Proof. To understand the notations easily, we illustrate the proving process by using the diagram of Fig. 3. In Fig. 3, the distance between points i and e is wc(j, i), the distance between points i and t is bc(j, i), and the vectors $x_{i}^{(j)}$ , μ _j , and μ _w correspond to the points i, e, and t, respectively. According to (2), the nearest cluster of sample i is cluster w, which can be calculated as follows:

Fig. 3

Distribution diagram of clustering structure for CIP’s properties.

$w = \underset{1 ⩽ k ⩽ m, k \neq j}{arg min} ∥ x_{i}^{(j)} - μ_{k} ∥$ (27)

According to (9), we have $d (x_{i}^{(j)}, μ_{j}) ⩽ d (x_{i}^{(j)}, μ_{w}) + d (μ_{w}, μ_{j}),$ (28) $d (x_{i}^{(j)}, μ_{w}) ⩽ d (x_{i}^{(j)}, μ_{j}) + d (μ_{j}, μ_{w}),$ (29)

and $d (μ_{j}, μ_{w}) ⩽ d (μ_{j}, x_{i}^{(j)}) + d (x_{i}^{(j)}, μ_{w}) .$ (30)

Because $d (x_{i}^{(j)}, μ_{w}) = bc (j, i)$ (31)

and $d (x_{i}^{(j)}, μ_{j}) = wc (j, i),$ (32)

Therefore, according to (8), (28), (29), and (30), we have $bc (j, i) - wc (j, i) ⩾ - ∥ μ_{j} - μ_{w} ∥,$ (33) $bc (j, i) - wc (j, i) ⩽ ∥ μ_{j} - μ_{w} ∥,$ (34)

and $bc (j, i) + wc (j, i) ⩾ ∥ μ_{j} - μ_{w} ∥ .$ (35)

CIP is a ratio and Theorem 5 shows the ranges for the numerator and denominator of CIP. According to Theorem 5, we also draw the conclusion of Theorem 1.

3.5 Determination of the optimal NC

The CIP index analyse the clustering validity of a single sample. The higher the value of CIP is, the better the clustering quality of a single sample will be. We analyse the clustering effects by calculating the average CIP index value of all samples in the data set. The higher the average value of CIP, the better the clustering effects. The optimal NC is the NC corresponding to the maximum average CIP value. According to (13), we have the following formula, where n represents the sample number and k_opt represents the optimal NC. $k opt = \underset{2 ⩽ m < n}{\arg \max} {avgCIP (m)}$ (36)

4 Optimal NC determination algorithm

Based on the CIP index, a novel algorithm is presented to evaluate the clustering quality and determine the optimal NC. The CIP-based optimal NC determination (CONCD) algorithm is described in Algorithm 1.

Algorithm 1 CONCD
Step 1: Select the numbers of clusters in the search range of
[k_min, k_max].
Step 2: For m = k_min to k_max
Step a: Use a certain clustering algorithm to cluster
the sample data set.
Step b: Use (6) to calculate the CIP index value of a single
sample.
Step c: Use (13) to calculate the average value of the CIP
index.
Step 3: Use (36) to calculate the optimal NC.
Step 4: Output the validity index values, optimal NC, and
clustering results.

5 Experimental studies

To show the performance of the CIP index and CONCD algorithm, this paper adopts the following clustering algorithms: affinity propagation (AP) [19], agglomerative hierarchical clustering (AHC) with single linkage, average linkage, and median linkage [20, 21], and divisive hierarchical clustering (DHC) with divisive analysis (DIANA) [22]. AP is a partitional clustering algorithm while AHC and DHC are hierarchical clustering algorithms. The NC cannot be directly provided as an input parameter in AP. Search methods are generally used to obtain the clustering results for a given NC. To achieve the clustering for a specified NC, Frey et al. [19] search an appropriate preference value by using a bisection method. We do experiments on synthetic data sets, benchmark data sets, and real data sets, to examine the CIP index and contrast it with other indices, such as DB, Sil, KL, Wint, and IGP.

When using the CIP index to evaluate the clustering quality, we adopt the Euclidean distance measure for all data sets. In the experiments, the NCs are in the search range of [2, k_max]. Generally, we set $k_{max} = ⌊ \sqrt{n} ⌋$ according to $k_{max} \leq \sqrt{n}$ [23 –25]. The experiments include eleven 2-D synthetic data sets, whose structure distributions are provided in Fig. 4.

Fig. 4

Structure distribution of eleven synthetic data sets.

5.1 Experiments using AP clustering

The AP algorithm clusters a data set based on the similarity matrix of n samples. Usually, the similarity values for AP are negative. The AP algorithm calculates the similarity value by using the squared Euclidean distance for two d-dimensional samples x_p and x_q in every data set, i.e., s(p, q) = – ||x_p – x_q||².

1) Synthetic data sets: The experiments include the E2, F3, K3, M3, G6, N3, and N4 data sets. Regarding the E2 data set, its right NC is 2. The experimental results from using the validity indices to identify the optimal NC are listed in Table 2, where we see that the optimal NC 2 chosen by the DB, Sil, IGP, and CIP indices is correct, the KL index identify the optimal NC 5, and the Wint index identifies the optimal NC 4. The KL and Wint indices identify the wrong optimal NC for the E2 data set.

Table 2
Number of clusters-index relationship table of E2 data set

NC DB Sil KL Wint IGP CIP

2 0.3788 0.6367 0.6301 0.3588 1.0000 0.6061

3 0.5332 0.5723 9.9556 0.4927 0.9833 0.5551

4 0.5963 0.4876 1.5251 0.4962 0.9559 0.4987

5 0.7319 0.4427 41.4939 0.4785 0.8730 0.4639

6 0.7324 0.4540 0.0265 0.4506 0.9606 0.4870

7 0.6997 0.4669 0.0265 0.4271 0.9663 0.4954

NC	DB	Sil	KL	Wint	IGP	CIP
2	0.3788	0.6367	0.6301	0.3588	1.0000	0.6061
3	0.5332	0.5723	9.9556	0.4927	0.9833	0.5551
4	0.5963	0.4876	1.5251	0.4962	0.9559	0.4987
5	0.7319	0.4427	41.4939	0.4785	0.8730	0.4639
6	0.7324	0.4540	0.0265	0.4506	0.9606	0.4870
7	0.6997	0.4669	0.0265	0.4271	0.9663	0.4954

Data set information and the experimental results from using the validity indices to identify the optimal NC are listed in Table 3. Table 3 shows that the Sil and CIP indices can identify the correct optimal NCs for the seven synthetic data sets; the DB index is valid for the seven synthetic data sets except N3 and N4; the IGP index is valid for the E2, F3, and M3 data sets; the Wint index is valid for the K3 and M3 data sets; and the KL index is only valid for the N3 data set.

Table 3

Experimental results of optimal NCs using AP for synthetic data sets

Dataset	Sample number	Right NC	Optimal NC
			DB	Sil	KL	Wint	IGP	CIP
E2	53	2	2	2	5	4	2	2
F3	300	3	3	3	11	4	3	3
K3	1500	3	3	3	33	3	2	3
M3	1800	3	3	3	12	3	3	3
G6	600	6	6	6	18	5	4	6
N3	194	3	2	3	3	4	2	3
N4	226	4	3	4	6	3	2	4

2) Benchmark data sets: The experiments include six benchmark data sets that are from the clustering basic benchmark (http://cs.uef.fi/sipu/datasets/): R15, D31, S2, dim032, dim064, and dim128. Data set information and the experimental results from using the validity indices to identify the optimal NCs are listed in Table 4. Table 4 shows that the Sil and CIP indices can identify the correct optimal NCs for the six benchmark data sets; the DB index is valid for the R15, D31, and S2 data sets; the KL and Wint indices are valid for the dim032, dim064, and dim128 data sets; and the IGP index is invalid for the six benchmark data sets.

Table 4

Experimental results of optimal NCs using AP for benchmark data sets

Dataset	Sample number	Dimension	Right NC	Optimal NC
				DB	Sil	KL	Wint	IGP	CIP
R15	600	2	15	15	15	13	11	8	15
D31	3100	2	31	31	31	7	6	6	31
S2	5000	2	15	15	15	36	7	2	15
dim032	1024	32	16	32	16	16	16	12	16
dim064	1024	64	16	32	16	16	16	2	16
dim128	1024	128	16	32	16	16	16	3	16

3) Real data sets: The experiments include eight real data sets, which are from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/). Regarding the Breast data set, its real NC is 2. The experimental results from using the six validity indices to identify the optimal NC are shown in Fig. 5, where we see that the optimal NC 2 chosen by the DB, Sil, IGP, and CIP indices is correct, the KL index identifies the optimal NC 3, and the Wint index identifies the optimal NC 8. The KL and Wint indices identify the wrong optimal NC for the Breast data set.

Fig. 5

Clustering number-index relationship graph of Breast.

Data set information and the experimental results from using the validity indices to identify the optimal NCs are listed in Table 5. Table 5 shows that the CIP index can identify the correct optimal NCs for the eight real data sets; the Sil index is valid for all the real data sets except Column_2 C; the IGP index is valid for the Breast, Column_2 C, Pima, and Wdbc data sets; the DB index is valid for the Breast, Seeds, and Wdbc data sets; the KL index is valid for the Heart and Wdbc data sets; and the Wint index is only valid for the Seeds data set.

Table 5

Experimental results of optimal NCs using AP for real data sets

Dataset	Sample number	Dimension	Real NC	Optimal NC
				DB	Sil	KL	Wint	IGP	CIP
Breast	683	9	2	2	2	3	8	2	2
Banknote	1372	4	2	20	2	35	8	34	2
Column_2C	310	6	2	3	3	4	8	2	2
Heart	270	13	2	7	2	2	5	3	2
Pima	768	8	2	3	2	5	3	2	2
Seeds	210	6	3	3	3	11	3	2	3
Sonar	208	60	2	12	2	3	4	14	2
Wdbc	569	30	2	2	2	2	3	2	2

5.2 Experiments using AHC with single linkage

1) Synthetic data sets: The experiments include nine 2-D synthetic data sets. Data set information and the experimental results from using the validity indices to identify the optimal NCs are listed in Table 6. Table 6 shows that the Sil and CIP indices can identify the correct optimal NCs for the nine synthetic data sets; the IGP index is valid for the E2, P2, Q2, and J2 data sets; the Wint index is valid for the F3, RA6, and N3 data sets; the DB index is valid for the F3 and Q2 data sets; and the KL index is only valid for the F3 data set.

Table 6
Experimental results of optimal NCs with single linkage for synthetic data sets

Dataset Sample number Right NC Optimal NC

DB Sil KL Wint IGP CIP

E2 53 2 4 2 4 4 2 2

F3 300 3 3 3 3 3 2 3

RA6 1741 6 41 6 5 6 2 6

P2 500 2 14 2 19 7 2 2

Q2 600 2 2 2 12 4 2 2

G6 600 6 12 6 2 3 2 6

N3 194 3 11 3 6 3 2 3

N4 226 4 13 4 8 3 2 4

J2 746 2 26 2 11 4 2 2

Dataset	Sample number	Right NC	Optimal NC
E2	53	2	4	2	4	4	2	2
F3	300	3	3	3	3	3	2	3
RA6	1741	6	41	6	5	6	2	6
P2	500	2	14	2	19	7	2	2
Q2	600	2	2	2	12	4	2	2
G6	600	6	12	6	2	3	2	6
N3	194	3	11	3	6	3	2	3
N4	226	4	13	4	8	3	2	4
J2	746	2	26	2	11	4	2	2

2) Real data sets: The experiments include seven real data sets, which are from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/). Data set information and the experimental results from using the validity indices to identify the optimal NCs are listed in Table 7. Table 7 shows that the CIP index can identify the correct optimal NCs for the seven real data sets; the Sil index is valid for all the real data sets except Sonar; the IGP index is valid for all the real data sets except Pima and Sonar; the Wint index is valid for all the real data sets except Pima, Sonar, and Transfusion; the DB index is valid for the Sonar and Transfusion data sets; and the KL index is valid for the Pop_fail and Wdbc data sets.

Table 7

Experimental results of optimal NCs with single linkage for real data sets

Dataset	Sample number	Dimension	Real NC	Optimal NC
				DB	Sil	KL	Wint	IGP	CIP
Breast	683	9	2	26	2	24	2	2	2
Heart	270	13	2	4	2	5	2	2	2
Pima	768	8	2	3	2	9	4	3	2
Pop_fail	540	18	2	6	2	2	2	2	2
Sonar	208	60	2	2	3	3	3	3	2
Transfusion	748	4	2	2	2	10	3	2	2
Wdbc	569	30	2	21	2	2	2	2	2

5.3 Experiments using AHC with average linkage

1) Synthetic data sets: The experiments include eight 2-D synthetic data sets. Data set information and the experimental results from using the validity indices to identify the optimal NCs are listed in Table 8. Table 8 shows that the Sil and CIP indices can identify the correct optimal NCs for the eight synthetic data sets; the DB index is valid for the F3, K3, M3, Q2, and G6 data sets; the KL index is valid for the F3 and G6 data sets; the Wint index is valid for the E2, F3, and M3 data sets; and the IGP index is valid for the E2 and Q2 data sets.

Table 8
Experimental results of optimal NCs with average linkage for synthetic data sets

Dataset Sample number Right NC Optimal NC

DB Sil KL Wint IGP CIP

E2 53 2 5 2 3 2 2 2

F3 300 3 3 3 3 3 2 3

K3 1500 3 3 3 7 9 2 3

M3 1800 3 3 3 16 3 2 3

Q2 600 2 2 2 23 3 2 2

G6 600 6 6 6 6 3 2 6

N3 194 3 4 3 5 4 2 3

N4 226 4 5 4 6 3 2 4

Dataset	Sample number	Right NC	Optimal NC
E2	53	2	5	2	3	2	2	2
F3	300	3	3	3	3	3	2	3
K3	1500	3	3	3	7	9	2	3
M3	1800	3	3	3	16	3	2	3
Q2	600	2	2	2	23	3	2	2
G6	600	6	6	6	6	3	2	6
N3	194	3	4	3	5	4	2	3
N4	226	4	5	4	6	3	2	4

2) Real data sets: The experiments include seven real data sets, which are from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/). Data set information and the experimental results from using the validity indices to identify the optimal NCs are listed in Table 9. Table 9 shows that the Sil and CIP indices can identify the correct optimal NCs for the seven real data sets; the IGP index is valid for all the real data sets except Pop_fail, Seeds, and Sonar; the DB index is only valid for the Pima data set; the KL index is only valid for the Seeds data set; and the Wint index is invalid for the seven real data set.

Table 9

Experimental results of optimal NCs with average linkage for real data sets

Dataset	Sample number	Dimension	Real NC	Optimal NC
				DB	Sil	KL	Wint	IGP	CIP
Breast	683	9	2	4	2	23	3	2	2
Heart	270	13	2	3	2	10	3	2	2
Pima	768	8	2	2	2	6	5	2	2
Pop_fail	540	18	2	3	2	17	4	21	2
Seeds	210	6	3	7	3	3	6	2	3
Sonar	208	60	2	10	2	7	5	3	2
Wdbc	569	30	2	3	2	4	3	2	2

5.4 Experiments using AHC with median linkage

1) Synthetic data sets: The experiments include eight 2-D synthetic data sets. Data set information and the experimental results from using the validity indices to identify the optimal NCs are listed in Table 10. Table 10 shows that the Sil and CIP indices can identify the correct optimal NCs for the eight synthetic data sets; the DB index is valid for the E2, F3, M3, Q2, and G6 data sets; the Wint index is valid for the E2, F3, K3, and N3 data sets; the IGP index is valid for the E2 and Q2 data sets; and the KL index is only valid for the N4 data sets.

Table 10
Experimental results of optimal NCs with median linkage for synthetic data sets

Dataset Sample number Right NC Optimal NC

DB Sil KL Wint IGP CIP

E2 53 2 2 2 4 2 2 2

F3 300 3 3 3 7 3 2 3

K3 1500 3 31 3 15 3 2 3

M3 1800 3 3 3 40 4 2 3

Q2 600 2 2 2 16 4 2 2

G6 600 6 6 6 12 3 2 6

N3 194 3 2 3 8 3 2 3

N4 226 4 2 4 4 3 2 4

Dataset	Sample number	Right NC	Optimal NC
E2	53	2	2	2	4	2	2	2
F3	300	3	3	3	7	3	2	3
K3	1500	3	31	3	15	3	2	3
M3	1800	3	3	3	40	4	2	3
Q2	600	2	2	2	16	4	2	2
G6	600	6	6	6	12	3	2	6
N3	194	3	2	3	8	3	2	3
N4	226	4	2	4	4	3	2	4

2) Real data sets: The experiments include seven real data sets, which are from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/). Data set information and the experimental results from using the validity indices to identify the optimal NCs are listed in Table 11. Table 11 shows that the Sil and CIP indices can identify the correct optimal NCs for the seven real data sets; the IGP index is valid for all the real data sets except Heart and Seeds; the DB index is valid for the Heart, Pima, and Transfusion data sets; the Wint index is valid for the Heart and Pop_fail data sets; and the KL index is invalid for the seven real data sets.

Table 11

Experimental results of optimal NCs with median linkage for real data sets

Dataset	Sample number	Dimension	Real NC	Optimal NC
				DB	Sil	KL	Wint	IGP	CIP
Heart	270	13	2	2	2	10	2	4	2
Pima	768	8	2	2	2	26	6	2	2
Pop_fail	540	18	2	11	2	21	2	2	2
Seeds	210	6	3	14	3	4	5	4	3
Sonar	208	60	2	14	2	4	3	2	2
Transfusion	748	4	2	2	2	18	4	2	2
Wdbc	569	30	2	5	2	4	6	2	2

5.5 Experiments using DHC with DIANA

1) Synthetic data sets: The experiments include seven 2-D synthetic data sets. Data set information and the experimental results from using the validity indices to identify the optimal NCs are listed in Table 12. Table 12 shows that the Sil and CIP indices can identify the correct optimal NCs for the seven synthetic data sets; the DB index is valid for the E2, F3, M3, Q2, and G6 data sets; the Wint index is valid for the F3, M3 and N4 data sets; the IGP index is valid for the E2, M3, and Q2 data sets; and the KL index is only valid for the Q2 data set.

Table 12
Experimental results of optimal NCs for seven synthetic data sets

Dataset Sample number Right NC Optimal NC

DB Sil KL Wint IGP CIP

E2 53 2 2 2 3 3 2 2

F3 300 3 3 3 10 3 2 3

M3 1800 3 3 3 28 3 3 3

Q2 600 2 2 2 2 3 2 2

G6 600 6 6 6 12 3 2 6

N3 194 3 2 3 6 4 2 3

N4 226 4 3 4 7 4 2 4

Dataset	Sample number	Right NC	Optimal NC
E2	53	2	2	2	3	3	2	2
F3	300	3	3	3	10	3	2	3
M3	1800	3	3	3	28	3	3	3
Q2	600	2	2	2	2	3	2	2
G6	600	6	6	6	12	3	2	6
N3	194	3	2	3	6	4	2	3
N4	226	4	3	4	7	4	2	4

2) Real data sets: The experiments include eight real data sets, which are from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/). Data set information and the experimental results from using the validity indices to identify the optimal NCs are listed in Table 13. Table 13 shows that the CIP index can identify the correct optimal NCs for the eight real data sets; the Sil index is valid for all the real data sets except Seeds; the IGP index is valid for all the real data sets except Column_2 C; the DB index is valid for all the real data sets except Heart and Sonar; the Wint index is valid for the Column_2 C and Seeds data sets; and the KL index is only valid for the Breast data set.

Table 13

Experimental results of optimal NCs for eight real data sets

Dataset	Sample number	Dimension	Real NC	Optimal NC
				DB	Sil	KL	Wint	IGP	CIP
Breast	683	9	2	2	2	2	3	2	2
Banknote	1372	4	2	2	2	35	20	2	2
Column_2C	310	6	2	2	2	7	2	8	2
Heart	270	13	2	4	2	12	3	2	2
Pima	768	8	2	2	2	7	3	2	2
Seeds	210	6	3	3	2	12	3	3	3
Sonar	208	60	2	12	2	3	4	2	2
Wdbc	569	30	2	2	2	22	8	2	2

6 Conclusion

From the viewpoint of sample geometry, we propose a new cluster validity index named CIP, which is a commonly designed index based on the idea of the nearest neighbor cluster and clustering centroid. CIP can be used for validating the clustering results of a certain clustering algorithm. AP, AHC with single linkage, average linkage, and median linkage, and DIANA divisive hierarchical clustering are adopted as the clustering algorithms for our experiments. When we adopt the AP clustering algorithm, CIP identifies the correct optimal NCs for all the experimental data sets, and Sil identifies the same results except the Column_2 C data set. When we adopt the AHC with single linkage algorithm, CIP identifies the correct optimal NCs for all the experimental data sets, and Sil identifies the same results except the Sonar data set. When we adopt the AHC with average linkage algorithm, CIP and Sil identify the correct optimal NCs for all the experimental data sets. When we adopt the AHC with median linkage algorithm, CIP and Sil also identify the correct optimal NCs for all the experimental data sets. When we adopt the DHC with DIANA algorithm, CIP identifies the correct optimal NCs for all the experimental data sets, and Sil identifies the same results except the Seeds data set.

The AHC with single linkage algorithm along with the CIP index identifies the correct optimal NCs for the RA6 and J2 data sets, which may be used to decide that the clustering results of AHC with single linkage are suitable for data structures with complicated and nonconvex distribution. The AHC with average linkage or median linkage algorithm along with the CIP index identifies the correct optimal NCs for the K3 and M3 data sets, which may be used to decide that the clustering results of AHC with average linkage or median linkage are suitable for data structures with overlap distribution.

CIP and Sil excel over the other indices for choosing the optimal NC. Sil is a famous cluster validity index, and it outperforms the other existing indices in the literature. The time complexity of CIP is O(n), whereas Sil is O(n²). Contrasted with Sil, CIP is more efficient. Theoretical analysis and experimental results demonstrate that the proposed index and method analyse the clustering results effectively and are suitable for determining the optimal NC for various types of data sets. Further research work will include applications of the new index for other clustering algorithms on multiple data sets.

Footnotes

Acknowledgments

The authors would like to thank Dongdong Cheng of Chongqing University for her help in some data sets. This work was supported in part by the Fundamental Research Funds for the Central Universities (Grant JUSRP11235) and in part by the National Natural Science Foundation of China (Grant 61673193 and 61833007).

References

Liu

, Li

, Xiong

, Gao

, Wu

and Wu

, Understanding and enhancement of internal clustering validation measures, IEEE Transactions on Cybernetics 43(3) (2013), 982–994.

Davies

D.L.

and Bouldin

D.W.

, A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence 1(2) (1979), 224–227.

Rousseeuw

P.J.

, Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics 20(1) (1987), 53–65.

Dudoit

and Fridlyand

, A prediction-based resampling method for estimating the number of clusters in a dataset, Genome Biology 3(7) (2002), 1–21.

Strehl

, Relationship-based clustering and cluster ensembles for high-dimensional data mining, Austin: University of Texas at Austin, 2002.

Wang

, Li

, Zhang

and Guo

, Experimental comparison of clusters number estimation for cluster analysis, Computer Engineering 34(9) (2008), 198–199.

Kapp

A.V.

and Tibshirani

, Are clusters found in one dataset present in another dataset? Biostatistics 8(1) (2007), 9–31.

Liang

, Zhao

, Li

, Cao

and Dang

, Determining the number of clusters using information entropy for mixed data, Pattern Recognition 45(6) (2012), 2251–2265.

Zhao

and Fränti

, WB-index: A sum-of-squares based index for cluster validity,&, Knowledge Engineering 92 (2014), 77–89.

10.

, Xu

and Wunsch

D.C.

II , A comon study of validity indices on swarm-intelligence-based clustering, IEEE Transactions on Systems, Man, and Cybernetics–Part B: Cybernetics 42(4) (2012), 1243–1256.

11.

Starczewski

, A new validity index for crisp clusters, Pattern Analysis and Application 20(3) (2017), 687–700.

12.

Yue

, Wang

and Bao

, A new validity index for evaluating the clustering results by partitional clustering algorithms, Soft Computing 20(3) (2016), 1127–1138.

13.

Bhargavi

M.S.

and Gowda

S.D.

, A novel validity index with dynamic cut-off for determining true clusters, Pattern Recognition 48(11) (2015), 3673–3687.

14.

Shieh

, Robust validity index for a modified subtractive clustering algorithm, Applied Soft Computing 22 (2014), 47–59.

15.

Rojas-Thomas

J.C.

, Santos

and Mora

, New internal index for clustering validation based on graphs, Expert Systems with Applications 86 (2017), 334–349.

16.

Bezdek

J.C.

, Moshtaghi

, Runkler

and Leckie

, The generalized C index for internal fuzzy cluster validity, IEEE Transactions on Fuzzy Systems 24(6) (2016), 1500–1512.

17.

Liu

C.-M.

, Niu

and Liao

, Mechanisms to improve clustering uncertain data with UKmeans,&, Knowledge Engineering 116 (2018), 61–79.

18.

Arbelaitz

, Gurrutxaga

, Muguerza

, Pe'rez

J.M.

and Perona

, An extensive comparative study of cluster validity indices, Pattern Recognition 46(1) (2013), 243–256.

19.

Frey

B.J.

and Dueck

, Clustering by passing messages between data points, Science 315(5814) (2007), 972–976.

20.

Duda

R.O.

, Hart

P.E.

and Stork

D.G.

, Pattern Classification (2nd), John Wiley & Sons, New York, NY, 2001.

21.

Gurrutxaga

, Albisua

, Arbelaitz

, Martin

J.I.

, Muguerza

, Pe'rez

J.M.

and Perona

, SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index, Pattern Recognition 43(10) (2010), 3364–3373.

22.

Kaufman

and Rousseeuw

P.J.

, Finding groups in data: An introduction to cluster analysis, John Wiley & Sons, Hoboken, NJ, 1990.

23.

Pal

N.R.

and Bezdek

J.C.

, On cluster validity for the fuzzy c-means model, IEEE Transactions on Fuzzy Systems 3(3) (1995), 370–379.

24.

Bezdek

J.C.

and Pal

N.R.

, Some new indexes of cluster validity, IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 28(3) (1998), 301–315.

25.

Iam-On

, Boongoen

, Garrett

and Price

, A link-based approach to the cluster ensemble problem, IEEE Transactions on Pattern Analysis and Machine Intelligence 33(12) (2011), 2396–2409.

A novel internal cluster validity index

Abstract

Keywords

1 Introduction

2 Related work

3 Novel cluster validity index

3.1 Definitions of CIP index and related concepts

5 Experimental studies

Table 12 Experimental results of optimal NCs for seven synthetic data sets Dataset Sample number Right NC Optimal NC DB Sil KL Wint IGP CIP E2 53 2 2 2 3 3 2 2 F3 300 3 3 3 10 3 2 3 M3 1800 3 3 3 28 3 3 3 Q2 600 2 2 2 2 3 2 2 G6 600 6 6 6 12 3 2 6 N3 194 3 2 3 6 4 2 3 N4 226 4 3 4 7 4 2 4

Footnotes

Acknowledgments

References

Table 12
Experimental results of optimal NCs for seven synthetic data sets

Dataset Sample number Right NC Optimal NC

DB Sil KL Wint IGP CIP

E2 53 2 2 2 3 3 2 2

F3 300 3 3 3 10 3 2 3

M3 1800 3 3 3 28 3 3 3

Q2 600 2 2 2 2 3 2 2

G6 600 6 6 6 12 3 2 6

N3 194 3 2 3 6 4 2 3

N4 226 4 3 4 7 4 2 4