Investigating cluster validation metrics for optimal number of clusters determination

Abstract

In short, clustering is the process of partitioning a given set of objects into groups containing highly related instances. This relation is determined by a specific distance metric with which the intra-cluster similarity is estimated. Finding an optimal number of such partitions is usually the key step in the entire process, yet a rather difficult one. Selecting an unsuitable number of clusters might lead to incorrect conclusions and, consequently, to wrong decisions: the term “optimal” is quite ambiguous. Furthermore, various inherent characteristics of the datasets, such as clusters that overlap or clusters containing subclusters, will most often increase the level of difficulty of the task. Thus, the methods used to detect similarities and the parameter selection of the partition algorithm have a major impact on the quality of the groups and the identification of their optimal number. Given that each dataset constitutes a rather distinct case, validity indices are indicators introduced to address the problem of selecting such an optimal number of clusters. In this work, an extensive set of well-known validity indices, based on the approach of the so-called relative criteria, are examined comparatively. A total of 26 cluster validation measures were investigated in two distinct case studies: one in real-world and one in artificially generated data. To ensure a certain degree of difficulty, both real-world and generated data were selected to exhibit variations and inhomogeneity. Each of the indices is being deployed under the schemes of 9 different clustering methods, which incorporate 5 different distance metrics. All results are presented in various explanatory forms.

Keywords

Clustering cluster validation validity index relative criteria number of clusters

1. Introduction

The term clustering describes the unsupervised Machine Learning task, in which a set of objects is partitioned in ways that instances placed in the same group appear to be more similar [1]. These resulting groups are referred to as clusters. However, from a generalization perspective, the concept of a cluster can not be explicitly defined, in terms of both form and content, for every given dataset. This is due to the fact that what is considered a well-defined cluster or a good partition depends, to a large extent, both on the application and background of the researchers [2] and on the – quite fuzzy – concept of similarity. As a result, different metrics and views are reflected in different definitions. In a top-down perspective, clustering is the result of dividing a heterogeneous population into a series of more homogeneous subgroups [3], while in a bottom-up one, clustering is the process of forming groups via data partitioning using a given similarity criterion [4]. In both views, the key element is the mirror notions of similarity and dissimilarity. These, in practice, are determined merely by the use of a distance metric. In other words, given a how-to-measure arrangement, items that result belonging to the same group are more similar – in terms of the aforementioned metric – compared to the ones that lie in different ones. Consequentially, selecting the number of clusters, a function that is – among other things – depended on both the distance metric and the structure of the dataset as well as the method utilized, becomes crucial. Thus, determining the optimal number of clusters stands out as one – if not the most – extremely important step in the whole process.

Figure 1.

A graphical representation of different partitionings of the same dataset, using Agglomerative Clustering with Average, Complete or Single-linkage along with Euclidean or Manhattan distance metric.

Furthermore, most real-world classification tasks that utilize cluster analysis carry the problem of optimal number of cluster determination as an intrinsic and fundamental parameter, comparable – in terms of output importance – to the algorithmic procedure selection. This is obviously because – in most cases – the deployment of a specific clustering method is vital to the way the output space is shaped, having every corresponding option differentiate the partitions produced. Each algorithm will most likely alter the instances contained in each group leading to a different set of clusters. Moreover, even within a specific algorithm, changing the parameters and the representational structures of the data will also lead to output alterations, having a rather substantial impact on the way the final partitions are eventually formed [5]. A small demonstration of the above is presented in Fig. 1. This is an illustration of six different methods, all variants of the Agglomerative clustering algorithm, as applied to the same set of data. The setting uses a fixed number of clusters equal to 4. Specifically, the 6 aforementioned distinct methods are formed by the combinations of 3 different kinds of links and 2 different distance metrics. These include, on the one hand, Single-linkage, Complete-linkage and Average-linkage clustering, and, on the other, the Euclidean and Manhattan distance metrics. In each of the cases, output fluctuations reflected in the various groupings of the original dataset can be observed.

Typically, there are two ways to assign the number clusters. Either the number of clusters is determined by the user as a required initialization parameter at the beginning of the procedure, or the algorithm itself exploits inner mechanisms to automatically set the appropriate number of partitions. This selection step is – regardless of the selection mode – both fuzzy and non-singular in terms of a deterministic output choice of the number of clusters. Thus, the demand for cluster validity evaluation, and the respective attributes which emanate from this assessment affecting the overall quality of the procedure, set the direction of a rather essential line of research. Considering the difficulty of the problem as well as the general importance of clustering in data mining, the volume of related literature is – as expected – large. In addition, a reasonably extended part of this relevant literature contains a similarly substantial number of proposed indicators, all aiming at an insightful estimation of the optimal number of clusters that – ideally – works in any given dataset. There are three general ways to approach the cluster validation assessment problem, each of which is based on different evaluation criteria [6]: internal, external, and relative cluster validation. This work is structured so that it represents in a comprehensive and straightforward manner an extensive experimental study of validity indices that belong to the third of the aforementioned approaches of cluster evaluation criteria, namely the relative approach. The above categorization, however, is not meant as a definition of explicitly formulated experimental instructions. It only stands as just a, so to speak, profile of the main content of the procedure.

The main task of – the examined in this paper – relative-based cluster validity indices is to evaluate the estimated number of clusters in terms of some notion of the overall quality of the produced partitioning. They do that by exploiting the following intra-algorithmic comparative strategy: take a specific algorithm and, by modifying its parameters, create execution variations to generate various results, which will then be evaluated comparatively by contrasting the performance of each possible schema with any other. These validity indices make use of some statistical and geometric attributes of the data, exploiting distance metrics properties, as well as intra-cluster information such as compactness and isolation [7].

Now, on to the settings of the experimental process, a total of 26 validity indices for estimating the optimal number of clusters in a given dataset were evaluated. These were applied to a quite large set of methods, consisted of variations of k-means and hierarchical agglomerative clustering algorithms. This variety of methods were supplemented by a set of 5 different distance metrics, resulting to an extensive experimentation regarding proposing and selecting the best possible partitions. In the following sections elements from the results of the implementations of all the above will be presented. Starting from Chapter 2, related works mainly from the contemporary literature will be indicatively addressed. Next, the overall setting of the experimental procedure – that is the methods, distance metrics and validity indices used – will be presented. Concluding this introduction, the results of the experimental process, as well as the various emerging conclusions, will be addressed in the last sections.

2. Related work

Estimating the best number of partitions to be made in a given dataset constitutes – as already repeatedly noted in the introduction – one of the most crucial steps in almost every clustering task. In the case where the number of clusters is not to be produced by the inner structure of the method, the user has to make decisions regarding which, given the current working conditions, the optimal value may be. This decision is strongly dependent not only to the problem itself but also to the researcher’s personal biases. Over the years, several studies have focused on this particular problem with the aim of proposing novel solutions and comparing available ones [8]. Thus, both investigating and testing, as well as using a diverse set of tools capable of providing “multi-level” outputs regarding possible decisions to the optimal number of clusters, is not only beneficial but stands out as the central point in the whole process. Consequently, the comparison of the available tools by evaluating their results becomes valuable, providing insight into the performance and quality of the outputs. As a result, the task of automating the decision-making process regarding the optimal number of partitions has attracted a considerable amount of the research community’s attention.

As already noted, the types of criteria presented in the theoretical aspects of cluster validation can be divided into three basic types [9, 10]: internal, external and relative criteria. In terms of external criteria, given the different partitions that emerge from the execution, the evaluation is based on a comparison that relies on specific predefined information. On the other hand, regarding the internal approach, the evaluation of the resultant partitions is based solely on the partitions themselves, with no extra information or clustering process repetitions. Lastly, relative cluster validation techniques are based on comparisons of partitions generated by the same algorithm with different parameters and different data subsets. In the present work, all experimental executions were performed using the – implemented in the R programming language – package described in [7]. Since the strategy designed for the evaluation stage was based on comparisons of each clustering algorithm in terms of modifying its parameters, the relative validation approach was investigated [6]. A comparative study of such criteria can be found in [11, 12], works that both provide a fairly extensive experimental framework. The research, however, has also focused on both other types of evaluation criteria, namely the internal and external validation metrics. In [13] a comparative study of 5 external criteria is used in a hierarchical clustering framework, while in [14] a detailed study of 11 widely used internal clustering validation measures is presented. Another empirical comparison of internal-based evaluations can be found in [15], where 7 such validation indices are compared on a wide variety of datasets. A study focused on the comparison of both internal and external validation indices is presented in [16], using K-means and Bissecting-K-Means algorithms over 12 datasets to compare 6 internal and 4 external indicators, with results supporting the assumption that internal criteria tend to be more accurate.

Many forward steps have also been taken to propose new methods of improving the traditional ones. In [17], a consensus clustering approach, using internal evaluation criteria to improve the general outline of applying a certain clustering algorithm with various cluster configurations and select the one that maximizes a certain validity measure, is proposed. The Analytic Hierarchy Process (AHP) is utilized for selecting the best values for the number of clusters in [18], where the AHP model is formed using various information criteria. To overcome the difficulties of identifying the optimal number of clusters in mixed data problems, a generalized mechanism is presented in [9]. The procedure integrates the Renyi along with the Complement entropy. A novel stability-based algorithm that improves the ability of traditional partitioning methods to determine the way groups are formed in various datasets is presented in [20], while a new method called depth difference (DeD), for estimating the optimal number of clusters $k$ in a dataset based on data depth, is proposed in [21]. In [22], a new approach that outputs the desired information by using Meta-learning is presented. The method works by exploiting the idea that meta-features are able to improve the predictive performance of meta-models used for recommending the number of clusters. In closing, an approach that combines both a Silhouette Validity Index and the concept of Local Scaling to calculate the optimal number of clusters in a dataset using a Spectral Clustering algorithm is being proposed in [23].

All of the above, while only a few cases of current scientific production, clearly show, on the one hand, the quantity and variety of research work related to the issue, and, on the other hand, the current ongoing demand for the extension of existing tools. Below is a summary of the approach to the experimental procedure of this work, along with the data and methods used.

3. Methodology

This section provides the necessary descriptive information regarding the experimental procedure. First, information on the type and characteristics of the datasets utilized, followed by a brief discussion of the algorithms and metrics employed, is provided. Then, the design and implementation of the entire experimental setup is been addressed.

3.1 Datasets

Starting from the initial aim of being able to perform comparisons and monitor the overall behavior of every examined validation index, in a way that helps conclusions regarding the performance of each of the proposed indicators to be presented clearly, a collection of datasets was formed, consisted of two different types separated in terms of the way they were created. The first group contains datasets from real-world classification problems, which were drawn from the – publicly available – UCI repository. To conduct experiments, and in order to transform the classification problem into a clustering one, the class feature of the real-world data was not taken into account and was manually deleted. This hidden knowledge of the class, however, was a posteriori necessary for drawing conclusions about the results, a fact that will be addressed in the next section. The most important observation, though, regarding data selection was that these available datasets turned out to have a small number of actual classes. These were in the vast majority of the sets below 6, with most of them being 2 or 3, while many had a rather prohibitive length in terms of the execution time and computational resources needed. As the above facts could reduce the reliability of the experimental process and challenge the generalization of the derived results and conclusions, the creation of the second artificially constructed dataset was considered beneficial. These two general groups of datasets create the conditions to overcome the aforementioned shortcomings, contributing substantially to the reliability of the overall experimental process. It should be noted, however, that the two cases, due to the different origin of the data, were considered separately as two distinct scenarios.

3.1.1 Real-world datasets

Starting with the real-world data, a total of 20 datasets taken from the UCI repository [24] were used. This publicly available repository was selected, among other things, in order to reinforce the additional objective of easy reproduction of the results to be presented. These initial real-world datasets emerged from classification problems and contained labeled data. To be converted into a form suitable for clustering, which is an unsupervised learning task, the class label had to be removed. Table 1 provides useful information about the datasets in a rather concise but comprehensive way. Specifically, the table contains essential form attributes, like the number of instances, features, and actual classes, as well as the number of instances that belong to the most common class. The latter is an indicator of the problem’s difficulty, as classes containing a small number of instances are more unlikely to be identified correctly as clusters. It should be noted that the majority of the datasets used are imbalanced, so there are significant differences between the number of instances in different classes. Actually, this remark is also implied by the aforementioned last column in Table 1. Lastly, all the features were numerical in order to avoid the effect of different types of encoding on the final results, which is rather a matter of a separate ongoing study.

Table 1
Real-world datasets

Dataset names	Number of
	Instances	Features	Classes	PILCC ${}^{*}$
Appendicitis	106	7	2	19.81%
Balance	625	4	3	7.84%
Banknote_authentication	1372	4	2	44.46%
Breast_cancer_coimbra	116	91	2	44.83%
Bupa	345	6	2	42.03%
Cleveland	297	13	5	4.38%
Dermatology	358	34	6	5.59%
Glass	214	9	6	4.2%
Haberman	306	3	2	26.47%
Hayes_roth	160	4	3	19.38%
Iris	150	4	3	33.33%
Led7digit	500	7	10	7.4%
Monk2	432	6	2	47.22%
Phoneme	5404	5	2	29.35%
Pima	768	8	2	34.9%
Vehicle	846	18	4	23.52%
Wine	178	13	3	26.97%
Wine_quality_red	1599	11	6	0.22%
Wine_quality_white	4898	11	6	0.41%
Wisconsin	683	9	2	35%

${}^{*}$ PILCC: Percentage of instances in less common class.

3.1.2 Artificial datasets

Having the UCI datasets constituting the real-world case of this study, a total of 25 additional datasets were manually created in order to substantiate the required wider experimental framework. This was done using the “make_classification” function of the sklearnPython library [25]. The class labels were once again removed in order to meet the clustering task criteria. Since the datasets in this group were created to cover cases that had little or no representation in the original datasets, some useful information about their nature should be provided. The first issue was that the given urgent demand for differentiation in terms of the number of their appearances, their features, and their ranks, statistics depicted in Table 2 where this information is summarized, was one of the main reasons for their construction in the first place. Moreover, all the features in these datasets are informative, so that there were no features constructed by a linear combination of informative ones. Lastly, only 10 of the artificially generated datasets were balanced, while the others were made to be unbalanced, with some of their classes consisted of a relatively small number of instances, in an attempt to increase the overall complexity of the task. All essential information about datasets’s characteristics can be found in Table 2.

Table 2
Artificial datasets

Dataset names	Number of
	Instances	Features	Classes	PILCC ${}^{*}$
Dataset_1	300	5	5	20%
Dataset_2	500	15	10	10%
Dataset_3	400	7	8	12.5%
Dataset_4	2000	10	15	6.67%
Dataset_5	500	9	12	8.33%
Dataset_6	200	8	2	50%
Dataset_7	600	6	3	3.33%
Dataset_8	1500	3	4	25%
Dataset_9	150	4	2	50%
Dataset_10	2000	4	7	14.29%
Dataset_11	550	7	7	5%
Dataset_12	1000	5	8	5%
Dataset_13	400	16	5	10%
Dataset_14	200	12	3	20%
Dataset_15	900	4	5	10%
Dataset_16	500	11	2	20%
Dataset_17	150	6	9	5%
Dataset_18	100	8	2	35%
Dataset_19	300	3	4	10%
Dataset_20	250	5	3	20%
Dataset_21	500	5	17	4%
Dataset_22	300	4	14	3%
Dataset_23	200	5	10	5%
Dataset_24	350	3	7	5%
Dataset_25	400	6	16	2%

${}^{*}$ PILCC: Percentage of instances in less common class.

3.2 Methods, metrics and indicators

Before proceeding to a rather concise presentation of the methods, distance metrics, and validity indices used to calculate the optimal number of clusters, it should be noted that henceforth the validity indices may often be referred to as “indices” or “indicators” for short. In fact, this was the case in several of the above-mentioned areas of the present work.

First of all, a total of 9 algorithmic procedures which were utilized and tested throughout this survey will below be briefly mentioned. During these experimental procedures, a total of 26 indicators for estimating the optimal number of clusters in each of the datasets examined were evaluated, as applied to K-mean and Hierarchical agglomerative clustering algorithms, in terms of the aforementioned wide variety of methods. Lastly, 5 distinct distance metrics were also examined. Thus, a total of 1170 tests per dataset were taken into consideration, in a plethora of potential optimal numbers of clusters proposed.

3.2.1 Methods

Regarding the experimental setting in terms of the algorithmic procedures used, the following 9 well known methods were utilized. First, the widely used K-means method [26, 27] was utilized. Among other things, the K-means algorithm, as one of the most, if not the most frequently used technique in the clustering literature, constituted a safe performance baseline. Obviously, the apparent multiple appearance of the method in the literature makes any reference to further information about its operation unnecessary. Next, various agglomeration methods, which is a bottom-up hierarchical clustering approach, were also exploited. Specifically, the following single-link, complete-link, and minimum-variance algorithm variations were utilized:

•
Two implementations of the Ward method [28]
•
Single-linkage clustering [29, 30, 31]
•
Complete-linkage clustering [32, 33]
•
Average-linkage clustering [31]
•
The Mc-Quitty [31, 34] method
•
The Median method [35]
•
The Centroid method [31]

As a bottom-up hierarchical scheme, all the aforementioned methods share the same attribute. At the beginning of the execution, each case is considered as an autonomous cluster. During the iterations’ steps, the clusters are linked based on their similarity to form bigger clusters. The operation stops when the desired number of clusters is reached. The type of linkage represents the way the similarity between the clusters is measured, indicating the way the method itself is defined.
3.2.2 Metrics

Regarding distance metrics, we will suffice, for reasons of space and readability, with a simple reference to the following extremely well-known metrics, which were used in the experiments.

Specifically, the following 5 metrics were used:

•
Euclidean
•
Maximum
•
Manhattan
•
Canberra
•
Minkowski

3.2.3 Validity indices

As for cluster validation metrics, a very wide variety of indices were used to indicate the optimal number of clusters. The main focus of this work, in terms of the type of indicators, was the approach based on the so-called relative criteria for cluster validation. All these indicators are shown in Table 3. It should be noted that an extensive and formal presentation of all indicators is – given their number – impractical. That is due to, on the one hand, the reasonable length restrictions, and because of the fact that a simple placement of their mathematical formulas will not add anything substantial to the present work. Thus, we urge the reader to look at the referenced documents for further clarification.

Table 3
Validity indices

1	The Calinski and Harabasz index [36]
2	The Duda and Hart ratio criterion [4]
3	The Duda and Hart Pseudo $t^{2}$ index [4]
4	The Hubert and Levin C-Index [37]
5	The Beale index [38]
6	The Cubic Clustering Criterion [39]
7	Point-biserial correlation index [40]
8	The Davies and Bouldin index [41]
9	The Frey and Van Groenewoud index [42]
10	The Hartigan index [43]
11	The Ratkowsky and Lance criterion [44]
12	The Scott and Symons index [45]
13	The Marriot index [46]
14	The Ball index by Ball and Hall [47]
15	The Milligan and Cooper Trcovw index [48]
16	The Tracew index [49]
17	Friedman and Rubin index i [50]
18	Friedman and Rubin index ii [50]
19	The McClain and Rao index [51]
20	The Krzanowski and Lai index [52]
21	The Silhouette index by [53]
22	Dindex [54]
23	The Dunn index [55]
24	The Hubert’s $\Gamma$ statistic [56]
25	The SD validity index [57]
26	The SDbw validity index [58]

Figure 2.

Representation of the experimental procedure.

3.3 Experimental procedure

In this section, the general scheme of the experimental procedure will be thoroughly described. First of all, a focal point of the experimental process, as it has been previously mentioned, was that the initial unbalanced real-world datasets were supplemented by a diverse set of artificially created ones, and both real-world and artificial contained information about the actual number of classes in each dataset, as in classification problems. By converting the classification problem into a clustering one, by simply removing the class feature from the dataset, we can assume that one accepts the correspondence between the two problems even on a rather informal level: the optimal number of clusters simply becomes the number of classes. Thus, class labels provide crucial information to use in the evaluation stage, even if they are not assessed at all during the experiments.

To calculate the validity indices that estimate the optimal number of clusters, the R – based nbclust [7] package was used. Regarding experimental executions, 9 different clustering methods were exploited, and each of these methods was performed 5 times, each time incorporating a different distance metric to calculate the similarities between the instances. As a result, for every dataset, a total of 45 general runs were executed (9 algorithms $\times$ 5 metrics). As the whole process contained, in terms of data, two different scenarios, a total of 45 data sets were used, and a total of 2025 experiments were performed. In addition, it should be borne in mind that each of these executions includes the evaluation of all indicators. Also, in every single run, the algorithm was executed 29 times: the number of clusters k examined varied from $k=2$ to $k=30$ . Then every index proposes a value as the optimal number of clusters, which is also a number between 2 and 30. These arithmetic parameters were not chosen randomly. Given that the largest number of actual classes can be found in Dataset_21, which contains 17 classes, we set the maximum number of clusters equal to a relatively – but not way too much – larger number than 17. This is due to the fact that a much larger number could have made the whole execution “beyond reach”, in terms of time resources. That is, each experiment would have lasted much longer without any substantial experimental expediency.

Then, in order to measure the efficiency and the capacity of every validity indicator to suggest valid optimal values, a metric had to be specified. This was set to be the absolute value of the difference between the number of clusters proposed by each indicator and the actual number of clusters. Therefore, in each experiment, for each index, an integer indicates how far from reality the estimate for the optimal number of clusters is. In both separate scenarios (real/artificial data), the results were grouped according to the distance metrics and thus will be presented in the following section. A visual representation of the whole experimental setting is presented in Fig. 2.

4. Results

The results and some remarks arising from their analysis will now be presented in this section. With regard to the general objective in the experiments carried out, the intention was to identify which of the examined indices performed best in estimating the optimal number of clusters, and, in particular, to identify whether there were specific indices that generally performed well in the majority of datasets. Given the above, and bearing in mind the multitude of the performed tests, it’s easily inferred that the number of indicators, metrics and their combinations becomes quite restrictive in terms of a full presentation of them all. In addition, the numerous outputs and tables of results that were produced would be extremely hard to follow. These, combined with the limited space, led us to choose other more concise ways of presentation. The latter mainly involved the use of some well-known diagrammatic techniques, which tend to summarize the information and make it much easier to follow. Thus, in both case studies (Real-World, Artificial), the results were grouped by distance metric.

Moreover, here some clarification regarding the two case studies is required: they are studied separately since, in the case of the real-world datasets, the majority of the accessible – and consequently the finally selected – sets had a number of classes – i.e. number of actual clusters – less than or equal to 6. As a result, indices that tend to select fewer clusters as the ideal number of groups in each of the partitions have a clear inherent advantage which is not necessarily based on non-random events. This was the reason why, in creating the artificial data scenario, priority was given to the widest possible diversity of real clusters so that safer conclusions could be drawn about the behavior of the indicators. And, indeed, the ranking of the indicators (except for the first and the last, which remain in their respective positions in all situations) differs significantly between the two cases. However, this does not affect the demand to conduct experiments and draw conclusions from real-world data, which are rarely available in formats that meet all the requirements or cover every possible scenario. To compare the indicators, the Friedman ranking test [59] was performed together with the Bonferroni-Dunn post hoc test [60]. These scores provide a good picture of the performance of the quantities examined: a good performance equivalizes to positions in the upper places, while the worst performing indicators will fall down to the lower parts of the table. The results of the Friedman’s rankings can be found in Tables 4 and 5, for real-world and artificial data respectively.

Table 4
Real-world dataset Friedman scores

	Real
	Eyclidean		Manhattan		Maximum		Canberra		Minkowski
	Rank	Index	Rank	Index	Rank	Index	Rank	Index	Rank	Index
1st	7.34682	Ball	7.47093	Ball	7.38728	Ball	7.64525	Beale	7.34682	Ball
2nd	8.19075	McClain	8.29070	McClain	7.93931	McClain	7.68436	Ball	8.19075	McClain
3rd	9.83237	Silhouette	9.86628	Beale	9.66763	Beale	8.20950	Duda	9.83237	Silhouette
4th	9.97977	Beale	10.20058	Silhouette	9.96243	Duda	8.20950	PseudoT2	9.97977	Beale
5th	10.32081	Duda	10.21802	TrCovW	9.96243	PseudoT2	8.25978	McClain	10.32081	Duda
6th	10.38150	PseudoT2	10.28779	Duda	10.38728	Silhouette	9.83520	CCC	10.38150	PseudoT2
7th	10.58960	TrCovW	10.28779	PseudoT2	10.79769	TrCovW	11.40782	TrCovW	10.58960	TrCovW
8th	11.10116	TraceW	10.85174	Frey	11.27457	Frey	11.84078	Frey	11.10116	TraceW
9th	11.10983	Frey	11.18023	TraceW	11.68497	TraceW	11.85196	CH	11.10983	Frey
10th	12.04335	Ratkowsky	11.47965	Ratkowsky	12.13584	SDindex	12.11173	Ratkowsky	12.04335	Ratkowsky
11th	12.18208	Hartigan	12.07849	Scott	12.57225	Ratkowsky	12.15084	SDindex	12.18208	Hartigan
12th	12.57514	Scott	12.16279	CCC	12.65607	Hartigan	12.18156	Silhouette	12.57514	Scott
13th	12.77457	SDindex	12.61628	SDindex	12.94220	CCC	12.34916	DB	12.77457	SDindex
14th	13.11850	CH	12.76163	Hartigan	12.97688	Scott	12.68436	TraceW	13.11850	CH
15th	13.26012	CCC	13.06686	CH	13.12717	CH	13.09218	Hartigan	13.26012	CCC
16th	13.51445	PtBiserial	13.64244	PtBiserial	13.40173	DB	13.61732	Scott	13.51445	PtBiserial
17th	13.77457	Marriot	13.64244	Marriot	14.29769	PtBiserial	13.87989	Dunn	13.77457	Marriot
18th	14.34104	DB	14.43895	DB	14.43931	Marriot	15.57542	Marriot	14.34104	DB
19th	15.82370	Cindex	14.94477	Cindex	15.29769	Dunn	15.72067	PtBiserial	15.82370	Cindex
20th	16.13006	Dindex	16.27616	Dindex	15.86994	Dindex	16.25419	Dindex	16.13006	Dindex
21th	16.13006	Hubert	16.27616	Hubert	15.86994	Hubert	16.25419	Hubert	16.13006	Hubert
22th	16.28324	Dunn	17.15116	Dunn	15.93353	Cindex	18.26257	Rubin	16.28324	Dunn
23th	16.95087	KL	17.80814	KL	17.10405	KL	18.37989	Friedman	16.95087	KL
24th	18.67630	Rubin	19.09302	Friedman	18.60694	Rubin	18.78771	KL	18.67630	Rubin
25th	19.18786	Friedman	19.68895	Rubin	19.44798	Friedman	20.12849	Cindex	19.18786	Friedman
26th	25.38150	SDbw	25.21802	SDbw	25.25723	SDbw	24.62570	SDbw	25.38150	SDbw

Figure 3.

CD diagram: Real-world dataset.

Table 5

Artificial dataset Friedman scores

	Artificial
	Eyclidean		Manhattan		Maximum		Canberra		Minkowski
	Rank	Index	Rank	Index	Rank	Index	Rank	Index	Rank	Index
1st	7.99554	Ball	7.98430	Ball	7.85333	Ball	7.58372	Ball	7.99554	Ball
2nd	9.62946	TrCovW	9.32960	Ratkowsky	9.97333	TrCovW	9.92326	TraceW	9.62946	TrCovW
3rd	10.27009	Ratkowsky	9.69507	TrCovW	9.98667	Ratkowsky	10.16047	Ratkowsky	10.27009	Ratkowsky
4th	10.48214	TraceW	10.43722	TraceW	10.66444	TraceW	10.60698	SDindex	10.48214	TraceW
5th	10.63616	Cindex	10.62108	Hartigan	10.80222	Hartigan	10.64419	TrCovW	10.63616	Cindex
6th	10.82589	SDindex	10.98879	SDindex	11.01556	SDindex	10.83023	Scott	10.82589	SDindex
7th	10.95536	Hartigan	11.05830	Cindex	11.55778	PseudoT2	10.90000	CH	10.95536	Hartigan
8th	11.28125	Scott	11.65247	PseudoT2	11.60222	Duda	11.20233	Duda	11.28125	Scott
9th	11.67857	Duda	11.69283	Duda	11.71111	Scott	11.20233	PseudoT2	11.67857	Duda
10th	11.67857	PseudoT2	11.69283	Scott	11.76444	Beale	11.30233	McClain	11.67857	PseudoT2
11th	11.84821	Beale	11.94843	Beale	11.77556	McClain	11.34884	Hartigan	11.84821	Beale
12th	11.91964	McClain	11.96413	McClain	11.95556	Cindex	11.40465	Beale	11.91964	McClain
13th	12.32366	Marriot	12.61659	Marriot	12.13778	CH	12.22326	CCC	12.32366	Marriot
14th	12.57813	CH	12.86771	CH	12.53111	Marriot	12.96977	Friedman	12.57813	CH
15th	12.92411	Frey	13.02242	Frey	12.84444	Silhouette	13.00465	Dunn	12.92411	Frey
16th	13.55580	Silhouette	13.71749	Silhouette	13.36667	Frey	13.88372	Marriot	13.55580	Silhouette
17th	14.38616	PtBiserial	14.19058	PtBiserial	14.50444	Friedman	14.06977	Frey	14.38616	PtBiserial
18th	14.52009	Friedman	14.50000	Rubin	14.59778	Rubin	14.11395	Rubin	14.52009	Friedman
19th	15.02679	Rubin	15.10987	Friedman	15.17333	PtBiserial	14.77674	PtBiserial	15.02679	Rubin
20th	15.93304	KL	15.61659	KL	15.66667	CCC	14.86512	Silhouette	15.93304	KL
21th	16.51339	Dunn	16.21973	CCC	15.81111	DB	16.36977	KL	16.51339	Dunn
22th	16.53571	CCC	16.52466	DB	16.61111	KL	16.60698	DB	16.53571	CCC
23th	16.83929	DB	17.06726	Dunn	16.67778	Dunn	18.10000	Dindex	16.83929	DB
24th	18.52902	Dindex	18.53363	Dindex	18.47778	Dindex	18.10000	Hubert	18.52902	Dindex
25th	18.52902	Hubert	18.53363	Hubert	18.47778	Hubert	21.55814	Cindex	18.52902	Hubert
26th	23.60491	SDbw	23.41480	SDbw	23.46000	SDbw	23.24884	SDbw	23.60491	SDbw

Moreover, the null hypothesis was rejected. Note that a post-hoc analysis using the Bonferroni-Dunn test provides feedback on which indices behave similarly. That is, all indices behaved similarly in all 10 cases – bearing in mind that the results were grouped per metric and there were 5 distance metrics and 2 distinct study cases. Figure 3 shows the results as Critical Difference graphs (CD diagrams) for the case of real-world data. Here, these charts depict the top ten performing indices in respect to their Friedman rankings. The cd value is calculated, and the indices whose Friedman ranking values differ by a value less than or equal to the above are associated with a strong horizontal line in the graph. That is, given the cd – values, for each pair of indicators, the absolute value of the difference in their rankings is calculated. If this value is higher than the cd, then two indicators differ in their average value, and this is what one means by saying that they have different behaviors. In the illustrations, the indicators that are shown to be connected do not differ significantly. In Fig. 4 the CD diagrams in the case of the artificially created data are shown.

Figure 4.

CD diagram: Artificial dataset.

Both the results of the Friedman test, as well as the CD diagrams, demonstrate the overall al superiority of the Ball index over the other indices. Specifically, the Ball index ranks first in all cases except for the use of the Canberra metric in the case of real-world data, where it ranks second after Beale by a small margin. It is also quite clear, from a simple observation of the cd charts, that the Ball index shows a statistically significant difference compared to the vast majority of other indicators. In particular, in the case of artificial data its behavior is significantly different from all the others,

Figure 5.

Box-plots: Real-world dataset.

Figure 6.

Box-plots: Artificial dataset.

Figure 7.

Heatmaps of all 25 artificial datasets regarding (from left to right) K-means, Single and Ward methods.

while, in the real-world dataset, only a few connections that imply performance similarities can be spotted, with the majority of them indicating dependencies just between the Ball and McClain indices. Apart from the aforementioned dominance in both case studies, the indicators that follow in the ranking differ in relation to the use of real or artificial data. To be more specific, in the real-world senario, the leading Ball index is followed by McClain, Silhoutte, Beale and TrCovW, while in the artificially generated datasets the top five – with the exception of Ball – is usually consisted of TrCovW, Ratkowski, TraceW, and Hartigan. With regard to the indicators with the lowest performance, it is worth noting that in all cases, the SDbw index ranks last with a significant difference from the rest. Variations and rearrangements are observed in the middle positions of the table, implying a rather inconclusive performance ranking.

An additional way to provide a visual representation of the extracted results is via box-plots. Figure 5 depicts the box – plots graphs of all the examined indices, in the case of real-world data, in terms of each distance metric. The corresponding box – plot graphs for the artificially created data can be found in Fig. 6. To decode the underlying information of these plots, note that we reach for the indices with the least possible divergence from the actual classes – in terms of its recommended ideal number of clusters, i.e. those with box-plots that appear closest to zero. A value of zero indicates that the discrepancy between the number of clusters recommended by the relevant index and the actual class number is minimal. We can clearly observe the remarkable, almost universal superiority of the Ball index both in the real-world and in artificially generated datasets. This seems to be evident in all three ways in which the results have been presented so far. However, what can be deduced with reasonable certainty is that there is little evidence for the superiority of the Ball index both in the real world and in generated datasets.

Lastly, heatmaps of the kmeans, single and ward methods regarding the artificial dataset case study are provided in Fig. 7. Once again, the plots are grouped in respect to the distance metric that the methods internally utilized to measure the distances between objects. So for each one of the 3 aforementioned methods, a total of 5 heatmaps are depicted. A color scale is used to show the deviation of the estimate of each indicator from the actual number of clusters. In the presented heatmaps, the columns represent the validity indices, and the rows denote the serial numbers for each of the 25 datasets. The darker the square, the larger the estimated difference from the actual value. This means that columns with lighter shades of blue that tend to look white represent better performance than columns painted in darker shades.

5. Conclusions and future work

In this last section, the final remarks of this work are presented regarding the interpretation of the results of the experimental process, as well as an outline of the future targeting. The purpose of this paper was to compare a large number of proposed cluster validity indices, in order to – ideally – come to both safe as well as useful conclusions about their performance, regarding the – quite fuzzy – task of estimating the optimal number of clusters when a specific dataset is given. For this reason, two scenarios were designed, each of which consisted of data from different origins: the first includes existing real-world problem datasets, while the second was artificially constructed for the purposes of the current framework. In total 45 datasets were examined. In order to evaluate the performance of the validity indices in question and to produce comparable findings, a series of experiments were performed using 9 different clustering algorithms and 5 different distance metrics, resulting in 45 experimental schemes. In order to ensure a relative variation in the complexity of the subject, special attention was paid to the diversity of the datasets, both in terms of the actual number of groups and their characteristics, as well as in terms of the number of cases within the clusters. After a statistical study, the indicators were ranked based on their accuracy according to the Friedman ranking.

The results, while providing a valuable insight into which validity indicators work well in both scenarios, do not give a rigorous or definitive ranking, with the exception, as mentioned above, of the top positions in the ranking in almost all metrics. It is worth noting that if we exclude the Ball index, which seems to perform better than all the others in the vast majority of tests, the top performance scores for each of the two scenarios vary, a result that is most likely related to the number of classes, which, on average, were lower in actual data. As a result, the difficulty of the task is underlined once more, as is the need for alternative techniques for evaluating estimates of the optimal number of clusters.

Given the above, an interesting extension that incorporates the current study would be the development of validity indices groups, the estimates of which will be merged to eventually propose a final best as the optimal number of clusters [61]. Such a combination module designed to be robust and efficient, may then be used as a stand-alone indicator. A second possible future area of interest would be to use ensemble learning to construct ensembles of clustering algorithms [62, 63]. An ensemble clusterer would utilize the number of clusters supplied by the assessment method as input to build a final model by merging the individual base learners. According to the ensemble theory, in an ensemble setting that exploits accurate as well as diverse base learners, ensembles of learners would be more accurate and robust, as they tend to perform generally better than their individual base learners [64]. As a result, there is enough evidence to suggest that such a research direction would constitute a valid endeavor, resulting in methods that provide better data partitions.

References

Gentle

Kaufman

Rousseuw

. Finding Groups in Data: An Introduction to Cluster Analysis. Vol. 47, Biometrics. 1991; 788.

Estivill-Castro

. Why so many clustering algorithms. ACM SIGKDD Explor Newsl. 2002; 4(1): 65–75.

Aldenderfer

Blashfield

. Cluster Analysis [Internet]. Thousand Oaks, California; 1984. Available from: https://methods.sagepub.com/book/cluster-analysis.

Clarke

MRB

Duda

Hart

. Pattern classification and scene analysis. J R Stat Soc Ser A. 1974; 137(3): 442.

Halkidi

Vazirgiannis

Batistakis

. Quality Scheme Assessment in the Clustering Process. In: LNCS (LNAI). 2000. pp. 265–76.

Theodoridis

Koutroumbas

. Pattern Recognition. 4th editio. Academic Press; 2009.

Charrad

Ghazzali

Boiteau

Niknafs

. Nbclust: An R package for determining the relevant number of clusters in a data set. J Stat Softw. 2014; 61(6): 1–36.

Yuan

Yang

. Research on k-value selection method of k-means clustering algorithm. J. 2019; 2(2): 226–35.

Halkidi

Batistakis

Vazirgiannis

. On clustering validation techniques. J Intell Inf Syst. 2001; 17(2-3): 107–45.

10.

Brun

Sima

Hua

Lowey

Carroll

Suh

, et al. Model-based evaluation of clustering validation measures. Pattern Recognit. 2007; 40(3): 807–24.

11.

Vendramin

Jaskowiak

Campello

RJGB

. On the combination of relative clustering validity criteria. ACM Int Conf Proceeding Ser. 2013; (April 2009).

12.

Vendramin

Campello

RJGB

Hruschka

. Relative Clustering Validity Criteria: A Comparative Overview in Wiley InterScience (www.interscience.wiley.com). 2010. Available from: www.interscience.wiley.com.

13.

Milligan

Cooper

. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behav Res. 1986; 21(4): 441–58.

14.

Liu

Xiong

Gao

. Understanding of internal clustering validation measures. Proc – IEEE Int Conf Data Mining, ICDM. 2010; 911–6.

15.

Hämäläinen

Jauhiainen

Kärkkäinen

. Comparison of internal clustering validation indices for prototype-based clustering. Algorithms. 2017; 10(3).

16.

Rendón

Abundez

Arizmendi

Quiroz

. Internal versus external cluster validation indexes. Int J Comput Commun [Internet]. 2011; 5(1): 27–34. Available from: http://w.naun.org/multimedia/UPress/cc/20-463.pdf.

17.

Ünlü

Xanthopoulos

. Estimating the number of clusters in a dataset via consensus clustering. Expert Syst Appl [Internet]. 2019; 125: 33–9. Available from: doi: 10.1016/j.eswa.2019.01.074.

18.

Akogul

Erisoglu

. An approach for determining the number of clusters in a modelbased cluster analysis. Entropy. 2017; 19(9): 1–15.

19.

Liang

Zhao

Cao

Dang

. Determining the number of clusters using information entropy for mixed data. Pattern Recognit [Internet]. 2012; 45(6): 2251–65. Available from: doi: 10.1016/j.patcog.2011.12.017.

20.

Lord

Willems

Lapointe

Makarenkov

. Using the stability of objects to determine the number of clusters in datasets. Inf Sci (Ny) [Internet]. 2017; 393(February): 29–46. Available from: doi: 10.1016/j.ins.2017.02.010.

21.

Patil

Baidari

. Estimating the optimal number of clusters k in a dataset using data depth. Data Sci Eng [Internet]. 2019; 4(2): 132–40. Available from: doi: 10.1007/s41019-019-0091-y.

22.

Pimentel

de Carvalho

ACPLF

. A meta-learning approach for recommending the number of clusters for clustering algorithms. Knowledge-Based Syst [Internet]. 2020; 195(xxxx): 105682. Available from: doi: 10.1016/j.knosys.2020.105682.

23.

Mur

Dormido

Duro

Dormido-Canto

Vega

. Determination of the optimal number of clusters using a spectral clustering optimization. Expert Syst Appl [Internet]. 2016; 65: 304–14. Available from: doi: 10.1016/j.eswa.2016.08.059.

24.

Dua

Graff

. UCI Machine Learning Repository [Internet]. 2017. Available from: http://archive.ics.uci.edu/ml.

25.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

, et al. Scikitlearn: Machine learning in python. J Mach Learn Res. 2011; 12: 2825–30.

26.

MacQueen

. Some methods for classification and analysis of MultiVariate observations. Proc fifth Berkeley Symp Math Stat Probab. 1967; 1: 281–97.

27.

Hartigan

Wong

. Algorithm AS 136: A k-means clustering algorithm. J R Stat Soc Ser C (Applied Stat [Internet]. 1979 Aug 24; 28(1): 100–8. Available from: https://www-jstor-org.web.bisu.edu.cn/stable/2346830.

28.

Ward

. Hierarchical Grouping to Optimize an Objective Function. Vol. 58, Journal of the American Statistical Association. 1963; 236–44.

29.

Florek

Józef

Perkal

Steinhaus

Zubrzycki

. Sur la liaison et la division des points d’un ensemble fini. Colloq Math. 1951; 2: 282–5.

30.

Gower

Ross

GJS

. Minimum spanning trees and single linkage cluster analysis. J R Stat Soc Ser C (Applied Stat [Internet]. 1969 Aug 24; 18(1): 54–64. Available from: https://www-jstor-org.web.bisu.edu.cn/stable/2346439.

31.

Sokal

Michener

, of Kansas U. A Statistical Method for Evaluating Systematic Relationships [Internet]. University of Kansas; 1958. (University of Kansas science bulletin). Available from: https://books.google.gr/books?id=o1BlHAAACAAJ.

32.

Češka

. Estimation of the mean floristic similarity between and within sets of vegetational relevés. Folia Geobot Phytotaxon. 1966; 1(3): 93–100.

33.

Defays

. An efficient algorithm for a complete link method. Comput J. 1977; 20: 364–6.

34.

McQuitty

. Similarity analysis by reciprocal pairs for discrete and continuous data. Educ Psychol Meas [Internet]. 1966 Dec 1; 26(4): 825–31. Available from: doi: 10.1177/001316446602600402.

35.

Gower

. A comparison of some methods of cluster analysis. Biometrics [Internet]. 1967 Aug 24; 23(4): 623–37. Available from: https://www-jstor-org.web.bisu.edu.cn/stable/2528417.

36.

Caliński

Harabasz

. A dendrite method for cluster analysis. Commun Stat [Internet]. 1974; 3(1): 1–27. Available from: doi: 10.1080/03610927408827101.

37.

Hubert

Levin

. A General Statistical Framework for Assessing Categorical Clustering in Free Recall. 1976; 83(6): 1072–80.

38.

Beale

EML

. Euclidean Cluster Analysis [Internet]. Scientific Control Systems Limited; 1969. Available from: https://books.google.gr/books?id=o3QdMwEACAAJ.

39.

Sarle

. SAS Technical Report A-108, Cubic Clustering Criterion. NC: SAS Institute Inc. Cary; 1983.

40.

Milligan

. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika. 1980; 45(3): 325–42.

41.

Davies

Bouldin

. A cluster separation measure. IEEE Trans Pattern Anal Mach Intell. 1979; PAMI-1(2): 224–7.

42.

Frey

van

. A Cluster Analysis of the D2 Matrix of White Spruce Stands in Saskatchewan Based on the Maximum-Minimum Principle Author (s): T. Frey and H. van Groenewoud Published by: British Ecological Society Stable URL: https://www-jstor-org.web.bisu.edu.cn/stable/2258571 REFE. J Ecol. 2016; 60(3): 873–86.

43.

Hartigan

. Clustering Algorithms. 99th ed. USA: John Wiley & Sons, Inc.; 1975.

44.

Ratkowsky

Lance

. Criterion for determining the number of groups in a classification. Aust Comput J. 1978; 10(3): 115–7.

45.

Scott

Symons

. Clustering methods based on likelihood ratio criteria. Biometrics. 2012; 27(2): 387–97.

46.

Marriott

FHC

. Practical problems in a method of cluster analysis. Biometrics. 1971; 27(3): 501.

47.

Ball

Hall

. Isodata, A Novel Method of Data Analysis and Pattern Classification. In 1965.

48.

Milligan

Cooper

. An examination of procedures for determining the number of clusters in a data set. Psychometrika. 1985; 50(2): 159–79.

49.

Fukunaga

Koontz

WLG

. A criterion and an algorithm for grouping data. IEEE Trans Comput. 1970; C-19(10): 917–23.

50.

Friedman

Rubin

. On Some Invariant Criteria for Grouping Data Author (s): H.P. Friedman and J. Rubin Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association Stable URL: https://www-jstor-org.web.bisu.edu.cn/stable/2283767 Accessed: 02-03-2016 18: J Am Stat Assoc. 2016; 62(320): 1159–78.

51.

McClain

Rao

. CLUSTISZ: A program to test for the quality of clustering of a set of objects. J Mark Res [Internet]. 1975; 12(4): 456–60. Available from: https://www-jstor-org.web.bisu.edu.cn/stable/3151097.

52.

Krzanowski

Lai

. A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics. 1988; 44(1): 23.

53.

Rousseeuw

. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math [Internet]. 1987; 20: 53–65. Available from: https://www-sciencedirect-com-443.web.bisu.edu.cn/science/article/pii/0377042787901257.

54.

Lebart

Morineau

Piron

. Statistique exploratoire multidimensionnelle. Stat Explor Multidimens. 1995; 439.

55.

Dunn

. Well-separated clusters and optimal fuzzy partitions. J Cybern [Internet]. 1974; 4(1): 95–104. Available from: doi: 10.1080/01969727408546059.

56.

Hubert

Arabie

. Comparing partitions. J Classif [Internet]. 1985; 2(1): 193–218. Available from: doi: 10.1007/BF01908075.

57.

Halkidi

Batistakis

Vazirgiannis

. Clustering validity checking methods: Part II. SIGMOD Rec. 2002; 31(3): 19–27.

58.

Halkidi

Vazirgiannis

. Clustering validity assessment: finding the optimal partitioning of a data set. Proc 2001 IEEE Int Conf Data Min. 2001; 187–94.

59.

Friedman

. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc. 1937; 32(200): 675–701.

60.

Dunn

. Multiple comparisons among means. J Am Stat Assoc. 1961; 56(293): 52.

61.

Zhou

Z-H

. Ensemble Learning. Encycl Biometrics. 2009; 270–3.

62.

Vega-Pons

Ruiz-Shulcloper

. A survey of clustering ensemble algorithms. Int J Pattern Recognit Artif Intell. 2011; 25(3): 337–72.

63.

Zhang

. Ensemble Machine Learning. Ensemble Machine Learning. 2012.

64.

Dong

Cao

Shi

. A survey on ensemble learning. Front Comput Sci [Internet]. 2020; 14(2): 241–58. Available from: doi: 10.1007/s11704-019-8208-z.

Investigating cluster validation metrics for optimal number of clusters determination

Abstract

Keywords

1. Introduction

3. Methodology

3.1 Datasets

3.1.1 Real-world datasets

Table 1 Real-world datasets

Table 2 Artificial datasets

3.2.1 Methods

• Euclidean • Maximum • Manhattan • Canberra • Minkowski 3.2.3 Validity indices

Table 3 Validity indices

4. Results

Table 4 Real-world dataset Friedman scores

References

Table 1
Real-world datasets

Table 2
Artificial datasets

•
Euclidean
•
Maximum
•
Manhattan
•
Canberra
•
Minkowski

3.2.3 Validity indices

Table 3
Validity indices

Table 4
Real-world dataset Friedman scores