Abstract
In short, clustering is the process of partitioning a given set of objects into groups containing highly related instances. This relation is determined by a specific distance metric with which the intra-cluster similarity is estimated. Finding an optimal number of such partitions is usually the key step in the entire process, yet a rather difficult one. Selecting an unsuitable number of clusters might lead to incorrect conclusions and, consequently, to wrong decisions: the term “optimal” is quite ambiguous. Furthermore, various inherent characteristics of the datasets, such as clusters that overlap or clusters containing subclusters, will most often increase the level of difficulty of the task. Thus, the methods used to detect similarities and the parameter selection of the partition algorithm have a major impact on the quality of the groups and the identification of their optimal number. Given that each dataset constitutes a rather distinct case, validity indices are indicators introduced to address the problem of selecting such an optimal number of clusters. In this work, an extensive set of well-known validity indices, based on the approach of the so-called relative criteria, are examined comparatively. A total of 26 cluster validation measures were investigated in two distinct case studies: one in real-world and one in artificially generated data. To ensure a certain degree of difficulty, both real-world and generated data were selected to exhibit variations and inhomogeneity. Each of the indices is being deployed under the schemes of 9 different clustering methods, which incorporate 5 different distance metrics. All results are presented in various explanatory forms.
Introduction
The term clustering describes the unsupervised Machine Learning task, in which a set of objects is partitioned in ways that instances placed in the same group appear to be more similar [1]. These resulting groups are referred to as clusters. However, from a generalization perspective, the concept of a cluster can not be explicitly defined, in terms of both form and content, for every given dataset. This is due to the fact that what is considered a well-defined cluster or a good partition depends, to a large extent, both on the application and background of the researchers [2] and on the – quite fuzzy – concept of similarity. As a result, different metrics and views are reflected in different definitions. In a top-down perspective, clustering is the result of dividing a heterogeneous population into a series of more homogeneous subgroups [3], while in a bottom-up one, clustering is the process of forming groups via data partitioning using a given similarity criterion [4]. In both views, the key element is the mirror notions of similarity and dissimilarity. These, in practice, are determined merely by the use of a distance metric. In other words, given a how-to-measure arrangement, items that result belonging to the same group are more similar – in terms of the aforementioned metric – compared to the ones that lie in different ones. Consequentially, selecting the number of clusters, a function that is – among other things – depended on both the distance metric and the structure of the dataset as well as the method utilized, becomes crucial. Thus, determining the optimal number of clusters stands out as one – if not the most – extremely important step in the whole process.
A graphical representation of different partitionings of the same dataset, using Agglomerative Clustering with Average, Complete or Single-linkage along with Euclidean or Manhattan distance metric.
Furthermore, most real-world classification tasks that utilize cluster analysis carry the problem of optimal number of cluster determination as an intrinsic and fundamental parameter, comparable – in terms of output importance – to the algorithmic procedure selection. This is obviously because – in most cases – the deployment of a specific clustering method is vital to the way the output space is shaped, having every corresponding option differentiate the partitions produced. Each algorithm will most likely alter the instances contained in each group leading to a different set of clusters. Moreover, even within a specific algorithm, changing the parameters and the representational structures of the data will also lead to output alterations, having a rather substantial impact on the way the final partitions are eventually formed [5]. A small demonstration of the above is presented in Fig. 1. This is an illustration of six different methods, all variants of the Agglomerative clustering algorithm, as applied to the same set of data. The setting uses a fixed number of clusters equal to 4. Specifically, the 6 aforementioned distinct methods are formed by the combinations of 3 different kinds of links and 2 different distance metrics. These include, on the one hand, Single-linkage, Complete-linkage and Average-linkage clustering, and, on the other, the Euclidean and Manhattan distance metrics. In each of the cases, output fluctuations reflected in the various groupings of the original dataset can be observed.
Typically, there are two ways to assign the number clusters. Either the number of clusters is determined by the user as a required initialization parameter at the beginning of the procedure, or the algorithm itself exploits inner mechanisms to automatically set the appropriate number of partitions. This selection step is – regardless of the selection mode – both fuzzy and non-singular in terms of a deterministic output choice of the number of clusters. Thus, the demand for cluster validity evaluation, and the respective attributes which emanate from this assessment affecting the overall quality of the procedure, set the direction of a rather essential line of research. Considering the difficulty of the problem as well as the general importance of clustering in data mining, the volume of related literature is – as expected – large. In addition, a reasonably extended part of this relevant literature contains a similarly substantial number of proposed indicators, all aiming at an insightful estimation of the optimal number of clusters that – ideally – works in any given dataset. There are three general ways to approach the cluster validation assessment problem, each of which is based on different evaluation criteria [6]: internal, external, and relative cluster validation. This work is structured so that it represents in a comprehensive and straightforward manner an extensive experimental study of validity indices that belong to the third of the aforementioned approaches of cluster evaluation criteria, namely the relative approach. The above categorization, however, is not meant as a definition of explicitly formulated experimental instructions. It only stands as just a, so to speak, profile of the main content of the procedure.
The main task of – the examined in this paper – relative-based cluster validity indices is to evaluate the estimated number of clusters in terms of some notion of the overall quality of the produced partitioning. They do that by exploiting the following intra-algorithmic comparative strategy: take a specific algorithm and, by modifying its parameters, create execution variations to generate various results, which will then be evaluated comparatively by contrasting the performance of each possible schema with any other. These validity indices make use of some statistical and geometric attributes of the data, exploiting distance metrics properties, as well as intra-cluster information such as compactness and isolation [7].
Now, on to the settings of the experimental process, a total of 26 validity indices for estimating the optimal number of clusters in a given dataset were evaluated. These were applied to a quite large set of methods, consisted of variations of k-means and hierarchical agglomerative clustering algorithms. This variety of methods were supplemented by a set of 5 different distance metrics, resulting to an extensive experimentation regarding proposing and selecting the best possible partitions. In the following sections elements from the results of the implementations of all the above will be presented. Starting from Chapter 2, related works mainly from the contemporary literature will be indicatively addressed. Next, the overall setting of the experimental procedure – that is the methods, distance metrics and validity indices used – will be presented. Concluding this introduction, the results of the experimental process, as well as the various emerging conclusions, will be addressed in the last sections.
Estimating the best number of partitions to be made in a given dataset constitutes – as already repeatedly noted in the introduction – one of the most crucial steps in almost every clustering task. In the case where the number of clusters is not to be produced by the inner structure of the method, the user has to make decisions regarding which, given the current working conditions, the optimal value may be. This decision is strongly dependent not only to the problem itself but also to the researcher’s personal biases. Over the years, several studies have focused on this particular problem with the aim of proposing novel solutions and comparing available ones [8]. Thus, both investigating and testing, as well as using a diverse set of tools capable of providing “multi-level” outputs regarding possible decisions to the optimal number of clusters, is not only beneficial but stands out as the central point in the whole process. Consequently, the comparison of the available tools by evaluating their results becomes valuable, providing insight into the performance and quality of the outputs. As a result, the task of automating the decision-making process regarding the optimal number of partitions has attracted a considerable amount of the research community’s attention.
As already noted, the types of criteria presented in the theoretical aspects of cluster validation can be divided into three basic types [9, 10]: internal, external and relative criteria. In terms of external criteria, given the different partitions that emerge from the execution, the evaluation is based on a comparison that relies on specific predefined information. On the other hand, regarding the internal approach, the evaluation of the resultant partitions is based solely on the partitions themselves, with no extra information or clustering process repetitions. Lastly, relative cluster validation techniques are based on comparisons of partitions generated by the same algorithm with different parameters and different data subsets. In the present work, all experimental executions were performed using the – implemented in the R programming language – package described in [7]. Since the strategy designed for the evaluation stage was based on comparisons of each clustering algorithm in terms of modifying its parameters, the relative validation approach was investigated [6]. A comparative study of such criteria can be found in [11, 12], works that both provide a fairly extensive experimental framework. The research, however, has also focused on both other types of evaluation criteria, namely the internal and external validation metrics. In [13] a comparative study of 5 external criteria is used in a hierarchical clustering framework, while in [14] a detailed study of 11 widely used internal clustering validation measures is presented. Another empirical comparison of internal-based evaluations can be found in [15], where 7 such validation indices are compared on a wide variety of datasets. A study focused on the comparison of both internal and external validation indices is presented in [16], using K-means and Bissecting-K-Means algorithms over 12 datasets to compare 6 internal and 4 external indicators, with results supporting the assumption that internal criteria tend to be more accurate.
Many forward steps have also been taken to propose new methods of improving the traditional ones. In [17], a consensus clustering approach, using internal evaluation criteria to improve the general outline of applying a certain clustering algorithm with various cluster configurations and select the one that maximizes a certain validity measure, is proposed. The Analytic Hierarchy Process (AHP) is utilized for selecting the best values for the number of clusters in [18], where the AHP model is formed using various information criteria. To overcome the difficulties of identifying the optimal number of clusters in mixed data problems, a generalized mechanism is presented in [9]. The procedure integrates the Renyi along with the Complement entropy. A novel stability-based algorithm that improves the ability of traditional partitioning methods to determine the way groups are formed in various datasets is presented in [20], while a new method called depth difference (DeD), for estimating the optimal number of clusters
All of the above, while only a few cases of current scientific production, clearly show, on the one hand, the quantity and variety of research work related to the issue, and, on the other hand, the current ongoing demand for the extension of existing tools. Below is a summary of the approach to the experimental procedure of this work, along with the data and methods used.
Methodology
This section provides the necessary descriptive information regarding the experimental procedure. First, information on the type and characteristics of the datasets utilized, followed by a brief discussion of the algorithms and metrics employed, is provided. Then, the design and implementation of the entire experimental setup is been addressed.
Datasets
Starting from the initial aim of being able to perform comparisons and monitor the overall behavior of every examined validation index, in a way that helps conclusions regarding the performance of each of the proposed indicators to be presented clearly, a collection of datasets was formed, consisted of two different types separated in terms of the way they were created. The first group contains datasets from real-world classification problems, which were drawn from the – publicly available – UCI repository. To conduct experiments, and in order to transform the classification problem into a clustering one, the class feature of the real-world data was not taken into account and was manually deleted. This hidden knowledge of the class, however, was a posteriori necessary for drawing conclusions about the results, a fact that will be addressed in the next section. The most important observation, though, regarding data selection was that these available datasets turned out to have a small number of actual classes. These were in the vast majority of the sets below 6, with most of them being 2 or 3, while many had a rather prohibitive length in terms of the execution time and computational resources needed. As the above facts could reduce the reliability of the experimental process and challenge the generalization of the derived results and conclusions, the creation of the second artificially constructed dataset was considered beneficial. These two general groups of datasets create the conditions to overcome the aforementioned shortcomings, contributing substantially to the reliability of the overall experimental process. It should be noted, however, that the two cases, due to the different origin of the data, were considered separately as two distinct scenarios.
Real-world datasets
Starting with the real-world data, a total of 20 datasets taken from the UCI repository [24] were used. This publicly available repository was selected, among other things, in order to reinforce the additional objective of easy reproduction of the results to be presented. These initial real-world datasets emerged from classification problems and contained labeled data. To be converted into a form suitable for clustering, which is an unsupervised learning task, the class label had to be removed. Table 1 provides useful information about the datasets in a rather concise but comprehensive way. Specifically, the table contains essential form attributes, like the number of instances, features, and actual classes, as well as the number of instances that belong to the most common class. The latter is an indicator of the problem’s difficulty, as classes containing a small number of instances are more unlikely to be identified correctly as clusters. It should be noted that the majority of the datasets used are imbalanced, so there are significant differences between the number of instances in different classes. Actually, this remark is also implied by the aforementioned last column in Table 1. Lastly, all the features were numerical in order to avoid the effect of different types of encoding on the final results, which is rather a matter of a separate ongoing study.
Real-world datasets
Real-world datasets
Having the UCI datasets constituting the real-world case of this study, a total of 25 additional datasets were manually created in order to substantiate the required wider experimental framework. This was done using the “make_classification” function of the sklearnPython library [25]. The class labels were once again removed in order to meet the clustering task criteria. Since the datasets in this group were created to cover cases that had little or no representation in the original datasets, some useful information about their nature should be provided. The first issue was that the given urgent demand for differentiation in terms of the number of their appearances, their features, and their ranks, statistics depicted in Table 2 where this information is summarized, was one of the main reasons for their construction in the first place. Moreover, all the features in these datasets are informative, so that there were no features constructed by a linear combination of informative ones. Lastly, only 10 of the artificially generated datasets were balanced, while the others were made to be unbalanced, with some of their classes consisted of a relatively small number of instances, in an attempt to increase the overall complexity of the task. All essential information about datasets’s characteristics can be found in Table 2.
Artificial datasets
Artificial datasets
Before proceeding to a rather concise presentation of the methods, distance metrics, and validity indices used to calculate the optimal number of clusters, it should be noted that henceforth the validity indices may often be referred to as “indices” or “indicators” for short. In fact, this was the case in several of the above-mentioned areas of the present work.
First of all, a total of 9 algorithmic procedures which were utilized and tested throughout this survey will below be briefly mentioned. During these experimental procedures, a total of 26 indicators for estimating the optimal number of clusters in each of the datasets examined were evaluated, as applied to K-mean and Hierarchical agglomerative clustering algorithms, in terms of the aforementioned wide variety of methods. Lastly, 5 distinct distance metrics were also examined. Thus, a total of 1170 tests per dataset were taken into consideration, in a plethora of potential optimal numbers of clusters proposed.
Methods
Regarding the experimental setting in terms of the algorithmic procedures used, the following 9 well known methods were utilized. First, the widely used K-means method [26, 27] was utilized. Among other things, the K-means algorithm, as one of the most, if not the most frequently used technique in the clustering literature, constituted a safe performance baseline. Obviously, the apparent multiple appearance of the method in the literature makes any reference to further information about its operation unnecessary. Next, various agglomeration methods, which is a bottom-up hierarchical clustering approach, were also exploited. Specifically, the following single-link, complete-link, and minimum-variance algorithm variations were utilized:
As a bottom-up hierarchical scheme, all the aforementioned methods share the same attribute. At the beginning of the execution, each case is considered as an autonomous cluster. During the iterations’ steps, the clusters are linked based on their similarity to form bigger clusters. The operation stops when the desired number of clusters is reached. The type of linkage represents the way the similarity between the clusters is measured, indicating the way the method itself is defined.
Regarding distance metrics, we will suffice, for reasons of space and readability, with a simple reference to the following extremely well-known metrics, which were used in the experiments.
Specifically, the following 5 metrics were used:
Euclidean Maximum Manhattan Canberra Minkowski
As for cluster validation metrics, a very wide variety of indices were used to indicate the optimal number of clusters. The main focus of this work, in terms of the type of indicators, was the approach based on the so-called relative criteria for cluster validation. All these indicators are shown in Table 3. It should be noted that an extensive and formal presentation of all indicators is – given their number – impractical. That is due to, on the one hand, the reasonable length restrictions, and because of the fact that a simple placement of their mathematical formulas will not add anything substantial to the present work. Thus, we urge the reader to look at the referenced documents for further clarification.
Validity indices
Validity indices
Representation of the experimental procedure.
In this section, the general scheme of the experimental procedure will be thoroughly described. First of all, a focal point of the experimental process, as it has been previously mentioned, was that the initial unbalanced real-world datasets were supplemented by a diverse set of artificially created ones, and both real-world and artificial contained information about the actual number of classes in each dataset, as in classification problems. By converting the classification problem into a clustering one, by simply removing the class feature from the dataset, we can assume that one accepts the correspondence between the two problems even on a rather informal level: the optimal number of clusters simply becomes the number of classes. Thus, class labels provide crucial information to use in the evaluation stage, even if they are not assessed at all during the experiments.
To calculate the validity indices that estimate the optimal number of clusters, the R – based nbclust [7] package was used. Regarding experimental executions, 9 different clustering methods were exploited, and each of these methods was performed 5 times, each time incorporating a different distance metric to calculate the similarities between the instances. As a result, for every dataset, a total of 45 general runs were executed (9 algorithms
Then, in order to measure the efficiency and the capacity of every validity indicator to suggest valid optimal values, a metric had to be specified. This was set to be the absolute value of the difference between the number of clusters proposed by each indicator and the actual number of clusters. Therefore, in each experiment, for each index, an integer indicates how far from reality the estimate for the optimal number of clusters is. In both separate scenarios (real/artificial data), the results were grouped according to the distance metrics and thus will be presented in the following section. A visual representation of the whole experimental setting is presented in Fig. 2.
Results
The results and some remarks arising from their analysis will now be presented in this section. With regard to the general objective in the experiments carried out, the intention was to identify which of the examined indices performed best in estimating the optimal number of clusters, and, in particular, to identify whether there were specific indices that generally performed well in the majority of datasets. Given the above, and bearing in mind the multitude of the performed tests, it’s easily inferred that the number of indicators, metrics and their combinations becomes quite restrictive in terms of a full presentation of them all. In addition, the numerous outputs and tables of results that were produced would be extremely hard to follow. These, combined with the limited space, led us to choose other more concise ways of presentation. The latter mainly involved the use of some well-known diagrammatic techniques, which tend to summarize the information and make it much easier to follow. Thus, in both case studies (Real-World, Artificial), the results were grouped by distance metric.
Moreover, here some clarification regarding the two case studies is required: they are studied separately since, in the case of the real-world datasets, the majority of the accessible – and consequently the finally selected – sets had a number of classes – i.e. number of actual clusters – less than or equal to 6. As a result, indices that tend to select fewer clusters as the ideal number of groups in each of the partitions have a clear inherent advantage which is not necessarily based on non-random events. This was the reason why, in creating the artificial data scenario, priority was given to the widest possible diversity of real clusters so that safer conclusions could be drawn about the behavior of the indicators. And, indeed, the ranking of the indicators (except for the first and the last, which remain in their respective positions in all situations) differs significantly between the two cases. However, this does not affect the demand to conduct experiments and draw conclusions from real-world data, which are rarely available in formats that meet all the requirements or cover every possible scenario. To compare the indicators, the Friedman ranking test [59] was performed together with the Bonferroni-Dunn post hoc test [60]. These scores provide a good picture of the performance of the quantities examined: a good performance equivalizes to positions in the upper places, while the worst performing indicators will fall down to the lower parts of the table. The results of the Friedman’s rankings can be found in Tables 4 and 5, for real-world and artificial data respectively.
Real-world dataset Friedman scores
Real-world dataset Friedman scores
CD diagram: Real-world dataset.
Artificial dataset Friedman scores
Moreover, the null hypothesis was rejected. Note that a post-hoc analysis using the Bonferroni-Dunn test provides feedback on which indices behave similarly. That is, all indices behaved similarly in all 10 cases – bearing in mind that the results were grouped per metric and there were 5 distance metrics and 2 distinct study cases. Figure 3 shows the results as Critical Difference graphs (CD diagrams) for the case of real-world data. Here, these charts depict the top ten performing indices in respect to their Friedman rankings. The cd value is calculated, and the indices whose Friedman ranking values differ by a value less than or equal to the above are associated with a strong horizontal line in the graph. That is, given the cd – values, for each pair of indicators, the absolute value of the difference in their rankings is calculated. If this value is higher than the cd, then two indicators differ in their average value, and this is what one means by saying that they have different behaviors. In the illustrations, the indicators that are shown to be connected do not differ significantly. In Fig. 4 the CD diagrams in the case of the artificially created data are shown.
CD diagram: Artificial dataset.
Both the results of the Friedman test, as well as the CD diagrams, demonstrate the overall al superiority of the Ball index over the other indices. Specifically, the Ball index ranks first in all cases except for the use of the Canberra metric in the case of real-world data, where it ranks second after Beale by a small margin. It is also quite clear, from a simple observation of the cd charts, that the Ball index shows a statistically significant difference compared to the vast majority of other indicators. In particular, in the case of artificial data its behavior is significantly different from all the others,
Box-plots: Real-world dataset.
Box-plots: Artificial dataset.
Heatmaps of all 25 artificial datasets regarding (from left to right) K-means, Single and Ward methods.
while, in the real-world dataset, only a few connections that imply performance similarities can be spotted, with the majority of them indicating dependencies just between the Ball and McClain indices. Apart from the aforementioned dominance in both case studies, the indicators that follow in the ranking differ in relation to the use of real or artificial data. To be more specific, in the real-world senario, the leading Ball index is followed by McClain, Silhoutte, Beale and TrCovW, while in the artificially generated datasets the top five – with the exception of Ball – is usually consisted of TrCovW, Ratkowski, TraceW, and Hartigan. With regard to the indicators with the lowest performance, it is worth noting that in all cases, the SDbw index ranks last with a significant difference from the rest. Variations and rearrangements are observed in the middle positions of the table, implying a rather inconclusive performance ranking.
An additional way to provide a visual representation of the extracted results is via box-plots. Figure 5 depicts the box – plots graphs of all the examined indices, in the case of real-world data, in terms of each distance metric. The corresponding box – plot graphs for the artificially created data can be found in Fig. 6. To decode the underlying information of these plots, note that we reach for the indices with the least possible divergence from the actual classes – in terms of its recommended ideal number of clusters, i.e. those with box-plots that appear closest to zero. A value of zero indicates that the discrepancy between the number of clusters recommended by the relevant index and the actual class number is minimal. We can clearly observe the remarkable, almost universal superiority of the Ball index both in the real-world and in artificially generated datasets. This seems to be evident in all three ways in which the results have been presented so far. However, what can be deduced with reasonable certainty is that there is little evidence for the superiority of the Ball index both in the real world and in generated datasets.
Lastly, heatmaps of the kmeans, single and ward methods regarding the artificial dataset case study are provided in Fig. 7. Once again, the plots are grouped in respect to the distance metric that the methods internally utilized to measure the distances between objects. So for each one of the 3 aforementioned methods, a total of 5 heatmaps are depicted. A color scale is used to show the deviation of the estimate of each indicator from the actual number of clusters. In the presented heatmaps, the columns represent the validity indices, and the rows denote the serial numbers for each of the 25 datasets. The darker the square, the larger the estimated difference from the actual value. This means that columns with lighter shades of blue that tend to look white represent better performance than columns painted in darker shades.
In this last section, the final remarks of this work are presented regarding the interpretation of the results of the experimental process, as well as an outline of the future targeting. The purpose of this paper was to compare a large number of proposed cluster validity indices, in order to – ideally – come to both safe as well as useful conclusions about their performance, regarding the – quite fuzzy – task of estimating the optimal number of clusters when a specific dataset is given. For this reason, two scenarios were designed, each of which consisted of data from different origins: the first includes existing real-world problem datasets, while the second was artificially constructed for the purposes of the current framework. In total 45 datasets were examined. In order to evaluate the performance of the validity indices in question and to produce comparable findings, a series of experiments were performed using 9 different clustering algorithms and 5 different distance metrics, resulting in 45 experimental schemes. In order to ensure a relative variation in the complexity of the subject, special attention was paid to the diversity of the datasets, both in terms of the actual number of groups and their characteristics, as well as in terms of the number of cases within the clusters. After a statistical study, the indicators were ranked based on their accuracy according to the Friedman ranking.
The results, while providing a valuable insight into which validity indicators work well in both scenarios, do not give a rigorous or definitive ranking, with the exception, as mentioned above, of the top positions in the ranking in almost all metrics. It is worth noting that if we exclude the Ball index, which seems to perform better than all the others in the vast majority of tests, the top performance scores for each of the two scenarios vary, a result that is most likely related to the number of classes, which, on average, were lower in actual data. As a result, the difficulty of the task is underlined once more, as is the need for alternative techniques for evaluating estimates of the optimal number of clusters.
Given the above, an interesting extension that incorporates the current study would be the development of validity indices groups, the estimates of which will be merged to eventually propose a final best as the optimal number of clusters [61]. Such a combination module designed to be robust and efficient, may then be used as a stand-alone indicator. A second possible future area of interest would be to use ensemble learning to construct ensembles of clustering algorithms [62, 63]. An ensemble clusterer would utilize the number of clusters supplied by the assessment method as input to build a final model by merging the individual base learners. According to the ensemble theory, in an ensemble setting that exploits accurate as well as diverse base learners, ensembles of learners would be more accurate and robust, as they tend to perform generally better than their individual base learners [64]. As a result, there is enough evidence to suggest that such a research direction would constitute a valid endeavor, resulting in methods that provide better data partitions.
