There are clustering algorithms (such as DBSCAN) that do not group all data into clusters, but identify some data as noise and exclude it from clusters. In the literature there are no dedicated validity measures for this kind of noise-aware clusterings. Applying the standard measures blindly (which seems to happen in the literature) yields misleading results. We revise top performing, established validity measures to cope with the results of this kind of clustering algorithms and demonstrate that such clusterings may require an additional type of validity check, assessing not only the cluster validity (separation and compactness), but also the validity of the distinction between noise and cluster instances. Additionally, we propose a balanced score, that captures both types of validity to get a holistic validity score. All proposed measures are evaluated on artificial data, mimicking the experiments of the extensive review [Arbelaitz O, Gurrutxaga I, Muguerza J et al. 2013]. The encouraging results demonstrate that the noise aware extension of the Silhouette coefficient and the Score function are least influenced by the noise level.
Cluster analysis is often used as an exploratory tool for multidimensional data, where the inherent structure cannot be assessed visually. Cluster validity measures come to the rescue and try to capture desirable properties (such as compactness and separation) in a single number, so that clustering results can be compared easily, e.g., to find the right number of clusters or to compare the results of different clustering algorithms.
While there is a well-established set of evaluation measures for classifiers (such as accuracy or -score), evaluation of clusterings is less straightforward as there is usually no ground truth information available. (Even if we have ground truth information, a supervised evaluation of clusterings is usually not suitable.1–3)
There is a plethora of cluster validity measures in the literature (30 reviewed in Arbelaitz et al.4) with varying conceptions of good clusters. While classification accuracy may be applied to evaluate any classifier, a clustering validity measure cannot be applied to the result of any clustering algorithm: Some algorithms (such as -means) assign every instance to a cluster, other algorithms (such as DBSCAN) flag noise points and exclude them from clusters. We denote such methods as noise-aware clustering algorithms. The frequently applied validity measures, however, cannot cope with this distinction and – applied blindly – interpret all noise points as if they represent one cluster on its own and thus yield useless results. It may be tempting to simply remove the data flagged as noise and apply the established validity measure to the remaining data – but this is not an option either: Suppose uniform noise and a “clustering algorithm” which removes all data but two well-separated islandsa – most validity measures would evaluate the result as being very good despite the fact that this particular clustering algorithm simply made up the clusters. A more common case is when a clustering algorithm identifies only some of the clusters in the dataset, but flags other clusters as noise – simply because their density is slightly below some threshold parameter, for instance. Again, when applying common measures to the detected clusters, such misjudgement in the resulting partition goes undetected. So the main contributions of this work are:
We draw attention to the often disregarded but yet important problem of validating clustering results from noise-aware clustering algorithms (such as DBSCAN5 and others6,7).
Four of the best-performing validity measures (from the extensive review4) are revised and hardened against data flagged as noise in the resulting clustering partitions.
Two new measures for the assessment of the distinction between noise and clusters in the resulting partitions are introduced.
Proposal of a balanced score to capture both types of validity: cluster validity (separation and compactness) and distinction between noise and clusters.
The robustness against various levels of noise is empirically evaluated (in line with review4) and the results demonstrate the necessity of such extensions for an impartial evaluation.
While the proposed measures are evaluated using synthetic data, their applicability extends to real-world areas, as noise or outliers occur in many real-world applications. In biomedical clustering, for example, due to biological variability or sensor artefacts. Similarly, in network anomaly detection, high-density clusters can indicate typical behaviour, while low-density noise can reflect security threats. In image segmentation, especially for satellite imagery or medical images, the distinction between signal and background noise is crucial. In these cases, robust clustering evaluation metrics must be able to distinguish between meaningful structure and noise, and they should avoid rewarding algorithms that impose artificial structure on non-clusterable data. This emphasises the general importance of including noise as an integral part of assessing cluster validity.
The remainder of this paper is organised as follows: in Sect. 2, we discuss the literature that is related to this problem. After that in Sect. 3, we revise the definition of a clustering to depict noise instances as well as the clustered data clearly. Sect. 4 presents the modified and new validity metrics, which are then evaluated in detail in Sect. 5. A brief overview of the evaluation volume can be found at the beginning of Sect. 5. Finally, the conclusions follow in Sect. 6.
Related work
Cluster analysis is an unsupervised machine learning technique to arrange data points into groups (called clusters) of homogenous data (similar instances belong to the same cluster, dissimilar to different clusters).
A common definition of a partition of a set is given by:
A partition of a set is a collection of non-empty and pairwise disjoint subsets of , such that . The set of all possible partitions of S is denoted by .
In the literature, the term clustering can refer to both, the process of cluster analysis as well as the output of a cluster analysis. In this work, we use the term clustering to refer to the resulting output of a cluster analysis, which is usually assumed to be a partition of the set of data instances .
(conventional clustering, e.g. Topchy et al.,9,10 Faceli et al.11)
Let be a set of instances. A clustering is a partition of , with being the set of all possible clusterings of .
Consequently, every procedure that generates a partition from the instance set could be called a clustering algorithm. Cluster validity measuresb express the validity of a clustering usually by assessing how compact the discovered clusters are and how well they are separated. This can be done in different ways. Examples are the Silhouette coefficient (sil)12, the Davies-Bouldin index (DBI),13 the Dunn-Index (DI),14 the score function (SF),15 to mention a few of the top-performers in Arbelaitz et al.4 (For better readability we briefly review these measures when proposing modifications to them in Sect. 4.)
There are many different approaches to clustering algorithms , such as hierarchical (e.g. single-linkage), partitional (e.g. -means), grid-based (e.g. CLIQUE), density-based (e.g. DBSCAN), etc. They all yield a new column storing the information to which cluster an instance belongs. Instances in the same cluster, get the same id. Almost all cluster validity measures assume a conventional clustering as the obtained output and interpret the resulting column as such.
But there are differences in the semantics of this output column. Noise-aware clustering algorithms such as DBSCAN,5 noise-clustering6 or the SNN algorithm7 reserve one particular id in this column to denote an instance has not been assigned to any cluster at all – but has been flagged as noise. As the data type of the output column did not change (only the semantics of the id), the existing measures (being part of popular libraries like sklearn) can still be invoked: There exist numerous publications that completely ignore the problem, how to evaluate clusterings that (potentially) include noise labels or at least do not address it at all (e.g. Chen et al.,16 Kertanah et al.,17 Ogbuabor and Ugwoke18). The authors of Chen et al.16 explicitly state that they use the Silhouette implementation from sklearn.metrics.silhouette_scorec to get the evaluation result. They compare clusterings from -means and DBSCAN, although the implementation from sklearn does not provide any special treatment for the noise, but treats all noise points as if they represent a cluster. Noise points are, however, by definition widely spread instead of compact. Evaluating all points with the noise label as ”one cluster” is thus meaningless and greatly distorts the assessed validity.
The authors in Zhao et al.19 propose a novel internal validity index - GSCVI - based on granular ball representations, which aims to improve the robustness of the silhouette coefficient on datasets with arbitrary cluster shapes and noise. While their method takes into account the presence of noisy instances and aims to mitigate their negative impact on cluster validation, it treats noise as a confounding factor that needs to be removed to minimise its negative influence. In contrast, our work adopts a fundamentally different perspective: rather than reducing noise, we incorporate it as a meaningful structural component, that is part of the dataset. Simply excluding all noise points is not an option, because that could artificially introduce an unjustified separation between clusters (as mentioned in the introduction): For noise-aware clusterings, the validity of the noise labels themselves need to be validated as well. This brings so-called cluster tendency measures to our attention, which do not validate discovered clusters, but check beforehand if there is a grouping tendency in the data at all. If a validity measure would rate clusters positively (after excluding some noisy data from the assessment), but no cluster tendency can be detected in the original data, the cluster validity may have been inadmissibly induced by the omission of the noise. In this work we consider the Hopkins statistic,20 which assesses the cluster tendency by generating random data (resp. a sample of the dataset) and their distance (resp. ) to nearest neighbours in the data. If clusters exists, the statistic approaches 1. Purely random data yield values close to .
With the Partitioning Davies-Bouldin Index (PDBI), the authors in Ros et al.21 propose an extension of the classic Davies-Bouldin Index that allows a finer evaluation of complex cluster structures. The basic idea is to first divide each cluster into subclusters in order to then calculate homogeneity and separation at this finer granularity. Subclusters with a low density are automatically marked as noisy and weighted with a partially reduced weighting during the evaluation. PDBI thus takes noise into account implicitly, but not by explicitly handling noise points, which are in the case of DBSCAN marked with cluster label . A direct application of PDBI to e.g. DBSCAN results is therefore not suitable without further modifications, as this can lead to noise being incorrectly evaluated as poorly separated clusters, which leads to misjudgements of real noise points. We also argue that deciding whether an instance is noise or belongs to a cluster should be the task of the clustering algorithm. The cluster validity measures should then only be responsible for evaluating this decision.
External or supervised clustering validation measures, such as the Adjusted Rand Index (ARI), are valuable tools in evaluation scenarios where a ground truth clustering is available - for example, when assessing the performance of a newly developed clustering algorithm on synthetic or labelled benchmark data. In such settings, the alignment between the algorithmic output and the known partition can be quantified directly. However, the practical use of clustering often takes place in exploratory data analysis, where no class labels are available and the underlying structure of the data is unknown. Even when some form of class label exists, it does not necessarily correspond to a natural clustering structure in the feature space (see Jeon et al.,2 Färber et al.3). Therefore, the validity measures proposed in this work are internal and tailored for real-world, unsupervised settings - especially when noise-aware clustering algorithms are used to detect structure in unlabelled, potentially noisy data.
Clustering as a twofold partition
We need to review the definition 2 of a clustering: We explicitly distinguish between noise labels and cluster labels, because their semantics differ greatly.
(clustering)
Let be a set of instances. A clustering, resulting from an arbitrary clustering algorithm, of can be represented as a twofold partition , where represents the instances that are not assigned to a cluster and is a (conventional) partition of the actually clustered data . The set of all possible clusterings of is denoted by .
A function may be denoted as a noise-aware clustering algorithm. For clustering algorithms that do not detect noise points, such as -means, and applies. With DBSCAN all data that received the noise label is gathered in . This is illustrated using the example from Figure 1. Here, the same dataset has been clustered with -means with (left) and with DBSCAN (right). The results are indicated by the colours of the instances. The clustering as a twofold partition for the result of the -means (left) is: and for DBSCAN: .
Clustering as a twofold partition - example.
Two-dimensional generated datasets.
Note that conventional algorithms may be altered easily to yield a non-empty : With fuzzy -means, membership degrees to each cluster are calculated, several clusters may share an instance of the data in equal proportions. In this case, it may not be advisable to assign this instance to a specific cluster and it could be a candidate for . Also, with -means clustering, it happens that instances form their own clusters, these may also be regarded as members of .
Validity of noise-aware clusterings
In the next section 4.1 we propose modifications sil of the Silhouette coefficient,12dbi of the Davies-Bouldin index,13gD33 of the generalised Dunn-Index 3.314 and sf of the score function,15 that can handle clusterings in the form of . These metrics were considered to be most relevant due to their good performance in Arbelaitz et al.4
A noise-aware clustering calls for a validation of the noise as well: Consider Figure 2(c) and suppose an algorithm would mark all data as noise except the data in the two circles. If declares data within both circles as clusters, validity measures would evaluate them as well separated and compact – but the separation is artifically introduced by excluding and thus not real. It is not admissible to support clusters by excluding data selectively. We thus need measures to assess the validity of the noise labels as well.
Silhouette Coefficient. The Silhouette Coefficient12 is a distance-based measure that indirectly takes into account the density of instances in the clusters. The value is calculated for each instance of the data using equation (1), with being the average distance from instance to all others in its own cluster and being the average distance to all instances to the nearest cluster. The value should therefore be high for well-separated clusters. The overall Silhouette coefficient (sil) for the entire clustering is then obtained by equation (2). The Silhouette coefficient lies between -1 and 1, values close to 1 indicate a good clustering, values less than or close to 0 indicate a poor clustering.
The revised Silhouette coefficient is based on a redefinition of in equation (3) and changes the calculation only in case of noise points (thereby preserving all properties of the Silhouette index for clustered data). So let us assume a noise-aware clustering like Figure 1(right), where points #2 and #5 were (correctly) identified as noise. If would be treated as just another cluster, we would obtain high values for and as the points are wide-spread. The closest cluster (cyan: #3, #4, #6) would yield smaller average distances, thus , . The Silhouette indices and would then be negative, indicating a poor cluster and degrading the overall Silhouette coefficient. This happens even though the clustering algorithm has correctly marked both points as noise. Instead of degrading the result, the correct identification of noise should improve the Silhouette index.
We accomplish this as follows: For noise points the value is not meaningful, because noise points do not have their own cluster. Instead we introduce as the average distance to the second nearest cluster. Very much in the spirit of the standard , now the term judges if noise point is substantially closer to its closest than its second closest cluster. If that would be the case the term approaches 1, indicating that may be better assigned to its closest cluster rather than classified as noise. On the contrary, if the closest and second closest cluster have similar distances, the fraction approaches 0, indicating that it cannot be assigned unambiguously to a cluster and its classification as noise is justified. We use the term to turn a value of 0 (confirming noise) into a good Silhouette score of 1. Now, and receive high values, contributing to an overall good Silhouette coefficient. (The additional minimum in equation (3) just assures that a highest score of 1 is obtained even for negative fractions.)
As before, the overall Silhouette score is then determined as usual using equation (2) by averaging the individual .
Davies-Bouldin Index. For determining the Davies-Bouldin (DB) Index,13 individual DB-indices are calculated for each cluster. If is the set of clusters and , then the conventional for a cluster is calculated by equation (4), with and being the intra cluster distances in cluster and and being the inter cluster distance between the two clusters. For well separated clusters their separation should be large compared to the sum of their , so the fraction should become small.
The dbi is then determined by averaging the DB as shown in equation (5). The metric can take values between 0 and , with values near 0 indicating a good clustering. For calculating the adjusted variant we use the conventional formula to calculate the for (preserving its properties for true clusters). In the other case, if , then the calculation is adjusted as displayed in equation (6) (second case).
We judge the group of noise points by their minimal distance to cluster centres: points correctly identified as noise should stay away from cluster centers. By we denote the minimal distance of a noise instance to the cluster centroid of . As this distance should be high, its reciprocal approaches 0 for true noise points. equation (6) incorporates the highest reciprocal over all clusters to capture the worst separation of noise and clustered data. The nominator of equation (6) reflects a constant: . The fraction of noise points is used as an additional factor to penalise a tendency of marking most of the data as noise. The is the maximal distance between two clusters and with and acts as a scaling factor to get the value into a comparable value range. The numerator can therefore become as large as the maximum distance between clusters. The final term becomes large if there is a lot of noise and noise instances are close to cluster centroids.
Dunn Index. The Dunn Index and its different variants14 make a heuristic estimate of the goodness of a clustering based on two values, the separation estimator and the cohesion estimator . The variants differ in how the two estimators are calculated. The value of the metric then always results from . In the variant gD33, the separation estimator shown in equation (7) and the cohesion estimator shown in equation (9) are applied. For , for each pair of clusters and the average distance between two points from these clusters is determined. is then the minimum of these values. The value for results from the average distances of all instances of a cluster to the respective centroid of the cluster. The maximum is then used as . In other words, only the worst values for cohesion and separation are considered in the Dunn index.
For the extension to noise-aware clusterings, we suggest the use of an additional separation estimator shown in equation (8). We only introduce an additional separation estimator and no cohesion estimator, since cohesion is not a meaningful indicator for the set of noise instances. We intend to assess the separation between clusters and noise points. The term measures the average distance of (a fraction of) noise points to the cluster points. Its minimum over all clusters measures the “poorest separation” of noise and a single cluster – very much as measures the poorest separation between a pair of clusters.
We use the noise neighbourhood for each cluster , which consists of the nearest of all noise instances. For our experiments, we set to 15. For the number of noise neighbours we take at least 2 instances if available. Now the average distance to each noise point in is determined for each instance in cluster . This is averaged over all instances of a cluster. The value for the separation estimator then results from the minimum across all clusters (equation (8)). This value is therefore large (and thus does not become minimal) if the nearest of the noise instances are on average far away from the cluster. For the final calculation of the (equation (10)), we use the minimum from and .
Score Function. The score function15 is a cluster evaluation metric that takes into account both the distances between (bcd, equation (11)) the cluster centroids and the global centroid and the distances between the instances within (wcd, equation (12)) a cluster and the respective cluster centroid. The metric value is calculated as shown in equation (13).
For our extension of the measure, we add another summand in the formulae for bcd and wcd for with the respective inverse ro reflect the nature of the ”cluster” . In equation (12) we add , since in the case of the noise group it is good if the mean distance to the centroid is large. In equation (11) we add , because the assumption is, that the noise centroid is near the overall centroid and here a large distance is considered good.
Noise validity measures – density assessment
Besides the customised variants of the conventional measures, we propose two novel (density-based) measures, which do not evaluate the quality of the clusters, but focus on the quality of the distinction between noise instances and cluster instances . They assess whether instances identified as noise are located in significantly less dense regions than instances that are assigned to a cluster.
These dense regions must now be defined in some way so that the density of different regions (cluster/non-cluster) can be confronted. To do so, we propose two variants: first, we divide the data space into grid cells and then compare the density of cluster and non-cluster grid cells (grid coefficient). In the second variant, a region is the direct neighbourhood of each instance and then the density of the neighbourhoods of cluster or noise instances is compared (neighbourhood ratio).
Grid Coefficient. The first metric is a grid-based coefficient that quantifies the ratio of data density in cluster grid cells to non-cluster grid cells, which is called grid-coefficient (). A cluster grid cell is a cell that contains at least one instance from . Non-cluster grid cells are cells that contain either no instances at all or only instances from . The size of a grid cell is determined dynamically and is equal to half of the minimum distance of all maximum distances between instances in the clusters. In circular clusters, a grid cell is therefore as large as the radius of the smallest cluster (in terms of intra cluster distances). The proposed coefficient is calculated using three variables, the average number of instances in non-cluster grid cells, the quotient of the number of empty grid cells and the total number of grid cells and the average number of instances in cluster grid cells. The penalty term applies in situations where there are many empty cells that potentially ”pull” the value of towards 0 (as it is likely in a high dimensional space), which could give the impression that the ratio between and is large. To limit the influence of many empty cells, is added.
In scenarios in which the data density inside the clusters is significantly higher than outside (as it should be the case in a ”good” clustering), will be comparatively high and will be significantly smaller. The range of this metric is , however, reasonable clusterings only result in a value in . The closer the metric is to 1, the better the clustering.
Neighbourhood-Ratio. The second novel metric is called neighbourhood-ratio. This metric evaluates the clustering based on the neighbourhood density inside and outside clusters. It measures the ratio between the (exponential) mean neighbourhood density within the clusters and the (exponential) mean neighbourhood density of the noise instances. Similar to the , which first determines the cell size, a radius is first determined here (also dynamically and in the same way as the cell size of the grid coefficient), in which neighbours of an instance are searched for. This neighbourhood count is determined as shown in equation (15).
The mean of over all cluster instances and the mean over all noise instances are determined and combined in equation (16). The value of ranges in , delivering (due to the exponential function) a value near 1 as soon as is substantially greater than .
As both metrics are density-based and not distance-based, they can also deal with arbitrary cluster shapes and are not distorted by individual outliers that lie far away.
Balanced score proposal
The adjusted metrics from Sect. 4.1 and the newly introduced metrics from Sect. 4.2 each address different aspects of a clustering . The conventional measures, such as the silhouette coefficient, primarily evaluate the cohesion and shape of the identified clusters (cluster assessment). The newly introduced measures, such as the grid coefficient do not evaluate the quality of the clusters, but focus on the quality of the distinction between noise instances and cluster instances . They assess whether instances identified as noise are located in significantly less dense regions than instances that are assigned to a cluster (density assessment).
For noise-aware clusterings, however, both aspects are important for assessing the overall quality of . We therefore propose the -Score (balanced), which is defined as the harmonic mean between the and the as shown in equation (17).
By taking both measures into account, the -Score allows both the cluster validity to be assessed and whether the density within clusters differs significantly from noise regions. We take here so that the value ranges of both metrics match. Semantically, this does not change the interpretation of the , as a value of as well as a value of both suggest a poor clustering.
Computational considerations
The computational complexity of the modified validity measures proposed in this work is, by design, equivalent to that of their conventional counterparts. For example, the extended and the adjusted inherit the same complexity class as the original definitions, as they rely on similar pairwise distance calculations. For very large datasets, these could be accelerated using sampling strategies.
For the , we addressed potential scalability issues in higher-dimensional spaces by implementing a sparse representation: instead of storing the full grid structure, we store only those grid cells that actually contain data instances. This significantly reduces memory usage and improves runtime performance, particularly when the number of occupied cells is small compared to the total possible number (which is usually the case for higher dimensions).
The relies on fixed-radius neighbourhood counts for each instance. These may be determined by scanning all pairwise distances (yielding a complexity of ), which is comparable to, say, the computation of the Silhouette measure.
Experimental evaluation
After describing the test data generation process in Sect. 5.1, we evaluate the performance of the adjusted validity measures in Sect. 5.2, focusing on comparing the modified metrics against their original versions and analysing their behaviour with increasing noise. We then evaluate the performance of both newly introduced metrics for density assessment by comparing their behaviour on data with clusters and randomly scattered data in Sect. 5.3. As we intend to evaluate the metrics – and not the partitions obtained from a particular clustering algorithm – Sects. 5.2-5.3 use the labels from the data generation process as the evaluated clustering . Sect. 5.4, however, demonstrates the performance and usefulness of two metrics when a noise-aware clustering algorithm (DBSCAN) is involved for determining and Sect. 5.5 illustrates the benefit of the -Score.
Data generation
We assess the validity measures using artificial datasets (following Arbelaitz et al.4): structured datasets, which have well-separated clusters, and uniformly distributed datasets, which consist of noise only. We varied all parameters to cover a wide range of scenarios: the number of clusters (2, 3 and 4 with 300 instances per cluster), dimensionality (2, 4 and 8), noise level (0 to 0.9) (a noise level of 0.3 means that an additional 30% of uniformly generated noise instances is added to the cluster data) and standard deviations of the clusters (0.3 and 0.7). The cluster centres were chosen randomly, but a minimum separation of two standard deviations was guaranteed to ensure the presence of good clusters. Even with a high level of noise, the number of instances in clusters is predominant. The uniformly distributed datasets were generated for the evaluation of the metrics and . They consist of noise only with a varying degree from 0.1 to 0.9, where a noise level of 0.1 (resp. 0.9) corresponds to (resp. ).
All datasets have similar variable domains, but potentially different types of clusters (some are elongated, some are spherical). For each parameter combination we generated five datasets with different random seeds. We ended up with 615 generated datasets (540 structured, 75 noise).dFigure 2 shows three 2-dimensional datasets. Figures 2(a) and 2(b) contain clusters with different noise levels and numbers of clusters and Figure 2(c) is a noise dataset with a noise level of .
Performance of the modified metrics (before: original version, after: proposed modification). (a) Silhouette coefficient , (b) Davies-Bouldin index , (c) generalised Dunn 3.3 , (d) Score function , (e) Dispersion Sil and (f) Dispersion DBi.
Extended cluster validity measure evaluation
First, we compare the average performance of the modified metrics against their original version. To ensure that we assess the properties of the validity measures rather than those of the clustering algorithms, we use the labels from the data generation process (as they would have been provided by a perfect clustering algorithm). The adjusted metrics receive a clustering in the form of , and for the calculation of the original metrics, the noise instances were assigned to the nearest cluster (similar to how -Means would partition the data).
Figure 3 compares the average performance of the modified metrics (green curves) against their original version (red curves) over all 540 datasets that contained clusters. The ideal case would therefore be that even if the noise level increases, the metric only degrades marginally. All original metrics (magenta) degrade already at 10% noise and continue deteriorating as the noise level increases. The performance of the customised metrics outperforms the original metrics in all cases substantially. In particular and stand out, as their values remain closest to the noise-free case, close to an ideal performance. But the other two variants ( and ) have also improved over the original version. The metric stagnates, which is due to the heuristic estimate based on the ”worst estimates” for separation and compactness ( and in Eqs. (7) and (9)). If these values remain the same (or same ratio), an increasing noise level does not change the results of the metric. This phenomenon occurs for the original as well as the modified metric. Figures 3(e) and 3(f) depict the dispersion of the sil and dbi per noise level: The dispersion for the adapted metrics is much smaller compared to the original metrics in both cases.
Noise validity measure evaluation
Next, we focus on the metrics and . As discussed in Sect. 4, they detect cases where clusters only appear if the instances between them are (falsely) claimed as noise. More generally, they investigate if the distinction between noise and cluster instances is justified (e.g. supported by the density in the dataset). The performance of these two metrics is compared against the Hopkins Statistic (HS)20 in Figure 4. All metrics were calculated on all cluster datasets (green curves, labelled (number of clusters)) and on those containing noise only (red curves, labelled ). For the partitions were derived from the data generation process again, to mimic a ”perfect” clustering algorithm. For the noise data () we used the DBSCAN algorithm with eps so that random accumulations of instances are identified as clusters and the rest as noise. As desired for this experiment, the clusters found are clearly not justified, as the density within the clusters and outside is almost identical. The expectation for a good measure is thus a high value (close to 1) for the green curves (containing clusters) and low values for the noise data (). Low values, however, differ between the measures: If the data is uniformly distributed without clusters, a low value for and corresponds to 0, while for HS it corresponds to 0.5. Therefore, although similar in shape, the green HS curve () should not approach 0.5 which indicates random noise. From Figure 4 we see that the newly proposed measures behave as desired (top left and mid), but HS already drops considerably at a noise level of (top right).
New metrics and Hopkins statistics: Performance (top) and Dispersion (bottom) (for data with clusters and pure noise ).
The dispersion of these metrics is displayed in Figure 4 (bottom). It is evident that shows the clearest results and has the lowest dispersion. Both HS curves already tend to converge at a noise level of 0.3, which does not make it a good indicator for cluster tendency in noisy environments.
Evaluating partitions of a noise-aware clustering algorithm
In the former experiments, the partition has been derived from the data generation process, we now discuss results from a scenario with labels from DBSCAN for two datasets from Figure 2. As we propose a validity measure we focus on its evaluation – rather than the evaluation of the clustering algorithm. We thus choose clustering parameter settings that are well suited to detect the structure in the data (which may otherwise be determined by hyper-parameter search). The parameter minPts was set to for every execution. The choice of eps was guided by the intrinsic structure of the synthetic data, we set eps to the mean of the half standard deviation of the clusters for every execution (also for the dataset without clusters). For the example with clusters from Figure 2(b) DBSCAN detects the clusters quite well and we obtain the results shown in Figure 5 (left). The red curve shows what happens if the result of DBSCAN is passed ignorantly to the sklearn implementation of the Silhoutte coefficient: Despite the correct identification of noise and clusters, sil drops quickly and indicates a poor result. Quite on the contrary, the proposed sil keeps the coefficient at a high level and the consistently high values of the grid measure indicates that the clusters are real (the density of noise and clusters differs considerably). Both measures degrade slowly because the increasing amount of noise instances affects them slightly.
Evaluation of DBSCAN clusterings on datasets from Figures 2(b) and 2(c).
For the noise-only example in Figure 2(c) DBSCAN identifies a few small clusters but most instances as noise and the results are depicted in Figure 5 (right). Here we excluded the noise instances (for the red curve) and evaluated the clustering only considering the clustered instances, again receiving completely misleading results: The high value of the conventional sil indicates a good clustering result. In contrast, the proposed sil is negative and correctly indicates a poor result. Additionally the grid is close to 0, indicating that there is no difference in density between noisy areas and clusters. Together both measures indicate clearly that the obtained partition does not contain good clusters.
Balanced score evaluation
Finally we demonstrate the utility of the -Score for the dataset shown in Figure 6 (6(b) and 6(c)). This data contains three groupings of instances, each with a slightly different density.
Balanced Score Evaluation. Instances assigned as noise are drawn as crosses.
On this dataset we executed the DBSCAN several times. The parameter remained the same for each execution, but the parameter eps was varied between and . The , and the -Score were calculated on the generated clusterings and the results are given in Figure 6(a). Here we see that the metrics and disagree at certain points. This is because they each rate different aspects of a clustering as good. Especially disagreement prevails with a value of eps. The corresponding clustering is shown in Figure 6(b) (note that the instances marked by red crosses on the right are noise instances). Here, the grouping on the right side is not dense enough so that it is not identified as a cluster. The rates this cluster assignment as good (but not perfect, as the clusters are not circular and dense enough), as the clusters are separated from each other and the noise instances are further away from clusters. However, the grid coefficient criticises the fact that the data density in the region of the noise instances does not differ substantially enough from the density in the cluster regions. The maximum score is achieved by the clustering shown in Figure 6(c). Here, the instances on the right-hand side are categorised as a further cluster and only the one remaining instance is identified as noise. The reaches its maximum with this assignment, as the density of the noise instance now differs significantly from the density in the clusters.
Conclusions
In this work, we addressed the limitations of existing cluster validity measures when applied to the results of clustering algorithms that identify noise. Although standard validity measures yield useless results in this case, they are nevertheless used in the literature (possibly unaware of the their inappropriateness). By modifying four widely used validity measures Silhouette Coefficient, Davies-Bouldin index, Dunn index, and Score function, we created variants of these metrics that cope well with noise-aware clusterings. To validate the division into noise and cluster instances, as proposed by the respective clustering algorithm, we additionally introduced the Grid Coefficient and Neighbourhood Ratio, that rate the clustering by evaluating whether the assignment of cluster and noise instances is justified by the respective data densities. To be able to assess both, the cluster validity and the density relation between noise and clusters, we proposed a Balanced Score that takes both into account, the adjusted Silhouette Coefficient and the Grid Coefficient. The experimental evaluation, conducted on a diverse set of artificial datasets with varying noise levels and further characteristics, demonstrates that the modified and new measures substantially outperform their conventional counterparts. We strongly recommend to use the proposed rather than the standard measures for noise-aware clustering algorithms such as DBSCAN,5 noise clustering,6 etc.
Besides the promising contributions, the proposed measures also have limitations, as the threshold and scaling parameters used in the and metrics are empirically motivated and may need tuning for specific domains. Lastly, although the proposed validity measures reduce sensitivity to noise, they still rely on distance or density assumptions which may not hold in all data manifolds. We have focussed on prominent validity measures, but in future work more conventional measures may be modified to copy with noise-aware partitions.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article.
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
ORCID iDs
Lea Eileen Brauner
Frank Höppner
Frank Klawonn
Notes
References
1.
von LuxburgUWilliamsonRCGuyonI. Clustering: Science or art? In: Proceedings ICML workshop on unsup. transfer learning, 2012.
2.
JeonHKuoYHAupetitM, et al.Classes are not clusters: Improving label-based evaluation of dimensionality reduction. arXiv, 2023.
3.
FärberIGünnemannSKriegelHP, et al.On using class-labels in evaluation of clusterings. In: MultiClust: 1st international workshop on discovering, summarizing and using multiple clusterings held in conjunction with KDD, 2010.
4.
ArbelaitzOGurrutxagaIMuguerzaJ, et al.An extensive comparative study of cluster validity indices. Pattern Recognit2013.
5.
EsterMKriegelHPSanderJ, et al.Density-based spatial clustering of applications with noise. In: Int. Conf. knowledge discovery and data mining, 1996.
6.
DavéRN. Characterization and detection of noise in clustering. Pattern Recognit Lett1991.
7.
ErtözLSteinbachMKumarV. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM international conference on data mining 2003, pp.47–58. SIAM.
8.
AsaklyWBlecherABrennanC, et al.Set partition asymptotics and a conjecture of gould and quaintance. J Math Anal Appl2014.
9.
TopchyAJainAKPunchW. Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell2005.
10.
TopchyAPLawMHCJainAK, et al.Analysis of consensus partition in cluster ensemble. In: ICDM, 2004.
11.
FaceliKSakataTCde SoutoMCP, et al.Partitions selection strategy for set of clustering solutions. Neurocomputing2010.
12.
RousseeuwPJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math1987.
13.
DaviesDLBouldinDW. A cluster separation measure. IEEE Trans Pattern Anal Mach Intell1979.
14.
BezdekJCPalNR. Some new indexes of cluster validity. IEEE Trans Syst Man Cybern1998.
15.
SaittaSRaphaelBSmithIFC. A bounded index for cluster validity. In: IAPR Int. Conf. on Machine Learning and Data Mining in Pattern Rec, 2007.
16.
ChenMBanitaanSMalekiM, et al.Pedestrian group detection with k-means and dbscan clustering methods. In: 2022 IEEE eIT, 2022.
17.
KertanahKNurmayantiWPAiniSR, et al.Comparison of algorithms k-means and dbscan for clustering student cognitive learning outcomes in physics subject. Kappa J2023.
18.
OgbuaborGUgwokeFN. Clustering algorithm for a healthcare dataset using silhouette score value. AIRCC’s Int Journ of Comp Sc and Inf Tech2018.
19.
ZhaoPChenZXieJ, et al.A novel silhouettes cluster internal evaluation index based on granular-ball. In: 2023 8th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), 2023, pp.92–97. DOI: 10.1109/ICCCBDA56900.2023.10154684.
20.
HopkinsBSkellamJG. A new method for determining the type of distribution of plant individuals. Ann Bot1954.
21.
RosFRiadRGuillaumeS. Pdbi: A partitioning davies-bouldin index for clustering evaluation. Neurocomputing2023; 528: 178–199.