Abstract
One main problem of Fuzzy c-Means (FCM) is deciding on an appropriate number of clusters. Although methods have been proposed to address this, they all require clustering algorithms to be executed several times before the right number is chosen. The aim of this study was to develop a method for determining cluster numbers without repeated execution. We propose a new method that combines FCM and singular value decomposition. Based on the percentage of variance, this method can calculate the appropriate number of clusters. The proposed method was applied to several well-known datasets to demonstrate its effectiveness.
Introduction
Clustering is an unsupervised technique for grouping a set of objects based on their similarity. Clustering in general and Fuzzy c-Means (FCM) in particular can be applied to many real-world challenges, such as medical image segmentation [1, 2], target selection in direct marketing [3], web mining [4], managing readiness activities for enterprise resource planning [5], clustering microarray data [6], stock market prediction [7], and even cluster analysis and nonlinear mapping of soil datasets for detection of polluted sites [8].
The need for clustering continues to grow. A pivotal challenge for most basic clustering algorithms is determining the number of clusters, which greatly influences clustering results [9]. Most clustering algorithms are sensitive to the initial number of clusters, which is difficult to determine without prior information [10]. Some methods have attempted to determine the number of clusters for crisp data [11, 12, 13, 14].
Only a limited number of studies focused on determining a suitable number of centers for fuzzy clustering techniques [10, 11, 15, 13, 16]. In [15], Sun et al. proposed specifying values based on a cluster validity criterion. This algorithm splits the worst cluster into two clusters until the predefined criterion is met. Subbalakshmi et al. calculated a Silhouette score function for each cluster. The cluster with the lowest score is split, and the process is repeated until maximum average Silhouette scores are achieved [11]. In [16], Yu and Li proposed a new index for determining the number of clusters in FCM by applying a Hessian matrix derived from minimizing the objective function. They use the conditional index of this Hessian matrix as an index of stability. The main disadvantage of all these approaches is that number of clusters are derived through repetitive trials. The numbers are strongly intertwined with the clustering algorithms. As a result, the numbers identified cannot be applied to other methods. Every FCM clustering method still have to figure out their own ways to estimate numbers of clustering. The previous methods aim for finding the k value by running the clustering algorithm several times with a different k value for each time. Hence, their limitations are about executing time.
To independently identify a suitable number of clusters, this study proposes an algorithm utilizing Singular Value Decomposition (SVD) on similarity matrix of data points. SVD is known for relating data points with latten features [17, 18, 19]. In this study, the data points highly associated with the same latten features are assumed to be related. As a result, the number of features is suggested as the number of clusters. The method proposed in this study reduces execution time by approximately 85% compared with other methods.
This paper is organized as follows. In Section 2, we review some studies closely related to our topic. In Section 3, we introduce the basic FCM algorithm and SVD. We also provide a detailed presentation of our new method, which incorporates a user-defined threshold. In Section 4, numerical examples are provided to demonstrate the exactness and usefulness of our method, including empirical comparisons. In Section 5, we offer conclusions and discuss current and future applications of this research.
Related work
Fuzzy clustering has emerged as a fruitful research topic. Accordingly, a great many clustering algorithms have been developed. However, determining the number of clusters remains a critical task. In [19], Lee and Olafsson showed that the number of clusters can affect the performance of clustering algorithms. In [20] Yu et al. noted that the quality of clustering results, including factors such as noise or outliers, is also affected by the number of clusters.
In [15], Sun et al. reported an FCM-based splitting algorithm for automatically determining the number of clusters, which was calculated using a score function. With U as the fuzzy partition matrix, V as the vector of centers, and
The value of Scat (
In [11], Subbalakshmi et al. used a fuzzy Silhouette score function
In [10], Beringer and Hüllermeier reported an adaptive optimization method for adjusting the number of clusters in FCM clustering. They proposed a modification of the commonly used Xie-Beni index as a validity measure for fuzzy partitions. Each iteration of this method consists of a quality check to determine whether the cluster model might be improved by increasing (
Although fuzzy clustering and methods for determining the number of clusters have been the topic of research, to the best of our knowledge no studies have focused on using SVD and FCM in a clustering algorithm to determine the number of clusters.
In [13], Thong et al. listed three popular approaches to determining the number of clusters: scanning [19, 21, 22, 23], preprocessing [24, 25], and pruning [11, 26, 27]. To our understanding, all of these approaches require that a clustering procedure be run numerous times to check the optimality of the number of clusters.
Several studies have attempted to use both FCM and SVD in certain applications. In [28], Li et al. combined SVD, FCM, and rough set theory to diagnose faults in rotating machinery. In [29], Muliawati et al. used FCM to identify trending topics on Twitter from reduced data dimensions by SVD. Guo et al. proposed a feature-clustering approach to recognize faults in resonant grounding distribution systems, using SVD to decompose the time-frequency matrix and merging with the polarity distribution matrix to establish an amplitude-polarity feature matrix (APFM). FCM was then applied to the APFM to detect system faults by classifying feeders into two types: fault feeders and sound feeders [30]. Oliynyk et al. used FCM clustering to identify clusters with various neuronal spike shapes, then applied SVD to classify neuronal waveforms. Although FCM clustering based on SVD has been applied in their algorithms, none of these studies attempted to identify the appropriate number of clusters [31]. Our proposed method therefore seeks to determine the suitable number of clusters using a combination of FCM and SVD. Comparing to other methods, our performance can be integrated with other approaches. To overcome the issue of execution timing, this paper strives to reduce the time of clustering and makes some comparison with the benchmarking methods.
Methodology
In [16], there are two main approaches to determining the number of clusters. The first approach focuses on geometric properties (e.g., the Xie-Beni index [10]). The second approach uses the fuzzy partition concept, and its main advantage is superior data clustering performance in uncertain cases. The algorithm proposed in this paper is based on fuzzy partition.
Primary of FCM
The main aim of FCM is to cluster data points by minimizing the objective function
Where:
In Eq. (2),
where
After being clustered by FCM, the
The pseudocode of FCM is shown in Fig. 1.
Pseudocode of FCM algorithm.
Let
where
To correctly apply FCM, a reasonable
The methodology starts with initializing membership matrix
The
Applying SVD to
The percentage threshold is defined in Eq. (7). The top large singular values indicate the suitable number of clusters
where
The pseudocode of the proposed algorithm is shown in Fig. 2.
Pseudocode of the proposed algorithm.
Figure 3 provides a block diagram of the proposed method summarizing its algorithm in simplified steps.
Block diagram of the proposed algorithm.
To test the effectiveness and reliability of the proposed method, three simulations were performed on both synthesized and real datasets. To ensure the validity of results in the test simulations, all experiments were performed on the same device with the same initial parameters, and all were implemented using a 64-bit Windows 10 operating system. To evaluate the algorithms under the same development environment, all the algorithms were executed in Matlab language using an Intel Core i5 processor on a 4-GB RAM notebook.
Simulation 1
In this numerical experiment, we performed a simulation using the fcmdata.dat [20] dataset consisting of 140 instances in two dimensions. The proposed method was applied step by step. Users can set a threshold in Eq. (7) to identify the most influential clusters. The results for different thresholds are shown in Table 1.
Relationship between thresholds and number of suitable clusters
Relationship between thresholds and number of suitable clusters
Table 1 displays the number of influential clusters determined by the thresholds. When a higher threshold is set, more singular values should adequately describe the data characteristics. The number of singular values with high variance in the dataset represents the number of clusters. For example, when the threshold is set at 95%, the number of influential clusters is five. The other singular values contain only small variations, and therefore cannot be counted as having great influence in the dataset. In most cases, a threshold is set so that the percentage of variance accounts for at least 70% [32].
This simulation experiment had two main purposes. First, we checked the reliability of the proposed method’s experimental results by comparing them with three well-known cluster validity indices, namely, the Calinski-Harabasz index [33], the Silhouette index [34], and the Davies-Bouldin index [35]. Second, we compared the time consumed by other methods and by this new method through cross-testing with other criteria on both artificial and real datasets. The Fisher’s Iris dataset from the UCI machine learning repository [36] contains 150 instances and four dimensions, whereas the synthesized pathway dataset [37] has 300 instances and three dimensions. To ensure comparison under the same conditions, we used the same initial parameters by setting the number of clusters at six for all datasets. The aforementioned hardware specification was again used in conducting comparisons for the optimal number of clusters
Cross-testing between the proposed method and other approaches in the Fisher’s Iris dataset
Cross-testing between the proposed method and other approaches in the Fisher’s Iris dataset
Cross-testing between the proposed method and other approaches in a path-based dataset
Tables 2 and 3 provide an empirical comparison of validation techniques using artificial and real datasets, respectively. Two other well-known algorithms (k-means clustering and Gaussian mixture distribution clustering) were employed for cross-checking of the three aforementioned cluster validity indices including the Calinski-Harabasz index, the Davies-Bouldin index and the Silhouette index. The outputs of the numerical experiment indicated that the suitable numbers of clusters determined by our method and the other methods were largely the same. This result validates the accuracy of the proposed algorithm. We then checked whether our method outperformed the others in terms of time consumption. Figures 4 and 5 display the results.
Run time on Fisher’s Iris dataset.
Figure 4 shows the results of testing with Fisher’s Iris dataset using three methods and three criteria. Because the output of our method was the same as that of the others, we wished to know which method required less execution time. Our method consumed less time than the others. Specifically, the proposed method reduced execution time by approximately 90%.
Run time on path-based dataset.
Figure 5 shows the results of testing for time consumption with a path-based dataset. Although both our method and the others determined the same number of clusters, ours had lower time consumption than the others. Specifically, the proposed method reduced execution time by 83% to 93%.
To demonstrate the efficiency of the proposed method, we compared its time consumption with that of other methods by testing with a different number of data points. Datasets of many different sizes, from 1000 to 7000 normally distributed random data points with two dimensions, were created. The result is displayed in Fig. 6.
Comparison of time consumption between methods.
The line graph in Fig. 6 illustrates the average time consumption of the three methods for five numbers of data points. The proposed method outperformed the k-means and Gaussian mixture clustering methods. The smaller the dataset size, the lower was the variance observed. However, when the dataset size increased, the variance in time consumption increased as well. For example, the execution time for 1000 data points did not differ greatly among the three methods, whereas when the number of data points increased to 7000, the proposed method reduced time consumption by approximately 88% compared with the Gaussian mixture method, and by 75.5% compared with the k-means method. These findings indicate our proposed method can dramatically reduce the time required to determine a suitable number of clusters. The positive results of Simulation 3 suggest that the proposed method could be used with large datasets containing a massive number of data records.
Visualization of the original dataset.
Visualize the dataset after clustering with 2 clusters when the threshold is 75%.
Visualize the dataset after clustering with 3 clusters when the threshold is 85%.
Visualize the dataset after clustering with 4 clusters when the threshold is 90%.
The above simulations have shown the effectiveness of our approach in term of time consumption. When clustering the dataset, it is critical important to ensure the quality of clustering with the particular number of clusters. Thus, we visualize the clustering results then compare them to show the quality of the fcmdata.dat dataset with 140 data points and 2 features. In this simulation, we plot the original data points at the first step. Then, we plot the clusters of the dataset with different number of clusters respectively. By this way, we can visualize the quality of clustering results with different recommended number of clusters. The below graphics show the clusters in the different colours after running the proposed algorithm with different thresholds.
From the visualized results we obtained when clustering the same dataset, it is clearly to see that the clustering results are very different with various recommended number of clusters. In this dataset, the best clustering result we got with two clusters. When clustering with three or four clusters, the clustering results are not desirable enough as the data points are not distributed well in those cases. Therefore, we can check clearly the quality of clustering with the visualized figures.
Simulation 5
To validate the number of clusters, we added two more graphical simulations by using two benchmark approaches, namely, Gap statistic method [38] and Silhouette approach [34] to determine the suitable number of clusters. We use visualization techniques for understanding the statistical properties of the dataset to decide the number of clusters. The experiments are reported as follows:
Relationship between Gap values and number of clusters.
Relationship between Silhouette values and number of clusters.
Gap statistic is a well-known graphical approach to cluster evaluation involves plotting an error measurement of several numbers of clusters, and then it creates an “elbow” in the figure. The “elbow” shows the most significant decrease in error measurement. The gap statistic approach estimates the “elbow” point as the number of clusters with the largest gap value.
Based on the Fig. 11, the maximum value of the gap criterion occurs at 3 clusters. However, the value at 2 clusters is within a standard error of the maximum, so the suggested optimal number of clusters is 2.
To validate the number of clusters, we also simulate the second simulation by using graphical Silhouette clustering evaluation as follows:
From the above graphical simulations with two benchmark methods, we can conclude that the suitable number of clusters for this dataset is two. Our approach also recommends two clusters for this dataset. Therefore, our approach is a trust worthy method to determine number of clusters for fuzzy data.
In order to evaluate the results with high-dimensional and skewed dataset. An experiment is conducted the Forest Cover Type drawn from the UCI Machine Learning Repository with 54 attributes and 581012 instances. This dataset is high-dimensional, skewed and has many outliers. The simulation results show that the proposed algorithm is outperformed comparing to benchmarking methods when clustering more complicated datasets. The experimental applied the validity indices to testify the efficiency and accuracy of the proposed method which are shown in Table 4.
Cross-testing between the proposed method and other approaches in a skewed dataset
Cross-testing between the proposed method and other approaches in a skewed dataset
Cross-testing between the proposed method and other approaches in a skewed dataset
To assess the results on a skewed dataset, an experiment is conducted on the Boston_house_price dataset which is drawn from the UCI Machine Learning Repository. From the results in the Table 5, the execution time of the proposed method is comparatively faster than the other approaches with the correct number of clustering. Hence, the proposed method is consistent when clustering skewed dataset.
Simulation 8: Dataset which has many outliers
To test the dataset with outliers, we have implemented the robust testing for the proposed algorithm on three synthetic datasets with two dimensions and 1000 instances. The datasets contain three percentages of outliers of 0.1%, 0.2% and 0.3% respectively. The results in the Table 6 can help to determine how well each method performs among these three outlier percentages and whether the proposed method is better than the Calinski-Harabasz, Silhouette and Davies-Bouldin indices on computing the number of clusters. Hence, the proposed method is robust to the limited percentages of outliers.
Cross-testing between the proposed method and other approaches in the datasets with different percentages of outliners
Cross-testing between the proposed method and other approaches in the datasets with different percentages of outliners
This paper presents a new method for calculating a suitable number of clusters using a mixture of fuzzy and SVD approaches. The proposed method can reduce time consumption when determining the most effective number of clusters. The model was tested on both artificial and real datasets for validation, and empirical comparisons were also performed to prove the reliability and accuracy of the proposed method. Furthermore, the numerical experiment results indicated that the proposed method was consistent with lower time consumption. Compared with more traditional methods, the proposed method can perform faster when working with large datasets.
Future research could use this method in real-world case studies involving particular scenarios. We also plan to use the proposed method in clustering the massive datasets generated by IoT devices, medical imaging, car sensors, satellite imaging, navigation systems, wearable health-monitoring devices, appliances, and other such sources. Furthermore, the missing values in the datasets should be handled in the coming studies.
Footnotes
Acknowledgments
The authors thank the reviewers for their constructive comments and recommendations. This paper is supported by Ho Chi Minh city University of Technology and Education.
Conflict of interest
No potential conflict of interest is reported by the authors.
