Abstract
Fuzzy C-means (FCM) clustering algorithm is a widely used method in data mining. However, there is a big limitation that the predefined number of clustering must be given. So it is very important to find an optimal number of clusters. Therefore, a new validity function of FCM clustering algorithm is proposed to verify the validity of the clustering results. This function is defined based on the intra-class compactness and inter-class separation from the fuzzy membership matrix, the data similarity between classes and the geometric structure of the data set, whose minimum value represents the optimal clustering partition result. The proposed clustering validity function and seven traditional clustering validity functions are experimentally verified on four artificial data sets and six UCI data sets. The simulation results show that the proposed validity function can obtain the optimal clustering number of the data set more accurately, and can still find the more accurate clustering number under the condition of changing the fuzzy weighted index, which has strong adaptability and robustness.
Keywords
Introduction
With the continuous development of science and technology, the amount of data accumulated by all walks of life has increased dramatically. In order to mine valuable information from massive data, data mining technology came into being. As one of the important means of data mining, cluster analysis has been widely used in pattern recognition, information retrieval, image processing and so on [1, 2]. As an unsupervised learning method, clustering is to divide the similar samples into one class and dissimilar samples into different classes without prior knowledge of data [3]. The clustering algorithms can be divided into two major directions: hard clustering and fuzzy clustering. Hard clustering partitions data using either 1 or 0 method, which each data sample must be clearly assgined to one of classes, such as K-means clustering algorithm [4]. However, in reality most of the data is uncertain. For this reason, Ruspini introduced the concept of fuzzy theory [5] based on hard clustering method to produce fuzzy clustering, such as FCM clustering algorithm [6]. So that has the similarity between class and class division basis, data samples have better clustering results and become more effective, more close to the real need. FCM clustering algorithm has become one of the most widely used clustering algorithms by virtue of its simple principle, fast calculation speed and wide problem solving scope [7]. However, FCM clustering algorithm needs to be verified by clustering validity, so as to determine the optimal number of clustering and judge the quality of clustering results. Therefore, finding an appropriate clustering validity function is one of the important directions for studying clustering validity [8].
Any clustering validity function cannot apply all data sets due to the complexity of data structure and different attributes. So fuzzy clustering validity function has been innovating. At present, the exploration direction of fuzzy clustering validity functions focuses on the following two aspects:
(1) Fuzzy clustering validity function based on membership degree. In 1965, Zadeh took the lead in proposing separation degree, a measure of clustering effectiveness [9]. However, this metric cannot judge the result of clustering well. Then Bedzek first proposed the partition coefficient (V PC ) [10]. This is the first index used to describe the validity of fuzzy clustering by the degree of coincidence of data objects. According to V PC and Shannon’s theorem in Shannon’s information theory, the partition entropy (V PE ) was proposed [11]. In order to suppress the monotonically changing characteristics of V PC and V PE with the number of clusters, Dave proposed a modified partition coefficient (V MPC ) [12] and new indicators (V P ) was proposed by Chen and Links in 2004 [13] and more.
(2) Fuzzy clustering validity function based on membership degree and data set geometry structure. For example, Xie and Beni proposed the Xie-Beni validity function (V XB ) based on this idea [14]. Fakuyame-Sugeno validity function (V FS ) was proposed in 1989 [15]. Knows et al. proposed the clustering validity function (V K ) [16]. Bensaid et al. proposed the clustering validity function (V SC ) [17]. K. L. Wu and M. S. Yang proposed the clustering validity function (V PCAES ) [18]. J. S. Wang proposed the clustering validity function (V W ) [19] and L. F. Zhu et al. proposed the clustering validity function (V Z ) [20] and more.
Various validity functions have their own advantages and disadvantages, which leads to that they can only be applied to a part of data sets. However, the evaluation of clustering validity is a key part of clustering analysis, so we absorb the advantages of traditional clustering validity function, and design a new clustering validity function to make it have better stability. This paper proposes a new validity function (V HY ) The function combines the fuzzy membership matrix with the geometric structure of the data so as to improve the relationship between the function and the geometric structure of the data set, which can effectively overcome the influence of noise data, overlapping data and high-dimensional data, accurately find the optimal number of clustering, and has good robustness. Experimental simulation results show that can obtain the correct number of clustering partition on both artificial data sets and UCI data sets and the clustering results of V HY have better stability when the fuzzy weighted index is changed.
The paper is divided into five chapters. The first section mainly introduces the research background and significance. The second section introduces the FCM clustering algorithm and the composition of some clustering effectiveness functions. The third section introduces the principle of the proposed clustering effectiveness function. The fourth section is the experimental simulation and result analysis. The fifth section summarizes the full text and prospects the next work.
FCM Clustering algorithm and typical clustering validity functions
FCM clustering algorithm
The fuzzy C-means (FCM) clustering algorithm proposed by Bezdek is the most representative one of the fuzzy clustering methods, and is also the most widely used clustering algorithm. FCM clustering algorithm uses n data objects {x1x2, … , x
n
} of data X to divided into c fuzzy clustering and to find the minimum objective function, which is defined as Equation (1):
Clustering validity is a problem of finding the optimal solution of c under the condition of J m minimization. In fact, more attention must be paid to the validity of clustering if cluster analysis is to make a great contribution to engineering applications. Since J m decreases monotonically with the decrease of c, an effective partition evaluation criterion is needed [22]. The process of FCM clustering algorithm is described as follows.
Step 1: Fix cluster parameters c and fuzzy factors m (usually between 1.5 and 2.5). When m=1, FCM clustering algorithm is equivalent to K-means algorithm; When m approaches 1 indefinitely, FCM clustering algorithm tends to be more and more clustering algorithm of hardening fraction. On the contrary, when m tends to be infinite, all data objects x j and cluster center v i will coincide, while data object x j belongs to each cluster will have the same membership degree, with the value of 1/c;
Step 2: Initialize fuzzy division membership matrix.
Step 3: According to Equation (2), update the cluster center Vt+1 ={ v1, v2, …, v
c
}.
Step 4: According to Equation (3), update the fuzzy partition matrix Ut+1 = (u
ij
) c×n.
For i = 1, 2, …, candj = 1, 2, …, n.
Step 5: Calculate e = Vt+1 - V t . If e ≤ ɛ (ɛis a threshold value usually from 0.001 to 0.01), the algorithm is stopped and the final clustering result is calculated; Otherwise t = t+1 and go to Step 3.
Partition coefficient (V PC ) [10]
The V
PC
defined by Bezdek is used to measure the overlap between clusters, which is defined as Equation (4).
The main advantage of V
PC
is its simple operation, while the disadvantage is that it decreases monotonously with the increase of
Bezdek also used the V
PE
to measure the fuzziness of clustering partition, and the index was similar to V
PC
, which was defined as Equation (5):
Bezdek proved it is suitable for all probability cluster partitions. The disadvantage of this validity function is that it decreases monotonically with the increase of
V
MPC
is shown in Equation (6). The V
MPC
index optimizes the monotonely decreasing trend of V
PC
, but does not improve other defects of the V
PC
index.
V
XB
proposed by Xie and Beni is realized by Equation (7). V
XB
is the first clustering validity function to take the structure of the data set into consideration.
It is the ratio of the compactness of the intra-class to the separation of the inter-classes. Obviously, the larger the inter-class distance, the more discrete the inter-class, and the smaller the intra-class distance, the more compact the intra-class. Therefore, the minimum value of V XB indicates that the clustering result is the most effective.
V P is proposed by Chen and Linkens, and it is the vidlity index of the subtracted form. It is a validity function only concerned with membership degree, which is defined in Equation (8).
Bensaid et al. proposed V
SC
in 1996. V
SC
is the ratio of the sum of the tightness of the cluster to the sum of the separation degree, which is defined as Equation (9).
V SC replaces the measure of compactness within a class with the average sum of the whole and the average sum of the compactness within a class. The smaller the V SC value, the better the clustering.
The definition of the V
FS
function is shown in Equation (10).
V K is an effective indicator proposed by Kwon et al. It effectively counteracts the decreasing trend of V XB by adding penalty terms to the molecules of the V XB index. Like V XB , the minimum value of V K corresponds to the optimal number of clusters. V K is calculated by Equation (11):
V PCAES is a subtractive form of validity index proposed by Wu and Yang. It describes the compactness and separability of clustering by fuzzy membership function and relative value of center distance of an exponential structure. V PCAES is defined as Equation (12).
Where,
The V HF validity function is constructed from two validity function that are WL and Tang. Is definition is given as follows Equation (13). The minimum value of V HF corresponds to the optimal number of clusters.
L. F. Zhu proposed V
Z
in 2019 which is defined in Equation (14). Where, the numerator
The V
ECS
is proposed by Ouchicha in 2020, which is defined in Equation (15)
Where, in the V
ECS
, the SC was denified as
Based on the classical clustering validity functions, this paper proposes a new clustering validity function, which is defined as Equation (16):
In the comp of V
HY
the
In the sep of V
HY
, the
Apply V HY to the FCM clustering algorithm, and the algorithm procedure is shown in Fig. 1. Based on V HY , the FCM clustering algorithm procedure is described as follows.

Fowchart of FCM clustering algorithm based on V HY .
Step 1: Determine the maximum number of clustering of c
max
(
Step 2: According to the c = [2, c max ], initialize the number of clustering and the cluster center V(γ+1).
Step 3: Update the fuzzy partitioning matrix U(γ+1) and cluster center. Judge whether the ∥Vt+1 - V t ∥ is smaller than the ɛ. If ∥Vt+1 - V ∥ t < ɛ, go to the next step, otherwise continue Step 3.
Step 4: Let c increase by 1, the FCM algorithm is called to obtain the optimal solution, and calculate the value of V HY . Judge whether the c is smaller than the c max . If c < c max , go to Step 2, otherwise go to the next step.
Step 5: Select the minimum value of V HY index, that is to say min{ V HY (U, V, c) }, whichis corresponding to the clustering number c b for optimal clustering number. Output the value of V HY .
Test data sets
The selected experimental data sets are divided into four artificial data sets and six UCI data sets, as listed in Table 1. In the artificial data sets, Data_2_3 is a set of two-dimensional three-class data set with clear classification according to Gauss distribution. Data_2_3(Noise) is a two-dimensional three-class data set in which 100 noise data are added into Data_2_3. Data_2_3(overlap) is a set of data sets with overlapping data from one class to another. Data_3_6 is a data set of three-dimensional six-class that are uniformly distributed. The sample distribution of artificial data set is shown in Fig. 2.
Experimental data set
Experimental data set

Artificial data sets.
The datasets selected from UCI database. Iris dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other two; the latter are not linearly separable from each other. Seeds dataset comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. Heart dataset is a heart disease database similar to a database already present in the repository (Heart Disease databases) but in a slightly different form. WPBC dataset is about Wisconsin Prognostic Breast Cancer. Segment dataset an image data described by high-level numeric-valued attributes with 7 classes. Phoneme dataset was in use in the European ESPRIT 5516 project: ROARS. The aim of this project is the development and the implementation of a real time analytical system for French and Spannish speech recognition. The detail number of attributes, samples and classes of UCI data sets are listed in Table 1.
Seven kinds of typical clustering validity functions (V
MPC
, V
SC
, V
P
, V
XB
, V
K
, V
PCAES
, V
Z
) were selected to carry out the compare experiments to verify the validity of V
HY
. According to the experience knowledge [23], the value of the fuzzy weighted value index m range is 1.5 ⩽ m ⩽ 2.5, the number of clustering c range is
Before the comparative test, the feasibility of V HY was judged by observing the changing trend of comp and sep. The results are shown in Fig. 3. It can be seen from the Fig. 3(a)-(b) that under the condition of Data_2_3, comp and sep change greatly when c = 2, 3, 4. When c = 3, the comp and sep decrease rapidly, and the comp is smaller than the sep, so that V HY reaches the minimum. It can be seen from the Fig. 3(c)-(d) that under the condition of iris dataset, After c = 4, the change range of the comp and sep are very slow, and presents a decreasing trend, the comp value is always greater than the value of the sep, so when c = 3, the change trend of the comp and sep are the largest, so it can get that the minimum value of V HY is c = 3, and the number of iris optimal clusters is c = 3.

The varying trends of V HY , comp and sep.
Simulation results for artificial data sets (Data_2_3, Data_2_3(noise), Data_2_3(overlap) and Data_3_6) are shown in Fig. 4–7. It can be seen from Fig. 4(a)–(i) that for the clearly divided data set Data_2_3, the optimal number of clustering can be found for V MPC , V P , V XB , V K , V Z and V HY . Figure 5(a)–(i) shows that under the influence of noisy data, the optimal clustering number c = 3 can be obtained by V P , V XB , V K , V Z and V HY . Figure 6(a)–(i) shows that in the case of overlap of samples in Data_2_3(overlap) data set, only V MPC and V HY found the optimal clustering number c = 3. Figure 7(a)–(i) shows that V MPC , V P , V XB , V K , V Z and V HY can found the optimal number of clustering c = 3 when data set dimensions are added. The simulation results on artificial data sets show that only V HY can find the optimal clustering number of all four artificial data sets. V P , V XB , V K and V Z can find the optimal clustering number of three data sets. V MPC can find the optimal clustering number of two data sets. V PCAES , V SC and V ECS cannot find the optimal clustering number.

The varying trends of validity indices on Data_2_3 data set.

The varying trends of validity indices on Data_2_3(noise) data set.

The varying trends of validity indices on Data_2_3(overlap) data set.

The varying trends of validity indices on Data_3_6 data set.
These shows that the validity of V HY is better than that of these typical clustering validity functions. In addition, V HY is less affected by noise and overlapping data in the data set. The structure and sample size of artificial data set are relatively simple. In order to verify the clustering validity of V HY in complex data sets, simulation experiments are carried out on UCI data set.
The simulation results for the UCI data sets (Iris, Seeds, Heart, WPBC, Segment and Phoneme) are shown in Fig. 8–13. It can be seen from Fig. 8(a)–(i) that only V HY can get the optimal number of cluster c = 3. The second dimension and the third dimension of Iris data set have high similarity, so V P , V SC , V XB , V K , V Z V ECS can only get the optimal number of clustering c = 2, indicating that V HY can well cluster overlapping data. It can be seen from Fig. 9(a)–(i) that for Seeds data set, V P andV HY get the optimal number of clustering c = 3. Figure 10(a)–(i) shows that for Heart data set, V P , V SC , V XB , V K , V Z , V ECS and V HY can obtain the optimal number of clustering c = 2. Figure 11 shows that V SC , V P , V PCAES , V ECS and V HY can obtain the optimal number of clustering c = 2 for WPBC data set. According to the simulation results shown in Fig. 9–11, it can be found that when processing the data set of Seeds, Heart and WPBC with higher dimension, only V p and V HY can get the optimal number of clustering. It shows that V P and V HY can find the optimal clustering number better than other typical clustering validity functions in the case of data sets with high dimension. Figures 12 and 13 simulate Segment and Phoneme data set, and only V HY can get the optimal clustering number c = 2. Segment and Phoneme have higher number of data samples and dimensions, indicating that V HY clustering effect is significantly better than V MPC , V SC , V P , V XB , V K , V PCAES and V Z when the number and dimension of data sets are relatively higher. It can be seen from the simulation results in Figs. 8–13 that only V HY can find the optimal clustering number of all real data sets. On the other hand, V HY can still find the optimal number of clustering in the data set with overlapped samples, higher dimensions and larger number of sample.

The varying trends of validity indices on Iris data set.

The varying trends of validity indices on Seeds data set.

The varying trends of validity indices on Heart data set.

The varying trends of validity indices on WPBC data set.

The varying trends of validity indices on Phoneme data set.

The varying trends of validity indices on Segment data set.
The simulation results of V MPC , V SC , V P , V XB , V K , V PCAES , V Z , V ECS and V HY in the artificial data set and UCI data set respectively show that V HY is obviously superior to V MPC , V SC , V P , V XB , V K , V PCAES , V ECS and V Z in finding the optimal number of clustering. In order to clear out V HY clustering validity of observations is superior to other traditional clustering validity function, the clustering validity function values under different data sets based on V MPC , V p , V SC , V XB , V K , V Z , V PCAES , V ECS and V HY shown in Figs. 4–13 are normalized and shown in Fig. 14(a)–(j) in the same coordinate system.

Comparison of clustering validity functions after normalization.
The optimal clustering number under different clustering validity functions are listed in Table 2. The time used to simulate the data set is shown in Table 2 by combining the cluster effectiveness function and FCM algorithm respectively. Table 2 data unit is second. From the four experiments, it can be found that the computing time of V HY is shorter than some traditional clustering functions. The optimal clustering number under different clustering validity functions are listed in Table 3.
The calculation time of different clustering effectiveness functions
Optimal clustering results of different clustering validity functions on different data sets
m is the fuzzy weighted index, and the introduction of m extends the hard clustering division into the fuzzy clustering division. According to the previous experiences, the value range of m is 1.5 < m ⩽ 2.5. In this simulation experiments, select V MPC , V SC , V XB , V P , V K , V PCAES , V Z , V HY and m = 1.2, 1.5, 2.0, 3.0, 5.0, then carry out simulation experiments on the data sets listed in Table 1. The simulation results are listed in Tables 4 and 5.
Imapct of different m values on each index by using manual datasets
Imapct of different m values on each index by using manual datasets
Imapct of different m values on each index by using UCI datasets
The simulation results on Data_2_3 and Data_2_3(noise) are shown that the change of m value has no influence on V SC , V XB , V P , V K and V HY , but has little influence on V Z and V MPC , while has great influence on V PCAES . Only m = 5, V Z cannot find the optimal cluster number, and V MPC cannot find the optimal cluster number when m = 1.2 and 1.5. However, since V SC is a monotone function, the effect of changing m value on V SC is not representative. Data_2_3(overlap) results show that only V HY and V MPC can find the correct number of clustering when changing the value of m, but V HY misdetermines the clustering result as c = 2 when m = 3.0 and 5.0; V MPC misdetermines the clustering result as c = 14 and 12 when m = 1.2 and 1.5, indicating that V HY has better stability than V MPC . The results of Data_3_6 show that the optimal clustering can be found in V MPC , V XB , V P , V K , V Z and V HY under the condition of changing m, but only V HY can find the optimal clustering number in m =1.5, 2.0, 3.0 and 5.0, while V MPC , V XB , V P , V K , V Z and V HY can only find the optimal clustering number in part of m values, while V PCAES and V SC cannot find the optimal clustering number.
Simulation results on Iris data set show that only V HY and V MPC can find partially correct clustering number when changing the value of m, but V HY can find the optimal clustering number at m = 1.2, 1.5 and 2.0, while V MPC can only find the optimal clustering number at m = 3.0 and 5.0. The result on Seeds dataset shows that V MPC , V P and V HY can find partially correct number of clustering. The accuracy of V MPC and V HY is higher than V P . V HY has a small change under the influence of m, while V MPC has a big change under the influence of V MPC . Heart results show that the optimal number of clustering can be found under different values of m for both V HY and V P , while V XB , V K and V Z can only be found under partial values of m. The results of WPBC show that V XB , V PCAES , V P , V K , V Z and V HY can find part of the optimal clustering number c = 2, but V HY has a higher accuracy rate than V XB , V PCAES , V P , V K and V Z . The simulation results on Phoneme data set show that the optimal number of clusters c = 2 can be found by V HY under different values of m. The simulation results on Segment data set show that the optimal number of clusters can be found only by V HY and V XB in the case of m =2.0, and even in the case of m =3.0 and 5.0, V XB , V K adn V Z are invalid, that is to say that the minimum value cannot be found.
Therefore, the experimental results show that the proposed clustering validity function V HY under different values of m is more accurate, reliable and robust to the optimal number of clustering.
Aiming at the defects of existing validity functions, this paper proposes a new clustering validity function V HY , and carries out simulation experiments on four artificial data sets and six UCI data sets. The experimental results show that the V HY index can be used to find the optimal clustering number in the case of noise, data overlap between classes, high-dimensional data and large data samples, and V HY does not change significantly with the change of fuzzy weighted index m so as to show the good robustness.
Due to the endless emergence of data sets, it is impossible to correctly evaluate the validity of data only by relying on one clustering validity function. Moreover, in the process of research, the structure of validity function becomes more and more complex and the amount of calculation becomes larger and larger. For this reason, V HY is not applicable to all datasets, and the complex calculation process is also difficult to understand. Therefore, in the future research process, the combined clustering validity function will be studied in depth, though absorbing the advantages of each validity function, a more accurate validity evaluation method is obtained.
