Ratio component-wise design method of fuzzy c-means clustering validity function

Abstract

Fuzzy clustering is an important research field in pattern recognition, machine learning and image processing. The fuzzy C-means (FCM) clustering algorithm is one of the most common fuzzy clustering algorithms. However, it requires a given number of clusters in advance for accurate clustering of data sets, so it is necessary to put forward a better clustering validity index to verify the clustering results. This paper presents a ratio component-wise design method of clustering validity function based on FCM clustering method. By permutation and combination of six clustering validity components representing different meanings in the form of ratio, 49 different clustering validity functions are formed. Then, these functions are verified experimentally under six kinds of UCI data sets, and a clustering validity function with the simplest structure and the best classification effect is selected by comparison. Finally, this function is compared with seven traditional clustering validity functions on eight UCI data sets. The simulation results show that the proposed validity function can better verify the classification results and determine the optimal clustering number of different data sets.

Keywords

Data mining fuzzy c-means clustering algorithm clustering validity function ratio component-wise design

1 Introduction

With the continuous development of science and technology and the continuous accumulation of network information data, human society has entered a new era of data explosion. In order to obtain valuable information from massive data, data mining technology emerges at the historic moment [1]. Data mining has been widely used in database, artificial intelligence, machine learning and other fields. As an important means of data mining, clustering analysis has also been further developed and studied [2, 3]. Cluster analysis is an unsupervised learning method, and its core idea is to classify the samples with similar attributes in the data without prior knowledge into one category, and the samples with dissimilar attributes into different categories, so that the samples of the same category are as similar as possible, and there are large differences among samples of different classes [4]. Current clustering algorithms are mainly divided into hierarchical clustering algorithm, partition clustering algorithm, density clustering algorithm and model clustering algorithm [5 –8]. Hard-partition K-means clustering algorithm [9], as one of the algorithms to divide clustering, adopts an either-or method to divide data, that is to say that each data sample must be clearly divided into different classes. However, in fact, most data are uncertain, and a data sample may belong to multiple classes to varying degrees [10]. In order to provide a better dividing basis for data samples with similarity between clusters and be closer to the real needs [11], Bezdek introduced the concept of fuzzy set [12] into the K-means clustering algorithm and proposed the fuzzy C-means (FCM) clustering algorithm [13]. FCM clustering algorithm has become one of the most widely used clustering algorithms with its characteristics of being more consistent with the actual sample situation, convenient operation and wide application range [14].

FCM clustering algorithm is to determine the best classification results based on the validity function. But because sample data is with a greater difference between the structure and properties of a data set, any kind of clustering validity function is impossible to deal with all of the data set. So based on the status quo, new clustering validity functions were constantly proposed and it has also been favored by scholars at home and abroad. Clustering validity functions are mainly divided into the following two categories.

1) Fuzzy clustering validity function only based on membership degree. For example, Bezdek firstly proposed the validity function partition coefficient (V_PC) [15] and partition entropy (V_PE) [16] for fuzzy clustering in 1974. V_PC and V_PE have simple structure and small computation, but they show a monotony trend with the change of the number of clusters. Subsequently, Dave proposed an improved partition coefficient (V_MPC) in 1996, which could suppress the characteristic that the partition coefficient decreases monotonously with the number of clusters, but there was no improvement for other disadvantages of V_PC [17]. In 2004, Chen and Links proposed a new validity index (V_P)based on subtraction form, which has been successfully applied to nonlinear fuzzy model, power system modeling, mechanical performance prediction and other fields [18]. In 2010, Zalik proposed a validity function (V_CO) using membership degree to define overlap degree and compactness [19]. Then Yong-li Liu introduced V_CO into the module of separation degree in 2019 and proposed the CSBM validity function [20], thus improving the stability of V_CO.

2) Fuzzy clustering validity function based on membership degree and geometric structure of data set. For example, in 1991, Xie and Beni proposed the clustering validity function (V_XB) based on scale operation [21], which was the first clustering validity function considering the geometric structure of data sets, and laid a solid foundation for the field of clustering validity function. However, this validity function has two shortcomings. (1) When the clustering number c tends to the number of data samples, V_XB tends to 0. (2) When the ambiguity index is very large, V_XB becomes infinite. In 2005, Kuo-Lung Wu proposed a clustering validity function (V_PCAES) based on normalized partition coefficient and exponential operation [22]. Chi Yun et al. proposed the validity function (V_FM) in 2007 [23], which took into account two important evaluation indicators, partition entropy and fuzzy partition factor, and defined the compactness and separation of clusters. But its performance on noise data sets were poor. In 2015, Chib-Hung Wu proposed a validity function (V_WL) [24] by taking into account both the membership degree and the geometric structure of the data set, and could better suppress noise data. In 2019, Ling-feng Zhu et al proposed a new clustering validity function (V_ZLF) [25], which is composed of the ratio of compactness and separation, and can accurately divide high-dimensional data sets. Liu proposed the validity function (V_IMI) in 2021 [26], which is mainly characterized by a new definition of the separation measure and uses the imbalance ratio of two clusters to expand the center between distances. In the same year, Hong-yu Wang proposed a novel validity function (V_HY) based on the ratio of intra-class compactness and inter-class separation [27], etc.

A clustering validity function is composed of several components and different components contain different geometric meaning, but also play different roles in a validity function. Many clustering validity functions are put forward by scholars using components for empirical trial-and-error. This subjective method has a certain degree of contingency, and there are very few researches on fuzzy clustering validity functions based on components. Therefore, six FCM clustering performance evaluation components defined in this paper are continuously permutation and combination in the form of ratio in an objective way, and several newly formed validity functions are verified on the UCI data set so as to analyze the influence of different components and select the best validity function among them. This function is compared with typical FCM clustering validity functions on UCI data sets. Experimental results show that this validity function can obtain correct clustering results on UCI data sets, and can accurately divide high-dimensional data and overlapping data.

2 FCM clustering algorithm

The fuzzy C-means (FCM) clustering algorithm proposed by Bezdek is the most representative fuzzy clustering algorithm at present. The principle of FCM clustering algorithm is to divide the set X ={ x₁, x₂, ⋯ , x_n } composed of n data objects into c fuzzy clustering and find the minimum objective function through iteration, which is defined as follows [28]: $J_{FCM} (U, V) = \sum_{i = 1}^{c} \sum_{j = 1}^{n} {(u_{ij})}^{m} {∥ x_{j} - v_{i} ∥}^{2}$ (1) where, J_FCM (U, V) is the square error clustering criterion and the minimum value of J_FCM (U, V) is named as the least square error stationary point; V ={ v₁, v₂, ⋯ , v_c } is the corresponding cluster center in data set X; Parameter c represents the number of clusters; The parameter m ∈ (1, ∞) is the fuzzy exponent to control the degree of fuzziness of each data set; ∥x_j - v_i∥ represents the European distance between data object x_j and cluster centerv_i; u_ij (0 ⩽ u_ij ⩽ 1) is the membership degree of data object x_j of cluster center v_i. In Equation (1), u_ij ∈ U and U is a membership matrix with fuzzy division of c × n that must satisfy the following conditions, and U₀ is called the initial membership matrix. $\sum_{i = 1}^{c} u_{ij} = 1 for 1 ⩽ j ⩽ n$ $0 ⩽ \sum_{j = 1}^{n} u_{ij} ⩽ n for 1 ⩽ i ⩽ c$

The process of the FCM clustering algorithm is described as follows.

Step 1: Given the cluster parameters c and fuzzy factor m (usually between 1.5 and 2.5).

Step 2: Initialize the cluster center matrix V and membership matrix U, and obtain initial membership matrix U₀ and initial cluster center V₀.

Step 3: According to Equation (2), update the cluster center V_t+1 ={ v₁, v_2,⋯,v_c }. among it, V_t represents the clustering center at the current time, V_t+1 represents the clustering center at the next moment. $v_{i} = \frac{\sum_{j = 1}^{n} {({u_{ij}}^{(t)})}^{m} \cdot x_{i}}{\sum_{j = 1}^{n} {({u_{ij}}^{(t)})}^{m}}$ (2)

Step 4: According to Equation (3), update the fuzzy partition matrix u_ij^(t+1) = (u_ij) _c×n, where u_ij^(t) represents the membership at the current time and u_ij^(t+1) represents the membership at the next moment. $u_{ij} = {[{\sum_{k = 1}^{c} (\frac{{∥ x_{j} - v_{i} ∥}^{2}}{{∥ x_{j} - v_{k} ∥}^{2}})}^{2 / m - 1}]}^{- 1}$ (3) where, i = 1, 2, ⋯ c and j = 1, 2, ⋯ n.

Step 5: Calculate e =∥ u_t+1 - u_t ∥. If e ⩽ ɛ (ɛ is a threshold value usually from 0.001 to 0.01), the algorithm is stopped and the final clustering result is calculated; Otherwise t = t+1 and go to Step 3.

3 Ratio component-wise design method of FCM clustering validity function

3.1 Cluster validity evaluation component

Based on the constituent characteristics of typical clustering validity functions, this paper defines six cluster validity evaluation components (CPs). These components are used to represent the tightness, similarity, variation, separation and overlap of data sets respectively.

(1) Compactness

Equation (4) is the criterion function of FCM clustering algorithm, which is used to define the tightness within the data set. Its geometrical significance is the sum of the distance between clustering center v_i and data sample x_j. The smaller the value of CP₁ is, the closer the distance between the data sample x_j and its corresponding cluster center v_i is, that is to say that the higher the similarity degree of the data within the class is, the closer the data within the class is. $C P_{1} = \sum_{i = 1}^{c} \sum_{j = 1}^{n} (u_{ij})^{2} {∥ x_{j} - v_{i} ∥}^{2}$ (4)

(2) Variability

Exponential operation is highly useful in dealing with the classical Shannon entropy and cluster analysis. Because the exponential function has monotonicity, it can suppress the interference of noisy data. At the same time, it can measure the distance between the data object and the cluster center within the interval of (0,1] [22]. Therefore, Equation (5) measures the distance between the data x_j and the cluster center v_i and the distance between sample points based on the form of exponential function so as to represent the degree of variation of data within the class.

Similarly, the smaller the value of CP₂ is, the lower the variation degree of the data in the class is, and the more stable the data in the class is. $ɛ = (\sum_{j = 1}^{n} {∥ x_{j} - \bar{x} ∥}^{2}) / n$ ( $\bar{x} = (\sum_{j = 1}^{n} x_{j}) / n$ is the mean of the data sample) represents the average sum of the distance between the data in the class and the mean of the data sample, which is used to measure the compactness of the data in the class. The smaller the calculated value of ɛ is, the more stable the data in the class is. $C P_{2} = \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} exp (- \frac{{∥ x_{j} - v_{i} ∥}^{2}}{ɛ})$ (5)

(3) Overlap

Equation (6) represents the degree of overlap between data sets based on the membership degree. If the data point x_k of the overlapping part belongs to the i-th cluster, the membership degree u_ik will be close to 1, while the membership degree u_jk of the data point x_k belongs to the j-th cluster will be close to 0. If the overlapping data is separated better, then the difference of |u_ik - u_jk| will be closer to 1, and the value of 1 - |u_ik - u_jk| will be smaller. Therefore, CP₃ achieves a very small value indicating better classification. $C P_{3} = min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - | u_{ik} - u_{jk} |))$ (6)

(4) Similarity

Equation (7) represents the square sum of the minimum membership degree of data point x_j belonging to cluster center v_i. The larger the value of CP₄ is, the higher the membership degree of data point and its corresponding cluster center is, and the more similar the data within the class is. $C P_{4} = min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} {(u_{ij})}^{2}$ (7)

(5) Separation

Equations (8) and (9) are two components used to define the overlap degree between classes. The geometric meaning of $C P_{5} = min_{i \neq k} {∥ v_{i} - v_{k} ∥}^{2}$ represents the minimum distance between any two clusters. The larger the value of CP₅ is, the larger the distance between classes is, that is to say that the better the separation degree is.

$C P_{6} = (\sum_{i = 1}^{c} {∥ v_{i} - \bar{v} ∥}^{2}) / c$ ( $\bar{v} = (\sum_{i = 1}^{c} v_{i}) / c$ is the average value of all cluster centers, that is, the equilibrium point of cluster center) represents the average sum of distance between all cluster centers and the equilibrium point of cluster center. Similarly, the larger the value of CP₆ is, the farther the center of each cluster is from the equilibrium point of cluster center, that is, the more separated classes are from each other. $C P_{5} = min_{i \neq k} {∥ v_{i} - v_{k} ∥}^{2}$ (8) $C P_{6} = \frac{1}{c} \sum_{i = 1}^{c} {∥ v_{i} - \bar{v} ∥}^{2}$ (9)

3.2 Ratio component-wise design method of clustering validity function

In order to explore the component-wise design method of clustering validity functions, six components mentioned above are divided into two sets according to their extreme value characteristics: Ω₁ ={ CP₁, CP₂, CP₃ }and Ω₂ ={ CP₄, CP₅, CP₆ }. The components in Ω₁ are valid for the minimum value, and the components in Ω₂ are valid for the maximum value. Below, let’s freely choose 1, 2 or 3 components (don’t repeat) from Ω₁are combined so as to form the molecular through the accumulation. Optionally choice 1, 2 or 3 in the component (don’t repeat) from Ω₂ are combined so as to form the denominator through the accumulation. Based on the ratios to form a new clustering validity function, this composition rule is shown in Equation (10). $S = \frac{\sum_{i = 1}^{n} C P_{i}}{\sum_{j = 1}^{m} C P_{j}}$ (10) where, CP_i ∈ Ω₁, CP_j ∈ Ω₂, m and n are any positive integers from 1 to 3. The clustering validity functions S constituted in this way adopt similar structural patterns and are all minimum valid, which is more convenient for the subsequent comparative experiments.

3.3 Clustering data selection and simulation experiments

A total of 49 different clustering validity functions can be formed by permutation and combination of six components in the form of ratio according to the formation rules shown in Equation (10). In order to better present the results of the experiment, 49 validity functions were divided into 7 groups with 7 in each group. The results are shown in Tables 2 , 12 and 14. These tables show the names, combination rules, and function structures of 49 formed validity functions. According to previous prior knowledge [29], fuzzy index 1.5 ⩽ m ⩽ 2.5 and cluster number $2 ⩽ c ⩽ \sqrt{n}$ can be determined. In this paper, m = 2, 2 ⩽ c ⩽ 14 is selected to carry out simulation experiments, and then judge whether the classification of each clustering validity function is accurate when different data sets are used. The actual data sets selected in this experiment are Iris, Seeds, Phoneme, Heart, Hfcr and Segment data sets in UCI database. The data volume, categories and attributes of each UCI data set are listed in Table 1. In order to better observe the changing trend of seven groups of clustering validity functions in the UCI data set, their function values on the experimental results of UCI data set were placed in the normalized coordinate system, as shown in Figs. 1-7(a)-(f). In this way, the clustering effect of each validity function can be compared more intuitively. Finally, the optimal clustering number of each cluster validity function for different UCI data sets is listed, and the results are shown in Tables 3 , 13 and 15.

Fig. 1

Change trend of first group of functions after normalization.

Fig. 2

Change trend of second group of functions after normalization.

Fig. 3

Change trend of third group of functions after normalization.

Fig. 4

Change trend of fourth group of functions after normalization.

Fig. 5

Change trend of fifth group of functions after normalization.

Fig. 6

Change trend of sixth group of functions after normalization.

Fig. 7

Change trend of seventh group of functions after normalization.

Table 1

UCI data set

Data sets	Data Numbers	Attributes	Classes
Iris	150	4	3
Seeds	210	7	3
Phoneme	300	3	2
Heart	270	13	2
Hfcr	299	13	4
Segment	2310	19	7

Table 2

First group of clustering validity functions

Function name	Constitutional rules	Function description
S ₁	$\frac{C P_{1}}{C P_{4}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2}}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2}}$
S ₂	$\frac{C P_{2}}{C P_{4}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2}}$
S ₃	$\frac{C P_{3}}{C P_{4}}$	$\frac{min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2}}$
S ₄	$\frac{C P_{1}}{C P_{5}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2}}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$
S ₅	$\frac{C P_{2}}{C P_{5}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$
S ₆	$\frac{C P_{3}}{C P_{5}}$	$\frac{min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$
S ₇	$\frac{C P_{1}}{C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2}}{\frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$

Table 3

Optimal clustering number of first group of functions for UCI data set

Data	Optimal c	S ₁	S ₂	S ₃	S ₄	S ₅	S ₆	S ₇
Iris	3	3	2	2	2	2	2	14
Seeds	3	3	2	2	2	2	2	14
Phoneme	2	2	2	2	4	2	4	14
Heart	2	2	2	2	2	2	2	13
Hfcr	4	3	2	2	4	2	2	14
Segment	7	4	2	2	7	2	2	14

Table 4

Second group of clustering validity functions

Function name	Constitutional rules	Function description
S ₈	$\frac{C P_{2}}{C P_{6}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{\frac{1}{c} \sum_{i = 1}^{c} v_{i} - {\bar{v}}^{2}}$
S ₉	$\frac{C P_{3}}{C P_{6}}$	$\frac{min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{\frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₁₀	$\frac{C P_{1} + C P_{2}}{C P_{4}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2}}$
S ₁₁	$\frac{C P_{1} + C P_{3}}{C P_{4}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2}}$
S ₁₂	$\frac{C P_{2} + C P_{3}}{C P_{4}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2}}$
S ₁₃	$\frac{C P_{1} + C P_{2}}{C P_{5}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$
S ₁₄	$\frac{C P_{1} + C P_{3}}{C P_{5}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$

Table 5

Optimal clustering number of second group of functions for UCI data set

Data	Optimal c	S ₈	S ₉	S ₁₀	S ₁₁	S ₁₂	S ₁₃	S ₁₄
Iris	3	2	2	3	3	2	2	2
Seeds	3	2	2	3	3	2	2	2
Phoneme	2	4	4	2	2	2	4	4
Heart	2	2	9	2	2	2	2	2
Hfcr	4	2	2	3	3	2	4	4
Segment	7	2	2	4	4	2	7	7

Table 6

Third group of clustering validity functions

Function Name	Constitutional Rules	Function description
S ₁₅	$\frac{C P_{2} + C P_{3}}{C P_{5}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$
S ₁₆	$\frac{C P_{1} + C P_{2}}{C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{\frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S₁₇	$\frac{C P_{1} + C P_{3}}{C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{\frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S₁₈	$\frac{C P_{2} + C P_{3}}{C P_{6}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{\frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S₁₉	$\frac{C P_{1} + C P_{2}}{C P_{4} + C P_{5}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$
S₂₀	$\frac{C P_{1} + C P_{2}}{C P_{4} + C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₂₁	$\frac{C P_{1} + C P_{3}}{C P_{4} + C P_{5}}$ .	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$

Table 7

Optimal clustering number of third group of functions for UCI data set

Data	Optimal c	S ₁₅	S ₁₆	S ₁₇	S ₁₈	S ₁₉	S ₂₀	S ₂₁
Iris	3	2	12	14	2	3	3	3
Seeds	3	2	14	14	2	3	3	3
Phoneme	2	2	14	14	4	2	2	2
Heart	2	2	9	13	2	2	2	2
Hfcr	4	2	14	14	2	4	14	4
Segment	7	2	14	14	2	7	14	7

Table 8

Fourth group of clustering validity functions

Function Name	Constitutional Rules	Function description
S ₂₂	$\frac{C P_{1} + C P_{3}}{C P_{4} + C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₂₃	$\frac{C P_{2} + C P_{3}}{C P_{4} + C P_{5}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$
S ₂₄	$\frac{C P_{2} + C P_{3}}{C P_{4} + C P_{6}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₂₅	$\frac{C P_{1} + C P_{2}}{C P_{5} + C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₂₆	$\frac{C P_{1} + C P_{3}}{C P_{5} + C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₂₇	$\frac{C P_{2} + C P_{3}}{C P_{5} + C P_{6}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₂₈	$\frac{C P_{1} + C P_{2} + C P_{3}}{C P_{4} + C P_{5}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$

Table 9

Optimal clustering number of fourth group of functions for UCI data set

Data	Optimal c	S ₂₂	S ₂₃	S ₂₄	S ₂₅	S ₂₆	S ₂₇	S ₂₈
Iris	3	3	2	2	2	14	2	3
Seeds	3	3	2	2	14	14	2	3
Phoneme	2	2	2	2	14	14	4	2
Heart	2	2	2	2	9	13	2	2
Hfcr	4	14	2	2	4	4	2	4
Segment	7	14	2	2	7	7	2	7

Table 10

Fifth group of clustering validity functions

Function Name	Constitutional Rules	Function description
S ₂₉	$\frac{C P_{1} + C P_{2} + C P_{3}}{C P_{4} + C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₃₀	$\frac{C P_{1} + C P_{2} + C P_{3}}{C P_{5} + C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₃₁	$\frac{C P_{1} + C P_{2} + C P_{3}}{C P_{4} + C P_{5} + C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₃₂	$\frac{C P_{1}}{C P_{4} + C P_{5} + C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2}}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₃₃	$\frac{C P_{2}}{C P_{4} + C P_{5} + C P_{6}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₃₄	$\frac{C P_{3}}{C P_{4} + C P_{5} + C P_{6}}$	$\frac{min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₃₅	$\frac{C P_{1}}{C P_{4} + C P_{5}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2}}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$

Table 11

Optimal clustering number of fifth group of functions for UCI data set

Data	Optimal c	S ₂₉	S ₃₀	S ₃₁	S ₃₂	S ₃₃	S ₃₄	S ₃₅
Iris	3	3	14	3	3	2	2	3
Seeds	3	3	14	3	3	2	2	3
Phoneme	2	2	14	2	2	2	2	2
Heart	2	2	13	2	2	2	2	2
Hfcr	4	14	4	4	4	2	2	4
Segment	7	14	7	7	7	2	2	7

Table 12

Sixth group of clustering validity functions

Function name	Constitutional rules	Function description
S ₃₆	$\frac{C P_{1}}{C P_{4} + C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2}}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₃₇	$\frac{C P_{1}}{C P_{5} + C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2}}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₃₈	$\frac{C P_{2}}{C P_{4} + C P_{5}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$
S ₃₉	$\frac{C P_{2}}{C P_{4} + C P_{6}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₄₀	$\frac{C P_{2}}{C P_{5} + C P_{6}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₄₁	$\frac{C P_{3}}{C P_{4} + C P_{5}}$	$\frac{min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$
S ₄₂	$\frac{C P_{3}}{C P_{4} + C P_{6}}$	$\frac{min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$

Table 13

Optimal clustering number of sixth group of functions for UCI data set

Data	Optimal c	S ₃₆	S ₃₇	S ₃₈	S ₃₉	S ₄₀	S ₄₁	S ₄₂
Iris	3	3	14	2	2	2	2	2
Seeds	3	3	14	2	2	2	2	2
Phoneme	2	2	14	2	2	2	2	2
Heart	2	2	13	2	2	2	2	2
Hfcr	4	14	4	2	2	2	2	2
Segment	7	14	7	2	2	2	2	2

Table 14

Seventh group of clustering validity functions

Function Name	Constitutional Rules	Function description
S ₄₃	$\frac{C P_{3}}{C P_{5} + C P_{6}}$	$\frac{min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₄₄	$\frac{C P_{1} + C P_{2}}{C P_{4} + C P_{5} + C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ})}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₄₅	$\frac{C P_{1} + C P_{3}}{C P_{4} + C P_{5} + C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₄₆	$\frac{C P_{2} + C P_{3}}{C P_{4} + C P_{5} + C P_{6}}$	$\frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2} + min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + \frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$
S ₄₇	$\frac{C P_{1} + C P_{2} + C P_{3}}{C P_{4}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} u_{ij}^{2}}$
S ₄₈	$\frac{C P_{1} + C P_{2} + C P_{3}}{C P_{5}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2}}$
S ₄₉	$\frac{C P_{1} + C P_{2} + C P_{3}}{C P_{6}}$	$\frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2} + \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} \exp (\frac{- ∥ x_{j} - v_{i} ∥^{2}}{ɛ}) + min_{i \neq j} (\frac{1}{n} \sum_{j = 1}^{n} (1 - \| u_{ik} - u_{jk} \|))}{\frac{1}{c} \sum_{i = 1}^{c} ∥ v_{i} - \bar{v} ∥^{2}}$

Table 15

Optimal clustering number of seventh group of functions for UCI data set

Data	Optimal c	S ₄₃	S ₄₄	S ₄₅	S ₄₆	S ₄₇	S ₄₈	S ₄₉
Iris	3	2	3	3	2	3	2	12
Seeds	3	2	3	3	2	3	2	14
Phoneme	2	4	2	2	2	2	4	14
Heart	2	2	2	2	2	2	2	9
Hfcr	4	2	4	4	2	3	4	14
Segment	7	2	7	7	2	4	7	14

(1) First Group Simulation Experiment and Results Analysis

It can be seen from Fig.1 (a)-(b) that only S₁ can obtain the correct clustering number c = 3 for Iris and Seeds datasets. Fig. 1(c) shows the clustering validity function S₁, S₂, S₃ and S₅ can accurately divide Phoneme data set into two categories. It can be seen from Fig. 1(d) that, except for S₇, the other six functions can obtain accurate clustering results c = 2 for the Heart data set. As can be seen from Fig.1(e)-(f), only S₄ can correctly divide Hfcr and Segment data sets, the remaining six validity functions cannot obtain the correct clustering number of these two data sets. In the effectiveness function comparison experiment of the first group, it can be found that none of 7 validity functions can achieve the best classification effect for all UCI data sets. Four UCI data sets can be distinguished correctly by S₁, three UCI data sets can be correctly divided by S₄ and the correct clustering number of two UCI data sets can be obtained by S₂, S₃ and S₅. Only one real data set can be successfully distinguished by S₆. However, S₇ cannot distinguish any data set due to the irrationality of component collocation.

(2) Second Group Simulation Experiment and Results Analysis

It can be seen from Fig. 2(a)-(b) that only S₁₀ and S₁₁ can get the correct classification number c = 3 when facing the Iris and Seeds dataset. As can be seen from Fig. 2(c), S₁₀, S₁₁ and S₁₂ can correctly divide Phoneme data set samples into two categories. Figure 2(d) shows that for the Heart data set, except for S₉, the remaining six clustering validity functions can obtain correct classification results c = 2. It can be seen from Fig. 2(a)-(b) that S₁₃ and S₁₄ can obtain accurate clustering numbers for Hfcr and Segment dataset. Through the comparative experiment of the second set of validity functions, it can be found that these 7 validity functions (S₈, S₉, S₁₀, S₁₁, S₁₂, S₁₃ and S₁₄) still cannot correctly divide all 6 UCI data sets. Among them, S₁₀ and S₁₁ can successfully distinguish four data sets; S₁₃ and S₁₄ can get the correct clustering results of three data sets; S₁₂ can correctly classify Phoneme and Heart data sets; Compared with other validity functions, the performance of S₈ is weak, and only one kind of data set can be separated. However, the clustering effect of S₉ is the worst, unable to divide these 6 UCI data sets.

(3) Third Group Simulation Experiment and Results Analysis

As can be seen from Fig. 3(a)-(b), S₁₉, S₂₀ and S₂₁ can correctly divide the Iris and Seeds dataset into 3 categories. According to Fig. 3(c), S₁₅, S₁₉, S₂₀ and S₂₁ obtained the optimal clustering number of Phoneme data set c = 2. As can be seen from Fig. 3(d), S₁₆ and S₁₇ obtained wrong clustering numbers for Heart data set, while the remaining five validity functions could be correctly divided. Finally, according to Fig. 3(e)-(f), it can be found that only S₁₉ and S₂₁ can obtain the correct clustering number of Hfcr and Segment data sets, that is, c = 4 or 7. Based on the comparative experiment of the third group of validity functions, it can be concluded that the clustering effect of S₁₉ and S₂₁ is the best, and the six groups of UCI data sets can be correctly divided. S₂₀ can obtain the correct clustering number of four data sets, but the effect is not good when processing Hfcr and Segment data sets. The classification effect of S₁₅ and S₁₈ is poor, and only two groups and one data set can be distinguished respectively. However, when faced with 6 sets of UCI data sets, S₁₆ and S₁₇ could not get the optimal cluster number of any set of data sets.

(4) Fourth Group Simulation Experiment and Results Analysis

As shown in Fig. 4(a)-(b), only S₂₂ and S₂₈ can obtain the optimal clustering number c = 3 for Iris and Seeds datasets. As can be seen from Fig. 4(c), S₂₂, S₂₃, S₂₄ and S₂₈ can successfully divide Phoneme data set into two categories. As you can see from Fig. 4(d), S₂₂, S₂₃, S₂₄, S₂₇ and S₂₈ can divide the Heart dataset correctly. Finally, it can be concluded from Fig. 4(e)-(f) that only three validity functions (S₂₅, S₂₆ and S₂₈) can obtain the optimal clustering number c = 4 of Hfcr dataset and the correct clustering number c = 7 for Segment dataset. According to the comparative experiment of the fourth group of validity functions, S₂₈ has the best clustering effect and can correctly divide all six groups of UCI data sets. Among them, S₂₂ can successfully distinguish four UCI datasets, but cannot get the optimal number of clusters when faced with Hfcr and Segment datasets. The classification effect of S₂₃, S₂₄, S₂₅ and S₂₆ is mediocre, and can successfully distinguish two UCI datasets. However, the performance of validity function S₂₇ is poor, and only the optimal clustering result of Heart data set can be obtained.

(5) Fifth Group Simulation Experiment and Results Analysis

As can be seen from the simulation image of Fig. 5(a)-(b), the four validity functions (S₂₉, S₃₁, S₃₂ and S₃₅) can obtain the optimal clustering number c = 3 of Iris and Seeds dataset. According to Fig. 5(c)-(d) of Phoneme and Heart datasets, it can be seen that only S₃₀ gets the wrong number of clusters, and other six validity functions can divide these two UCI datasets correctly. According to Fig. 5(e)-(f), when faced with Hfcr and Segment data sets, S₃₀, S₃₁, S₃₂ and S₃₅ can obtain the correct clustering numbers c = 4 and c = 7 of the two data sets. It can be seen from the comparative experiment of the fifth validity function that the three validity functions (S₃₁, S₃₂ and S₃₅) can obtain correct clustering results when dealing with six UCI data sets. Among them, S₂₉ can successfully distinguish four groups of data sets, but fails to correctly divide Hfcr and Segment data sets. However, S₃₀, S₃₃ and S₃₄ have poor effects among 7 validity functions, and can only get the optimal cluster number of two sets of data.

(6) Sixth Group Simulation Experiment and Results Analysis

As can be seen from Fig. 6(a)-(b), only S₃₆ can correctly divide Iris and Seeds data sets and obtain their optimal cluster number c = 3. As can be seen from Fig. 6(c)-(d), both S₃₆, S₃₈, S₃₉, S₄₀, S₄₁ and S₄₂ can successfully divide Phoneme and Heart data sets into two categories, and only validity function S₃₇ can obtain wrong clustering results. However, the classification effect of S₃₇ on Hfcr and Segment data sets is very good, and the optimal clustering number of these two sets of data sets can be obtained c = 4 and c = 7, while the other six validity functions cannot correctly divide these two sets of data sets. This conclusion can be obtained in Fig. 6(e)-(f). The comparative experiment based on the sixth set of validity functions shows that none of these validity functions can correctly cluster the six UCI data sets. Among them, S₃₆ has the best effect and can successfully distinguish four data sets. On the other hand, S₃₇, S₃₈, S₃₉, S₄₀, S₄₁ and S₄₂ can obtain the optimal cluster number of the two sets of data.

(7) Seventh Group Simulation Experiment and Results Analysis

As can be seen from Fig. 7(a)-(b), S₄₄, S₄₅ and S₄₇ can obtain the correct clustering number c = 3 for Iris and Seeds datasets. Based on the simulation curve in Fig. 7(c), S₄₄, S₄₅, S₄₆ and S₄₇ can successfully divide data samples in Phoneme data set into two categories. As can be seen from Fig. 7(d), except for S₄₉, the other six validity functions can divide Heart data set correctly. Finally, it can be concluded from Fig. 7(e)-(f) that S₄₄ and S₄₅ can obtain the optimal clustering number of Hfcr and Segment data sets, namely c = 4 and c = 7. Through the comparison experiment of the seventh validity function, it can be known that S₄₄ and S₄₅ has the best performance and can get the correct clustering results for all six UCI data sets. S₄₇ can correctly divide four UCI data sets; The clustering effect of S₄₈ is slightly worse than that of S₄₇, which can successfully distinguish three UCI data sets. S₄₆ can obtain the best clustering results of the two sets of data sets; S₄₃ can only correctly divide Heart data set into two categories. In the comparison experiment of the seventh group of validity functions, only S₄₉ got the wrong differentiation result for all data sets, and its clustering effect was the worst.

Through the comparative experiment of the above 7 groups of validity functions, 8 cluster validity functions that can successfully divide 6 groups of UCI data sets are obtained, which are S₁₉, S₂₁, S₂₈, S₃₁, S₃₂, S₃₅, S₄₄ and S₄₅, where, S₃₁ is composed of 6 components, S₂₈ and S₄₅ are composed of 5 components respectively, S₁₉, S₃₁ and S₃₂ are composed of 4 different components, and S₃₅ is composed of only 3 components. By observing the structure of the eight validity functions, it can be found that these validity functions all contain three components CP₁, CP₄ and CP₅, which indicates that these three components have a greater impact on the clustering performance and mathematical dimension of the clustering validity function than CP₂, CP₃ and CP₆. S₃₅ is the simplest form composed of CP₁, CP₄ and CP₅. Based on the prior knowledge of time and space complexity [30], it can be concluded that S₃₅is the optimal among 49 clustering validity functions.

4 Simulation comparison with typical clustering validity functions

Table 16 lists the number of data samples, attributes and categories of UCI data set selected for this experiment. Haberman and German data sets are added in the comparison process to increase the persuitability of the experiment.

Table 16
Typical clustering validity functions

Validity Index Function Description Optimal c

Modification of partition coefficient $V_{MPC} = 1 - \frac{c}{c - 1} (1 - \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{2})$ Max

Xie and Beni $V_{XB} = \frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2}}{min_{i \neq j} v_{i} - {v_{j}}^{2}}$ Min

P-index $V_{P} = \frac{1}{n} \sum_{j = 1}^{n} max_{i} (u_{ij}) - \frac{1}{k} \sum_{i = 1}^{c - 1} \sum_{j = i + 1}^{c} [\frac{1}{n} \sum_{k = 1}^{n} \min (u_{ik}, u_{jk})]$ Max

Partition coefficient and exponential separation $V_{PCAES} = \sum_{i = 1}^{c} \sum_{j = 1}^{n} \frac{{u_{ij}}^{2}}{u_{mj}} - \sum_{i = 1}^{c} \exp (\frac{- min_{k \neq i} ∥ v_{i} - v_{k} ∥^{2}}{β_{T}})$ Max

Chih-Hung Wu $V_{WL} = \frac{\sum_{i = 1}^{c} (\frac{\sum_{j = 1}^{n} {u_{ij}}^{2} x_{j} - {v_{i}}^{2}}{\sum_{j = 1}^{n} u_{ij}})}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + median ∥ v_{i} - v_{k} ∥^{2}}$ Min

FM-index $V_{FM} = \frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {(u_{ij} - \frac{1}{c})}^{2} ∥ v_{i} - x_{j} ∥^{2}}{n min_{i \neq j} v_{i} - {v_{j}}^{2}} \times (- \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} [u_{ij} {log}_{a} (u_{ij})])$ Min

Zhu $V_{ZLF} = \frac{comp}{sep} = \frac{\sum_{j = 1}^{n} \frac{1 - min_{i} u_{ij}}{\sum_{i = 1}^{c} x_{j} - v_{i}}}{\sum_{k = 1}^{c} \sum_{i = 1; i \neq k}^{c} v_{i} - \bar{v} / \frac{c (c - 1)}{2}}$ Min

Validity Index	Function Description	Optimal c
Modification of partition coefficient	$V_{MPC} = 1 - \frac{c}{c - 1} (1 - \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{2})$	Max
Xie and Beni	$V_{XB} = \frac{\frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} {u_{ij}}^{m} ∥ v_{i} - x_{j} ∥^{2}}{min_{i \neq j} v_{i} - {v_{j}}^{2}}$	Min
P-index	$V_{P} = \frac{1}{n} \sum_{j = 1}^{n} max_{i} (u_{ij}) - \frac{1}{k} \sum_{i = 1}^{c - 1} \sum_{j = i + 1}^{c} [\frac{1}{n} \sum_{k = 1}^{n} \min (u_{ik}, u_{jk})]$	Max
Partition coefficient and exponential separation	$V_{PCAES} = \sum_{i = 1}^{c} \sum_{j = 1}^{n} \frac{{u_{ij}}^{2}}{u_{mj}} - \sum_{i = 1}^{c} \exp (\frac{- min_{k \neq i} ∥ v_{i} - v_{k} ∥^{2}}{β_{T}})$	Max
Chih-Hung Wu	$V_{WL} = \frac{\sum_{i = 1}^{c} (\frac{\sum_{j = 1}^{n} {u_{ij}}^{2} x_{j} - {v_{i}}^{2}}{\sum_{j = 1}^{n} u_{ij}})}{min_{i \neq k} ∥ v_{i} - v_{k} ∥^{2} + median ∥ v_{i} - v_{k} ∥^{2}}$	Min
FM-index	$V_{FM} = \frac{\sum_{i = 1}^{c} \sum_{j = 1}^{n} {(u_{ij} - \frac{1}{c})}^{2} ∥ v_{i} - x_{j} ∥^{2}}{n min_{i \neq j} v_{i} - {v_{j}}^{2}} \times (- \frac{1}{n} \sum_{i = 1}^{c} \sum_{j = 1}^{n} [u_{ij} {log}_{a} (u_{ij})])$	Min
Zhu	$V_{ZLF} = \frac{comp}{sep} = \frac{\sum_{j = 1}^{n} \frac{1 - min_{i} u_{ij}}{\sum_{i = 1}^{c} x_{j} - v_{i}}}{\sum_{k = 1}^{c} \sum_{i = 1; i \neq k}^{c} v_{i} - \bar{v} / \frac{c (c - 1)}{2}}$	Min

In order to better reflect S₃₅’s classification effect, this experiment selected 7 typical clustering validity functions (V_MPC, V_XB, V_P, V_PCAES, V_WL, V_FM and V_ZLF) to conduct a comparative test with S₃₅ on UCI data set. The function description and optimal clustering number of these 7 typical clustering validity functions are listed in Table 17, where $\bar{v} = (\sum_{i = 1}^{c} v_{i}) / c$ represents the mean value of clustering centers, the geometric meaning of two variables $u_{mj} = min_{1 ⩽ i ⩽ c} \sum_{j = 1}^{n} {(u_{ij})}^{2}$ and $β_{T} = (\sum_{i = 1}^{c} {∥ v_{i} - \bar{v} ∥}^{2}) / c$ is consistent with the similarity and separation degree mentioned in Section 3.1, and median ∥ v_i - v_k ∥ ² represents the median distance between two clustering centers. 8 fuzzy clustering validity functions (V_MPC, V_XB, V_P, V_PCAES, V_WL, V_FM, V_ZLF and S₃₅) are normalized and placed in the same coordinate system. The simulation results are shown in Fig. 8(a)-(h). Finally, the optimal clustering number of each clustering validity function for different UCI data sets is listed, and the results are shown in Table 18.

Fig. 8

Variation trend of normalized clustering validity functions.

Table 17

UCI dataset

Data sets	Data Numbers	Attributes	Classes
Iris	150	4	3
Seeds	210	7	3
Phoneme	300	3	2
Heart	270	13	2
Hfcr	299	13	4
Segment	2310	19	7
Haberman	306	3	2
German	1000	24	2

Table 18

Optimal clustering numbers of different validity functions for UCI data sets

Data	Optimal c	V _MPC	V _XB	V _P	V _PCAES	V _FM	V _WL	V _ZLF	S ₃₅
Iris	3	2	2	2	9	2	2	2	3
Seeds	3	2	2	3	14	2	2	2	3
Phoneme	2	4	4	4	10	4	2	4	2
Heart	2	2	2	2	14	2	2	2	2
Hfcr	4	10	8	10	14	10	2	4	4
Segment	7	3	7	3	14	3	2	2	7
Haberman	2	2	13	4	14	2	9	13	2
German	2	2	11	2	14	2	4	2	2

It can be seen from Fig. 8(a)-(b) that S₃₅ can successfully divide Iris and Seeds data sets into 3 categories, and V_P can correctly divide Seeds data sets, while other validity functions all get wrong cluster numbers. This indicated that when dealing with data sets with higher dimensions such as Iris and Seeds, S₃₅ had better classification ability than other typical clustering validity functions. Figure 8(c) shows that only S₃₅ obtained the optimal number of clusters for Phoneme data set, that is, c = 2. It can be seen from Fig. 8(d) that, except V_PCAES, the Heart data set cannot be correctly divided, the other 8 validity functions have obtained accurate classification results. Figure 8(e) shows that S₃₅ and V_ZLF can accurately divide samples in the Hfcr dataset into 4 categories. As can be seen from Fig. 8(f), S₃₅ and V_XB can obtain the optimal classification result when facing the Segment data set, that is, c = 7. Figure 8(e)-(f) simulates Hfcr and Segment data sets, and only S₃₅ can correctly classify them. This means that when the number of samples and complexity of the data set are relatively high, the clustering effect of S₃₅ is obviously better than other validity functions. According to Fig. 8(g), V_MPC, V_FM and S₃₅ successfully obtained the optimal clustering number c = 2 of Haberman data set. Finally, Fig. 8(h) shows that V_MPC, V_P, V_FM and S₃₅ can correctly distinguish German data set, while other validity functions have obtained wrong clustering results. As can be seen from the experimental results in Fig. 8(a)-(h), only S₃₅ can find the optimal number of classification for all UCI data sets, and the ideal number of clustering can still be found in data sets with overlapping samples, noisy data and high dimension.

5 Conclusions

In this paper, a component-wise design method of FCM clustering validity function is proposed based on six components that define different characteristics of the data set. Based on the UCI data set, six components are permutation and combination experiments in the form of ratio, and the validity function with the simplest structure and the best effect is selected. Finally, the optimal clustering validity function and 7 kinds of typical clustering validity function are adopted to carry out the contrast experiments on UCI data sets. The experimental results show the proposed validity function can get the optimal clustering number of 8 groups of UCI data sets. The classification of the data set with high-dimensional, noise and overlapping features is better than the other typical validity functions. It also proves that the component-wise design method proposed in this paper is very objective and can avoid the contingency of using subjective trial and error method to put forward the validity function. This design method also has universal applicability to other components and has certain guiding significance in constructing new clustering validity function. However, the component-wise design method only considers the ratio form of component combination and does not verify other mathematical forms. For this reason, the future work will study the composition of components based on subtraction as well as exponential and logarithmic forms to enrich the integrity of the system.

Footnotes

Acknowledgment

This work was supported by the Basic Scientific Research Project of Institution of Higher Learning of Liaoning Province (Grant No. LJKZ0293), and the Project by Liaoning Provincial Natural Science Foundation of China (Grant No. 20180550700).

References

Liu, Jiaying , et al. Data mining and information retrieval in the 21st century: A bibliographic review, Computer Science Review 34 (2019), 100193.

Fern, Xiaoli

, and Lin, Wei , Cluster ensemble selection, Statistical Analysis and Data Mining: The ASA Data Science Journal 1.3 (2008), 128–141.

Bai, Liang , Liang, Jiye , and Cao, Fuyuan , A multiple k-means clustering ensemble algorithm to find nonlinearly separable clusters, Information Fusion 61 (2020), 36–47.

Frossyniotis

, Likas, Aristidis ; Stafylopatis, Andreas . A clustering method based on boosting, Pattern Recognition Letters 25.6 (2004), 641–654.

WenChao, Li , Yong, Zhou , and ShiXiong, Xia , A novel clustering algorithm based on hierarchical and K-means clustering. In: 2007 Chinese Control Conference. (2007), 605–609.

Kriegel, Hans-Peter , et al. Density-based clustering, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1.3 (2011), 231–240.

Gurrutxga, Ibai , et al. SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index, Pattern Recognition 43(10) (2010), 3364–3373.

Malsiner-Walli, Gertraud , Fruhwirth-Schnater, Sylvia , and Grun, Bettina , Model-based clustering based on sparse finite Gaussian mixtures, Statistics and Computing 26(1-2) (2016), 303–324.

Hartigan

J.A.

, and Wong

M.A.

, A k-means clustering algorithm, Journal of the Royal Statistical Society, Series C (Applied Statistics) 28(1) (1979), 100–108.

10.

Lei

, Bezdek

J.C.

, Chan

, Vinh

N.X.

, Romano

, and Bailey

, Extending information-theoretic validity indices for fuzzy clustering, IEEE Transactions on Fuzzy Systems 25(4) (2017), 1013–1018.

11.

Huang, Hong , et al. Brain image segmentation based on FCM clustering algorithm and rough set, IEEE Access 7 (2019), 12386–12396.

12.

Zadeh, Lotfi

, Fuzzy sets. In: Fuzzy sets, fuzzy logic, and fuzzy systems: selected papers by Lotfi A Zadeh (1996), 394–432.

13.

Bezdek, James

, and Ehrlich, Robert , Full, William. FCM: The fuzzy c-means clustering algorithm, Computers & Geosciences 10(2–3) (1984), 191–203.

14.

Nayak, Janmenjoy , Naik, Bighnaraj and BEHERA, HSr, Fuzzy C-means (FCM) clustering algorithm: a decade review from 2000 to 2014, Computational Intelligence in Data Mining-Volume 2 (2015), 133–149.

15.

Bezdek

J.C.

and Pal

N.R.

, Some new indexes of cluster validity, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 28(3) (1998), 301–315.

16.

Simovici

D.A.

and Jaroszewicz

, An axiomatization of partition entropy, IEEE Transactions on Information Theory 48(7) (2002), 2138–2142.

17.

Silva

, Moura

, Canuto

A.M.P.

, Santiago

R.H.N.

and Bedregal

, An interval-based framework for fuzzy clustering applications, IEEE Transactions on Fuzzy Systems 23(6) (2015), 2174–2187.

18.

Chen, Min-You and Linkens, Derek

, Rule-base self-generation and simplification for data-driven fuzzy models, Fuzzy Sets and Systems 142(2) (2004), 243–265.

19.

Krista Rizman Žalik , Cluster validity index for estimation of fuzzy clusters of different sizes and densities, Pattern Recognition 43(10) (2010), 3374–3390.

20.

Liu, Yongli , et al. Avalidity index for fuzzy clustering based on bipartite modularity, Journal of Electrical and Computer Engineering (2019), 2019.

21.

Xie, Xuanli Lisa and Beni, Gerardo , A validity measure for fuzzy clustering, IEEE Transactions on pattern analysis and machine intelligence 17(6) (1991), 841–847.

22.

Kuo-Lung Wu and Miin-Shen Yang , A cluster validity index for fuzzy clustering, Pattern Recognition Letters 26(9) (2005), 1275–1291.

23.

Lingkui Meng and Chunchun Hu , Cluster validity index based on measure of fuzzy partition [J], Computer Engineering 33(11) (2007), 15–17.

24.

, Ouyang

, Chen

and Lu

, A new fuzzy clustering validity index with a median factor for centroid-based clustering, IEEE Transactions on Fuzzy Systems 23(3) (2004), 701–718.

25.

Zhu

L.F.

, Wang

J.S.

and Wang

H.Y.

, A novel clustering validity function of FCM clustering algorithm, IEEE Access 7 (2019), 152289–152315.

26.

Yun Liu , Yanfang Jiang Tao Hou and Fu Liu , A new robust fuzzy clustering validity index for imbalanced data sets, Information Sciences 547 (2021), 579–591.

27.

Wang

H.Y.

, Wang

J.S.

and Zhu

L.F.

, A new validity function of FCM clustering algorithm based on the intra-class compactness and inter-class separation, Journal of Intelligent & Fuzzy Systems 40(6) (2021), 12411–12432.

28.

Babak Rezaee , A cluster validity index for fuzzy clustering, Fuzzy Sets and Systems 161(23) (2010), 3014–3025.

29.

Askari

, Noise-resistant fuzzy clustering algorithm, Granul. Comput. 6 (2021), 815–828.

30.

Salar Askari , Fuzzy c-means clustering algorithm for data with unequal cluster sizes and contaminated with noise and outliers: Review and development, Expert Systems with Applications 165 (2021), 0957–4174.