A clustering algorithm for fuzzy numbers based on fast search and find of density peaks

Abstract

This paper made improvements on clustering by fast search and find of density peaks (CFSFDP) algorithm and extended this algorithm to fuzzy numbers (FN-CFSFDP algorithm). Using FN-CFSFDP algorithm, classical information included in the samples are extended to fuzzy sets, and fuzzy samples can be clustered by searching the density peak. Firstly, by means of error analysis, improved Euclidean distance between fuzzy numbers was defined, and some key parameters or operating quantities mainly including cut-off distance and Gaussian Kernel function of fuzzy samples were introduced in detail. Next, 76 random simulations in total were performed on four sets of samples under different conditions with different $t$ -values, different sample sizes, index numbers, cluster numbers and fetching rules. Moreover, Kappa coefficients in above simulations were calculated. Finally, both advantages and disadvantages of the proposed FN-CFSFDP were concluded and some recommendations for improvement were put forward, which can provide insightful guidance for further investigations of fuzzy clustering algorithms on fuzzy sets.

Keywords

Fuzzy clustering on fuzzy sets FN-CFSFDP algorithm fuzzy number improved Euclidean distance Kappa coefficient

1. Introduction

Clustering is an important human behavior. The key content of clustering is to divide a data set into several clusters in accordance with certain standard such as distance or density so as to minimize the difference among the data objects in a same cluster but simultaneously maximize the difference among the data objects in different clusters. Currently, clustering has been extensively investigated and successfully applied in many domains such as pattern recognition, data analysis, image processing, market research, customer classification and the classification of Web documents [1]. As shown in Table 1, the corresponding clustering algorithms can be roughly classified into six types. Each type includes a lot of different clustering algorithms and only some representative examples are listed in Table 1 for reference.

Table 1
Classification of some common clustering algorithms

Classification results of some common clustering algorithms
Partitional	Hierarchical	Density-based	Graph-theory	Grid-based	Model-based
clustering	clustering	clustering	clustering	clustering	clustering
K-means [2]	CURE [7]	DBSCAN [13]	DTG [17]	STING [21]	COBWeb [24]
K-modes	ROCK [8]	RNN-DBSCAN	CAMAS [18]	WaveCluster [22]	AutoClass [25]
K-prototypes	CHAMELOEN [9]	OPTICS [14]	CLICKS [19]	CLIQUE [23]	CLASSIT
K-medoids	SBAC [10]	FDC [15]	CAST [20]		SOM [26]
CLARA [3]	BIRCH [11]	GDBSCAN
CLARANS [4]	BUBBLE [12]	CFSFDP [16]
PCM [5]	BUBBLE Rap	DDBSCAN
FCM [6]

For lack of space, only a small portion of clustering algorithms are listed in Table 1. Readers interested in clustering algorithms can read the papers wrote by Xu and Wunsch [27]; in particular, the readers who are familiar with Chinese can read the monograph that was wrote by Zhang [28]. A lot of common new clustering algorithms proposed in recent years were described minutely in this monograph. This article proposed a clustering algorithm for samples with fuzzy numbers, namely, a fuzzy numbers-based clustering algorithm by fast search and find of density peaks (hereinafter referred to as FN-CFSFDP algorithm). We’ll focus on it in later sections.

Including the Introduction, this paper also includes six sections. In Section 2, we introduces the related works of FN-CFSFDP algorithm proposed in this article. In Section 3, the related knowledge of fuzzy number was introduced as the preparation of further analysis. In Section 4, the improved Euclidean distance between fuzzy samples was first introduced via error analysis, then, some key parameters or operating quantities including the threshold of the distance between fuzzy samples and Gaussian Kernel function of the fuzzy sample were defined, and the detailed clustering procedures on fuzzy set using FN-CFSFDP algorithm were determined based on CFDFDP algorithm, after that, we analyzed the maximum time complexity of FN-CFSFDP and compared FN-CFSFDP algorithm and CFSFDP algorithm in several aspects. In Section 5, 4 sets and 76 times of random simulations in total were carried out on the fuzzy set with different sample numbers, indexes, 6 times random simulations in the first 3 sets are obtained the clustering accuracies were 74.00%, 44.00%, 71.67%, 71.67%, 71.00% and 71.00%, respectively; in addition, Kappa coefficients in random simulations under different parameter were calculated for evaluating and analyzing clustering results and the invalidation in automatic identification of cluster numbers. In the fourth set of random simulations, we got different mean values and variances of clustering accuracies and Kappa coefficients. There are series of conclusions: (1) The maximum mean value of clustering accuracy is equal to 65.07% when $t$ -parameter is equal to 1.00%; (2) The minimum mean value of clustering accuracy is equal to 57.30% when $t$ -parameter is equal to 3.00%; (3) The minimum variance of clustering accuracy is equal to 155.800 when $t$ -parameter is equal to 1.00%; (4) The maximum variance of clustering accuracy is equal to 239.623 when $t$ -parameter is equal to 1.50%; (5) The maximum mean value of Kappa coefficient is equal to 0.4760 when $t$ -parameter is equal to 1.00%; (6) The minimum mean value of Kappa coefficient is equal to 0.3595 when $t$ -parameter is equal to 3.00%; (7) The minimum variance of Kappa coefficient is equal to 0.0350 when $t$ -parameter is equal to 1.00%; (8) The maximum variance of Kappa coefficient is equal to 0.0567 when $t$ -parameter is equal to 1.50%; (9) The maximum mean-clustering accuracy of datasets is equal to 75.67 in the 10th group; (10) The minimum mean-clustering accuracy of datasets is equal to 40.19 in the 7th group; (11) The maximum mean-kappa coefficient of datasets is equal to 0.6350 in the 10th group; (12) The minimum mean-kappa coefficient of datasets is equal to 0.1029 in the 7th group; (13) The algorithm has a certain degree of parameter adaptability and clustering stability; (14) Overall, the clustering accuracy and kappa coefficient of FN-CFSFDP algorithm are lower than those of CFSFDP algorithm in an experimental comparison with 70 times classical CFSFDP clustering. After that, we proposed an original quantitative evaluation method for parameter adaptability and clustering stability, on this basis, we evaluated average level of parameter adaptability and clustering stability for FN-CFSFDP algorithm comprehensively in the fourth simulation. Finally, in the last section, both advantages and shortcomings of the proposed FN-CFSFDP algorithm were concluded in detail, which can provide research thought and direction for further investigation.

2. Related works

According to the membership between sample and cluster, clustering algorithms can be roughly divided into two types – classical clustering and fuzzy clustering. Classical clustering is a kind of data-based hard classification, in which samples and clusters satisfy two-valued logic relationship. Using classical clustering, each object to be identified should be strictly assigned to a cluster, i.e., a sample can only belong to or not belong to a cluster. In other words, it is an either/or classification. However, many objects in real life do not possess strict properties. For example, whether 3 a.m. is in the morning or at night? Whether a 50-year-old male is an old people or a youth? Whether 3 hamburgers in the breakfast are too much or too less for Jack? If a student leaves his/her left foot outside the classroom but moves the right foot in the classroom, whether he/she is inside or outside the classroom? All these questions have no explicit boundaries. According to Zadeh’s theory [29], above questions have the extension of fuzzy concept [30]. Fuzzy clustering should be conducted on these questions or objects. Unlike classical clustering, fuzzy clustering is a kind of soft classification for data. In brief, according to the definition, fuzzy clustering can be applied on the samples which may belong to or not belong to a certain cluster. For example, fuzzy clustering algorithm is appropriate to the sample set in which a sample has a 70% probability of belonging to Cluster A and simultaneously a 30% probability of belonging to Cluster B. FCM and PCM as listed in Table 1 are two typical fuzzy clustering algorithms; in particular, FCM is the most common clustering algorithm that was proposed by Ruspini in 1969 [31]. Afterwards, Dunn and Bezdek extended Ruspini’s thought in 1974 and 1981, proposed the basic idea of determining fuzzy classification by minimizing an appropriately defined function and successfully derived an iterative algorithm used for calculating the membership function of the examined cluster [32, 33]. Using FCM, the sum of the membership grades of each sample to all clusters equals to 1. Under this constraint, the clustering results using FCM are not completely consistent with visual membership grades. Moreover, FCM is quite sensitive to noise and abnormal samples. For overcoming the shortcomings, Krishnapuram and Keller proposed a probability-based fuzzy clustering algorithm (PCM) [34]. Owing to the ignorance of above sum-to-1 constraint, the membership grades of noise and abnormal samples are quite small when using PCM, which can thus enhance noise-resistance capability of the clustering center. Accordingly, PCM is more robust than FCM. However, PCM is sensitive to initial conditions and cannot adequately solve the questions with serious noise. In order to reduce the sensitivity to noise, Pal et al. developed PFCM algorithm on the basis of FCM and PCM [35], Yang and Wu proposed PCA algorithm [36], while Zhang and Leung put forward IPCM algorithm [37]. Other scholars also proposed a lot of improvement algorithms, which are not detailedly introduced in this paper.

Using any a clustering algorithm, two factors – sample type and membership – should be taken into account. However, for FCM, PCM, PCA and IPCM algorithms, although they fall into the type of fuzzy clustering (i.e., the membership between sample and cluster is fuzzy), the sample information is still classical. According to both sample type and membership, clustering algorithms can be divided into four types, as shown in Table 2.

Table 2
Type of clustering algorithms

	Sample type
Membership	Classical	Fuzzy
Classical	TYPE 1	TYPE 3
Fuzzy	TYPE 2	TYPE 4

As listed in Table 2, the clustering algorithms in TYPE 1 are applicable to classical samples, in which both sample information and membership are established on classical set; the clustering algorithms in TYPE 2 are applicable to fuzzy samples, in which sample information is established on fuzzy set while the membership is established on the classical set; the clustering algorithm in TYPE 3 are applicable to fuzzy samples, in which sample information is established on fuzzy set while the membership is established on classical set; the clustering algorithms in TYPE 4 are applicable to fuzzy samples with fuzzy membership to clusters. Accordingly, K-means, BIRCH and DBSCAN algorithms can be regarded as the algorithms in TYPE 1 while FCM and PCM fall into the type of TYPE 2; by contrast, the clustering algorithms in TYPE 3 and TYPE 4 were rarely investigated in spite of some studies on the clustering of uncertain data. For example, Kriegel and Pfeifle proposed FDBSCAN algorithm in combination with probability density function [38]. On the basis of DBSCAN clustering algorithm, Tepwankul and Maneewongwattana defined measurement distance and deviation using probability, sample expectation and density and thus put forward U-DBSCAN algorithm [39]. Erdem and Gundem divided a 2D data set into sub-data sets and acquired final clusters by combining these sub-data sets [40]. These algorithms mainly depend on probability or dimension reduction. Although the word ‘Fuzzy’ appears in the related literatures, these algorithms were developed on classical sets rather than fuzzy set and thus can only be regarded as the clustering on uncertain data. In narrow sense, fuzzy clustering refers to a kind of clustering algorithms based on fuzzy mathematics, which generally uses the related knowledge of fuzzy set rather than pure dependence on the knowledge of probability or interval number on classical set. Therefore, almost no studies have been performed on fuzzy clustering algorithms on narrow Fuzzy set. In order to illustrate characteristics of 4 types-clustering in Table 2 vividly, we can draw an example diagram for relationship between clusters and samples. In this diagram, solid and dashed circular point respectively represent classical sample and fuzzy sample, so we call them ‘Classical Point’ and ‘Fuzzy Point’, as shown in Fig. 1.

Figure 1.

Four different types of clustering in different logics. a. Classical point and classical cluster in two-valued logic. b. Fuzzy point and fuzzy cluster in two-valued logic. c. Classical point and classical cluster in uncertain random logic. d. Fuzzy point and fuzzy cluster in uncertain random logic.

In practical applications, on account of asymmetry of information and the limitations of objective measure, people sometimes cannot acquire very accurate data. For example, a tractor manufacturing enterprise needs to finance approximately 5 million dollars; Marry is approximately 170 cm tall; Vivian spends approximately 13.5 s in the 100-meter dash. All these data are fuzzy. In order to solve this kind of problems, the concept of fuzzy number in fuzzy mathematics was thus proposed. Based on CFSFDP algorithm proposed by Rodriguez and Laio in [16], a novel clustering algorithm for the fuzzy samples including multiple fuzzy numbers under multiple indexes was investigated in this study; moreover, the effectiveness of the proposed algorithm was validated by random simulation results. Some people may ask why we choose the CFSFDP algorithm as the basic algorithm instead of other algorithms. Take DBSCAN algorithm as an example, DBSCAN algorithm and CFSFDP algorithm are both belong to density-based clustering algorithms, and DBSCAN algorithm has no requirement for any shapes of data sets, discovering clusters with arbitrary shape easily. Also, it can effectively identify noise points and outliers. These are many advantages in DBSCAN algorithm. On the other hand, DBSCAN algorithm also has the following disadvantages [41, 42]: (1) Poor adaptability for parameters and poor robustness; (2) The ability to deal with high dimensional data is not strong enough; (3) For the datasets with uneven density distribution, the clustering results are not accurate enough. In contrast, CFSFDP algorithm does not require complex parameter settings, no requirement for shape of clusters. It could not only handle high-dimensional data more effectively than DBSCAN algorithm, but also handle sparse data and low dimensional data more effectively [43]. And it can automatically identify the number of clusters and find out centers of clusters clearly [44]. In a word, CFSFDP algorithm is more universal, lower computational complexity and higher stability than DBSCAN algorithm. In addition, CFSFDP algorithm, one of newly discovered clustering algorithms, came out in 2014 and has great expansibility. For all these reasons above, CFSFDP algorithm is chosen as the basic algorithm in this paper. According to the classification method in Table 2, the proposed algorithm can be classified into TYPE 3. After introducing the related works, we will introduce FN-CFSFDP algorithm in detailed.

3. Related concepts of fuzzy number

This study focused on fuzzy numbers for clustering. The definition of fuzzy number was first derived on the basis of fuzzy set. Next, some concepts of Fuzzy set will be first introduced.

Definition 1. (Classical set) For a domain of discourse $X$ , $A$ is a subset of $X$ . If $\exists\forall x\in X$ , $x\in A$ and $x\notin A$ can be exclusively satisfied, $A$ can be regarded as a classical set. For the domain of discourse $X$ , the subset $A$ can only be determined by the following characteristic function $\chi_{A}$ :

$\displaystyle\chi_{A}(x)=\left\{{{\begin{array}[]{ll}1,&x\in A,\\ 0,&x\notin A,\\ \end{array}}}\right.$ (1)

where $\chi_{A}:X\to\{0,1\}$ . The characteristic function will then be extended to the range of fuzzy set ([0, 1]).

Definition 2. (Fuzzy set) Assuming $\mu_{A}$ is a mapping from the domain of discourse $X$ to a closed interval [0, 1], if $\mu_{A}:X\to[0,1]$ , $x\mapsto\mu_{A}(x)$ can be satisfied, it can be regarded that $\mu_{A}$ determines a fuzzy subset $A$ in the domain of discourse $X$ ; further, $A$ is a fuzzy set, $\mu_{A}(x)$ represents the membership function of $A$ or the membership grade of $x$ to the fuzzy set $A$ .

Definition 3. (Continuous fuzzy set) If the fuzzy subset $A$ is an infinite set in the domain of discourse $X$ , $A$ in above Definition 2 is a continuous fuzzy set. According to Zadeh expression method, $A$ can be described as:

$\displaystyle A=\int_{X}\frac{\mu_{A}(x)}{x}.$ (2)

It should be noted that $\int$ is only a kind of expression pattern rather than integral sign in ordinary sense.

Definition 4. (Convex fuzzy set) Assuming the domain of discourse $X$ is in Euclidean space and $A$ is a fuzzy subset in $X$ , the necessary and sufficient conditions for a convex fuzzy set $A$ in $X$ can be written as:

$\displaystyle\mu_{A}[k\cdot x_{2}+(1-k)\cdot x_{1}]\geqslant\mu_{A}(x_{1})% \wedge\mu_{A}(x_{2}),$ (3)

where $x_{1},x_{2}\in X$ , $\mu_{A}(x)$ denotes the membership function of $A$ or the membership grade of $x$ to the fuzzy set $A$ , and $k\in[0,1]$ .

Definition 5. (Regular fuzzy set) As regard to $A$ , a fuzzy subset in $X$ , if A is a regular fuzzy set, $k\in[0,1]$ when and only when $\exists x_{0}\in X$ , $\mu_{A}(x_{0})=1$ .

Definition 6. (Fuzzy number) If the domain of discourse $X$ is a real number field $\mathbb{R}$ and the fuzzy set $A$ in $\mathbb{R}$ is a regular convex fuzzy set (i.e., $A$ can simultaneously satisfy both Definitions 4 and 5), $A$ can be regarded as a real number. If A is also an infinite set, A can be regarded as a continuous fuzzy number. Discrete fuzzy number is a special form of continuous fuzzy number, i.e., if A is an infinite set and satisfies certain conditions, A can also be regarded as a discrete fuzzy number [45].

As stated above, a discrete fuzzy number is a special form of a continuous fuzzy number. Unless otherwise specified, the involved formulas in clustering are applicable to continuous fuzzy numbers, which will be abbreviated as fuzzy numbers hereinafter (i.e., the mentioned fuzzy numbers are all continuous fuzzy numbers in this paper).

Some common fuzzy numbers mainly include triangular fuzzy numbers, trapezoidal fuzzy numbers, normal fuzzy numbers and Cauchy fuzzy numbers. Besides, interval numbers are also a kind of special fuzzy numbers, while fuzzy numbers are the generalization of interval numbers [30]. Subsequently, the clustering procedures are detailedly described on the basis of CFSFDP, and a lot of important operating quantities are also defined.

4. Clustering steps, maximum time complexity and compare of FN-CFSFDP

In this section, the detailed procedures of the proposed FN-CFSFDP algorithm and the definitions of a lot of important operating quantities are described. As stated above, FN-CFSFDP algorithm was proposed based on the framework of CFSFDP algorithm, and the procedures are almost identical to these using CFSFDP. However, the samples for clustering are fuzzy numbers established on fuzzy sets, and many operating quantities mainly including the calculation parameters, kernel function, the distance between fuzzy samples and the threshold of distance are different from those when using CFSFDP. All these operating quantities should be redefined.

Step 1. Initialization and pre-treatment.

4.1 Set the parameter $t$ for the determination of the threshold $d_{c}$ (i.e., the cut-off distance)

According to a general method proposed by Rodriguez and Laio in [16], $d_{c}$ should be set so that the number of mean adjacent samples around each sample accounts for approximately 1.00% $\sim$ 4.00% of total number of samples. Here, the adjacent sample refers to the rest of samples with a distance from the sample of less than $d_{c}$ . In addition, $t\in[0.01,0.04]$ and the value of $t$ should be determined in accordance with specific circumstance.

4.2 Calculate the distance $d_{rt}$ between two fuzzy samples ( $\tilde{s}_{r}$ and $\tilde{s}_{t}$ ) and set $d_{rt}=d_{tr}$ ( $r<t$ )

For a sample, if the included information of multiple indexes is described by fuzzy numbers, the sample can be regarded as fuzzy sample. For $M$ fuzzy samples under $N$ indexes, each fuzzy sample in the fuzzy sample set $\tilde{S}=\{\tilde{s}_{i}\}_{i=1}^{M}$ can be treated as a multi-fuzzy-vector and each element in the vector is a fuzzy number, and accordingly, the $i$ -th fuzzy sample can be expressed as $\tilde{s}_{i}=[FN_{i}^{(j)}]_{j=1}^{N}$ . According to the definition as described in Ref. [46], Euclidean distance between the fuzzy numbers of two fuzzy samples ( $\tilde{s}_{r}$ and $\tilde{s}_{t}$ ) under the $j$ -th index, denoted as $A_{r}^{(j)}$ and $A_{t}^{(j)}$ , can be calculated as:

$\displaystyle d_{rt}^{(j)}=\left[\int_{a^{(j)}}^{b^{(j)}}\left|A_{r}^{(j)}(x)-% A_{t}^{(j)}(x)\right|^{2}dx\right]^{1/2},$ (4)

where the domain of discourse $X=\mathbb{R}$ ; and $[a^{(j)},b^{(j)}]$ denotes the common integral fields of the membership functions of fuzzy numbers $A_{r}^{(j)}$ and $A_{r}^{(j)}$ to the fuzzy samples $\tilde{s}_{r}$ and $\tilde{s}_{t}$ under the j-th index ( $[a^{(j)},b^{(j)}]=[a_{r}^{(j)},b_{r}^{(j)}]\cup[a_{t}^{(j)},b_{t}^{(j)}]$ ). $[a_{t}^{(j)},b_{t}^{(j)}]\subset\mathbb{R}$ represents the integral field that the membership function of the fuzzy number $A_{r}^{(j)}$ , denoted as $A_{r}^{(j)}(x)$ , can be expressed, while $[a_{t}^{(j)},b_{t}^{(j)}]\subset\mathbb{R}$ represents the integral field that the membership function of the fuzzy number $A_{t}^{(j)}$ , denoted as $A_{t}^{(j)}(x)$ , can be expressed. In other words, the membership functions of two fuzzy numbers $A_{r}^{(j)}$ and $A_{t}^{(j)}$ in $\mathbb{R}$ can only be expressed in $[a^{(j)},b^{(j)}]$ .

Intuitively, aiming at calculating the distance $d_{rt}$ between two fuzzy samples $\tilde{s}_{r}$ and $\tilde{s}_{t}$ , the distance between M fuzzy samples should be separately calculated under N indexes; then, by means of certain operator aggregation, i.e., Eq. (4) can be integrated using the operator $\sum$ , the following expressions can be derived:

$\displaystyle d_{rt}=\sum\limits_{j=1}^{N}d_{rt}^{(j)}=\sum\limits_{j=1}^{N}% \left[\int_{a^{(j)}}^{b^{(j)}}\left|A_{r}^{(j)}(x)-A_{t}^{(j)}(x)\right|^{2}dx% \right]^{1/2}.$ (5)

In spite of easy understanding, the calculated distance can increase the error. Here, a new distance, named improved Euclidean distance, is defined, which can lead to smaller error. The conclusion will be proved in detail later. Next, the definition of improved Euclidean distance is described. For the domain of discourse $X=\mathbb{R}$ , $[a^{(j)},b^{(j)}]$ is a bounded interval. For N indexes, it can be determined an interval $[{a,b}]=[\wedge_{j=1}^{N}a^{(j)},\vee_{j=1}^{N}b^{(j)}]$ . Assuming that $\vee$ and $\wedge$ represent maximizing and minimizing in Zadeh operator, i.e., $\wedge_{j=1}^{N}a^{(j)}=\min\{a^{(j)}\}_{j=1}^{N}$ and $\vee_{j=1}^{N}b^{(j)}=\max\{b^{(j)}\}_{j=1}^{N}$ , the range $[{a,b}]$ represents the maximum common integral field. In addition, if the common integral fields including P indexes ( $1\leqslant P\leqslant N)$ is a bilateral unbounded interval or a unilateral unbounded interval, an unbounded interval can be regarded as the unlimited extension of a bounded interval. By taking a bilateral unbounded interval as an example, when $a^{(j)}\to-\infty$ and $b^{(j)}\to-\infty$ , $[a^{(j)},b^{(j)}]\to({-\infty,+\infty})$ . For covering all cases, $[a^{(j)},b^{(j)}]$ denotes the common integral fields of two fuzzy samples under the j-th index, the improved Euclidean distance between two fuzzy samples ( $\tilde{s}_{r}$ and $\tilde{s}_{t})$ , denoted as $d_{rt}$ , can be calculated as:

$\displaystyle d_{rt}=\left[\sum_{j=1}^{N}\int_{a}^{b}\left|A_{r}^{(j)}(x)-A_{t% }^{(j)}(x)\right|^{2}dx\right]^{1/2}.$ (6)

Equation (6) takes full consideration of the distance between two fuzzy samples ( $\tilde{s}_{r}$ and $\tilde{s}_{t}$ ) under $N$ indexes, i.e., the distance between finite countable fuzzy numbers.

Next, error analysis is conducted on Eqs (5) and (6). The total error equals to the sum of system error and random error. As regard to Eq. (5), assuming that $e_{\Sigma}({r,t})$ , $e_{\Sigma}^{(S)}({r,t})$ and $e_{\Sigma}^{(S)}({r,t})$ denote total error, system error and random error, respectively, the following expression can be acquired:

$\displaystyle e_{\Sigma}({r,t})=e_{\Sigma}^{(S)}({r,t})+e_{\Sigma}^{(R)}({r,t}).$ (7)

For Eq. (6), assuming that $e_{IE}({r,t})$ , $e_{IE}^{(S)}({r,t})$ and $e_{IE}^{(R)}({r,t})$ denote total error, system error and random error, respectively, the following expression can be acquired:

$\displaystyle e_{IE}({r,t})=e_{IE}^{(S)}({r,t})+e_{IE}^{(R)}({r,t}).$ (8)

It should be noted that all errors are nonnegative. For comparing $e_{\Sigma}({r,t})$ and $e_{IE}({r,t})$ , since random error cannot be artificially controlled and it is assumed that $e_{\Sigma}^{(R)}({r,t})=e_{IE}^{(R)}({r,t})$ , only $e_{\Sigma}^{(S)}({r,t})$ and $e_{IE}^{(S)}({r,t})$ should be compared. Assuming that $e_{j}^{(S)}({r,t})$ denotes the error of $\int_{a^{(j)}}^{b^{(j)}}|A_{r}^{(j)}(x)-A_{t}^{(j)}(x)|^{2}dx$ , in view of the fact that $\forall[a^{(j)},b^{(j)}]\subseteq[{a,b}]$ , the common integral field can be denoted as $[a^{(j)},b^{(j)}]$ under the j-th index. Apparently, in accordance with the properties of integration, the definite integral values in non-domain regions all equal to 0, i.e., as regard to a single index, the error only exists in the domain of discourse within defined common integral field of the membership function, while the definite integral values in the domains of discourse outside the common integral field are all 0. Accordingly, the following expression can be derived:

$\displaystyle e\left\{\int_{a}^{b}\left|A_{r}^{(j)}(x)-A_{t}^{(j)}(x)\right|^{% 2}dx\right\}=e\left\{\int_{a^{(j)}}^{b^{(j)}}\left|A_{r}^{(j)}(x)-A_{t}^{(j)}(% x)\right|^{2}dx\right\}=e_{j}^{(S)}({r,t}).$ (9)

The system error of Eq. (5) can be written as:

$\displaystyle e_{\Sigma}^{(S)}({r,t})=\sum_{j=1}^{N}[e_{j}^{(S)}({r,t})]^{1/2}.$ (10)

The system error of Eq. (6) can be written as:

$\displaystyle e_{IE}^{(S)}({r,t})=\left[\sum_{j=1}^{N}e_{j}^{(S)}({r,t})\right% ]^{1/2}.$ (11)

It can be easily observed that $e_{\Sigma}^{(S)}({r,t})\geqslant e_{IE}^{(S)}({r,t})$ , and therefore, $e_{\Sigma}^{(S)}({r,t})\geqslant e_{IE}^{(S)}({r,t})$ , which suggests that the calculated error based on improved Euclidean distance as defined in Eq. (6) is smaller than that as defined in Eq. (5). Similarly, it can be proved that $e_{\Sigma}({r,t})\geqslant e_{IE}({r,t})$ . Therefore, the distance between two fuzzy samples is calculated by improved Euclidean distance when using FN-CFSFDP algorithm. After the definition of distance, the threshold value $d_{c}$ (i.e., the cut-off distance) should be determined.

4.3 Determinate the threshold value

d_{c}

According to Eq. (6), the improved Euclidean distances between any two among $M$ fuzzy samples under $N$ indexes were calculated. On account of $r<t$ and $d_{rt}=d_{tr}$ , there exist $H=M(M-1)/2$ non-repetitive distances in total. In ascending order, $d_{1}<d_{2}<\cdots<d_{H}$ . The following expression are set as:

$\displaystyle d_{c}=d_{f({Ht})},$ (12) $\displaystyle Ht=Mt({M-1})/2,$ (13) $\displaystyle f({Ht})=\text{INT}({Ht}),$ (14)

where $t\in[{0.01,0.04}]$ and $f({Ht})=\text{INT}({Ht})$ denotes the integer of $H t$ after rounding off.

4.4 Calculate the density of

M

fuzzy samples

\{\rho_{r}\}_{r=1}^{M}

and generate the subscript sequence in descending order

\{q_{r}\}_{r=1}^{M}

Next, using Gaussian kernel function, the density of fuzzy samples $\rho_{r}$ can be calculated as:

$\displaystyle\rho_{r}=\sum_{t=1}^{H}\text{exp}\left[-\left(\frac{d_{rt}}{d_{c}% }\right)^{2}\right],$ (15)

where $H=M(M-1)/2$ . After the arrangement of $\{q_{r}\}_{r=1}^{M}$ in descending order, $\rho_{q_{1}}\geqslant\rho_{q_{2}}\geqslant\cdots\geqslant\rho_{q_{M}}$ .

4.5 Calculate the special distance

\{\delta_{r}\}_{r=1}^{M}

between any two among

M

fuzzy samples and find the corresponding fuzzy sample in accordance with the serial number

\{n_{r}\}_{r=1}^{M}

The distance set of $M$ fuzzy samples $\{\delta_{r}\}_{r=1}^{M}$ is acquired by arranging the distances $\delta_{q_{r}}$ , according to the descending order of the subscript $\{q_{r}\}_{r=1}^{M}$ . $\delta_{q_{r}}$ can be defined as:

$\displaystyle\delta_{q_{r}}=\left\{{{\begin{array}[]{ll}\min\limits_{q_{t}:t<r% }\{d_{q_{r}q_{t}}\},&{r\geqslant 2,}\\ \max\limits_{q_{t}:t\geqslant 2}\{\delta_{q_{t}}\},&{r=1.}\\ \end{array}}}\right.$ (16)

When $r\geqslant 2$ , $\min_{q_{t}:t<r}\{d_{q_{r}q_{t}}\}$ denotes the minimum distance between the sample $\tilde{s}_{q_{r}}$ and the samples with greater densities in descending order of density ( $\{\tilde{s}_{q_{t}}\}_{t=1}^{r-t}$ ). It should be noted that $d_{q_{r}q_{t}}$ denotes improved Euclidean distance. When $r=1$ , if $\delta_{q_{r}}({r\geqslant 2})=\{\delta_{q_{r}}\}_{r=2}^{M}=\{\delta_{q_{t}}\}% _{t=2}^{M}$ are all calculated, $\delta_{q_{1}}$ denotes the maximum of $\{\delta_{q_{t}}\}_{t=2}^{M}$ ( $r,t\in\{1,2,\ldots,M\}$ ). After $r,t\in\{{1,2,\ldots,M}\}$ are all calculated, a new set is denoted as $\{\delta_{r}\}_{r=1}^{M}$ for the convenience of further calculation since it does not need to sort. $\{\delta_{r}\}_{r=1}^{M}$ and $\{\delta_{q_{r}}\}_{r=1}^{M}$ have identical elements. The fuzzy sample corresponding to $\{n_{r}\}_{r=1}^{M}$ can be found, and $\{n_{r}\}_{r=1}^{M}$ is the serial number of the fuzzy sample among all samples with greater densities in the fuzzy sample set $\tilde{S}=\{\tilde{s}_{i}\}_{i=1}^{M}$ that is closest to $\{n_{r}\}_{r=1}^{M}$ .

Step 2. Determination of the serial number of the cluster center $\{m_{g}\}_{g=1}^{n_{c}}$ , where $\tilde{s}_{m_{g}}$ denotes the fuzzy sample as the center of the $m_{g}$ -th cluster.

Next, the cluster center, denoted as $\gamma_{r}$ , can be calculated in accordance with the comprehensive value of $\rho_{r}$ and $\delta_{r}$ . $\gamma_{r}$ can be calculated as:

$\displaystyle\gamma_{r}=\rho_{r}\cdot\delta_{r}.$ (17)

This study also assumes that $\{{c_{r}}\}_{r=1}^{M}$ denotes the cluster where the cluster center is located, $\{{c_{r}}\}_{r=1}^{M}$ denotes the mark symbol and $c_{r}$ denotes the fact that the $r$ -th fuzzy sample in $\tilde{S}=\{\tilde{s}_{i}\}_{i=1}^{M}$ belongs to the $c_{r}$ -th cluster. The following definition is made:

$\displaystyle c_{r}=\left\{{\begin{array}[]{ll}{k},&\tilde{s}_{r}\in\{\tilde{s% }_{m_{g}}\}_{g=1}^{n_{c}}\&\{{c_{r}=k}\},\\ {-1},&{\text{Otherwise},}\\ \end{array}}\right.$ (18)

where $\tilde{s}_{r}\in\{\tilde{s}_{m_{g}}\}_{g=1}^{n_{c}}$ represents that the fuzzy sample $\tilde{s}_{r}$ is the cluster center; $c_{r}=k$ represents that the sample $\tilde{s}_{r}$ belongs to the k-th cluster. If the number of clusters is unknown in advance, the cluster number can be automatically identified based on $\gamma_{r}$ ; if the number of cluster is required in advance, the corresponding number can be set in accordance with actual condition, i.e., $\gamma_{r}$ can be set as a positive integer.

Step 3. Classification on the fuzzy samples that are not the cluster centers.

When dealing with a fuzzy sample $\tilde{s}_{q_{r}}$ , if the distances between $\tilde{s}_{q_{r}}$ and the samples with greater densities are identical, $\tilde{s}_{q_{r}}$ should be randomly distributed to the cluster where $\tilde{s}_{q_{t}}({t<r})$ is located. The classification on the fuzzy samples that are not the cluster centers is traversed in the order of density. Specially, non-center fuzzy samples are more likely to be classified into the cluster with greater density, and therefore, each cluster can be expanded layer by layer with the use of $\{n_{r}\}_{r=1}^{M}$ .

We can use a flowchart to summarize the whole steps of FN-CFSFDP algorithm, as shown in Fig. 2.

Figure 2.

Flow chart of clustering steps in FN-CFSFDP algorithm.

To further illustrate the steps of FN-CFSFDP algorithm, we give more detailed pseudo code than front flow chart, as shown in Table 3.

Table 3

Detailed pseudo code of FN-CFSFDP algorithm

Algorithm: FN-CFSFDP
Input:
$\tilde{S}=\{\tilde{s}_{i}\|\tilde{s}_{i}=({FN}_{i}^{(1)},{FN}_{i}^{(2)},\ldots,% {FN}_{i}^{(N)})\}_{i=1}^{M}$	// Fuzzy sample set
$t\in[{0.01,0.04}]$	// $t$ -parameters
Output:
$C=\{c_{1},c_{2},\ldots,c_{k}\}$	// Clustering results of k clusters
1: for each fuzzy sample $r$ do
2: for each fuzzy sample $t$ do
3: for each fuzzy index $j$ do
4: $d_{rt}=\left[\sum_{j=1}^{N}\int_{a}^{b}\|A_{r}^{(j)}(x)-A_{t}^{(j)}(x)\|^{2}dx% \right]^{1/2}$ ;	// Calculate all the Improved Euclidian distance
5: end for
6: end for
7: end for
8: for each fuzzy sample $r$ do
9: for each fuzzy sample $t$ do
10: for each fuzzy index $j$ do
11: $H=M(M-1)/2$ ;	// Calculate number of non-repetitive distances
12: $d_{1}<\cdots<d_{H}$ ;	// Order non-repetitive distances in ascending
13: $d_{c}=d_{f({Ht})}$ ;	// Calculate the threshold
14: end for
15: end for
16: end for
17: for each fuzzy sample $r$ do
19: for each fuzzy sample $t$ do
20: for each fuzzy index $j$ do
21: for each fuzzy non-repetitive distances $H$ do
22: $\rho_{r}=\sum_{t=1}^{H}\text{exp}[-(d_{rt}/d_{c})^{2}]$ ;	// Calculate all density of fuzzy samples
23: $\rho_{q_{1}}\geqslant\rho_{q_{2}}\geqslant\cdots\geqslant\rho_{q_{M}}$ ;	// Order all densities in descending
24: end for
25: end for
26: end for
27: end for
27: for each fuzzy sample $r$ do
28: for each fuzzy sample $t$ do
29: for each fuzzy index $j$ do
30: for each descending ordered subscript $q_{r}$ do
31: $\delta_{q_{r}}=\left\{{{\begin{array}[]{ll}\min_{q_{t}:t<r}\{d_{q_{r}q_{t}}\},% &{r\geqslant 2}\\ \max_{q_{t}:t\geqslant 2}\{\delta_{q_{t}}\},&{r=1}\\ \end{array}}}\right.$ ;	// Calculate all special distance of fuzzy samples
32: end for
33: end for
34: end for
35: end for
36: for each fuzzy sample $r$ do
37: $\gamma_{r}=\rho_{r}\cdot\delta_{r}$ ;	// Calculate all comprehensive values of $\gamma_{r}$
38: end for
39: for each fuzzy sample $r$ do
40: $c_{r}=\left\{{{\begin{array}[]{ll}{k,}&\tilde{s}_{r}\in\{\tilde{s}_{m_{g}}\}_{% g=1}^{n_{c}}\&\{c_{r}=k\}\\ {-1,}&{\text{Otherwise}}\\ \end{array}}}\right.$ ;	// Clustering and marking cluster centers
41: $\tilde{s}_{q_{r}}\to\mathop{\text{cluster}(\tilde{s}_{q_{t}}\in\text{cluster}\|% {t<r})}\limits_{d(\tilde{s}_{q_{r}},\tilde{s}_{q_{t}}\|\forall t<r)\equiv L}$ ;	// Classifying samples from non-clustering centers
42: end for

Consider the algorithm with the least efficiency, we can analyze the maximum time complexity of FN-CFSFDP algorithm. Input M fuzzy samples in turn, repeating execute M times, the maximum time complexity of this step is $O(1)$ ; Maximum time complexity of input $t$ -parameter is $O(1)$ ; $d_{rt}$ is calculated in three layers-loop statement, two layers for choosing any two fuzzy samples, another layer for choosing any index, therefore, the maximum time complexity of $d_{rt}$ is $O(N\cdot M^{2})$ ; The maximum time complexity for calculating number of non-repetitive distances is $O(1)$ ; Ordering non-repetitive distances in ascending needs a maximum time complexity of $O({M^{2}})$ for different sort orders. Sort (*) function is used in the R language to sort, no matter what sorting method the Sort (*) function uses, the maximum time complexity of any sort will not exceed $O({M^{2}})$ , therefore, the maximum time complexity of this step is $O({M^{2}})$ ; Calculating threshold $d_{c}$ requires $d_{rt}$ and $t$ -parameters, because both two have been worked out in the previous steps, and no need to recalculate, so we can get $d_{c}$ directly. The maximum time complexity of this step is $O(1)$ ; if we calculate fuzzy density $\rho_{r}$ separately, it needs four layers-loop statement. Fortunately, three layers-loop for $d_{rt}$ calculating have been worked out in previous clustering steps and need not recalculate, at this time, only one layer-loop of fuzzy non-repetitive distances is needed. In this case, the maximum time complexity of this step is $O(H)$ , and $H$ represents number of non-repetitive distances; After calculating density, a descending order is also made according to the density as $\rho_{q_{1}}\geqslant\rho_{q_{2}}\geqslant\cdots\geqslant\rho_{q_{M}}$ , and still consider the worst case, the maximum time complexity of this step is $O({M^{2}})$ ; Similarly, calculation of the special distance $\delta_{r}$ needs labels in accordance with the serial number after the density descending order, however, the descending order has been worked out in previous clustering steps and no necessary be repeated. There is only one-layer loop statement in this step, so the maximum time complexity is $O(M)$ ; The calculation of comprehensive values $\gamma_{r}$ requires $\rho_{r}$ and $\delta_{r}$ , and both of them have been worked out in the previous steps, also there is no need for loop statement, so the maximum time complexity of $\gamma_{r}$ is $O(1)$ ; ‘Classifying samples from non-clustering centers’ is a process of classifying non-clustering centric fuzzy samples into fuzzy clusters by samples with larger density than non-clustering centric fuzzy samples and nearest distance between them. The process involves sorting of density and distance which have been worked out in previous steps, so there is no requirement of resorting them. And only one layer-loop statement in this step, therefore, the maximum time complexity is $O(M)$ . We list the maximum time complexity of key parameters and steps in FN-CFSFDP, as shown in Table 4.

Table 4

Maximum time complexity of key parameters and steps

Key parameters and steps	Maximum time complexity
$\tilde{S}=\{\tilde{s}_{i}\|\tilde{s}_{i}=({FN}_{i}^{(1)},{FN}_{i}^{(2)},\ldots,% {FN}_{i}^{(N)})\}_{i=1}^{M}$	$O(1)$
$t$	$O(1)$
$d_{rt}$	$O(N\cdot M^{2})$
$H$	$O(1)$
$d_{1}<\cdots<d_{H}$	$O({M^{2}})$
$d_{c}$	$O(1)$
$\rho_{r}$	$O(H)$
$\rho_{q_{1}}\geqslant\rho_{q_{2}}\geqslant\cdots\geqslant\rho_{q_{M}}$	$O({M^{2}})$
$\delta_{r}$	$O(M)$
$\gamma_{r}$	$O(1)$
Classifying fuzzy samples from non-clustering centers	$O(M)$

Because the whole FN-CFSFDP algorithm is executed in sequential structure, so the maximum time complexity of FN-CFSFDP algorithm is $O(N\cdot M^{2})$ . We can also compare the similarities and differences between the two algorithms of FN-CFSFDP and CFSFDP, as shown in Table 5.

Table 5

Comparison between FN-CFSFDP algorithm and CFSFDP algorithm

Algorithm name	FN-CFSFDP	CFSFDP
Object type	Fuzzy numerical type	Numerical type
Object state	Fuzzy uncertainty	Definite
Object shape	Uncertain fuzzy	Definite classic with any shapes
Mathematical basis	Fuzzy set theory	Classic set theory
Membership between objects and clusters	Belong to or not, two-valued logic	Belong to or not, two-valued logic
Ability to deal with high-dimensional data	Good	Good
Clustering measure	Improved Euclidian distance	Distance or similarity degree
Threshold	Related to $t$ -parameters	Related to $t$ -parameters
Time complexity (maximum)	$O(N\cdot M^{2})$	$O(M^{2})$

After all procedures of the proposed FN-CFSFDP algorithm were introduced, random simulation is then performed on several sets of data for validating the effectiveness of the proposed FN-CFSFDP algorithm.

5. Random simulation, evaluation of parameter adaptability and clustering stability

Since the membership functions of some commonly-used triangular and trapezoidal fuzzy numbers are all piecewise functions, the improved Euclidean distances should be piecewise calculated. In order to reduce the calculation amount, this study assumes that the information under each index can be described by normal fuzzy numbers, and therefore, the fuzzy samples are normal fuzzy samples. For $M$ normal fuzzy samples under $N$ indexes, each fuzzy sample in the normal fuzzy sample set $\tilde{S}=\{\tilde{s}_{i}\}_{i=1}^{M}$ can be treated as a multivariate normal fuzzy vector or a multivariate normal fuzzy point $\tilde{s}_{i}=[\textit{NFN}_{i}^{(j)}(\mu_{i}^{(j)},\sigma{}_{i}^{(j)})]_{j=1}% ^{N}$ . This study also assumes that the information of any a normal fuzzy sample $i$ under any an index $j$ can be described by normal fuzzy numbers. For $\forall i\in\{1,2,\ldots,M\}$ and $\forall i\in\{1,2,\ldots,M\}$ , the related expression can be written as $\textit{NFN}_{i}^{(j)}(\mu_{i}^{(j)},\sigma_{i}^{(j)})$ , and the corresponding membership function is $A_{i}^{(j)}(x)$ . For the domain of discourse $X=\mathbb{R}(X:(-\infty,+\infty))$ , the membership function can be described as:

$\displaystyle A_{i}^{(j)}(x)=\exp\left[-\left(\frac{x-\mu_{i}^{(j)}}{\sigma_{i% }^{(j)}}\right)^{2}\right],$ (19)

where $\mu_{i}^{(j)}$ and $\sigma_{i}^{(j)}$ are two parameters of the membership function $A_{i}^{(j)}(x)$ . A normal fuzzy number can be determined after $\mu_{i}^{(j)}$ and $\sigma_{i}^{(j)}$ are determined. This is necessary for the generation of normal fuzzy numbers in simulation. The parameters can be acquired by many means. The method of acquiring parameters or relevant information required in the simulation is generally defined as fetching rule. The related rule can be arbitrarily specified in accordance with personal preference, which can also be set as random generation. In this study, by means of random generation, the fetching rule follows random probability distribution for generating random numbers.

In the first set of simulation, 200 normal fuzzy samples were used and each sample includes two indexes – Index A and Index B, i.e., $M=$ 200 and $N=$ 2. These samples were artificially divided into two clusters in advance, i.e., $C=$ 2, $t=$ 4.00%. The first 100 samples are in a cluster; the parameters $\mu$ and $\sigma$ in the membership function corresponding to the first index are from two normal distributions ( $N(160,2^{2})$ and $N(2,0.16^{2})$ ), while the parameters $\mu$ and $\sigma$ in the membership function corresponding to the second index are from two normal distributions ( $N(45,2^{2}$ ) and $N(2,0.3^{2})$ ). The rest 100 samples are in the other cluster; the parameters $\mu$ and $\sigma$ in the membership function corresponding to the first index are from two normal distributions ( $N(168,3^{2}$ ) and $N(3,0.2^{2})$ ), while the parameters $\mu$ and $\sigma$ in the membership function corresponding to the second index are from two normal distributions ( $N(54,3.5^{2})$ and $N(3.5,0.5^{2})$ ). Here, the distribution can be arbitrarily set since it only represents a kind of fetching rule. Table 6 lists the relevant information in the first set of simulation, in which R denotes the fetching rule and T denotes the parameter type.

Table 6

Relevant information in the first set of simulation

$M=$ 200	Index
$N=$ 2		Index A		Index B
$C=$ 2		Normal fuzzy numbers		Normal fuzzy numbers
$t=$ 4.00%
Cluster 1 $p=1,\ldots,100$	T	$\mu_{p}^{(A)}$	$\sigma_{p}^{(A)}$	$\mu_{p}^{(B)}$	$\sigma_{p}^{(B)}$
	R	$N({160,2^{2}})$	$N({2,0.16^{2}})$	$N({45,2^{2}})$	$N({2,0.3^{2}})$
Cluster 2 $q=1,\ldots,100$	T	$\mu_{q}^{(A)}$	$\sigma_{q}^{(A)}$	$\mu_{q}^{(B)}$	$\sigma_{q}^{(B)}$
	R	$N({168,3^{2}})$	$N({3,0.2^{2}})$	$N({54,3.5^{2}})$	$N({3.5,0.5^{2}})$
Domain of discourse $X$		$({-\infty,+\infty})$		$({-\infty,+\infty})$

The second set of simulation includes 300 normal fuzzy samples in total ( $M=$ 300), in which each sample has three indexes ( $N=$ 3), namely, Index E, Index F and Index G. These 300 samples were first artificially divided into three clusters ( $C=$ 3). Let $t$ -parameter equals to 3.00%. The first 100 samples are in a cluster; the parameters $\mu$ and $\sigma$ in the membership function corresponding to the first index are described by two normal distributions ( $N({1.2,0.14^{2}})$ and $N({0.15,0.15^{2}}))$ , the parameters $\mu$ and $\sigma$ in the membership function corresponding to the second index are described by two normal distributions ( $N({7,0.15^{2}})$ and $N({0.2,0.12^{2}}))$ , and the parameters $\mu$ and $\sigma$ in the membership function corresponding to the third index are described by two normal distributions ( $N({3.4,0.2^{2}})$ and $N({0.13,0.21^{2}}))$ . The middle 100 samples are in another cluster; the parameters $\mu$ and $\sigma$ in the membership function corresponding to the first index are described by two normal distributions ( $N({8,0.25^{2}})$ and $N({0.2,0.24^{2}}))$ , the parameters $\mu$ and $\sigma$ in the membership function corresponding to the second index are described by two normal distributions ( $N({15,0.24^{2}})$ and $N({0.24,0.14^{2}}))$ , and the parameters $\mu$ and $\sigma$ in the membership function corresponding to the third index are described by two normal distributions ( $N({13,0.2^{2}})$ and $N({0.32,0.12^{2}}))$ . The rest samples are in another cluster; the parameters $\mu$ and $\sigma$ in the membership function corresponding to the first index are described by two normal distributions ( $N({17,0.23^{2}})$ and $N({0.3,0.2^{2}}))$ , the parameters $\mu$ and $\sigma$ in the membership function corresponding to the second index are described by two normal distributions ( $N({22,0.2^{2}})$ and $N({0.35,0.12^{2}}))$ , and the parameters $\mu$ and $\sigma$ in the membership function corresponding to the third index are described by two normal distributions ( $N({26,0.18^{2}})$ and $N({0.5,0.31^{2}}))$ . Table 7 lists the relevant information in the second set of random simulation, in which R denotes the fetching rule and T denotes the parameter type. For simplifying the expression, $R_{C}^{j}(\mu)$ and $R_{C}^{j}(\sigma)$ denote the fetching rules of $\mu$ and $\sigma$ in the membership function corresponding to the normal fuzzy number in the $C$ -th cluster under the $j$ -th index, respectively, where $C=1,2,3$ and $j=E,F,G$ .

Table 7

Relevant information in the second set of simulation

$M=$ 300		Index
$N=$ 3		Index E		Index F		Index G
$C=$ 3		Normal fuzzy numbers		Normal fuzzy numbers		Normal fuzzy numbers
$t=$ 3.00%
Cluster 1 $p=1,\ldots,100$	T	$\mu_{p}^{(E)}$	$\sigma_{p}^{(E)}$	$\mu_{p}^{(F)}$	$\sigma_{p}^{(F)}$	$\mu_{p}^{(G)}$	$\sigma_{p}^{(G)}$
	R	$R_{1}^{E}(\mu)$	$R_{1}^{E}(\sigma)$	$R_{1}^{F}(\mu)$	$R_{1}^{F}(\sigma)$	$R_{1}^{G}(\mu)$	$R_{1}^{G}(\sigma)$
Cluster 2 $q=1,\ldots,100$	T	$\mu_{q}^{(E)}$	$\sigma_{q}^{(E)}$	$\mu_{q}^{(F)}$	$\sigma_{q}^{(F)}$	$\mu_{q}^{(G)}$	$\sigma_{q}^{(G)}$
	R	$R_{2}^{E}(\mu)$	$R_{2}^{E}(\sigma)$	$R_{2}^{F}(\mu)$	$R_{2}^{F}(\sigma)$	$R_{2}^{G}(\mu)$	$R_{2}^{G}(\sigma)$
Cluster 3 $r=1,\ldots,100$	T	$\mu_{r}^{(E)}$	$\sigma_{r}^{(E)}$	$\mu_{r}^{(F)}$	$\sigma_{r}^{(F)}$	$\mu_{r}^{(G)}$	$\sigma_{r}^{(G)}$
	R	$R_{3}^{E}(\mu)$	$R_{3}^{E}(\sigma)$	$R_{3}^{F}(\mu)$	$R_{3}^{F}(\sigma)$	$R_{3}^{G}(\mu)$	$R_{3}^{G}(\sigma)$
Domain of discourse $X$		$({-\infty,+\infty})$		$({-\infty,+\infty})$		$({-\infty,+\infty})$

It should be noted that Table 7 is similar to Table 6 in terms of content, which is only listed for comparison of the simulation information. According to above-described clustering procedures, the programs were compiled with R language for clustering. In accordance with the definition proposed by Al-Shammary [47], clustering accuracy on the data set $D$ using the algorithm $f$ can be calculated as:

$\displaystyle ac\_\textit{rate}({D/f})=\frac{\sum_{i=1}^{K}{\textit{corr}\_c_{% i}}}{|D|},$ (20)

where $k$ denotes the number of real classes of the dataset, $\textit{corr}\_c_{i}$ denotes the number of samples that are accurately classified in the $i$ -th class and denotes the number of samples in the data set. A greater $\textit{corr}\_c_{i}$ is indicative of more favorable clustering performance. Table 8 lists the related clustering parameters and results.

Table 8

Clustering parameters and results

	(%)	$d_{c}$	Clustering accuracy (%)
The first set of simulation	4.00	1.3710 ${}^{-33}$	74.00
The second set of simulation	3.00	8.4610 ${}^{-14}$	44.00

The clustering accuracy of the first set of samples reaches up to 74.00% while the clustering accuracy in the second set of simulation is only 44.00%. Apparently, the clustering performance in the first set of simulation is superior to that in the second set of simulation. Because of random fetching rule, clustering accuracy may fluctuate to certain degree. Next, the fetching rule can be changed so as to perform random simulations under different fetching rules.

In the third set of simulation, 300 normal fuzzy samples were used ( $M=$ 300) and each sample include three indexes ( $C=$ 3), namely, Index H, Index V and Index W. The first 100 samples are in a cluster, while the parameters $\mu$ and $\sigma$ in the membership function corresponding to the three indexes are described by two normal distributions $N({-2,0.14^{2}})$ and $N({0.15,0.15^{2}})$ . The middle 100 samples are in another cluster, while the parameters $\mu$ and $\sigma$ in the membership function corresponding to the three indexes follow the exponential distributions $\textit{Exp}({10})$ and $\textit{Exp}({10})$ . The rest 100 samples are in another cluster, while the parameters $\mu$ and $\sigma$ in the membership function corresponding to the three indexes follow the uniform distributions $U({9,15})$ and $U({0.9,1.5})$ . Table 9 lists the corresponding information in the third set of simulation, in which R and T denotes the fetching rule and parameter type, respectively.

Table 9

Relevant information in the third set of simulation

$M=$ 300		Index
$N=$ 3		Index H		Index V		Index W
$C=$ 3		Normal fuzzy numbers		Normal fuzzy numbers		Normal fuzzy numbers
$t=$ 1.00%, 2.00%, 3.00%, 4.00%
Cluster 1 $p=1,\ldots,100$	T	$\mu_{p}^{(H)}$	$\sigma_{p}^{(H)}$	$\mu_{p}^{(V)}$	$\sigma_{p}^{(V)}$	$\mu_{p}^{(W)}$	$\sigma_{p}^{(W)}$
	R	$\mu\sim N({-2,0.14^{2}}),\sigma\sim N({0.15,0.15^{2}})$
Cluster 2 $q=1,\ldots,100$	T	$\mu_{q}^{(H)}$	$\sigma_{q}^{(H)}$	$\mu_{q}^{(V)}$	$\sigma_{q}^{(V)}$	$\mu_{q}^{(W)}$	$\sigma_{q}^{(W)}$
	R	$\mu\sim\textit{Exp}({10}),\sigma\sim\textit{Exp}({0.24})$
Cluster 3 $r=1,\ldots,100$	T	$\mu_{r}^{(H)}$	$\sigma_{r}^{(H)}$	$\mu_{r}^{(V)}$	$\sigma_{r}^{(V)}$	$\mu_{r}^{(W)}$	$\sigma_{r}^{(W)}$
	R	$\mu\sim U({9,15}),\sigma\sim U({0.9,1.5})$
Domain of discourse $X$		$({-\infty,+\infty})$		$({-\infty,+\infty})$		$({-\infty,+\infty})$

According to above information, the parameters were set as follows: $t=$ 1.00%, $t=$ 2.00%, $t=$ 3.00% and $t=$ 4.00%. Four simulations in total were conducted on the third set of samples, and the related clustering accuracy and Kappa coefficient were calculated [48]. After rounding off, the values of $d_{c}$ and clustering accuracy were corrected to two decimal places, and values of Kappa coefficient were corrected to four decimal places, as shown in Table 10.

Table 10

Clustering parameters and results

(%)	$d_{c}$	Clustering accuracy (%)	Kappa coefficient
1.00	0.55	71.67	0.5750
2.00	0.62	71.67	0.5750
3.00	0.67	71.00	0.5650
4.00	0.71	71.00	0.5650

It can be observed from Table 10 that the clustering accuracy reaches 71.67% at $t=$ 1.00% and $t=$ 2.00%, and the clustering accuracy reaches 71.00% at $t=$ 3.00% and $t=$ 4.00%. Apparently, the value of $t$ can affect the clustering accuracy. In this study, Kappa coefficient was introduced for evaluating the clustering performance. According to the evaluation criterion proposed by Landis and Koch [49], Kappa coefficient can be classified into 6 levels so as to accurately access clustering consistency. Table 11 lists the evaluation criterion based on the calculated Kappa coefficient.

Table 11

Evaluation criterion based on Kappa coefficient

Kappa coefficient	Non-accidental consistency degree
$<$ 0.00	Poor
0.00 $\sim$ 0.20	Slight
0.21 $\sim$ 0.40	Fair
0.41 $\sim$ 0.60	Moderate
0.61 $\sim$ 0.80	Substantial
0.81 $\sim$ 1.00	Almost perfect

According to above evaluation criterion based on the calculated Kappa coefficient, the clustering performances when $t$ was set as different values were moderate. The clustering is not poor; unfortunately, the clustering performance doesn’t reach the level of ‘Substantial’ or ‘Almost Perfect’. The reason why Kappa coefficients not the best level is that FN-CFSFDP algorithm has extended the classical clustering objects to the fuzzy numbers on fuzzy sets. Regardless of the concept of fuzzy number, the question will be analyzed simply form the perspective of calculation. Although 300 fuzzy samples were used in the third set of random simulation, the fuzzy numbers corresponding to each index in any fuzzy samples have infinite states when using FN-CFSFDP. If all states were taken into account, each one in 900 fuzzy numbers of 300 fuzzy samples under 3 indexes includes infinite amount of information. Under such huge amount of information, moderate clustering consistency is not easy. On the other hand, it can be found that automatic identification method of cluster number when using CFSFDP fails when using FN-CFSFDP. In this study, the variation of $\gamma$ was plotted. Under four different parameters, the variation curves of $\gamma$ exhibit similar tendency and only one obvious jump point, as shown in Fig. 3, in which ${i}$ denotes the number of $\gamma$ .

Figure 3.

Decision graph under different $t$ -parameters. a. Decision graph under $t=$ 1.00%. b. Decision graph under $t=$ 2.00%. c. Decision graph under $t=$ 3.00%. d. Decision graph under $t=$ 4.00%.

As shown in Fig. 3, only a jump point can be easily observed. Using CFSFDP, the number of clusters can be automatically identified in accordance with the number of the jump points of $\gamma$ . Following this rule, only a cluster can be found in the clustering results. Because of small $d_{c}$ , the calculated $\rho$ based on $d_{c}$ is also small and the density peak is small; therefore, $\gamma$ for the identification of cluster number is also small. Only a great jump point can be observed, and the values of $\gamma$ corresponding to the rest samples exhibit slight difference. Accordingly, the other jump points cannot be observed by naked eyes. CFSFDP has certain shortcomings in automatically identifying the number of clusters. It is a pity that this problem has not been successfully addressed using FN-CFSFDP algorithm, which is expected to be improved in further studies.

In order to further verify the effectiveness of FN-CFSFDP algorithm, we still follow the fetching rules, fuzzy number types, number of samples, number of index and specified number of clusters in the third set of simulation (as shown in Table 6). On the other hand, because in previous three sets of simulations, t-parameters were only integers. In order to make the granularity of $t$ -parameters smaller and make FN-CFSFDP algorithm more diversified, in this set of simulation, we extend the range of $t$ -parameters as follows: 1.00%, 1.50%, 2.00%, 2.50%, 3.00%, 3.50% and 4.00%. On this basis, there are 10 groups of random simulated fuzzy datasets in total. Each group under different $t$ -parameters was carried out 7 times. We calculate clustering accuracy and kappa coefficient of each random simulated result. Moreover, mean value of clustering accuracies, variance of clustering accuracies, mean value of kappa coefficients, and variance of kappa coefficients are calculated out in different $t$ -parameters, and the result is shown in Table 12 (including Table 12(a) to (g)). SNFD represents ‘Serial Number of Fuzzy Datasets’; CA represents ‘Clustering Accuracy’; MCA represents ‘Mean of Clustering Accuracy’; VCA represents ‘Variance of Clustering Accuracy’; KC represents ‘Kappa Coefficient’; MKC represents ‘Mean of Kappa Coefficient’; VKC represents ‘Variance of Kappa Coefficient’.

Table 12

Clustering in 10 fuzzy datasets under different 7 $t$ -parameters

(a) Clustering in 10 fuzzy datasets under
$t$ -parameter value	SNFD	CA (%)	KC
$t=$ 1.00%	1	53.67	0.3050
	2	79.00	0.6850
	3	50.00	0.2500
	4	45.00	0.1750
	5	70.67	0.5600
	6	70.00	0.5500
	7	75.33	0.6300
	8	75.00	0.6250
	9	56.33	0.3450
	10	75.67	0.6350
MCA $=$ 65.07%	VCA $=$ 155.800	MKC $=$ 0.4760	VKC $=$ 0.0350
(b) Clustering in 10 fuzzy datasets under $t=$ 1.50%
$t$ =1.50%	1	53.67	0.3050
	2	79.00	0.6850
	3	49.00	0.2350
	4	45.00	0.1750
	5	71.67	0.5750
	6	72.00	0.5800
	7	34.33	0.0150
	8	75.00	0.6800
	9	56.33	0.3450
	10	75.67	0.6350
MCA $=$ 61.17%	VCA $=$ 239.623	MKC $=$ 0.4230	VKC $=$ 0.0567
(c) Clustering in 10 fuzzy datasets under $t=$ 2.00%
$t=$ 2.00%	1	67.67	0.5150
	2	79.00	0.6850
	3	52.67	0.2900
	4	45.00	0.1750
	5	66.33	0.4950
	6	72.00	0.5800
	7	34.33	0.0150
	8	78.67	0.6800
	9	56.33	0.3450
	10	75.67	0.6350
MCA $=$ 62.77%	VCA $=$ 230.289	MKC $=$ 0.4415	VKC $=$ 0.0518
(d) Clustering in 10 fuzzy datasets under $t=$ 2.50%
$t=$ 2.50%	1	67.67	0.5150
	2	79.00	0.6850
	3	52.67	0.2900
	4	45.00	0.1750
	5	67.00	0.5050
	6	72.00	0.5800
	7	34.33	0.0150
	8	47.67	0.2150
	9	56.33	0.3450
	10	75.67	0.6350
MCA $=$ 59.73%	VCA $=$ 217.872	MKC $=$ 0.3960	VKC $=$ 0.0490
(e) Clustering in 10 fuzzy datasets under $t=$ 3.00%
$t=$ 3.00%	1	67.67	0.5150
	2	56.00	0.3400
	3	52.33	0.2850
	4	45.00	0.1750

Table 12, continued
$t$ -parameter value	SNFD	CA (%)	KC
(e) Clustering in 10 fuzzy datasets under $t=$ 3.00%
$t=$ 3.00%	5	67.00	0.5050
	6	72.00	0.5800
	7	34.33	0.0150
	8	47.67	0.2150
	9	55.33	0.3300
	10	75.67	0.6350
MCA $=$ 57.30%	VCA $=$ 173.010	MKC $=$ 0.3595	VKC $=$ 0.0389
(f) Clustering in 10 fuzzy datasets under $t=$ 3.50%
$t=$ 3.50%	1	67.67	0.5150
	2	58.67	0.3800
	3	52.33	0.2850
	4	45.00	0.1750
	5	66.33	0.4950
	6	72.00	0.5800
	7	34.33	0.0150
	8	53.00	0.2950
	9	55.33	0.3300
	10	75.67	0.6350
MCA $=$ 58.03%	VCA $=$ 162.790	MKC $=$ 0.3705	VKC $=$ 0.0366
(g) Clustering in 10 fuzzy datasets under $t=$ 4.00%
$t=$ 4.00%	1	67.67	0.5150
	2	58.67	0.3800
	3	68.67	0.5300
	4	45.00	0.1750
	5	66.33	0.4950
	6	72.00	0.5800
	7	34.33	0.0150
	8	53.00	0.2950
	9	55.33	0.3300
	10	75.67	0.6350
MCA $=$ 59.67%	VCA $=$ 168.782	MKC $=$ 0.3950	VKC $=$ 0.0380

Through the above results in sub tables, the highest clustering accuracy is 79.00%, and the lowest is 34.33% in 70 random simulation tests; When $t=$ 1.00%, the average clustering accuracy is the highest, which equals 65.07%; When $t=$ 3.00%, the average clustering accuracy is the lowest, which equals 57.30%; When $t=$ 1.00%, the variance of clustering accuracy is the lowest, which equals 155.800, and clustering accuracy is the most stable; When $t=$ 1.50%, the variance of clustering accuracy is the highest, which equals 239.623, and clustering accuracy is the most unstable. On the other hand, the maximum Kappa coefficient is 0.6850, and the minimum is 0.0150 in 70 random simulation tests; When $t=$ 1.00%, the average Kappa coefficient is the maximum, which equals 0.4760 and achieves a moderate non-accidental consistency degree. It makes clear that clustering effect of algorithm is moderate under non-accidental condition; When $t=$ 3.00%, the average Kappa coefficient is the minimum, which equals 0.3595 and achieves a fair non-accidental consistency degree. It makes clear that clustering effect of algorithm is fair under non-accidental condition; When $t=$ 1.00%, the variance of Kappa coefficient is the lowest, which equals 0.0350 and makes clear that clustering effect of algorithm is most stable under non-accidental condition; When $t=$ 1.50%, the variance of Kappa coefficient is the highest, which equals 0.0567 and makes clear that clustering effect of algorithm is most unstable under non-accidental condition. Besides, according to the results of random simulation, for same fuzzy datasets, we can find that different $t$ -parameters do not necessarily lead to different clustering accuracies and Kappa coefficients. In addition, we also calculated mean of clustering accuracy for each fuzzy dataset and mean of Kappa coefficient for each fuzzy dataset under 7 $t$ -parameters, as shown in Table 13 (including Table 13(a) to (j)). SNFD represents ‘Serial Number of Fuzzy Datasets’; CA represents ‘Clustering Accuracy’; KC represents ‘Kappa Coefficient’. MCAFD represents ‘Mean of Clustering Accuracy for Fuzzy Datasets’; MKCFD represents ‘Mean of Kappa Coefficient for Fuzzy Datasets’; MKCECP represents ‘Mean of Kappa Coefficient-Evaluation of Clustering Performance’ (Refer to the evaluation standard in Table 11).

Table 13

Different 7 $t$ -parameters in 10 fuzzy datasets

SNFD	$t$ (%)	CA (%)	MCAFD (%)	KC	MKCFD	MKCECP
(a) First fuzzy datasets with different 7 $t$ -parameters
1	1.00	53.67	63.67	0.3050	0.4550	Moderate
	1.50	53.67		0.3050
	2.00	67.76		0.5150
	2.50	67.76		0.5150
	3.00	67.76		0.5150
	3.50	67.76		0.5150
	4.00	67.76		0.5150
(b) Second fuzzy datasets with different 7 $t$ -parameters
2	1.00	79.00	58.62	0.6850	05486	Moderate
	1.50	79.00		0.6850
	2.00	79.00		0.6850
	2.50	79.00		0.6850
	3.00	56.00		0.3400
	3.50	58.67		0.3800
	4.00	58.67		0.3800
(c) Third fuzzy datasets with different 7 $t$ -parameters
3	1.00	50.00	53.95	0.2500	0.3093	Fair
	1.50	49.00		0.2350
	2.00	52.67		0.2900
	2.50	52.67		0.2900
	3.00	52.33		0.2850
	3.50	52.33		0.2850
	4.00	68.67		0.5300
(d) Fourth fuzzy datasets with different 7 $t$ -parameters
4	1.00	45.00	45.00	0.1750	0.1750	Slight
	1.50	45.00		0.1750
	2.00	45.00		0.1750
	2.50	45.00		0.1750
	3.00	45.00		0.1750
	3.50	45.00		0.1750
	4.00	45.00		0.1750
(e) Fifth fuzzy datasets with different 7 $t$ -parameters
5	1.00	70.67	67.90	0.5600	0.5186	Moderate
	1.50	71.67		0.5750
	2.00	66.33		0.4950
	2.50	67.00		0.5050
	3.00	67.00		0.5050
	3.50	66.33		0.4950
	4.00	66.33		0.4950
(f) Sixth fuzzy datasets with different 7 $t$ -parameters
6	1.00	70.00	71.71	0.5500	0.5757	Moderate
	1.50	72.00		0.5800
	2.00	72.00		0.5800
	2.50	72.00		0.5800
	3.00	72.00		0.5800
	3.50	72.00		0.5800
	4.00	72.00		0.5800
(g) Seventh fuzzy datasets with different 7 $t$ -parameters
7	1.00	75.33	40.19	0.6300	0.1029	Slight
	1.50	34.33		0.0150
	2.00	34.33		0.0150
	2.50	34.33		0.0150

Table 13, continued
SNFD	$t$ (%)	CA (%)	MCAFD (%)	KC	MKCFD	MKCECP
(g) Seventh fuzzy datasets with different 7 $t$ -parameters
7	3.00	34.33	40.19	0.0150	0.1029	Slight
	3.50	34.33		0.0150
	4.00	34.33		0.0150
(h) Eighth fuzzy datasets with different 7 $t$ -parameters
8	1.00	75.00	61.43	0.6250	0.4293	Moderate
	1.50	75.00		0.6800
	2.00	78.67		0.6800
	2.50	47.67		0.2150
	3.00	47.67		0.2150
	3.50	53.00		0.2950
	4.00	53.00		0.2950
(i) Ninth fuzzy datasets with different 7 $t$ -parameters
9	1.00	56.33	55.90	0.3450	0.3386	Fair
	1.50	56.33		0.3450
	2.00	56.33		0.3450
	2.50	56.33		0.3450
	3.00	55.33		0.3300
	3.50	55.33		0.3300
	4.00	55.33		0.3300
(i) Tenth fuzzy datasets with different 7 $t$ -parameters
10	1.00	75.67	75.67	0.6350	0.6350	Substantial
	1.50	75.67		0.6350
	2.00	75.67		0.6350
	2.50	75.67		0.6350
	3.00	75.67		0.6350
	3.50	75.67		0.6350
	4.00	75.67		0.6350

Table 14

Different 7 $t$ -parameters in 10 classical datasets

SNCD	$t$ (%)	CA (%)	MCACD (%)	KC	MKCCD	MKCECP
(a) First classical datasets with different 7 $t$ -parameters
1	1.00	60.00	65.90	0.4000	0.4900	Moderate
	1.50	60.33		0.4050
	2.00	60.00		0.4050
	2.50	60.00		0.4050
	3.00	60.33		0.4050
	3.50	60.67		0.4100
	4.00	100.00		1.0000
(b) Second classical datasets with different 7 $t$ -parameters
2	1.00	60.00	65.90	0.4000	0.4900	Moderate
	1.50	60.33		0.4050
	2.00	60.00		0.4000
	2.50	60.00		0.4000
	3.00	60.33		0.4050
	3.50	60.67		0.4100
	4.00	100.00		1.0000
(c) Third classical datasets with different 7 $t$ -parameters
3	1.00	59.33	68.95	0.3900	0.5343	Moderate
	1.50	62.00		0.4300
	2.00	60.33		0.4050
	2.50	67.00		0.5050
	3.00	67.00		0.5050
	3.50	67.00		0.5050
	4.00	100.00		1.0000
(d) Fourth classical datasets with different 7 $t$ -parameters
4	1.00	53.33	65.09	0.3000	0.4764	Moderate
	1.50	63.67		0.4550
	2.00	57.33		0.3600
	2.50	70.33		0.5550
	3.00	70.33		0.5550
	3.50	70.33		0.5550
	4.00	70.33		0.5550
(e) Fifth classical datasets with different 7 $t$ -parameters
5	1.00	100.00	100.00	1.0000	1.0000	Almost perfect
	1.50	100.00		1.0000
	2.00	100.00		1.0000
	2.50	100.00		1.0000
	3.00	100.00		1.0000
	3.50	100.00		1.0000
	4.00	100.00		1.0000
(f) Sixth classical datasets with different 7 $t$ -parameters
6	1.00	100.00	100.00	1.0000	1.0000	Almost perfect
	1.50	100.00		1.0000
	2.00	100.00		1.0000
	2.50	100.00		1.0000
	3.00	100.00		1.0000
	3.50	100.00		1.0000
	4.00	100.00		1.0000
(g) Seventh classical datasets with different 7 $t$ -parameters
7	1.00	53.33	65.29	0.3300	0.4836	Moderate
	1.50	68.67		0.5300
	2.00	67.00		0.5050
	2.50	67.00		0.5050

Table 14, continued
SNCD	$t$ (%)	CA (%)	MCACD (%)	KC	MKCCD	MKCECP
(g) Seventh classical datasets with different 7 $t$ -parameters
7	3.00	67.00	65.29	0.5050	0.4836	Moderate
	3.50	67.00		0.5050
	4.00	67.00		0.5050
(h) Eighth classical datasets with different 7 $t$ -parameters
8	1.00	53.33	66.81	0.3000	0.5021	Moderate
	1.50	55.00		0.3250
	2.00	55.00		0.3250
	2.50	83.33		0.7500
	3.00	79.67		0.6950
	3.50	70.67		0.5600
	4.00	70.67		0.5600
(i) Ninth classical datasets with different 7 $t$ -parameters
9	1.00	100.00	100.00	1.0000	1.0000	Almost perfect
	1.50	100.00		1.0000
	2.00	100.00		1.0000
	2.50	100.00		1.0000
	3.00	100.00		1.0000
	3.50	100.00		1.0000
	4.00	100.00		1.0000
(j) Tenth classical datasets with different 7 $t$ -parameters
10	1.00	65.00	90.00	0.4750	0.8500	Almost perfect
	1.50	65.00		0.4750
	2.00	100.00		1.0000
	2.50	100.00		1.0000
	3.00	100.00		1.0000
	3.50	100.00		1.0000
	4.00	100.00		1.0000

As shown in sub tables of Table 13, in the 10th group of datasets, the maximum mean-clustering accuracy is equal to 75.67%; In the 7th group of datasets, the minimum mean-clustering accuracy is equal to 40.19%; In the 10th group of datasets, the maximum mean-kappa coefficient is equal to 0.6350; In the 7th group of datasets, the minimum mean-kappa coefficient is equal to 0.1029. In some datasets, different $t$ values have little or no effect on the clustering accuracy and Kappa coefficient. Therefore it can also indirectly illustrate that FN-CFSFDP algorithm has certain parameter adaptability and clustering stability. If a clustering algorithm clusters one dataset under different values of one parameter, and different clustering results are obtained for the same one dataset, this only shows that the algorithm is very sensitive to this parameter. This sensitivity will affect the application of algorithm and cause a certain extent of clustering instability. Conversely, when a parameter changes its value, clustering accuracy and kappa coefficient just change a little, or no change at all, this phenomenon shows that the clustering algorithm is not sensitive to the change of parameter under the same datasets. Parameters will not affect or even affect the results of clustering in this state. CFSFDP algorithm has better clustering stability and parameter adaptability itself, and FN-CFSFDP algorithm has this advantage too.

For further comparison, from aspects of clustering accuracy and kappa coefficient, we make an experimental comparison between FN-CFSFDP algorithm and original CFSFDP algorithm in document [16]. Because the simulation for FN-CFSFDP algorithm uses normal fuzzy numbers, but the clustering objects are classical data for original CFSFDP algorithm with great difference, in this case, we are willing to reduce differences but unable to create an identical simulation. Thus, only the experimental background can be as close as possible, for example, fetching rules, sample size, index size, etc. We still referring to experimental background of the third simulation, and also installing the following simulation background: 300 samples, 3 indexes, 3 appointed clusters with 100 samples each cluster and several fetching rules. The first 100 samples in 3 indexes generated by normal distribution $N({-2,0.14^{2}})$ . The middle 100 samples in 3 indexes generated by exponential distribution $\textit{Exp}({10})$ . The last 100 samples in 3 indexes generated by uniform distribution $U({9,15})$ . Similar to the fourth set of simulations, a total of 10 classical datasets are generated randomly, each classical dataset will be clustered one time separately in different $t$ -parameters as follows: 1.00%, 1.50%, 2.00%, 2.50%, 3.00%, 3.50% and 4.00%. After clustering, we calculate clustering accuracy for per clustering, kappa coefficient, mean of clustering and mean of kappa, using Euclidean distance as a measurement, the results as shown in Table 14 (including Table 14(a) to (j)). SNCD represents ‘Serial Number of Classical Datasets’; CA represents ‘Clustering Accuracy’; KC represents ‘Kappa Coefficient’. MCACD represents ‘Mean of Clustering Accuracy for Classical Datasets’; MKCCD represents ‘Mean of Kappa Coefficient for Classical Datasets’; MKCECP represents ‘Mean of Kappa Coefficient-Evaluation of Clustering Performance’ (Refer to the evaluation standard in Table 11).

Comparing the results of Table 14 with Table 13, obviously, the clustering accuracy and kappa coefficient of FN-CFSFDP algorithm are lower than those of CFSFDP algorithm overall. This is because FN-CFSFDP algorithm applies to continuous fuzzy numbers, especially normal fuzzy numbers in random simulation. Each fuzzy number has multiple states, as continuous fuzzy numbers, each normal fuzzy numbers contains infinite states, every state as a form of classical number (it is a fuzzy number essentially) and participate in the definite integral computation of improved Euclidian distance. Relative to fuzzy numbers, one classical number has just one classical state itself, in this case, the amount of information involved in classical computation is much less than that of fuzzy computation. Meanwhile, there is no definite integral computation in original Euclidian distance, comparatively, the classical calculation is much simpler and smaller error.

Back to the topic of parameter adaptability and clustering stability for FN-CFSFDP algorithm, we can also formulate a set of standard for evaluating parameter adaptability and clustering stability to quantify the degree of both. Let ${CA}_{ij}^{(t)}$ be the clustering accuracy of $j$ -th clustering under $i$ -th datasets and $t$ -parameter. Suppose that the $i$ -th datasets is clustered ${NC}_{i}^{(t)}$ times, and $\text{Count}\left[{{CA}_{ij}^{(t)}}\right]$ represents number of ${CA}_{ij}^{(t)}$ which appearing in ${NC}_{i}^{(t)}$ times clustering, define the expression $\vartheta_{i}^{(t)}$ , which satisfies:

$\displaystyle\vartheta_{i}^{(t)}=\frac{{\max}\left\{{\text{Count}\left[{{CA}_{% ij}^{(t)}}\right]}\right\}}{{NC}_{i}^{(t)}},$ (21)

$0\leqslant\vartheta_{i}^{(t)}\leqslant 1$ . Similarly, let ${KC}_{ij}^{(t)}$ be the Kappa coefficient of $j$ -th clustering under $i$ -th datasets and $t$ -parameter. Suppose that the $i$ -th datasets is clustered ${NC}_{i}^{(t)}$ times, and $\text{Count}\left[{{KC}_{ij}^{(t)}}\right]$ represents number of ${KC}_{ij}^{(t)}$ which appearing in ${NC}_{i}^{(t)}$ times clustering, define the expression $\psi_{i}^{(t)}$ , which satisfies:

$\displaystyle\psi_{i}^{(t)}=\frac{{\max}\left\{\text{Count}\left[{KC}_{ij}^{(t% )}\right]\right\}}{{NC}_{i}^{(t)}},$ (22)

$0\leqslant\psi_{i}^{(t)}\leqslant 1$ . The comprehensive efficacy function of parameter adaptability and clustering stability of $i$ -th datasets under $t$ -parameter is defined as:

$\displaystyle f\left({{\vartheta_{i}^{(t)},\psi}_{i}^{(t)}}\right)=\sqrt{% \vartheta_{i}^{(t)}\cdot\psi_{i}^{(t)}}.$ (23)

Obviously, $f\left(\vartheta_{i}^{(t)},\psi_{i}^{(t)}\right)$ satisfies $0\leqslant f\left(\vartheta_{i}^{(t)},\psi_{i}^{(t)}\right)\leqslant 1$ . The quantitative evaluation of parameter adaptability and clustering stability can be carried out according to the value of $f\left(\vartheta_{i}^{(t)},\psi_{i}^{(t)}\right)$ , as shown in Table 15.

Table 15

Values of comprehensive efficacy function and evaluating standard

Value of comprehensive efficacy function	Evaluating standard
0.00 $\sim$ 0.33	Poor
0.34 $\sim$ 0.66	Moderate
0.67 $\sim$ 0.99	Strong
1.00	Almost perfect

Table 16

Evaluating results for 10 fuzzy datasets

SNFD	VCEF	CER
1	0.7143	Strong
2	0.5714	Moderate
3	0.2857	Poor
4	1.0000	Almost perfect
5	0.4286	Moderate
6	0.8571	Strong
7	0.8571	Strong
8	0.2857	Poor
9	0.5714	Moderate
10	1.0000	Almost perfect

This is just an original evaluation standard, not an axiom. Actually, different evaluation standards can also be formulated according to specific problems and needs. According to the above evaluation standard, for 10 groups of fuzzy datasets, comprehensive evaluation results of parameter adaptability and clustering stability are shown in Table 16. SNFD represents ‘Serial Number of Fuzzy Datasets’; VCEF represents ‘Value of Comprehensive Efficacy Function’; CER represents ‘Comprehensive Evaluating Result’.

Thus, the geometric mean of 10 groups of fuzzy datasets is 0.5995, and arithmetic mean is 0.6571. Comprehensive average level for Parameter adaptability and clustering stability of 10 groups of fuzzy datasets is moderate. On this basis, it is reasonable to consider that FN-CFSFDP algorithm has certain parameter adaptability and clustering stability. If we can make the number of $t$ -parameter greater than or equal to 30 and the number of datasets greater than or equal to 30, that is, both two satisfy the large sample condition of statistics, the results of comprehensive evaluation will be more credible.

6. Conclusions and prospects

For satisfying the clustering requirements on the samples with fuzzy information, this paper proposed a novel clustering algorithm for fuzzy numbers on fuzzy set by retaining the advantages of CFSFDP. However, there still exist two shortcomings in FN-CFSFDP algorithm. Firstly, using FN-CFSFDP algorithm, the samples are fuzzy numbers established on fuzzy set, but the membership between sample and cluster still performs clustering and classification in classical mathematical way. In fuzzy mathematics, the calculations among multiple fuzzy sets or fuzzy numbers mainly including fuzzy close degree, fuzzy degree and fuzzy distance transforms from fuzzy to classical, and finally, the calculated results are classical values. Although above transformation is reasonable in fuzzy mathematics, a certain amount of information may be lost. In addition, samples generally include infinite information, which may lead to the decline in clustering accuracy to certain degree compared with the results using CFSFDP. Secondly, the shortcoming in automatic identification of cluster number using CFSFDP has yet to adequately addressed when using FN-CFDFDP algorithm. In view of the above shortcomings, further improvements can be attempted as the following two aspects. On the one hand, the relation between sample and cluster can be defined on fuzzy set, and developing FCM clustering algorithm on fuzzy numbers so that both samples and memberships can be established on fuzzy sets, which can may reduce the loss of information. Secondly, the identification of geometrical images by naked eyes should be abandoned and a novel quantitative mathematical method should be formulated, so that even tiny difference can be identified using mathematical method.

Overall, the greatest contribution in this study is to extend the clustering objects in classical sets to fuzzy sets, which can provide direction for further investigating clustering problems on fuzzy sets.

References

Han

J.W.

Kamber

and Pei

, Clustering analysis, in: Data Mining: Concept and Technique, MK imprint of Elsevier, New York, 2012, pp. 478–490.

Macqueen

, On convergence of K-means and partitions with minimum average variance, Annals of Mathematical Statistics36(3) (1965), 1084–1084.

Kuang

and Zhang

L.C.

, A scheduling algorithm based on CLARA clustering, in: Proceedings of the International Conference on Green Energy and Sustainable Development, 2017, pp. 1–7.

R.T.

and Han

J.W.

, CLARANS: a method for clustering objects for spatial data mining, IEEE Transactions on Knowledge and Data Engineering14(5) (2002), 1003–1016.

Barni

Cappellini

and Mecocci

, A possibilistic approach to clustering-comments, IEEE Transactions on Fuzzy Systems4(3) (1996), 393–396.

Bezdek

J.C.

Ehrlich

and Full

, FCM-The fuzzy c-means clustering-algorithm, Computers & Geosciences10(2-3) (1984), 191–203.

R.N.

Wang

X.L.

and Ding

J.D.

, Multilevel core-sets based aggregation clustering algorithm, Journal of Software24(3) (2013), 490–506.

Guha

Rastogi

and Shim

, Rock: a robust clustering algorithm for categorical attributes, Information Systems25(5) (2000), 345–366.

Wilcox

et al., Simulation tests of galaxy cluster constraints on chameleon gravity, Monthly Notices of the Royal Astronomical Society462(1) (2016), 715–725.

10.

Stratman

et al., Identification of critical inspection samples among railroad wheels by similarity-based agglomerative clustering, Integrated Computer-Aided Engineering18(3) (2011), 203–219.

11.

Lorbeer

et al., Variations on the clustering algorithm BIRCH, Big Data Research11 (2018), 44–53.

12.

Lin

et al., Application of B-KFCM algorithm in the clustering of load in substations, in: Proceedings of the International Conference on Electrical and Electronic Engineering, 2014, pp. 51–56.

13.

Benmouiza

and Cheknane

, Density-based spatial clustering of application with noise algorithm for the classification of solar, in: Proceedings of 2016 8th International Conference on Modelling, Identification & Control, 2016, pp. 279–283.

14.

Kanagala

H.K.

and Krishnaiah

V.V.J.R.

, A comparative study of K-means, DBSCAN and OPTICS, in: Proceedings of 2016 International Conference on Computer Communication and Informatics, 2016.

15.

Terrazas

and Krasnogor

, A genotype-phenotype-fitness assessment protocol for evolutionary self-assembly wang tiles design, Memetic Computing5(1) (2013), 19–33.

16.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science344(6191) (2014), 1492–1496.

17.

Roy

and Mandal

J.K.

, A delaunay triangulation preprocessing based fuzzy-encroachment graph clustering for large scale GIS, in: Proceedings of 2012 International Symposium on Electronic System Design, 2012, pp. 300–305.

18.

et al., CAMAS: a cluster-aware multiagent system for attributed graph clustering, Information Fusion37 (2017), 10–21.

19.

Zaki

M.J.

et al., CLICKS: an effective algorithm for mining subspace clusters in categorical datasets, Data & Knowledge Engineering60(1) (2007), 51–70.

20.

Aghabozorgi

et al., A hybrid algorithm for clustering of time series data based on affinity search technique, Scientific World Journal (2014), 1–12.

21.

Dat

N.D.

et al., Sting algorithm used English sentiment classification in a parallel environment, International Journal of Pattern Recognition and Artificial Intelligence31(7) (2017), 1–30.

22.

Sheikholeslami

Chatterjee

and Zhang

A.D.

, WaveCluster: a wavelet-based clustering approach for spatial data in very large databases, VLDB Journal8(3-4) (2000), 289–304.

23.

Chrobak

Durr

and Nilsson

B.J.

, Competitive strategies for online CLIQUE clustering, in: Proceedings of 9th International Conference of Algorithms and Complexity, 2015, pp. 101–113.

24.

Holmes

and Pfahringer

, Clustering large datasets using Cobweb and K-means in tandem, in: Proceedings of 17th Annual Australian Conference on Artificial Intelligence, 2004, pp. 368–379.

25.

Zander

Nguyen

and Armitage

, Automated traffic classification and application identification using machine learning, in: Proceedings of the 2005 IEEE Conference on Local Computer Networks, 2005.

26.

Kohonen

, Self-organized formation of topologically correct feature maps, Biological Cybernetics43(1) (1982), 59–69.

27.

and Wunsch

, Survey of clustering algorithms, IEEE Transactions on Neural Networks16(3) (2005), 645–678.

28.

Zhang

X.C.

, Data Clustering, Science Press, Beijing, 2017, 1–388.

29.

Zadeh

L.A.

, Fuzzy sets, Information and Control8(3) (1965), 338–353.

30.

Chen

S.L.

J.G.

and Wang

X.G.

, in: Fuzzy sets and membership function, in: The Theory of Fuzzy Sets and Its Application, Science Press, Beijing, 2005, pp. 1–3.

31.

Ruspini

E.H.

, A new approach to clustering, Information and Control15(1) (1969), 22–32.

32.

Dunn

J.C.

, A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, Journal of Cybernetics3(3) (1974), 32–57.

33.

Bezdek

J.C.

, Cluster validity, in: Pattern Recognition With Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981, pp. 95–154.

34.

Krishnapuram

and Keller

J.M.

, A possibilistic means algorithms, IEEE Trans on Fuzzy Systems1(2) (1993), 98–110.

35.

Pal

N.R.

et al., A possibilistic fuzzy c-means clustering algorithm, Journal of Cybernetics3(3) (1974), 32–57.

36.

Yang

M.S.

and Wu

K.L.

, Unsupervised possibilistic clustering, Pattern Recognition39(1) (2006), 5–21.

37.

Zhang

J.S.

and Leung

Y.W.

, Improved possibilistic C-means clustering algorithms, IEEE Trans on Fuzzy Systems12(2) (2004), 209–217.

38.

Kriegel

H.P.

and Pfeifle

, Density-based clustering of uncertain data, in: Proceedings of the 11th ACM SIGKDD international Conference on Knowledge Discovery in Data Mining, 2005, pp. 672–677.

39.

Tepwankul

and Maneewongwattana

, U-DBSCAN: a density-based clustering algorithm for uncertain objects, in: Proceedings of 2010 IEEE 26th International Conference on Data Engineering Workshops, 2010, pp. 209–217.

40.

Erdem

and Gundem

T.I.

, M-FDBSCAN: A multicore density-based uncertain data clustering algorithm, Turkish Journal of Electrical Engineering and Computer Sciences22(1) (2014), 143–154.

41.

Smiti

and Elouedi

, DBSCAN-GM: an improved clustering method based on Gaussian means and DBSCAN techniques, in: Proceedings of the 16th International Conference on Intelligent Engineering Systems, 2012, pp. 573–578.

42.

Tran

T.N.

Drab

and Daszykowski

, Revised DBSCAN algorithm to cluster data with dense adjacent clusters, Chemometrics and Intelligent Laboratory Systems120(2) (2013), 92–96.

43.

C.L.

et al., Research on important places identification method based on improved CFSFDP algorithm, Application Research of Computers34(1) (2017), 136–140.

44.

Jiang

L.Q.

et al., Optimization of clustering by fast search and find of density peaks, Application Research of Computers33(11) (2016), 3251–3254.

45.

Wang

Han

and Li

, Fixed point theorems of fuzzy integer value mappings and optimization management to balance problem, International Journal of Fuzzy Systems19(3) (2017), 829–837.

46.

Zhu

J.Y.

, Fuzzy distance and fuzzy degree, in: Non-classical Mathematics for Intelligent Systems, Huazhong University of Science and Technology Press, Wuhan, 2001, pp. 75–81.

47.

Al-Shammary

et al., Fractal self-similarity measurements based clustering technique for SOAP web messages, Journal of Parallel and Distributed Computing73(5) (2013), 664–676.

48.

Cohen

, A coefficient of agreement for nominal scales, Educational and Psychological Measurement20(1) (1960), 37–46.

49.

Landis

J.R.

and Koch

G.G.

, The measurement of observer agreement for categorical data, Biometrics33(1) (1977), 159–174.

A clustering algorithm for fuzzy numbers based on fast search and find of density peaks

Abstract

Keywords

1. Introduction

Table 1 Classification of some common clustering algorithms

Table 2 Type of clustering algorithms

4.1 Set the parameter t for the determination of the threshold d c (i.e., the cut-off distance)

4.2 Calculate the distance d r ⁢ t between two fuzzy samples ( s ∼ r and s ∼ t ) and set d r ⁢ t = d t ⁢ r ( r < t )

References

Table 1
Classification of some common clustering algorithms

Table 2
Type of clustering algorithms

4.1 Set the parameter $t$ for the determination of the threshold $d_{c}$ (i.e., the cut-off distance)

4.2 Calculate the distance $d_{rt}$ between two fuzzy samples ( $\tilde{s}_{r}$ and $\tilde{s}_{t}$ ) and set $d_{rt}=d_{tr}$ ( $r<t$ )