A fuzzy mixed data clustering algorithm by fast search and find of density peaks

Abstract

If we endow an intelligent system with fuzzy logic, we hope that it can deal with fuzzy data, including the clustering of fuzzy data. This paper proposes a fuzzy mixed data clustering algorithm by fast search and find of density peaks (FMTD-CFSFDP), which is a development of the CFSFDP clustering algorithm. The proposed algorithm is a kind of density-based clustering method established using fuzzy sets for fuzzy mixed data. Mathematical definitions for fuzzy mixed data are presented. Combined with the definition of traditional fuzzy Euclidean distance, we defined an improved Euclidean distance for both continuous and discrete fuzzy sets with smaller error. On this basis, the weight between continuous and discrete indicators is introduced for establishing the global difference for fuzzy mixed data. Referring to the clustering procedures of the CFSFDP algorithm, a Gaussian Kernel function for fuzzy samples is calculated and the clustering procedures of our proposed algorithm are described in detail. Furthermore, four different sets of random simulations are performed, which illustrates the feasibility of the proposed algorithm.

Keywords

Fuzzy mixed data fuzzy set FMTD-CFSFDP algorithm improved Euclidean distance overall distance

1. Introduction

Clustering analysis refers to the process of dividing data objects or observed objects into various subsets in accordance with certain standards such as distance or density. Each subset represents a cluster. Clustering is a kind of generalized system, as it has three key elements: input, processing and output. When clustering, we input some objects to be clustered (sample sets or datasets), the clustering algorithm will process these objects according to certain standards (e.g., a mathematical model), and output different clusters which contains different objects.

Clustering analysis is a kind of unsupervised learning process, with the aim of making clusters that contain similar objects within a cluster with these being dissimilar to other objects in other clusters [1, 22]. Currently, clustering analysis has been widely applied in many fields, including business intelligence, biological safety, Web retrieval, evaluation and decision. Clustering analysis can serve as an independent tool or a preprocessing step in other algorithms. According to the opinion presented by Chen [3], there are mainly six types of clustering algorithms: partitional clustering [4, 5, 6, 7], hierarchical clustering [8, 9, 10, 11], density-based clustering [12, 13, 14], grid-based clustering [15, 16], probability-model-based clustering [17, 18] and constraint-based clustering [19, 20]. Certainly, this is an artificial classification and cannot cover all clustering, such as graph-theory-based clustering, that do not fall into any particular type [21, 22, 23]. Each algorithm has its own properties. In this article, a clustering algorithm for samples containing discrete and continuous mixed fuzzy sets is proposed, namely, a fuzzy mixed data clustering algorithm by fast search and find of density peaks (hereinafter referred to as FMTD-CFSFDP algorithm). This will be covered in detail in later sections.

In addition to the introduction, the rest of the paper is organized into six sections: The first section introduces the related work, and the second section presents the required mathematical definitions. The third section details the steps of FMTD-CFSFDP clustering algorithm as well as other formulas, including some analysis and comparisons. The next section presents simulations, using artificially-established fuzzy mixed datasets on the basis of normal and discrete fuzzy sets. Four sets of random simulations were used and the clustering results in the fifth section. Finally, three innovative points and three shortcomings of FMTD-CFSFDP are discussed and three recommendations for improvements are proposed.

2. Related works

The algorithm proposed in this article is based on the CFSFDP algorithm which employed fuzzy clustering and combined this with the idea of fuzzy mathematics. Therefore the proposed algorithm has some overlap with CFSFDP, some original fuzzy clustering algorithms and even classical algorithms such as the DBSCAN algorithm. Here we introduce these related works.

In recent years, a popular clustering algorithm, clustering by fast search and find of density peaks (CFSFDP), was proposed by Rodriguez and Laio in 2014. This is a density-based algorithm [24], which can automatically identify the number of clusters and can be applied to data with non-spherical clusters. Many scholars have made improvements to this algorithm. By utilizing the characteristics of standard variance of the expected cluster center and the mean value of the noise cluster, Mehmood et al. proposed the Fuzzy-CFSFDP algorithm which can improve the automatic identification of cluster centers [25]. Wan et al. raised some questions about the decision graph method in searching for cluster centers and then propose a kind of fuzzy CFSFDP algorithm, as well as an optimized fuzzy CFSFDP algorithm based on manifold distance and standard deviation-based cutoff distance. Additionally, they also employed two synthetic datasets for validating the effectiveness of their algorithm [26]. Gao et al. [27] proposed an improved algorithm, ICFS, on the basis of CFSFDP and redesigned the computational formula for the cutoff distance and the method of selecting cluster centers, which can enhance the robustness of the algorithm. Further, a novel allocation strategy of non-center points and merging/splitting processes in clusters were put forward so as to enhance the clustering precision and scalability. Shen and Zhang proposed a method to determine the position of cluster centers based on the variation rules of multilevel high-order difference, which improved the identification of cluster centers using CFSFDP [28]. Combined with the CFSFDP algorithm and hierarchy protocol, Zhang et al. effectively improved energy consumption in wireless sensor networks (WSNs) and advanced an improved CFSFDP-E algorithm by taking residual energy into account [29]. Qin et al. [30] combined terahertz time-domain spectroscopy (THz-TDS) with CFSFDP to reduce the dimension of THz spectral data using principal component analysis and developed a PCA-CFSFDP algorithm for pesticide detection.

There are many other algorithms that improve upon the original in existence. One interesting algorithm is the mixed, adaptive and optimized CFSFDP (MAO-CFSFDP) algorithm, which was proposed by Li et al. [31] for mixed datasets. Using the MAO-CFSFDP algorithm, the mixed distance with optimal weights and the optimal cutoff distance (which is also referred to as the optimal threshold value) for automatic extraction can be determined based on the concepts of Hopkins statistics, data field and entropy. For the same mixed dataset, the MAO-CFSFDP algorithm exhibited higher clustering accuracy than the $k$ -prototypes algorithm. Moreover, Li and Chen addressed some practical problems using this algorithm [32], which further confirms its reliability. Including the algorithm proposed in this paper, we will compare these in detail in Section 3. Although a lot of improvements have been made, the CFSFDP algorithm and its successors were still established on classical sets, i.e., these algorithms can be regarded as classical clustering algorithms.

However, many objects in real life possess no strict attributes, i.e., two-valued (either/or) logic is not applicable to these objects. For example, if a student is standing on the aisle outside the classroom, record this state as 1; conversely, record the state as 0. However, if a student puts one foot in the aisle and the other foot in the classroom, this state could not be described by classical two-valued logic. As another example, consider that different people have different understandings of the meaning of the word “young people”. Some people think that the age of 18 to 30 is a young person; some think that this range should be between 18 and 40 years old; and some think that young people can be counted as older than 20 years old and younger than 50 years old. Because of different perceptions, it is difficult to determine exactly what age range should be considered. The meaning of this word is also fuzzy, thus, the fuzzy concept cannot be described by classical two-valued logic. There are many such words or concepts, such as “good”, “very good”, “bad”, “very bad”, “delicious”, “comfortable”, and so on. Fuzziness must be taken into account in describing these words or concepts, requiring knowledge of fuzzy mathematics. By referring to Zadeh’s theory [33], all these objects are fuzzy objects with fuzzy characteristics. Currently, some common algorithms including PCM [34], FCM [35] and PFCM [36] cannot be regarded as fuzzy clustering algorithms on fuzzy sets. Strictly speaking, these algorithms only rely on statistical uncertainty theory (such as probability distributions and Bayesian models) and the clustering objects are not fuzzy. If we design an intelligent system, it is possible that it requires not only classical two-valued logic, but also fuzzy logic, which can use clustering to process fuzzy data.

Based on the theory of fuzzy sets, this study proposes an extension of the CFSFDP algorithm for fuzzy mixed data consisting of continuous and discrete fuzzy sets, called the FMTD-CFSFDP algorithm. For a clearer understanding of the contributions made in this paper, we compared the FMTD-CFSFDP algorithm with other commonly used classical and fuzzy clustering algorithms in basic concepts and logic types in Table 1.

Table 1
Comparison between FMTD-CFSFDP algorithm and other common algorithms

	FMTD-CFSFDP	FCM	CFSFDP	DBSCAN	SOM	K-means
Object type	Mixed type	Numerical type	Numerical type	Numerical type	Numerical type	Numerical type
Object state	Fuzzy uncertainty	Definite	Definite	Definite	Definite	Definite
Mathematical theoretical basis	Fuzzy set theory	Classic set theory, probability model	Classic set theory	Classic set theory	Classic set theory	Classic set theory
Membership between objects and clusters	Belong to or not, two-valued logic	Uncertain, based on a certain probability	Belong to or not, two-valued logic	Belong to or not, two-valued logic	Belong to or not, two-valued logic	Belong to or not, two-valued logic

There are obvious differences between the FMTD-CFSFDP algorithm and other algorithms in concept and logic type. In the following sections, after we introduce all the steps of the FMTD-CFSFDP algorithm, we will compare FMTD-CFSFDP with the original CFSFDP algorithm and other improved CFSFDP algorithms in detail (see Table 5).

Briefly speaking, the FMTD-CFSFDP algorithm can satisfy the clustering requirements for samples with fuzzy mixed data. In addition to the inheritance of most of the advantages of CFSFDP, FMTD-CFSFDP still exhibits three important innovative points. Firstly, FMTD-CFSFDP expands the application of the CFSFDP algorithm from classical sets to fuzzy sets. Secondly, the data field function proposed in [37] is used to calculate the optimal threshold and reduce the dependence on parameters. Thirdly, the Euclidean distance for fuzzy sets is improved, and improved Euclidean distances for continuous and discrete fuzzy sets are defined so as to reduce the error and achieve more reasonable measurement.

3. Related mathematical definitions

This study mainly focuses on the CFSFDP algorithm for fuzzy mixed data using fuzzy sets. The mathematical definition of mixed fuzzy data should first be identified. As the name implies, fuzzy mixed data are data that are composed of both continuous and discrete fuzzy data. Next, the related concepts of fuzzy sets and fuzzy numbers as well as their logical relations will be introduced. It should be noted that fuzzy data in this paper refers to fuzzy sets rather than fuzzy numbers, which will be described in detail below.

Definition 1: Classical sets For a domain of discourse $X$ , $A$ is a subset of $X$ . If $\forall x\in X$ , $x\in A$ and $x\notin A$ can be exclusively satisfied, $A$ can be regarded as a classical set. For the domain of discourse $X$ , the subset $A$ can only be determined by the following characteristic function $\chi_{A}$ which can be written as Eq. (1):

$\displaystyle\chi_{A}\left(x\right)=\left\{\begin{array}[]{lc}1,&x\in A,\\ 0,&x\notin A,\\ \end{array}\right.$ (1)

where $\chi_{A}:X\to\{{0,1}\}$ . The characteristic function will then be extended to the range of the fuzzy set [0, 1].

Definition 2 [38]: Fuzzy sets Assuming $\mu_{A}$ is a mapping from the domain of discourse $X$ to a closed interval [0, 1], if $\mu_{A}:X\to[{0,1}]$ and $x\to\mu_{A}(x$ can be satisfied, it can be regarded that $\mu_{A}$ determines a fuzzy subset $A$ in the domain of discourse $X$ ; further, $A$ is a fuzzy set, $\mu_{A}(x)$ represents the membership function of $A$ or the membership grade of $x$ to the fuzzy set $A$ .

Definition 3: Continuous fuzzy sets If the fuzzy subset $A$ is an infinite set in the domain of discourse $X$ , $A$ in the above Definition 2 is a continuous fuzzy set. According to the Zadeh expression method, $A$ can be described as Eq. (2):

$\displaystyle A=\int_{X}{\frac{\mu_{A}\left(x\right)}{x}}$ (2)

It should be noted that $\int$ is only a kind of expression pattern rather than the integral sign in the ordinary sense.

Definition 4: Convex fuzzy set Assuming that the domain of discourse $X$ is in Euclidean space and $A$ is a fuzzy subset in $X$ , the necessary and sufficient conditions for a convex fuzzy set $A$ in $X$ can be written as Eq. (3):

$\displaystyle\mu_{A}\left[{{k\cdot x}_{2}+\left({1-k}\right){\cdot x}_{1}}% \right]\geqslant\mu_{A}\left({x_{1}}\right)\wedge\mu_{A}\left({x_{2}}\right)$ (3)

where $x_{1},x_{2}\in X$ , ‘ $\wedge$ ” denotes the minimization operation of Zadeh operators, $\mu_{A}(x)$ denotes the membership function of $A$ or the membership grade of $x$ to the fuzzy set $A$ , and $k\in[{0,1}]$ .

Definition 5: Continuous regular fuzzy set With regard to $A$ , a fuzzy subset in the domain of discourse $X$ , if $A$ is a continuous regular fuzzy set, when and only when ${\exists x}_{0}\in X$ , $\mu_{A}\left({x_{0}}\right)=$ 1.

Definition 6: Continuous fuzzy number Assuming a domain of discourse $X=\mathbb{R}$ and an infinite set A ( $A\in{\cal F}(\mathbb{R})$ , where ${\cal F}(\mathbb{R})$ represents all fuzzy sets in the real number domain $\mathbb{R})$ , if $A$ is a continuous regular convex fuzzy set (i.e., $A$ can simultaneously satisfy Definitions 4 and 5), $A$ can be regarded as a continuous fuzzy number.

Some common continuous fuzzy numbers include triangular fuzzy numbers, trapezoidal fuzzy numbers, normal fuzzy numbers, Cauchy fuzzy numbers and Sharp – $\Gamma$ fuzzy numbers. In addition, interval numbers are also a special kind of fuzzy numbers, and fuzzy numbers expand interval numbers [39]. Continuous fuzzy numbers are a kind of continuous fuzzy sets that simultaneously satisfy regularity and convexity.

Definition 7: Discrete fuzzy set If the fuzzy subset $A$ is an infinite set in the domain of discourse $X$ , $A$ as defined in Definition 2 is a discrete fuzzy set and can be expressed using the following Zadeh distribution as Eq. (4):

$\displaystyle A=\sum_{g=1}^{h}{\frac{\mu_{A}\left({x_{g}}\right)}{x_{g}}}$ (4)

where ${\forall x}_{g}\in X$ and $A={\{{x_{g}}\}}_{g=1}^{h}$ . Here, $\sum$ and $\frac{\mu_{A}({x_{g}})}{x_{g}}$ are two sign representations rather than addition and division in mathematics.

Definition 8: Discrete fuzzy number Assuming a domain of discourse $X=\mathbb{R}$ and an infinite fuzzy subset $A$ in $\mathbb{R}$ , there exist $h$ elements $x_{1},x_{2},\cdots,x_{h}\in\mathbb{R}$ that satisfy the order relation: $x_{1}<x_{2}\ldots<x_{h}$ so that the support set of $A$ (denoted as Supp $A$ ) can satisfy $\text{SuppA}=\{{\forall x\in\mathbb{R}|{A(x)>0}.}\}=\{{x_{1},x_{2},\ldots,x_{h% }}\}$ . If there exist two natural numbers $s$ and $t(1\leqslant s\leqslant t\leqslant h)$ that make $A(x)$ satisfy the following conditions: (1) when $s\leqslant l\leqslant t$ , $A({x_{l}})=1$ ; (2) when $1\leqslant i\leqslant j\leqslant s$ , $A({x_{i}})\leqslant A({x_{j}})$ ; (3) when $t\leqslant i\leqslant j\leqslant h$ , $A({x_{i}}\geqslant A({x_{j}})$ , where $A(x)$ denotes the membership degree of the element $x$ to the finite fuzzy subset $A$ , then $A$ is a discrete fuzzy number. It can be seen that discrete fuzzy numbers are a special kind of discrete fuzzy sets. Based on the above definitions, the relation between fuzzy sets and fuzzy numbers can be concluded. A fuzzy number is a special kind of fuzzy sets, and fuzzy sets covers a wider range than fuzzy numbers. For making the algorithm more universal, fuzzy sets are used, i.e., a continuous fuzzy set can be equivalent to continuous fuzzy data while a fuzzy set can be regarded as discrete fuzzy data in mathematics. According to the above explanations, unless otherwise specified herein, continuous fuzzy data refers to continuous fuzzy sets and discrete fuzzy data refers to discrete fuzzy sets.

Definition 9: Fuzzy mixed data A fuzzy mixed data set (i.e., fuzzy mixed data) consists of several continuous fuzzy sets and several discrete fuzzy sets.

Based on these definitions, the detailed clustering procedures for the FMTD-CFSFDP algorithm are described below, and a series of important parameters are also provided.

4. Clustering steps, analysis and comparison of FMTD-CFSFDP

In this chapter, the clustering steps of the FMTD-CFSFDP algorithm are introduced in detail. In addition, the corresponding formulas of the algorithm, frameworks, flow charts, analysis of time complexity and comparisons between relative algorithms are also listed.

4.1 Initialization and pretreatment

This chapter covers most of the clustering steps and mainly introduces the calculations of measurement, threshold, density and special distance. The following subsections introduce each steps in detail.

4.1.1 Calculate the distance between the continuous fuzzy sets of two fuzzy samples $\tilde{s}_{r}$ and $\tilde{s}_{t}(L_{C}({r,t}))$

This study first assumes $M$ fuzzy indexes under $N$ indexes. Specifically, $N$ indexes include $N_{1}$ quantitative indexes, with the content of continuous fuzzy sets, and $N_{2}$ qualitative indexes, with the content of discrete fuzzy sets ( $N_{1}+N_{2}=N$ ). Each fuzzy sample in the fuzzy sample set $\tilde{S}={\{{\tilde{s}_{i}}\}}_{i=1}^{M}$ can be regarded as a multivariate fuzzy vector or a multivariate fuzzy point and expressed as

$\displaystyle\tilde{s}_{i}\!=\!\!\left[\underbrace{C_{i}^{(1)}\!\!\left(% \forall x|x\in\Omega_{i}^{(1)}\right),\ldots,C_{i}^{(N_{1})}\!\!\left(\forall x% |x\in\Omega_{i}^{(N_{1})}\right)}_{N_{1}},\underbrace{\left\{\!\left(x_{ij}^{(% 1)}|\mu_{ij}^{(1)}\right)\!\right\}_{j=1}^{d_{1}},\ldots,\left\{\!\left(x_{ij}% ^{(N_{2})}|\mu_{ij}^{(N_{2})}\right)\!\right\}_{j=1}^{d_{N_{2}}}}_{N_{2}}% \right],$

where $C_{i}^{({k_{1}})}({\forall x|{x\in\Omega_{i}^{({k_{1}})}}})$ denotes the continuous fuzzy set corresponding to the $k_{1}$ -th index of the $i$ -th sample ( $1\leqslant k_{1}\leqslant N_{1}$ ), $\Omega_{i}^{({k_{1}})}$ denotes the discourse domain of the continuous fuzzy set ( $C_{i}^{({k_{1}})}\in{\cal F}({\Omega_{i}^{({k_{1}})}}))$ , ${\{{({x_{ij}^{({k_{2}})}|{\mu_{ij}^{({k_{2}})}}})}\}}_{j=1}^{d_{k_{2}}}$ denotes the discrete fuzzy set corresponding to the $k_{2}$ -th index of the $i$ -th sample, and $d_{k_{2}}$ denotes the number of elements in the discrete fuzzy sets ( $d_{k_{2}}\geqslant 2$ ). Each sample has $N$ fuzzy sets ( $N_{1}+N_{2}=N$ ), and accordingly, the whole fuzzy sample set $\tilde{S}={\{{\tilde{s}_{i}}\}}_{i=1}^{M}$ includes $M\cdot N$ fuzzy sets, which exactly constitutes a fuzzy mixed data set (also referred to as fuzzy mixed data). For calculating the distance between two samples, the distance between two samples under an index should be calculated. For the indexes in continuous fuzzy set, assuming $C_{r}^{(j)}(x)$ denotes the membership function of the fuzzy sample $r$ to fuzzy set $C_{r}^{(j)}$ under the $j$ -th index and $C_{t}^{(j)}(x)$ denotes the membership function of the fuzzy set $t$ to the fuzzy set $C_{t}^{(j)}$ under the $j$ -th index ( $C_{t}^{(j)}$ is also a continuous fuzzy set), $x_{r}^{(j)}=[{a_{r}^{(j)},b_{r}^{(j)}}]\subset\mathbb{R}$ , $x_{t}^{(j)}=[{a_{t}^{(j)},b_{t}^{(j)}}]\subset\mathbb{R}$ and $\mathbb{R}$ denotes a domain of discourse, the Euclidean distance between two continuous fuzzy sets under the j-th index, denoted $L_{C}^{(j)}({r,t})$ , can be calculated as Eq. (5):

$\displaystyle L_{C}^{\left(j\right)}\left({r,t}\right)={\left[{\int\limits_{x_% {r}^{\left(j\right)}{\cup X}_{t}^{\left(j\right)}}{{\left|{C_{r}^{\left(j% \right)}\left(x\right)-C_{t}^{\left(j\right)}\left(x\right)}\right|}^{2}dx}}% \right]}^{1\mathord{\left/{\vphantom{12}}\right.\kern-1.2pt}2}.$ (5)

If all $N_{1}$ indexes are quantitative, i.e., each fuzzy sample includes $N_{1}$ continuous fuzzy sets, the distance between two samples $\tilde{s}_{r}$ and $\tilde{s}_{t}$ under all $N_{1}$ quantitative indexes, denoted as $L_{C}({r,t})$ , can be calculated as Eq. (6):

$\displaystyle L_{C}\left({r,t}\right)=\sum\limits_{j=1}^{N_{1}}{L_{C}^{\left(j% \right)}\left({r,t}\right)}=\sum\limits_{j=1}^{N_{1}}{{\left[{\int\limits_{x_{% r}^{\left(j\right)}{\cup X}_{t}^{\left(j\right)}}{{\left|{C_{r}^{\left(j\right% )}\left(x\right)-C_{t}^{\left(j\right)}\left(x\right)}\right|}^{2}dx}}\right]}% ^{1{\left/{\vphantom{12}}\right.\kern-1.2pt}2}}.$ (6)

Equation (6) is derived by conducting summation on Eq. (5) with the use of the operator $\sum$ , i.e., $L_{C}({r,t})$ is equal to the summation of $L_{C}^{(j)}({r,t})$ under $N_{1}$ indexes. The above calculation of $L_{C}({r,t})$ is a two-step process, which should be improved so as to reduce systematic error. This study also assumes two bounded intervals $x_{r}^{(j)}$ and $x_{t}^{(j)}$ ( $x_{r}^{(j)}=[{a_{r}^{(j)},b_{r}^{(j)}}]\subset\mathbb{R}$ and $X_{r}^{(j)}=[{a_{r}^{(j)},b_{r}^{(j)}}]\subset\mathbb{R})$ and a bounded interval $x_{rt}^{(j)}$ that satisfies $x_{rt}^{(j)}=[{a_{rt}^{(j)},b_{rt}^{(j)}}]$ , where $a_{rt}^{(j)}=a_{r}^{(j)}\wedge a_{t}^{(j)}$ and $b_{rt}^{(j)}=b_{r}^{(j)}\vee b_{t}^{(j)}$ . Here, ‘ $\vee$ ’ represents the maximizing operation of Zadeh operators. For $N_{1}$ indexes, if there exists an interval $X_{rt}=[{{b_{rt}}_{rt}}]$ that satisfies $x_{rt}=[{\mathop{\wedge}\limits_{j=1}^{N_{1}}a_{rt}^{(j)},\mathop{\vee}\limits% _{j=1}^{N_{1}}b_{rt}^{(j)}}]$ , $x_{rt}$ can be regarded as the maximum public integration domain. In addition, if the integration domain $x_{P}^{(j)}$ of the fuzzy sets under $P$ indexes ( $1{\leqslant P\leqslant N}_{1})$ are bilaterally unbounded or unilaterally unbounded, the method of the computation of public integration domain is identical to the condition when $x_{P}^{(j)}$ is a bounded interval, i.e., the maximum public integration domain should be searched. An unbounded interval can be regarded as the infinite extension of a bounded interval. For covering all conditions, assuming that $x_{rt}$ denotes the maximum public integration domain of two fuzzy samples $\tilde{s}_{r}$ and $\tilde{s}_{t}$ , the improved Euclidean distance between $\tilde{s}_{r}$ and $\tilde{s}_{t}$ in $N_{1}$ continuous fuzzy sets, denoted as $L_{C}({r,t})$ , can thus be calculated as Eq. (7):

$\displaystyle L_{C}\left({r,t}\right)={\left[{\int\limits_{x_{rt}}{\sum\limits% _{j=1}^{N_{1}}{{\left|{C_{r}^{\left(j\right)}\left(x\right)-C_{t}^{\left(j% \right)}\left(x\right)}\right|}^{2}}dx}}\right]}^{1\mathord{\left/{\vphantom{1% 2}}\right.\kern-1.2pt}2}\quad.$ (7)

Through error analysis, it can be seen that Eq. (7) exhibits smaller error and higher precision than Eq. (6) (see the Appendix 1 for the detailed mathematical derivation).

4.1.2 Calculate the distance between the discrete fuzzy sets of two fuzzy samples

\tilde{s}_{r}

and

\tilde{s}_{t}

(

L_{D}({r,t}))

Assuming that the indexes are discrete fuzzy sets, $D_{r}^{(k)}(x)$ denotes the membership degree of the value of $x$ to the fuzzy sample $k$ under the $k$ -th index, the information of another sample $t$ under the $k$ -th index is also a discrete fuzzy set and $D_{t}^{(k)}(x)$ denotes the membership degree of the value of $x$ to the fuzzy sample $t$ under the $k$ -th index, if two discrete fuzzy sets $D_{r}^{(k)}$ and $D_{t}^{(k)}$ have the same elements and ${\{{x_{g}^{(k)}}\}}_{g=1}^{h(k)}$ denotes the combined set of $D_{r}^{(k)}$ and $D_{t}^{(k)}$ , the Euclidean distance between two discrete fuzzy sets under the $k$ -th index, denoted as $L_{D}^{(k)}({r,t})$ , can be calculated as Eq. (8):

$\displaystyle L_{D}^{\left(k\right)}\left({r,t}\right)={\left({\sum\limits_{g=% 1}^{h\left(k\right)}{{\left|{D_{r}^{\left(k\right)}\left({x_{g}^{\left(k\right% )}}\right)-D_{t}^{\left(k\right)}\left({x_{g}^{\left(k\right)}}\right)}\right|% }^{2}}}\right)}^{1\mathord{\left/{\vphantom{12}}\right.\kern-1.2pt}2}.$ (8)

If all $N_{2}$ indexes are qualitative, i.e., each fuzzy sample includes $N_{2}$ discrete fuzzy sets, the distance between all $N_{2}$ qualitative indexes of two fuzzy samples, denoted as $L_{D}({r,t})$ , can be calculated as Eq. (9):

$\displaystyle L_{D}\left({r,t}\right)=\sum\limits_{k=1}^{N_{2}}{L_{D}^{\left(k% \right)}\left({r,t}\right)}=\sum\limits_{k=1}^{N_{2}}{{\left({\sum\limits_{g=1% }^{h\left(k\right)}{{\left|{D_{r}^{\left(k\right)}\left({x_{g}^{\left(k\right)% }}\right)-D_{t}^{\left(k\right)}\left({x_{g}^{\left(k\right)}}\right)}\right|}% ^{2}}}\right)}^{1\mathord{\left/{\vphantom{12}}\right.\kern-1.2pt}2}}.$ (9)

Equation (9) can be acquired by conducting a summation of Eq. (8) with the use of the operator ‘ $\sum$ ’, i.e., $L_{D}({r,t})$ can be calculated by adding the Euclidean distances between the discrete fuzzy sets under $N_{2}$ indexes. The whole calculation of $L_{D}({r,t})$ includes two steps, which has a similar shortcoming to the calculation of $L_{C}({r,t})$ . Therefore, for reducing systematic error, the Euclidean distance between discrete fuzzy sets should be improved as shown in Eq. (10):

$\displaystyle L_{D}\left({r,t}\right)={\left({\sum\limits_{k=1}^{N_{2}}{\sum% \limits_{g=1}^{h\left(k\right)}{{\left|{D_{r}^{\left(k\right)}\left({x_{g}^{% \left(k\right)}}\right)-D_{t}^{\left(k\right)}\left({x_{g}^{\left(k\right)}}% \right)}\right|}^{2}}}}\right)}^{1\mathord{\left/{\vphantom{12}}\right.\kern-1% .2pt}2}.$ (10)

By referring to the improved Euclidean distance between continuous fuzzy sets, it can also be proved that the error calculated according to Eq. (10) is smaller than that according to Eq. (9). For a detailed proof, see Appendix 1.The improved Euclidean distance between discrete fuzzy sets can thus be calculated according to Eq. (10). After the calculation of $L_{C}({r,t})$ and $L_{C}({r,t})$ , the weight of $\gamma$ is introduced for weight coefficient of two kinds of distances.

4.1.3 Calculate the weight

\lambda

and overall distance between two fuzzy samples (

L({r,t}))

According to the above sections, the distance between two fuzzy samples $\tilde{s}_{r}$ and $\tilde{s}_{t}$ can be calculated by combining Eqs (7) and (10). $L_{C}({r,t})$ and $L_{D}({r,t})$ can be calculated by the membership functions or the membership degrees of ${2N}_{1}$ continuous fuzzy sets and ${2N}_{2}$ discrete fuzzy sets. Since the distance serves as the measure of two samples and each sample corresponds to $N_{1}$ continuous fuzzy sets, two samples correspond to ${2N}_{1}$ continuous fuzzy sets. However, the proportions of the two types of distances are not necessarily the same and may exhibit certain difference. By referring to the similarity measurement proposed by Huang in the $K$ -prototypes algorithm [40], a weight $\lambda$ should be determined to calculate the weighted distance.

As stated above, there exist $N_{1}$ continuous fuzzy sets and $N_{2}$ discrete fuzzy sets, the proportions of continuous and discrete fuzzy sets can be calculated as: $\lambda^{C}={N_{1}}/N$ and $\lambda^{D}={N_{2}}/N$ , respectively, and accordingly, $\lambda$ can be defined as Eq. (11):

$\displaystyle\lambda={\lambda^{D}}/\lambda^{C}=N_{2}/N_{1}.$ (11)

4.1.4 Calculate the overall distance between two fuzzy samples

\tilde{s}_{r}

and

\tilde{s}_{t}

(

L({r,t})

Let $L({r,t})=L({t,r})$ , the overall distance between two fuzzy samples $\tilde{s}_{r}$ and $\tilde{s}_{t}$ can be calculated as Eq. (12):

$\displaystyle L\left({r,t}\right)=L_{C}\left({r,t}\right)+\lambda\cdot L_{D}% \left({r,t}\right).$ (12)

4.1.5 Determine the optimal threshold

L^{*}

According to the method proposed by Rodriguez and Laio in [24], $L^{*}$ can be selected so as to make the average number of neighboring samples of each sample occupy 1% $\sim$ 4% of the total number of samples. It should be noted that the neighboring samples are the remaining samples with a distance from the data point lower than $L^{*}$ . Using this method, the value of $L^{*}$ should be determined based on individual practical experiences. The setting of the value of $L^{*}$ can certainly affect the clustering results. In this study, unlike general methods, the minimum potential entropy in the data field of samples is used to automatically determine the optimal threshold $L^{*}$ . The present determination of the threshold $L^{*}$ can partly refer to the method in [37]. Assuming that there exists a set $\tilde{S}={\{{\tilde{s}_{i}}\}}_{i=1}^{M}$ in the sample’s data space $\mathchar 22\mskip-10.0mu \lambda$ , a fuzzy sample object in $\tilde{S}$ can be treated as a physical object that propagates its sample distribution in a given task, thereby forming a data field of fuzzy samples. For any sample $\tilde{s}_{r}\in\mathchar 22\mskip-10.0mu \lambda$ , the field function $\varphi_{r}$ can be mathematically expressed as $\varphi_{r}=\sum_{t=1}^{M}{\left[{m_{t}\times K\left({\frac{\tilde{s}_{r}-% \tilde{s}_{t}}{\sigma}}\right)}\right]}$ , where $\sigma$ denotes an influence factor and can impose certain effects on the distribution of the final potential. Here, $m_{t}$ denotes the mass of $\tilde{s}_{t}$ , $K(x)$ denotes a unit potential function and describes the spreading rules of the distribution of sample objects to the whole data field (generally, $K(xt)$ is set as a Gaussian kernel function), and $\tilde{s}_{r}-\tilde{s}_{t}$ denotes the azimuth distance between two fuzzy samples. If the sample data field is a scalar field, $m_{t}=1$ . When using a Gaussian kernel function, $\tilde{s}_{r}-\tilde{s}_{t}$ is equal $L({r,t})$ , i.e., the overall distance between two fuzzy samples $\tilde{s}_{r}$ and $\tilde{s}_{t}$ , and the potential of each fuzzy sample, denoted $\varphi_{r}$ , can be calculated as Eq. (13):

$\displaystyle\varphi_{r}=\sum\limits_{t=1}^{M}{\exp\left[{-{\left({\frac{L% \left({r,t}\right)}{\sigma}}\right)}^{2}}\right]}.$ (13)

If the density of fuzzy samples (which will be described in detail in the following section) is also set as Gaussian Kernel function, $\varphi_{r}$ is equivalent to the sample density. At that time, $\sigma=L^{*}$ ; accordingly, the optimization of $L^{*}$ can be transformed into the optimization of the influence factor $\sigma$ in the data field, i.e., $\sigma$ can be optimized by searching for the minimum potential entropy. The potential entropy $H$ can be calculated as Eq. (14):

$\displaystyle H=-\sum\limits_{r=1}^{M}{\theta_{r}\cdot\text{ln}\ \theta_{r}},$ (14)

where $\theta_{r}$ denotes a normalized factor ( $0\leqslant\theta_{r}\leqslant 1)$ . $\theta_{r}$ can be defined as Eq. (15):

$\displaystyle\theta_{r}=\frac{\varphi_{r}}{\left({\sum\limits_{r=1}^{M}{% \varphi_{r}}}\right)}.$ (15)

By substituting Eq. (13) into Eq. (15), $\theta_{r}$ can be solved; next, by substituting $\theta_{r}$ into Eq. (14), the whole mathematical expression includes only one unknown parameter $\sigma$ . The value of $\sigma$ corresponding to the minimum $H$ is then solved. At that time, $\sigma=L^{*}$ , i.e., $L^{*}$ can be automatically extracted.

4.1.6 Calculate the density of

M

fuzzy samples (

{\{{\rho_{r}}\}}_{r=1}^{M})

and generate the subscript sequence in descending order (

{\{{q_{r}}\}}_{r=1}^{M})

Using a Gaussian kernel function, the density of the fuzzy samples, denoted $\rho_{r}$ , can be calculated as Eq. (16):

$\displaystyle\rho_{r}=\sum\limits_{t=1}^{T}\text{exp}\left[{-{\left({\frac{L(r% ,t)}{L^{*}}}\right)}^{2}}\right].$ (16)

Let $L(r,t)=L(t,r)$ , $T={M({M-1})}\mathord{\left/{\vphantom{{M\left({M-1}\right)}2}}\right.\kern-1.2% pt}2$ , and the density after the arrangement of subscripts ${\{{q_{r}}\}}_{r=1}^{M}$ in descending order should satisfy $\rho_{q_{1}}\geqslant\rho_{q_{2}}\geqslant\ldots\geqslant\rho_{q_{M}}$ .

4.1.7 Calculate the special distance between

M

fuzzy samples (

{\{{\delta_{r}}\}}_{r=1}^{M})

and find the corresponding fuzzy sample according to the serial number

{\{{n_{r}}\}}_{r=1}^{M}

The set of $M$ special distances, denoted as ${\{{\delta_{r}}\}}_{r=1}^{M}$ , can be acquired by the distance of $\delta_{q_{r}}$ . $\delta_{q_{r}}$ can be defined as Eq. (17):

$\displaystyle\delta_{q_{r}}=\left\{\begin{array}[]{ll}\mathop{\min}\limits_{q_% {t}:t<r}\{L(q_{r},q_{t})\},&r\geqslant 2,\\ \mathop{\max}\limits_{q_{t}:t\geqslant 2}\left\{\delta_{q_{t}}\right\},&{r=1.}% \\ \end{array}\right.$ (17)

When $r\geqslant 2$ , $\mathop{\min}\limits_{q_{t}:t<r}\{L(q_{r},q_{t})\}$ denotes the minimum value of the distance between the sample $\tilde{s}_{q_{r}}$ and the samples with greater density than $\tilde{s}_{q_{r}}$ , denoted ${\{{\tilde{s}_{q_{t}}}\}}_{t=1}^{r-t}$ . It should be noted that $L({q_{r},q_{t}})$ denotes the overall distance between samples and that the samples arranged in descending order by density and having greater density. When $r=1$ , if all calculated results satisfy $\delta_{q_{r}}({r\geqslant 2})={\{{\delta_{q_{r}}}\}}_{r=2}^{M}={\{{\delta_{q_% {t}}}\}}_{t=2}^{M}$ , $\delta_{q_{1}}$ denotes the maximum value in ${\{{\delta_{q_{t}}}\}}_{t=2}^{M}$ ( $r,t\in\{{1,2,\ldots,M}\})$ . After ${\{{\delta_{q_{r}}}\}}_{r=1}^{M}$ is calculated, a new set is denoted as ${\{{\delta_{r}}\}}_{r=1}^{M}$ for the convenience of subsequent calculations as there is no need of ordering. ${\{{\delta_{r}}\}}_{r=1}^{M}$ and ${\{{\delta_{q_{r}}}\}}_{r=1}^{M}$ have the same elements. Accordingly, the fuzzy sample corresponding to the serial number ${\{{N_{r}}\}}_{r=1}^{M}$ can be reversely found, where ${\{{N_{r}}\}}_{r=1}^{M}$ denotes the serial number of the closest fuzzy sample among all fuzzy samples with greater density than $\tilde{s}_{i}$ .

4.2 Determine the serial number of the cluster center (

{\{{m_{g}}\}}_{g=1}^{n_{c}})

, where

\tilde{s}_{m_{g}}

denotes the fuzzy sample at the center of the

m_{g}

-th cluster

The cluster center can be determined by calculating the comprehensive measure value of $\rho_{r}$ and $\delta_{r}$ . The comprehensive measure value, denoted $\gamma_{r}$ , can be calculated as Eq. (18):

$\displaystyle\gamma_{r}=\rho_{r}\cdot\delta_{r}.$ (18)

Next, assuming that ${\{{c_{r}}\}}_{r=1}^{M}$ denotes the cluster where the cluster center is located, ${\{{c_{r}}\}}_{r=1}^{M}$ is a mark symbol and $c_{r}$ denotes the fact that the $r$ -th fuzzy sample in $\tilde{S}={\{{\tilde{s}_{i}}\}}_{i=1}^{M}$ belongs to the $c_{r}$ -th cluster. $c_{r}$ can be defined as Eq. (19):

$\displaystyle c_{r}=\left\{\begin{array}[]{ll}k,&\tilde{s}_{r}\in\left\{\tilde% {s}_{m_{g}}\right\}_{g=1}^{N_{c}}\wedge\left\{c_{r}=k\right\},\\ {-1,}&\text{otherwise}\\ \end{array}\right.$ (19)

where $\tilde{s}_{r}\in{\left\{{\tilde{s}_{m_{g}}}\right\}}_{g=1}^{n_{c}}$ denotes that the fuzzy sample $\tilde{s}_{r}$ is the cluster center; $c_{r}=k$ denotes that the sample $\tilde{s}_{r}$ belongs to the $k$ -th cluster; ‘ $\wedge$ ’ represents the ‘and’ relation, i.e., both conditions should be simultaneously satisfied.

4.3 Classification of the fuzzy samples that are not the cluster centers

When dealing with a fuzzy sample $\tilde{s}_{q_{r}}$ , if the distances between $\tilde{s}_{q_{r}}$ and the samples with greater densities are identical, $\tilde{s}_{q_{r}}$ should be randomly distributed to the cluster where $\tilde{s}_{q_{t}}({t<r})$ is located. The classification of the fuzzy samples that are not the cluster centers is performed in the order of density. Specially, non-center fuzzy samples are more likely to be classified into the cluster with greater density, and therefore, each cluster can be expanded layer by layer with the use of ${\{{N_{r}}\}}_{r=1}^{M}$ .

4.4 Analysis and comparison of the FMTD-CFSFDP algorithm

Thus, we have introduced all the clustering steps of FMTD-CFSFDP algorithm. In order to understand the approach more clearly, the framework is given in Table 2.

Table 2
Framework of FMTD-CFSFDP

Algorithm: FMTD-CFSFDP
Input: fuzzy samples with continuous fuzzy sets and discrete fuzzy sets, the number of fuzzy samples is $M$ , the number of indexes is $N$ , if the number of clusters are designated manually, then we input the designated number of cluster which is $c$ , otherwise, the input of $c$ is unnecessary
Output: several clusters and clustering results
Step 1 Calculate the improved Euclidean distance between any two fuzzy samples by Eqs (7), (10)–(12) Step 2 Calculate the optimized threshold $d_{c}$ by Eqs (13)–(15) Step 3 Calculate the density $\rho_{r}$ of fuzzy samples by Eq. (16), descending the densities Step 4 Calculate the special distance $\delta_{r}$ by descending the ordered densities and Eq. (17) Step 5 Calculate the comprehensive value $\gamma_{r}$ by Eq. (18) and find cluster centers, mark them by Eq. (19) Step 6 Classify the other fuzzy samples which are not cluster centers

Algorithm: FMTD-CFSFDP

Input: fuzzy samples with continuous fuzzy sets and discrete fuzzy sets, the number of fuzzy samples is

M

, the number of indexes is

N

, if the number of clusters are designated manually, then we input the designated number of cluster which is

c

, otherwise, the input of

c

is unnecessary

Output: several clusters and clustering results

Step 1

Calculate the improved Euclidean distance between any two fuzzy samples by Eqs (7), (10)–(12)

Step 2

Calculate the optimized threshold $d_{c}$ by Eqs (13)–(15)

Step 3

Calculate the density $\rho_{r}$ of fuzzy samples by Eq. (16), descending the densities

Step 4

Calculate the special distance $\delta_{r}$ by descending the ordered densities and Eq. (17)

Step 5

Calculate the comprehensive value $\gamma_{r}$ by Eq. (18) and find cluster centers, mark them by Eq. (19)

Step 6

Classify the other fuzzy samples which are not cluster centers

In addition to Table 2, the flowchart of the FMTD-CFSFDP algorithm is given in Fig. 1.

Figure 1.

Flowchart of the FMTD-CFSFDP algorithm.

In order to compare with the FMTD-CFSFDP algorithm, the framework and flowchart of the original CFSFDP algorithm are listed here. The framework of CFSFDP is shown in Table 3.

Table 3

Framework of CFSFDP

Algorithm: CFSFDP
Input: classical samples with classical data, the number of ples is $M$ , the number of indexes is $N$ , if the number of clusters are designated manually, then we input the designated number of cluster which is $c$ , otherwise, the input of $c$ is unn- ecessary. The $t$ -parameters are necessary to calculate the threshold, $t\in[{0.01,0.04}]$
Output: several clusters and clustering results
Step 1 Calculate the measurement between any two samples by Minkowski distance or similarity degree Step 2 Calculate the threshold $d_{c}$ by ascending measurement and $t$ -parameters Step 3 Calculate the density $\rho_{r}$ of the sample by the cut-off kernel function, Gaussian kernel function or exponential kernel function, descending the densities Step 4 Calculate the special measurement $\delta_{r}$ by descending ordered densities and ascending ordered measurement Step 5 Calculate the comprehensive value $\gamma_{r}$ , find cluster centers and mark them Step 6 Classify the other classical samples which are not cluster centers

Algorithm: CFSFDP

Input: classical samples with classical data, the number of ples is

M

, the number of indexes is

N

, if the number of clusters are designated manually, then we input the designated number of cluster which is

c

, otherwise, the input of

c

is unn- ecessary. The

t

-parameters are necessary to calculate the threshold,

t\in[{0.01,0.04}]

Output: several clusters and clustering results

Step 1

Calculate the measurement between any two samples by Minkowski distance or similarity degree

Step 2

Calculate the threshold $d_{c}$ by ascending measurement and $t$ -parameters

Step 3

Calculate the density $\rho_{r}$ of the sample by the cut-off kernel

function, Gaussian kernel function or exponential kernel function, descending the densities

Step 4

Calculate the special measurement $\delta_{r}$ by descending ordered densities and ascending ordered measurement

Step 5

Calculate the comprehensive value $\gamma_{r}$ , find cluster centers and mark them

Step 6

Classify the other classical samples which are not cluster centers

The flowchart of CFSFDP is given in Fig. 2.

Figure 2.

Flowchart of the CFSFDP algorithm.

Obviously, from the framework and flowchart of the two algorithms, there is not much difference between FMTD-CFSFDP and CFSFDP in form. The difference between the algorithms is mainly reflected in the clustering objects, measurement calculating, parameters and threshold calculating. Furthermore, analyzing the time complexity for FMTD-CFSFDP, we listed the maximum time complexity of the clustering steps, parameters and key equations in Table 4.

Table 4

Maximum time complexity

Steps	Equations or sub-steps	Time complexity	Maximum time complexity
Input	$M$ samples	$O\left(1\right)$	$O\left(1\right)$
	$N$ indexes	$O\left(1\right)$
	$c$ clusters	$O\left(1\right)$
Step 1	Eq. (7)	$O\left({{N\cdot M}^{2}}\right)$	$O\left({{N\cdot M}^{2}}\right)$
	Eq. (10)	$O\left({{N\cdot M}^{2}}\right)$
	Eq. (11)	$O\left(1\right)$
	Eq. (12)	$O\left(1\right)$
Step 2	Eq. (13)	$O\left(M\right)$	$O\left(M\right)$
	Eq. (14)	$O\left(M\right)$
	Eq. (15)	$O\left(1\right)$
Step 3	Eq. (16)	$O\left(T\right)$	$O\left({M^{2}}\right)$
	$\rho_{q_{1}}\geqslant\rho_{q_{2}}\geqslant\ldots\geqslant\rho_{q_{M}}$	$O\left({M^{2}}\right)$
Step 4	$\textit{ascending}\left[{L\left({q_{r},q_{t}}\right)\left\|{r>t}\right.,2% \leqslant r\leqslant M}\right]$	$O\left({{\left({M-1}\right)}^{2}}\right)$	$O\left({{\left({M-1}\right)}^{2}}\right)$
	$\textit{descending}\left[{\left.{\delta_{q_{t}}}\right\|2\leqslant t\leqslant M% ,r=1}\right]$	$O\left({{\left({M-1}\right)}^{2}}\right)$
	Eq. (17)	$O\left(M\right)$
Step 5	Eq. (18)	$O\left(1\right)$	$O\left(1\right)$
	Eq. (19)	$O\left(1\right)$
Step 6	Classify the other fuzzy samples which are not cluster centers	$O\left(M\right)$	$O\left(M\right)$
Output	Different clusters	$O\left(1\right)$	$O\left(1\right)$

Because the whole FMTD-CFSFDP algorithm is executed in a sequential manner, the maximum time complexity of FMTD-CFSFDP is $O({{N\cdot M}^{2}})$ . The time complexity of CFSFDP is $O({M^{2}})$ [41]. In this case, it can be seen that the FMTD-CFSFDP algorithm has a higher time complexity than the CFSFDP algorithm. Next, we listed some algorithms in detail which relate to the CFSFDP algorithm or are improvements on the CFSFDP algorithm. Including CFSFDP and FMTD-CFSFDP, there are 11 kinds of clustering algorithm compared in Table 5.

Table 5

Related algorithm comparing

Algorithm Name (Including Abbreviation)	Five key features/steps of the algorithm: 1) clustering objects; 2) measurement; 3) threshold; 4) density function; 5) method for identifying the number of clusters.
CFSFDP [24]	For classical data, the algorithm uses Euclidean distance as measurement, cut-off distance as the threshold, cut-off kernel as the density function, and identifies the number of clusters via decision graphs.
Fuzzy-CFSFDP [25]	For classical data, the algorithm uses Euclidean distance as measurement, cut-off distance as the threshold, cut-off kernel as the density function, and identifies the number of clusters by expected cluster centers and standard deviation of special distance $\delta_{r}$ (equivalent to Eq. (17)).
Optimized Fuzzy CFSFDP [26]	For classical data, the algorithm uses Manifold distance as measurement, then, cut-off distance based on manifold distance and standard deviation to calculate the threshold, cut-off kernel as the density function, and identifies the number of clusters via fuzzy rules.
ICFS [27]	For classical data, the algorithm uses Euclidean distance as measurement, then, cut-off distance based on manifold distance and standard deviation to calculate the threshold, using cut-off kernel as the density function, and identifies the number of clusters by a “bump point” in decreasing ordered comprehensive value $\gamma_{r}$ (equivalent to Eq. (18)).
Automatically selecting cluster centers in CFSFDP [28]	For classical data, the algorithm uses Euclidean distance as measurement, then, field function and potential entropy to calculate the threshold, a Gaussian kernel as the density function, and identifies the number of clusters by decision graphs and turning points.
CFSFDP-E [29]	For wireless sensor networks (WSN for short, the data for WSN are also classical data), the algorithm uses transmission distance as measurement, then, the algorithm uses the dynamic cut-off distance as the threshold, using density-energy defined in this article as the density function, and identifies the number of clusters by a new comprehensive value which is $\gamma_{r}$ multiplied by the “residual energy of node”.
PCA-CFSFDP [30]	For THz spectra data (THz spectra data are also classical data), the measurement calculated by the dataset after dimension reduction processing with PCA, then, the algorithm uses cut-off distance as the threshold, a Gaussian kernel as the density function, and identifies the number of clusters by decision graphs.
MAO-CFSFDP [31, 32]	For samples with multi-dimensional classical data, the algorithm uses weighted mixed Euclidean distance as measurement, then, field function and potential entropy to calculate the optimal threshold, a Gaussian kernel as the density function, and identifies the number of clusters by decision graphs.
ALCCE-CFSFDP [42]	For received signal strength indication (RSSI for short, the data for RSSI are also classical data), the algorithm uses Euclidean distance as measurement, then, the algorithm uses cut-off distance (which is different to CFSFDP in terms of the parameters’ value range) as the threshold, cut-off kernel as the density function, and identifies the number of clusters by decision graphs.
Uncertain GM-CFSFDP [43]	For interval data (a kind of uncertain classical data), the algorithm uses $E$ -ML distance proposed in this reference as measurement, then, the algorithm uses a grid density threshold, using the average density of the mesh, and identifies the number of clusters by decision graphs.
FMTD-CFSFDP (proposed in this article)	For fuzzy samples with fuzzy continuous sets and discrete fuzzy sets, the algorithm uses weighted mixed improved Euclidean distance (“improved Euclidean distance” for short) as measurement, then, field function and potential entropy to calculate the optimal threshold, a Gaussian kernel as the density function, and identifies the number of clusters by decision graphs.

Obviously, the main difference between FMTD-CFSFDP and the other algorithms is the type of clustering objects: the other algorithms focus on classical data, whereas FMTD-CFSFDP involves fuzzy mixed data.

5. Random simulations

Subsequently, four sets of random simulations in total were conducted, with the simulation conditions listed in Table 6.

Table 6
Simulation conditions

	Parameters	Values
Hardware	CPU	AMD Athlon (tm) II X4 630 Processor
	CPU Clock Speed	2.80 GHz
	Memory	4.00 GB
Software	OS	Windows 7 Home Basic
	Tool	R version 2.15.1 (2012-06-22)

Tables 7–10 list the relevant information for the four sets of simulations, in which R denotes the fetching rule and T denotes the parameter type. “Fetching rule” represents the rules for generating clustering objects in random simulation. For a detailed explanation of these simulations, see Appendix 2.

Table 7

Relevant information for the first set of simulation

$M=$ 200		Index ( $N_{1}=$ 2, $N_{1}=$ 1)
$N=$ 3		Index A		Index B		Index C
$C=$ 2		Normal fuzzy set		Normal fuzzy set		Discrete fuzzy set
Cluster 1	T	$\mu_{p}^{\left(A\right)}$	$\sigma_{p}^{\left(A\right)}$	$\mu_{p}^{\left(B\right)}$	$\sigma_{p}^{\left(B\right)}$	$\mu_{p,g}^{\left(C\right)}$
$p=1,\ldots,100$	R	$N\left({160,2^{2}}\right)$	$N\left({2,0.16^{2}}\right)$	$N\left({45,2^{2}}\right)$	$N\left({2,0.3^{2}}\right)$	$U\left({0,1}\right)$
Cluster 2	T	$\mu_{q}^{\left(A\right)}$	$\sigma_{q}^{\left(A\right)}$	$\mu_{q}^{\left(B\right)}$	$\sigma_{q}^{\left(B\right)}$	$\mu_{q,g}^{\left(C\right)}$
$q=1,\ldots,100$	R	$N\left({168,3^{2}}\right)$	$N\left({3,0.2^{2}}\right)$	$N\left({54,3.5^{2}}\right)$	$N\left({3.5,0.5^{2}}\right)$	$U\left({0,1}\right)$
Value range of the fuzzy set		$\left({-\infty,+\infty}\right)$		$\left({-\infty,+\infty}\right)$		$\left\{{1,2,3,4,5}\right\}$
Domain of discourse $X$		$\mathbb{R}$		$\mathbb{R}$		$\mathbb{R}$

Table 8

Relevant information for the second set of simulation

$M=$ 300		Index ( $N_{1}=$ 2, $N_{1}=$ 1)
$N=$ 3		Index A		Index B		Index C
$C=$ 3		Normal fuzzy set		Normal fuzzy set		Discrete fuzzy set
Cluster 1	T	$\mu_{p}^{\left(A\right)}$	$\sigma_{p}^{\left(A\right)}$	$\mu_{p}^{\left(B\right)}$	$\sigma_{p}^{\left(B\right)}$	$\mu_{p,g}^{\left(C\right)}$
$p=1,\ldots,100$	R	$N\left({1.2,0.14^{2}}\right)$	$N\left({0.15,0.15^{2}}\right)$	$N\left({7,0.15^{2}}\right)$	$N\left({0.2,0.12^{2}}\right)$	$U\left({0,1}\right)$
Cluster 2	T	$\mu_{q}^{\left(A\right)}$	$\sigma_{q}^{\left(A\right)}$	$\mu_{q}^{\left(B\right)}$	$\sigma_{q}^{\left(B\right)}$	$\mu_{q,g}^{\left(C\right)}$
$q=1,\ldots,100$	R	$N\left({8,0.25^{2}}\right)$	$N\left({0.2,0.24^{2}}\right)$	$N\left({15,0.24^{2}}\right)$	$N\left({0.24,0.14^{2}}\right)$	$U\left({0,1}\right)$
Cluster 3	T	$\mu_{r}^{\left(A\right)}$	$\sigma_{r}^{\left(A\right)}$	$\mu_{r}^{\left(B\right)}$	$\sigma_{r}^{\left(B\right)}$	$\mu_{r,g}^{\left(C\right)}$
$r=1,\ldots,100$	R	$N\left({17,0.23^{2}}\right)$	$N\left({0.3,0.2^{2}}\right)$	$N\left({22,0.2^{2}}\right)$	$N\left({0.35,0.12^{2}}\right)$	$U\left({0,1}\right)$
Value range of the fuzzy set		$\left({-\infty,+\infty}\right)$		$\left({-\infty,+\infty}\right)$		$\left\{{1,2,3}\right\}$
Discourse of the domain $X$		$\mathbb{R}$		$\mathbb{R}$		$\mathbb{R}$

Table 9

Relevant information for the third set of simulation

$M=$ 300		Index ( $N_{1}=$ 2, $N_{1}=$ 1)
$N=$ 3		Index A		Index B		Index C
$C=$ 3		Normal fuzzy set		Normal fuzzy set		Discrete fuzzy set
Cluster 1	T	$\mu_{p}^{\left(A\right)}$	$\sigma_{p}^{\left(A\right)}$	$\mu_{p}^{\left(B\right)}$	$\sigma_{p}^{\left(B\right)}$	$\mu_{p,g}^{\left(C\right)}$
$p=1,\ldots,100$	R	$N\left({0.2,0.12^{2}}\right)$	$Exp\left({12}\right)$	$U\left({7,15}\right)$	$U\left({0.2,0.4}\right)$	$U\left({0,1}\right)$
Cluster 2	T	$\mu_{q}^{\left(A\right)}$	$\sigma_{q}^{\left(A\right)}$	$\mu_{q}^{\left(B\right)}$	$\sigma_{q}^{\left(B\right)}$	$\mu_{q,g}^{\left(C\right)}$
$q=1,\ldots,100$	R	$N\left({-6,0.23^{2}}\right)$	$Exp\left(9\right)$	$U\left({1,4}\right)$	$U\left({1.02,1.54}\right)$	$U\left({0,1}\right)$
Cluster 3	T	$\mu_{r}^{\left(A\right)}$	$\sigma_{r}^{\left(A\right)}$	$\mu_{r}^{\left(B\right)}$	$\sigma_{r}^{\left(B\right)}$	$\mu_{r,g}^{\left(C\right)}$
$r=1,\ldots,100$	R	$N\left({8.6,0.15^{2}}\right)$	$Exp\left({15}\right)$	$U\left({20,24}\right)$	$U\left({0.67,1.35}\right)$	$U\left({0,1}\right)$
Value range of the fuzzy set		$\left({-\infty,+\infty}\right)$		$\left({-\infty,+\infty}\right)$		$\left\{{1,2,3,4}\right\}$
Discourse of the domain $X$		$\mathbb{R}$		$\mathbb{R}$		$\mathbb{R}$

Table 10

Relevant information for the fourth set of simulation

$M=$ 300		Index ( $N_{1}=$ 1, $N_{1}=$ 2)
$N=$ 3		Index A		Index B	Index C
$C=$ 3		Normal fuzzy set		Discrete fuzzy set	Discrete fuzzy set
Cluster 1	T	$\mu_{p}^{\left(A\right)}$	$\sigma_{p}^{\left(A\right)}$	$\mu_{p,g}^{\left(B\right)}$	$\mu_{p,h}^{\left(C\right)}$
$p=1,\ldots,100$	R	$N\left({1.6,0.16^{2}}\right)$	$N\left({0.12,0.16^{2}}\right)$	$U\left({0,1}\right)$	$U\left({0,1}\right)$
Cluster 2	T	$\mu_{q}^{\left(A\right)}$	$\sigma_{q}^{\left(A\right)}$	$\mu_{q,g}^{\left(B\right)}$	$\mu_{q,h}^{\left(C\right)}$
$q=1,\ldots,100$	R	$N\left({7.8,0.31^{2}}\right)$	$N\left({0.21,0.23^{2}}\right)$	$U\left({0,1}\right)$	$U\left({0,1}\right)$
Cluster 3	T	$\mu_{r}^{\left(A\right)}$	$\sigma_{r}^{\left(A\right)}$	$\mu_{r,g}^{\left(B\right)}$	$\mu_{r,h}^{\left(C\right)}$
$r=1,\ldots,100$	R	$N\left({16.7,1.03^{2}}\right)$	$N\left({0.15,0.26^{2}}\right)$	$U\left({0,1}\right)$	$U\left({0,1}\right)$
Value range of the fuzzy set		$\left({-\infty,+\infty}\right)$		${\left\{{x_{g}}\right\}}_{g=1}^{4}$	${\left\{{x_{h}}\right\}}_{h=1}^{3}$
Discourse of the domain $X$		$\mathbb{R}$		$\mathbb{R}$	$\mathbb{R}$

After four random simulations (25 runs per group), the decision graphs are plotted(choose one from random simulation of each group), as shown in Fig. 3(a)–(d). These describe the relations between the comprehensive measure value $\gamma$ for automatically identifying the number of clusters and the corresponding serial number of the sample $n$ after the arrangement of $\gamma$ in descending order. Specifically, the y-axis represents $\gamma$ in descending order and the x-axis is the corresponding $n$ . According to the explanation in the CFSFDP algorithm, the $\gamma$ value at the cluster center will form a jump point, but will be smoother at a non-clustered center will be smoother. Therefore, the cluster center can be found by searching for the jump point. The specific details and methods are described in [24].

Figure 3.

Decision graphs of the simulation results after the arrangement of $\gamma$ in descending order.

The present clustering results are verified and evaluated using the definition of clustering accuracy proposed by Al-Shammary et al. in [44]. According to this definition, the clustering accuracy of the algorithm $f$ on the sample set $D$ can be calculated as Eq. (20):

$\displaystyle\textit{ac\_rate}\left({D/f}\right)=\frac{\sum_{i=1}^{k}\textit{% corr}\_c_{i}}{\left|D\right|}$ (20)

where $k$ denotes the number of real clusters in the sample set, $\textit{corr}\_c_{i}$ denotes the number of correctly-classified samples in the $i$ -th cluster and $|D|$ denotes the number of samples. A greater value of $\textit{corr}\_c_{i}$ suggests a more favorable clustering performance. Equation (20) represents the overall clustering accuracy. However, it is difficult to determine a range of clustering accuracies that correspond to a good clustering, as this depends on concrete conditions and clustering requirements. For example, if a task has a man-made threshold of 50.00% for the overall accuracy of the algorithm, naturally, clustering accuracy of more than 50.00% is considered to be a better performance less than 50.00% is considered as poor performance. In theory, without considering the complexity of the algorithm, the higher the clustering accuracy, the better.

Then, in accordance with the arrangement of $\gamma$ and the clustering results, the threshold value and clustering accuracy are calculated, as shown in Tables 11–14. The mean of clustering accuracy rate is abbreviated as MC. Variance of the clustering accuracy rate is abbreviated as VC.

Table 11

Summary of the clustering results for the first sets of simulations

Experiment

number

Optimal

threshold

d_{c}

Clustering

accuracy (%)

Experiment

number

Optimal

threshold

d_{c}

Clustering

accuracy (%)

0.89

52.00

0.97

50.50

0.91

51.00

0.89

51.00

0.94

52.00

0.93

55.00

0.93

51.00

0.94

52.50

0.83

54.00

0.91

55.50

0.81

50.50

0.85

52.50

0.90

50.00

0.87

51.50

0.94

52.50

0.87

54.50

0.92

52.00

0.85

55.00

0.89

51.00

0.91

56.50

0.84

52.00

0.89

50.00

0.89

50.00

0.88

56.00

0.84

50.50

=

52.36% VC

=

4.0525

Table 12

Summary of the clustering results for the second sets of simulations

Experiment

number

Optimal

threshold

d_{c}

Clustering

accuracy (%)

Experiment

number

Optimal

threshold

d_{c}

Clustering

accuracy (%)

0.38

40.67

0.40

37.00

0.39

41.67

0.41

35.67

0.40

36.33

0.39

37.33

0.39

36.33

0.41

33.67

0.40

38.67

0.40

37.33

0.39

39.00

0.38

36.00

0.41

37.00

0.40

39.67

0.39

46.33

0.40

38.33

0.40

38.00

0.39

39.00

0.38

40.33

0.40

42.67

0.39

38.00

0.40

34.67

0.38

42.00

0.39

37.67

0.39

41.00

=

38.57% VC

=

7.7999

Table 13

Summary of the clustering results for the third sets of simulations

Experiment

number

Optimal

threshold

d_{c}

Clustering

accuracy (%)

Experiment

number

Optimal

threshold

d_{c}

Clustering

accuracy (%)

0.88

46.33

0.79

53.67

0.87

40.67

0.82

54.33

0.76

41.33

0.80

48.00

0.77

39.33

0.74

52.67

0.79

45.00

0.71

53.67

0.56

47.67

0.91

45.00

0.76

38.00

0.83

40.67

0.72

39.33

0.70

43.33

0.77

46.33

0.84

41.33

0.47

40.00

0.87

50.67

0.50

45.00

0.86

56.33

0.78

38.67

0.76

45.00

0.80

43.67

=

45.44 VC

=

29.96

Table 14

Summary of the clustering results for the fourth sets of simulations

Experiment

number

Optimal

threshold

d_{c}

Clustering

accuracy (%)

Experiment

number

Optimal

threshold

d_{c}

Clustering

accuracy (%)

3.08

37.67

3.06

35.33

3.09

36.67

3.11

37.00

3.04

36.67

3.07

37.33

3.02

35.00

3.13

38.00

2.98

35.00

3.05

36.33

3.09

36.67

3.02

38.33

3.08

38.33

3.08

39.00

3.10

38.33

3.08

36.67

3.16

35.00

3.07

36.67

3.10

39.33

3.02

35.00

3.07

37.67

3.06

37.00

3.06

37.67

3.08

35.00

3.04

37.33

=

36.92 VC

=

1.71

Through 100 artificial random fuzzy sample sets, it can be seen from the clustering results that the clustering accuracy span is very large, ranging from more than 30.00% to less than 60.00%. This may be due to the differences between the sample sets themselves, or the clustering accuracy of the algorithm is not stable enough. Obviously, the clustering accuracy of the first set of simulations and the third set of simulations is higher than that of the second and fourth sets.

Taking advantage of the experimental conditions of the first and third sets of simulations, two fuzzy sample sets with mixed data are generated randomly by experimental conditions which have been mentioned above, and then recorded as FMTDS-1 and FMTDS-2. For these, the clustering accuracy of FMTDS-1 is 52.50% and the accuracy of FMTDS-2 is 63.67%. To further demonstrate the clustering effect, we attempt to draw the clustering results of these two sets of fuzzy sample sets. For a fuzzy sample composed of several continuous fuzzy sets and discrete fuzzy sets, it is difficult to draw the clustering result directly. This is because, in the simulations, the first index and the second index correspond to continuous fuzzy sets, and normal fuzzy sets are used. In this case, we can use the mean of the normal fuzzy set under the first index as the horizontal axis, and the mean of the normal fuzzy set under the second index as the vertical axis to draw the clustering result. As for the discrete fuzzy sets corresponding to the third index, this usually represents attributes, and for all the fuzzy samples in the simulation, elements in discrete fuzzy sets are identical, but the degree of membership of elements to discrete fuzzy sets is different. In short, the types and numbers of attributes possessed by the fuzzy samples are the same, but to different degrees of membership.

When graphing, the attributes of different membership degrees can be represented by different colors, provided that there are enough kinds of tones and that they are easily distinguished by human vision. However, it is difficult to satisfy this requirement because sometimes human vision cannot distinguish the subtle differences between two similar colors. As there are hundreds of samples in each simulation, which require hundreds of colors, this inevitably makes the differentiation between some colors. Moreover, because we need to use color to represent samples in different clusters in two-dimensional graph, to avoid confusion, it is impossible to use color to represent discrete fuzzy sets again. Therefore, discrete fuzzy sets can be omitted for the time being in this example. Because there is no difference between elements in discrete fuzzy sets, the difference only exists in the membership degree of elements to discrete sets; therefore, if we mark elements in discrete fuzzy sets on the axes, all samples are the same, which can not reflect any differences. Even if we consider drawing the membership of elements in discrete fuzzy sets on a numbered axis, because the membership degree is between 0 and 1, the difference between different fuzzy samples reflected on the two-dimensional map is still insignificant. Because there is no such problem in normal fuzzy sets, the elements themselves are infinite, and the mean value is the key parameter which can represent them to some extent; therefore, in a fuzzy samples with two continuous fuzzy sets and one discrete fuzzy set, we only consider drawing the mean values of two continuous fuzzy sets on the horizontal and vertical axes, and replacing this fuzzy samples with a coordinate point composed of two mean values. According to the above rules, the clustering results of FMTDS-1 and FMTDS-2 are shown in Figs 4 and 5, respectively.

Figure 4.

Clustering result for FMTDS-1.

Figure 5.

Clustering result for FMTDS-2.

Obviously, as the clustering accuracy reflects, the clustering effect of FMTDS-2 is better than that of FMTDS-1.

6. Conclusions

As shown in Fig. 3 and Tables 11–14, the CFSFDP algorithm that automatically identifies the number of clusters by searching for the jump points of $\gamma$ , only performs successfully in the first set of simulations and fails in the other sets of simulations. As a result, the accuracy for certain simulations is not high. This suggests that, when using FMTD-CFSFDP algorithm, the automatic identification of the number of clusters based on the jump point of $\gamma$ is unstable and easily fails. Since the improved Euclidean distances between continuous fuzzy sets or discrete fuzzy sets are calculated based on membership degrees with a range of [0, 1], the calculated improved Euclidean distances are always small. The overall distance $L$ as well as the threshold value $L^{*}$ are small, thereby resulting in small calculated values of $\rho$ and $\delta$ (corresponding to the density of fuzzy samples and special distance). Accordingly, the acquired comprehensive measure value $\gamma$ is also small. Although it is not the only reason, the value range of the membership degrees within [0, 1] can easily result in a discrimination degree that is too low, i.e., it is difficult to determine the number of clusters based only on the visual identification of the jump point of $\gamma$ . By referring to [24], the clustering accuracy obtained by the FMTD-CFSFDP algorithm is lower than that using CFSFDP algorithm, which can be attributed to the introduction of the concept of fuzzy sets. Accordingly, each state (i.e., each element) corresponding to each index of the sample in the fuzzy set can be treated as a number. In other words, although each sample theoretically includes two or three fuzzy sets in the four sets of random simulations, nearly all elements in each fuzzy set are used in calculating the related parameters, i.e., a large number of elements (without limits) in any a continuous fuzzy set are involved in the calculation. Therefore, the FMTD-CFSFDP algorithm contains much more information than CFSFDP on classical sets, thereby yielding lower clustering accuracy.

In summary, FMTD-CFSFDP algorithm is applicable to the clustering of samples with fuzzy mixed data, which also inherits most of the advantages of the CFSFDP algorithm. The main innovation of this study lies in the extension of the CFSFDP algorithm from classical sets to fuzzy sets. Accordingly, a novel clustering algorithm for fuzzy mixed data is proposed, the mathematical definition of fuzzy mixed data is provided, and the Euclidean distance between fuzzy samples is improved so as to reduce errors.

7. Discussion

However, in spite of the above innovations mentioned in Section 5, the proposed FMTD-CFSFDP algorithm still exhibits the following three limitations. Firstly, using the FMTD-CFSFDP algorithm, fuzzy samples cover the information of the fuzzy sets, but the membership of fuzzy samples to the cluster still follows hard division without the consideration of fuzzy characteristics. Many of the calculations involve fuzzy mathematics, including the calculation of fuzzy nearness, fuzzy degrees and fuzzy distance, which are converted into classical quantities and finally classical values are output. Although the above calculations are reasonable in fuzzy mathematics, some information is always lost during the conversion from fuzzy sets to classical sets. In particular, a hard division of fuzzy samples can reduce the clustering accuracy to certain degree. Secondly, in this study, the improved Euclidean distance is proposed, which exhibits less error than the traditional Euclidean distance, and the overall distance is calculated after weighting. However, the distances cannot be calculated without the membership degree. The membership degree ranges from 0 to 1, which will undoubtedly weaken the discrimination degree of the overall distance, thereby resulting in a decrease of clustering accuracy compared to the results for CFSFDP. Thirdly, the proposed FMTD-CFSFDP algorithm fails to overcome the shortcoming of the CFSFDP algorithm in the visual identification of cluster numbers and makes the identification of the number of clusters more unstable (see Fig. 3, this point has been mentioned in the previous simulation experiments of Section 4). This can be partly attributed to the decrease in the discrimination degree of the measure as described above. Because of this low discrimination degree, the value of $\gamma$ for the final automatic identification of the number of clusters varies slightly and thus cannot be effectively discriminated using visual means.

Based on the above analysis, the FMTD-CFSFDP algorithm can be further improved based on the following aspects. Firstly, the membership between fuzzy samples and clusters should also be defined on fuzzy sets so that both the sample information and memberships are established on fuzzy sets. By doing so, information loss in clustering and classification may be reduced so as to enhance clustering accuracy. Secondly, the fuzzy distance calculated based on membership degrees and all related improvements should be abandoned and a new measure that reflects fuzzy data characteristics should be developed. Thirdly, a novel identification method for the number of cluster should be sought for replacing the visual identification based on the characteristics of $\gamma$ in the geometrical images, i.e., a set of quantized mathematical models should be proposed for the automatic identification of the number of clusters. Accordingly, the second and third shortcomings described above can be overcome and even subtle differences can be easily identified by the established mathematical model, so as to enhance the discrimination degree in clustering. Overall, the present research can provide insightful direction for further clustering on fuzzy sets.

Footnotes

Appendix 1

With regard to Eq. (6), assuming that $e_{\sigma}({r,t}$ , $e_{\Sigma}^{(S)}({r,t})$ and $e_{\Sigma}^{(R)}({r,t})$ represent the overall error, systematic error and random error, respectively, the following expression can be derived:

$\displaystyle e_{\Sigma}\left({r,t}\right)=e_{\Sigma}^{\left(S\right)}\left({r% ,t}\right)+e_{\Sigma}^{\left(R\right)}\left({r,t}\right).$

For Eq. (7), assuming that $e_{IE}({r,t})$ , $e_{IE}^{(S)}({r,t})$ and $e_{IE}^{(R)}({r,t})$ represent the overall error, systematic error and random error, respectively, the following expression can be derived:

$\displaystyle e_{IE}\left({r,t}\right)=e_{IE}^{\left(S\right)}\left({r,t}% \right)+e_{IE}^{\left(R\right)}\left({r,t}\right).$

It should be noted that all the errors are nonnegative. For comparing $e_{\Sigma}({r,t})$ and $e_{IE}({r,t})$ , since the random error cannot be artificially controlled and it can be assumed that $e_{\Sigma}^{(R)}({r,t})=e_{IE}^{(R)}({r,t})$ , only $e_{\Sigma}^{(S)}({r,t})$ and $e_{IE}^{(S)}({r,t})$ should be compared. Assuming that $e_{j}^{(S)}({r,t})$ denotes the error of $\int\limits_{x_{r}^{(j)}{\cup X}_{t}^{(j)}}{{|{C_{r}^{(j)}(x)-C_{t}^{(j)}(x)}|% }^{2}dx}$ , since $x_{rt}^{(j)}\subseteq X_{r,t}$ , for the $j$ -th index, ${}_{rt}^{(j)}$ denotes the public integration domain. Obviously, according to the property of definite integration, for the fuzzy set $C_{r}^{(j)}$ or $C_{t}^{(j)}$ of each fuzzy sample, in a non-integration domain: $C_{r}^{(j)}(x)=$ 0, $\int_{X_{t}^{(j)}}{{|{C_{r}^{(j)}(x)}|}^{2}dx=0}$ or $C_{t}^{(j)}(x)=$ 0, and $\int_{x_{r}^{(j}}{{|{C_{t}^{(j)}(x}|}^{2}dx=0}$ . In other words, for $C_{r}^{(j)}$ and $C_{t}^{(j)}$ , the integration in a non-public integration domain can be calculated as $\int_{(x_{rt}-x_{rt}^{(j)})\sum_{j=1}^{N_{1}}}|C_{r}^{(j)}(x)-C_{t}^{(j)}(x)|^% {2}dx$ , and ( $x_{rt}-_{rt}^{(j)})$ represents the difference set, i.e., the integration domain of $x_{rt}$ except ${}_{rt}^{(j)}$ . For a single index, the error only exists in the defined integration domain, while the integrations in a non-defined non-integration domain are all equal to 0, suggesting no error. Accordingly, the following expression can be derived:

$e\left\{{\int\limits_{x_{rt}}{{\left|{C_{r}^{\left(j\right)}\left(x\right)-C_{% t}^{\left(j\right)}\left(x\right)}\right|}^{2}dx}}\right\}=e\left\{{\int% \limits_{x_{r}^{\left(j\right)}{\cup X}_{t}^{\left(j\right)}}{{\left|{C_{r}^{% \left(j\right)}\left(x\right)-C_{t}^{\left(j\right)}\left(x\right)}\right|}^{2% }dx}}\right\}=e_{j}^{\left(S\right)}\left({r,t}\right).$

The systematic error of Eq. (6) can then be calculated as:

$\displaystyle e_{\Sigma}^{(S)}\left(r,t\right)=\sum\limits_{j=1}^{N}\left[e_{j% }^{(S)}\left(r,t\right)\right]^{1\mathord{\left/{\vphantom{12}}\right.\kern-1.% 2pt}2}.$

Based on the property of definite integration,

$\displaystyle\int\limits_{X_{r,t}}\sum_{j=1}^{N_{1}}\left|C_{r}^{(j)}(x)-C_{t}% ^{(j)}(x)\right|^{2}dx=\sum_{j=1}^{N_{1}}\int\limits_{X_{r,t}}\left|C_{r}^{(j)% }(x)-C_{t}^{(j)}(x)\right|^{2}dx,$

and the systematical error of Eq. (7) can be calculated as:

$\displaystyle e_{IE}^{\left(S\right)}\left({r,t}\right)=\left[\sum\limits_{j=1% }^{N}e_{j}^{\left(S\right)}\left({r,t}\right)\right]^{1\mathord{\left/{% \vphantom{12}}\right.\kern-1.2pt}2}\quad.$

Apparently, $e_{\Sigma}^{(S)}({r,t})\geqslant e_{IE}^{(S)}({r,t})$ . Since $e_{\Sigma}^{(R)}({r,t})=e_{IE}^{(R)}({r,t})$ , $e_{\Sigma}({r,t})\geqslant e_{IE}({r,t})$ . The error of the improved Euclidean distance according to Eq. (7) is smaller than that calculated according to Eq. (6).

Appendix 2

In the first set of simulations, 200 fuzzy samples were used, and each sample includes three indexes (i.e., $M=$ 200 and $N=$ 3). Specifically, two indexes (Index A and Index B) are continuous fuzzy subsets in the discourse domain $X=\mathbb{R}$ and one index is a discrete fuzzy subset (Index C) in $X=\mathbb{R}$ (i.e., $N_{1}=$ 2 and $N_{2}=$ 1). These samples were artificially divided into two clusters in advance, i.e., $C=$ 2. For the first 100 samples in a cluster, assuming that the two continuous fuzzy subsets are normal fuzzy sets, the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index A are from the two normal distributions $N({160,2^{2}})$ and $N({2,0.16^{2}})$ , and the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index B are also from the two normal distributions $N({45,2^{2}})$ and $N({45,2^{2}})$ . With regard to Index C (a discrete fuzzy set), a set ${\{{x_{g}}\}}_{g=1}^{5}$ was first obtained by arbitrarily setting $x_{g}$ and $g$ ; assuming ${\{{x_{g}}\}}_{g=1}^{5}={\{{x_{g}=g}\}}_{g=1}^{5}$ , the membership degree of $x_{g}$ to the discrete fuzzy set (Index $C)$ can be determined by the uniform distribution $U({0,1})$ . It should be noted that the distributions can be arbitrary, which only represent a fetching rule. The remaining 100 samples are in another cluster; the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index A are described by the two normal distributions $N({168,3^{2}})$ and $N({3,0.2^{2}})$ , the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index B are described by the two normal distributions $N({54,3.5^{2}})$ and $N({3.5,0.5^{2}})$ , and the membership degrees of the five elements corresponding to Index C ( ${\{{x_{g}}\}}_{g=1}^{5}=\{{1,2,3,4,5}\})$ to the discrete fuzzy set C can be generated by the uniform distribution $U({0,1})$ .

The second set of simulations includes 300 normal fuzzy samples in total ( $M=$ 300), in which each sample has three indexes ( $N=$ 3). Similar to the condition in the first set of simulation, the three indexes include two continuous fuzzy subsets in the domain of discourse $X=\mathbb{R}$ , denoted as Index A and Index B, and one discrete fuzzy subset, denoted as Index C. Accordingly, $N_{1}=$ 2 and $N_{2}=$ 1. These 300 fuzzy samples were artificially divided into three clusters in advance (i.e., $C=$ 3). For the first 100 samples in a cluster, this study also assumes that the two continuous fuzzy subsets are normal fuzzy sets; specifically, the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index A are described by the two normal distributions $N({1.2,0.14^{2}})$ and $N({0.15,0.15^{2}})$ , and the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index B are described by the two normal distributions $N({7,0.15^{2}})$ and $N({0.2,0.12^{2}})$ . For Index C (a discrete fuzzy set), a set ${\{{x_{g}}\}}_{g=1}^{5}$ was first obtained by arbitrarily setting $x_{g}$ and $g$ , and then, assuming ${\{{x_{g}}\}}_{g=1}^{5}={\{{x_{g}=g}\}}_{g=1}^{5}$ , the membership degree of $x_{g}$ to the discrete fuzzy set (Index $C$ ) can be described by the uniform distribution ( $U({0,1}))$ . The middle samples are in a cluster; the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index A are from the two normal distributions $N({8,0.25^{2}})$ and $N({0.2,0.24^{2}})$ , the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index B are from the two normal distributions $N({15,0.24^{2}})$ and $N({0.24,0.14^{2}})$ , and the membership degrees of the three elements corresponding to Index C ( ${\{{x_{g}}\}}_{g=1}^{3}=\{{1,2,3}\})$ to the discrete fuzzy set are generated by the uniform distribution $U({0,1})$ . The remaining 100 samples are in a cluster; the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index A are from the two normal distributions $N({17,0.23^{2}})$ and $N({0.3,0.2^{2}})$ , the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index B are described by the two normal distributions $N({22,0.2^{2}})$ and $N({0.35,0.12^{2}})$ , and the membership degrees of the three elements corresponding to Index C ( ${\{{x_{g}}\}}_{g=1}^{3}=\{{1,2,3}\})$ to the discrete fuzzy set are generated by the uniform distribution $U({0,1})$ .

In the third set of simulations, 300 fuzzy samples and three indexes in total are involved (i.e., $M=$ 300 and $N=$ 3). Specifically, these three indexes are two continuous fuzzy subsets in the domain of discourse $X=\mathbb{R}$ , dented as Index A and Index B, and one discrete subset in the domain of discourse $\mathbb{R}$ denoted as Index C, i.e., $N_{1}=$ 2 and $N_{2}=$ 1. Similarly, these fuzzy samples are artificially classified into three clusters, i.e., $C=3$ . The first 100 samples are in a cluster. Firstly, this study also assumes that the two continuous fuzzy subsets are normal fuzzy sets: the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index A are from a normal distribution $N({1.2,0.14^{2}})$ and an exponential distribution $\textit{Exp}({12})$ while the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index B are described by the two uniform distributions $U({7,15})$ and $U({0.2,0.4})$ ; for Index C (a discrete fuzzy set), a set ${\{{x_{g}}\}}_{g=1}^{4}$ can first be obtained by arbitrarily setting $x_{g}$ and $g$ , and then, assuming ${\{{x_{g}}\}}_{g=1}^{4}={\{{x_{g}=g}\}}_{g=1}^{4}$ , the membership degree of $x_{g}$ to the discrete fuzzy set (Index $C)$ can be generated by the uniform distribution $U({0,1})$ . The middle samples are in a cluster; the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index A are from a normal distribution $N({-6,0.23^{2}})$ and an exponential distribution $\textit{Exp}(9)$ , the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index B are from two uniform distributions $U({1,4})$ and $U({1.02,1.54})$ , and the membership degrees of the four elements corresponding to Index C ( ${\{{x_{g}}\}}_{g=1}^{4}=\{{1,2,3,4}\})$ to the discrete fuzzy set are generated by the uniform distribution $U({0,1})$ . The remaining 100 samples are in a cluster; the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index A are from a normal distribution $N({8.6,0.15^{2}})$ and an exponential distribution $\textit{Exp}({15})$ , the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index B are described by the two uniform distributions $U({20,24})$ and $U({0.67,1.35})$ , and the membership degrees of the four elements corresponding to Index C to the discrete fuzzy set ( ${\{{x_{g}}\}}_{g=1}^{4}=\{{1,2,3,4}\})$ are generated by the uniform distribution $U({0,1})$ .

The fourth set of simulations includes 300 fuzzy samples and three indexes (i.e., $M=$ 300 and $N=$ 3). Similarly, for the three indexes, one index is for the continuous fuzzy subsets in the domain of discourse $X=\mathbb{R}$ , denoted as Index A, and the other two indexes are discrete fuzzy subsets in $X=\mathbb{R}$ , denoted as Index B and Index C, i.e., $N_{1}=$ 1 and $N_{2}=$ 2. The 300 samples are artificially divided into three clusters ( $C=$ 3). As to the first 100 sample in a cluster, this study also assumes that the continuous fuzzy subset is a normal fuzzy set; specifically, the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index A are from the two normal distributions $N({1.6,0.16^{2}})$ and $N({0.12,0.16^{2}})$ ; for Index B (a discrete fuzzy set), a set ${\{{x_{g}}\}}_{g=1}^{4}$ can first be obtained by arbitrarily setting $x_{g}$ and $g$ , and then, assuming ${\{{x_{g}}\}}_{g=1}^{4}={\{{x_{g}=g}\}}_{g=1}^{4}$ , the membership degree of $x_{g}$ to the discrete fuzzy set (Index $C$ ) can be generated by the uniform distribution $U({0,1})$ ; for Index $C$ (a discrete fuzzy set), a set ${\{{x_{h}}\}}_{h=1}^{3}$ can first be obtained by arbitrarily setting $x_{h}$ and $h$ , and then assuming ${\{{x_{h}}\}}_{h=1}^{3}={\{{x_{h}=h}\}}_{h=1}^{3}$ , the membership degree of $x_{h}$ to the discrete fuzzy set (Index $C$ ) can be generated by the uniform distribution ( $U({0,1}))$ . The middle samples are in a cluster; the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index A are from the two normal distribution functions $N({7.8,0.31^{2}})$ and $N({0.21,0.23^{2}})$ , the membership degrees of four elements corresponding to Index B ( ${\{{x_{g}}\}}_{g=1}^{4}=\{{1,2,3,4}\})$ to the discrete fuzzy set are generated by the uniform distribution $U({0,1})$ , and the membership degrees of the three elements corresponding to Index C ( ${\{{x_{h}}\}}_{h=1}^{3}=\{{1,2,3}\})$ to the discrete fuzzy set are also generated by the uniform distribution $U({0,1})$ . The remaining 100 samples are in a cluster; the parameters $\mu$ and $\sigma$ in the membership function corresponding to Index A are from the two normal distributions $N({7.8,0.31^{2}})$ and $N({0.21,0.23^{2}})$ , the membership degrees of four elements corresponding to Index B ( ${\{{x_{g}}\}}_{g=1}^{4}=\{{1,2,3,4}\})$ to the discrete fuzzy set are generated by the uniform distribution $U({0,1})$ , and the membership degrees of the three elements corresponding to Index C ( ${\{{x_{h}}\}}_{h=1}^{3}=\{{1,2,3}\})$ to the discrete fuzzy set are generated by the uniform distribution $U({0,1})$ .

References

Han

J.W.

Kamber

and Pei

, Cluster analysis: Basic concepts and methods, in: Data Mining: Concept and Technique Fan

and Meng

X.F.

, transl., China Machine Press, Beijing, 2012, pp. 288–291.

Bishop

C.M.

, Clustering algorithms, in: Neural Networks for Pattern Recognition, Clarendon Press, Oxford, 1995, pp. 187–189.

Chen

C.T.

, Research on K-modes Clustering Algorithm of Dissimilarity Measure, MSc. Dissertation, Taiyuan University of Technology, 2012.

Macqueen

, On convergence of K-means and partitions with minimum average variance, Annals of Mathematical Statistics36(3) (1965), 1084–1084.

Chaturvedi

Green

P.E.

and Carroll

J.D.

, K-modes clustering, Journal of Classification18(1) (2001), 35–55.

Karbach

et al., Model-K-prototyping at the knowledge level, Expert Systems with Application4(2) (1992), 268–268.

Reynolds

A.P.

Richards

and Rayward-Smith

V.J.

, The application of K-medoids and PAM to the clustering rules, in: Proceedings of 2004 Intelligent Data Engineering and Automated Learning Ideal, 2004, pp. 173–178.

R.N.

Wang

X.L.

and Ding

J.D.

, Multilevel core-sets based aggregation clustering algorithm, Journal of Software24(3) (2013), 490–506.

Guha

Rastogi

and Shim

, Rock: A robust clustering algorithm for categorical attributes, Information Systems25(5) (2000), 345–366.

10.

Wilcox

et al., Simulation tests of galaxy cluster constraints on chameleon gravity, Monthly Notices of the Royal Astronomical Society462(1) (2016), 715–725.

11.

Lorbeer

et al., Variations on the clustering algorithm BIRCH, Big Data Research11 (2018), 44–53.

12.

Benmouiza

and Cheknane

, Density-based spatial clustering of application with noise algorithm for the classification of solar, in: Proceedings of 2016 8th International Conference on Modelling, Identification & Control, 2016, pp. 279–283.

13.

Bryant

and Cios

, RNN-DBSCAN: a density-based clustering algorithm using reverse nearest neighbor density estimates, IEEE Transactions on Knowledge and Data Engineering30(6) (2018), 1109–1121.

14.

Kanagala

H.K.

and Krishnaiah

V.V.J.R.

, A comparative study of K-means, DBSCAN and OPTICS, in: Proceedings of 2016 International Conference on Computer Communication and Informatics, 2016.

15.

Dat

N.D.

et al., Sting algorithm used English sentiment classification in a parallel environment, International Journal of Pattern Recognition and Artificial Intelligence31(7) (2017), 1–30.

16.

Sheikholeslami

Chatterjee

and Zhang

A.D.

, WaveCluster: a wavelet-based clustering approach for spatial data in very large databases, VLDB Journal8(3-4) (2000), 289–304.

17.

Holmes

and Pfahringer

, Clustering large datasets using Cobweb and K-means in tandem, in: Proceedings of 17th Annual Australian Conference on Artificial Intelligence, 2004, pp. 368–379.

18.

Kohonen

, Self-organized formation of topologically correct feature maps, Biological Cybernetics43(1) (1982), 59–69.

19.

Zhang

X.P.

et al., Spatial clustering with obstacles constraints by ant colony optimization and quantum particle swarm optimization, in: Proceedings of 2009 International Conference on Artificial Intelligence and Computational Intelligence, 2009.

20.

Zhang

X.P.

et al., Spatial clustering with obstacles constraints using PSO-DV and K-Medoids, in: Proceedings of 2008 3rd International Conference on Intelligent System and Knowledge Engineering, 2008.

21.

Roy

and Mandal

J.K.

, A delaunay triangulation preprocessing based fuzzy-encroachment graph clustering for large scale GIS, in: Proceedings of 2012 International Symposium on Electronic System Design, 2012, pp. 300–305.

22.

et al., CAMAS: A cluster-aware multiagent system for attributed graph clustering, Information Fusion37 (2017), 10–21.

23.

Zaki

M.J.

et al., CLICKS: An effective algorithm for mining subspace clusters in categorical datasets, Data & Knowledge Engineering60(1) (2007), 51–70.

24.

Rodriguez

and Laio

, Clustering by fast search and find of density peaks, Science344(6191) (2014), 1492–1496.

25.

Mehmood

et al., Fuzzy clustering by fast search and find of density peaks, in: Proceedings of 2015 International Conference on Identification, Information, and Knowledge in the Internet of Things, 2015.

26.

Wan

et al., Optimized fuzzy clustering by fast search and find of density peaks, in: Proceedings of 2018 IEEE 3rd International Conference on Cloud Computing and Big Data Analysis, 2018.

27.

Gao

et al., ICFS: An improved fast search and find of density peaks clustering algorithm, Proceedings of 14th IEEE Intl Conf on Dependable, Autonomic and Secure Comp/4th IEEE Intl Conf on Pervasive Intelligence and Comp/2nd IEEE Intl Conf on Big Data Intelligence and Comp/IEEE Cyber Sci and Technol Congress, 2016.

28.

Shen

Y.C.

and Zhang

, Automatically selecting cluster centers in clustering by fast search and find of density peaks with data field, in: Proceedings of 2017 2nd International Conference on Information Systems Engineering, 2017.

29.

Zhang

Y.M.

Liu

M.D.

and Liu

Q.W.

, An energy-balanced clustering protocol based on an improved CFSFDP algorithm for wireless sensor networks, Sensors18(3) (2018).

30.

Qin

B.Y.

et al., Terahertz time-domain spectroscopy combined with PCA-CFSFDP applied for pesticide detection, Optical and Quantum Electronics49(7) (2017).

31.

Chen

Y.Y.

and Zhang

S.F.

, Design of mixed data clustering algorithm based on density peak, Journal of Computer Application38(2) (2018), 483–490.

32.

and Chen

Y.Y.

, Evaluation on the annual development status of private hospitals in Mainland China based on Mao-CFSFDP clustering algorithm, Basic & Clinical Pharmacology & Toxicology123(3) (2018), 109–110.

33.

Zadeh

L.A.

, Fuzzy sets, Information and Control8(3) (1965), 338–353.

34.

Barni

Cappellini

and Mecocci

, A possibilistic approach to clustering-comments, IEEE Transactions on fuzzy systems4(3) (1996), 393–396.

35.

Bezdek

J.C.

Ehrlich

and Full

, FCM-The fuzzy c-means clustering-algorithm, Computers & Geosciences10(2-3) (1984), 191–203.

36.

Pal

N.R.

et al., A possibilistic fuzzy c-means clustering algorithm, Journal of Cybernetics3(3) (1974), 32–57.

37.

Wang

and Chen

, HASTA: A hierarchical-grid clustering algorithm with data field, International Journal of Data Warehousing and Mining10(2) (2014), 39–54.

38.

George

J.K.

and Bo

Y.A.

, Fuzzy sets: Basic concepts, in: Fuzzy Sets and Fuzzy Logic: Theory and Application Patti

, ed., Prentice Hall PTR, New Jersey, 1995, pp. 19–30.

39.

Chen

S.L.

J.G.

and Wang

X.G.

, in: Fuzzy sets and membership function, in: The Theory of Fuzzy Sets and Its Application, Science Press, Beijing, 2005, pp. 28–42.

40.

Huang

, Clustering large data sets with mixed numeric and categorical values, in: Proceedings of the First Pacific-Asia Knowledge Discovery and Data Mining Conference, 1997, pp. 39–54.

41.

Zhang

W.K.

, Research on density-based hierarchical clustering algorithm, MSc. Dissertation, University of Science and Technology of China, 2015.

42.

Liu

Jia

and Wang

H.Z.

, Adaptive indoor localization algorithm of based on CFSFDP in complex environment, Journal of Signal Processing34(4) (2018), 465–475.

43.

Qin

and Mao

Y.M.

, Application of uncertain GM-CFSFDP clustering algorithm in landslide hazard prediction, Computer Systems & Application27(6) (2018), 195–201.

44.

Al-Shammary

et al., Fractal self-similarity measurements based clustering technique for SOAP web messages, Journal of Parallel and Distributed Computing73(5) (2013), 664–676.

A fuzzy mixed data clustering algorithm by fast search and find of density peaks

Abstract

Keywords

1. Introduction

2. Related works

Table 1 Comparison between FMTD-CFSFDP algorithm and other common algorithms

4.1 Initialization and pretreatment

4.1.1 Calculate the distance between the continuous fuzzy sets of two fuzzy samples s ∼ r and s ∼ t ⁢ ( L C ⁢ ( r , t ) )

4.4 Analysis and comparison of the FMTD-CFSFDP algorithm

Table 2 Framework of FMTD-CFSFDP

Table 6 Simulation conditions

7. Discussion

Footnotes

Appendix 1

Appendix 2

References

Table 1
Comparison between FMTD-CFSFDP algorithm and other common algorithms

4.1.1 Calculate the distance between the continuous fuzzy sets of two fuzzy samples $\tilde{s}_{r}$ and $\tilde{s}_{t}(L_{C}({r,t}))$

Table 2
Framework of FMTD-CFSFDP

Table 6
Simulation conditions