A fuzzy clustering algorithm based on hybrid surrogate model

Abstract

Data clustering based on regression relationship is able to improve the validity and reliability of the engineering data mining results. Surrogate models are widely used to evaluate the regression relationship in the process of data clustering, but there is no single surrogate model that always performs the best for all the regression relationships. To solve this issue, a fuzzy clustering algorithm based on hybrid surrogate model is proposed in this work. The proposed algorithm is based on the framework of fuzzy c-means algorithm, in which the differences between the clusters are evaluated by the regression relationship instead of Euclidean distance. Several surrogate models are simultaneously utilized to evaluate the regression relationship through a weighting scheme. The clustering objective function is designed based on the prediction errors of multiple surrogate models, and an alternating optimization method is proposed to minimize it to obtain the memberships of data and the weights of surrogate models. The synthetic datasets are used to test single surrogate model-based fuzzy clustering algorithms to choose the surrogate models used in the proposed algorithm. It is found that support vector regression-based and response surface-based fuzzy clustering algorithms show competitive clustering performance, so support vector regression and response surface are used to construct the hybrid surrogate model in the proposed algorithm. The experimental results of synthetic datasets and engineering datasets show that the proposed algorithm can provide more competitive clustering performance compared with single surrogate model-based fuzzy clustering algorithms for the datasets with regression relationships.

Keywords

Data clustering fuzzy clustering regression relationship hybrid surrogate model engineering data

1 Introduction

With the advent of the data-intensive times, the monitoring of engineering systems is becoming increasingly perfect. The measured data not only record the operation process of engineering systems but also involve the internal mechanism of engineering systems. Mining these engineering data can help the design, analysis, operation, and maintenance of engineering systems. The operation state of engineering systems often changes to meet the requirements of different working conditions, resulting in the different features of engineering data. It is necessary to partition these engineering data to ensure the validity and reliability of the analysis results. However, most engineering data are usually not labeled, which leads that the engineering data need to be partitioned based on their internal characteristics. Data clustering is one of the mature tools to accomplish this work, which has been successfully applied in the data mining tasks such as data classification [1 –4], anomaly detection [5, 6], and image segmentation [7, 8]. It is a task to group the data into clusters so that the patterns within the same cluster are more similar than those within different clusters [9]. Fuzzy clustering is a branch of clustering methods and algorithms. It is developed based on fuzzy set theory, in which the most well-known one is fuzzy c-means algorithm (FCM). Fuzzy c-means algorithm is an objective function-based fuzzy clustering method which partition a set of object data {x₁, … , x_n } $\subset ℝ^{d \times n}$ into c fuzzy clusters based on a computed minimizer of the fuzzy within-group least squares function. $J (U, V) = \sum_{i = 1}^{c} \sum_{k = 1}^{n} u_{i, k}^{m} d_{i, k}^{2}$ (1) where d_i,k represents the similarity metric that evaluates the distance between x_k and the prototype of the i-th cluster, and Euclidean distance is used in fuzzy c-means algorithm; u_i,k is the membership that represents the degree which x_k belongs to the i-th cluster and satisfies the following condition. $\sum_{i = 1}^{c} u_{ik} = 1 (k = 1, 2, \dots, n; \forall i, k : u_{ik} \in [0, 1])$ (2)

Through minimizing (1) with constraint (2), the clustering results can be obtained. It can be found that there are two key problems in the objective function-based fuzzy clustering method as shown in Equation (1). One is the minimization of the clustering objective function, and the other one is the evaluation of the similarity between data. The first problem is usually solved by Lagrange multiplier method or heuristic algorithms [10]. For the second problem, Euclidean distance is mostly used to evaluate the similarity between data. In addition, a variety of distance metrics have been introduced to replace Euclidean distance to improve the clustering results as well. Dey [11] designed a clustering objective function based on the sum of the entropy of Euclidean distance. Carvalho [12] used Quadratic distance to replace Euclidean distance and compared the proposed algorithm with FCM through numerical cases and real-world datasets. Liu [13] introduced Mahalanobis distance into fuzzy clustering algorithm and found that Mahalanobis distance can provide better clustering results when the scale of attributes is greatly different from each other. Gueorguieva [14] proposed a method to solve the solution instability problem of Mahalanobis distance by fixing the maximum and minimum eigenvalues of the covariance matrix. Kiraly [15] proposed a fuzzy clustering algorithm based on Geodesic distance for the division and planning of map paths.

In recent years, fuzzy clustering algorithms have been widely used in engineering data analysis. Mota [16] used fuzzy clustering algorithm to assist decision-making regarding the control of variables in compost bedded pack barns. Song [17] developed a time series segmentation method based on the framework of FCM and used it to segment the time series of a complex mechanical system. Majumder [18] utilized a fuzzy clustering algorithm for the multi-sensors data fusion to reduce the imprecision and uncertainty of the sensory data. Arora [19] proposed an enhanced spatial intuitionistic fuzzy c-means algorithm for image segmentation. Lin [20] used fuzzy clustering algorithm to mine the historical meteorological data to forecast air pollutant concentrations and environmental factors. The above-mentioned works provide insights into the availability and potential of fuzzy clustering algorithms for benefiting engineering systems. For engineering data, the difference between each other is more reflected in the regression relationship, and many data mining tasks of engineering data are depending on their regression relationship as well. However, most similarity metrics used in the current fuzzy clustering algorithms are based on the spatial distance or the mapped high-dimensional spatial distance, which results that it is difficult to ensure that the data of the same part have a similar regression relationship. To realize data clustering based on regression relationship, the most important problem is the evaluation of the regression relationship. However, the regression relationship of engineering data is usually unknown and very complicated, so it is difficult to evaluate by explicit formulas. This limitation can be addressed by adopting surrogate models, which can build the regression relationship based on small numbers of samples. The commonly used surrogate models are support vector regression (SVR), radial basis function (RBF), Kriging method (KRG), and response surface (RS). Ren [21] used SVR to model the actual flight data of an aero-engine to help its design and analysis. Li [22] used SVR to replace the finite element model of a dental implant to predict the stress at the interface between bone and implant. Serina [23] utilized RBF to replace expensive computational fluid dynamics computer simulations for the resistance optimization problems of a hydrofoil and a destroyer-type vessel. Halali [24] used RBF to estimate the pressure gradient in water-oil pipelines based on the experimental data. Lu [25] used KRG to predict the actual stiffnesses of a bridge. Lee [26] used KRG to replace the massive computer simulations in the reliability-based design optimization. Toft [27] assessed the site suitability of wind turbines based on RS and the aero-elastic simulations. Lmalghan [28] used RS to model the machining process of aluminum alloy. Through a number of engineering experiments and applications, the researchers find that no single surrogate model always performs the best for all engineering practices [29, 30]. Hybrid surrogate model, as an ensemble of multiple surrogate models, is proposed to solve this issue [31, 32]. The experimental results of numerical and engineering problems indicate that hybrid surrogate model is able to take advantage of different surrogate models to achieve higher regression accuracy. In this work, we proposed a clustering algorithm based on hybrid surrogate model (named the HSM-FC algorithm), in which the adaptive combination of multiple surrogate models is used to improve the data partition accuracy. The proposed algorithm is compared with single surrogate model-based fuzzy clustering algorithms through a number of synthetic and engineering datasets. In addition, the computational cost and convergence performance of the proposed algorithm are explored as well. The contributions of this work are as follows: 1) The regression relationship is used to replace the conventional similarity metrics based on spatial distance to constitute the clustering objective function, which realizes the data partition according to the regression relationship; 2) Hybrid surrogate model is used to evaluate the regression relationship. Through the adaptive combination of multiple surrogate models, the data partition accuracy is improved.

The rest of this paper is organized as follows. Section 2 describes the fuzzy c-means algorithm, the proposed algorithm and the optimization method to its clustering objective function. In Sections 3, the synthetic datasets are used to test the performance of the proposed algorithm. Section 4 presents the clustering results produced for several engineering datasets. Some conclusions are given in Section 5.

2 Preliminary

2.1 Fuzzy c-means algorithm (FCM)

Fuzzy c-means algorithm is a well-known algorithm to cluster data in unsupervised learning [33]. It partitions a given set of object data {x₁, x₂, . . . , x_n} $\subset ℝ^{d \times n}$ into c fuzzy clusters by minimizing an objective function J (U, V) as follows, $J (U, V) = \sum_{i = 1}^{c} \sum_{k = 1}^{n} u_{i, k}^{m} ∥ x_{k} - v_{i} ∥_{2}^{2}$ (3) where $x_{k} = x_{1}, k, x_{2}, k, ..., x_{j}, k, ..., x_{d}, k T$ is an object datum, and x_j,k is the j-th attribute of x_k; v_i is the i-th cluster prototype, and let matrix of cluster prototypes be $V = [v_{i, j}] = {[v_{1}, v_{2}, . . ., v_{c}]}^{T} \in ℝ^{c \times d}$ ; m is a fuzzification parameter, m ∈ (1, ∞); ∥ · ∥ ₂ represents the Euclidean norm in $ℝ^{d}$ , and the expansion of ∥x_k - v_i ∥ ₂ is shown as follows,

$\begin{matrix} {∥ x_{k} - v_{i} ∥}_{2} \\ = \sqrt{{(x_{1, k} - v_{i, 1})}^{2} + {(x_{2, k} - v_{i, 2})}^{2} + \dots {(x_{d, k} - v_{i, d})}^{2}} \end{matrix}$ (4)

u_i,k is the membership that represents the degree which x_k belongs to the i-th cluster and satisfies the following condition,

$\sum_{i = 1}^{c} u_{i, k} = 1 (k = 1, 2, \dots, n; \forall i, k : u_{i, k} \in [0, 1])$ (2) and let partition matrix be $U = [u_{i, k}] \in ℝ^{c \times n}$ .

The necessary conditions for minimizing (3) with the constraint (2) result in the following iterative update formulas for the prototypes and the partition matrix:

$v_{i} = \frac{\sum_{k = 1}^{n} u_{i, k}^{m} x_{k}}{\sum_{k = 1}^{n} u_{i, k}^{m}} (i = 1, 2, . . ., c)$ (5) $u_{i, k} = {[\sum_{t = 1}^{c} {(\frac{{∥ x_{k} - v_{i} ∥}_{2}^{2}}{{∥ x_{k} - v_{t} ∥}_{2}^{2}})}^{\frac{1}{m - 1}}]}^{- 1}$ (6)

The iterations are carried out until the changes in the values of the partition matrix reported in consecutive iterations are lower than a certain predetermined threshold.

2.2 Hybrid surrogate model-based fuzzy clustering algorithm

In this section, a fuzzy clustering algorithm based on hybrid surrogate model is proposed. The proposed clustering objective function is defined as follows, $J (U) = \sum_{i = 1}^{c} \sum_{k = 1}^{n} \sum_{j = 1}^{s} u_{i, k}^{m} {(w_{i, j} (x_{obj, k} - \hat{x_{obj, k, i, j}}))}^{2}$ (7) where c is the number of clusters; n is the number of data; s is the number of surrogate models; w_i,j is the weight of the j-th surrogate model of the i-th cluster and satisfies the following condition. $\sum_{j = 1}^{s} w_{i, j} = 1 (i = 1, 2, \dots, c; \forall i, j : w_{i, j} \in [0, 1])$ (8)

x_obj,k is the output attribute that is selected by the user, and $\hat{x_{obj, k, i, j}}$ is the estimation by the j-th surrogate model of the i-th cluster:

$\hat{x_{obj, k, i, j}} = f_{i, j} (x_{1, k}, \dots x_{(obj - 1), k}, x_{(obj + 1), k}, \dots, x_{d, k})$ (9)

where x_1,k, … x_(obj-1),k, x_(obj+1),k, …, x_d,k are the input attributes. The surrogate models are constructed using the data with membership higher than the membership criteria θ. Formally, the proposed clustering algorithm, called hybrid surrogate model-fuzzy clustering algorithm (HSM-FC) is characterized in the following way. $\begin{matrix} J (U) = \sum_{i = 1}^{c} \sum_{k = 1}^{n} \sum_{j = 1}^{s} u_{i, k}^{m} {(w_{i, j} (x_{obj, k} - \hat{x_{obj, k, i, j}}))}^{2} \\ \sum_{j = 1}^{s} w_{i, j} = 1 (i = 1, 2, \dots, c; \forall i, j : w_{i, j} \in [0, 1]) \\ \sum_{i = 1}^{c} u_{i, k} = 1 (k = 1, 2, \dots, n; \forall i, k : u_{i, k} \in [0, 1]) \end{matrix}$ (10)

Proposition. The solutions of (10) are $u_{ik} = {[\sum_{t = 1}^{c} {(\frac{\sum_{j = 1}^{s} u_{ik}^{m} (w_{i, j} (w_{i, j} D_{i, j, k})))^{2}}{\sum_{j = 1}^{s} u_{ik}^{m} (w_{i, j} (D_{t, j, k})))^{2}})}^{\frac{1}{m - 1}}]}^{- 1}$ (11)

$w_{i, j} = {[\sum_{o = 1}^{s} \frac{\sum_{k = 1}^{n} u_{i, k}^{m} {(D_{i, j, k})}^{2}}{\sum_{k = 1}^{n} u_{i, k}^{m} {(D_{i, o, k})}^{2}}]}^{- 1}$ (12) where D_i,j,k represents $| x_{obj, k} - \hat{x_{obj, k, i, j}} |$ . The iterations are carried out until the changes in the values of the partition matrix reported in consecutive iterations are lower than a certain predetermined threshold.

Proof. In the following, we prove the iterative solutions (11)∼(12). First, fixed w_i,j, we determine the membership degrees u_i,k. Considering the Lagrangian function

$L_{λ} = \sum_{i = 1}^{c} \sum_{k = 1}^{n} \sum_{j = 1}^{s} u_{ik}^{m} {(w_{i, j} D_{i, j, k})}^{2} + λ (\sum_{i = 1}^{n} u_{i, k} - 1)$ (13)

where λ is the Lagrange multiplier. The first derivatives of (13) with respect to u_i,k and λ equal to zero, yielding $\frac{\partial L_{λ}}{\partial u_{i, k}} = {mu}_{i, k}^{m - 1} \sum_{j = 1}^{s} {(w_{i, j} D_{i, j, k})}^{2} - λ = 0$ (14) $\frac{\partial L_{λ}}{\partial λ} = \sum_{i = 1}^{c} u_{i, k} - 1 = 0$ (15)

From (14) we obtain $u_{i, k} = {(\frac{λ}{m \sum_{j = 1}^{s} {(w_{i, j} D_{i, j, k})}^{2}})}^{\frac{1}{m - 1}}$ (16) and, by considering (15) $\frac{λ}{m} = {(\frac{1}{\sum_{t = 1}^{c} (\frac{1}{\sum_{j = 1}^{s} {(w_{i, j} D_{t, j, k})}^{2}})})}^{\frac{1}{m - 1}}$ (17)

Substituting (17) in (16), the u_i,k in (11) can be obtained.

Fixed u_i,k, we derive w_i,j, and the Lagrangian function is designed as follows

$L_{ξ} = \sum_{i = 1}^{c} \sum_{k = 1}^{n} \sum_{j = 1}^{s} u_{i, k}^{m} {(w_{i, j} D_{i, j, k})}^{2} - ξ (\sum_{j = 1}^{s} w_{i, j} - 1)$ (18)

By setting the first derivatives of (18) with respect to w_i,j and ξ to zero, the following equations can be obtained $\frac{\partial L_{ξ}}{\partial w_{i, j}} = 2 w_{i, j} \sum_{k = 1}^{n} u_{i, k}^{m} {(D_{i, j, k})}^{2} - ξ = 0$ (19) $\frac{\partial L_{ξ}}{\partial ξ} = \sum_{j = 1}^{s} w_{i, j} - 1 = 0$ (20)

Thus $w_{i, j} = \frac{ξ}{2 \sum_{k = 1}^{n} u_{i, k}^{m} {(D_{i, j, k})}^{2}}$ (21)

And, by considering (21) $\frac{ξ}{2} = \frac{1}{\sum_{o = 1}^{s} \frac{1}{\sum_{k = 1}^{n} u_{i, k}^{m} {(D_{i, o, k})}^{2}}}$ (22)

By replacing (22) in (21), the w_i,j in (12) can be obtained.

In each iteration, the surrogate models are updated based on the clustering results of the previous iteration. The j-th surrogate model of the i-th cluster is built based on the data with the corresponding membership higher than the membership criteria θ. The total procedure of the proposed HSM-FC algorithm is described as follows.

Step (1) Setting the clustering number c, the objective attribute x_obj, and the iteration number g = 1;

Step (2) Initializing the weight matrix of surrogate models W⁽⁰⁾ randomly, and generating the initial membership matrix U⁽⁰⁾ using FCM;

Step (3) For each cluster, selecting the data with membership higher than the membership criteria θ, and use them as the training data;

Step (4) Checking whether a training dataset is empty, if yes, creating a new membership matrix U randomly and return to Step (3);

Step (5) Using the training data of the i-th cluster to build the corresponding surrogate models;

Step (6) Using the obtained surrogate models from Step (5) to get the responses of all the data;

Step (7) Calculating the membership matrix U^(g) and the weight matrix of surrogate models W^(g) using Equation (11) ∼ (12);

Step (8) If ∀i, k: max $| u_{i, k}^{(g)} - u_{i, k}^{(g - 1)} | < ɛ$ , then stop and get partition matrix U, otherwise set g = g + 1 and return to Step (3).

In the proposed algorithm, the membership criteria θ is updated as follows $θ = \frac{1}{c} + \frac{0.9 - \frac{1}{c}}{Max} g$ (23) where Max is the maximum iteration.

3 Experiments on synthetic datasets

3.1 Dataset generation and experiment settings

To thoroughly test the performance of the HSM-FC algorithm, 10 synthetic datasets are in this section. Each dataset is given a denomination by the number of object data, the number of attributes, and the regression relationship. For instance, N400A2F1 denotes that the dataset contains 400 object data and can be divided evenly into two clusters; A2 denotes the dataset has two attributes; F1 represents the regression relationship among attributes. The details of the 10 synthetic datasets are listed in Appendix A. The synthetic dataset is generated as follows. For each cluster, the input attributes of the i-th cluster are first sampled using Latin hypercube sampling method. For each datum, the output attribute is calculated according to the setting regression relationship and added with a random value. After that, the data of different clusters are combined as the obtained synthetic dataset.

In the experiments, fuzzy c-means algorithm and four single surrogate model-based fuzzy clustering algorithms including support vector regression, Kriging method, radial basis function, and response surface-based fuzzy clustering algorithms (short for SVR-FC, KRG-FC, RBF-FC, and RS-FC, respectively) are first used to cluster the synthetic datasets to select the surrogate models used in the HSM-FC algorithm. The details of single surrogate model-based fuzzy clustering algorithm can be found in Ref [34] and Appendix B. The parameters are set as follows: the fuzzification parameter m is 2, the threshold value is 10^-6, the maximum iteration is 50, the membership criteria θ for the SVR-FC, KRG-FC, RBF-FC, and RS-FC algorithms is 0.5. For each dataset, the experiments are conducted 30 times, and the clustering performance is evaluated through the following cluster validity indexes.

The first cluster validity index is misclassification rate (MS), which is defined as follows: $MS = \frac{N_{error}}{N_{total}}$ (24) where N_error is the number of misclassified object data; N_total is the total number of object data. The lower MS, the higher cluster validity.

Adjusted rand index (ARI) is used to evaluate the clustering performance [35]. Given a set S of n elements, and two partitions of these elements, namely X ={ X₁, X₂, … , X_s } and Y ={ Y₁, Y₂, …, Y_s }, the overlap between X and Y can be summarized in a contingency table [n_ij] where each entry n_ij denotes the number of objects in common between X_i and Y_j: n_ij = |X_i ∩ Y_j| as shown in Table 1. Adjusted rand index is defined as follows:

Table 1

Contingency table

X ∖ Y	Y ₁	Y ₂	...	Y _s	Sums
X ₁	n ₁₁	n ₁₂	...	n _1s	a ₁
X ₂	n ₂₁	n ₂₂	...	n _2s	a ₂
...	...	...	...	...	...
X _s	n _r1	n _r2	...	n _rs	a _r
Sums	b ₁	b ₂	b _s

Table 2

Average MS of the synthetic datasets

	FCM	SVR-FC	KRG-FC	RBF-FC	RS-FC	HSM-FC
N400A2F1	0.457	0.110	0.275	0.499	0.028	0.026
N400A2F2	0.495	0.161	0.288	0.499	0.072	0.060
N400A2F3	0.249	0.075	0.041	0.499	0.204	0.035
N400A3F4	0.285	0.042	0.261	0.339	0.031	0.025
N400A4F5	0.273	0.059	0.262	0.315	0.041	0.032
N400A5F6	0.037	0.003	0.038	0.343	0.003	0.003
N400A5F7	0.185	0.097	0.298	0.377	0.084	0.074
N400A6F8	0.233	0.096	0.236	0.396	0.101	0.086
N400A7F9	0.154	0.017	0.158	0.481	0.018	0.016
N400A7F10	0.269	0.005	0.269	0.473	0.040	0.005
Mean	0.263	0.066	0.213	0.422	0.062	0.036

$ARI = \frac{\sum_{i, j} (\begin{matrix} n_{ij} \\ 2 \end{matrix}) - [\sum_{i} (\begin{matrix} a_{i} \\ 2 \end{matrix}) \sum_{j} (\begin{matrix} b_{j} \\ 2 \end{matrix})] / (\begin{matrix} n \\ 2 \end{matrix})}{\frac{1}{2} [\sum_{i} (\begin{matrix} a_{i} \\ 2 \end{matrix}) + \sum_{j} (\begin{matrix} b_{j} \\ 2 \end{matrix})] - [\sum_{i} (\begin{matrix} a_{i} \\ 2 \end{matrix}) \sum_{j} (\begin{matrix} b_{j} \\ 2 \end{matrix})] / (\begin{matrix} n \\ 2 \end{matrix})}$ (25)

The closer ARI to 1, the higher cluster validity.

The third cluster validity index is normalized mutual information (NMI) [36] which is defined as follows:

NMI (X, Y) = \frac{I (X, Y)}{\sqrt{H (X) H (Y)}}

(26) where I (·) is the mutual information metric and H (·) is the entropy metric. The closer NMI to 1, the higher cluster validity.

3.2 Experimental results and analysis

The average results of 30 times experiments for the 10 synthetic datasets are shown in Tables 2-4, in which the bold blue numbers represent the best results of the single surrogate model-based clustering algorithms, and the bold red numbers represent the best results among all the clustering algorithms. It can be found that the mean cluster validity indexes of the SVR-FC, KRR-FC, RS-FC algorithms are much better than those of FCM. The clustering results of SVR-FC and RS-FC are better than FCM for all the synthetic datasets. The KRG-FC algorithm provides good results for low-dimensional synthetic datasets but exhibits worse clustering performance for high-dimensional synthetic datasets compared with FCM. It is because Kriging method establishes the regression model based on Gaussian process which is difficult to accurately describe the regression relationship among various attributes. The performance of the RBF-FC algorithm is worse than FCM. For the synthetic datasets tested in this section, the number of samples is much higher than the number of attributes. The solution of RBF is easy to be unstable, so the regression models built in the RBF-FC algorithm cannot effectively describe the regression relationship. Thus, the RBF-FC algorithm cannot provide accurate clustering results for most synthetic datasets. Similar results can be found in Tables 3 and 4.

Table 3
Average ARI of the synthetic datasets

FCM SVR-FC KRG-FC RBF-FC RS-FC HSM-FC

N400A2F1 0.005 0.729 0.315 0.002 0.905 0.911

N400A2F2 0.002 0.409 0.267 0.001 0.738 0.748

N400A2F3 0.250 0.763 0.811 0.000 0.352 0.860

N400A3F4 0.183 0.839 0.227 0.103 0.881 0.900

N400A4F5 0.198 0.791 0.209 0.113 0.845 0.876

N400A5F6 0.846 0.990 0.846 0.219 0.990 0.990

N400A5F7 0.381 0.648 0.143 0.146 0.713 0.726

N400A6F8 0.285 0.654 0.283 0.044 0.640 0.686

N400A7F9 0.494 0.935 0.489 0.028 0.934 0.937

N400A7F10 0.184 0.966 0.184 0.023 0.848 0.970

Mean 0.282 0.772 0.377 0.068 0.785 0.860

	FCM	SVR-FC	KRG-FC	RBF-FC	RS-FC	HSM-FC
N400A2F1	0.005	0.729	0.315	0.002	0.905	0.911
N400A2F2	0.002	0.409	0.267	0.001	0.738	0.748
N400A2F3	0.250	0.763	0.811	0.000	0.352	0.860
N400A3F4	0.183	0.839	0.227	0.103	0.881	0.900
N400A4F5	0.198	0.791	0.209	0.113	0.845	0.876
N400A5F6	0.846	0.990	0.846	0.219	0.990	0.990
N400A5F7	0.381	0.648	0.143	0.146	0.713	0.726
N400A6F8	0.285	0.654	0.283	0.044	0.640	0.686
N400A7F9	0.494	0.935	0.489	0.028	0.934	0.937
N400A7F10	0.184	0.966	0.184	0.023	0.848	0.970
Mean	0.282	0.772	0.377	0.068	0.785	0.860

Table 4

Average NMI of the synthetic datasets

	FCM	SVR-FC	KRG-FC	RBF-FC	RS-FC	HSM-FC
N400A2F1	0.005	0.660	0.270	0.000	0.826	0.838
N400A2F2	0.000	0.340	0.227	0.000	0.645	0.657
N400A2F3	0.214	0.683	0.736	0.001	0.327	0.802
N400A3F4	0.226	0.760	0.231	0.158	0.809	0.832
N400A4F5	0.244	0.715	0.206	0.169	0.763	0.801
N400A5F6	0.759	0.977	0.759	0.174	0.977	0.977
N400A5F7	0.413	0.547	0.160	0.121	0.610	0.624
N400A6F8	0.341	0.555	0.339	0.035	0.538	0.588
N400A7F9	0.396	0.885	0.392	0.023	0.883	0.889
N400A7F10	0.185	0.935	0.18	0.019	0.777	0.943
Mean	0.278	0.716	0.350	0.007	0.716	0.795

From the experimental results and analysis above, it can be seen that the SVR-FC and RS-FC algorithms exhibit much better clustering performance than the other algorithms. Thus, support vector regression and response surface are used to construct the hybrid surrogate model in the proposed algorithm. The parameters of the HSM-FC algorithm are set as follows: the fuzzification parameter m is 2, the threshold value is 10^–6, the maximum iteration is 50. For each synthetic dataset, the experiments are conducted 30 times, and the average results are shown in Tables 2∼4 as well. It can be found that the HSM-FC algorithm achieves better cluster validity indexes than the SVR-FC and RS-FC algorithms for most synthetic datasets except N400A5F6 dataset, which indicates that the HSM-FC algorithm is able to take advantage of support vector regression and response surface to obtain accurate clustering results.

Table 5

Average running time of the synthetic datasets (s)

	SVR-FC	KRG-FC	RBF-FC	RS-FC	HSM-FC
N400A2F1	58.161	2.082	0.630	0.088	56.755
N400A2F2	64.535	2.236	0.804	0.128	56.888
N400A2F3	64.891	0.712	0.979	0.229	29.205
N400A3F4	87.071	0.617	0.174	0.186	78.853
N400A4F5	69.928	0.554	0.644	0.196	65.566
N400A5F6	69.981	0.798	0.446	0.126	24.294
N400A5F7	68.421	7.012	0.867	0.149	65.942
N400A6F8	72.525	1.076	0.489	0.129	70.927
N400A7F9	71.805	2.959	0.440	0.149	70.504
N400A7F10	63.314	4.890	0.893	0.301	60.970

To compare the computation cost of the HSM-FC algorithm with single surrogate model-based fuzzy clustering algorithms, the average running time of 30 times experiments of the synthetic datasets are shown in Table 5 (CPU: Intel Core i7-10700KF, RAM: 32 G). From this table, it can be found that the SVR-FC algorithm has the highest running time among the single surrogate model-based fuzzy clustering algorithms. It is mainly because that the computational cost of support vector regression is much higher than Kriging method, radial basis function, and response surface when the sample size is significantly higher than the dimension of input [37]. The proposed HSM-FC algorithm has less running time than the SVR-FC algorithm, but more running time than the KRG-FC, RBF-FC, and RS-FC algorithms. The hybrid surrogate model utilizes multiple surrogate models to evaluate the regression relationship of data, which result in that regression accuracy and the computational cost increase simultaneously. But the accurate regression relationship can be obtained through fewer iterations by the help of hybrid surrogate models, so the computational cost of the proposed HSM-FC algorithm is less than the SVR-FC algorithm. Finally, we concentrate on the convergence of the HSM–FC algorithm. The convergence curves shown in Appendix C demonstrate the convergence of the HSM–FC algorithm. It can be found that the values of the clustering objective function of the proposed algorithm decrease rapidly within the initial iterations and then tend to converge gradually. The alternation optimization method of the proposed algorithm is similar to that for FCM. The HSM–FC algorithm can achieve convergence as discussed in Ref [38].

4 Experiments on engineering datasets

4.1 Mill dataset

The Mill dataset is from three experiments running on a milling machine under different operating conditions [39]. Data sampled by three different types of sensors (acoustic emission sensor, vibration sensor, current sensor) are acquired at several positions, in which the DC spindle motor current, AC spindle motor current, table vibration, spindle vibration, acoustic emission at table, and acoustic emission at spindle of each experiment are recorded. The initial data have 9000 samples. In this paper, we choose the datum every 60 data (1, 61, 121,..., 8941), and only the data of the milling process is retained. Finally, the used dataset has 300 samples and includes three classes. The details of the mill dataset used in this section are shown in Table 6.

Table 6
Details of Mill dataset

Clusters 3

Samples 300

Attributes 6 DC spindle motor current

AC spindle motor current

Table vibration

Spindle vibration

Acoustic emission at table

Acoustic emission at spindle

The parameters of the clustering algorithms are set as follows: the fuzzification parameter m is 2, the threshold value is 10^–6, the maximum iteration is 50, the membership criterion θ for the SVR-FC, KRG-FC, RBF-FC and RS-FC algorithms is 0.33. The DC spindle motor current is set as the output attribute, and the AC spindle motor current, table vibration, spindle vibration, acoustic emission at table, and acoustic emission at spindle are set as the input attributes. The obtained clustering results are shown in Fig. 1. It can be found that the HSM-FC algorithm produces the smallest misclassification rate, the highest adjusted rand index and normalized mutual information. The proposed algorithm is able to provide competitive clustering results for the Mill dataset.

Fig. 1

Clustering results of the Mill dataset.

Table 7

Details of Battery discharge dataset

Clusters	2
Samples	382
Attributes	6	Voltage
		Current
		Temperature

Table 8

Clustering results of the Battery discharge dataset

	FCM	SVR-FC	KRG-FCM	RBF-FC	RS-FC	HSM-FC
MS	0.487	0.479	0.507	0.497	0.385	0.293
ARI	0.002	0.001	0.002	0.000	0.051	0.169
NMI	0.001	0.001	0.000	5.238-E06	0.039	0.128

4.2 Battery discharge dataset

The Battery discharge dataset is provided by the Prognostics CoE at NASA Ames [40]. The dataset records two discharging processes of a Li-Ion battery, which have 192 samples and 190 samples respectively. Each datum involves the voltage, current, and temperature of the Li-Ion battery (Table 7). The parameters of the experiments are set as follows: the fuzzification parameter m is 2, the threshold value is 10^–6, the maximum iteration is 50, the membership criterion θ is 0.5 for the single surrogate model-based fuzzy clustering algorithms. The temperature is set as the output attribute, and the voltage and current are set as the input attributes. The obtained clustering results are shown in Table 8. From this table, it can be found that the HSM-FC algorithm produces the best clustering results. The proposed algorithm is able to provide competitive clustering results for the Battery discharge dataset.

4.3 Borehole dataset

The borehole dataset comes from the water flow rate problem [41]. The dataset has 1500 samples and can be evenly divided into five clusters according to the radius of borehole. The attributes of the Borehole dataset are listed in Table 9.

Table 9
Details of the borehole dataset

Clusters 5

Samples 1500

Attributes 8 Water flow rate

Radius of influence

Transmissivity of upper aquifer

Potentiometric head of upper aquifer

Transmissivity of lower aquifer

Length of borehole

Hydraulic conductivity of borehole

Potentiometric head of lower aquifer

Fig. 2

Clustering results of the Borehole dataset.

In the experiment, the water flow rate is set as the output attribute, and the other attributes are set as the input attributes. The experimental parameters are set as follows: the fuzzification parameter m is 2, the threshold value is 10^–6, the maximum iteration is 50, the membership criterion θ is 0.2. The clustering results are shown in Fig. 2. It can be seen the misclassification rate of the HSM-FC algorithm is 0.287, which is much smaller than the other clustering algorithms, and it also achieves the best results for the cluster validity indexes ARI and NMI. The proposed HSM-FC algorithm can take advantage of different surrogate models to provide better clustering performance than the single surrogate model-based fuzzy clustering algorithms for the Borehole dataset.

4.4 Piston dataset

The Piston dataset comes from the simulation of a piston moving with a cylinder, developed by Kenett and Zacks [42]. The dataset has 2000 samples and can be evenly divided into five clusters according to the piston mass and surface area. The details of the Piston dataset are listed in Table 10.

Table 10
Details of the Piston dataset

Clusters 5

Samples 2000

Attributes 6 Cycle time

Initial gas volume

Spring coefficient

Atmospheric pressure

Ambient temperature

Filling gas temperature

The parameters of the experiments are set as follows: the fuzzification parameter m is 2, the threshold value is 10^–6, the maximum iteration is 50, the membership criterion θ is 0.2 for the single surrogate model-based fuzzy clustering algorithms. The cycle time is set as the output attribute. The initial gas volume, spring coefficient, atmospheric pressure, ambient temperature, and filling gas temperature are set as the input attributes. The obtained clustering results are shown in Fig. 3. It can be found that the HSM-FC algorithm produces the smallest misclassification rate, the highest adjusted rand index and normalized mutual information. The proposed algorithm is able to provide competitive clustering results for the Piston dataset.

Fig. 3

Clustering results of the Piston dataset.

5 Conclusions

In this paper, a hybrid surrogate model-based fuzzy clustering algorithm is proposed. The proposed algorithm is developed based on fuzzy c-means algorithm, in which the regression relationship is utilized to evaluate the difference among the clusters. Several surrogate models are simultaneously utilized to describe the regression relationship through a weighting scheme. The clustering objective function is designed based on the prediction errors of surrogate models, and an alternating optimization method is designed to minimize it to obtain the memberships of data and the weights of surrogate models. The synthetic datasets are used to compare the performance of the single surrogate model-based fuzzy clustering algorithms. The results indicate that the support vector regression-based and response surface-based fuzzy clustering algorithms show better clustering performance than the other single surrogate model-based fuzzy clustering algorithms. Support vector regression and response surface are used to construct the hybrid surrogate model in the proposed algorithm. The experimental results of synthetic and engineering datasets indicate that the proposed algorithm can provide better clustering results than fuzzy c-means algorithm and single surrogate model-based fuzzy clustering algorithms for the datasets with regression relationship.

Footnotes

Appendix A

Table A1

Details the synthetic datasets

Dataset	Cluster	Regression relationship	Range
N400A2F1	1	y = sin(2πx/10) + 0.2sin (2πx/2.5) + rand [- 0.02, 0.02]	[0, 1]
	2	y = sin(2πx/10) + rand [- 0.02, 0.02]
N400A2F2	1	y = x sin(x)/10 + rand [- 0.02, 0.02]	[0, 1]
	2	y = (x + sin(x))/10 + rand [- 0.02, 0.02]
N400A2F3	1	$y = 8 - \frac{1}{{(x + 0.01)}^{2}} - \frac{1}{{(x + 0.3)}^{2}} - \frac{1}{{(x + 0.4)}^{2}} + rand [- 0.5, 0.5]$	[0, 1]
	2	$y = 6 - \frac{1}{(x^{2} + 0.01)} - \frac{1}{{(x - 0.6)}^{2} + 0.02} - \frac{1}{{(x - 0.8)}^{2} + 0.04} + rand [- 0.5, 0.5]$
N400A3F4	1	$y = {(x_{2} - 5.1 * \frac{x_{1}^{2}}{4 π^{2}} + \frac{5 x_{1}}{π} - 6)}^{2} + ((1 - \frac{1}{8 π}) * cos (x_{1})) + rand [- 0.5, 0.5]$	[0, 1]
	2	$y = {(x_{2} - 5.1 * \frac{x_{1}^{2}}{4 π^{2}} + \frac{5 x_{1}}{π} - 6)}^{2} + {((1 - \frac{1}{8 π}) * \cos (x_{1}))}^{2} + rand [- 0.5, 0.5]$
N400A4F5	1	y = - x₁x₂x₃ + rand [- 0.1, 0.1]	[0, 1]
	2	y = - x₁x₂x₃ - x₁x₃ - x₁x₂ - x₂x₃ + rand [- 0.1, 0.1]
N400A5F6	1	$y = (1.1 - \sum_{i = 1}^{4} α_{i} exp (- \sum_{j = 1}^{4} A_{ij} {(x_{j} - p_{ij})}^{2})) / 0.839 + rand [- 0.5, 0.5]$	[0, 1]
	2	$y = (1.1 - \sum_{i = 1}^{4} α_{i} exp (- \sum_{j = 1}^{4} {(x_{j} - p_{ij})}^{2})) / 0.839 + rand [- 0.5, 0.5]$
N400A5F7	1	$y = (\sum_{i = 1}^{4} {(\sum_{j = 1}^{3} (j^{i} + 0.2) * ({(\frac{x_{j}}{j})}^{i} - 1))}^{2}) / 120 + rand [- 0.5, 0.5]$	[0, 1]
	2	$y = (\sum_{i = 1}^{4} {(\sum_{j = 1}^{4} (j^{i} + 0.2) * ({(\frac{x_{j}}{j})}^{i} - 1))}^{2}) / 1500 + rand [- 0.5, 0.5]$
N400A6F8	1	y = 10 sin(2x₁x₂π) + 20 (x₃ - 1.5) ² + 8x₄ + 4x₅ + rand [- 0.1, 0.1]	[0, 1]
	2	y = 10 sin(x₁x₂π) + 20 (x₃ - 0.5) ² + 10x₄ + 5x₅ + rand [- 0.1, 0.1]
N400A7F9	1	$y = 1 {(x_{1} - 2 + 8 x_{2} - 8 x_{2}^{2})}^{2} + 0.25 {(3 - 4 x_{2})}^{2} + 4 \sqrt{x_{3} + 1} {(2 x_{3} - 1)}^{2} + 0.25 \sum_{i = 4}^{6} i \ln (1 + \sum_{j = 3}^{i} x_{j}) + rand [- 0.1, 0.1]$	[0, 1]
	2	$y = 0.4 {(x_{1} - 2 + 8 x_{2} - 8 x_{2}^{2})}^{2} + 0.1 {(3 - 4 x_{2})}^{2} + 0.16 \sqrt{x_{3} + 1} {(2 x_{3} - 1)}^{2} + 0.7 \sum_{i = 4}^{6} i \ln (1 + \sum_{j = 3}^{i} x_{j}) + rand [- 0.1, 0.1]$
N400A7F10	1	$y = (\sum_{i = 1}^{5} [100 {(x_{i + 1} - x_{i}^{2})}^{2} + {(x_{i} - 1)}^{2}]) * (1 + rand [- 0.025, 0.025])$	[0, 1]
	2	$y = (\sum_{i = 1}^{5} [50 {(x_{i + 1} - x_{i}^{2})}^{2} + {(x_{i} - 1)}^{2}]) * (1 + rand [- 0.025, 0.025])$

Appendix B

In this section, the single surrogate model based-fuzzy clustering algorithms are described, in which the SVR-FC algorithm is introduced first. The SVR-FC algorithm is a clustering algorithm that partitions data according to the regression relationship of data with the help of support vector regression [34]. It clusters a given set of object data {x₁, x₂, . . . , x_n} $\subset ℝ^{d \times n}$ into c fuzzy clusters by minimizing an objective function J_SVR-FC (1) $J_{SVR - FC} = \sum_{i = 1}^{c} \sum_{k = 1}^{n} u_{ik}^{m} {(x_{obj, k} - {\hat{x_{obj, k}}}_{i})}^{2}$ with the following constraint (2) $\sum_{i = 1}^{c} u_{i, k} = 1 (k = 1, 2, \dots, n; \forall i, k : u_{i, k} \in [0, 1])$

where u_i,k is the membership that represents the degree which x_k belongs to the i-th cluster; $x_{k} = [x_{1}, k, x_{2}, k, .., x_{d}, k]$ ; x_obj is the output attribute, ${\hat{x_{obj, k}}}_{i}$ is the corresponding estimation of the SVR model SVR (x_1,k, … , x_obj-1,k, x_obj+1,k, . . , x_d,k) _iof the i-th cluster, and x_1,k, ... ,x_obj-1,k, x_obj+1,k, ... , x_d,k are the input attributes. The necessary conditions for minimizing Equation (1) with the constraint Equation (2) result in the following partition matrix:

(3) $u_{ik} = {[\sum_{t = 1}^{c} {(\frac{{(x_{obj, k} - {\hat{x_{obj, k}}}_{i})}^{2}}{{(x_{obj, k} - {\hat{x_{obj, k}}}_{t})}^{2}})}^{\frac{1}{m - 1}}]}^{- 1}$

To ensure that the i-th SVR model can effectively learn the regression relationship, only the data with higher membership than a criteria θ are used to construct the SVR model in each iteration. The total procedure of the SVR-FC algorithm is described as follows.

Step (1) Setting the clustering number c, the objective attribute x_obj, the membership criterion θ and the iteration number g = 1;

Step (2) Using FCM to cluster data, and the obtained membership matrix U_FCM are used as the initial membership matrix U⁽⁰⁾;

Step (3) For each cluster, selecting the data with membership higher than θ as the training data;

Step (4) Checking whether a training dataset is empty, if yes, creating a new membership matrix U randomly and return to Step (3);

Step (5) Using the training data of i-th cluster to build i-th SVR model;

Step (6) Using the obtained SVR models to get the responses of all the data and creating the response matrix x_n × c ;

Step (7) Calculating the partition matrix U^(g) as follows; (4) $u_{ik} = {[\sum_{t = 1}^{c} {(\frac{∥ x_{{obj}_{k}} - {\hat{x_{obj, k}}}_{i} ∥_{2}^{2}}{∥ x_{{obj}_{k}} - {\hat{x_{obj, k}}}_{t} ∥_{2}^{2}})}^{\frac{1}{m - 1}}]}^{- 1}$

Step (8) If ∀i, k: max $| u_{ik}^{(g)} - u_{ik}^{(g - 1)} | < ɛ$ , then stop and get partition matrix U, otherwise set g = g + 1 and return to Step (3).

Through replacing support vector regression with Kriging method, radial basis function, and response surface in Steps (5) and (6), respectively, the KRG-FC, RBF-FC, and RS-FC algorithms can be obtained.

Appendix C

Acknowledgments

This work is supported by Natural Science Foundation of Jiangsu Province (BK20210777), Funding of Jiangsu University (20JDG068) and National Natural Science Foundation of China (51875260).

References

Pham

D.T.

and Afify

A.A.

, Clustering techniques and their applications in engineering. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 221(11) (2007), 1445–1459.

Kannan

S.R.

, Ramathilagam

and Chung

P.C.

, Effective fuzzy c-means clustring algorithms for data clustering problems, Expert Systems with Applications 39(7) (2012), 6292–6300.

Song

, Shi

, Wu

and Sun

, A new fuzzy c-means clustering-based time series segmentation approach and its application on tunnel boring machine analysis, Mechanical Systems and Signal Processing 133 (2019), 106279.

De Oliveira

J.V.

and Pedrycz

, (Eds.). Advances in fuzzy clustering and its applications, John Wiley & Sons. (2007).

Pandeeswari

and Kumar

, Anomaly detection system in cloud environment using fuzzy clustering based ANN, Mobile Networks and Applications 21(3) (2016), 494–505.

Ganapathy

, Kulothungan

, Yogesh

and Kannan

, A novel weighted fuzzy C-means clustering based on immune genetic algorithm for intrusion detection, Procedia Engineering 38 (2012), 1750–1757.

Yang

, Chung

F.L.

and Shitong

, Robust fuzzy clustering-based image segmentation, Applied Soft Computing 9(1) (2009), 80–84.

Chatzis

S.P.

and Varvarigou

T.A.

, A fuzzy clustering approach toward hidden Markov random field models for enhanced spatially constrained image segmentation, IEEE Transactions on Fuzzy Systems 16(5) (2008), 1351–1361.

Jain

A.K.

, Data clustering: 50 years beyond K-means, Pattern Recognition Letters 31(8) (2010), 651–666.

10.

Setnes

M. J.

and Roubos

, GA-fuzzy modeling and classification: complexity and performance, IEEE transactions on Fuzzy Systems 8(5) (2000), 509–522.

11.

Dey

, Pratihar

D.K.

and Datta

G.L.

, Genetic algorithm-tuned entropy-based fuzzy C-means algorithm for obtaining distinct and compact clusters, Fuzzy Optimization and Decision Making 10(2) (2011), 153–166.

12.

Carvalho

F.D.A.

, Tenório

C.P.

and Junior

N.L.C.

, Partitional fuzzy clustering methods based on adaptive quadratic distances, Fuzzy Sets and Systems 157(21) (2006), 2833–2857.

13.

Liu

H.C.

, Yih

J.M.

, Lin

W.C.

and Wu

D.B.

, Fuzzy C-Means Algorithm Based on Common Mahalanobis Distances, Journal of Multiple-Valued Logic & Soft Computing (2009), 15.

14.

Gueorguieva

, Valova

and Georgiev

, M&MFCM: fuzzy c-means clustering with mahalanobis and minkowski distance metrics, Procedia Computer Science 114 (2017), 224–233.

15.

Király

, Vathy-Fogarassy

Á.

and Abonyi

, Geodesic distance based fuzzy c-medoid clustering–searching for central points in graphs and high dimensional data, Fuzzy Sets and Systems 286 (2016), 157–172.

16.

Mota

V.C.

, Damasceno

F.A.

and Leite

D.F.

, Fuzzy clustering and fuzzy validity measures for knowledge discovery and decision making in agricultural engineering, Computers and Electronics in Agriculture 150 (2018), 118–124.

17.

Song

, Shi

, Wu

and Sun

, A new fuzzy c-means clustering-based time series segmentation approach and its application on tunnel boring machine analysis, Mechanical Systems and Signal Processing 133 (2019), 106279.

18.

Majumder

and Pratihar

D.K.

, Multi-sensors data fusion through fuzzy clustering and predictive tools, Expert Systems with Applications 107 (2018), 165–172.

19.

Arora

and Tushir

, An Enhanced Spatial Intuitionistic Fuzzy C-means Clustering for Image Segmentation, Procedia Computer Science 167 (2020), 646–655.

20.

Lin

Y.C.

, Lee

S.J.

, Ouyang

C.S.

and Wu

C.H.

, Air quality prediction by neuro-fuzzy modeling approach, Applied Soft Computing 86 (2020), 105898.

21.

Ren

L.H.

, Ye

Z.F.

and Zhao

Y.P.

, A modeling method for aero-engine by combining stochastic gradient descent with support vector regression, Aerospace Science and Technology 99 (2020), 105775.

22.

, Shi

, Liu

and Shi

, Uncertainty optimization of dental implant based on finite element method, global sensitivity analysis and support vector regression, Proceedings of the Institution of Mechanical Engineers, Part H: Journal of Engineering in Medicine 233(2) (2019), 232–243.

23.

Serani

, Pellegrini

, Wackers

, Jeanson

C.E.

, Queutey

, Visonneau

and Diez

, Adaptive multi-fidelity sampling for CFD-based optimisation via radial basis function metamodels, International Journal of Computational Fluid Dynamics 33(6-7) (2019), 237–255.

24.

Halali

M.A.

, Azari

, Arabloo

, Mohammadi

A.H.

and Bahadori

, Application of a radial basis function neural network to estimate pressure gradient in water–oil pipelines, Journal of the Taiwan Institute of Chemical Engineers 58 (2016), 189–202.

25.

, Xu

, Chen

and Zhou

, Prediction method of bridge static load test results based on Kriging model, Engineering Structures 214 (2020), 110–641.

26.

Lee

, Choi

K.K.

and Zhao

, Sampling-based RBDO using the stochastic sensitivity analysis and dynamic Kriging method, Structural and Multidisciplinary Optimization 44(3) (2011), 299–317.

27.

Toft

H.S.

, Svenningsen

, Moser

, Sørensen

J.D.

and Thøgersen

M.L.

, Assessment of wind turbine structural integrity using response surface methodology, Engineering Structures 106 (2016), 471–483.

28.

Lmalghan

, Rao

, Arun Kumar

, Rao

S.S.

and Herbert

M.A.

, Machining parameters optimization of AAusing response surface methodology and particle swarm optimization, International Journal of Precision Engineering and Manufacturing 19(5) (2018), 695–704.

29.

Goel

, Haftka

R.T.

, Shyy

and Queipo

N.V.

, Hybrid of surrogates, Structural and Multidisciplinary Optimization 33(3) (2007), 199–216.

30.

Zhang

, Chowdhury

and Messac

, An adaptive hybrid surrogate model, Structural and Multidisciplinary Optimization 46(2) (2012), 223–238.

31.

Asgari

, Moazamigoodarzi

, Tsai

P.J.

, Pal

, Zheng

, Badawy

and Puri

I.K.

, Hybrid surrogate model for online temperature and pressure predictions in data centers, Future Generation Computer Systems 114 (2021), 531–547.

32.

Song

, Lv

, Li

, Sun

and Zhang

, An advanced and robust ensemble surrogate model: extended adaptivehybrid functions, Journal of Mechanical Design 140(4) (2018), 041402.

33.

Bezdek

J.C.

, Ehrlich

and Full

, FCM: The fuzzy c-means clustering algorithm, Computers & Geosciences 10(2-3) (1984), 191–203.

34.

Shi

, Zhang

, Sun

and Song

, A fuzzy c-means algorithm based on the relationship among attributes of data and its application in tunnel boring machine, Knowledge-Based Systems 191 (2020), 105–229.

35.

Santos

J.M.

and Embrechts

, On the use of the adjusted rand index as a metric for evaluating supervised classification. In International conference on artificial neural networks (pp. 175–184). Springer, Berlin, Heidelberg. (2009).

36.

Estévez

P.A.

, Tesmer

, Perez

C.A.

and Zurada

J.M.

, Normalized mutual information feature selection, IEEE Transactions on neural networks 20(2) (2009), 189–201.

37.

Smola

A.J.

and Schölkopf

, A tutorial on support vector regression, Statistics and Computing 14(3) (2004), 199–222.

38.

Hathaway

R.J.

, Hu

and Bezdek

J.C.

, Local convergence of tri-level alternating optimization, Neural Parallel and Scientific Computatiions 9(3/4) (2001), 19–28.

39.

Agogino

and Goebel

, BEST lab, UC Berkeley. “Milling Data Set “, NASA Ames Prognostics Data Repository (2007).

40.

Saxena

and Goebel

, “Turbofan Engine Degradation Simulation Data Set", NASA Ames Prognostics Data Repository (2008).

41.

Gramacy

R.B.

and Lian

, Gaussian process single-index models as emulators for computer experiments, Technometrics 54(1) (2012), 30–41.

42.

Kenett

and Zacks

, Modern industrial statistics: design and control of quality and reliability. Pacific Grove, CA: Duxbury press. (1998).

Clusters		5
Samples		1500
Attributes	8	Water flow rate
		Radius of influence
		Transmissivity of upper aquifer
		Potentiometric head of upper aquifer
		Transmissivity of lower aquifer
		Length of borehole
		Hydraulic conductivity of borehole
		Potentiometric head of lower aquifer

Clusters		5
Samples		2000
Attributes	6	Cycle time
		Initial gas volume
		Spring coefficient
		Atmospheric pressure
		Ambient temperature
		Filling gas temperature

A fuzzy clustering algorithm based on hybrid surrogate model

Abstract

Keywords

1 Introduction

2.1 Fuzzy c-means algorithm (FCM)

3.1 Dataset generation and experiment settings

4.1 Mill dataset

Table 6 Details of Mill dataset Clusters 3 Samples 300 Attributes 6 DC spindle motor current AC spindle motor current Table vibration Spindle vibration Acoustic emission at table Acoustic emission at spindle

4.3 Borehole dataset

Table 10 Details of the Piston dataset Clusters 5 Samples 2000 Attributes 6 Cycle time Initial gas volume Spring coefficient Atmospheric pressure Ambient temperature Filling gas temperature

Footnotes

Appendix A

Appendix B

Appendix C

Acknowledgments

References

Table 6
Details of Mill dataset

Clusters 3

Samples 300

Attributes 6 DC spindle motor current

AC spindle motor current

Table vibration

Spindle vibration

Acoustic emission at table

Acoustic emission at spindle

Table 10
Details of the Piston dataset

Clusters 5

Samples 2000

Attributes 6 Cycle time

Initial gas volume

Spring coefficient

Atmospheric pressure

Ambient temperature

Filling gas temperature