Abstract
The process of retrieving essential information from the dataset is a significant data mining approach, which is specifically termed as data clustering. However, nature-inspired optimizations are designed in recent decades to solve optimization problems, particularly for data clustering complexities. However, the existing methods are not feasible to process with a large amount of data, as the execution time taken by the traditional approaches is larger. Hence, an efficient and optimal data clustering scheme is designed using the devised Fractional Sail Fish-Sparse Fuzzy C-Means + Particle Whale optimization (FSF-Sparse FCM + PWO) based MapReduce Framework (MRF) to process high dimensional data. Theproposed FSF-Sparse FCM is designed by the integration of Sail Fish Optimization (SFO) with fractional concept and Sparse FCM. The proposed MRF poses two functions, such as the mapper function and reducer function to perform the process of data clustering. Moreover, the proposed FSF-Sparse FCM is employed in the mapper phase to compute the cluster centroids, and thereby the intermediate data is generated. The intermediate data is tuned in the reducer phase using Particle Whale Optimization (PWO), which is the integration of Particle Swarm Optimization (PSO) and Whale optimization algorithm (WOA). Accordingly, the optimal cluster centroid is computed at the reducer phase using the objective function based on DB-Index. The proposed FSF-Sparse FM + PWO obtained the highest accuracy of 0.903 and lowest DB-Index of 39.07.
Introduction
Big data [9,18,37] is more voluminous and complex that is characterized with the factors, like veracity, value, variety, velocity, and volume hence the existing data processing methods are not enough to deal with this data. A main problem in the big data environment is to analyze the high-dimensional information for different large-scale and complex applications. However, the multi-view high dimensional information is defined using various structures and feature spaces that are received from different sources [31]. It is the collection of database with complex and large data in such a way that to process the data with the traditional tools of the database management system is very difficult. Some of the issues included in the data processing system are search, storage [27], analysis, visualization, sharing, and capture. However, large database is widely used due to the presence of additional information when compared with a smaller dataset [13]. Apart from classification [39–41], clustering is the foremost data mining approach used to perform knowledge discovery such that clustering plays an active role in data analysis. In general, clustering partition the dataset into the groups of relevant objects by reducing similarity among data objects in various clusters and increasing the similarity of objects lies in the same cluster [36]. Data clustering is an important step in various domains that includes pattern recognition, computer vision, data mining, medical fields, DNA analysis, web statistics, market analysis [21], and spatial database applications [4].
It is commonly used to identify the information patterns or features in the subgroups of data and the structure of the dataset. Hence, cluster analysis [33] is used in different fields, such as bioinformatics, document analysis, atmospheric sciences, medical diagnosis, wireless sensor network [6], seismic zoning, pattern recognition, and so on [36]. Data clustering is a technique that is specifically used to place similar data objects together such that similar object scan be located in one while different items are placed in various groups. It is the unsupervised learning model that is described to the cluster of objects in an unspecified cluster. However, classification is in the form of a supervised method that involves the process of allocating objects to the predetermined clusters [1,23]. Data clustering is used in areas, like image study, information mining, data recovery [14,15], machine learning, and statistical analysis. However, the clustering techniques are categorized into different types, such as density-related approach, model-related approach, grid-related approach,and hierarchical approach[1,3]. Moreover, the traditional K-means clustering is the centroid-based clustering technique that divide the data space into the structure called as Voronoi diagram. Because of easy parallel processing and less computational cost, K-means clustering model is commonly used to solve the issues in large-scale data clustering rather than spectral clustering [7].
In the clustering approaches, representative of cluster reveals at center, which increases the average resemblance of data in similar cluster and reduces the average distance to remaining objects. However, most of the clustering approaches compute the pairwise distance for finding the representative or centroid. Accordingly, this process is more time-consuming [2,24,36]. In general, clustering the multi-view data results NP-hard problem and is attracted most of the researches to generate different clustering methods for different types of real-world applications. However, the unsupervised model for selecting features is presented in the cluster configuration for realizing the structure [28]. The multi-view affinity propagation method is designed to perform multi-view clustering that is specifically used to cluster more than two views [31,34].Different clustering methods are designed to perform data clustering that includes [10], clique, K-means and hierarchical clustering. K-means is the widely used clustering model as it reduces intra cluster distance [36]. Different techniques developed for clustering did not offer effective processes with the execution period and the selection of optimal clusters for managing big data [13]. An evolutionary programming-based clustering method is developed in [25] that clusters the data into the number of clusters [13].
The major contribution of this research is explained as follows:
The mapper function is utilized to find the intermediate data such that the intermediate data is tuned in the mapper phase using PWO based on the objective function. The PWO is the integration of PSO and WOA.
The organization of the paper is made as follows: Section 2 denotes the survey of conventional data clustering methods, and Section 3 presents the proposed FSF-Sparse FCM + PWO based MRF for big data clustering. Section 4 elaborates the results and discussion and Section 5 is the conclusion of the research.
Motivation
Literature survey
Some of the existing data clustering approaches are reviewed in this section. Kulkarni, O. et al. [17] introduced a fractional fuzzy clustering along with the particle whale optimization to perform big data clustering using the MapReduce framework. Here, the optimal centroids were identified using the PWO. However, the experimentation was evaluated using two different datasets, namely skin segmentation and localization. This method obtained better clustering accuracy. Hosseini, B. and Kiani, K. [11] introduced a distributed density-based hesitant fuzzy clustering model to solve the fuzzy characteristic issues. It does not require prior knowledge regarding data distribution and cluster before performing the clustering process. The existence of outlier data will not affect the characteristics and the shape of clusters. It has less data dependency and more scalability between the nodes. It showed more robustness in the presence of noise and required less computation cost. It increased the computation speed, but it does not consider the gene expression data. Shukla, A.K. and Muhuri, P.K. [29] introduced the interval type-2 fuzzy sets (IT2 FSs) for data clustering. This method was used to solve the issues of uncertainty in gene expression data. However, the validity measures, like partition entropy, partition coefficient and silhouette coefficient were used for evaluating the performance of clustering. It failed to analyze the efficiency with large sized dataset. However, it linearly increased the computational complexity by increasing the data size. It does not consider the fuzzy uncertainty model to face the objectives, like dimensionality reduction under multi variate datasets. Tao, Q. et al. [31] introduced an intelligent weighting k-means clustering (IWKM) model using the swarm intelligence. Here, the coupling degree among the clusters was defined to enlarge clusters dissimilarity. However, different features were utilized for finding the clustering objects using weighting distance function. Here, the swarm intelligence was used to compute the cluster center, features weight and weight of views with global search. It increased the performance using precise perturbation. This method was more effective under big data applications. It does not consider generative adversarial network (GAN) and transfer learning for clustering.
Ilango, S.S. et al. [13] implemented an artificial bee colony (ABC) model to find best cluster by minimizing execution time under large sized dataset. It integrated global and local search model for balancing the exploitation and the exploration process. It solved the optimization issues through the simulation of real bees especially in the process of clustering. The size of dataset was varied and was mapped with appropriate timings. It was implemented in the Hadoop environment and effectively reduced the classification error and execution time to select optimal cluster. Abdulwahab, H.A. et al. [1] introduced a data clustering model named levy flight black hole (LBH) for data clustering. Here, the movement of the stars was based on step size, which was generated using levy distribution. However, the clustering performance was increased by combining the black hole approach with levy flight. This method was evaluated under six different datasets and showed effective results in clustering data objects. It escaped from the exploration and local minima and moved to the search space more effectively. However, it was not implemented in the process of text document. Huang, W. et al. [12] introduced a signal and analytic clustering model with a Hadoop framework for data clustering. It was more effective, as it distribute the data on various nodes and enabled to achieve efficient data management. This method obtained a better packet delivery ratio and throughput and offered fault tolerance under failure. It effectively reduced the communication error and processing time. Wu, Y. et al. [36] developed a projection based clustering model for clustering with big data. This method was used to project the data points to centroid by measuring the similarity between points by computing the projections of centroid. It achieved linear time complexity based on sample size. It showed better accuracy in data clustering with limited computation time. The centroid of cluster was determined based on density distribution of objects. It required less memory for the clustering process. Weijia Lu [20] developed an improved k-mean clustering algorithm for bigdata mining under Hadoop parallel framework. The distributed database was used to simulate the shared memory space and to parallelize the algorithm in the Hadoop platform. It improves the accuracy however it fails to discuss the computational time. Ashish Kumar Tripathi et al. [32] proposed a new recommendation system using a map-reduce-based Tournament empowered Whale optimization algorithm (TWOA) to attain optimal clusters. The experiment results that this method can be used over large-scale datasets. However, it limits the exploration ability.
Challenges
Some of the issues faced by the conventional big data clustering techniques are explained as follows:
To deal with the distributed data is more complex in big data clustering, as the clustering model needs the data to be centralized [11].However, increasing accuracy and the robustness to noise results a great challenge in the clustering model.To develop a clustering approach that generates more satisfactory cluster quality within in the specified time is a major challenge [17]. In the devised method, the Sparse Fuzzy C-Means used in the proposed method can shrink the irrelevant features’ weight for extracting zero and also it is solved efficiently in analytic form, which reduces the noise and reduces complexity in clustering. To solve the high dimensionality issues, different dimensionality reduction methods are introduced such that these methods use features extraction and selection process for optimizing the features to the clustering approach. Moreover, reducing the dimensionality of data for the large-sized database is computationally intensive [36]. The sailfish optimization algorithm used in the devised method is a high-speed optimizer, which solves the problem in computational complexities.
Proposed fractional sail fish-sparse fuzzy C-means + particle whale optimization for big data clustering
Big data clustering is designed with MRF using the proposed FSF-Sparse FCM + PWO approach. With MRF, the process of data clustering is performed with big data by involving two different functions, like the mapper and reducer function. Initially, input big data is partitioned to number of data such that these partitioned data are passed to the mapper phase that includes the number of mappers, where the process of clustering is accomplished using the proposed FSF-Sparse FCM. The cluster centroids are generated at the mapper phase and thereby the intermediate data is generated using mapper function. The resulted middle data computed from mapper phase is fed to reducer phase, where the intermediate data is tuned using PWO to calculate optimal centroid of cluster. The computation of optimal cluster centroids is accomplished using the objective function using DB-Index. Figure 1 denotes the schematic diagram of FSF-Sparse FCM + PWO.

Schematic diagram of proposed FSF-sparse FCM + PWO.
MapReduce is the software framework used to perform scattered computing with a large quantity of information. The calculation is denoted in terms of mapper and reducer task. Let us consider the database as B that contains a number of data using attributes is represented as,
Big data clustering at mapper phase using proposed FSF-sparse FCM
The divided data is given to the mapper phase, and the data clustering is achieved by proposed FSF-Sparse FCM. c is the total mappers and is specified as,
However, the input given to ath mapper is specified as,
Let us consider the cluster centroids as
The attack alternation strategy of SFO is expressed as,
By substituting above Eq. (16) in Eq. (13), and the devised FSF-Sparse FCM is represented as,
Hence, the cluster centroids are trained by the proposed FSF-Sparse FCM such that total centroids computed depend on number of user set parameter. However, the centroids classify the clusters using data points in such a way that data points present in same cluster must have relevant characteristics, and the data points between the cluster groups have different characteristics. Moreover, the cluster centroids computed using the proposed FSF-Sparse FCM are combined together to generate intermediate data and it is the reducer phase input. Algorithm 1 denotes the pseudo code of implemented FSF-Sparse FCM.

Pseudo code of FSF-sparse FCM
However, each mapper map the partitioned information to make the cluster centroids with respect to centroid range and generate middle data. However, the outcome obtained from mapper phase is fed as input to reducer phase in such a way that the output of c mappers are specified as,
Reducer phase performs reducer function to find optimal cluster centroid using PSO. Moreover, the overall reducers considered are given as,
The optimal cluster centroids are computed at the reducer phase based on intermediate data Q. Moreover, the cluster centroids obtained at reducer phase is expressed as,
Solution vector
The solution is to compute optimal cluster centroids by PWO in MRF such that the solution is randomly constructed based on intermediate data obtained from mappers using the proposed FSF-Sparse FCM.
Objective function
It is computed based on DB index [16] that measure the solution quality based on the similarity among clusters. The DB index is represented as,
Here,
Here,
Here, ρ is an integer. The distance among data point and the respective cluster centroids should be minimum, whereas space among cluster centroids must be high.
Algorithmic steps of PWO
PSO is the integration of PSO [35] and WOA [22] such that the PSO algorithm is employed to find optimal centroid at reducer phase based on the objective function. PSO is the category of searching process, where each particle updates the velocity and position based on the changes of environment. WOA performs the hunting strategy using the social behavior of humpback whales. The algorithmic steps are described as below:
Results and discussion
This section elaborates the results and discussion of FSF-Sparse FCM + PWO in terms of accuracy and DB-Index.
Experimental setup
The FSF-Sparse FCM + PWO model is implemented in the JAVA tool with two different datasets, namely skin dataset [30] and localization dataset [19].
Dataset description
This section presents the dataset description that is used to implement the data clustering process.
Evaluation metrics
Performance analysis
This section describes the performance analysis of FSF-Sparse FCM + PWO based on the cluster size.
Figure 2 portrays the performance analysis of proposed FSF-Sparse FCM + PWO with skin dataset. Figure 2 a) shows the accuracy analysis by considering the cluster size. When number of clusters = 2, the accuracy of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4, and 5 is 0.8925, 0.8935, 0.8979, and 0.9025, respectively.When number of clusters = 3, the accuracy of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5is 0.9317, 0.9030, 0.8975, and 0.8925, respectively. When number of clusters = 4, the accuracy of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 0.8976, 0.9341, 0.9019, and 0.9083, respectively. When number of clusters = 5, the accuracy of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 0.9221, 0.9011, 0.9027, and 0.9166.
Figure 2 b) depicts the analysis of DB-Index by considering cluster size. When number of clusters = 2, the DB-Index of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 9.4361, 9.5243, 9.1570, and 9.6194, respectively. When number of clusters = 3, the DB-Index of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 19.6483, 9.7808, 14.4693, and 18.7304, respectively. When number of clusters = 4, the DB-Index of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 28.1804, 9.2787, 29.2994, and 27.5798, respectively. When number of clusters = 5, the DB-Index of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 33.5478, 2.0184, 39.0697, and 25.9278, respectively.
Figure 3 portrays the performance analysis of FSF-Sparse FCM + PWO with localization dataset. Figure 3 a) shows the accuracy analysis based on cluster size. When number of clusters = 2, the accuracy of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 0.8681, 0.8560, 0.8615, and 0.8560, respectively. When number of clusters = 3, the accuracy of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 0.8560, 0.8560, 0.8878, and 0.8357, respectively. When number of clusters = 4, the accuracy of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 0.8578, 0.9050, 0.8829, and 0.9768, respectively. When number of clusters = 5, the accuracy of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 0.8676, 0.8760, 0.8773, and 0.9176, respectively.
Figure 3 b) shows the analysis of DB-Index by considering cluster size. When number of clusters = 2, the DB-Index of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 11.6159, 3.5784, 2.4545, and 9.6657, respectively. When number of clusters = 3, the DB-Index of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 27.2933, 10.3508, 11.6910, and 11.6693, respectively. When number of clusters = 4, the DB-Index of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 29.6399, 11.7863, 39.1704, and 30.8685, respectively. When number of clusters = 5, the DB-Index of proposed FSF-Sparse FCM + PWO with mapper size = 2, 3, 4 and 5 is 41.6576, 9.1815, 23.3220, and 46.9070, respectively.
Comparative methods
The performance enhancement of proposed model is measured by analyzing the proposed with the existing methods, like Swarm-Based Map-Reduce Framework (MKS-MRF) [16], FCM [38], Sparse FCM [8], KFCM [42], FPWhale-MRF, and FrSparseFCM-based MRF.

Analysis with skin dataset, a) accuracy, b) DB-index.

Analysis with localization dataset, a) accuracy, b) DB-index.
This section presents the comparative analysis of proposed approach by varying the number of clusters.
Figure 4 portrays the comparative analysis of proposed FSF-Sparse FCM + PWO with skin dataset. Figure 4 a) shows the analysis of accuracy based on cluster size. When number of clusters = 2, the accuracy of existing MKS-MRF is 0.756, K-Means is 0.756, FCM is 0.756, KFCM is 0.795, FPWhale-MRF is 0.879, Sparse FCM is 0.792, FrSparseFCM-based MRF is 0.892, whereas the proposed FSF-Sparse FCM + PWO obtained the accuracy of 0.898. When number of clusters = 3, the accuracy of existing MKS-MRF is 0.756, K-Means is 0.756, FCM is 0.756, KFCM is 0.756, FPWhale-MRF is 0.844, Sparse FCM is 0.792, FrSparseFCM-based MRF is 0.893, whereas the proposed FSF-Sparse FCM + PWO obtained the accuracy of 0.898. When number of clusters = 4, the accuracy of existing MKS-MRF is 0.816, K-Means is 0.756, FCM is 0.756, KFCM is 0.756, FPWhale-MRF is 0.869, Sparse FCM is 0.793, FrSparseFCM-based MRF is 0.893, whereas the proposed FSF-Sparse FCM + PWO obtained the accuracy of 0.902. When number of clusters = 5, the accuracy of existing MKS-MRF is 0.824, K-Means is 0.756, FCM is 0.756, KFCM is 0.757, FPWhale-MRF is 0.850, Sparse FCM is 0.793, FrSparseFCM-based MRF is 0.893, whereas the proposed FSF-Sparse FCM + PWO obtained the accuracy of 0.903.
Figure 4 b) depicts the analysis of DB-Index based on cluster size. When number of clusters = 2, the DB-Index of existing MKS-MRF is 36.05, K-Means is 83.20, FCM is 34.04, KFCM is 30.27, FPWhale-MRF is 16.93, Sparse FCM is 19.79, FrSparseFCM-based MRF is 16.30, whereas the proposed FSF-Sparse FCM + PWO obtained the DB-Index of 9.16. When number of clusters = 3, the DB-Index of existing MKS-MRF is 29.79, K-Means is 172.68, FCM is 187.72, KFCM is 185.26, FPWhale-MRF is 10.83, Sparse FCM is 30.43, FrSparseFCM-based MRF is 29.15, whereas the proposed FSF-Sparse FCM + PWO obtained the DB-Index of 14.47. When number of clusters = 4, the DB-Index of existing MKS-MRF is 216.91, K-Means is 159.19, FCM is 214.35, KFCM is 113.51, FPWhale-MRF is 91.86, Sparse FCM is 30.30, FrSparseFCM-based MRF is 47.05, whereas the proposed FSF-Sparse FCM + PWO obtained the DB-Index of 29.30. When number of clusters = 5, the DB-Index of existing MKS-MRF is 84.40, K-Means is 325.92, FCM is 195.96, KFCM is 267.28, FPWhale-MRF is 61.70, Sparse FCM is 145.94, FrSparseFCM-based MRF is 108.25, whereas the proposed FSF-Sparse FCM + PWO obtained the DB-Index of 39.07.
Figure 5 portrays the comparative analysis of proposed FSF-Sparse FCM + PWO. Figure 5 a) shows the analysis of accuracy by considering cluster size. When number of clusters = 2, accuracy of existing MKS-MRF is 0.792, K-Means is 0.786, FCM is 0.792, KFCM is 0.792, FPWhale-MRF is 0.900, Sparse FCM is 0.756, FrSparseFCM-based MRF is 0.856, whereas the proposed FSF-Sparse FCM + PWO obtained the accuracy of 0.901. When number of clusters = 3, the accuracy of existing MKS-MRF is 0.792, K-Means is 0.792, FCM is 0.792, KFCM is 0.792, FPWhale-MRF is 0.872, Sparse FCM is 0.756, FrSparseFCM-based MRF is 0.856, whereas the proposed FSF-Sparse FCM + PWO obtained the accuracy of 0.888. When number of clusters = 4, the accuracy of existing MKS-MRF is 0.851, K-Means is 0.792, FCM is 0.792, KFCM is 0.792, FPWhale-MRF is 0.858, Sparse FCM is 0.764, FrSparseFCM-based MRF is 0.856, whereas the proposed FSF-Sparse FCM + PWO obtained the accuracy of 0.883. When number of clusters = 5, the accuracy of existing MKS-MRF is 0.820, K-Means is 0.791, FCM is 0.792, KFCM is 0.792, FPWhale-MRF is 0.850, Sparse FCM is 0.772, FrSparseFCM-based MRF is 0.856, whereas the proposed FSF-Sparse FCM + PWO obtained the accuracy of 0.877.

Comparative analysis using skin dataset, a) accuracy, b) DB-index.

Comparative analysis with localization dataset, a) accuracy, b) DB-index.
Figure 5 b) depicts the analysis of DB-Index based on cluster size.When number of clusters = 2, the DB-Index of existing MKS-MRF is 12.01, K-Means is 22.25, FCM is 88.17, KFCM is 23.07, FPWhale-MRF is 7.73, Sparse FCM is 9.08, FrSparseFCM-based MRF is 5.34, whereas the proposed FSF-Sparse FCM + PWO obtained the DB-Index of 2.45. When number of clusters = 3, the DB-Index of existing MKS-MRF is 18.98, K-Means is 18.11, FCM is 89.15, KFCM is 42.68, FPWhale-MRF is 21.65, Sparse FCM is 53.95, FrSparseFCM-based MRF is 18.07, whereas the proposed FSF-Sparse FCM + PWO obtained the DB-Index of 11.69. When number of clusters = 4, the DB-Index of existing MKS-MRF is 74.82, K-Means is 66.43, FCM is 112.34, KFCM is 217.26, FPWhale-MRF is 49.30, Sparse FCM is 33.82, FrSparseFCM-based MRF is 77.06, whereas the proposed FSF-Sparse FCM + PWO obtained the DB-Index of 39.17. When number of clusters = 5, the DB-Index of existing MKS-MRF is 115.09, K-Means is 167.56, FCM is 114.36, KFCM is 769.56, FPWhale-MRF is 69.47, Sparse FCM is 108.19, FrSparseFCM-based MRF is 86.85, whereas the proposed FSF-Sparse FCM + PWO obtained the DB-Index of 23.32.
Table 1 represents comparative discussion of proposed FSF-Sparse FCM + PWO. When number of clusters = 5, the accuracy of existing MKS-MRF is 0.824, K-Means is 0.756, FCM is 0.756, KFCM is 0.757, FPWhale-MRF is 0.850, Sparse FCM is 0.793, FrSparseFCM-based MRF is 0.893, whereas the proposed FSF-Sparse FCM + PWO obtained the accuracy of 0.903 using skin dataset.When number of clusters = 5, the DB-Index of existing MKS-MRF is 84.40, K-Means is 325.92, FCM is 195.96, KFCM is 267.28, FPWhale-MRF is 61.70, Sparse FCM is 145.94, FrSparseFCM-based MRF is 108.25, whereas the proposed FSF-Sparse FCM + PWO obtained the DB-Index of 39.07 using skin dataset.
In the devised scheme, the mapper function aims to transfer the subsets of big data into key pairs using proposed FSF-Sparse FCM, whereas reducer function is used to compute optimal cluster centroids by considering intermediate data by PWO algorithm. Also, Sparse FCM can effectively handle the large dimensional information and have the capability to optimally select cluster centroids. Also, the fractional concept helps to increase the performance of clustering. Thus, the performance of the devised approach is better than the traditional approaches.
Statistical analysis
Table 2 depicts the statistical analysis of the implemented FSF-Sparse FCM + PWO and the existing methods, such as K-Means, FCM, KFCM, FPWhale-MRF, Sparse FCM, and FrSparseFCM-based MRF. The statistical analysis is done by considering the mean and variance of the evaluation metrics, such as accuracy and DB-index.
Comparative discussion
Comparative discussion
Statistical analysis
In this research, an effective and optimal big data clustering methods is designed by proposed FSF-Sparse FCM + PWO based MRF. The proposed data clustering method involves two functions, namely mapper function and reducer function to find the optimal cluster centroid using big data. The mapper function uses FSF-Sparse FCM for data clustering and the reducer function uses PWO for tuning intermediate data.Initially, the input data is divided into subsets of data it is passed to the mapper phase for data clustering process. The clustering of data is achieved using proposed FSF-Sparse FCM to generate intermediate data. The intermediate data generated by the mapper phase is tuned in the reducer phase using PWO to get the best possible cluster centroid. The optimal cluster centroid is computed using reducer function based on objective function designed using DB-Index. The proposed FSF-Sparse FM + PWO obtained highest accuracy of 0.903 and lowest DB-Index of 39.07. The devised data clustering approach is important in various domains that include pattern recognition, computer vision, data mining, medical fields, DNA analysis, web statistics, market analysis, and spatial database applications.The PSO algorithm used in this method has not considered the geometric optimization problems to extend the application field, which will be considered in the future work.
