Abstract
The current state of economic, social ideas, and the advancement of cutting-edge technology are determined by the primary subjects of the contemporary information era, big data. People are immersed in a world of information, guided by the abundance of data that penetrates every element of their surroundings. Smart gadgets, the IoT, and other technologies are responsible for the data’s explosive expansion. Organisations have struggled to store data effectively throughout the past few decades. This disadvantage is related to outdated, expensive, and inadequately large storage technology. In the meanwhile, large data demands innovative storage techniques supported by strong technology. This paper proposes the bigdata clustering and classification model with improved fuzzy-based Deep Architecture under the Map Reduce framework. At first, the pre-processing phase involves data partitioning from the big dataset utilizing an improved C-Means clustering procedure. The pre-processed big data is then handled by the Map Reduce framework, which involves the mapper and reducer phases. In the mapper phase. Data normalization takes place, followed by the feature fusion approach that combines the extracted features like entropy-based features and correlation-based features. In the reduction phase, all the mappers are combined to produce an acceptable feature. Finally, a deep hybrid model, which is the combination of a DCNN and Bi-GRU is used for the classification process. The Improved score level fusion procedure is used in this case to obtain the final classification result. Moreover, the analysis of the proposed work has proved to be efficient in terms of classification accuracy, precision, recall, FNR, FPR, and other performance metrics.
Introduction
The availability of data increases every day due to recent technical developments. Websites of social networking and networks of sensors produce massive amounts of data [1, 2]. In other words, big data are generated quickly from several sources and formats. Big data is currently a significant research field as it is created quickly, making it challenging to store, process, or manage them using conventional software. Technologies for big data are equipment that can store valuable data in different formats [3, 4]. Several analytical systems are being made accessible to assist users in analysing complicated data, both structured and unstructured, so as to satisfy users’ needs to analyse and store complicated information. To obtain information within big data, many suggestions and designs for applications, models, methods, equipment, and algorithms are being made [5, 6]. Keeping accurate and dependable results for big data is the major objective of these technological advancements. Modern technology is additionally needed for big data to be effectively stored and handled within a constrained length of time [7].
Especially, the most effective meta-learning methods for proper analysis of the vast amounts of data produced by contemporary devices include clustering techniques. The fundamental objective of clustering is to organize points of information into clusters so that they are comparable in terms of certain parameters. There have been many efforts to evaluate and group the data in the field of clustering in many applications. However, the main problem with algorithmic clustering is that it frequently demands a lot of computer power, particularly when processing big data. Resources for interactive analysis, applications for processing streams, and resources for batch processing belong to three separate categories of big data systems [8]. Data processing in interactive settings and real-time information interaction require interactive data analysis technologies [9].
Also, the most significant and frequently utilized problem in ML is big data classification, whose goal is to develop a strategy for classifying data into a number of already existing groups based on a collection of learning sets. The SVM is one of the most intriguing approaches in ML for categorization, and it has been effectively used in many fields of science and engineering [10]. Numerous machine learning techniques have faced difficulties as a result of the introduction of large data [11]. Additionally, the two types of fundamental deep learning models include generative models and discriminative models. RBN, AE and DBN models are the key components of the conventional type and are frequently used to convey as bigdata categorization [12]. Order correlation or joint statistical distribution characterizing data, which mostly includes CNN, RNN, DSN, and long short-term memory network models to categorize the internal structure of the information or characterize the subsequent probability of the data [13, 14].
Though, one of the essential elements of safe information clustering is the safety of data throughout the clustering of multiple dimensions of big data, particularly unstructured and uncertain large data. Loss of information can happen during transmission, storage, and clustering itself owing to a variety of data mistakes and security assaults [15, 16, 17, 18]. To overcome this problem, there needs efficient data classification methods. Accordingly, this paper proposes a bigdata classification model and the main contribution of this work is brief as below:
Proposing an improved fuzzy C-means clustering algorithm for data partitioning, which enhances the classification performance. Map Reduce framework is used to handle the bigdata, where the mapper phase handles the feature extraction and fusion process. Introducing an improved feature fusion process which depends on the weight average fusion method that fuses the entropy-based features and correlation-based features. Recommending an improved score level fusion method obtaining accurate classification results by combining the results from DCNN and Bi-GRU models.
The remaining of this proposed work is organized as follows: The literature review and the problem statement are mentioned in Section 2. The proposed big data classification and classification model is explained in Section 3. Section 4 explains about the Results and Discussions and Section 5 represents the conclusion of the classification model.
In 2023, Shuaiyin Ma et al., [19] stated with the development of cutting-edge technologies for information and communication, manufacturing information from the IoT achieved the size of big data in Industry 4.0. Big data presents a challenge to conventional clustering and correlational evaluation due to its size and low-value density. To address this issue, clustering based on a big data-driven correlation analysis was suggested in this research as a way to increase the effectiveness of resource and energy use. The industrial units with unusually high energy consumption were categorized in-depth using clustering analysis. The interaction between energy supply and demand is then balanced by correlational evaluation, which could decrease carbon emissions and improve sustainable competitiveness. Results of the sensitivity study indicate that, in comparison to the original analysis system, the technique for extracting features could increase the accuracy of the correlation evaluation.
In 2022, Abhilasha, A et al., [20] focused on Big data, which was crucial for information processing training, manipulation, and predicting. The training and extraction of information from these big datasets could lead to limited categorization outcomes and incorrect conclusions as a result of the imbalanced data supply. Conventional ML classifiers were able to handle the unbalanced datasets, but they still had issues with overfitting, cost of training, and specimen classification difficulty. The investigation effort offered a novel “Self-Boosted along with Dynamic Semi-Supervised Segmentation Method” to predict improved categorization. This approach was first preprocessed by building data segments using the Hybrid Related closest neighbour algorithm. After preparing the data, big data categorization necessitates large amounts of storage, which raises the cost of training. Thus, a heterogeneous weight ensemble classifier was proposed to compute the information space in every sample data by using closest neighbors and to address the training cost with problems in unstable samples depending on adaptive weight modification. Then the categorization performs poorly as a result of sample hardness. As a result, the project created a Self Managed Ensample Optimizer that improves classification results by sorting the specimen majority into bins based on their stiffness values. To provide balanced data with superior classification outcomes, the suggested model successfully categorized the unbalanced dataset with 99% of high accuracy.
In 2022, Dongwei Li et al., [21] stated that cloud computing was one of the most intriguing options because clustering Bigdata frequently takes enormous processing power. However, if it was not effectively managed, the cost of computation in the clouds could be surprisingly large. In the big data clustering field, the long tail effect was widely noticed, which suggests that a significant amount of time was frequently spent in the middle to late phases of grouping. To attain a sufficiently suitable precision at a minimal computing cost, the authors of this study attempt to reduce the unneeded long tail in the clustering process. A fresh strategy was put forth to achieve big data clustering on the cloud that was affordable. The k-means and EM algorithms were commonly used by the authors at the earliest stage when the necessary accuracy was achieved by learning the model of regression with the samples. The outcomes of tests on four well-known data sets show that both the k-means and EM algorithms might achieve great affordability in the cloud using our suggested methodology. In their experiments using the far more effective k-means method, for instance, the authors discovered that, in comparison to the less effective EM approach, reaching a 99% accuracy only requires 47.71% to 71.14% of the calculation effort needed to achieve a 100% accuracy.
In 2021, Chitrakant Banchhor and Srinivasu, N [22] proposed the efficient administration of storage and processing of a very high volume of data is referred to as the “big data handling” method. Both organized and unstructured data might be handled overall using a certain strategy. This study examines the CNB classifier, CGCNB, FCNB, and HCNB classifiers. These classifiers operate as intended since they depend on the Bayesian concept. The CNB was created by converting the normal naive Bayes classifier into an underlying hypothesis by applying correlation between the attributes. The CNB classifier’s efficiency was significantly improved by integrating the cuckoo searches and grey wolf optimization methods. The final classifier was referred to as a CGCNB. Additionally, the accuracy, specificity, sensitivity, recall, and time required for execution of the FCNB and HCNB classifications were taken into account while analyzing their performance with CNB and CGCNB.
In 2022, Jayasri N.P and R. Aruna [23] concentrated on Big data which was the collection of Big data from numerous sources, including sensor data, social media, online transaction information, etc. It was difficult to assess the relevance of such a large collection of created data using traditional processing methods. It was necessary to evaluate a large amount of noisy data to obtain important information due to the development and forthcoming trends in the health care field. This study’s objective in the field of healthcare was to assess the diabetic patients’ medical dataset by using a combination of cutting-edge modular decision attention networks, AR, and artificial intelligence through the MapReduce framework. In a MapReduce context, the association-based apriori algorithm takes into account health data to produce regulations. This analysis was conducted using a 50-variable diabetic machine-learning dataset from UCI. The proposed algorithm’s output was provided by metrics like accuracy, precision, recollection, and the F-score.
In 2021, Mario Juez-Gil et al., [24] stated that the Big Data used to describe ever-increasing datasets that are rising in both quantity and complexity at a rate never previously witnessed. The abundance of samples for the various classes could not be accurately balanced, which is a common difficulty for categorization, especially with Big Data. Due to this bias in preference for the majority class and disregard for the minority one, imbalanced categorization was first established several decades earlier. Although there were already more imbalanced classification algorithms than ever before, they still concentrated on small datasets rather than the current state of big data. This study uses two well-known ensemble families (Bagging and Boosting) to conduct extensive experiments with combined classifiers in the setting of unbalanced big data categorization. Moreover, Statistical experimental results show that the most recent was the Bayesian technique, which was compared with ensemble performance and time to execution throughout all of the research, which was conducted in the Spark cluster. One especially interesting finding of the study was the fact that when using Big Data to analyze unbalanced datasets, simpler strategies outperformed more complicated ones. Some advanced techniques that appear to be required to process and lessen imbalance in normal-sized databases failed as well for imbalanced Big Data due to their added complexity.
In 2022, Noha Mostafa et al., [25] stated the primary component of the Energy Internet was thought to be the use of large amounts of data in the energy industry. Particularly with the incorporation of energy from renewable sources and smart grids, significant and promising problems exist. The advantages and difficulties of applying big data analytics for power plants that produce renewable energy were discussed in this paper. The capability to gather the information and use it appropriately for improved decision-making is a significant element. For the possible use of big data processing for electric grids and environmentally friendly power utilities, a framework was created. The prediction of the smart grid’s stability using multiple ML techniques was suggested as a five-step process. Through the use of three different machine learning techniques, the long-term viability of the model was forecast by using the data collection from a distributed smart grid data system. According to the outcomes of training the weighted linear regression model was obtained 96% accuracy for 70% training data. Likewise, the decision tree model and randomly generated forest tree model of accuracy were 78% and 84% respectively. The classification model produced 87% across the CNN model and the one with a gradient-boosted decision tree model. Then the dataset’s amount of data was deemed very small for big data analysis, which was the main constraint of this effort. However, the immediate analysis of events and cloud computing services offered was appropriate for a big data analysis architecture.
In 2022, Fang Liu [26] stated the building of a multi-level language collection is first accomplished in this study using large-scale data technology including the DL technique. Then EKF model was developed to create a data query system through the examination of NN algorithms. Finally, simulation tests confirm that language bytes can be appropriately retrieved from the language database. Research demonstrates the great degree of result adaptation of the database using big data as well as the DL algorithm. In addition to being significantly superior to the other two approaches when the total amount of analyses exceeded 15, it also had the simplest convergence curve. The outcomes of the simulation demonstrate the effectiveness of the DL algorithm combined with the C-type travelling wave approach for building language databases to handle branch point attribute data. Additionally, it has an appropriately matching effect on Big data, allowing it to increase the database’s capacity for recognizing various linguistic systems and vastly enhancing retrieving effectiveness and precision.
In 2022, Xuegong Du et al., [27] described a CNN-based big data analysis and forecasting system. To investigate the distributed data layout of big data, continuous pattern matching technology is employed, along with the fusion of information processing of cloud-based gathered big data, comparison associated methods for frequent item identification, and rule of association extraction of features from large-dimensional fusion data. To achieve the categorization and processing of Big data from the cloud services collection, a method known as clustering is used. The CNN and camera combined for identifying the environment around it was currently a centre for research because the hardware technology of the vehicle to identify the nearby environment is challenging. However, CNN alone cannot solve the issues of long training times and poor accuracy. It must also analyze camera input. It is suggested to use a better CNN. The results of the experiment reveal that this technique’s mining of data accuracy is 12.43% as well as 21.76% greater than that of standard approaches. Additionally, because there are fewer iteration phases, mining is more timely. The network’s precision and learning speed can both be successfully increased due to the network’s architecture. It has been demonstrated that the CNN provides a higher accuracy as well as a greater rate of training.
In 2019 S. K. Lakshmanaprabu et al., [28] developed an IoT-based healthcare system of big data analytics by using MapReduce process and Random Forest Classifier. Here, various diseases affected patient’s e-health data was collected for the analysis of bigdata. For better classification, the enhanced factors were selected from the dataset by utilizing the Improved Dragonfly Algorithm. After that, based on the enhanced features, the e-health data is classified by using the Random Forest Classifiers. Moreover, the experimental results were carried out and noticed that the proposed model has attained 94.2% high precision. Also, various performance measures were evaluated for the proposed model and compared with other traditional methods. Table 1 provides examples of certain large data-related problems and difficulties.
Features and challenges of conventional insider threat detection model
Features and challenges of conventional insider threat detection model
Numerous complications and concerns arise during the storage and big data process due to the rapid expansion of data. To fix these flaws and issues in a cloud environment, there aren’t many options accessible. Pig Latin, Dryad, MongoDB, Cassandra, and MapR are a few examples of technologies that cannot tackle these problems in big data processing. Users still lack access to databases and low-level infrastructure for processing information and administration, despite the aid of Hadoop and MapR. Big Data Analytics has been an important subject in the field of data science as numerous businesses are investing in creating solutions using it to solve their monitoring, testing, analysis of data, modelling, and other technical requirements. In the world of Big Data Analytics, the knowledge acquired by Deep Learning algorithms remains largely unexplored. Deep Learning has been extensively used in some Big Data areas, like computer vision and recognition of speech to enhance classification modelling outcomes. Deep Learning is appealing as a useful tool for Big Data Analytics because it can extract high-level, concepts and data representations from Big data, mostly uncontrolled data. Deep learning can be used to better tackle Big Data issues like biased modelling, rapid knowledge retrieval, linguistic indexing, and information labelling.
Proposed big data clustering and classification model
This paper proposes the bigdata clustering and classification model with improved fuzzy-based Deep architecture under the MapReduce framework. The following description are the step-by-step process of the proposed model.
Initially, Data partitioning is carried out in the pre-processing phase by using an improved C-Means clustering process. Then, the MapReduce framework is applied to handle the pre-processed bigdata which contains the mapper and reducer phase. In the mapper phase, an improved normalization process takes place, subsequently, the feature fusion process, which fuses entropy-based features and correlation-based features, is carried out. Then all the mappers are merged together to obtain the combined feature set in the reducer phase. Finally, the classification process takes place by the deep hybrid model with the combination of Deep Convolutional Network and Bidirectional Gated Recurrent Unit. Here, the final classification outcome is determined by the Improved score level fusion method. The overall architecture of bigdata classification is illustrated in Fig. 1.
Overall architecture of the bigdata classification.
As there are different data sources, processing and cleansing data is challenging. Additionally, data sources could be incomplete or riddled, which makes the further process complicated for classification. To handle this problem, this work does partitioning of data (bigdata dataset
Improved fuzzy c-means clustering
Initially, the input big data,
Where,
A) Modified Euclidean distance
The spatial distance between the two sets of spatial vectors,
By separating the Euclidean distance given in Eq. (2) the distance is normalized by the least value of the square root of the squared sum of two space vectors. The modified Euclidean distance is defined in Eq. (3) by multiplying each vector component by a constant stability factor
The improved Euclidean distance
B) Average of manhattan distance
The separation between two places in space is known as the Manhattan distance. However, it is the total of the vertical and horizontal distances between two variables rather than the linear distance among them. Using this as a basis, the approach is improved to provide the improved Manhattan distance among the two matrices. With the data matrix and the associated row component of the matrix, the Manhattan distance is calculated. The Manhattan distance can be obtained from the number of matrix paths, and the average value improves the Manhattan distance [31]. The computation of the distance to Manhattan is in Eq. (4). Where,
According to the improved fuzzy c-means clustering process, the proposed distance measures are calculated by adding both the modified Euclidean distance in Eq. (3) and Manhattan distance in Eq. (5). Equations (6) and (7) signifies the improved distance measures among the object and the cluster centre.
Then substitute the overall distance from Eq. (7) to Eq. (1), to form the clusters
After the clustering process of big data, the Map Reduce framework is applied to handle it. According to the Map Reduce framework, initially, the preprocessed data contain several clustered data and each clustered data is again split into the multiple mapper phase. The Mapper phase intends the process of improved data normalization and feature fusion process as well. Here, entropy-based features and correlation-based features are fused together by using the weighted average fusion method. After that, the Reducer phase combines the overall features from all mappers. Finally, the results of the suitable features are obtained from the output of the reducer phase. The normalization and feature extraction under the Map-Reduce framework is depicted in Fig. 2.
Feature extraction under the Map-Reduce framework.
A collection of data is transformed to be on an equivalent scale through normalization. According to the data themselves, the purpose of ML algorithms is typically to update and adjust the information to ensure it has a value in the range of 0 to 1 or between
According to this paper, improved normalization takes place, which is based on the tanh normalization process. Tanh estimation techniques are thought to be a more efficient and dependable normalization method. Along with convergent faster than Z-score normalization process, it is not susceptible to outliers. Thus, it produces values ranging from
Where,
From the normalized data
In this section, the improved feature fusion is takes place including the fusion of entropy-based features and correlation-based features. The stepwise procedure for the Improved feature fusion by using the weight average fusion method is explained as follows.
Feature extraction In the improved feature fusion method, feature extraction is the first step. Initially, entropy-based features and correlation-based features are extracted from the normalized data. Then the final extracted features are subjected to the feature selection. It is explained as follows
Entropy-based features Entropy [33] is a measurement of a random variable’s unpredictability. Assume that
Since
If
Correlation-based features Next, the correlation-based features [34] are extracted accordingly, the Pearson correlation coefficient method is used in this proposed work. It is applied for the linear inter-relationship between two variables are identified. It is measured on ratio scales or intervals. Also, statistical measures are used in its frequency. If the data has a continuous scale, then the correlation coefficient is in the range of
Here, Feature Selection by using PCA In this step.2, from the extracted features, Let
It is also noted that there are always multiple eigenvectors, once evaluating the important of a single feature component, there is more than one eigenvectors are considered. To perform the feature selection, the following methods are developed.
By using the original training data, the covariance matrix of PCA is calculated. After that, all the eigen values and eigen vectors are solved. Based on the first largest eigen values Then, calculate the contribution of
Arrange the contribution of the features in descending order and it is stored by using Here, select the top Normalization of selected features Then, the selected features
Computation of weight After the normalization of the selected features, the computed mutual Information score assigns weight to the selected features. According to mutual information [33], the amount of information is measured by one random variable consisting of another random variable. Here, a single random variable’s uncertainties are reduced due to the knowledge of the other random variable. Consider two random variables
Then, normalize the feature score, so that the summation of the normalized score is always equal to 1. The normalized mutual information score is defined as the ratio between the mutual information score of each feature and the summation of all mutual information score, which is mathematically expressed in Eq. (18)
Where,
Weighted Averaging Using the weighted average strength [36] of the relevant areas taken from the input and output data, the resulting fused feature is created. With the computed Normalized weight in Eq. (18), the weighted averaging of the selected as well as normalized features is performed. Each feature is multiplied by its equivalent Normalized weight and then concatenated to produce a fused and weighted feature
Reducer phase: the output of the mappers is shuffling together to get into the reducer phase. Then, the final feature
In this work, the final process is the classification process by the combination of the DCNN classifier and the Bi-GRU classifier. Here, the final features,
DCNN
Initially, the feature
DCNN architecture.
Convolution Layer Convolution primarily employs the data of the kernel of convolution to carry out the convolution process using a sliding window approach on the input matrices. The generated output matrix is inversely proportional to the input matrix’s convolutional kernel, steps and buffering sizes, which is expressed in Eq. (20).
Where, Pooling Layer The fundamental purpose of pooling is to decrease the data dimension while preserving the most important data. There are actually two typical approaches for calculating pooling. The initial method is known as maximum pooling, and unlike the others, it outputs the highest value from the mask. Average pooling is the second. The average among all the values in the filter is the output. Fully Connected Layer A multilayer neural network with fully connected layers is said to be fully connected. All feature maps are transformed into an array of one dimension for the fully linked layer’s networking input. For categorization or forecasting, the fully connected NN is employed last. Activation Function Both linear and non-linear functions make up the activation function. Compared to linear functions, non-linear functions can be represented more accurately. In common neural networks, non-linear functions are more frequently used. ReLU is more frequently utilized as a function that is not linear. Equation (21) illustrates the formula for the ReLU function. The outcome is
The GRU is a simpler version of the LSTM, and both models are enhanced RNNs with strong modelling skills for dependence over time. A reset gate
Where,
When working with the present data, models with a bi-directional structure have the capacity to acquire data from both past and subsequent data. Bi-GRU contains two GRUs. one processing the information forward and the other processing it backwards. Simply the input and forget gates are present in this bidirectional recurrent neural network. Figure 4 depicts the Bi-GRU model diagram’s structure. The two GRUs’ states, which are bidirectional and pointing in the opposite direction, are used to decide on the bi-GRU model. The first GRU moves forward, starting at the beginning of the data series, and the second GRU moves backwards, starting at the conclusion of the information sequence. This enables knowledge from the past as well as the future to affect the conditions of the present. Equation (3.3.2) describes the bi-GRU. Where,
Therefore both the output obtained from the DCNN and Bi-GRU classifiers are combined by using the Improved score level fusion method to get the final classification output. Figure 4 shows the Bi-GRU architecture.
Bi-GRU architecture.
In this proposed work, the outcome from the hybrid classifiers is normalized by using z-improved min-max normalization. The score level fusion techniques [39], the classification process of the hybrid classifiers are separately by using Euclidean Distance and then normalized by using Eq. (24). After that the scores are fused together based on the addition rule as per Eq. (25). Here,
According to this work, an improved score level fusion is proposed for getting more accuracy for the big data classification. Here both the classification models are normalized separately by using min-max normalization, Eq. (26) shows the min-max normalization of the DCNN classifier,
Similarly, the min-max normalization of the Bi-GRU classifier is taken and follows the same procedure as mentioned in DCNN classifiers. It is calculated by using Eq. (31), where
Then the final fusion score is calculated as per Eq. (32), which is the combination of the normalized DCNN classifier and Bi-GRU classifier. This is the final classification outcome of the bigdata classification model.
Experimental setup
The big data classification model was implemented using Python, specifically with Python version 3.7. The computational hardware used for this implementation was an AMD Ryzen 5 3450U processor with Radeon Vega Mobile Gfx, operating at 2.10 GHz. The system was equipped with a total of 16.0 GB of RAM, of which 13.9 GB was available for use. Further, the big data classification analysis was carried out using the Lung Cancer Dataset [40]and [41].
Dataset1 description
Hong and Young used the lung cancer dataset to demonstrate the effectiveness of the ideal discriminant plane even in ill-posed situations. Multivariate is a characteristic of the Lung Cancer dataset. Further, it includes 56 attributes and 32 instances. The associated task is classification. Class label is the first attribute. Each nominal predictive attribute has an integer value between 0 and 3. It consists of a totally 10 attributes. For the attribute the type is categorical and the role is the feature. For the variable name class, the role is the target and the type is categorical.
Dataset2 description
The raw network packets of the UNSW-NB 15 dataset were developed by the IXIA PerfectStorm tool in the Cyber Range Lab of UNSW Canberra for generating a hybrid of real modern normal activities and synthetic contemporary attack behaviours. The tcpdump utility was employed to capture 100 GB of the raw traffic (e.g., Pcap files). There are nine different kinds of attacks in this dataset: worms, reconnaissance, shellcode, DoS, backdoors, fuzzers, and exploits. The Argus, Bro-IDS tools are utilized and twelve algorithms are built to generate fully 49 features with the class label. The UNSW-NB15_features.csv file contains a description of these features.UNSW-NB15_GT.csv is the name of the ground truth table, and UNSW-NB15_LIST_EVENTS.csv is the name of the event list file. This dataset was partitioned, and UNSW_NB15_training-set.csv and UNSW_NB15_testing-set.csv, respectively, were set up as the training and testing sets. There are 175,341 records in the training set and 82,332 records in the testing set, which are divided into two categories: attack and normal.
Performance evaluation
The evaluation of classification performance involved comparing the DCNN+Bi-GRU approach with traditional methods. This assessment considered various performance metrics, including specificity, FNR, precision, FPR, MCC, sensitivity, F-measure, NPV, and accuracy. Additionally, we conducted a comparative analysis between the DCNN+Bi-GRU method and state-of-the-art approaches like ICNN [27] and RFC+MAPREDUCE [28], as well as traditional methods such as Bi-GRU, DCNN, LSTM, RF, RNN, and GRU.
Performance evaluation of positive measures for Dataset 1
In Fig. 5, the positive metric evaluation for the DCNN+Bi-GRU method is presented in comparison to Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU for the classification of big data. To achieve precise classification of big data, the ideal outcome is for the model to produce elevated positive metric scores. The primary observation is that the DCNN+Bi-GRU approach demonstrates a highly accurate classification of big data, with accuracy ratings consistently exceeding 90%. In contrast, Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU yield lower accuracy values. Especially, training rate=70, the DCNN+Bi-GRU method achieved an accuracy of 94.231, whilst the Bi-GRU is 87.360, DCNN is 83.705, LSTM is 79.812, RF is 83.535, RNN is 86.570, ICNN [27] is 89.712, RFC+MAPREDUCE [28] is 81.727 and GRU is 85.768, correspondingly. Additionally, when the training percentage was set at 90%, the DCNN+Bi-GRU approach demonstrated a specificity of 95.630, whereas Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU all achieved lower specificity values.
Concurrently, the evaluation of precision and sensitivity for DCNN+Bi-GRU and conventional methods is illustrated in Fig. 5(b) and 5(c). At the training rate 80, the DCNN+Bi-GRU obtained a sensitivity of 96.103, though the conventional strategies scored least sensitivity values, notably, Bi-GRU
Validation of DCNN+Bi-GRU versus conventional methods with respect to positive metrics.
Figure 6 displays the contrasting evaluation of the DCNN+Bi-GRU approach and previous schemes regarding negative metrics for big data classification. Furthermore, the DCNN+Bi-GRU method is compared to Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU. It is important to note that for accurate classification of big data, the negative metric value should be minimized. Similarly, the DCNN+Bi-GRU method achieved lower negative metric values compared to traditional methods. In particular, the False Positive Rate (FPR) of the DCNN+Bi-GRU approach stands at 4.369 when the training percentage is set to 90%. This value is significantly lower in comparison to Bi-GRU (12.070), DCNN (16.682), LSTM (14.026), RF (16.694), RNN (33.824), ICNN [27] (14.654), RFC+MAPREDUCE [28] (11.462), and GRU (9.465), respectively. Simultaneously, the RF algorithm achieved the highest FNR, followed by LSTM and DCNN, whereas the DCNN+Bi-GRU method recorded the lowest FNR of 3.896. The evaluation has unveiled remarkable achievements in the field of big data classification, largely attributable to the implementation of the DCNN+Bi-GRU scheme. This improvement can be attributed to the application of improved fuzzy c-means clustering, improved normalization techniques, and the combination of deep hybrid classification methods.
Validation of DCNN+Bi-GRU versus conventional methods with respect to negative metrics.
To evaluate the effectiveness of big data classification, a comprehensive analysis of various metrics is conducted for the DCNN+Bi-GRU method, comparing it with Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU, as illustrated in Fig. 7. To guarantee the effective classification of big data, the DCNN+Bi-GRU approach should demonstrate superior scores across these other metrics. Likewise, when the training rate is set at 70%, the F-measure of the DCNN+Bi-GRU methodology stands at 95.080, even though the Bi-GRU is 88.805, DCNN is 84.818, LSTM is 81.004, RF is 84.870, RNN is 87.837, ICNN [27] is 90.899, RFC+MAPREDUCE [28] is 89.053 and GRU is 87.200, correspondingly. Furthermore, the NPV of the DCNN+Bi-GRU scheme significantly surpasses that of Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU. Hence, The DCNN+Bi-GRU approach has consistently demonstrated superior performance over existing methods, consistently excelling across various metrics. This unquestionably highlights its exceptional potential for the accurate classification of big data.
Validation of DCNN+Bi-GRU versus conventional methods with respect to other metrics.
Figure 8 presents a comparative analysis of ROC curves for the DCNN+Bi-GRU method and Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU within the context of a big data classification framework. The ROC curve is generated by plotting the true positive rate against the false positive rate, with a particular focus on the 80% learning percentage. To improve the accuracy of big data classification, it is desirable for the ROC area to reach or exceed 95%. Similarly, when the false positive rate was set to 1.0, the DCNN+Bi-GRU method attained a true positive rate of 0.983, which exceeded the lower true positive rates achieved by Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU. Hence, the remarkable performance observed in the ROC assessment highlights the DCNN+Bi-GRU method’s capability to achieve accurate classification of large-scale data.
ROC curve analysis.
Table 2 presents the ablation analysis of the DCNN+Bi-GRU method, including models with conventional feature fusion, conventional normalization, conventional fuzzy C-means, and a model without feature extraction, all in the context of big data classification. A comprehensive ablation analysis is carried out to thoroughly assess the results of integrating or improving specific elements within the DCNN+Bi-GRU methodology. This process offers a deeper insight into the distinct contributions that these features bring to the overall performance of the DCNN+Bi-GRU framework. Furthermore, the specificity values for the different models are as follows: the DCNN+Bi-GRU scheme achieves a specificity of 0.945, the model with conventional feature fusion scores 0.900, the model with conventional normalization obtains 0.886, the model with conventional Fuzzy C-means records 0.805, and the model without feature extraction registers 0.859. In addition, the FPR of the DCNN+Bi-GRU
Ablation study on DCNN+Bi-GRU, model with conventional feature fusion, model with conventional normalization, model with conventional Fuzzy C-means and model without feature extraction
Ablation study on DCNN+Bi-GRU, model with conventional feature fusion, model with conventional normalization, model with conventional Fuzzy C-means and model without feature extraction
The statistical evaluation of DCNN+Bi-GRU over Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28] and GRU for big data classification is illustrated in Table 3. To ensure the production of exceptionally precise outcomes, each model undergoes a stringent statistical evaluation procedure. This exhaustive assessment aims for accuracy and encompasses the scrutiny of critical statistical metrics, including minimum, average, standard deviation, median, and maximum values. Considering the median statistical metric, the DCNN+Bi-GRU gained an accuracy of 0.948, whilst the Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28] and GRU presented lower accuracy ratings of 0.881, 0.829, 0.805, 0.769, 0.825, 0.884, 0.879 and 0.852, respectively. In addition, under the maximum statistical metric, the DCNN+Bi-GRU scheme achieves an accuracy of 0.967, which surpasses the accuracy values of Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU.
Statistical assessment on accuracy
Statistical assessment on accuracy
Table 4 presents a comparative analysis of the K-fold evaluation conducted on the DCNN+Bi-GRU method and conventional strategies for the classification of big data. K-fold analysis, also known as K-fold cross-validation, is a method employed in the fields of machine learning and statistics to evaluate the effectiveness and consistency of a predictive model. In K-fold cross-validation, the initial dataset is partitioned into K equal-sized subsets, often called “folds.” The model is then subjected to K rounds of training and assessment, where each round employs a distinct fold as the validation set, while the remaining K-1 folds are employed for training. This method guarantees that each data point is employed for validation precisely once. Furthermore, the DCNN+Bi-GRU approach achieved the highest precision of 0.939, whereas Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU yielded lower precision scores.
Analysis on K-Fold
Analysis on K-Fold
The difference between the two sets, given in standard error units, is measured by the
Analysis of statistical test
Analysis of statistical test
There are two variations of the Wilcoxon test: the signed-rank test and the rank sum test. It compares two matched groups. Finding out if two or more sets of pairs differ from one another in a statistically significant way is the aim of the test. Table 6 shows the analysis of the Wilcoxon Test. The Wilcoxon value of the proposed model is (
Analysis of Wilcoxon test
Analysis of Wilcoxon test
A non-parametric test for examining randomized complete block designs is the Friedman test. When there might be more than two treatments, it is an expansion of the sign test. Table 7 shows the Friedman Test. There are k experimental treatments (k
Analysis of friedman test
Analysis of friedman test
For Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU for the classification of huge data. Figure 9 presents the positive metric assessment for the DCNN+Bi-GRU approach. The model’s output should have high positive metric scores to accomplish accurate large data classification. Firstly, the DCNN+Bi-GRU technique consistently achieves accuracy ratings above 90%, indicating very accurate big data classification. Therefore, the accuracy values of Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU are lower. The DCNN+Bi-GRU approach, in particular, attained an accuracy of 98.43 at a training rate of 90. In contrast, the Bi-GRU is 82.360, DCNN is 73.705, LSTM is 75.812, RF is 80.535, RNN is 85.570, ICNN [27] is 84.822, RFC+MAPREDUCE [28] is 82.838, and GRU is 87.878. Furthermore, the DCNN+Bi-GRU strategy showed a specificity of 95.630 when the training percentage was set at 90%, while Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU all had lower specificity values.
Analysis of positive metrics for dataset 2.
Figure 10 compares the DCNN+Bi-GRU technique with earlier approaches in terms of negative metrics for big data categorization. Moreover, a comparison is made between the DCNN+Bi-GRU approach and Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU. It is crucial to remember that the negative metric value needs to be as low as possible in order to accurately classify huge data. In a similar vein, the DCNN+Bi-GRU approach produced smaller negative metric values than the conventional approaches. Specifically, when the training percentage is set to 90%, the DCNN+Bi-GRU approach’s False Positive Rate (FPR) is 3.450. Compared to Bi-GRU (11.181), DCNN (13.793), LSTM (11.137), RF (12.785), RNN (24.932), ICNN [27] (14.654), RFC+MAPREDUCE [28] (11.462), and GRU (9.465), respectively, this number is much lower. The RF algorithm and LSTM and DCNN algorithms reached the greatest FNR simultaneously, whereas the DCNN+Bi-GRU approach recorded the lowest FNR of 3.796. The assessment has revealed impressive progress in the big data classification domain, mostly due to the application of the DCNN+Bi-GRU method. The implementation of enhanced normalization methods enhanced fuzzy c-means clustering, and the combination of deep hybrid classification approaches are responsible for this improvement.
Analysis of negative measures for Dataset 2.
As shown in Fig. 11, a thorough study of multiple metrics is carried out for the DCNN+Bi-GRU approach, comparing it with Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU to assess the efficacy of big data classification. The DCNN+Bi-GRU method needs to perform better on all of these additional criteria to ensure that huge data is classified effectively. The F-measure of the DCNN+Bi-GRU approach, meantime, is 96.190 when the training rate is set at 70%, even though the Bi-GRU is 88.805, the DCNN is 85.929, the LSTM is 81.004, the RF is 85.980, the RNN is 88.948, the ICNN [27] is 90.899, the RFC+MAPREDUCE [28] is 80.164, and the GRU is 88.312, respectively. Moreover, the DCNN+Bi-GRU strategy has a much higher net present value (NPV) than Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU. Therefore, the DCNN+Bi-GRU strategy has continuously outperformed the state-of-the-art techniques, outperforming them on a wide range of parameters. This demonstrates its extraordinary potential for precise large data classification.
Analysis on other measures for Dataset 2.
Table 8 shows the statistical comparison of DCNN+Bi-GRU for big data classification vs Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU. This thorough analysis, which strives for precision, examines important statistical measures such as minimum, average, standard deviation, median, and maximum values. The DCNN+Bi-GRU achieved an accuracy of 0.9430 when the median statistical metric was taken into account. In contrast, the Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU demonstrated lower accuracy ratings of 0.8611, 0.8404, 0.8270, 0.8042, 0.8398, 0.7686, 0.8604, and 0.852, in that order. Furthermore, the DCNN+Bi-GRU scheme obtains an accuracy of 0.967 under the maximal statistical measure, outperforming the accuracy values of Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU.
Statistical analysis for dataset 2
Statistical analysis for dataset 2
Analysis of K-fold for dataset 2
Analysis of mann-whitney
K-fold analysis also referred to as K-fold cross-validation is a technique used to assess a prediction model’s efficacy and consistency. The original dataset is divided into K equal-sized subgroups, typically referred to as “folds,” for K-fold cross-validation. After that, the model is trained and evaluated for K rounds. In each round, a different fold is used as the validation set, while the remaining K-1 folds are used for training. Table 9 shows the analysis of the k-fold for dataset 2. This technique ensures that every data point is used exactly once for validation. Furthermore, the precision score of 0.929 was highest for the DCNN+Bi-GRU method, whereas lower values were obtained for Bi-GRU, DCNN, LSTM, RF, RNN, ICNN [27], RFC+MAPREDUCE [28], and GRU.
Statistical test for mann-whitney U test
The probability that assesses the evidence against the null hypothesis, or
Analysis of computational time
“Running time” or computation time is the amount of time needed to complete a computing task. When a computation is represented as a series of rule applications, the number of rule applications determines how long the computation takes. Table 11 shows the computational time analysis. The computational time of the proposed model is (
Computational time analysis
Computational time analysis
In this paper, a Bigdata clustering and classification model is proposed with an improved fuzzy-based deep architecture under the Map Reduce framework. The initial step in the pre-processing stage was to split out the data from the bigdata collection by using an enhanced C-Means clustering method. Then the MapReduce framework was used to manage the preprocessed large data, which consists of mapper and reducer phases. Following the enhanced normalization method discussed in the mapper phase, a feature fusion strategy that blends correlation-based and entropy-based features was utilized. All of the mappers were then combined in the reduction process to create an acceptable feature. The classification procedure was completed by a deep hybrid model that combines a bidirectional gated recurrent unit with a deep convolutional network. In this instance, the improved score level fusion approach was applied to determine the classification’s ultimate outcome. In addition, regarding classification precision, recall, FNR, FPR, and other performance metrics, the suggested work outperforms as compared to conventional models in their performance.
