Modeling of class imbalance handling with optimal deep learning enabled big data classification model

Abstract

Big data is the amount of data that surpasses the ability to process the data of a system concerning memory usage and computation time. It is commonly applied in several domains like healthcare, education, social networks, e-commerce, etc., as they have progressively obtained a massive quantity of input data. A major research problem is big data analytics, which can be carried out using expert systems and deep structured architectures. Besides, data wrangling and class imbalance data handling are challenging issues that need to be resolved in big data analytics. Class imbalance data degrade the performance of the classification model, which remains a challenging process due to the heterogeneous and complex structure of the comparatively huge datasets. Thus, the research focused on presenting a Class Imbalance Handling with Optimal Deep Learning Enabled Big Data Classification (CIHODL-BDC) framework. The core perception of the CIHODL-BDC framework helps to classify the big data in the Hadoop MapReduce framework. To accomplish this, the presented CIHODL-BDC model initially performs a data wrangling process is performed to alter the unrefined data into a useful layout. Next, the CIHODL-BDC model handles the class imbalance problem using a grey wolf optimizer (GWO) with Synthetic Minority Oversampling (SMOTE) technique. Besides, the Adam optimizer procedure with the Bidirectional Long Short Term Memory (BiLSTM) approach is performed to categorize the big data. The result analysis of the proposed CIHODL-BDC model is evaluated by two standard datasets. The simulation outcomes revealed the elevated performance of the CIHODL-BDC approach over existing methods.

Keywords

Data wrangling big data analytics hadoop mapreduce class imbalance data handling deep learning

1. Introduction

Recently, big data can be described with the help of 3 data features, volume, velocity, and variety [1]. The classification speed of generating as well as processing the key entity according to the utilizations can be mentioned as velocity, while the types and nature of data were named variety. The enormous development of data results in numerous difficulties in processing instead of accessing and storing the data. The data collection became costly, therefore it was mandatory to use the data efficiently, and, for further progression, extra effective systems were advanced for big data processing [2]. Officially, big data means the data capacity that surpasses the ability of data processing of a system concerning memory usage and time consumption. Big data has been utilized broadly in numerous domains, like businesses, medicine, and industry, and it has archived a huge volume of raw data [3]. One such important research issue was data analytics which can be executed based on data mining and ML techniques. Big data mining was generally tough to handle with the recent technologies and data mining software devices because the data size is complex and large [4]. The requirement for smart data analytic techniques increases with the help of big data, such as multi-temporal processing, image processing, data fusion, and automatic classification. Parallelization techniques were advanced ascending with the data existing through increasing the calculations significantly. For overcoming the disadvantages of the big data process, data mining techniques were implemented in the developing technologies [5]. For managing the issues in relation to larger-scale datasets, Google launched the MapReduce structure.

ML and deep learning (DL) techniques have been robustly influenced by the class imbalance issue [6, 7]. The latter means few challenges that occur if the number of samples in a single or more class present in the dataset is lesser than another class, therefore producing a significant degradation of the classifier presentation [8]. In the publications, most of the researchers handling this issue were reported, particularly, the data sampling techniques like Random Over-Sampling (ROS), which imitates specimens from the majority class, and Random based Under-Sampling (RUS), which eradicates specimens from the majority class. Such methodologies bias the discrimination procedure for compensating for the class imbalance ratio [9]. Data sampling techniques have significant disadvantages, like longer training periods and overfitting which may mostly appear if minority specimens were replicated [10]. Besides, information was typical if most of the specimens were eradicated from majority classes, potentially dismissing valuable information for the classifier. Then, highly “intelligent” sampling, techniques including an experiential mechanism, were advanced.

The researchers in [11] suggest a new classification structure for big data which has 2 advanced stages. In the stage of feature selection, the popular Whale Optimization Algorithm (WOA) is used to obtain accurate feature sets. The second one was the pre-processing stage utilizes the LSH-SMOTE and SMOTE systems to solve the class imbalance issue. Next, the WOA $+$ BRNN system uses the WOA to train a DL technique named BRNN initially. In [12], the ability of Adaptive Boosting (AdaBoost) was compiled is integrated with a CNN for making a novel ML technique, AdaBoost-CNN, that manages huge imbalanced datasets having more preciseness. The research scholars in [13] illustrate the impacts of class imbalance on categorization methods. In particular, they learned the effects of changing class imbalance ratios over classifier accuracy. In [14], they suggest the first compound structure to deal with multi-class large data issues, confronting the presence of high volumes and multiple data classes. They recommend analyzing the instance-level challenges in every class, resulting in realizing the factor that makes difficulties in learning. In [15], a contemporary dispersed clustering method for imbalanced data minimization with the help of the k-nearest neighbor (K-NN) categorization method was presented. The essential purpose of this study is to evaluate the actual practicing data to reduce the volume of instances or elements. Such reduced data sets would assure rapid data categorization and standard memory management having low sensitivity.

The major scope of the recommended CIHODL-BDC model is to categorize the big data in the Hadoop MapReduce framework. The presented CIHODL-BDC model initially performs a data wrangling process to change the raw key entity into a preferable configuration. Next, the CIHODL-BDC model handles the class imbalance problem using GWO with SMOTE technique. Besides, Adam optimizer with BiLSTM model is performed to classify the big data. The estimation of the suggested CIHODL-BDC method is evaluated by two standard datasets.

2. Depiction of the proposed model

In this research work, the latest CIHODL-BDC approach is introduced for the classification of the big data process. Here, the MapReduce with Hadoop technique is utilized to manage the big data. The recommended CIHODL-BDC method performs based on a series of sub-procedures, namely data wrangling, SMOTE-based class imbalance data handling, GWO-based parameter tuning, BiLSTM-based classification, and Adam hyperparameter optimizer. Figure 1 depicts the overall process of the CIHODL-BDC model.

Figure 1.

Diagrammatic presentation of CIHODL-BDC approach.

2.1 Hadoop MapReduce

One of the emerging technologies and tools that are widely used in big data is Hadoop. After many years of development, the key application of the Hadoop technological scheme is reasonably tremendous in public sources [16]. The building block of Hadoop is Map Reduce. It is utilized to resolve the application problem of parallel analysis and operation in large-scale data scenarios. Map Reduce can be defined by its two key processes: Map and Reduce. The Map is the mapping process, and Reduce is an inductive process. This process simultaneously implements a sequence of working nodes. Each node performs similar processing in its own managed dataset without data communication. Map-Reduce makes the developer no longer consider the fundamental information while designing large-scale data processing applications, but only realize the respective interface based on the two processes that considerably improve the development efficiency and decrease the development difficulty.

The core concept of this work is helpful for categorizing the big data in the Hadoop MapReduce technique. Initially, there are two datasets are used that is localization and skin data. Here, the Localization data is taken from the actions of 5 people who are ankle right, chest, ankle left, belt, and wearing tags. Accordingly, the skin data is achieved by the R, G, and B values in the face images attained in 2 databases like FERET and PAL. The collected data is further subjected to the data wrangling process. Here, the data is converted and mapped into an actual format to another format for the purpose of generating information in a suitable format for analytics. Then, the SMOTE approach is introduced for handling class imbalance issues. The next phase is the classification process where once the data is balanced, then the classification process takes place using the BiLSTM model. Then the parameter optimization is done by using an ADAM optimizer. Finally, the experimental evaluation is conducted regarding various diverse metrics.

2.2 Data wrangling process

Data wrangling includes converting and mapping datasets from the original format to another format so as to generate information in a convenient format for analytics. The aim of data wrangling is to assurance useful and quality information. Data experts typically dedicate most of their time to the data wrangling procedure over the data analysis. In this work, the data wrangling procedure is performed in different forms, namely grouping data, data transformation, missing value replacement, and removal of redundant data. Initially, the data transformation technique is performed where the information in any format is converted into .csv format. Then, the missing value that exists in the data is filled with the mode approach. Next, the unwanted and repeated columns or rows in the data are removed. At last, the dataset grouping is carried out by the built-in function in Pandas that might roll the dataset into different sets.

2.3 Data handling process for class imbalance

The SMOTE approach is used to handle class imbalance problems. SMOTE is an over-sampling algorithm to resolve class-imbalance problems. A synthetic sample is produced by SMOTE as follows. Assume that the $k$ -nearest neighbor of a minority samples $x_{i}\in S_{\text{min}}$ . One of the $k$ -neighbor $\hat{x}_{i}$ is chosen randomly, and the distinction among the two vectors is multiplied with an arbitrary weight within the interval of [0, 1], and the weighted difference is appended to the minority samples $x_{i}$ for generating novel synthetic samples.

$\displaystyle x_{\textit{new}}=x_{i}+\left(\hat{x}-x_{i}\right)*\delta$ (1)

The novel sample lies on the line connecting the two vectors. The disadvantage of SMOTE includes variance and over-generalization. SSO refers to Sample Subset Optimization to determine the optimum balanced sample subset. Here, the GWO approach determines an optimum balanced sample set.

GWO approach depends on the hunting activity and leadership level of wolves. Grey wolf often chooses to live in a group [17]. Similar to other SI-based meta-heuristic approaches, GWO initializes the population. Next, the wolf updates its location in the solution space. The mathematical modeling of hunting, leadership, and encircling behaviors are given below: The top three best solutions are regarded as leader wolves, beta ( $\beta)$ , omega $\left(\omega\right),$ delta ( $\delta)$ , and alpha $\left(\alpha\right)$ wolves. The follower wolf upgrades its states under the guidance of the leading wolf. The encircling behavior is mathematically expressed in the subsequent equation:

$\displaystyle D=\left|{C\times X_{l.w}^{t}-X_{w}^{t}}\right|$ (2) $\displaystyle X_{w}^{\left({t+1}\right)}=X_{l.w}^{t}-A\times D$ (3)

From the expression, $t$ denotes the existing iteration amount. $X_{l.w}^{t}$ represent the location of leading wolves $\left({\alpha,\beta,\textit{and}\ \delta}\right)$ at $t^{\text{th}}$ iteration and $X_{w}^{\left({t+1}\right)}$ indicates the location of the grey wolf in the following iteration. The $A$ and $C$ coefficient vectors are determined in the following equation. $D$ indicates the difference vector among the leader wolves and the grey wolf.

$\displaystyle A=2\times a\times\textit{rand}_{1}-a$ (4) $\displaystyle C=2\times\textit{rand}_{2}$ (5)

$\textit{rand}_{1}$ and $\textit{rand}_{2}$ are a uniform random number that lies between $\left[{0,1}\right].a$ is decremented linearly from 2 to $0$ with the iteration number, and it is determined as follows.

$\displaystyle a=2-2\left({\frac{t}{\textit{Maxiter}}}\right)$ (6)

In Eq. (6), the term Maxiter indicates the maximal iteration count. In the hunting procedure, it is regarded that each of the leading wolves has good knowledge of the prey position. Accordingly, all the wolves upgraded their position according to the position of a leading wolf with the subsequent formula:

$\displaystyle D_{\alpha}=\left|{C_{1}\times X_{\alpha.w}^{t}-X_{w}^{t}}\right|$ (7) $\displaystyle D_{\beta}=\left|{C_{2}\times X_{\beta.w}^{t}-X_{w}^{t}}\right|$ (8) $\displaystyle D_{\delta}=\left|{C_{3}\times X_{\delta.w}^{t}-X_{w}^{t}}\right|$ (9)

Given that, $X_{\alpha.w}^{t},X_{\beta.w}^{t}$ and $X_{\delta.w}^{t}$ be the location of $\alpha,\beta,$ and $\delta$ wolves at $t^{\text{th}}$ iterations. $C_{1},C_{2}$ and $C_{3}$ refers to the coefficient vectors as determined in Eq. (5). Afterward attaining $D_{\alpha},D_{\beta}$ and $D_{\delta}$ difference vectors the novel location of grey wolves at $(t+1)^{th}$ is computed by:

$\displaystyle X_{w}^{\left({t+1}\right)}=\frac{X_{w1}^{t}+X_{w2}^{t}+X_{w3}^{t% }}{3}$ (10)

where

$\displaystyle X_{w1}^{t}=X_{\alpha.w}^{t}-A_{1}\times D_{\alpha}$ (11) $\displaystyle X_{w2}^{t}=X_{\beta.w}^{t}-A_{2}\times D_{\beta}$ (12) $\displaystyle X_{w3}^{t}=X_{\delta.w}^{t}-A_{3}\times D_{\delta}$ (13)

$A_{1},A_{2}$ and $A_{3}$ coefficient vectors are determined in Eq. (4).

The cost function to be minimized is an error represented as follows

$\displaystyle\hat{\varepsilon}=\frac{1}{n}\sum^{f}_{i=1}\sum^{n/f}_{j=1}\left|% y^{(i)}_{j}-p\left(t^{(i)}_{j}|\Theta^{(i)_{t}}x^{(i)}_{j}\right)\right|$ (14)

Where $f=$ number of fold, $n=$ number of samples, $y_{j}^{\left(i\right)}=$ The provided class labels of instance $j$ from the test fold $i$ , $p\left(t_{j}^{\left(i\right)}|\Theta^{(i)_{t}}x_{j}^{\left(i\right)}\right)=$ Prediction of the $j^{\text{th}}$ samples from the test fold $i$ , $x_{j}^{\left(i\right)}=$ feature vector, $\Theta^{\left(i\right)}=\textit{model}$ is the parameter that occurred at the time of training, and $t_{j}^{\left(i\right)}=$ forecasted value of $j^{\text{th}}$ samples. The error in Eq. (2) is $f$ -fold cross-validation error and the subspace that reduces the fault is the solution to the optimization issue. Finding the optimal instance subset comprising equal members of minority and majority instances is translated as an optimization issue. Thus, an evolutionary algorithm is utilized for optimizing the instance subset to a global minimal solution for the cost function. The optimal global solution is focused on using the GWO approach.

2.4 Big data classification using BiLSTM model

Once the data is balanced, then it performs the classification process by the BiLSTM model. The presented module is based on the LSTM cell; the writing and reading memory cells $c$ are under the control of a set of sigmoid gates [18]. At $t$ timestep, LSTM receives input from various resources: present input $x$ , the preceding hidden layer of each LSTM unit $h_{t-1}$ , along with the preceding memory cell state $c_{t-1}$ . The update gate at $t$ time step for the provided input $x_{t},h_{t-1}$ , and $c_{t-1}$ is shown below:

$\displaystyle i_{t}=\sigma\left({W_{xi}x_{t}+W_{hi}h_{t-1}+b_{i}}\right),$ (15) $\displaystyle f_{t}=\sigma\left({W_{xf}x_{t}+W_{hf}h_{t-1}+b_{f}}\right),$ (16) $\displaystyle 0_{t}=\sigma\left({W_{xo}x_{t}+W_{ho}h_{t-1}+b_{o}}\right)$ (17) $\displaystyle g_{t}=(\varphi\left({W_{xc}x_{t}+W_{hc}h_{t-1}+b_{c}}\right),$ (18) $\displaystyle c_{t}=f_{t}\odot c_{t-1}+i_{t}\odot g_{t},$ (19) $\displaystyle h_{t}=0_{t}\odot\phi\left({c_{t}}\right),$ (20)

From the expression, without taking into account the optional peephole connection, $W$ refers to the weight matrixes, and $b$ denotes the bias. $\sigma$ indicates the sigmoid function $\sigma\left(x\right)=\frac{1}{1+\text{exp}\left({-x}\right)}$ and ( $\phi$ represents hyperbolic tangent $(\phi\left(x\right)=\frac{\text{exp}\left(x\right)-\text{exp}\left({-x}\right)% }{\text{exp}\left(x\right)+\text{exp}\left({-x}\right)}.\odot$ symbolizes the product with a gate value. The LSTM hidden output $h_{t}=\{h_{tk}\}_{k=0}^{K},h_{t}\in R^{K}$ is utilized for predicting the following words through the Softmax function using variables $W_{s}$ and $b_{s}$ :

$\displaystyle F\left({p_{ti};W_{s},b_{s}}\right)=\frac{\exp\left({W_{s}h_{ti}+% b_{s}}\right)}{\mathop{\sum}\nolimits_{j=1}^{K}\text{exp}\left({W_{s}h_{tj}+b_% {s}}\right)},$ (21)

In Eq. (21) $p_{ti}$ indicates the likelihood distribution for forecasted words. The main inspiration of LSTM is that it learns long-term temporal activity and prevents quick vanishing and exploding issues that conventional RNNs suffer from backpropagation optimization. To utilize the previous and subsequent contextual dataset, we present bi-directional models by feeding a sentence to LSTM from forward and backward orders. At every time step $t$ , a hidden forward state with $\vec{h}$ a hidden state function is calculated by using the preceding hidden condition $\vec{h}_{t-1}$ and the input at the present step $x_{t}$ . In addition, a hidden backward layer using a hidden state function $\overleftarrow{h}$ is calculated by using the input at the existing step $x_{t}$ and upcoming hidden state $\overleftarrow{h}_{t+1}$ . The backward forward, and contextual representations, produced by $\overleftarrow{h}_{t}$ and $\vec{h}_{t}$ correspondingly, are integrated with a longer matrix. The fused output is the forecast of the goal sequence. Figure 2 displays the structure of the BiLSTM approach.

Figure 2.

Structure of BiLSTM.

2.5 Hyperparameter optimization

Finally, the hyperparameters related to the BiLSTM model are tuned by the Adam optimizer. Adam is an optimization approach that, rather than the typical stochastic gradient descent (SGD) process, might be utilized for iteratively updating network weights with trainable datasets [19]. The technique is highly effective when it comes to complicated problems with a larger amount of data or variables. It consumes lesser memory and is memory-effective. On the surface, it seems to be a mixture of the ‘RMSP’ algorithm and the ‘GD with momentum’. Two GD methodologies are integrated with the Adam optimizer.

Momentum: This technique employs the ‘exponential weight average’ to accelerate the GD model. The fast convergences to the minima can be accomplished by utilizing the average.

$\displaystyle w_{t+1}=w_{t}-\alpha m_{t}$ (22)

Where,

$\displaystyle m_{t}=\beta m_{t-1}+\left({1-\beta}\right)\left[{\frac{\delta L}% {\delta w_{t}}}\right]$ (23)

$m_{t}=$ aggregate of the gradient at $t$ time (firstly, $m_{t}=0)m_{t-1}=$ aggregate of the gradient at $t-1$ time, $W_{t}=$ weight at $t$ time, $\partial L=$ derivative of Loss Function, $W_{t+1}=$ weight at time $t+1$ , $\alpha_{t}=$ learning rate at $t$ time, $\partial W_{t}=$ derivative of weight at $t$ time, and $\beta=$ Moving average variable (constant, 0.9).

Root Mean Square Propagation (RMSP): Adoptive learning methodology focuses on enhancing AdaGrad in RMSprop. Instead of AdaGrad, an overall number of squared gradients, the ‘exponentially moving average’ is considered.

$\displaystyle w_{t+1}=w_{t}-\frac{\alpha_{t}}{(v_{t}+\varepsilon)^{\frac{1}{2}% }}\ast\left[{\frac{\delta L}{\delta w_{t}}}\right]$ (24)

Where,

$\displaystyle v_{t}=\beta v_{t-1}+\left({1-\beta}\right)\ast\left[\frac{\delta L% }{\delta w_{t}}\right]^{2}$ (25)

$W_{t}=$ weights at time $t$ , $W_{t+1}=$ weight at time $t+1$ , $\alpha_{t}=$ learning rate at time $t$ , $\partial L=$ derivative of Loss Function, $\partial W_{t}=$ derivative of weight at time $t$ , $V_{t}=$ sum of the square of the historical gradient. Consequently, Adam Optimizer integrates the strength of the prior two methodologies into a better GD. Utilizing the equation from the preceding two techniques, we may attain

$\displaystyle m_{t}=\beta_{1}m_{t-1}+\left({1-\beta_{1}}\right)\left[{\frac{% \delta L}{\delta w_{t}}}\right]v_{t}=\beta_{2}v_{t-1}+\left({1-\beta_{2}}% \right)\left[\frac{\delta L}{\delta w_{t}}\right]^{2}$ (26)

Afterward all the iterations, we instinctively modify the GD so that it remains unchanged and impartial during the process, hence it is called Adam. At present, in place of the standard weight parameter $m_{-}t$ and $v_{-}t$ , we consider the bias-corrected weight parameter.

$\displaystyle w_{t+1}=w_{t}-\widehat{m_{t}}\left({\frac{\alpha}{\sqrt{\widehat% {v_{t}}+\varepsilon}}}\right)$ (27)

In every method, this optimizer is utilized due to its minimal memory use requirement and higher efficiency.

3. Results

Evaluation of the CIHODL-BDC technique is executed by the two datasets such as localization data and skin data. Here, the Localization data has been generated dependent upon the data detailed in the actions of 5 people who are ankle right, wearing tags, chest, and ankle left, and belt. The number of elements and samples from dataset 1 is 8 and 164 860. All the instances procedures localization data is to tag that is detected by element. In dataset 2, the skin data was attained by the R, G, and B values in the face images obtained in 2 databases such as FERET and PAL. The amount of samples accessible is 245 057. Here, the samples contain skin and non-skin samples. The skin instances for this dataset are taken as 50 859 skin instances and also the non-skin instances are taken as 194 198.

Table 1 provides the details of the whole sample that exists in the dataset before resampling and after the resampling process. The table values indicated that the dataset is properly sampled by the GWO-SMOTE technique.

Table 1
Dataset details

Class names	Before resampling	After resampling
Dataset 1
Walking	32710	54005
Falling	2973	54783
Lying	54480	54480
Sitting	27244	54903
No. of samples	117407	218171
Dataset 2
Skin samples	50859	191631
Non-skin samples	194198	194198
No. of samples	245057	385829

Figure 3.

Estimation of the confusion matrices of CIHODL-BDC model (a) 70% of TR dataset-1, (b) 30% of TS dataset-1, (c) 70% of TR dataset-2, and (d) 30% of TS dataset-2.

Figure 3 illustrates the confusion matrix of the CIHODL-BDC technique on the test dataset-1. On 80% of training (TR) dataset-1, the CIHODL-BDC model has recognized the samples from the walking class as 36780. Moreover, the 37849 samples are taken from the falling class, the lying class 36716 samples, and the 37265 samples from the sitting class. While testing (TS) 30% in dataset-1, the CIHODL-BDC approach has recognized various classes like walking, falling, lying, and sitting. Here, the walking class contains 15838 samples, 16219 samples from the falling class, 15578 samples contain the lying class, and the sitting class contains 16068 samples. While evaluating 80% of TR dataset-2, the CIHODL-BDC methodology has predicted skin sample contains 132955 and 133575 samples from the non-skin sample class. While considering 30% of TS dataset-2, the skin sample class is considered as 56843 samples, and 57426 samples are recognized from the non-skin sample class.

Figure 4.

Dataset 1 performance analysis of CIHODL-BDC model with 70% of TR data.

Figure 5.

Dataset 1 validation of CIHODL-BDC technique with 30% of TS data.

Figure 4 offers result outcomes of the CIHODL-BDC model regarding dataset-1 with 80% of TR data. Here, the CIHODL-BDC approach has recognized walking class samples with $\textit{accu}_{y}$ of 98.30%, $\textit{sens}_{y}$ of 97.41%, $\textit{spec}_{y}$ of 98.59%, $F_{\textit{score}}$ of 96.58%, and AUC of 98%. In addition, the CIHODL-BDC method has recognized lying class samples with $\textit{accu}_{y}$ of 98.44%, $\textit{sens}_{y}$ of 95.97%, $\textit{spec}_{y}$ of 99.27%, $F_{\textit{score}}$ of 96.86%, and AUC of 97.62%. Also, the CIHODL-BDC system has recognized sitting class instances with $\textit{accu}_{y}$ of 98.90%, $\textit{sens}_{y}$ of 97.31%, $\textit{spec}_{y}$ of 99.48%, $F_{\textit{score}}$ of 97.80%, and AUC of 98.33%.

Figure 6.

Estimation of CIHODL-BDC approach based on TA and VA analysis under dataset-1.

Figure 7.

Analysis of VL and TL with the help of the CIHODL-BDC model under dataset-1.

Figure 5 describes the brief experimental analysis of the CIHODL-BDC approach. The dataset-1 for the recommended approach is attained as 30% of TS data. The CIHODL-BDC has accessible effectual classification outcomes. Moreover, the CIHODL-BDC method has recognized walking class instances with $\textit{accu}_{y}$ of 98.36%, $\textit{sens}_{y}$ of 97.48%, $\textit{spec}_{y}$ of 98.65%, $F_{\textit{score}}$ of 96.72%, and AUC of 98.07%. Followed by, the CIHODL-BDC model holds lying class instances with $\textit{accu}_{y}$ of 98.46%, $\textit{sens}_{y}$ of 96.02%, $\textit{spec}_{y}$ of 99.26%, $F_{\textit{score}}$ of 96.86%, and AUC of 97.64%. Similarly, the CIHODL-BDC methodology has recognized sitting class instances with $\textit{accu}_{y}$ of 98.88%, $\textit{sens}_{y}$ of 97.05%, $\textit{spec}_{y}$ of 99.49%, $F_{\textit{score}}$ of 97.76%, and AUC of 98.27%.

Figure 8.

Performance analysis of CIHODL-BDC model under dataset-1 regarding recall and precision.

Figure 9.

Estimation of ROC curve of CIHODL-BDC approach under dataset-1.

In Fig. 6, the TA and VA are validated by the CIHODL-BDC approach regarding dataset-1. The result validation of the CIHODL-BDC system has attained maximal values. Here, the VA provides better performance than the TA.

Figure 7 shows the analysis of TL and VL using the CIHODL-BDC technique concerning dataset-1. The analysis implied that the CIHODL-BDC algorithm has accomplished worse rates of VL and TL. Here, the performance of the VL is lesser than TL.

Figure 8 shows the evaluation of recall and precision of the CIHODL-BDC approach on test dataset-1. It is analyzed that the CIHODL-BDC method has attained a higher precision-recall performance.

ROC search of the CIHODL-BDC technique on dataset-1 is exposed in Fig. 9. It shows better performance of the CIHODL-BDC system has demonstrated that it can effectively categorize four several classes on the dataset.

The computation of the CIHODL-BDC method with existing classifiers on dataset-1 is provided in Table 2. The analysis of the CIHODL-BDC model has shown enhanced performance over existing approaches. With respect to $\textit{accu}_{y}$ , the CIHODL-BDC model has attained higher $\textit{accu}_{y}$ of 98.66. In addition, in terms of $\textit{sens}_{y}$ , the CIHODL-BDC approach has obtainable maximal $sens_{y}$ of 97.32%, and CG-CNB algorithms have achieved reduced $\textit{sens}_{y}$ of 81%, 82%, 84%, and 85% correspondingly. Also, with regard to $\textit{spec}_{y}$ , the CIHODL-BDC system has offered higher $\textit{spec}_{y}$ of 99.11% whereas the NB, CNB, GWO-CNB, and CG-CNB models have gained minimal $\textit{spec}_{y}$ of 73%, 74%, 76%, and 77%.

Table 2 depicts the overall big data classification outcomes of the CIHODL-BDC approach on dataset 2.

Table 2

Comparative analysis of CIHODL-BDC approach of dataset-1

Methods	Accuracy	Sensitivity	Specificity
NB	77.00	81.00	73.00
CNB	78.00	82.00	74.00
GWO-CNB	80.00	84.00	76.00
CG-CNB	81.00	85.00	77.00
CIHODL-BDC	98.66	97.32	99.11

Figure 10.

Evaluation of CIHODL-BDC model on dataset-2 with 70% of TR data.

Figure 11.

Estimation of CIHODL-BDC technique on dataset-2 with 30% of TS data.

Figure 12.

Evaluation of TA and VA of CIHODL-BDC technique under dataset-2.

Figure 13.

Estimation of TL and VL of CIHODL-BDC model under dataset-2.

Figure 14.

Evaluation of precision-recall analysis of CIHODL-BDC model under dataset-2.

Figure 10 gives a brief analysis of the CIHODL-BDC approach on dataset 2 with 80% of TR data. It referred that the CIHODL-BDC technique has existing effective classification outcomes. For sample, the CIHODL-BDC system has recognized skin class samples with $\textit{accu}_{y}$ of 98.69%, $\textit{sens}_{y}$ of 99.03%, $\textit{spec}_{y}$ of 98.34%, $F_{\textit{score}}$ of 98.68%, and AUC of 98.69%. Likewise, the CIHODL-BDC system has identified Non-skin class samples with $\textit{accu}_{y}$ of 98.69%, $\textit{sens}_{y}$ of 98.34%, $\textit{spec}_{y}$ of 99.03%, $F_{\textit{score}}$ of 98.69%, and AUC of 98.69%.

Figures 10 and 11 depict a brief outcome analysis of the CIHODL-BDC model on dataset-2 with 30% of TS data. The CIHODL-BDC system has identified skin class instances with $\textit{accu}_{y}$ of 98.72%, $\textit{sens}_{y}$ of 99.06%, $\textit{spec}_{y}$ of 98.38%, $F_{\textit{score}}$ of 98.71%, and AUC of 98.72%. Furthermore, the CIHODL-BDC system has recognized non-skin class instances with $\textit{accu}_{y}$ of 98.72%, $\textit{sens}_{y}$ of 93.38%, $\textit{spec}_{y}$ of 99.06%, $F_{\textit{score}}$ of 98.73%, and AUC of 98.72%.

Estimation of the TA and VA of the CIHODL-BDC approach of dataset-2 is visualized in Fig. 12. The empirical outcome of the CIHODL-BDC system has achieved enriched rates of VA and TA. The TL and VL obtainead by the CIHODL-BDC methodology on dataset-2 are recognized in Fig. 13. The analysis has revealed that the CIHODL-BDC approach has sophisticated the lower rates of VL and TL. Thus, the performance of the VL is lower than TL.

Evaluation of precision-recall of the CIHODL-BDC methodology on dataset-2 is represented in Fig. 14. While observing, the CIHODL-BDC system has attained increased precision-recall performance.

Figure 15 provides the ROC curve of the CIHODL-BDC model on test dataset 2. The analysis shows that the CIHODL-BDC algorithm has displayed its ability to classify the two diverse classes on the dataset.

Table 3

Computation of CIHODL-BDC model with baseline algorithms under dataset-2

Methods	Accuracy	Sensitivity	Specificity
NB	76.00	81.00	71.00
CNB	77.00	82.00	72.00
GWO-CNB	78.00	83.00	73.00
CG-CNB	80.00	85.00	75.00
CIHODL-BDC	98.72	98.72	98.72

Table 4

CT analysis of CIHODL-BDC model with existing approaches

Methods	Computational time (sec)
NB	7.50
CNB	8.00
GWO-CNB	7.12
CG-CNB	6.78
CIHODL-BDC	6.12

Figure 15.

ROC curve of CIHODL-BDC technique under dataset-2.

Figure 16.

Cross-validation of the CIHODL-BDC method for dataset 1.

Figure 17.

Cross-validation of the CIHODL-BDC technique for dataset 2.

Table 5

Statistical analysis of the CIHODL-BDC technique

Friedman aligned ranks test (significance level of 0.005)
Proposed CIHODL-BDC
Comparison	Statistic	Adjusted $p$ -value	Result
WOA $+$ BRNN [11] Vs CIHODL-BDC	1.78885	0.73638	H0 is accepted
AOA [19] Vs WOA $+$ BRNN [11]	1.34164	1.00000	H0 is accepted
WGW [17] Vs CIHODL-BDC	1.34164	1.00000	H0 is accepted
AOA [19] Vs WGW [17]	0.89443	1.00000	H0 is accepted
WOA $+$ BRNN [11] Vs CGWO [20]	0.89443	1.00000	H0 is accepted
CIHODL-BDC Vs CGWO [20]	0.89443	1.00000	H0 is accepted
AOA [19] Vs CIHODL-BDC	0.44721	1.00000	H0 is accepted
WGW [17] Vs WOA $+$ BRNN [11]	0.44721	1.00000	H0 is accepted
WGW [17] Vs CGWO [20]	0.44721	1.00000	H0 is accepted
AOA [19] Vs CGWO [20]	0.44721	1.00000	H0 is accepted

A comparative analysis of the CIHODL-BDC model with several classifiers on dataset 2 is provided in Table 3. The simulation analysis of the CIHODL-BDC model has exhibited improved performance over other algorithms. In terms of $\textit{accu}_{y}$ , the CIHODL-BDC system has offered higher $\textit{accu}_{y}$ of 98.72% and CG-CNB models have obtained the least $\textit{accu}_{y}$ of 76%, 77%, 78%, and 80% correspondingly. Along with that, with respect to $\textit{sens}_{y}$ , the CIHODL-BDC model has obtained able maximal $\textit{sens}_{y}$ of 98.72%, and CG-CNB techniques have obtained reduced $\textit{sens}_{y}$ of 81%, 82%, 83%, and 85%. Eventually, with respect to $\textit{spec}_{y}$ , the CIHODL-BDC approach has obtainable superior $\textit{spec}_{y}$ of 98.72% whereas the NB, CNB, GWO-CNB, and CG-CNB algorithms have achieved decreased $\textit{spec}_{y}$ of 71%, 72%, 73%, and 75%.

Finally, a computation time (CT) inspection of the CIHODL-BDC model with existing models is compared in Table 4. The result implies that the CNB model has attained the least performance with increased CT of 8 s. Moreover, the CG-CNB model has been validated in a reasonable CT of 6.78 s. However, the CIHODL-BDC model has reached effectual outcomes with a minimal CT of 6.12 s. Throughout the analysis, it is apparent that the CIHODL-BDC model has shown effective big data classification performance over other models.

3.1 Cross-validation of the designed approach using deep structured architectures

The cross-validation of the designed approach using deep structured architectures is shown in Figs 16 and 17. While considering Figs 16 and 17, the accuracy rate of the designed approach is elevated than the NB, CNB, GWO-CNB, and CG-CNB. Most of the graph results have attained equivalent performance regarding various standard metrics. The NB model attained the least performance owing to this it easily falls into the local optimum and also it has an unbalanced dataset.

3.2 Statistical evaluation of the designed approach using baseline approaches

The statistical analysis of the offered model for the CIHODL-BDC technique is shown in Table 5. While taking Table 5, the statistical significance is done using two phases post-hoc multiple comparison and ranking. At first, the ranking procedure takes place. Secondly, the post-hoc multiple comparison is validated. In the end, the significant step is taken as 0.005. Because the experimental results must have a 5% or the least possibility of occurring null hypothesis to be statistically significant. From the empirical result, the output of the recommended model attains elevated performance.

4. Conclusion

A novel CIHODL-BDC model was introduced to classify big data. Here, the Hadoop MapReduce approach is performed for handling a huge amount of big data. The recommended CIHODL-BDC model follows a series of sub-processes like data wrangling, SMOTE-based class imbalance data handling, GWO-based parameter tuning, BiLSTM-based classification, and Adam hyperparameter optimizer. The result analysis of the offered CIHODL-BDC technique is estimated with the help of two standard datasets. The comparative study is highlighted to show the efficacy of the CIHODL-BDC technique over existing conventional techniques. Therefore, the CIHODL-BDC technique is utilized as an effective tool to employ the classification of the big data process. However, need to develop effective outlier detection mechanisms for performance improvements. It required the investigation of meta-learning approaches for automatic selection which helps for better feature selection. Effective outlier detection mechanisms can be introduced to further improve the efficacy of classification in the upcoming works. The author will include the meta-learning models for the automated selection of instance types for data sampling of each class.

References

Leevy

Khoshgoftaar

Bauder

Seliya

. A survey on addressing high-class imbalance in big data. Journal of Big Data. 2018; 5(1): 1-30.

Johnson

Khoshgoftaar

. Survey on deep learning with class imbalance. Journal of Big Data. 2019; 6(1): 1-54.

Waheed

Hassan

Aljohani

Hardman

Alelyani

Nawaz

. Predicting academic performance of students from VLE big data using deep learning models. Computers in Human Behavior. 2020; 104: 106189.

Hannun

Rajpurkar

Haghpanahi

Tison

Bourn

Turakhia

. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine. 2019; 25(1): 65-69.

Kaur

Pannu

Malhi

. A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys (CSUR). 2019; 52(4): 1-36.

Das

Datta

Chaudhuri

. Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recognition. 2018; 81: 674-693.

. Unbalanced Big Data-Compatible Cloud Storage Method Based on Redundancy Elimination Technology. Scientific Programming, 2022.

Sripriya Akondi

Menon

Baudry

Whittle

. Novel Big Data-Driven Machine Learning Models for Drug Discovery Application. Molecules. 2022; 27(3): 594.

Vuttipittayamongkol

Elyan

Petrovski

. On the class overlap problem in imbalanced data classification. Knowledge-Based Systems. 2021; 212: 106631.

10.

Oviedo

Ren

Sun

Settens

Liu

Hartono

NTP

Ramasamy

DeCost

Tian

Romano

Gilad Kusne

. Fast and interpretable classification of small X-ray diffraction datasets using data augmentation and deep neural networks. npj Computational Materials. 2019; 5(1): 1-9.

11.

Hassib

El-Desouky

Labib

El-kenawy

ESM

. WOA + BRNN: An imbalanced big data classification framework using Whale optimization and deep neural network. Soft Computing. 2020; 24(8); 5573-5592.

12.

Taherkhani

Cosma

McGinnity

. AdaBoost-CNN: An adaptive boosting algorithm for convolutional neural networks to classify multi-class imbalanced datasets using transfer learning. Neurocomputing. 2020; 404: 351-366.

13.

Thabtah

Hammoud

Kamalov

Gonsalves

. Data imbalance in classification: Experimental evaluation. Information Sciences. 2020; 513: 429-441.

14.

Sleeman

, WC. Krawczyk

. Multi-class imbalanced big data classification on spark. Knowledge-Based Systems. 2021; 212: 106598.

15.

Kamal

Ripon

Dey

Ashour

Santhi

. A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Computer Methods and Programs in Biomedicine. 2016; 131: 191-206.

16.

Abouzeid

Bajda-Pawlikowski

Abadi

Silberschatz

Rasin

. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proceedings of the VLDB Endowment. 2009; 2(1): 922-933.

17.

Masadeh

Hudaib

Alzaqebah

. WGW: A hybrid approach based on whale and grey wolf optimization algorithms for requirements prioritization. Advances in Systems Science and Applications. 2018; 18(2): 63-83.

18.

Liu

Wang

Zhao

Yuan

Zhang

. Automatic cardiac arrhythmia classification using combination of deep residual network and bidirectional LSTM. IEEE Access. 2019; 7: 102119-102135.

19.

Nanni

Maguolo

Lumini

. Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks, 2021. arXiv preprint arXiv:210314689.

20.

Banchhor

Srinivasu

. Integrating Cuckoo search-Grey wolf optimization and Correlative Naive Bayes classifier with Map Reduce model for big data classification. Data & Knowledge Engineering. 2020; 127: 101788.