Abstract
With the technical advances, the amount of big data is increasing day-by-day such that the traditional software tools face burden in handling them. Additionally, the presence of the imbalance data in the big data is a huge concern to the research industry. In order to assure the effective management of big data and to deal with the imbalanced data, this paper proposes a new optimization algorithm. Here, the big data classification is performed using the MapReduce framework, wherein the map and reduce functions are based on the proposed optimization algorithm. The optimization algorithm is named as Exponential Bat algorithm (E-Bat), which is the integration of the Exponential Weighted Moving Average (EWMA) and Bat Algorithm (BA). The function of map function is to select the features that are presented to the classification in the reducer module using the Neural Network (NN). Thus, the classification of big data is performed using the proposed E-Bat algorithm-based MapReduce Framework and the experimentation is performed using four standard databases, such as Breast cancer, Hepatitis, Pima Indian diabetes dataset, and Heart disease dataset. From, the experimental results, it can be shown that the proposed method acquired a maximal accuracy of 0.8829 and True Positive Rate (TPR) of 0.9090, respectively.
Introduction
With the growth in science and technology, the amount of big data grows gradually in the last few decades, attracting the researchers to contribute a lot in the field of academics and industry [12]. The technical and networking advancements in the field of mobile, Internet of Things, and social networks generate a huge amount of digital information. Thus, Big data is the term that describes this massive collection of the data and it indicates the large and complex datasets that could be either structured or unstructured. These data are generated daily and required to be analyzed in less time. Big data is characterized in three V’s as: Volume, Variety, and Velocity [10]. At present, Big data plays a dominant role in multiple environments, like public administration, business organization, scientific research, healthcare, social, networking industry, and natural resource management. The available traditional database systems are incapable of storing this huge mass of the data [8]. The massive amount of information is collected using the contemporary systems, which is omnipresent, since most of the research activities are associated with gathering large amounts of data. For illustration, Large Hadron Collider experiments are capable of obtaining 30 petabytes of information in a year [5].
Classification is a significant step in data mining that assists in detecting the spam using the tweet of the user [16], and classification is a process of classifying all the instances of the data received from the individual user such that the collected data resemble the entire data with different features [17, 7]. In the last few years, the methods to store the massive collection of the data are handled using the new database management and distributed computing technologies. MapReduce [14] framework is a distributed programming model that consists of mappers and reducers for transforming the input data in such a way to address the growing big data, which paved way for the establishment of distributed processing engines, like Apache Hadoop, Spark and Flink [15] for extending the MapReduce [9, 12]. Feature extraction is a significant step that ensures the selection of the relevant features in order to acquire better classification accuracy. The feature selection strategy plays a major role in removing the irrelevant and redundant features such that the training time is also minimized [7]. In supervised classification, knowledge extraction is enabled using a learning algorithm that classifies the future input instances or patterns. A significant machine learning technique is Fuzzy Rule-Based Classification Systems (FRBCSs) [4].
Machine learning approaches contribute a major role in analyzing the big data and the research community is motivated by the existing simple and effective method, named as K-Nearest Neighbor (KNN) algorithm. The research communities in the past few years are exploring widely to enhance the potential to meet the raising demands of big spatial databases and data mining [9]. In [12], FRBCSs are employed for big data using MapReduce to enable big data classification. For datasets of moderate size, Support Vector Machine (SVM) is employed for regression and classification issues in multiple areas of genomics, computer vision, e-commerce, cyber security, and so on because of the unlimited capabilities of generalization. However, SVM for classifying huge volumes data available as text, image, or video format as meaningful classes is a highly challenging aspect [13, 11]. There are various other classifiers, such as Random forest, Bayes network, Decision tree, KNN, and SVM, employed for classifying the data [7]. In addition to these methods, linear methods, like Kappa statistics [18] and K-means [19], and non-linear classifiers, like SVM are highly employed [1].
The main aim of the research is to establish a big data classification model using an optimization algorithm. The big data classification is progressed using the MapReduce framework that uses the proposed optimization algorithm, named Exponential Bat (E-Bat) algorithm. The proposed algorithm is obtained by integrating the EWMA with the BA. The big data that is obtained from the distributed sources is fed to the mapper phase that performs the feature selection using the E-Bat algorithm. The effective feature extraction ensures the classification of the data such that the classification accuracy is enhanced. The selected features are fed to the reducer for data classification, and the reducer utilizes the Neural Network (NN), which is trained using the proposed E-Bat algorithm such that the data is classified as various classes.
The main contribution of the paper is:
E-BatNN for big data classification: The proposed E-BatNN is the integration of EWMA in the update rule of Bat algorithm for training NN and selecting the relevant features from the big data such that the classification accuracy is enhanced.
The organization of the paper is: The background of big data classification is deliberated in Sections 1 and 2 depicts the literature review of the existing big data classification methods. In Section 3, the proposed method of big data classification is presented and Section 4 discusses the results of the proposed method. Finally, Section 5 concludes the paper.
Motivation
Literature survey
The section presents the literature review of the big data classification methods that are listed below.
Ke et al. [1] developed a method for big data classification using Lightweight VGGNet that avoids the potential errors caused by excessive parameter settings and serves as a cost-effective solution. The demerit of the method is that the method neglected the use of deep learning and optimal partitioning. Ezatpoor et al. [2] developed a method, termed as MapReduced Enhanced Bitmap Index Guided (MRBIG) Algorithm that possesses fast processing time, but data classification for incomplete data was unaddressed, and the method was suitable only for the data environment with moderate miss rate. Elkano et al. [3] developed a method, termed as distributed MapReduce prototype generation method CHI-PG that improved the execution time without comprising accuracy and reduction rates, but failed to provide better results in high dimensional environment. lkano et al. [4] developed a method, termed as the Fuzzy Rule-Based Classification System (FRBCSs) that did not address the problem of sizeup. The demerit is that the algorithm possessed linear relation on execution time and scaleup algorithm. Ramírez-Gallego et al. [5] developed a method, termed as Nearest Neighbor Classification that handled the high dimensional scenario, but the drift changes in data were neglected at the time of data classification. Zhai et al. [6] modelled a method, termed as Fuzzy integral-based ELM ensemble that possessed simple structure and it was easy for the implementation. The failure of the method is that it is not suitable for multi-classification of imbalanced data. Murugan and Devi [7] developed a method, LR-PCA hybridization that achieved increased detection rate and accuracy, but it was inefficient in large datasets, and neglected the effects of noise in the data. Varatharajan et al. [8] developed a method, called LDA with an enhanced SVM that employed reduced data for the classification with better accuracy, but the use of large data environments leads to poor performance.
Challenges
The challenges of the big data classification algorithm are enlisted below:
Most of the existing classification methods using SVM suffer from poor accuracy mainly, due to smaller partitions in the entire distribution of the dataset, causing the local separation of the hyperplanes that contradict with global separating hyperplane [11]. The performance of the classification is affected because of the irrelevant and redundant features available in the large search space that is termed as “the curse of dimensionality” [7]. K-means suffer from local optimum issues that are highly sensitive to the noises and outliers. SVM for the classification suffers from the several issues regarding the selection of the kernel function and due to the rejection of the space information from the synchronization patterns [1]. The standard learning techniques are used in the distributed environment, but they fail to meet the classification issues in less time [4].
The need to solve the multimodal optimization objectives with highly complex and non-linear constraints insist the researchers to work for developing better optimizations that assure the global optimization solutions without any conflicting constraints. Metaheuristics pave a way for multi-objective problems, which never concludes with a single best solution instead, metaheuristics generate a set of solutions for a better approximation. Moreover, most of the algorithms developed based on the metaheuristics is suitable for single objective optimizations rather than for the multi-objectives, and these existing algorithms convert the multi-objectives as single objective with the help of weights. On the other hand, the generation of solutions with better diversity is another challenge faced by the existing metaheuristics. Additionally, the real-world issues, like uncertainty and noise should not have impact on the algorithm as it should be robust to permit inhomogeneity and should offer a good option for the decision-makers to go for effective decision-making. Thus, metaheuristic algorithms contribute much to the multi-objective global optimization. Keeping all these in mind, a novel metaheuristic search algorithm, called as E-bat algorithm is developed. The proposed E-Bat algorithm is the integration of the EWMA [21] with BA [20]. This section deliberates the proposed E-Bat algorithm along with the algorithmic steps, and Fig. 1 shows the pseudo code of the proposed E-Bat algorithm.
Pseudo code of the proposed E-Bat algorithm.
BA [20] is based on the characteristics of bat echolocation that employs the SONAR or echolocation for detecting the prey, forbids the obstacles, and locates the objects in the dark. The bats use the time delays of the received signals from the initial time, and the changes in loudness to sense the surroundings. Bats exhibit the magical way of locating the objects and the bats search for the prey based on the variations in their position and velocity. Bats possess the ability to adjust the pulse rates and wavelength of the emitted pulses. BA exhibits better convergence in the initial stage of the algorithm and possess the ability to switch between the exploration and the exploitation phases at the time of approaching towards the optimal location. The automatic switching is due to the changes in loudness and pulse emission rates that drove the search towards achieving the global solution. BA exhibits the advantages of swarm-optimization algorithm and is capable of dealing with the multi-modal optimization algorithms. BA follows the frequency-tuning strategy that improves the diversity of the solutions and BA is capable of tackling the non-linear issues that leads to the accurate optimal solutions. The update rule of BA is given as,
where,
where,
EWMA [21] is the monitoring process that averages the data to yield less weight to a data that is removed with time. EWMA offers a smooth method of averaging and the data is less influenced with time that indicates that the weights applied to a data vanish exponentially with time. It is essential in EWMA to know the current estimate of the variance rate and assures the governance of the volatility of the data. The equation using EWMA is given as,
where,
The proposed update rule is obtained through the integration of the EWMA concept [21] in the update rule of the standard Bat algorithm [20]. The proposed algorithm inherits the advantages of EWMA and BA in finding the optimal solutions. The convergence of the proposed algorithm is enhanced as the position update of the bat is based on the position of that in the previous iteration as well as the best position of the bat obtained so far. The effectiveness of the solutions is thereby enhanced to yield a global optimal solution through an effective balance between the exploration and the exploitation phases.
The algorithmic steps of the proposed E-Bat algorithm are depicted in this section along with the pseudo code of the algorithm.
Initialization
The population of the bat (solutions) in the search space is initialized as,
where,
The fitness of the solution is evaluated based on the formula shown in Eq. (17). The fitness of the individual solutions is evaluated and the solution that acquires the maximum value of fitness is selected as the best solution.
Update the solution based on the proposed E-Bat algorithm
The solutions are updated after the computation of the fitness and the solution update follows the Eq. (9) that exhibits the combined effect of BA and EWMA. In the position update step, there are two conditions: The first condition is based on a random number that is compared with the pulse emission rate of the bat such that if the random number exceeds the pulse emission rate of
where,
The solutions (position of bat) are ranked based on the fitness and the solution that ranked the highest, forms the best solution,
Terminate
The iteration is continued for the maximum number of iterations and terminates upon the generation of the global optimal solution.
Big data classification using the MapReduce framework based on the proposed E-Bat algorithm
Big data classification is the necessity of the hour to enable the effective and complex-free analysis of the big data that comes from the distributed sources. The literature presents a lot of algorithms for dealing with the big data, but the algorithms suffer from processing complexity and dealing with the data of missing attributes and new additional attributes. To minimize the computational time and manage the distributed data, the research uses the MapReduce framework that addresses the complexity issues and computational issues. The big data from distributed sources are managed perfectly using MapReduce framework and the computational issues are reduced through an effective feature selection strategy that operates at minimizing the dimension of features. The proposed method of classification assures the classification accuracy through the effective parallelism of the servers that processes the subsets of big data. The two major functions of MapReduce are: map function and reduce function that performs the function of mapping the input data as relevant patterns and reducing the intermediate data of mappers to obtain the desired output. Figure 2 shows the block diagram of the proposed method of big data classification.
Block diagram of the proposed method of big data classification.
Map Reduce Framework (MRF) exhibits higher processing power because of a number of servers in the mapper phase that have the ability to operate in parallel. The flexibility and the processing time of the big data are better using MRF such that the input big data is split into several subsets of data and the individual mapper intake a subset and processes on it to generate the desired output. Let us consider the input big data denoted as
where,
where,
Thus, the input to the
where,
The solution encoding gives a pictorial representation of the solution obtained using the proposed algorithm. In case of feature selection, the solution vector is a vector of selected features and in case of big data classification, the solution vector is the optimal cluster centroids.
Fitness function
The fitness function is based on two factors, such as accuracy and number of features and the fitness is aimed to solve the maximization function.
where,
where, TP indicates true positive, TN is true negative, FP specifies false positive, and FN indicates false negative.
The features selected using the proposed E-Bat algorithm is given in Eq. (20). The feature selection in the mapper phase supports dimensional reduction such that the selection of the highly significant features ensures the effective classification of big data. The feature selection is mapped as binary and the value ‘1’ corresponding to an attribute of the data point indicates the selection of the features. The selected features are given as,
where,
The big data classification is progressed in the reducer phase using NN based on the intermediate data obtained using the feature selection method in the mapper phase. NN is highly fault tolerant, possess the ability to deal with noisy data, and is capable of dealing with the complex patterns. The E-Bat-based NN is employed in the reducers to form the required cluster centroids and the optimal selection of the centroids assure the effective extraction of the patterns from big data in order to release the time associated with processing. The total number of reducers in the reducer phase is given as,
where,
NN follows the basic concept of neurons in the human brain that consists of input layer, hidden layer, and output layer. The architecture of NN is the interconnection of the neurons using the activation function. In the input layer, the inputs are processed using the weightage of the neurons in such a way to yield the output. The weightage of the neurons are the weights of the neurons and the biases of the neurons. The input to the NN is the intermediate data
where,
where,
The above sigmoid function is multiplied with the weights of the hidden layer as,
The final output of NN is given as,
The training of NN is performed using the proposed E-Bat optimization algorithm that aims at determining the optimal weights to tune the NN for classifying the big data. The optimal weights derived from the proposed algorithm tunes the NN for deriving the optimal centroids. The big data classification using the proposed E-bat-based NN is effective in classifying the data through deriving the optimal centroids and is capable of dealing with the new data attributes arriving from the distributed sources.
Results and discussion
The section describes the results of the proposed method of big data classification and the comparative analysis is discussed.
Experimental setup and database used
The experimentation of the proposed method is done in JAVA that runs in the PC with Windows 8 operating system. The experimentation is performed using four standard datasets taken from the UCI machine repository and the standard datasets include Breast cancer [25], Hepatitis [24], Pima Indian diabetes dataset [22], and Heart disease dataset [23].
Breast cancer dataset
The breast cancer dataset consists of 9 attributes of categorical characteristics and a total of 286 instances.
Heart disease dataset
The heart disease database consists of four databases, such as Cleveland, Hungary, Switzerland, and VA long beach. Among the available four databases, Cleveland database is employed for the experimentation. The characteristic of the database and attributes are described as multivariate, categorical, real, and integer. There are a total of 303 instances, and 75 attributes.
Hepatitis dataset
The Hepatitis dataset taken from the UCI machine learning repository consists of 155 instances and 19 attributes, with categorical, real, and integer characteristics. The characteristics of the data set are multivariate.
Pima Indian diabetes dataset
The Pima Indian diabetes database consists of 768 instances and 8 plus class attributes.
Competing methods
The proposed method of big data classification is compared with the existing methods, like Fuzzy [4], KNN [5], SVM [8], and Bat
Performance metrics
The metrics employed for the analysis include the accuracy and Total Positive Rate (TPR) and the formula for computing the metrics is given below.
Analysis using breast cancer dataset based on a) accuracy b) TPR.
The term accuracy deals with the accurate classification of the big data that is computed as,
where, TP denotes the true positive, TN signify the true negative, FP indicates the false positive, and FN denote the false negative.
TPR is the ratio of true positive to the total number of the real positives in the data that is given as,
Analysis using Cleveland dataset based on a) accuracy b) TPR.
The section discusses the performance of the proposed method of big data classification based on the performance metrics is discussed below.
Analysis using Hepatitis dataset based on a) accuracy b) TPR.
The section discusses the analysis of the big data classification methods using the breast cancer dataset, as shown in Fig. 3. The accuracy of the methods is depicted in Fig. 3a) and the discussion of accuracy is based on varying percentages of the training data. Initially, for the 50% of training data, the accuracy of the methods remained high. However, on further increasing the percentage of the training data, the accuracy of the methods decreases. Even though the accuracy decreases upon increasing the training percentage, the proposed EBatNNmethod acquired a better accuracy compared with the existing methods. The accuracy of the methods, like fuzzy, KNN, SVM, BAT
Using cleveland dataset
The section discusses the analysis of the big data classification methods using the Cleveland dataset, and is shown in Fig. 4. The accuracy of the methods is depicted in Fig. 4a) based on varying percentages of the training data. Initially, for the 50% of training data, the accuracy of the methods remained high. However, on further increasing the percentage of the training data, the accuracy of the methods decreases. Even though the accuracy decreases upon increasing the training percentage, the proposed EBatNN method acquired a better accuracy compared with the existing methods. The accuracy of the methods, fuzzy, KNN, SVM, BAT
Comparative discussion
Comparative discussion
Analysis using Pima India dataset based on a) accuracy b) TPR.
The section discusses the analysis of the big data classification methods using the Hepatitis dataset, as demonstrated in Fig. 5. The accuracy of the comparative methods is depicted in Fig. 5a), which initially, for the 50% of training data, remained high. However, on further increasing the percentage of the training data, the accuracy of the methods decreases. Even though the accuracy decreases upon increasing the training percentage, the proposed EBatNN method acquired a better accuracy compared with the existing methods. The accuracy of the methods, fuzzy, KNN, SVM, BAT
Using Pima India dataset
The section discusses the analysis of the big data classification methods using the Pima Indian dataset based on accuracy and TPR, as illustrated in Fig. 6. The accuracy attained by the comparative methods is depicted in Fig. 6a) and the discussion of the accuracy is based on the percentage of the training data. Initially, for the 50% of training data, the accuracy of the methods remained high. The accuracy decreases upon increasing the training percentage, and the proposed EBatNNmethod acquired a better accuracy compared with the existing methods. The accuracy of fuzzy, KNN, SVM, BAT
Comparative discussion
Table 1 discusses the accuracy and TPR of the proposed EBatNNmethod and the existing methods. The accuracy of the methods, fuzzy, KNN, SVM, BAT
Conclusion
The paper deals with the proposed big data classification that aimed at meeting the raising demands of high volume, high velocity, high value, high veracity, and huge variety. The big data classification is performed using the MapReduce framework such that the data from the distributed sources is handled parallel at the same time. The big data is analyzed by the MapReduce framework to yield the classified results and the processing is of two steps. The first step is feature extraction that extracts the optimal features from the data using the proposed E-Bat algorithm in the mappers. In contrary, the classification is performed in the reducers that are provided with the NN. The optimal tuning of the weights of NN is processed using the proposed EBatNN algorithm. The final output from the MapReduce framework is the classified big data that forms the clusters for the whole big data. The experimentation of the proposed big data classification is performed using four standard databases taken from the UCI machine learning Repository. The analysis of the classification methods confirms that the proposed method outperformed the existing methods with the accuracy of 0.8829 and TPR of 0.9090.
