Optimal Feature Selection for Big Data Classification: Firefly with Lion-Assisted Model

Abstract

In this article, the proposed method develops a big data classification model with the aid of intelligent techniques. Here, the Parallel Pool Map reduce Framework is used for handling big data. The model involves three main phases, namely (1) feature extraction, (2) optimal feature selection, and (3) classification. For feature extraction, the well-known feature extraction techniques such as principle component analysis, linear discriminate analysis, and linear square regression are used. Since the length of feature vector tends to be high, the choice of the optimal features is complex task. Hence, the proposed model utilizes the optimal feature selection technology referred as Lion-based Firefly (L-FF) algorithm to select the optimal features. The main objective of this article is projected on minimizing the correlation between the selected features. It results in providing diverse information regarding the different classes of data. Once, the optimal features are selected, the classification algorithm called neural network (NN) is adopted, which effectively classify the data in an effective manner with the selected features. Furthermore, the proposed L-FF+NN model is compared with the traditional methods and proves the effectiveness over other methods. Experimental analysis shows that the proposed L-FF+NN model is 92%, 28%, 87%, 82%, and 78% superior to the state-of-art models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively.

Introduction

A novel technology known as “Big Data” is utilized in the field of technology to override the challenges of explosion data volume. Presently, the modern decision support systems consider the big data as a core component to store a huge volume of data.^1,2 It is complex to read all these data sources by a decision-maker, such that the classification of the collected data is made under certain conditions. The big data is usually classified by “3V”: volume, velocity, and variety.^3,4 The classification of data automatically or semiautomatically is the aspiration of big data users. Furthermore, big data is also utilized in forecasting approaches. Thus, it had to deal with the structured, unstructured, and semi-structured data.^5–7 This kind of classification is not possible with the traditional classifiers. Thus, it paved the way for feature extraction-based classification of data. From the original data set,^8,9 the most relevant information is obtained, and this information is recorded in a lower dimensionality space. When a massive of data are fed as input to an algorithm, then the data become redundant. So that the large volume of input data will be converted into a reduced representation set of features referred as the feature vectors.^10–12 A selected input will be modeled correctly as one of the possible output classes, and the pattern recognition makes this task. The pattern recognition scheme is divided into two main stages, namely Feature selection and Classification.^13,14

The feature selection is essential in the feature extraction phase since the classifier is not able to recognize from poorly selected features. Before selecting a feature, it is essential to distinguish between classes based on specific criteria, and this is done to get rid of insensitive or irrelevant information.^15,16 The correlation plays a significant role in the data feature classification between different data features. If the correlation between the data features is low, then the selection of the features becomes accurate. During data classification, the data are grouped under predefined classes.^17,18 The classification process generally arises due to the nonfeasible integration of vast volume of data in a reasonable running time.^19,20 This suggests that the classification of data in the context of big data is a critical task. The supervised classification with the intention of classifying the future input patterns gathers knowledge from the data set. Fuzzy Rule-Based Classification System is one of the most renewed machines learning techniques.^21,22 These tools were utilized in varying types of applications such as bioinformatics, medical problems, anomaly intrusion detection, financial applications, and image processing.^23,24 However, in the reasonable time, the standard learning techniques are not able to cope up with the classification problems in the distributed environment.^25,26 Therefore, there is a necessity for these kinds of techniques to be adapted or redesigned with the aspiration of applying it to the distributed environment. The article classifies the big data through optimal feature selection after handling big data by Parallel Pool Map reduce Framework.

The contributions of this article are summarized as follows:

The first contribution deals with the extraction of the features with the aid of feature extraction techniques such as principle component analysis (PCA), linear discriminate analysis (LDA), and linear square regression (LSR).

In the second contribution, the Firefly (FF) algorithm is interlinked with the Lion Algorithm (LA) to form a novel algorithm termed as Lion-based Firefly (L-FF) algorithm, and this hybridized algorithm helps in selecting the optimal features. The optimal features are selected in such a way that the correlation between the selected features is minimal.

The third contribution focused on classifying the selected features. Neural network (NN) is utilized for the classification of the selected features of data in an effective manner.

The performance of the proposed method L-FF+NN is compared over the existing models such as Genetic algorithm-based Neural Network (GA+NN), Firefly-based neural network (FF+NN), Artificial Bee Colony-based Neural Network (ABC+NN), in terms of accuracy, sensitivity, specificity, precision, false-positive rate (FPR), false-negative rate (FNR), negative predictive value (NPV), false discovery rate (FDR), F1 score, and Matthews correlation coefficient (MCC).

The rest of the article is organized as follows: The Literature Review section revaluated the literature on various works carried out in the field of big data feature extraction. The Architectural Representation of L-FF+NN-Based Big Data Classification Model section gives information on the Architectural framework of proposed L-FF+NN big data classification. The Optimal Feature Selection: A Hybrid Optimization Algorithm section demonstrates the optimal feature selection by the hybridization of LA with FF algorithm. The Classification of Features: NN Model section deals with the results and discussion, and The Results and Discussion section provides a strong conclusion for this research.

Literature Review

Related works

In 2016, Lei et al.¹ proffered a novel unsupervised two-layer NN and a softmax regression classifier, especially for the intelligent fault diagnosis of machines. The main intension is to find nonlinear information from the input samples by optimizing the cost function, thus providing good generalization ability. The outcomes of both the proposed method proved that there was higher accuracy in feature extraction, and the proposed model had accomplished an intelligent fault diagnosis on the big data.

In 2018, Ramírez-Gallego et al.² proposed a distributed implementation of a generic feature selection framework to have a standard feature selection in the big data platforms. This method included the renewed information theory-based methods with the aim of boosting the performance and accuracy. The outcomes of the proposed model verified that it had the capacity of dealing with the ultrahigh-dimensional data sets and a huge number of samples.

In 2018, Ke et al.³ proposed a feature learning algorithm for the big data in parallel computing by utilizing Adaptive Independent Subspaces Analysis (AISA). The aim of the proposed model was to demonstrate the effectiveness over complete AISA features in the classification task. The partial independent signal bases and partial independent factorial representation were utilized with the intention of demonstrating the effectiveness of the overcomplete AISA. The result obtained from AISA features was collated with the independent component analysis (ICA)-related features, and the outcomes verified that AISA had higher classification accuracy.

In 2017, Attigeri et al.⁵ presented the support vector machine (SVM) and logistic regression, machine learning algorithms, to analyze the feature selection and extraction algorithm for loan data. The objective of this article was to extract algorithm to large volume of financial data as well as the dimensionality reduction considering feature selection. The quality of the data set is enhanced with the help of parallel distributed preprocessing, and this process is carried out in the IBM Bluemix cloud platform with spark notebook. The outcome shows that reduction of features has significantly improved execution time without compromising the accuracy.

In 2018, Zhao et al.⁶ presented a distributed subtractive clustering algorithm for efficient economic Big Data analysis. The main aims of the proposed approach were twofold: to select important attributes and to identify representative ones based on parallel clustering. The hidden patterns related to the economic development were revealed via the coupled economic feature selection and econometric model construction. The result of validation of the proposed model proved that there was superiority in the performance of the proposed model when compared with other enormous economic data models.

In 2017, Xin et al.⁷ proffered a novel evolutionary algorithm called the Improved Crossover Operation (ICO) for enhancing the precision of redundant data in feature extraction. This method helps in achieving the objective function of improving the local exploration as well as the convergence precision of the algorithm in terms of magnitude. The result of the proposed model verified that the overfitting was diminished and the robustness of the proposed model was high.

In 2016, Vinod and Vasudevan⁸ proposed a Highly Correlated Feature Set Selection (HCFS) algorithm to categorize the data in an efficient and an effective way. The aim of the proposed method was to offer quality features to hierarchical learning algorithm for better classification performance. The proposed model had combined the hierarchical learning approach to enhance the performance of feature extraction from a vast database of patient records. The outcome shows that filter-based feature approach with hierarchical learning algorithm for big data classification achieves better performance.

In 2018, Sisiaridis and Markowitch⁹ proposed machine learning algorithms based on artificial intelligence techniques. To enhance the accuracy of feature extraction and to diminish the security noise of the extracted feature, further four novel methods, namely, Feature selection in case of having a relative small number of categories, Leave-out single-value attributes, Namespace correlation, and Data correlation using the actual values were utilized. The proposed learning algorithms help in handling the feature extraction, and feature selection for the heterogeneous data that were got from varying sources. The outcome of the proposed model verified that it had low computational time and had reduced the data complexity while collated with other traditional models.

Review

The literature has come out with several techniques for the big data classification, which is shown in Table 1. They require more improvements because of lack of several features in the precious classification and extraction of data. In two-stage learning method-based unsupervised learning features,¹ the diagnostic accuracy was high and consumed less computational time to classify the images based on certain features. Beyond these advantages, it suffered from the drawback of low convergence, and this method classified only the labeled data, and the unlabeled data were not classified. The labeled data were only utilized for training. Then, distributed implementation of a generic feature selection (FS) framework² was applicable to real-world data sets, and it had the capability of dealing with ultrahigh-dimensional data sets. It suffered from the serious drawbacks of imbalance in the class distribution, and it was not able to select the features automatically, such that it required the human intervention. In feature learning algorithm-based AISA,³ the advantages are high accuracy, and it required only low number of training samples. The disadvantage of this method was that it is highly complex, and robustness in classification was low. Furthermore, in SVM,⁵ time complexity of classification task is reduced, and it had the drawback as the dimension increases the computational cost associated also increases exponentially. Furthermore, in distributed subtractive clustering algorithm,⁶ the pros were that highly dimensional data were analyzed and detected the relevant features and discarding the irrelevant features. The cons of this method are the collected huge volume data containing incomplete, incorrect, and nonstandard items, which are difficult for processing.

Table 1.

Features and challenges on diverse feature selection models in big data classification

References	Adopted methodology	Features	Challenges
Lei et al.¹	Two-Stage Learning Method-based Unsupervised Learning Features	Diagnostic accuracy was high	Low Convergence
Lei et al.¹		Less computational time	Only labeled data were classified
Ramírez-Gallego et al.²	Generic FS framework	Applicable to read—word data sets	Class distribution is imbalanced
Ramírez-Gallego et al.²	Generic FS framework	Capable of dealing with ultrahigh-dimensional data sets	Could not select the features automatically
Ke et al.³	Feature learning algorithm-based AISA	Higher accuracy	High computational complexity
Ke et al.³	Feature learning algorithm-based AISA	Requires less number of training samples	Low robustness
Attigeri et al.⁵	SVM	Time complexity of classification task is reduced	High computational cost
Zhao et al.⁶	Distributed subtractive clustering algorithm	Efficiently analyzed high-dimensional economic big data	High processing time
Zhao et al.⁶	Distributed subtractive clustering algorithm	Efficiently analyzed high-dimensional economic big data	Inconvenient for sparse data
Xin et al.⁷	ICO algorithm	The local exploration is sharply enhanced	Unable to calculate the gradient or derivative
		Convergence precision was enhanced by an order of magnitude	Overfitting takes places
		Convergence precision was enhanced by an order of magnitude	Robustness is low
Vinod and Vasudevan⁸	HCFS algorithm	Provided good quality feature subset	Multiple overlapping binary classes and multiclass issues takes place
Vinod and Vasudevan⁸	HCFS algorithm	Better classification accuracy
Sisiaridis and Markowitch⁹	Machine learning algorithm based on AI techniques	Reduce computational time	Low scalability
		Reduce data complexity	Focused only on a single source data
		Provide solutions to interoperability issues	Focused only on a single source data

AI, artificial intelligence; FS, feature selection; HCFS, Highly Correlated Feature Set Selection; ICO, Improved Crossover Operation; SVM, support vector machine.

In ICO algorithm,⁷ the local exploration is sharply enhanced, and the convergence precision was enhanced by an order of magnitude. It had the disadvantages such as it was insufficient in calculating the gradient or derivative. In HCFS algorithm,⁸ the advantages are better classification accuracy, and it was highly inefficient in choosing the subset of features from original features. In SVM classifier,⁹ the computational time is low, data complexity, and it provided solutions to interoperability issues. It had the disadvantages such as low scalability is low, focused only on IP flow aggregation and not consider any other network service. These drawbacks play a major role in having the great motivation for developing new feature selection models.

Architectural Representation of L-FF+NN-Based Big Data Classification Model

Proposed methodology

The selection of an optimal feature subset from the high-dimensional feature set is a critical task in the big data mining as it involves the collection and processing of vast amounts of data. The classical and advanced data mining and machine learning tools that are available in the present trend are not sufficient to extract the features in an optimal way. Hence, in this article, a novel big data classification model is developed with the assistance got from the intelligent methods. Here, Parallel Pool Map reduce Framework is used for handling big data. The data set is collected from University of California Irvine (UCI) repository, in which the four data sets such as Absenteeism at work, Dermatology, Contraceptive method choice, and Immunotherapy data set are used. The block diagram for proposed optimal feature selection-data classification model is illustrated in Figure 1. The proposed data classification model involves three main phases: (1) feature extraction, (2) optimal feature selection, and (3) classification. Feature extraction is a process of dimensionality reduction by which an initial arrangement of data is diminished to increasingly sensible groups for processing. Its phase includes the feature extraction techniques such as PCA, LDA, and LSR. Since the length of the features are found to be large, there is a necessity for an optimal feature selection technology. This is accomplished by the proposed hybrid model with the combination of FF and LA called L+FF that helps in selecting the most relevant features. Moreover, feature classification is a decisional approach in which the features of the data are grouped under a certain class that is specified in priori. Once, the optimal feature is selected, the classification algorithm called NN is adopted, which can classify the data in an effective manner with the selected features. With the combination of all above-mentioned techniques, the proposed model is termed as L-FF+NN.

FIG. 1.

Block diagram of proposed optimal feature selection-based data classification model.

Feature extraction

Before data classification, the feature extraction of data is accomplished using three feature extraction techniques, namely PCA, LDA, and LSR. In general, the feature extraction is carried out with a desire of diminishing the quantity of the resources that are needed to describe a vast quantity of the data set. Moreover, the differentiation between the images is carried out using the extracted features.

Principle component analysis

It is an unsupervised learning method that is utilized for the dimensionality reduction of the data set without altering the original variability of the data.²⁷ PCA makes use of the orthogonal transformation to obtain the linearly uncorrelated variables, and these variables are referred as principle components from the correlated variables. The count of the principal components is found to be equal or less than the original variables. In PCA, the statistical analyses such as mean, standard deviation (SD), covariance, and eigen values and eigen vectors of a matrix are evaluated.

a. Mean: It is the average of the values of the variables throughout the distribution. This measure is also referred to as central tendency. The mean value for the random variable is represented in Equation (1), in which $F_{k} = F_{1}, F_{2}, \dots F_{n}$ represents the random variables. The size for the random variables is specified as k.

M e a n (\bar{F_{}}) = \frac{1}{l} \sum_{k = 1}^{l} F_{k} .

(1)

b. Standard deviation: It is used to determine the degree of scatter. To calculate the spread out of the data, the average distance between the mean and the point at which the data set is available is evaluated by squaring them. The mathematical equation for SD is represented in Equation (2) in which the mean is denoted as $\bar{F_{}}$ .

S D = \sqrt{\frac{1}{l} \sum_{k = 1}^{l} {(F_{k} - \bar{F})}^{2}} .

(2)

c. Covariance: It is measured between two dimensions. This measurement predicts the quantity of the variations in dimension from the mean. The covariance is calculated using Equation (3), where $F, G$ are the random variables.

C o v (F, G) = \frac{\sum_{k = 1}^{l} (F_{k} - \bar{F}) (G_{k} - \bar{G})}{l} .

(3)

d. Eigen values and eigen vectors of a matrix: The rectangular array of numbers, symbols, or expressions is termed as a matrix, and each and every individual item belonging to the matrix is called as elements. The term B is a $n \times n$ matrix and the eigen value of B is represented in Equation (4). Moreover, $λ$ is indicated as a scalar parameter. For attaining the distinct eigen values, the eigen vector of a symmetric matrix is orthogonal and it is symmetric for real values.

[B] [F] = λ [F] .

(4)

The extracted features obtained from the PCA model are specified as $F_{i}^{P C A}$ , and it is represented in Equation (5) in which the count of the feature PCA is specified as NP. $F_{i}^{P C A} = \{F_{1}^{P C A}, F_{2}^{P C A} \dots F_{N P}^{P C A}\} .$ (5)

Linear discriminate analysis

It is a renewed feature extraction and dimension reduction that has been utilized in the field of speech recognition, face recognition, multimedia information retrieval, and so on.²⁸ The main intention of LDA is to predict the optimal transformation from the high-dimensional data, which were grouped into classes. The within-class scatter matrix and between-class scatter matrix are predicted to solve the issues related to the optimal discrimination projection matrix. The mathematical equation to find optimal discrimination projection matrix $S_{o d p}$ ²⁹ is shown in Equation (6), where $B_{c l a s s}$ and $W_{c l a s s}$ represent the between-class scatter matrix and within-class scatter matrix, respectively. The formula for calculating $B_{c l a s s}$ and $W_{c l a s s}$ is shown in Equation (7), where T_J is the feature vector of the data, $μ_{N J}$ and signifies the vector of image class belongs to T_J and Equation (8) in which $μ_{J}$ indicates the mean feature vector of class J. Furthermore, the eigenvectors of the projection matrix S are shown in Equation (9) in which $D_{T}^{} = B_{c l a s s} + W_{c l a s s}$ is the total scatter matrix, in Equation (8) $S_{o d p} = arg {max}_{s} \frac{S^{T} B_{c l a s s} S}{S^{T} W_{c l a s s} S} .$ (6)

^{} W_{c l a s s} = \sum_{j = 1}^{H} (T_{J} - μ_{N J}) {(T_{J} - μ_{N J})}^{T} .

(7)

^{} B_{c l a s s} = \sum_{j = 1}^{H} Q_{J} (μ_{J} - α_{}) {(μ_{J} - α_{})}^{T} .

(8)

S = e i g (D_{T}^{- 1} B_{c l a s s}) .

(9)

Furthermore, $F_{i}^{L D A}$ is the feature extracted from LDA model, and it is represented in Equation (10), in which ND specifies the count of the feature of LDA. $F_{i}^{L D A} = \{F_{1}^{L D A}, F_{2}^{L D A} \dots F_{N D}^{L D A}\} .$ (10)

Linear square regression

LSR is a renewed supervised dimensionality reduction procedure. Most of the time, the LSR is implied to extract the information from the data.³⁰ The optimization problem of LSR is expressed in Equation (11) in which the corresponding label of the data is represented as L_e and the class indicator matrix is $Y_{o} = {y_{1}, y_{2, \dots} y_{n'}}$ . The matrix with Kth columns having the dimensionality vector as $d^{*} + 1$ is shown in Equation (12). The optimal transformation matrix is $Z^{*}$ . Furthermore, $(Z^{*} Z^{T^{*}}) \pm$ is pseudo-inverse of $Z^{*} Z^{T^{*}}$ . $I (V^{*}) = {min}_{V^{*}} \frac{1}{2} {∥V^{T^{*}} Z^{*} - Y_{o}∥}^{2}_{F} .$ (11)

V_{l s}^{*} = (Z^{*} Z^{T^{*}}) Z^{*} Y_{o}^{T} .

(12)

The extracted features obtained from the LSR model are shown in Equation (13) in which the count of the LSR feature is represented as NS. $F_{i}^{L S R} = \{F_{1}^{L S R}, F_{2}^{L S R} \dots F_{N S}^{L S R}\} .$ (13)

Furthermore, the extracted features obtained from all the three models (PCA, LDP, and LSR) are represented in Equation (14). The combined form of the extracted features F_i is modified as per Equation (15). $F_{i} = F_{i}^{P C A} + F_{i}^{L D A} + F_{i}^{L S R} .$ (14)

F_{i} = \{F_{1,} F_{2,} \dots F_{n}\} .

(15)

The combined feature F_i is subjected to hybrid algorithm L-FF, which further provides the optimal features $F_{i}^{*}$ . Next to the optimal feature selection, it is fed as input to the NN classifier for data classification.

Optimal Feature Selection: A Hybrid Optimization Algorithm

Objective function

The foremost objective of this article is to diminish the correlation that exists between the data features during selecting the optimal features. The mathematical equation of the objective function is shown in Equation (16). Then, the correlation between two data features F₁ and F₂ is expressed in Equation (17) in which n indicates the number of data features. $N = min [C o r r e l a t i o n] .$ (16)

C o r r e l a t i o n = \frac{n \sum F_{1} F_{2} - \sum F_{1} \sum F_{2}}{(n \sum F_{1}^{2} - {(\sum F_{1})}^{2}) - n \sum F_{2}^{2} - {(\sum F_{2})}^{2}} .

(17)

Feature encoding

The combined features after combining the PCA, LDA, and LSR, which results in attaining Equation (15), are given as solution to proposed L-FF algorithm for selecting the optimal features. Accordingly, the extracted data features Fi where $F_{i} = \{F_{1,} F_{2} \dots F_{n}\}$ are selected optimally by the hybrid L-FF algorithm. Then, the selected feature vector $F_{i}^{*} = F_{1}^{*}, F_{2}^{*}, \dots . F_{n}^{*}$ is fed as input to NN, and the classified data are obtained in terms of some labels. Figure 2 exhibits the feature encoding of the proposed model.

FIG. 2.

Feature encoding.

Firefly algorithm

FF algorithm³¹ was developed by Xin-She Yang in the year 2008 by the inspiration got from the fireflies. Three main assumptions were made here; they are (1) all FF are unisex; (2) attractiveness is directly proportional to brightness and attractiveness is inversely proportionally to distance; (3) the objective function defines the brightness of FF. Each FF has its attractiveness, which is represented as $ρ$ and it decreases with distance x. Equation (19) represents the attractiveness between two FF in which $ρ_{0}$ denotes the maximum attractiveness and it is referred as the light absorption coefficient. Furthermore, g and h are the two FF at position K_g and K_h, their distance is evaluated using the mathematical equation (20) in which b represents the count of dimensions. The movement of FF is represented in Equation (21). The light intensity M_h of FF is evaluated on the basis of the distance between the fireflies. The mathematical equation of FF is shown in Equation (18) in which M₀ represents the original light intensity. $M = M_{0} {e^{- γ x}}^{} .$ (18)

ρ (x) = ρ_{0} e^{- γ x}, v \geq 1 .

(19)

x_{g h} = ‖ K_{g} - K_{h} ‖ = \sqrt{\sum_{w = 1}^{b} (K_{g, w} - K_{h, u})^{2}}

(20)

K_{b e s t} = K_{g} + {ρ_{0}}^{- γ x_{g h}^{2}} (K_{h} - K_{g}) + ω (r a n d - \frac{1}{2}) .

(21)

The current position of FF is denoted by the first term, and the attractiveness of FF is denoted by the second term. The random movement of FF is described by the last term. The initial position of FF is denoted as per Equation (21). The pseudocode for conventional FF is shown in Algorithm 1.

Algorithm 1: Firefly algorithm³¹
Initialize Maximum generation $M a x_{g}$ and intensity of light M_g
Light absorption co-efficient is defined
While $(t < M a x_{g})$
For $g = 1 : n_{1}$ for all FF
For $h = 1 : n_{2}$ for all FF
If $(M_{h} > M_{g})$
FF g is moved toward h
End if
Attractiveness varies with distance x
New solutions are evaluated and light intensity is updated
End for h
End for g
FF are ranked and the best FF is predicted
End while

Lion Optimization algorithm

LA³² was developed on the basis of the raw inspiration obtained from the unique social behavior of the lions by Rajakumar in the year 2012. The optimal solution for the problem is found by the LA on the basis of two unique lion behaviors: they are terrestrial defense and territorial takeover. In between the resident males and nomadic males, the terrestrial defense takes place, whereas in between the old territorial male and new territorial male, the territorial takeover occurs.³²

Search procedure

The aim of search procedure is to obtain the optimal solution on the basis of the objective function.

Pride generation: The initial pride encloses $K^{m a l e}$ , $K^{f e m a l e}$ , and $K^{n o m a d i c}$ as the male territorial lion, female territorial lion, and nomadic lion, respectively. The structure of $K^{m a l e}$ is represented as $K^{m a l e} = k_{1}^{m a l e} k_{2}^{m a l e} \dots . . k_{Y}^{m a l e}$ . Similarly, the structure of $K^{f e m a l e}$ is described as $K^{f e m a l e} = k_{1}^{f e m a l e} k_{2}^{f e m a l e} \dots . . k_{Y}^{f e m a l e}$ . The length of the solution vector is denoted as Y. The arbitrary integer is represented as y, where $y = 1, 2, 3 \dots . Y$ and for $k_{y}^{f e m a l e}$ and $k_{y}^{m a l e}$ . It is essential to have the arbitrary integer to be generated within the limit of $(k_{y}^{min}, k_{y}^{max})$ . The minimum and maximum limit of the solution space are represented as $k_{y}^{min}$ and $k_{y}^{max}$ , respectively. Furthermore, $t (k_{y})$ is represented in Equation (22), in which it describes $t (k_{y}^{m a l e})$ and $t (k_{y}^{f e m a l e})$ . The generation of the binary lions is ensured as per Equation (23).

t (k_{y}) = q (k_{1}) \sum_{y = 2}^{Y} 2^{Y - 1} k_{y} .

(22)

q (k_{y}) = \{\begin{matrix} 1; i f k_{1} = 0 \\ - 1; o t h e r w i s e \end{matrix} .

(23)

Fertility evaluation: $K^{m a l e}$ , $K^{f e m a l e}$ reaches the global optima or local optima when their fitness value becomes saturated. Generally, to eliminate the local optimal solutions, the fertility evaluation is carried out. The update solution of the female and male is $K_{b e s t}^{f e m a l e}$ and $K_{b e s t}^{m a l e}$ , respectively. The sterility rate $S (r)$ ensures the fertility of $K_{}^{f e m a l e}$ , and $S (r)$ is increased by 1 after the crossover operation. The updating of $K_{b e s t}^{f e m a l e}$ by $K_{}^{f e m a l e}$ occurs as per Equations (24) and (25) in which the random integer generated within the interval $[1, Y]$ is represented as $y^{*}$ and the female renewal function is described as $λ$ . Furthermore, the random integers generated within the interval [0, 1] are represented as $r a_{1}$ and $r a_{2}$ .

K_{b e s t}^{f e m a l e} = min [K_{y *}^{max}, max (K_{y *}^{min}, λ_{y *})] .

(24)

λ_{y *} = \{K_{y *}^{f e m a l e} + [0.1 r a_{2} - 0.05 (K_{y *}^{m a l e} - r a_{1} K_{y *}^{f e m a l e})]\} .

(25)

Mating: By performing the process of crossover and mutation, newly generated $K^{m a l e}$ and $K^{f e m a l e}$ accomplish the mutation process. At the end of mutation process, four new cubs $K^{c u b}$ , namely $k_{11}^{c u b}$ , $k_{12}^{c u b}$ by crossover and $k_{21}^{c u b}$ , $k_{22}^{c u b}$ by mutation are generated, and here, the crossover operation takes place first and it is followed by mutation. To generate $K^{c u b}$ from crossover and $K^{n e w}$ from the mutation, the single point crossover operation as well as the random mutation are carried out. At the end of the crossover and mutation operation, four direct cubs and four mutated cubs jointly fill the cub pool. Then, the gender grouping takes place. In gender grouping, the solution pool is clustered as male cubs and female cubs. Furthermore, K-means clustering is carried out in the cub pool to perform gender grouping, and by k-means $K^{z c u b s}$ and $K^{s c u b s}$ are formed. Then, to have stability among the male and female cubs, the kill sick/weak cubs are done and this helps in renewing the pride. On updating a pride, the age of the cub is initialized to zero. The age of the cub gets increment to 1, when the territorial lion succeeds the nomadic lion in the territorial defense. The generation of $K^{n o m a d}$ in the territorial defense follows the same procedure as the generation of $K^{m a l e}$ . The strength of the entire pride is denoted as $f (K^{p r i d e})$ .

Lion operation: This process removes the existing contemporary solution and replaces it with the new solution when the new solution is a better one.²⁵ When the age of the cub is greater than or equal to the maturity age, the terrestrial takeover takes place. The $K^{m a l e}$ is appended to form the $K_{p r i d e}^{m a l e}$ and $K_{p r i d e}^{f e m a l e}$ . Then, $K_{p r i d e}^{m a l e}$ and $K^{f e m a l e}$ are appended to form $K^{z c u b s}$ and $K^{s c u b s}$ in $K_{p r i d e}^{f e m a l e}$ . Two main criteria for the selection of $K_{b e s t}^{f e m a l e}$ and $K_{b e s t}^{m a l e}$ are shown in Equations (26) and (27).

f (K_{b e s t}^{f e m a l e}) < f (K_{b e s t}^{f e m a l e} (c)); K_{b e s t}^{f e m a l e} (c) \neq K_{b e s t}^{f e m a l e} .

(26)

f (K_{b e s t}^{f e m a l e}) < f (K_{b e s t}^{f e m a l e} (c)); K_{b e s t}^{f e m a l e} (c) \neq K_{b e s t}^{f e m a l e} .

(27)

Termination: $O_{c o u n t}$ The error threshold is represented and maximum number of generations as T_e and $M a x_{e}$ .The target minimum is denoted as $f (K_{}^{o p t i m a l})$ . The termination occurs when at least one of the termination criteria from Equation (28) or (29) gets satisfied.

M a x_{e} > M a x_{e}^{max} .

(28)

⌊f (K_{}^{m a l e}) - f (K_{}^{o p t i m a l})⌋ \leq T_{e} .

(29)

The LA³² and FF²⁹ are reported to have promising exploitation phases [Eq. (21) of FF and Eq. (24) of LA]. However, to exploit the advantages of both the algorithms, the proposed algorithm makes a perfect recombination strategy. It combines the updating principle based on the recombination rate, which is defined by the user based on the characteristics of the problem and the algorithm. The proposed L-FF algorithm has been explained in the subsequent section.

Proposed L-FF algorithm

In general, the FF has the capacity of dealing complex nonlinear, multimodal optimization problem in an efficient way. Moreover, it does not require a good solution to begin the iteration process. Beyond this, the FF suffers from the drawbacks such as the parameters of FF algorithm are set as fixed and they do not get altered with respect to time. The FF does not have the capacity of remembering the history of better situation of every other FF, hence there is a chance for missing of the best solution. The FF has no capacity of migrating in a random behavior. The FF could move to a certain direction alone where the brightness has been enhanced. So, due to this complex nature, FF has been linked with LA algorithm. The interlinked algorithm improved the quality of the optimal solution. Initially, in the proposed L-FF algorithm, a random number is selected. If the random value is found to be greater than 0.5, the solution is updated with the FF. This renewal of the solution is accomplished with respect to Equation (21). In case, if the random value is found be higher than 0.5, then the solution is updated by the LA and the renewal of the best solution is accomplished with respect to Equation (24). The pseudocode for L-FF algorithm is represented in Algorithm 2.

Algorithm 2: Proposed L-FF-based feature selection
Initialize Maximum generation $M a x_{g}$ and intensity of light M_g
Light absorption co-efficient is defined
While $(t < M a x_{g})$
For $g = 1 : n$ for all FF
For $h = 1 : g$ for all FF
If ( $r a n d < 0.5$
Update the solution using Equation (21) by FF algorithm
else
Update the solution using Equation (24) by LA
If $(M_{h} > M_{g})$
Firefly g is moved toward h
else
Attractiveness varies with distance x
New solutions are evaluated and light intensity is updated
end
End if
end
end

Hence, the optimally generated features are represented as $F_{i}^{*} = \{F_{1}^{*}, F_{2}^{*} \dots F_{n}^{*}\}$ . The flowchart for the proposed L-FF+NN model is shown in Figure 3.

FIG. 3.

Flowchart of the proposed L-FF+NN algorithm.

Classification of Features: NN Model

NN-based classification

The optimally extracted features are fed as input to the NN classifier to have an optimal classification of the data. NN³¹ is in built with a noteworthy ability of taking lots of data as input from the big data and processing those data by means of inferring the hidden, complex, and nonliner relationship that takes place between the data. Once the features are extracted, they are fed as input to NN. NN is more flexible and hence it is utilized in numerous applications.

Furthermore, Equations (30) and (31) represent the network model of the NN in which the hidden neuron is represented as a. For the hidden neurons a, the bias weight is represented as $w i_{r a}^{R}$ . For the mth layer, the output bias weight is described as $w i_{r m}^{e}$ . Furthermore, the output weight from ath hidden neuron to mth output layer is represented as $w i_{a m}^{e}$ . The output weight from oth input to ath hidden neuron is described as $w i_{o a}^{R}$ . The hidden output is represented as hi, and the predicted output of the network is referred as $p i^{*}$ , which is shown in Equation (30) and Equation (31), respectively, where AT indicates the activation function. The error function between the actual output pi and classified output $p i^{*}$ is described in Equation (32). $h i = A T (w i_{r a}^{R} + \sum_{o = 1}^{n_{t}} w i_{o a}^{R} F_{i}^{*}) .$ (30)

p i^{*} = A T (w i_{r m}^{e} + \sum_{d = 1}^{n_{h}} w i_{a m}^{e} h i) .

(31)

E r r = {arg min}_{\{w i_{r a}^{R}, w i_{o a}^{R}, w i_{r m}^{e}, w i_{a m}^{e}\}} \sum_{m = 1}^{n_{o}} |p i - p i^{*}| .

(32)

Hence, for our data sets such as Absenteeism at work, Dermatology, Contraceptive method choice, and Immunotherapy, the classification is performed using NN with optimally selected features by the L-FF algorithm.

Results and Discussion

Experimental setup

The proposed optimal feature selection-based big data classification model was carried out in MATLAB, and the results related to the corresponding simulation were observed. The evaluation is carried out by varying the population size between 20, 40, 60, and 80, and performance standards such as accuracy, specificity, precision, and F1 score are analyzed.

Accuracy

Accuracy is the measurement of nearness to a definite value. The formula for accuracy is: $A C C = \frac{T P + T N}{T P + F P + F N + T N},$ (34)

where TP refers to true positive, TN signifies true negative, FN indicates false negative, and FP is false positive.

Specificity

The specificity also called the true negative rate (TNR), which is the measure of actual negatives that are accurately assessed. $T N R = \frac{T N}{T N + F P} .$ (35)

Precision

The precision is the closeness of two or more measurements to each other. $P r e c i s i o n = \frac{T P}{T P + F P} .$ (36)

F score

The F score is otherwise known as F₁ score or F measure. It is the measure of a test's precision. The F score is defined as the weighted harmonic mean of the test precision and sensitivity. This score is determined by the precision and recall of a test considered.

In the current research work, the data sets such as Absenteeism at work, Dermatology, Contraceptive method choice, and Immunotherapy data set are utilized.

Absenteeism at work database

It is gathered from https://archive.ics.uci.edu/ml/datasets/Absenteeism±at±work# (last accessed December 5, 2018). On the basis of the purpose of the research, the data set allows for several new combinations of attributes and attribute exclusions, or the modification of the attribute type (categorical, integer, or real).

Dermatology data

It is collected from https://archive.ics.uci.edu/ml/datasets/dermatology (last accessed December 5, 2018). This database is the inclusion of 4 attributes, 33 of which are linear valued and 1 of them is nominal. In the data set constructed for this domain, the family history feature has the value 1 if any of these diseases have been observed in the family, and 0 otherwise.

Contraceptive method choice database

It is collected from https://archive.ics.uci.edu/ml/datasets/Contraceptive±Method±Choice (last accessed December 5, 2018). The samples are married women who were either not pregnant or do not know if they were at the time of interview. The problem is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods).

Immunotherapy data set

It is gathered from https://archive.ics.uci.edu/ml/datasets/Immunotherapy±Dataset (last accessed December 5, 2018).

To the next of L-FF+NN-based data classification model, the performance of the developed model is compared over the conventional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, LA+NN, and L-FF+NN algorithms by analyzing the measures such as accuracy, specificity, precision, FPR, FNR, NPV, FDR, F₁-score, and MCC. The entire performance analysis outperforms the proposed of L-FF+NN model over other existing models.

Analysis on feature selection using Absenteeism at work data set

The data collected from the Absenteeism at work are used for the L-FF+NN-based data classification, which is manifested in Figure 4. The evaluation is carried out by varying the population size between 20, 40, 60, and 80, and performance standards such as accuracy, specificity, precision, and F₁ score are analyzed. From Figure 4a, the accuracy at the population size of 65 for the L-FF+NN model is 1.2% better than the existing GA+NN, and it is 0.5% and 4.1% better than FF+NN and PSO+NN, respectively, and it is 3% and 2.2% better than ABC+NN and LA+NN, respectively. Moreover, as shown in Figure 4b, the specificity of L-FF+NN for the population size of 60 is 3% better than GA+NN, 2%, superior to FA-NN, 1.5%, better than FF+NN, 1.2% and 8%, superior to the traditional methods such as ABC+NN and LA+NN, correspondingly. From Figure 4c, the precision of the L-FF+NN model is 1.3%, 0.1%, 6%, 4%, and 3% better than the traditional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively. Then, F₁ score for the L-FF+NN model is analyzed, and it is found to be 1.0% and 0.96% better than GA+NN and FF+NN, respectively, and also it is 0.9%, 0.85%, and 0.53% better than conventional models such as PSO+NN, ABC+NN and LA+NN, respectively. Table 2 represents the overall performance analysis of the proposed L-FF+NN model for the Absenteeism at work data sets. From the analysis, the accuracy of L-FF+NN model is 9.1%, 6.8%, 13%, 16.7%, and 14.3% better than the conventional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively. Then, the sensitivity of L-FF+NN model is 23.2% better than GA+NN, and 27% better than FF+NN and 23%, 27.5%, and 17.8% better than PSO+NN, ABC+NN, and LA+NN, respectively. Then, L-FF+NN is found to be 15.7%, 14.1%, and 15% better than GA+NN, FF+NN, and PSO+NN, respectively, and also 12% and 8% better than conventional ABC+NN and PSO+NN model, respectively, in terms of specificity.

FIG. 4.

Analysis on the feature selection using Absenteeism at work data set focusing on: (a) accuracy, (b) specificity, (c) precision, and (d) F₁ score.

Table 2.

Overall performance analysis of the proposed feature selection over the conventional models using Absenteeism at work data set

Metrics	GA+NN³³	FF+NN³⁴	PSO+NN³⁵	ABC+NN³⁶	LA+NN³²	L-FF+NN
Accuracy	0.79907	0.81671	0.79907	0.74741	0.78163	0.87245
Sensitivity	0.80403	0.85697	0.80403	0.85216	0.75171	0.61735
Specificity	0.79807	0.80866	0.79807	0.72646	0.78761	0.92347
Precision	0.44332	0.47251	0.44332	0.38388	0.41447	0.61735
FPR	0.20193	0.19134	0.20193	0.27354	0.21239	0.07653
FNR	0.19597	0.14303	0.19597	0.14784	0.24829	0.38265
NPV	0.79807	0.80866	0.79807	0.72646	0.78761	0.92347
FDR	0.55668	0.52749	0.55668	0.61612	0.58553	0.38265
F₁-score	0.57152	0.60915	0.57152	0.52931	0.53433	0.61735
MCC	0.48861	0.54016	0.48861	0.44664	0.43766	0.54082

FPR, false-positive rate; FNR, false-negative rate; NPV, negative predictive value; FDR, false discovery rate; MCC, Matthews correlation coefficient.

Analysis on feature selection using Dermatology data set

The proposed L-FF+NN model is compared over the several existing models in terms of feature selection analysis in terms of accuracy, specificity, precision, and F₁ score for Dermatology data, which is given in Figure 5. From Figure 5a, the accuracy of the proposed L-FF+NN model is evaluated with the population size of 65, which is found to be 1.2%, 0.1%, and 0.4% better than GA+NN, FF+NN, and PSO+NN, respectively, and it is 0.3% and 0.2% better than ABC+NN and LA+NN, respectively. The specificity values are calculated, where the proposed L-FF+NN model is 2.4%, 0.1%, 1%, 0.6%, and 0.5% superior to the conventional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively, when the evaluation is done at the population size of 80, as shown in Figure 5b. Furthermore, the evaluation of the proposed L-FF+NN model is compared with the existing methods in terms of precision for population size as 80, and the evaluation proves that the proposed L-FF+NN model is 1.4%, 0.1%, 0.3%, 0.6%, and 0.8% superior to the traditional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively, based on Figure 5c. Moreover, the population size is set as 65, and the F₁ score by the proposed L-FF+NN model is found to be 4%, 0.1%, and 3.2% better than the conventional models such as GA+NN, FF+NN, and PSO+NN, and it is 2.1% and 1.3% better than ABC+NN and LA+NN, respectively, and it is observed from Figure 5d. The overall performance analysis of the feature selection analysis using Dermatology data set is portrayed in Table 3. Here, the accuracy of L-FF+NN model is 1.3%, 0.9%, 1.7%, 1.2%, and 0.9% better than the conventional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively. Similarly, the specificity of the proposed L-FF+NN model is 7% better than GA+NN and 0.1% and 3.4% better than FF+NN and PSO+NN and also 1.8% and 2.3% better than ABC+NN and LA+NN, respectively. Furthermore, the NPV analysis of the proposed L-FF+NN model is 7%, 0.1%, 3.4%, 1.8%, and 2.3% better than the conventional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively. Moreover, in terms of FDR, the proposed L-FF+NN model is 92%, 28%, 87%, 82%, and 78% superior to the state-of-art models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively. Furthermore, the F₁ score of the proposed L-FF+NN model is 3.3%, 0.9%, and 1.7% better than traditional models such as GA+NN, FF+NN, and PSO+NN, respectively, and it is 1.1% and 0.9% better than ABC+NN and LA+NN, respectively.

FIG. 5.

Analysis on the feature selection using Dermatology data set focusing on: (a) accuracy, (b) specificity, (c) precision, and (d) F₁ score.

Table 3.

Overall performance analysis of the proposed feature selection over the conventional models using Dermatology data set

Metrics	GA+NN³³	FF+NN³⁴	PSO+NN³⁵	ABC+NN³⁶	LA+NN³²	L-FF+NN
Accuracy	0.93321	0.96486	0.94903	0.95426	0.95695	0.96581
Sensitivity	0.93637	0.93637	0.93637	0.93637	0.93637	0.93637
Specificity	0.93004	0.99335	0.9617	0.97214	0.97752	0.99525
Precision	0.93048	0.99295	0.9607	0.97111	0.97656	0.99495
FPR	0.069959	0.006648	0.038303	0.027857	0.022475	0.004748
FNR	0.063628	0.063628	0.063628	0.063628	0.063628	0.063628
NPV	0.93004	0.99335	0.9617	0.97214	0.97752	0.99525
FDR	0.069519	0.007049	0.039298	0.02889	0.02344	0.005045
F₁-score	0.93342	0.96383	0.94838	0.95342	0.95604	0.96477
MCC	0.86643	0.93124	0.89836	0.9091	0.91467	0.93324

Analysis on feature selection using Contraceptive method choice data set

The data collected from the Contraceptive method choice database were analyzed. The evaluation of the feature selection analysis was carried out in terms of certain performance measures by varying the population size. Figure 6 portrays the performance analysis on Contraceptive method choice database. The modeled feature selection analysis on proposed L-FF+NN model is compared with conventional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN in terms of accuracy, and the securitized values for the population size of 80 are found to be 2.1% and 0.1% better the GA+NN and FF+NN and also 0.9%, 0.3%, and 0.2% better than the conventional models such as PSO+NN, ABC+NN, and LA+NN, respectively, in Figure 6a. For the population size of 65, in Figure 6b, the specificity of the proposed L-FF+NN model is found to be 1.3%, 0.1%, 0.6%, 0.2%, and 0.3% better than the traditional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively. From Figure 6c, the precision of the proposed L-FF+NN model is 7.6%, 0.1%, 4%, 3.2%, and 1.8% superior to the conventional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively, while setting the population size to 65. Moreover, in terms of precision, the population size is set as 65 as per Figure 6d, and the model is evaluated with the conventional models for F₁ score, and it is found that the proposed L-FF+NN model is 7%, 0.1%, 4.1%, 2.2%, and 2.8% better than the conventional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively. The overall performance analysis in terms of feature selection of proposed L-FF+NN on the Contraceptive method choice database is portrayed in Table 4. Here, the accuracy of the proposed L-FF+NN model is 1.8%, 0.39%, 0.6%, 0.4%, and 0.3% better than the traditional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively. Similarly, the MCC value of the proposed L-FF+NN model is 4.8%, 0.14%, and 2.5% better than the conventional models such as GA+NN, FF+NN, and PSO+NN, respectively, and it is 1.7% and 1.3% better than ABC+NN and LA+NN, respectively. The F₁ score by the proposed L-FF+NN model is 9.4%, 0.1%, and 2% better than the traditional models such as GA+NN, FF+NN, and PSO+NN, respectively, and it is 1.4% better than ABC+NN and 10% better than LA+NN. Furthermore, in terms of FDR of the proposed L-FF+NN model, the evaluation proves that it is 92%, 28%, 87%, 82%, and 78% superior to the conventional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively. Then, there is 1.6%, 0.4%, 0.8%, 0.5%, and 0.4% improvement in the proposed L-FF+NN model over the conventional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively, in terms of NPV analysis. Moreover, in terms of FPR, the proposed L-FF+NN model is found to 93%, 28% better than GA+NN and FF+NN, and it is 87%, 82%, and 78% better than PSO+NN ABC+NN, and LA+NN, respectively.

FIG. 6.

Analysis on the feature selection using Contraceptive method choice data set focusing on: (a) accuracy, (b) specificity, (c) precision, and (d) F₁ score.

Table 4.

Overall performance analysis of the proposed feature selection over the conventional models using Contraceptive method choice data set focusing

Metrics	GA+NN³³	FF+NN³⁴	PSO+NN³⁵	ABC+NN³⁶	LA+NN³²	L-FF+NN
Accuracy	0.98306	0.99603	0.98954	0.99168	0.99279	0.99642
Sensitivity	0.98435	0.98435	0.98435	0.98435	0.98435	0.98435
Specificity	0.9828	0.99837	0.99058	0.99315	0.99447	0.99883
Precision	0.91964	0.99176	0.95434	0.96637	0.97269	0.9941
FPR	0.017204	0.001635	0.009419	0.00685	0.005527	0.001168
FNR	0.015647	0.015647	0.015647	0.015647	0.015647	0.015647
NPV	0.9828	0.99837	0.99058	0.99315	0.99447	0.99883
FDR	0.080364	0.008235	0.04566	0.033626	0.027308	0.005896
F₁-score	0.95089	0.98805	0.96911	0.97528	0.97849	0.9892
MCC	0.94146	0.98567	0.96299	0.97034	0.97418	0.98708

Analysis on feature selection using Immunotherapy data set

Figure 7 shows the Immunotherapy data set of the proposed L-FF+NN model over the existing methods. On setting the population size to 45, the accuracy of the proposed L-FF+NN model is 1%, 0.1%, 0.5%, 0.4%, and 0.3% better the traditional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively, as per Figure 7a. Moreover, in terms of specificity, from Figure 7b, the proposed L-FF+NN model is 2.1%, 0.1%, 0.5%, and 0.7% better than the traditional models such as GA+NN, FF+NN, PSO+NN, and ABC+NN, respectively, and it is 1.2% better than LA+NN in the population size 60. Furthermore, the evaluation of the proposed L-FF+NN model is compared with the existing methods in terms of precision and by using the population size as 80 and the evaluation proves that the proposed L-FF+NN model is 5%, 0.1%, 1%, 0.8%, and 0.5% superior to the traditional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively, as shown in Figure 7c. With the population size of 65, the proposed L-FF+NN model is found to be 4%, 0.1%, and 2.5% better than the state-of-art models such as GA+NN, FF+NN, and PSO+NN, respectively, and it is 1.8% better than ABC+NN, 0.7% better than LA+NN as manifested in Figure 7d.

FIG. 7.

Analysis on the feature selection using Immunotherapy data set focusing on: (a) accuracy, (b) specificity, (c) precision, and (d) F₁ score.

Table 5 demonstrates the performance of proposed L-FF+NN over conventional models for feature selection in Immunotherapy data set. The proposed L-FF+NN model in terms of accuracy is 0.2%, 0.1%, 0.14%, 0.9%, and 0.7% better than the traditional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively. The specificity of the proposed L-FF+NN model is found to be 0.3%, 0.012%, 0.2%, 0.14%, and 0.18% better than the traditional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively. The proposed L-FF+NN model is analyzed in terms of precision and it is found to be 0.89%, 0.02%, and 0.4% better than the state-of-art models such as GA+NN, FF+NN, and PSO+NN, respectively, and it is 0.3% and 0.2% better than ABC+NN and LA+NN, respectively. The proposed L-FF+NN model is evaluated over the conventional model in terms of FPR, and the outcomes were found to be 93%, 28%, 87%, 82%, and 78% better than the conventional models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively.

Table 5.

Overall performance analysis of the proposed feature selection over the conventional models using Immunotherapy data set focusing

Metrics	GA+NN³³	FF+NN³⁴	PSO+NN³⁵	ABC+NN³⁶	LA+NN³²	L-FF+NN
Accuracy	0.97007	0.97264	0.97136	0.97178	0.972	0.97272
Sensitivity	0.91875	0.91875	0.91875	0.91875	0.91875	0.91875
Specificity	0.99573	0.99959	0.99766	0.9983	0.99863	0.99971
Precision	0.99078	0.99912	0.99493	0.99631	0.99702	0.99937
FPR	0.004275	0.000406	0.00234	0.001702	0.001373	0.00029
FNR	0.081254	0.081254	0.081254	0.081254	0.081254	0.081254
NPV	0.99573	0.99959	0.99766	0.9983	0.99863	0.99971
FDR	0.009219	0.000883	0.005069	0.003692	0.002981	0.000631
F₁-score	0.9534	0.95725	0.95532	0.95596	0.95628	0.95736
MCC	0.93284	0.93897	0.9359	0.93691	0.93743	0.93915

Analysis on classification using Absenteeism at work data set

Figure 8 shows the classification analysis on the Absenteeism at work data set, and the performance of the proposed model is evaluated over the conventional model by means of varying the population size. The population size is set as 5, and the proposed model is found to be 4.2% and 2% better than the conventional models such as L-FF+KNN and L-FF+SVM, respectively, as per Figure 8a. Then, the proposed L-FF+NN model is evaluated over the conventional model in terms of specification as shown in Figure 8b, and it is found to be 0.2% and 0.12% better than conventional models such as L-FF+KNN and L-FF+SVM, respectively, while fixing population as 3. Then, the precision of the proposed L-FF+NN model is 5.2% better than L-FF+KNN and 0.1% better than L-FF+SVM while setting the population size to 5 as per Figure 8c. From Figure 8d, the proposed L-FF+NN model is 11%, 1.1% better than the traditional models such as L-FF+KNN and L-FF+SVM for F₁ score analysis with population size as 5. Table 6 demonstrates the classification analysis on Absenteeism at work data set, and the accuracy of the proposed L-FF+NN model is 3.5% better than both L-FF+KNN and L-FF+SVM. The sensitivity of the proposed L-FF+NN is 17% superior to both the conventional models such as L-FF+KNN and L-FF+SVM. Then, the specificity of the proposed model is 1.9% better than both the conventional models such as L-FF+KNN and L-FF+SVM. Furthermore, the precision of the proposed L-FF+NN model is 17% better than both the traditional models such as L-FF+KNN and L-FF+SVM. The FPR of the proposed L-FF+NN model is evaluated as 19% enhanced over both the existing models such as L-FF+KNN and L-FF+SVM. Then, in terms of FNR, there is 19% of improvement in the proposed L-FF+NN model while compared with the state-of-art models such as L-FF+KNN and L-FF+SVM. The proposed L-FF+NN is found to be 1.9% better than both the traditional models such as L-FF+KNN and L-FF+SVM in terms of NPV. Then, FDR of the proposed L-FF+NN is evaluated to be 19% better than both the traditional models such as L-FF+KNN and L-FF+SVM. The proposed L-FF+NN model has the enhancement in the MCC value as 24% over both the existing models such as L-FF+KNN and L-FF+SVM.

FIG. 8.

Analysis on the classification using Absenteeism at work data set focusing on: (a) accuracy, (b) specificity, (c) precision, and (d) F₁ score.

Table 6.

Overall performance analysis of the proposed classification over the conventional models using Absenteeism at work data set

Metrics	L-FF+KNN³⁷	L-FF+SVM³⁸	L-FF+NN
Accuracy	0.84251	0.84251	0.87245
Sensitivity	0.52753	0.52753	0.61735
Specificity	0.90551	0.90551	0.92347
Precision	0.52753	0.52753	0.61735
FPR	0.094494	0.094494	0.07653
FNR	0.47247	0.47247	0.38265
NPV	0.90551	0.90551	0.92347
FDR	0.47247	0.47247	0.38265
F₁-score	0.52753	0.52753	0.61735
MCC	0.43303	0.43303	0.54082

Analysis on classification using Dermatology data set

The proposed L-FF+NN model is compared over the several existing models such as L-FF+NN and L-FF+SVM by measures such as accuracy, specificity, sensitivity, and F₁-score in Figure 9. The population size is set at 4, and the accuracy of the proposed L-FF+NN model is 0.3% and 0.2% better than L-FF+KNN and L-FF+SVM, respectively, as per Figure 9a. Then, from Figure 9b, the specificity of the proposed L-FF+NN model is 0.25% and 0.10% better than the traditional models such as L-FF+KNN and L-FF+SVM, respectively, at the population size of 5. The sensitivity of the proposed L-FF+NN model is 10% and 5% superior to the conventional models such as L-FF+KNN and L-FF+SVM, respectively, while setting the population size to 3 as per Figure 9c. Furthermore, from Figure 9d, the population size is 2 and the proposed L-FF+NN model is found to be 4.2% and 2.55 better than L-FF+KNN and L-FF+SVM, respectively. Table 7 describes the database on the classification analysis of Dermatology database. The performance of proposed L-FF+NN over conventional models for feature selection in Dermatology model of the proposed L-FF+NN model is 33% and 1% better than the conventional models such as L-FF+KNN and L-FF+SVM, respectively, in terms of accuracy. The sensitivity of the proposed L-FF+NN model is 30% better than KNN+NN. Then, the proposed L-FF+NN model is found to be 3.6% and 0.2% better than the conventional model in terms of specificity.

FIG. 9.

Analysis on the classification using Dermatology data set focusing on: (a) accuracy, (b) specificity, (c) precision, and (d) F₁ score.

Table 7.

Overall performance analysis of the proposed classification over the conventional models using Dermatology data set

Metrics	L-FF+KNN³⁷	L-FF+SVM³⁸	L-FF+NN
Accuracy	0.72491	0.95584	0.96581
Sensitivity	0.72016	0.93637	0.93637
Specificity	0.72966	0.97531	0.99525
Precision	0.72707	0.97431	0.99495
FPR	0.27034	0.024691	0.004748
FNR	0.27984	0.063628	0.063628
NPV	0.72966	0.97531	0.99525
FDR	0.27293	0.025692	0.005045
F₁-score	0.7236	0.95496	0.96477
MCC	0.44985	0.91237	0.93324

Analysis on classification using Contraceptive method choice data set

Figure 10 shows the classification analysis on the Contraceptive method choice. From Figure 10a, the accuracy of the proposed L-FF+NN model is 0.2% and 0.1% superior over the conventional models such as L-FF+KNN and L-FF+SVM at the population size of 5. Then, the population size is fixed at 5, and the specificity of the proposed L-FF+NN model is 0.2% and 0.1% better than the conventional models such as L-FF+KNN and L-FF+SVM. Then, from Figure 10c, the sensitivity of the proposed L-FF+NN model is 8% and 4% better than the traditional models such as L-FF+KNN and L-FF+SVM. Then, from Figure 10d, the population size is 3, and the F₁-score of the proposed L-FF+NN model is 5.3% and 2% better than L-FF+KNN and L-FF+SVM. The overall performance analysis of the data classification is shown in Table 8, and it is 1% and 0.04% better than the traditional models in terms of accuracy. The F₁-score of the proposed L-FF+NN model is 3%, 1.2% better than the conventional models such as L-FF+KNN and L-FF+SVM. Then, the proposed L-FF+NN model is 90% better than L-FF+KNN and 80% better than L-FF+SVM in terms of FDR analysis. Then, the NPV of the proposed L-FF+NN model is 1.2% and 0.2% better than the traditional models such as4 L-FF+KNN and L-FF+SVM. The proposed L-FF+NN model is found to be 91% and 80% better than the conventional model in terms of FPR.

FIG. 10.

Analysis on the classification using Contraceptive method choice data set focusing on: (a) accuracy, (b) specificity, (c) precision, and (d) F₁ score.

Table 8.

Overall performance analysis of the proposed classification over the conventional models using Contraceptive method choice data set

Metrics	L-FF+KNN³⁷	L-FF+SVM³⁸	L-FF+NN
Accuracy	0.9863	0.99233	0.99642
Sensitivity	0.98435	0.98435	0.98435
Specificity	0.98669	0.99393	0.99883
Precision	0.93667	0.97008	0.9941
FPR	0.013312	0.006072	0.001168
FNR	0.015647	0.015647	0.015647
NPV	0.98669	0.99393	0.99883
FDR	0.063333	0.029919	0.005896
F₁-score	0.95992	0.97716	0.9892
MCC	0.95209	0.9726	0.98708

Analysis on classification using Immunotherapy data set

Figure 11 represents the feature classification analysis on the Immunotherapy data set. From Figure 11a, the accuracy of the proposed L-FF+NN model is 0.2% and 0.1% better than the classifiers such as L-FF+KNN and L-FF+SVM, while setting the population size at 5. Then, the population size is fixed at 4, and the specificity of the proposed L-FF+NN model is 0.2% and 0.1% better than the conventional models such as L-FF+KNN and L-FF+SVM, respectively, as shown in Figure 11b. The sensitivity of the proposed L-FF+NN model is 11% and 5% superior to the conventional models such as L-FF+KNN and L-FF+SVM, respectively, at the population size of 4 as per Figure 11c. Then, the proposed L-FF+NN model is found to be 4.5% and 3% better than the conventional model while fixing the population size at 3 in terms of F₁-score as per Figure 11d. Table 9 demonstrates overall performance analysis of the proposed L-FF+NN model over conventional models for Immunotherapy data set. The accuracy is found to be 1.6% and 0.08% better than the conventional models such as L-FF+KNN and L-FF+SVM. Then, the sensitivity of the proposed L-FF+NN model is 2% better than KNN+NN. The proposed L-FF+NN model has the precision as 2% and 0.2% better than the conventional models such as L-FF+KNN and L-FF+SVM.

FIG. 11.

Analysis on the classification using Immunotherapy data set focusing on: (a) accuracy, (b) specificity, (c) precision, and (d) F₁ score.

Table 9.

Overall performance analysis of the proposed classification over the conventional models using Immunotherapy data set

Metrics	L-FF+KNN³³	L-FF+SVM³⁸	L-FF+NN
Accuracy	0.9574	0.97191	0.97272
Sensitivity	0.89879	0.91875	0.91875
Specificity	0.98671	0.99849	0.99971
Precision	0.97128	0.99673	0.99937
FPR	0.013288	0.001509	0.00029
FNR	0.10121	0.081254	0.081254
NPV	0.98671	0.99849	0.99971
FDR	0.028719	0.003273	0.000631
F₁-score	0.93363	0.95615	0.95736
MCC	0.90381	0.93722	0.93915

Conclusion

In this article, the big data classification model was developed with the assistance got from the intelligent methods. Three main phases were used in this article: they are (1) feature extraction, (2) optimal feature selection, and (3) classification. Before starting the feature selection and classification processes, the Parallel Pool Map reduce Framework is used for handling big data. The first phase, that is, feature extraction made use of the well-known extraction techniques such as PCA, LDA, and LSR. The dimensions of the data were high, and hence, the selection of the optimal features was a complex task so that the proposed model used the optimal feature selection technology referred as L-FF to select the optimal features. The research was focused on the objective of diminishing the correlation between the selected features, as they were related to the generation of diverse information that was related to different classes of data. Furthermore, the classification of the feature was carried out after the selection of the optimal features. The classification of the selected features was carried out with NN, which had the ability to classify the data in an effective manner with the selected features. To know about the enhancement in the performance of the proposed model, the proposed model L-FF+NN is compared over the traditional models for the data sets such as Absenteeism at work, Dermatology, Contraceptive method choice, and Immunotherapy data set in terms of accuracy, sensitivity, specificity, precision, FPR, FNR, NPV, FDR, F₁-score, and MCC. In the dermatology database-based feature extraction, FDR of the proposed L-FF+NN model is 92%, 28%, 87%, 82%, and 78% superior to the state-of-art models such as GA+NN, FF+NN, PSO+NN, ABC+NN, and LA+NN, respectively. The performance of proposed L-FF+NN over conventional models for feature selection in Dermatology model of the proposed L-FF+NN model is 33% and 1% better than the conventional models such as L-FF+KNN and L-FF+SVM in terms of accuracy. Thus, the entire experimental analysis confirms the effective performance of proposed classification over conventional methods. The interconnected proposed algorithm (L-FF) aids in enhancing the optimal solution.

In the future, we will add the noisy to the original signals and improve the proposed algorithm, in case of the noisy signals pictures with the partial occlusion, illumination changes, and so on. In addition to this, the future research will focus on the moderate deployment of real big data classification for further better performance, such as business enterprise and commercial.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding was received for this article.

Abbreviations Used

References

Lei

, Jia

, Lin

, et al. An intelligent fault diagnosis method using unsupervised feature learning towards mechanical big data. IEEE Trans Indus Electr. 2016; 63:3137–3147.

Ramírez-Gallego

, Ramírez-Gallego

, Mouriño-Talín

, et al. An information theory-based feature selection framework for Big Data Under Apache Spark. IEEE Trans Syst Man Cybernet Syst. 2018; 48:1441–1453.

, Zhang

, Song

, Wan

. Big data analytics enabled by feature extraction based on partial independence. Neurocomputing. 2018; 288:3–10.

Jameel

, Abdul-Karem

, Mahmood

. A review of the impact of ICT on business firms. Int J Latest Eng Manage Res. 2017; 2:15–19.

Attigeri

, Pai

MMM

, Pai

. Analysis of feature selection and extraction algorithm for loan data: A big data approach. 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, 2017. pp. 2147–2151.

Zhao

, Chen

, Hu

, et al. Distributed feature selection for efficient economic big data analysis. IEEE Trans Big Data. 2018; 4:164–176.

Xin

, Wang

. Enhancing the precision of redundant data in feature extraction by an improved evolutionary algorithm. 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA), Beijing, 2017. pp. 78–83.

Vinod

, Vasudevan

. A filter based feature set selection approach for big data classification of patient records. 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, 2016. pp. 3684–3687.

Sisiaridis

, Markowitch

. Reducing data complexity in feature extraction and feature selection for Big Data Security Analytics. 2018 1st International Conference on Data Intelligence and Security (ICDIS), South Padre Island, TX, 2018. pp. 43–48.

10.

Putri

DGP

, Siahaan

. Software feature extraction using infrequent feature extraction. 2016 6th International Annual Engineering Seminar (InAES), Yogyakarta, 2016. pp. 165–169.

11.

Ansari

, Sutar

. Optimized and efficient feature extraction method for devanagari handwritten character recognition. 2015 International Conference on Information Processing (ICIP), Pune, 2015. pp. 11–15.

12.

Agarwal

, Misra

, Agarwal

. The 5th generation mobile wireless networks-key concepts, network architecture and challenges. Am J Electrical Electron Eng. 2015; 3:22–28.

13.

Ratnasari

, Susanto

, Soesanti

, Maesadji. Thoracic X-ray features extraction using thresholding-based ROI template and PCA-based features selection for lung TB classification purposes. 2013 3rd International Conference on Instrumentation, Communications, Information Technology and Biomedical Engineering (ICICI-BME), Bandung, 2013. pp. 65–69.

14.

Imani

, Ghassemian

. Band clustering-based feature extraction for classification of hyperspectral images using limited training samples. IEEE Geosci Remote Sens Lett. 2014; 11:1325–1329.

15.

Doi

, Yamanaka

. Discrete finger and palmar feature extraction for personal authentication. IEEE Trans Instrum Meas. 2005; 54:2213–2219.

16.

Song

, Zhang

, Mei

, Guo

. A multiple maximum scatter difference discriminant criterion for facial feature extraction. IEEE Trans Syst Man Cybernet B Cybern. 2007; 37:1599–1606.

17.

Mendenhall

, Merenyi

. Relevance-based feature extraction for hyperspectral images. IEEE Trans Neural Netw. 2008; 19:658–672.

18.

Zheng

. Heteroscedastic feature extraction for texture classification. IEEE Signal Process Lett. 2009; 16:766–769.

19.

Gao

, Zhang

, Yang

. Comments on “On Image Matrix Based Feature Extraction Algorithms.” IEEE Trans Syst Man Cybernet B Cybernet. 2007; 37:1373–1374.

20.

Xia

, Liao

, Chanussot

, et al. Improving random forest with ensemble of features and semisupervised feature extraction. IEEE Geosci Remote Sens Lett. 2015; 12:1471–1475.

21.

Imani

, Ghassemian

. Feature extraction using weighted training samples. IEEE Geosci Remote Sens Lett. 2015; 12:1387–1391.

22.

Thomas

, Rangachar

MJS

. Hybrid optimization based DBN for face recognition using low-resolution images. Multimedia Res. 2018; 1:33–43.

23.

Van

, Kang

. Bearing-fault diagnosis using non-local means algorithm and empirical mode decomposition-based feature extraction and two-stage feature selection. IET Sci Meas Technol. 2015; 9:671–680.

24.

Tao

, Zhou

, Liu

, Zhang

. Tensorial independent component analysis-based feature extraction for polarimetric SAR Data Classification. IEEE Trans Geosci Remote Sens. 2015; 53:2481–2495.

25.

Wagh

, Gomathi

. Route discovery for vehicular ad hoc networks using modified lion algorithm. Alexandria Eng J. 2018; 57:3075–3087.

26.

Remmiya

, Abisha

. Artifacts removal in EEG signal using a NARX model based CS Learning Algorithm. Multimedia Res. 2018; 1:1–8.

27.

Xiao

, Ma

, Feng

, et al. Rice blast recognition based on principal component analysis and neural network. Comput Electr Agric. 2018; 154:482–490.

28.

Hidayat

, Fajrian

, Muda

, et al. A comparative study of feature extraction using PCA and LDA for face recognition. 2011 7th International Conference on Information Assurance and Security (IAS), Melaka, 2011. pp. 354–359.

29.

Yang

. Firefly algorithm, Levy flights and global optimization. In: Research and Development in Intelligent Systems. London: Springer, 2010. pp. 209–218.

30.

Zhao

, Wang

, Nie

. Orthogonal least squares regression for feature extraction. Neurocomputing. 2016; 216:200–207.

31.

Senthamil Selvi

, Valarmathi

, Miraclin Joyce Pamila

. Group firefly algorithm for feature selection in big data environment. 2017; 6:1179–1196.

32.

Rajakumar

. Optimization using lion algorithm: A biological inspiration from lion's social behavior. Evol Intell. 2018; 11:31–52.

33.

McCall

. Genetic algorithms for modelling and optimisation. J Comput Appl Math. 2005; 184:205–222.

34.

Fister

, Fister

Jr , Yang

X-S

, Brest

. A comprehensive review of firefly algorithms. Swarm Evolut Comput. 2013; 13:34–46.

35.

Tanweer

, Suresh

, Sundararajan

. Self regulating particle swarm optimization algorithm. Inf Sci. 2015; 294:182–202.

36.

Basturk

DKB

. On the performance of artificial bee colony (ABC) algorithm. Appl Soft Comput. 2008; 8:687–697.

37.

Zhang

, Li

, Zong

, et al. Efficient kNN classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst. 2018; 29:1774–1785.

38.

Liu

, Tang

. Mass classification in mammograms using selected geometry and texture features, and a new SVM-Based Feature Selection Method. IEEE Syst J. 2014; 8:910–920.