Abstract
Breast cancer has been life-threatening for many years as it is the common cause of fatality among women. The challenges of screening such tumors through manual approaches can be overcome by computer-aided diagnosis, which aids radiologists in making precise decisions. The selection of significant features is crucial for the estimation of prediction accuracy. This work proposes a hybrid Genetic Algorithm (GA) and Honey Badger Algorithm (HBA) based Deep Neural Network (DNN), HGAHBA-DNN for the concurrent optimal features selection and parameter optimization; further, the optimal features and parameters extracted are fed into the DNN for the prediction of the breast cancer. It fuses the benefits of HBA with parallel processing and efficient feedback with GA’s excellent global convergent rate during the processing stages. The aforementioned method is evaluated on the Wisconsin Original Breast Cancer (WOBC), Wisconsin Diagnostic Breast Cancer (WDBC), and the Surveillance, Epidemiology, and End Results (SEER) datasets. Subsequently, the performance is validated using several metrics like accuracy, precision, Recall, and F1-score. The experimental result shows that HGAHBA-DNN obtains accuracy of 99.42%, 99.84%, and 92.44% for the WOBC, WDBC, and SEER datasets respectively, which is much superior to the other state-of-the-art methods.
Keywords
Introduction
One of the most common reasons for fatality amid womankind is breast cancer. Breast cancer is a condition in which the cells in the breast tissue alter their structure and divide improperly, producing a lump or mass [1]. It can be quickly determined whether the case is malignant or benign by analyzing all of the risk factors and creating a classification model [2]. World Health Organization (WHO) reported that breast cancer disease is the major reason for mortality. Many invasive and non-invasive methods have been developed by researchers for detecting breast cancer. The conventionally used procedures for diagnosing breast cancer take longer and are less accurate because of human error. The shortcomings of manual diagnosis are overcome by automated computer-based diagnosis techniques, which allow redundant procedures and biopsies to be avoided. Different machine learning (ML) approaches are used in the field of medical diagnosis, where feature extraction is done manually. Many ML models have been enabled by various researchers to classify malignant and benign tumors, such as k-nearest neighborhood (k-NN) algorithms, regression, multilayer perceptrons (MLPs), Random Forest (RF), Naive Bayes (NB), decision trees (DTs), and support vector machines (SVM).
Data mining techniques and ML methodologies are used by the authors in [3] to predict breast cancer, where they have estimated that the Naïve Bayes (NB) ML approaches show better prediction accuracy. Although the acquired prediction scores are respectable, there is still room for improvement through better feature engineering. The key driving force behindhand the work is the best choice of features. Selecting the relevant feature is a crucial step as more features can lead to higher computational costs and redundant information and it is vital for efficient classification. Many techniques have been developed by researchers to address this challenging issue [4, 5]. Evolutionary algorithms are those that draw inspiration from the process of natural evolution (EA). Among these, the most common one is the genetic algorithm [6], which works on the idea of the ‘survival of the fittest’.
Metaheuristic algorithms that draw inspiration from nature are frequently employed to tackle optimization issues and find the best answers. Combining two meta-heuristic techniques eliminates the effects of time consumption, human error, and the intensity of labor intensification [7, 8]. The simultaneous feature selection and parameter optimization procedures that aid in the most accurate breast cancer detection are also a focus of this method of integrating two metaheuristic algorithms. By addressing issues like early, slower convergence and fine-tuning the parameters in accordance with the primary objective of selecting the ideal traits and parameters, the accuracy of the detection of breast cancer is raised. In order to overcome the aforementioned difficulties in breast cancer diagnosis, the HGAHBA_DNN model was created. First, a mutation-based analysis is used for the Genetic Algorithm (GA), and then a genetic algorithm with HBA supported by the DNN algorithm is used to select the best features. Although the HBA algorithm has the benefit of dynamic search ability, a disadvantage is that it might become trapped in local optima as a result of population diversity loss, which is particularly problematic when trying to solve a challenging optimization problem. A genetic algorithm is used to address this issue through the proposed method, which could converge within less iteration. Additionally, by combining HBA with GA, it was possible to better enforce the elimination of the trade-off amid exploitation and exploration, for boosting productivity as well as resolving low-dimensional issues.
This paper has the following major contributions: To design an automated system using a deep learning approach that classifies breast cancer using standard datasets. The developed approach is implemented for feature selection and parameter optimization. The integration of the optimization techniques with the proposed deep learning framework to escape from local optima. HGAHBA_DNN is implemented for utilizing the Wisconsin breast cancer datasets and SEER datasets for the best feature subset selection and parameter tuning. The HGAHBA_DNN model is evaluated in terms of accuracy. The statistical significance of the proposed work is tested using the ANOVA test.
The structure of this article is described as follows; Section 2 discusses related works of the existing methodologies. In Section 3 materials and methods used in this study are discussed. Section 4 discusses the results obtained and the corresponding discussion of the observations. Finally, Section 5 ends with a conclusion and key summary of the outcomes of the research.
Related works
Many recent approaches focused on feature selection and this section focuses on some of the related research works. The authors of [9] fused Particle swarm optimization (PSO) and Grey wolf optimization with k-NN for selecting the best features. The model was evaluated using several datasets from the UCI repository. Mafarja and Seyedali [10] developed two variants of the whale optimization algorithm for feature selection, one with tournament and wheel selection, and another with crossover and mutation operators. Their algorithm outperformed the other wrapper approaches when experimented with several datasets from the UCI repository. In the Whale optimization algorithm (WOA), individuals arrive at the ideal solution by modeling the behavior of encircling prey. In other words, other traditional algorithms develop fresh responses by taking the present generation’s structural relationship into account. Suganthi and Malathy [11] created a bat algorithm for feature selection and tested it on the WDBC dataset. They found that their method performed better. Egzi Zorarpci et al. [12] developed a fusion method for feature selection using Artificial Bee Colony (ABC) optimization with different evolution, and their results showed that their method achieved good performance in classification tasks. Here, ABC and rough set theory are combined, which is used for analyzing the relation of the features with every class for better feature selection. Grasshopper optimization (GOA) for feature selection is a recent nature-inspired algorithm that mimics the swarming behavior of grasshoppers in nature [13]. GOA was designed for solving continuous optimization problems. The ACO is generally combined with various optimization algorithms such as PSO, cuckoo search optimization, and GAs for feature selection found to be effective. Lakshman et al. [14] proposed a tool for the detection of breast cancer using histopathology images to separate abnormal cells from normal cells. SVM algorithm was used in their work to extract the morphological features and validated their method using a benchmark image dataset. Their results achieved a higher F-score than the other conventional frameworks. Using ML algorithms on microarray datasets, Sudir et al. [15] used microarray datasets and 5 different feature selection techniques. In their work, MFO serves as the optimizer for the classification of the disease. On the WOBC and WDBC datasets, Omotehinwa et al. [16] used Ant Lion optimization (ALO) for feature selection and MLP for classification. The size of the tumor and the condition of the lymph nodes were determined to be crucial factors in the diagnosis of cancer by Niu et al. [17]. On the WDBC dataset, Sanam et al. [18] presented an ML framework with GB, SVM, ANN, and MLP. Hybrid MLP was employed as the classifier, and the system achieved 99.12% accuracy. Choi et al. [19] adopted a deep learning technique by applying five different classification methods through a 5-fold cross-validation technique. They improved the classification results by comparing them with the existing six models, including DNN on two UCI datasets namely WOBC & WDBC. Wang et al. [20] diagnosed breast cancer by combining 12 variants of SVMs, based on the weighted area under the receiver operating characteristic curve ensemble. They conducted experiments on WOBC, WDBC, and SEER datasets and achieved higher accuracy in prediction with reduced variance.
For the categorization of breast cancer, Sharma et al. [21] created an ensemble learning model based on neural networks and additional trees, which provided better accuracy when different variables were chosen. Abdar et al. [22] developed a nested ensemble model for breast cancer diagnosis using the WDBC dataset. In their work, they developed a nested ensemble model using k-fold cross-validation with stacking and voting as classifiers. They compared their work with single classifiers and previous works and found their method to be efficient. Wang Sutong et al. [23] developed an enhanced random forest rule to extract classification rules for the analysis of cancer. Their method was evaluated on the WOBC, WDBC, and SEER datasets, which outperformed in accuracy. On the WDBC dataset, Naji et al. [24] employed different variants of ML techniques, such as logistic regression, decision trees, and k-NN, to predict benign and malignant tumors and achieve greater accuracy. Badr et al. [25] used GWO and enhanced the performance of the SVM for the diagnosis of cancer and tested them on the WDBC dataset and Electronic Health Records. With their developed method, the authors observed significant improvement in the accuracy of both datasets in their experimentation phase. Lu et al. [26] developed a gradient-boosting method using a genetic algorithm for cancer diagnosis and tested it on the SEER dataset and achieved good results.
All of these methods have the same goal of classifying diseases utilizing hybrid classification methods. Although numerous papers are available on cancer diagnosis, the accuracy of diagnosis still needs to be improved. A good deal of earlier works has only been examined using one database and lack feature selection. The performance evaluation of the majority of earlier works did not employ any statistical analysis to demonstrate the outcomes of the results. The proposed HGAHBA approach selects the best features and parameters using HBA and GA and improves breast cancer classification performance. The proposed methodology offers a methodical means of achieving the required outcomes by taking into account various technical optimizations with various machine learning techniques.
Proposed methodology
The best features are selected to increase the accuracy of the cancer diagnosis accuracy. GA and HBA are combined to escape from local optima in the proposed approach, thereby the GA-based HBADNN method is used for selecting the crucial features, and the parameters are used for better means of classification. Among the huge range of challenges opened up to be addressed in the healthcare sector by the research community, one among them is the accurate prediction of disease outcomes. The proposed approach uses the evolutionary algorithm. The different tasks done for prediction are (1) Data pre-processing (2) HGAHBADNN utilization (3) DNN-based classification, and (4) Performance calculation.
The proposed framework as shown in Fig. 1 consists of pre-processing, feature selection, parameter optimization, and classification modules integrated into the system. For pre-processing, min-max normalization is used, and then a hybrid combination of the Genetic algorithm and honey-badger algorithm is utilized for feature selection and parameter optimization. Finally, after the selection of optimal features, the DNN classifier is used to classify the dataset.

Workflow of the proposed framework.
Data pre-processing is one of the crucial stages that is used to enhance the data quality and guides to extract the most significant information from the available large volume of data. During the earlier stages of feature selection, the data are normalized using min-max normalization, where for every feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into 1, and every other value gets transformed into a value between 0 and 1.
Such normalization strategies are computed using Equation (1)
The GA is one of the optimization algorithms proposed by Holland, based on Darwin’s biological evolution theory. It mimics the evolution process to choose the optimum solutions. GA includes the stages such as the initial population, evaluation, selection, crossover, and mutation operator. The main advantage of the GA is its inherent capability of providing good global search ability.
In the initial population stage, a set of chromosomes are created randomly. In the phase of evaluation, a fitness function is used which determines how close a solution is being chosen to meet the goals. In the selection phase, a significant number of the population is selected, so that it can be passed to the next generation which in turn is evaluated based on the fitness function. The crossover phase involves the selection of two parents to generate and choose an offspring that is better than the parent. In the mutation phase, some randomness is introduced, as mutant offspring may get better features, which makes certain that the algorithm may not converge at any local optimum.
The honey badger optimization algorithm was introduced by Hashim et al. [27] in 2021; The HBA imitated the honey badger, which inhabits semi-deserts and jungles in Southwest Asia, and Africa. It uses its sense of smell and movement to find prey.
Initialization process: In this stage, to determine the initial set of solutions the upper (KU) and lower (KL) boundaries were considered. The solutions are stochastic sets were framed using Equation (2)
The smell intensity of each candidate In
i
is given by Equation (3),
The density factor is defined as σ, which balances the exploration and exploitation phases using Equation (4),
Update positions: The candidates are updated at this point K new , which involves either digging or honey phases.
Digging phase: In this phase, the sense of the smell of the honey badger is used to locate and catch its prey, which is given by Equation (5)
Honey Phase: In this phase, the honey badger searches for the beehive by pursuing the honey bird, which is determined by Equation (7),
The fitness can be calculated using the Equation (8) of the KNN classifier [28] is used here to evaluate the fitness function because of its simplest nature.
The proposed novel HGAHBA optimization is shown in Fig. 2. The core idea of merging the HBA and GA approaches aims to improve the efficiency and accuracy of the developed framework. The parameters for the HBA and GA are listed in Table 1.

Workflow of the proposed HGAHBA approach.
Parameter settings of HGAHBA
The DNN has a similar structure to an artificial neural network with complex hierarchies. It has been proposed as a way of producing more predictive models.
The network is made up of multiple layers of computational units, usually connected in a feed-forward manner. Each neuron has direct connections to the neurons present in subsequent layers.
For a given input x, the hidden layer ‘m’ is h = f (b + Wx) where b denotes the bias, W is the weight matrix and f is the nonlinear activation function which is usually sigmoid activation
The output is calculated as o = g (b + Wh), where the function g can be an identity function for a regression problem. The parameter values are specified in Table 2. To initialize the weight matrix ‘W’ some random values are used. The steps carried out in this network are as follows:
Parameter values of the DNN
Parameter values of the DNN
The first step is model initialization, where the weights are initialized randomly.
The inputs are passed through the network and in the forward direction, the output is calculated.
The weights are updated using the weight update rule. The weights are updated till they converge.
Batch sizes in the power of 2 [29] are considered for tuning the deep neural network model. The various activation functions used for our experimental purpose are the Rectified Linear Unit (ReLU), sigmoid, tanh, and softmax, which are important factors in predicting the output of the neural network. Optimizers are usually utilized as one of the crucial components that adjust the parameters of the neural network, such as weights and learning rate, to minimize losses. The optimizers considered in this work are Root Mean Square Propagation (RMSprop) Adaptive Moment Estimation (Adam), and Stochastic Gradient Descent (SGD). Learning rate is also considered another vital parameter, as a higher learning rate leads to divergence from the objective system, while a lower one leads to slow learning. The number of epochs plays an incredible part in the training of the neural network. Further, overfitting occurs when the number of epochs is too high while the model trained with less number of epochs leads to underfitting.
The pseudocode of the hybrid honey badger and genetic algorithm is presented in Algorithm 1. This was inspired by the proposed approach through the merging process and it is reclaimed as an HBA-GA algorithm, where initially a random set of feasible solutions are generated.
The dataset used for experimentation and analysis of breast cancer in our work is the WOBC [30], WDBC dataset [30], and SEER dataset [31]. The WOBC dataset consists of 699 observations by considering 10 attributes for analysis. The attribute bare nuclei have 16 missing instances in the chosen dataset. Here, for filling in the bare nuclei attribute value, the median value of the feature is considered because in this dataset all attributes have continuous values. The WDBC dataset has 569 observations and 32 attributes for analysis. The total number of malignant cases is observed to be 212, and the number of benign cases is 357. This dataset does not contain any missing values in the available set of data. The SEER program provides a reputable source of cancer data in the United States. The version, submitted in November 2020 (from 1975 to 2017) is considered for analysis in this work, which comprises more than 10 million cancer instances with 133 attributes. Most parameters such as tumor size have been only included in the dataset for the diagnosis after the year 2004. So the values after the diagnosis year 2004 are considered and the instances with unknown values are neglected for analysis. In this work, 10 categorical attributes, 4 continuous attributes, and 3 other additional attributes such as vital status recode, cause of death, and survival month were considered as predictors [19, 22] as shown in Table 3. After pre-processing and data cleaning, 1,37,386 instances were obtained, among them 1,27,276 instances were observed to be positive and 10,110 instances were recognized as negative. Random oversampling is a technique where randomly the examples of the minority class are duplicated while in the undersampling technique, the examples from the majority class are randomly deleted. Random oversampling and undersampling are combined in this work to balance the classes which result in better performance of the models and also subsequently avoid the overfitting problem.
Attributes of the SEER dataset
Attributes of the SEER dataset
The dataset is initially categorized as the ratio of train and test sets, and the data were normalized during the preprocessing phase. As a follow-up, the accuracy was evaluated using all the features without any feature selection. Then, using the feature selection strategy proposed in our work, the features are selected and evaluated. The comparison was done in terms of time and accuracy.
When performing parametric optimization traditionally, all other variables are held constant while only one parameter is changed. However, this conventional approach consumes a lot of time and computational resources. The blending of a GA and an HBA is performed to find the best possible combination of parameters and best features. HBA searches for the best individuals, where the GA algorithm prevents overfitting. The proposed HGAHBA-DNN approach improves the traditional GA method as it reduces the number of generations, and improves the execution time and prediction accuracy.
All computations are carried out using the Google Colab platform, a GPU-based cloud framework, and the scikit-learn software package is used for the experiments. The hyperparameters used for DNN and the optimal values obtained by the proposed hybrid GA and HBA approach are given in Table 4.
Chosen hyperparameters to be optimized and the optimal values obtained
Chosen hyperparameters to be optimized and the optimal values obtained
The performance of the proposed approach is improved through better means of feature selection strategy, where selection overhead in the features set of the SEER dataset is reduced. 10-fold cross-validation is performed to validate the performance and the proposed results increase the accuracy of the state-of-art methods.
The following evaluation metrics are used in evaluating the algorithms. Accuracy, Recall, Precision, and F1-score are the metrics used as the evaluation method, based on the parameters of obtaining a tabular summary of classification through the confusion matrix. The confusion matrix is utilized to evaluate the classified and misclassified rate of the system. The viability and performance of a framework can be estimated by computing the accuracy. The accuracy of the model is estimated through Equation (9).
Recall specifies the rate of positive classes that are correctly identified as given in Equation (10).
Precision is the fraction of related instances among the retrieved instances, given by Equation (11)
The F1-score (F1-measure) is the mean of the Recall and precision measures, given in Equation (12)
Figure 3 shows the convergence and efficiency of the proposed HGAHBA method applied over the chosen datasets. It is evident that the proposed method converges faster than the other methods in all three chosen datasets. The hybrid algorithm swiftly generates a stable optimal solution during the first iteration, whereas other algorithms solve a considerable number of iterations. As a result, the hybrid algorithm is more significant than others because it requires lesser time for convergence.

Convergence plot of the HGAHBA with other algorithms of (a) WOBC dataset; (b) WDBC dataset; (c) SEER dataset.
This demonstrates the advantages of the hybrid GA and the HBA, particularly its striking feature of the ability to quickly converge to a local optimum during early search stages. The suggested method addressed the issue that HBA typically faces a low progress rate in its initial iterations because of the lack of balancing.
To demonstrate the performance of the developed feature selection algorithm, experiments are performed to compare the performance with HGAHBA. The classification accuracy achieved by conventional algorithms was observed to be much lesser than the HGAHBA. From the experimental results, it is evident that the developed model shows better performance in classification accuracy and reduced time complexity. The proposed HGAHBADNN is implemented with the chosen three public breast cancer datasets, and the results of the proposed model are tabulated in Table 5.
Performance estimation of classifiers on the WOBC, WDBC & SEER datasets
The model accuracy and loss of all three datasets, before and after feature selection and optimization are shown in Fig. 4. It is obvious from the observations that the most crucial feature selection and optimization can be clearly understood during the analysis phases.

Visualization of the proposed model accuracy and loss characteristics observed over the chosen three breast cancer datasets.
Tables 6, 7 and 8 shows the results of the repeatability test (15 runs) of the various feature selection algorithms on the WOBC, WDBC and SEER dataset. The best, average, worst and standard deviation results are listed in the table. It is observed that the proposed approach HGAHBA achieved the highest prediction rate.
Repeatability test on the WOBC dataset
Repeatability test on the WDBC dataset
Repeatability test on the SEER dataset
Analysis of Variance (ANOVA) statistical test is carried out to determine the efficiency of the results. If the p-value is less than 0.05, then the results are statistically significant otherwise, it is not significant. It can be observed that the p values are statistically significant; therefore, the proposed HGAHBA method is significant when compared with other methods.
Tables 9, 10 and 11 present the results of the ANOVA test based on the average classification accuracy observed after 15 iterations of the proposed and additional feature selection techniques. The performance of the proposed work is compared with existing related works as shown in Table 12. The suggested GAHBA-DNN technique initially enhances the performance of traditional algorithms as it cuts the number of generations. The proposed GAHBADNN also outperforms the other conventional algorithms in terms of its performance in prediction and precision. The suggested model can be used for many similar breast cancer dataset types in the future.
ANOVA test summary of the WOBC dataset
ANOVA test summary of the WDBC dataset
ANOVA test summary of the SEER dataset
Comparison of performance analysis value with state-of-art methodologies
*Not Applicable.
Despite the promising results achieved by our proposed HGAHBA-DNN framework for breast cancer prediction, it is important to acknowledge certain limitations. Because it incorporates FS and parameter tuning processes simultaneously, the suggested method needs more computational time than the existing breast cancer scheme.
As feature selection is one of the perplexing tasks in breast cancer analysis, intensive research was carried out by the researchers by considering various relevant factors. In this work, a hybrid version of GA and HBA for feature selection and parameter optimization is used on the chosen WOBC, WDBC, and SEER datasets. In the proposed HGAHBA algorithm, HBA as local search is embedded into GA to escape from local optima and enhance the search performance. The proposed method provides a promising means of feature selection, as it increases the accuracy and significantly reduces the computational complexity.
Data availability statement
Publicly available datasets were analyzed in this study. This data can be found at WOBC: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+% 28Original% 29; WDBC: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic).
