Surrogate-assisted firefly algorithm for breast cancer detection

Abstract

Breast cancer is a severe disease for women health, however, with expensive diagnostic cost or obsolete medical technique, many patients are hard to obtain prompt medical treatment. Thus, efficient detection result of breast cancer while lower medical cost may be a promising way to protect women health. Breast cancer detection using all features will take a lot of time and computational resources. Thus, in this paper, we proposed a novel framework with surrogate-assisted firefly algorithm (FA) for breast cancer detection (SFA-BCD). As an advanced evolutionary algorithm (EA), FA is adopted to make feature selection, and the machine learning as classifier identify the breast cancer. Moreover, the surrogate model is utilized to decrease computation cost and expensive computation, which is the approximation function built by offline data to the real object function. The comprehensive experiments have been conducted under several breast cancer dataset derived from UCI. Experimental results verified that the proposed framework with surrogate-assisted FA significantly reduced the computation cost.

Keywords

Breast cancer detection firefly algorithm machine learning surrogate model feature selection

1 Introduction

The breast cancer seriously affected the normal functions of human organs and destroyed the normal tissue structure, which is a serious threat to women health and life. Such as chest pain, shortness of breath and difficulty in breathing. If the cancer metastasizes to the spine, there can be bone pain, and in severe cases, paralysis. The death rate is very high at the advanced stage of breast cancer. However, the reason of breast cancer is uncertain yet, thus it is hard to accurately detect the disease by the current medical technology at early stage. Many patients suffer much pains and even die because not obtained accurate result in detection early. In fact, the breast cancer can be detected by analyzing its influence factors, such as human living habits, genetic factors, environmental pollution and others. But, when analyze these influence factors, the doctor’s decision may diverse. The final decision may get a false-negative result or false-positive results caused by fatigue and inexperience doctor. Therefore, it is a good way to use the intelligent identification technology to deal with the complex analysis work and get useful information for doctor.

In the past decades, machine learning attracted a lot of attention, it obtained great effect in different fields for solving problems with unknown rules, including cancer detection. For example, Wu et al. [34] proposed a novel neural-network based algorithm, which refers to as entropy degradation method (EDM), to detect small cell lung cancer (SCLC) from computed tomography (CT) images. This research could facilitate early detection of lung cancers. Deng et al. [6] developed a novel weighted hierarchical adaptive voting ensemble (WHAVE) machine learning (ML) method for breast cancer detection, the result is more accurate than the individual ML methods. Abdar et al. [1] proposed a method to predict heart disease, which through finding a better decision tree to extract these most influence factors. Rajesh et al. [23] utilized support vector machine (SVM) to obtain higher sensitivity in classification of ECG heartbeats. Besides, data mining is also a widely used technique for information processing, which utilizes available data to find hidden but useful information that may not be directly recognizable.

Evolutionary algorithm [28, 29] has good performance in disease diagnose, which can be used to extract more essential feature information. Such as, Hassoon et al. [13] used the genetic algorithm (GA) to optimize boosted C5.0 algorithm to find rules for liver disease by considering a dataset, which can find the effective attributes in the liver disease diagnosis. The particle swarm optimization (PSO) [30] is also used to generate rules for heart disease in [3]. In addition, surrogate model method is a good way for optimizing the evolutionary algorithm, which developed along with the extensively employed for expensive black-box design optimization. Meanwhile, surrogate model method can efficiently assist the evolutionary algorithm to optimize the real-world problems with lower cost. Such as, Gong et al. [11] used surrogate model to select the best one among the multiple candidate offspring, which improve the performance of single operator-based methods in the majority of the functions. Moreover, Regis et al. [24] develops a surrogate-assisted evolutionary programming (EP) algorithm for constrained expensive black-box optimization that can be used for high-dimensional problems with many black-box inequality constraints. This proposed method is much better than a traditional EP.

In fact, breast cancer detection is a complex process, although costing long time and great money, it still may get a false results possibly. Based on these observations, we proposed a novel framework with surrogate-assisted FA, which provides a more accurate and efficient result in breast cancer detection. In this framework, the efficient FA is used to extract important feature information and dig out their hidden relation. Then the machine learning built a classifier by these extracted feature information. The classify performance is remarkable of machine learning technology, but it has expensive computation cost. Thus, we adopted the surrogate model fitting the real classification problem. The framework with surrogate-assisted FA significantly decrease the computation cost. We have conducted experiment on three breast cancer datasets derived from UCI.

The main contributions of this paper can be summarized as follows:

Using surrogate model assist the FA to extract feature of breast cancer.

Design a framework with surrogate-assisted FA for breast cancer detection (SFA-BCD) in which the FA can be replaced with other advanced FA variants, the machine learning method and surrogate model also can be replaced with else.

The rest of this paper is organized as follows. Section 2 briefly reviews the related works, including the machine learning techniques, evolutionary algorithm and surrogate model. Section 3 introduced the design method and implement procedure about our proposed disease detection framework. On the basis of different experiment results and analysis, verifying that the performance of our proposed framework is good in section 4. Finally, the conclusion of this paper has presented in section 5.

2 Preliminary works

In this section, we first briefly reviewed the contribution of the machine learning in medicine field, then introduced some application with FA in feature selection, followed by a brief explain of the effect using surrogate model to assist evolutionary algorithm.

2.1 Machine learning in cancer detection

With the increasing medicine technique, we can know that the cause of one disease always relate with many factors. For that, doctor need to deal with and analysis large patient data record before making decision. The classification is a kind of supervised data analysis technique, which is important for dealing with lager-scale data. Plenty of classification algorithms such as support vector machine (SVM), neural networks (NN), K-nearest neighbor (KNN) and etc, them have been widely used in various research fields. For example, Zheng et al. [40] adopted the K-means algorithm to identify the hidden patterns of the benign and malignant tumors respectively, then distinguish the incoming tumors using the new classifier built by the SVM, the accuracy is improved obviously. Bazazeh et al. [4] used SVM, random forest (RF) and bayesian networks (BN) to develop tools for physicians that can be used as an effective mechanism for early detection and diagnosis of breast cancer which will greatly enhance the survival rate of patients. Osareh et al. [18] utilized SVM, KNN and probabilistic neural networks (PNN) classifiers combined with signal-to-noise ratio feature ranking, sequential forward selection-based feature selection and principal component analysis feature extraction to distinguish between the benign and malignant tumours of breast. Valentini et al. [27] utilized bagged ensembles method of SVM to identify cancer disease, which can efficiently deal with the variance and the curse of dimensionality. In addition, Kong et al. [15] increase the accuracy in diagnosis and prediction of breast cancer, by using the novel jointly sparse discriminant analysis to explore and extract the key features. More detailed review about machine learning algorithm in breast cancer recognition can be found in [39]. However, using individual machine learning to identify cancer need great time because the number of cancer feature is vast frequently.

2.2 Feature selection method

At present, a real-world problems can be solved by many solutions, but requires much complex computation in most cases. For reducing the computation time and avoiding curse of dimensionality, feature selection method has obtained a lot of attention. The primary propose is that select a subset of variables from the input data, which can efficiently represent the information of input data. This technique is widely applied in many fields, such as increasing the prediction accurate in machine learning, the better understanding by extracting the more important information from data in pattern recognition. Gao et al. [10] designed a new feature redundancy term that considers the relevancy between a candidate feature and the class given each already-selected feature, which can deal with large redundant information meanwhile offering large new classification information problem. Evolutionary algorithm is a kind of efficient approach used in feature selection. Niu et al. [16] used the brain storm optimization algorithm (BSO) to identify the more important features and utilized reinforcement-behaved strategy to make feature selection. Alfarraj et al. [2] proposed a firefly gravitational ant colony optimization approach, which can successfully select the optimized feature that consists of the particular predictive analytics, the result show that it is better in sensitivity, specificity and accuracy. In addition, Emary et al. [7] introduced a system for feature selection, in which the novel FA is able to quickly find the optimal or near-optimal feature subset, therefore, the system can enhance validly the classification accuracy and reduce the attribute size. Peng et al. [21] proposed a novel composite FA (CoFA) to tackle the breast cancer recognition problem. Furthermore, there have more detailed survey about feature selection problems by evolutionary algorithm in [35].

2.3 Surrogate model technique

In order to solve real-word optimization problems, expensive function evaluations is a severe challenge, it requires great computational cost frequently. For that, the evolutionary algorithm with surrogate model has obtained a lot of attention from academia and industry. Surrogate model technique not only enhances the evolutionary algorithm but also reduce the resource consuming. For example, Yu et al. [38] introduced a surrogate-assisted hierarchical particle swarm optimizer in which the RBF model built by online data choose the best solution so that reduce expensive computation. Sun et al. [25] proposed a two-layer surrogate-assisted PSO (TLSAPSO) algorithm, in which a global and a number of local surrogate models are employed for fitness approximation. This method can reduce the number of fitness evaluations, especially computationally expensive problems. Because of useing GA and PSO to solve inverse heat conduction problem required higher computational costs than their gradient based alternatives, Vakili et al. [26] construct an approximation to the direct problem using a set of available data and the underlying physics of the problem, which can achieve low cost. Pan et al. [19] used neural network to predict the dominance relationship, which is not separately approximating the objective value of candidate solution and reference solution. In addition, Wang et al. [31] proposed a novel offline data-drive evolutionary algorithm, the bootstrap sampling technique is used to extract the train dataset, one RBF surrogate model pool is built by these train dataset, finally reduce the computational complexity through utilizing the RBF model to calculate the fitness value.

2.4 Classical FA

From the above mentioned, it is distinct that evolutionary algorithm is a good way for feature selection. FA is a simple but efficient algorithm for solving complex optimization problems in continuous search space. The flash and attractiveness is the core feature for firefly to evolve in nature, therefore, three rules are used to simplify the structure of FA from Yang et al. [36]. First, there is no gender difference for all fireflies. Second, the brighter firefly will attract the less brightness one. Third, the brightness of each firefly is depend on the fitness of objective function.

On the basis of FA, the attractiveness of each firefly is depend on its light intensity, thus the attractiveness β applied in [36] is shown as follows: $β (r) = β_{0} e^{- γ r^{2}}$ (1) where r is the Euclidean distance between two fireflies, and β₀ is the minimum attractiveness at r = 0. The parameter γ is a fixed value that usually set to 1, which indicates the light absorption coefficient. For random selected two fireflies X_i and X_j in same search space, their Euclidean distance r_ij can be calculated as follows: $r_{ij} = | | X_{i} - X_{j} | | = \sqrt{\sum_{d = 1}^{D} {(x_{id} - x_{jd})}^{2}}$ (2) where D is the problem dimension.

Any firefly X_i move to other brighter firefly X_j, the movement is conducted by $x_{id} (t + 1) = x_{id} (t) + β_{0} e^{- γ {r_{ij}}^{2}} (x_{jd} (t) - x_{id} (t)) + α ε$ (3) where X_id and X_jd is the dth dimension value of firefly i and j, respectively. Furthermore, ε is a random value and ε ∈ [-0.5, 0.5], α also is a random value that represents the step parameter and α ∈ [0, 1], t is the current iterations.

Algorithm 1 Pseudocode of FA

1: Randomly generate Np fireflies (solutions) as an initial population {X_i|i = 1, 2, . . . , Np};

2: Calculate the fitness value f of each firefly;

3: FEs = Np;

4: while FEs < MaxFEs do

5: for i = 1 : Np do

6: for j = 1 : Np do

7: if f (X_j) < f (X_i) then

8: Move firefly X_i towards X_j according to (3);

9: Calculate the fitness value of the new solution;

10: end if

11: end for

12: end for

13: FEs = FEs + Np;

14: end while

To assume the overall procedure of original FA, Algorithm 1 briefly described the main procedure of FA. f (.) represents fitness evaluation function, Np is population size, FEs is the current fitness evaluations, MaxFEs is the maximum fitness evaluations, and f (X_i) < f (X_j) indicates that firefly X_i is better than firefly X_j according to fitness value. More information about FA can be found in [9, 22].

We have observed three key points as follows:

The machine learning is widely applied in cancer detection, but still need to solve uncertainly influence factor and high dimension challenge when the data is huge.

FA is efficient for feature selection.

Surrogate model can reduce the expensive computation.

Thus, consideration of combining these approaches may be helpful for getting the better result of breast cancer detection while reducing the expensive computation time and cost. A framework with surrogate-assisted FA for cancer detection is proposed in this paper, using surrogate model assists the FA and reduce the expensive computation.

3 Methodlogy

Although combining machine learning and FA can improve the accuracy of the classifier, the computational time still is enormous. For this challenge, surrogate model is a good way to reduce the expensive computation time, which have obtained remarkable results for solving expensive and complexity optimization problems in various fields. Based on these considerations, we proposed a novel framework with surrogate-assisted FA for breast cancer detection.

3.1 Breast cancer detection framework

In this section, we proposed a novel framework with surrogate-assisted FA for breast cancer detection (SFA-BCD). The breast cancer detection framework is simple but effective, in which the machine learning is a classifier used to identify the breast cancer, the FA is used in feature selection, and the surrogate model efficiently assists FA to optimize problem.

As shown in Fig. 1, the proposed SFA-BCD framework can be divided into three main parts, i.e., breast cancer dataset collection, feature selection with FA and classification. First, the breast cancer datasets are collected, after making pre-processed, all features within dataset are ready to process. Then, FA generates offline data by obtained feature set and machine learning algorithm.

Fig. 1

Illustration of SFA-BCD framework

Algorithm 2 Feature selection by FA

1: Randomly generate Np fireflies (solutions) as an initial population {X_i|i = 1, 2, . . . , Np};

2: Calculate the fitness value f of each firefly;

3: FEs = Np;

4: while FEs < MaxFEs do

5: for i = 1 : Np do

6: ...

7: The procedure can be replaced with any individual selection strategies;

8: ...

9: if f (X_j) > f (X_i) then

10: ...

11: The procedure can be replaced with any FA variants;

12: Such as move firefly X_i towards X_j according to (3) in the original FA;

13: ...

14: Calculate the fitness value of the new solution according to (5);

15: end if

16: end for

17: FEs = FEs + Np;

18: end while

In order to built suitable surrogate model, the optimization problem needs to be properly formulated, including the specification of the fitness has been approximated by the surrogate model. In this paper, the surrogate model is built by the offline data, which can assist the FA feature selection. Finally, machine learning algorithm conducts breast cancer detection. In addition, for getting the reasonable feature subset from the complete feature set, during the feature selection process of FA, the dimension size is used as same as the number of attributes in the given dataset. All dimension variable are limited in the range [0, 1], if random value is smaller than surrogate probability, and when the variable value approach to 1, its corresponding feature is candidate to be selected in classification, the process of feature selection is formulated in (4), more detailed information can be known in Algorithm 2. If calculated by the surrogate model, there is no any process about the variable value.

$f_{ij} = {\begin{matrix} 1 \\ 0 \end{matrix} \begin{matrix} if X_{ij} > 0.5 \\ otherwise \end{matrix}$ (4) where X_ij is the dimension value of firefly individual i at dimension j.

In order to overcome the challenge that the false positive or true negative results from breast cancer detection, we employed three machine learning methods to analyse record data and output detection result, they are support vector machine (SVM), K-nearest neighbor (KNN) and Random forset (RF). For these methods, SVM kernel type is set to RBF, the number of neighbors are selected as 5 in KNN, otherwise, the number of trees in RF are set to 30. The theory of SVM is mainly inspired from statistical learning theory. Compare with other machine learning models, the major advantages of the SVM are that there is no local minima during learning and the generalization error does not depend on the dimension of the space. More information about SVM can refer [17]. The equation of the evaluation coefficients is follows: $Accuracy = (\sum_{i = 1}^{N} \frac{TP + TN}{TP + FP + TN + FN}) / N$ (5) where the Accuracy is calculated by those values: TN-True Negative, TP-True Positive, FN-False Negative, FP-False Positive. We adopted the 10-fold cross-validation method, and the N is equal to = 10.

Algorithm 3 SFA-BCD framework

1: Dataset pre-processing;

2: Collect offline data;

3: Build the surrogate function RBF and PRS according to offline data;

4: Randomly generate Np fireflies (solutions) as an initial population {X_i|i = 1, 2, . . . , Np};

5: Calculate the fitness value f of each firefly;

6: FEs = Np;

7: while FEs < MaxFEs do

8: for i = 1 : Np do

9: ...

10: The procedure can be replaced with any individual selection strategies;

11: ...

12: if f (X_j) > f (X_i) then

13: ...

14: The procedure can be replaced with any FA variants;

15: Such as move firefly X_i towards X_j according to (3) in the original FA;

16: ...

17: if Rand < SP then

18: Calculate the fitness value of the new solution according to (5);

19: else

20: Calculate the fitness value of the new solution according to (8);

21: end if

22: end if

23: end for

24: FEs = FEs + Np;

25: end while

3.2 Surrogate assisted based on data-drive

Combining the machine learning method and FA, the accuracy of cancer detection result has significantly enhanced. However, the time cost is still enormous, that not only increases expensive computation but also boosts the waiting time for patient. Surrogate model is a good way to reduce computational time while decrease cost, thus, we utilized two state-of-the-art surrogate models, polynomial response surfaces (PRS) and radial basis function (RBF). Box and Draper originally proposed PRS in [5], which can be simply built and its smoothing capability allows fast convergence of noisy functions during the search process, the second order polynomial model is formulated in (6). RBF is firstly proposed by Hardy in [12], which is a linear combinations of a radially symmetric function rely on the Euclidean distance between the sample point and predicted point, the RBF model is formulated as (7). More discussions and application on PRS and RBF can be known in the researches [8 , 37]. We used offline data to build the corresponding PRS and RBF surrogate model, the offline data comes from the evaluation records derived from the result of classifier with FA for cancer detection, that includes the individual dimension value and the final accuracy calculated by classifier. For that, an approximate function of the expensive fitness function is constructed. It can be described as:

$\tilde{y} (x) = β_{0} + \sum_{i = 1}^{N} β_{i} x_{i} + \sum_{i = 1}^{N} \sum_{j = 1}^{N} β_{ij} x_{i} x_{j}$ (6)

$\tilde{y} (x) = \sum_{i = 1}^{N} λ_{i} ϕ (| | x - x_{i} | |)$ (7)

$y = f (\tilde{y} (x), X_{i})$ (8) where $\tilde{y} (x)$ indicates the built surrogate model, X_i is the current individual and y is the approximate value of the breast cancer detection accuracy. The specific process are shown in Algorithm 3.

In the SFA-BCD framework, SP is the surrogate probability, using the classifier to calculate the fitness when the random value is smaller than the surrogate probability. Otherwise, choosing surrogate model assists the feature selection in FA by approximate function.

4 Experimental study

In order to verify the performance of the proposed breast cancer detection framework with surrogate-assisted FA (SFA-BCD), comprehensive experiments have conducted on real patient dataset, including three different breast cancer datasets. Three advanced firefly variants are also used in feature selection, which include LiFA [20], RaFA [32] and NaFA [33], all of them have obtained good performance under the benchmark function. For fairly comparing the experimental results, all classifier with FA variants independently run 10 times at each dataset, the mean accuracy are retained. The MaxFEs is set to 5E3 in the mentioned FA variants. Besides, other parameters are designed as same as its original structure. For all experiments in this paper, the 10-fold cross validation method is used to ensure the reliability of the result.

4.1 Test datasets

Three breast cancer datasets derived from UCI repository, which include breast cancer wisconsin datasets in original (BCWO) and breast cancer wisconsin dataset in diagnostic (BCWD), the other is breast cancer coimbra (COI). These datasets cover static information of participants, various habits, medical records and other characteristics. Table 1 introduces some basic information about these mentioned datasets, they are the number of instances and attributes, completeness, positive and negative. When the completeness is No, some value have lost in the dataset, the Negative indicates the number of malignant breast cancer, Positive means that the number of benign or absence breast cancer. More information can be found in Appendix.

Table 1
Breast cancer datasets

Dataset Instances Attributes Completeness Positive Negative

BCWO 699 10 No 241 458

BCWW 569 32 Yes 212 357

COI 116 10 Yes 52 64

Dataset	Instances	Attributes	Completeness	Positive	Negative
BCWO	699	10	No	241	458
BCWW	569	32	Yes	212	357
COI	116	10	Yes	52	64

4.2 Experiment result analysis

In this section, the experimental results with computation cost as well as the accuracy are presented to verify the proposed framework for breast cancer detection. For the purpose of having statistically sound conclusions, the Wilcoxon’s rank sum test results at a 0.05 significance level between proposed SFA-BCD and others are summarized, in which the symbol “-”, “+”, and “≈” represent that the performance of proposed SFA-BCD is worse than, better than and similar to that of related algorithm, respectively. Ratio coefficient is used to measure the time cost difference between the classifier with FA and proposed SFA-BCD, it is formulated as followings: $ratio = \frac{SFA_time}{FA_time}$ (9) where FA _ time is the computation cost of classifier with FA, SFA _ time is the computation cost of proposed SFA-BCD.

In Table 2, 3 and 4, the results of accuracy and time are calculated using classifier directly, the classifier with FA and the proposed SFA-BCD. The ratio represents the difference of time cost between FA and surrogate assisted FA, lower is better. For example, in Table 2, the accuracy of LiFA+SVM+RBF on BCWO dataset is 0.9748 (+,-), in which (+,-) means that LiFA+SVM+RBF better than SVM and worse than LiFA+SVM. At Table 2, we used LiFA as the feature selection method, RBF and PRS are also employed to assist the LiFA, three classifiers such as SVM, RF and KNN are adopted to classify the condition of breast cancer. From these results, the performance of proposed breast cancer detection framework is significantly better than using classifier directly. Such as the SVM classifier, the accuracy of used classifier is 0.9698, 0.7125 and 0.9787 at three datasets, but the accuracy rise to 0.9748, 0.8579 and 0.9824 in the proposed framework with RBF surrogate model, the PRS surrogate model also increases the accuracy to 0.9744, 0.859 and 0.9821. For the SVM algorithm with LiFA, the statistic results in accuracy has not remarkable difference compared with the proposed breast cancer detection framework, which means its effect in accuracy is similar with proposed SFA-BCD.

Table 2

Result obtained by LiFA for breast cancer detection at different datasets

Algorithm
Dataset	SVM		LiFA+SVM		LiFA+SVM+RBF			LiFA+SVM+PRS
	Accuracy	Time	Accuracy	Time	Accuracy	Time	Ratio	Accuracy	Time	Ratio
BCWO	0.9698	0.10	0.9755	493.97	0.9748 (+, -)	146.18	0.30	0.9744 (+, -)	149.13	0.30
COI	0.7125	0.11	0.8642	484.47	0.8579 (+, -)	194.91	0.40	0.8590 (+, -)	198.78	0.41
BCWD	0.9787	0.10	0.9840	503.10	0.9824 (+, -)	221.30	0.44	0.9821 (+, -)	220.08	0.44
	RF		LiFA+RF		LiFA+RF+RBF			LiFA+RF+PRS
	Accuracy	Time	Accuracy	Time	Accuracy	Time	Ratio	Accuracy	Time	Ratio
BCWO	0.9707	1.36	0.9797	6752.72	0.9785 (≈, -)	2051.36	0.30	0.9778 (≈, -)	2026.77	0.30
COI	0.7173	1.93	0.8691	8044.51	0.8631 (+, ≈)	3460.98	0.43	0.8643 (+, ≈)	3229.35	0.40
BCWD	0.9594	1.40	0.9787	7307.48	0.9766 (+, -)	2915.36	0.40	0.9752 (+, -)	2883.92	0.39
	KNN		LiFA+KNN		LiFA+KNN+RBF			LiFA+KNN+PRS
	Accuracy	Time	Accuracy	Time	Accuracy	Time	Ratio	Accuracy	Time	Ratio
BCWO	0.9731	0.06	0.9797	284.92	0.9795 (≈, ≈)	86.82	0.30	0.9792 (≈, ≈)	87.42	0.31
COI	0.6987	0.12	0.8467	361.36	0.8425 (+, ≈)	144.86	0.40	0.8430 (+, ≈)	145.80	0.40
BCWD	0.9664	0.06	0.9802	300.91	0.9789 (+, -)	122.18	0.41	0.9794 (+, ≈)	123.68	0.41

Table 3

Result obtained by RaFA for cancer detection at different datasets

Algorithm
Dataset	SVM		RaFA+SVM		RaFA+SVM+RBF			RaFA+SVM+PRS
	Accuracy	Time	Accuracy	Time	Accuracy	Time	Ratio	Accuracy	Time	Ratio
BCWO	0.9698	0.10	0.9757	251.83	0.9711 (≈, -)	111.19	0.44	0.9720 (+, -)	109.08	0.43
COI	0.7125	0.11	0.8646	257.37	0.8576 (+, -)	124.65	0.48	0.8562 (+, -)	122.88	0.48
BCWD	0.9787	0.10	0.9856	274.28	0.9838 (+, -)	125.58	0.46	0.9807 (+, -)	140.38	0.51
	RF		RaFA+RF		RaFA+RF+RBF			RaFA+RF+PRS
	Accuracy	Time	Accuracy	Time	Accuracy	Time	Ratio	Accuracy	Time	Ratio
BCWO	0.9707	1.36	0.9798	3906.68	0.9732 (≈, -)	1458.37	0.37	0.9742 (≈, -)	1365.92	0.35
COI	0.7173	1.93	0.8727	4484.31	0.8519 (+, -)	2112.33	0.47	0.8589 (+, -)	2042.57	0.46
BCWD	0.9594	1.40	0.9793	3884.35	0.9773 (+, ≈)	1824.38	0.47	0.9714 (+, -)	2123.04	0.55
	KNN		RaFA+KNN		RaFA+KNN+RBF			RaFA+KNN+PRS
	Accuracy	Time	Accuracy	Time	Accuracy	Time	Ratio	Accuracy	Time	Ratio
BCWO	0.9731	0.06	0.9802	156.79	0.9770 (-, -)	59.25	0.38	0.9772 (-, -)	59.35	0.38
COI	0.6987	0.12	0.8447	197.73	0.8351 (+, -)	93.47	0.47	0.8337 (+, -)	92.16	0.47
BCWD	0.9664	0.06	0.9826	163.53	0.9814 (+, ≈)	75.67	0.46	0.9738 (+, -)	90.48	0.55

Table 4

Result obtained by NaFA for breast cancer detection at different dataset

Algorithm
Dataset	SVM		NaFA+SVM		NaFA+SVM+RBF			NaFA+SVM+PRS
	Accuracy	Time	Accuracy	Time	Accuracy	Time	Ratio	Accuracy	Time	Ratio
BCWO	0.9698	0.10	0.9756	520.20	0.9751 (+, -)	155.71	0.30	0.9741 (+, -)	159.34	0.31
COI	0.7125	0.11	0.8646	485.22	0.8592 (+, ≈)	196.54	0.41	0.8580 (+, ≈)	180.76	0.37
BCWD	0.9787	0.10	0.9846	509.32	0.9840 (+, ≈)	202.57	0.40	0.9816 (+, -)	203.22	0.40
	RF		NaFA+RF		NaFA+RF+RBF			NaFA+RF+PRS
	Accuracy	Time	Accuracy	Time	Accuracy	Time	Ratio	Accuracy	Time	Ratio
BCWO	0.9707	1.36	0.9797	6704.98	0.9794 (≈, ≈)	1983.53	0.30	0.9780 (≈, -)	2015.82	0.30
COI	0.7173	1.93	0.8664	9526.02	0.8636 (+, ≈)	3293.51	0.35	0.8651 (+, ≈)	3219.88	0.34
BCWD	0.9594	1.40	0.9791	7316.77	0.9773 (+, -)	3044.55	0.42	0.9758 (+, -)	2986.54	0.41
	KNN		NaFA+KNN		NaFA+KNN+RBF			NaFA+KNN+PRS
	Accuracy	Time	Accuracy	Time	Accuracy	Time	Ratio	Accuracy	Time	Ratio
BCWO	0.9731	0.06	0.9807	291.55	0.9798 (+, ≈)	90.22	0.31	0.9792 (+, -)	88.61	0.30
COI	0.6987	0.12	0.8495	361.51	0.8460 (+, ≈)	146.13	0.40	0.8467 (+, ≈)	146.92	0.41
BCWD	0.9664	0.06	0.9809	298.73	0.9798 (+, ≈)	120.72	0.40	0.9793 (+, -)	121.68	0.41

However, it is noticeable that computation cost of proposed breast cancer detection framework is sharply decreased, value of ratio can describe clearly the time difference, smaller is better. We still used the SVM classifier as a sample to demonstrate, the time of SVM with LiFA is 493.97, 484.47 and 503.10 obtained from these three datasets. But the time of proposed framework with RBF model is 146.18, 194.91 and 221.30, respectively, the proposed framework with PRS model costs time is 149.13, 198.78 and 220.08. Obviously, the computation cost is highly decreased, the ratio is 0.30, 0.40 and 0.44 between SVM with LiFA and the proposed framework with RBF model, meanwhile, the ratio compared with the proposed framework with PRS model is 0.30, 0.41 and 0.44. These results show that surrogate-assisted LiFA reduced more than half computation cost to SVM with LiFA.

To comprehensively verify the performance of proposed SFA-BCD, we also used RF and KNN classifier to conduct breast cancer recognise. Moreover, employing RaFA and NaFA to prove that the framework is flexible, other advanced optimization algorithm could be used in this framework. Fig. 2 shows the computation cost of NaFA, NaFA+RBF, and NaFA+PRS on SVM, RF, and KNN classifier, in which the mean cost of algorithm on all three datasets is used to compete. By carefully reviewing Fig. 2, one can note that using surrogate model to assist the NaFA reduced more than half cost. From these results in Table 2, 3 and 4, we can conclude that the proposed SFA-BCD can combine with different optimization algorithms and machine learning algorithms to obtain good result, which not only enhanced the accuracy but also decreased the computation cost.

Fig. 2

Computation cost of NaFA, NaFA+RBF, and NaFA+PRS on SVM, RF, and KNN classifier.

4.3 Comparing the performance under different SP

Because the value of surrogate probability decided the times using surrogate function to calculate, we primarily analyzed the effect under different SP in this section. In this experiment, the classifier just utilized KNN, other structure is as well as above mentioned, the value of SP is set to 0.3, 0.4 and 0.5, respectively. From Table 5, the ratio is gradually rise when the SP increase, such as, from 0.30 to 0.30, then to 0.52 at the KNN with LiFA in feature selection based RBF surrogate model. However, with the time cost increasing, the accuracy is rise, from 0.9794 to 0.9795 and then to 0.9801 when SP is 0.3, 0.4 and 0.5. Because the SP is bigger, the times using KNN to calculate the accuracy is rise, the time cost also will increase.

Table 5
Result obtained by algorithms for breast cancer detection at different SP

Algorithm

SP LiFA+KNN LiFA+KNN+RBF LiFA+KNN+PRS

Accuracy Time Accuracy Time Ratio Accuracy Time Ratio

0.3 0.9794 84.99 0.30 0.9792 84.98 0.30

0.4 0.9797 284.92 0.9795 86.08 0.30 0.9792 87.42 0.31

0.5 0.9801 146.82 0.52 0.9801 146.28 0.51

RaFA+KNN RaFA+KNN+RBF RaFA+KNN+PRS

Accuracy Time Accuracy Time Ratio Accuracy Time Ratio

0.3 0.9752 59.14 0.38 0.9770 59.37 0.38

0.4 0.9802 156.79 0.9761 71.35 0.46 0.9780 76.77 0.49

0.5 0.9779 86.69 0.55 0.9782 86.72 0.55

NaFA+KNN NaFA+KNN+RBF NaFA+KNN+PRS

Accuracy Time Accuracy Time Ratio Accuracy Time Ratio

0.3 0.9797 87.89 0.30 0.9794 87.14 0.30

0.4 0.9807 291.55 0.9798 90.24 0.31 0.9792 88.61 0.30

0.5 0.9804 141.88 0.49 0.9798 142.45 0.49

5 Conclusion

This paper proposed a breast cancer detection framework with surrogate-assisted FA (SFA-BCD) using classifier to calculate the accuracy, in which the FA variants is used to feature selection, besides, using surrogate model technique assists the FA variants while reduce the computation cost. On the basis of experiment analysis, the proposed breast cancer detection framework with surrogate-assisted FA has significantly reduce the computation cost and obtained good performance.

The computation cost of the proposed framework is sharply decreased. The time of SVM with LiFA is 493.97, 484.47 and 503.10 obtained from three test datasets. But the time of proposed framework with RBF model is 146.18, 194.91 and 221.30, respectively. The proposed framework with PRS model costs time is 149.13, 198.78 and 220.08. Meanwhile, the ratio compared with the proposed framework with PRS model is 0.30, 0.41 and 0.44. These results show that using surrogate model to assist the FA variant reduced more than half computation cost to SVM with FA variant.

In addition, the framework can integrate different evaluation algorithms, machine learning algorithm and surrogate model techniques, the procedure is easy to complete. So, it can be applied in other disease detection. In the future, we will do more work in the structure design of surrogate function according to offline data.

The Matlab source code of SFA-BCD can be downloaded from H. Peng’s homepage: https://whuph.github.io/index.html.

Footnotes

Appendix

Details of the three breast cancer datasets can be found by the following links:

(BCWO) https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29

(BCWD) https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

(COI) https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra

Acknowledgments

This work is supported by the National Natural Science Foundation of China (61763019, 62041603), the Jiangxi Provincial Natural Science Foundation (20202BABL202019), and the Science and Technology Plan Projects of Jiangxi Provincial Education Department (GJJ180891).

References

Abdar

, Using decision trees in data mining for predicting factors influencing of heart disease, Carpathian Journal of Electronic Computer Engineering 8(2) (2015), 31–36.

Alfarraj

, Alzubi

and Tolba

, Optimized feature selection algorithm based on fireflies with gravitational ant colony algorithm for big data predictive analytics, Neural Computing and Applications 31(5) (2019), 1391–1403.

Alkeshuosh

A.H.

, Moghadam

M.Z.

, AlMansoori

and Abdar

, Using PSO algorithm for producing best rules in diagnosis of heart disease, In: 2017 International Conference on Computer and Aspplications (ICCA), IEEE, (2017), 306–311.

Bazazeh

and Shubair

, Comparative study of machine learning algorithms for breast cancer detection and diagnosis, In: 5th International Conference on Electronic Devices, Systems and Applications (ICEDSA), IEEE, (2016), 1–4.

Box

G.E.

and Draper

N.R.

, Empirical model-building and response surfaces, Wiley New York, (1987).

Deng

and Perkowski

, A novel weighted hierarchical adaptive voting ensemble machine learning method for breast cancer detection, In: 2015 IEEE International Symposium on Multiple-Valued Logic, IEEE, (2015), 115–120.

Emary

, Zawbaa

H.M.

, Ghany

K.K.A.

, Hassanien

A.E.

and Pârv

, Firefly optimization algorithm for feature selection, In: Proceedings of the 7th Balkan Conference on Informatics Conference, ACM, (2015), 1–7.

Fan

, Huang

and Wang

, Sparsity-promoting polynomial response surface: A new surrogate model for response prediction, Advances in Engineering Software 77 (2014), 48–65.

Fister

, Fister

Jr, , Yang

X.S.

and Brest

, A comprehensive review of firefly algorithms, Swarm and Evolutionary Computation 13 (2013), 34–46.

10.

Gao

, Hu

and Zhang

, Feature redundancy term variation for mutual information-based feature selection, Applied Intelligence 50 (2020), 1272–1288.

11.

Gong

, Zhou

and Cai

, A multioperator search strategy based on cheap surrogate models for evolutionary optimization, IEEE Transactions on Evolutionary Computation 19(5) (2015), 746–758.

12.

Hardy

R.L.

, Multiquadric equations of topography and other irregular surfaces, Journal of Geophysical Research 76(8) (1971), 1905–1915.

13.

Hassoon

, Kouhi

M.S.

, Zomorodi-Moghadam

and Abdar

, Rule optimization of boosted c5.0 classification using genetic algorithm for liver disease prediction, In: 2017 International Conference on Computer and Applications (ICCA), IEEE, (2017), 299–305.

14.

Jin

, A comprehensive survey of fitness approximation in evolutionary computation, Soft Computing 9(1) (2005), 3–12.

15.

Kong

, Lai

, Wang

and Liu

, Breast cancer discriminant feature analysis for diagnosis via jointly sparse learning, Neurocomputing 177 (2016), 198–205.

16.

Niu

, Yang

and Wang

, Feature selection using a reinforcement-behaved brain storm optimization, In: International Conference on Intelligent Computing, Springer (2019), 672–681.

17.

Noble

W.S.

, What is a support vector machine?, Nature Biotechnology 24(12) (2006), 1565–1567.

18.

Osareh

and Shadgar

, Machine learning techniques to diagnose breast cancer, In: 5th International Symposium on Health Informatics and Bioinformatics, IEEE, (2010), 114–120.

19.

Pan

, He

, Tian

, Wang

, Zhang

and Jin

, A classification-based surrogate-assisted evolutionary algorithm for expensive many-objective optimization, IEEE Transactions on Evolutionary Computation 23(1) (2018), 74–88.

20.

Peng

, He

, Deng

and Wu

, Firefly algorithm with luciferase inhibition mechanism, IEEE Access 7 (2019), 120189–120201.

21.

Peng

, Zhu

, Deng

, Yu

and Wu

, Composite firefly algorithm for breast cancer recognition, Concurrency and Computation: Practice and Experience (2020), 1–12. https:doi.org10.1002cpe.6032

22.

Peng

, Zhu

, Deng

and Wu

, Enhancing firefly algorithm with courtship learning, Information Sciences 543 (2021), 18–42.

23.

Rajesh

K.N.

and Dhuli

, Classification of ecg heartbeats using nonlinear decomposition methods and support vector machine, Computers in Biology and Medicine 87 (2017), 271–284.

24.

Regis

R.G.

, Evolutionary programming for high-dimensional constrained expensive black-box optimization using radial basis functions, IEEE Transactions on Evolutionary Computation 18(3) (2013), 326–347.

25.

Sun

, Jin

, Zeng

and Yu

, A two-layer surrogateassisted particle swarm optimization algorithm, Soft Computing 19(6) (2015), 1461–1475.

26.

Vakili

and Gadala

, Low cost surrogate model based evolutionary optimization solvers for inverse heat conduction problem, International Journal of Heat and Mass Transfer 56(1-2) (2013), 263–273.

27.

Valentini

, Muselli

and Ru

, no, Cancer recognition with bagged ensembles of support vector machines, Neurocomputing 56 (2004), 461–466.

28.

Wang

, Li

and Zhou

, An estimation of distribution algorithm for mixed-variable Newsvendor problems, IEEE Transactions on Evolutionary Computation 24(3) (2020), 479–493.

29.

Wang

, Li

, Liao

and Yan

, An ensemble learning based prediction strategy for dynamic multi-objective optimization, Applied Soft Computing 96 (2020), 106592.

30.

Wang

, Zhang

and Zhou

, A particle swarm optimization algorithm for mixed-variable optimization problems, Swarm and Evolutionary Computation 60 (2021), 100808.

31.

Wang

, Jin

, Sun

and Doherty

, O ine data-driven evolutionary optimization using selective surrogate ensembles, IEEE Transactions on Evolutionary Computation 23(2) (2018), 203–216.

32.

Wang

, Wang

, Sun

and Rahnamayan

, Firefly algorithm with random attraction, International Journal of Bio-Inspired Computation 8(1) (2016), 33–41.

33.

Wang

, Wang

, Zhou

, Sun

, Zhao

, Yu

and Cui

, Firefly algorithm with neighborhood attraction, Information Sciences 382 (2017), 374–387.

34.

and Zhao

, Small-cell lung cancer detection using a supervised machine learning algorithm, In: 2017 International Symposium on Computer Science and Intelligent Controls (ISCSIC), IEEE, (2017), 88–91.

35.

Xue

, Zhang

, Browne

W.N.

and Yao

, A survey on evolutionary computation approaches to feature selection, IEEE Transactions on Evolutionary Computation 20(4) (2015), 606–626.

36.

Yang

X.S.

, Firefly algorithm, stochastic test functions and design optimisation, International Journal of Bio Inspired Computation 2(2) (2010), 78–84.

37.

Yao

, Chen

, Huang

and van Tooren

, A surrogatebased optimization method with rbf neural network enhanced by linear interpolation and hybrid infill strategy, Optimization Methods and Software 29(2) (2014), 406–429.

38.

, Tan

, Zeng

, Sun

and Jin

, Surrogate-assisted hierarchical particle swarm optimization, Information Sciences 454 (2018), 59–72.

39.

Yue

, Wang

, Chen

, Payne

and Liu

, Machine learning with applications in breast cancer diagnosis and prognosis, Designs 2(2) (2018), 13.

40.

Zheng

, Yoon

S.W.

and Lam

S.S.

, Breast cancer diagnosis based on feature extraction using a hybrid of k-means and support vector machine algorithms, Expert Systems with Applications 41(4) (2014), 1476–1482.

Surrogate-assisted firefly algorithm for breast cancer detection

Abstract

Keywords

1 Introduction

2 Preliminary works

2.1 Machine learning in cancer detection

2.2 Feature selection method

2.3 Surrogate model technique

2.4 Classical FA

3.1 Breast cancer detection framework

4.1 Test datasets

Table 1 Breast cancer datasets Dataset Instances Attributes Completeness Positive Negative BCWO 699 10 No 241 458 BCWW 569 32 Yes 212 357 COI 116 10 Yes 52 64

Footnotes

Appendix

Acknowledgments

References

Table 1
Breast cancer datasets

Dataset Instances Attributes Completeness Positive Negative

BCWO 699 10 No 241 458

BCWW 569 32 Yes 212 357

COI 116 10 Yes 52 64