Abstract
OBJECTIVE:
To develop an ensemble a deep transfer learning model of CT images for predicting pathologic complete response (pCR) in breast cancer patients undergoing neoadjuvant chemotherapy (NAC).
METHODS:
The data were obtained from the public dataset ‘QIN-Breast’ from The Cancer Imaging Archive (TCIA). CT images were gathered before and after the first cycle of NAC. CT images of 121 breast cancer patients were used to train and test the model. Among these patients, 58 achieved a pCR and 63 showed a non-pCR based pathology examination of surgical results after NAC. The dataset was split into training and testing subsets with a ratio of 7:3. In addition, the number of training samples in the dataset was increased from 656 to 1,968 by performing an image augmentation method. Two deep transfer learning models namely, DenseNet201 and ResNet152V2, and the ensemble model with a concatenation of two models, were trained and tested using CT images.
RESULTS:
The ensemble model obtained the highest accuracy of 100% on the testing dataset. Furthermore, we received the best performance of 100% in recall, precision and f1-score value for the ensemble model. This supports the fact that the ensemble model results in better-generalized model and leads to efficient framework. Although a 0.004 and 0.003 difference were seen between the AUC of two base models (DenseNet201 and ResNet152V2) and the proposed ensemble, this increase in the model quality is critical in medical research. T-SNE revealed that in the proposed ensemble, no points were clustered into the wrong class. These results expose the strong performance of the proposed ensemble.
CONCLUSION:
The study concluded that the ensemble model can increase the ability to predict breast cancer response to first-cycle NAC than two DenseNet201 and ResNet152V2 models.
Introduction
Worldwide, breast cancer is the most common cancer in women, including almost one-third of all females’ malignancies [1]. Previous studies indicate that various factors, including genetic and environmental factors, could be correlated with breast cancer initiation and progression [2]. Patients who are detected at an early stage will have a better survival rate. Therefore, a breast cancer diagnosis in the early stages is critical for its treatment [3]. It has been shown that imaging procedures such as mammography, ultrasound (US), positron emission tomography (PET), single-photon emission computed tomography (SPECT), magnetic resonance imaging (MRI), and Computed tomography (CT) can be used for diagnosis and monitoring patients [4–7]. Mammography is more used for screening [8]. Screening with mammography employs X-ray imaging to detect breast cancer before a mass can be felt. The aim is to treat cancer earlier when a cure is more likely.
In non-metastatic breast cancer, cancer cell destruction and preventing cancer recurrence are the primary purposes of therapy. Systemic breast cancer therapy can be preoperative (neoadjuvant), postoperative (adjuvant), or both [9–11]. It has been shown that adjuvant chemotherapy reduces both cancer recurrence and mortality. Neoadjuvant chemotherapy (NAC) is usually used before surgery with locally advanced breast cancers [12–14]. For some patients, NAC can increase breast conservation rates and allow for a smaller surgical resection volume [15]. Regardless of tumor volume, NAC gives a chance to evaluate a treatment regimen’s cytotoxic activity. There are exciting data to recommend that clinical response-guided NAC can increase survival. After completing NAC, the remaining tumor burden is a vital prognostic sign, and patients who obtain a pathologic complete response have excellent long-term survival [16, 17]. However, a substantial number of patients display a non-complete response to therapy. Therefore, to avoid unnecessary chemotherapy, new approaches for predicting breast tumor response to NAC before or during treatment seem to be needed. Besides, an early prediction of non-response to NAC may provide a shift to new therapy methods [18–20].
Various imaging modalities can be used to assess NAC’s response before surgery, including PET/CT and MRI. PET/CT studies reveal that decreased tumor metabolism can distinguish poor- and good-NAC responders [19]. The apparent diffusion coefficient (ADC) map and Ktrans derived from DW-MRI and DCE-MR images have extensively been investigated as prognostic and predictive biomarkers. DCE-MRI has been examined as a surrogate biomarker for assessing NAC response in breast cancer [21]. It has been shown that calculating the Ktrans changes as a pharmacokinetic parameter, before and after therapy, is the best predictor of pathologic non-response [20]. In the study by Padhani et al. [22], the changes in both the tumor size and range of histograms of Ktrans were uniformly able to predict the final response. Additionally, it reported that ADC could differentiate responders from non-responders after NAC, while other studies revealed no relationship between ADC and treatment response [23].
Recently, for diagnosing various diseases, deep learning procedures are extensively used in medical imaging [24]. These procedures automatically extract the features from the given image without any user intervention. Deep learning procedures diagnose a disease and provide suitable prediction models to help physicians develop effective treatment methods [25, 26]. One of these strategies is transfer learning, which is a sub-branch of deep learning. Transfer learning improves learning in a new task by transferring knowledge from a related study that has already been learned [27]. For reducing the errors in detection, an ensemble deep transfer learning model was proposed [28]. The ensemble learning framework uses the classic transfer learning plan. This method trains a prediction model on the source dataset and later refines it on the target dataset but spreads the scheme through ensemble prediction by training and refining multiple models. An ensemble of transfer learning networks may be a robust strategy by reducing errors. It provides optimal results from the combined networks with the least possible errors [29].
DWI and PET/CT are commonly used in clinical oncology imaging among the promising imaging techniques. They play a definite role in the assessment of tumor response to NAC. Each of these techniques has its strength to visualize and investigate intra-tumoral details. However, there are limited medical imaging devices such as DWI and PET/CT in remote places and low- and middle-income countries. Therefore, training deep transfer learning models using CT images with careful reference to the PET and DWI images can overcome these limits. To the best of our knowledge, no previous study was done for predicting the response to NAC in breast cancer patients with the help of ensemble deep transfer learning models based on CT images, with careful reference to the PET images. The present study was performed to determine if an ensemble deep transfer learning can increase the accuracy for predicting pathologic complete response (pCR) in breast cancer patients undergoing NAC. Notably, we hypothesized that the combined models would have superior predictive accuracy than the single models in isolation. The main contribution of this study is as follows: The proposed model not only helpful in predicting breast cancer response but also able to differentiate the pCR patients from the non-complete response (non-pCR); The proposed model was compared with the competitive models in terms of performance metrics such as accuracy, f1-score, area under the ROC Curve (AUC), precision, and recall.
Materials and methods
Dataset, CT acquisition, and pre-processing
The data was obtained from the public dataset “The Cancer Imaging Archive (TCIA) collection” [30, 31]. This data includes longitudinal PET/CT images collected to study response to NAC in breast cancer. In this study, only CT images were used, with careful reference to the PET images. Images were gathered before and after the first-cycle NAC. Images were obtained with a GE Discovery STE scanner (GE Healthcare, Waukesha, WI, USA). The imaging parameters used for the CT scan were as follows: the tube current = 80 mAs, the tube voltage = 120 KVp, and the pitch = 1.675/1.
In this retrospective study, CT images of 121 breast cancer patients were used to train and test models. Hence, 58 patients achieved a pCR, while the remaining 63 patients following NAC treatment showed a non-pCR. Determination of pCR and non-pCR situation by a breast pathologist was decided at the time of definitive surgery. The dataset was split into training and testing data by a certain percentage ratio. Here, for the total 121 patients, we assigned 85 patients to the training set (656 axial slices) and 36 patients (700 axial slices) to the testing set according to a ratio of 7:3. Of note, slices of each patient were not divided between both training and test sets. Also, the number of training samples in the dataset was increased from 656 to 1968 by performing the image augmentation techniques. The matched breast’s CT and PET images for pre-and post-NAC cases are shown in Figs. 1 and 2. Patients with and without any invasive cancer in breast or lymph nodes were defined as non-pCR and pCR, respectively. Also, patients who progressed before surgery were defined as non-pCR.

The matched breast’s CT and PET images before (a and b) and after (c and d) the first cycle of NAC for one pCR case.

The matched breast’s CT and PET images before (a and b) and after (c and d) the first cycle of NAC for one non-pCR case.
Before the training process, image pre-processing was performed for all selected CT slices in which breast tumors were visible. It is necessary to mention, images with Dicom format were used, and their format was not changed before image normalization. As shown in Fig. 3, with the validation of an experienced oncologist and careful reference to the PET images, each tumor’s volume was cropped by performing a box surrounding the volume of interest. Then, cropped volumes were used for the inputs of CNN with size of 224×224×3. Of note, the OpenCV library was used to convert a one-channel image into a three-channel. It merges one-channel images into three-channel images. To guarantee comparable voxel intensities across images, image normalization was performed. Hence, the maximum intensity of each slice was normalized between 0 and 1. Figure 3 shows the pre-processing used to input images.

Tumor cropping and pre-processing.
Deep learning is very famous and has successfully been used in image classification and analysis tasks, where we automatically learn an efficient feature representation of an image. However, to use deep learning efficiently, we usually require extensive data to better learn the image samples’ actual distribution. Where we face data scarcity, data-hungry is a challenge to efficiently apply the deep learning approach to diagnose and predict diseases. Therefore, to overcome this problem, we used transfer learning models for feature extraction. By applying transfer learning models, feature extraction from the CT images in our datasets has been obtained using different off-the-shelf convolutional neural network (CNN) based pre-trained models on ImageNet. For this purpose, we used off-the-shelf CNN-based pre-trained models such as Resnet152V2 [32], DenseNet121 [33], and an ensemble deep transfer learning model, concatenate Resnet152V2 and DenseNet121. An ensemble deep transfer learning model is proposed to reduce the detection errors by combining the outputs of different independently transfer learning models, as shown in Fig. 4. The core structure of the employed models is explained in the preceding subsections.

Diagram of an ensemble deep transfer learning model by considering two models.
Deep learning is an end-to-end machine learning (ML) procedure that automatically can extract features layer by layer. Compared with the manual feature selection, the features extracted by deep learning models are more abstract, include more specific information, and better represent the errors. As a deep learning method, the CNN procedure has attracted much attention because of its local perception mechanism, which is an inspired design by the human visual cortex structure [34]. Feature extraction in a CNN architecture is achieved through a sequence of convolutional, pooling, and fully connected layers. The core of a CNN model is the convolutional layer and pooling layer. A back-propagation algorithm performs the training of CNN. In the convolution layer, using a kernel, the feature maps are convolved, and then the feature map is output by an activation function [35]. The kernel is nothing but a filter that is used to extract the features from the images. The size and number of kernels are two key hyperparameters that define the convolution operation. The size is typically 3×3 but sometimes are 5×5 or 7×7. The convolutional layer usually comprises a plurality of convolutional kernels to create many output features maps to extract a larger input feature map. Each has a size of (X-Y + 1)×(X_Y + 1) M, where X is the input image’s size, and Y is the kernel’s size. The operation of the convolutional layer is as follows:
where the output of the current layer is
The outputs of convolution layers are passed from an activation function. The ReLU (Rectified Linear Unit) is the most used activation function in almost all convolutional neural networks [36]. The activation function of ReLU defined as follows:
ReLU does by thresholding values at 0. Briefly, when a < 0, it outputs 0, and conversely, when a≥0, it outputs a linear function. The training time of ReLU is considerably lower than the sigmoidal functions, also called the logistic function. It is also beneficial for overcoming gradient-based training and poor performance in deep learning models due to widespread saturation. Also, by fastening the convergence of stochastic gradient descent, ReLU reduces the training time.
The pooling layer usually is entered after a convolution layer. Using the pooling layer is reduced the size of feature maps and network parameters [37]. Pooling layers by eliminating some connections between convolutional layers implement a form of spatial transformation invariance and reduce the computational complexity for upper layers. Therefore, the pooling layer performs the down-sampling on the feature maps from the previous layer and provides the new feature maps with a compressed resolution. Extracting only useful information and discarding irrelevant details is demanded from an ideal pooling procedure. The average pooling layer is used in the proposed architecture. As shown in Fig. 5, an average pooling layer performs down-sampling by dividing the input into rectangular pooling regions and computing each region’s average values. Dropouts are added to regularize the convolutional networks that randomly selected neurons are ignored during training [38]. The dropped neurons have no contribution to the forward pass or back-propagation during the training phase. This helps in avoiding over-fitting in the neural networks. In the current study, the dropouts are set to 0.1 and 0.2 for the two-class classification problem.

Example of average pooling operation.
The output feature maps of the final convolution layer are typically transformed into a single vector. The fully connected layers have full connections to the neurons [39]. The inputs to these layers are multiplied with layers weight matrix to produce the multiplication result. Our proposed model is a fully connected dense layer used with a sigmoid activation function for two-class classification. The sigmoid function gives desirable outcomes for the probability of 0 and 1. This function is used in most feed-forward neural networks due to its nonlinearity and the computational simplicity of its derivative.
Figure 6 shows the architectures of DenseNet201 and ResNet152V2. In these two networks, parameter transfer architecture is used, and its layers are frozen, and the weights remain constant during learning. Hence, we used the pre-trained weights on the ImageNet dataset as a start point for the architectures of DenseNet201 and ResNet152V2 and the ensemble model. The difference between these two models is the number of layers of both architectures and the number of parameters of these two models. The ResNet only uses one preceding feature-map, while DenseNet uses features of all the preceding convolutional blocks [40]. The used structure for the two models as follows: the last convolutional block + AveragePooling + Flatten + Dropout (0.1) + Dense (64, activation=’ReLU’) + Dropout (0.2) + Dense (64,activation=’ReLU’) + Dense (2, activation=’sigmoid’). Therefore, the fine-tuned pre-trained model with several layers was used for feature extraction. In Dense Layer, the sigmoid activation function is introduced for a two-class classification problem. The models are trained for 100 epochs with batch size set to 16. For fine-tuning of the models, adam optimizer is used with lr = 0.0001. To prevent overfitting, regularization is performed using an early stopping criterion.

Architectures of deep transfer learning models, a and b are DenseNet201 and ResNet152V2 models.
The ensemble of deep CNNs can show to be a powerful method for better results. It works based on combining the decisions obtained from several models. Figure 7 shows the architecture of the proposed ensemble model for a two-class classification problem. The ensemble was done with a concatenation of two deep learning models, including DenseNet201 and ResNet152V2. During tuning, fully connected layers (dense layers with 64 neurons each with 0.2 and 0.1 dropouts) having sigmoid activation function were added for classification. The ensemble model was trained for 20 epochs with a batch size of 64. The learning rate was set to 0.0001. Training parameters of the denseNet201, ResNet152V2, and the ensemble models are displayed in Table 1.

Proposed ensemble architecture for classification.
Training parameters of the DenseNet201, ResNet152V2, and ensemble models
Image augmentation techniques are used together with deep learning procedures to improve classification accuracy and avoid overfitting. Regarding deep learning models requiring much larger amounts of data to train, these techniques’ usefulness is becoming more and more recognized, especially in biomedical imaging, in which large amounts of labeled data are challenging to come by or expensive to produce. In our study, image augmentation was performed using the Keras library in Python. Scaling and translating were performed for training images. In the scaling, each image produced 75% and 60% of the original image. Left 20 percent, and Right 20 percent used for translating. The number of training samples in the dataset was increased from 656 to 1968 by performing the image augmentation techniques. It is expected that this method increases the accuracy of the training and affects the classification results positively.
Performance metrics
The results were then analyzed to select the best model evaluating pCR to NAC in breast cancer patients using CT images. Moreover, we used the dimensionality reduction method “t-distributed stochastic neighbor embedding (t-SNE)” to visualize high-dimensional data by giving each data point in a two-dimensional map [41]. t-SNE is a relatively new procedure of dimension reduction particularly suitable for non-linear and high-dimensional datasets. It is a method of manifold learning technique that is performed using probability distributions through affine transformation. All experiments, including data pre-processing and analysis, were performed on the Google Cloud computing service “Google Colab” (colab.research.google.com) using programming language Python and framework Tensor Flow version 2.4.1. For evaluating the proposed models’ performance, an area under the receiver operating characteristic (ROC) curve (AUC), accuracy, precision, recall, and f1-Score were calculated as follows.
TP, FP, TN, and FN represent the number of True Positive, False Positive, True Negative, and False Negative, respectively.
Models’ analysis
Table 2 demonstrates the performance evaluation of the proposed ensemble, DenseNet201 and ResNet152V2. It was observed that the mentioned models attain comparable results. Both DenseNet201 and ResNet152V2 achieved high accuracy. However, there is still scope for improvement by considering other parameters. For this, the ensemble model is presented. The proposed ensemble architecture consists of DenseNet201 and ResNet152V2 surpassed models with 100% accuracy. The proposed architecture attained 100% precision, indicating the correctness of pCR and non-pCR differentiated. The results reveal that the proposed ensemble achieves high specificity rates. It means that there would be no false-positive predictions.
Prediction of performance results obtained from three CNN models
Prediction of performance results obtained from three CNN models
The models’ values of accuracy (Training and Testing) and loss (Training and Testing) are given in Figs. 8 and 9. For two DenseNet201 and ResNet152V2 models and the proposed ensemble, the training step has been carried out to the 100 and 20 epochs, respectively. Plus, an early stopping mechanism to the training process was applied for all models. If validation accuracy reached the value of one, the entire learning was stopped. As reported, the learning was stopped at the ninth epoch for the proposed ensemble by the early stopping criteria. For the three models, the accuracy of the testing data was higher than the training data. The highest testing accuracy was obtained with the ensemble model. Also, this model decreases loss values faster than the two models DenseNet201 and ResNet152V2.

The values of accuracy for two models DenseNet201 and ResNet152V2, and the proposed ensemble.

The values of loss for two models DenseNet201 and ResNet152V2, and the proposed ensemble.
For visualizing data in a two-dimensional map, 3D visualization was used by t-SNE. Figure 10 shows that in the proposed ensemble, not any points were clustered with the wrong class. These results reveal the strong performance of the t-SNE method and the proposed ensemble. As shown in Fig. 11, the confusion matrix was used to evaluate the proposed binary classifier. The impact of FP and FN rates in models’ performance is displayed with the confusion matrix’s help. It means that the proposed ensemble provides not any FP and FN rates.

Data visualization with the t-SNE method for original images, two DenseNet201 and ResNet152V2 models, and the proposed ensemble.

Confusion matrix analysis for two DenseNet201 and ResNet152V2 models and the proposed ensemble.
As shown in Fig. 12, the AUC was also computed for all binary classifiers. It can be observed that the proposed ensemble performs better than giving good separability between the two classes. This supports the fact that the ensemble model results in better-generalized models and leads to efficient frameworks. Although a 0.004 and 0.003 difference were seen between the AUC of both base models (DenseNet201 and ResNet152V2) and the proposed ensemble, this increase in the model quality is critical in medical research.

ROC plots for two DenseNet201 and ResNet152V2 models, and the proposed ensemble.
Predicting response to NAC in breast cancer can change the treatment protocol. Simply put, if pCR can be predicted at the beginning of chemotherapy, alternative treatment could be applied [18, 22]. In this study, an ensemble deep transfer learning model using CNNs was proposed for predicting response to NAC in breast cancer based on CT images. Health officials may leverage the proposed model to differentiate the pCR patients from the non-pCR. To the best of our knowledge, this study is the first report of derived CT images with the deep transfer learning and ensemble model to predict pCR following NAC of breast cancer. We hypothesized that the ensemble model would have superior predictive accuracy than the single models in isolation. Our results show that the ensemble model was able to predict pCR with greater accuracy (AUC = 1) than either DenseNet201 (AUC = 0.996) or ResNet152V2 (AUC = 0.997) in isolation.
Kassani et al. [42] developed a deep learning-based procedure using descriptor features extracted by CNN models and pooling operation. They classified histological breast cancer images. The proposed architecture using the Xception model yields 92.50% average classification accuracy. Nevertheless, our study was different in evaluating deep learning procedures and calculated accuracy, and used image type. Ypsilantis et al. [43] investigated the challenging problem of predicting patients’ response to NAC from a single 18F-FDG PET scan before treatment. They achieved an average of 80.7% sensitivity and 81.6% specificity. Our study obtained better results; the ensemble model achieved 100% accuracy. Adoui et al. [44] presented a deep learning model for predicting the breast cancer response to NAC based on DCE-MRI. 723 slices extracted from 42 breast cancer patients who underwent NAC therapy were used to train the deep learning model. The proposed deep learning architecture predicted the pCR to NAC with an AUC of 0.91 using combined pre-and post-NAC images. However, in the present study, CT images were used. We obtained better results than the study of Adoui et al. (AUC of = 1 by ensemble model). We declared that the ensemble model could be used to differentiate the pCR patients from non-pCR. Choi et al. [45] have used PET/CT and MRI images to respond to NAC in advanced breast cancer. The deep learning model was presented and compared with the conventional methods. ROC analysis was used to assess the performance of the differentiating pCR and non-pCR. AUC was the highest for ΔSUV at 0.805. However, in the current study, different performance metrics were presented, and better results were achieved for the AUC that the ensemble model obtained a max value of 1.
This study had some limitations, which can be improved in future researches. A limited patient dataset is available that eventually impacts the training and learning capacity of the developed models. This work can also be extended by adding risk and survival prediction of pCR/non-pCR patients to help healthcare planning and management strategies.
Conclusions
In the current study, an ensemble deep transfer learning model was designed for predicting response to NAC in breast cancer using CT images. It can be concluded that the ensemble model can increase the ability to predict breast cancer response to first-cycle NAC than two DenseNet201 and ResNet152V2 models. Looking to the future from clinicians’ point of view may enable clinicians to tailor individual patient therapy.
Funding
No funding was received for this study.
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
None.
Footnotes
Acknowledgments
None.
