Abstract
Breast cancer is a widespread and significant health concern among women globally. Accurately categorizing breast cancer is essential for effective treatment, ultimately improving survival rates. Moreover, deep learning (DL) has emerged as a widely adopted approach for precise medical image classification in recent years, showing promise in this domain. However, despite the availability of DL models proposed in the literature for automated classification of breast cancer histopathology images, achieving high accuracy remains challenging. A minor modification to pre-trained models and simple training strategies can further enhance model accuracy. Based on the approach, this paper proposed an anti-aliased filter in a pre-trained ResNet-34 and a novel three-step training process to improve BC histopathology image classification accuracy. The training involves systematically unfreezing layers and imposing additional constraints on the rate of change of learnable parameters. In addition, four-fold on-the-fly data augmentation enhances model generalization. The Ada-Hessian optimizer adjusts learning rates based on first and second-order gradients to improve convergence speed. The training process utilizes a large batch size to minimize the training loss associated with batch normalization layers. Even with the limited GPU size, the gradient accumulation technique achieves a large batch size. Collectively, these strategies minimize training time while maintaining or improving the accuracy of BC histopathology image classification models. In the experimental implementation, the proposed architecture achieves superior results compared to recent existing models, with an accuracy of 98.64%, recall (98.98%), precision (99.35%), F1-Score (99.17%), and MCC (97.36%) for binary classification. Similarly, the model achieves an accuracy of 95.01%, recall (95.01%), precision (94.95%), F1-Score (94.94%), and MCC (93.42%) for the eight-class category of BC images.
Introduction
As per the Global Cancer Statistics (GCS) 2020 [1], female breast cancer is the most prevalent cancer worldwide, with 2.3 million new cases reported in 2020. Breast cancer accounts for around 11.7% and 6.9% of new cases and deaths, respectively, in the total cancer statistics. Therefore, early detection and proper treatment significantly impact survival rates. Various conventional methods, such as palpation, mammography, ultrasound, magnetic resonance imaging (MRI), and thermography, have been used for initial breast cancer screening. In cases of suspicion, tissues from the malignant prone area are extracted using various biopsy techniques. Subsequently, histopathology slides are prepared by staining with two dyes, Haematoxylin and Eosin (H&E). Hematoxylin stains the nuclei in the tissue into dark blue, enabling quick identification of the nuclei’s size and shape. At the same time, eosin changes the cytoplasm to pink, determining the volume of the cytoplasm and its morphological changes [2]. Professional pathologists will then inspect the samples under a microscope with different magnification factors to assess tumor staging and grading. This information is vital in deciding the best possible treatment. However, analyzing such histopathological samples is tedious, time-consuming, and prone to errors.
In the past few years, significant progress has been made in reducing the workload of pathologists by converting histological samples into high-resolution digital images with different magnification factors. This approach allows for easy identification of the nuclei and cytoplasm containing morphological changes. Furthermore, viewing these images by changing magnification factors improves the detection and identification of the cancer stage. To simplify this process further, the pathologist’s experience is exploited to extract the most relevant handcrafted features from the images using advanced image processing algorithms and training a machine learning model. This machine learning model serves as an automatic, computer-aided diagnostic (CAD) tool for classifying histopathology images, reducing the pathologist’s effort. However, the accuracy of the tool depends on the pathologist’s ability to identify relevant details in histopathology images.
In the modern era, the success of deep learning methods, particularly the Deep Convolutional Neural Networks (DCNN) model for image classification tasks, has proven to be a superior alternative to conventional image processing. DCNN’s ability to learn complex features from annotated datasets automatically makes it an obvious choice for the medical domain. However, training models from scratch requires significant computational resources, a large amount of data, and considerable time [3]. To address these issues, a new architecture is proposed using a pre-trained model with suitable training strategies. To summarize, the main objectives of this article are to propose a new architecture that addresses the issues associated with training DCNN models from scratch, to utilize pre-trained model with suitable training strategies, and to highlight the potential of these strategies in reducing the computational cost and improving accuracy. Proposing a modified ResNet-34 with an anti-aliased filter to automate the detection of BC histopathology images into two and eight-class, regardless of the magnification images. Applying a global contrast normalization to reduce the intra -contrast variation and a four-fold on-the-fly data augmentation to improve model generalization. Proposing a three-step training strategy for effective utilization of Transfer Learning(TL) and Fine-Tuning(FT) in model training Improving the proposed model’s convergence by using an Ada-hessian optimizer. Exploiting the accumulation of gradient normalization technique to increase the batch size from 32 to 128 without adding additional GPU resources to reduce model loss during training.
The structure of this paper is as follows: Section 2 analyzes related works in the field, providing a comprehensive review of the existing literature. Section 3 describes the proposed methodology in detail, explaining the approach to address the challenges associated with breast cancer image classification. Section 4 presents the experiments conducted to evaluate the effectiveness of the proposed methodology, including the dataset used, experimental setup, and performance metrics. Section 5 includes the results obtained from the experiments and provides a detailed discussion. Section 6 compares the proposed model with state-of-the-art models. Finally, section 7 ablation studies, and section 8 concludes the paper by summarizing the key contributions of the proposed methodology, discussing its limitations, and outlining possible directions for future research.
Related works
This section discusses various classification architectures for BC histopathology images using the Break-His dataset on magnification dependent (MD) and independent (MI). MD refers to a model that will consider the magnification factor of histopathology images during usage. In contrast, MI refers to the model that will not consider the magnification factor of histopathology images while using the model. Most of the initial works are related to MD.
F.A. Spanhol et al. [4] extracted six types of features by using conventional texture descriptors, namely local binary patterns (LBP) [5], completed local binary patterns (CLBP) [6], local phase quantization (LPQ) [7] and so on and trained on four different machine learning algorithms, 1-nearest-neighbour (1-NN), quadratic linear analysis (QDA), SVM, and random forest (RF) classifiers for binary classification. The accuracies are in the range of 80% –85% for different magnification factors. Here models primarily belong to machine learning. N. Bayramoglu et al. [8] proposed a simple DCNN with three convolution layers with kernel sizes of 7x7, 5x5, and 3x3. The model reported average accuracies ranging from 82.1% to 84.63% across four different magnification factors. One of the reasons for low accuracy could have resulted in the DCCNs being unable to learn more intrinsic features from the dataset due to its sparsity. Similarly, F.A. Spanhol et al. [9], in the other work, used AlexNet [10], a comparatively dense deep learning-based architecture. Initially, patches of sizes 32x32 and 64x64 split from the images by applying a sliding window and random methods and trained the model at path level for binary classification. Subsequently, an image-level prediction was carried out by fusing the outcome of individual patches of the given image using sum, product, and max rules. It resulted in improving the classification accuracy by around 6% compared to their previous work. A.Nahid et al. [11] undertook a comparative study with three custom Deep Neural Network (CNN, LSTM, CNN+ LSTM) models using a support vector machine (SVM) and softmax as classification layers. Intrinsic features from histopathology images were extracted using k-means and mean-shift clustering algorithms to train the models. The highest accuracy of 91% was reported on the CNN at 200x magnification with mean-shift clustering and SVM for binary classification. However, this experiment could not explain the variation in performance for different magnifications from model to model.
Zhongyi et al. [12] implemented both binary and eight-class classification with the help of a class structure-based deep convolutional neural network (CSDCNN) with data augmentation. Accuracy of around 93.5±2.0 for eight-class and 96.5±2.5 for binary classification was reported. Bardou et al. [13] proposed a custom CNN from scratch with the help of five convolutional layers, two dense layers, and softmax. Then, the model was trained with data augmentation techniques for binary-class and eight-class classification. Afterward, to boost model accuracy, ten predictive outcomes of the model were ensembled. It results in accuracy vary from 96.15% to 98.33% for the binary case and 83.31% to 88.23% for the eight. Here model attained a decent accuracy in binary class. However, in eight-class, there is room for improvement. Sharma et al. [14] concluded that features learned by pre-trained CNN were better than the model trained on handcrafted features. Moreover, the model VGG-16 +SVM with data augmentation achieved an accuracy varying from 91.23% to 93.97% for eight-class.
Wang et al. [15] proposed hybrid model using CNN and a Gated Recurrent Unit Network(GRU) for binary classification and attained an accuracy of 86.21%. Similarly, Jia Liet al. [16] used pyramid gray co-occurrence matrix(PGLCM) and incremental generalized learning(IBL) concept to train the model for binary classification and reported an average accuracy of 90%. Seo et al. [17] address the binary in a different approach, where they proposed that the novel Primal-Dual Multi-Instance SVM intends to identify the area of abnormality that can predict malignancy in the given image. However, the accuracy is relatively low, ranging from 85.80% to 89.10%. Zhou et al. [18] applied a Resolution Adaptive Network with SVM to carry for two and eight classes. However, variations of accuracy among magnification factors are high. Similarly, Joshi et al. [19] used Xecption+custom classification layers for binary at a magnification factor 40x and reported 93.33% accuracy. Pandey et al. [20] implemented a two-stage classification approach using pre-trained Xception. In the first stage, the model identifies whether the input belongs to a binary category. In the second stage, classify into individual subcategories. The accuracies lie in 98.13% to 99% for the first stage and 91.03% to 94.69% for the later stage.
For the MI task, Shallu et al. [21] carried out a comparative study on three CNN architectures, VGG-16, VGG-19, and ResNet-50, with transfer learning on the pre-trained model and training the model from scratch to generate features. Subsequently, logistic regression is used as a classification layer to classify the features into benign and malignant categories. The pre-trained VGG-16 attained the best accuracy of 92.60% for binary classification. Dabeer et al. [22] trained custom CNN for binary case and achieved an accuracy of 93.45%.
In contradiction, Boumaraf et al. [23] proposed a block-wise fine-tuning strategy on pre-trained ResNet-18 for MI and MD. A global contrast normalization with 3-fold data augmentation techniques used for two and eight-class categories. The average accuracy for two and eight-class problem was 98.84% and 92.15%, respectively. However, the classification accuracy in eight-class was relatively low compared with the binary case. Moreover, Zhong et al. [24] observed that whenever the down-sampling of features occurs in CNN models leads to an aliasing effect that impended model stability and robustness. They proposed an anti-aliased filter called blur-pool to mitigate the aliasing effect.
Based on the above study, the existing models are less accurate for the eight-class than the binary. Hence, improving the model’s accuracy for eight-class is necessary. The models trained on transfer learning and fine-tuning are far better than those from scratch. On-the-fly data augmentation offers greater flexibility than pre-generate augmented data. Moreover, the model performance improves as its size increases. Further, introducing an anti-aliased filter with fixed weights at appropriate locations improves model robustness and stability, which leads to better performance. In addition, the choice of optimizer considerably reduces the model’s training. Hence, this article proposes a modified ResNet-34, slightly denser than ResNet-18 [23] with state-of-the-art training strategies.
A Modified ResNet-34 with Anti-aliased filter
This section discusses the proposed model architecture, briefly introducing the break-his dataset, pre-processing, and various data augmentation methods to improve the model performance. It also discusses different training strategies followed to reduce training time and enhance the model performance.
Proposed model
A modified ResNet-34 [25] model has been proposed to handle binary and eight-class classification tasks. The model incorporates stacks of residual blocks, each reducing the feature size by half and doubling the number of features compared to the previous block. Further, the Residual block comprises two types: Residual_Block-0 with a skip connection and Residual_Block-1 with down-sampling. The skip connection enables the easy flow of information within the residual block by adding the output of Residual_block-0 to the second subsequent through Relu. At the same time, Residual_block-1 with the down-sampling helps to match the output of the residual block with the input of the next residual block. Equation (1) shows the mathematical representation of the Residual_Block-0 with a skip connection:

Residual Blocks.
On the other hand, Equation (2) depicts the Residual_Block-1 with down-sampling.
Figure 1b, highlighted in the stippled regions, indicate the modified Conv2D layers in Residual_block-1 and the Down-sample sections. Finally, Pre-trained ResNet-34 was modified in the Residual blocks as stated earlier, and existing Classification layers were altered to as per the task in hand, either 2 or 8.
The basic architecture can be divided into three main components: Initial layers, a stack of residual blocks, and Classification layers. Figure 2 provides a detailed description of the proposed model, including the dimensions of input and output features at each stage.

Proposed Model.
The initial layers include the input layer, 2D-convolutional layer, batch normalization, ReLU activation, and max-pooling.
3.2.1. Input layer
It accept input image of size 224x224x3 and applied to subsequent layer.
2D-convolutional layer. It applies a 2D convolution operation to the input image with a 7x7 kernel, a stride of 2, and padding of 3, resulting in an output feature map of size 112x112x64.
Batch Normalization. This normalization process ensures that the inputs to each layer have zero mean and unit variance, thus reducing the internal covariate shift [26].
ReLU. Rectified Linear Unit represented as Equation 5.
Max-pooling. Max Pooling is a down-sampling operation that reduces the spatial dimensions of feature maps while preserving dominant features. It selects the maximum value within rectangular pooling windows. Max Pooling provides translation invariance and reduces computational complexity in CNNs. In this case, a pooling operation with a 3x23 kernel size and a stride of 2 yields an output size of 56x56x64.
Block-64. Block-64 consists of three instances of residual block-0. The output of each residual block-0 is concatenated with the previous layer and passed through a ReLU activation function. Since residual block-0 is utilized in this block, the input and output features have dimensions of 56x56x64.
Block-128. Block-128 comprises one Residual block-1 and four Residual block-0 instances. Residual block-1 reduces the feature size and doubles the number of features compared to the previous block. Down-sample aligns the output of the last block with the concatenated input. Residual block-0 is utilized to learn complex features. In Fig. 2, mentioning “3-times” in the Block-128 section indicates that these blocks are repeated three times. Similar logic can be applied to the rest of Block-256 and Block-512, producing 14x14x256 and 7x7x512, respectively. Subsequently, the output from Block-512 is fed into the classification layers, and the function of each layer is as follows:
Classification layers
The output from Block-512 is passed through the classification layers, each serving a specific function:
Adaptive Average Pooling. This operation is similar to 2D max-pooling, but instead of selecting the maximum value, it calculates the average value within the filter region 7x7. It reduces the output feature of Block-512 from a size of 7x7x512 to 1x1x512.
Fully connected layer (FC-256). The fully connected layer adjusts the learnable weights to perform the classification task effectively. It takes the output from adaptive average pooling and generates 256 feature vectors.
ReLU. This layer takes the input from the previous layer and replaces negative values with zeros.
Dropout. Dropout regularization is applied to the learning weights to prevent the co-adaptation effect [27]. In the proposed model, a dropout probability of 0.5 is used.
Fully connected layer (FC-2 or FC-8). Depending on the classification problem, a fully connected layer receives the output from Relu activation and produces either 2 or 8 feature vectors.
Log SoftMax layer. In a classification problem, softmax provides the probability scores of images belonging to specific categories. Applying the logarithm function over softmax helps penalize outliers in the dataset, thus improving model stability. Softmax maintains output values between 0 and 1, while the logarithm of softmax returns negative values. Mathematically, it is given by Equation (6).
Break-his dataset [4] is considered for training, testing, and evaluation of the proposed model. It consists of 7909 labeled surgical open biopsy (SOB) histopathological images of 24 patients in the benign and 58 patients in the malignant category. These categories are divided further into four sub-categories each. The former has Adenosis(A), Fibro adenoma (FA), Tubular adenoma (TA), and Phyllodes tumor (PT), as sub-categories. Similarly, the latter has Ductal carcinoma (DC), Lobular carcinoma (LC), Papillary carcinoma (PC), and Mucinous carcinoma (MC), as sub-categories. Besides images captured at 40x, the dataset also contains histopathology images that show the region of the tumor cells in 40x images magnified to 100x, 200x, and 400x. Hence, it results in a total of 2480 benign and 5429 malignant categories color images of Portable network format (PNG) with 700x460 pixels. Table 1 shows the summary of the dataset. The experiments involve binary and eight-class classification. In binary, the model intended to predict a given histopathology image as either benign(label=0) or malignant(label=1). Similarly, the model classifies the image into one of the sub-categories of benign and malignant for eight-class classification. Hence sub-categories are labeled from 0 to 7.
Summary of Break-His dataset. [28]
Summary of Break-His dataset. [28]
The dataset provides H& E-stained histopathological images. It was observed that a small amount of colour variation from image to the image was quite evident due to the H& E staining process. In addition to that, the image acquisition process at different environmental conditions leads to possible intra and inter-contrast changes among the images. To alleviate the possible performance degradation due to such contrast changes, a global contrast normalization (GCN) was applied [29]. It brings uniformity to the contrast of the images. If the given image has a height of H, a width of W, and a channel count of 3 (RGB), then the GCN image is given by Equation (7).
Figure 3a & 3c shows that samples of the dataset with the magnification of 40x across each sub-category of benign and malignant images before applying GCN. Similarly, the corresponding images of benign and malignant sub-categories after applying GCN are shown in Fig. 3b & 3d. It is noticed that, variation of the contrast is reduced considerably after applying GCN.

Sample histopathology images at magnification factor: 40x, before (data souce:BreakHis [28]) and after applying global contrast normalization (GCN) (a): Benign:sub-categories before GCN (b) Benign: sub-categories after GCN (c)Malignant:sub-categories before GCN (d) Malignant:sub-categories after GCN.
An augmentation technique can improve the model’s performance, even with limited training data. This technique can be applied by enlarging the train data before training or augmenting different image transformations while training (on-the-fly). The former increases the data and the model training time and requires extra storage. On the other hand, the latter is performed during the training process itself. Instead of creating a separate augmented dataset, the original training data is randomly transformed or modified on-the-fly during each epoch. Hence, on-the-fly data augmentation provides flexibility in choosing different augmentation techniques, allowing the model to learn from a broader range of transformations. To take advantage of on-the-fly data augmentation, the Albumentations package [30] is used to generate three variants of image transformations: horizontal flipping, random rotation 90°, and vertical flipping. Additionally, to increase the randomness of the transformations, the “one_of” method available in the package is used. This method allows selecting one of the transformations with a predetermined probability. Setting the probability value to 0.75 will choose either one of the three transformations or the original image, eventually producing a four-fold on-the-fly data augmentation pipeline. Figure 4(a) shows the original image of the benign class, which has been resized to 224x224. The corresponding horizontal flip, random rotation, and vertical flip are depicted in Fig. 4(b), (c), and (d), respectively.

Data Augmentation (a): Original Image (Class: Benign, Subclass: Adenosis, Mag: 100x) resized to 224x224 (b) Horizontal Flip (c)Random Rotate 90° (d) Vertical Flip.
Although a large dataset can enhance the efficiency of deep learning models, obtaining a vast data in the medical domain is often challenging. To address this, a four-fold, On-the-fly data augmentation pipeline is employed to strengthen the model’s efficiency by providing adequate training data. The model contains batch normalization layers that require a large batch size to minimize training loss [31]. However, the large batch size requires a substantial amount of computational resources. To mitigate this, the “accumulation of gradient normalization” [32] technique is employed to increase the existing batch size from 32 to 128 without additional computational resources. Additionally, a dropout rate of 0.5 in the final layers and weight decay of 0.1 can help reduce model over-fitting.
For model training, a second-order optimizer, Ada-Hessian [33], with a Hessian power of 0.5 is considered. Ada-Hessian accounts for the gradient and curvature effects of the loss function during backpropagation when updating the model parameters. As a result, the model converges to a global minimum in fewer epochs. Since the proposed model uses a Log Softmax layer in the final layers, a negative log-likelihood loss (NLLLoss) function ensures a positive value from the loss function. As shown in Table 1, the dataset is imbalanced in binary and eight class categories, with the number of images in ductal carcinoma being more than three times that of other sub-categories. To prevent model bias against the majority classes, a weight vector inversely proportional to the number of images in each class is applied to the NLLLoss. Suppose x i represents the number of images for the i th category for an n classification problem, the weight vector can be represented by Equation (9).
In addition, a three-step training process and learning rate constraints on Block-512 can effectively utilize transfer learning and fine-tuning strategies to minimize training time while enhancing accuracy. The systematic approach of the three-step training is explained below.
The proposed model employs the pre-trained weights of ResNet-34 up to Block-512 and randomly initializes all learnable parameters in the classification layers. Subsequently, all trainable parameters of the model up to Block-256 are frozen, and the learning rate for Block-512 is restricted to 10-3. Through this process, the model’s classification layers gradually learn relevant features of Breast cancer histopathology images rather than abruptly updating weights during initial training. The model training starts with a learning rate (LR) of 10-2 and decreases by half every ten epochs (50). The model’s performance is validated with the test dataset at the end of each epoch to assess the best model state and prevent overfitting. Finally, the best state of the proposed model, with the highest validation accuracy during transfer learning, is saved for further processing.
Fine-tuning-1
In fine-tuning-1, all trainable parameters up to Block-256 are frozen, and the learning rate restrictions on Block-512 are removed. The model is then initialized with previously saved model states and trained with an initial learning rate (LR) of 10-3, which is reduced by half every ten epochs (50). Finally, the model’s best validation accuracy state is saved for the next training step.
Fine-tuning-2
In this step, all trainable parameters in the proposed model are unfrozen, and the model is initialized with previously saved model states. The model is trained with an initial learning rate (LR) of 10-4, which is reduced by half every ten epochs (50). Finally, the model’s state corresponding to the best validation accuracy in this step is saved for model evaluation.
Experiments
This section briefly explains the experimental procedure of the proposed model’s training, validation, and evaluation. The various performance parameters used to assess the model are also discussed.
Model training, validation and evaluation
Figure 5 illustrates the training and validation process of the proposed model. To each image in the break-his dataset, a global contrast normalization is applied. The test dataset is created by selecting every fifth image from the break-his dataset, while the remaining images are used for the training dataset. As a result, the test dataset includes images from both the benign and malignant categories, sub-categories, and magnification factors in proportion to the training dataset. The dataset is divided into training and testing sets, with 80% and 20% of the images allocated to the respective sets. The final training and test datasets contain 6327 and 1582 images, respectively. Table 2 represents the distribution of training and testing dataset.

Training and Validation of the Proposed model.
Distribution of train and test for both classifications
Additionally, a separate augmentation pipeline is employed to train and validate the proposed architecture. The data augmentation pipeline resizes each image to 224x224 during training and uses a four-fold on-the-fly data augmentation to generate one of the proposed image transformations. The former ensures compatibility with the proposed model’s input size, while the latter addresses data scarcity. For the test dataset, the data augmentation pipeline resizes each image to 224x224 and is used for validation and evaluation. After training in each epoch, the model accuracy and loss are validated with the test dataset to avoid over-fitting. In the proposed model, the second fully connected (FC) layer in the classification layers has a variable number of neurons, depending on the classification task. For binary classification (benign or malignant), the number of neurons is set to two. In the case of eight-class classification, the model predicts the presence of one of the following classes: A, FA, TA, PT, DC, LC, MC, and PC. Accordingly, the number of neurons in the second FC layer is set to eight. The experiment was conducted using an Intel® Core™i9-9900K CPU@3.60 Ghz×16 GB Nvidia Quadro RTX 5000, 32GB Ram, 1TB HDD, and Ubuntu 20.04. The model was implemented using the Pytorch package [34].
Performance of the classification task is measured using several metrics, including confusion matrix, accuracy, precision, recall, F1-measure, and Mathew’s correlation coefficient (MCC). The Table 3 summarizes the list of parameters used to evaluate the efficiency of the proposed model. For calculating the evaluation parameters, a machine learning library scikit-learn [35] package is used, and particularly the macro method applies to eight-class problem.
Evaluation parameters used to assess the model performance
Evaluation parameters used to assess the model performance
Note. t p = True Positive, t n = True Negative,f p =False Positive and f n =False Negative.
This section evaluates the performance of the proposed model on two-class and eight-class magnification-independent breast cancer histopathology images. Initially, the model is trained using the training dataset for each epoch. At the end of each epoch, the model accuracy is compared to the validation accuracy to avoid over-fitting. The state of the model that achieves the best validation accuracy in each stage of the three-step training process is saved for initializing the model in the next step. This process ensures model avoids over-fitting.
Figures 7 show the model’s accuracy and loss for both classification problems. The regions separated by three vertical lines represent the model’s status in each epoch for training and validation in the three-step training process, namely transfer learning, fine-tuning-1, and fine-tuning-2. As the number of epochs increases, accuracy improves, and loss decreases in each training step. The vertical line intersecting with the training curve indicates the state that corresponds to the best validation accuracy in each of the three steps of training. For binary, the model took 35 epochs to reach accuracy from 71% to 91%, and loss decreased from 0.6 to 0.25 in the Transfer learning. Similarly, in Fine-tuning -1, for an additional 26 epochs, model accuracy went 94% while the loss was 0.18. In the case of the eight-class, the accuracy went from 40% to 86%, and the loss changed from 1.6 to 0.4 in merely 21 epochs in the first step. Similarly, the accuracy of 92% and the corresponding loss was 0.26 in Fine tuning-1 at another 49 epochs. This process indicates the proposed additional constraints’ learning rate effectiveness in the transfer learning stage. At the end of Fine-tuning-2, the models achieved an accuracy of 98.6% and 95.01%, respectively. Moreover, the 3-step training process took only 2hr:40 min for each classification.

Model accuracy and loss for Binary classification.

Model accuracy and loss for Eight-class classification.
Furthermore, the smoothness of the accuracy and loss curves indicates that the proposed anti-aliased filter ensures robustness and stability during training. The exponential decay of the loss curves demonstrate the effectiveness of Ada-Hessian in updating model parameters, leading to improved accuracy and decreased loss. Table 4 illustrates the computational complexity of the proposed model. Approximately 60% of the total parameters are updated during the first two training stages, accounting for two-thirds of the epochs. Consequently, the model consumes significantly less time during the transfer learning and fine-tuning-1 phase. Moreover, the model is approximately 154.5MB and requires 4.18 GB of flops. Additionally, it can process 784 and 791 frames per second, with a latency of 4.567ms and 4.432ms for both classes, respectively.
Computational complexity Analysis
After the model was trained, its performance on the test dataset was evaluated using binary and eight-class classification confusion matrices. The binary class confusion matrix provided insights into the model’s ability to distinguish between benign and malignant tumors. In contrast, the eight-class matrix assessed its capability to differentiate among various malignant and benign sub-categories. The confusion matrices for the two problems are shown in Fig. 8a and 8b. In the binary, the model made incorrect predictions for seven benign and eleven malignant images out of 1582 test images. However, in the eight-class classification, 79 images were misclassified. It indicates that the model performed better in the binary than in the eight-class. Figure 8c further demonstrates that, despite the superior performance of the binary classifier, the eight-class classifier exhibited good discriminative abilities in binary classification, where only two benign and three malignant cases were misclassified.

Confusion Matrices for Binary and Eight class classification.
Figure 9 shows the category-wise totals versus the correctly predicted classifications, with B representing Benign and M representing Malignant. The misclassifications are fewer in the binary compared to the eight-class. Among the misclassified images, LC (28 out of 125), DC (20 out of 690), and PT (13 out of 89) have the highest number of incorrect classifications. Moreover, misclassifications were more prevalent among malignant sub-categories than benign sub-categories, with DC and LC being the most frequently misclassified sub-types of malignancy. Of the 79 misclassifications, 45 exclusively belonged to DC and LC. Nine PT images were misclassified as FA in the benign sub-categories.

Total and Correctly Predicted for Binary and Eight-class.
The primary reason for the high misclassification rate of DC and LC is the presence of images from patient ID 13412 in both categories within the Break-His dataset. Specifically, 15 of the 26 misclassified DC images and 16 of the 19 misclassified LC images belong to the same patient ID. These particular images account for approximately 37% of the total misclassifications. Additionally, the model erroneously predicted eight images from another patient ID, 15570, as LC instead of DC.Further analysis revealed that only four of the 79 misclassified images did not appear in the top-2 predictions. It indicates that for 75 of the misclassified images, the correct classification was present within the top-2 predicted classes. A histogram, depicted in Fig. 10, illustrates the distribution of differences between the top-2 predicted classes and the actual types for the 79 misclassified images. Most images exhibited only a marginal difference, falling from 0.02 to 0.86.

Top-2 Accuracy analysis for eight-class.
The proposed model for magnification independent (MI) classification was evaluated for two and eight-class problems and it have achieved an accuracy of 98.86% and 95.01% for binary and eight class, respectively. Table 5 shows the evaluation matrix of the proposed architecture of MI for two and eight class problems. The evaluation metrics, including accuracy, precision, recall, F-measure, and MCC, indicate the model performance is consistent in the respective classification. Similarly Figs. 12 represent the Precision and Recall curve (PR)for binary and eight classes, respectively. The low Area under the Curve (AUC-PR) score of 0.9375 for LC further indicates that the replication of images has a more significant impact on LC than DC. A high AUC-PR score suggests the proposed model’s robustness to a threshold.
The model evaluation on MI for two and eight-class problem

Precision- Recall Curve for Binary

Precision- Recall Curve for Eight-Class
Furthermore, the proposed model was evaluated for magnification dependent (MD) by treating each magnification factor (40x, 100x,200x and 400x) in the test dataset as an entity. Table 6 shows the evaluation matrix of MD for both classifications. In binary classification, the model accuracy varies from 98.08% to 99.26%, while in the eight-class classification, the highest accuracy is noted in the 40x magnification factor and the lowest accuracy in 200x. Overall, the model achieved an accuracy of 98.86% in binary classification and 95.01% in the eight-class classification.
The model evaluation on MD for binary and eight-class
Table 7 depicts the model accuracy with recent works on MD. Comparing the performance with the other six latest counterparts [14, 23], and [20], the proposed model gives better accuracy. An improvement of 12% to 3.4% in the binary classification of the proposed architecture is noticed when compared with [17, 18] and [19]. In addition, the proposed model’s accuracy, precision, recall, and F1-score are consistent with the finding of [23] and [20]. In the case of the Eight-class, a notable difference in accuracy with [14, 23] and [20] up by 3.1%. However, for 100x and 200x, the model accuracy is low compared to [18], but our results are consistent across different magnification factors indicating the model’s robustness for magnification.
Comparing performance with MD on recent methods proposed in the literature
Comparing performance with MD on recent methods proposed in the literature
Similarly, Table 8 depicts the model accuracy with recent works on MI. The proposed model’s accuracy is improved in binary by 7% with respective to [21] and [22]. Whereas with [23] about 3% in eight-class and marginal improvement for the binary case. Hence, the proposed model performed consistently on both MI and MD. Moreover, the model has achieved around 3% improvement in accuracy for eight-class problem.
Comparing performance with MI on recent methods proposed in the literature
To evaluate the effectiveness of different components and techniques in improving the performance of the ResNet-34 model, we conducted an ablation study using binary and eight-class classification for MI.
The baseline model, ResNet-34(Scratch), achieved an accuracy of 85.43% for the binary and 82.16% for the eight-class. Including the proposed four-fold on the -fly data augmentation improved the model performance, with the accuracy increasing to 86.89% for the binary and 82.76% for the eight-class. Further, the effectiveness of transfer learning by utilizing a pre-trained ResNet-34 model as the starting point improved accuracies of 91.27% and 85.86% for both categories, respectively. Instead of transfer learning, the 3-step training approach yielded even higher accuracies, reaching 92.46% and 86.12%, respectively. The combined impact of data augmentation and transfer learning with the ResNet-34(pre-trained) led to 94.04% and 88.94% accuracy. In continuity, instead of transfer learning, 3-step training raised the accuracies to 96.34% and 90.22%.
Similarly, similar steps with the inclusion of an Anti-aliased filter are shown in the Tables 10 indicated progressive improvement of accuracies. The final step results showed better accuracy, demonstrating the effectiveness of the proposed methodology.
Analysis of MI for Binary on other Models
Analysis of MI for Binary on other Models
Analysis of MI for Eight on other Models
This paper presented a modified deep-learning architecture based on the pre-trained ResNet-34 for magnification-independent binary and eight-class classification of BC histopathology images. The architecture incorporated a fixed weights Gaussian filter to address the aliasing effect, which improved model stability and robustness. Furthermore, the additional learning rate constraint in the proposed three-step training allowed for smoother updating of learnable parameters, enhancing the model accuracy. Additionally, using a large batch size reduced the frequency of batch normalization updates, resulting in decreased training loss. Moreover, applying global contrast normalization reduced intra-contrast variation, and a four-fold, on-the-fly data augmentation ultimately enhanced the model’s generalization and accuracy. Notably, utilizing a Hessian power 0.5 in Ada-Hessian facilitated reaching the optimal solution in fewer epochs. This technique involved considering first and second-order derivatives during the computation of the loss in back-propagation. It was evident from the results that the proposed model achieved an accuracy of 98.86% for the binary class whereas 95.01% for eight-class classification on magnification independent task. Moreover, the model showed consistent performance on magnification dependent. Overall, the results obtained with the proposed model were better than the previous works in the literature using the same dataset, especially in eight-class classification. The proposed model’s strategy can be applied to other medical modalities and is well-suited for medical applications
However, the proposed model’s evaluation is limited to a specific dataset, which may restrict its generalizability to different datasets or medical modalities. While the model outperforms previous works on the same dataset, further validation on diverse datasets is necessary to enhance its reliability and applicability in various medical applications. Additionally, image repetition within the DC (ductal carcinoma) and LC (lobular carcinoma) sub-categories of the malignant class, specifically for patient ID 13412, adversely affects the accuracy of the eight-class classification. Addressing and mitigating the impact of this repetition is essential to improve the classification accuracy for these sub-categories.
To enhance the proposed model, future research can focus on treating the problem as a multi-label classification task. This approach would enable capturing the diverse characteristics present in BC histopathology images by treating each pathological feature as a separate label. Additionally, evaluating the model on diverse datasets and medical modalities would provide valuable insights into its generalizability and performance across different settings, enhancing its robustness in real-world scenarios.
Competing interests
The Authors declare that they have no competing interest.
Ethical approval and consent to participate
Not applicable.
