Vision transformer-convolution for breast cancer classification using mammography images: A comparative study

Abstract

Breast cancer is a significant global health concern, highlighting the critical importance of early detection for effective treatment of women’s health. While convolutional networks (CNNs) have been the best for analysing medical images, recent interest has emerged in leveraging vision transformers (ViTs) for medical data analysis. This study aimed to conduct a comprehensive comparison of three systems a self-attention transformer (VIT), a compact convolution transformer (CCT), and a tokenlearner (TVIT) for binary classification of mammography images into benign and cancerous tissue. Thorough experiments were performed using the DDSM dataset, which consists of 5970 benign and 7158 malignant images. The performance accuracy of the proposed models was evaluated, yielding results of 99.81% for VIT, 99.92% for CCT, and 99.05% for TVIT. Additionally, the study compared these results with the current state-of-the-art performance metrics. The findings demonstrate how convolution-attention mechanisms can effectively contribute to the development of robust computer-aided systems for diagnosing breast cancer. Notably, the proposed approach achieves high-performance results while also minimizing the computational resources required and reducing decision time.

Keywords

Breast cancer (BC)convolutional networks (CNN)computer-aided diagnosis (CAD)vision transformers (VIT)mammography DDSM

1. Introduction

Breast cancer is recognized as one of the most significant health challenges impacting women worldwide. According to data from The International Agency for Research on Cancer (IARC), breast cancer affected approximately 7.8 million women in the five years prior to 2020. It is estimated that 2.3 million women worldwide were diagnosed with breast cancer, resulting in 685000 fatalities, as illustrated in Fig. 1. This highlights breast cancer as a significant global health concern. Early diagnosis remains crucial in addressing this disease and devising timely treatment plans [1]. In clinical practice, various techniques, such as histopathology [2], CT scan images [3] and ultrasonography imaging, are employed to detect breast cancer [4]. Mammography images [5], can provide greater sensitivity in distinguishing between cancerous masses. These images are typically categorized into benign and malignant lesions, providing more precise and detailed diagnostic information. [6]. Pathologists are essential in the identification and classification of tissue samples used in the diagnosis of breast cancer, and microscopic examination is one of the important methods for diagnosing BC [7]. However, manual diagnosis is labor intensive and susceptible to mistakes, especially when performed by inexperienced pathologists [8]. Computer-aided diagnosis based on convolutional networks has become the dominant system in medical vision tasks [9]. In practical applications, CNNs have achieved significant success. A typical CNN architecture is composed of various layers, with the most common layers being convolution, pooling, fully connected, dropout, and rectified linear activation function (ReLU) [10]. The investigated tasks include extracting and segmenting breast cancer tumour lesions, as well as classifying breast tissue to determine whether it is benign or malignant [11].

Figure 1.

The global incidence of breast cancer in different countries worldwide in 2020.

Recently, there has been an increase of interest in using self-attention approaches in computer-aided diagnosis (CAD) due to the success of vision transformers (VITs) [12]. The attention mechanism based on self-attention layers used in transformers splits the image into multiple patches, linearly embeds them, and then applies the attention directly to this collection of patches [13]. First, it excels in capturing long-range relationships between image pixeles. Second, its capacity for adaptive modelling through the dynamic computation of self-attention weights enables the network to effectively highlight relevant images regions while suppressing noise, contributing to improved object recognition and feature extraction. Finally, the inherent ability the attention mechanism to provide saliency maps offers valuable insights into the specific areas of interest within an image, making it an indispensable tool for enhancing the interpretability and performance of computer vision models [14]. However, there are some drawbacks to vision transformers ViT, which can be quite demanding in terms of computer resources in regard to training and testing data. They also have many parameters, which makes the model more complicated [15]. Recently, there has been a growing interest in combining CNN models with ViT to overcome these challenges [16]. In medical applications, both ViTs and CNNs face difficulties in achieving the best performance, especially when little data are available. The complex structure of ViTs and long training times of CNNs can be particularly challenging. On the other hand, the fusion of VIT-CNN models is more effective at reducing the decision time and resources required for training [9].

Our objective is to evaluate and compare the performance of vision transformers utilizing convolution layers for feature extraction using the different optimizers Adam, AdamaX, and SGD to increase the classification accuracy. The primary goal of this method is to significantly reduce training time and minimize parameters for medical data analysis. Our study presents three notable contributions, as outlined below:

•

We explore whether the self-attention transformer model can be used for mammogram imaging tasks, and discuss how it can be adjusted and improved medical images analysis.

•

We show that combining CNNs with ViTs can greatly enhance medical images are analysed.

•

We introduce the tokenlearner transformer model, which stands out from current methods. This model has accurate performance, and low decision time, especially when dealing with limited data.

Our comparative study utilized an attention technique to divide breast images into patches, with a specific focus on patch size within tumor regions for creating mammography patches. Additionally, our study integrates attention techniques and convolution layers for classifying mammography images, offering a novel approach that could improve performance in medical data analysis for computer-aided diagnosis. The subsequent sections of this paper are organized as follows: A background on the mammogram images is presented in Section 2. A summary of relevant studies on this topic is given in Section 3. Our suggested self-attention method is explained in depth in Section 4. The results of our empirical evaluations, which were carried out with the use of the DDSM database, are presented and discussed in Section 5. A thorough description of the results is provided in Section 6. Possible directions for further study and development are described in Section 7.

Table 1

Breast cancer incidence worldwide in 2022

Population	Number	ASR (World)	Crude rate	Cum risk
Africa	198 553	40.5	28.2	4.3
Latin America and the Caribbean	220 124	52.0	65.1	5.6
Northern America	306 307	95.1	162.5	10.3
Europe	557 532	75.6	144.3	8.1
Oceania	28 507	91.5	130.5	9.9
Asia	985 817	34.3	43.4	3.7
Total	2 296 840	46.8	58.7	5.1

2. Background

In this section, we offer essential background information crucial for comprehending the subsequent content of the paper.

2.1 Mammography

Mammography is a method used for the early detection of breast cancer, employing a low-dose X-ray to capture images of the breast tissue interior. This technique enhances the visibility of potential signs. Malignancies such as masses, microcalcifications, ascites, and distortions can be visualized, serving as indicators of potential malignancy. Figure 2 illustrates a mammography system as an example. The process involves crossing the breast between the compression plate and the breast support. Subsequently, low -intensity X-ray beams are directed onto the breast from above using an X-ray tube. Mammography is widely adopted as the primary screening method for breast cancer due to its speed, cost-effectiveness, and minimal requirement for skilled operators [17].

Figure 2.

An overview of the mammography setup, where a woman’s breast is placed on the breast support and an X-ray beam projection is generated in the X-ray tube (courtesy: Wikipedia).

Each breast undergoes imaging from two distinct viewpoints, namely the Craniocaudal View (CC) and the Mediolateral Oblique (MLO) in mammography-based screening programs, as depicted in Fig. 3a. Sample mammograms from these viewpoints are shown in Figs 3b and 3c.

Figure 3.

Mammography projection views used in breast cancer screening studies: (a) shows the direction of the two most commonly used view points to produce mammograms, (b) CC view, and (c) MLO view.

Upon identifying areas of concern, radiologists may conduct further image projections, such as enlarged views or mediolateral views. Traditionally, screen-film mammography (SFM) utilizes photographic films to capture breast scans. However, with advancements in imaging technology, high-quality full-field digital mammography (FFDM) is now employed, allowing for immediate viewing on computers. To address the limitations of standard 2D mammography, pseudo 3D digital breast tomosynthesis (DBT) has gained popularity as a novel imaging technique. In DBT, the X-ray tube traverses a narrow arc above the breast, capturing multiple low-dose X-ray images Fig. 4.

Figure 4.

Schematic procedure of DBT, showing the movement of the x-ray tube to acquire images at different angles. Note that the geometry differs between manufacturers.

These exposures are post processed to generate pseudoo-3D breast volumes, with each slice representing a different depth of the breast tissue. Typically, slices are spaced at 1 mm intervals, facilitating improved visualization of interior breast tissue and reducing the overlapping effects observed in conventional mammography. Various DBT geometries and acquisition settings exist, each exhibiting different performances in detecting masses and microcalcifications. As illustrated in Fig. 5, different slices of breast DBT images are shown. Despite the increasing adoption of DBT, X-ray mammography remains the gold standard imaging modality for breast cancer screening due to its speed and cost-effectiveness [1].

Figure 5.

Sample DBT volume: Different slices of DBT volume traversing from left to right.

Given the intricate nature of breast anatomy and the challenges associated with cancer detection using mammography, there has been widespread recognition of the need for enhanced screening methods. Consequently, scientists and researchers have been diligently working to develop improved mammography techniques to facilitate early cancer detection. Regular mammograms remain one of the most effective diagnostic tools available to physicians, often detecting breast cancer up to three years before it becomes palpable.

3. Related works

The use of vision transformers with attention mechanisms in medical image processing has become increasingly popular, especially in computer-aided diagnosis. In 2021, Matsoukas Christos et al. questioned whether it is time to switch from CNNs to transformers for medical images [9]. They suggested that transformers can replace CNNs with minimal effort, especially for small medical datasets. In 2022, Henry Emerald U et al. explored whether the impact of transformers on computer vision extends to medical imaging [12]. Their paper compared the performance of transformers and CNNs across different medical imaging methods. In 2023, He Kelei et al. aimed to highlight the use of transformers in medical image analysis [18]. They emphasized that existing transformer methods can be easily applied to various medical imaging tasks with few modifications. In recent studies, Abimouloud M et al. [19] introduced Transformer (ViT) architectures utilizing various optimizers, such as Adam, Adamax, and SGD, on the DDSM dataset. They achieved accuracies of 99.89% with Adamax, 99.96% with Adam, and 99.81% with SGD. The corresponding training times were 633.73 seconds, for AdamaX, and 698.91 seconds for SGD for Adam 942.01 seconds. Boudouh, Saida Sarra et al. [20] proposed a method in which several filters were applied during the mammography preprocessing stage. Then, they utilized six transfer learning models for feature extraction and training: InceptionResNetV2, EfficientNetB7, DenseNet201, MobileNetV2, ResNet152V2, and VGG16. Their models achieved accuracies of 99.83%, 98.19%, 97.29%, 99.42%, 99.01%, and 99.83%, respectively. Ayana Gelan et al. [21] developed a method for classifying breast masses invia mammography using the DDSM dataset. Their approach involved training from scratch and transfer learning using ResNet, EfficientNet, Incep- tionNet, and vision transformers such as the Swin Transformer and Pyramid Vision Transformer. They proposed vision-transformer-based transfer-learning models, which All achieved an impressive area under the receiver operating curve (AUC) of 1 $\pm$ 0. The models also demonstrated minimal training time, completing in just 2900 seconds. Wessam et al. [22] introduced a comprehensive framework utilizing various deep learning models, such as InceptionV3, DenseNet121, ResNet50, VGG16, and MobileNetV2, along with data augmentation. They applied a modified U-Net model for mammogram segmentation and classification. Their methodology attained the highest accuracy of 99.43%. In the InceptionV3 system, which incorporates segmentation, classification, and data augmentation, the segmentation process took 1.2134 seconds. Oza, Parita et al. [23] employed pretrained CNN models, including VGG-16, ResNet-50, Inception-V3, and EfficientNet-B7, along with data augmentation, to extract features from the MIAS and CBIS-DDSM datasets. Their proposed model achieved an accuracy of 99.8% on the CBIS-DDSM dataset. Li Hua et al. [24] proposed a two-model approach consisting of a convolutional neural network (CNN) and a recurrent neural network (RNN). The model comprises two branch networks, with two modified ResNet architectures used to extract breast mass features. Their model achieved an accuracy, recall, and AUC of 94.70%, 94.10%, and 96.80%, respectively, on the DDSM dataset.

4. Materials and methods

In this section, we present an overview of the proposed systems, starting with a description of the DDSM dataset used for model development. Subsequently, we define the VIT, CCT and TVIT models, employing three different optimizers: Adam, Adamax, and SGD. The primary goal of these steps is to assess the effectiveness of the TVIT model compared to that of the VIT model in accurately classifying lesions in DDSM images as either malignant or nonmalignant. A block diagram of the proposed methods is shown in Fig. 6.

Table 2
Details of vision transformer model variants

Model	Image size size	Patch size	Layers	Heads
Our proposed adapted ViT	224 $\times$ 224	14	8	4

Figure 6.

Block diagram of the proposed approach.

4.1 Dataset

In this study, we utilized the Digital Database for Screening Mammography (DDSM) [25], which is publicly available and and accessible at: https://data.mendeley.com/datasets/ywsbh3ndr8/2, to evaluate the effectiveness of our systems in identifying breast cancer masses. This dataset contains a total of 13,128 images, out of which 5,970 depict benign tissues and 7,158 depict malignant tissues. The images are stored in JPEG format and have dimensions of 1888 $\times$ 2964 pixels. Figure 7 displays a sample image from the dataset.

4.2 Vision transformer

The Vision Transformer (ViT) architecture, as highlighted in our study, is instrumental in classifying breast cancer mammograms. This architecture is meticulously crafted to analyse and comprehend the intricate features present in images. The process, depicted in Fig. 8, begins by resizing the original images from 227 $\times$ 227 pixels to 224 $\times$ 224 pixels to meet ViT’s requirements. These resized images are then partitioned into smaller patches, each measuring 14 $\times$ 14 pixels. This restructuring results in 256 input 2D image patches denoted as $N_{xp}\in\mathbb{R}^{N\times P\cdot 2C}$ , where $N$ represents the number of patches and $P$ signifies the patch size calculated by

$\displaystyle N=\frac{\textit{HW}}{P^{2}}$ (1)

The steps of the vision transformer mechanism are as follows:

•

Each patch is flattened and projected into a higher-dimensional space $D$ through a trainable linear projection, forming embedded patch images, also known as patch embeddings. Additionally, a learnable class embedding Xclass is included in the sequence of embedded patches, serving as inputs for the transformer encoder block.

•

The transformer encoder block integrates a forward connection that combines the original input with the outcome of multihead attention following the calculation of the attention function. This function involves computing the dot product between Q and the transpose of K, which is subsequently scaled down by a factor of $d_{K}$ , representing the square root of the dimension of K. Here, Q, K, and V are projected multiple times, where H denotes the number of heads. The resultant output is then subjected to a softmax function, as depicted in Eq. (2).

The results from these steps are then forwarded to the MLP head layer.

•

The MLP head leverages the outputs from the transformer encoder layers to generate a probability distribution of labels, facilitating the prediction of the image class. The architecture implemented in this work is described in the Table 3.

$\displaystyle\text{attention}(Q,K,V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d% _{K}}}\right)V$ (2)

Table 3

Details of CCT model variants

Model	Image size size	Patch size	VIT layers	CONV layers
Our proposed adapted ViT	224 $\times$ 224	16	2	2

Figure 7.

DDSM dataset images.

4.3 Compact convolutional transformers (CCT)

Offer a new perspective in image analysis [26]. The core of CCT lies in its architecture, outlined in Fig. 9. Unlike traditional methods, CCT introduces a significant change in its model by replacing the patch and embedding block in VIT with a basic convolution block. This block comprises standard structural convolution, rectified linear unit (ReLU) activation, and a max pooling layer, which allows for enhanced flexibility in handling input images of various sizes.

For instance, given an image with dimensions represented as $x\in\mathbb{R}^{H\times W\times C}$ , the initial processing step involves applying the MaxPool, ReLU, and Conv2d operations sequentially, resulting in the output $x_{0}$ . Moreover, by integrating this convolution block, CCT offers adaptability, as it is not constrained to input resolutions divisible strictly by a predefined patch size. Additionally, the convolution and max pooling operations can overlap, potentially improving performance by incorporating inductive biases. This flexibility allows the CCT to preserve local spatial information, further enhancing its capabilities in image analysis.

Figure 8.

Vision transformer self-attention architecture.

Figure 9.

Architecture of the compact convolution transformer.

4.4 Token learner

The TokenLearner model operates by first taking an image-shaped tensor as its initial input. This tensor then undergoes processing through multiple convolutional layers, generating distinct spatial attention maps to capture specific spatial features. Spatial attention mechanisms enable the model to focus on relevant regions within the input, which is essential for retaining important information. Subsequently, elementwise multiplication is employed to accentuate the importance of specific regions while de-emphasizing others. The resulting product is then aggregated using a pooling operation to condense the information into a pooled representation. The following is a breakdown of TokenLearner architecture:

•
Input Tensor: This tensor initially takes an image-shaped tensor $X\in\mathbb{R}^{T\times H\times W\times C}$ . Convolutional Layers: The tensor undergoes processing through multiple single-channel convolutional layers, each capturing specific spatial features and generating distinct spatial attention maps. This multi-layered approach enhances the model’s expressivity, enabling it to capture intricate details and patterns in the input.
•
Spatial Attention: Spatial attention allows the network to focus on relevant regions within the input, which is crucial for retaining important information. This ensures that the model emphasizes relevant areas, especially when working with a reduced number of patches.
•
Elementwise Multiplication: The attention maps are applied to the original input through elementwise multiplication, accentuating the importance of specific regions while deemphasizing others. Pooling Operation: The resulting product from element-wise multiplication is aggregated using a pooling operation, condensing the information.
•
Output: The output is a pooled representation treated as input to the transformer encoder block. This summary features a significantly reduced number of patches compared to the splitting image technique in the base self-attention model VIT. TokenLearner Module: This module combines convolutional layers, spatial attention, and pooling to efficiently process input images, providing a condensed yet informative representation with reduced training time and processing.

Figure 10.
The token vector generation form input patches (courtesy: [27]).

5. Experiment and results

5.1 Dataset splitting

This paper aims to efficiently and accurately diagnose breast cancer (BC) using the DDSM dataset. The dataset was split into two segments for evaluating the effectiveness of the classification model. 80% of the dataset, which included 10,502 images, was designated for the training phase, while the remaining 20% (2,626 images) were set aside for testing. The training process was meticulously handled performed on an HP Z8 G4 computer workstation:

•
Memory (RAM): 96.00 GB
•
Processor: Intel(R) Xeon(R) Silver 4108 CPU @ 1. 80 GHz 1. 80 GHz.
•
Graphic Processing Unit: (GeForce RTX 2080 Ti, GeForce RTX 3090)
•
System type: 64-bit operating system, x64 processor. We utilized Python 3.11.

5.2 Dataset preprocessing

Ensuring a balanced dataset is crucial for creating a reliable and fair classification model. When dealing with imbalanced datasets, such as those often encountered in medical contexts, there is a risk of poor generalization and greater chance of misclassification. To address this challenge with the DDSM dataset, we implemented data augmentation as a key strategy to overcome data scarcity. The specific details of our data augmentation approach can be found in Table 4. Additionally, the hyperparameters used in training our model are listed in Table 5.

Table 4
Data Augmentation Parameters

Data augmentation technique	Value
Rotation range	5
Width shift range	0.1
Height shift range	0.1
Zoom range	0.001
Fill mode	0.001

Table 5

Hyperparameter settings

Hyperparameter	Value
Batch size	32
Number of epochs	20
Optimizer	Adam
Loss function	Binary crossentropy

5.3 Evaluation metrics

To ensure the optimal training of the vision Transformer (VIT) model,various optimizers (Adam, Adamax and SGD) were tested. The performance of the model was assessed using a variety of evaluation metrics, such as the confusion matrix, ROC curve area, sensitivity, specificity, accuracy, precision, AUC, F1-score, total training time, and time per epoch. These metrics were meticulously used to thoroughly evaluate the model’s classification effectiveness and efficiency.

$\displaystyle\textit{Accuracy}(\%)=\frac{\textit{TP}+\textit{TN}}{\textit{TP}+% \textit{TN}+\textit{FP }+\textit{FN}}$ (3) $\displaystyle\textit{Sensitivity}(\%)=\frac{\textit{TP}}{\textit{TP}+\textit{% FN}}$ (4) $\displaystyle\textit{Precision}(\%)=\frac{\textit{TP}}{\textit{TP}+\textit{FP}}$ (5) $\displaystyle\textit{Specificity}(\%)=\frac{\textit{TN}}{\textit{TN }+\textit{% FP}}$ (6) $\displaystyle\textit{F1score}(\%)=\frac{2\times[\text{Precision}\times\text{% Recall}]}{\text{Precision}+\text{Recall}}$ (7) $\displaystyle\textit{AUC}(\%)=\frac{\textit{Sensitivity}}{\textit{Specificity}}$ (8) $\displaystyle\textit{Total}\_\textit{Training}\_\textit{Time}(S)=\textit{end}% \_\textit{time }-\textit{start}\_\textit{time}$ (9) $\displaystyle\textit{Avg\_Training\_Time\_per\_epoch}=\frac{\textit{Total\_% Training\_Time}}{\textit{Num\_epochs}}$ (10)

Table 6

Performance comparison metrics of the optimizers

Optimizers		Accuracy	Precision	Sensitivity	Specificity	AUC	F1 score
VIT	Adam	99.81%	99.80%	99.81%	99.81%	99.99%	99.80%
	AdamaX	99.54%	99.53%	99.54%	99.54%	99.99%	99.53%
	SGD	99.81%	99.81%	99.80%	99.80%	99.99%	99.80%
CCT	Adam	99.81%	99.93%	99.72%	99.92%	100.0%	99.83%
	Adamax	99.92%	99.86%	100.0%	99.83%	100.0%	99.93%
	SGD	99.09%	99.79%	98.53%	99.75%	99.93%	99.16%
TVIT	Adam	98.55%	98.54%	98.53%	98.53%	99.90%	94.50%
	Adamax	99.05%	99.01%	99.06%	99.06%	99.90%	99.04%
	SGD	80.88%	80.78%	80.58%	8058%	91.40%	80.66%

Table 7

The best performance in binary classification for each model

Model	VIT	CCT	TVIT
Total training time (s)	819.77	8421.03	533.90
Avg training time per epoch (s)	40.98	421.05	26.69
Accuracy	99.81%	99.92%	99.05%
Precision	99.80%	99.86%	99.01%
Sensitivity	99.81%	100%	99.06%
Specificity	99.81%	99.83%	99.06%
AUC	99.99%	100%	99.90%
F1 score	99.80%	99.93%	99.04%
Best optimizer	Adam	Adamax	AdamaX

Figure 11.

Confusion matrices of the VIT Adam(1), Adam(2), SGD(3) optimizer respectively.

5.4 Results

In this section, we conduct a comprehensive performance review of models, with three different optimizers: Adam, Adamax, and SGD. To provide a clear overview of our findings, the optimal performances of our proposed systems are meticulously detailed in Table 6.

Based on the results presented in Table 7, it is evident that the CCT, particularly when coupled with the AdamaX optimizer, emerges as the top-performing model. It demonstrated remarkable accuracy, achieving 99.92%. Additionally, VIT, when paired with the Adam optimizer, achieves an accuracy of 99.81%, while TVIT, under the same optimizer, achieves an accuracy of 99.05%. Moreover, the configuration utilizing the AdamaX optimizer excels in terms of minimal training time, completing in 533.90 seconds. These results highlight the effectiveness and efficiency of the systems across various optimization techniques, offering valuable insights into their respective strengths and performance metrics. The confusion matrices of the models with Adam (1), AdamaX (2), and SGD (3) optimizers using the DDSM dataset are illustrated in Figs 11, 14, and 13.

Figure 12.

Confusion matrices of the CCT Adam(1), Adam(2), SGD(3) optimizer respectively.

Figure 13.

Confusion matrices of the Tokenlearner Adam(1), Adam(2), SGD(3) optimizer respectively.

Figure 14.

Illustration of the ROC curves of VIT, CCT, and TVIT with best optimizer.

6. Discussion

In this study, we introduced three systems self-attention vision transformer (VIT), compact convolution transformer (CCT), and Token learner (TVIT) utilizing different optimizers for the early diagnosis of breast cancer using the DDSM dataset. These systems demonstrated impressive performance in classifying breast cancer masses.

To mitigate overfitting, we employed data augmentation to generate additional images within the same dataset. Previous studies have focused mainly on classifying the DDSM breast cancer dataset using CNN models or traditional machine learning algorithms. However, our study emphasized feature extraction from tissue images within the DDSM dataset using attention techniques with convolution layers in the proposed systems. A significant contribution of this study was the use of various optimizers Adam, AdamaX and SGD, to minimize the error function (loss function) and improve production efficiency, based on the learnable parameters of the model. These parameters are crucial for adjusting the weights and learning rate of a neural network to minimize losses. This technique introduces a new approach of self-attention transformers and convolutional models, enabling accurate computer-aided breast cancer diagnosis with fewer computational resources and shorter training times, particularly for medical image datasets. Upon reviewing the table, it is evident that CCT achieves the highest accuracy of 99.92% in classification among all tested models, particularly when utilizing the AdamaX optimizer. The self-attention vision transformer achieves an accuracy of 99.81% with the AdamaX optimizer, while the Token Learner achieves 99.05% accuracy with the AdamaX optimizer. However, it is noteworthy that the Adamax optimizer consistently demonstrates superior performance compared to the Adam and SGD optimizer in terms of accuracy across the experiments. When considering the training time for each model, it’s evident that TokenLearner achieves the lowest training time, taking only 533.90 seconds. ViT followed with a training time of 819.77 seconds.

However, it’s important to note that CCT requires a longer time, with 8421.03 seconds. The CCT model incorporates convolution layers with an attention mechanism, which may contribute to its longer training time. On the other hand, the Token Learner model also utilizes same mechanisms in its architecture but requires less training time. This could be attributed to the tokenization technique employed in the Token Learner model, which dynamically highlights relevant regions in the input patches. This technique reduces the number of patches used in training, resulting in lower training time and complexity compared to the CCT model. The experiments show that there are still opportunities to enhance the performance of transformer-based models in breast cancer mammography, where accuracy is crucial. Our study suggested that transformer-based convolution models could benefit from better mechanisms for aggregating information. Models such as Token learner, which utilizes tokenization and attention maps to process image input efficiently at any resolution, should be considered.Finally, in contrast to recent research, this method demonstrated notably enhanced outcomes, of demonstrating the effectiveness of combining of VITs and CNNs for detecting breast tumors in mammography images. Moreover, through a comparative examination with models from prior studies, we noted that this approach consistently achieved superior performance, as illustrated in Table 8.

Table 8
Assessment of the performance of previously published works on mammography for breast cancer detection using the DDSM dataset

Reference	Method	Optimizer	Performance	Time
Boudouh, Saida Sarra et al. [20]	InceptionResNetV2		ACC 99.83%
	EfficientNetB7		ACC 98.19%
	DenseNet201		ACC 97.29%
	MobileNetV2		ACC 99.42%
	ResNet152V2		ACC 99.01%
	VGG16		ACC 99.83%
Ayana Gelan et al. [21]	vision-transformer-based transfer-learning		(AUC) of 1 $\pm$ 0	2900 seconds
Abimouloud M et al. [19]	VIT		ACC 99.81%	633.73 seconds
Wessam et al. [22]	InceptionV3		ACC 99.43%	12134 seconds
Oza, Parita et al. [23]	pretrained CNN model		ACC 99.8%
Li Hua et al. [24]	CNN-RNN		ACC 94.70%
Our systems
VIT		Adam	ACC 99.81%	819.77 S
CCT		AdamaX	ACC 99.92%	8421.03 S
TVIT		AdamaX	ACC 99.05%	533.90 S

7. Conclusion

This approach introduces new hybrid transformer-convolution models, offering an advanced computer-aided analysis for diagnosing breast cancer through mammography images. This innovative approach significantly reduces the need for computational resources and training time, especially when dealing with large medical image datasets. The following is a summary of our experiments:

•

The hybrid attention convolution approach introduces an innovative method that combines the strengths of both techniques in medical visual diagnosis.

•

Tokenization mechanisms provide a promising solution for medical scenarios with limited training data and restricted computational resources.

In future studies, we plan to explore a wider range of optimization techniques and fine-tune hyperparameters. Additionally, integrating multi modal data and validating our system with real patient data could further enhance diagnostic accuracy and clinical applicability. We also aim to investigate strategies for faster decision making with larger datasets, striving to continually improve our system’s efficiency and effectiveness in medical image analysis.

Funding

This study was not funded. The authors have no relevant financial or non-financial interests to disclose.

Availability of data and materials

The dataset analysed during the current study are available in: https://data.mendeley.com/datasets/ ywsbh3ndr8/2.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Footnotes

Conflict of interest

The authors declare that they have no conflicts of interest.

References

Agarwal

et al., Computer aided detection for breast lesion in ultrasound and mammography (2019).

Maalej

Mezghani

Elleuch

Kherallah

et al., Transfer learning and data augmentation for improved breast cancer histopathological images classifier, International Journal of Computer Information Systems & Industrial Management Applications 15, (2023).

Appari

N.V.L.

and Kanojia

M.G.

, Soft computing and image processing techniques for COVID-19 prediction in lung CT scan images, Int. J. Hybrid Intell. Syst 18 (2022), 111–131. doi: 10.3233/HIS-220009.

Rustam

Hapsari

and Solihin

, Optimal cervical cancer classification using gauss-newton representation based algorithm, Vol. 2168 (AIP Publishing, 2019).

Bharati

Podder

and Mondal

, Artificial neural network based breast cancer screening: a comprehensive review, International Journal of Computer Information Systems & Industrial Management Applications 12 (2020).

Al-Jabbar

Alshahrani

Senan

E.M.

and Ahmed

I.A.

, Multi-method diagnosis of histopathological images for early detection of breast cancer based on hybrid and deep learning, Mathematics 11 (2023), 1429.

Das

Conjeti

Roy

A.G.

Chatterjee

and Sheet

, Multiple instance learning of deep convolutional neural networks for breast histopathology whole slide classification, IEEE, 2018, pp. 578–581.

Zewdie

E.T.

Tessema

A.W.

and Simegn

G.L.

, Classification of breast cancer types, sub-types and grade from histopathological images using deep learning technique, Health and Technology 11 (2021), 1277–1290.

Matsoukas

Haslum

J.F.

Söderberg

and Smith

, Is it time to replace cnns with transformers for medical images? arXiv preprint arXiv:210809038. (2021).

10.

Mohamed

E.A.

Rashed

E.A.

Gaber

and Karam

, Deep learning model for fully automated breast cancer detection system from thermograms, PloS One 17 (2022), e0262349.

11.

Mohamed

E.A.

Gaber

Karam

and Rashed

E.A.

, A novel cnn pooling layer for breast cancer segmentation and classification from thermograms, Plos One 17 (2022), e0276523.

12.

Henry

E.U.

Emebob

and Omonhinmin

C.A.

, Vision transformers in medical imaging: A review. arXiv preprint arXiv:221110043. (2022).

13.

Zhu

Cheng

Zhang

Lin

and Dai

, An empirical study of spatial attention mechanisms in deep networks, 2019, pp. 6688–6697.

14.

Dosovitskiy

et al., An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv preprint arXiv:201011929. (2020).

15.

Mehta

and Rastegari

, Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:211002178. (2021).

16.

Kadri

Bouaziz

Tmar

and Gargouri

, Crossvit wide residual squeeze-and-excitation network for alzheimer’s disease classification with self attention progan data augmentation, Int. J. Hybrid Intell. Syst. 17 (2021), 163–177. doi: 10.3233/HIS-220002.

17.

Benlamoudi

et al., Deep Neural Networks Improve Radiologists’ Performance in Breast Cancer Screening. Ph.D. thesis, UNIVERSITY OF OUARGLA.

18.

et al. Transformers in medical image analysis, Intelligent Medicine 3 (2023), 59–78.

19.

Abimouloud

M.L.

Bensid

Elleuch

Aiadi

and Kherallah

, Mammography breast cancer classification using vision transformers. 23th International Conference on Intelligent Systems Design and Applications (ISDA 2023) (December 11–13, 2023).

20.

Boudouh

S.S.

and Bouakkaz

, New enhanced breast tumor detection approach in mammogram scans based on pre-processing and deep transfer learning techniques, Multimedia Tools and Applications (2023), pp. 1–22.

21.

Ayana

et al., Vision-transformer-based transfer learning for mammogram classification, Diagnostics 13 (2023), 178.

22.

Salama

W.M.

and Aly

M.H.

, Deep learning in mammography images segmentation and classification: Automated cnn approach, Alexandria Engineering Journal 60 (2021), 4701–4709.

23.

Oza

Sharma

and Patel

, Breast lesion classification from mammograms using deep neural network and test-time augmentation, Neural Computing and Applications 36 (2024), 2101–2117.

24.

Niu

and Zhang

, Classification of breast mass in two-view mammograms via deep learning, IET Image Processing 15 (2021), 454–467.

25.

Huang

M.-L.

and Lin

T.-Y.

, Dataset of breast mammography images with masses, Data in Brief 31 (2020), 105928.

26.

Hassani

et al., Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:210405704. (2021).

27.

Ryoo

Piergiovanni

Arnab

Dehghani

and Angelova

, Tokenlearner: Adaptive space-time tokenization for videos, Advances in Neural Information Processing Systems 34 (2021), 12786–12797.

Vision transformer-convolution for breast cancer classification using mammography images: A comparative study

Abstract

Keywords

1. Introduction

2.1 Mammography

4. Materials and methods

Table 2 Details of vision transformer model variants

4.2 Vision transformer

5.1 Dataset splitting

• Memory (RAM): 96.00 GB • Processor: Intel(R) Xeon(R) Silver 4108 CPU @ 1. 80 GHz 1. 80 GHz. • Graphic Processing Unit: (GeForce RTX 2080 Ti, GeForce RTX 3090) • System type: 64-bit operating system, x64 processor. We utilized Python 3.11. 5.2 Dataset preprocessing

Table 4 Data Augmentation Parameters

Table 8 Assessment of the performance of previously published works on mammography for breast cancer detection using the DDSM dataset

Funding

Availability of data and materials

Ethical approval

Footnotes

Conflict of interest

References

Table 2
Details of vision transformer model variants

•
Memory (RAM): 96.00 GB
•
Processor: Intel(R) Xeon(R) Silver 4108 CPU @ 1. 80 GHz 1. 80 GHz.
•
Graphic Processing Unit: (GeForce RTX 2080 Ti, GeForce RTX 3090)
•
System type: 64-bit operating system, x64 processor. We utilized Python 3.11.

5.2 Dataset preprocessing

Table 4
Data Augmentation Parameters

Table 8
Assessment of the performance of previously published works on mammography for breast cancer detection using the DDSM dataset