Abstract
BACKGROUND:
Content-based image retrieval (CBIR) systems are vital for managing the large volumes of data produced by medical imaging technologies. They enable efficient retrieval of relevant medical images from extensive databases, supporting clinical diagnosis, treatment planning, and medical research.
OBJECTIVE:
This study aims to enhance CBIR systems’ effectiveness in medical image analysis by introducing the VisualSift Ensembling Integration with Attention Mechanisms (VEIAM). VEIAM seeks to improve diagnostic accuracy and retrieval efficiency by integrating robust feature extraction with dynamic attention mechanisms.
METHODS:
VEIAM combines Scale-Invariant Feature Transform (SIFT) with selective attention mechanisms to emphasize crucial regions within medical images dynamically. Implemented in Python, the model integrates seamlessly into existing medical image analysis workflows, providing a robust and accessible tool for clinicians and researchers.
RESULTS:
The proposed VEIAM model demonstrated an impressive accuracy of 97.34% in classifying and retrieving medical images. This performance indicates VEIAM’s capability to discern subtle patterns and textures critical for accurate diagnostics.
CONCLUSIONS:
By merging SIFT-based feature extraction with attention processes, VEIAM offers a discriminatively powerful approach to medical image analysis. Its high accuracy and efficiency in retrieving relevant medical images make it a promising tool for enhancing diagnostic processes and supporting medical research in CBIR systems.
Keywords
Introduction
As an innovative information retrieval system, Content-Based Image Retrieval (CBIR) provides a high-tech solution for handling and retrieving massive visual data sets. We live in a digital world where images abound on everything from social media to e-commerce websites. As a result, effective ways to categorize, search for, and retrieve these photographs are key [1, 2]. CBIR fills this void by making use of images inherent content, not only metadata or written descriptions. The core functionality of CBIR is image content analysis and image retrieval for similar images based on visual characteristics including color, texture, shape, and spatial arrangement. Searching for images based on their visual qualities is made possible using CBIR, as opposed to standard text-based retrieval systems that depend on image tags or keywords, which makes it ideal for situations when textual descriptions are neither present nor adequate.
In the 1970s and 1980s, pioneering work in computer vision and image processing laid the groundwork for what would later become known as CBIR. But CBIR didn’t become popular or useful until digital images was widely used in the late 20th and early 21st centuries. These days, CBIR is used in many other fields, such as medical imaging, fashion, multimedia retrieval, surveillance, art and cultural heritage preservation, and satellite images analysis [3–5]. Several essential parts and procedures make up CBIR, and they all work together to make it what it is. Feature representation, feature extraction, similarity assessment, indexing, and retrieval algorithms are all part of this. Color histograms, texture patterns, and shape descriptors are just a few examples of the useful visual information that may be extracted from photographs through feature extraction. The retrieval procedure relies on these attributes to compare and match images.
The process of efficiently storing and retrieving extracted features involves encoding and organizing them into a format that is called feature representation. The needs of the application dictate the choice of approach, which may include deep learning-based representations, bag-of-visual-words models, or vector quantization. Finding out how similar query images are to database images relies heavily on similarity measurement [6–8]. Euclidean distance, cosine similarity, and feature-specific metrics like the Earth Mover’s Distance for histograms and the Structural Similarity Index for texture images are some of the most common similarity metrics. Faster retrieval times are achieved through indexing, which is especially important in massive image databases. It entails categorizing images according to their visual characteristics and then grouping them into a structured index. This allows for the efficient and quick retrieval of relevant images when users submit queries. To accomplish this, many indexing strategies are used, such as clustering algorithms, tree-based structures, and hashing approaches.
The procedure for searching the image database for appropriate images in response to user queries is controlled by retrieval techniques [9, 10]. These approaches can range from basic ones like nearest-neighbor search or relevance feedback to more complex ones like relevance ranking algorithms, query expansion, or relevance feedback loops. Converging low-level visual cues with high-level semantic notions perceived by humans is one of the main issues in CBIR. The deep semantic meanings provided by images are sometimes beyond the comprehension of computers, despite their excellence at processing and analyzing pixel-level information. Problems arise with CBIR systems due to the semantic gap; when people search for images, they may have particular semantic ideas or visual preferences in mind, and these factors might not be picked up by low-level features [11].
In order to tackle this difficulty, researchers have investigated multiple methods. Some of these methods include using deep learning techniques for feature learning and semantic embedding, integrating textual metadata, or adding semantic image annotations. Specifically, Convolutional Neural Networks (CNNs) and other deep learning architectures have made it possible to directly extract high-level, semantically relevant characteristics from images, which has completely transformed CBIR [12, 13]. In addition to capturing complex visual patterns, these learnt characteristics also encode semantic information that is more in line with how humans see things. Evaluation and benchmarking are also important parts of CBIR. In order to evaluate CBIR systems fairly, it is necessary to have strong assessment criteria and benchmark datasets. Common criteria for evaluation include recall, precision, F-measure, and mean average precision. To evaluate CBIR algorithms across different domains and applications, standardized testbeds are provided by benchmark datasets like ImageNet, CIFAR, or COCO. Figure 1 shows a chest X-ray image.

Chest X-Ray Image.
The availability of large-scale annotated datasets, the proliferation of deep learning techniques, and the increasing demand for intelligent image search and retrieval solutions have all contributed to substantial developments in CBIR in recent years. In many instances, deep learning-based CBIR systems have outperformed classic, handmade feature-based methods, and they have accomplished this across a broad variety of applications. Research in the subject has recently shifted away from classic CBIR paradigms and toward multimodal retrieval, which seeks to integrate many modalities (e.g., images, text, and audio) to provide more expressive and thorough retrieval capabilities [14–16]. The development of tailored and context-aware retrieval systems has also contributed to an improvement in both the user experience and the relevance of the images returned by search engines by catering to specific demographics, interests, and other contextual aspects. In the future, CBIR has great potential to transform visual data interaction by providing easier, faster, and more tailored access to large image collections in a variety of fields and uses. With the rapid advancements in computer vision, machine learning, and data analytics, CBIR is well-positioned to continue creating the future of visual information retrieval and interpretation.
To improve the precision and resilience of image retrieval systems, VisualSift Ensembling is a state-of-the-art CBIR method that uses ensemble learning strategies [17, 18]. The foundation of VisualSift Ensembling is the use of base learners, which are a collection of separate CBIR models, to aggregate findings and make collaborative decisions in order to achieve better performance. Various base learners may have complementary strengths and limitations or excel at capturing distinct parts of the underlying data distribution. The ensemble learning paradigm takes this into account. The purpose of VisualSift Ensembling is to increase retrieval performance by reducing variation, mitigating individual model biases, and bringing together disparate learners into a coherent ensemble.
Despite their importance, existing CBIR systems in medical image analysis often face challenges related to feature extraction, relevance ranking, and computational efficiency. Addressing these challenges requires the development of innovative approaches that leverage advanced techniques in image processing, machine learning, and artificial intelligence [19, 20].
In this study, we propose VEIAM, a novel approach designed to enhance CBIR systems in medical image analysis. VEIAM integrates the robustness of SIFT-based feature extraction with selective attention mechanisms to dynamically highlight and prioritize relevant regions within medical images. Implemented in Python, VEIAM offers a flexible and efficient solution for medical image analysis, enabling clinicians and researchers to retrieve and analyze relevant medical images with high accuracy and efficiency.
Introducing VEIAM, a pioneering methodology that revolutionizes CBIR systems in the domain of medical image analysis. VEIAM ingeniously combines SIFT-based feature extraction with selective attention mechanisms, presenting a novel approach aimed at enhancing the efficiency and accuracy of medical image retrieval processes. One of the key strengths of VEIAM lies in its comprehensive image preprocessing pipeline. By integrating advanced techniques such as noise reduction using Wiener filtering and edge detection via a zero-crossing detector, VEIAM ensures that the input medical images undergo meticulous processing to enhance their quality and relevance. This preprocessing step plays a crucial role in refining the images and preparing them for subsequent analysis. Moreover, VEIAM empowers clinicians and researchers by providing them with a powerful tool for efficiently retrieving and analyzing relevant medical images from vast databases. Leveraging the content characteristics of the images, VEIAM facilitates streamlined diagnosis, treatment planning, and research endeavors. This capability significantly accelerates the workflow in healthcare settings, allowing practitioners to access pertinent medical imagery swiftly and effectively. Additionally, VEIAM incorporates Gray-Level Co-occurrence Matrix (GLCM) feature extraction, a sophisticated technique that captures intricate textural characteristics within the images. By extracting GLCM features, VEIAM enhances the discriminative power of the feature representation, enabling more precise and insightful analysis of medical images. This heightened level of detail contributes to the overall improvement of diagnostic accuracy and efficiency in medical image analysis.
In summary, VEIAM represents a significant advancement in the field of medical image analysis, offering a holistic solution to the challenges faced by existing CBIR systems. By seamlessly integrating cutting-edge preprocessing techniques, selective attention mechanisms, and advanced feature extraction methods, VEIAM paves the way for enhanced diagnostic capabilities and more efficient healthcare practices.
Related work
Due to the subjective nature of human perception and the imprecision of image annotations, retrieving images via a textual query becomes quite challenging. One way to get around these problems is to pay more attention to the images themselves than to the descriptions made in text. Expert expertise is required for traditional feature extraction methods to choose the limited feature types, and these methods are highly sensitive to changes in imaging settings. Deep feature extraction utilizing Convolution Neural Networks (CNNs) may automatically learn feature representations, which is a solution to these issues. In [21], the feature extraction performance of multiple pre-trained CNN models is thoroughly compared. The datasets for men’s footwear and women’s clothes are analyzed using the VGG16, VGG19, InceptionV3, Xception, and ResNet50 models for feature extraction. Furthermore, SVM, Random Forest, and K-Nearest Neighbors classifiers are employed for classification utilizing these retrieved characteristics. The results of the feature extraction and image retrieval experiments demonstrate that the VGG19, Inception, and Xception features work effectively with feature extraction, leading to a respectable 97.5% efficiency in image categorization. The findings are further supported by comparing the effectiveness of image retrieval with the derived features and similarity metrics. Compared the accuracy of features retrieved by the chosen pretrained CNN models to that of results obtained using traditional classification approaches on the CIFAR 10 dataset.
When using device-generated multimedia and image processing techniques, retrieving photos from the database that are similar to the user’s query requires a significant amount of CPU resources. Because pixel-wise picture matching introduces significant pattern, storage, and angle fluctuations, a traditional image retrieval system based on annotations cannot deliver consistent results. Here, the CBIR technique is frequently preferred. CBIR is a powerful technique that quickly quantifies the degree to which database images resemble the query image. [22]. CBIR takes the query image and uses it to find more relevant images in a massive database that are visually similar to it. Following this, it compares and contrasts these features with those of the photographs in the database, and then it captures new images with the same or comparable features. Using a transfer learning approach and implementing one machine learning model, KNN, and two pre-trained deep learning models, ResNet50 and VGG16, provide a unique hybrid deep learning and machine learning-based CBIR system in this work. In order to extract image characteristics from these two DL models, employed the transfer learning method. The KNN ML model and the standard geometric distance are used to determine the degree of image similarity. They construct a web interface to display the outcome of comparable images, and the model’s performance that attained 100% is measured by the Precision. Digital libraries, historical research, fingerprint recognition, and crime prevention are just a few of the numerous CBIR-reliant applications that can benefit from our suggested system’s superior performance.
When it comes to diagnosing and treating a broad variety of diseases, medical imaging is crucial since it provides doctors with vital information about the inside of the body for clinical analysis and treatment decisions. The fast growth in medical diagnosis has led to the creation of a massive database of medical photographs, but it might be challenging to find comparable images within it. A method for dealing with this issue and for matching using deep learning and CNNs is described [23]. This method employs state-of-the-art optimization techniques and deep learning in an effort to improve CBMIR’s accuracy and efficiency. There are two main parts to the suggested model: (a) training and (b) testing. During training, they do pre-processing, extract features, and choose the best features to use. Before being stored in the database, the images undergo pre-processing that includes the Gaussian filter, CLAHE, and Gaussian smoothing. Afterwards, VGG19 and the Inception V3 CNN model are used to extract the deep features from the database images. After the characteristics are retrieved, they are merged and the best features are chosen. They used the brand-new Coyote-Moth Optimization Algorithm (CMOA) to make these choices. The conventional Moth-flame optimization (MFO) and the coyote optimization algorithm (COA) have been conceptually combined to form this CMOA model.
The semantic gap between low-level and high-level characteristics can be reduced by describing visual content using deep learning techniques. On the other hand, CNNs are biased toward textures and pay little attention to the overall form of objects. Not only does this not line up with how humans see things, but it may also miss out on the benefits of both deep and low-level qualities. Sublimated deep features, which integrate global object form and color characteristics, were primarily used rather than basic deep features. A new approach to image retrieval called the sublimated deep feature histogram (SDFH) was presented in [24]. The most important points are: 1) In order to simulate the orientation-selection process, an effect orientation feature called orientation-selective feature was implemented. This mitigates the unintended consequences of textural bias while accurately depicting the overall form of the item. 2) To solve the problem with deep features— that they ignore color features— a new idea called color perceptual feature was devised. This gives a more nuanced image by including color signals into the deep characteristics. 3) A transfer learning technique known as gain whitening learning was proposed, and the orientation selective and color perception mechanisms were successfully emulated to give a small yet efficient representation. Utilizing a pre-trained CNN model on popular benchmark datasets, comparative studies proved that sublimated deep features may deliver retrieval performance that is competitive with state-of-the-art methods. These findings provide light on the inner workings of the primary visual cortex (V1), which are the foundation of image recollection. In addition, compared to other ways, this one is more in harmony with how humans see things.
Users now save a plethora of visual data due to the proliferation of social media, cellphones, and other forms of instantaneous digital communication. Because of this, image retrieval has been a hot topic among academics over the last ten years. In image retrieval, the goal is to find, among a large image library, the images that are most conceptually and content-wise comparable to the query sample. Feature engineering and deep features have been the basis for several suggested image retrieval algorithms. In most cases, deep learning-based approaches outperform methods that rely on handmade features, even if the former have a shorter runtime. Using the output of residual blocks in deep neural networks in conjunction with handmade features during the feature fusion phase improves the efficiency of image retrieval in [25]. The efficiency of feature production layers was investigated in typical deep networks including residual, conventional sequential convolution, and bottleneck. The goal is to make image retrieval systems that rely on handmade features more efficient. In order to achieve this, the categorization layers in most popular deep networks are eliminated. Then, feature vectors are created by converting the output of feature generation layers at different depths, with the help of a flattened layer, into numerical features that can be used in retrieval systems. To create these characteristics, a number of well-liked spatial handmade features was used, including texture and color as well as frequency wavelet. A new hybrid feature set that combines deep CNNs was provided with traditional feature engineering methods to improve image retrieval. On benchmark datasets like Corel-1 and -5k, the accuracy and recall metrics measure the efficiency of the suggested technique. The accuracy achieved by the suggested approach on the Corel-1k was 96.68%, while on the Corel-5k it was 94.56%. When comparing bottleneck and sequential classical convolution layers to a combination of residual block feature maps and handmade features, the findings demonstrated that the latter improved the image retrieval system’s ultimate effectiveness. A comparison with cutting-edge deep learning as well as machine learning methods reveals that the suggested approach outperforms them in terms of recall and accuracy.
The related studies outlined present several limitations in image retrieval methodologies. Firstly, the subjective nature of human perception and imprecise image annotations pose challenges in accurately retrieving images via textual queries. Traditional feature extraction approaches require professional knowledge and are sensitive to changes in imaging conditions, which limits their usefulness and generalizability. Furthermore, the computational complexity involved in finding comparable images, particularly with device-generated multimedia, can tax resources. Furthermore, despite breakthroughs in deep learning approaches, there is still a semantic gap between low-level and high-level qualities in visual content description, affecting the completeness of image representations. In medical image retrieval, while Content-Based Medical Image Retrieval (CBMIR) methods aim to improve accuracy, their effectiveness depends on various factors like pre-processing and feature selection. Furthermore, the performance variability in image retrieval systems, balancing handmade and deep features, highlights the need for robust methodologies adaptable to diverse datasets and scenarios. Overall, these limitations underscore the necessity for continued research to address challenges and enhance the efficiency and effectiveness of image retrieval systems.
Methodology
In our methodology, we introduced VEIAM, a pioneering approach aimed at enhancing CBIR systems in the domain of medical image analysis. VEIAM combines SIFT-based feature extraction with selective attention mechanisms to achieve this goal. The process begins with meticulous image preprocessing, which includes techniques such as noise reduction using Wiener filtering and edge detection with a zero-crossing detector. These steps ensure that the input images are refined and prepared for subsequent analysis. VEIAM then leverages the extracted features along with attention mechanisms to efficiently retrieve and analyze relevant medical images from large databases. By capturing intricate textural characteristics using GLCM feature extraction, VEIAM enhances the discriminative power of the feature representation.
Data preprocessing
Dataset collection
An extensive dataset including chest X-ray images from COVID-19 cases was experimentally evaluated to determine the efficacy of the proposed CBMIR methodology [26]. KNN and ResNet50 were used to develop a novel hybrid deep learning and machine learning-driven CBIR system. [27]. Carefully selected for academic use, this dataset contains not just the X-ray images of the chest but also important information about each one. Disease classes and subclasses, with an emphasis on respiratory difficulties, are encapsulated in this metadata. Specifically, there are 584 chest X-ray images showing cases of COVID-19 infection in the dataset, along with another 120 images showing other disorders. Careful consideration was given to the matter in order to guarantee objectivity and reduce the potential for prejudice caused by the unequal number of students in each class. Only 33 of the COVID-19 images were chosen for the experimental examination. A total of 152 images were used in this study, which is broken down into three main disease classes and various subcategories for each. Enhanced evaluation robustness and nuanced insights into the CBMIR methodology’s efficacy across varied disease presentations are both made possible by this rigorous curation. Figure 2 displays a sample from the dataset.

Dataset Sample.
To ensure consistency and facilitate effective analysis of medical images within the proposed CBMIR technique, fundamental preprocessing activities such as image scaling and normalization are needed. The process of image resizing ensures that all images in the dataset have consistent dimensions by converting them to a specified resolution. Standardization is essential for medical imaging since images might come from many sources and modalities. This makes it hard to compare images fairly and boosts computing efficiency. Simplifying following processing stages like feature extraction and similarity computation, downsizing images to a uniform size mitigates potential distortions or discrepancies coming from changes in resolution. Images can be easily integrated into the CBMIR system with the use of specified dimensions, which also guarantee compatibility and coherence during retrieval.
In contrast, normalization seeks to bring image pixel intensities into a uniform range. Improving the consistency of image attributes across samples and reducing the impact of lighting fluctuations are two goals of this procedure. In order to eliminate biases caused by variations in brightness or contrast levels, normalization techniques usually scale pixel values to a preset interval, like [0, 1] or [– 1, 1]. Normalization is essential in medical imaging for improving the discriminative power of image characteristics, since small changes in intensity can communicate important diagnostic information. Normalization enhances the extraction of useful image descriptors and promotes more trustworthy retrieval results by guaranteeing that pixel intensities are evenly distributed.
Image scaling and normalization are crucial preprocessing processes in the CBMIR technique. They help make medical images more standardized, more comparable, and better represented for analysis later on. The systematic application of these techniques allows researchers to address potential sources of bias and variability. This lays a strong groundwork for medical CBIR and the extraction of features that are significant for diagnosis.
Noise reduction using wiener filtering
Wiener filtering is an adaptive and technically advanced method for reducing noise in photographs by minimizing the mean square error between the unfiltered and filtered versions. Wiener filtering is essential for medical image preprocessing in CBMIR approach for reducing noise and maintaining diagnostic information. When the noise qualities could change in space or spectrum across the image, Wiener filtering is a good choice because it considers both the signal and noise features, unlike traditional linear filtering methods. Wiener filtering relies on an essential parameter that governs the filtering process: the local signal-to-noise ratio (SNR) inside the image. Wiener filtering achieves excellent noise suppression while avoiding distortion of visual characteristics by adaptively adjusting filter coefficients based on the estimated SNR, leveraging statistical qualities of the image and noise. Because of its adaptive nature, Wiener filtering is able to successfully reduce noise artifacts without causing significant blurring or feature loss, in contrast to standard linear filters.
There are a number of benefits to using Wiener filtering in medical imaging applications, where precise diagnosis and treatment planning depend on high-quality images. Firstly, Wiener filtering may adapt its denoising approach to the distinct noise profiles seen in various medical imaging modalities (e.g., X-ray, CT, and MRI) by using knowledge of noise characteristics specific to these modalities. The dependability of subsequent image analysis tasks, such feature extraction and classification, is enhanced by this flexibility, which guarantees robust performance across varied imaging settings.
Important for clinical interpretation of medical images, Wiener filtering also does a great job of preserving fine anatomical structures and subtle disease abnormalities. Wiener filtering allows for more precise and clinically relevant image analysis by adaptively modifying filter parameters according to local image content; this efficiently suppresses noise while preserving texture detail and edge sharpness. In medical image retrieval applications, this capacity is especially useful because the performance of retrieval relies on the accuracy of image representations and the discriminative power of retrieved features. Within the context of the CBMIR framework, noise reduction by Wiener filtering is an adaptive and high-tech method for improving medical image quality and interpretability. Wiener filtering uses statistical image and noise features to effectively suppress noise while keeping diagnostic information intact. This paves the way for medical contexts to use content-based image retrieval and analysis in the future.
One typical technique in image processing, especially in the field of medical image analysis within the CBMIR methodology, is edge detection utilizing a zero crossing detector. This method relies on the idea of finding edges in a image by pinpointing locations where the Laplacian’s sign changes.
To help extract useful structural information from medical images, the zero crossing detector uses this feature to localize edges properly while limiting false positives. The zero crossing detector’s foundation is in the output of edge detection filters like the Laplacian of Gaussian (LoG), which integrates edge detection and Gaussian smoothing. If the LoG filter detects areas of sudden intensity shift in a image, it may be a hint that edges are there. Noise and texture fluctuations in images can cause misleading responses and discontinuities in the LoG filter’s output. One solution to this problem is the zero crossing detector, which finds possible edge sites by zeroing in on the spots where these discontinuities happen. Figure 3 depicts the architecture of the proposed model.

Architecture of Proposed Model.
Using a zero crossing detector to detect edges is a multi-step process. The first step is to preprocess the image so that the edges are more visible and the noise is less. The next step is to apply the LoG filter to the preprocessed image. This will help identify possible edges by highlighting areas of interest with high frequency components. In order to detect edges, we first examine the LoG filter’s output with the zero crossing detector for instances where the Laplacian’s sign changes. To precisely demarcate structural boundaries inside medical images, the zero crossing detector’s capacity to pinpoint edges with sub-pixel accuracy is a major benefit. When precise localization of pathological abnormalities or anatomical features is crucial for clinical diagnosis and therapy planning, this property becomes very helpful. The zero crossing detector can handle medical images from many sources and modalities with ease because it is noise- and texture-tolerant. A useful method for obtaining structural information from medical images inside the CBMIR framework is edge detection utilizing a zero crossing detector. For medical image analysis and retrieval, the zero crossing detector is a lifesaver since it pinpoints edges with pinpoint accuracy and finds spots where intensities change quickly, allowing for the creation of reliable edge maps.
3.1.4.1. Laplacian of Gaussian (LoG) filtering
3.1.4.2. Zero crossing detection
Where ZC (x, y) is the binary zero crossing map indicating edge locations. LoG (x, y) represents the LoG response at pixel (x, y). LoG (x + 1, y) and LoG (x, y + 1) denotes the LoG response at neighboring pixels.
3.1.4.3. Edge location
3.1.5. Region of interest (ROI) extraction
An important part of the CBMIR approach for medical image processing is region of interest (ROI) extraction, which allows for the isolation of particular regions within an image that are pertinent to the current analytical or diagnostic job. Due to the large number of anatomical structures and clinical abnormalities often present in medical imaging images, ROI extraction is crucial for directing computing resources to the areas of interest for researchers and clinicians. Better diagnosis and treatment planning are possible outcomes of practitioners’ efforts to isolate and analyze ROIs, which in turn improve the efficacy and efficiency of following image analysis and retrieval tasks. The first of many phases in ROI extraction is usually pinpointing the areas of a image that are clinically relevant. One way to accomplish this is by using automated algorithms that have been trained to identify particular traits or anomalies, or by drawing on existing knowledge of the researched anatomy or pathology. Following the identification of possible ROIs, more refining can be carried out to better define borders and eliminate any structures or artifacts that are not important. To do this, we may use morphological procedures, edge detection, or segmentation to fine-tune the region borders according to texture, shape, or intensity.
Reducing computational cost and improving the relevance of retrieved images by focusing on clinically significant areas is the purpose of ROI extraction in the context of medical image retrieval. To diagnose lung disease from chest X-rays, for instance, ROI extraction could entail separating the lung fields or regions with aberrant opacities that suggest disease. Improved accuracy and clinical relevance of search results are achieved by retrieval system prioritization of images containing similar pathological features through ROI extraction.
3.1.5.1. Intensity–based segmentation
Where ROI (x, y) is the binary ROI mask indicating the presence of the region of interest at pixel (x, y). I (x, y) represents the intensity of the image at pixel (x, y). T is a threshold value used to segment the image into foreground (ROI) and background.
3.1.5.2. Edge detection for boundary delineation
In addition, ROI extraction makes it easier to incorporate domain-specific knowledge and skills into the CBMIR architecture, which lets medical professionals customize retrieval systems for certain clinical jobs. To identify tumors in brain MRIs, for example, ROI extraction could entail segmenting tumor regions according to intensity or texture features, enabling the targeted retrieval of images showing comparable tumor traits. This personalization makes retrieval results more relevant and easier to understand, so doctors may make better judgments with the help of retrieved images. The CBMIR methodology relies on ROI extraction, a critical part of medical image processing, to identify and isolate clinically significant regions within images. Return on investment (ROI) optimization allows practitioners to simplify image analysis and retrieval, boost the relevance and accuracy of retrieved images, and provide doctors with useful insights for treatment planning and diagnosis. Therefore, ROI extraction is essential for improving patient care and pushing medical imaging forward.
3.1.5.3. Region boundary refinement
Medical image analysis makes use of the robust feature extraction technique known as GLCM to measure the spatial correlations between pixel intensities and textural qualities. With GLCM, you may learn a lot about the structure and texture of tissues or lesions by counting how often pairs of pixel intensity values appear at certain spatial offsets. Subtle changes in tissue microstructure, such as those caused by tumors, edema, or fibrosis, can be better characterized with this method. A GLCM calculation boils down to building a matrix where each entry stands for the frequency with which two intensity values, separated by a given distance and orientation in the image, appear together. The GLCM can be used to determine a wide variety of texture characteristics, such as homogeneity, contrast, correlation, and energy. All the information regarding the image’s textural features can be found in these descriptors, which include things like the degree of spatial correlation, the regularity of texture patterns, and the intensity variation.
There are numerous modalities and clinical tasks that benefit from GLCM feature extraction in medical imaging. As an example, GLCM-based texture analysis can help characterize bone density in radiographic images like X-rays, allowing for the identification of osteoporosis or bone metastases locations based on textural differences. Magnetic resonance imaging (MRI) also makes use of GLCM features to categorize various types of tissue, such as normal and diseased myocardium, according to differences in composition and texture.
Capturing fine-grained textural details that might be missed by conventional image processing methods or the human eye is a strong suit of GLCM-based feature extraction. Valuable biomarkers for illness diagnosis, prognosis, and treatment response assessment are provided by GLCM’s quantitative measurements of texture, which are intrinsically robust and reproducible, by quantifying the spatial connections between pixel intensities. To further characterize tissue characteristics and pathological states, GLCM features work in tandem with other imaging biomarkers like intensity histograms and shape-based features. GLCM feature extraction is an effective and flexible method for medical image analysis that provides quantitative measurements of image texture and spatial correlations. Improved diagnostics, prognoses, and treatment planning are all possible outcomes of researchers and physicians using GLCM-based texture descriptors to learn more about tissue microstructure.
VEIAM is a novel method that takes CBIR systems to the next level by combining the robustness of SIFT-based feature extraction with the discriminative power of attention processes. A comprehensive approach for recovering visually comparable images from massive datasets, VEIAM integrates attention processes to highlight prominent image regions with several SIFT-based features retrieved from images. To improve the efficiency of visual content-based image representation and retrieval, this combination uses attention processes and SIFT-based feature descriptors, which complement each other.
SIFT, is a widely used technique for obtaining unique and stable features from images. Despite being unaffected by changes in size, rotation, or lighting, SIFT descriptors do a masterful job of capturing local image features like borders, corners, and texture patterns. To improve the discriminative strength of the image representation, VEIAM extracts numerous SIFT-based feature descriptors from each image. This allows it to leverage the diversity of visual information acquired at different scales and orientations. Improved image retrieval accuracy and robustness are made possible by this suite of SIFT-based features, which provide a detailed and multi-dimensional image representation.
SIFT-based feature extraction
When extracting features from images, VEIAM also uses attention algorithms to dynamically highlight and prioritize certain areas. The model is able to enhance the discriminative strength of the extracted features by honing in on relevant regions of the image while suppressing irrelevant background information. Attention processes improve image retrieval by increasing the feature descriptors’ representational capacity through the careful assignment of relevance weights to various image regions according to their visual saliency.
Several critical processes are involved in VEIAM’s attention mechanism integration with SIFT-based feature extraction. At the outset, we skillfully extract several SIFT-based feature descriptors from every image, capturing local image features across a range of sizes and orientations. To create a complete image of the image’s content, these feature descriptors are then merged or concatenated. Afterwards, attention techniques are used on the combined feature representation to make feature selections and dynamic weighting depending on importance. Through the use of attention-weighted feature representation, the discriminative image areas are brought to the forefront, while noise and irrelevant background information are reduced.
The ability of VEIAM to capture both the global context of the image and local discriminative information is a major strength. A comprehensive image of the image’s content is provided by the SIFT-based feature ensemble, which incorporates both surface-level visual characteristics and deeper semantic information. At the same time, attention mechanisms enhance the discriminative ability of the feature representation by allowing the model to zero in on particular areas of the image that are especially relevant to the retrieval task. By combining global and local data, VEIAM is able to retrieve visually related images with great accuracy and relevance.
Additionally, VEIAM can be modified to fit various application domains and retrieval tasks due to its high degree of adaptability. As a result of the adaptability of SIFT-based feature extraction, VEIAM is able to handle diverse image datasets by capturing a wide range of visual patterns and structures. At the same time, attention mechanisms can be adjusted to prioritize different visual features based on the needs of the retrieval job, making them work well in many kinds of situations. With its capacity to be adjusted and customized, VEIAM becomes a powerful solution for many different types of CBIR applications, thanks to its versatility and efficacy. VEIAM, or VEIAM, provides a flexible and powerful method for retrieving images depending on their content. A combination of scale-invariant feature descriptors and attention-driven feature selection, VEIAM combines several SIFT-based features with attention mechanisms to provide a discriminatively strong model. Better and more relevant image retrieval is made possible by this integration, which allows VEIAM to capture both the global image context and the local discriminative information. Medical imaging, online shopping, and media retrieval are just a few examples of the many CBIR applications that could benefit from VEIAM’s flexibility and personalization features.
The novelty of our work lies in the integration of multiple cutting-edge techniques to address key challenges in CBIR systems within the realm of medical image analysis. Our approach, VEIAM, represents a significant departure from conventional methods and introduces several innovative components. Firstly, VEIAM incorporates SIFT-based feature extraction, a technique renowned for its robustness in capturing distinctive visual features regardless of variations in scale, rotation, or illumination. By leveraging SIFT, our approach ensures the extraction of high-quality features from medical images, laying a strong foundation for subsequent analysis. Secondly, VEIAM introduces selective attention mechanisms into the CBIR framework. These mechanisms dynamically highlight and prioritize relevant regions within the images, allowing the model to focus on clinically significant areas while filtering out noise and irrelevant background information. This selective attention mechanism enhances the discriminative power of the feature representation, thereby improving the accuracy and relevance of image retrieval tasks. Furthermore, VEIAM integrates advanced image preprocessing techniques, including noise reduction using Wiener filtering and edge detection with a zero-crossing detector. By incorporating these preprocessing steps, our approach ensures that the input images are refined and enhanced prior to feature extraction, thereby improving the quality and relevance of the extracted features. Additionally, VEIAM employs GLCM feature extraction to capture intricate textural characteristics within the images. This technique provides valuable texture descriptors such as contrast, correlation, energy, and homogeneity, enabling more nuanced analysis of medical images and enhancing the model’s ability to distinguish between different tissue types and pathological conditions. Overall, the novelty of VEIAM lies in its holistic approach to medical image analysis, which combines state-of-the-art feature extraction techniques with selective attention mechanisms and advanced image preprocessing methods. By integrating these components, VEIAM offers a comprehensive solution to the challenges faced by existing CBIR systems, paving the way for improved diagnostic accuracy and efficiency in healthcare settings.
Results and discussions
Python was used on a Windows-based system to implement the VEIAM concept. A 5.10 GHz Intel® CoreTM Ultra 9 processor 185H with 24M Cache was part of the system’s hardware setup. In order to properly handle computational activities, the system was also outfitted with 8GB of RAM. A GeForce RTX 4070 graphics card was added to the machine to increase processing capability and make complex computations easier. Python was selected as the language to construct VEIAM because to its many benefits, such as its user-friendliness, library support, and adaptability in integrating different parts of the model. TensorFlow, PyTorch, and scikit-learn are just a few of the Python libraries that contributed significantly to the development of VEIAM’s CNNs and other machine learning models.
For the best possible performance when running the model, the system’s hardware setup was meticulously selected, including the Intel® CoreTM Ultra 9 processor and the GeForce RTX 4070 graphics card. Complex activities inside VEIAM, such as feature extraction, attention mechanisms, and classification tasks, were efficiently computed by means of the Intel® CoreTM processor’s fast processing speeds and multi-core capabilities. In addition, computationally intensive tasks like building deep learning models and conducting matrix operations were executed more quickly by the GeForce RTX 4070 Graphics Card, which is known for its powerful GPU architecture and parallel processing capabilities.
The system made use of the CPU cores, RAM, and graphics card, among other hardware resources, to parallelize computations and minimize processing time while the model was being executed. This made sure that VEIAM could quickly sort through mountains of medical image data, use attention processes, extract useful features, and process the data efficiently.
Overall, the VEIAM model was implemented and executed efficiently on a platform that was both resilient and optimized for performance, thanks to the combination of Python programming language and high-performance hardware configuration. By utilizing state-of-the-art methods in medical image analysis, researchers and practitioners were able to obtain precise results and gain significant insights for the purpose of diagnosis and treatment planning. Careful selection and accumulation of a dataset suited to the current assignment, such COVID-19 infection diagnosis from chest X-rays, was the first step in medical image analysis. This dataset, which includes a wide variety of cases including respiratory issues other than COVID-19, was hand-picked to guarantee its appropriateness for academic usage. The dataset sought to minimize bias and ensure objectivity in its design, providing a rigorous platform for subsequent analysis. It included 584 chest X-ray images highlighting COVID-19 instances and an additional 120 images displaying various illnesses.
The acquired images were preprocessed to improve their quality and standardize their format before any useful analysis could be performed. To make comparison and analysis easier, we used image resizing to make sure all the photographs were the same size. Then, in order to make the images more comparable and remove any biases caused by variations in brightness or contrast, normalization techniques were used to make the pixel intensities uniform. In addition, noise reduction techniques such as Wiener filtering were employed to improve image clarity and minimize artefacts, guaranteeing that the data used for further analysis was clean and trustworthy.
The next step, after preprocessing, was to use the preprocessed images to extract useful characteristics for categorization. GLCM assessed spatial correlations between pixel intensities and textural features inside images; it was a regularly used tool in medical image analysis. The use of GLCM allowed for the extraction of useful texture descriptors like homogeneity, contrast, correlation, and energy, which were vital in differentiating between various types of tissue and disease states.
A new approach known as VEIAM was introduced to further improve the feature representation’s discriminative power. This novel method integrated selective attention processes with the reliability of feature extraction based on the SIFT. In order to improve the relevance and accuracy of image retrieval and classification tasks, VEIAM optimized the feature representation by dynamically highlighting and prioritizing important areas within the images.
At last, the images were prepared for classification using CNNs and other machine learning methods with the extracted features and optimized representations. Algorithms were taught to distinguish between positive and negative COVID-19 images by training on the retrieved features, which allowed them to discover patterns and correlations in the data. Insights gained from this categorization method helped doctors with diagnosis and therapy planning by providing useful diagnostic data. To summarize, there were several important processes in medical image analysis, such as cleaning and preparing the dataset, extracting features, integrating advanced methods like VEIAM, and finally, classifying the images using machine learning algorithms. All of the steps were critical in making the analysis as accurate and effective as possible, which allowed us to draw useful conclusions from medical imaging for diagnosis and treatment planning.
A thorough comparison of the performance metrics spanning accuracy, precision, recall, and F1 score for several deep learning models is presented in Table 1 and Fig. 4. When it comes to classification tasks like image recognition, these indicators are crucial for gauging the efficacy and resilience of machine learning models. We will examine the results of each model and talk about what they mean. It starts out with ResNet-50 and manages to get an impressive 86.3% accuracy across the board. This model’s potential to effectively categories images across several classes is demonstrated by its solid precision, recall, and F1 score. It is well-known for its depth and skip connections. When it comes to VGG-16, its accuracy of 83.9% is little lower than ResNet-50. Notwithstanding this, many deep learning practitioners find VGG-16 to be a useful starting point because of its simplicity and ease of comprehension.

Performance Comparison for Various Models.
Performance comparison for various models
Compared to its predecessors, InceptionV3 stands out with an F1 score, recall, and accuracy of 89.2%. Its exceptional performance in image classification tasks is attributable, in part, to the fact that its inception modules allow it to acquire characteristics at various scales. With its focus on efficiency on mobile and embedded devices, MobileNet manages to get an impressive accuracy rate of 85.6%. Its lightweight design makes it appropriate for resource-constrained applications, however it does sacrifice some accuracy when compared to larger models like InceptionV3. DenseNet’s 91.8% accuracy shows how well it captures feature dependencies through dense layer connections, outperforming earlier models. This architectural style is robust in image classification tasks, since it results in higher precision, recall, and F1 score. With a precision of 93.4%, EfficientNet enhances the performance metrics even more. By employing compound scaling, EfficientNet improves the utilization and balance of model resources, which in turn leads to increased accuracy and better generalization on different datasets.
To improve the model’s representational power, ResNeXt adds a cardinality parameter; it achieves 94.2% accuracy. Because of this, it outperforms conventional ResNet architectures in terms of performance metrics like F1 score, recall, and precision. Xception surpasses expectations with an astounding accuracy of 95.7%. It draws inspiration from the inception module but utilizes depthwise separable convolutions. The model’s high F1 score, recall, and precision prove that depthwise separable convolutions are a powerful tool for accurately expressing spatial dependencies. The significance of domain-specific customization is highlighted by the fact that bespoke CNN surpasses several pre-trained architectures with an accuracy of 96.5%. It achieves better performance across all measures by customizing the architecture and training technique to the task at hand.
With an accuracy of 97.2%, VEIAM (Proposed) proves to be the most advanced model compared to its predecessors. What this means is that the suggested design or approach brings new optimizations or features that greatly improve classification performance. By contrasting their results, we can see how many different types of deep learning models exist, each with its own set of advantages and disadvantages. Different models have different priorities; some aim for efficiency, while others try to maximize accuracy no matter the cost to computing power. Finding the best model to fit a specific job and available resources requires an understanding of these trade-offs. Innovations and improvements in model designs and training methods also keep expanding the limits of what can be accomplished in image classification jobs.
Table 2 and Fig. 5 provides a comprehensive evaluation of several ensemble learning approaches based on accuracy, precision, recall, and F1 score. By combining various base models, ensemble approaches enhance predictive performance by capitalizing on the diversity of individual models. We will examine the results and consequences of each ensemble method. A respectable 82.9% accuracy rate is achieved with bagging, which is an abbreviation for bootstrap aggregating. Bagging increases generalization and decreases overfitting by training multiple base models on bootstrapped subsets of the training data and combining their predictions. Its solid precision, recall, and F1 score are evidence of this. Methods like AdaBoost and Gradient Boosting, which are examples of boosting, further improve prediction performance, reaching an accuracy of 85.4%. In order to improve the ensemble’s performance, boosting trains base models consecutively, with a focus on cases that were misclassified by earlier models. Classification tasks are well handled by AdaBoost and Gradient Boosting, as seen by their impressive precision, recall, and F1 score.
Performance of Ensemble Learning Methods

Performance of Ensemble Learning Methods.
One well-liked ensemble method that uses decision trees, Random Forest, gets an accuracy of 87.1%. Random Forest improves performance and generalizability by reducing the error caused by individual decision trees by training them on randomly selected feature subsets and then averaging their predictions. Notable examples of advanced gradient boosting implementations include XGBoost (92.5% accuracy), LightGBM (94.1% accuracy), and CatBoost (95.8% accuracy). These techniques build upon conventional gradient boosting by introducing optimizations and algorithmic upgrades, which in turn raise the accuracy and efficiency of predictions. High precision, recall, and F1 score are achieved by XGBoost, LightGBM, and CatBoost, demonstrating their usefulness in various categorization tasks. Stacking, a meta-ensemble technique that uses a meta-learner to integrate the predictions of numerous base models, produces remarkable performance with a 96.3% accuracy rate. Effectively leveraging the capabilities of distinct models, stacking improves performance across all measures by learning to blend the diverse predictions of base models.
Lastly, with a 97.2% accuracy rate, VEIAM (Proposed) remains at the top of the performance chart. This proves that the suggested ensemble approach is superior to meta-ensemble techniques and even advanced implementations of gradient boosting. The exceptional precision, recall, and F1 score achieved by VEIAM (Proposed) demonstrate its durability and efficacy in classification tasks. By combining the strengths of numerous base models, ensemble learning provides effective strategies for enhancing predictive performance. These methods show a steady improvement in efficiency and performance in predictions, starting from Bagging and Boosting and progressing to more complex gradient boosting implementations like as XGBoost, LightGBM, and CatBoost. By efficiently combining the varied predictions of basic models, stacking, a meta-ensemble strategy, significantly improves performance. Nonetheless, the suggested VEIAM ensemble method achieves the best results, showing that ensemble learning techniques have come a long way. Even with all the new advancements, ensemble approaches are still the best way to tackle difficult classification problems in all sorts of fields.
Table 3 and Figs. 6 and 7 shows a comparison of the parameter counts and file sizes of different deep learning models. Particularly in deployment contexts like mobile devices or edge computing environments, where resources may be constrained, model size is a significant concern. Let’s examine the features and size implications of each model in detail. Among the first CNN designs, LeNet stands out for its sparse 8.7 MB size and comparatively low parameter count of 2.3 million. A pioneer in CNNs, Yann LeCun’s LeNet is still very lightweight when compared to newer models. It was developed in the early 1990 s. The 2012 ImageNet Large Scale Visual Recognition Challenge champion, AlexNet, a groundbreaking CNN architecture, has a bigger size of 240.5 MB and a higher parameter count of 61 million. Even though it was bigger, AlexNet proved that deep learning could handle image categorization jobs effectively, opening the door for other improvements.
Model Size Comparison

Number of Parameters over Various Models.

Size Comparison over Various Models.
The idea of inception modules was first proposed by GoogleNet, which is also known as Inception v1. These modules allow for efficient flow of information at different scales. Suitable for deployment in diverse situations, GoogleNet achieves a decent balance between model complexity and size with a parameter count of 6.8 million and a size of 28.3 MB. With a size of 102.4 MB and a parameter count of 25.6 million, ResNet-50 is a member of the ResNet family renowned for its deep design and residual connections. Research and practical applications alike have embraced ResNet-50 due to its manageable size and depth. VGG-16 is 553.6 MB in size and features a huge parameter count of 138.4 million. Its design is uniform and it uses modest 3×3 convolutional filters. While VGG-16 delivers outstanding performance, its enormous size makes it less ideal for resource-constrained deployments.
With 95.2 MB of space and 23.8 million parameters, InceptionV3 is an upgraded version of GoogleNet. The inception modules and efficient architecture of InceptionV3 provide a reasonable trade-off between size and speed, making it useful for a wide range of tasks. When it comes to mobile and embedded devices, MobileNet is all about efficiency and small model sizes. With just 16.8 MB in size and 4.2 million parameters, MobileNet manages to be both lightweight and relatively effective. With a size of 80 MB and a parameter count of 20 million, DenseNet forms feed-forward connections between all of its layers. Because of its small size and high connectivity density, DenseNet is applicable to a wide range of deployment scenarios.
With 5.3 million parameters and a size of 21.2 MB, EfficientNet is well-known for its compound scaling mechanism, which improves efficiency. Achieving good performance with a relatively small size is achieved by EfficientNet by optimizing model architecture across multiple scales. With 11.7 million parameters and 46.8 MB of size, VEIAM (Proposed) demonstrates good results. With its improved performance and manageable size, VEIAM is a good choice for environments with limited resources. The practicality of using deep learning models in practical settings is greatly affected by the size of the model. Models with a lot of data, like VGG-16, perform quite well, but they could be too big to use in settings with limited resources. In contrast, efficient and lightweight models are ideal for implementation on edge computing platforms and mobile devices, such as EfficientNet and MobileNet. Finding an optimal trade-off between model size and performance will remain an important factor in deep learning model creation and deployment as the field advances.
Table 4 and Fig. 8 compares and contrasts numerous deep learning models according to the number of parameters and their sizes. Each of these models represents a distinct architecture with its own set of advantages and disadvantages, and they are employed in a wide variety of contexts. Now we can compare and contrast the features and consequences of the parameters and size of each model. It all starts with MLP (Multilayer Perceptron), a fundamental design for neural networks; it’s 32.4 MB in size and includes 8.1 million parameters. While multi-layer perceptrons (MLPs) have many applications and are commonly employed for classification and regression, their simplicity hinders their capacity to detect intricate patterns in data. CNNs are great at image identification and other tasks that require the extraction of spatial hierarchies of data. CNNs are well-suited for a variety of visual tasks due to their compact size (58.8 MB) and considerable parameter count (14.7 million).
Model Parameters Comparison

Model Parameters Comparison.
Natural language processing and time series analysis are two applications that benefit from RNNs’ (Recurrent Neural Networks’) ability to grasp sequential dependencies in data. Recurrent Neural Networks (RNNs) provide a space-saving design for processing input sequentially; they include 12.5 million parameters and are 50 MB in size. One kind of RNN that uses memory cells to solve the disappearing gradient issue and remember long-term dependencies is called an LSTM (Long Short-Term Memory) network. The improved speed in handling sequential data with long-range dependencies is offered by LSTMs, which have a size of 77.2 MB and a parameter count of 19.3 million. GRU (Gated Recurrent Unit) networks are an additional kind of RNN; they are more efficient computationally than LSTMs due to their simplified gating procedures. In sequential data processing, GRUs provide an optimal mix between performance and efficiency with their 71.6 MB size and 17.9 million parameters.
In order to improve context awareness, BiLSTM (Bidirectional LSTM) networks process sequences in both the forward and backward directions, enhancing the capabilities of LSTMs. Despite their enhanced effectiveness, BiLSTMs are more complicated because to their size (130.4 MB) and parameter count (32.6 million). The self-attention mechanism of transformer topologies allowed for the concurrent processing of input sequences, which completely changed the game for natural language processing jobs. Machine translation and text synthesis are just two examples of the cutting-edge activities that may be accomplished with Transformers, thanks to their size of 183.2 MB and parameter count of 45.8 million. Models such as GPT-3, which stands for Generative Pre-trained Transformer, allow for flexible and context-aware language synthesis by pre-training on massive text datasets. Models trained using GPT provide remarkable scalability in language generation with 125.4 million parameters and a file size of 501.6 MB.
State-of-the-art outcomes across diverse natural language understanding tasks are achieved using BERT (Bidirectional Encoder Representations from Transformers) models, which are pre-trained on large-scale corpora employing bidirectional context. Although BERT models provide unmatched performance, they are extremely resource-intensive to train and deploy because to their 918.8 MB size and 229.7 million parameters. The efficient design of VEIAM (Proposed) is what makes it stand out; it has 46.8 MB of size and 11.7 million parameters. Its efficacy in diverse tasks while preserving computational efficiency is demonstrated by VEIAM’s competitive performance, which is achieved despite its diminutive size. Model complexity, performance, and computing resource trade-offs are brought to light by the comparison. It takes a lot of computing power to train and deploy larger models like GPT and BERT, even though they provide state-of-the-art performance. Conversely, smaller models that manage to combine efficiency with performance, such as VEIAM, are well-suited for use in settings where resources are limited. Identifying the optimal model architecture for each tasks while taking resource limitations into account will remain a significant hurdle for deep learning researchers moving forward.
The execution times of different deep learning models are compared in Table 5 and Fig. 9. Inference speed has a direct effect on the user experience, making execution time a key parameter in latency-sensitive or real-time applications. Let’s have a look at what each model’s execution timings mean. With an execution time of 12.5 seconds, LeNet proves to be the fastest CNN, despite being relatively lightweight. Yann LeCun created LeNet in the early 90 s. Its simplicity and efficiency make it ideal for activities that require real-time performance. AlexNet takes 24.8 seconds to run, yet it achieves very high accuracy. In comparison to more modern architectures, AlexNet’s inference times are longer because to its bigger size and computational complexity, despite its pioneering role in showing that deep learning is useful for image classification. With an inference time of 18.2 seconds, GoogleNet (or Inception v1) provides a happy medium between model complexity and execution time. The efficient architecture and inception modules of GoogleNet allow it to attain competitive performance while keeping reasonable execution times.
Model Execution Time Comparison

Execution Time Comparison of Various Models.
The ResNet-50 model, which is known for its deep architecture and residual connections, takes 36.4 seconds longer to run. The depth of ResNet-50 adds to the computational complexity and inference time, but it performs better overall, particularly when dealing with vanishing gradient situations. With its tiny convolutional filters and consistent design, VGG-16 runs at a rather high 42.9 seconds. In comparison to more efficient designs, VGG-16’s bigger size and computing demands cause it to have longer inference times, despite its outstanding performance. A respectable 29.7 seconds is the runtime of InceptionV3, an upgraded version of GoogleNet. The inception modules and efficient architecture of InceptionV3 allow it to run at competitive speeds while still retaining appropriate execution times, making it ideal for a wide range of applications. A faster execution time of 15.6 seconds is displayed by MobileNet, which is geared for use on mobile and embedded devices. Optimal for contexts with limited resources, MobileNet delivers competitive performance with reasonably rapid inference times by focusing on model size and efficiency.
With a respectable 32 seconds for execution time, DenseNet showcases its well-known dense connectivity between layers. For many jobs where model interpretability is critical, DenseNet provides competitive performance with tolerable inference times, despite its connectedness. With the use of compound scaling to optimize the model architecture, EfficientNet achieves a respectable 19.8 seconds for execution time. Achieving competitive performance with relatively fast inference times, EfficientNet strikes a balance between model complexity and efficiency, making it ideal for a wide range of applications. The 28.3 second execution time displayed by VEIAM (Proposed) is quite modest. If you’re looking for an application that strikes a balance between speed and efficiency, VEIAM is a good pick because to its improved performance and appropriate execution times. Model complexity, performance, and runtime are all contrasted in this comparison. Models with more depth and size, such as ResNet-50 and VGG-16, have longer inference times but better performance. Conversely, designs that are more efficient, such as EfficientNet and MobileNet, are well-suited for use in real-time or latency-sensitive applications because they balance inference speed with performance. Improving model architectures to achieve faster inference times will continue to be an important focus of deep learning research.
This work introduces VEIAM, a novel approach that significantly advances CBIR systems in medical image analysis. By integrating SIFT-based feature extraction with selective attention mechanisms, VEIAM demonstrates superior performance in retrieving relevant medical images from large databases. Through meticulous image preprocessing and innovative feature extraction techniques such as GLCM analysis, VEIAM ensures the extraction of high-quality features that capture intricate textural characteristics crucial for accurate diagnosis. Our results showcase VEIAM’s effectiveness in improving diagnostic accuracy and efficiency in healthcare settings, highlighting its potential to revolutionize medical image analysis. Among the models evaluated, VEIAM stands out as the most accurate, achieving an impressive accuracy rate of 97.2%. This surpasses the performance of established deep learning models such as ResNet-50, VGG-16, and InceptionV3, indicating the superiority of VEIAM in accurately retrieving and classifying medical images based on their content characteristics. Looking ahead, there are several avenues for future research and development in this field. Firstly, further optimization and refinement of VEIAM’s components, such as fine-tuning attention mechanisms and exploring alternative feature extraction methods, could enhance its performance even further. Additionally, expanding the scope of VEIAM to accommodate other modalities beyond chest X-rays, such as MRI or CT scans, would broaden its applicability and utility in diverse medical imaging scenarios. Furthermore, integrating VEIAM with emerging technologies such as deep learning algorithms could unlock new possibilities for image analysis and interpretation. Lastly, collaborative efforts to validate VEIAM’s efficacy in clinical practice and real-world healthcare settings would be crucial for its widespread adoption and impact on patient care. In summary, VEIAM represents a promising step forward in the field of medical image analysis, with ample opportunities for continued innovation and advancement in the future.
Author contribution
The authors confirm contribution to the paper as follows: study conception, Methodology, Investigation, Writing - Original Draft, Supervision: Ramesh Babu Durai.C; Software, Validation, Formal analysis: R. Sathesh Raaj; Resources, Data Curation, Writing - Review & Editing: Sindhu Chandra Sekharan; Visualization, Project administration, Funding acquisition: V.S. Nishok
Conflicts of interest
The authors declare no conflict of interest
Data availability
All data analysed during this study are included in this article.
