Revolutionizing tumor detection and classification in multimodality imaging based on deep learning approaches: Methods,applications and limitations

Abstract

BACKGROUND:

The emergence of deep learning (DL) techniques has revolutionized tumor detection and classification in medical imaging, with multimodal medical imaging (MMI) gaining recognition for its precision in diagnosis, treatment, and progression tracking.

OBJECTIVE:

This review comprehensively examines DL methods in transforming tumor detection and classification across MMI modalities, aiming to provide insights into advancements, limitations, and key challenges for further progress.

METHODS:

Systematic literature analysis identifies DL studies for tumor detection and classification, outlining methodologies including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and their variants. Integration of multimodality imaging enhances accuracy and robustness.

RESULTS:

Recent advancements in DL-based MMI evaluation methods are surveyed, focusing on tumor detection and classification tasks. Various DL approaches, including CNNs, YOLO, Siamese Networks, Fusion-Based Models, Attention-Based Models, and Generative Adversarial Networks, are discussed with emphasis on PET-MRI, PET-CT, and SPECT-CT.

FUTURE DIRECTIONS:

The review outlines emerging trends and future directions in DL-based tumor analysis, aiming to guide researchers and clinicians toward more effective diagnosis and prognosis. Continued innovation and collaboration are stressed in this rapidly evolving domain.

CONCLUSION:

Conclusions drawn from literature analysis underscore the efficacy of DL approaches in tumor detection and classification, highlighting their potential to address challenges in MMI analysis and their implications for clinical practice.

Keywords

Multimodal medical image deep learning MRI CT PET fusion segmentation image analysis GANs

1 Introduction

Medical imaging plays an indispensable role in the diagnosis and treatment of various diseases. Usually, Magnetic resonance imaging (MRI), Computed tomography (CT), Positron emission tomography (PET), and Single-photon emission computed tomography (SPECT) modalities are commonly practised in the medical field. These modalities offer complementary facts about the shape and features of various organs and tissues inside the frame, the human body being scanned. However, reading and integrating facts from several imaging modalities may be difficult because of differences in imaging protocols, image resolution, and noise elements. Analyzing multimodality medical images with multiple modes through conventional Machine learning (ML) methods necessitates the creation of capabilities manually, which is restricted by the intricacy and variety of medical data.

Physicist Alan Cormack [1] hypothesized that scanning a body from multiple angles could best extract the information it contained, although this study was not performed with the limitations of computers at the time [2]. Multimodality tomography in the field of diagnostic imaging originated in 1966 with the first prototype of CT-SPECT, which acquired images of the patient’s breast [3]. Although closely related, both functional (SPECT and PET) and morphological tomographic images (CT and MRI) were developed independently. Godfrey Honsfield is undoubtedly the central figure in the development of CT, as he developed a prototype and built the first CT for clinical use [4]. Few medical discoveries have received such immediate and enthusiastic acceptance. Hounsfield and Cormack jointly received the Nobel Prize in Medicine in 1979. In the era of the 1990 s, there was a growing realization among individuals that the incorporation of both morphological and functional information held a significant and deep-seated significance, prompting individuals to give more serious contemplation to the necessity of this amalgamation. To solve this problem, two approaches have been adopted: images acquired at different times are fused using digital image manipulation techniques, or images acquired simultaneously are automatically merged [5].

The problem with method selection in clinical diagnostic imaging is that the highest-sensitivity methods have relatively low resolution, while the high-resolution methods have relatively low sensitivity. In recent years, the idea of using multiple models in combination has gained popularity, and researchers have realized that great advantage can be gained by using the complementary capabilities of different imaging modalities together. The amalgamation of different imaging models can result in a synergetic effect, wherein the combined output surpasses the individual contributions of each model, thus enhancing the overall efficacy and efficiency of the imaging process [6]. The idea of combining imaging technologies went mainstream with the advent of the first successful commercial fused devices. The first fused PET-CT device, developed in 1998 by Townsend and colleagues in collaboration with Siemens Medical, became commercially available in 2001. Time magazine named “Biograph” one of the “Inventions of the Year” in 2000, and it was a success. By 2003, all major medical device manufacturers, including General Electric (GE), Philips, CTI, and Siemens, had PET-CT. Integrated devices are available. In the following years, PET-CT sales grew so rapidly that by 2006, there were virtually no sales of standalone PET devices [7]. All PET cells were part of a multimodality system. The next wave of innovation is in PET-MRI-fused devices, which promise improved patient safety and imaging capability over PET-CT. Although research on PET-MRI devices began with PET-CT, the economic and engineering challenges of combining the two modalities slowed development, and the first commercial PET-MRI prototype for a human-scale hybrid scanner was released by 2007 [8–10]. With the rise of hybrid technology, these new instruments have created a wave in probe design and development as investigators continuously discover new ways to maximize the clinical benefits of hybrid instrument technology [11, 12].

In healthcare facilities, the primary responsibility of analyzing medical images is predominantly fulfilled by individuals who possess specialized training and expertise in the form of radiologists and physicians. However, given the wide variation in pathology and the potential fatigue of human experts, researchers and clinicians have begun to take advantage of computer-assisted interventions. The velocity at which advancements are being made in the realm of computational medical image analysis may not be commensurate with the rapidity at which medical imaging technology progresses; nevertheless, with the advent of ML technology, notable enhancements are being witnessed within this domain in applying ML, finding, or learning information features that best describe the information. Regularities or patterns in data play an important role in various tasks in medical image analysis. Traditionally, meaningful or task-relevant features were designed mostly by human experts based on their knowledge of the target domains, making it difficult for non-experts to exploit ML techniques for their studies [13]. In the meantime, attempts have been made to learn sparse representations based on predefined dictionaries, potentially learned from training samples. Many fields of science are concerned with the parsimonious representation of data, a fundamental problem in many sciences. The simplest explanation of a given observation should be preferred over a complicated one. Spatial-inducing penalization and dictionary learning have demonstrated the validity of this approach for feature representation and feature selection in medical image analysis [14–18]. It should be noted that the sparse representation or dictionary learning methods described in the literature still find informational patterns or routines embedded in data with a shallow architecture, thus limiting their symbolic power. However, with auto features engineering, Deep learning (DL) [19] overcomes this constraint. Instead of manually extracting features, DL requires only a set of data with minimal pre-processing if necessary and then discovers a representation of the information in a self-taught manner [20]. Therefore, the computer took over humans’ feature engineering burden, allowing non-experts to effectively use ML for their research and DL applications, particularly in medical image analysis.

DL networks are specifically designed to acquire knowledge and extract hierarchical data representations. The underlying motivation behind pursuing DL lies in its inspiration drawn from the intricate structure and function of the human brain. The ultimate objective of DL is to create its resemblances to the cognitive and reasoning processes exhibited by humans. Recently, DL has gained considerable attention and popularity within the scientific community and in various other domains. Aside from medical practice, DL is utilized extensively in domains such as computer vision, natural language processing, speech recognition, and reinforcement learning. DL has demonstrated exceptional performance in various tasks, such as image classification, object detection, machine translation, and sentiment analysis. These notable accomplishments have propelled DL to the forefront of research and established it as a potent tool for tackling complex problems across numerous disciplines [21].

DL publicizes complex structures within high-dimensional data, making it well-suited for applications in medical image analysis. Litjens et al. [22] thoroughly surveyed DL in medical image analysis. Lundervold et al. [23] focus on MRI images, and Liu et al. [24] on ultrasound images. Here, we discuss DL-based analysis of images in MMI. Recent advances in ML, particularly DL, are helping to identify, classify, and quantify patterns in medical images. Central to these advances is leveraging hierarchical feature representations learned only from data rather than hand-designed features based on domain-specific knowledge. DL is fast becoming state of the art, increasing efficiency in various medical applications [25].

Extracting complex features from medical images using DL models, such as convolutional neural networks (CNNs), You Only Look Once (YOLO), and recurrent neural networks (RNNs), leads to better segmentation, registration, and classification outcomes. The creation of automated systems with computer-aided methods for disease diagnosis and treatment planning has also been revolutionized by DL, which may result in less work for healthcare personnel and better patient results. Our earlier ML and DL research, published in esteemed journals, has already examined these techniques across various fields, including computer vision, natural language processing, medical image analysis, segmentation, and classification. Along with automated medical image analysis and computer-aided diagnosis and prediction, enhancing the accuracy and efficiency of existing methods [21–34]. [26–39].

DL-based approaches are currently considered a promising alternative for analyzing multiple modalities of medical images [40–43]. Compared to single images, multi-modal images help to extract features from different views and bring complementary information, contributing to better data representation and discriminatory power of the network. As pointed out in Ref. [44], The CT image can diagnose muscle and bone disorders, such as bone tumors and fractures, while the MR image can offer a good soft tissue contrast without radiation. Functional images, such as PET, lack anatomical characterization while they can provide quantitative metabolic and functional information about diseases. MRI dependence on variable acquisition parameters, such as T1-weighted (T1), contrast-enhanced T1-weighted (T1c), T2-weighted (T2) and Fluid attenuation inversion recovery (Flair) images can furnish this work with supplementary information. T2 and Flair are suitable to detect the tumor with peritumoral oedema, while T1 and T1c are to detect the tumor core without peritumoral oedema. Therefore, applying multi-modal images can reduce the information uncertainty and improve clinical diagnosis and segmentation accuracy [45]. Several widely used multi-modal medical images are described in Fig. 1.

Fig. 1

The multi-modal medical images, (a)–(c) are the commonly used multi-modal medical images and (d)–(g) are the different sequences of brain MRI [40].

This article has enlightened DL-based architectures such as CNNs and YOLO based Models, Siamese Networks, Fusion-Based Models, Attention-Based Models, and Generative Adversarial Networks (GANs) for multimodal medical image (MMI) analysis. We discuss recent developments in DL-based techniques and models for analyzing medical images from multiple modalities while highlighting potential research directions for future work. We exemplify specific applications of DL-based approaches, such as tumor detection and characterization, and provide background literature on disease progression tracking. Due to the diversity of the field, as we cannot cover all aspects of multimodal imaging, we will specifically focus on commonly used multimodality medical imaging modalities in this article while describing the use of the DL role in tumor detection and classification. The article highlights the higher performance gained by DL-based medical imaging analysis in the most used multimodality imaging techniques, such as PET-CT, SPECT-MRI, SPECT-CT, Ultrasound-photoacoustic, PET-MRI, and CT-MRI, while specifically focusing PET-CT and PET-MRI for tumor detection and classification. Moreover, we draw researchers’ attention to the advantages and limitations of such approaches compared to traditional ML-based techniques. Analyzing multimodalities, images have seen various generations. The approaches for MMI analysis are categorized as conventional methods, ML, and DL, as depicted in Fig. 2.

Fig. 2

Three generations of MMI analysis.

Novel ML, AI, and specifically DL concepts are highlighted in this article regarding multimodalities medical image analysis. This article contributes in several ways regarding multimodal medical image analysis. •

This article contributes a survey of current state-of-the-art DL-based methods in MMI analysis for tumor detection and classification, including recent developments and trends.

•

Our survey pindown gaps and challenges in the present literature, such as limitations of existing models and techniques, areas where more research is needed, and challenges related to data availability, interpretability, and generalizability.

•

We focus on providing guidance on best practices for DL-based approaches for MMI analysis, including recommendations for preprocessing, model architecture, and evaluation metrics. The article can also highlight benchmark datasets that are commonly used for evaluating models in the field.

•

This article provides an accessible and comprehensive introduction to DL-based MMI analysis and is a valuable educational resource for researchers and practitioners new to the field.

•

By identifying gaps and challenges and highlighting best practices and benchmark datasets, this review article can help guide future research efforts in the field toward areas of greatest need and potential impact.

The survey is organized as follows: Section 2 provides background knowledge of commonly used multimodal medical imaging modalities, their familiar usage/ clinical applications, motivation, the objective of this survey and an introduction to the utilization of DL in MMI analysis. Section 3 covers the proposed survey search strategy and study selection criteria. Section 4 reveals the multimodal image classification and segmentation and some common challenges faced in multimodal medical image segmentation. Section 5 concerns the related survey of DL architectures in MMI analysis for tumor detection and characterization. Section 6 is about results comparison and discussion. Common challenges, future directions and emerging trends are highlighted in Section 7. Finally, Section 8 provides the concluding remarks of this survey. An overview of our survey articles is shown in Fig. 3.

Fig. 3

Organization of the survey paper.

2 Background

Several imaging modalities are used in medical imaging, having their own strengths and limitations. The choice of imaging modality used to acquire patient medical images for disease diagnosis depends upon the patient’s condition, the target of patient organs for imaging, and the availability of imaging modalities. Multiple imaging modalities are utilized to acquire necessary information about a patient’s anatomy, physiology, and pathology. Each imaging modality offers distinct advantages and is commonly used in clinical scenarios.

Multi-modal medical imaging uses multiple imaging techniques to capture patient anatomy or pathology information, which is about Two whole-body PET-CT studies of a 68-year-old male undergoing treatment for small cell lung cancer [46]. Each imaging modality presents unique acumens into specific aspects of the body, such as structure, function, metabolism, or blood flow. It demands the unification of images acquired with unique modalities such as MRI, CT, PET, SPECT, ultrasound, and others. Healthcare providers can better grasp the patient’s condition and make more accurate diagnoses by integrating several modalities.

Utilizing multi-modal medical images offers several advantages in clinical practice and research. It gives further information regarding the architecture, physiology, metabolism, and function of tissues or organs. Combining these models considerably improves the assessment process by providing a more thorough and accurate patient health assessment. Consequently, this integration of different models effectively augments diagnostic capabilities, thus rendering healthcare systems more efficient and effective in their implementation.

One prevalent application of MMI lies in its utilization to identify and describe a wide range of diseases. In the oncology domain, for instance, the combination of MRI and PET scanning holds great potential in accurately pinpointing and staging tumors. This is accomplished by simultaneously visualizing anatomical structures and metabolic activity, enhancing the precision of diagnostic procedures [47–49]. Likewise, merging CT and SPECT images can prove immensely valuable in cardiovascular disease, facilitating healthcare professionals in obtaining highly detailed and comprehensive information about vascular and myocardial perfusion. As a result, this enables a more effective diagnosis and evaluation of such afflictions.

Another area where MMI finds its application is in the realm of image-guided intervention and surgical planning, a crucial component of contemporary medical practice. The amalgamation of preoperative imaging data produced by various imaging modalities such as CT, MRI, and PET give surgeons more power with a comprehensive understanding of the target region and its surrounding anatomy. This augmented visualization enhances the precision and safety of surgeries and furnishes invaluable assistance in meticulously planning these interventions. Moreover, the versatility of multimodal imaging transcends the boundaries of the operating room and extends its influence on the evaluation and monitoring of treatments, playing a pivotal role in gauging the efficacy and progression of interventions over time. By capturing and scrutinizing the dynamic alterations transpiring in the target regions post-treatment, multi-modal imaging empowers healthcare providers with the necessary insights to make well-informed decisions regarding patient care and fine-tune treatment strategies accordingly. Consequently, the integration of MMI into clinical workflows confers numerous advantages, heralding a medical revolution and significantly enhancing patient outcomes.

The integration and analysis of MMI often involve advanced image registration, fusion, and data analysis techniques to effectively combine and extract meaningful information from different modalities. Image segmentation is pivotal in healthcare because it significantly impacts precise information extraction. Accurate and precise multimodal image segmentation is critical for clinical diagnosis, treatment planning, and monitoring of various illnesses, including tumor detection, categorization, and progression tracking.

Medical imaging paradigms in multimodal healthcare vary depending on clinical needs, with no ideal solution. PET-CT combines metabolic and anatomical data, valuable for cancer staging, neurological, and cardiovascular diseases. PET-MRI enhances tumor detection, and SPECT-CT offers precise localization. Various multimodal techniques like ultrasound-photoacoustic imaging, and Positron Emission Tomography-Magnetic Resonance Spectroscopy (PET-MRS) aid in tumor characterization. PET-MRS is utilized in tumor detection to provide comprehensive insights into metabolic and molecular processes within tissues. While PET offers functional imaging by detecting the distribution of radiolabeled tracers, MRS provides biochemical information by measuring the concentrations of metabolites. Together, PET-MRS enhances tumor detection by correlating metabolic changes with tissue characteristics, aiding in diagnosis, treatment planning, and monitoring of therapeutic responses [50]. Clinical applications, advancements, and challenges are discussed, emphasizing ongoing research to refine and expand techniques’ utility, particularly with the integration of DL methods. The evolving landscape of multimodal imaging continues to shape diagnostic approaches, improving patient care across various medical fields [51–60]. Some common multimodality imaging techniques and their familiar uses are shown in Table 1.

Table 1
Common multimodality imaging techniques and their familiar uses

Modality name Familiar uses

PET-MRI detecting and characterizing tumors, neurological disorders, and cardiovascular diseases.

PET-CT Tumor detection, staging, and monitoring treatment response.

SPECT-CT Valuable for various medical fields, including oncology, neurology, cardiovascular, and infection imaging.

MRI-DTI Researching brain conditions and neurodegenerative diseases.

Ultrasound and photoacoustic In the characterization of tumors and vascular abnormalities.

MRI- MRS Particularly in the brain, and can help distinguish between benign and malignant lesions.

Ultrasound-doppler imaging Widely used in obstetrics, cardiology, vascular medicine, and various other diagnostic purposes.

CTA Valuable for diagnosing vascular conditions and assessing vascular anatomy in various body parts.

Multiparametric MRI Prostate cancer detection and characterization.

Endoscopy-fluorescence imaging Early tumor detection, especially for conditions like colorectal cancer.

CXR-CT Lung cancer screening

Modality name	Familiar uses
PET-MRI	detecting and characterizing tumors, neurological disorders, and cardiovascular diseases.
PET-CT	Tumor detection, staging, and monitoring treatment response.
SPECT-CT	Valuable for various medical fields, including oncology, neurology, cardiovascular, and infection imaging.
MRI-DTI	Researching brain conditions and neurodegenerative diseases.
Ultrasound and photoacoustic	In the characterization of tumors and vascular abnormalities.
MRI- MRS	Particularly in the brain, and can help distinguish between benign and malignant lesions.
Ultrasound-doppler imaging	Widely used in obstetrics, cardiology, vascular medicine, and various other diagnostic purposes.
CTA	Valuable for diagnosing vascular conditions and assessing vascular anatomy in various body parts.
Multiparametric MRI	Prostate cancer detection and characterization.
Endoscopy-fluorescence imaging	Early tumor detection, especially for conditions like colorectal cancer.
CXR-CT	Lung cancer screening

2.1 Motivation

As previously said, medical imaging is crucial in diagnosing, prognosis, therapy, and monitoring numerous diseases and medical conditions. While traditional medical image analysis methods have demonstrated some effectiveness, they often heavily rely on the laborious and subjective process of manually extracting features and implementing manual algorithms. Unluckily, this approach is prone to time-consuming work, subjectivity, and human error. Nonetheless, with the rise of DL, medical image analysis has experienced significant advances, leading to the birth of automated algorithms capable of more accurate interpretation of complicated medical images [61]. The MMI approach acquires complementary and comprehensive insights into a patient’s condition. However, due to these data sets’ inherent complexity, heterogeneity, and high dimensionality, effectively harnessing the potential of MMI data presents a substantial challenge [62–64].

DL techniques have demonstrated immense potential in effectively tackling these challenges and have exhibited exceptional performance in many tasks in analysing medical images [65]. These groundbreaking methods can extract intricate patterns and establish intricate relationships from multimodal imaging data, ultimately enhancing accuracy in diagnosis, treatment planning, and patient prognosis. The primary objective of composing this all-encompassing review article is to provide readers with a comprehensive and exhaustive overview of recent advancements in DL-based MMI analysis methods. By conducting thorough investigations into integrating and amalgamating diverse imaging models, this review seeks to meticulously demonstrate and elucidate how DL algorithms can efficiently harness copious amounts of supplementary information from various modes, thereby augmenting the overall accuracy and resilience of medical image analysis.

Moreover, it is important to note that this comprehensive review article aims to delve deeply into the multifaceted challenges that arise in MMI analysis. The aim is to meticulously examine and deliberate upon these challenges to fully comprehend their intricacy and multifaceted nature. One of the main obstacles to MMI analysis lies in the inherent heterogeneity of the data, which presents a significant obstacle in integrating and interpreting the data. The wide-ranging diversity and variability within the data makes it arduous to effectively analyze and draw meaningful conclusions. Furthermore, the explainability of DL models presents another puzzling issue in MMI analysis. Understanding the underlying mechanisms and decision-making processes of DL models is crucial for their effective and responsible deployment in medical imaging. Overall, this review article aims to shed light on these problems faced in tumor detection and classification in the MMI field and give insight into potential remedies and possibilities for the future.

2.2 Objectives

The primary objectives of this review article on DL-based approaches for MMI analysis are twofold. The initial goal of this essay is to offer a thorough review of the cutting-edge DL methods utilized in MMI analysis. This paper will go deeply into the subtle aspects involving the amalgamation and fusion of several imaging modalities, including, but not limited to, MRI, CT, PET, and ultrasound imaging, to broaden our awareness and expertise in this field. The article aims to present a detailed understanding of how DL can effectively analyze MMI by examining the advancements in DL architectures, algorithms, and training strategies.

Secondly, the review article addresses the challenges and limitations of MMI analysis specifically around tumor detection and classification. It will go through the intrinsic heterogeneity and unpredictability of multimodal imaging data, which presents difficulties for data collection, fusion, and analysis. By highlighting these challenges, the article aims to inspire further research and innovation in developing solutions to overcome them and to encourage the creation of standardized benchmarks and datasets for MMI analysis.

Furthermore, the article outlines potential future directions and emerging trends in DL-based MMI analysis and improving patient outcomes. It will explore avenues for incorporating domain knowledge into DL models, leveraging transfer learning techniques to adapt models to domains and diseases, and integrating clinical data for a holistic analysis. By presenting these future directions, the article aims to provide researchers and practitioners with insights and guidance to advance the field further and drive improvements in MMI analysis.

2.3 DL-based MMI analysis

DL-based multimodal imaging has shown incredible promise in various clinical applications across medical specialities. This game-changing technology effectively uses deep neural networks (DNNs) great potential to completely examine and seamlessly integrate data from various imaging modes, substantially improving diagnosis accuracy and providing critical insights into the area of medical practice. Implementing this unique technique might benefit health sectors such as neuroimaging, cancer imaging, and cardiovascular imaging.

It has greatly impacted many medical domains in molecular imaging, musculoskeletal imaging, gastrointestinal imaging, lung imaging, ophthalmology, dentistry, maxillofacial imaging, obstetrics, and gynecology. This technology can transform how many diseases and disorders are diagnosed and tracked, ultimately improving patient outcomes and worldwide healthcare services. Researchers and practitioners are encouraged to use the meticulously compiled and voluminous evaluation literature displayed in Table 2 to get a comprehensive overview of the existing body of literature and research relevant to DL techniques employed in multimodal imaging.

Table 2
Various exercises of DL-based MMI analysis

Article Year Architecture Task Modality Application Achievements

Heung et al. [17] 2015 Multi-task learning in a hierarchical fashion. Alzheimer’s classification MRI and PET Neuroimaging Superior performance to the state-of-the-art methods

M. Liu et al. [170] 2018 3D-CNNs Alzheimer’s classification MRI and PET Neuroimaging Classification performance

Giesel et al. [171] 2019 U-net based DNNs Tumor segmentation Multimodal MRI Oncology Segmentation performance

Esther et al. [66] 2022 nnU-net based MMDL Segmentation echocardiography and CMR Cardiovascular Imaging Performance

Laquan et al. [55] 2020 FCN Tumor segmentation PET and CT Molecular imaging Segmentation accuracy

Zhongliang et al. [67] 2021 CNN Tumor progression PET and CT Gastrointestinal imaging Segmentation accuracy

Hilmizen et al. [172] 2020 Fusion of ResNet50 and VGG16 Lung function CT-Scan and X-Ray Pulmonary imaging Improvement Performance

Anthony et al. [141] 2020 Review about various DL techniques Performance study about various DL applications Application performance

Article	Year	Architecture	Task	Modality	Application	Achievements
Heung et al. [17]	2015	Multi-task learning in a hierarchical fashion.	Alzheimer’s classification	MRI and PET	Neuroimaging	Superior performance to the state-of-the-art methods
M. Liu et al. [170]	2018	3D-CNNs	Alzheimer’s classification	MRI and PET	Neuroimaging	Classification performance
Giesel et al. [171]	2019	U-net based DNNs	Tumor segmentation	Multimodal MRI	Oncology	Segmentation performance
Esther et al. [66]	2022	nnU-net based MMDL	Segmentation	echocardiography and CMR	Cardiovascular Imaging	Performance
Laquan et al. [55]	2020	FCN	Tumor segmentation	PET and CT	Molecular imaging	Segmentation accuracy
Zhongliang et al. [67]	2021	CNN	Tumor progression	PET and CT	Gastrointestinal imaging	Segmentation accuracy
Hilmizen et al. [172]	2020	Fusion of ResNet50 and VGG16	Lung function	CT-Scan and X-Ray	Pulmonary imaging	Improvement Performance
Anthony et al. [141]	2020	Review about various DL techniques	Performance study about various DL applications			Application performance

Multimodal imaging has substantially improved the evaluation of heart function, notably by combining data from echocardiography, MRI, and CT scans. This comprehensive approach gives doctors a full view of the patient’s heart condition and allows them to make better-educated treatment decisions [66]. Moreover, DL technology in cancer imaging has provided new opportunities for researchers to examine PET-CT and PET-MRI data, allowing for observing molecular processes and malignant tumors. A recent study conducted by Laquan Li et al. (2020) proposed a novel method that utilizes full convolutional networks (FCN) to accurately segment tumors from PET-CT images, further showcasing the potential of DL in this field [55].

Neurotransmitter imaging: Utilizing multimodal molecular imaging can facilitate examining neurotransmitter activity and its correlation with neurological disorders. Orthopedics: DL models integrate MRI and CT data to enhance the evaluation of musculoskeletal injury, skeletal health, and surgical planning. Mouchess Maria et al. (2006) employed multimodal imaging techniques to evaluate tumor progression and bone resorption in mouse models of neuroblastoma. They could monitor tumor growth and bone loss by implanting luciferase-expressing human neuroblastoma cells in the femur using radiography, bioluminescence, micro-CT, and MRI. The administration of zoledronic acid inhibits tumor growth and prevents bone loss, while high-resolution MRI can detect distant metastases [67]. Liver disease assessment: The application of multi-modal medical imaging, based on DL, in liver disease assessment employs intricate algorithms. These algorithms can integrate and analyze data from diverse imaging techniques, enhancing accuracy and diagnostic capabilities. Xue Zhongliang et al. (2021) solved the predicament of liver lesion segmentation by utilizing a deep CNN in combination with multi-modal PET and CT scans. They proposed a model that could improve the interaction between features in different models, merge feature maps with varying resolutions, and introduce a similarity loss function to ensure consistency. This model surpassed the performance of baseline techniques in liver tumor segmentation and exhibited greater accuracy [68]. The detection of lung cancer is significantly enhanced through the analysis of combined PET and MRI scans using DL algorithms [56]. Monitoring fetal development is effectively carried out by utilizing ML-based and DL-based multimodal ultrasound and MRI data, thus providing comprehensive insights into fetal development, and detecting abnormalities during pregnancy [69]. Clinical practice transforms by implementing DL-based multimodal imaging, improving diagnostic accuracy, treatment planning, and patient care across various medical specialities. Anticipated advancements in DL algorithms and imaging technology are poised to augment its clinical application further.

DL has a significant role in tumor detection and classification across diverse multimodal imaging modalities. Analyzing multimodal data with DL models enhances diagnostic accuracy, facilitates personalized treatment plans, and enables early detection, thereby revolutionizing cancer management and improving patient outcomes. M. Attique Khan et al. (2020) [70] studied automated brain tumor classification using DL, which addresses radiologists’ challenges. The method incorporates contrast stretching, DL feature extraction, joint learning, and feature fusion. They validated their study on BraTS datasets and achieved high accuracies: 97.8%, 96.9%, and 92.5% for BraTS2015, BraTS2017, and BraTS2018, respectively.

G. Murtaza et al. (2020) [71] performed a study on DL-based breast cancer classification. Their review focuses on the DL-based classification of breast cancer using various medical imaging modalities. It systematically analyzes 49 studies, covering imaging modalities, datasets, preprocessing techniques, neural network architectures, and performance metrics. The review highlights challenges and future research directions in this domain, serving as a valuable resource for both beginners and advanced researchers in multimodality medical imaging. A study by EliasHossain et al. (2022) [72] proposes a strategy for brain tumor segmentation using 3D Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) scans with 3DU-Net Design and ResNet50. The ResNet50 achieved 98.96% accuracy, and the 3DU-Net scored 97.99%. Additionally, image fusion and a specific loss function improved segmentation accuracy. The models were validated using various metrics and integrated into a web server for practical deployment in healthcare settings.

3 Proposed survey search strategy and study selection criteria

In constructing this all-encompassing survey article, we meticulously adhered to the guidelines explicitly delineated in the Preferred Reporting Project for Systematic Review and Meta-Analysis (PRISMA) Methodology, ensuring a methodical and thorough literature review. This particular approach establishes a uniform and standardized framework, promoting transparency, precision, and the ability to replicate the implementation of the systematic review. We embarked upon an extensive and all-encompassing investigation into the existing body of literature, undertaking a meticulous and thorough examination of the esteemed academic search engine Google Scholar, and employed a range of carefully chosen search terms to procure enlightening and groundbreaking future research findings pertaining to the highly specialized and intricate domains of tumor segmentation, and categorization with a particular emphasis on various facets of multi-modal imaging, thereby ensuring the integration of diverse and exhaustive future research outcomes into perspectives on this intricate and pivotal subject matter. The search query encompasses a variety of terms, including but not limited to “deep learning,” “convolutional neural networks,” and “machine learning.” In addition, it incorporates abstract keywords such as “lesion,” “cancer,” and “tumor,” along with the concepts of “segmentation,” “classification,” “multimodal imaging,” “multimodal therapy,” “PET-CT,” “SPECT-CT,” “PET-MRI,” multimodality image analysis,” “PET-CT in multimodality imaging,” “tumor detection using PET-MRI imaging,” and similar terms for other modalities. The primary objective of this query is to delve into the intricate and nuanced connections between these advanced computational techniques and their potential applications within the realm of medicine. Figures 4 and 5 show the year-wise distribution of the chosen literature on DL-based tumor detection, lesions segmentation, and classification approaches. Figure 5 articles distribution is based on a Google Scholar query upon all the articles mentioning the words “Deep learning” and specific modality, i.e., “PET-CT”. Figure 6 depicts the most often practiced technique distribution, Fig. 7 shows the Region-wise distribution of the selected research papers. Several efficient methods have been proposed for MMI analysis, and survey papers in the literature have reviewed the recent work on DL techniques for multimodal imaging, as presented in Table 3.

Fig. 4

DL-based approaches for tumor detection and classification focus on PET-CT, PET-MRI, and SPECT-CT multimodalities.

Fig. 5

Year-wise distribution of the reviewed literature is based on google scholar query upon all the articles mentioning word “Deep learning” and specific modality i.e., “PET-CT”.

Fig. 6

Most often practiced technique.

Fig. 7

Region-wise distribution of the selected research papers.

Table 3

Summary of related survey articles MMI analysis

Category investigating articles:	Essential characteristic	Weaknesses
Liling Peng et al., 2023 [173]	•Examined the effectiveness of Fluorine-18 fluorodeoxyglucose PET-MRI in conjunction with chest CT in swiftly detecting malignancy in a substantial population of individuals without symptoms and underlined the increasing significance of PET-MRI in medical imaging.	•Follow short follow-up periods performing this study.
		•Considering limited sample sizes, retrospective designs, and short follow-up periods is imperative when analyzing the outcomes.
Kai Jannusch et al., 2022 [174]	•Aims of study was the evaluation of the clinical relevance of missed lung nodules at the initial staging of breast cancer patients in PET-MRI compared with CT.	•The study suggests considering supplemental low-dose chest CT after neoadjuvant therapy for backup. However, it does not provide guidelines or criteria for when and how this supplemental CT should be conducted.
Z. Xue et al., 2021 [68]	•Model for liver tumor segmentation using a PET-CT scans dataset. •Combine feature maps of different resolutions to derive spatially varying fusion maps and enhance the lesions information. •Introduce a similarity loss function for consistency constraint	•It does not specify the performance metrics used.
		•It does not elaborate on the specific baseline methods used for comparison.
		•Does not provide insights into the interpretability of the Deep learning model.
		•Does not discuss the clinical validation of the model.
General tutorials:
(Piñeiro-Fiel et al., 2021 [175]; Guglielmo et al., 2021 [176]; M agadza and Viriri, 2021) [177]	•Provide a general introduction to non-invasive imaging methods (MR, CT, PET) and their role in cancer management.	•Algorithmic and architectural perspective analysis was ignored.
	•Discuss quantitative image analysis in screening, diagnosis, tumor characterization, and treatment response of patients.
		•Do not explore the existing solution critically considering performance analysis.
Survey Articles:
Ole Martin et al., 2020 [178]	•The purpose was to investigate differences between PET-MRI and PET-CT in lesion detection and classification in oncologic whole-body.	•The study was performed over a small sample size of 1,003 oncologic examinations from 918 patients.
		•Does not provide detailed information on the clinical relevance of these findings and changes in Tumor, Nodes, and Metastasis staging.
Lucia Baratto et al., 2022 [179]	•Provide a review to the current utilization of PET and MRI imaging-based AI models in pediatric oncology.	•The article’s discoveries and their ramifications have yet to undergo thorough critical analysis.
		•While it does provide instances of AI-facilitated image processing, it disregards potential constraints, impediments, or imperfections in the technology.
(Yousefirizi et al., 2022 [180]; Ren et al., 2021 [121]; Sadaghiani et al., 2021 [181])	•Focus on aspects than PET-CT like single PET, the fusion of (PET-MRI or PET-CT/MRI), and others.	•Focus on the specific (PET, PET-MRI, PET-CT/MRI) modality aspects.
		•Might comprehensively discuss that aspect including AI-based models.
Proposed survey	•Give a thorough overview of automated tumor lesions and classification methods in PET-MRI, PET-CT, and other multimodalities.
	•Unlike the general tutorial, the suggested survey critically examined current work and comprehensively addressed its algorithmic and network architectural design features.
	•Unlike the tutorial and survey, which focus on certain components (detection, classification, progression), this survey focuses on DL-based segmentation approaches for multimodalities.
	•Data set characteristics and open databases on multimodalities are also given, as is assessment (data annotation and measurement metrics).
	•Current issue highlights and proposals for potential future paths have been highlighted.

This survey discusses key parameters used to explore recent studies on tumor detection, classification and key aspects of image analysis i.e., image segmentation in multimodalities imaging. Table 4 describes the parameters used to explore the reviewed studies, which are classified into five categories: architecture design, dataset characteristics, Tumor type, model performance (Perf.), and implementation (Imp.) code availability of the proposed study. Under the architecture design category, we discuss the backbone network, the shortcuts connection, convolution, and the attention mechanism used with the backbone model. We also consider loss functions such as distribution-based, region-based, or hybrid loss functions, and the optimizer during model training. Based on the architectural design, we classified the existing DL models into CNNs and YOLO based Models, Siamese Networks, Fusion-Based Models, Attention-Based Models, and Generative Adversarial Networks (GANs). These models are categorized and subdivided into several types, including CNN, FCN, SegNet, RNet, UNet, TSN, CSN, RSN, efusion, ARNN, LSTM, VGAN, cGAN, PGAN, InfoGAN and their variants. A summary of the classification of the reviewed literature is shown in Fig. 8.

Table 4

Details of the parameter used to explore DL models

Parameters	Functionality
Architecture design	Model	•Backbone network: CNN, FCN, U-Net, etc.
		•Connections: Residual con, Skip con, Dense con etc.
		•Convolution module: Dilated conv, Separable conv etc.
		•Attention and pooling method: Pyramid pooling, Attention.
	Loss function	•Name of loss function:
		∘ CE (Distribution-based)
		∘ Dice (Region-based)
		∘ Focal Thank you for reaching out. Dice (Hybrid)
	Optimizer	•Stochastic gradient descent (SGD), ADAM, Adadelta, L-BFGS, etc.
Dataset characteristics	Pre/Post processing	•Pre/Post-processing involvement:
		∘ Yes (✓)/No (×)
	Data	•Availability of dataset:
		∘ Public
		∘ Private
		∘ Both used (*)
	Study population	•Size of Dataset:
		∘ Number of patients
		∘ Training images
		∘ Testing images
		∘ Validation images
	Cross testing	•Multi-center data involved
		•Cross-validation involved
		∘ Yes (✓)/No (×)
	Data augmentation	•Data augmentation involved
		•Data augmentation method
		∘ Yes (✓)/No (×)
Focused area	Tumor	•Anatomical interest ∘ Breast, brain, lungs, head and neck and bones
Performance	Dice similarity Coefficient	•Performance score
Access	Code	∘ Availability of code and supplementary materials
		∘ Yes (✓)/No (×)

Fig. 8

DL-based multimodality image analysis architecture.

In terms of dataset features, we reviewed the dataset’s public availability, as well as the study population, which includes total data size, training, testing, and validation data size, and is used to illustrate the usefulness of the suggested methodologies. The dataset’s heterogeneity in tumor structure, size, and location is also taken into account. We also evaluate the cross-testing feature, which accounts for the dataset’s high unpredictability and complexity. Data from a single or several sites are used in cross-testing. We label a data set as cross-testing if it comes from various centers and some of the data is utilized for training and some for testing. It is also being studied whether data augmentation and pre/post-processing approaches have been employed to improve tumor segmentation performance. The type of cancer or anatomical interest of the region is also summarized.

Evaluation measures such as dice similarity coefficient (DSC), intersection over union (IoU), sensitivity, precision, recall, and C-index were employed to describe model performance. The metrics DSC and IoU are used to assess the agreement between expected and actual results. The measurement measures accuracy, sensitivity, precision, and recall are used to calculate the segmentation model’s fraction of true positives and negatives. The C-index is used to examine the consistency between anticipated risk and survival status in a survival prediction task (Harrell et al., 1996). There is no gold standard for assessing model performance; instead, most models are assessed using a variety of criteria. DSC is a frequently used segmentation statistic that measures the similarity between forecasts and ground realities. As a consequence, we provided the DSC value in the comparison table, and the findings of additional diagnostic measures are also addressed while examining the research in the appropriate parts. Finally, we indicate if the proposed study’s implementation code is publicly available. The availability of implementation code as a foundation model improves repeatability and promotes the notion of transfer learning in order to construct more robust distribution models.

4 Multimodal image segmentation and classification

DL-based techniques have become effective for assessing clinical images from different modalities in recent decades. DL has intensified the quality and efficiency of medical image analysis by automatically learning key features from raw image data. DL has significantly improved Image Registration, Image Segmentation, and Image Classification [21–25].

4.1 Image classification

In recent years, there has been outstanding progress in DL-based image classification methods, signifying significant advancements in this domain. These methods have been effectively implemented in image classification tasks in multimodal medical imaging, yielding promising outcomes. One particularly noteworthy contribution in this field is the TransMed model, as proposed by Yin Dai et al. (2021), which specifically focuses on the classification of multimodal medical images. By capitalizing on the capabilities of CNNs and transformers, the TransMed model successfully addresses the existing limitations within this domain. The remarkable front of this model is its performance, which has far surpassed that of the CNN-based model, as demonstrated by two distinct data sets. TransMed’s achievements have significantly improved 10.1% and 1.9%, respectively. This exceptional accomplishment not only underscores the potential of the TransMed model but also instils optimism regarding its applicability to various tasks involving the analysis of medical images [73].

M Xiao et al. (2020) [74] addresses the challenges in diagnosing and classifying gliomas, the most common brain tumors. It introduces two convolutional neural network models: a 2D ResNet-based model for pathology image classification and a 3D DenseNet-based model for MRI image classification. These models achieved first place in the CPM-RadPath-2019 challenge for classifying different grades of gliomas. K Takahashi et al. (2022) [75] explore the utility of DL models in improving the accuracy of PET- CT image classification for breast cancer (BC). Using images with multiple degrees of PET maximum-intensity projection (MIP), DL models trained on 400 images showed promising results, outperforming radiologists in sensitivity and specificity.

Jiapeng Zhang et al. (2022) [76] introduces a DL-based method for classifying multiple organ-specific cancers using PET/CT images, aiming to assist radiologists in cancer screening. The proposed method incorporates a modality fusion module to fuse PET and CT images with segmented multi-organ information. Grayscale transformation and a double-level V-net are utilized for organ segmentation in low-dose CT images, addressing data annotation challenges and enhancing image context information. The classifier achieves an F-score of 82.3%, demonstrating its potential to aid radiologists in cancer screening across six classes.

4.2 Image segmentation

DL-based approaches have shown significant promise in improving the precision and efficiency of segmentation, marking an important milestone in the MMI field. It is worth noting that attention mechanisms have emerged as a notable breakthrough in this sector, promising to improve segmentation accuracy by selectively concentrating on pertinent characteristics. This promising aspect was underscored by the research conducted by Guo Zhe et al. (2019), who made a substantial contribution by introducing a supervisory MMI analysis method based on DL. Their study specifically concentrated on segmenting soft tissue sarcoma lesions using MRI, CT, and PET. Their investigation revealed that fusing different network levels can yield superior outcomes compared to solely analyzing single-modal images. This finding assumes paramount significance as it furnishes invaluable guidance for developing multi-modal image analysis techniques [45].

Multimodal medical image segmentation (MMIS) is a vital challenge in obtaining relevant information from several imaging data sources. Advances in this subject help to enhance clinical decision-making and provide a better knowledge of complicated biological and environmental events. MMIS pertains to the intricate and convoluted process of characterizing and delineating distinct anatomical structures or regions of interest that are visible in medical images produced using a variety of imaging modes. This complex endeavor necessitates the utilization of sophisticated image processing and analysis methodologies that aim to extract vast amounts of information-rich data from each pattern, followed by the harmonious integration of these data sets to generate precise and all-encompassing segmentation. The primary objective of MMIS is to enhance the accuracy and comprehensiveness of medical image analysis by harnessing the complementary advantages provided by different imaging models.

The endeavour of MMIS presents a formidable challenge due to the variations in image characteristics, intensity distribution, and spatial resolution observed across different modes. However, it also holds immense potential for harnessing each modality’s diverse and complementary information, thereby opening avenues for enhanced precision and adaptability in the generated segmentation outcomes. This perspective is fortified by the findings of earlier investigations [40, 77], which support the assertion mentioned earlier.

Several approaches have been developed for MMIS. One common approach is to perform modality-specific segmentations independently and then fuse the results using registration techniques to ensure spatial alignment. Another approach is to directly incorporate the information from multiple modalities into a unified segmentation framework by exploiting the similarities and differences in intensity patterns.

ML and DL techniques have also been effectively utilized in the field of MMIS, whereby they have demonstrated remarkable efficacy. These techniques leverage diverse and comprehensive information derived from various models to construct highly accurate models capable of classifying structures of interest with precision and accuracy. In particular, CNNs and other intricate architectures are frequently employed in these methods, as they can establish intricate nonlinear associations between input images and their corresponding segments [77–79]. The employment of ML and DL methodologies in MMIS has yielded significant achievements due to their ability to exploit the vast array of information accessible in different models. By virtue of these techniques, models can be trained to precisely and accurately segment structures of interest present in medical images. Furthermore, the performance of these models can be further augmented through the utilization of CNN or other advanced architectures, as they effectively assimilate the intricate nonlinear relationships between input images and their corresponding segments [77–79].

The segmentation outcomes derived from MMIS possess many practical and scholarly applications within clinical practice and research. These applications span a broad spectrum of disciplines, encompassing treatment planning, image-guided intervention, surgical navigation, disease diagnosis, and treatment response monitoring. In order to facilitate the process of quantitative analysis, accurate volumetric measurement, tracking of disease progression, and the effective utilization of computer-aided diagnostic systems, it becomes imperative to acquire precise and dependable segmentation results.

There is no doubt that the task of MMIS presents a formidable challenge, yet it undeniably holds paramount importance within medical imaging. This task capitalizes on the advantages of various imaging modalities to construct all-encompassing and exceptionally precise segments, which in turn provide invaluable aid in diagnosis, treatment planning, and patient management. The continuous advancement of sophisticated algorithms and technological innovations within this specific field has, without a doubt, played a significant role in propelling noteworthy progress in medical imaging, ultimately leading to enhanced levels of patient care and prognostic capabilities.

4.2.1 Common challenges in multi-modal medical image segmentation

Multi-modal medical image segmentation poses several challenges due to the complexities of integrating and analyzing data from multiple imaging modalities [40, 80]. Some of the key challenges include:

Intensity heterogeneity: Each modality may have its intensity range, distribution, and contrast characteristics. This can make it difficult to define consistent intensity thresholds or models for segmentation across modalities.

Modality complementarity: While multi-modal images offer complementary information, effectively utilizing this information is challenging. The issue of successfully merging information from several modalities to increase segmentation accuracy remains to be investigated and is an active research area.

Class imbalance: Medical image segmentation often involves imbalanced class distributions, where certain structures or regions of interest are smaller or less frequent than others. Handling class imbalance and ensuring accurate segmentation of both major and minor structures is crucial.

Limited annotated data: Annotated training data for MMIS is often limited and time-consuming. The availability of a large-scale, diverse, and well-annotated multimodal dataset is central for training accurate and robust segmentation models.

Computational complexity: When dealing with vast amounts of data and complicated segmentation methods, the computational requirements for processing MMI might be quite high. Efficient algorithms and hardware resources are needed to handle the computational demands of MMIS.

Validation and evaluation: Due to the lack of ground truth annotations for all modalities, evaluating the performance of MMIS approaches might be difficult. Developing reliable evaluation metrics and validation strategies that account for the modalities’ differences is important.

5 Related survey of multimodal DL architectures

In medical image analysis, numerous architectures of DL have been meticulously crafted to cater to the distinctive requirements of scrutinizing medical images procured through various modalities. These techniques and models, which are grounded in DL, have been painstakingly designed and implemented with the objective of unearthing the potential complexities found within vast collections of medical images. By seamlessly integrating multiple models within their framework, these techniques and models can extract information and insights, facilitating accurate and comprehensive analysis. In order to demonstrate the extensive range and profound nature of these multi-modal DL architectures, this article presents several noteworthy instances.

5.1 2D & 3D CNNs and YOLO based models

In this section, we will discuss the segmentation approaches that employ a DL model, specifically CNN and FCN, SegNet, RNet, U-Net, VNet, WNet and YOLO based models for tumor detection and classification. The advantages of CNNs over traditional ML models can be summarized as, (1) minimum preprocessing for input data requires. (2) automating the feature engineering (3) reducing the number of learnable parameters (Jose Dolz et al., 2018). It is one of the most often exploited DL models for the detection, classification, and segmentation model for medical images (J. Sun et al., 2021). 3D CNN was created to assess health images obtained from various modalities, such as MRI and CT. The fundamental benefit of using 3D CNNs is their capacity to immediately extract significant information about the spatial and temporal properties inherent in three-dimensional picture data. As a result, when compared to their two-dimensional counterparts, 3D CNNs have outperformed them. This performance improvement can be vividly exemplified through the notable work conducted by Mohammed Oves and his colleagues in the year 2023. Oves et al. successfully used DL technology to solve the complicated challenge of reliably identifying large-scale medical imaging data, including two-dimensional and three-dimensional pictures. The team’s proposed deep volume classification network showcases remarkable efficiency by skillfully amalgamating and integrating multiple depth features, culminating in highly commendable performance outcomes. Specifically, the area under the curve (AUC) value attained an impressive 93.66%, as the corresponding papers reported [81, 82] values.

Zhao et al. 2018, [83] presented 3D FCN for brain tumor segmentation, which included multi-task learning and feature fusion modules. In the multi-task learning module, VNet style architecture was used to extract high-dimensional characteristics from each imaging modality (PET-CT). The feature fusion module accepts the feature maps as input and executes feature re-extraction procedures using four convolution layers fusion networks. The proposed model’s performance was verified against State-of-The-Art (SOTA) VNet (Milletari et al., 2016 [84]), WNet (Xu et al., 2018 [85]), and fuzzy c-means clustering approaches using a lung cancer dataset of 84 patients. According to the results, the suggested co-segmentation approach achieved a DSC of 0.850 with classification and volume error values of 0.330 and 0.150, respectively.

YOLO has the ability to properly detect and accurately pinpoint the placement of objects inside an image or video frame. Tao Zhou et al., 2023 [86] proposed A Cross-modal Cross-scale Clobal-Local Attention YOLOV5 Lung Tumor Detection Model (CCGL-YOLOV5). They used stochastic gradient descent (SGD) with Localization Loss cost function for performance match. The CCGL-YOLOV5 model achieved an accuracy of 97.83%, a recall rate of 97.39%, an average accuracy of 96.67%, an F1 score of 97.61%, and a frames per second (FPS) of 98.59% on the multimodal lung tumor PET-CT data set. The experimental results demonstrate that the performance of the CCGL-YOLOV5 model in this study surpasses that of other conventional models. Moreau et al. 2021, [87] proposed UNet-based UNet_BL and UNet_FL models for breast cancer metastatic lesions segmentation and treatment response monitoring with longitudinal whole-body PET-CT scans as input. The baseline model UNet_BL was trained on baseline PET-CT images for segmenting baseline lesions, while the follow-up UNet_FL model was trained on four inputs: two for follow-up PET and CT, two for baseline PET, and output of UNet_BL as baseline lesions segmentation. The proposed models were evaluated on the EPICURE_seinmeta dataset consisting of 60 patients with 60 baseline and 104 follow-up images. The results explain that UNet_BL and UNet_FL achieved a mean DSC score value of 0.66 and 0.58, respectively.

Joonas Liedes et al. (2022) [88] study the the applicability of a convolutional neural network in detecting and auto-delineating Head and neck squamous cell carcinoma (HNSCC) from PET-MRI data. They employed a U-net model described in that was built using the TensorFlow Keras version 2.5.0 framework in Python version 3.7.10. The model was trained to identify the main tumor and any metastasis in the PET-MRI slices. In addition, models trained just with PET slices and models trained with augmented PET-MRI slices were created. The model based on PET-MRI attained a precision of 0.71 by employing a 9-pixel classification fidelity. It exhibited specificity and sensitivity values of 0.68 and 0.77, respectively. Conversely, the pure PET model displays a greater susceptibility to false positives, despite its sensitivity being comparable to that of the PET-MRI model. Models trained using augmented PET-MRI data yielded sensitivity, specificity, and precision values of 0.53, 0.77, and 0.65, respectively. The summary of the CNNs and YOLO-based network tumor detection models is presented in Table 5.

Table 5
CNNs and YOLO architectures for tumor detection

Ref Anatomic target Modality Sample division Cross valid. DL Architecture Cost function Optimizer Performance Pre/Post Augmentation Data/Code

Zhou et al. (2023) Lung PET-CT Total = 3437 ✓ CCGL-YOLOV5 Localization loss SGD Acc = 97.83% × × ×

Train = 2058

Test = 1379

Zhao et al. (2018) Lung PET-CT Total = 84 ✓ 3D FCN: VNet Weighted loss Adam DSC = 0.76 ✓ ✓ ×

Train = 48

Test = 36

Adel Kermi et al. (2018) Brain MRI Volumes Total = 285 × 2 D CNN: UNet WCE+GDL SGD DSC = 0.868 ✓ ✓ ✓

Train = 228

Test = 57

Moreau et al. (2021) Brest PET-CT Total = Train = Test = ✓ 3D UNet: skip con. CE+multi class dice Adam DSC = 0.580 ✓ ✓ ×

Joonas Liedes et al. (2022) Head and neck PET-MRI Total = 356 ✓ CNN UNet: skip con. Binary cross-entropy (BCE) Adam DSC = 0.72 ✓ ✓ ×

Train = 290

Test = 66

Ref	Anatomic target	Modality	Sample division	Cross valid.	DL Architecture	Cost function	Optimizer	Performance	Pre/Post	Augmentation	Data/Code
Zhou et al. (2023)	Lung	PET-CT	Total = 3437	✓	CCGL-YOLOV5	Localization loss	SGD	Acc = 97.83%	×	×	×
			Train = 2058
			Test = 1379
Zhao et al. (2018)	Lung	PET-CT	Total = 84	✓	3D FCN: VNet	Weighted loss	Adam	DSC = 0.76	✓	✓	×
			Train = 48
			Test = 36
Adel Kermi et al. (2018)	Brain	MRI Volumes	Total = 285	×	2 D CNN: UNet	WCE+GDL	SGD	DSC = 0.868	✓	✓	✓
			Train = 228
			Test = 57
Moreau et al. (2021)	Brest	PET-CT	Total = Train = Test =	✓	3D UNet: skip con.	CE+multi class dice	Adam	DSC = 0.580	✓	✓	×
Joonas Liedes et al. (2022)	Head and neck	PET-MRI	Total = 356	✓	CNN UNet: skip con.	Binary cross-entropy (BCE)	Adam	DSC = 0.72	✓	✓	×
			Train = 290
			Test = 66

5.2 Siamese networks

The Siam Network, a neural network architecture that has emerged as a viable solution for facilitating image registration and alignment of various modes, exhibits a distinct characteristic comprising two identical subnets. These subnets not only possess identical sets of weights but also possess the capability to encode two dissimilar images into fixed-length characteristic vectors. Leveraging this shared weighting scheme, the Siam network undertakes a comparative analysis of these characteristic vectors, thus enabling the determination of similarity between the two images.

In a recent investigation by Ahmed Sabeeh Yousif et al. (2022), the authors acknowledge the pressing need for accurate disease diagnosis within medical imaging. More specifically, they concentrate on the crucial task of multi-modal image fusion and endeavor to introduce an innovative approach that successfully amalgamates two potent techniques: sparse representation and orthogonal matching pursuit (OMP) with Siamese convolutional neural network (SCNN) methods. By seamlessly integrating these methods, the researchers aspire to tackle several pivotal challenges, including the augmentation of pixel positioning, the refinement of sparse characteristics, and the mitigation of undesirable artifacts. Ultimately, their proposed approach strives to yield superior fusion outcomes within the medical imaging domain when juxtaposed with previously employed methodologies [89].

Zhaisheng Ding et al. (2021), [90] proposed a study on tumor detection from PET-MRI, a critical aspect of medical imaging. They introduce a novel framework that combines the local extrema scheme (LES) and a Siamese network to improve efficiency and overcome limitations in tumor detection. Through extensive experiments and analysis, the proposed approach demonstrates superior performance compared to existing methods, offering promising advancements in tumor detection within medical imaging. Ning Xiao et al. (2022), [91] proposed a Siamese Pyramid Fusion Network (SPFN) and introduced feature pyramid transformation to the Siamese convolution neural network to extract multi-scale information from the fusion of PET and CT images to detect lung tumors. Their findings concluded that the proposed fusion-based Siamese method has a particular competitive performance in the quality improvement and information retention of PET-CT. A summary of the Siamese network-based models is presented in Table 6.

Table 6
Siamese Networks architecture models for tumor detection/classification

Ref Anatomic target Modality Sample division Cross valid. DL Architecture Cost function Optimizer Performance Pre/Post Augmentation Data/Code

Yousif et al. (2022) Brain CT and MRI Total = 8320 × 2D CNN: SCNN-fusion, sparse coding and OMP Soft-Max SGD VIF = 0.9992 × × ✓

Train = 6320

Test = 2000

Zhaisheng Ding et al. (2022) Brain PET-MRI Total = 220 ✓ LES-Siamese net soft-max SGD Acc = 97.66 × × ×

Train = 120

Test = 12

Z. Diao et al. (2023) [52] HeadNeck and Liver tumor PET-CT Total = 51P × Siamese semi-disentanglement network, backed by GAN. cross-entropy loss Adam DSC = 0.718 × × ✓

Train = 36P

Test = 15P

Ning Xiao et al. (2022) Lung PET-CT Total = 840 ✓ SPFN structural similarity loss Xavier Inf. Entro. = 0.076 ✓ × ✓

Train = 672

Test = 168

Ref	Anatomic target	Modality	Sample division	Cross valid.	DL Architecture	Cost function	Optimizer	Performance	Pre/Post	Augmentation	Data/Code
Yousif et al. (2022)	Brain	CT and MRI	Total = 8320	×	2D CNN: SCNN-fusion, sparse coding and OMP	Soft-Max	SGD	VIF = 0.9992	×	×	✓
			Train = 6320
			Test = 2000
Zhaisheng Ding et al. (2022)	Brain	PET-MRI	Total = 220	✓	LES-Siamese net	soft-max	SGD	Acc = 97.66	×	×	×
			Train = 120
			Test = 12
Z. Diao et al. (2023) [52]	HeadNeck and Liver tumor	PET-CT	Total = 51P	×	Siamese semi-disentanglement network, backed by GAN.	cross-entropy loss	Adam	DSC = 0.718	×	×	✓
			Train = 36P
			Test = 15P
Ning Xiao et al. (2022)	Lung	PET-CT	Total = 840	✓	SPFN	structural similarity loss	Xavier	Inf. Entro. = 0.076	✓	×	✓
			Train = 672
			Test = 168

5.3 Fusion-based models

A specially designed model based on fusion has been meticulously created to effectively merge and amalgamate the extracted information from various models, resulting in a substantial enhancement of the precision and reliability of medical image analysis. “Image fusion generates an informative composite image, which can promote the performance of subsequent computer vision tasks. In this domain, multimodal medical image fusion has drawn increasing attention due to its significant clinical applications, including tumor segmentation, cell classification, neurological research, and treatment strategies for recurrent high-grade gliomas” [92]. A novel DL-based multimodal medical image fusion method via a multiscale adaptive Transformer called MATR was proposed by Wei Tang et al. (2022), [92] for analysing SPECT-MRI images from the Harvard database. They adopted seven representative and state-of-the-art methods for qualitative and quantitative comparisons. On the whole, when compared with the proposed MATR, the other seven algorithms have several drawbacks. The multi-modal deep neural network (MDNN) is a renowned example of a fusion-based model that has received significant attention and accolades for its outstanding capacity to smoothly integrate numerous layers of varied modes inside the depths of neural networks. Many references have validated and supported this exceptional result, further consolidating its fame and relevance in medical image analysis [34, 38, 40, 43, 63, 78, 81, 82, 89, 93–95].

Lei Bi et al. (2021), [96] propose a recurrent fusion network (RFN) for automatic PET-CT tumor segmentation. Their recurrent fusion network (RFN) consists of multiple recurrent multi-modalities down sampling (RMD) and up sampling (RMU) processes, which are connected via interconnect link modules (ILMs) backed with ResNet, DenseNet and 3D-UNet. According to their statement, the considerable improvement (> 5% in DSC) of RFN to 3D-UNet suggests that RFN can alleviate this training initialization limitation for 3D-based FCNs. A. Kumar et al. 2019, [54] checked CNN with fused inputs (FSs), multi-branch (MB) and multi-channel (MC), and found that CNN had a significantly higher foreground detection accuracy (99.29%, p <; 0.05) than the fusion baselines (FS: 99.00%, MB: 99.08%, and TC: 98.92%) and a significantly higher Dice score (63.85%) than the recent PET-CT tumor segmentation methods. Sebastian Jinu et al. (2022) [97] proposed a novel technique for fusing PET-MRI images using the YUV color space and wavelet transform. The fusion aims to integrate complementary information from both modalities for enhanced diagnosis. Comparative analysis suggests that the Dmey wavelet at decomposition level 3 and the maximum fusion rule yield optimal results for brain tumor detection. Quality assessment confirms promising outcomes for medical image fusion. The summary of the Fusion network-based models is presented in Table 7 with added articles referenced as [98–100].

Table 7
Fussion-based networks architecture models for tumor detection/classification

Ref Anatomic target Modality Sample division Cross valid. DL Architecture Cost function Optimizer Performance Pre/Post Augmentation Data/Code

Lei Bi et al. (2021) Lung PET-CT Total = 356 Train = 247 Test = 109 ✓ RFN-18, RFN-50, RFN-101, RMD, RMU backed with ResNet, DenseNet and 3D-UNet backbones Pixel-wise cross-entropy Adam RFN to 3D-UNet >5% in DSC ✓ ✓ ✓

A. Kumar et al. (2019) Lungs, mediastinum and brain PET-CT Total = 50 Train = 40 Test = 10 ✓ 2D CNN skip con., fusion maps. Other fusions FSs, MB and MC Categorical cross-entropy Mini-batch, SGD 99.89% on tumor ✓ ✓ ✓

Dakai Jin et al. (2019) Esophageal PET-CT, PET-RTCT Total = 110 Train = 80 Test = 30 ✓ Progressive semantically nested network (PSNN) Dice Loss Adam Imp. DSC from 0.654 0.764 × ✓ ×

Meidi Chen et al. (2022) Parathyroid SPECT-CT Total = 40 Train = 32 Test = 8 ✓ CNN, 3D fusion Cross-entropy and smooth-L1 SGD DSC = 0.822 ✓ × ✓

Huai Chen et al. (2020) Nasopharyngeal MRI (T1, CET1 and T2) Total = 149 Train = 112 Test = 37 ✓ multi-modality MRI fusion network++multi-MLP+stdPool+self-transfer, Dice loss Adam meanDSC = 72.38 ✓ ✓ ✓

Ref	Anatomic target	Modality	Sample division	Cross valid.	DL Architecture	Cost function	Optimizer	Performance	Pre/Post	Augmentation	Data/Code
Lei Bi et al. (2021)	Lung	PET-CT	Total = 356 Train = 247 Test = 109	✓	RFN-18, RFN-50, RFN-101, RMD, RMU backed with ResNet, DenseNet and 3D-UNet backbones	Pixel-wise cross-entropy	Adam	RFN to 3D-UNet >5% in DSC	✓	✓	✓
A. Kumar et al. (2019)	Lungs, mediastinum and brain	PET-CT	Total = 50 Train = 40 Test = 10	✓	2D CNN skip con., fusion maps. Other fusions FSs, MB and MC	Categorical cross-entropy	Mini-batch, SGD	99.89% on tumor	✓	✓	✓
Dakai Jin et al. (2019)	Esophageal	PET-CT, PET-RTCT	Total = 110 Train = 80 Test = 30	✓	Progressive semantically nested network (PSNN)	Dice Loss	Adam	Imp. DSC from 0.654 0.764	×	✓	×
Meidi Chen et al. (2022)	Parathyroid	SPECT-CT	Total = 40 Train = 32 Test = 8	✓	CNN, 3D fusion	Cross-entropy and smooth-L1	SGD	DSC = 0.822	✓	×	✓
Huai Chen et al. (2020)	Nasopharyngeal	MRI (T1, CET1 and T2)	Total = 149 Train = 112 Test = 37	✓	multi-modality MRI fusion network++multi-MLP+stdPool+self-transfer,	Dice loss	Adam	meanDSC = 72.38	✓	✓	✓

5.4 Attention-based models

In the vast realm of medical imaging, attention-based models have been subjected to extensive research and development to concentrate on pertinent features prevalent effectively and selectively in images procured from a myriad of models. These intricate models harness the power of attentional mechanisms, allowing them to direct their focus toward specific areas within an image that encompasses the utmost informational value of a given task. Through the utilization of attention mechanisms, attention-based models have proven to be exceedingly efficacious in their ability to discern and prioritize the most germane and pivotal regions present in images, consequently leading to a noteworthy enhancement in the overall performance and precision of medical image analysis and interpretation [101].

M Fallahpoor et al. 2023, [102] conducted a comprehensive investigation focused on using AI and DL on PET-CT images and its use in oncology, neurology, cardiology, and other emerging medical fields. Their review underscores the power of DL in PET-CT imaging, with successful applications in lesion detection, tumor segmentation, and disease classification with special focus on Attention-based models. Xiao Yang et al. 2022, [103] present a multimodality relation attention network (MMRACR-net) for breast tumor classification that uses consistency regularization. A multi-modality relation attention module (MMRAM) and a classification consistency module (CCM) are included in the suggested network. The MMRAM investigates the correlation information between two modalities to learn the attentive aspect of each. The CCM provides classification consistency by reducing the classification difference between the diffusion-weighted imaging (DWI) and apparent dispersion coefficient (ADC) image modalities. The summary of the Attention-Based Models is presented in Table 8 with added articles referenced as [104, 105].

Table 8
Attention-Based Networks architecture models for tumor detection/classification

Ref Anatomic target Modality Sample division Cross valid. DL Architecture Cost function Optimizer Performance Pre/Post Augmentation Data/Code

R. Hussein et al. (2022) Brain PET-MRI Total = 126 ✓ multimodal enc.-decoder attention guided CNN Voxel-wise reconstruction loss+Perceptual loss Nesterov Adam SSIM = 0.924 ✓ ✓ ×

Train = 105

Test = 21

L. Chen et al. (2022) Lung PET-CT Total = 250 ✓ 3-D CNN (CenterNet), Gau. kernel, DLA-34 Heatmap, Offset and Object Loss Adam Sensativity = 0.96 ✓ × ×

Train = 5F

Test = 5F

Xiao Yang et al. (2022) Breast DWI, ADC –MRI Total = 145 ✓ MMRACR-net, MMRAM and CCM Classif. Cons.+cross-entropy Adam Acc = 86.7% × ✓ ×

Train = 115

Test = 30

Ref	Anatomic target	Modality	Sample division	Cross valid.	DL Architecture	Cost function	Optimizer	Performance	Pre/Post	Augmentation	Data/Code
R. Hussein et al. (2022)	Brain	PET-MRI	Total = 126	✓	multimodal enc.-decoder attention guided CNN	Voxel-wise reconstruction loss+Perceptual loss	Nesterov Adam	SSIM = 0.924	✓	✓	×
			Train = 105
			Test = 21
L. Chen et al. (2022)	Lung	PET-CT	Total = 250	✓	3-D CNN (CenterNet), Gau. kernel, DLA-34	Heatmap, Offset and Object Loss	Adam	Sensativity = 0.96	✓	×	×
			Train = 5F
			Test = 5F
Xiao Yang et al. (2022)	Breast	DWI, ADC –MRI	Total = 145	✓	MMRACR-net, MMRAM and CCM	Classif. Cons.+cross-entropy	Adam	Acc = 86.7%	×	✓	×
			Train = 115
			Test = 30

5.5 Generative adversarial networks (GANs)

GANs are commonly employed in image-to-image translation, where their primary objective is to ease image conversion between discrete modes. The generator and the discriminator are the two subnets that comprise the base GAN. Together, the generator and discriminator produce the desired results. The generator performs a critical part in ensuring the effective translation process by creating an image that closely resembles the target mode. On the other hand, the discriminator has been explicitly taught to discriminate between the produced image and the genuine one. This complex interaction between the generator and discriminator enables GANs to efficiently produce visually cohesive and contextually accurate translations across various languages.

In the field of MMI analysis, GAN has become a strong and significant technology, demonstrating its potential and effect. GAN makes synthesizing realistic and coherent multimodal images possible, representing a significant achievement in the field. This accomplishment is made feasible by teaching the generator to create synthetic samples that are so reliable that they can almost be mistaken for real images. Additionally, discriminators are taught to distinguish between genuine and synthetic samples with skill, which adds to the overall effectiveness of this system. GANs have a wide range of uses in multimodal analysis, with numerous potential applications. Multimodal image synthesis, which entails the production of missing modalities to acquire a comprehensive and holistic perspective of a patient’s overall status, is one of the main uses of GANs in this context. The capacity to more thoroughly assess and evaluate a patient’s medical condition makes this capability very valuable. The enhancement of domain adaptability is another essential function of GANs. Domain-invariant expressions can be learnt by aiding knowledge transfer from one modality to another or from a source domain to a target domain using the power of GAN. This is incredibly helpful when data gaps exist across different imaging modalities, scanners, or agencies or when annotated data is lacking. To advance the MMI analysis field, overcoming these obstacles and creating links between other disciplines [106–108].

A. Liebgott et al. 2019, [109] used 2D Conditional GAN backed with preprocessing techniques contrast-limited adaptive histogram equalization (CLAHE), Histogram Matching (HM) and Gaussian blurring (GB) on PET-CT images. The average mean square error (MSE = 0.7107) score was achieved. The summary of the GANs-Based Models is presented in Table 9 with added articles referenced as [51, 53, 110–112].

Table 9
GANs architecture models for tumor detection/classification

Ref Anatomic target Modality Sample division Cross valid. DL Architecture Cost function Optimizer Performance Pre/Post Augmentation Data/Code

A. Liebgott et al. (2019) Brain PET-CT Total = 202 Train = 173 Test = 29 × 2D Cond. GAN, CLAHE I will schedule some time for us to connect. HM+GB Adversarial loss Adam MSE = 0.7107 ✓ ✓ ×

Z. Huang et al. (2022) Nasopharyngeal PET-CT Total = 1299 Train = 1040 Test = 259 ✓ GAN+UNet, TG-Net Dice loss expectation maximization DSC = 0.9135 × ✓ ✓

K. Cao et al. (2020) Skin and Bones PET-CT Total = 100 Train = 80 Test = 20 ✓ GAN Cross entropy loss Adam MSE = 37.735 ✓ ✓ ✓

A.B. Choen et al. (2019) Liver PET-CT Total = 60 Train = 37 Test = 23 × FCN and GAN Std. uptake value (SUV)+Cr. entropy Adam TPR = 94.6 % × ✓ ×

Y. Li et al. (2023) Lung PET-CT Total = 23 Train = 18 Test = 5 ✓ GAN Adversarial loss Adam Improved SNR = 10dB × ×

K.T. Oh et al. (2021) Brain white matter lesions PET-CT Total = 50 Train = 35 Test = 15 ✓ GAN GAN loss+distance map Minibatch DSC = 0.751 ✓ ✓ ×

Ref	Anatomic target	Modality	Sample division	Cross valid.	DL Architecture	Cost function	Optimizer	Performance	Pre/Post	Augmentation	Data/Code
A. Liebgott et al. (2019)	Brain	PET-CT	Total = 202 Train = 173 Test = 29	×	2D Cond. GAN, CLAHE I will schedule some time for us to connect. HM+GB	Adversarial loss	Adam	MSE = 0.7107	✓	✓	×
Z. Huang et al. (2022)	Nasopharyngeal	PET-CT	Total = 1299 Train = 1040 Test = 259	✓	GAN+UNet, TG-Net	Dice loss	expectation maximization	DSC = 0.9135	×	✓	✓
K. Cao et al. (2020)	Skin and Bones	PET-CT	Total = 100 Train = 80 Test = 20	✓	GAN	Cross entropy loss	Adam	MSE = 37.735	✓	✓	✓
A.B. Choen et al. (2019)	Liver	PET-CT	Total = 60 Train = 37 Test = 23	×	FCN and GAN	Std. uptake value (SUV)+Cr. entropy	Adam	TPR = 94.6 %	×	✓	×
Y. Li et al. (2023)	Lung	PET-CT	Total = 23 Train = 18 Test = 5	✓	GAN	Adversarial loss	Adam	Improved SNR = 10dB	×	×
K.T. Oh et al. (2021)	Brain white matter lesions	PET-CT	Total = 50 Train = 35 Test = 15	✓	GAN	GAN loss+distance map	Minibatch	DSC = 0.751	✓	✓	×

6 Results comparison and discussion

Various research investigations have demonstrated the effectiveness of DL-based approaches in the analysis of MMI. These investigations propose that disease detection, segmentation, and classification precision have been enhanced under diverse imaging models. Comparisons with conventional techniques frequently emphasize the superior performance and potential to offer more comprehensive and personalized patient care in addressing the intricacies and disparities of MMI.

Based on our extensive review of literature sources and findings, it is evident that models such as U-Net, ResNet, DenseNet, and their multiple variants are widely used in the medical imaging field to accomplish tasks such as segmentation, detection, and classification. Nonetheless, the effectiveness of these models depends on the specific task being addressed, the dataset used, and numerous other factors that exert their influence.

From our findings 2D CNNs and YOLO based perform well on PET-CT data sets where Siamese Networks achieved good score on PET-MRI and CT-MRI datasets. Fusion and attention-based models are highly employed in PET-CT, SPECT-CT, and PET-MRI multimodality imaging analysis and results were superior in case of PET-CT, SPECT-CT. The GANs based models are mostly employed for PET-CT data set analysis according to our findings.

6.1 Most frequently used modalities and DL architecture in MMI analysis

According to our findings PET-CT is one of the most used multimodality imaging techniques in healthcare for tumor detection segmentation and classification compared to other, while Fusion and GAN based DL architectures are multimodality imaging analysis. Table 10 reveals some facts about different modalities and DL architecture with respect to published articles from 2019 to 2023.

Table 10
Distribution selected articles on Multimodalities imaging and DL Architectures (2019–2023)

Modality Articles published

2019 2020 2021 2022 2023

PET-CT 5 3 6 8 4

PET-MRI 2 2 1 4 –

SPECT-CT 1 – – 2 –

CT-MRI 1 1 – 1 –

DL Architecture CNN 2 – 2 1 2

YOLO – – – – 1

Siamese networks – – 4 1

Fusion based models 3 3 4 4 2

Attention based models – – – 3 1

GANs 3 1 2 1 1

Modality		Articles published
		2019	2020	2021	2022	2023
	PET-CT	5	3	6	8	4
	PET-MRI	2	2	1	4	–
	SPECT-CT	1	–	–	2	–
	CT-MRI	1	1	–	1	–
DL Architecture	CNN	2	–	2	1	2
	YOLO	–	–	–	–	1
	Siamese networks	–	–	4	1
	Fusion based models	3	3	4	4	2
	Attention based models	–	–	–	3	1
	GANs	3	1	2	1	1

6.2 Accuracy improvement

Numerous research studies have documented substantial enhancements in the accuracy of multimodal image analysis utilizing DL in contrast to traditional approaches. As an illustration, in 2019, a study by Tohidul Islam et al. concentrated on the segmentation of brain tumors, where multimodal DL achieved an astonishing accuracy rate of 98%, surpassing the performance of traditional techniques [113]. The study also presented several compelling discoveries, some elaborated upon in Table 11.

Table 11
DL based models’ tumors segmentation and performance (mean and standard deviation of a five-fold cross-validation) from various authors. The best results for each metric are shown in bold

Network architecture Global accuracy Mean accuracy Mean IoU Weighted IoU Mean BF

Guo et al. [45] 0.9814±0.008 0.9531±0.007 0.8305±0.006 0.9654±0.008 0.8824±0.007

Wang et al. [182] 0.9825±0.008 0.9560±0.007 0.8384±0.006 0.9688±0.007 0.8873±0.007

Zhou et al. [183] 0.9753±0.008 0.9557±0.008 0.8305±0.007 0.9684±0.008 0.8855±0.008

Choudhury et al. [184] 0.9810±0.008 0.9515±0.007 0.8375±0.006 0.9650±0.007 0.8955±0.007

Sun et al. [185] 0.9780±0.010 0.9435±0.009 0.8190±0.008 0.9650±0.009 0.8385±0.008

Tohidul Islam et al. [113] 0.9849±0.009 0.9579±0.008 0.8410±0.009 0.9706±0.010 0.8986±0.009

Network architecture	Global accuracy	Mean accuracy	Mean IoU	Weighted IoU	Mean BF
Guo et al. [45]	0.9814±0.008	0.9531±0.007	0.8305±0.006	0.9654±0.008	0.8824±0.007
Wang et al. [182]	0.9825±0.008	0.9560±0.007	0.8384±0.006	0.9688±0.007	0.8873±0.007
Zhou et al. [183]	0.9753±0.008	0.9557±0.008	0.8305±0.007	0.9684±0.008	0.8855±0.008
Choudhury et al. [184]	0.9810±0.008	0.9515±0.007	0.8375±0.006	0.9650±0.007	0.8955±0.007
Sun et al. [185]	0.9780±0.010	0.9435±0.009	0.8190±0.008	0.9650±0.009	0.8385±0.008
Tohidul Islam et al. [113]	0.9849±0.009	0.9579±0.008	0.8410±0.009	0.9706±0.010	0.8986±0.009

6.3 Intramodality fusion

DL methods that integrate data from diverse imaging modes are typically more advantageous compared to single-modal methods when it comes to performance and accuracy. In a comprehensive investigation conducted by Guo et al. (2018), scholars harnessed DL technology to identify soft tissue sarcomas by utilizing MMI, such as MRI, CT, and PET. To augment the process of segmentation, the researchers put forth an image fusion architecture, which is visually depicted in Fig. 9. Astonishingly, despite the potential challenges associated with robustness, the authors discovered that the most optimal outcomes were achieved by amalgamating information at the feature level [77]. Furthermore, the authors expounded on the efficacy of their proposed approach and showcased sample label maps that were generated by employing a type I fusion network that incorporated PET, CT, and T2 images (Fig. 10). As evidenced by the visualization, it becomes abundantly clear that the CNN model, which was trained on the accumulated data, effectively predicted the contour of the tumor area with exemplary precision.

Fig. 9

Illustration of the structure for (a) type-i fusion networks, (b) type-ii fusion network and (c) type-iii fusion network. The yellow arrows indicate the fusion location [45].

Fig. 10

Contour line of the ground truth annotation (yellow line) and labelmap (red line) overlaid on the T2-weighted MR image from one randomly selected subject [77].

According to a study by Nahed Tawfik et al. (2021), [114] intramodality fusion improves performance in multimodality imaging. The article highlights the significance of combining images from various imaging modalities to enhance image quality and clinical applicability. It discusses how medical image fusion captures complementary information from modalities like MRI, PET, CT, etc., aiming to address their limitations. While existing fusion methods have shown promising results, the article emphasizes the need for further advancements to tackle emerging challenges. This comprehensive survey serves as a valuable resource for researchers in the field, laying the groundwork for developing more effective fusion techniques to improve medical imaging applications.

Manoj Diwakar et al., [115] introduced a novel method for multi-modality medical image fusion using shearlet domain processing. Input images undergo non-subsampled shearlet transform (NSST) decomposition, with base and detail layers fused using a local extrema (LE) approach and Co-occurrence filter (CoF). High-frequency components are fused using sum-modified Laplacian (SML) for edge preservation. The comparative analysis demonstrates superior edge preservation with the proposed method over existing techniques, validated through subjective and objective evaluations on multi-modal medical image datasets.

According to Niharika S.D Souza et al. (2024) [116] multiplexed graphs approach using a graph neural network, which enables task-informed reasoning to effectively fuse information from multiple modalities. Evaluation of benchmark datasets and clinical data for Tuberculosis treatment outcome prediction and autism spectrum disorder classification demonstrates robust performance improvements over state-of-the-art methods.

Accurate breast cancer prediction and prognosis is very important for treatment planning and quality life of patients. Integrating multi-modal data like genomics and pathology images enhances predictive accuracy. Existing approaches face challenges: the Kronecker product technique is computationally expensive, and methods often overlook modality-specific relations. To address these challenges, Honglei Liu et al. (2024) [117] propose an attention-based multi-modal network that efficiently captures both intra-modality and inter-modality relations without high-dimensional features. their method outperforms existing approaches in breast cancer prognosis prediction.

6.4 Segmentation and localization

The application of DL has demonstrated remarkable efficacy in various tasks, including segmentation and lesion localization. In a recent investigation conducted by Guo Zhe et al. (2019), the primary objective was to accurately outline the lesion profile of soft tissue sarcoma, a significant challenge in medical imaging. The researchers discovered that DL-based segmentation surpasses other existing methods, particularly when utilizing multi-modal images and integrating fusion techniques into networks [45]. The investigation’s findings in Fig. 11 include performance comparisons obtained through several model combinations. Moreover, Fig. 12 visually presents the obtained results when executing the same task using a single-modal network. These visual representations further emphasize the benefits of DL-based segmentation in lesion contouring and localization. Extensive research has established that utilizing multi-modal images alongside fusion techniques in DL networks constitutes an effective approach to enhance the precision and accuracy of lesion segmentation, thereby facilitating improved diagnosis and treatment planning for patients afflicted with soft tissue sarcoma. The results of this study profoundly contribute to the field of medical imaging and underscore the potential of DL to enhance the clinical prognosis of sarcoma patients. Overall, DL-based segmentation technology has raised as a valuable practice in medical imaging, offering a promising avenue for further research and development in this domain.

Fig. 11

Contour line of the ground truth annotation (red line) and segmentation result (yellow line). (a) Single-modality network on T2. (b) Multimodality network on T2 + PET (Type-I). (c) Multimodalities network on T2 + PET+CT. (d)–(f) 3-D surface visualization of the segmentation results in (a)–(c) [45].

Fig. 12

Box chart for the statistics (median, first/third quartile and the min/max) of the DICE coefficient across 50 subjects. Red box stands for network train and test on Type-I network, blue box stands for Type-II network, and green stands for Type-III. Performances of single-modality network are shown as gray boxes to the left for reference [45].

6.5 Choice of deep learning model and multimodality selection

The selection of PET-CT, PET-MRI, and SPECT-CT modalities for the purpose of medical image analysis is contingent upon the precise clinical scenario, the requisite information, as well as the inherent merits and constraints associated with each model. Each individual model possesses distinct advantages and is typically determined based on the precise clinical investigation at hand.

PET-CT combines functional and anatomical information in a single scan and is excellent for cancer imaging, as it provides metabolic information (PET) along with detailed anatomical localization (CT). It is widely used for cancer staging, treatment planning, and monitoring. Very high radiation exposure due to the CT component is one of the shortcomings of this technology. It has limited soft tissue contrast compared to MRI. PET-MRI provides functional information (PET) and excellent soft tissue contrast (MRI) with no ionizing radiation exposure in the MRI component. It is an ideal technology for imaging soft tissues and neurological conditions. Longer scan times compared to PET-CT, limited availability and higher cost are the shortcomings of this technology. SPECT-CT combines functional information (SPECT) with anatomical localization (CT) and is widely used in nuclear medicine for various applications, including bone scans and myocardial perfusion imaging with lower radiation dose compared to PET-CT. Limitations: Lower spatial resolution compared to PET. Limited in soft tissue imaging compared to MRI. Selection Criteria: Cancer Imaging: For comprehensive cancer imaging, especially in staging and treatment monitoring, PET-CT is often preferred due to its ability to combine metabolic and anatomical information. Neurological Imaging: PET-MRI is valuable for neurological conditions where soft tissue contrast is crucial. It avoids ionizing radiation and provides detailed information about brain structures and functions. Bone Scans: SPECT-CT is commonly used for bone scans, providing functional information about bone metabolism along with precise anatomical localization. Soft Tissue Evaluation: In scenarios where soft tissue evaluation is a priority and ionizing radiation is a concern, PET-MRI may be preferred.

The ultimate determination hinges upon the clinical indications, the precise information requisite, and the equilibrium between the advantages and disadvantages of each paradigm. In various clinical scenarios, the progress in technology and the continuous investigation may sway individuals’ inclinations towards a particular model as opposed to another. It is of utmost importance for healthcare professionals to meticulously mull over the distinct prerequisites of individual patients when ascertaining the most suitable imaging model.

The selection of a DL framework for the purpose of identifying and categorizing tumors in multimodal imaging is contingent upon various factors, such as the distinctive characteristics of the data, the intricacy of the undertaking, and the available resources. Different architectures have shown success in different scenarios. Here’s a brief overview of each type:

Convolutional Neural Networks: CNNs are well-suited for image-based tasks and have succeeded highly in tasks such as image classification and segmentation. They automatically learn hierarchical features from the input data, making them effective for capturing patterns in medical images. CNNs can be applied to each modality individually or to fused multimodal images for tumor detection and classification.

Siamese Networks: Siamese Networks are designed for tasks like image similarity and dissimilarity. They can be used for comparing and contrasting different regions within the same or across different modalities. Siamese Networks can be beneficial when comparing corresponding regions in multimodal images to identify tumor characteristics.

Fusion-based Networks: Fusion-based networks combine information from multiple modalities to enhance overall performance. They leverage complementary information from different modalities, improving the model’s ability to detect and classify tumors. When dealing with multimodal imaging, fusion-based networks can effectively combine PET, CT, MRI, or other modalities to provide a more comprehensive analysis.

Attention-based Models: Attention mechanisms allow the model to focus on specific regions of interest, improving interpretability and performance. It is useful for emphasizing relevant information in multimodal images for tumor detection. Attention-based models can be applied to highlight important features in each modality or guide the fusion process in multimodal tumor classification.

Generative Adversarial Networks: GANs can be used for generating synthetic data, potentially augmenting limited datasets in medical imaging. They may aid in enhancing the quality of images, which can be valuable for training more robust tumor detection models. GANs can be used for data augmentation or for generating realistic images that simulate various tumor characteristics.

Choosing the Best Model: The choice depends on the nature of the imaging data, the availability of labeled samples, and the specific requirements of the task. For multimodal imaging, fusion-based networks and attention-based models are often beneficial to leverage the complementary information from different modalities. Siamese Networks might be useful for tasks involving direct comparisons between regions or structures in different modalities. Data availability, computational resources, and interpretability considerations should also guide the selection. Ultimately, the “best” model depends on the specific context and requirements of the tumor detection and classification task in multimodality imaging. Experimentation and comparative evaluations on the specific dataset of interest are often necessary to determine which model performs optimally.

6.6 Clinical validity

The clinical effectiveness of DL models holds immense significance due to their pivotal role in determining their applicability and dependability. A study conducted by Yinhao Wu et al. (2022) focusing on the classification of skin cancer has shed light on the potential and practicality of DL in the field of clinical practice, thereby showcasing its viability. Nonetheless, it is imperative to highlight that further validation is indispensable to ascertain the strength and reproducibility of these models, especially when considering different cohorts of patients. The requirement for additional validation arises from the acknowledgment that the efficacy and effectiveness of the DL model can potentially vary across diverse patient groups, thereby necessitating a comprehensive evaluation and validation process to ensure its clinical utility [118]. Recent interest in AI, machine learning, and DL in cardiovascular medicine promises personalized care through advanced CV imaging applications [119]. DL models can simultaneously analyze diverse imaging modes, thereby enabling the enhancement of tumor diagnostic accuracy through the utilization of additional data furnished by each model. In the realm of oncology, the amalgamation of MRI and PET images serves to furnish a more thorough comprehension of tumor characteristics, exemplifying this phenomenon. For instance, DL algorithms can accurately delineate tumor boundaries across MRI, CT, and PET images, facilitating precise radiotherapy planning [120–124].

A study by Ahmed Hosny et al. (2022) [125] presents a comprehensive clinical validation strategy for DL models in segmenting primary non-small-cell lung nodules (NSCLC) tumors and lymph nodes from PET-CT images. It includes interobserver and intraobserver benchmarking, primary validation on multiple datasets, functional validation, and end-user testing. Results demonstrate improved segmentation accuracy over interobserver benchmarks, consistency with intraobserver benchmarks, and equivalent radiation dose coverage compared to expert segmentations. Despite variations across datasets, the models show potential clinical utility by reducing segmentation time and interobserver variability. This study highlights the importance of evaluating clinical utility beyond geometric segmentation metrics.

According to a study by Zhaoshuo Diao, et al. (2021) [126] focuses on enhancing the clinical validity of tumor delineation in PET-CT images. Existing segmentation methods often lack consideration for uncertainty and consume significant computing resources. To address this, the proposed evidence fusion network (EFNet) reduces uncertainty by separately outputting PET and CT results and fusing them through evidence fusion. EFNet simplifies the network architecture, achieving improved segmentation results and enhancing efficiency compared to existing methods. Experimental validation on soft-tissue sarcomas and lymphoma datasets demonstrates significant improvements in segmentation accuracy, enhancing the clinical utility of PET-CT imaging for tumor delineation.

DL algorithms can analyze large volumes of multimodal images rapidly, potentially reducing the time required for image interpretation. This can enhance workflow efficiency in clinical practice, allowing radiologists to focus more on complex cases or patient care. According to a study performed by Amirhossein Sanaat et al. (2021) [127] summarize that DL techniques, specifically CycleGAN, successfully synthesize full-dose PET images from low-dose scans, aiding in reducing radiation exposure and enhancing patient comfort. The study demonstrates comparable diagnostic quality and tumor detectability between synthesized and full-dose images, validating the efficacy of DL in PET-CT.

Bonardel Gerald et al. (2022) [128] evaluates the clinical validity of a DL-based denoising algorithm, SubtlePET, applied to PET/CT images acquired at reduced count statistics compared to regular acquisition. Phantom and patient images from three different PET-CT scanners were assessed. Results indicate that SubtlePET effectively denoised images acquired at reduced count statistics while maintaining lesion detectability and SUVmax values. The study suggests that SubtlePET could be applied in clinical practice for PET-CT acquisitions with reduced injected doses, aligning with European recommended injected dose guidelines, without compromising diagnostic confidence.

A study by Yngve Mardal et al. (2021) [129] focuses on evaluating the clinical validity of convolutional neural networks (CNNs) for automatically delineating gross tumor volumes (GTV) in head and neck cancer (HNC) patients using FDG-PET/CT images. By comparing CNN-generated delineations to manual ones by specialists and introducing new structure-based evaluation metrics, the study demonstrates the accuracy and reliability of CNN models in accurately identifying GTVs. Importantly, the study emphasizes the significance of PET/CT imaging in improving delineation accuracy, with PET/CT-based CNN models outperforming those based solely on CT images. Overall, the findings underscore the clinical utility of CNNs in enhancing tumor delineation in HNC patients, providing valuable insights for improving treatment planning and patient outcomes.

6.7 DL network computational demands

DL requires substantial computational and memory resources throughout the training and inference phases, but its accuracy remains high. To fulfil these resource demands, cloud computing has become a conventional technique despite presenting obstacles concerning data migration. To expedite the process of DL inference, specialized hardware solutions like ASICs (such as Google’s TPU and ShidianNao) and FPGA-based DNN accelerators have emerged, delivering exceptional energy efficiency.

In 2019, Chen Jiashi and her colleagues authored a publication that delves into the intricate discussions surrounding the application, implementation, and training of DL with edge computing. This comprehensive study addresses the numerous challenges in system performance, network technology, benchmarking, and privacy. The notion of edge computing, which is viewed as a very promising method with the ability to efficiently address the computational and low latency needs of DL on edge devices, is of great interest. Given this context, it becomes imperative to initiate an in-depth investigation into the various techniques employed to safeguard invaluable privacy concepts within the complex and intricate process of transferring substantial volumes of data between edge devices and the ethereal realm of the cloud. By meticulously examining these technologies, we can acquire invaluable insights into effectively alleviating privacy concerns and seamlessly integrating DL at the network edge. These earnest efforts will undoubtedly significantly contribute to advancing the field and pave the way for the widespread adoption of DL in edge computing environments [130]. An overview of the taxonomy of these strategies, with examples of various scenarios that will be described in further depth in Table 12.

Table 12
Overview of selected studies on privacy-preserving inference

Work DNN model Main ideas Key metrics

Work Wang et al. [186] MobileNets, GoogleNet, others Introduce noise to offloaded data and subsequently train DNN on the perturbed dataset. accuracy, energy, memory

CryptoNets [187] 5- and 9- layer NN homomorphic encryption latency, communication size

MiniONN [188] CNN secure encryption with homomorphic, Pairwise computation accuracy, latency, communication size

DeepSecure [189] custom DNN and CNN, LeNet, others Pairwise computation latency

Chameleon [190] 5-layer CNN Secure computation with encrypted data collaboration latency, communication size

GAZELLE [191] Custom CNN Secure computation using homomorphic encryption and two-party computation, considering their respective advantages and limitations latency, communication size

Work	DNN model	Main ideas	Key metrics
Work Wang et al. [186]	MobileNets, GoogleNet, others	Introduce noise to offloaded data and subsequently train DNN on the perturbed dataset.	accuracy, energy, memory
CryptoNets [187]	5- and 9- layer NN	homomorphic encryption	latency, communication size
MiniONN [188]	CNN	secure encryption with homomorphic, Pairwise computation	accuracy, latency, communication size
DeepSecure [189]	custom DNN and CNN, LeNet, others	Pairwise computation	latency
Chameleon [190]	5-layer CNN	Secure computation with encrypted data collaboration	latency, communication size
GAZELLE [191]	Custom CNN	Secure computation using homomorphic encryption and two-party computation, considering their respective advantages and limitations	latency, communication size

6.8 Interpretable AI

Interpretability remains a concern. Some studies, like Saad Bin Ahmed et al. (2022) on explainable AI (XAI) in chest X-ray analysis, raise issues crucial for clinical acceptance. There is an overall lack of universally accepted quantitative evaluation metrics for XAI techniques, so additional research in this direction is needed [131].

6.9 Transfer learning

Transfer learning techniques, as discussed by Yi Li et al. (2023) in their review, enable knowledge transfer from one dataset to another, reducing the need for large, annotated datasets. In their work, they presented a unique multimodal dataset for glaucoma diagnosis and classification (GMNNnet) and proposed a new multimodal neural network of transfer learning with little training data for glaucoma diagnosis and classification. GMNNnet performs well in capturing complex glaucoma information from multiple modalities, offering promising results for accurate diagnosis and classification of glaucoma [132]. Parvin Razzaghi et al. (2022), Multimodal brain tumor detection using a multimodal deep transfer learning model approach significantly outperforms the comparable approaches. Their study shows that the multimodal feature encoder and multimodal domain adaptation technique successfully learn and transfer knowledge, as shown in Table 13 [133].

Table 13
The comparison of reported DICE coefficients for various methods on the figshare dataset: Evaluation of base models with and without modality knowledge and adaptation techniques, using ResNet50 and U-Net as backbone frameworks for classification and segmentation in Base1. In Base2, data is input into the network without incorporating modality-specific information

Method Meningioma Glioma Pituitary

Base 1 [133] 0.8867 0.6133 0.8235

Base 2 [133] 0.8756 0.6502 0.8099

Isunuri et al., [192] 0.8243 0.6077 0.7847

SB Kumar et al., [193] 0.8997 0.6554 0.8395

Deep transfer learning (without multimodal) [133] 0.8997 0.6802 0.8452

Multimodal deep transfer learning [133] 0.9459 0.7241 0.9108

Method	Meningioma	Glioma	Pituitary
Base 1 [133]	0.8867	0.6133	0.8235
Base 2 [133]	0.8756	0.6502	0.8099
Isunuri et al., [192]	0.8243	0.6077	0.7847
SB Kumar et al., [193]	0.8997	0.6554	0.8395
Deep transfer learning (without multimodal) [133]	0.8997	0.6802	0.8452
Multimodal deep transfer learning [133]	0.9459	0.7241	0.9108

Transfer learning and domain adaptation approaches can assist models in overcoming the limitation of limited data by allowing them to harness information learnt from comparable datasets or modalities. These techniques have become instrumental in addressing challenges related to data scarcity, domain adaptation, and improving the generalization of DL models in MMI analysis. These strategies use information learned from previously trained models on one job or domain to enhance performance on a related task or domain with little data.

6.10 Benchmark datasets

Benchmark data sets are frequently utilized to evaluate DL models in the domain of MMI, playing a pivotal and crucial role in assessing and comparing the performance of these models. The utilization of these data sets for this specific objective has been comprehensively explained and expounded upon in Table 14. These data sets encompass a vast array of medical imaging models and clinical applications, thereby presenting an invaluable and highly significant resource for developing and evaluating DL models in the realm of multi-modal imaging research. Scholars and researchers commonly depend on these data sets to gauge the effectiveness and accuracy of algorithms associated with many tasks, including but not limited to classification, segmentation, tumor and other disease diagnosis. Using these esteemed benchmark data sets, researchers can effectively and efficiently assess the strengths and weaknesses inherent in their DL models, refining and optimizing their performance meticulously and discerningly.

Table 14
Benchmark datasets are commonly used for evaluating deep learning models for multimodality imaging

Benchmark datasets Purpose

The Cancer Imaging Archive (TCIA) TCIA offers a variety of cancer-related multimodal imaging datasets, such as CT, MRI, and PET scans, for cancer detection, diagnosis, and treatment evaluation.

Medical Image Computing and Computer Assisted Intervention (MICCAI) Datasets MICCAI regularly hosts challenges with publicly available multimodal imaging datasets, including challenges related to brain imaging, cardiac imaging, and more.

Multimodal Brain Tumor Image Segmentation Challenge (BRATS) BRATS provides multimodal brain tumor datasets, including MRI modalities like T1, T1c, T2, and FLAIR, for tumor segmentation tasks.

PhysioNet PhysioNet offers various physiological and biomedical datasets, including multimodal data like ECG, EEG, and imaging data for research in cardiology and neurology.

National Institutes of Health (NIH) Chest X-ray Dataset This dataset includes chest X-ray images, associated radiology reports, and CT scans, making it useful for multimodal research in lung diseases.

Multimodal Whole Slide Imaging (WSI) Databases Databases like the CAMELYON challenge datasets provide digitized pathology slides and additional imaging modalities, such as fluorescence and brightfield microscopy.

Functional MRI of the Brain (fMRIB) Software Library (FSL) Datasets FSL offers datasets for functional MRI studies that include multimodal imaging data for brain function analysis.

National Library of Medicine (NLM) –Medical Image Dataset NLM provides various medical image datasets, including multimodal data, for tasks like image segmentation, classification, and retrieval.

ImageCLEF Medical Image Retrieval Challenge ImageCLEF offers challenges with medical image datasets, often involving multimodal content, for tasks like image retrieval and classification.

Multi-Modality Whole Heart Segmentation (MM-WHS) Challenge This challenge focuses on multimodal cardiac imaging data for whole heart segmentation tasks, including MRI and CT.

Open Access Series of Imaging Studies (OASIS) OASIS offers multimodal brain imaging datasets, including MRI and PET scans, for AD research.

Multimodal Spine Image Dataset (MSID) MSID includes MRI, CT, and X-ray images for spine-related research and medical image analysis.

Benchmark datasets	Purpose
The Cancer Imaging Archive (TCIA)	TCIA offers a variety of cancer-related multimodal imaging datasets, such as CT, MRI, and PET scans, for cancer detection, diagnosis, and treatment evaluation.
Medical Image Computing and Computer Assisted Intervention (MICCAI) Datasets	MICCAI regularly hosts challenges with publicly available multimodal imaging datasets, including challenges related to brain imaging, cardiac imaging, and more.
Multimodal Brain Tumor Image Segmentation Challenge (BRATS)	BRATS provides multimodal brain tumor datasets, including MRI modalities like T1, T1c, T2, and FLAIR, for tumor segmentation tasks.
PhysioNet	PhysioNet offers various physiological and biomedical datasets, including multimodal data like ECG, EEG, and imaging data for research in cardiology and neurology.
National Institutes of Health (NIH) Chest X-ray Dataset	This dataset includes chest X-ray images, associated radiology reports, and CT scans, making it useful for multimodal research in lung diseases.
Multimodal Whole Slide Imaging (WSI) Databases	Databases like the CAMELYON challenge datasets provide digitized pathology slides and additional imaging modalities, such as fluorescence and brightfield microscopy.
Functional MRI of the Brain (fMRIB) Software Library (FSL) Datasets	FSL offers datasets for functional MRI studies that include multimodal imaging data for brain function analysis.
National Library of Medicine (NLM) –Medical Image Dataset	NLM provides various medical image datasets, including multimodal data, for tasks like image segmentation, classification, and retrieval.
ImageCLEF Medical Image Retrieval Challenge	ImageCLEF offers challenges with medical image datasets, often involving multimodal content, for tasks like image retrieval and classification.
Multi-Modality Whole Heart Segmentation (MM-WHS) Challenge	This challenge focuses on multimodal cardiac imaging data for whole heart segmentation tasks, including MRI and CT.
Open Access Series of Imaging Studies (OASIS)	OASIS offers multimodal brain imaging datasets, including MRI and PET scans, for AD research.
Multimodal Spine Image Dataset (MSID)	MSID includes MRI, CT, and X-ray images for spine-related research and medical image analysis.

Using DL-based methods for MMI analysis has numerous potential advantages compared to traditional ML methods. The utilization of DL models allows for auto-learning and intricate feature extraction from medical images, eliminating manual feature engineering. This characteristic has proven exceptionally beneficial in medical imaging, where images often contain substantial noise levels and are influenced by various imaging artefacts. DL models can handle noise and variability, enhancing MMI task performance. Moreover, DL models can effectively integrate information from different imaging modes, improving accuracy and diagnostic performance.

This scholarly article extensively investigates the dynamic and ever-evolving landscape of DL in MMI tumor analysis. It uncovers DL significance and progress that has been made in this domain. Healthcare workers may merge information from many imaging techniques, such as MRI, CT, PET, and others, by leveraging the power of DL. This integration empowers the DL model to exhibit an elevated ability to provide accurate diagnosis of tumors. Undoubtedly, this emerging trend marks a substantial advancement in the field of MMI analysis, offering tremendous potential for more precise and comprehensive patient care.

7 Challenges, future directions and emerging trends

Although DL-based MMI analysis methods have potential advantages over traditional machine learning-based methods, DL also has certain limitations. One limitation is that training DL models necessitates substantial amounts of labelled data, which can prove arduous to acquire when faced with rare diseases or multiple patterns. Another limitation to the computational robustness of DL models, they necessitate high-performance computing resources such as GPUs or TPUs, thereby incurring costs. Additionally, DL models can manifest difficulties in exploitability, rendering comprehension of the rationales behind their predictions challenging. Furthermore, deep-fitting models are harmed by overfitting and underfitting. Finally, DL models are often perceived as “black boxes” due to the demanding nature of understanding the underlying process through which predictions are made.

7.1 Main challenges in MMI analysis

MMI analysis presents challenges due to data heterogeneity, limited annotated datasets, and the need for interpretability. Integrating diverse imaging modalities while preserving their unique features is complex. Domain adaptation is required to handle variations between datasets. The lack of generalization and computational complexity are additional hurdles [62, 63, 80]. Some of the key challenges, as in Fig. 13, are as follows:

Fig. 13

Challenges in multimodal medical image analysis.

Data heterogeneity: Distinct imaging techniques capture distinct anatomy and physiology components, resulting in image quality, contrast, and noise variances. Integrating and analyzing these disparate data kinds necessitates using specific algorithms and models to manage multimodal data heterogeneity [134].

Limited availability of annotated multimodal datasets: Annotating medical images with ground truth data takes time and resources, especially for multimodal datasets. The scarcity of large-scale annotated datasets across multiple imaging modalities hinders development and evaluation of DL models for MMI analysis [135].

Interpretability and explainability: DL models are often black-box models, particularly those based on complex architectures such as DNNs and GANs. Understanding the decisions made by these models and explaining their predictions to clinicians and patients is challenging. In medical settings, interpretability and explainability are crucial for gaining the trust and acceptance of AI-driven image analysis [135, 136].

Integration of multimodal information: It is challenging to effectively fuse information from several imaging modalities while keeping their distinct properties. It is a huge issue to develop effective multimodal fusion algorithms that may leverage complementary information from distinct modalities without sacrificing critical features [137].

Domain adaptation: Variations in imaging protocols, hardware, and patient populations can lead to domain shifts when working with data from different sources or institutions. Adapting DL models to handle these domain shifts and maintain performance across diverse datasets is a critical challenge in MMI analysis [138].

Limited generalization: DL models trained on large datasets may not generalize well to new patients or clinical scenarios. The generalization performance of multimodal models must be carefully validated to ensure robust and reliable performance in real-world clinical settings [92].

Computational complexity: DL models for multimodal analysis can be computationally intensive, requiring significant computational resources and time for training and inference. To make these models practical, efficient model architectures and optimization techniques are essential for clinical use [139]. Addressing these challenges requires collaborative efforts among researchers, clinicians, and policymakers. Advances in data sharing, standardization of imaging protocols, and the development of explainable and interpretable AI or DL models will play a vital role in overcoming these challenges and unlocking the full potential of MMI analysis for improved patient care.

7.2 Main challenges in DL models

Despite the promising potential of DL models in multimodality imaging, several challenges exist, including the need for large and diverse datasets for training, interpretability of DL-based predictions, and integration into existing clinical workflows. Additionally, DL models may encounter difficulties in handling data heterogeneity, variability in imaging protocols, and generalization to unseen patient populations.

7.2.1 Data quality and quantity

To achieve optimal performance, DL models necessitate substantial amounts of superior-quality data. Nevertheless, medical imaging datasets frequently suffer from limitations in both size and quality, resulting in overfitting and decreased generalization of models [30, 31, 34, 39, 40–43].

7.2.2 Explainability

The intricate nature of DL models presents a formidable challenge regarding their interpretability, rendering it arduous for clinicians to place trust in and comprehend their output. It is of utmost importance to develop techniques for visualizing and elucidating model output to surmount this predicament [140].

7.2.3 Standardization and reproducibility

Presently, the application of DL-based methods in medical image analysis lacks standardization, rendering it troublesome to compare the outcomes of diverse studies and replicate findings. A pressing need exists to standardize data collection, pre-processing, and model development robustly [141].

Clinical Translation: Despite the promising outcomes observed in research environments, the clinical translation of medical image analysis methods founded on DL remains nascent. Further investigations are warranted to authenticate the performance of these methods in clinical settings and address concerns about regulatory approval and integration with established clinical workflows.

7.2.4 Multimodal image registration

Multimodal image registration is indeed a critical challenge in the application of multimodal medical imaging. When we acquire Images from different modalities (MRI, CT, PET, etc.) they are aligned together in multimodal imaging techniques, which enables accurate comparison and fusion of information. Inaccurate registration can lead to misalignment of anatomical structures or features, compromising the quality and reliability of subsequent applications like image fusion, segmentation, or quantitative analysis. Therefore, robust and accurate multimodal image registration techniques are very important for the effectiveness and reliability of multimodal medical imaging. Advancements can be made in multimodal image registration by advancing DL for various modalities, integrating biomechanical models, enabling real-time registration, quantifying uncertainty, and incorporating artificial intelligence. Robust frameworks should be developed, and methods must be validated for clinical use to address challenges and improve accuracy in medical imaging applications [142–144].

7.3 Future directions and emerging trends

From the literature we presented on the role of DL in MMI analysis, we identified some emerging trends and provided some insight into future directions for research development in this field. Some of them are presented below, and we hope that this information will serve as a guide for future researchers. An overview of this guidance and emerging trends can be found in.

7.3.1 Incorporation of domain knowledge

The incorporation of domain knowledge is a critical part of MMI analysis, where medical practitioners’ experience and domain-specific information may be used to improve the accuracy and interpretability of DL models. This integration ensures that the models align with medical principles and constraints, making the analysis more clinically relevant and reliable [145]. Several ways domain knowledge can be incorporated include:

Designing Task-Specific Architectures: Domain experts can guide the design of DL architectures tailored to specific medical image analysis tasks. For example, understanding anatomical structures and disease characteristics can inform the development of specialized organ segmentation or lesion detection networks [146].

Annotating Training Data: Domain experts can annotate training data, providing accurate and detailed labels that capture clinically significant features. High-quality annotations are essential for training robust models that generalize to new data.

Feature Selection and Extraction: Medical knowledge can aid in identifying relevant features and modalities that are most informative for the task at hand. This can reduce the dimensionality of the data and enhance model efficiency.

Loss Function Design: Domain knowledge can inform the choice and formulation of loss functions, emphasizing clinically relevant metrics and penalizing errors that may have more severe consequences in medical practice.

Interpretability and Explainability: Domain knowledge can be used to develop interpretability techniques, allowing clinicians to understand the model’s decisions and gain insights into how the analysis aligns with medical expertise.

Handling Uncertainty: Medical domain knowledge can help address uncertainties in multimodal data and model predictions, common in real-world medical scenarios.

Validation and Clinical Adoption: Involving domain experts in the validation process and clinical adoption of DL models ensures that the analysis aligns with medical best practices and meets the needs of healthcare professionals [147].

DL models can become more accurate, reliable, and clinically relevant by incorporating domain knowledge into MMI analysis. Collaboration between AI researchers and medical experts is essential to bridge the gap between cutting-edge technology and real-world medical applications, leading to improved patient outcomes and enhanced healthcare practices.

7.3.2 Explainable AI

Interpretation and explication within the realm of artificial intelligence, or the field of XAI, constitutes an indispensable domain that devotes its attention to advancements in techniques and frameworks that furnish lucid and comprehensible elucidations for the determinations and prognostications of AI systems. In intricate artificial intelligence models, such as DNN, the internal mechanisms are frequently regarded as enigmatic, rendering it arduous to apprehend the modus operandi employed by the model in drawing inferences. This lack of transparency can serve as a conspicuous impediment, especially in vital sectors such as healthcare, finance, and autonomous systems, where users necessitate assurance in the decision-making processes of AI.

XAI’s primary objective is bridging the chasm between the intricacy of AI models and human comprehension by furnishing substantial explications for the outcomes produced by the model. This enables users, including researchers, clinicians, regulators, and end users, to acquire insights into the rationales behind AI predictions and a more profound discernment of how AI arrives at specific determinations. The research endeavors in developing the XAI model will augment the reliance and acceptance of multi-modal analysis in clinical practice [45, 148].

7.3.3 Integration of clinical data

The primary objective of medical image analysis methods founded on DL is to amplify patient treatment outcomes. The amalgamation of clinical data into medical image analysis necessitates the fusion of patient-specific information (including medical history, laboratory findings, and clinical evaluations) and data from diverse imaging techniques. This combination is critical in improving the precision and contextual knowledge of medical picture interpretation, allowing healthcare providers to make better-educated diagnoses, treatment regimens, and prognoses. By incorporating clinical data, a multi-modal analysis approach transforms an all-inclusive strategy that fosters personalized and patient-centric healthcare decisions. To accomplish this objective, such analysis must be integrated with the processes associated with clinical decision-making, including treatment planning and monitoring [149].

7.3.4 Explainable and interpretable DL models

The creation of AI algorithms that can offer reasonable explanations for their predictions and actions is referred to as explainable and interpretable DL models. The inner workings of classic DL models, such as DNNs, can be complicated to read, resulting in a “black-box” scenario in which it is unknown how the model arrived at a given solution. Explainable and interpretable DL models seek to overcome this issue by providing human-readable explanations for their results. This means that, besides providing predictions, the model may give insights into which elements or patterns in the input data influenced the choice. Explainable and interpretable DL models can improve MMI analysis performance by offering insights into how the model integrates input from many imaging modalities. Clinicians can discover possible mistakes or biases and fine-tune the model by determining which characteristics or modalities are most significant in generating predictions. This openness allows for better decision-making and higher trust in AI-driven analysis, leading to more accurate and trustworthy patient diagnoses and treatment plans [150].

7.3.5 Federated learning

Federated learning (FL) is a decentralized approach to model training that enables collaboration across multiple institutions without sharing patient data centrally. Each institution trains its model locally on its own data, and only model updates, rather than raw data, are shared with a central server. This approach addresses data privacy concerns, as patient data remains secure within each institution’s boundaries. FL encourages collective intelligence by incorporating knowledge from several sources, resulting in more robust and accurate models while protecting data privacy. FL tackles data privacy, but most focus on unimodal data. Multimodal FL emerges for enhanced system performance, explored in this survey as cited here [151].

7.3.6 Meta-learning

Meta-learning, sometimes called “learning to learn,” is a promising technique that teaches models how to learn from various tasks or domains, helping them quickly adapt to new tasks or data distributions with less data. Meta-learning may be used in MMI analysis to create models that generalize successfully across multiple imaging modalities and medical situations [152]. The model becomes more resilient and may effectively adapt to new modalities or patient groups by learning common representations or characteristics from varied tasks, even when data is limited. Meta-learning approaches could increase MMI analysis’s efficiency and accuracy, making it a vital area of study with the potential to improve patient care and speed medical discoveries. Advances in multimodal fusion techniques will enable better data integration from many imaging modalities. According to a proposed methodology by Aishik Konwer et al,. (2023) [153], in medical imaging, not all modalities are consistently available. They propose a meta-learning approach to enhance modality-agnostic representations, improving brain tumor segmentation even with limited data.

7.3.7 Generative models

Generative models play an important role in MMI analysis by generating synthetic data that closely matches real medical pictures. GANs, which consist of a generator and a discriminator network, are one of the most well-known generative model architectures. GANs may be utilized in MMI analysis to solve data scarcity by generating more samples for underrepresented modalities or expanding the training dataset [154, 155]. Furthermore, GANs present modality translation, which allows them to transform pictures from one modality to another, enabling cross-modality analysis and correlation. Researchers may use generative models to extend datasets, improve model generalization, and investigate multiple imaging modalities without requiring substantial data collection and annotation. These models have a significant capacity to improve the accuracy and robustness of medical image analysis, impacting customized medicine and patient care. Nonetheless, ensuring the quality and precision of the produced pictures in this sector remains a constant issue.

Limited data availability is a challenge in medical imaging. Train DL models using Augmentation Generative Adversarial Nets for Multimodality imaging with limited data can be a choice as discussed in a study by Chaobin Xu et al. (2024) [154]. The proposed Scarce Data Augmentation Generative Adversarial Nets (Scarcity-GAN) tackles data scarcity by selecting similar features from extra datasets, enhancing defect detection. The model, incorporating an Encoder-Decoder framework and Fusion Patch-Embedding module, accurately locates defects. Extensive experiments demonstrate Scarcity-GAN’s superiority in performance and generalizability over state-of-the-art models, suggesting GANs’ potential for training DL models with limited data in medical imaging.

7.3.8 Clinical integration of AI models

Healthcare professionals can benefit from AI-driven diagnostic assistance, treatment planning, and patient monitoring by integrating AI into clinical settings [156]. AI models may help with illness detection and diagnosis, individualized therapy suggestions, and timely treatment response. Clinical integration also entails dealing with data privacy, governance, and regulatory compliance issues. Building confidence in AI-driven healthcare applications requires ensuring patient data is securely handled and utilized responsibly. Ultimately, successful clinical integration of MMI analysis empowers healthcare professionals with powerful AI tools, leading to improved patient outcomes, more efficient healthcare processes, and better healthcare delivery [157–159].

7.3.9 Multi-task learning

The employment of multi-task learning (MTL) entails training DL models to concurrently execute multiple interconnected tasks by utilizing a shared set of parameters. MTL uses shared knowledge between tasks to improve the model’s performance on each task instead of training different models for each task. The utilization of MTL within the realm of MMI analysis allows for the meticulous examination of distinct imaging modes or the execution of various tasks, such as image segmentation, classification, and registration, all within a unified framework. The model is able to extract more flexible and generalized representations from multi-modal input through the joint acquisition of knowledge from numerous tasks and data [160].

MTL possesses numerous advantages in medical image analysis, including heightened efficiency, mitigated risk of overfitting, and amplified generalization capabilities. It can also be particularly useful when dealing with limited annotated data for individual tasks. In conclusion, multi-task learning is an influential technology that can amplify the capabilities of DL models in the domain of MMI analysis, thereby facilitating the production of more precise and comprehensive outcomes. Furthermore, it expedites the advancement of AI-driven solutions in healthcare. As suggested by Huihui Yu et al. (2024) [161] Novel self-supervised multi-task learning framework improves representations, outperforming state-of-the-art methods on Chest X-ray datasets.

7.3.10 Uncertainty estimation

The measurement of indeterminacy in the analysis of multimodal interaction encompasses the assessment of the indeterminacy connected to forecasts made by models of AI. In healthcare situations, indeterminacy assumes a pivotal function due to medical data’s fundamental intricacy and variability. Indeterminacy is comprised of two primary categories: 1) stochastic indeterminacy and 2) epistemological indeterminacy [162, 163].

In MMI analysis, uncertainty estimation becomes particularly valuable when dealing with limited or noisy data, domain shifts between different imaging modalities or patient populations, and ambiguous cases. AI models may give confidence intervals for predictions by measuring uncertainty, allowing healthcare practitioners to make better-informed judgments, highlight situations that require additional evaluation, and prevent potentially misleading outcomes. Uncertainty estimation is a judgmental component in constructing dependable AI systems for medical image processing, enhancing the reliability and safety of AI-powered clinical applications, and improving patient care [164–166].

7.3.11 Other emerging technologies

Healthcare textiles are an emerging technology that comes in a variety of shapes and functions [167]. This emerging technology of healthcare textiles in multimodality imaging can enhance patient comfort, reduce infection risks, minimize artifacts, and integrate smart technologies, leading to improved imaging performance and diagnostic accuracy.

Quantum machine learning (QML) enhances healthcare by optimizing drug discovery, genomic analysis, and medical imaging processes [168]. The utilization of QML can improve multimodality imaging through enhanced image reconstruction, automated segmentation, and personalized treatment planning.

8 Conclusion

This review article comprehensively investigates recent advancements in MMI analysis methods based on DL. These techniques surround tumor detection, segmentation, and classification. The article explores specific DL models and techniques for analyzing diverse medical image models while emphasizing their strengths and limitations compared to conventional ML methods.

Moreover, this review article also introduces specific instances where DL-based methods have been employed in MMI analysis, such as disease classification. The findings of these applications have been succinctly summarized. The use of MMI plays an essential role in improving knowledge, diagnosis, and treatment of a variety of diseases. By amalgamating information from distinct imaging modes, clinicians and researchers can better comprehend the patient’s condition, enhancing patient management and prognosis.

Researchers are presently exploring future directions and emerging trends to surmount the hurdles encountered in this field. Explainable and interpretable DL models strive to enhance AI models’ transparency, enabling clinicians to navigate the problem-solving process more effectively. Furthermore, integrating patient-specific data with multimodal imaging data through clinical data integration further enhances AI-driven analysis, facilitating more personalized and context-aware healthcare decisions.

Transfer learning techniques aid in utilizing knowledge acquired from existing models to enhance the performance of models when data sets are limited in availability. Generative models permit data augmentation and pattern transformation, effectively addressing the challenge of insufficient data and facilitating the analysis of diverse models. On the other hand, meta-learning focuses on constructing models that can quickly adapt to new tasks or patterns with less data, increasing the model’s generalizability.

As the domain of DL-based MMI analysis continues to advance, the amalgamation of these principles will persistently shape the trajectory of AI-driven healthcare. AI models will steadily enhance their accuracy, interpretability, and applicability in clinical settings by identifying and addressing difficulties, exploring fresh methodologies, and embracing emerging trends. This will help patients and healthcare workers.

A comprehensive examination of analytical methods in MMI based on DL is an extremely crucial and indispensable resource for individuals involved in the research field, practitioners, and decision-makers. This review presents the remarkable progress in the field and emphasizes the pressing need for further research, validation, and incorporation of ethical considerations, particularly as these technologies move towards broader clinical adoption. In essence, the combination of DL technology with MMI has the potential to completely transform healthcare by providing more precise, rapid, and personalized diagnostic and treatment options. Given this, it is of utmost significance that all pertinent stakeholders possess a comprehensive understanding and delve into the significance and influence of this integration.

Footnotes

Acknowledgments

We gratefully acknowledge the support received for this work from the following funding sources: •

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2021-0-00755, Dark data analysis technology for data scale and accuracy improvement).

•

Furthermore, this work was partially supported by the National Research Foundation of Korea (NRF), funded by the Korean government (MSIT) (No. RS-2023-00243034).

I acknowledge the use of ChatGPT (https://chat.openai.com/), which helped refine my work’s academic language and accuracy [].

Figures Copyright Permissions

Some of the figures are reused from published articles online. Copyright permissions to reuse these figures are obtained under License Number 5742810476253 from IEEE Explorer.

References

Cormack

A.M.

, Representation of a function by its line integrals with some radiological, Journal of Applied Physics 34(9) (1963), 2908–2913.

Bonmatí

L.M.

Sopena

Bartumeus

SopenaMultimodality

, imaging techniques, Contrast media & molecularimaging 5(4) (2010), 180–189.

Kuhl

D.E.

Hale

Eaton

W.L.

, Transmission scanning: a useful adjunct to conventional, Radiology 87(2) (1966), 278–284.

Hounsfield

G.N.

, Computerized transverse axial scanning (tomography): Part 1. Description of system, British Journal of Radiology 46(552) (2014), 1016–1022.

Townsend

D.W.

, Dual-modality imaging: combining anatomy and function, Journal of Nuclear Medicine 49(6) (2008), 938–955.

Angelique

, Multimodality imaging probes: design and challenges, Chemical Reviews 110(5) (2010), 3146–3195.

Townsend

D.W.

Beyer

Blodgett

T.M.

, PET/CT scanners: a hardware approach to image fusion, Semin Nucl Med 33(3) (2003), 193–204.

Pomper

M.G.

Gelovani

J.G.

, Molecular imaging in oncology 1st ed, New York: CRC Press Informa Healthcare, 2008.

Ell

P.J.

, The contribution of PET/CT to improved patient management, The British Journal of 79(937) (2006), 32–36.

10.

Tsukamoto

Ochi

, PET/CT today: system and its impact on cancer diagnosis, Annals of Nuclear Medicine 20 (2006), 255–267.

11.

Cherry

S.R.

Louie

A.Y.

Jacobs

R.E.

, The integration of positron emission tomography with magnetic resonance imaging, Proceedings of the IEEE 96(3) (2008), 418–418.

12.

Bisdas

Fougere

C.L.

Ernemann

, Hybrid MR-PET in neuroimaging, Clinical Neuroradiology 25 (2015), 275–281.

13.

Dinggang

Suk

, Deep learning in medical image analysis, Annual Review of Biomedical Engineering 19 (2017), 221–248.

14.

Shao

Gao

Guo

Shi

Yang

Shen

, Hierarchical lung field segmentation with joint shape and appearance sparse learning, IEEE Transactions on Medical Imaging 33(9) (2014), 1761–1780.

15.

Wang

Chen

K.C.

Gao

Shi

Liao

Shen

S.G.

Yan

Lee

P.K.

Chow

Liu

N.X.

Xia

J.J.

Shen

, Automated bone segmentation from dental CBCT images using patch-based sparse representation and convex optimization, Medical Physics 41(4) (2014), 043503_1–043503_14.

16.

Yap

P.H.

Zhang

Shen

, Multi-tissue decomposition of diffusion MRI signals via L0 sparsegroup estimation, IEEE Transactions on Image Processing 25(9) (2016), 4340–4353.

17.

Suk

H.I.

Lee

S.W.

Shen

Initiative

A.D.N.

, Deep sparse multi-task learning for feature selection in Alzheimer’s disease diagnosis, Brain Structure and Function 221 (2016), 2569–2587.

18.

Chen

Juttukonda

Benzinger

Rubin

B.G.

Lee

Y.Z.

Lin

Shen

Lalush

Hongyu

A.N.

, Probabilistic air segmentation and sparse regression estimated pseudo CT for PET/MR attenuation correction, Radiology 275(2) (2015), 562–569.

19.

Schmidhuber

, Deep learning in neural networks: An overview, Neural Networks 61 (2015), 85–117.

20.

LeCun

Bengio

Hinton

, Deep learning, Nature 521 (2015), 436–444.

21.

Rguibi

Hajami

Zitouni

Elqaraoui

Bedraoui

, Cxai: Explaining convolutional neural networks for medical imaging diagnostic, Electronics 11(11) (2022), 1775.

22.

Litjens

Kooi

Bejnordi

B.E.

Setio

A.A.A.

Ciompi

Ghafoorian

, Laak

J.A.V.D.

, Ginneken

B.V.

, Ciompi

C.I.

, Sánchez

, Asurvey on deep learning in medical image analysis, MedicalImage Analysis 42 (2017), 60–88.

23.

Lundervold

A.S.

Lundervold

, An overview of deep learning in medical imaging focusing on MRI, Zeitschrift Für Medizinische Physik 29(2) (2019), 102–127.

24.

Liu

Wang

Yang

Lei

Liu

S.X.

Wang

, Deep learning in medical ultrasound analysis: a review, Engineering 5(2) (2019), 261–275.

25.

Shen

Suk

, Deep learning in medical image analysis, Annual Review of Biomedical Engineering 19 (2017), 221–248.

26.

Hussain

A.S.M.H.D.

, Computer-Aided Osteoporosis Detection from DXA imaging, Computer Methods and Programs in Biomedicine 173 (2019), 87–107.

27.

Hussain

Han

S.M.

Kim

T.S.

, Automatic hip geometric feature extraction in DXA imaging using regional random forest, Journal of X-ray Science and Technology 27(2) (2019), 207–236.

28.

Hussain

Al-antari

A.M.

Al-masni

S.M.H.A.M.

Kim

T.S.

, Femur segmentation in DXA imaging using a machine learning decision tree, 26(5) (2018), 727–746.

29.

Hussain

Khan

M.A.

Abbas

Naqvi

R.A.

Mushtaq

M.F.

Rehman

Nadeem

, Enabling smart cities with cognition based intelligent route decision in vehicles empowered with deep extreme learning machine, CMC-Computers Materials & Continua 26(1) (2021), 141–156.

30.

Hussain

Naqvi

R.A.

Loh

W.K.

Lee

, Deep learning in DXA image segmentation, CMC-Computers Materials & Continua 66(3) (2020), 2587–2598.

31.

Nazir

A.I.A.J.H.M.D.H.T.

Naqvi

R.A.

, Retinal image analysis for diabetes-based eye disease detection using deep learning, Applied Sciences 10(18) (2020), 6185_2–6185_21.

32.

Naqvi

M.A.K.R.A.

Malik

Saqib

Alyas

Hussain

, Roman urdu news headline classification empowered with machine learning, CMC-Computers Materials & Continua 65(2) , 1221–1236.

33.

Naqvi

R.A.

Hussain

Loh

W.K.

, Artificial Intelligence-based semantic segmentation of ocular regions for biometrics and healthcare applications, CMC-Computers Materials & Continua 66(1) (2021), 715–732.

34.

Siddiqui

S.Y.

Naseer

Khan

M.A.

Mushtaq

M.F.

Naqvi

R.A.

Hussain

Haider

, Intelligent Breast Cancer Prediction Empowered with Fusion and Deep Learning, CMC-Computers Materials & Continua 67(1) (2021), 1033–1049.

35.

Khan

M.A.

Abdullah

Akram

Naqvi

R.A.

Mehmood

Hussain

Soomro

T.A.

, A Scale Normalized Generalized LoG Detector Approach for Retinal Vessel Segmentation, IEEE Access 9 (2021), 44442.

36.

Hussain

Naqvi

R.A.

Abbas

Khan

M.A.

Sohail

Hussain

, Trait based trustworthiness assessment in human-agent collaboration using multi-layer fuzzy inference approach, IEEE Access 9 (2021), 73561–73574.

37.

Iqbal

Naqvi

R.A.

Atif

Hanif

M.A.K.M.

Abbas

Hussain

, On the image encryption algorithm based on the chaotic system, DNA encoding and castle, IEEE Access 9 (1182), 53–118270.

38.

Akram

Adnan

Asif

Imran

S.M.A.

Yasir

M.N.

Naqvi

R.A.

Hussain

, Exploiting the multiscale information fusion capabilities for aiding the leukemia diagnosis through white blood cells segmentation, IEEE Access 10 (2022), 48747–48760.

39.

Ishaq

Raza

Rehar

Abadeen

S.Z.

Hussain

Naqvi

R.A.

Lee

S.W.

, Assisting the human embryo viability assessment by deep learning for in vitro fertilization, Mathematics 11(9) (2023), 1–17.

40.

Zhou Tongxue Ruan

S.C.S.

, A review: Deep learning for medical image segmentation using multi-modality fusion, Array 3(4) (2019), 100004.

41.

Korot

Guan

Ferraz

Wagner

S.K.

Zhang

Liu

Faes

, et al., Code-free deep learning for multi-modality medical image classification, Nature Machine Intelligence 3(4) (2021), 288–298.

42.

Fourcade

Khonsari

R.H.

, Deep learning in medical image analysis: A third eye for doctors, Journal of Stomatology, Oral and Maxillofacial Surgery 120(4) (2021), 279–288.

43.

Islam

K.T.

Wijewickrema

Leary

S.O.

, A deep learning-based framework for the registration of three dimensional multi-modal medical images of the head, Scientific Reports 11(1) (2021), 1–13.

44.

Bhatnagar

Q.J.

Liu

, A new contrast based multimodal medical image fusion framework, Neurocomputing 157 (2015), 143–152.

45.

Guo

Huang

Guo

, Deep learning-based image segmentation on multimodal medical imaging, IEEE Transactions on Radiation and Plasma Medical Sciences 3(2) (2019), 162–169.

46.

Bushberg

J.T.

Seibert

J.A.

Leidholdt

E.M.

Boone

J.M.

, The Essential Physics of Medical Imaging (3rd ed.), North America: Lippincott Williams & Wilkins, 2011.

47.

Sotoudeh

Sharma

Fowler

K.J.

McConathy

Dehdashti

, Clinical application of PET/MRI in oncology, Journal of Magnetic Resonance Imaging 44(2) (2016), 265–276.

48.

Zhang

Xiao

Tan

, Correlation between 18F-FDG PET CT SUV and symptomatic or asymptomatic pulmonary tuberculosis, Journal of X-ray Science and Technology 27(5) (2019), 899–906.

49.

Wang

Liu

, Three-dimensional structure tensor based PET/CT fusion in gradient domain, Journal of X-Ray Science and Technology 27(2) (2019), 307–319.

50.

Kim

Satter

Reed

Fadell

Kardan

, A novel, integrated PET-guided MRS technique resulting in more accurate initial diagnosis of high-grade glioma, The Neuroradiology Journal 29(3) (2016), 193–197.

51.

Cao

Feng

Kim

, ImprovingPET-CT image segmentation via deep multi-modality data augmentation, in: Machine Learning for Medical Image Reconstruction: Third International Workshop, MLMIR 2020, Lima, Peru, 2020.

52.

Diao

Jiang

Shi

Yao

Y.D.

, Siamese semi-disentanglement network for robust PET-CT segmentation, Expert Systems with Applications 223 (2023), 119855.

53.

Huang

Tang

Chen

Wang

Shen

Zhou

Wang

Fan

Liang

, TG-Net: Combining transformer and GAN for nasopharyngeal carcinoma tumor segmentation based on total-body uEXPLORER PET/CT scanner, Computers in Biology and Medicine 148 (2022), 105869.

54.

Kumar

Fulham

Feng

Kim

, Co-learning feature fusion maps from PET-CT images of lung cancer, IEEE Transactions on Medical Imaging 39(1) (2019), 204–217.

55.

Zhao

Tan

, Deep learning for variational multimodality tumor segmentation in PET/CT, Neurocomputing 392 (2020), 277–295.

56.

Ohno

Koyama

Lee

H.Y.

Yoshikawa

Sugimura

, Magnetic resonance imaging (MRI) and positron emission tomography (PET)/MRI for lung cancer staging, Journal of Thoracic Imaging 31(4) (2016), 215–227.

57.

Kang

Xia

Skudder-Hill

Yin

Wang

, Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET)/ComputedTomography Features of AtypicalTeratoid/RhabdoidTumors: Case Series and Review, Journal of Child Neurology 37(12–14) (2022), 1003–1009.

58.

Yang

Cong

Kalra

Wang

, Sinogram-based attenuation correction in PET/CT, Journal of X-Ray Science and Technology 24(1) (2016), 9–22.

59.

Hussain

Al-Antari

Al-Masni

Han

Kim

, Femur segmentation in DXA imaging using a machine learning decision tree, Journal of X-ray Science and Technology 26(5) (2018), 727–746.

60.

Wang

Zhang

Ding

Chen

Jiang

Shi

Bai

Ren

, A modularly designed fluorescence molecular tomography system for multi-modality imaging, Journal of X-Ray Science and Technology 32(2) (2015), 147–156.

61.

Budd

Robinson

E.C.

Kainz

, A survey on active learning and human-in-the-loop deep learning for medical image analysis, Medical Image Analysis 71 (2021), 403–415.

62.

Mora

D.A.L.

Lagos

L.A.

Montserrat

I.C.

, Estorch, Future Challenges of Multimodality Imaging, Molecular Imaging in Oncology 216 (2020), 905–918.

63.

Zhang

Y.D.

Dong

Wang

S.H.

Yao

Zhou

, Advances in multimodal data fusion in neuroimaging: Overview, challenges, and novel orientation, Information Fusion 64 (2020), 149–187.

64.

Liao

Huang

Luo

Fan

, Analysis of misdiagnosis and 18F-FDG PET/CT findings of lymph node tuberculosis, Journal of X-Ray Science and Technology 30(5) (2022), 941–951.

65.

, Deep learning in multimodal medical image analysis, in: In Health Information Science HIS 2019, Xi’an, China, October 18–20, 2019.

66.

Puyol

A.E.

Sidhu

B.S.

Gould

Porter

Elliott

M.K.

Mehta

Rinaldi

C.A.

King

A.P.

, A multimodal deep learning model for cardiac resynchronisation therapy response prediction, Medical Image Analysis 79 (2022), 102465.

67.

Mouchess

M.L.

Sohara

Nelson

M.D.

DeClerck

Y.A.

Moats

R.A.

, Multimodal imaging analysis of tumor progression and bone resorption in a murine cancer model, Journal of Computer Assisted Tomography 30(3) (2006), 525–534.

68.

Xue

Zhang

Zhu

Shen

Shah

S.A.A.

Bennamoun

, Multi-modal co-learning for liver lesion segmentation on PET-CT images, IEEE Transactions on Medical Imaging 40(12) (2021), 3531–3542.

69.

Heinsalu

Williams

Ranjan

Zampieri

C.A.

Uus

Robinson

E.C.

Rutherford

M.A.

Story

Hutter

, Predicting Preterm Birth Using Multimodal Fetal Imaging, in: Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Perinatal Imaging, Placental and Preterm Image Analysis, Strasbourg, France, 2021.

70.

Khan

Ashraf

Alhaisoni

Damaševičius

Scherer

Rehman

Bukhari

, Multimodal brain tumorclassification using deep learning and robust feature selection: Amachine learning application for radiologists, Diagnostics 10(8) (2020), 565.

71.

Murtaza

Shuib

Abdul Wahab

Mujtaba

Nweke

Al-garadi

Zulfiqar

Raza

Azmi

, Deep learning-based breast cancer classification through medical imaging modalities: state of the art and research challenges, Artificial Intelligence Review 52 (2020), 1655–1720.

72.

Hossain

Al Jannat

Huda

Alsharif

Faragallah

Eid

Rashed

, Brain Tumor Auto-Segmentation on Multimodal Imaging Modalities Using Deep Neural Network, Computers, Materials & Continua 72(3) (2022), 4509–4523.

73.

Dai

Gao

Liu

, Transmed: Transformers advance multi-modal medical image classification, Diagnostics 11(8) (2021), 1384.

74.

Jia

, Brain tumor classification with multimodalMRand pathology images, in: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 5th InternationalWorkshop, BrainLes 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, 2020.

75.

Takahashi

Fujioka

Oyama

Mori

Yamaga

Yashima

Imokawa

Hayashi

Kujiraoka

Tsuchiya

Oda

, Deep learning using multiple degrees of maximum-intensity projection for PET/CT image classification in breast cancer, Tomography 8(1) (2022), 131–141.

76.

Zhang

Wang

Liu

Tang

Wang

, Multiple organ-specific cancers classification from PET/CT images using deep learning, Multimedia Tools and Applications 81(12) (2022), 16133–16154.

77.

Guo

Huang

Guo

, Medical image segmentation based on multi-modal convolutional neural network: Study on image fusion schemes, in: IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 903, Washingtog, D.C., 2018.

78.

Dolz

Gopinath

Yuan

Lombaert

Desrosiers

Ayed

I.B.

, HyperDense-Net: a hyper-densely connected CNN for multi-modal image segmentation, IEEE Transactions on Medical Imaging 38(5) (2018), 1116–1126.

79.

Zhang

Yang

Tian

Shi

Zhong

Zhang

, Modality-aware mutual learning for multimodal medical image segmentation, in Medical Image Computing and Computer Assisted Intervention–MICCAI, Strasbourg, France, 2021.

80.

Alam

Rahman

S.U.

, Challenges and solutions in multimodal medical image subregion detection and registration, Journal of Medical Imaging and Radiation Sciences 50(1) (2019), 24–30.

81.

Owais

Cho

S.W.

Park

K.R.

, Volumetric Model Genesis in Medical Domain for the Analysis of Multimodality 2D/3D Data based on the Aggregation of Multilevel Features, IEEE Transactions on Industrial Informatics Early access, pp. 1–13, 2023.

82.

Rahim

El-Sappagh

Ali

Muhammad

Ser

J.D.

Abuhmed

, Prediction of Alzheimer’s progression based on multimodal Deep-Learning-based fusion and visual Explainability of time-series data, Information Fusion 92 (2023), 363–388.

83.

Zhao

Tan

, Tumor co-segmentation in pet/ct using multi-modality fully convolutional neural network, Physics in Medicine & Biology 64(1) (2018), 015011.

84.

Milletari

Navab

Ahmadi

S.A.

, V-net: Fully convolutional neural networks for volumetric medical image segmentation, in: IEEE 2016 Fourth International Conference on 3D Vision (3DV), Stanford, California, USA, 25–28 Oct, 2016.

85.

Xue

Zhang

Zhu

Shen

Shah

S.A.A.

Bennamoun

, Multi-modal co-learning for liver lesion segmentation on pet-ct images, IEEE Transaction Medical Imaging 40(12) (2018), 3531–3542.

86.

Zhou

Liu

Wang

, CCGL-YOLOV5: A cross-modal cross-scale global-local attention YOLOV5 lung tumor detection model, Computers in Biology and Medicine 165 (2023), 107387.

87.

Moreau

Rousseau

Fourcade

Santini

, et al., Automatic segmentation of metastatic breast cancer lesions on 18f-fdg pet/ct longitudinal acquisitions for treatment response assessment, Cancers 14(1) (2021), 101.

88.

Liedes

Hellström

Rainio

Murtojärvi

Malaspina

Hirvonen

Klén

Kemppainen

, Automatic segmentation of head and neck cancer from PET-MRI data using deep learning, Journal of Medical and Biological Engineering (2023), 1–9.

89.

Yousif

A.S.

Omar

Sheikh

U.U.

, An improved approach for medical image fusion using sparse representation and Siamese convolutional neural network, Biomedical Signal Processing and Control 72 (2022), 103357.

90.

Ding

Zhou

Hou

Liu

, Siamese networks and multi-scale local extrema scheme for multimodal brain medical image fusion, Biomedical Signal Processing and Control 68(11) (2021), 102697.

91.

Xiao

Yang

Qiang

Zhao

Hao

Lian

Li.

, PET and CT image fusion of lung cancer with Siamese pyramid fusion network, Frontiers in Medicine 9 (2022), 792390.

92.

Tang

Liu

Duan

, MATR: Multimodal medical image fusion via multiscale adaptive transformer, IEEE Transactions on Image Processing 31 (2022), 5134–5149.

93.

Azam

M.A.

Khan

K.B.

Salahuddin

Rehman

Khan

S.A.

Khan

M.A.

Kadry

Gandomi

A.H.

, A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics, Computers in Biology and Medicine 144 (2022), 105253.

94.

Tan

Tiwari

Pandey

H.M.

Moreira

Jaiswal

A.K.

, Multimodal medical image fusion algorithm in the era of big data, Neural Computing and Applications 2020, 1–21..

95.

Zhou

Ruan

Canu

, A review: Deep learning for medical image segmentation using multi-modality fusion, Array 3 (2019), 100004.

96.

Fulham

Liu

Song

Feng

D.D.

Kim

, Recurrent feature fusion learning for multi-modality pet-ct tumor segmentation, Computer Methods and Programs in Biomedicine 203 (2021), 106043.

97.

Sebastian

King

, Comparative analysis and fusion of MRI and PET images based on wavelets for clinical diagnosis, International Journal of Electronics and Telecommunications 68(4) (2022), 867–873.

98.

Jin

Guo

T.Y.

Harrison

A.P.

Xiao

Tseng

C.K.

Lu.

, Accurate esophageal gross tumor volume segmentation in PET/CT using two-stream chained 3D deep network fusion, in: Medical Image Computing and Computer Assisted Intervention–MICCAI, Shenzhen, China, 2019.

99.

Chen

Qiao

Chen

Huang

, Multimodal fusion network for detecting hyperplastic parathyroid glands in SPECT/CT images, IEEE Journal of Biomedical and Health Informatics 27(3) (2022), 1524–1534.

100.

Chen

Yin

Liu

Gong

Wang

, MMFNet: A multi-modality MRI fusion network for segmentation of nasopharyngeal carcinoma, Neurocomputing 394 (2020), 27–40.

101.

Liu

Huang

Z.A.

Zhu

Wong

K.C.

Tan

K.C.

, Attention-like multimodality fusion with data augmentation for diagnosis of mental disorders using MRI, IEEE Transactions on Neural Networks and Learning Systems, Early Access, 2022.

102.

Fallahpoor

Chakraborty

Pradhan

Faust

Barua

P.D.

Chegeni

Acharya

, Deep learning techniques in PET/CT imaging: A comprehensive review from sinogram to image space,, Computer Methods and Programs in Biomedicine Pre-proof (2023), 107880.

103.

Yang

Song

Nie

Qiao

Shi

Yin

, Multi-modality relation attention network for breast tumor classification, Computers in Biology and Medicine 150 (2022), 106210.

104.

Hussein

Shin

Zhao

Guo

Davidzon

Moseley

Zaharchuk

, Brain mri-to-pet synthesis using 3d convolutional attention networks, arXiv preprint arXiv:2211.12082, 2022.

105.

Chen

Liu

Shen

Liu

Zhao

Zhu

, Multimodality Attention-Guided 3-D Detection of Nonsmall Cell Lung Cancer in 18 F-FDG PET/CT Images, IEEE Transactions on Radiation and Plasma Medical Sciences 6(4) (2021), 421–432.

106.

Chen

Wei

Li.

, TarGAN: target-aware generative adversarial networks for multi-modality medical image translation, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021:24th International Conference, Strasbourg, France, 2021.

107.

Zhu

Huang

Zhang

Zeng

Kong

Zhou

, DualMMP-GAN: Dual-scale multi-modality perceptual generative adversarial network for medical image segmentation, Computers in Biology and Medicine 144 (2022), 105387.

108.

Yuan

Wei

Wang

Tasdizen

, Unified generative adversarial networks for multimodal segmentation from unpaired 3D medical images, Medical Image Analysis 64 (2020), 101731.

109.

Liebgott

Hindere

Armanious

Bartler

Nikolaou

Gatidis

Yangl

, Prediction of FDG uptake in Lung Tumors from CT Images Using Generative Adversarial Networks, in: 2019 27th European Signal Processing Conference (EUSIPCO) IEEE, Coruña, Spain, 2019.

110.

Ben Cohen

, Klang

, Raskin

S.P.

, Soffer

, Ben-Haim

, Konen

, Cross-modality synthesis from CT to PET using FCN and GAN networks for improved automated lesion detection, Engineering Applications of Artificial Intelligence 78 (2019), 186–194.

111.

Zheng

Wang

Zhou

Zhang

Song

Jiang

, Fully Convolutional Transformer-Based GAN for Cross-Modality CT to PET Image Synthesis, in: International Workshop on Computational Mathematics Modeling in Cancer Analysis, Vancouver, BC, Canada, 8 October 2023.

112.

K.T.

Kim

B.S.

Lee

Yun

Yoo

S.K.

, Segmentation of white matter hyperintensities on 18 F-FDG PET/CT images with a generative adversarial network, European Journal of Nuclear Medicine and Molecular Imaging 48 (2021), 3422–3431.

113.

Islam

K.T.

Wijewickrema

O’leary

, A deep learning framework for segmenting brain tumors using MRI and synthetically generated CT images, Sensors 22(2) (2022), 523.

114.

Tawfik

Elnemr

Fakhr

Dessouky

Abd El-Samie

, Survey study of multimodality medical image fusion methods, Multimedia Tools and Applications 80 (2021), 6369–6396.

115.

Diwakar

Singh

Shankar

, Multi-modal medical image fusion framework using co-occurrence filter and local extrema in NSST domain, Biomedical Signal Processing and Control 68 (2021), 102788.

116.

D‘Souza

Wang

Giovannini

Foncubierta-Rodriguez

Beck

Boyko

Syeda-Mahmood

, Fusing modalities by multiplexed graph neural networks for outcome prediction from medical data and beyond, Medical Image Analysis 93 (2024), 103064.

117.

Liu

Shi

Wang

, Multi-modal fusion network with intra-and inter-modality attention for prognosis prediction in breast cancer, Computers in Biology and Medicine 168 (2024), 107796.

118.

Chen

Zeng

Pan

Wang

Zhao

, Skin cancer classification with deep learning: a systematic review, Frontiers in Oncology 12 (2022), 1–20.

119.

Kocyigit

Grimm

Griffin

Cheng

, Applications of artificial intelligence in multimodality cardiovascular imaging: a state-of-the-art review, Progress in Cardiovascular Diseases 63(3) (2020), 367–376.

120.

Yang

Fan

Zhu

Wang

, Convex hull matching and hierarchical decomposition for multimodality medical image registration, Journal of X-Ray Science and Technology 32(2) (2015), 253–265.

121.

Ren

Eriksen

J.G.

Nijkamp

Korreman

S.S.

, Comparingdifferent CT, PET and MRI multi-modality image combinations for deeplearning-based head and neck tumor segmentation, ActaOncologica 60(11) (2021), 1399–1406.

122.

McKenzie

Santhanam

Ruan

O’Connor

Cao

Sheng

, Multimodality image registration in the head-and-neck using a deep learning-derived synthetic CT as a bridge, Medical Physics 47(3) (2021), 1094–1104.

123.

Yang

Cui

Bai

Gong

, RA-SIFA: Unsupervised domain adaptation multi-modality cardiac segmentation network combining parallel attention module and residual attention unit, Journal of X-Ray Science and Technology 29(6) (2021), 1065–1078.

124.

Sangeetha Francelin

, Daniel

, Anita Rose

, Pugalenthi

, Deep learning supported disease detection with multi-modality image fusion, Journal of X-Ray Science and Technology 92(3) (2021), 411–434.

125.

Hosny

Bitterman

Guthier

Qian

Roberts

Perni

Saraf

Peng

Pashtan

Kann

, Clinical validation of deep learning algorithms for radiotherapy targeting of non-small-cell lung cancer: an observational study,, The Lancet Digital Health 4(9) (2022), e657 e666.

126.

Diao

Jiang

Han

Yao

Shi

, EFNet: evidence fusion network for tumor segmentation from PET-CT volumes, Physics in Medicine & Biology 66(20) (2020), 205005.

127.

Sanaat

Shiri

Arabi

Mainta

Nkoulou

Zaidi

, Deep learning-assisted ultra-fast/low-dose whole-body PET/CT imaging, European Journal of Nuclear Medicine and Molecular Imaging 48 (2021), 2405–2415.

128.

Bonardel

Dupont

Decazes

Queneau

Modzelewski

Coulot

Le Calvez

Hapdey

, Clinical and phantom validation of a deep learning based denoising algorithm for F-18-FDG PET images from lower detection counting in comparison with the standard acquisition, EJNMMI Physics 9(1) (2022), 36.

129.

Moe

Groendahl

Tomic

Dale

Malinen

Futsaether

, Deep learning-based auto-delineation of gross tumour volumes and involved nodes in PET/CT images of head and neck cancer patients, European Journal of Nuclear Medicine and Molecular Imaging 48 (2021), 2782–2792.

130.

Chen

Ran

, Deep learning with edge computing: A review, Proceedings of the IEEE 107(8) (2019), 1655–1674.

131.

Ahmed

S.B.

Oba

R.S.

Ilie

, Explainable-AI in Automated Medical Report Generation Using Chest X-ray Images, Applied Sciences 12(22) (2022), 1–19.

132.

Han

Zhong

Guo

, A transfer learning-based multimodal neural network combining metadata and multiple medical images for glaucoma type diagnosis, Scientific Reports 13(1) (2023), 1–13.

133.

Razzaghi

Abbasi

Shirazi

Rashidi

, Multimodal brain tumor detection using multimodal deep transfer learning, Applied Soft Computing 129(109631) (2022), 1–11.

134.

Guan

Liu

, Domain adaptation for medical image analysis: a survey, IEEE Transactions on Biomedical Engineering 69(3) (2021), 1173–1185.

135.

Müller

Unay

, Retrieval from and understanding of large-scale multi-modal medical datasets: a review, IEEE Transactions on Multimedia 19(9) (2017), 2093–2104.

136.

Papadimitroulas

Brocki

Chung

N.C.

Marchadour

Vermet

Gaubert

Eleftheriadis

, Artificial intelligence: Deep learning in oncological radiomics and challenges of interpretability and data harmonization, Physica Medica 83 (2021), 108–121.

137.

Songdechakraiwut

Shen

Chung

, Topological learning and its application to multimodal brain network integration, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021, Strasbourg, France, 2021.

138.

Dorent

Kujawa

Ivory

Bakas

Rieke

Joutard

Glocker

, CrossMoDA challenge: Benchmark of cross-modality domain adaptation techniques for vestibular schwannoma and cochlea segmentation, Medical Image Analysis 83 (2023), 102628.

139.

Zhang

Liu

Wang

Liu

Song

, SWTRU: star-shaped window transformer reinforced U-net for medical image segmentation, Computers in Biology and Medicine 150 (2022), 105954.

140.

Kohoutová

Heo

Cha

Lee

Moon

Wager

T.D.

Woo

C.-W.

, Toward a unified framework for interpretingmachine-learning models in neuroimaging, Nature Protocols 15(4) (2020), 1399–1435.

141.

Yao

A.D.

Cheng

D.L.

Pan

Kitamura

, Deep learning in neuroradiology: a systematic review of current algorithms and approaches for the new wave of imaging technology, Artificial Intelligence 2(2) (2020), e190026:1–6.

142.

Zhu

Zhou

Fan

, Advances and challenges in multimodal remote sensing image registration, IEEE Journal on Miniaturization for Air and Space Systems 4(2) (2023), 165–174.

143.

Boveiri

H.K.R.J.R.

Mehdizadeh

, Medical image registrationusing deep neural networks: a comprehensive review, Computers& Electrical Engineering 87 (2020), 106767.

144.

Haskins

Kruger

Yan

, Deep learning in medical image registration: a survey, Machine Vision and Applications 31(8) (2020), 1–18.

145.

Chen

Wang

Niu

Liu

Gong

, Domain knowledge powered deep learning for breast cancer diagnosis based on contrast-enhanced ultrasound videos, IEEE Transactions on Medical Imaging 40(9) (2021), 2439–2451.

146.

Zoetmulder

Gavves

Caan

Marquering

, Domain-and task-specific transfer learning for medical segmentation tasks, Computer Methods and Programs in Biomedicine 214 (2022), 106539.

147.

Marwaha

Landman

Brat

Dunn

Gordon

, Deploying digital health tools within large, complex health systems: key considerations for adoption and implementation, NPJ Digital Medicine 5(13) (2022), 1–7.

148.

Aslam

Khan

I.U.

Mirza

AlOwayed

Anis

F.M.

Aljuaid

R.M.

Baageel

, Interpretable machine learning models for malicious domains detection using explainable artificial intelligence (XAI), Sustainability 14(12) (2022), 7375.

149.

Ethier

J.F.

Curcin

Barton

McGilchrist

M.M.

Bastiaens

Andreasson

Rossiter

Zhao

Arvanitis

T.N.

Taweel

Delaney

B.C.

Burgun

, Clinical data integration model, Methods of Information in Medicine 54(1) (2015), 16–23.

150.

Lisboa

P.J.G.

Saralajew

Vellido

Fernández-Domenechand

Villmann

, The coming of age of interpretable and explainablemachine learning models, Neurocomputing 535 (2023), 25–39.

151.

Che

Wang

Zhou

, Multimodal federated learning: A survey, Sensors 23(15) (2023), 6986.

152.

Zhao

Wang

Y.K.I.

, Multimodality in meta-learning: A comprehensive survey, Knowledge-Based Systems 250 (2022), 108976.

153.

Konwer

Bae

Chen

Prasanna

, Enhancing modality-agnostic representations via metalearning for brain tumor segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2023.

154.

Cui

Wang

Zheng

Zhang

Chen

, Scarcity-GAN: Scarce data augmentation for defect detection via generative adversarial nets, Neurocomputing 566 (2024), 127061.

155.

Alzubaidi

Bai

Al-Sabaawi

Santamaría

Albahri

, Al-dabbagh

, Fadhel

, Manoufali

, Zhang

, Al-Timemy

, Duan

, A survey on deep learning tools dealing with data scarcity:definitions, challenges, solutions, tips, and applications, Journal of Big Data 10(1) (2023), 46.

156.

Karalis

, The Integration of Artificial Intelligence into Clinical Practice, Applied Biosciences 3(1) (2024), 14–44.

157.

Lipkova

Chen

Barbieri

Shao

Vaidya

Chen

Zhuang

Williamson

Shaban

, Artificial intelligence for multimodal data integration in oncology, Cancer Cell 40(10) (2022), 1095–1110.

158.

Merkow

Soin

Long

Cohen

Saligrama

Bridge

Yang

Kaiser

Borg

Tarapov

Lungren

, October. CheXstray: a real-time multi-modal monitoringworkflowfor medical imaging AI, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, Canada, 2023.

159.

Dana

Agnus

Ouhmich

Gallix

, Multimodality imaging and artificial intelligence for tumor characterization: current status and future perspective, Seminars in Nuclear Medicine 50(6) (2020), 541–548.

160.

Zhao

Wang

Che

Bao

, Multi-task deep learning for medical image computing and analysis: A review, Computers in Biology and Medicine 153 (2023), 106496.

161.

Dai

, Self-supervised multi-task learning for medical image analysis, Pattern Recognition 150 (2024), 110327.

162.

Kurz

Hauser

Mehrtens

Krieghoff-Henning

Hekler

, Kather

, FrÖhling

, von Kalle

, Brinker

, Uncertaintyestimation in medical image classification: systematic review, JMIR Medical Informatics 10(8) (2022), 36427.

163.

Mehta

Shui

Arbel

, Evaluating the fairness of deep learning uncertainty estimates in medical image analysis, Medical Imaging with Deep Learning 227 (2024), 1453–1492.

164.

Harry

, The Future of Medicine: Harnessing the Power of AI for Revolutionizing Healthcare, International Journal of Multidisciplinary Sciences and Arts 2(1) (2023), 36–47.

165.

Patil

Shankar

, Transforming healthcare: harnessing the power of AI in the modern era, International Journal of Multidisciplinary Sciences and Arts 2(1) (2023), 60–70.

166.

Khan

Shiwlani

Qayyum

Sherani

Hussain

, AI-powered healthcare revolution: an extensive examination of innovative methods in cancer treatment, Jurnal Multidisiplin Ilmu 3(1) (2024), 87–98.

167.

Roy

Ashmika

, Textile Products in Healthcare: Innovations, Applications, and Emerging Trends, in: Emerging Technologies for Health Literacy and Medical Practice, IGI Global, 2024, 288–314.

168.

Ullah

Garcia-Zapirain

, Quantum Machine Learning Revolution in Healthcare: A Systematic Review of Emerging Perspectives and Applications, IEEE Access 12 (2024), 11423–11450.

169.

OpenAI, GPT-3.5, OpenAI, Microsoft Corporation, [Online]. Available: https://chat.openai.com/. [Accessed 11 09 2023].

170.

Liu

Cheng

Wang

Initiative

A.D.N.

, Multi-modality cascaded convolutional neural networks for Alzheimer’s disease diagnosis, Neuroinformatics 16 (2018), 295–308.

171.

Kermi

Mahmoudi

Khadir

M.T.

, Deep convolutional neural networks using U-Net for automatic brain tumor segmentation in multimodal MRI volumes, in: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, Granada, Spain, 2019.

172.

Hilmizen

Bustamam

Sarwinda

, The multimodal deep learning for diagnosing COVID-19 pneumonia from chest CT-scan and X-ray images, in: 2020 IEEE 3rd International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), Yogyakarta, Indonesia, 2020.

173.

Peng

Liao

Zhou

Zhong

Jiang

Wang

, et al., [18F] FDG PET/MRI combined with chest HRCT in early cancer detection: a retrospective study of asymptomatic subjects,, European Journal of Nuclear Medicine and Molecular Imaging 50 (2023), 3723–3734.

174.

Jannusch

Bruckmann

N.M.

Geuting

C.J.

Morawitz

Dietzel

Rischpler...

Kirchner

, Lung Nodules Missed in Initial Staging of Breast Cancer Patients in PET/MRI—Clinically Relevant? Cancers 14(13) (2022), 3454.

175.

Piñeiro-Fiel

Moscoso

Pubul

Ruibal

Á.

, Silva-Rodríguez

, Aguiar

, A systematic review of pettextural analysis and radiomics in cancer, Diagnostics 11(2) (2021), 380.

176.

Guglielmo

Marturano

Bettinelli

Gregianin

Paiusco

Evangelista

, Additional value of pet radiomic features for the initial staging of prostate cancer: A systematic review from the literature, Cancers 13(23) (2021), 6026.

177.

Magadza

Viriri

, Deep learning for brain tumor segmentation: a survey of state-of-the-art, J. Imaging 7(2) (2021), 19.

178.

Martin

, Schaarschmidt , Demircioglu

, Heusch

, Quick

H.H.

, Forsting , Antoch

M.G.

, Herrmann

, PET/MRI versus PET/CT for whole-body staging: results from a single-center observational study on 1,003 sequential examinations, Journal of Nuclear Medicine 61(8) (2020), 1131–1136.

179.

Baratto

Wang

Y.R.J.

Theruvath

Sarrami

A.H.

Sheybani

Hawk

K.E.

Daldrup-Link

, PET and MRI imaging-based AI models in pediatric oncology, Journal of Nuclear Medicine 63(2) (2022), 2723.

180.

Yousefirizi

Decazes

Amyar

Ruan

Saboury

Rahmim

, AI-based detection, classification and prediction/prognosis in medical imaging: towards radiophenomics, PET Clinics 17(1) (2022), 183–212.

181.

Sadaghiani

M.S.

Rowe

S.P.

Sheikhbahaei

, Applications of artificial intelligence in oncologic 18F-FDG PET/CT imaging: a systematic review, Annals of Translational Medicine 9(9) (2021), 823.

182.

Wang

Ourselin

Vercauteren

, Automatic Brain Tumor Segmentation Based on Cascaded Convolutional Neural Networks With Uncertainty Estimation, Front. Comput. Neurosci 13 (2019), 1–13.

183.

Zhou

Ding

Wang

Tao

, One-Pass Multi-task Convolutional Neural Networks for Efficient Brain Tumor Segmentation, in: 21st International Conference on Medical Image Computing and Computer Assisted Intervention—MICCAI 2018, Granada, Spain, September 16–20, 2018.

184.

Roy Choudhury

, Vanguri

, Jambawalikar

S.R.

, Kumar

, Segmentation of Brain Tumors Using DeepLabv3+, in 4th International Workshop on Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Granada, Spain, September 16, 2018.

185.

Sun

Peng

Guo

, Segmentation of the multimodal brain tumor image used the multi-pathway architecture method based on 3D FCN, Neurocomputing 423 (2021), 34–45.

186.

Wang

Zhang

Bao

Zhu

Cao

P.S.

, Not just privacy: Improving performance of private deep learning in mobile cloud, inKDD&18: Proceedings of the 24th ACMSIGKDD International Conference on Knowledge Discovery & Data Mining, London United Kingdom, August 19–23, 2018.

187.

Gilad-Bachrach

Dowlin

Laine

Lauter

Naehrig

Wernsing

, CryptoNets: Applying neural networks to encrypted data with high throughput and accuracy, in 33 rd International Conference on Machine, New York, NY, USA, 2016.

188.

Liu

Juuti

Asokan

, Oblivious neural network predictions via minionn transformations, in 2017 ACM SIGSAC Conference on Computer and Communications Security, Texas USA, October 2017.

189.

Rouhani

B.D.

Riazi

M.S.

Koushanfar

, DeepSecure: Scalable provably-secure deep learning, in 55th Annual Design Automation Conference, Francisco California, June 24–29, 2018.

190.

Juvekar

Vaikuntanathan

Chandrakasan

, GAZELLE: A low latency framework for secure neural network inference, in 27th USENIX Security Symposium (USENIX Security 18), Santa Clara, CA, USA, August 15–17, 2018.

191.

Liu

Cao

Luo

Chen

Vokkarane

Yunsheng

Chen

Hou

, A new deep learning-based food recognition system for dietary assessment on an edge computing service infrastructure, IEEE Transactions on Services Computing 11(2) (2018), 249–261.

192.

Venkateswarlu Isunuri

, Kakarla

, Fast brain tumour segmentation using optimized U-Net and adaptive thresholding, Automatika: Journal for Control, Measurement, Electronics, Computing and Communications 61(3) (2020), 352–360.

193.

Kumar

S.B.

Panda

Agrawal

, Brain magnetic resonance image tumor detection and segmentation using edgeless active contour, in IEEE 2020 11th international conference on computing, communication and networking technologies (ICCCNT), Kharagpur, India, 1–3 July 2020.