Abstract
Recent technological developments and improvement in the medical domain demands advancement, to address the issue of early disease detection. Also, the current pandemic has resulted in considerable progress of improvement in the medical domain, through online consultation by physicians for different diseases using clinical reports and medical images. A similar process is adopted in developing a Visual Question Answering (VQA) system in the medical field. In this paper, existing medical VQA datasets, appropriate techniques, suitable quantitative metrics, real time challenges and, the implementation of one VQA approach with algorithms and performance evaluation are discussed. The medical VQA datasets collected from multiple sources are represented in different perspectives (organwise, planewise, modality-type and abnormality-type) for a better understanding and visualization. Then the techniques used in VQA are subsequently grouped and explained, based on evolution, complexity in the dataset and the need for semantics in understanding the questions. In addition, the implementation of a VQA approach using VGGNet and LSTM is carried out for existing and improved datasets, and analyzed with accuracy and BLEU score metrics. The improved datasets, created through dataset reduction and augmentation approaches, resulted in better performance than the existing datasets. Finally, the challenges of the medical VQA domain are examined in terms of datasets, combining techniques, and modifying the parameters of existing performance metrics for future research.
Keywords
Introduction
Visual Question Answering (VQA) is a vision-based textual approach in the fields of computer vision and natural language processing. It takes an image, and one or more natural language questions about the image as input, and generates an answer in natural language as output. The amount of information required from an image in textual format is rapidly increasing. Therefore, Natural Language Processing (NLP) and computer vision are considered important to facilitate the retrieval and analysis of the image based on the text. VQA is used to (i). Assist partially sighted people read restaurant menus, instructions on how to cook a packaged meal, or determine if their clothes match, (ii). Help in medical decision-making and assist medical students in evaluating their domain-specific knowledge, and (iii). Support the general public answer questions about road access and travel using satellite images during natural disasters. VQA datasets are broadly classified into three –natural (indoor and outdoor images), abstract (cartoon and gaming images), and medical (images of different organs and modalities) [1–3]. This research paper focuses on medical VQA datasets, techniques used in implementation and its real time challenges.
The motivation of this paper are: (i). The comprehensive interpretation based on medical VQA datasets, techniques and challenges helps the researchers to know the current status of the medical VQA domain, and further development based on the discussed research gap. (ii). The evolution of dataset from VQA-MED 2018 to 2021 and VQA-RAD datasets are discussed along with the interpretation of the datasets based on the multiple factors. This inference helps to find out the characteristics of the datasets and improve the dataset which influences the performance of the model. (iii). The recent and efficient techniques for the particular dataset can be identified from the categorization of the approaches. It helps in choosing appropriate technique depending on the type of the dataset. (iv). Some of the datasets have deviating number of samples and give poor performance. Improvising such data may promote more research works.
A medical VQA system has become a necessity in recent times because it supports clinical decision-making and improves patient engagement. For instance, patients in Africa can access their medical health-related data (both structured and unstructured) details via a patient portal. However, the difficulty in getting domain-specific medical experts to answer questions resulted in the development of the medical VQA system. In the African nation of Cameroon, for instance, not all hospitals have specialists, radiologists, gynecologists, and cardiologists. Patients are, consequently, forced to get the opinion of a general physician who has minimum domain-specific knowledge, resulting a possibly incorrect diagnosis in certain cases. To address this issue, a medical VQA system has been developed to answer the patient’s queries with correct information when it is essential to them. Further, the clinician’s confidence is increased by considering the automated system-generated answer as a second opinion. Overall, advances in medical VQA have greatly reduced disparities in a country like Cameroon, with its minimum number of medical experts. To achieve this, the reasonable research in the medical VQA domain need to be developed. Hence, this paper focuses on the following objectives: (i). To comprehensively interpret from medical VQA datasets and techniques (ii). To improve the dataset based on the inference from the interpretation of dataset by data reduction and augmentation approaches to address data imbalance problem. (iii). To develop an medical VQA model for original dataset, reduced dataset, augmented dataset and augmented reduced dataset and validate the resulting model using performance metrics. (iv). To summarize the challenges in terms of dataset acquisition, performance metrics, model generation, computing efficiency. The novelty of this work lies in addressing the second and fourth objectives.
In addition to this, few researchers are also working worldwide and publishing their works in different conferences and journal. In Fig. 1, research papers presented at conferences and published in journals, related to medical VQA datasets, techniques, and performance metrics, are depicted as a graph in chronological order as a sunburst representation.

Sunburst chart –A count of medical VQA research papers.
The medical VQA dataset used in those papers, cannot be created and need to be sourced from hospitals or clinics. Hence, dataset collection acts as a major challenge. For medical VQA datasets, an external forum named as ImageCLEF is giving extended supports in providing realtime medical datasets and conducting tasks to encourage researchers from 2018 onwards. The ImageCLEF VQA-MED 2018, 2019, 2020 and 2021 datasets have 2866, 4200, 5000 and 5500 radiology images and 6413, 15292, 5000 and 5500 question-answer pairs, respectively. As a result, a total of 13,066 radiology images and 27,705 question-answer pairs are currently available which is described along with VQA-RAD dataset in Section 2.
The datasets were used to design and develop medical VQA systems with appropriate deep neural network techniques. These techniques enable the model to progressively learn features from the data, especially unstructured data, at multiple levels to learn complex patterns. The deep neural learning techniques for image feature extraction include the Convolutional Neural Network (CNN) [4], and pre-trained models such as VGGNet [5], ResNet [6, 46] and CoTNet152 [27]. Different deep neural learning techniques for text feature extraction are, Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM) [7], Gated Recurrent Unit (GRU) [8], Gated Linear Unit (GLU) [9], Bidirectional Encoder Representation from Transformers (BERT) [10], and Bio-BERT [11]. Along with deep neural learning techniques for image feature extraction and text feature extraction, Multimodal Compact Bilinear Pooling (MCB) [12], Multimodal Factorized Bilinear Pooling Network (MFB) [6], Multimodal Factorized High-order Pooling (MFB) [11], Support Vector Machine (SVM) [14] or Feed Forward Neural Network (FFNN) [26] are used for concatenation. Other compositional VQA techniques include the Stacked Attention Network (SAN) [14], and Bilateral Branch Network (BBN) [15], and Bilinear Attention Network (BAN) [16].
The VQA model developed for different medical datasets, using the appropriate techniques is validated with performance metrics. Section 3 explains those metrics and discusses their benefits and computation. In addition, the Ground Truth Answer (GTA) of each images are computed by human annotators as given in Equation 1, to be used in analyzing VQA models. The ‘ACC
GTA
’ attains 100% accurate answer if at least three human annotators are given exact answer for an image.
The rest of the paper is structured as follows. Section 2 discusses the medical VQA datasets and their inferences. Section 3 describes the performance metrics used to analyse the model quantitatively, while Section 4 explains different approaches of medical VQA techniques, with their merits and demerits. Section 5 outlines a medical VQA system with its design. Section 6 explains the experiments and results of medical VQA model using quantitative metrics. Section 7 discusses the challenges of existing medical VQA systems and suggests possible solutions to overcome them. Finally, conclusion and future work is summarized in Section 8.
This section discusses the four medical VQA-MED datasets from ImageCLEF, as well as another dataset named as VQA-RAD. The sample images of different categories of VQA-MED and VQA-RAD datasets are shown in Figs. 2 3 respectively. Moreover, the description about these medical VQA datasets are given in Table 1.

Samples from the ImageCLEF VQA-MED datasets.

Samples from the VQA-RAD dataset.
Dataset description
The medical VQA-MED dataset developed by Hasan et al. in 2018 has radiology images along with their captions, which is extracted from PubMed Central articles and part of the ImageCLEF 2017 caption prediction task [21]. The two expert human annotators were assigned to check and validate the question-answer pairs associated with the radiology images in two passes. The syntactic and semantic correctness were checked by a human annotator in the first pass. In the second pass, the validation and test sets were evaluated for clinical relevance by a medical expert, the second annotator. As a result, the corpus consists of 2,866 medical images associated with 6,413 question-answer pairs that have been finalized.
VQA-RAD dataset
The VQA-RAD dataset overcomes two drawbacks of the VQA-MED 2018 dataset. The two drawbacks include, firstly, a number of irrelevant questions and, secondly, image sequences and 3D reconstructions, both of which are rarely used in direct patient care.
In [12], Lau et al. constructed this dataset which focuses on CT/MRI images of head, chest, and abdomen. In this, the question types are related to planes, modalities, organ systems, abnormalities, colors, and sizes as shown in Fig. 3. The answer types are open-ended and closed-ended. This dataset consists of 315 images associated with 3515 Question-Answer (QA) pairs, of which 1515 are free-form and 733 are rephrased questions. The remaining 1267 questions are framed and have corresponding free-form and rephrased questions.
VQA-MED 2019 dataset
The VQA-MED 2019 dataset shows an increase in the number of samples and their diversity, and it is more organized than the 2018 dataset. This is achieved by selecting relevant radiology images from the MedPix database with filters corresponding to captions, planes, categories, modalities, localities, and diagnostic methods. The VQA-MED 2019 dataset consists of samples related to sixteen planes, ten organs, and thirty-six modalities. As a result, the overall dataset size has increased to 4,200 images, each associated with 3 to 4 questions, along with 12,792 QA pairs [18]. Further, a medical doctor and a radiologist performed a manual double validation of the test answers and corrected ten questions, which is 3% of the total test set size.
VQA-MED 2020 dataset
The VQA-MED 2020 dataset focuses on abnormality-type samples of different organs, planes, and modalities. The ImageCLEF forum created this dataset automatically through the process of (i). Applying filters to select relevant images and associated annotations (ii). Creating patterns to generate questions and answers (iii) Selecting relevant medical images from the Med-Pix database based on their captions, localities, and diagnostic methods. The final list has 330 medical problems, where each problem occurs at least 10 times in the created VQA data. This dataset consists of 5000 radiology images and 5000 QA pairs that are divided among training, validation, and test sets [19]. The most common medical problems and their frequencies in this dataset are pulmonary embolism (114), acute appendicitis (109), angiomyolipoma (68), osteochondral (63), adenocarcinoma of the lung (60), and sarcoidosis (58).
VQA-MED 2021 dataset
The VQA-MED 2021 dataset [20] is primarily on abnormality-type questions for different categories like organs, planes, and modalities, like the previous year. This dataset consists of 5,500 samples, of which 4,500 are from the VQA-MED 2020 dataset, and the icomplete dataset is divided into training, validation, and test sets. Compared to 2020, the quality of the 2021 data has shown a marked improvement by having the reference answers of the test set validated by a medical doctor.
Dataset interpretation and analysis
The interpretation and analysis from the medical VQA datasets are given in Tables 2, 4 and 5 and, Fig. 4 for better understanding. The utilization of these datasets are analyzed in terms of advantages and disadvantages as given in Table 2.
Advantages and disadvantages of medical VQA datasets
Advantages and disadvantages of medical VQA datasets
Brief inferences from the medical VQA datasets in terms of questions
Brief inferences from the medical VQA datasets in terms of answers
Maximum-occurring questions and answers with their frequencies

Radar graph for answers of VQA-MED datasets.
The questions with the maximum and minimum length are tabulated, along with their vocabulary size in Table 3. In this, the questions are of two types, open-ended and closed-ended. The former includes questions that begin with ‘is’, ‘are’ and ‘does’, and the latter with ‘what’, ‘which’, and ‘how’. Of the two question types, those beginning with ‘what’ and ‘is’ comprise the maximum-length and minimum-length questions, respectively.
The maximum and minimum lengthened answer, along with their vocabulary size, are computed and tabulated in Table 4. It is inferred that most of the longest answers offer information related to abnormalities, while the vocabulary size for all the shortest answers is one. Most answers in the datasets are closed-ended, and the number of samples with respect to both open-ended and closed-ended answers is depicted.
The Table 5 shows the maximum occurrence of questions and answers, along with their respective frequencies. In this, the “what” question type (open-ended) and “pulmonary embolism” (answer) occurs frequently in most of the medical VQA datasets.
The consecutively occurring answer and its frequency are illustrated in Fig. 4 for four medical VQA-MED datasets using a radar graph. In the graph, the answer with the maximum frequency takes the outer circle and, depending on the decreasing level of frequency level, the respective answer will take the inner circles. From the Fig. 4(b) of VQA-MED 2019 dataset, it is inferred that “t2” (least frequently occurring answer) takes the smallest elevation in the inner most circle and “axial” (most frequently occurring answer) takes the largest elevation and it touches the outer circle.
The performance of the VQA model is validated by means of quantitative metrics discussed in this section. The system-generated answer is evaluated with ground-truth answers using performance metrics like the Bilingual Evaluation Understudy (BLEU) score, Word-Based Semantic Similarity (WBSS), and accuracy. All three performance metrics range from 0.0 (completely different results) to 1.0 (the ideal result). The BLEU score [24] is calculated by counting the number of n-grams in the system-generated answer to the number of n-grams in the ground truth answer as given in Equation 2. The WBSS Score is computed based on the Wu-Palmer Similarity (WUPS) (Wu and Palmer, 1994), with WordNet ontology in the back end, as computed in Equation 3. In 2015, Antol et al. [1] stated that accuracy measures the exact match between model-generated and ground truth answers and it is computed based on Equation 4.
In Equation 2, C represents the corpus, S the sentences in the corpus, n-gram the continuous sequence of n-items in a sentence, and p n the BLEU n-gram precision (i.e., it computes n-gram matches for every hypothesis sentence ‘S’ in the corpus ‘C’). The C1 and C2 in Equation 3 represents the two concept nodes in the WordNet taxonomy. The Lowest Common Subsumer (LCS) in this equation represents the common parent of C1 and C2, with a minimum node distance. Depth C1 indicates the number of nodes from C1 to the LCS node, depth C2 represents the number of nodes from C2 to the LCS node, and depth LCS indicates the number of nodes from the LCS node to the root node. In Equation 4, TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. The numerator in this equation represents the correctly classified number of samples and the denominator denotes the total number of samples.
This section explains the deep neural network techniques used in developing medical VQA models. Existing VQA approaches are categorized into joint embedding, hybrid and compositional, depends on the uniqueness of the techniques. In the joint embedding approach, individual techniques are used for both image and text feature extraction, while feature concatenation is undertaken by a single architecture. In the hybrid VQA approach, the attention mechanism which is used alongside with the joint embedding. The third category is the compositional approach, in this the feature extraction and concatenation are executed by a single architecture. These approaches are listed in Tables 6, 7 and 8 in regard to the datasets used and the performance metrics applied under different categories. The merits, demerits and the usage of different VQA approaches are given in Table 9.
A comparison and analysis of different medical VQA techniques in the joint embedding VQA approach
A comparison and analysis of different medical VQA techniques in the joint embedding VQA approach
A comparison and analysis of different medical VQA techniques in the hybrid VQA approach
A comparison and analysis of different medical VQA techniques in the compositional VQA approach
Merits, demerits and the usage of medical VQA approaches
The different joint embedding VQA methods are used by various researchers worldwide for four medical VQA datasets. Among these, some of the significant methods are discussed in this section.
Allaouzi et al. developed a VQA model using a combination of VGG16, LSTM and decision tree, resulting in a BLEU score of 0.054 because the decision tree-based classification approach ignored word order information. However, Peng et al. achieved an improved BLEU score of 0.108 using ResNet, LSTM and MFB, since the model captured and integrated fine-grained contextual information effectively for the VQA-MED 2018 dataset.
For the VQA-MED 2019 dataset, Allaouzi et al. used the Greedy Search Algorithm (GSA) to predict the next word based on the previously generated words, and the final answer is built by recursively calling the model resulted a BLEU score of 0.624. Al-Sadi et al. created the VQA model using five sub-models (organs, planes, abnormalities, modalities, and answer prediction). Of these, the first four are image classification models pre-trained on the VGG16 network, while the last is developed by the LSTM, based on the text, with repetitive questions for all categories. The results show that accuracy for planes, organs, and modalities ranges from 65% to 72% with improvisation, while the abnormality model accuracy is as low as 18%, which reduces the overall performance and therefore requires sufficient attention. Lubna et al. [25] constructed a VQA classification model to answer modality-based questions alone for different radiology images using the CNN and obtained the highest accuracy of 83.8%.
The method namely GLU, Multi-Layer Perceptron (MLP), are applied for VQA-MED 2020 dataset and Bio-BERT for VQA-MED 2021 dataset. The GLU technique narrow down the predicted answer to some extent but it was not adequate enough to handle questions related to abnormality types and hence the resulted an BLEU score of 0.350. The MLP classify the abnormality category effectively as a multi-class image classification problem with an attention score for each region, as a result it achieved an BLEU score of 0.441. The BioBERT extract semantic information from the dataset and, in this way, enhance the overall performance of the model with an BLEU score of 0.402. Liu et al., [26] concluded that image retrieval (retrieving the answer for the sample in the test set based on the maximum cosine similarity value with the samples in the training set) lacks for other than VQA-RAD, VQA-MED 2018 and 2019 VQA datasets. Zhang et al., [27] represented medical image as a type point and combined with QA-pairs because it improved the reasoning ability of different modalities.
The overall inferences from the Table 6 are, (i). The role of LSTM is not limited to text feature extraction but also to the feature concatenation (ii). Both Turner et al. [28] and Sarrouti. [4] used the same combination of techniques and attained an reasonable BLEU score of 0.550 and 0.441 for two different datasets namely, VQA-MED 2019 and 2020 datasets. (iii). Al-Sadi et al. [29, 30] used VGG16 for VQA-MED 2019 and 2020 datasets, and they incorporated data augmentation techniques in the later research.
Hybrid VQA approach
In Hybrid VQA approach, an attention mechanism like co-attention or the bahdanau attention is used with the joint embedding approach and it is shown in Table 7. These attention mechanisms are used alongside image feature extraction or after feature concatenation because they are capable of accessing every feature and selects specific features from a sequence to produce an answer. The co-attention mechanism generated the attentional features for both image and text, based on the attention weight of different grid locations [13]. The MFH with co-attention mechanism learned the relatively important information from each part of the visual and textual features and ignored the irrelevant information [37]. The bahdanau attention maps the relationship between local image features when the mapping between the global image features and the text was insufficient [38]. In 2022, Wang et al. [39] stated that Cross-model self-attention captures the long-range dependencies between image and QA-pairs which helps in better feature representation to focus on significant information. Tascon-Morales et al., [40] focuses on answer consistency instead of improving VQA architecture or reducing data-related limitations. Because the improvement in the answer consistency (answer to sub-questions implies to main questions) increases the performance of the model. The Zhu et al., [41] stated that introduction of semantic attention showed the certain degree of improvement in the answer prediction.
The overall inference from the Table 7 are, Liu et al. [8, 38] used pre-trained along with GRU and attention mechanism for VQA-MED 2019 and 2020 datasets. In the 2020 research, they upgraded the pre-trained model based on the characteristics of the datasets, and applied seq2seq techniques for fusion mechanism.
Compositional VQA approach
The compositional VQA approach uses advanced and unique techniques like the Multimodal Compact Bilinear Pooling (MCB), Stacked Attention Network (SAN), Bilateral Branch Network (BBN), Skeleton-based Sentence Matching (SSM) and Bilinear Attention Network (BAN). These techniques are arranged in terms of datasets and performance metrics in Table 8, and the inferences are discussed in this section.
In [12], Lau et al. observed that the MCB is able to differentiate the variance in the predicted answers from that of similar questions in the VQA-RAD dataset and it attained an BLEU score of 0.468. Abacha et al. [14] incorporated the MCB for the VQA-MED 2018 dataset, owing to its multimodal representation capability, and compared it with the SAN. The MCB reduced the number of parameters but the higher dimensionality computation increases its complexity. Moreover, SAN outperformed MCB because it queries an image multiple times and infers the answer progressively. In addition to this, the fine-tuning of the model prevented overfitting and improved performance slightly. Going one step further, Do et al. [16] integrated the SAN with the Multiple Meta-model Quantifying (MMQ) process, and the resultant model learned essential information from images and meta-models to answer challenging questions in the VQA-RAD and PathVQA (Pathology) datasets.
Liao et al. [44] incorporated the SSM where questions with a similar structure are summarized into a unified backbone to simplify the question-comprehension segment. According to Chen et al. [45], long-tailed distributions of candidate answers in the VQA-MED 2020 dataset can be addressed by the BBN with a cumulative learning strategy. Their approach constructed a retrieval-based answer section by computing the cosine similarity between the input sample and all the training samples of the top five categories. Taking a step ahead, Eslami et al. [16] used a BBN ensemble to reduce overfitting in the training and validation data of the VQA-MED 2021 dataset by effectively modeling the imbalanced long-tailed distribution.
The proposed VQA system
The proposed VQA system comprises of four stages namely, dataset improvisation, medical VQA model generation, answer prediction and performance analysis as shown in Fig. 5.

Proposed medical VQA system.
The VQA-MED 2020 and 2021 datasets are used in this proposed work. ImageCLEF, focused on abnormality-type questions in VQA-MED 2020 and 2021 datasets, however the overall model performance was minimal due to data imbalance problems. To deal with this difficulty, data reduction and augmentation approaches are used with different combinations in the proposed work for these datasets. Then the dataset is divided into training and test for VQA model generation and prediction.
The dataset reduction ways are, (i). Eliminate the data pertaining to least contributing labels (ii). Identify and reduce the samples from the labels where the count of the samples in the particular label is excess than the average count required. The resulted dataset is named as Reduced Dataset (RD). The dataset augmentation ways are, (i). Identify the similar type samples from other medical VQA datasets and augment. (ii). Incorporate augmentation techniques namely, mixup and label smoothing, and generate data. The resulted dataset is named as Augmented Dataset (AD). The dataset augmentation after dataset reduction gives better results because the reduced dataset consists of effective samples and it is named as Augmented Reduced Dataset (ARD).
The medical VQA model generated using joint embedding approach (VGGNet and LSTM) is explained in Algorithms 1 and 2. The Algorithm 1, explains the generation of VQA model for the training set and it comprises encoder and decoder phases. The Algorithm 2, a sub-part of Algorithm 1, explains word vector generation using embedding method. The answer prediction for the test set based on the generated VQA model is detailed in Algorithm 3. The predicted answer are evaluated using two performance metrics namely, accuracy and BLEU score.
The performance analysis of medical VQA model for VQA-MED 2020 and 2021 datasets are given in Table 10. In this table, OD, RD, AD and ARD represents the Original Dataset, Reduced Dataset, Augmented Dataset and Augmented Reduced Dataset and ATD represents the augmentation of VQA-MED 2020’s test set (VQA-MED 2020) with VQA-MED 2021 dataset.
VQA-MED datasets Vs Performance metrics for the proposed model
VQA-MED datasets Vs Performance metrics for the proposed model
From Table 10, it has been inferred that ARD of VQA-MED 2020 and ATD of VQA-MED 2021 datasets resulted a better BLEU Score of 0.330 and 0.227 than original datasets. Also, the accuracy of the improved dataset is increased by 0.6% and 2.4% for VQA-MED 2020 (ARD) and 2021 (ATD) datasets than the original datasets. When compared with Mohamed et al., [7], the proposed model achieved 0.3% and 0.4% improved accuracy and BLEU score for VQA-MED 2020 dataset. Besides, the accuracy and BLEU Score for Reduced Dataset (RD) and Augmented Dataset (AD) are also computed but it is comparatively less than the ARD and ATD.
In Table 11, the number of correctly classified samples for the medical VQA dataset are tabulated along with the improved percentage of samples. From the table, it has been inferred that the number of correctly classified samples is increased by 0.6% and 1.4% for improved datasets of VQA-MED 2020 (ARD) and 2021 (ATD), when compared with original datasets. The increase in percentage is calculated from the number of correctly classified samples (Table 11) to the total number of samples (Table 1), for both datasets.
Number of correctly classified samples Vs datasets
From the conducted investigations, the following inferences are made. (i). The VQA-MED datasets from 2018 to 2021 are improved each year based on the knowledge gained from the previous year dataset. (ii). Among the datasets, the length of the question ranges from 3 to 28 words and the length of the answer ranges from 1 to 26 words. (iii). The range of maximum frequency of the question with respect to a class is 14 to 1126 and, for answer is 74 to 1558. (iv). The three approaches are used in different scenarios depending on the characteristics of the dataset as, joint embedding is chosen when there is an clear understanding about the dataset, hybrid embedding is when the answer to the question requires the features from more than one region of the image and compositional approach can be selected when the characteristics of the dataset are not completely understandable but the performance of the dataset need to be evaluated. (v). The dataset reduction and augmentation improves the performance of the model. For instance, the performance of the model is increased by 0.6% and 1.4% for VQA-MED 2020 and 2021 dataset respectively.
Moreover, different challenges in medical VQA datasets, techniques, performance metrics and computational efficiency are discussed in Table 12 with possible solutions, as applicable. Finally, the proposed system has been compared with other medical VQA works.
Challenges and solutions in medical VQA domain
Challenges and solutions in medical VQA domain
Besides, the proposed work is compared to that of Lin et al. [47] and Al-Sadi et al. [48]. As compared with Lin et al., [47], in this paper the medical VQA datasets are analyzed and significant information are interpreted which is considered for the VQA model creation. Additionally, limitations in terms of datasets, performance metrics, and techniques are also included in each section. Though Al-Sadi et al. [48] worked with the VQA-MED 2019 dataset alone in 2021, our work has dealt with the VQA-MED 2020 and 2021 datasets and used suitable augmentation techniques to maximize the performance of the model.
This paper focuses on four different perspectives namely, (i). Comprehensive interpretation of medical VQA datasets and techniques, (ii). Improvisation of dataset based on the inference from the interpretation of dataset using data reduction and augmentation approaches (iii). Generation of medical VQA model for original and improvised dataset and comparison of the results using quantitative metrics. (iv). Summarizing the challenges in the medical VQA dataset acquisition, techniques, performance metrics and memory requirement which pave way for further development in medical VQA domain. The five medical VQA datasets are studied and the significant information from the samples are inferred, which helps to understand the dataset and it also pave way for the dataset improvisation. The dataset is improvised by data reduction and augmentation approaches to enhance the performance. The medical VQA model developed for these datasets are compared and analysed using quantitative metrics. From the results, it has been inferred that Augmented Reduced Dataset (ARD) and Augmentation of VQA-MED 2020 Test set with VQA-MED 2021 Dataset (ATD) gives an improved accuracy of 0.6% and 2.4% compared to VQA-MED 2020 and 2021 datasets respectively. The challenges in medical VQA dataset are discussed along with their possible solutions. This interpretation from datasets, categorization of approaches, development of medical VQA model for five datasets, analyse the research gap in terms of challenges made this paper distinct with other medical VQA review and survey paper.
In future work, we plan to address the challenges in the medical VQA domain and incorporate the advanced medical VQA techniques to improve the performance the model.
