Visual Recalibration and Gating Enhancement Network for Radiology Report Generation

Abstract

Automatic radiology medical report generation is a necessary development of artificial intelligence technology in the health care. This technology serves to aid doctors in producing comprehensive diagnostic reports, alleviating the burdensome workloads of medical professionals. However, there are some challenges in generating radiological reports: (1) visual and textual data biases and (2) long-distance dependency problem. To tackle these issues, we design a visual recalibration and gating enhancement network (VRGE), which composes of the visual recalibration module and the gating enhancement module (gating enhancement module, GEM). Specifically, the visual recalibration module enhances the recognition of abnormal features in lesion areas of medical images. The GEM dynamically adjusts the contextual information in the report by introducing gating mechanisms, focusing on capturing professional medical terminology in medical text reports. We have conducted sufficient experiments on the public datasets of IU X-Ray to illustrate that the VRGE outperforms existing models.

1. INTRODUCTION

Medical imaging is pivotal in diagnosing and treating a comprehensive range of diseases. Traditionally, clinical medical imaging reports rely on the extensive medical knowledge and practical experience of health care professionals. To write accurate and high-quality medical reports, it usually requires experienced medical imaging experts to provide detailed information and analysis. However, because of differences in professional knowledge among different doctors, generating medical image reports that are consistent with radiological images is a major challenge. In addition, doctors usually need to spend 10 minutes or even longer to write a medical imaging report (Alfarghaly et al., 2021; Li et al., 2018).

The increasing number of medical images has added to the heavy task of writing reports for doctors. The varying workload and status of doctors may lead to time-consuming processes and inconsistent report quality. To address these challenges, automatic medical imaging report generation has become an urgent task in clinical practice. This task not only reduces the huge burden on health care professionals but also significantly accelerates the progress of workflow, thereby improving the overall quality and standardization level of health care (Hou et al., 2023a).

Recently, deep learning has made remarkable strides in image captioning, effectively producing precise short text narratives for natural images. This success ignites considerable research interest in generating radiology reports. Many existing methods (Jing et al., 2018b; Liang et al., 2017; Yu et al., 2016) for radiology generation reports use image-captioning models and traditional encoder–decoder architectures: During the encoding phase, medical image visual features are derived from Convolutional Neural Networks (CNNs); in the decoding layer, the report is generated by the Transformer (Liu et al., 2021a; Lu et al., 2017). Automatically generate radiological reports involves producing detailed, professional, and terminology-rich long-text reports based on all findings, encompassing both normal and abnormal observations in medical imaging. Although the task of generating radiological reports has achieved great success, it still faces many challenges, as follows:

Visual and textual data bias: In medical imaging datasets, the number of normal medical images exceeds the abnormal medical images. As shown in Figure 1, in a radiological medical image, the abnormal part only occupies a small of the medical image, resulting in an imbalanced distribution of visual information. In writing radiological medical reports, doctors usually complete the report based on all the observations in medical imaging. However, because of the imbalanced distribution of visual information, the medical reports generated less description of abnormal discoveries and use many similar statements to describe normal discoveries, resulting in visual and textual data biases (Hou et al., 2023a; Li et al., 2022; Liu et al., 2021b).

Long-distance dependency problem: Generating radiological reports requires a detailed description of the characteristics of each region in medical images. However, because of the uneven distribution of abnormal feature information, the model may face difficulties in long-distance dependency issues, making it difficult to associate lesion information and capture global contextual information effectively. These challenges make sufficient accuracy and coherence in the generated medical imaging reports difficult (Hou et al., 2023b).

FIG.1.

A chest X-ray image and its report. Red represents organs, whereas blue with underline represents descriptions of abnormal parts.

This article proposes a visual recalibration and gating enhancement network (Visual Recalibration and Gating Enhancement Network, VRGE) to tackle the overhead challenges. We introduce a visual recalibration module (visual recalibration module, VRM) to enhance the visual feature information extracted from medical images and improve the VRGE’s sensitivity to determine abnormal characteristics. In addition, to relieve text bias and model long text dependencies, we adopt a gating enhancement module (GEM) in the Transformer decoder to associate disease information effectively and integrate contextual information from radiology reports. Our specific contributions are in the following aspects: •

The VRGE employs the VRM, which improves the model’s ability to identify crucial parts and diseased areas in radiological images by re-adjusting and enhancing the extracted medical visual feature information.

•

The GEM associates and integrates contextual information from input sequences to capture key disease feature representations in medical reports, which can relieve the long-distance dependency problem.

•

Comprehensive experiments demonstrate that the VRGE model achieves state-of-the-art performance compared to other baselines on the public IU X-Ray dataset.

2. RELATED WORK

2.1. Image captioning

Image Captioning task aims to generate corresponding natural language contents by extracting information from images. These methods usually adopt the structure of CNN–RNN to extract image feature information through CNNs, and use Recurrent Neural Network (RNN) to generate a single sentence description. Vinyals et al. (2015) designed the Neural Image Caption (NIC) model, which uses CNN as an alternative to RNN to generate semantically abundant description. To tackle the long-term dependency in RNN, these methods introduce structures such as long short-term memory networks (LSTMs) and gated recurrent units. These structures effectively solve the problems of gradient disappearance and gradient explosion, making the model more capable.

Recently, the attention mechanism has been introduced into the image captioning model to pay more concentration to crucial areas in the image, providing a basis for better expression of semantic information. Xu et al. (2015) designed a “hard-soft” attention mechanism embedded in CNN to obtain underlying image feature information, thereby improving the coherence and relevance of the generated description content. Unlike image captioning, radiological medical reports need to consider the essential features in the image, conduct a detailed analysis of anomaly detection and lesion location in medical images, and generate more professional diagnostic descriptions.

2.2. Medical report generation

Different from image captioning, medical report generation is based on all the content in the medical image (normal and abnormal structural findings) to generate a detailed text report with professional medical terms rather than a sentence description. Many existing methods generate radiology medical reports by aligning text and image information. Jing et al. (2018b) used a co-attention mechanism to locate sub-region feature information in medical images and generate corresponding accurate descriptions. Chen et al. (2020) introduced a memory-driven module in the Transformer decoder layer, which can store critical information about modalities to enhance the interaction and generation process between modalities. Chen et al. (2022) further proposed memory query and memory response modules to align text and image information to promote radiology report generation. Liu et al. (2021) designed a contrastive attention model to extract image contrast information and effectively capture image anomaly feature representation. Currently, some methods utilize additional prior knowledge to help models better understand the contextual information of medical reports. For example, Zhang et al. (2020) constructed a knowledge graph for chest discovery and passed information through graph convolutional networks, which can improve the model to detect and analyze disease characteristics.

3. METHOD

Radiology medical report generation involves a complex approach integrating textual and visual data. In this process, we use artificial intelligence methods to mine critical information extracted from the input radiological medical images $Img = {i_{1}, i_{2}, \dots, i_{N_{I}}}$ and generate corresponding radiological medical image reports $Y = {y_{1}, y_{2}, \dots, y_{N_{R}}}$ , where $N_{I}$ and $N_{R}$ are the number of radiology medical images and medical reports.

Next, we will elaborate on the proposed VRGE, which aims to generate accurate radiological medical reports automatically. The VRGE involves four essential modules: Visual embedding module (VEM) aims to extract visual features from the medical images. The VRM is to recalibrate the visual features to enhance the representation of crucial content. The GEM can capture and integrate contextual information to ensure the coherence and accuracy of reports, and a report generator (RG) can generate the final radiology medical report. The overall structure of VRGE is shown in Figure 2.

FIG. 2.

This is the overall structure of visual recalibration and gating enhancement network, which contains four core modules: Visual embedding module (VEM), Visual recalibration module (VRM), Gating enhancement module (GEM) and report generator (RG). CNNs, Convolutional Neural Networks.

3.1. Visual features representation

The visual features representation utilizes the VEM and the VRM. These components concentrate on the lesion part and enhance the detection of abnormal visual characteristics within radiological images. This approach is designed to mitigate the impact of data bias, allowing for more subtle and accurate feature extraction from medical images.

3.2. Visual embedding module

The VEM takes radiological medical images $Img$ as input, with every single image having a dimension of $3 \times 224 \times 224$ . We use CNN as the visual encoder to derive visual feature representations from medical images. To obtain rich visual feature representations, we introduce ResNet-101 as CNN’s backbone network pre-trained on ImageNet.

After forward propagation through ResNet-101, the medical image $Img$ generates corresponding convolutional feature maps. Next, we extract visual features related to radiology medical imaging from the last convolutional layer of the network and represent them as a vector $V = {v_{1}, v_{2}, \dots, v_{N_{I}}}, v_{i} \in R^{d}$ . Once these visual feature vectors are extracted, we connect the patches of each row according to their order in the image, forming a sequence for subsequent detailed analysis. The specific details of this implementation process are as follows: $\begin{matrix} V = {v_{1}, v_{2}, \dots, v_{N_{I}}} = f_{vem} (Img) # \end{matrix}$ (1)where $f_{vem} (\cdot)$ refers to the VEM.

3.3. Visual recalibration module

Following VEM, we derive a sequence of visual features from radiological imaging. Subsequently, a VRM is utilized to enhance the capability in recognizing crucial structures and lesion positions within medical images.

Since visual data bias and the uneven allocation of visual features, normal visual features dominate, whereas abnormal parts occupy a relatively small proportion. As the number of layers in deep neural networks increases, low-level image features tend to be diminished by the uppermost deep layers.

To address this issue, the residual blocks (RBs) introduces a skip connection mechanism, enabling the model to deliver input features directly to subsequent layers, which can capture affluent semantic information. To diagnose and analyze radiological medical images more accurately, VRM introduces RBs to capture critical visual features. This strategy effectively utilizes RBs to capture positive visual feature representations in medical images, especially in balancing normal and abnormal visual features. By fully utilizing RBs, the VRGE can efficiently focus on abnormal visual features, significantly improving the model’s performance. VRM can help address data bias and imbalanced feature distribution issues and enhance the model’s expressive ability when facing deep neural network structures, enabling it to accurately capture critical information in medical images.

We take $V = {v_{1}, v_{2}, \dots, v_{N_{I}}}$ extracted through VEM to the RBs, and the specific implementation is as follows: $\begin{matrix} T = {t_{1}, t_{2}, \dots, t_{N_{I}}} = ReLu (f_{res} (V)) \end{matrix}$ (2)where $f_{res} (\cdot)$ refers to the residual function.

Additionally, VRM incorporates channel attention (CA) and spatial attention (SA) to boost the VRGE’s capacity in extracting abnormal features. Employing operations like squeeze, excitation, and scaling, it dynamically enhances attention toward crucial features within input visual features. CA facilitates the adaptive learning of lesion-related visual features, improving the discernibility and expressiveness of visual features for disease-related information in medical images. The $T = {t_{1}, t_{2}, \dots, t_{N_{I}}} \in R^{d}$ is transmitted to the channel attention to perform the following operations: $\begin{matrix} \hat{T} = Poo l_{val} (T) \end{matrix}$ (3) $\begin{matrix} P = f c_{2} (f c_{1} (\hat{T})) \end{matrix}$ (4) $\begin{matrix} M_{C A} = σ (M) \end{matrix}$ (5)where $M_{C A} \in R^{d}$ is the final output of the channel attention, and $Poo l_{val} (\cdot)$ , $f c (\cdot),$ and $σ (\cdot)$ denote variance pooling, the full connection, and $sigmoid$ activation, respectively.

The spatial attention mechanism can accurately focus on local lesion areas in medical images. Enhancing the attention to abnormal regions enables the model to explore the details and microstructure of medical images more deeply. At the same time, spatial attention effectively promotes the model to integrate contextual information across scales, which can help VRGE to understand image content comprehensively and enhance the model’s lesion detection performance in medical image analysis. $\begin{matrix} M_{S A} = Con v_{3 \times 3} (T) \end{matrix}$ (6)where $M_{S A} \in R^{d}$ is the final output of the spatial attention. The $Con v_{3 \times 3} (\cdot)$ refers to $3 \times 3$ convolution layer.

We combine the visual features of CA and SA to obtain the final VRM feature results. $\begin{matrix} M = V ⨁ ((M_{C A} ⨁ M_{S A}) ⨂ T) \end{matrix}$ (7)where $⨁$ is element-wise addition, and $⨂$ is element-wise product. The $M = {m_{1}, m_{2}, \dots, m_{N_{I}}} \in R^{d}$ is the final output of the VRM.

3.4. Gating enhancement module

Integrating a GEM in the decoder aims to improve the Transformer’s context-related understanding. GEM addresses text data bias and medical reports’ long-distance dependencies. GEM effectively tackles the challenge of long text dependencies by introducing information and state gates and captures contextual information from text reports. GEM identifies the imbalanced distribution of lesion information in the report and associates these details to improve the ability of VRGE to extract key disease details.

In the decoding process, the word embeddings of the radiological reports are initially input to the masked multi-head attention within the decoder. This step aims to derive contextual representations $E = {e_{1}, e_{2}, \dots, e_{N_{R}}}$ for the position of each word in the medical text report. Once these contextual representations are obtained, the decoder progresses to the GEM. This sequential flow allows the model to capture nuanced information regarding the position and context of each word in the medical report.

An information gate is activated to regulate the flow of information and amplify the significance of lesion-related areas. The progression is detailed as follows: $\begin{matrix} E = MHA (y_{1}, \dots, y_{N_{R}}) \end{matrix}$ (8) $\begin{matrix} E^{l} = σ (W_{info} \cdot [y i, e i] + b_{info}) \end{matrix}$ (9) $\begin{matrix} C = Act (Nor (linear (Y))) ⨀ E^{l} \end{matrix}$ (10)where $C$ corresponds to the context representation modified through the information gating process, and $linear (\cdot)$ , $Nor (\cdot)$ , $⨀$ , and $σ (\cdot)$ refer to the linear operation, normalization operation, element-wise multiplication, and $sigmoid$ funtion, respectively. The $W_{info}$ and $b_{info}$ are learnable weight matrix and bias terms, respectively.

Subsequently, the context representation $C$ undergoes processing through the state gate. The state gate is essential in dynamically adjusting the encoder state based on the relevance of the current position in the medical report. The progression is detailed as follows: $\begin{matrix} S^{'} = Y ⨀ σ (W_{sta} \cdot [y_{i}, c_{i}] + b_{sta}) \end{matrix}$ (11)where $W_{sta}$ and $b_{sta}$ denote learnable weight matrix and bias terms, respectively.

The result yielded by the GEM module is: $\begin{matrix} G = C ⨁ S^{'} \end{matrix}$ (12)where $G$ represents the output of the GEM.

3.5. Report generator

Regarding the report generator, we utilize a Transformer encoder and Transformer decoder (Vaswani et al., 2017). In the decoder, we incorporate the GEM to merge positive features and vital details from radiological contents. The GEM is designed to associate and integrate contextual information, addressing the challenge of long-text dependencies.

3.5.1. Transformer encoder

In this part, we elaborate the operation flow of the Transformer encoder utilized for recalibrated visual features $M = {m_{1}, m_{2}, \dots, m_{N_{I}}}$ from the VRM module. The input medical visual features $M$ undergo encoding and feature extraction using the Transformer encoder, providing robust diagnosis and analysis support. Each encoding layer incorporates a multi-head self-attention mechanism (MHA), a fully connected feed-forward network, residual connection, and layer normalization. The MHA mechanism efficiently captures crucial disease representations.

Initially, the recalibrated visual feature sequence $M = {m_{1}, m_{2}, \dots, m_{N_{I}}}$ undergoes a linear transformation to derive the $Q \in R^{G_{q} \times d_{k}}$ , $K \in R^{G_{k} \times d_{k}}$ and $V \in R^{G_{k} \times d_{k}}$ . $\begin{matrix} Q = M W_{q} K = M W_{k} V = M W_{v} \end{matrix}$ (13)where $W_{q}, W_{k}, W_{v}$ denote the learnable weight matrices.

Then, we use attention mechanism to capture key abnormal representations in recalibrated visual features, and the calculation process is as follows: $\begin{matrix} A t t_{i} (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V \end{matrix}$ (14)

Then, the multiple attention heads' results are concatenated as follows: $\begin{matrix} H = MHA (Q, K, V) = Concat (A t t_{1}, A t t_{2}, \dots, A t t_{N_{I}}) W^{o} \end{matrix}$ (15)where $H$ is the final output of the MHA, and $Concat (\ cdot)$ is concatenation operation, $W^{o}$ is learnable parameter and $\sqrt{d_{k}}$ stands for the scaling factor.

Following the MHA, the last output $M^{'}$ is achieved via residual connection and layer normalization: $\begin{matrix} M^{'} = L n (Res (H + M)) \end{matrix}$ (16)where $Res (\cdot)$ denotes residual connection, and $L n (\cdot)$ represents the layer normalization.

3.5.2. Transformer decoder

We incorporate the GEM in Transformer decoder. The GEM focuses on pathology contextual contents within medical text sequences, enhancing the VRGE’s capability to obtain valuable text feature and generate detailed descriptions. $\begin{matrix} y_{t} = f_{d} (m_{1}^{'}, \dots, m_{N_{I}}^{'}, GEM (y_{1}, \dots, y_{N_{R} - 1})) \end{matrix}$ (17)where $f_{d} (\cdot)$ is the decoder layer.

After the above steps of calculation, the calculation of the radiology report can be formulated as: $\begin{matrix} p (Y| Img) = \prod_{t = 1} p (y_{N_{R}}| y_{1}, y_{2}, \dots, y_{N_{R} - 1}, Img) \end{matrix}$ (18)where the final output of the radiology report is $Y = {y_{1}, y_{2}, \dots, y_{N_{R}}}$ .

4. EXPERIMENT

4.1. Datasets

Following the previous approach (Chen et al., 2020), we evaluate our model using available IU X-Ray datasets.

IU X-Ray: It is an extensively utilized radiological imaging dataset obtained by Indiana University, encompassing 7,470 images and 3,955 associated reports. In the dataset, a majority of patients have supplied medical reports and corresponding forward and lateral images. Each medical image encompasses various disease categories, such as pneumonia, tuberculosis, pneumothorax, abnormal lung texture, etc. The reports contain impressions, findings, and other details.

4.2. Evaluation metrics

To evaluate the reports generated by the VRGE, we utilize various metrics, such as BLEU-n ( $n \in {1, 2, 3, 4}$ ) (Papineni et al., 2002), Metric for Evaluation of Translation with Explicit ORdering (METEOR) (Banerjee and Lavie, 2005), and Recall-Oriented Understudy for Gisting Evaluation (ROUGE-L) (Lin, 2004). BLEU-n assesses the similarity between the ground reports and the generated reports. METEOR not only focuses on word-level accuracy but also considers phrase-level matching when evaluating generated reports and ground truth. ROUGE-L measures the overlap between reference text and generated text and captures the consistency between generated medical reports and reference text, similar to the recall metric. In conclusion, each metric provides a unique perspective, resulting in a comprehensive and nuanced analysis of the report-generating capabilities of the VRGE model.

4.3. Baseline models

To demonstrate the model’s efficacy, we compare VRGE with other models:

•
NIC* (Vinyals et al., 2015): It utilizes CNN in place of RNN to extract comprehensive visual details from image data and employs RNN to generate the ultimate text output.
•
Show Attend and Tell (SA&T)* (Xu et al., 2015): It introduces an attention-based method for generating descriptions, integrating hard and soft attention to extract valuable image elements for generating elaborate semantic characterizations.
•
Hybrid Retrieval-Generation Reinforced Agent (HRGR-Agent) (Li et al., 2018): This model adopts a hierarchical retrieval mechanism and reinforcement learning methods, combining advanced retrieval and low-level generation modules to generate structurally balanced radiological medical reports.
•
CoAttention radiology report generation model (COATT) (Jing et al., 2018b): It designs a multitasking learning framework, which introduces a common attention mechanism to locate abnormal medical image feature regions and establishes a hierarchical LSTM architecture to generate rich text reports.
•
Generating radiology reports via memory-driven transformer (R2Gen) (Chen et al., 2020): It employs a relational memory driver module into the Transformer encoder to record key information from previous processes, enhancing the encoder’s decoding ability and generating semantically rich radiological medical reports.
•
Generating radiology reports via memory-driven transformer, Cross-modal Memory Networks (R2GenCMN) (Chen et al., 2022): It introduces a memory query and response module to associate text and image information more effectively, improving the consistency and richness of generated radiological reports.
•
Posterior-and-Prior Knowledge Exploring-and-Distilling approach (PPKED) (Liu et al., 2021a): This model integrates multi-domain knowledge extractors to explore prior and posterior knowledge to alleviate text and data bias, thereby enhancing the consistency and richness of generated reports.

4.4. Implementation details

Our experimental setup chooses ResNet-101 as the medical imaging visual encoder trained on ImageNet. We set the size of each medical image as $224 \times 224$ , and the medical image embedding dimension extracted by VRM is $d = 7 \times 7 \times 2048$ . We adopt the Transformer architecture and its hidden state is randomly initialized with 512. During model training, we employ the Adam optimizer (Kingma and Ba, 2015). The learning rate of the visual extractor is set to 0.00005, whereas the other settings have a learning rate 0.0001.

We set the beam size to 3 to balance the quality of generated reports and computational efficiency. The entire experiment is trained on an NVIDIA 3080 GPU, with a maximum number of training rounds of $100$ and a batch size of $8$ . This setting is adjusted to ensure the model maintains high quality and efficient computational performance when generating medical reports.

5. EXPERIMENT AND ANALYSIS

5.1. Main experiments

In our experimentation, we conduct a comparative analysis of the proposed VRGE model using the IU X-Ray dataset. Table 1 illustrates the outcomes, revealing that our VRGE model achieve SOTA performance. A detailed comparison against the PPKED model, as indicated in Table 1, showcases the superior performance of our model across various key metrics. There are notable improvements in BLEU-n ( $n \in {1, 2, 3, 4}$ ) metrics, exhibiting increases of 1.6%, 2.8%, 1.9%, and 1.2%, respectively. These results underscore the effectiveness of our model in filtering out irrelevant information and prioritizing the generation of precise and thorough symptom descriptions.

Table 1.
Comparison of Experimental Results with Other Models on the IU X-Ray

Model BL-1 BL-2 Bl-3 Bl-4 MET RG-L

NIC* 0.216 0.124 0.087 0.066 — 0.306-

SA&T* 0.339 0.251 0.168 0.118 — 0.323

HRGR 0.438 0.298 0.208 0.151 — 0.322

COATT 0.455 0.288 0.205 0.154 — 0.369

R2Gen 0.470 0.304 0.219 0.165 0.187 0.371

CMN 0.475 0.309 0.222 0.170 0.191 0.375

PPKED 0.483 0.315 0.224 0.168 0.190 0.376

VRGE 0.499 0.343 0.243 0.180 0.218 0.411

Model	BL-1	BL-2	Bl-3	Bl-4	MET	RG-L
NIC*	0.216	0.124	0.087	0.066	—	0.306-
SA&T*	0.339	0.251	0.168	0.118	—	0.323
HRGR	0.438	0.298	0.208	0.151	—	0.322
COATT	0.455	0.288	0.205	0.154	—	0.369
R2Gen	0.470	0.304	0.219	0.165	0.187	0.371
CMN	0.475	0.309	0.222	0.170	0.191	0.375
PPKED	0.483	0.315	0.224	0.168	0.190	0.376
VRGE	0.499	0.343	0.243	0.180	0.218	0.411

Moreover, our model demonstrates enhancements in various other evaluation metrics. The METEOR score surpasses the PPKED by 2.8%, underscoring our model’s proficiency in producing accurate and fluent radiology medical reports. The ROUGE-L further demonstrates a 3.5% enhancement in VRGE compared to PPKED, signifying substantial progress in generating coherent disease descriptions. In short, the VRGE achieves advancements in accuracy, coherence and fluency in report descriptions. The enhancement of the VRGE model is notable, as it reflects the model’s capacity to produce reports with heightened linguistic fluency and coherence.

5.2. Ablation experiments

We perform dedicated experiments for every module using the IU X-Ray dataset to validate the efficiency of the VRM and GEM. In these experiments, we systematically evaluate the performance of each module to assess their contributions to the overall model efficacy. Specifically, we isolate the impact of the VRM and GEM by conducting controlled experiments, allowing us to demonstrate their influence on the quality and accuracy of the generated medical reports.

5.3. Effect of the VRM

To prove the efficacy of the VRM, we remove it from the VRGE model. We conduct experiments using BLEU-n, METEOR, and ROUGE-L to evaluate its impact on model performance. From the results in Table 2, it can be observed that after removing the VRM, BLEU-n ( $n \in {1, 2, 3, 4}$ ), decreased by 2.4%, 2.0%, 1.0%, and 0.6%, respectively. Moreover, METEOR and ROUGE-L decreased by 3.1% and 1.9%, respectively.

Table 2.
Result of the Ablation Experiment on the IU X-Ray

Model BL-1 BL-2 BL-3 BL-4 MET RG-L

BASE 0.396 0.254 0.179 0.135 0.164 0.342

+MVR 0.492 0.331 0.243 0.180 0.198 0.368

+CGM 0.475 0.323 0.233 0.174 0.219 0.392

VRGE 0.499 0.343 0.243 0.180 0.223 0.411

Model	BL-1	BL-2	BL-3	BL-4	MET	RG-L
BASE	0.396	0.254	0.179	0.135	0.164	0.342
+MVR	0.492	0.331	0.243	0.180	0.198	0.368
+CGM	0.475	0.323	0.233	0.174	0.219	0.392
VRGE	0.499	0.343	0.243	0.180	0.223	0.411

The boldface represents the highest performance.

This experiment demonstrates the importance of the VRM. Its removal resulted in performance degradation, especially in the Bilingual Evaluation Understudy (BLEU) and METEOR evaluation metrics, indicating that the module is crucial to capture anomaly detection in radiological medical images, enhancing visual features of anomaly detection, and alleviating data bias. These results emphasize this module’s contribution to the model’s overall performance and support its effectiveness in generating radiology medical image reports.

5.4. Effect of the GEM

To verify the effectiveness of the GEM, we remove it from the VRGE model and conduct experiments to evaluate its impact on model performance. The results in Table 2 prove the efficacy of the GEM. After removing the GEM, we observe a significant decrease in various indicators. Specifically, BLEU-1, BLEU-2, BLEU-3, and BLEU-4 decreased by 0.7%, 1.2%, 1.1%, and 1.1%, respectively. Moreover, METEOR and ROUGE-L are 2.0% and 4.3% lower than VRGE, respectively. This series of experimental results demonstrates the crucial role of the GEM in model performance. Its removal results in a significant decrease in performance metrics, especially in BLEU and METEOR evaluations. This indicates that the GEM effectively utilizes contextual information to generate coherent and consistent radiological reports. Further analyzing the experimental results, we find that adding the GEM improves the fluency of language generation and helps solve the long-distance dependencies overall. This confirms the importance of the GEM in generating radiological reports, providing reliable support for improving the model’s understanding of medical image context and accurate report generation.

6. CASE Study

To further examine the feasibility of the VRGE model in clinical practice, we randomly select three cases in the IU X-Ray data set for in-depth analysis, as shown in Figure 3. We compare the differences between the radiology medical reports generated by VRGE on the test set with those generated by the Transformer, R2GenCMN methods, and Ground Truth. In the comparison, we utilize different color identifications, where red represents descriptions of organs, and blue underlined represents similar descriptions to ground truth.

FIG.3.

Cases of the IU X-ray generated reports. Our results are compared with the ground truth, the Transformer method and R2GenCMN method. Red indicates organs, and blue underline indicates that each model is close to the description of ground truth.

For the first case, the report generated by VRGE aligns more closely with the ground truth, unlike the Transformer model, which fails to recognize “left infrahilar region.” Notably, VRGE can identify crucial abnormal findings in radiological images, such as “Calcified lymph identified,” a detail overlooked by the other two models. This underscores the emphasis of the proposed VRGE model on effective anomaly detection in medical images.

For the second case, it is evident that the report generated by VRGE provides a thorough and exhaustive contents of the medical image compared to the other two models. Furthermore, our model successfully identifies “no focal consolidation pneumothorax,” a detail overlooked by the Transformer model.

For the last case, compared to other methods, VRGE generates comprehensive medical reports. Notably, it can identify and generate exceptions absent in the Transformer and R2GenCMN methods, such as “slight vascular capitation.”

7. FUTURE WORK

Recently, deep learning has demonstrated remarkable success in radiology report generation. However, there is a pressing need for further research in several key areas, notably model interpretability and data privacy security. Meanwhile, radiologists find it challenging to believe in inexplicable systems. As these systems become more sophisticated, it becomes essential for the results and reasoning processes to be interpretable. This interpretability can promote confidence in the generated reports and enhance the model’s utility in clinical decision-making. Moreover, given that radiology reports often include sensitive information, ensuring robust privacy protection measures is imperative.

In conclusion, while deep learning has shown promising results in radiology report generation, ongoing research efforts should prioritize enhancing model interpretability and implementing robust privacy measures. This approach ensures the responsible and ethical deployment of these technologies in the health care domain, promoting both efficacy and patient trust.

8. CONCLUSION

This article designs a model, VRGE, to generate radiological medical reports automatically. We develop a dedicated visual recalibration module to boost the VRGE’s capabilities in capturing essential lesion features. This module enhances the VRGE’s attention to crucial visual areas within medical images and mitigates visual data bias. Additionally, our approach incorporates a GEM, which effectively aggregates pertinent contextual elements. By extracting text descriptions associated with lesions, this module enables the model to filter out irrelevant information, addressing long-distance dependency issues and textual data bias. The results obtained from experiments on the IU X-Ray dataset highlight that the VRGE model outperforms other previous baseline models significantly.

Footnotes

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

This work is supported by grant from the Natural Science Foundation of China (No. 62072070).

References

Alfarghaly

, Khaled

, Elkorany

, et al. Automated radiology report generation using conditioned transformers. Informatics in Medicine Unlocked, 2021; 24:100557; doi: 10.1016/j.imu.2021.100557

Banerjee

, Lavie

. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, USA, June 29, 2005; pp. 65–72.

Chen

, Shen

, Song

, et al. Cross-modal memory networks for radiology report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics: Online, 2022; pp. 5904–5914.

Chen

, Song

, Chang

T-H

, et al. Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics: Online, 2020; pp. 1439–1449.

Hou

, Liu

, Li

, et al. MKCL: Medical Knowledge with Contrastive Learning model for radiology report generation. J Biomed Inform, 2023; 146(1532-0464):104496; doi: 10.1016/j.jbi.2023.104496

Hou

, Sang

, Liu

, et al. Radiology Report Generation via Visual Recalibration and Context Gating-Aware. In: Proceedings of 18th International Symposium on Bioinformatics Research and Applications (ISBRA): Singapore, 2023; pp. 107–119.

Jing

, Xie

, Xing

. On the automatic generation of medical imaging reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018; pp. 2577–2586.

Kingma

, Ba

. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations, ICLR 2015, 2015.

, Li

, Hu

, et al. A self-guided framework for radiology report generation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, 2022; pp. 588–598.

10.

, Liang

, Hu

, et al. Hybrid retrieval-generation reinforced agent for medical image report generation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Curran Associates Inc.: Montréal, Canada, 2018; pp. 1537–1547.

11.

Liang

, Hu

, Zhang

, et al. Recurrent topic-transition gan for visual paragraph generation. In: Proceedings of the IEEE international conference on computer vision, 2017; pp. 3362–3371.

12.

Lin

C-Y

. Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81.

13.

Liu

, Xian

, Shen

, et al. Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021a; pp. 13748–13757.

14.

Liu

, Yin

, Wu

, et al. Contrastive Attention for Automatic Chest X-ray Report Generation. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021b; pp. 269–280.

15.

, Xiong

, Parikh

, et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, 2017; pp. 3242–3250.

16.

Papineni

, Roukos

, Ward

, et al. Bleu: A Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics: Philadelphia, Pennsylvania, USA, 2002; pp. 311–318.

17.

Vaswani

, Shazeer

, Parmar

, et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Curran Associates Inc.: Long Beach, California, USA, 2017; pp. 6000–6010.

18.

Vinyals

, Toshev

, Bengio

, et al. Show and tell: A neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA, 2015; pp. 3156-3164.

19.

, Ba

, Kiros

, et al. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. JMLR.org: Lille, France, 2015; pp. 2048–2057.

20.

, Wang

, Huang

, et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Las Vegas, NV, USA, 2016; pp. 4584-4593.

21.

Zhang

, Wang

, Xu

, et al. When Radiology Report Generation Meets Knowledge Graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34 Volume. AAAI Press: New York, NY, USA; 2020; pp. 12910–12917.

Visual Recalibration and Gating Enhancement Network for Radiology Report Generation

Abstract

1. INTRODUCTION

2.1. Image captioning

2.2. Medical report generation

3. METHOD

3.2. Visual embedding module

3.5.1. Transformer encoder

4.1. Datasets

4.2. Evaluation metrics

4.3. Baseline models

5. EXPERIMENT AND ANALYSIS

5.1. Main experiments

5.3. Effect of the VRM

Table 2. Result of the Ablation Experiment on the IU X-Ray Model BL-1 BL-2 BL-3 BL-4 MET RG-L BASE 0.396 0.254 0.179 0.135 0.164 0.342 +MVR 0.492 0.331 0.243 0.180 0.198 0.368 +CGM 0.475 0.323 0.233 0.174 0.219 0.392 VRGE 0.499 0.343 0.243 0.180 0.223 0.411

6. CASE Study

8. CONCLUSION

Footnotes

AUTHOR DISCLOSURE STATEMENT

FUNDING INFORMATION

References

Table 2.
Result of the Ablation Experiment on the IU X-Ray

Model BL-1 BL-2 BL-3 BL-4 MET RG-L

BASE 0.396 0.254 0.179 0.135 0.164 0.342

+MVR 0.492 0.331 0.243 0.180 0.198 0.368

+CGM 0.475 0.323 0.233 0.174 0.219 0.392

VRGE 0.499 0.343 0.243 0.180 0.223 0.411