Abstract
BACKGROUND:
Accurate volumetric segmentation of primary central nervous system lymphoma (PCNSL) is essential for assessing and monitoring the tumor before radiotherapy and the treatment planning. The tedious manual segmentation leads to interindividual and intraindividual differences, while existing automatic segmentation methods cause under-segmentation of PCNSL due to the complex and multifaceted nature of the tumor.
OBJECTIVE:
To address the challenges of small size, diffused distribution, poor inter-layer continuity on the same axis, and tendency for over-segmentation in brain MRI PCNSL segmentation, we propose an improved attention module based on nnUNet for automated segmentation.
METHODS:
We collected 114 T1 MRI images of patients in the Huashan Hospital, Shanghai. Then randomly split the total of 114 cases into 5 distinct training and test sets for a 5-fold cross-validation. To efficiently and accurately delineate the PCNSL, we proposed an improved attention module based on nnU-Net with 3D convolutions, batch normalization, and residual attention (res-attention) to learn the tumor region information. Additionally, multi-scale dilated convolution kernels with different dilation rates were integrated to broaden the receptive field. We further used attentional feature fusion with 3D convolutions (AFF3D) to fuse the feature maps generated by multi-scale dilated convolution kernels to reduce under-segmentation.
RESULTS:
Compared to existing methods, our attention module improves the ability to distinguish diffuse and edge enhanced types of tumors; and the broadened receptive field captures tumor features of various scales and shapes more effectively, achieving a 0.9349 Dice Similarity Coefficient (DSC).
CONCLUSIONS:
Quantitative results demonstrate the effectiveness of the proposed method in segmenting the PCNSL. To our knowledge, this is the first study to introduce attention modules into deep learning for segmenting PCNSL based on brain magnetic resonance imaging (MRI), promoting the localization of PCNSL before radiotherapy.
Keywords
Introduction
Primary central nervous system lymphoma (PCNSL) is one type of extra-nodal non-Hodgkin lymphoma (NHL), with involvement restricted to the brain. It accounts for about 1% of NHLs and 3–4% of intracranial tumors [1]. The 5-year overall survival (OS) rate of PCNSL reported internationally is only 20–30% [2]. Radiation and chemotherapy may significantly improve the survival rates of PCNSL patients [3], particularly when chemotherapy agents are administered prior to radiation therapy, where the segmentation of PCNSL is important for positioning the tumor before radiotherapy, so as to accurately delineating the target area [4].
The segmentation of PCNSL is very challenging. Unlike other brain tumors, PCNSL tumors are more dispersed and varied in shape, with significant morphological differences between different layers. Besides, multiple lesions are commonly seen in PCNSL patients, which usually affect the deep tissues of the brain [5, 6]; and some lesions are smaller and exhibit prominent surrounding edema, lateral ventricle compression and an evident occupied effect [7]. Figure 1 illustrates the abovementioned difference between the PCNSL and the Brain Tumor Segmentation (BraTS) Challenge dataset.

An example of (a) PCNSL and (b) BraTS2021. PCNSL data exhibit certain distribution dispersion, characterized by small tumor volume and discontinuity, with the presence of edema around the tumor in general.
Deep learning-based methods have been effective in segmenting brain tumors, including gliomas and meningiomas [8]. There are also competitions such as the BraTS [9] where research [10, 11] based on neural networks has achieved good results in these brain tumor segmentation tasks. Following the idea of Resnet, Xiao et al. [12] added skip connections to U-Net. This operation increased the network’s depth, which can prevent overfitting and improve the model accuracy. Zhou et al. [13] integrated feature maps from different levels in U-Net++and introduced deep supervision that allows for pruning, significantly reducing the parameter size of the large deep networks while maintaining acceptable accuracy levels.
Vijay et al. [14] even improved the U-Net via a Spatial Pyramid Pooling (SPP) to extract information from various down sampling blocks and hence increased the scope of reconstruction. Another attempt lies in standardized platform with pre-processing, training, and evaluation tools. This so-called nnU-Net (’no-new-Net’) [15] includes various 3D CNN architectures and supports advanced techniques such as instance normalization (IN). Overall, nnU-Net is a robust, dynamic solution for biomedical image segmentation, renowned for its exceptional adaptability and innovative self-configuration capabilities that excel across diverse tasks. Luu et al. [16] expanded the network of nnU-Net by incorporating axial attention and introduced the IN operation. With these enhancements, they ranked the first place in the BraTS2021 Challenge. Although U-Net [17] based techniques mentioned above have great encoding capacities, due to the constrained receptive fields of convolution kernels, it is difficult for them to effectively learn the correlation of information in the tumor area between different layers of the sample [18].
In order to better focus the model on the regions to be segmented, attention modules have been proposed to enhance the model ability to extract features from important areas in the feature maps. For instance, the Convolutional Block Attention Module (CBAM) [19] enhances attention effects by separately focusing on channel and spatial dimensions through the channel attention module and spatial attention module respectively; the Squeeze-and-Excitation (SE) module [20] introduces an attention module that adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between the channels; the Squeeze and Channel Excitation (scSE) module [21] is designed to be added to any existing architecture without changing the original structure and without adding much computational cost; the Global Context (GC) [22] module introduces an attention module that adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between the channels; and the axial attention [23] treats the input as a multi-dimensional tensor and operates separately along each axis, which allows axial attention to be used with much larger inputs. All these methods have been widely used in other brain tumor segmentations but not for PCNSL. Since PCNSL is composed of mostly scattered tumors. It is difficult for these attention modules to capture all small tumors. Furthermore, the gray level of the tumor center is similar to the normal area, so under-segmentation is easy to occur. Finally, the tumor shapes in different layers are quite different, which makes it challenging to learn the overall image features based on 2D convolutions.
To address the abovementioned problems, we introduced an attention module based on multi-scale dilated convolution feature fusion in PCNSL segmentation for the first time, effectively avoiding the under-segmentation seen in other segmentation methods. The main contributions are: We proposed a dilated convolutions in spatial attention module that increases the receptive field and captures a more extensive range of spatial information so that the model can learn more edge and texture features of tumor parts. We developed a new attentional feature fusion method with 3D convolutions (AFF3D) in spatial attention module that could adaptively fuse the features of the spatial attention map generated by the spatial attention module. This new method can effectively reduce under-segmentation by simultaneously considering local and global features, which enhances feature selection and information extraction. We have compared our method with existing attention methods. Experimental results demonstrated the improvement of our method in PCNSL segmentation based on magnetic resonance imaging (MRI) images.
As in nnU-Net [24], the 3D U-Net widely applied in medical image segmentation is used as the basic structure in this work. The network has the encoder-decoder structure with skip connections linking the two pathways and deep supervision. We added our improved attention module to every layer in the encoder and decoder. Figure 2 is the model diagram of the segmentation network:

The architecture of our segmentation network, comprising the nnU-Net-3D network and the improved attention module.
Inspired by the work of CBAM [19], we improved the attention module that includes a channel attention module focusing the model on informative feature channels, and a spatial attention module highlighting the important spatial locations.
We collected 114 T1 MRI images of patients in the Huashan Hospital, Shanghai, China, in accordance with the Helsinki Declaration and approved by the Ethics Research Committee of HuaShan Hospital, School of Medicine, Fudan University (approval No. KY 2015-256). Written informed consent was obtained from all the participants before inclusion in the study. The size of T1 MRI volume and its corresponding label volume are 288×288×180, 448×512×176, 512×512×232, and 512×512×232 pixels respectively. The spacing of these cases were 0.833333×0.833333×1, 0.488281×0.488281×1, 0.4688×0.4688×0.8, and 0.488281×0.488281×1 respectively. In order to improve the computation efficiency, T1 and its label volumes were cropped by nnU-Net automatically to the same size of 128×128×128 pixels with spacing of 1.875×1.875×1.40 for the best experiment performance. The labels are marked by 2 senior physicians for evaluation.
Channel attention in the improved attention module
Unlike the channel attention in CBAM, we replaced the 1D convolution (nn.Linear) with the 3D convolution (nn.Conv3d). This allows the improved attention module to learn the relationships between different positions in the 3D space. Then, we replaced the ReLU with LeakyReLU in the channel attention. It retains gradient information in the negative part, which can more effectively deal with noise in the input data, helping the network to converge better and exhibit greater robustness. These two improvements make the segmentation model more suitable for handling 3D data. The architecture of the channel attention is shown in Fig. 3.

The architecture of the channel attention. The feature map F is input to two adaptive poolings: AdaptiveAvgPooling3D and AdaptiveMaxPooling3D. The outputs features are processed by the 3D multilayer perceptron (MLP3D), added, and passed through the activation function S to obtain the result of channel attention M CA .
We first aggregated spatial information of a feature map F by using both 3D-average-pooling (AdaptiveAvgPool3D) and 3D-max-pooling (AdaptiveMaxPool3D) operations, generating two different spatial context descriptors: F
avg
and F
max
. The adaptive poolings automatically and dynamically adapt to the size of input features. These average-pooled features and max-pooled features were forwarded to a shared network MLP3D, that receive the output of AdaptiveAvgPool3D and AdaptiveMaxPool3D. Then the output and through MLP3D were added together. After the sigmoid S, we obtained channel attention map M
CA
, which can be written as:
To further improve the performance of our improved attention module, we introduced dilated convolutions [25], attentional feature fusion (AFF) [26] and attention maps in spatial attention.

The architecture of the MS-CAM and AFF3D, where C, H and W denote channel, height and width, respectively. Two branches with different scales are used to extract channel attention weights. One branch uses global average pooling to extract attention for global features, while the other branch directly uses point-wise convolution to extract channel attention for local features.
AFF3D is composed of MS-CAM, that represents an attention fusion function, which computes a combination of local and global attention. The input feature map was processed through a local attention module and a global attention module, which comprise of a sequence of layers (including 3D convolution, batch normalization, and ReLU activation layers). The outputs from these modules were combined by addition, and then passed through a sigmoid function to form the final attention weight map.
We combined dilated convolutions, AFF3D and attention maps in spatial attention. As shown in Fig. 5. Firstly, the average and maximum values of the input F were calculated along the channel dimension, and these two results were then concatenated along the channel dimension to form a new feature map. It was then passed through each of the dilated convolutional layers that were initialized earlier, resulting in a series of outputs. These outputs vary in spatial scale due to the convolution kernel size and dilation rate were different. Next, these convolutional outputs were passed through the AFF3D instance for self-attention operation, yielding a series of self-attention outputs. These self-attention outputs were then concatenated along the channel dimension and summed to produce the output, which were stored in attention maps. Here AFF3D accomplished the global and local feature fusion for each dilated convolution, while the concatenation and summation of the attention maps achieved the cross-scale-fusion feature F
d
. This self-attention module enhanced important features and suppressed unimportant ones. Finally, a Sigmoid activation function S was applied on F
d
to constrain the output between 0 and 1, then the final spatial attention map M
SA
(F) was obtained. In short, the spatial attention is computed as:

The architecture of the improved spatial attention that comprises the dilated convolutions, AFF3D, and attention maps for the feature fusion. First, the input feature map F is processed separately through AvgPooling and MaxPooling, then input into dilated convolutions of different scales. Afterwards, the produced feature maps of different scales are fused in AFF3D. All the feature maps are then stored in the Attention map, undergo concatenation and summation to produce F d , and go through a sigmoid activation to obtain the final M SA (F).
The improved attention module integrates the improved channel attention and spatial attention modules described above and provides an additional improvement: res-attention. When res-attention was set to true, the network performed a residual connection, adding the input directly to the output of the improved attention module, enabling the network to learn more useful feature representations. We also added BN after the input to make the model training process more stable and to avoid gradient explosion or vanish.
As shown in Fig. 6, we multiplied the input F and the output of the channel attention module and passed the result through the spatial attention module. Its output was again multiplied by the input. This production was then added to the feature after the res-attention skip connection to obtain the final refined features F
final
:

The architecture of the improved attention module. It comprises the improved spatial attention, the channel attention, the BN and res-attention skip connection. The input feature first goes through BN, then separately through spatial attention and channel attention. It is finally added with the result of the res-attention skip connection to obtain the refined feature.
Network architecture and parameter settings
The encoder of nnU-Net comprises of 5 levels of convolutional layers with strided convolution downsampling. The decoder follows the same structure with transpose convolution upsampling and convolution level. LeakyReLU with slope of 0.01 and batch normalization was applied after every convolution layers, the loss funcation is Dice loss+cross entropy loss.
We implemented all experiments based on PyTorch 1.12.1 [29] with CUDA 11.8 and CuDDN 8.5.0 [30]. Both training and testing were performed with one 24 GB NVIDIA RTX 3090 GPU. The networks were trained for 1000 epochs using the stochastic gradient descent optimizer with a momentum of 0.99. The Dice Similarity Coefficient (DSC) on the validation set of the current fold was used to monitor the training progress. The initial learning rate was 0.01 and was decayed following a polynomial schedule:
During training, data augmentation was applied to improve the generalization. Data augmentation consisted of random rotation and scaling, elastic deformation, additive brightness augmentation, and gamma scaling. The objective for optimization is the sum of the binary cross entropy loss and the dice loss, calculated at the final output. We randomly split the total of 114 cases into 5 distinct training and test sets for a 5-fold cross-validation. Each training set contained nii file of MRI-T1 from 92 cases (training 70%, validation 10%) and its corresponding test data contained nii file of MRI-T1 from the other 22 cases (20%).
Analysis in dilated convolution
We evaluated different combinations of dilated convolution kernels of 3×3×3, 5×5×5, 7×7×7, 9×9×9, 11×11×11 (abbreviated as 3, 5, 7, 9, 11 in the Table 1 respectively), These schemes consist of dilated convolutions of varying sizes and quantities. The feature maps they generate are directly combined through summation, while the 4conv+AFF3D achieves the fusion of feature maps through AFF3D.The parameters of 3, 5, 7 resulted in the lowest DSC. When the parameters increased with 9, the DSC improved by 0.80%. However, when the parameters increased by 11, the DSC decreased by 0.53%, Jaccard index decreased by 0.85%. The method of 4convs+AFF3D achieved the highest DSC of 93.49% and the second lowest hausdorff distance 95 (HD-95).
Ablation study of comparison of the number of dilated convolution kernels across five-fold cross-validation
Ablation study of comparison of the number of dilated convolution kernels across five-fold cross-validation
The qualitative analysis also shows that the dilated convolutions reduced the mis-segmentation. From Fig. 7, it can be observed that there are under-segmentations in the yellow circled area and over-segmentations in the blue circled area of the baseline (BL). With the use of dilated convolutions, this situation has gradually improved in 3convs and 4convs (Columns 3 and 4). However, when the quantity of dilated convolution kernels increases to 5 (Column 5), the results become worse with under-segmentations in the yellow circled area. Incorporation of AFF3D addresses the problems of over-segmentation in the blue circle and under-segmentation in the yellow circle. Hence, the best results are obtained with 4 dilated convolution kernels and AFF3D (Column 6).

Segmentation of PCNSL with different setups in dilated convolutions. The segmentation results (as the red masks indicate) show that: besides the method of ’4convs+AFF3D’ in the column 6, all other methods have inaccurate segmentation, especially in the areas marked by yellow and blue circles. From the top to bottom: the sagittal view, a zoomed-in view of the segmented focal region, 3D stereograms highlighting the positions marked by the blue circle and yellow circle in the sagittal view.
An ablation study aims to validate the impact of key components of a system, model, or theory on the overall system performance. It does so by systematically removing or modifying specific parts of the system to observe how these changes affect functionality, performance, or behavior of the system [31]. We evaluated each improvement of the proposed improved attention module. Table 2 shows the experimental results: the BL + res-attention & BN + dilated convolutions + AFF3D achieves the highest DSC of 93.49%, highest Jaccard index of 87.84% and highest precision of 92.51%, higher than BL by 1.87%, 4.49% and 1.58%, respectively. The HD-95 of BL + res-attention & BN + dilated convolutions is 11.85 mm, lower than BL by 2.95%.
Ablation study of res-attention & BN, dilated convolutions and AFF3D across five-fold cross-validation
Ablation study of res-attention & BN, dilated convolutions and AFF3D across five-fold cross-validation
In Fig. 8, Column 1 represents the initial appearance of the tumor. The tumor area is small in the column 1. Consequently, the weight in the red frame is relatively low. In the column 2 and 3, the tumor area gradually increases until it reaches its maximum size. Correspondingly, the weight within the red frame significantly increases. It can be clearly observed in the column 1 and 5 of (a), (b) and (c) that the tumor region and boundaries appear overly blurred compared to (d). The BL + res-attention & BN + dilated convolutions+AFF3D in (d) assigns higher weights than (a), (b) and (c).

Feature maps generated by the same convolution of (a) BL, (b) BL + res-attention & BN (c) BL + res-attention & BN + dilated convolutions and (d) BL + res-attention & BN + dilated convolutions+AFF3D, respectively. The red frame region represents the same region which is selected to observe the assigned weight variation in the gradual transition between normal tissue and the tumor in different slices. The weight assigned to the region within the red frame (as the number in the bottom right corner of each image indicates) shows that: the feature maps of (d) are much clearer compared to those of (a), (b), and (c).
We compared the training performance across three-fold cross-validation for 300 epochs of our improved attention module with some commonly used attention modules, including SE, ScSE, CBAM, GC, and axial attention. Results are as follows:
It can be seen in Table 3 that the improved attention module proposed in this paper achieves the highest DSC and Jaccard index, higher than the second-place method CBAM [19] by 1.10% and 0.98%. Figure 9 compares the segmentation results of the edge enhanced tumors.
Quantitative comparison with commonly used attention methods on PCNSL datasets with 3d-nnu-net across three-fold cross-validation
Quantitative comparison with commonly used attention methods on PCNSL datasets with 3d-nnu-net across three-fold cross-validation

PCNSL segmentation results from different attention methods. The segmentation results (as the red masks indicate) show that our improved attention module (column 8) did not produce any over-segmentation compared to the other methods. From top to bottom: the transverse view, the sagittal view, and the coronal view.
In the bottom row of Fig. 9, under-segmentation occurs in the results produced by SE, CBAM, and scSE. In the middle row, only the method proposed in this paper (Column 8) and scSE do not obtain under-segmentation, the other methods fail to segment the middle region of the edge enhanced tumor. In the top row, only our improved attention module achieves complete segmentation, while the other methods all obtain under-segmentation. BL avoids under-segmentation on the top row, but SE and CBAM obtain under-segmentation in all three rows. Through this experiment, we can see that our improved attention module is superior to other attention modules.
As seen in the top row of Fig. 10, SegResnet and Diff-UNet obtain over-segmentation in the lower part of the tumor center; in the middle row, only our improved attention module+nnU-Net and MedSegDiff do not neglect the small tumor in the right region; in the bottom row, only our improved attention module+nnU-Net accurately segment the small tumor at the center.

PCNSL segmentation results from different segmentation methods. The segmentation results (as the red masks indicate) show that our method (column 7) can segment the tiny tumors which all other methods neglect in the bottom row. From top to bottom: the transverse view, the sagittal view, and the coronal view.
We compared SegResnet, Diff-UNet, MedSegDiff, Swin UNETR, nnU-Net-3D and our improved attention module. As shown in Table 4, our method obtains a DSC of 93.49%, higher than the second-place method nnU-Net-3D (also as the BL) [15] by 1.87%.
Analysis of operations in attentive transformation modules across five-fold cross-validation
The main findings of this work are that: i) the improved attention module proposed improves the performance of PCNSL segmentation. ii) The multi-scale dilated convolutions improve the receptive field of the model. iii) The AFF3D can fuse the feature map generated by multi-scale dilated convolutions, so the model can assign higher weights to tumor regions. iv) Our method outperforms other attention modules and segmentation networks in resolving under-segmentation and the omission of small diffuse tumors. Our proposed fusion network demonstrates superior segmentation performance compared to existing methods.
Analysis in dilated convolution and ablation study
The segmentation results in row 3 and 4 of Fig. 7 indicate that, except for the 4convs+AFF3D, other methods exhibit serious over-segmentation in 3D segmentation results, which is more pronounced than in 2D results. The dilated convolutions allow the model to incorporate a multi-scale receptive field, thereby enhancing the feature learning capability of our segmentation model. As the number and size of dilated convolution kernels increase, the segmentation results show gradual improvement. However, excessively increasing the number of dilated convolution kernels can negatively impact the segmentation ability. As a result, adding a fifth dilated convolution (11×11) causes the model to overly focus on edge information, neglecting the correlation within tumor areas. This overemphasis results in under-segmentation of the areas circled in yellow and over-segmentation of the areas circled in blue (in Fig. 7, row 3 and 4, Column 5). Results in Table 1 reveal that 4convs is the optimal choice for this experiment. After incorporating AFF3D, the feature maps of 4convs are fused, enabling the model to better understand the area relevance of diffuse tumors.
Ablation studies results in Table 2 demonstrate each improvement effectiveness in the improved attention module: based on BL, BN and res-attention can make gradient backpropagation stable and robust. The increase in the DSC brought by the dilated convolutions is due to the model obtaining a larger receptive field of multiple scales. Such a larger receptive field of multiple scales allows the model to learn more detailed feature information about small tumor regions. A higher weight in Fig. 8 (d) indicates that different sizes of dilated convolution could capture more details and relevance in the sample images and benefit the segmentation model’s performance. After AFF3D fused the feature maps produced by dilated convolutions, the weights assigned to the tumor parts are further increased.
Under-segmentation of different attention modules
The PCNSL segmentation experiment results in Fig. 9 imply that except for our improved attention module, all these attention modules lack the ability to segment the center region of the edge enhanced tumor. Among them: SE overly emphasized certain tumor enhancement area features and neglected other tumor features. Hence, the results in top row of Fig. 9 imply that SE are even worse than BL due to the under-segmentation of the center of the tumor. CBAM’s spatial attention module aims to weight the features in the spatial dimension to capture global and local context information. However, excessive spatial attention caused the model to overly focus on global features of the image and neglect local details. This excessive focus leads to under-segmentation of edge enhanced tumor regions or boundaries in Column 7 of Fig. 9.
Axial attention is a self-attention module for images in different directions and GC attempts to capture global context information of the image to aid the segmentation task; axial attention and GC do not obtain under-segmentation in the cross-sectional view. however, global context information is not always beneficial to specific local details and boundary information. Axial attention and GC overly focused on the global context and neglect essential details of local area. This excessive focus the global context leads to insufficient fine-grained segmentation of local features and caused under-segmentation in the middle and bottom row, Column 4 of Fig. 9.
The spatial attention in the scSE module is used to weight different spatial positions of the image to capture global and local context information. However, when spatial attention overly focused on these subtle differences and noise in the local context information, the model ignored the main features related to the tumor and obtained over-segmentation at boundary areas in top and bottom row, Column 7 of Fig. 9.
In Table 3, our improved attention module achieves the highest DSC and Jaccard index. This demonstrates the advantages of our improved attention module. Specifically, the dilated convolutions can assist the model to simultaneously focus on both local and global information. Moreover, AFF3D can help the model to assign a higher weight to the tumor region and capture subtle information in the sample image with a larger receptive field. As a result of these improvements, the over-segmentation has been reduced.
Diffuse tumor segmentation analysis of different segmentation methods
Figure 10 reveals that these different segmentation methods have some issues to segment the small diffuse tumors in sagittal view and coronal view:
SegResnet achieves decent results when dealing with the BraTS dataset, where tumors generally have larger areas and fewer dispersed tumors. However, when dealing with diffuse tumors in PCNSL, the SE and residual parts could not analyze the continuity and correlation information in the top row of Fig. 10.
MedSegDiff and Diff-UNet are both diffusion-based segmentation models, which exhibit robustness to noise. However, their results in experiments with 3D images is not as good as when dealing with 2D images. Moreover, diffusion models often suffer from over-smoothing, leading to the loss of fine details in diffuse tumors segmentation results in bottom row of Fig. 10.
Swin UNETR achieves good results in BraTS 2021 due to the transformer encoder utilizes the shifted window to calculate self-attention and extract features at different resolutions. Although Swin UNETR contains an attention module, the under-segmentation result in middle row of Fig. 10 indicates that it lacks the ability to learn interlayer feature correlations for dispersed small tumors in PCNSL. The pre-processing and post-processing of images by nnU-Net can improve the precision of segmentation. However, the under-segmentation results in middle bottom row of Fig. 10 reveal that nnU-Net lacks an attention module, which can capture detail information of diffuse tumors.
Table 4 indicates that our method has addressed the issues of diffuse tumor segmentation previously described and achieves the best results. Our model utilizes dilated convolutions of various sizes to generate a multi-scale receptive field. This allows the model to focus more accurately on diffuse tumors. The modification of 3D convolution in channel attention enhances the model’s analytical ability of the internal correlation and interlayer continuity of 3D MRI images. These two enhancements lead to an increased weight to the diffuse tumor portion, thereby improving segmentation precision. Comparing with other segmentation methods, our improved attention module has been proven superior in segmenting diffuse tumors based on MRI images of PCNSL.
Future work
Due to the rarity of PCNSL, there is no public dataset, and research on PCNSL segmentation is limited. Pennig et al. [36] used a 3D convolutional neural network trained on gliomas to provide segmentation of PCNSL comparable to manual segmentation; it is the only work that used deep learning based method to segment the PCNSL. Compared with their study, our improved attention module proposed in this paper compensates for its deficiencies in attention, enabling the model to learn more feature information about the tumor region. Also, the nnU-Net used in this paper performs better than the 3D-CNN model used in their work, considering the aspects of preprocessing, data augmentation, and post-processing. This study presents a novel attention-based model for PCNSL segmentation, outperforming existing methods and potentially offering better outputs for PCNSL diagnosis using MR images. However, we acknowledge the ongoing requirement to enhance our dataset by incorporating more samples and from diverse modalities. On the other hand, since PCNSL lesions are relatively dispersed and small, extensive annotation work also poses a notable challenge that needs to be addressed. In future work, we will collect more multimodal PCNSL data samples and perform multi-objective annotation to optimize our segmentation model and attention module.
Conclusion
In this study, we presented a new feature-fusion-attention-based segmentation model to segment the PCNSL. By incorporating multi-scale dilated convolutions and utilizing the AFF3D module for multi-scale fusion, the proposed method enables the model to focus more on the location, region and information at different scales of the tumor, and the correlation between local features and global spatial features was used to guide the segmentation task. The experiments showed that the proposed method could provide improving results quantitatively and qualitatively in the PCNSL segmentation task than the comparative methods. This work is the first study to introduce the attention module to automatically segment the PCNSL based on MRI images, which can provide clinical use in MRI images for PCNSL.
Footnotes
Acknowledgments
The authors are grateful for the funding support from the Fujian Province Science and Technology Innovation Joint Fund, Pudong New Area Science and Technology Development Fund and Shanghai Municipal Alliance for Clinical Competence Improvement and Advancement in Neurosurgery
Conflict of interest
The authors have no relevant conflicts of interest to disclose.
