FourierFilter irregular attention U-Net with multi-receptive field fusion for cell nucleus segmentation

Abstract

Cell nucleus segmentation plays a significant role in Computer-Aided systems for cancer diagnosis. However, the nuclear images are characterized by different sizes, overlap, adhesion, and similarities between nuclei and other structures, making this task challenging. Aiming to adjust and enhance the feature learning ability of the network, this paper proposes a FourierFilter Irregular Attention U-Net (FFIA-UNet), which contains FourierFilter Irregular Attention (FFIA) and multi-receptive filed fusion (MRF) module. FFIA module seeks to learn deeper characteristics by taking advantage of frequency information and deformable convolution. MRF module improve the learning capacity of fuzzy edges and irregular forms via multiple dilated convolution. Experiments on three datasets show that the proposed FFIA-UNet achieves state-of-the-art. Dice-Score and mIoU reached 0.929 and 0.885 respectively on DSB2018. Furthermore, numerous ablation experiments have demonstrated the module’s efficacy.

Keywords

Nucleus segmentation FourierFilter Irregular Attention deformable convolution multi-receptive filed

1 Introduction

Cell morphology in histopathological images provides key information for cancer diagnosis, treatment planning, and survival analysis [1]. Many academics have created deep models to analyze histopathology pictures because of deep learning’s potent feature learning capabilities [2]. Nonetheless, there are a few additional challenges in the nuclear image segmentation process when compared to natural images, such as the disorganized distribution of the nucleus and the poor contrast between the nucleus and background. Therefore, researchers have proposed a more effective and novel deep learning segmentation method according to the characteristics of nuclear images.

Similar to the segmentation task of natural images, the segmentation method for cell nucleus can be divided into two stages and one stage based on the training process of the model. Mask RCNN [3], which has a significant impact in the two-stage model, is also used for cell nucleus segmentation. Furthermore, the nucleus is viewed as a clustering core by SPANet [4]. Based on the border, BRPNet [5] produces region recommendations. The risk of over-fitting is increased by the high parameters and sophisticated models and the tiny parameters of medical databases.

Then multitude of one-stage nucleus segmentation techniques have been developed. According to the different network structures, these methods can be categorized as U-Net type and Transformer structures. U-Net [6] is a structure specifically designed for medical segmentation tasks, which can achieve excellent segmentation results even in small amounts of data. As a result, U-Net and its variations [7, 8] have been used to segment nuclei with remarkable outcomes. UNet++ [9] uses the nested U-Net structure to increase accuracy and performance, however doing so comes with an increase in parameters. To mitigate the issue of connections, UNet3+ [10] develops a full-scale skip connection architecture that is capable of capturing multi-scale information, hence reducing model parameters. Stacking two U-Net networks and adding VGG-19 as a new encoder allows Double-UNet [11] to increase the segmentation accuracy and deepen the network depth. A multi-path attention approach is used by DCSAU-Net [12] to recover features for improved segmentation. AttentionU-Net [13] introduced attention gates modules as a practical way to improve model sensitivity and accuracy. ResUNet [14, 15] included residual links in the encoder and decoder phases to improve model performance. To further enhance model performance, ResU-Net++ [16] included Test-Time Augmentation and Conditional Random Field techniques. Recurrent and residual connections are combined in R2U-Net [17] to provide a novel framework. Besides, the lesion-attention pyramid network (LAPN) [18] is presented to enhance the learning ability of network by integrating the subnetworks with different resolutions. And MSRF-Net [19] also apply multi-scale residual fusion network for biomedical image segmentation.

In recent years, the Transformer has achieved better results in computer vision, and therefore it has also been used to solve cell nucleus segmentation. Leading performance in several medical image tasks has been attained by combining TransUNet [20] with the Transformer. Additionally, the encoder-decoder-based segmentation approach [21 –24] makes use of the Transformer’s features. By combining text and Vision Transformer, LViT [22] extracts text and image information and enhances cell nucleus segmentation accuracy by combining the two kinds of data. However, LViT does not apply local text information to segmentation, which leads to low segmentation accuracy in local regions. Besides, a spatial dependence multi-task Transformer (SDMT) network [25] is proposed for MRI segmentation and landmark localization. Regretfully, the Transformer is ill-suited for the nucleus segmentation task due to data limitations [26]. These networks usually ignore the irregular form and edge blurring.

Fourier transform has been an important tool in digital image processing for decades. For vision problems, the Fourier transform is widely used in deep learning techniques [27]. Some use the convolution theorem to speed up CNNs using fast Fourier transform (FFT) [28], while others use the discrete Fourier transform to translate the images to the frequency domain and use the frequency information to increase the performance [27]. Fast Fourier convolution (FFC) [29] accomplishes convolutions in the frequency domain by substituting a local Fourier unit for the convolution seen in CNNs. Additionally, the Fourier transform is employed in place of self-attention in the Transformer to enhance the computational efficiency of network training and inference while also better capturing global frequency domain information and establishing long-term semantic relationships. The Fourier transform can lower the number of parameters and processing expenses as compared to the Transformer.

This paper proposes a FourierFilter Irregular Attention U-Net (FFIA-UNet), which contains the U-Net structure and two designed modules: multiple receptive fields (MRF) and the FourierFilter Irregular Attention module (FFIA), to better learn the irregular shape and distinguish edges. Every level of the encoder phase contains the FFIA module, which uses learnable filters and deformable convolution to communicate information globally in the frequency domain, recalibrates the feature map and enhances irregular shape learning. Because of the nucleus’s tiny size, density, and propensity for adhesion and overlap, we propose a feature fusion mechanism dubbed MRF to enhance the learning ability, which sits between the encoder phase’s two convolution layers. Moreover, because the up-sampling technique is not learnable, a predicted spatial mismatch arises when integrating information from multiple stages. We also employ the MRF module to move the feature information from the encoding process to the matching decoding stage. These two plug-and-play modules provide a framework that is simple to utilize. Tests conducted on four datasets demonstrate that the suggested FFIA-UNet reaches state-of-the-art performance. The following are the contributions:

A FourierFilter Irregular Attention (FFIA) module is proposed to fit the irregular shape. This module is plug-and-play, realized by fast Fourier transform and depthwise deformable convolution.

A multi-receptive field fusion (MRF) module is proposed to improve the capacity to learn spatial feature information.

The designed FFIA-UNet achieves the state-of-the-art on the 2018 Data Science Bowl, MoNuSeg and TNBC datasets.

2 Methods

The thorough overview of our planned network is shown in Fig. 1. It constructed the Multi-receptive feature fusion (MRF) module and the FourierFilter Irregular attention (FFIA) module, which are positioned at the encoder and skip connections.

Fig. 1

The overall structure of our FFIA-UNet. (a) The FFIA-UNet; (b) FourierFilter Irregular Attention module (FFIA); (c) Multi Receptive filed fusion module (MRF).

2.1 The detail of FFIA-UNet

Our FFIA-UNet retains the Encoder and Decoder structure of U-Net. The FFIA module, shown in Fig. 1, comes after the downsampling process and aims to minimize information loss by modifying the feature map’s learning emphasis using FourierFilter and deformable convolution. Following four downsamplings, the features undergo inverse and Fourier transforms to alternate between the frequency and time domains. There is more information in the retrieved feature edge. To extract richer features, the MRF module is incorporated into two standard convolutions of each stage. Additionally, by using multi-scale features, the MRF module is used to transport feature information from the encoder stage to the matching decoding stage, thereby reducing the misalignment phenomena.

We did not make any modifications to the original bilinear interpolation for upsampling and convolution operations during the Decoder phase. After two convolutions for feature fusion, the bilinear interpolation process doubles the size of the feature map and compresses the channel number. Lastly, a pointwise convolution yields the anticipated segmentation outcome.

2.2 FourierFilter irregular attention (FFIA) module

In medical image, the object typically looks very similar to the surrounding background tissue and takes on an uneven shape. Deformable convolution and attention mechanisms offer comparable benefits when seen through the lenses of adaptive feature selection and receptive field alteration. Furthermore, the Fourier transform ignores the impact of irregular shapes and converts the image into the frequency domain. Then, this research presents a FourierFilter Irregular Attention (FFIA) module to fit the nucleus shape better in feature maps of the network. The FourierFilter module (FF), depthwise convolution (DWC), deformable convolution (DFC), and pointwise convolution (PWC) make up the FFIA module depicted in Fig.1(b).

The FF module receives the features as input, converts them to the frequency domain using the Fast Fourier Transform, then uses the Inverse Fourier Transform to return the features to their spatiotemporal form. Subsequently, the features undergo additional calibration via deformable convolutions. We employ DWC to extract more characteristics from the frequency domain feature maps produced by the Fourier transform before utilizing DFC. The spatial feature information is then extracted from the feature graphs produced by the deformable convolution using PWC. Afterwards, similar to other image attention mechanisms, we will obtain an attention map and combine it with the input feature map to adjust the learning direction of the network. To be more precise, the FFIA module can be shown like this: $Y_{FFIA} = ϰ ⨂ (PWC (DFC (DWC (FF (ϰ))))),$ (1) where ϰ and Y_FFIA represents the input and output feature via FFIA operation, ⨂ represents the element-wise product. The FFIA module combines the advantages of the frequency domain, deformable convolution, and attention. We also provide a detailed introduction to the implementation process of Fourier operations and deformation convolution.

Fourier Filter operation

The image is converted from the time domain to the frequency domain via the fast Fourier transform, allowing for additional angles of observation and analysis. Fourier transform physically converts the grayscale distribution function of an image to its frequency distribution function, which reflects the intensity of grayscale change in the image. The region with slow gray change is the low frequency region, and the region with large gray change is the high frequency region. The 2D fast Fourier transform (FFT) along the spatial dimensions is performed to convert the input ϰ ∈ R^H×W×D to the frequency domain: $X = F [ϰ] \in C^{H \times W \times D},$ (2) where F [·] denotes the 2D FFT. Note that X is a complex-valued tensor and represents the spectrum of ϰ. We can then modulate the range by multiplying a learnable filter K ∈ C^H×W×D: $\tilde{X} = K ⊙ X,$ (3) where ⊙ is the element-wise multiplication (also known as the Hadamard product). The filter K is called the Fourier filter, which can represent an arbitrary filter in the frequency domain.

Finally, we adopt the inverse fast Fourier transform (IFFT) to transform the modulated spectrum $\tilde{X}$ back to the spatio-temporal domain: $\hat{ϰ} \leftarrow F^{- 1} [\tilde{X}],$ (4) where $\hat{ϰ}$ is the updated tokens, F^-1 [·] denotes the IFFT. The Fourier filter layer is formulated with inspiration from digital image processing frequency filters [30], wherein the Fourier filter K can be viewed as a collection of learnable frequency filters for various hidden dimensions. It is demonstrated that a depthwise global convolution with a filter size of H × W is equal to the Fourier filter layer.

Deformable convolution

In the neural network, the convolution operation is based on the regular receptive field. The receptive field of the 3 × 3 convolution kernel is a square with 9 pixels. Complex targets, such as the nucleus, appear to be of varying sizes and roundish forms depending on the nature and location of the lesion. As a result, making the receptive field of the convolution kernel circular helps in nuclear segmentation. Deformable convolution [31], different to regular convolution, introduces learnable offsets in the receptive field. Hence the receptive field of deformable convolution is based on the irregular shape of this region rather than the region of regular convolution. This design allows it to learn the edge and contour information of shapes more effectively, and it has performed well in certain image segmentation tasks. Therefore, we briefly introduce deformable convolution in the following content.

Let x, y be the input and output feature map of the conventional convolution layer respectively. And each location p₀ on y is expressed as $y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n}),$ (5) where $R = {(- 1, - 1), (- 1, 0), \dots, (0, 1), (1, 1)}$ denotes the receptive field, w (·) is the weight of the kernel, p_n enumerates the locations in $R$ .

Deformable convolution [31], shown in Fig. 2, first learns an offset value through an additional convolution to obtain the coordinate value with offset value after each convolution. This process can be expressed as $y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n} + Δ p_{n})$ (6) $= \sum_{p_{n} \in R} \sum_{q} w (p_{n}) G (q, p_{0} + p_{n} + Δ p_{n}) \cdot x (q),$ where offsets {Δp_n|n = 1, 2, ⋯ , N} and $N = | R |$ . Via this operation, the regular receptive field $R$ is augmented with offsets. And offsets Δp_n are typically fractional and implemented via bilinear interpolation G (· , ·), and q enumerates all integral spatial locations in the feature map x.

Fig. 2

Illustration of 3 × 3 deformable convolution [31].

2.3 Multi-receptive feature fusion (MRF) module

Owing to the dense, microscopic targets that are commonly found in nuclear pictures, problems like adhesions and overlaps can easily result in the loss of feature information during the downsampling process. One effective solution to the aforementioned issues is to expand the receptive field. This leads to the proposal of a multi-receptive field fusion (MRF) module in this study, shown in Fig. 1(c). This module is placed between two 3 × 3 convolution layers of the encoder, and multi-receptive fused feature maps are obtained by obtaining features of different receptive fields. The multi-receptive fusion module can effectively handle abnormal nucleus of different shapes and sizes through multiple dilated convolutions with different dilation factors, so that the output of each convolution contains a larger range of information. This module contains three different dilation factor dilated convolutions (dilation factor sizes are set to 1, 3, and 5 respectively) as shown in the Fig. 3, where the first dilated convolution is the same as the 3 × 3 ordinary convolution. Dilated convolutions with larger dilation factors are used to capture global contextual semantic information, while dilated convolutions with smaller dilation factors are used to capture detailed internal structural information to avoid excessive pixel loss and improve the segmentation effect of image details and edges.

Fig. 3

3 × 3 convolution with different dilated ratios.

In addition, there is usually a predictable spatial deviation between the feature maps obtained through the downsampling operation of the encoder and the feature maps obtained through the upsampling operation of the decoder. Directly using element addition or channel cascading for feature fusion can disrupt predictions around object boundaries. Therefore, this article transfers the feature information from the encoder stage to the corresponding decoding stage through a multi-receptive fusion module, and reduces feature misalignment by extracting multi-scale features. In addition, to reduce parameters and improve computational speed, the proposed MRF module adopts deep convolution. This process can be represented as follows. $Y_{MRF} = DDWC (DDWC (DDWC (X, 1), 3), 5),$ (7) where X, Y_MRF is the input and output of MRF module, and DDWC (z, i) denotes the detailed depthwise convolution with rate i on input z, and i = 1, 3, 5.

2.4 Loss

As the loss function, we employ the Dice loss [32]. The Dice coefficient, stated as follows, is used to calculate how similar two samples are to one another. $DiceLoss = 1 - \frac{2 | Y_{Pre} \cap Y_{GT} |}{| Y_{Pre} | + | Y_{GT} |},$ (8) where |Y_Pre| and |Y_GT| represent the number of elements of prediction Y_Pre and ground truth Y_GT, |Y_Pre ∩ Y_GT| represents the intersection between |Y_Pre| and |Y_GT|.

3 Experiments and results

3.1 Datasets and training details

Datasets. We evaluate our method on the 2018 Data Science Bowl, MoNuSeg and TNBC cell datasets. Besides, we also add experiments on another medical image segmentation dataset (CVC-ClinicDB).

2018 Data Science Bowl (DSB)[9] contains 670 manually labeled nucleus segmentation images. These images are acquired under various conditions and are different in cell type, magnification, and imaging mode (brightness and fluorescence). The purpose of this dataset is to test algorithms’ capacity to generalize in these variations. Ten percent are used for testing, twenty percent are used for verification, and seventy percent are utilized for training in this work.

MoNuSeg [33] dataset involves 32 teams and more than 80 researchers from institutions in different regions. The training set consists of 30 images and 21623 single core annotations, and the test set consists of 14 images. This dataset comes from multiple organs.

TNBC [34] datset contains a large number of annotated cells, including normal epithelial and myoepithelial breast cells (located in ducts and lobules), invasive cancer cells, fibroblasts, endothelial cells, adipocytes, macrophages, and inflammatory cells (lymphocytes and plasma cells). The TNBC dataset consists of 50 images, with a total of 4022 annotated cells. The maximum number of cells in one sample is 293, and the minimum number of cells in one sample is 5. On average, each sample has 80 cells, with a high standard deviation of 58.

CVC-ClinicDB [16] is the official dataset for the training phase of the MICCAI 2015 Colonoscopy Video Automatic Polyp Detection Challenge. The database consists of 612 static images extracted from colonoscopy videos, which come from 29 different sequences. The image size is 288 × 384. Each image is accompanied by a ground truth mask to identify the area covered by the polyps in the image.

Metrics. Five commonly-used and accepted measures are used to assess segmentation performance: mIoU (Mean Intersection over Union), recall, accuracy, precision, and Dice score. Dice score and mIoU are measurement functions that are used to calculate set similarity among various metrics. The similarity between the two samples is assessed using the dice coefficient. Besides, mIoU is employed to compute the average value by calculating the ratio of the intersection and union of the actual and projected outcomes.

Implementation and Training Details We implement our model on a single NVIDIA 3090 Tensor Core GPU, an 8-core CPU, and 24GB RAM using the PyTorch 1.10.0 framework. All models are trained using the Adam optimizer, which has a learning rate of 1e-4. 16 and 150 are the set values for the batch size and epochs, respectively. We changed the image size to 256 × 256 during training. Additionally, we optimize the learning rate usingReduceLROnPlateau.

3.2 Performance comparison

Comparison on 2018 DSB Dataset. We juxtapose our findings against the reference models presented in Table 1. The majority of these techniques are U-Net variations, and some have even tried combining visual transfer. Table 1 displays our final results, which routinely outperform the benchmark models. Our FFIA-UNet has 7.7, 4.2, 2.9, 0.9, and 6.1 percent greater mIoU, Dice score, accuracy, recall, and precision than the original U-Net network. As for the model SSFormer-L, it also improves by 2.1 and 0.6 percent on mIoU and the Dice score, respectively.

Table 1
Results on the DSB dataset, where “–” denotes that no relevant data is provided in references. Color convention: Best, Second-Best

Method Accuracy Precision Recall Dice-score mIOU

U-Net [6] 0.955 0.872 0.920 0.887 0.808

Unet++ [9] 0.955 0.874 0.918 0.886 0.814

Unet3+ [10] 0.957 0.889 0.909 0.893 0.825

DoubleU-Net [11] 0.955 0.876 0.927 0.889 0.817

DCSAU-Net [12] 0.960 0.914 0.924 0.914 0.850

Attention-UNet [13] 0.953 0.870 0.918 0.887 0.816

ResUNet++ [16] 0.954 0.900 0.903 0.894 0.822

R2U-Net [17] 0.956 0.884 0.911 0.891 0.822

MSRF-Net [19] 0.965 0.902 0.940 0.922 0.857

TransUNet [20] 0.954 0.900 0.906 0.895 0.821

FANet [35] 0.968 0.920 0.922 0.918 0.857

LeViT-UNet [22] 0.953 0.889 0.888 0.882 0.808

SSFormer-L [23] – – – 0.923 0.861

DuAT [21] – – – 0.926 0.870

FFIA-UNet (Ours) 0.974 0.933 0.933 0.929 0.885

Method	Accuracy	Precision	Recall	Dice-score	mIOU
U-Net [6]	0.955	0.872	0.920	0.887	0.808
Unet++ [9]	0.955	0.874	0.918	0.886	0.814
Unet3+ [10]	0.957	0.889	0.909	0.893	0.825
DoubleU-Net [11]	0.955	0.876	0.927	0.889	0.817
DCSAU-Net [12]	0.960	0.914	0.924	0.914	0.850
Attention-UNet [13]	0.953	0.870	0.918	0.887	0.816
ResUNet++ [16]	0.954	0.900	0.903	0.894	0.822
R2U-Net [17]	0.956	0.884	0.911	0.891	0.822
MSRF-Net [19]	0.965	0.902	0.940	0.922	0.857
TransUNet [20]	0.954	0.900	0.906	0.895	0.821
FANet [35]	0.968	0.920	0.922	0.918	0.857
LeViT-UNet [22]	0.953	0.889	0.888	0.882	0.808
SSFormer-L [23]	–	–	–	0.923	0.861
DuAT [21]	–	–	–	0.926	0.870
FFIA-UNet (Ours)	0.974	0.933	0.933	0.929	0.885

We present five examples with predictions from our network and benchmark models in Fig. 4. We can see from the red box in the last column that our method can successfully segment the nucleus compared to the previous method using the green box. Our approach integrates deeper semantic and frequency domain features, increasing the abundance of extracted features and highlighting edge features by using learnable filters to exchange information globally among the tokens in the Fourier domain and further calibrating the frequency domain features through deformable convolution. The nucleus that our approach segments is distinct and essentially accurate. The two black images in the fourth line indicate that the corresponding two models did not segment the nucleus. The number of nucleus in this image is large, but the individual size is modest, making segmentation more difficult.

Fig. 4

Visual comparison with the benchmark techniques on the DSB Dataset.

Comparison on MoNuSeg Dataset. We compare our findings with the benchmark models on the MoNuSeg dataset, as indicated in Table 2. We also attain state-of-the-art with our FFIA-UNet. The Dice and mIoU scores have generally increased to 0.815 and 0.702, respectively. The mIoU and Dice scores have both increased by more than 6% when compared to U-Net. In addition, our method outperformed the LViT-T by 3.9 and 2.1 percentage points in mIoU and Dice, respectively.

Table 2

Results on the MoNuSeg

Method	Accuracy	Precision	Recall	Dice-score	mIoU
U-Net [6]	0.845	0.769	0.820	0.765	0.629
Unet++ [9]	0.827	0.751	0.811	0.770	0.630
Unet3+ [10]	0.837	0.759	0.817	0.773	0.639
DoubleU-Net [11]	0.830	0.758	0.807	0.770	0.635
Attention-UNet [13]	0.833	0.761	0.801	0.767	0.635
ResUNet++ [16]	0.834	0.761	0.803	0.768	0.637
R2U-Net [17]	0.829	0.752	0.811	0.771	0.632
TransUNet [20]	0.832	0.784	0.802	0.785	0.651
LViT-T [22]	–	–	–	0.804	0.673
LViT-L [22]	–	–	–	0.810	0.682
UCTransNet [24]	–	–	–	0.799	0.667
DoubleUNet-DCA [36]	–	–	–	0.781	0.660
MedT [37]	–	–	–	0.796	0.662
GTUNet [38]	–	–	–	0.793	0.659
MDM [40]	–	–	–	0.810	–
FFIA-UNet (Ours)	0.883	0.815	0.822	0.815	0.702

The visualization results on the MoNuSeg dataset are shown in the Fig. 5. It can be seen that the accuracy of the model in cell nucleus segmentation has been further improved, and the occurrence of missegmentation and missegmentation has been further reduced. This indicates that the model has good segmentation ability for small targets, can extract richer features, and better distinguish cell nucleus.

Fig. 5

Visual comparison on the MoNuSeg Dataset.

Comparison on TNBC Dataset. We also compare our method with the benchmark models on the TNBC dataset, as indicated in Fig. 6. Our FFIA-UNet also attain the state-of-the-art on five metrics. The Dice and mIoU scores have generally increased to 0.898 and 0.828, respectively.

Fig. 6

Comparison results on the TNBC Dataset.

The two examples of visualization results on the TNBC dataset is shown in the Fig. 7. From the figure, it can be seen that the segmentation effect of the model is closer to the original GT image, and the segmentation effect is further improved. In the second row, the missed segmentation phenomenon of other models is obvious, while FFIA-UNet can segment more accurately. The model can better distinguish between foreground and background in nuclear images, and effectively segment the nucleus from other structures.

Fig. 7

Visual comparison on the TNBC Dataset.

3.3 Ablation study

We set up several ablation experiments to illustrate the rationality of FFIA and MRF modules. ALl the ablation experiments are finished on the 2018 DSB dataset.

Ablation experiment for the composition of FFIA module. Our FFIA module consists of FourierFilter, depthwise convolution, deformable convolution, and pointwise convolution. We did four experiments to remove these four blocks respectively. As shown in Table 3, removing any operation will degrade the performance. Although there was an improvement in recall metrics after omitting the Fourier transform in the first experiment, both Dice and mIoU, which are critical for segmentation tasks, exhibited a slight loss in performance. In the third experiment, we removed deformable convolution and saw a significant decrease in all indicators, demonstrating that deformable convolution is important in attention and better suited to the shape of the nucleus.

Table 3
Ablation Study of FFIA module, where “w.o.” denotes “without”

Method Acc Precision Recall Dice mIoU

w.o. FF 0.980 0.918 0.935 0.927 0.875

w.o. DWC 0.964 0.920 0.933 0.922 0.873

w.o. DFC 0.982 0.924 0.916 0.915 0.869

w.o. PWC 0.980 0.937 0.926 0.927 0.882

Ours 0.984 0.933 0.929 0.929 0.885

Method	Acc	Precision	Recall	Dice	mIoU
w.o. FF	0.980	0.918	0.935	0.927	0.875
w.o. DWC	0.964	0.920	0.933	0.922	0.873
w.o. DFC	0.982	0.924	0.916	0.915	0.869
w.o. PWC	0.980	0.937	0.926	0.927	0.882
Ours	0.984	0.933	0.929	0.929	0.885

Ablation experiment on dilated rate configuration of MRF module. To verify the influence of different size and number of receptive fields on the segmentation task, we designed an ablation experiment, and the results are shown in Table 4. In the table, 1,3,5 indicates that there are three dilated convolution layers in the MRF module, and the dilated rates are 1,3, and 5 in turn. By analogy, 2,4,6,8 indicates that we design four dilated convolution layers in the MRF module with dilated rates of 2, 4, 6, and 8 in sequence. As can be shown, the performance of the dilated convolution configurations with {1, 3, 5} and {1, 3, 5, 7} is comparable, however the latter adds an additional layer and more parameters.

Table 4

Dilated rate configuration of MRF module

rate	Acc	Precision	Recall	Dice	mIoU
1, 3	0.983	0.905	0.947	0.923	0.874
2, 4	0.980	0.915	0.934	0.928	0.877
1, 3, 5	0.984	0.933	0.929	0.929	0.885
2, 4, 6	0.982	0.938	0.926	0.924	0.883
1, 3, 5, 7	0.982	0.929	0.937	0.931	0.884
2, 4, 6, 8	0.981	0.911	0.951	0.930	0.877

Ablation experiments on FFIA and MRF modules. We design seven experiments to illustrate FFIA and MRF modules, whose result is shown in Table 5. Additionally, “MRF in Encoder” indicates that the MRF module is used during the encoder stage. Furthermore, “MRF in skip” means that the MRF module was used in place of the skip connection. Table 5 illustrates that U-Net’s performance can also be enhanced by only utilizing a module. mIoU and Dice scores improve more when all modules are employed.

Table 5

Ablation experiments of module effectiveness, where ✓ and × indicate whether the module is employed respectively

FFIA	MRF in Encoder	MRF in skip	Accuracy	Precision	Recall	Dice	mIoU
×	×	×	0.955	0.872	0.920	0.887	0.808
✓	×	×	0.979	0.904	0.931	0.914	0.858
×	✓	×	0.979	0.907	0.923	0.908	0.841
×	✓	✓	0.982	0.919	0.925	0.915	0.857
✓	×	✓	0.979	0.905	0.931	0.915	0.860
✓	✓	×	0.983	0.928	0.933	0.927	0.883
✓	✓	✓	0.984	0.933	0.929	0.929	0.885

Model parameter comparison. We conducted ablation experiments on the parameter quantity and calculation speed, shown in Fig. 8. It can be intuitively seen from the figures that the parameter quantity (Params) of FFIA-UNet is 32.95M, the calculation speed (Flops) is 56.87G. Comparing the parameter quantity and calculation speed, the two are basically the same, while FFIA-UNet slightly increases. The parameters and Flops of Unet++ is much higher than others, indicating that it improve performance by increasing computational complexity. The precise and clear segmentation results fully demonstrate that the proposed FFIA-UNet achieves significant performance improvement without increasing the computing load.

Fig. 8

Ablation experiments on the parameter and calculation speed.

Visualization of Class Activation Mapping for Models. To demonstrate the learning process of the model, we visualize the class activation map after the downsampling stage. As shown in the Fig. 9, our model focuses more on the nucleus itself rather than diverging to the surrounding areas.

Fig. 9

Visualization of class activation mapping.

Ablation experiment with other attention modules. We also compared the suggested attention mechanism in this article with other popular attention approaches, including SENet [41] and CBAM [42], to better illustrate its efficacy. The “DA” in the second to last row represents deformable attention, which mainly uses deformable convolution to obtain attention information. Table 6 shows that, despite being the first, SENet’s performance has always been great, and it may still be placed second. Other attention methods perform slightly worse than SENet. And our FFIA outperforms SENet, demonstrating that FFIA based on Fourier and deformable convolution is more useful in altering the network’s learning direction.

Table 6

comparison with other attention modules

Method	Acc	Precision	Recall	Dice	mIoU
w.o attention	0.955	0.872	0.920	0.887	0.808
SE [41]	0.980	0.931	0.933	0.928	0.873
CBAM [42]	0.979	0.909	0.946	0.924	0.865
A2-Net [43]	0.964	0.887	0.892	0.880	0.805
scSE [44]	0.980	0.905	0.935	0.916	0.863
SK-Net [45]	0.975	0.930	0.916	0.917	0.856
CC-Net [46]	0.979	0.923	0.936	0.926	0.869
ECA-Net [47]	0.976	0.893	0.955	0.920	0.857
DA-UNet [48]	0.974	0.909	0.925	0.911	0.844
Ours	0.984	0.933	0.929	0.929	0.885

Ablation experiment on other medical image segmentation dataset. To demonstrate the generalization performance of our designed model, we also conducted experiments on medical datasets without cell nucleus segmentation, and the results are shown in the Table 7. From the table, it can be seen that the Dice Score and mIoU of the FFIA-UNet proposed in this article are 0.906 and 0.841, which are significantly better than other networks. However, the segmentation results of the Transformer based TransUNet are not satisfactory, with Dise Score and mIoU of 0.867 and 0.799, respectively. This is mainly due to the small number of CVC-ClinicDB datasets and segmentation targets, and the complexity of the TransUNet model, which is prone to overfitting.

Table 7

Results on the CVC-ClinicDB

Method	Accuracy	Precision	Recall	Dice	mIoU
U-Net [6]	0.984	0.882	0.893	0.872	0.809
Unet++ [9]	0.984	0.919	0.859	0.876	0.811
Attention-UNet [13]	0.986	0.904	0.901	0.895	0.835
ResUNet++ [16]	0.982	0.870	0.853	0.854	0.781
DoubleU-Net [11]	0.986	0.892	0.912	0.896	0.836
R2U-Net [17]	0.978	0.880	0.847	0.841	0.765
Unet3+ [12]	0.984	0.907	0.885	0.892	0.827
TransUNet [20]	0.982	0.876	0.873	0.867	0.799
FFIA-UNet (Ours)	0.990	0.907	0.910	0.906	0.841

Figure 10 show the qualitative analysis results on the CVC-ClinicDB dataset. The segmentation result of the FFIA-UNet proposed in this paper is closest to the GT image, and the segmentation effect is excellent. In other rows, other models have experienced misclassification and unclear edge segmentation effects. This is mainly due to the unclear distinction between foreground and background in medical images, and other networks only rely on extracting temporal and spatial features. Our method rely on fast Fourier transform to convert the spatiotemporal domain into the frequency domain, which can better distinguish foreground and background, resulting in better segmentation results.

Fig. 10

Visual comparison on the CVC-ClinicDB dataset.

4 Conclusion

In this paper, we propose a FourierFilter Irregular attention (FFIA) and a multi-receptive field fusion (MRF) module. The former can interchange information globally among the tokens in the Fourier domain and can adaptively extract features from the backbone network, and the latter fuse features from multiple receptive fields. Our FFIA-UNet achieves state-of-the-art on the 2018 DSB, MoNuSeg and TNBC datasets. With a slight number of parameters, the Dice score and mIoU on DSB dataset reached 0.929 and 0.885, respectively. And on MoNuSeg and TNBC, the Dice score also reached 0.815 and 0.83, respectively. The requirement and efficiency of the modules were further confirmed by the ablation experiments.

The model in this paper also has some limitations. Firstly, the feature learning ability of the network backbone needs to be further enhanced. Second, frequency domain information is very important and widely used in computer vision. However, this article is only the simplest use of frequency-domain information. Third, the experiment was conducted on a public data set, which is different from clinical data.

We will conduct the following operations in the follow-up study. Mamba structure will be used to extract global information. And we will use frequency domain information more efficiently with wavelet transform.

Footnotes

Acknowledgment(s)

This work is supported by the Natural Science Starting Project of SWPU (No. 2022QHZ023, 2022QHZ013), the Sichuan Provincial Department of Science and Technology Project (No. 2022NSFSC0283), the Sichuan Scientific Innovation Fund (No. 2022JDRC0009), the Key Research and Development Project of Sichuan Provincial Department of Science and Technology (No.2023YFG0129), and the Key Laboratory of Internet Natural Language Intelligent Processing in Sichuan Provincial Higher Education Institutions (No. INLP202202). In addition, we also thank the High-Performance Computing Center, Southwest Petroleum University for its support.

Conflict of interest statement

No potential conflict of interest was reported by the authors.

References

Nasir

E.S.

, Parvaiz

and Fraz

M.M.

, Nuclei and glands instance segmentation in histology images: a narrative review, Artificial Intelligence Review 56(8) (2023), 7909–7964.

Kumar

, Bhatt

, Vimal

, et al., Automated white corpuscles nucleus segmentation using deep neural network from microscopic blood smear, Journal of Intelligent and Fuzzy Systems 42(2) (2022), 1075–1088.

Vuola

A.O.

, Akram

S.U.

and Kannala

, Mask-RCNN and U-Net Ensembled for Nuclei Segmentation, in: IEEE International Symosium on Biomedical Imaging (ISBI), 2019, 208–212.

Koohbanani

N.A.

, Jahanifar

, Gooya

, et al., Nuclear Instance Segmentation Using a Proposal-Free Spatially Aware Deep Learning Framework, in: International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2019, 622–630.

Song

, Tan

, Jiang

, et al., Accurate Cervical Cell Segmentation from Overlapping Clumps in Pap Smear Images, IEEE Transactions on Medical Imaging 36(1) (2017), 288–300.

Ronneberger

, Fischer

and Brox

, U-Net: Convolutional Networks for Biomedical Image Segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, 234–241.

Ding

and Wang

, Efficient Unet with depth-aware gated fusion for automatic skin lesion segmentation, Journal of Intelligent & Fuzzy Systems 40(5) (2021), 9963–9975.

, Fu

, Zhang

, et al., Adaptive multi-scale feature fusion based U-net for fracture segmentation in coal rock images, Journal of Intelligent & Fuzzy Systems 42(4) (2022), 3761–3774.

Zhou

, Siddiquee

M.M.R.

, Tajbakhsh

, et al., UNet++: A Nested U-Net Architecture for Medical Image Segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2018, 3–11.

10.

Huang

, Lin

, Tong

, et al., Unet 3+: A full-scale connected unet for medical image segmentation, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, 1055–1059.

11.

Mubarrat

S.T.

and Chowdhury

, 2-deep convolutional neural network in medical image processing, in: Handbook of Deep Learning in Biomedical Engineering, 2021, 25–60.

12.

, Ma

, He

, et al., DCSAU-Net: A deeper and more compact split-attention U-Net for medical image segmentation, Computers in Biology and Medicine 154 (2023), 106626.

13.

Oktay

, Schlemper

, Folgoc

L.L.

, et al., Attention u-net: Learning where to look for the pancreas, arXiv, 2018, 1804.03999.

14.

Xiao

, Lian

, Luo

, et al., Weighted res-unet for high-quality retina vessel segmentation, in: International Conference on Information Technology in Medicine and Education (ITME), 2018, 327–331.

15.

Ajilisa

O.A.

, Jagathy Raj

V.P.

and Sabu

M.K.

, Segmentation of thyroid nodules from ultrasound images using convolutional neural network architectures, Journal of Intelligent & Fuzzy Systems 43(1) (2022), 687–705.

16.

Jha

, Smedsrud

P.H.

, Riegler

M.A.

, et al., ResUNet++: An Advanced Architecture for Medical Image Segmentation, in: IEEE International Symosium on Multimedia (ISM), 2019, 225–230.

17.

Alom

M.Z.

, Hasan

, Yakopcic

, et al., Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation, arXiv, 2018, 1802.06955.

18.

, Jiang

, Zhang

, et al., Lesion-attention pyramid network for diabetic retinopathy grading, IEEE International Symposium on Multimedia (ISM) 126 (2022), 102259.

19.

Srivastava

, Jha

, Chanda

, et al., MSRF-Net: A Multi-Scale Residual Fusion Network for Biomedical Image Segmentation, IEEE Journal of Biomedical and Health Informatics 26(5) (2022), 2252–2263.

20.

Chen

, Lu

, Yu

, et al., TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation, arXiv, 2021: abs/2102.04306.

21.

Tang

, Huang

, Wang

, et al., DuAT: Dual-aggregation transformer network for medical image segmentation, arXiv, 2022:2212.11677.

22.

, Li

, et al., LVIT: language meets vision transformer in medical image segmentation, IEEE Transactions on Medical Imaging, 2023: Accepted paper.

23.

Wang

, Huang

, Tang

, et al., Stepwise Feature Fusion: Local Guides Global, in: International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2022, 110–120.

24.

Wang

, Cao

, Wang

, et al., UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-Wise Perspective with Transformer, in: Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022, 2441–2449.

25.

, Lv

, Li

, et al., SDMT: Spatial Dependence Multi-Task Transformer Network for 3D Knee MRI Segmentation and Landmark Localization, IEEE Transactions on Medical Imaging 42(8) (2023), 2274–2285.

26.

Liu

, Lin

, Cao

, et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, in: IEEE/CVF International Conference on Comuter Vision (ICCV), 2021, 9992–10002.

27.

Lee

J.H.

, Heo

, Kim

K.R.

, et al., Single-image depth estimation based on Fourier domain analysis, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, 330–339.

28.

, Xue

, Zhu

, et al., FALCON: A Fourier Transform Based Approach for Fast and Secure Convolutional Neural Network Predictions, in: IEEE/CVF Conference on Comuter Vision and Pattern Recognition (CVPR), 2020, 8702–8711.

29.

Chi

, Jiang

and Mu

, Fast fourier convolution, in: Advances in Neural Information Processing Systems (NeurIPS), 2020, 4479–4488.

30.

Pitas

, Digital image processing algorithms and applications, IEEE Signal Processing Magazine 18(2) (2000), 58.

31.

Dai

, Qi

, Xiong

, et al., Deformable Convolutional Networks, in: IEEE International Conference on Comuter Vision (ICCV), 2017, 764–773.

32.

Milletari

, Navab

and Ahmadi

, V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation, in: International Conference on 3D Vision (3DV), 2016, 565–571.

33.

Kumar

, Verma

, Anand

, et al., A Multi-Organ Nucleus Segmentation Challenge, IEEE Transactions on Medical Imaging 39(5) (2020), 1380–1391.

34.

Peter

, Marick

, Fabien

and Thomas

, Segmentation of nuclei in histopathology images by deep regression of the distance map, IEEE Transactions on Medical Imaging 38(2) (2018), 448–459.

35.

Tomar

N.K.

, Jha

, Riegler

M.A.

, et al., FANet: A Feedback Attention Network for Improved Biomedical Image Segmentation, IEEE Transactions on Neural Networks and Learning Systems 34(11) (2023), 9375–9388.

36.

Ates

G.C.

, Mohan

and Celik

, Dual Cross-Attention for medical image segmentation, Engineering Applications of Artificial Intelligence 126 (2023), 107139.

37.

Valanarasu

J.M.J.

, Oza

, Hacihaliloglu

, et al., Medical Transformer: Gated Axial-Attention for Medical Image Segmentation, in: International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2021, 36–46.

38.

, Wang

, et al., GT U-Net: A U-Net Like Group Transformer Network for Tooth Root Segmentation, in: International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2021, 386–395.

39.

Wazir

and Fraz

M.M.

, HistoSeg: Quick attention with multi-loss function for multi-structure segmentation in digital histology images, arXiv, 2022, 2209.00729.

40.

Pan

, Chen

and Shi

, Masked Diffusion as Self-supervised Representation Learner, in: arXiv, 2023, 2308.05695.

41.

, Shen

and Sun

, Squeeze-and-Excitation Networks, in: IEEE Conference on Comuter Vision and Pattern Recognition (CVPR), 2018, 7132–7141.

42.

Woo

, Park

, Lee

, et al., CBAM: Convolutional Block Attention Module, in: European Conference on Computer Vision (ECCV), 2018, 3–19.

43.

Chen

, Kalantidis

, Li

, et al., A²Nets: Double Attention Networks, in: Advances in neural information processing systems (NeurIPS), 2018, 350–359.

44.

Roy

A.G.

, Navab

and Wachinger

, Concurrent Spatial and Channel ‘Squeeze & Excitation’ in Fully Convolutional Networks, in: International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2018, 421–429.

45.

, Wang

, Hu

, et al., Selective Kernel Networks, in: IEEE Conference on Comuter Vision and Pattern Recognition (CVPR), 2019, 510–519.

46.

Huang

, Wang

, Wei

, et al., CCNet: Criss-Cross Attention for Semantic Segmentation, IEEE Transactions on Neural Networks and Learning Systems 45(6) (2023), 6896–6908.

47.

Wang

, Wu

, Zhu

, et al., ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks, in: IEEE/CVF Conference on Comuter Vision and Pattern Recognition (CVPR), 2020, 11531–11539.

48.

Xiao

, Pan

and Zhang

, DA-UNet: Deformable Attention U-Net for Nucleus Segmentation, in: International Conference on Computer, Vision and Intelligent Technology Proceedings, 2023, 11531–11539.