Abstract
BACKGROUND:
Convolutional neural network has achieved a profound effect on cardiac image segmentation. The diversity of medical imaging equipment brings the challenge of domain shift for cardiac image segmentation.
OBJECTIVE:
In order to solve the domain shift existed in multi-modality cardiac image segmentation, this study aims to investigate and test an unsupervised domain adaptation network RA-SIFA, which combines a parallel attention module (PAM) and residual attention unit (RAU).
METHODS:
First, the PAM is introduced in the generator of RA-SIFA to fuse global information, which can reduce the domain shift from the respect of image alignment. Second, the shared encoder adopts the RAU, which has residual block based on the spatial attention module to alleviate the problem that the convolution layer is insensitive to spatial position. Therefore, RAU enables to further reduce the domain shift from the respect of feature alignment. RA-SIFA model can realize the unsupervised domain adaption (UDA) through combining the image and feature alignment, and then solve the domain shift of cardiac image segmentation in a complementary manner.
RESULTS:
The model is evaluated using MM-WHS2017 datasets. Compared with SIFA, the Dice of our new RA-SIFA network is improved by 8.4%and 3.2%in CT and MR images, respectively, while, the average symmetric surface distance (ASD) is reduced by 3.4 and 0.8mm in CT and MR images, respectively.
CONCLUSION:
The study results demonstrate that our new RA-SIFA network can effectively improve the accuracy of whole-heart segmentation from CT and MR images.
Introduction
The heart is one of the most important organs of the human body, however, many people in the world are suffering from an ocean of cardiovascular diseases. According to the statistics of World Health Organization (WHO), nearly 23.6 million people will have been prone to die caused by cardiovascular disease by 2030 [1]. Cardiovascular disease has become the striking reason to human death around the world. Thus, the diagnosis and the treatment of cardiac disease are of extreme importance.
The cardiac function of patients can be evaluated by the parameters of physiological structure, and the accurate segmentation of cardiac structure is substantial for the quantitative analysis of physiological function. Over the years, the deep learning (DL) methods have been commonly used in cardiac image segmentation. Compared with the conventional segmentation methods, the DL methods have better accuracy and efficiency. Cardiac segmentation methods based on DL are divided into three categories: supervised, semi-supervised and unsupervised segmentation.
Supervised segmentation resorts to training with labeled images. At present, it has achieved satisfactory results in cardiac image segmentation. Liao et al. [2] developed the MMTLNET for three-dimensional whole-heart segmentation which optimized feature extraction by introducing attention module and U-net backbone network. Payer et al. [3] proposed a segmentation network for multi-label whole-heart location based on two FCN. One of the FCN was used to locate the center of whole-heart substructures, and the other was used for precise segmentation. Semi-supervised segmentation refers that only part of the training data is labeled. Jiang et al. [4] proposed domain adaptation framework for tumor segmentation, which synthesized CT images into MRI images through adversarial learning. Li et al. [5] proposed a two-teacher model to integrate intra-domain and inter-domain adaptation, which learned priori knowledge of source domain data through knowledge distillation to achieve efficient multi-modality cardiac segmentation. The collaborative feature ensembling adaptation framework (CFEA) proposed by Liu et al. [6] has achieved excellent results in fundus image segmentation. Valindria et al. [7] designed a dual-stream encoder-decoder structure to learn the relevant information of multi-modality dataset for improving multi-organ segmentation performance. Unsupervised domain adaptation segmentation method is usually used to solve the problem of multi-modality medical image segmentation, which includes feature alignment, image alignment and the combination both of them for domain adaptation. Image alignment usually uses the CycleGAN [8] framework for image conversion to reduce appearance differences. Zhang et al. [9] developed a cross-modality image segmentation framework, which can extract effective features from unpaired images. Chartsias et al. [10] realized efficient segmentation of cardiac images by synthesizing CT images into MRI images using CycleGAN framework. Cai et al. [11] proposed cross-modality 2D/3D image segmentation method, which segmented several non-rigid organs by maintaining shape consistency. For alignment, feature alignment usually uses loss functions and adversarial algorithm to constrain the features of the two domains. Wang et al. [12] proposed the glaucoma screening and diagnosis network named pOSAL which guided the network to segment fundus images by perceptible segmentation loss and obtained better performance. Chen et al. [13] developed a fundus image segmentation framework named IOSUDA, which could achieve feature alignment by extracting complementary features from two domains. However, since the above methods only considered either a single image alignment or feature alignment, it would lead to insufficient cross-modality information extraction and failed to solve the domain shift problem. Combination of image alignment and feature alignment can make full use of their respective advantages. Chen et al. [14] proposed an anatomical regularized representation learning method (ARL-GAN) by introducing adversarial algorithm for multi-modality cardiac sub-structure segmentation and skull segmentation. It retained the anatomical structure information in the process of synthesizing cross-modality images effectively. Dong et al. [15] developed a new unsupervised semantic lesion transfer model for endoscopic lesion segmentation, which combined image alignment with feature alignment to alternately explore transferable domain-invariant information to reduce domain shift.
Although the above-mentioned supervised segmentation models and semi-supervised models showed better segmentation performance, they mainly depended on large annotated datasets. Up to now, the annotations to the datasets are mainly performed by medical experts, which is inefficient. Furthermore, hand annotation is easier to introduce subjectivity, which leads to poor generalization ability of the model [16]. The unsupervised model can be trained without labeling data, which is labor-saving and robust.
Many models have achieved advanced results in the field of medical image segmentation. However, model trained on one modality data set cannot be directly applied to other modality data sets with different distributions. Due to the complexity of the cardiac structure, there exist significant differences in the appearance of cardiac images between different modality images and different patients, which leads to different distributions of cardiac images in different modalities. Different distributions bring about the problem of domain shift, which limits the performance of multi-modality cardiac image segmentation [17]. The unsupervised method does not require manually tagging data, and the generalization ability and portability of the model can be improved by reducing the domain shift. To address these challenges, an UDA framework that combines PAM with RAU for multi-modality whole-heart segmentation is proposed. RA-SIFA is designed based on the cycleGAN, and co-training is carried out by connecting two GAN models end to end. One is used to realize the conversion from source to target domain images, and the other is to realize the image segmentation and the reconstruction from the target to source domain. By virtue of cycle consistency loss, the mapping consistency between two domains is kept. The main works of this paper are as follows: Parallel attention module which consists of two attention modules and residual connections is designed to extract contextual background information and capture inter-channel dependencies. Subsequently, the output features of the two attention modules and the residual connection are summarized to obtain features with better distinguishing ability. The details of each position in the generated image are fully coordinated with other long-distance details, and the resolution of the generated image is prone to be better. Thus, the PAM makes the data in source domain more similar to the target data and shortens Domain-shift from the aspect of image alignment. Residual attention unit is proposed, as shown in Fig. 3(c), which integrates the spatial attention module and the residual block. It alleviates the problem that the convolutional layer is insensitive to the spatial position, and the network retains unique information while capturing relevant information, which improves the performance of the network from the aspect of feature alignment. A UDA multi-modality cardiac image segmentation network RA-SIFA combining PAM with RAU is proposed, which performances domain adaptation by combining image with features alignment. The complementary information between MRI and CT images is used to perform cross-modality bidirectionally adaptation.
Method
In order to solve domain shift in multi-modality cardiac images, the paper proposes an unsupervised domain adaptation framework based on the SIFA [18] network. As shown in Fig. 1, the network with the PAM with the RAU is named RA-SIFA. This model maintains the consistency of anatomical structure between two modalities images (CT⟶MRI, MRI⟶CT) by collaborating the image alignment and feature alignment, thus to realizing the bidirectional learning between source domain images xA and target domain images xB. Then adversarial training is used to reduce the difference between the two modalities to improve the performance of RA-SIFA. RA-SIFA is mainly composed of generator GB, discriminator (DA, DB, Dpi), shared encoder E, decoder U and classifier Ci. Generator GB is used to convert the xA into the target-like domain image xA→B. We use DB to discriminate whether the image xA→B is true or not. DA is used to discriminate where the image reconstructed by the generator (E, U) is from, xA→B or xB. Dpi is used to discriminate the segmentation prediction map of two domain samples. Shared encoder E combined with decoder U forms a new generator GA, which is used to covert the xB into the source-like domain image xB→A or reconstruct the xA→B into source domain xA→B→A. Ci is a pixel-level classifier, which is cascaded with the shared encoder E to segment the target domain image.

The framework of the RA-SIFA model.
The generator GB, the discriminator DB, the shared encoder E and the decoder U constitute a CycleGAN for the mutual conversion of the two modality images. Blue arrow and orange arrow are used to represent the source and the target domain data flow respectively. p : xA→B denotes the prediction of source-like domain image and p : xB stands for the prediction of target domain image.
Due to different imaging principles of instruments, different modality images show diverse visual appearances. The generator GB reduces the domain shift by synthesizing the xA with xA→B at the image alignment. Details of generator are shown in Fig. 2(a). The encoding structure uses three convolutional blocks, nine residual blocks and one attention module for feature extraction. Each convolutional block includes a convolution layer, an Instance Normalization layer, and a rectified linear unit (ReLU) layer. The convolution layer uses the kernel size of 7×7 and 3×3 respectively, and the kernel number of three convolution blocks is 32, 64 and 128 respectively. Instance Normalization is to optimize H and W on image pixels, speeds up the convergence of the model while maintaining the independence among images. Residual block by identity connection can retain the underlying features to perform multi-layer network feature information fusion and reduce the gradient disappearance in back propagation process.

Structure of the generator G B and PAM.
In addition, PAM is introduced between the encoding and decoding structure. As a supplement to the convolution block, the PAM improves the diversity of features and establishes the long-distance dependency of the image area. The PAM is shown in Fig. 2(b), which is composed of two parallel GCnet [19] and residual connection. The GCnet module is mainly divided into three steps, namely context modeling, channel dependence and feature fusion. Among them, the context module is inspired by the non-local module [20], which gathers global information and supplement useful semantic information. Referred to the SENet block [21], the transform block is used to capture the dependence between channels. Layer Normalization (LN) [22] can enhance the generalization ability of the network. Finally, the extracted context features are merged with the shallow layer features through the addition operation. The PAM structure is defined as Formula 1.
where i is the index of query positions, and j enumerates all possible positions, x and z represent input and output, respectively. W represents the linear transformation matrix. LN stands for the increased layer regularization. N p is the number of positions of the feature map.
The PAM used in the generator can extract the global context background information while reducing computational complexity. Compared with the cascaded attention module, the PAM can effectively make the network full use of context information and further enhance the diversity of features. By introducing the PAM to coordinate the information between distant pixels, it makes xA→B and xB more similar.
The decoder structure is composed of two deconvolution blocks with the kernel size of 3×3 and a convolution block whose kernel size is 1×1. The deconvolution block, which has a similar structure to convolution block, is used to integrate the features and restore the image resolution. As a result, the target domain image is obtained through a convolution layer with the kernel size of 1×1.
Shared encoder E plays multiple roles in the framework. Firstly, the shared encoder E and the decoder U constitute the generator GA to reconstruct the xA. Then the segmentation module for cardiac image segmentation consists of the shared encoder E and the classifier Ci. The shared encoder E extracts domain-invariant features by combining the useful information of the two modalities, so the network reduces the domain shift from the aspect of the feature alignment. As shown in Fig. 3(a), the share encoder E is mainly composed of convolution block, max-pooling layer, residual attention unit, dilation residual block and residual dilation padding block. The structure of the convolution block is consistent with the generator GB. The dilation residual block includes two convolution layers with dilation rate of 2, which enlarges the receptive field. By combining the original residual block with the spatial attention module, the RAU is used to extract more useful information. The structure of RAU is shown in Fig. 3(b).

The structure diagram of shared encoder E and residual attention unit.
The spatial attention module includes two convolutional layers with the kernel size of 1×1, one batch normalization (BN) layer, one ReLU layer and one sigmoid layer. The structure of RAU is shown in Fig. 3(c).
By introducing spatial attention module into the residual block, the model alleviates the problem that the convolution layer is insensitive to spatial location and enhance the information propagation between layers. Thus, the model with better generalization ability and more accurate feature extraction ability can be obtained. Then a max-pooling layer after the residual block is cascaded, so that the dimensions of the output feature are doubled, which can make up for the information lost in the down-sampling process. The shared encoder E based on RAU can adaptively integrate local features and global dependencies.
In the model,
In order to make the distribution of the two modalities image more similar, the cycle-consistency loss [8] can be defined as Formula 3.
For the segmentation network {E, Ci}, we use a mixed loss function to segment the image xA→B. The mixed loss can be defined as Formula 4:
where H (·) represents the cross-entropy loss, and Dice(·) represents the Dice loss.
We train the overall network in an end-to-end way. The training sequence of the network is GB → DB → E → Ci → U → DA → Dpi. The objective function of our framework is shown as Formula 5:
where
Dataset and preprocessing
In this paper, the public dataset provided by the classic multi-modality whole-heart segmentation (MM-WHS2017) challenge is used for network training and testing. The dataset consists of 40 unpaired whole-heart images of 40 patients from different clinical sites [2], including 20 MRI and 20 CT images with ground truth masks. We divide the heart into four substructures: ascending aorta (AA), left atrial blood chamber (LAC), left ventricular blood chamber (LVC) and left ventricular myocardial (MYO). Data of two modalities are divided into training data and test data at a ratio of 4 : 1, and the averaged five-fold cross-validation experiment results is used in this paper.
The lack of data leads to insufficient training of the network, which causes the problem of over-fitting. Therefore, the data needs to be enhanced. The raw data scan of MRI includes the cardiac and other surrounding organs. The raw data scan of CT is the cardiac area. Inspired by Chen et al. [18], firstly, we manually crop the cardiac region. Secondly, we slice the original 3D data and crop it into 256×256 coronal image slices for training our 2D model. The processed MRI images and CT images contained 12,000 and 9,600 slices, respectively. In the experimental stage, the 2D MR data is divided into 9600 and 2400 for training and test respectively. The 2D CT data is divided into 8400 and 1200 for training and test respectively. Data enhancement including rotation, scaling and affine transformation is used to reduce over-fitting.
Implementation details
We have implemented the method on Anaconda and TensorFlow (version 1.14.0). Hardware configuration is GeForce RTX 2080Ti graphics processing unit (GPU) with 11G memory. The Adam optimizer is used and learning rate is set to 2e-4, the weight parameters A and B are set to 10, the batch size is set to 8, we set 22000 iterations of training, and the training duration is about 14 hours.
Evaluation metrics
In order to evaluate the segmentation effects, we use the Dice coefficient and the average surface distance (ASD) of symmetric positions to evaluate the segmentation effect. Where A represents the prediction map and B represents the ground truth. The higher Dice value indicates the better segmentation performance. Inversely, the higher ASD value means the worse segmentation performance. Dice is used to measure the accuracy of voxel segmentation between the predicted and reference volume, ranging from 0∼1. The Dice can be defined as Formula 6:
ASD (Average symmetrical surface distance at symmetrical position, known as ASSD or AvgD): is used to calculate the average distance between the surface of the predicted mask and ground truth. The ASD can be defined as Formula 7:
where S(A) represents the surface voxels in A set, and min b∈S(B) ∥ a - b ∥ 2 represents the shortest distance from voxel a to b.
Experimental result analysis of RA-SIFA
In order to verify the effectiveness of parallel attention module and residual attention unit, many experiments on multi-modality whole heart segmentation (MM-WHS2017) dataset are conducted. The comparative experiments based on SIFA is carried out. The network combines SIFA with RAU is named as R_SIFA. The network combines SIFA with PAM is named as A_SIFA.
Figure 4 shows the visual segmentation results of cardiac MRI and CT test images in the ablation experiment. Among them, the first two lines show the visual segmentation results of MRI test image when CT and MRI are the source and target images, and the last two lines show the visual segmentation results of CT test image when MRI and CT are the source and target images. By adding RAU and PAM, RA-SIFA can segment the contours of cardiac substructures of the two modes more accurately than the baseline model. In general, the segmentation effect of the RA_SIFA model is closer to the ground truth. Table 1 and Table 2 show the segmentation results in each experiment environment, which verifies that the PAM and RAU are effective. Compared with the SIFA model, the R-SIFA model increases the average Dice score in CT and MRI images by 6.8%and 1.3%respectively, which indicates the RAU in the shared encoder can extract detailed information effectively. Subsequently, the PAM is further introduced to form the RA-SIFA model. The mean Dice score in CT and MRI images are increased by 1.6%and 1.9%respectively, which indicates the PAM of the generator is effective to extract contextual information.

Visualized comparison segmentation results of experiments for cardiac MRI images (rows 1–2) and CT images (rows 3–4). The cardiac structures of AA, LAC, LVC and MYO are represented in red, green, orange and blue respectively. From left to right, the original test image, SIFA segmented image, R_SIFA segmented image, A_SIFA segmented image, RA_SIFA segmented image and ground truth are shown respectively.
The performance comparison of experiments
The performance comparison of experiments
Figure 5 show the Dice and ASD evaluation results of four networks on MRI images (left) and CT images (right) respectively. It is observed that the average Dice and average ASD of R_SIFA and A_SIFA both show better results compared with SIFA, which shows the effectiveness of PAM and RAU. RA_SIFA obtained better results than R_SIFA and A_SIFA, which indicates that the PAM and RAU can effectively extract the relevant information between the two modalities.

Dice and ASD evaluation results.
In order to verify the effectiveness of our proposed UDA method in segmenting multi-modality cardiac images. We compare RA_SIFA with other advanced UDA methods Cycle GAN [8], SIFA [18], SynSeg-Net [23], AdaOutput [24], CyCADA [25] and SASAN [26] based on same training and test data. We evaluate the bidirectional domain adaptation between MRI and CT images, and the segmentation results are shown in Table 3 and Table 4.
Performance comparison with other advanced methods on CT cardiac images
Performance comparison with other advanced methods on CT cardiac images
Performance comparison with other advanced methods on MRI cardiac images
Tables 3 4 show the segmentation results of our method and other unsupervised domain adaptation methods on CT and MRI dataset respectively. The result shows that our method obtains the competitive performance. Compared with previous work [18], the average Dice obtained by our method in CT images is increased by 8.4%, and the average ASD is reduced by 3.40 mm. The average Dice obtained in MRI images increased by 3.2%, and the average ASD was reduced by 0.80 mm, which proves the effectiveness of RA-SIFA in solving the domain shift.
An obvious conclusion can be obtained that the adaptation from the MRI to CT domain can obtain better results than the adaptation from the CT to MRI domain. The possible reason for the above phenomenon is due to the different imaging principles, and the adaptation from the CT to MRI domain is more difficult than the adaptation from the MRI to CT domain. Therefore, CT images are more suitable for organ segmentation tasks than MRI.
Some studies have explored the feasibility of cross-modality UDA in image segmentation [13, 14]. The fruitful results are achieved by use of UDA from richly labeled source domain to unlabeled target domain. For improving the performance of cross-modality cardiac segmentation, this paper proposes an unsupervised domain adaptation framework RA-SIFA combining with parallel attention module and residual attention unit for multi-modality cardiac segmentation. By virtue of image alignment and feature alignment, accurate multi-modality cardiac segmentation is achieved. The parallel attention module in the generator fully takes advantage of context information to make the source domain image more similar to the target, realizing reduce domain shift from the aspect of image alignment. The residual attention unit in the shared encoder can alleviate the problem of the convolutional layer insensitive to spatial position, which makes the network retain unique information while capturing relevant information, realizing reduce domain shift from the aspect of feature alignment.
Experimental results show that the segmentation performance of mutual mapping from MRI to CT are enhanced greatly. Thus, we can draw a conclusion that bidirectional UDA between CT and MRI can be realized by using RA-SIFA. However, there remain still some deficiencies in our method which need further improvement and research. Therefore, in the following research, we will focus on the low segmentation accuracy of MRI images.
Conflicts of interest
All authors declare that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Ethics approval
All authors declare that the work described in this paper does not violate the ethics of our experiments.
Consent to participate
All authors have read and approved contents of this paper.
Funding
The study is supported in part by the key specialized research and development program of Henan Province (202102210170).
