Abstract
BACKGROUND:
Deformable image registration (DIR) plays an important part in many clinical tasks, and deep learning has made significant progress in DIR over the past few years.
OBJECTIVE:
To propose a fast multiscale unsupervised deformable image registration (referred to as FMIRNet) method for monomodal image registration.
METHODS:
We designed a multiscale fusion module to estimate the large displacement field by combining and refining the deformation fields of three scales. The spatial attention mechanism was employed in our fusion module to weight the displacement field pixel by pixel. Except mean square error (MSE), we additionally added structural similarity (ssim) measure during the training phase to enhance the structural consistency between the deformed images and the fixed images.
RESULTS:
Our registration method was evaluated on EchoNet, CHAOS and SLIVER, and had indeed performance improvement in terms of SSIM, NCC and NMI scores. Furthermore, we integrated the FMIRNet into the segmentation network (FCN, UNet) to boost the segmentation task on a dataset with few manual annotations in our joint leaning frameworks. The experimental results indicated that the joint segmentation methods had performance improvement in terms of Dice, HD and ASSD scores.
CONCLUSIONS:
Our proposed FMIRNet is effective for large deformation estimation, and its registration capability is generalizable and robust in joint registration and segmentation frameworks to generate reliable labels for training segmentation tasks.
Keywords
Introduction
As a fundamental task in medical image analysis, deformable image registration (DIR) has been used in many medical applications such as image segmentation [3, 8]. DIR is the process of establishing dense nonlinear spatial correspondences Φ between moving images
Traditional DIR approaches, such as Demons [21], LDDMM [6] and SyN [1] align images by solving computationally expensive iterative optimization. Therefore, classcial DIR algorithms are time-consuming and not practical in actual clinical applications.
Recently, deep learning was introduced to image registration for its high efficiency. DIR methods based on deep leaning regard image registration as mappings from input to output, which seek the best deformation vector fields by updating and optimizing the similarity objective function in the training dataset. The popular evaluation metrics currently include mean square error (MSE), cross correlation (CC) and normalized cross correlation (NCC), mutual information (MI) and normalized mutual information (NMI). Given the situation where reliable ground truth deformation fields are difficult to obtain, unsupervised deformable image registration (UDIR) schemes are becoming popular.
DIR algorithms based on deep learning have achieved significant success. However, large deformation estimation still is a key and difficult problem for deformable image registration. There are two kinds of methods for large deformation decomposition, which means learning a large deformation field by combining several subfields. One is cascade-based methods, which decompose the warp process into multistage operation. The entire network formed by stacking multiple identical subnetworks together to estimate the deformation field in a recursive manner. The other is coarse-to-fine based methods, which employ the multiscale features to estimate a deformation field. Each scale deformation component of multiscale-based methods is unequal [10], the low scale deformation field enjoys large receptive fields to deal with large deformation components, while the high scale deformation enjoys small receptive fields to deal with small deformation components.
In this paper, we proposed a fast multiscale network (FMIRNet) for unsupervised deformable image registration of monomodal medical images. Our multiscale module was designed for large deformation estimation. We first obtained three displacement fields at different scales, then inputed them to our proposed multiscale fusion module, where spatial attention was employed to refine the deformation field. Additionally, we integrated FMIRNet into Joint Registration and Segmentation (JRS) frameworks to improve segmentation tasks with few annotations. In the JRS methods, our FMIRNet can generate reliable annotations for images without ground truth labels.
On the whole, the detailed work in this paper can be summarized as follows:
(1) We propose a multiscale unsupervised deformable image registration method for large displacement estimation;
(2) We design a multiscale fusion module for combining and refine multiscale displacement fields;
(3) We integrate the FMIRNet into segmentation networks to form joint frameworks, where the FMIRNet is used to generate reliable labels from the few available annotations for supervised training to boost the segmentation task.
(4) We conduct extensive experiments and ablation studies on the registration and joint learning networks to validate the effectiveness of our work.
Deep learning technologies have recently been applied to medical image registration. In supervised methods for deformable image registration, ground truth deformation fields for supervised training are either obtained by traditional approaches, or synthesized by manual deformation transformations. FlowNet [4] used synthetic data for training and estimating pixel-level loss. Sokooti et al. [20] augmented their dataset using random displacement vector fields (DVFs). Fan et al. [5] used the deformation field obtained by LCC-Demons [15] and SyN [1] to train a CNN. Realtime and robust registration has been made possible by supervised transformation estimation. However, ground truth informations are difficult to collect. Additionally, it is crucial to ensure that simulated data is sufficiently similar to clinical data. These challenges make supervised registration impractical.
Recently, unsupervised registration methods have gradually become a research hotspot, which use the image-wise similarities instead of the synthesized unreal deformation fields during the training process. Balakrishnan et al. [2] customized the registration as a parameter function for the first time and modeled it through a convolutional neural network, in which a spatial transformer network (STN) [11] was used to reconstruct one image from another. Sheikhjafari et al. [19] exploited the learned latent representations that were used as input to a network composed of eight fully connected layers to obtain the transformation for deformable registration in 2D cardiac cine MR volumes. However, unsupervised registration methods which are based on similarity metrics are sensitive to local optimas during optimization [17], so that these methods may fail to estimate large displacements in complex deformation fields [12].
Recent efforts have been devoted to handle estimation of large displacements by decomposing a large deformation field into combination of several subfields in either multistage or multiscale manner. Zhao et al. proposed the Volume Tweening Network (VTN) [27] to decompose the large deformation into a series of deformation subfields by several cascade CNNs. In subsequent research, Zhao et al. [26] enlarged the number of cascades and demonstrated that, as the number of cascades increases, the performance improves accordingly. UDRSNet proposed by wang et al. [22] adopted the same way as VTN and used a structural similarity measure to align images. Mok et al. [16] proposed LapIRN, which takes a Laplacian image pyramid as inputs to estimate and refine deformation fields. Kang et al. [12] developed Dual-PRNet for a coarse-to-fine deformation field estimation.
Method
Multiscale unsupervised registration
Figure 1 illustrates our proposed multiscale unsupervised network for deformable image registration (Taking 2D images as an example). We denote the moving image as

Flowchart of the proposed FMIRNet.
As shown in Fig. 1(a), we employed a ResUnet [22, 25] to model three dense nonlinear transformations (Φ1, Φ2 and Φ3) from multiscale image pairs ([
We proposed a spatial attention module to combine and refine the information contained by the multiscale displacement fields. As illustrated in Fig. 1(b), our attention mechanism was introduced after the displacement estimation module. We first upsampled Φ2 and Φ3 by factor 2 (denoted as Φ'2) and factor 4 (denoted as Φ'3), the preliminary fused features can be obtained as follows:
Next, we applied the spatial attention module on Φ
p
. Our attention module was composed of a Conv layer followed by a Sigmoid activate layer. The Sigmoid layer was configured to generate spatial weights W for each element in Φ
p
. The learned attention W was applied to Φ
p
by W ⊗ Φ
p
. Finally, we obtained the multiscale fused deformation field as follows:
Recent segmentation methods using deep learning technologies often require massive manually annotated labels, however that is labor intensive and expensive. Our proposed multiscale unsupervised deformable registration method can transfer a moving image to a fixed image, so we employed such an ability to map the available annotated labels from a moving domain to a fixed domain where the annotations are not provided.
As shown in Fig. 2, we integrated the registration network into a segmentation network to form a unified framework similar to [12, 24]. We learned two tasks simultaneously using the proposed multiscale architecture as a registration network and using FCN [14], UNet [18] as segmentation networks, respectively.

Joint registration and segmentation framework, where the proposed FMIRNet is used as registration subnetwork (RegNet). FMIRNet is applied to warp moving images and their labels, and then moving images and their labels, as well as, warped images and their labels are used as inputs for segmentation subnetwork for supervised training.
Figure 2 provides a detailed explanation of the proposed joint learning framework. Given a moving
Our joint learning method is designed for boosting segmentation with few annotations samples. For each 3D volume data selected from CHAOS, firstly, we divided volume into n groups 2D image slices, then choose the transverse-section slice as a moving image and the remaining slices as fixed images. In our training phase, only the moving image has annotated labels, the deformed labels were generated by warping
In our model, the deformed image is represented as
Formally, the registration loss was defined as:
The segmentation network takes
Data and pre-processing
We first evaluated our proposed multiscale registration network on 2D ultrasound images selected from EchoNet 1 . EchoNet consists of videos of parasternal long-axis echocardiography. For each video, we selected two frames as moving and fixed images to obtain 1362 image pairs, and divided the training, validation, and testing sets in a 7:2:1 ratio. Then registration experiment of 2D CT images and our joint registration and segmentation method is performed on CT images seclected from the CHAOS competition 2 . The CT dataset contains 20 volumes of data. For each CT volume data, divide the volume data into n sets of 2D slices in order, then select the transverse-section slice as the moving image and the remaining slices as fixed images (including adjacent slices) to obtain a set of image pairs. After that, all images were resampled at 128 * 128, and we randomly divided the data set into 738, 210, and 104 image pairs for training, validation, and test sets, respectively. As for 3D registration, we performed experiments on 400 pairs scans selected from the CT dataset SLIVER [9], and divided the training, validation, and testing sets in a 7:2:1 ratio. The resolution for scans of this dataset is 128 × 128 × 128.
Implementation details
The network training process was implemented using the PyTorch toolkit and optimized through the Adam optimizer with a learning rate of 1e - 4. The regularization parameters for CT images α = 1e - 7, β = 0.1 and ultrasound images α = 1e - 5, β = 1 were set to ensure that the magnitude of each loss element was the same. The models were trained for 40 epochs, with a batch size of 4. The experimental setup included NVIDIA GeForce RTX3090 and Intel Core i9-10920X@3.50GHz.
Evaluation metrics
In order to evaluate the registration performance of our proposed multiscale unsupervised network, normalized cross-correlation (NCC), normalized mutual information (NMI) and ssim were adopted. NCC is a metric of degree of correlation, the higher the ncc value, the better the registration effect. NMI values are within the range of [0, 1], the larger the better.
Dice, Hausdorff Distance (HD) and Average Symmetric Surface Distance (ASSD) are considered to measure the performance of the segmentation subnetwork in our joint learning framework. A higher Dice score between two regions denotes a higher degree of their overlap, meaning higher segmentation performance. A smaller value of HD denotes high proximity between the segmentation prediction and ground truth. ASSD measures the averaged mismatch between the surfaces of two volumes.
Results
Results of registration
We choose the unsupervised registration network VoxelMorph [2] and UDRSNet [22] to compare the registration performance of our proposed model. Meanwhile, we also compared our algorithms with state-of-art traditional registration algorithms including Elastix [13] and SyN [1]. The quantitative results are presented in Table 1. We can see that our FMIRNet performs better in terms of SSIM, NCC, and NMI scores; for example, FMIRNet improved SSIM (0.05), NCC (0.06) than UDRSNet. We attribute these improvements to our multiscale fusion module for refining deformable field.
Results of different methods on EchoNet. Standard deviations across instances are in parentheses
Results of different methods on EchoNet. Standard deviations across instances are in parentheses
We took three pairs of EchoNet images as examples and showed the visualization results of different methods in Fig. 3. From Fig. 3, we can see that the proposed FMIRNet method achieves an impressive improvement in the registration results, compared to the competing methods.

Registration results of different methods for deforming source images (left) to target images (right) on EchoNet.
In addition to ultrasound images, we also conducted registration experiments on CT images. Table 2 gives a summary of the quantitative results on CHAOS. The visualization results can be find in Fig. 4.
Results of different methods on CHAOS. Standard deviations across instances are in parentheses

Samples of registration results of our joint learning framework on CHAOS.
We further evaluated the performance of our joint registration and segmentation framework using the proposed FMIRNet, which can be integrated into a 2D segmentation network. FCN and UNet were adopted as segmentation networks in our joint framework to verify the universality of our registration method. Figure 4 shows some registration visualization samples, we can see that our FMIRNet not only aligns moving image to fixed image well, but also generates reliable annotation label to deformed image. These generated images and deformed labels help us to train a better segmentation network. In the inference stage, the segmentation subnetworks in our joint framework are used independently.
The segmentation results are shown in Table 3 and Fig. 5. FCN denotes training segmentation network in supervised manner on images with annotations, RFCN is a joint registration and segmentation network, where registration task employs our proposed FMIRNet, RUNet is also like this. Corresponding, VUNet and VFCN mean using voxmorph as registration subnetwork in joint framework. From Table 3 and Fig. 5, we can observe that both RFCN and RUNet achieve performance improvement compared to FCN and RUNet. The reason of this is that our joint registration and segmentation method has more images with reliable annotations, generated by registration subnetwork.
Results of different methods of segmentation task on CHAOS. Standard deviations across instances are in parentheses
Results of different methods of segmentation task on CHAOS. Standard deviations across instances are in parentheses

Samples of segmentation results of our joint learning framework on CHAOS.
In Table 4, we compare FMIRNet and its variant FMIRNet-1 and FMIRNet-2. Only Φ1 is estimated as deformation field in FMIRNet-1, while FMIRNet-2 employs Φ1, Φ2 to estimate deformation field. We can get conclusion that our proposed multiscale module learns a more accurate and reliable spatial transformation.
Ablation study of multiscale module. Standard deviations across instances are in parentheses
Ablation study of multiscale module. Standard deviations across instances are in parentheses
Ablation studies of the regularization weight α values. Standard deviations across instances are in parentheses
Results of different methods on SLIVER. Standard deviations across instances are in parentheses

Samples of 3D registration results on SLIVER.
In the joint framework, registration subnetwork is adopted to generate new labels from manually annotated labels by experts. Therefore, the segmentation network in joint network was trained using few annotated labels and deformed labels, while FCN and UNet were trained using only few annotated labels. Therefore, better segmentation performances on test data of the joint methods indicating that the registration network generates reliable and accurate annotations for unlabeled slices, that is, the learned spatial transformations are accurate and reliable.
We can get two conclusions from the data in Table 7. On the one hand, the proposed FMIRNet is reliable to generate accurate labels for unlabeled fix images, as RFCN (RUNet) is trained on moving labels and deformed labels and has better segmentation performance than FCN (UNet). On the one hand, the proposed FMIRNet has a good generalization ability in joint registration and segmentation framework, as the segmentation performances of RUNet and RFCN all have been improved.
Ablation studies of different r on CHAOS. Standard deviations across instances are in parentheses
In our multiscale deformation fields estimation module, Φ p is computed as Equation 2, which is based a assumption that Φ1, and have the same influence to generate the final deformation field Φ. However, this simple assumption may not necessarily hold true in practical applications, as the low scale deformation field enjoy large receptive fields, while the high scale enjoy small receptive fields. In our further research, we plan to incorporate channel attention to adaptively learn the weights of displacement fields at various scales.
The dataset we used in our proposed joint registration and segmentation frameworks is composed of 2D slices of 3D volume data. Between different 2D slices, the liver exhibits positional movement and varying volume sizes, and these factors all have an impact on the performance of registration tasks. According to Equation 6 and Table 7, the larger r, the greater the correlation between moving and fixed images, and the accuracy of the registration tasks will also be improved, providing more reliable annotation labels for segmentation tasks.
We design the joint registration and segmentation frameworks by integrating FMIRNet into FCN (denoted as RFCN) and UNet (denoted as RUnet). Experimental results illustrate that both RFCN and RUNet have performance improvement in terms of Dice, HD and ASSD scores; these confirm the generalization ability and robustness of our FMIRNet for boosting segmentation task with few annotation labels. Therefore, we can theoretically conclude that FMIRNet can be integrated into any segmentation network as a plug-and-play block.
Currently, the proposed FMIRNet is designed for 2D registration tasks. 3D registration tasks are more difficult and complicated than 2D registration tasks, the proposed FMIRNet can be extended for 3D registration task by directly replacing the convolution layers and spatial transformation layers. However, due to the influence of other organs around the liver, the registration performance deteriorates obviously. In our future research, for 3D registration tasks, we plan to design some region perception modules to restrict the deformation regions.
Conclusion
In this paper, we propose a fast multiscale unsupervised network (FMIRNet) for monomodal deformable image registration in an end-to-end manner. Compared to single scale registration scheme, the FMIRNet has indeed performance improvements in terms of SSIM, NCC and NMI scores. The good registration ability of FMIRNet inspires us to integrate it into the joint registration and segmentation framework to boost segmentation task with few annotated labels. Using FCN and UNet as segmentation networks, respectively, the joint methods RFCN and RUNet have performance gains in terms of Dice, HD, and ASSD scores, compared to using FCN and UNet alone. Still, the FMIRNet has several limitations in its current implementation. First, the FMIRNet is designed for monomodal image registration tasks. Therefore, how to efficiently register multimodal images is one of our future work. Second, the FMIRNet is designed for 2D registration tasks and has slightly inferior performance for 3D registration tasks. Therefore, improving FMIRNet also deserves exploration in the future.
