Enhancing 3D medical image registration with cross attention,residual skips,and cascade attention

Abstract

At the core of Deep Learning-based Deformable Medical Image Registration (DMIR) lies a strong foundation. Essentially, this network compares features in two images to identify their mutual correspondence, which is necessary for precise image registration. In this paper, we use three novel techniques to increase the registration process and enhance the alignment accuracy between medical images. First, we propose cross attention over multi-layers of pairs of images, allowing us to take out the correspondences between them at different levels and improve registration accuracy. Second, we introduce a skip connection with residual blocks between the encoder and decoder, helping information flow and enhancing overall performance. Third, we propose the utilization of cascade attention with residual block skip connections, which enhances information flow and empowers feature representation. Experimental results on the OASIS data set and the LPBA40 data set show the effectiveness and superiority of our proposed mechanism. These novelties contribute to the enhancement of 3D DMIR-based on unsupervised learning with potential implications in clinical practice and research.

Keywords

Deformable medical image registration similarity measures deep learning convolutional neural networks

1. Introduction

Deformable Medical Image Registration (DMIR) serves as a key tool in medical image processing and analysis. Its primary objective involves aligning reference images by establishing voxel displacement connections. This facilitates the matching and alignment of comparable anatomical structures across images. However, achieving precise registration presents a considerable challenge.

In the domain of medical image registration, traditional methods have long been employed to align and fuse different medical images for diagnostic and treatment purposes. These methods encompass a range of techniques such as intensity-based methods, feature-based methods, and landmark-based methods. Intensity-based methods seek to optimize similarity measures between pixel intensities, often using optimization techniques like gradient descent [1]. Feature-based methods leverage identifiable points, edges, or corners in the images to establish correspondences and facilitate alignment [2]. Landmark-based methods involve selecting distinctive anatomical landmarks to guide the registration process [3]. These traditional approaches have provided valuable insights and laid the foundation for the development of more advanced registration techniques.

Learning-based techniques use machine learning, deep learning, and artificial intelligence to make the process more accurate. Deep learning models, such as generative adversarial networks (GANs) and convolutional neural networks (CNNs), have demonstrated unparalleled capabilities in extracting relevant features and learning complex spatial transformations [4, 5, 6, 7].

Deformable medical image registration employing deep learning techniques is a topic of active research. Approaches for training networks can generally be categorized into two main groups: supervised and unsupervised learning techniques. Supervised learning techniques necessitate training the registration model using image pairs with known displacement and deformation fields. However, obtaining these fields typically involves prior registration through conventional methods. Consequently, the supervised learning models struggle to surpass the accuracy of traditional methods [8]. To address the limitations imposed by the requirement for labeled data in supervised learning, many researchers have turned their attention to unsupervised learning approaches for image registration.

Jaderberg et al. [9] introduced a spatial transformer network (STN) that supports the backpropagation of neural networks and transforms the image directly by using a deformation field. STN has played an important role in medical image registration. Balakrishnan et al. [10] combined U shaped network [11] and STN and developed a new network called voxelmorph and used unsupervised learning manner to register MRI brain atlas. Single image registration is enough to align those areas of an image that are less deformed, while multi-registrations are required for alignment of those areas of an image that are more deformed. Creating 3D models is crucial in the fields of computer-aided design (CAD), computer-aided engineering (CAE), and computer-aided manufacturing (CAM). HRLT3D, a hierarchical reinforcement learning approach was used for the reconstruction of 3D shapes [12]. DeepCSR, an innovative 3D deep learning framework for MRI-based cortical surface reconstruction, outperforms FreeSurfer and FastSurfer in accuracy, precision, and speed. Its continuous approach and hypercolumn features demonstrate efficient high-resolution reconstruction, promising advancements in medical studies and healthcare applications [13].

ViT-V-Net was introduced for registration of 3D medical image registration [14]. In ViT-V-Net, the vision transformer was employed as the final down sampling step of the encoder, alongside a CNN for non-final down sampling. The unsupervised learning methods with comparable to traditional algorithms have been presented [15]. They all calculate the similarity between fixed volume and warped moving volume, while the gradients can backpropagate through the differentiable warping operation. Most of the proposed networks show a lake of efficiency during dealing with complex deformation, especially with large displacement. For DMIR, transformers take the same attention mechanism [16] as single image tasks which focus on the relevance in the one image but ignore correspondences between image pairs. For fine registration, capturing correspondence between moving and fixed can pose challenges for transformers to give better registration of medical images. X-Morpher used cross attention to transformer structure, which just captures the corresponding between the same level of convoluted layers of fixed and moving images due to which the aligning accuracy of fixed and warped moving images is decreased. The efficiency of skip connection used in UNet [17, 18] is not optimal and their empowerment falls short of expectations. Further improvements are necessary to increase the effectiveness of skip connections in medical image registration. In the recursive cascade network introduced by Zhao et al. [19] the idea of using a sequence of networks to improve image registration has been shown to work well. However, in this setup, each part of the network needs both the original image and the image being registered. There is no proper relationship between the processed steps. This means that some parts of the image that are already aligned well are being processed again, which doesn’t really help and can slow things down. VTN [20] used cascade, but they suffered from the problems of training and complexity. In order to handle the prescribed challenges we used the following components:

−
In order to capture the correspondences between moving and fixed images, both in encoder and decoder, we introduce cross attention at different levels, which enhances the aligning capability of fixed and moving images. So by considering correspondences at multi-layers, we enhance registration accuracy.
−
By concatenating the usefulness of skip connections and residual blocks, our method aims to increase the empowerment of skip connections in medical image registration tasks. This integration makes possible, the efficient flow of information across various layers of the encoder and decoder, helping better alignment and enhance accuracy in 3D DMIR.
−
To handle the challenges related to training and complexity in cascade-based techniques, we present a novel approach. By introducing residual block skip connections with the cascade architecture, our method increases the efficiency and performance of the 3D medical image registration process.

2. Background and related work

In this section, we are going to discuss a short overview of the existing literature and research tools in the field of medical image registration.

2.1 Deformable registration process

In conventional volume registration, one of the volumes, either the moving (source) or the fixed (target), is adjusted to align with the other. Significant differences in brain anatomy due to natural variations among individuals and variations in their health conditions lead to substantial variability among subjects. Deformable registration plays an important role in enabling the comparison of these anatomical structures across different scans. This proves to be exceptionally useful for studying variations within different populations and for tracking how brain anatomy changes over time, particularly in individuals with medical conditions. The process of deformable registration typically involves two key steps: initially, an affine transformation is used to achieve a broad alignment, and then a more complex deformable transformation is applied to allow for greater flexibility. Our main emphasis lies in the latter phase, specifically in establishing a comprehensive and nonlinear correspondence for every voxel point.

The majority of deformable registration algorithms in use today employ an iterative approach that involves fine-tuning a transformation by minimizing an associated energy function [21, 22].

Consider two images, one fixed ( $f$ ) and the other moving ( $m$ ), and let $ϕ$ represent the registration field that links the coordinates of $f$ to those of $m$ . The task at hand can be formulated as an optimization problem in the following manner:

ϕ = {argmin}_{ϕ} L (f, m, ϕ) = {argmin}_{ϕ} (L_{sim} (f, m \circ ϕ) + λ L_{smooth} (ϕ)),

(1)

where

m

transformed by

ϕ

is denoted as

m \circ ϕ

. The function

L_{sim} (\cdot, \cdot)

quantifies the similarity between two images, and

L_{smooth} (\cdot)

enforces regularization. The parameter

λ

plays a pivotal role in balancing the regularization. In our formulation, we calculate

ϕ

ϕ = Id + u

, where Id represents the identity transform, a concept introduced by Bajcsy et al. in their work on multiresolution image registration [23]. Diffeomorphic transformations are employed to model

ϕ

, enabling a smooth transition from one image to another through the manipulation of a velocity vector field. This process ensures that the transformation maintains the original shape of objects in the image, preserving their topology. Additionally, it guarantees that the transformation is reversible. This implies the ability to reverse the transformation, returning from the altered image to the original one. Common metrics utilized for measuring

L_{sim}

(Image Similarity) consists of intensity mean squared error, mutual information, and cross-correlation [24, 25], which are calculated as:

1) Mean Squared Error (MSE)

Mean Squared Error (MSE) is a commonly used metric in DMIR, calculating the average squared distances between corresponding pixel intensities in the aligned images, in order to increase the registration accuracy. It is calculated as:

M S E = \frac{1}{N} \sum_{i = 1}^{N} (I_{f} (x_{i}) - I_{m} (x_{i}))^{2},

(2)

where

I_{f}

shows the fixed image and

I_{m}

shows the moving image. While

(x_{i})

represents the spatial coordinates of pixel in the images

2) Cross-correlation

Cross-correlation is used to measure similarity in DMIR, which calculates the rank of similarity between fixed and moving volumes by relating links between interrelating pixel intensities. It is formulated as:

C C = \frac{\sum_{i = 1}^{N} (I_{f} (x_{i}) - {\bar{I}}_{f}) (I_{m} (x_{i}) - {\bar{I}}_{m})}{\sqrt{\sum_{i = 1}^{N} {(I_{f} (x_{i}) - {\bar{I}}_{f})}^{2} \sum_{i = 1}^{N} {(I_{m} (x_{i}) - {\bar{I}}_{m})}^{2}}},

(3)

where

I_{f}

and

I_{m}

represent fixed and moving images respectively. While

(x_{i})

represents the spatial coordinates of a pixel in the images and

I_{f} (x_{i})

shows the intensity value of the fixed image at pixel

x_{i}

{\bar{I}}_{f}

shows the mean intensity value of the fixed image and

I_{m} (x_{i})

represents intensity strength of the moving image at pixel

x_{i}

{\bar{I}}_{m}

gives mean intensity value of moving image.

3) Mutual Information

It determines the amount of information shared between the pair images by calculating the mutual relationship of their intensity distributions. By increasing the mutual information, DMIR techniques can find the accurate alignment between the images, due to which the accuracy of registration is increased.

M I = \sum_{x, y} p_{f, m} (x, y) \log (\frac{p_{f, m} (x, y)}{p_{f} (x) p_{m} (y)}),

(4)

where

x

and

y

represent spatial coordinates in the images.

p_{f}

m (x, y)

represent joint probability distribution of the reference and moving images at coordinates (

x

y

)

p_{f} (x)

, marginal probability distribution of the fixed image at coordinate

x

p_{m} (y)

, marginal probability distribution of the moving image at coordinate

y

The smooth transformation is achieved by using regularizers that calculate consistency or directly applying on the displacement vector field [26, 27, 28, 29].

Traditional methods for image registration involve optimizing the deformation field separately for each pair of images. This approach can be comparatively complex, during dealing with a huge number of volumes, such as in population-wide analyses.

Essentially, instead of individually optimizing deformation fields for each pair, we perform a global optimization of the shared parameters. This idea is similar to something called “amortization,” a method used in different areas [30, 31]. After the calculation of the overall function, we demonstrate the deformation field based on displacement-based vector field, for the specific pair of images.

2.2 Medical image registration (deep learning-based)

The adoption of learning-based models, particularly convolutional neural networks (CNNs), across diverse domains like image segmentation, classification, and reconstruction has paved the way for increased interest in CNN-based registration methods. This heightened attention is primarily driven by the remarkable performance achieved by CNNs in these areas. The classification of these methods into two categories, namely supervised learning and unsupervised learning, is contingent on the training approach employed.

1) Supervised learning techniques

In these methodologies, it is essential to have access to ground-truth deformation vector fields. These fields are usually generated using established classical registration techniques [32, 33, 34]. In a study by Yang et al. [32] they introduced a novel approach involving an encoder-decoder network. This network was designed for predicting deformation fields on a patch-wise basis. However, the effectiveness of these methods in image registration relies heavily on the accuracy of the provided ground-truth data. These techniques often demand carefully constructed ground truth deformation fields and involve complex pre-processing steps. In real-world scenarios, obtaining such high-quality ground truth data and executing these complex pre-processing tasks can be quite challenging.

2) Unsupervised learning techniques

Researchers have introduced unsupervised learning methods to reduce the shortcomings of supervised learning techniques. These approaches aim to improve image registration by minimizing the loss between the transformed image and a predefined reference image. Kreb et al. [35] introduced an innovative unsupervised learning model that employs a low-dimensional stochastic representation of deformation. This approach minimizes the KL divergence between two image distributions. Balakrishnan et al. [15] introduced a 3D medical image registration method that operates in a pairwise fashion. They incorporated a Convolutional Neural Network (CNN) featuring a spatial transform layer (STL). The parameters of this layer are trained using the normalized cross-correlation function. In the context of large-volume image registration, Vos et al. [36] introduced a comprehensive framework for both affine and non-rigid image registration. Lei et al. [37] proposed an innovative multi-scale unsupervised learning technique known as MS-DIRNet. This approach incorporates both global and local registration networks.

However, these approaches have limitations in ensuring consistency, which can lead to a folding problem due to the mapping loss of integrity. In order to solve this problem diffeomorphic integration layers were introduced [38]. It is tantrum to note that applying the constraint during the inference phase can add extra complexity.

Cross-attention (CA) is a commonly employed version of self-attention (SA), often utilized for both inter- and intra-modal tasks within the field of computer vision [39, 40]. Its applicability to image registration has also been explored [41, 42]. What sets CA apart from SA is the manner in which it calculates matrix representations. In XMorpher [43] the CA is used between the same features layers of moving and fixed volumes.

Cascade approaches find application across diverse domains within computer vision. For instance, in the context of pose estimation, cascaded pose regression iteratively enhances pose predictions acquired from supervised training data [44]. Additionally, they play a role in expediting object detection through cascaded classifiers [45]. Cascade architectures have proven advantageous in the field of deep learning as well. One notable example is the deep deformation network, which employs a cascading approach across two stages to forecast deformations for landmark localization [46]. These cascade principles extend their benefits to a spectrum of applications, such as object detection [47], 3D image reconstruction for MRIs [48], liver segmentation [49], and mitosis detection [50].

In the field of medical image registration and deep learning, many researchers are using Residual Networks(ResNets). These ResNets are getting a lot of attention because they are good at understanding detailed image features and they are also good at learning complicated patterns. Residual networks (ResNets), first introduced by He et al. [51], have transformed deep learning. They solve the problem of gradients vanishing during training by using skip connections.

We build upon these concepts and expand CA, cascade, and ResNets to perform 3D volume registration, and enhance registration accuracy.

3. Methodology

Our model is based on deep learning networks. This model is used for extracting and matching features from fixed and moving images, and increase the registration capability of the input image pairs. Our model plays a key role in aligning 3D DMIR. In this section, we present the foundational framework of our model. We leverage cross-attention mechanisms across multiple layers in both the encoder and decoder components. This approach is instrumental in achieving precise alignment between the fixed and moving images. The residual block used in the skip connection helps the flow of information between the encoder and decoder. Cascade operates iteratively, continually refining its representations and the inclusion of residual skip connections seamlessly facilitates the flow of information between these iterations. This harmonious interplay between cascade and residual skip connections not only enhances the training process but also contributes to the model’s overall effectiveness. The complete flowchart of our proposed approach is shown in Figure 1.

Figure 1.

Overall architecture of our model which consists of: (a) Residual block used in skip connection for uniform flow of information between encoder and decoder and cross-attention is used over different layers of fixed and moving images. (b) Cascade attention with residual skip connection, which help in passing information through series of stages. (c) Residual block, which helps in proper flow of information between layers of encoder and decoder. (d) Cross attention between different layers. (e) The feature fusion module comprises four computation cross-attention blocks that share parameters to facilitate mutual correspondences. (f) Working structure of cross attention transformer.

3.1 Cross attention over multi layers (CAML)

The foundation of cross attention mechanism is based on the XMorpher technique. We use cross attention between the similar layers and as well as different layers of the fixed and moving images and that is why we call it cross attention over multi layers as shown in Figure 1d. We employ the max-pooling technique to effectively handle the varying sizes of features layers in both the fixed and moving images during applying the cross attention between them. It helps in better understanding and shares of information between fixed and moving images and thus accuracy of aligning of the images is improved. This cross attention consists of:

1) Cross-attention assisted feature integration block

As depicted in Figure 1e, the corresponding features $T_{f}$ and $T_{m}$ from parallel sub-networks mutually attend through four cross-attention transformer blocks by changing input order. In these four transformers, two are used for the same level of feature layers of both moving and fixed images, while the other two are used for different levels of feature layers. Outputs, influenced by each other’s attention, return to the original channels for deeper interaction. This process repeats k times for empowering information exchange, which helps in enhancing the registration of medical images.

2) Cross-attention transformer for mutual attention

By using the attention mechanism, the cross-attention transformer block calculates new feature tokens from input feature b to feature s as shown in Figure 1f. In this mechanism, base windows set $S_{ba}$ and searching windows set $S_{se}$ are formed. Further, each base window is extended to the query, and each searching window is extended to both key and value. Window head cross-attention calculates cross-attention between these windows, while the obtained attention enriches the base window.

3.1.1 Local window diversity and enhancement of image alignment

1) Window division (WD) and window area division (WAD)

In WD and WAD, we split the input feature tokens, represented as ‘ $s$ ’ and ‘ $b$ ’, into different-sized windows. In Figure 2I, WD divides the tokens into a base window called ‘ $S_{b a}$ ’ with a size of ‘ $n \times h \times w \times d$ ’, and WAD increases the window dimensions using enhancement factors ‘ $α$ ’, ‘ $β$ ’, and ‘ $γ$ ’. In order to make the window sizes equal, WAD uses a sliding window with a stride matching the base window size, and in this way, $S_{ba}$ has a size of $n \times α \times h \times β \times w \times γ \times d$ . The cross-attention transformer effectively calculates cross-attention between feature tokens of various sizes.

2) Multi-scale window fusion (MWF)

The main function of MWF is to connect query with the pairs of key and value as shown in Figure 2II. From base windows query is derived, while from searching windows the keys and values both are derived. The output is determined by summing up the values, each weighted by how much it aligns with query. In order to cover different representation angles MWF uses multi head attention. The formula for calculating cross attention is given as:

MWF (Q_{b a}, K_{s e}, V_{s e}) = softmax (\frac{Q_{b a} K_{s e}^{T}}{\sqrt{d}}) V_{s e}

(5)

where

Q_{b a}

K_{s e}

, and

V_{s e}

are the query, key, and value matrices respectively.

Q_{b a} \in R^{n \times s \times c}

is the linear projection of

S_{b a}

, and

K_{s e}

V_{s e} \in R^{n \times μ \cdot s \times c}

are linear projections of

S_{s e}

. Here,

s = h \times w \times d

μ = α \times β \times γ

, and

c

is the dimension of each feature token.

Figure 2.

(I) A specific approach to create a basis and searching windows using window partitions. (II) Using W-MCA, cross-attention between the base and search windows is calculated.

3.2 Residual block with skip connection (RBSC)

We have included the U-shaped architecture [11], which is a popular design for medical image registration tasks because it can capture both global and local information, in our suggested model. The encoder and decoder are the two primary parts of this architecture. Whether the input images are the fixed or moving image for our registration task, the encoder is in charge of extracting complex information from them. However, after receiving these characteristics, the decoder reconstructs the images using the information that was extracted, allowing for precise alignment.

We use skip connections into our architecture to strengthen the connectivity between appropriate levels of the encoder and decoder using ResNet as shown in Figure 1a. ResNet is well known for its capacity to solve the disappearing gradient issue. It is made up of residual blocks that help gradients spread during training. With a kernel size of 3 $\times$ 3 $\times$ 3, each residual block has two convolutional layers that enable the extraction of smooth spatial information from the input images. Furthermore, effective feature reuse and propagation are encouraged by the skip connections within the residual blocks, which facilitate a smooth flow of information between layers. Additionally, the model can integrate both low-level specific features and high-level semantic data due to the element-wise addition carried out by the skip connections, producing representations that are stronger. By promoting feature reuse across several layers, this method not only improves the model’s ability to detect complex patterns in medical images but also helps to reduce the risk of overfitting. Our architectural design incorporates a collaborative approach to medical image registration by combining skip connections and ResNet with U-shaped topology. By utilizing both global and local features and guaranteeing effective information transfer between the encoder and decoder, our model is able to provide exceptional results in medical image registration work, eventually upgrading the field’s state-of-the-art.

3.3 Cascade-base attention with residual skip connection (CARSC)

Our methodology for enhancing medical image registration integrates two key components: cascade attention and residual skip connection. Cascade attention, as illustrated in Figure 1b, plays a crucial role in refining the representation of the images, thereby enhancing the accuracy of the registration process. It achieves this by selectively focusing on relevant features and suppressing irrelevant ones, thereby improving the overall alignment. On the other hand, the residual skip connection facilitates the smooth flow of information and gradient propagation during the training phase. By incorporating skip connections, our model ensures that information from earlier layers is preserved and efficiently transmitted to subsequent layers, enabling better convergence during optimization. In order to achieve a balance between registration accuracy and computational efficiency, our model also employs the technique of executing the registration program twice in order minimize the computing burden. This method guarantees that the registration procedure stays effective without sacrificing the caliber of the output.

3.4 Unsupervised loss function

An unsupervised loss function $L_{us}$ , involves assessing the model’s performance exclusively using the input volumes and the registration field it generates. It consists of two components: $L_{sim}$ , which accounts for differences in appearance, and $L_{smooth}$ , which accounts for small-scale spatial variations in the deformation field $ϕ$ .

L_{us} (f, m, ϕ) = L_{sim} (f, m \circ ϕ) + λ L_{smooth} (ϕ)

(6)

We investigated the influence of the regularization parameter $λ$ and conducted experiments with two widely used functions for $L_{sim}$ . The first part of the above equation calculates the mean squared voxelwise difference, This regularization technique proves especially effective when the images $f$ and $m$ exhibit similar image intensity distributions and local contrast:

MSE (f, m \circ ϕ) = \frac{1}{| Ω |} \sum_{p \in Ω} [f (p) - [m \circ ϕ] (p)]^{2}

(7)

The other part of the equation is the cross-correlation between $f$ and $m \circ ϕ$ , which demonstrates greater resilience to variations in image intensity often encountered across different scans and datasets [25].

Let $\hat{f} (p)$ and $[\hat{m} \circ ϕ] (p)$ represent local mean intensity images. Specifically, $\hat{f} (p)$ is calculated as $\frac{1}{n} \sum_{p i} f (p i)$ , where $p i$ iterates through a 3D volume centered around $p$ , with $n = 9$ in our experimental setup. The local cross-correlation of $f$ and $m \circ ϕ$ is mathematically expressed as:

C C (f, m \circ ϕ) = \sum_{p \in Ω} \frac{{(\sum_{p i} (f (p i) - \hat{f} (p)) ([m \circ ϕ] (p i) - [\hat{m} \circ ϕ] (p)))}^{2}}{(\sum_{p i} {(f (p i) - \hat{f} (p))}^{2} \sum_{p i} {([m \circ ϕ] (p i) - [\hat{m} \circ ϕ] (p))}^{2})}

(8)

A higher cross-correlation (CC) value indicates a more accurate alignment, resulting in the following loss function:

L_{sim} (f, m, ϕ) = - CC (f, m \circ ϕ) .

(9)

To ensure that the displacement field $ϕ$ remains physically realistic and smooth, we introduce a diffusion regularizer. While minimizing $L_{sim}$ encourages $m \circ ϕ$ to closely match $f$ , it can potentially lead to a displacement field $ϕ$ that lacks smoothness. Our diffusion regularizer is applied to the spatial gradients of the displacement $u$ .

L_{smooth} (ϕ) = \sum_{p \in Ω} | | \nabla u (p) | |^{2}

(10)

To estimate the spatial gradients, we utilize differences between adjacent voxels. In particular, for the gradient components $\nabla u (p) = (\frac{\partial u (p)}{\partial x}, \frac{\partial u (p)}{\partial y}, \frac{\partial u (p)}{\partial z})$ , we approximate $\frac{\partial u (p)}{\partial x}$ as $u (p_{x} + 1, p_{y}, p_{z}) - u (p_{x}, p_{y}, p_{z})$ , and employ analogous approximations for $\frac{\partial u (p)}{\partial y}$ and $\frac{\partial u (p)}{\partial z}$ .

4. Experiments

4.1 Data and preprocessing

To check the performance of our proposed method, based on the atlas, we evaluate it on brain MR scans from two datasets: 450 T1-weighted brain MR scans from the OASIS dataset [52] and 40 brain MR scans from the LPBA40 dataset [53]. The OASIS dataset consists of brain MR scans from subjects spanning an age range of 18 to 96 years, consisting of 100 individuals with Alzheimer’s disease in the mild to moderate stages. Atlas-based registration is used on a huge scale in multidisciplinary image analysis to create an empowered correspondence between the atlas and moving images. In the OASIS dataset, our evaluation is based on the segmentation of 28 subcortical structures representing anatomical features as shown in Table 1. For the LPBA40 dataset, we utilize the segmentation of 56 anatomical structures, meticulously hand-drawn by experts, as the foundation for our evaluation. To enhance registration accuracy and computational efficiency, all data images are uniformly resized to dimensions of 160 $\times$ 192 $\times$ 224 with isotropic voxel sizes of 1 mm $\times$ 1 mm $\times$ 1 mm. This study partitions the OASIS dataset into 255, 20, and 150 volumes for training, validation, and testing, respectively, to ensure a diverse and robust evaluation. We utilized the registration strategy of subject to subject and paired 255 scans having a total of 64,770. The LPBA40 dataset serves as an independent test set for cross-data set validation in our study. To ensure a fair evaluation, we employ a randomized selection process to designate five volumes from each of the OASIS and LPBA40 test sets as fixed atlases. Subsequently, the remaining 145 volumes from OASIS and 35 volumes from LPBA40 are aligned as moving images to these fixed atlases. The test results presented in this paper are derived from the average performance across 725 combinations for OASIS and 175 combinations for LPBA40. This methodology ensures a robust and unbiased assessment of our registration approach across different datasets.

Table 1
The OASIS dataset’s 28 anatomical structures for testing.

Label Name Label Name

2 Left-cerebral-white-matter 24 CSF

3 Left-cerebral-cortex 28 Left-ventral-DC

4 Left-lateral-ventricle 41 Right-cerebral-white-matter

7 Left-cerebellum-white-matter 42 Right-cerebral-cortex

8 Left-cerebellum-cortex 43 Right-lateral-ventricle

10 Left-thalamus 46 Right-cerebellum-white-matter

11 Left-caudate 47 Right-cerebellum-cortex

12 Left-putamen 49 Right-thalamus

13 Left-pallidum 50 Right-caudate

14 3rd-ventricle 51 Right-putamen

15 4th-ventricle 52 Right-pallidum

16 Brain-stem 53 Right-hippocampus

17 Left-hippocampus 54 Right-amygdala

18 Left-amygdala 60 Right-ventral-DC

Label	Name	Label	Name
2	Left-cerebral-white-matter	24	CSF
3	Left-cerebral-cortex	28	Left-ventral-DC
4	Left-lateral-ventricle	41	Right-cerebral-white-matter
7	Left-cerebellum-white-matter	42	Right-cerebral-cortex
8	Left-cerebellum-cortex	43	Right-lateral-ventricle
10	Left-thalamus	46	Right-cerebellum-white-matter
11	Left-caudate	47	Right-cerebellum-cortex
12	Left-putamen	49	Right-thalamus
13	Left-pallidum	50	Right-caudate
14	3rd-ventricle	51	Right-putamen
15	4th-ventricle	52	Right-pallidum
16	Brain-stem	53	Right-hippocampus
17	Left-hippocampus	54	Right-amygdala
18	Left-amygdala	60	Right-ventral-DC

4.2 Baseline techniques

We evaluate our contributions by comparing them to two traditional approaches, called SyN [25], Elastix [54]. Additionally, we also assess our approach in comparison to learning-based algorithms, called VoxelMorph [15], ViT-V-Net [14], TransMorph [55], XMorpher [43], and TransMatch [56]. For the learning-based algorithms, we followed the hyperparameter settings provided by the authors to train our model from scratch on the dataset we used. In contrast, we adjusted the parameters for the traditional approach to achieve a balance between registration time and effectiveness.

4.3 Test

In our evaluation, we assess the similarity between segmented images from distorted moving and fixed images by calculating their Dice scores. Additionally, we gauge the significance of the registration performance concerning differential isometry characteristics by measuring the percentage of voxels with non-positive Jacobian determinants in the displacement field. Furthermore, we employ topology change (TC) as an additional evaluation metric, analyzing the differences in the segmented images before and after spatial transformation to examine the properties of differential isometry. Additionally, we conduct testing to calculate the average registration time for each image pair.

4.4 Implementation

We used pytorch 1.10.0 and trained with CUDA support (cu111) for the implementation of our proposed method. We used Adam Optimizer. The learning rate was set to $10^{- 4}$ . There was one epoch, consisting of 64,770 batches. Each batch consisted of one pair of images.

4.5 Evaluation matrices

We check the performance of the registration techniques with two evaluation matrices, i.e, Dice score (Dice) and Jacobian ( $| J ϕ |$ ), Topology change, and Time-complexity (T), which are defined as:

(1) Dice

Dice score which is also called Dice Similarity Coefficient, is a calculation of similarity between the pairs of fixed images and worked moving images. Mathematically it can be written as:

D i c e (I_{f}, I_{m}) = \frac{2 \times | I_{f} \cap I_{m} |}{| I_{f} | + | I_{m} |}

(11)

Where $I_{f}$ and $I_{m}$ show the fixed image and worked moving image respectively. To get a better registration result Dice score must be close to 1.

(2) $| J ϕ |$

Jacobian matrix shows the spatial transformation field between fixed and moving images. Mathematically Jacobian matrix can be defined as:

J = \nabla T (x)

(12)

Where $\nabla$ represents the gradient operator and $T (x)$ represents the transformation function used to voxel coordinate $x$ of the moving image.

(3) Topology change (TC)

It shows the change in the relationships and arrangement between different structures in fixed and moving images during their alignment with each other.

(4) Time-complexity (T)

It shows the empowerment of the algorithm during the aligning of fixed and moving images, and calculates how computational effort increases with image complexity.

5. Results

5.1 Registration performance comparison

A thorough assessment of registration performance is shown in Table 3, which compares two traditional registration methods with five modern learning-based registration strategies. The traditional approaches show the lowest Jacobian score (% of $| J 0 | < 0$ ), with particular success in registration correctness. But the registration process takes a very long time, which can affect how useful they are in real life. The registration work can be completed efficiently by the six learning-based approaches. Our model distinguishes itself among these techniques with remarkable results in registration effectiveness and accuracy (Dice), underscoring its benefits in the registration domain.

Table 2
Anatomical regions in LPBA40 for cross-dataset validation.

Label Name Label Name

21 L superior frontal gyrus 65 L inferior occipital gyrus

22 R superior frontal gyrus 66 R inferior occipital gyrus

23 L middle frontal gyrus 67 L cuneus

24 R middle frontal gyrus 68 R cuneus

25 L inferior frontal gyrus 81 L superior temporal gyrus

26 R inferior frontal gyrus 82 R superior temporal gyrus

27 L precentral gyrus 83 L middle temporal gyrus

28 R precentral gyrus 84 R middle temporal gyrus

29 L middle orbitofrontal gyrus 85 L inferior temporal gyrus

30 R middle orbitofrontal gyrus 86 R inferior temporal gyrus

31 L lateral orbitofrontal gyrus 87 L parahippocampal gyrus

32 R lateral orbitofrontal gyrus 88 R parahippocampal gyrus

33 L gyrus rectus 89 L lingual gyrus

34 R gyrus rectus 90 R lingual gyrus

41 L postcentral gyrus 91 L fusiform gyrus

42 R postcentral gyrus 92 R fusiform gyrus

43 L superior parietal gyrus 101 L insular cortex

44 R superior parietal gyrus 102 R insular cortex

45 L supramarginal gyrus 121 L cingulate gyrus

46 R supramarginal gyrus 122 R cingulate gyrus

47 L angular gyrus 161 L caudate

48 R angular gyrus 162 R caudate

49 L precuneus 163 L putamen

50 R precuneus 164 R putamen

61 L superior occipital gyrus 165 L hippocampus

62 R superior occipital gyrus 166 R hippocampus

63 L middle occipital gyrus 181 Cerebellum

64 R middle occipital gyrus 182 Brainstem

Label	Name	Label	Name
21	L superior frontal gyrus	65	L inferior occipital gyrus
22	R superior frontal gyrus	66	R inferior occipital gyrus
23	L middle frontal gyrus	67	L cuneus
24	R middle frontal gyrus	68	R cuneus
25	L inferior frontal gyrus	81	L superior temporal gyrus
26	R inferior frontal gyrus	82	R superior temporal gyrus
27	L precentral gyrus	83	L middle temporal gyrus
28	R precentral gyrus	84	R middle temporal gyrus
29	L middle orbitofrontal gyrus	85	L inferior temporal gyrus
30	R middle orbitofrontal gyrus	86	R inferior temporal gyrus
31	L lateral orbitofrontal gyrus	87	L parahippocampal gyrus
32	R lateral orbitofrontal gyrus	88	R parahippocampal gyrus
33	L gyrus rectus	89	L lingual gyrus
34	R gyrus rectus	90	R lingual gyrus
41	L postcentral gyrus	91	L fusiform gyrus
42	R postcentral gyrus	92	R fusiform gyrus
43	L superior parietal gyrus	101	L insular cortex
44	R superior parietal gyrus	102	R insular cortex
45	L supramarginal gyrus	121	L cingulate gyrus
46	R supramarginal gyrus	122	R cingulate gyrus
47	L angular gyrus	161	L caudate
48	R angular gyrus	162	R caudate
49	L precuneus	163	L putamen
50	R precuneus	164	R putamen
61	L superior occipital gyrus	165	L hippocampus
62	R superior occipital gyrus	166	R hippocampus
63	L middle occipital gyrus	181	Cerebellum
64	R middle occipital gyrus	182	Brainstem

Table 3

Quantitative evaluation of proposed framework against other methods on the OASIS and LPBA40 datasets.

	OASIS				LPBA40
Method	DSC	Jacobian	Time (s)	TC	DSC	Jacobian	Time (s)	TC
No Registration	0.612 $\pm$ 0.225	–	–	–	0.610 $\pm$ 0.224	–	–	–
Elastic [54]	0.678 $\pm$ 0.238	$<$ 0.1	58.02	1.143	0.667 $\pm$ 0.271	$<$ 0.1	52.17	1.13
SyN [25]	0.682 $\pm$ 0.153	$<$ 0.1	43.03	1.013	0.671 $\pm$ 0.163	$<$ 0.1	29.78	1.09
VoxelMorph [15]	0.732 $\pm$ 0.049	$<$ 0.6	0.3106	0.3427	0.688 $\pm$ 0.018	$<$ 0.43	0.248	0.543
ViT-V-Net [14]	0.744 $\pm$ 0.158	$<$ 0.8	0.2461	0.4407	0.690 $\pm$ 0.036	$<$ 0.4	0.186	0.660
TransMorph [55]	0.749 $\pm$ 0.169	$<$ 0.5	0.2730	0.6130	0.692 $\pm$ 0.034	$<$ 0.42	0.245	0.592
XMorpher [43]	0.755 $\pm$ 0.014	$<$ 0.3	0.2920	1.024	0.701 $\pm$ 0.022	$<$ 0.24	0.230	0.661
TransMatch [56]	0.760 $\pm$ 0.022	$<$ 0.35	0.3723	1.041	0.716 $\pm$ 0.031	$<$ 0.42	0.206	0.621
Ours	0.765 $\pm$ 0.020	$<$ 0.15	0.5434	1.043	0.735 $\pm$ 0.023	$<$ 0.13	0.42	1.06

5.2 Cross-data-set verification

The cross-data-set verification findings for six machine learning-based techniques and two conventional algorithms using our method on the LPBA40 dataset are also presented in Table 3. As Table 2 shows, LPBA40 is more complex than OASIS and requires registration for a larger number of regions, which resulted in a performance decrease for all registration strategies. However, our methodology continues to be the most effective registration technique even when managing the most challenging data collecting. These findings highlight our methodology’s benefits in numerous registration settings and confirm its stability and adaptability.

5.3 Visual analysis

In this segment, we provide visual comparisons of our proposed framework alongside seven other state-of-the-art deformable medical image registration techniques.

Figure 3 offers a visual representation of the registration outcomes for a specific image pair. It shows the moving and fixed images, the results obtained by each registration method, and the generated registration field. After comparison with other approaches, our technique shows superior alignment.

Figure 3.

Qualitative outcomes regarding the registration precision achieved by our approach in comparison to the baseline methods.

Figure 4.

Visualization of registration outcomes: A random pair of registration images from the OASIS test set was chosen for experimentation with six different registration algorithms. The registered images and corresponding masks were visualized in three dimensions and three specialized medical sections. The results from our model indicate that the registered image closely resembles the fixed image in the primary registration area.

In Figure 4, after employing our model for registration, the three-dimensional contour of the mask exhibits a remarkable alignment with the fixed mask. Furthermore, when we examine two-dimensional slices from various perspectives, such as coronal, cross-sectional, and sagittal views, it becomes apparent that the post-registration with our model closely resembles with the fixed image.

5.4 Ablation study

We conducted ablation experiments and recorded Dice Similarity Coefficient (DSC), Jacobian Co-efficient (% of $| J 0 | ⩽ 0$ ), and GPU memory for training in order to assess the performance of our three-components model: Residual block with Skip connection (RBSC), Cascade-based Attention with Residual Skip Connection (CARSC), and Cross Attention over Multi Layers (CAML). Each element will be introduced to the network successively in accordance with our model in order to confirm its impact. Table 4 displays the results of the completed ablation tests on OASIS data-set.

We found that, when looking at single-component effects, RBSC had the greatest effect on DSC score, up to 0.6%. With a maximum impact of 0.5%, the CAML had the second-biggest effect on DSC. The increased voxel folding capacity was exhibited by CARSC. CARSC resulted in a 1.6% rise in the Jacobian co-efficient.

We carefully eliminated two components at the same time to see how the double components affected our model. The impact on the DSC value for RBSC with CARSC was 1.7%, but the impact for CARSC with CAML and RBSC with CAML was 1.3% and 1.0%, respectively. Thus, the combination of RBSC and CARSC had the greatest effect on DSC score. The highest influence on the Jacobian co-efficient, which might reach 4.0%, was observed when RBSC plus CARSC was used. CARSC with CAML had the second-greatest effect on the Jacobian co-efficient, with a recorded influence of up to 3%.

Considering time and memory factors, we discovered that CARSC had the greatest increasing influence on our model. The smallest impact was reported with CAML.

Table 4
Ablation experiment results of our model.

RBSC CARSC CAML Dice score Jacobian Time (s) Memory (MB)

✓ ✓ ✗ 0.760 $\pm$ 0.012 $<$ 0.143 0.4810 14803

✗ ✓ ✓ 0.759 $\pm$ 0.013 $<$ 0.14 0.4503 14492

✓ ✗ ✓ 0.761 $\pm$ 0.015 $<$ 0.134 0.4121 14015

✗ ✗ ✓ 0.748 $\pm$ 0.011 $<$ 0.11 0.3190 11966

✓ ✗ ✗ 0.752 $\pm$ 0.017 $<$ 0.12 0.3497 12276

✗ ✓ ✗ 0.755 $\pm$ 0.018 $<$ 0.14 0.3879 12757

✓ ✓ ✓ 0.765 $\pm$ 0.020 $<$ 0.15 0.5434 16536

RBSC	CARSC	CAML	Dice score	Jacobian	Time (s)	Memory (MB)
✓	✓	✗	0.760 $\pm$ 0.012	$<$ 0.143	0.4810	14803
✗	✓	✓	0.759 $\pm$ 0.013	$<$ 0.14	0.4503	14492
✓	✗	✓	0.761 $\pm$ 0.015	$<$ 0.134	0.4121	14015
✗	✗	✓	0.748 $\pm$ 0.011	$<$ 0.11	0.3190	11966
✓	✗	✗	0.752 $\pm$ 0.017	$<$ 0.12	0.3497	12276
✗	✓	✗	0.755 $\pm$ 0.018	$<$ 0.14	0.3879	12757
✓	✓	✓	0.765 $\pm$ 0.020	$<$ 0.15	0.5434	16536

6. Conclusion

This paper consists of an unsupervised deep learning-based algorithm for the registration of 3D medical images. We massively tested and verified our model using the publicly accessible OASIS dataset. After training our model on the OASIS dataset, we tested it on another publicly accessible LBPA40 dataset, in order to verify our model performance across variant datasets, improving the the assessment of its robustness and applicability. The output from our experiments gives compelling evidence to reinforce the effectiveness of our model. Our novel approach not only increases the accuracy of 3D medical image registration but also enhances the reliability of the registration process. The novelty mentioned in this paper consists of cross-attention over multilayers, residual network in skip connection, and cascade-base attention with residual skip connection. In future research, there is potential to expand and refine our framework for other 3D applications such as 3D polygon mesh [57] and monocular 3D object detection [58]. In later investigation, if better ways of training emerge, then there is potential to complement or replace the existing contribution described in this paper. This perspective mechanism could further increase the performance of image registration, thereby enhancing the accuracy and quality of the registration process.

Footnotes

Acknowledgments

Funding for this research has been provided by the Fundamental Research Foundation of Shenzhen underGrant JCYJ2023080810570512.

References

Klein

Staring

Murphy

Viergever

M.A.

Pluim

J.P.W.

, elastix: A toolbox for intensity-based medical image registration, IEEE Transactions on Medical Imaging 29(1) (2010), 196–205. doi: 10.1109/TMI.2009.2035616.

Parra

N.A.

, Rigid and non-rigid point-based medical image registration, PhD thesis, Florida International University, 2009.

Makela

Clarysse

Sipila

Pauna

Pham

Q.C.

Katila

Magnin

I.E.

, A review of cardiac image registration methods, IEEE Transactions on Medical Imaging 21(9) (2002), 1011–1021.

Gong

Chen

Zeng

Zhang

, Generative adversarial networks in medical image processing, Current Pharmaceutical Design 27(15) (2021), 1856–1868.

Mahapatra

Antony

Sedai

Garnavi

, Deformable medical image registration using generative adversarial networks, in: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE, 2018, pp. 1449–1453.

Singh

N.K.

Raza

, Medical image generation using generative adversarial networks: A review, Health informatics: A computational perspective in healthcare, 2021, 77–96.

Singh

N.K.

Raza

, Medical image generation using generative adversarial networks, arXiv preprint arXiv:2005. 10687, 2020.

Haskins

Kruger

Yan

, Deep learning in medical image registration: A survey, Machine Vision and Applications 31(1–2) (2020). doi: 10.1007/s00138-020-01060-x.

Jaderberg

Simonyan

Zisserman

et al., Spatial transformer networks, Advances in Neural Information Processing Systems 28 (2015).

10.

Balakrishnan

Zhao

Sabuncu

M.R.

Guttag

Dalca

A.V.

, An unsupervised learning model for deformable medical image registration, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9252–9260.

11.

Ronneberger

Fischer

Brox

, U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, Springer, 2015, pp. 234–241.

12.

Fan

Yan

, 3D reconstruction based on hierarchical reinforcement learning with transferability, Integrated Computer-Aided Engineering 30 (2023), 1–13. doi: 10.3233/ICA-230710.

13.

Cruz

R.S.

Lebrat

Bourgeat

Fookes

Fripp

Salvado

, DeepCSR: A 3D Deep Learning Approach for Cortical Surface Reconstruction, in: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 806–815. doi: 10.1109/WACV48630.2021.00085.

14.

Chen

Frey

E.C.

, Vit-v-net: Vision transformer for unsupervised volumetric medical image registration, arXiv preprint arXiv:2104.06468, 2021.

15.

Balakrishnan

Zhao

Sabuncu

M.R.

Guttag

Dalca

A.V.

, VoxelMorph: A learning framework for deformable medical image registration, IEEE Transactions on Medical Imaging 38(8) (2019), 1788–1800.

16.

Zhang

Pei

Zha

, Learning dual transformer network for diffeomorphic registration, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, Springer, 2021, pp. 129–138.

17.

Zhou

Rahman Siddiquee

M.M.

Tajbakhsh

Liang

, Unet

+ +

: A nested u-net architecture for medical image segmentation, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, Springer, 2018, pp. 3–11.

18.

Huang

Lin

Tong

Zhang

Iwamoto

Han

Chen

Y.-W.

, Unet 3

+

: A full-scale connected unet for medical image segmentation, in: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 1055–1059.

19.

Zhao

Dong

Chang

E.I.

et al., Recursive cascaded networks for unsupervised medical image registration, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 10600–10610.

20.

Zhao

Lau

Luo

Eric

Chang

, Unsupervised 3D end-to-end medical image registration with volume tweening network, IEEE Journal of Biomedical and Health Informatics 24(5) (2019), 1394–1404.

21.

Thirion

J.-P.

, Image matching as a diffusion process: an analogy with Maxwell’s demons, Medical Image Analysis 2(3) (1998), 243–260.

22.

Sotiras

Davatzikos

Paragios

, Deformable medical image registration: A survey, IEEE Transactions on Medical Imaging 32(7) (2013), 1153–1190.

23.

Bajcsy

Kovačič

, Multiresolution elastic matching, Computer Vision, Graphics, and Image Processing 46(1) (1989), 1–21.

24.

Viola

Wells III

W.M.

, Alignment by maximization of mutual information, International Journal of Computer Vision 24(2) (1997), 137–154.

25.

Avants

B.B.

Epstein

C.L.

Grossman

Gee

J.C.

, Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain, Medical Image Analysis 12(1) (2008), 26–41.

26.

Leow

A.D.

Yanovsky

Chiang

M.-C.

Lee

A.D.

Klunder

A.D.

Becker

J.T.

Davis

S.W.

Toga

A.W.

Thompson

P.M.

, Statistical properties of Jacobian maps and the realization of unbiased large-deformation nonlinear image registration, IEEE Transactions on Medical Imaging 26(6) (2007), 822–832.

27.

Kabus

Klinder

Murphy

van Ginneken

Lorenz

Pluim

J.P.

, Evaluation of 4D-CT lung registration, in: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2009: 12th International Conference, London, UK, September 20–24, 2009, Proceedings, Part I 12, Springer, 2009, pp. 747–754.

28.

Greer

Kwitt

Vialard

F.-X.

Niethammer

, ICON: Learning regular maps through inverse consistency, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3396–3405.

29.

Shen

Vialard

F.-X.

Niethammer

, Region-specific diffeomorphic metric mapping, Advances in Neural Information Processing Systems 32 (2019).

30.

Kim

Wiseman

Miller

Sontag

Rush

, Semi-amortized variational autoencoders, in: International Conference on Machine Learning, PMLR, 2018, pp. 2678–2687.

31.

Cremer

Duvenaud

, Inference suboptimality in variational autoencoders, in: International Conference on Machine Learning, PMLR, 2018, pp. 1078–1086.

32.

Yang

Kwitt

Styner

Niethammer

, Quicksilver: Fast predictive image registration – a deep learning approach, NeuroImage 158 (2017), 378–396.

33.

Rohé

M.-M.

Datar

Heimann

Sermesant

Pennec

, SVF-Net: learning deformable image registration using shape matching, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11–13, 2017, Proceedings, Part I 20, Springer, 2017, pp. 266–274.

34.

Cao

Yang

Wang

Xue

Wang

Shen

, Deep learning based inter-modality image registration supervised by intra-modality similarity, in: Machine Learning in Medical Imaging: 9th International Workshop, MLMI 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 9, Springer, 2018, pp. 55–63.

35.

Krebs

Mansi

Mailhé

Ayache

Delingette

, Unsupervised probabilistic deformation modeling for robust diffeomorphic registration, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, Springer, 2018, pp. 101–109.

36.

De Vos

B.D.

Berendsen

F.F.

Viergever

M.A.

Sokooti

Staring

Išgum

, A deep learning framework for unsupervised affine and deformable image registration, Medical Image Analysis 52 (2019), 128–143.

37.

Lei

Wang

Liu

Patel

Curran

W.J.

Liu

Yang

, 4D-CT deformable image registration using multiscale unsupervised deep learning, Physics in Medicine & Biology 65(8) (2020), 085003.

38.

Dalca

A.V.

Balakrishnan

Guttag

Sabuncu

M.R.

, Unsupervised learning for fast probabilistic diffeomorphic registration, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2018: 21st International Conference, Granada, Spain, September 16–20, 2018, Proceedings, Part I, Springer, 2018, pp. 729–738.

39.

Chen

C.-F.R.

Fan

Panda

, Crossvit: Cross-attention multi-scale vision transformer for image classification, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 357–366.

40.

Wang

Yang

Zuo

Shen

H.T.

, Cross-modal attention with semantic consistence for image-text matching, IEEE Transactions on Neural Networks and Learning Systems 31(12) (2020), 5412–5425.

41.

Liu

Zuo

Han

Xue

Prince

J.L.

Carass

, Coordinate translator for learning deformable medical image registration, in: International Workshop on Multiscale Multimodal Medical Imaging, Springer, 2022, pp. 98–109.

42.

Song

Chao

Guo

Turkbey

Wood

B.J.

Sanford

Wang

Yan

, Cross-modal attention for multi-modal image registration, Medical Image Analysis 82 (2022), 102612.

43.

Shi

Kong

Coatrieux

J.-L.

Shu

Yang

, Xmorpher: Full transformer for deformable medical image registration via cross attention, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VI, Springer, 2022, pp. 217–226.

44.

Dollár

Welinder

Perona

, Cascaded pose regression, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, 2010, pp. 1078–1085.

45.

Felzenszwalb

P.F.

Girshick

R.B.

McAllester

, Cascade object detection with deformable part models, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Ieee, 2010, pp. 2241–2248.

46.

Zhou

Chandraker

, Deep deformation network for object landmark localization, in: Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part V 14, Springer, 2016, pp. 52–70.

47.

Cai

Vasconcelos

, Cascade r-cnn: Delving into high quality object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6154–6162.

48.

Schlemper

Caballero

Hajnal

J.V.

Price

A.N.

Rueckert

, A deep cascade of convolutional neural networks for dynamic MR image reconstruction, IEEE Transactions on Medical Imaging 37(2) (2017), 491–503.

49.

Ravishankar

Venkataramani

Thiruvenkadam

Sudhakar

Vaidya

, Learning and incorporating shape models for semantic segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2017, pp. 203–211.

50.

Chen

Dou

Wang

Qin

Heng

, Mitosis detection in breast cancer histology images via deep cascaded networks, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30, 2016.

51.

Zhang

Ren

Sun

, Deep Residual Learning for Image Recognition, 2015.

52.

Marcus

D.S.

Wang

T.H.

Parker

Csernansky

J.G.

Morris

J.C.

Buckner

R.L.

, Open Access Series of Imaging Studies (OASIS): Cross-sectional MRI Data in Young, Middle Aged, Nondemented, and Demented Older Adults, Journal of Cognitive Neuroscience 19(9) (2007), 1498–1507. doi: 10.1162/jocn.2007.19.9.1498.

53.

Shattuck

D.W.

Mirza

Adisetiyo

Hojatkashani

Salamon

Narr

K.L.

Poldrack

R.A.

Bilder

R.M.

Toga

A.W.

, Construction of a 3D probabilistic atlas of human cortical structures, Neuroimage 39(3) (2008), 1064–1080.

54.

Klein

Staring

Murphy

Viergever

M.A.

Pluim

J.P.

, Elastix: A toolbox for intensity-based medical image registration, IEEE Transactions on Medical Imaging 29(1) (2009), 196–205.

55.

Chen

Frey

E.C.

Segars

W.P.

, Transmorph: Transformer for unsupervised medical image registration, Medical Image Analysis 82 (2022), 102615.

56.

Chen

Zheng

Gee

J.C.

, TransMatch: A Transformer-based Multilevel Dual-Stream Feature Matching Network for Unsupervised Deformable Image Registration, IEEE Transactions on Medical Imaging, 2023.

57.

Fan

Song

, TPNet: A novel mesh analysis method via topology preservation and perception enhancement, Computer Aided Geometric Design 104 (2023), 102219. doi: 10.1016/j.cagd.2023.102219. https://www-sciencedirect-com-443.web.bisu.edu.cn/science/article/pii/S0167839623000511.

58.

Naiden

Paunescu

Kim

Jeon

Leordeanu

, Shift R-CNN: Deep Monocular 3D Object Detection With Closed-Form Geometric Constraints, in: 2019 IEEE International Conference on Image Processing (ICIP), 2019, pp. 61–65. doi: 10.1109/ICIP.2019.8803397.

Enhancing 3D medical image registration with cross attention,residual skips,and cascade attention

Abstract

Keywords

1. Introduction

2.1 Deformable registration process

1) Mean Squared Error (MSE)

2) Cross-correlation

3) Mutual Information

1) Supervised learning techniques

2) Unsupervised learning techniques

3. Methodology

1) Cross-attention assisted feature integration block

2) Cross-attention transformer for mutual attention

3.1.1 Local window diversity and enhancement of image alignment

1) Window division (WD) and window area division (WAD)

2) Multi-scale window fusion (MWF)

3.3 Cascade-base attention with residual skip connection (CARSC)

3.4 Unsupervised loss function

4.1 Data and preprocessing

4.3 Test

4.4 Implementation

4.5 Evaluation matrices

(1) Dice

(2) | J ϕ |

(3) Topology change (TC)

(4) Time-complexity (T)

5.1 Registration performance comparison

5.3 Visual analysis

Footnotes

Acknowledgments

References

(2) $| J ϕ |$