Abstract
BACKGROUND:
Adaptive radiotherapy (ART) aims to address anatomical modifications appearing during the treatment of patients by modifying the planning treatment according to the daily positioning image. Clinical implementation of ART relies on the quality of the deformable image registration (DIR) algorithms included in the ART workflow. To translate ART into clinical practice, automatic DIR assessment is needed.
OBJECTIVE:
This article aims to estimate spatial misalignment between two head and neck kilovoltage computed tomography (kVCT) images by using two convolutional neural networks (CNNs).
METHODS:
The first CNN quantifies misalignments between 0 mm and 15 mm and the second CNN detects and classifies misalignments into two classes (poor alignment and good alignment). Both networks take pairs of patches of 33x33x33 mm
RESULTS:
The quantification CNN reaches a mean error of 1.26 mm (
CONCLUSION:
The performances of the networks indicate the feasibility of using CNNs for an agnostic and generic approach to misalignment quantification and detection.
Keywords
Introduction
Over the course of radiotherapy treatment, which can last several weeks, the patient anatomy may evolve slowly (e.g., the patient may experience weight loss, tumor shrinkage, and changes in position with respect to the planning computed tomography (CT)) and/or be subject to sudden interfraction variations [1]. These anatomical modifications can lead to differences between the planned dose and the delivered dose to target volumes and organs at risk. The adaptive radiotherapy (ART) concept aims to address this issue by modifying a radiotherapy treatment plan according to the daily positioning images.
The ART concept, while introduced many years ago [2], is still not fully integrated into daily radiotherapy practice for several reasons [3]. One of these reasons is the use of deformable image registration (DIR), mainly for contour and/or dose deformation between planned and in-room positioning images [4]. Indeed, implementing ART into a clinical routine is a challenging task relying notably on the quality of the DIR algorithms included in the ART workflow/software. The DIR quality assessment must be performed at the software initial commissioning but also for each patient to ensure that contours/doses are correctly deformed [5], so-called patient-specific DIR evaluation. The assessment can be performed manually with various tools [6], but as it is a time-consuming task, a manual evaluation for each patient and each fraction cannot be realistically performed routinely in radiotherapy departments because of the time spent performing these manual operations. Instead, an automatic tool for DIR evaluation would be beneficial to the clinical translation of ART.
Recent works aim to address the issue of automatic DIR quality assessment to avoid using manual annotations and validations. As stated before, common methods for evaluating a DIR still rely on manual annotations or delineations that, if they provide ground-truth data, are not possible when considering clinical routine. Some automatic methods based on image intensity rely on image similarity metrics [6]. However, such metrics are not necessarily correlated with DIR quality, as shown by Rohlfing et al. [7]. Automatic landmark extractions and automatic contouring have also been investigated [6, 8], but by introducing contouring uncertainty, the DIR quality is still difficult to assess.
Deep learning methods for radiotherapy and medical imaging have developed increasingly over recent years for various tasks [9, 10]. They have been shown to outperform and outpace state-of-the-art methods in various medical imaging tasks, including DIR error estimation. In automatic DIR assessment based on image intensity, deep learning methods have arisen in recent years, because deep learning methods enable the modeling of complex nonlinear functions, e.g., modeling a relationship between a similarity measure and a DIR error.
To our knowledge, few papers have explored the use of deep-learning to quantify DIR errors, which is confirmed by a recent scoping review dedicated to the automatic estimation of DIR errors [11]. In 2017, Neylon et al. used a deep neural network to establish a connection between image similarity and registration error [12]. They demonstrated the effectiveness of the method, but without providing a means to use it in clinical routine and validating it only on synthetic data. Convolutional neural networks (CNN) were used in another article for estimation of nonlinear registration errors, in which the authors also proposed the use of heat-maps to give the user visual information about the quality of a registration [13]. The performance of the network was evaluated on chest CT scans from public databases. Sokooti et al. used convolutional Long Short-Term Memory (LSTM) to predict registration misalignment, tested on chest CT scans [14]. More recently, an article based on the use of VoxelMorph [15] proposed an approach to predict the variance of a given deformable vector field [16]. This was still based on chest CT scans, and included the dosimetric impact of the DIR uncertainty in the context of adaptive proton therapy.
This paper aims to present a supervised learning approach using CNNs with residual functions to make mono-modal DIR error estimation for head and neck (H&N) cancer treatments. To our knowledge, this is the first time that a deep-learning-based method has been used to quantify DIR errors on H&N CT patient images. In addition, we propose to translate our academic method into a clinical workflow, by providing warning flags when a DIR produces errors above a threshold, which could be defined following consensual expert recommendations, such as the American Association of Physicists in Medicine (AAPM) [5, 17].
Materials and methods
Database
Training set
The training database consists of synthetically deformed pairs of images modeling pairs of moving and fixed images with a fully known registration error. A synthetic dataset makes up for the complexity of collecting ground truths for such a problem.
The training sets consist of 11 planning CT scans of adult patients previously treated at ICANS (Institut de Cancérologie de Strasbourg) for various head-and-neck cancers. This study is a retrospective study performed in accordance with the ethical standards of the hospital. The patient gave informed consent for research publication. The 2.5 mm CT axial images were obtained at 140 kV on a GE Optima 580 RT CT (General Electric, USA). Patients were immobilized with a thermoplastic mask (CIVCO, USA) and low-density head supports (ORFIT, Belgium). Figure 1 shows the immobilization and positioning systems used. The training sets consist of pairs of CTs modeling a fixed image
Contentions used in clinical practice. (Left) CT image where the thermoplastic mask is visible. (Right) Picture of a patient wearing a thermoplastic mask.
The error map is the true labels fed to the networks during training. For the quantification network, the error map is used as it is, and for the classification network, the error map is transformed into a binary label map by setting a threshold at 5 mm so that every voxel where the modeled errors are greater than 5 mm will be assigned the value ‘1’, while voxels with errors less than 5 mm will be assigned the value ‘0’. Figure 2 shows an example of a pair of CTs with its error map and its binary map.
Example of a pair of CTs in axial view and its corresponding error map and binary map. (a) Original CT in axial view. (b) Deformed CT in axial view. (c) The error map of the deformation. (d) The binary map used as label map for classification which is the result of applying a threshold to the error map.
The test dataset consists of H&N CT scans with annotated landmarks. These CTs are not seen by the networks during training to assess the quality of the networks’ prediction on real deformations. The test set consists of CTs of locally advanced oropharyngeal carcinoma patients enrolled in the prospective ARTIX phase III trial (ARTIX study NCT01874587). Three patients were retrospectively selected, and each patient had six CTs, amounting to 18 CTs. The image acquisition is performed under the same conditions as for the training set. The CTs are taken once a week for every week of the patients’ treatment. The voxel size is different for every CT. The slice thickness is 2.5 mm for every CT, and the in-plane voxel dimensions are between 0.98 mm and 2.38 mm for both dimensions. To evaluate the networks, 10 landmarks were annotated in every CT by six different annotators: one senior and one junior radiation therapist (both of them being H&N specialists) and experienced medical physicists/therapists. The landmarks are located mostly in hard tissues, with three landmarks in soft tissue locations (landmarks 3, 4 and 6; see Table 1), thus enabling us to evaluate the networks on both types of tissues. Table 1 shows the landmark locations and the interoperator errors with a range of one standard deviation. Figure 3 gives a visualization of the landmark location on a patient. Indeed, this variability is crucial to properly evaluate the results given by the networks: manual annotations are the reference for DIR evaluation [5], and their errors are to be compared against the network errors. The interoperator errors are calculated by first computing the center of mass of each set of six landmarks for one CT and for one location. Then, the distance of each landmark to the center of mass is computed. The six distances are averaged together. Such calculation is done for each CT and for each location, and all results obtained are then averaged to compute location-specific interoperator errors.
Landmark location and the corresponding annotation errors
Landmark location and the corresponding annotation errors
Landmarks location visualized on a CT (coronal and sagittal views).
The landmark evaluation is done exclusively on pairs of corresponding landmarks with a TRE under 15 mm after the rigid registration step. The result is 2500 pairs.
The TRE is a physical distance measured in millimeters. Since the CTs have variable voxel sizes due to the clinical context and the networks do not have prior knowledge of the scales, all CTs in the database have been resampled to have a voxel size of 1x1x1 mm
Block-diagram of the workflow of the intended application. The CNN aims to quantify and classify the quality of the DIR between two kVCT of the same patient performed at different times. CNN: convolutional neural network. DIR: deformable image registration. H&N: head and neck. kVCT: kilovoltage computed tomography.
A block diagram of the idea behind the method is given in Fig. 4. The CNN architecture is described in Fig. 5. The method is a patch-based approach using only the image intensity information to assign to each voxel in an image an estimation of the DIR error. Such an agnostic method does not rely on specific DIR algorithms, nor does it rely on any data other than a pair of images described by their intensities.
The proposed method computes a DIR error estimation for every voxel in the image domain. Considering two images to be registered,
The TRE requires prior knowledge of the true transformation
The proposed methods investigated in this article consist of two CNNs: one that quantifies the TRE and one that classifies the TRE. For the quantification network, the output is a discrete value representing the millimetric TRE estimated by the network. For the classification network, the output is a classification of the TRE in one of two classes, which are TRE ranges: in the first class, TREs are between 0 and 5 mm, while in the second class, TREs are greater than 5 mm and less than 15 mm. Such classes describe well-registered (first-class) voxels and poorly registered (second-class) voxels according to norms established for ART [5, 17].
As previously explained, synthetic deformations are used as ground truths for training and manually annotated landmarks for evaluating the networks.
Network training
A total of 128 000 pairs of patches were extracted from 10 pairs of CTs. All patches are picked from
The CNN architectures proposed is made up of two stages of two convolutional layers followed by a max-pooling layer, followed by three fully connected layers to estimate/classify the error between the pair of patches given in input.
the patient’s body, and none are picked from the background using the binary masks generated for each CT. The pairs of patches used to train the quantification network are extracted randomly among the CTs so that there is the same number of patches for the intervals 0 mm to 5 mm, 5 mm to 10 mm and 10 mm to 15 mm. The pairs of patches used to train the classification networks are extracted so that half of the pairs of patches belongs to the first class and the other half belongs to the second class. Both networks were trained on a system with an Intel Xeon 12 core, 32 GB of memory, and an Nvidia GeForce RTX 2070 Ti with 11 GB of GPU memory. Each neural network was trained for 120 epochs with a batch size of 64 pairs.
To evaluate the networks on the annotated landmarks, a registration step is needed. With six CTs per patient, 15 pairs of CTs are possible per patient; thus, there are a total of 45 pairs of CTs. Each pair of CTs is rigidly registered on the fly by using Simple Elastix [19] with base parameters for rigid registration. The rigid registration uses the CT information and is not based on the landmarks being available. It is then assumed that the errors between corresponding landmarks after the rigid registration step consist mostly of the elastic components, which are what was modeled and what was fed to the networks by using the B-spline modeling approach.
Results
Quantification and classification results are first given for the synthetic dataset with computed DIR errors and then for the dataset with manually annotated landmarks as ground truth.
The quantification network achieves a loss of 0.8455 mm at the end of the training. On independent synthetic data, the loss is 1.4621 mm. Figure 6 shows the network applied to a pair of CTs with a synthetic deformation: visually, the network accords well with the modeled error on the patient’s body but shows visual artifacts due to the patch base approach: the structure of the b-spline is very smooth, but, proceeding voxel per voxel, the network cannot render the b-spline. The classification network achieves a loss of 0.2724 at the end of the training with an accuracy of 92.35%. The loss and the accuracy are computed using 100,000 pairs of patches randomly extracted with the pair of CTs left out of the training set. Then, the loss amounts to 0.4867 for an accuracy of 87.56%.
(a) True error map of the synthetic deformation in axial view. (b) True error map super-imposed on the original CT. (c) Error map estimated by the network. (c) Estimated error map super-imposed on the original CT.
Results of the landmark-based evaluation. (a) Histogram of the error of the landmark TRE. (b) Histogram of the relative error made by the quantification network on the landmarks. (c) Results of the classification network with the vertical line representing the boundary between the two classes in terms of TRE. (d) Results of the quantification network w.r.t. the true TRE. Solid line is the identity. Dashed lines are the identity 
Concerning the dataset with manual landmarks used as ground truths, Fig. 7a first shows that the true TREs are not balanced at all like the training set: TREs less than 8 mm are overrepresented. Indeed, in clinical scenarios, the DIR errors do not follow a uniform law. The quantification network evaluated on the pair of corresponding landmarks reaches a mean error of 1.75 mm with respect to the true error, and the network error ranges from 0.0 mm to 5.15 mm with a standard deviation of 1.26 mm. Figure 7b shows that the relative network error has a zero mean on the landmark set; this finding indicates that there is no bias toward under/overestimation of the error. Figure 7d also shows that the lower the true TRE is, the higher the network error; this finding indicates poor performances on low deformations, but for higher amplitudes of deformation, the quantification network has an accurate prediction. Indeed, the absolute errors made by the network seem to be independent of the amplitude of the deformations.
The classification network achieves an accuracy of 79.32% on the landmark set. The results for the classification network are shown in Fig. 7c and highly suggest that the network suffers from lower performances on lower deformations but has good accuracy on higher deformations; this suggestion is confirmed by the confusion matrix obtained:
Table 2 summarizes the results with respect to the landmark locations. Landmark location does not seem to affect the performances of the networks. There is no apparent correlation between the performances of the networks; e.g., the quantification network performs best on the C2 dens location (1.12 mm), while here, the classification network has an accuracy of 78.94%, which is close to the average accuracy of 79.32%. This performance is far from the best performance achieved by the network on the vertebral location with an accuracy of 83.11%, while here, the quantification network achieves a loss of 1.79 mm, which is barely more than the average network error (1.75 mm).
Results of the networks for each landmark location
Today, all areas of the health care environment are subject to research in artificial intelligence, especially in medical imaging and oncology [20, 21, 22, 23, 24]. Deformable registration algorithms are a main research area, since they are implemented today in many medical imaging software used in clinical routines [25, 26]. The AAPM recommends that deformation registration algorithms be used as part of a strict quality assurance process, which can be split into two parts [5]. The first is the commissioning step, which aims to determine the limits of use of the registration algorithm. This first step can be performed using digital and/or physical phantoms, as reported by numerous examples in the literature [27, 28, 29, 30, 31]. The second part concerns patient-specific validation during clinical practice; the purpose of this validation is to ensure that the precision of the deformation is sufficient for the intended clinical purpose. Several methods exist based on the contribution of an external operator or on the use of tools automatically quantifying errors from images or deformation vectors [6]. In the context of the ART approach, the purpose is to verify the deformable multimodal registration between positioning and planning imagery for all treatment sessions for all patients; this verification represents a very important number of daily checks. In this context, the contribution of an external operator, whether through visual validation or the addition of contours and landmarks, seems unthinkable in clinical routines because such a contribution is too time-consuming. Automatic methods based on the automatic generation of landmarks and contours seem difficult to implement since these methods add an additional level of uncertainty.
This paper explored a 3D-CNN approach to estimate a DIR error between H&N CTs by using two methods: classification and quantification of the error by using TREs. TREs measure physical distances between two points and, in the present case, are a direct measure of the deformation between two voxels and thus a direct measure of the DIR quality. The performances of the networks show that they lack generality in regard to low deformations (
As stated in the introduction, deep-learning based networks have already been proposed to automatically quantify DIR uncertainty [12, 13, 14, 16]. The global classification accuracy of our network is 79.3%, knowing that it is about 70 and 90% respectively for deformations lower than 5 mm and between 5 and 15 mm. In Looki et al., the classification accuracy is 87%, but the extent of deformations is not given, so it’s difficult to establish a strict comparison [14]. We obtained a quantification mean error of 1.75 mm evaluated on the pair of corresponding landmarks, while Eppenhof et al. obtained a 0.66 mm root mean square error [13]. However, the maximum TRE present in their dataset was 4 mm, while we evaluated TRE up to 15 mm. For the other publications, results given in graphs and/or for specific regions makes it difficult to compare results. However, we observe that the order of magnitude of our results is comparable with those in the literature.
The networks developed can be applied to a clinical routine: the classification network can be used to trigger warnings if voxels belonging to an organ-at-risk, a target volume or, more generally, any user-defined anatomical area classified as second class. Nevertheless, the results for both networks achieve lower accuracy for small TREs (
A limitation of our method is that the evaluation does not consider that some voxels belong to homogeneous or uncontrasted areas, leading to a lack of contextual information. There would be a need to characterize such voxels, as such modeling would lead to a more accurate assessment of methods evaluating DIR quality. During the making of the networks, there was an attempt to evaluate such voxels by using the synthetic dataset. Considering that third-order B-spline deformations should be easy enough to register for a research DIR algorithm, such as Elastix, pairs of patches the same size as the one used in the networks were registered with an elastic registration algorithm. It was further considered that if the deformation vector at the central voxel in the fixed patch referential (which is an output of the registration) was greater than the size of a voxel, then the voxel should be considered falling into a class describing voxels where an accurate registration is not possible due to the lack of context. This method required to register hundreds of thousands of patches to build another dataset, which could then be used as training data for a specific network. However, given the poor results obtained (more than half of the voxels belonging to the patient’s body in a CT fell into the class we described) and the overly time-consuming nature of the task, the investigation led where stopped.
This article investigated only mono-modal DIR error estimation, but in regard to ART, a clinical translation would require multimodal DIR error estimation. Further work should be directed toward multimodality and other anatomical locations to develop a more generic approach fitted for the ART routine. Addressing the multimodality problem would require simulating other modalities, which is another trending topic in radiotherapy [9], or training and testing networks on modality-independent features [6, 33].
The results presented in this paper prove the validity of the use of deep-learning methods for automatic DIR evaluation. This automatic method is particularly adapted to the ART context, where a manual evaluation is too time-consuming. However, it remains to be proven that the results are sufficient to lead to a clinical translation. This evaluation should be performed by applying our method on complete datasets of daily positioning images of patients treated in radiotherapy. With geometric and dosimetric metrics, it would then be possible to verify if the tools we have developed in this paper can effectively identify areas where the quality of DIR is poor from a clinical point of view.
Conclusion
This article presented automatic methods for estimating DIR errors. The methods presented in this article provide a generic approach using only the image intensity information and a limited number of CTs. If further developments, as those suggested, are required to achieve robust automatic systems for a clinical translation, patient-specific methods assessing DIR errors, such as the one presented here, are important steps to fulfill a clinical need in the automation of patient-specific tasks for various domains.
Footnotes
Acknowledgments
The authors wish to thank the landmark annotators for their valuable work: Dr. Sebastian Guihard, senior radiation therapist at the Institut de Cancérologie de Strasbourg (France); Dr. Joffrey Perruisseau-Carrier, chief of head and neck university clinic at the University Hospital of Strasbourg (France); and Corinne Renaud, Anthony Marcot, and Mathilde Bigot, medical physicists/therapists at the Institut de Cancérologie de Strasbourg (France).
Conflict of interest
The authors declare that they have no conflict of interest.
Funding
This work was partly funded by France Life Imaging (grant ANR-11-INBS-0006) and the region Grand Est (France).
