Abstract
Dose-volume histogram (DVH) is an important tool to evaluate the radiation treatment plan quality, which could be predicted based on the distance-volume spatial relationship between planning target volumes (PTV) and organs-at-risks (OARs). However, the prediction accuracy is still limited due to the complicated calculation process and the omission of detailed spatial geometric features. In this paper, we propose a spatial geometric-encoding network (SGEN) to incorporate 3D spatial information with an efficient 2D convolutional neural networks (CNN) for accurate prediction of DVH for esophageal radiation treatments. 3D computed tomography (CT) scans, 3D PTV scans and 3D distance images are used as the multi-view input of the proposed model. The dilation convolution based
Keywords
Introduction
High quality treatment plans are crucial for esophageal radiation therapy which aims to maximize radiation dosage at the PTV while minimizing the dosage at OARs [3,5,25]. DVH is a histogram relating radiation dose to tissue volume in radiation therapy planning [4], which is the most commonly used plan evaluation tool to compare doses from different treatment plans. Since planners need to rely on their previous experience, which makes the planning process subjective, iterative, and susceptible to variation between different planners. Apparently, accurate and automatic DVH prediction will contribute to obtain the high quality treatment plans efficiently.
In the past decades, DVH prediction has attracted the considerable attention of the researchers from both academia and industry. There is a complex spatial geometry between PTV and OARs. The challenge of DVH prediction is how to model these spatial geometry and handle the complex relationship between geometric and dosimetric information [17].
A common approach in DVH prediction to represent the geometric information between PTV and OARs is using manually defined feature descriptor. The most popular descriptors are distance-to-target histogram (DTH) [33,36] and overlap volume histogram (OVH) [30,31]. The DTH encodes the spatial relationship between the OARs and the PTV. DTH is equivalent to OVH when the Euclidean form of the distance function is used. Though DTH/OVH based methods have made certain progress, one concern regarding the DTH and OVH is that they pay less attention to the geometry shape of the organs and PTV. These geometry shapes were proved to be one of the important factors affecting the dose coverage [34]. However, it is very difficult to represent geometric features accurately by manual definition.
Recently, several studies explored dosimetric features as a new avenue for DVH prediction. Ma et al. [23,24] used the original dosimetric parameters as dosimetric representation and incorporated it with geometric feature. Cao et al. [1] proposed individual dose-volume histograms (IDVH) to represent dosimetric information, which is extracted from individual conformal beams in different directions. Similar to geometric representation, designing an effective dosimetric descriptor manually is also difficult.
On the basis of feature representation, many machine learning based algorithms are utilized to learn the mapping from the feature to DVH. E.g., the correlation between DTH and DVH had been previously modeled by principal component analysis (PCA) and support vector regression (SVR) [36]. Zhang et al. [35] proposed an ensemble method that combines the strengths of various linear regression models, including stepwise, lasso, elastic net, and ridge regression. Fan et al. [6] took usage of kernel density estimation algorithm. Inspired by the great success of deep neural networks in representation learning [8,9], there are a few deep learning-based approach have been proposed for DVH prediction. For instance, Jiao et al. [16] utilized generalized regression neural network [28], Cao et al. [1] adopted a gated recurrent unit-based recurrent neural network, and Ma et al. [24] used deep convolutional neural network. Although these deep learning based methods have achieved good results, the deep model is only used to complete the mapping from pre-defined features to DVH.
Moreover, there are several studies effort to compute 2D DVH by predicting the 3D radiotherapy dose distributions. Specifically, the inputs of the model are contoured CT images while the outputs are 3D radiotherapy dose distributions, then the 2D DVH is computed based on the mapping relationship between 2D and 3D dose distributions. E.g., Gronberg et al. [11] presented a 3D densely connected U-Net with dilated convolutions to model the input contoured CT images. Kearney et al. [18] introduced a generative adversarial networks (GANs) to model multimodal inputs (CT, PTV, and OARs). Although this kind of method is effective, 3D dose distributions annotation is very difficult. To the best of our knowledge, there are no deep learning methods to learn the optimal feature representation for 2D DVH prediction from the original image data directly.
To deal with the above issues, we propose a novel Spatial Geometric-Encoding Network (SGEN), which aims to incorporate 3D spatial information with 2D CNN, and predict DVH for esophageal radiation treatments effectively and accurately in a unified framework. To achieve this goal, three different types of image data are used as inputs to the model, i.e., 3D CT scans, 3D PTV scans and 3D distance images. The above three kinds of image data can complement each other in terms of describing the geometric relationship between PTV and OARs. Meanwhile, to avoid the high computational cost by using 3D CNNs with 3D inputs, the dilation convolution based multi-scale concurrent spatial and channel squeeze & excitation (msc-SE) structure is integrated into our network structure, which can speed up the training while enhancing the ability of multi-scale spatial geometric information representation. In addition, the proposed method has been compared with conventional DVH prediction methods in terms of MAE of DVH for different OARs based on 200 IMRT esophageal treatment plans.
The main contribution of this work can be summarized as follows:
A spatial geometric-encoding network based on CNN is proposed to predict DVH of OARs for esophageal radiation treatments. It can effectively learn the mapping from the multi-view data to DVH in an end-to-end manner.
The dilation convolution based multi-scale concurrent spatial and channel squeeze & excitation (msc-SE) structure is proposed which not only can maintain comprehensive spatial information with less computation cost, but also can extract the features of organs at different scales effectively.
Extensive experiments on esophageal cancer benchmark dataset have been conducted. The results demonstrate the effectiveness of the proposed method.
We note that Liu et al. [22] also applied the CNN for DVH prediction via handle the contours of PTV and OARs directly in an end-to-end way. The main differences between our method and [22] are three-fold. First, the task is different. [22] focused on nasopharyngeal cancer while our method is applied to esophageal cancer. Second, the input is different. We consider three types of images (3D CT scans, 3D PTV scans and 3D distance images) as input data simultaneously, while [22] only model the organ contours image. Finally, the structure of the CNN is different. [22] proposed a general connected residual deconvolution network while the dilation convolution based multi-scale concurrent spatial and channel squeeze & excitation (msc-SE) structure is adopted in our backbone, which makes the proposed network has better ability of extracting spatial geometric features.
The remainder of this paper is organized as follows. We present the details of the proposed methods in Section 2. Section 3 provides the experimental results and analysis. Finally, we conclude the paper in Section 4.
The proposed method
In this section, we first introduce the input and output of the proposed method. Then, we present model details. At last, we provide more implementation details of the proposed method.

Examples of input data: (a) computed tomography, (b) binarized PTV scan, (c) heat map of distance image.
To provide the detailed spatial distance features of each OARs to PTV, we calculated distance images from the binarized PTV scans, which present the shortest distance from each voxel to the boundary of PTV. As shown in Fig. 1, in order to make full use of the complementary characteristics of different types of data, CT, PTV scans and distance images are considered as input of the proposed SGEN. The size of each image is
Architecture of the model
Overview
The general framework of the proposed SGEN is shown in Fig. 2. In order to obtain more comprehensive geometric feature representation, 3D CT scans, 3D PTV scans and 3D distance images are inputed into 2D CNN based feature extractor. To ensure that the output features of different 2D CNNs are in a common space, for each input, the structures of the 2D CNN are same but the parameters are different. Features of different inputs are concatenated to obtain high-level representation. To avoid overfitting, an autoencoder is introduced to compress the original DVH feature vector. Then, a fully connected layer is utilized to predict the compressed DVH feature vector from the high-level representation, not original DVH feature vector. Finally, a decoder of the pre-trained autoencoder is used to recover the original DVH feature vector from the compressed DVH feature vector. We can see that it includes two sub-network: 2D CNN based feature extractor and autoencoder, details are in below.

The general framework of the proposed SGEN method. 3D CT scans, 3D PTV scans and 3D distance images are inputed into 2D CNN based feature extractor. For each input, the structures of the 2D CNN are same but the parameters are different. Features of different inputs are concatenated to obtain high-level representation. Then, a fully connected layer is utilized to predict the compressed DVH feature vector. Finally, a decoder of a pre-trained autoencoder is used to recover the original DVH feature vector from the compressed DVH feature vector.
The backbone of the proposed 2D CNN based feature extractor is a 4 layer DenseNet [14] like network. As illustrated in Fig. 2, it includes a dense block, a transition block and a msc-SE module for each layer.
Dense block As shown in Fig. 3, each dense block includes two
Transition block The transition block includes a

Overview of a branch of the dense block.

Illustration of the proposed msc-SE module. It contains s-SE (spatial squeezing-and-excitation), c-SE (channel squeeze-and-excitation) and dilation convolution based multi-scale feature fusion blocks.
msc-SE module The biggest difference between our feature extractor and original DenseNet is that a msc-SE module is proposed to get more representative feature. In medical image processing with input in 3D format, 3D CNN module is often used to extract image features. However, the training of 3D CNN network is difficult due to the large computation cost required. Fortunately, instead of directly implementing a 3D CNN module, spatial and channel squeeze-and-excitation (sc-SE) module [27] is developed which is able to acquire intrinsic spatial geometric information with less computation cost. Moreover, different organs have different scales, an excellent feature representation should be scale-invariant. To fuse both spatial-wise and channel-wise information within local receptive fields, and handle different organ scales, msc-SE (dilation convolution based multi-scale concurrent spatial and channel squeeze & excitation) module was proposed and applied to each layer. As shown in Fig. 4, the proposed msc-SE has three components: s-SE, c-SE and multi-scale fusion blocks.
Spatial squeezing-and-excitation (s-SE) block was originally proposed for 2D image segmentation task [7]. s-SE block focus on concentrate on the important spatial locations and ignore the irrelevant ones. As illustrated in Fig. 4, in s-SE block, a series of feature maps generated after dense and transition blocks can be converted to
Channel squeezing-and-excitation (c-SE) block was originally proposed for 2D image classification task [13]. c-SE block focus on class inter-dependencies via emphasizing important channels while neglecting less important ones. For input feature
The
On the basis of
DVH is a two-dimensional plot which computes the volume fraction of an OAR in different distances from PTVs. In our work, 100 points is sampled evenly from each cumulative DVH and each point includes a volume fraction value and a dose value. a 100-dimensional volume feature vector
Autoencoder performs better for nonlinear dimensionality reduction [29]. The autoencoder in our work is shown in Fig. 5, the encoder and decoder are structurally symmetric, it includes an encoder composed of seven encoding layers and a decoder composed of seven decoding layers. By pre-training each layer, autoencoder can finally reduce a 100-dimensional DVH feature vector into 5-dimensional one (experimentally determined).

Overview of the proposed autoencoder structure. The blue box represents the 100-dimensional feature vector as the original input, the coding layer is represented by a gray box, and the decoding layer is represented by a black box. The number below the box represents the dimension of the feature vector in each layer, and the orange box represents a 100-dimensional DVH feature vector reconstructed by the decoder.

Flowchart of the implementation.
The flowchart of the implementation is shown in Fig. 6. In training stage, we first utilize DVH vectors to train the autoencoder, which aims to obtain DVH encoder and decoder. Then, the encoder is used to compute the compressed DVH vector corresponding to the original DVH vectors. Finally, 3D CT scans, 3D PTV scans and 3D distance images are deemed as data, their corresponding compressed DVH vectors are deemed as label to train the DVH prediction model. The whole SGEN model includes both DVH autoencoder and prediction models. In test stage, 3D CT scans, 3D PTV scans and 3D distance images are fed into the pre-trained DVH prediction model firstly to predict the compressed DVH vector. Then, this compressed DVH vector is fed into the pre-trained DVH decode model to obtain the final predicted DVH vector. The MAE is calculated based on the predicted DVH vector and their corresponding groundtruth DVH vector.
To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on our construct dataset. In the experiments, we firstly compare the proposed SGEN method with three typical baselines to evaluate its peroformance. Then, we provide further analysis of the SGEN method. It includes the convergency investigation and the impact of different components in our framework. It is worth noting that, the purpose of this section is to demonstrate the effectiveness of the proposed SGEN rather than achieve state-of-the-art results by all means.
Dataset and settings
200 esophageal IMRT treatment plans were collected and used for the evaluation. All IMRT plans were clinically approved for treatment of esophageal radiotherapy. All plans used the same setting (6 MV photon beam with gantry angles
The entire network is trained on a Nvidia V100 GPU in PyTorch. The initial learning rate is set to 0.001 and Adam [19] is utilized as optimizer. Training is deemed as finish when the error of training set does not decrease for 10 epochs.
Evaluation metric
According to the ICRU 83 report [10], these indexes include the average dose, fractional lung volumes (V5, V10, V20, V30), fractional heart volumes (V30, V35), maximal dose of spinal cord (D2), etc. The five-fold cross-validation method was used for performance evaluation (200 data were divided into 5 groups. In each experiment, one group was taken as test data, while the other four groups were deemed as training data. A total of 5 experiments were carried out. The mean value and standard deviation were calculated as the final experimental results). The MAE of the endpoint sampled from DVHs was used to evaluate the accuracy of the prediction models:
Comparison with baselines
To verify the effectiveness of our proposed method, we compare the proposed method with three typical baselines in the experiments. The first one is traditional machine learning based method PCA+SVR [36], The second one is deep learning based method 2D-Unet [26], that is 2D-Unet is utilized as backbone in our SGEN. The last one is 3D-Unet [2]. The performance comparisons are reported in Table 1 and Fig. 7, the parameters of three backbone networks used in our DVH prediction framework are reported in Table 2. From Fig. 7, Table 1 and Table 2 we can conclude that (1) deep learning based method is far better than traditional method, (2) the proposed network structure is effective.
Performance comparison of the proposed SGEN with baselines and w or w/o msc-SE structure in terms of MAE scores on the constructed benchmark dataset. The highest score is shown in boldface
Performance comparison of the proposed SGEN with baselines and w or w/o msc-SE structure in terms of MAE scores on the constructed benchmark dataset. The highest score is shown in boldface
Figure 8 shows the training loss curves of the proposed SGEN versus the different number of training epochs. From the results, we can see that loss value decreases almost monotonously and smoothly. It becomes stable after 1000 epochs, which illustrates that the proposed method can be trained by the general gradient descent method efficiently.
Impact of different components
Impact of msc-SE structure
Compared with the general DenseNet, the biggest difference of our network is the msc-SE structure. To evaluate its impact, we compare it with the feature extractor without msc-SE module. Results are also reported in Table 1 and Fig. 7. We can observe that the predicted DVH of the network with msc-SE module is closer to the clinically approved DVH. We think reasons are two folds: (1) sc-SE block is able to fuse both spatial-wise and channel-wise information within pixel-wise context. (2) The proposed stacked dilation based multi-scale feature fusion block can handle organs at different scales.
Impact of multi-view inputs
The input of the proposed SGEN combines three terms, which aims to handle complementary information. To investigate the impact of these terms on the performance of the proposed method, we developed and evaluated six variations of SGEN: SGEN without 3D CT scans and 3D distance images (SGEN1), SGEN without 3D PTV scans and 3D distance images (SGEN2), SGEN without 3D CT scans and 3D PTV images (SGEN3), SGEN without 3D distance images (SGEN4), SGEN without 3D CT scans (SGEN5), SGEN without 3D PTV scans (SGEN6). The optimisation procedure of these six cases is similar to the proposed SGEN.
Table 3 shows the performance comparisons of SGEN and its six variations on the constructed benchmark dataset. From the results, we can see that the full SGEN performs best, which indicates that all of the three input terms in our framework contribute to the final DVH prediction accuracy. We can also see that SGEN4, SGEN5, SGEN6 outperform SGEN1, SGEN2 and SGEN3 with a large margin, which demonstrate that information in CT scans, PTV scans and distance images are complementary. Based on the above analysis, we find that modeling 3D CT scans, 3D PTV scans and 3D distance images simultaneously is a valuable strategy for DVH prediction.

Comparison of clinical DVHs and predicted DVHs of (a) left lung, (b) right lung, (c) heart, and (d) spine cord by different models. The solid black line represents the DVH calculated from the clinically approved plan while the dashed line in different colors are calculated from different prediction models.
Beam angle information is not modeled in the proposed SGEN. In IMRT, the number of radiation fields and gantry angle also affect the dose distribution of radiotherapy plan. For example, two organs at risk have the same spatial distance from the target area, but they have different dose distributions, the reason is that one organ may be in the beam and the other out of the beam. This means that the beam angle plays an important role in the modeling of dose distance relationship. In the future, we will explore how to use the beam angle information to achieve more accurate DVH prediction.
The parameters of three backbone networks used in our DVH prediction framework
The parameters of three backbone networks used in our DVH prediction framework

The training loss curves of SGEN versus the different number of training epochs on the constructed dataset.
Performance comparison of the proposed SGEN and its six variations in terms of MAE scores on the constructed benchmark dataset. The highest score is shown in boldface
In this paper, we proposed a new approach (SGEN) for DVH prediction of OARs. Our proposed method can handle 3D CT scans, 3D PTV scans and 3D distance images directly and simultaneously. Meanwhile, a dilation convolution based multi-scale concurrent spatial and channel squeeze & excitation (msc-SE) structure and an autoencoder structure are presented, which make our method can extract the spatial geometric features of PTV and OARs quickly and accurately. Extensive experimental results on the constructed esophageal benchmark dataset and the comprehensive analysis have demonstrated the effectiveness of the proposed DVH prediction strategy, and it could lead to superior performance compared with typical baselines.
Footnotes
Acknowledgements
This work was partially supported by the National Natural Science Foundation of China (No. 61902104, 11975312), the Anhui Provincial Natural Science Foundation (No. 2008085QF295, 1908085J25, 1808085MF209) and the University Natural Science Research Project of Anhui Province (No. KJ2020A0651).
Conflict of interest
The authors have no conflict of interest to report.
