Abstract
BACKGROUND:
Automatic segmentation of individual tooth root is a key technology for the reconstruction of the three-dimensional dental model from Cone Beam Computed Tomography (CBCT) images, which is of great significance for the orthodontic, implant and other dental diagnosis and treatment planning.
OBJECTIVES:
Currently, tooth root segmentation is mainly done manually because of the similar gray of the tooth root and the alveolar bone from CBCT images. This study aims to explore the automatic tooth root segmentation algorithm of CBCT axial image sequence based on deep learning.
METHODS:
We proposed a new automatic tooth root segmentation method based on the deep learning U-net with AGs. Since CBCT sequence has a strong correlation between adjacent slices, a Recurrent neural network (RNN) was applied to extract the intra-slice and inter-slice contexts. To develop and test this new method for automatic segmentation of tooth roots using CBCT images, 24 sets of CBCT sequences containing 1160 images and 5 sets of CBCT sequences containing 361 images were used to train and test the network, respectively.
RESULTS:
Applying to the testing dataset, the segmentation accuracy measured by the intersection over union (IOU), dice similarity coefficient (DICE), average precision rate (APR), average recall rate (ARR), and average symmetrical surface distance (ASSD) are 0.914, 0.955, 95.8% , 95.3% , 0.145 mm, respectively.
CONCLUSIONS:
The study demonstrates that the new method combining attention U-net with RNN yields the promising results of automatic tooth roots segmentation, which has potential to help improve the segmentation efficiency and accuracy in future clinical practice.
Introduction
CBCT imaging is a kind of non-invasive and low radiation technique, which is widely used in the diagnosis and treatment of dentistry [1, 2]. For CBCT images, automatic tooth segmentation is of great significance. Only by realizing automatic segmentation, can we effectively solve the following problems, such as tooth feature extraction, tooth recognition, tooth measurement, tooth layout and so on. Then they can be developed to the computer-aided orthodontic system, dental implant system, dental operation navigation and so on.
The tooth anatomical structure is shown in Fig. 1, for tooth segmentation, automatic tooth root segmentation is a challenging task relative to the crown, which comes from the accuracy of medical requirements. However, roots information has important application value. It has been found that obtaining root information before dental treatment can improve the success rate. H. Watanabe et al. [3] studied the significant correlation between miniscrew failure rate with root proximity, insertion angle, bone contact length and bone density, and the results demonstrated that the root proximity was the most affected factor to miniscrew failure. Shingo Kuroda et al. [4] had the same conclusion. Wook-Jae Yoon et al. [5] found that the damage to the natural adjacent teeth during the implant placement may wield adverse effects on adjacent teeth and may facilitate implant failure, and they thought that the slope of the adjacent teeth should be measured accurately before surgery. In orthodontic processing, the shape and location of the rootsare important factors to be considered to prevent root resorption. Masato Nishioka et al. [6] and Glenn T. Sameshima et al. [7] got the conclusion that abnormal root shape was a significant factor to root resorption, and Glenn T. Sameshima et al. [7] also found the roots length was correlated with root resorption.

The Sagittal (a) and axial (b) view of CBCT, and tooth anatomical structure.
At present, the tooth root segmentation is mainly done manually in clinical, that is, finding the teeth through threshold segmentation, and then the technicians remove the noise and the connection area between the adjacent teeth interactively to obtain complete single root boundary. The whole process takes the dental technician a considerable amount of time because each patient has dozens of teeth, and each tooth has dozens of images.
There are three main reasons for the difficulty in root segmentation. Firstly, CBCT has a poor contrast and lower signal to noise ratio. Secondly, the edge is blurred and weakened, because the tooth roots and neighboring alveolar bone have similar intensity. Additionally, adjacent teeth are very close together and the structure of the root is complex.
From the literature review, we found that tooth root segmentation methods using CBCT images can be divided into two categories namely, traditional methods and deep learning algorithms. Currently, several traditional methods have been developed, among which the level set based is the main one. In 2010, H.Gao et al. [8] proposed an improved level set with shape and intensity prior, namely a single level set to track the tooth root contour. In 2015, Gan et al. [9] applied a hybrid level set model to segment tooth of CBCT images, and they [10] (2018) used a global convex level set model to extract the connecting region of the tooth and alveolar bone, then teeth and alveolar bone were separated by random transform and a local level set model. Yuanjun Wang et al. [11] (2019) proposes another improved level set, they used Gan’s method for tooth crown parts, and for the roots, they enhanced contrast between sockets and tooth roots with a linear gray stretch method and then applied a narrow band approach to control the level set function update.
Although these level set based methods had a good performance on their CBCT images for teeth segmentation, they needed to choose the starting slice and select a seed point of each tooth in the chosen slice manually. The root and sockets have similar contrast in the CBCT images, so it is difficult to design a suitable energy function to control the number of iterations, which is necessary to adjust the number of iterations for each tooth manually.
Watershed was also used for teeth segmentation. Somayeh et al. [12] (2018) chose the appropriate image in the CBCT slices firstly, then used morphology method and Canny edge detector to get the teeth marker and finally the watershed method was controlled by the marker to get the segmentation results. However, the boundary obtained by the canny detector included not only the root but also the cranium and alveolar bone. The traditional algorithms are difficult to automatically achieve accurate root segmentation.
Nowadays the automatic segmentation methods based on deep learning have achieved rapid development, such as fully convolutional networks (FCN) and U-Net [13, 14]. For the CBCT images segmentation, Jun Ma et al. [15] (2019) worked on the automatic CBCT tooth roots segmentation with CNN and level set method. It yielded performance with 0.88 of Dice coefficient. Miao Gou et al. [16] (2019) used U-net and level set method to segment the Tooth in the CT images, and obtained 60% accuracy. For the CBCT volumetric segmentation, P. Macho et al. [17] (2018) used two similar 3D CNNs (low-res and high-res model) based on the V-Net. The low-res model focused to locate the ROI of the tooth of the input 3D CBCT cephalic samples. The high-res model was used on the output of the low-res model to produce a fine-grain segmentation, and the average result of the segmentation was 92% of Dice loss. M. Ezhov et al. [18] (2019) used V-Net for volumetric segmentation from coarse to fine and the results measured by binary voxel-wise intersection over union (IoU) between the ground truth volumetric mask and the model prediction was 0.94. Unet is a good segmentation network, but it can improve the segmentation effect by improving the network. Oman Oktay et al. [19] (2018) proposed combined the attention gates (AGs) in U-Net models, the AGs can learn to focus on the region of interest and suppress the irrelevant regions. The model achieved better performance on segmentation tasks than U-net.
Our work focused on automatic segmentation of tooth roots in the entire CBCT image sequence, including incisor, canine and molar, and few works were dealing with this task. We attempted to use effective segmentation network U-net with AGs as the basic framework and combined the Recurrent Neural Network since the sequential images have context information. We tested entire CBCT axial image sequences of test patients with the proposed network.
Attention U-Net
In this paper, Attention U-net [19] (AttU-Net) was proposed as a basic framework for CBCT tooth roots segmentation. The Attention U-Net was similar to U-net [20] structure, with four down-sampling steps followed by four upsampling steps, and the Attention Gates connected the up-sampled feature maps and the corresponding skip feature maps. The Attention U-net that we used as shown in Fig. 2. The size of input and output image was 512×512, and all the convolution kernels were set as 3×3 except the final convolution kernel 1×1. The attention gate operation was shown in Fig. 3. Attention map gives weight to the feature graph. In network training, it gives a larger weight to the target region and a smaller weight to the background region. In the task of tooth root segmentation, the attention unit makes the network keep a high degree of attention to the tooth root region.

The Attention U-net structure.

Attention Gate operation.
The AttU_net is a kind of 2D network which extracted the intra-slice contexts without inter-slice information. BDC_lstm could distil the 3D contexts from the features that extracted by 2D segmentation network since the morphological position of tooth root does not change much between adjacent sequences, we applied BDC_lstm after AttU_net to extract the interlayer information of tooth root sequence in CBCT image.
Recurrent neural networks were effective models to process sequential data, especially long short-term memory (LSTM) [21], which maintained a self-connected internal status acting as “memory.” The designed gates such as input gate, forget gate, and output gate avoided the gradient disappeared during the long information transferring. This ability allowed the LSTM to attain exceptional performance in analyzing sequential data. The formulation of LSTM is defined as follows:
The Convolution LSTM(C-LSTM) [22] was designed to analyze the multi-dimensional images across the temporal domain, using convolutional operators to instead vector product. The C-LSTM can be defined as follow formulation:

The C_lstm block structure, c t and h t are the cell activation state and hidden state at time t, they are transfer through the C_lstm cells.
The Bi-Directional C-LSTM [23] (BDC-lstm) extracted the information between neighbored images from two opposite directions. The extension of LSTM was that setting two layers of C-LSTM in two opposite directions, one before the current image and another one after the current image, then concatenating the feature maps. The detail of the structure proposed was shown in Fig. 5. The BDC-lstm was applied on the map feature x t , x t-1 and x t+1, which were the input as feature maps of the temporal domain. The feature map was extracted from the penultimate layer of the AttU-net structure, then concatenated the output features in the two direction as 32 channels, and finally output the segmentation result after 1 convolutional layer. The BDC-lstm encoded the input image sequence and then decoded it to produce the output, which was the segmentation result of the input image at the time t.

The BDC-lstm structure. x t-1, x t , x t+1 are the input feature maps of the temporal domain.
The network of BDC_lstm was combined with the Attention U-net structure as shown in Fig. 6. The images I t-1, I t , I t+1 were input into the AttU-Net structure and got 32 feature maps respectively, and then the feature maps were input of the BDC-lstm, and then the segmentation result of the image I t was output.

The AttU-Net+BDC_LSTM network structure. The 32 feature maps of the images I t-1, I t , I t+1 were input into the BDC-lstm structure.
Our framework was implemented in Torch 0.4.1. We implemented the experiment using Win10 platform with 12GB NVIDIA GTX 1080 Ti GPU, which also used CUDA 9.0 and CuDNN v7 for GPU acceleration. The research data was from Shanghai JG Digital Orthodontics Technology CO. LTD including the original CBCT image sequences and the tooth labels sketched by dental technician.
Data distribution
24 patient cases were set as train dataset containing 1160 images, 5 cases with 361 images were selected as the test dataset. We applied the test dataset with the same types of the training dataset, including the various images with different sharpness, contrast and tooth shapes, to test automatic segmentation performance. The CBCT images were classified into three classes as shown in Fig. 7. (a) Only the roots in the maxilla, (b) Both the roots and crowns, (c) Only the roots in the mandible. These three kinds of images exist in any case and have different features: a) The molar roots in maxilla have three branches and complex topological structure. b) The crowns are not surrounded by alveolar bone and have higher contrast with soft tissue. c) The Roots in the mandible have similar contrast with the surrounding structure. The number of these three types of images is shown in the Table 1.

Tooth sequence classification shown on the CBCT sagittal plane.
The number of the three classes images
The original images had different sizes (from 368×368 to 776×776), and the region of interests all located in the upper-middle part of the image. For images larger than 512×512, we cut the left, right and lower pixels of the image to 512×512 and for images smaller than 512, we add pixels with value of 0 to the left, right and bottom of the picture to make it 512×512.
optimizer and loss function

The segmentation results of the AttU-Net + BDC_lstm neural network with yellow lines and the red lines are the ground truth.
The Adam was used as an optimization method, and the parameter β 1 was set 0.9, β 2 = 0.99. We firstly set the learning rate 1e-4, and after 100 epoch change the learning rate to 1e-5.
The loss function was Binary Cross Entropy.
Firstly, we trained the AttU-net network with the learning rate of 1e-3, and then fixed weights of the trained AttU-net to train the BDC-lstm network with 1e-3 learning rate. We loaded the weights of these two parts and trained the network end to end with 1e-4 learning rate.
Results
Qualitative analysis

(a) The original image, (b) Segmentation result, (c) The zoom in area of Teeth, the yellow lines are the segmentation results and the red lines are the ground truth.
Our segmentation network could achieve almost the same as the ground truth in the segmentation of teeth. As shown in Fig. 8, some automatic segmentation results of the test images, the yellow lines were the segmentation results and the red lines were the ground truth, and the yellow and red lines almost coincided.
Roots in the shallow maxilla have a regular shape and are separated by bone structure, but the alveolar tightly surrounds the teeth. The segmentation network could accurately locate the teeth automatically and have effective segmentation at certain teeth, as shown in Fig. 9. The segmentation results of these images had almost the same boundaries as the manual work.
The molar roots in the maxilla gradually split into 3 branches as the root deepens. The image sequences and automatic segmentation results of a molar root are shown in Fig. 10.

(a) One molar root segmentation results with the yellow lines and the ground truth with red lines, (b) The reconstruction of the molar root.

(a) The original images, (b) The output segmentation results, (c) The red lines are the ground truth and the yellow lines are the segmentation results.
The trained network could segment teeth automatically, and the boundary of the segmented teeth is very similar to that of artificial labels as shown in Fig. 11.
The roots in the mandible
The segmentation results of the roots in the shallow mandible were shown in Fig. 12, which had almost the same contours with the manual work. The segmentation result of a molar in the mandible was shown in Fig. 13.

(a) The original image of the roots in the shallow mandible. (b) The output segmentation results. (c) The red lines are the ground truth and the yellow lines are the segmentation results.

(a) The segmentation results of a molar root in the mandible with the yellow lines and the ground truth with red lines, (b) The reconstruction of the segmented molar roots.
The results of different evaluation metrics

(a) The test images with intraoral metal implant in the teeth, (b)The segmentation results, (c) The segmentation results of ours with yellow lines and the ground truth with red lines.

(a) Original image, (b) The first layer attention map of AttU_net, (c) The second layer attention map of AttU_net.

The segmentation results of U-net with green lines and the segmentation results of AttU-net with blue lines.

(a) The segmentation results of adjacent layer images by AttU-net, (b) The segmentation results of AttU-net + BDC_lstm.
To evaluate the difference between the segmentation result and the ground truth, Intersection over Union (IOU), Average Precision rate (APR), Average Recall Rate (ARR), Average Dice similarity coefficient (ADSC) and Average symmetric surface distance (ASSD) were used as evaluation metrics.

(a) The segmentation results of AttU_net with over segmented roots and alveolar bone noise, (b) The segmentation results of AttU_net + BDC_lstm.
The teeth with metal implants
The tooth area can also be accurately segmented when there is an intraoral metal implant in the test picture as shown in Fig. 14.
Attention unit
After training, the attention characteristic map has a higher attention value in the tooth area and at the tooth boundary so the network could focus on the teeth region as shown in Fig. 15. The U_net with AGs had better performance on the roots boundaries as shown in Fig. 16.
BDC_lstm
2D segmentation network only extracted the intra-slice contexts. The segmentation results of AttU_net showed that some of the teeth will be under segmented, while in the adjacent layer, the teeth will be better segmented as shown in Fig. 17. BDC_lstm could distill the 3D context from the abstracted 2D contexts, reduce tooth over- segmentation and alveolar bone noise as shown in Fig. 18.
Compare with other method
Compare with CNN method.
By comparing the automatic segmentation results of CNN for tooth roots proposed by Jun Ma et al. [16] with the results of our segmentation network, as shown in Fig. 19, which illustrated that the CNN method could not segment the tooth completely, however, the network that we applied on the tooth roots automatic segmentation did not need to adjust the tooth contours by the level set. The Dice coefficients of the Jun Ma’s method and our method for all the test case sequences were shown in Table 3.

The segmentation results of the three classes images by Ma Jun et al. with green lines and the Ours with yellow lines, (a) Roots in mandible, (b) Both roots and crowns, (c) Roots in maxilla.
Comparison with Evaluation metrics ADSC and ASSD
The segmentation results of the classic U-net showed roots over segmentation and roots missed problem but AttU-Net + BDC-lstm had better performance for these teeth, as shown in Fig. 20. For the segmentation of dental crowns, our method was very close to U_net. The Dice coefficient of the U-net and our method for all the test case sequences were shown in Table 3. Specifically, compared with the traditional U_net and Majun’s CNN network, our method could achieve great performance for the tooth roots.

(a) The segmentation results of U_net with green lines, (b) The segmentation results of ours with yellow lines.
From our experimental results, our network was not effective in the following two cases. One is the roots in deep maxillary and mandibular. Figure 21 shows the network misses and under segments the small tooth roots, which has the low Dice coefficient score. The second case is the roots with similar contrast to the surrounding periodontal ligament. Figure 22 shows a slice image in the test case, it can be seen that the teeth in the brown circles are surrounded by alveolar bone with similar contrast, meanwhile, the results of the segmentation network show errors in the position indicated by the blue arrows.

(a) The roots missed in the deep mandible, (b) The roots missed in the deep maxilla.

The teeth in brown circle are surrounded by soft structure with similar contrast and close position, and proposed segmentation network over segmented these teeth that indicated by blue arrows.
This work focuses on the CBCT axial tooth roots for automatic segmentation with deep learning. As a result, the main contribution is to increase the accuracy of detecting dental roots, therefore, it has potential to help reduce dental technicians’ workloads to some extent. Specifically, we in this study constructed the AttU-Net + BDC_lstm, which applies CLSTM in the attention U-net structure to improve performance for the tooth root segmentation. However, the work still has limitations. The segmentation network could not work well for the images that depict deep tooth roots into the maxilla and mandible, as well as have partial tooth root losing. In the future work, we will explore more effective networks and increase samples to improve robustness of the network to perform tooth root segmentation.
Footnotes
Acknowledgments
This study is supported by the National Science Foundation of China Grant (No. 81301286), the Application and Basic Research project of Sichuan Province (No.2019YJ0055), and the Enterprise Commissioned Technology Development Project of Sichuan University (No.18H0832). Our data are provided by Niansong Ye and Zhenyan Xie of Shanghai JG Digital Orthodontics Technology CO. LTD.
