Abstract
Thyroid nodule segmentation is an indispensable part of the computer-aided diagnosis of thyroid nodules from ultrasound images. However, it remains challenging to segment the nodules from ultrasound images due to low contrast, high noise, diverse appearance, and complex thyroid nodules structure. So, it requires high clinical experience and expertise for proper detection of nodules. To alleviate the doctor’s tremendous effort in the diagnosis stage, we utilized several convolutional neural network architectures based on Encoder-Decoder architecture, U-Net architecture, Res-UNet architecture. To handle the complexity of the residual blocks, we also proposed three hybrid Res-UNet architectures by reducing the number of residual connections. The experimental analysis of the segmentation models proves the viability of residual learning in the U-Net architecture. Hybrid models which use minimum residual connections provide efficient segmentation frameworks similar to Res-UNet architecture with a minimum computational requirement. The experimental results indicate that all the segmentation models based on residual learning and U-Net can accurately delineate nodules without human intervention. This model helps to reduce dependencies on operators and acts as a decision tool for the radiologist.
Introduction
The thyroid gland is one of the endocrine glands located at the front of the neck. Due to its essential role in the human body, diagnosing and treating thyroid disease have become important. Thyroid nodules are the solid or cystic lumps present in the thyroid gland, either benign or malignant. Nodules can be caused by many factors, including iodine deficiency, overgrowth of normal thyroid tissue, or thyroid cancer. It is a common disease in women and older populations where there is a growth of cells in the thyroid gland. Thyroid nodules are mostly benign, with a malignancy rate of 4.5-6%. There was a rapid increase in thyroid cancer over the decades. The reports estimated that the rise in thyroid cancer incidence is faster than any other cancer at 4.5% percent per year over the last ten years [1]. In the United States, thyroid cancer has increased with an alarming rate of 52,070 cases diagnosed in 2019, out of which 2170 cases resulted in death [1]. The corresponding numbers in Canada are equally significant, with the Canadian Cancer Statistics estimating that 8,200 Canadians (6100 women and 2100 men) would be diagnosed with thyroid cancer resulting in 230 deaths in 2019 [2]. In India, thyroid cancer increased from 2.4 to 3.9 in women and men from 0.9 to 1.3. Diagnosed patients with thyroid cancer had cancer in the thyroid gland at the time of diagnosis [3]. But 30% of patients show metastasis thyroid cancer, which affects the lungs or other organs. Accurate and on-time diagnosis of thyroid cancer increases the survival rate of the patient than any other cancer [3].
Different imaging modalities like Computed Tomography, Magnetic Resonance Imaging (MRI), and Ultrasound Imaging are widely used for thyroid disease diagnosis. The most recommended thyroid nodule diagnosis approach is ultrasonography due to its features like real-time, inexpensive, non-invasive, non-radioactive, and accurate examination. Likewise, several ultrasonographic features are strongly related to the histopathological characteristics of the nodule.
But the relatively low quality and speckle noise in the ultrasound image makes the organ tissues in the ultrasound image fuzzy and in-homogeneous. It often makes ultrasound image analysis difficult for the radiologists, and the reliability of the diagnosis heavily depends on the radiologist’s experience and expertise. Due to the complex structure of the thyroid nodule, only an expertise sonographer can correctly detect it.
Segmentation is a highly relevant task in medical image analysis that can be applied to facilitate computer-assisted diagnosis, interventions, and extraction of several quantitative features from ultrasound images. In the case of thyroid nodule diagnosis, estimation of size, shape, and volume is crucial for the risk assessment and decision-making process for Fine Needle Aspiration Cytology (FNAC) biopsy. The features that indicate the increased risk of malignancy include hypoechoic solid nodules, taller than wide, irregular margin, extrathyroidal extension, microcalcifications, etc. Such types of nodules should be referred to for Fine Needle Aspiration Cytology (FNAC) analysis. Similarly, sub centimeters nodules identified by the ultrasound are not taken for the fine needle aspiration because they lack the potential to be clinically significant malignant characteristics. Proper segmentation of nodules from ultrasound images is essential for the estimation or analysis of these features. The segmentation of thyroid nodules from ultrasound images is complex due to poor contrast between various anatomies, the appearance of a granular speckle pattern, non-uniform luminance, and noise. The lack of a clear edge between the thyroid nodule and other anatomical structures makes it challenging to extract the boundaries accurately. The wide varieties of complex structures, background texture, and significant variation in size, shape, and intensity distribution of the thyroid nodules make the segmentation more complex. Therefore, it won’t be easy to distinguish nodules from normal thyroid tissues even when they are noticeable. It is well known that manual segmentation methods are time-consuming, tedious, and subject to tremendous individual variability. Semi-automatic segmentation methods can solve the problem partially, but some interactions are still needed. It prevents the widespread application of CAD(Computer-Aided Diagnosis) systems in thyroid ultrasound images. Therefore driven by clinical needs and related applications, it is imperative to develop fully automatic segmentation methods to reduce operator dependency.
Many researchers proposed different segmentation techniques to detect thyroid nodules from ultrasound images [4, 5]. It includes radial basis function neural network, variable background active contour, localization-based active contour [6, 7]. Most of these algorithms required a manually drawn boundary referred to as seed to initiate the segmentation algorithm [3]. A seeded boundary is a rough estimate of the nodule boundary drawn by a user on the B-mode image [1]. It isn’t easy to adapt these traditional contours and shape-based methods, region-based methods to the thyroid nodule segmentation scenario [8, 9]. So machine learning and deep learning models have been considered to be efficient for detecting thyroid nodule regions. Recently, deep learning has achieved superior performance in many computer vision tasks. It has proved its efficiency in various learning tasks [10, 11]. A deep learning algorithm utilized GPU computing power improvements to develop more extensive and complex neural networks capable of segmenting nodule structures from thyroid ultrasound images. The deep learning techniques with automatic feature extraction ability have proven their capability where other approaches fail to reach their potential [12]. Segmentation accuracy of thyroid nodules from ultrasound images has greatly improved due to the capability of deep learning techniques to handle complex conditions [13, 14]. The network typically requires many annotated cases to perform the training task to gain this capability [15]. Collecting such a massive dataset of annotated cases in medical image processing is often a very tough task.
We have utilized several deep convolutional neural network architectures for the segmentation of thyroid nodules, starting from simple Encoder-Decoder architecture. To improve the segmentation capability, we incorporated residual connections and U-Net architectures into the segmentation model [16, 17]. Furthermore, these large-scale and high-precision models still need a long processing time, even when running on the most sophisticated modern GPUs [18]. However, in practice, many terminals lack high computational capability and storage capacity than advanced GPUs, making it challenging to deploy large precision models on such terminals. Therefore, efficient segmentation architectures which are low in computational cost, fast in inference speed, and memory friendly are often required. To achieve this, we utilized some hybrid segmentation architectures for the thyroid nodule segmentation scenario. In this paper, we introduced several convolutional neural network architectures based on residual learning and U-Net architectures to segment thyroid nodules from ultrasound images. These hybrid architectures tried to reduce the number of residual connections from standard Res-UNet architecture without affecting the segmentation capability [18]. All these models attempted to keep the similar segmentation efficiency and reduce the computational complexity and memory requirement.
The remaining part of this paper is organized as follows. Proposed approach is explained in Section 2. Following this, Section 3 deals with the proposed framework for the experiment, and obtained results are discussed in Section 4. Section 5 presents the conclusion part.
Methodology
Here, we utilized several convolutional neural network architectures for the semantic segmentation of thyroid nodules based on residual learning and U-Net architecture. The overall architecture of the semantic segmentation system is given in Fig. 1.

Overall Architecture of the proposed system
Semantic segmentation is the process of accumulating portions of images together which to be a part of a similar category. It is also called pixel-level image classification. In semantic segmentation, we need to classify each pixel in an image in one of the predetermined categories. Recently, deep convolutional networks have been applied to segment images semantically. In semantic segmentation, it is essential to use low-level details while retaining high-level semantic information to get more satisfactory results. However, training such a deep neural network is tough, especially when only limited training samples are available.
We need such a large dataset to train a better deep learning model for the segmentation of medical images, which is often difficult to obtain. Here, for the segmentation of thyroid nodules from ultrasound images, we tried different convolutional neural network architectures. To prove the efficiency of residual learning, we included all the convolutional neural network architectures based on simple Encoder-Decoder architecture and U-Net architecture. Given a set of ultrasound images and the location of a nodule in the corresponding ultrasound images, we utilized several segmentation models based on CNN architectures for the segmentation of thyroid nodules from 2D ultrasound images. We will discuss each of these architectures in detail in the next section.
It is a fully convolutional neural network and consists of a contracting path and expansive path. It takes the whole image as input, and it learns high-level semantic image features through the encoder. Then, it gradually recovers the spatial dimension by a series of transposed convolution layers in the decoder. Finally, it predicts a pixel-wise segmentation output. The contracting path consists of several encoders, and the expansive path consists of several decoders.
The contracting layers mainly extract feature information from high-dimensional images, and it is responsible for downsampling. Each convolutional block consists of two convolutional layers with Relu (rectified linear unit) as an activation function, followed by a max-pooling layer. Relu looks and acts like a linear function but is a nonlinear function that allows complex relationships in the data to be learned. It is mainly used to decrease the training time dramatically compared to other activation functions. The following equation describes the Relu model as a function of x in which output equals the input when x is positive and 0 for other values.
Here, we consider five encoder blocks in the encoding path and five decoder blocks in the decoding path. Figure 2 represents Encoder-Decoder architecture utilized for the thyroid nodule segmentation.

Encoder-Decoder Architecture
The Encoder-Decoder architectures can achieve satisfactory results for nodules with less complex structures. But, they were unable to achieve better segmentation results when dealing with complex nodule structures. This is because the depth of the model in the Encoder-Decoder architecture usually causes vanishing gradient problems.
Olaf Ronnerberger et al. proposed U-Net architecture, which is evolved from a fully convolutional neural network, and it is efficient for developing a segmentation model from small datasets [21]. We believe the architecture of U-Net contributes to alleviating the training problem associated with the small sample size and issues regarding the gradient flow in semantic segmentation. Long skip connections in U-Net provide a shortcut for gradient flow in shallow layers. (But they do not alleviate the vanishing gradient problem in deep neural networks). The intuition behind this is that copying low-level features to the corresponding high level creates a path for information propagation, making it much easier for gradient to propagate between low-high levels. This approach allows for backward propagation during training and compensates low-level finer details to high-level semantic features.
The expansive path and contracting path are connected to each other in U-Net. It merges the features of related encoder and decoder blocks, and it aids in providing the missing boundary information when expanding, allowing edge information to be recognized successfully. The architecture that we have utilized to develop the thyroid nodule segmentation model is given in Fig. 3. We can see the encoder blocks and decoder blocks in Fig. 3. During upsampling, the image size becomes doubled during each stage, but significant features are halved. During upsampling, each output feature map is merged with the corresponding feature map for each of the downsampling of the contracting path. It aids in the completion of lost boundary information. This horizontal skip connection aids the entire architecture for the effective segmentation of complex nodule structure.

U-Net Architecture
The U-Net architecture seems to be more suitable for the pixel to pixel prediction required in the semantic segmentation problems, which involve smooth and gradual transitions from the image to the segmentation mask. It still has some challenges. Even though multiple stack layers in the U-Net-based architecture improve the model’s efficiency, these networks are challenging to train when compared with shallow networks. i.e., when we increase the number of layers in the network, its performance gets saturated and starts decreasing rapidly. If we increase the number of layers to an extent, it affects the performance of the deep learning model and leads to a vanishing gradient problem. In the vanishing gradient problem, the weights of the first layers are not being updated correctly through the backpropagation. As the error gradient is backpropagated to earlier layers, repeated multiplication makes the gradient small. It will result in a lousy convergence of the network. We can solve it effectively by using a deep residual network that contains the residual block. It has shown improved accuracy and faster convergence on segmentation tasks. He et al. proposed this segmentation model based on residual learning and U-Net, termed as Res-UNet [22, 23]. Res-UNet network combines residual module and U-Net network, capable of effectively overcoming gradient dispersion problem or vanishing gradient problem caused by the deepened network layer. They replaced each double convolution block in the U-Net architecture with the residual blocks. Res-UNet improves the performance of the segmentation network in terms of accuracy and performance. It improves the training speed and leads to faster convergence. It also deals with the problem of vanishing and exploding gradient that is present in the deep architecture [22, 23]. It optimizes the network easily using skip connections between layers and propagates gradient along with the network [24]. The contracting path and expansive path in the U-Net consist of a series of stacked residual units. The basic structure of a residual block and how it is different from simple double convolutional block can be represented in Fig. 4. There can be multiple combinations of Batch Normalization(BN), ReLU activations, convolutional layers in a residual unit. He et al. evaluated the impact of these multiple combinations and suggested a full preactivation design as in Fig. 4. So we followed a full preactivation residual unit to build our semantic segmentation model based on Res-UNet [23]. This can be generalized as:

Architecture of a Residual unit
Res-UNet combines the strength of both U-Net and Res-Net. This combination will take two advantage: Residual unit will ease the training of the network. The skip connections present within a residual unit and between various levels of the network (high level and low level) will facilitate easy information propagation without degradation.
We utilized a 5-level architecture of Res-UNet for the segmentation of thyroid nodules from thyroid ultrasound images, as shown in Fig. 5. As in U-Net, the architecture of the Res-UNet network includes mainly three parts: encoding, bridge, and decoding blocks. These blocks are built with residual units, consisting of two convolution blocks and identity mapping [25]. Each convolution block includes a Batch Normalization layer, ReLU activation layer, and a convolution layer. The identity mapping connects the input and output of the unit [25]. As in previous U-Net architecture, here also we take five encoder blocks that indicate the depth of the model [22, 24].

Residual U-Net
The Res-UNet architecture helps to improve the performance of the network compared to the U-Net model. One of the critical drawbacks of the Res-UNet model is that switching all the convolutional blocks to residual blocks renders the network too complicated and tends to overfit the training data. It increases the model complexity and number of parameters. Motivated by the above challenge faced by the Res-UNet, In this section, we propose a comprehensive set of architectures for the effective segmentation of thyroid nodules.
Residual U-Net with alternative residual connections
Instead of switching all the convolutional blocks into the residual block, we have alternately implemented residual blocks. We replaced alternate convolutional blocks in the U-Net architecture into residual blocks. Thereby we can reduce the residual connections to reduce the complexity of the network. It makes the gradient propagation across the network easier and reduces the complexity of the network. Here, the number of residual connections within the network is halved. It simplifies the model without affecting the segmentation capability. It improves the gradient flow in the network through alternate residual connections within the residual block and long skip connections between encoder and decoder blocks. Reducing the number of residual connections can reduce the model complexity and support various real-time applications. The entire architecture of the network is given in Fig. 6.

Residual UNet with alternate residual connections
By replacing all double convolution blocks in favor of residual blocks, the network becomes more complex and overfits the training data. It may increase the number of model parameters, model size. Instead of switching all double convolutional blocks to residual blocks, we retain the residual blocks in the contracting path and keep double convolution blocks in the expansive path. This network architecture will improve the gradient flow in the network by using minimum residual connections and long skip connections between encoder and decoder blocks. We removed the residual blocks in the decoder path, which helps to reduce the model complexity. The entire architecture of the network is given in Fig. 7.

Residual UNet with Residual Connections in the encoder architecture
Here, we utilized a hybrid architecture by incorporating vertical skip connections among subsequent encoder blocks to Res-UNet architecture. At the same time, it also eliminates the residual connections within the decoder blocks. This approach further explores the effects of skip connections in the subsequent encoder blocks in the Res-UNet architecture; it connects subsequent encoder blocks directly with each other. Here, the input of each new residual block in the contracting path is the concatenated result of the output of the current residual block and the previous residual blocks. The output of each residual block in the contracting path is passed to each subsequent block. Here, feature maps are aggregated with depth concatenation. A max-pooling operation is applied after concatenation, and it is fed into the input of the next encoding block. In all the Res-UNet architectures that we discussed, the skip connection exists within the local residual blocks. We extend the skip connections across the subsequent block in the encoder path. The gradient can smoothly propagate throughout the network by incorporating residual connection within the encoder block and skip connection across the subsequent encoder block into the architecture. It improves the discrimination capability of the network and speeds up the convergence. Remarkably, the connections can recover the spatial information loss incurred by the downsampling operation of the network and leverage the location information propagated from earlier layers of the network to achieve better segmentation results.
This approach promotes the information exchange between upsampling and downsampling paths in the entire Res-UNet architecture. The entire architecture of the network is given in Fig. 8.

Residual U-Net with connections in the consecutive encoder blocks
Dataset
There exist so many research solutions for thyroid nodule detection and characterization problem. Most of the data used for the published approaches are based on a private dataset making their performance evaluation and comparison a difficult one. It is challenging to collect a large amount of data due to the lack of time and expensive nature of medical image collection. Therefore, we decided to work with a public dataset DDTI (Digital Database of Thyroid Images), containing ultrasound images of thyroid nodules. This dataset was collected and published by Pedrazza et al. in 2015 [26]. This dataset has been rarely used in the previous studies for thyroid nodule segmentation problem [14, 27]. So, it isn’t easy to compare the segmentation performance of our proposed method with previous approaches. The proposed database includes a set of B-mode ultrasound images with a complete annotation and diagnostic description of suspicious thyroid lesions by expert radiologists. Several types of lesions like thyroiditis, cystic, adenomas, carcinomas are included in the dataset, and accurate lesion delineation is provided in an XML format. The diagnostic description of malignant lesions was confirmed by their histopathological analysis [28]. ROI is drawn by retrieving the specified pixel locations from the XML file and identifying the bounding polygon represented by these pixels. We randomly split the dataset 85:15 at the patient level to create independent training and test sets. The training data was further split 90:10 to create an independent validation set. The splits were done in a stratifed fashion to maintain the same proportion of cancer cases in the training, validation and test sets. The total numbers of images in the training, validation and testing sets were: 2520, 280 and 147, respectively.
Preprocessing
Even though the use of ultrasound in the medical field is well established, it still suffers from several shortcomings, including acquisition noise from equipment, ambient noise from the environment, background tissues, low contrast, asymmetric illumination, etc. Some image enhancement technique is applied to accentuate certain image features for subsequent analysis or image display. Likewise, the properties of the data in the public dataset are highly varying due to the diversity of image sources. Thus, all the images should be passed through a preprocessing pipeline to improve the efficiency and performance of the model. A useful segmentation model can be achieved only after removing the above noises from the ultrasound images.
The data obtained from different sources have different dimensions. The first stage of image preprocessing is where images with different dimensions were scaled to a standard size of (256,256) pixels. To improve the contrast of ultrasound images, contrast limited adaptive histogram equalization(CLAHE) is used. The histogram of an image indicates the graphical representation of the probability distribution of the gray values in the digital image. Histogram Equalization(HE) spreads out the image intensity values along with the total range [0,1], leading to an image with higher contrast. In CLAHE, image enhancement is applied to small data regions called tiles rather than the entire image. In openCV, by default tile grid size is 8*8, We have followed this value. Likewise, we used clip limit as 5.
The usefulness of ultrasound imaging is degraded by signal-dependent noise known as speckle noise or noise due to wave interference. Speckle noise in ultrasound images makes the segmentation process more complicated. Image filtering techniques are common to reduce noise from ultrasound images to improve the segmentation results. We have utilized several filtering techniques for enhancing the images obtained through ultrasound. All these filtering techniques are evaluated in terms of the BRISQUE score. The graph representing BRISQUE scores of different filter sizes for each filtering technique is given in the Fig. 9. For BRISQUE score smaller score indicates better perceptual quality. So we go for bilateral filter technique [29]. Also, we normalized all the intensities to [0,1].

Evaluation of different filtering technique
All of these experiments are conducted on a PC with the following configurations: Intel(R) Core(TM) i7 7700 HQ with 16GB RAM clock speed or frequency of CPU @ 2.80GHz and GPU of NVIDIA GeForce GTX 1080 Ti. All the algorithms are implemented in Python 2.7 on Anaconda 64 bit windows platform. The machine learning model is implemented using OpenCV, sklearn, Keras, and Tensorflow libraries.
Training
To estimate our model performance, we utilized same experiment configuration for all the model, which is discussed in the Table 1. In terms of parameter settings the batch size is set to 32. The Adam optimizer will be used with a learning rate of 6 * 10-4 and the learning rate. Based on some previous work, we also adopted a learning rate adjustment strategy [30, 31].
Various Hyper Parameters and their Value for segmentation architectures
Various Hyper Parameters and their Value for segmentation architectures
In biomedical applications, only a limited number of samples are available in datasets. This will create an overfitting problem for a large neural network. Hence, the size of the dataset needs to be increased by using data augmentation strategies to get better performance. Data augmentation is performed by applying various operations such as reflection, rotation, scaling, translation, etc., on every input image. Here, we adopted data augmentation strategies, such as random horizontal flipping, multi-scale, and random cropping to a fixed size for each input image during training to prevent model overfitting and improve the generalization ability. Finally, a total of 2800 images were obtained from the 833 original images. The data augmentation was performed using the augmentor python package.
Loss function
When the data is not sufficient for training, The network is prone to overfitting. Choosing the appropriate loss function is essential in guiding the network pixel prediction to avoid network overfitting. There is a problem of class imbalance between foreground and background pixels in the thyroid ultrasound image. To solve the class imbalance problem, we adopt focal loss, which is given in Equation 5.
It is essential to assess the model after the development of the model. The critical issue associated with the segmented images is the peculiarity of a high-class imbalance between the number of pixels of each class. We include Dice, precision, recall, mIoU for evaluating the segmented results from different models.
Results and discussions
Comparative analysis: Different deep learning-based architecture
All the proposed architectures are tested extensively on real ultrasound samples from DDTI images. This section presents the performance of the segmentation models that we discussed. Visualization results of all the segmentation architectures from DDTI test samples are given in the Fig. 10, Fig. 11, Fig. 12, and Fig. 13. In each Figure, (a) image represents the test sample; (b) image represents the ground truth; (c) image represents the segmentation results from Encoder Decoder Architecture; (d) image represents the segmentation results from UNet architecture; (e) image represents the segmentation results from ResUNet architecture; (f) image represents the segmentation results from hybrid ResUNet-1 architecture; (g) image represents the segmentation results from hybrid ResUNet-2 architecture; (h) image represents the segmentation results from hybrid ResUNet-3 architecture. For each segmentation mask obtained from different convolutional neural network architecture, its Dice Similarity Coefficient with its Ground truth is given along with the corresponding figure.

Segmentation Results from Encoder-Decoder, U-Net, Res-UNet, Res-UNet-1, Res-UNet2, Res-UNet-3 architectures along with its Dice Similarity Coefficient

Segmentation Results from Encoder Decoder, U-Net, Res-UNet, Res-UNet-1, Res-UNet-2, Res-UNet-3 architectures along with its Dice Similarity Coefficient

Segmentation Results from Encoder Decoder, U-Net, Res-UNet, Res-UNet-1, Res-UNet2, Res-UNet-3 architectures along with its Dice Similarity Coefficient

Segmentation Results from Encoder Decode, U-Net, Res-UNet, Res-UNet-1, Res-UNet2, Res-UNet-3 architectures along with its Dice Similarity Coefficient
Encoder-Decoder architectures can achieve satisfactory segmentation results on nodules with less complex structures. The model cannot provide better segmentation results for some complex nodules in the test dataset. When we increase the number of layers to handle these high-level semantic features in this complex nodule, the network failed to provide better segmentation results even with simple nodule structures. It may be due to the vanishing-gradient problem in the deep neural network architecture that we already discussed.
U-Net architecture handles these issues to some extent by adding a long skip connection between Encoder-Decoder architectures. We can notice that the improvement in the segmentation results in Figure 10, Figure 11. But the issues regarding the vanishing gradient problem still exist. Although the segmented results of UNet are better than that of Encoder-Decoder architecture, as shown in Table 2 and Fig. 11, the U-Net is still not sensitive enough to capture the finer details. The white areas that are segmented by the U-Net are irregular by ignoring the spatial information of shallow layers. Performance of UNet based segmentation network architecture is not satisfactory in the case of nodules in the low contrast regions and nodule images that contain several noises and artifacts (Fig. 10, Fig. 11). Both Encoder-Decoder architecture and U-Net architecture failed to provide better segmentation maps on these nodule structures. Both methods were unable to accurately mark the contours at the nodule boundaries, particularly at the low-contrast regions.
Performance of the proposed segmentation architectures for thyroid nodule dataset
But other architectures based on residual learning provide better segmentation performance on these images. Segmentation results from Res-UNet based architectures were closer to the ground truth and effectively extracted the appropriate nodule boundaries. i.e., the Res-UNet based architectures supplied more exact segmentation results than U-Net and Encoder-Decoder. Res-UNet enjoyed the advantages of both U-Net and ResNet, and it utilized residual learning for gradient propagation. But, in the case of nodules with small and irregular sizes, the performance of ResUNet and Hybrid Res-UNet-2 is quite well when it is compared with other architectures, as shown in Figure 10, 11. But in the case of nodules with large size, all the Res-UNet based architectures give better performance, as shown in Figure 12, 13. All of these residual learning-based segmentation models provide satisfactory performance by segmenting the nodule contours accurately where U-Net and encoder-decoder architecture failed to give accurate boundaries, which is evident from all of the mentioned Figures 10,11,12,13, Figures 10,11,12,13. This may be due to the multiple connections from different convolution blocks in the encoder block helps the network to learn more complex features and smooth out the gradient flow through the network.
Our findings suggest that all the architectures based on Res-UNet especially Hybrid Res-UNet-2, are highly effective, efficient, accurate and robust. The segmentation results from Hybrid Res-UNet-2 is comparable to the results of Res-UNet in the case of nodules with small and irregular size. But in almost all other cases, segmentation results from all the hybrid models are comparable to the results from Res-UNet architecture.
First, all the architectures are executed to evaluate the performance of Res-UNet based architectures. The thyroid nodule segmentation capability of each architecture is compared on the test sample of 115 instances. In terms of quantitative analysis, a comparison of different evaluation metrics of different architectures is given in Table 2. It summarizes the segmentation performance of each architecture under each evaluation metrics. An average IoU, dice, precision, recall are utilized for comparison between different architectures. Furthermore, all evaluation metrics of the proposed methods, especially Res-UNet based architectures, were in the appropriate range without any significant variation indicating that it is robust and flexible, can yield reliable results even in the presence of various artifacts. It is demonstrated that Residual UNet outperforms U-Net by 0.0824 in mIoU, 0.0472 in Dice, 0.0380 in Precision, 0.0546 in Recall. Likewise, all the Res-UNet based architectures provided a similar performance in terms of Dice, mIoU, precision, recall, which is evident in Table 2.
Performance comparison of Encoder-Decoder architecture, U-Net, Residual U-Net, and three hybrid architectures is illustrated using boxplot in Fig. 14. The median line of the box plot of all the architectures based on Res-UNet lies on the same level. Likewise, all the Residual U-Net based architectures have high and similar whiskers and comparatively smaller boxes indicating better results.

Performance Analysis of each architecture in terms of mIou, Dice, Precision, Recall with Boxplot
Number of parameters, time required for training, average test time per sample of each model is given in the Table 3. Table 3 lists the average computational time required for the training in hours and average computational time required for testing in seconds. All the experiments were carried out on the same experimental framework that we already discussed. From the Table 3, it is obvious that the time required for training and testing of the Res-UNet architecture are significantly reduced when it is compared with U-Net and Encoder-Decoder architectures. At the same time, we can see the substantial increment in the number of parameters of Res-UNet architecture. Too many residual connections may increase the number of parameters and increases the total number of calculations needed, but it leads to faster convergence. This is obvious in the Table 3, we can see the increased number of parameters and faster convergence.
Comparison of different models in terms of computing time and number of parameters
Comparison of different models in terms of computing time and number of parameters
Instead of switching all convolution blocks in favor of residual blocks, we can reduce the number of residual connections in several ways without affecting the gradient propagation. minimum number of residual connections may help to reduce the complexity and memory requirement without affecting the segmentation capability. Here comes the importance of hybrid Res-UNet architectures. The performance of the proposed hybrid Res-UNet architectures is comparable to Res-UNet, and they provide segmentation efficiency, which is similar to that of domain experts, as in Table 2. The number of parameters can be significantly reduced with this hybrid architectures with out affecting the training time. Both ResUNet and Hybrid ResUNet-3 have more number of parameters than other networks. Residual connections with in the convolution blocks and long skip connection between subsequent convolution blocks makes the network complex and results in large number of parameters. Hybrid ResUNet-1 and Hybrid ResUNet-2 have less number of parameters when compared with Res-UNet and Hybrid ResUNet-3. Here we have reduced the number of residual connections. Instead of applying residual connection on every convolutional blocks, we have minimized the number of residual connections in Hybrid ResUNet-1 and Hybrid ResUNet-2. Both U-Net and Encoder-Decoder architecture have less number of parameters when compared with other models. Variations in the training time and test time per sample for all the Res-UNet based architectures can be ignored.
From Table 3, it can be seen that Res-UNet and all the proposed hybrid Res-UNet architectures needs less time to train and test than U-Net and Encoder-Decoder architecture. It is due to the number of skip connections with in the convolutional block makes the gradient propagation easier and leads to faster convergence. ResUNet-1 and ResUNet-2 took less time when compared with ResUNet and other ResUNet-3 architectures. All the hybrid Models provides similar segmentation capability as that of ResUNet based architectures. Residual connections with in the block and between different convolution blocks leads to faster convergence and it needs more training time and test time than other models. Hybrid ResUNet-3 provides able to detect nodule boundaries more precisely even in the presence of artifacts and low illumination. So considering the improved performance, its running time is acceptable. The implemented networks based on Res-UNet leverages the strengths of Res-UNet and U-Net while keeping the model simple.
Fig. 15 shows the learning curves comparing all the architectures starting from U-Net. Thus our approach improves training convergence. All Res-UNet based models have a smooth curve when it is compared with the U-Net model. In all the Res-UNet based models, loss converges faster, which is stable after 60 epochs. The Dice coefficient is also becomes stable with 60 epochs. The learning and loss curves of all Res-UNet based architecture have a smooth curve that indicates training is easier for the models.

Loss and Accuracy Plots for training and Validation
Too many residual connections may increase the number of parameters and increases the total number of calculations needed. Instead of switching all convolution blocks in favor of residual blocks, a minimum number of residual connections may help to achieve identical performance with less complexity. The performance of all the proposed hybrid Res-UNet architectures is almost similar to the result of Res-UNet. Also, the segmentation efficiency of all these models is comparable to that of domain experts.
We analyzed the segmentation capability of hybrid Res-UNet architectures by comparing with U-Net and Res-UNet architectures with the same experimental setup as mentioned above. The experiment results demonstrate that the proposed method is quite robust and segments nodule regions more precisely, even in most complex images than U-Net and Encoder-Decoder architectures. It also capable of achieving similar segmentation performance of Res-UNet with minimum computational complexity.
We completed the experiment analysis with the public dataset DDTI, which includes various thyroid nodules that differ in size, shape, texture, and nodule location. Most of the published works are based on private datasets, and they are not available for experimentation. So, it isn’t easy to compare the performance of our approach with these existing approaches. We have tabulated the segmentation performance of various techniques taken from the literature, and it is shown in Table 4. In [33], X.Ying et al. proposed a segmentation model based on FCN and VGG-net. In [11], They utilized improved DeepLab v3+, which can extract more details of the thyroid nodule images. They obtained an accuracy of 97.91% and Dice coefficient of 94.08%.In [8], R.Liu et al. proposed faster region based CNN based approach for thyroid nodule detection. In [34], J.Ding et al utilized residual U-Net and attention gate mechanism for the segmentation of thyroid nodules from ultrasound images. When we compare these results with the results obtained from our experimental analysis 2, It demonstrates the efficiency of residual learning and U-Net architectures for thyroid nodule segmentation.
Performance Analysis of state- of the-art thyroid nodule segmentation methods for 2D ultrasound images
Performance Analysis of state- of the-art thyroid nodule segmentation methods for 2D ultrasound images
Segmenting thyroid nodules from thyroid ultrasound images has various applications from a clinical perspective. From a clinical standpoint, estimation of size and volume can be done only after the segmentation of thyroid nodules [3]. Estimating these parameters is vital as they are features for selecting the nodule for FNAC (Fine Needle Aspiration Cytology) procedure and are highly significant in determining the malignant characteristics of the nodule, i.e., lobulated and irregular margins, and taller than the wide shape is associated with increased risk of malignancy. The segmentation can also help determine cystic components inside the nodule. This research successfully solved the computer-aided nodule detection problem by systematically developing a segmentation model that aids clinicians in detecting thyroid nodules. This work would aid future research for computer-aided diagnosis of thyroid nodules as it provides techniques for the accurate detection of thyroid nodules. Additionally, the proposed detection framework is fully automatic and does not require any human interaction. The algorithm can also be used in the continuing education of the sonographer and act as a second opinion for the radiologist, and clinicians [3].
Limitations and future scope
This research has focused only on developing a computer-aided detection tool, which semantically segments thyroid nodules from thyroid ultrasound images. Still, many aspects need to be studied in the future to achieve better accuracy, performance, and clinical applicability. The present work suggests a few directions and challenges for the researchers to further exploration of thyroid image analysis domain. We can incorporate feature extraction and classification into the study to develop a computer-aided diagnosis system, which will be more beneficial for thyroid nodule diagnosis.
The proposed approach focused on the assumption that there exists one nodule per image. For the image that contains two nodules delineated by the sonographer, we divided the image so that it includes only one nodule per image. Similarly, the scarcity of annotated ultrasound data has been a predicament in the computer-aided sonographic diagnosis of thyroid nodules. This limitation becomes particularly relevant when implementing CAD systems for thyroid nodule detection and classification, where large datasets are required to characterize the location and texture of the thyroid nodule. The need for large datasets is essential for implementing and validating a new CAD system [2]. Moreover, this poses a significant obstacle to realizing the full potential of deep learning-based techniques [2]. Although publicly available datasets with manual annotations of thyroid exist, the number of thyroid cases is limited to a couple of hundred. The collection of a large comprehensive dataset is required for the development of effective CAD systems. For the development of an efficient segmentation model, we should consider a three-dimensional view of the nodule. It should consider the nodule view from multiple angles, multiple compression levels, multiple orientation levels. Likewise, we must need prior information about the nodule location to capture a two-dimensional image to segment the nodule. During live scanning of thyroid ultrasound, the sonographer adjusts the probe angle to view the thyroid nodule from different directions, angles, and different compression levels. After scanning, the sonographer manually segments the nodule from the captured image. During manual segmentation, the sonographer can use the prior information he had gained from utilizing the previously mentioned probe motion techniques to segment the nodule boundary. In the proposed approach, the algorithm only had access to the single two-dimensional planar images for the development and implementation of the model.
Future works aim at refining our detection framework to the detection and characterization framework. It will provide physicians with a more comprehensive diagnostic model which assists them in risk assessment and characterization. Likewise, residual blocks in the proposed architectures can be replaced by dense blocks, which allows it to combine the advantages of the method with the superiority of the deep convolutional neural network. It can assist the clinicians in handling images with low contrast and inhomogeneous contrast ratio. We can evaluate the performance of the proposed segmentation network architecture on other imaging modalities and images from different domains, which will be addressed in the future.
Conclusion
In this work, we have proposed a comprehensive set of fully automated frameworks for thyroid nodule segmentation from ultrasound images centered on residual learning and standard U-Net architecture. The experiment demonstrated that the proposed methods are quite robust, and it segmented thyroid nodules from the ultrasound images more precisely, even in most complex images, than other methods. All the architectures were evaluated for segmentation performance on thyroid nodule images, and it reached the level of experts and has potential clinical applicability. All the architectures based on residual learning and U-Net provided satisfactory performance in terms of mIoU, dice coefficient, precision, and recall. The performance of the three hybrid approaches is comparable to the result of Res-UNet with the minimum computational requirement.
