Abstract
BACKGROUND:
Calcification is an important criterion for classification between benign and malignant thyroid nodules. Deep learning provides an important means for automatic calcification recognition, but it is tedious to annotate pixel-level labels for calcifications with various morphologies.
OBJECTIVE:
This study aims to improve accuracy of calcification recognition and prediction of its location, as well as to reduce the number of pixel-level labels in model training.
METHODS:
We proposed a collaborative supervision network based on attention gating (CS-AGnet), which was composed of two branches: a segmentation network and a classification network. The reorganized two-stage collaborative semi-supervised model was trained under the supervision of all image-level labels and few pixel-level labels.
RESULTS:
The results show that although our semi-supervised network used only 30% (289 cases) of pixel-level labels for training, the accuracy of calcification recognition reaches 92.1%, which is very close to 92.9% of deep supervision with 100% (966 cases) pixel-level labels. The CS-AGnet enables to focus the model’s attention on calcification objects. Thus, it achieves higher accuracy than other deep learning methods.
CONCLUSIONS:
Our collaborative semi-supervised model has a preferable performance in calcification recognition, and it reduces the number of manual annotations of pixel-level labels. Moreover, it may be of great reference for the object recognition of medical dataset with few labels.
Keywords
Introduction
Ultrasound has become the primary method to detect thyroid nodules in clinical due to its convenience and efficiency [1]. The thyroid nodules with the characteristics of microcalcifications, taller-than-wide shape, irregular borders and marked hypoechogenicity will have a high risk of malignancy [2]. There are 32.1% to 38.6% of thyroid nodules with calcification [3]. The automatic recognition of calcification can improve the diagnosis efficiency of benign and malignant nodules.
However, it is not easy to recognize and locate calcifications. To the best of our knowledge, most of the methods are based on traditional image processing algorithms. For example, Chen et al. [4] proposed Calcification Index (CI) to quantitatively characterize the calcification in thyroid nodules. The method proposed by LI et al. [5] segmented the boundary of the nodule by applying the GVF Snake algorithm to CI images, which alleviated the tedious work of manually annotating nodule boundary. Choi et al. [6] proposed a semi-automatic method using Otsu’s algorithm. In the traditional algorithms, the calcification was determined by single or multiple thresholds based on the brightness. Although most calcifications have high brightness, using only high brightness as the criterion may misjudge other hyperechoic areas or highlighted cystic tissues due to the inherent defects of ultrasound images.
In recent years, deep learning technology has been gradually adopted to develop computer-aided detection and diagnosis (CAD) schemes of medical images [7–9], which also include to develop deep learning model for automatic thyroid nodule recognition and diagnosis using ultrasound images [10]. However, there are very few researches on the calcification feature of thyroid nodules. At present, there is only one deep learning method [11] for the extraction of calcification features, which was proposed by our team in 2018. The method used the Alexnet network as the basic model, and built a segmentation network by layer-wise up-pooling and de-convolution. The high-level features extracted by the segmentation network were transferred directly to the FC (Fully Connected) layer of the Alexnet classification network to determine whether there was a calcification object in the nodules. Although this method improved the accuracy of calcification detection, its prediction of calcification location was relatively rough, and the model applied deep supervision which requires all the pixel-level labels. When the dataset scale is large, annotating pixel-level labels for calcification is tedious and error-prone.
For calcification localization, it is important to construct a segmentation network with superior feature extraction. After the milestone Fully Convolution Network (FCN) in image segmentation was proposed [12], the U-net which inherited its architecture also achieved promising performance [13]. After that, there were many improved segmentation models based on U-net performed better than the original U-net in medical image segmentation [14, 15]. In the object recognition of medical images, the network is expected to highlight relevant features and suppress irrelevant features during feature extraction. The attention gating (AG) proposed in [16] could focus the network’s attention on the object of input image. The convolution neural network (CNN) embedded with AG can not only performed well in image classification, but also improved the performance of object detection and segmentation in medical images [17].
In this study, we proposed a two-stage collaborative semi-supervised model based on AG, which can achieve performance close to deep supervision under the collaborative supervision of image-level labels and few pixel-level labels. Moreover, the model pays more attention to calcification features than other methods, and can accurately indicate the location of calcification.
The rest of the paper is organized as follows. Section 2 describes the proposed method and material used in this method. Section 3 shows the semi-supervised experimental results with different percentages of pixel-level labels. The discussion in Section 4 compares the semi-supervised results with the experimental results of other calcification recognition methods. The conclusions are presented in Section 5.
Methodology
CS-AGnet architecture
The CS-AGnet based on U-net contains two branches which are segmentation network and classification network. As the detailed architecture of the CS-AGnet shown in Fig. 1, the two branches share the feature extractor in module A. The combination of module A and module B forms the segmentation network, while the combination of module A and module C produces the classification network. There is no connection between module B and module C. Additionally, the segmentation network needs to be supervised by pixel-level labels, and the classification network only needs image-level labels. Therefore, the CS-AGnet was trained under the collaborative supervision of pixel-level label in module B and the image-level label in module C. It enables the feature extractor to simultaneously extract features from pixel-level labels and image-level labels, which can continuously enhance the feature extraction capability of module A. Since pixel-level label carries more information, it is more important than image-level label for the prediction of calcification location. Thus, it is particularly important to construct a segmentation network with powerful feature extraction capabilities.

The architecture of the proposed CS-AGnet model, which contains one input branch (Module A) and two output branches (Module B and Module C).
In the CS-AGnet, the segmentation network was named DS-AGnet (Deep Supervised Attention Gated network), and it took the first four blocks (10 convolutions and 4 max-pooling layers) of VGG16 [18] as the contracting path. Deepening the network moderately can improve the ability of feature extraction. Considering the balance between the network parameters and performance, the number of filters from Block1 to Block4 was 32, 64, 128 and 256, respectively, which were half of the original U-net. Two convolution layers with 512 filters were the bridge between the contracting path and expansion path. The skip connection integrated high-level features with low-level features through AG, so that the model can highlight the feature response of salient regions and suppress them in other irrelevant regions.
Due to the cascaded convolution and network’s nonlinearity, the loss of spatial details may lead to the wrong recognition of calcification. As a solution, we added pyramid input and output to the middle layer of DS-AGnet. By injecting pyramid input image before each max-pooling layer in the contracting path, the model can access different kinds of category details at different scales. The deep supervision strategy introduced in [19] forces the network features to have semantic differences on each scale and then ensures the response of the image foreground content that can be influenced by AGs. Then, we added a 1×1×1 convolution and a sigmoid activation function on each decoder of DS-AGnet to form a pyramid output, which could compensate for the loss of small calcification in cascading convolutions and ensured the learning of position-aware features.
Classification branch
The classification branch is composed of the shared feature extraction module A and the module C. The “category prediction” of this branch refers to the model’s determination of the presence or absence of calcification features in the input thyroid nodule image. The module C in the classification branch integrated the feature of multi-level Grid Attention Gating (Grid-AGs) to jointly predict a result. A specific aggregation scheme was needed between the multi-level attention features to obtain a satisfactory output. However, the simple aggregation strategies may not force the network to learn the most useful gating mechanism. The feature aggregation in the module C used the fine-tuning mode [17], it is slightly different from the fine-tuning of transfer learning, as it allowed the fusion and fine-tuning of multiple-scale predictions in the network. The one-dimensional vectors obtained by weighted average of the salient features on the feature map channels were fed into the FC layer and soft-max to obtain the respective category prediction results at each scale. Then we concatenated these three category results, and added a new FC layer and soft-max on top of the concatenated features to fine-tune the weights of the FC layers of the previous three Grid-AGs feature for the accurate joint prediction. In this way, the network will pay attention to the subtle differences that can only be observed on specific scales.
Grid attention gating
The AG was first proposed in [16]. Jo et al. introduced an improved Grid-AG based on AG, which performed well in medical images without training large number of additional parameters [17, 20]. We embedded AGs in the CS-AGnet to focus on the salient areas related to the calcification and minimized the misjudgment of other highlight areas.
In Grid-AG, the compatibility score
The detailed schematic diagram of Grid-AG is also shown in Fig. 2. The global feature g encodes the most representative and distinguishing features related to the object. Therefore, the compatibility score would highlight the features in

A flow diagram of Grid Attention Gate block.
The σ2 in Grid-AG normalized the compatibility score to attention coefficient
Due to the presence of speckle noise in ultrasound images, the linear compatibility function usually couldn’t perform well on highly heterogeneous calcification. When calculating the compatibility score in Grid-AG, W f ∈ RCint×C, W g ∈ RCint×C g and Ψ ∈ R C int are linear transformation, b ψ ∈ R and b g ∈ R C int are bias, σ1 is the nonlinear activation function ReLU: σ1 (x) = max(0, x). The network can learn the nonlinear relationship between vectors through the W f , W g and σ1 in Grid-AG, which makes the network more expressive. Moreover, W f enables the fine-scale features of the selected layer focus on learning distinguishing features, instead of generating signals compatible with g. In the network, {Wg, b g }, {Ψ, b ψ } and {W f } were expressed as convolution layers with kernel size of 1×1, the last one had no bias.
Collaborative-supervised training scheme means that CS-AGnet was trained under the supervision of real image-level labels and pseudo pixel-level labels from all the training samples. The pseudo pixel-level labels were the prediction of the trained DS-AGnet, and the DS-AGnet was trained only under the supervision of few samples with pixel-level labels. The schematic diagram of the Grid-AG based collaborative-supervised training scheme we proposed is shown in Fig. 3. Since the CS-AGnet embedded Grid-AGs, it can combine the global attention of the image-level label with the fine attention of pseudo pixel-level label. The collaborative supervision could gradually tune the feature extractor while combining attention features.

A schematic diagram of two-stage collaborative-supervised training scheme.
The collaborative supervision training scheme can be simply described as the following two stages: The stage one: Training the DS-AGnet with deep-supervised dataset, which was composed of a certain percentage of images and their pixel-level labels that were randomly selected from all training data. The stage two: Training the CS-AGnet under collaborative supervision. The image-level labels were the ground truth of entire dataset, and the pseudo pixel-level labels were the predictions of the trained DS-AGnet.
In most thyroid nodule images, since the calcification only occupies a small part of the nodule, there is an imbalance between the number of pixels in the background and the object. For the calcification segmentation in DS-AGnet, the weight of false negative (FN) detection should be higher than that of false positive (FP) to improve the recall rate. The focus loss which introduced a modulating exponent can alleviate the class-imbalance between positive and negative samples compared to the average weighted Dice loss function between FP and FN [21]. However, it is difficult to balance accuracy and recall due to the small ROI in medical images.
The improved Tversky similarity index based on Dice score could maintain a flexible balance between FP and FN in case of severe data imbalance:
It is specified that the parameter γ varies from 1 to 3. It can be seen from formula (3) that the FTL changes slightly with high Tversky index. In the segmentation of calcification, the FTL ensures the pixels in the small calcification areas contribute equally to the loss of network training as pixels in the large background areas. We set the original parameter value: α = 0.7, β = 0.3,
The pyramid outputs of the DS-AGnet and CS-AGnet were trained under the FTL except for the final output layer to prevent the total loss from being over suppressed. The activation function of each pyramid output layer was sigmoid.
Dataset
As it was mature to detect thyroid nodules in ultrasound images [23], the images used in these experiments were the minimum square area containing intact thyroid nodules. The dataset included 1,207 ultrasound images of thyroid nodules, of which 551 had calcification and 656 had no calcification. When the images were fed into the input of network, they were resized to 192×192. We augmented the normalized dataset by random horizontal flipping and vertical flipping, in order to improve the robustness and reduce the over-fitting of the model.
The samples and labels of the thyroid ultrasound image dataset are shown in Fig. 4. The pixel-level label of the image was a binary mask of the same size, where 1 was calcification and 0 was background. The image-level label was one-hot encoding. The pixel-level and image-level labels used in the network training were annotated under the guidance of the experienced radiologists. In order to test the number of pixel-level labels needed for collaborative semi-supervision, we proportional sampled 10% to 60% of the images and corresponding pixel-level labels from all the training data to build the deep-supervised dataset for DS-AGnet.

The images and labels of the ultrasound thyroid nodule image dataset. The images of the dataset are the smallest square images containing the entire nodule area (a(1) and a(2)), b(1) and b(2) are its pixel-level labels of calcification features, and c(1) and c(2) are the image-level labels of the nodules with or without calcification, which are one-hot encoding.
Adam combines the advantages of AdaGrad and RMSProp algorithms, and converges quickly in model optimization through adaptive learning rate [24]. The DS-AGnet and CS-AGnet were trained by Adam optimizers with learning rates of 1e-4 and 5e-5, respectively, and their learning rate decay were both 1e-6. We performed 5-Fold Cross-Validation on the thyroid nodule ultrasound dataset. All the experiments were performed on NVIDIA GeForce GTX 1070.
Experimental results
Attention results of Grid-AGs
For the Grid-AGs in DS AGnet, the Grid-AG next to the top of encoder is the first, and the Grid-AG closest to final output is the third. Fig. 5 shows the attention map of all Grid-AGs when predicting calcification. Specifically, it shows that the first two Grid-AGs pay more attention to the surrounding tissues, while the third Grid-AG’s attention on calcification is the most accurate and finest. The first Grid-AG notices most of the hyperechoic areas including calcification, forcing the network to differentiate between calcification and other suspected areas. The second Grid-AG pays attention to the surrounding parts of the calcification. Therefore, the network learns to determine the calcification boundary, which can indicate the calcification morphology more accurately in the prediction. All the attention maps analyzed in the following are the third Grid-AG, which shows the recognition performance of CS-AGnet more objectively.

Attention maps of all Grid-AGs in CS-AGnet.
The recognition results of calcification feature in the thyroid nodules by CS-AGnet are shown in Fig. 6. It can be seen that the attention and segmentation prediction of CS-AGnet indicate the location of calcification accurately, even if it is a low-contrast calcification feature or there is other interference information with higher brightness in the nodules. For the nodule without calcification shown in the fifth column, its attention is displayed as a long bright bar at the edge of image, although the model can predict correctly. This may be because the model is expected to focus on the salient area related to the object, and attention will be forced to move to the edge of the image when there is no object. This aimless attention is easy to distinguish as it appears as a straight bar, and has little effect on the results.

The CS-AGnet’s recognition result of calcification in thyroid nodule. The first row is the input image, the second row are pixel-level labels of calcification, the third row are attention maps, the fourth row are the outputs of the segmentation branch, the fifth row are the prediction score of the classification branch, and the sixth row are the classification result.
In order to illustrate the effectiveness of the proposed two-stage collaborative supervision model, the accuracy, sensitivity and specificity are selected as the standard metrics to evaluate the performance of calcification recognition. They are calculated by true positive (TP), true negative (TN), FP and FN, as shown in formula (4) to (6). The area under the receiver operating characteristic (ROC) curve (AUC) is also adopted to evaluate the robustness of the model. The ROC curve is a curve drawn according to a series of different binary classification methods (cutoff value or decision threshold), with the true positive rate as the ordinate and the false positive rate as the abscissa. The predicted score in the ROC curve calculation is produced by the last soft-max of the classification branch in module C. In this experiment, the thyroid nodules with calcification were regarded as positive, otherwise negative.
According to the general theory of deep learning, the performance of the model is positively correlated with the percentage of pixel-level labels. Since we performed 5-fold Cross-Validation for the experiment, there are a total of 966(1207×4/5) training samples. We tested the results when the percentage of pixel-level labels was 10% (96 cases), 20% (193 cases), 30% (289 cases), 40% (386 cases), 50% (483 cases) and 60% (579 cases) and 100% (966 cases), respectively. At the percentage of 10%, it meant that only 10% (96 cases) of the training samples and their pixel-level labels are used to train DS-AGnet. The other 90% (870 cases) samples and the previous 10% (96 cases) with image-level labels were used together in the second stage of CS-AGnet training. The percentage of 100% indicated that the model was trained under the deep supervision of all pixel-level and image-level labels. Table 1 shows the calcification recognition performance of CS-AGnet as the number of pixel-level labels used for the DS-AGnet training increases. It is noticed that their image-level labels remain 100% at all time.
Experimental results of CS-AGnet under the collaborative supervision of different percentages pixel-level labels
The results showed that when the percentage of the sample with pixel-level labels in the DS-AGnet training was 30%, the recognition accuracy of CS-AGnet was 92.1%, which was close to the 92.9% of deep supervision. Also, if the percentage reached to 50%, the accuracy was further improved to 92.7%. However, when the percentage was only 10%, the CS-AGnet did not got good performance, with the accuracy of 0.858 and AUC of 0.919, respectively. Moreover, with the increase of pixel-level labels to 20%, the accuracy of calcification recognition has only increased to 87.6%.
Figure 7 shows the ROC curve of collaborative-supervised CS-AGnet when the DS-AGnet was trained under the supervision of different percentages of pixel-level labels. As the percentage increased from 10% to 60%, the AUC increased from 0.919 to 0.972. In particular, the AUC value is considered a more reliable metric than accuracy when there has a skewed sample distribution [25]. It is easy to see from the curves that when the percentage of the deep-supervised dataset in the DS-AGnet exceeded 30%, the recognition performance of CS-AGnet was significantly improved, and the ROC curves were very close. In conclusion, when the pixel-level label reached a certain percentage, the calcification recognition performance of the collaborative-supervised model fluctuates steadily around the performance of deep supervision.

ROC curves of CS-AGnet under the collaborative supervision of different percentages pixel-level labels.
Figure 8 shows the attention maps of the CS-AGnet under the collaborative supervision of different percentage of pixel-level labels. It can be seen from those attention maps based on the collaborative deep supervision that the area noticed by CS-AGnet is almost the same as the calcification location of ground truth. In the collaborative semi-supervision, with the percentage of pixel-level labels increased, the attention of CS-AGnet to calcification is more and more close to the real location. When the percentage of the pixel-level label was 10%, the network would recognize the hyperechoic areas in the ultrasound image that were similar to calcification features as calcification. Moreover, the calcifications with lower brightness were usually difficult to recognize. After the percentage of pixel-level labels reached 30%, the CS-AGnet’s attention covered most of the calcification in the ground truth like deep supervision.

Attention maps of CS-AGnet trained with different percentages of pixel-level labels. In addition to the input images and the pixel-level labels, the last four columns show the attention maps of the CS-AGnet when pixel-level labels account for 10% (96 cases), 30% (289 cases), 50% (483 cases), and 100% (deep supervision with 966 cases) of the training set, respectively.
In order to evaluate the performance of CS-AGnet, we compared some of the networks related to CS-AGnet that were mentioned in the previous papers (Table 2). The experimental results of all the methods in Table 2 are obtained by demonstrating them on the thyroid nodule dataset described in Section 3.1. It was noteworthy that the result of the CS-AGnet in Table 2 was obtained by semi-supervision with only 30% (289 cases) pixel-level labels. It can be seen that the semi-supervised CS-AGnet achieved the best results in accuracy and AUC.
The accuracy, sensitivity, specificity and AUC of CS-AGnet and other comparative methods in calcification recognition (0%, 30% and 100% indicate the percentage of pixel-level labels used by that method).
The accuracy, sensitivity, specificity and AUC of CS-AGnet and other comparative methods in calcification recognition (0%, 30% and 100% indicate the percentage of pixel-level labels used by that method).
VGG_U-net was also a collaborative-supervised model, which has the same network layers as CS-AGnet, but Grid-AGs were not embedded in this network. Alexnet-Seg was the model for calcification extraction proposed in [11]. AG-Sononet-FT-8 was an AG-based model for weakly-supervised object detection and recognition in fetal ultrasound scanning plane detection. We tested its performance in calcification recognition. The tests of VGG-16 [18] and ResNet-50 [26] were aimed to verify whether classification networks that perform well in natural images were also suitable for the classification of ultrasound images. Since the calcification recognition was essentially a simple binary classification problem, feature extraction of VGG and ResNet were sufficient. Even if the network is further deepened, such as InceptionV3 [27], the performance could not be significantly improved.
ROC curves of the test results for all experimental models mentioned above are shown in Fig. 9. Figure 10 shows their attention maps, where the attention maps of the network without AG are obtained by CAM [28]. In the deep-supervised model, the VGG_U-net performed better compared with Alexnet-Seg. This is because the collaborative-supervised training scheme can significantly improve the feature extraction ability compared to directly transferring the features learned by the segmentation network to the classification network. The CS-AGnet with Grid-AGs can further improve the accuracy of calcification recognition on the basis of VGG-Unet. Most importantly, it can focus the network’s attention on calcification features more accurately only under the supervision of 30% pixel-level labels.

Comparison of ROC curves between CS-AGnet and other methods for calcification recognition in thyroid nodules.

Attention maps of CS-AGnet and other thyroid nodule calcification recognition methods. In addition to the input image and pixel-level ground truth, the last four columns are the attention maps of CS-AGnet, VGG_Unet, AG-Sononet-FT-8 and VGG16.
In the conventional network, the larger the dataset sample size, the better the model performance. The reason why this two-stage CS-AGnet can achieve good performance on a very small pixel-level dataset is at the cost of increasing the complexity of model training. At present, most deep learning models are trained in one stage, and the two-stage training scheme can be considered to repeat the one-stage training twice. For small medical objects, annotating pixel-level label is tedious, so the size of the dataset that we can obtain is small. In this paper, the two-stage training is used to achieve the performance of one-stage training of entire dataset (966 cases) on a smaller dataset (289 cases) with pixel-level labels. By analogy, it may be possible to achieve the performance of 3000 samples one-stage training through 1000 samples two-stage training.
AG-Sononet-FT-8, VGG-16 and ResNet-50 are all the calcification recognition models trained only under the weak supervision of image-level labels. As seen in Table 2, their performance is inferior to the deep supervision model using pixel-level labels. It can be seen from the attention maps that although AG-Sononet-FT-8 can notice most calcifications under the supervision of image-level labels, the accuracy needed to be improved. The VGG-16 and ResNet-50 performed better than AG-Sononet-FT-8 in terms of accuracy and AUC, but their attention to calcifications was more blurred. The reason for the above phenomenon is that VGG-16 and ResNet-50 obtain classification results by predicting the approximate location of calcification. Comparing with these two methods, AG-Sononet-FT-8 needs to find finer location of calcification objects when predicting. Generally, since the location predicted by these three weakly-supervised models is not accurate enough and the recognition accuracy is slightly lower, it is difficult to apply them to the calcification analysis of the thyroid CAD system.
In the brightness-based methods CI [4] and Otsu [6], since the highlighted hyperechoic areas and cystic tissue in ultrasound images would affect the calcification recognition result, their calcification recognition performance is worse than the methods of neural network. In conclusion, the two-stage semi-supervised CS-AGnet we proposed showed the most accurate results. It not only achieved significantly improved calcification recognition performance by only adding 30% pixel-level labels but also paid more attention to the calcification.
In this study, we proposed the CS-AGnet, which was a Grid-AG based collaborative supervision model for the calcification recognition in ultrasound images of thyroid nodules. The CS-AGnet combines the global attention and fine attention of the model under the supervision of few image-level labels and pseudo pixel-level labels. It also achieves the performance that closes to deep supervision. Therefore, the collaborative semi-supervised model has an important value for the deep learning research of medical images. This model is also useful for the recognition and segmentation of small objects in medical images when the pixel-level deep labels are insufficient. Comparing with some existing neural network methods and traditional image processing methods, the CS-AGnet not only has higher recognition accuracy, but also has more accurate prediction of calcification location, which will improve the reliability of the model in CAD system.
Nevertheless, the method we proposed has some limitations. For example, it is difficult to predict the location of low-contrast, frequent and scattered calcifications accurately, and may misjudge the bright cystic tissue in negative samples. As the clinical recognition of those samples mainly depends on the experience of radiologists, it is considered to increase the scale of the training set including more complex samples, which will enhance the empirical learning of the model. In future work, the size and morphology of the predicted calcification features can be quantitatively analyzed. Besides, it is significant to distinguish the type of calcification in benign nodules and malignant nodules for the diagnosis of thyroid nodules.
Footnotes
Acknowledgments
This study is supported by the Application and Basic Research project of Sichuan Province (No.2019YJ0055), the Enterprise Commissioned Technology Development Project of Sichuan University (No.18H0832), and the Achievement Conversion and Guidance Project of Chengdu Science and Technology Bureau (No.2017-CY02-00027-GX). Our images are supported by China-Japan Friendship Hospital (Beijing 100029) and Highong Intellimage Medical Technology (Tianjin) Co., Ltd.
