Deep Learning–Based Computer-Aided Diagnosis System for Localization and Diagnosis of Metastatic Lymph Nodes on Ultrasound: A Pilot Study

Abstract

Background:

The presence of metastatic lymph nodes is a prognostic indicator for patients with thyroid carcinomas and is an important determinant of clinical decision making. However, evaluating neck lymph nodes requires experience and is labor- and time-intensive. Therefore, the development of a computer-aided diagnosis (CAD) system to identify and differentiate metastatic lymph nodes may be useful.

Methods:

From January 2008 to December 2016, we retrieved clinical records for 804 consecutive patients with 812 lymph nodes. The status of all lymph nodes was confirmed by fine-needle aspiration. The datasets were split into training (263 benign and 286 metastatic lymph nodes), validation (30 benign and 33 metastatic lymph nodes), and test (100 benign and 100 metastatic lymph nodes). Using the VGG-Class Activation Map model, we developed a CAD system to localize and differentiate the metastatic lymph nodes. We then evaluated the diagnostic performance of this CAD system in our test set.

Results:

In the test set, the accuracy, sensitivity, and specificity of our model for predicting lymph node malignancy were 83.0%, 79.5%, and 87.5%, respectively. The CAD system clearly detected the locations of the lymph nodes, which not only provided identifying data, but also demonstrated the basis of decisions.

Conclusion:

We developed a deep learning–based CAD system for the localization and differentiation of metastatic lymph nodes from thyroid cancer on ultrasound. This CAD system is highly sensitive and may be used as a screening tool; however, as it is relatively less specific, the screening results should be validated by experienced physicians.

Introduction

Artificial intelligence–based computer-aided diagnosis (CAD) systems are rapid and highly reproducible, and therefore suitable for labor-intensive work (1,2). As such systems are not affected by interobserver variation unless there is significant manual interaction with users, they could be used by medical health care providers to provide second opinions. A CAD system that uses deep learning to classify medical images has already been applied to various image diagnostic fields (3 –5).

In particular, in the field of thyroid imaging, a thyroid CAD system proposed by Chang et al. can differentiate malignant from benign thyroid nodules at accuracy levels similar to those obtained by radiologists (6). Commercially available thyroid CAD systems have since been introduced (i.e., S-detect for thyroid, AM-CAD) (7,8). Notably, Choi et al. integrated the S-Detect for Thyroid into an ultrasound (US) machine for real-time diagnosis (7). In the initial prospective study, the S-Detect demonstrated a similar sensitivity to that of an experienced radiologist for thyroid cancer detection, although the radiologist achieved a superior specificity. The S-detect could be used to assist the real-time assessment of malignancy risk and decisions regarding fine-needle aspiration during US examinations.

Despite the introduction of a thyroid CAD system, to our knowledge, a CAD system for the detection of lymph node metastases in neck has not yet been developed. Both diagnosis of thyroid cancer and detection of metastatic lymph nodes are important (9,10), as decisions regarding the surgical extent rely on the latter (11). Lymph node metastasis is also an prognostic indicator associated with patient survival, local recurrence and distant metastasis (12,13). However, lymph node metastasis detection requires additional experience. Several international guidelines have emphasized the clinical significance of metastatic lymph nodes from thyroid cancers (14 –16). Therefore, this study aimed to develop a deep learning–based CAD system to detect metastatic lymph nodes and to validate the diagnostic performance of this system.

Materials and Methods

Patients and datasets

This retrospective study protocol was approved by the ethics committee of our institutional review board. Informed consent for US and US-guided biopsy was obtained from all patients prior to each procedure. Because heterogeneous US image can lead to overfitting of results, split-sample validations were used to separate data sets into training, validation, and test sets (Table 1). From January 2008 to December 2015, 612 lymph nodes from 604 patients (293 benign lymph nodes from 293 patients, 319 metastatic lymph nodes from 311 patients) were consecutively examined. The cases were divided into a training data set (263 benign and 286 metastatic lymph nodes) and a validation data set (30 benign and 33 metastatic lymph nodes). From January to December 2016, we also enrolled 200 lymph nodes (100 benign and 100 metastatic lymph nodes) as a test data set.

Table 1.

Demographic Data for the 804 Patients

Characteristics	Training set and validation set	Test set
Patients
n	604	200
Age, years (mean)	13–84 (44.3)	10–81 (55.2)
Male, n (%)	185 (30.6)	54 (27.0)
Female, n (%)	419 (69.4)	146 (73.0)
Lymph nodes
n	612	200
Benign, n (%)	293 (47.9)	100 (50.0)
Malignancy, n (%)	319 (52.1)	100 (50.0)
Right, n (%)	308 (50.3)	96 (48.0)
Left, n (%)	304 (49.7)	104 (52.0)
Diameter, mean (range)	0.90 cm (0.2–2.5 cm)	0.82 cm (0.2–2.5 cm)
Level 1, n	11	8
Level 2, n	128	43
Level 3, n	160	45
Level 4, n	140	40
Level 5, n	56	24
Op-bed or level 6, n	117	40

In this study, we included both preoperative and postoperative lymph nodes. In the preoperative status, we included only lateral neck metastatic lymph nodes, while we included lateral neck metastases and thyroid bed recurrences in the postoperative status. The washout thyroglobulin (Tg) levels were considered positive only if they were higher than the serum Tg level. All lymph nodes were confirmed by fine-needle aspiration and/or washout thyroglobulin analyses.

Data preprocessing

US images of the thyroid and lymph nodes contain various sources of noise, such as blood vessels and adipose and muscle tissues. Therefore, we applied a data augmentation strategy to these images to ensure that our model would focus on the lymph nodes rather than noise (17). To augment our US images while preventing overfitting of the model, the image angle was set randomly within ±15 degrees, and the noise was removed randomly while including the lymph nodes. Next, we resized all of the augmented images to 224 pixels × 224 pixels to standardize the distance scale and input the images into the convolution neural network (CNN) model (18). As this task could be performed without requiring label information, we also applied it to the validation and test sets used for model evaluation.

Deep neural network

A typical CAD system is a type of classification model used to abnormal findings in medical images (5,19). However, recent studies have used the CNN model to evaluate segmentation techniques that automatically detect lesions. Overall, weakly supervised learning has increasingly attracted attention because it can be used to detect the location of a meaningful object from an attention heatmap in the absence of location information (20 –22). Zhou et al. proposed a method called class activation mapping, which uses global average pooling (GAP), and showed that objects can be located clearly without location information (23). This methodology allows classification and segmentation in a single model without location information; accordingly, we can infer the association of the localized region with the predicted label. Given these advantages, we used CNN-GAP to determine the locations of lymph nodes and differentiate benign from malignant nodes.

Training and diagnostic performance evaluation

We trained a CNN-GAP model that could simultaneously predict metastatic lymph nodes and their locations (Fig. 1). We applied the pretrained model to Image-Net for the initialization of network parameters. We set the learning rate to 0.001 and decreased it by a factor of 10 when no further improvement was seen in the validation set accuracy. Model learning continued until the accuracy of the validation set was reduced. To perform localization, we cropped the important region, defined as the area where the CNN-GAP network exhibited the maximum activation value, and set it as the lymph node location.

FIG. 1.

Workflow of data learning. Simultaneous classification and localization of a metastatic lymph node using our convolution neural network–class activation mapping–based computer-aided diagnosis system.

When used as the model input, the validation and test data sets were also amplified as described for the preprocessing step. The model would provide predictions about all amplified input images, regardless of the malignant status. Our model selects a predicted outcome label between benign or malignant that accounts for a large proportion and provides the ratio as a reliability score. Using the validation set (n = 63) and test set (n = 200), we evaluated the diagnostic performance of the CAD system, including the accuracy, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV).

Results

Diagnostic performance

We measured the accuracy, sensitivity, specificity, PPV, and NPV of our model in both validation and test data sets to evaluate the diagnostic performance for the classification of metastatic lymph nodes (Table 2). The accuracy of the model was 96.8% in the validation set and 83.0% in the test set. The respective sensitivity and specificity rates were 93.9% and 100% in the validation set and 89.0% and 77.0% in the test set. The respective PPV and NPV were 100.0% and 93.8% in the validation set and 79.5% and 87.5% in the test set.

Table 2.

Performance Evaluation of Computer-Aided Diagnosis System for Validation and Test Data Sets

Samples	Accuracy	Sensitivity	Specificity	Positive predictive value	Negative predictive value
Validation samples (n = 63)
	96.8%	93.9%	100.0%	100.0%	93.8%
Test samples (n = 200)
	83.0%	89.0%	77.0%	79.5%	87.6%

Reliability score of the model

The representation of model uncertainty is an important performance evaluation indicator in machine learning (24,25). As our model evaluates the presence of a metastatic lymph node multiple times (55 times) in the amplified input images, we can evaluate how consistently the model predicts the outcome in terms of uncertainty. We used “reliability” to refer to the consistency of the model predictions. We plotted a density graph of the reliability scores to compare the consistency between correct and incorrect cases (Fig. 2A). The median difference in reliability scores between the correct and incorrect cases was 22. These two reliability score distributions were tested using the Kruskal–Wallis test, which yielded a significant p-value of <0.001. In other words, if the model predicts the label of the amplified image more consistently, then the final label will be more likely to be correct. We also measured the accuracy, sensitivity and specificity of the model while sequentially increasing the threshold of reliability (Fig. 2B). Pearson's correlations between the threshold of the reliability score and model performance parameters of accuracy, sensitivity, and specificity were 0.9516, 0.9391, and 0.9158, respectively. As the threshold for reliability increased, the model performance improved but sample coverage decreased (Fig. 2C). Therefore, the less reliable results must be filtered.

FIG. 2.

Reliability characteristics and performance of the model. (a) Difference in the reliability distributions among correct and incorrect cases (Kruskal–Wallis test p-value <0.001; median of difference 22; chi-squared 18.69). (b) Accuracy, sensitivity, and specificity of the model, based on the threshold of reliability. (c) The sample coverage, dependent on the threshold of reliability.

Localization of the lymph node

Because our model is designed to distinguish benign from metastatic lymph nodes, the model was designed to focus on the lymph node areas in images while predicting metastasis. Therefore, we placed an attention heatmap drawn using GAP on the lymph node to allow us to infer the location (26). A map of a predicted malignant lymph node would lose its original shape and echogenic tumor focus (Fig. 3A). The GAP-drawn attention heatmap emphasizes the location of infiltration. The lymph node tumor textural signal is known to correspond with malignancy (27). Next, only a part of the benign lymph node was extracted and used as model inputs to draw heatmap (Fig. 3B). This experiment is to examine in detail how the convolution neural network (CNN) internally classify the presence of metastasis for the lymph node rather than localization. From the attention heatmap, our model emphasizes the feature of fatty hilum which our model identified itself classifying it as benign correctly (Fig. 3B). This fatty hilum activation is mainly characteristic of patients with benign lymph nodes (28,29). Therefore, the model focuses on the medically and biologically important parts of an input image to classify the node as benign or malignant. The attention heatmap can also be used to infer the location of the lymph node. To this end, we cropped the important region containing the maximum CNN-GAP network activation value and drew a rectangle on the US image to show the location of the lymph nodes. Compared with the location of the lymph node as drawn by the radiologist, the predicted lymph node location reflects the location of the actual lymph node.

FIG. 3.

Attention heat map derived using the global average pooling method. (a) Malignant case wherein the attention heatmap focuses on the infiltrated lymph node. (b) Benign case involving an enlarged lymph node wherein the attention heatmap focuses on fatty hilum activation.

Discussion

In this study, we developed a deep learning–based CAD system to localize and differentiate metastatic lymph nodes from thyroid cancer on US. Using our test set, we demonstrated that our CAD system exhibited an acceptable diagnostic performance, with 83.0% accuracy, 89.0% sensitivity, and 77.0% specificity. This CAD system also provided a reliability score that allowed us to judge the reliability of the predicted labels. Moreover, this system used weakly supervised learning to determine the location of the lymph node and infer important areas related to metastasis on US images. To our knowledge, this is the first research involving the development and validation of a deep learning–based CAD system specific for lymph node metastases of thyroid cancers. We expect that future CAD system with better performance could be useful for ruling out metastatic lymph nodes on US and may provide a simple method for decisions regarding fine-needle aspiration biopsy (30).

A recent meta-analysis of metastatic lymph node studies reported that US achieved a sensitivity of 71% [95% confidence interval (CI) 57–82%] and specificity of 85% [95% CI 64–95%] (31). In comparison, CT achieved a sensitivity of 70% [95% CI 59–80%] and specificity of 89% [95% CI 81–94%]. In that meta-analysis, the summarized sensitivity of combined CT/US (69%) was significantly higher than that of US alone (51%, p > 0.011). When we compared the diagnostic performance of our CAD system with those in the meta-analysis, our CAD system achieved a better sensitivity (89.0% vs. 71%) but lower specificity (77.0% vs. 85%) (31). This comparison suggests that our highly sensitive CAD system is useful as a screening tool; however, the low specificity suggests that the screening results should be validated by experienced clinicians. We note that a direct comparison of our study with the meta-analysis is unreasonable because of differences between the studies. First, the meta-analysis applied a level-by-level analysis to the included studies, whereas we performed a node-by-node analysis. Second, we enrolled only lateral neck lymph nodes screened during pre-operative examinations. Therefore, the exclusion of central neck lymph nodes observed during the same examination may have improved the diagnostic performance in our study. Consistent with our findings, the meta-analysis indicated that US exhibits better sensitivity and specificity in the lateral neck than the central neck. Third, our model is influenced by the ability to localize the lymph node to determine the metastasis. For this reason, the accuracy, sensitivity and specificity of our model test data may be undervalued.

Generally, the CAD system has several advantages over human evaluation (32). For US images, human evaluation is subjective and dependent on the operator's experience (33,34). By contrast, the CAD system provides consistent predictions for the same input, which could potentially eliminate the obstacle of interobserver variability. Moreover, the future expansibility of the CAD system is ensured (35). Briefly, the CAD system could be upgraded to incorporate accumulating reports on false positive results.

Neural network-based methods are usually referred to as “black box” models because it is difficult to determine the internal relationship between the predicted label and input feature (36). However, as our method simultaneously predicts the label and constructs the attention heatmap, we can slightly infer the important parts of the image when the model predicts a lymph node metastasis. Moreover, it was difficult to detect lymph nodes on US images during the development of our CAD system. Various sources of US noise, such as blood vessels, the trachea, esophagus and adipose and muscle tissue, interfere with the detection of true lymph nodes. Therefore, we found it effective to extract only the lymph node region for use as a training set and thus increase the performance of our model. The use of the region of interest as an input has already been implemented in S-Detect for Thyroid, and the application of this technique to our model would improve the performance (7). Regarding usefulness, the real-time application of our model to US machines would improve this metric. Additionally, although the current diagnostic performance of our model is good, we have not yet performed external validation. In the future, we will build on this pilot program by allowing the model to learn more data and evaluate its performance using external validation datasets.

The limitations of this study include its retrospective nature and therefore dependence on the composition of these data. Additionally, the quality of the images had some variability because the US examinations were performed by multiple physicians; however, all procedures were performed under the supervision of experienced faculty members in order to assure adequate lymph node imaging. For development of the CAD system, we enrolled only lesions that were confirmed by fine-needle aspiratoin cytology and we performed a node-by-node study rather than a level-by-level study. Since central lymph nodes were evaluated only in the postoperative setting, this may also influence the results. Furthermore, the sonographic features of thyroid bed recurrences can be significantly different.

In conclusion, we developed a deep learning–based CAD system for the localization and differentiation of lymph node metastases from thyroid cancer on US images. This highly sensitive CAD system may be useful as a screening tool; however, its low specificity suggests that the results should be validated by experienced clinicians. Additionally, the clinical application of this CAD system requires further validation in a large population study.

Footnotes

Acknowledgments

The authors thank Lee Sun Ah and Chung Han Cheol for their contributions to this study.

Author Disclosure Statement

This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors. No competing financial interests exist.

References

Doi

. 2007. Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Comput Med Imaging Graph, 31:198–211.

van Ginneken

, Novak

. 2012. Computer‐aided diagnosis. Proc SPIE, 8315:831501–831583.

Gulshan

, Peng

, Coram

, Stumpe

, Wu

, Narayanaswamy

, Venugopalan

, Widner

, Madams

, Cuadros

. 2016. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA, 316:2402–2410.

Greenspan

, van Ginneken

, Summers

. 2016. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Trans Med Imaging, 35:1153–1159.

Esteva

, Kuprel

, Novoa

, Ko

, Swetter

, Blau

, Thrun

. 2017. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542:115–118.

Chang

, Paul

, Kim

, Baek

, Choi

, Ha

, Lee

, Shin

, Kim

. 2016. Computer-aided diagnosis for classifying benign versus malignant thyroid nodules based on ultrasound images: A comparison with radiologist-based assessments. Med Phys, 43:554.

Choi

, Baek

, Park

, Shim

, Kim

, Shong

, Lee

. 2017. A computer-aided diagnosis system using artificial intelligence for the diagnosis and characterization of thyroid nodules on ultrasound: initial clinical assessment. Thyroid, 27:546–552.

, Chen

, Ho

, Tai

, Wang

, Chen

, Chang

. 2016. Quantitative analysis of echogenicity for patients with thyroid nodules. Sci Rep, 6:35632.

Mazzaferri

. 2007. Management of low-risk differentiated thyroid cancer. Endocr Pract, 13:498–512.

10.

Hay

. 2007. Management of patients with low-risk papillary thyroid carcinoma. Endocr Pract, 13:521–533.

11.

Wong

, Hynes

. 2006. Lymphatic or hematogenous dissemination: how does a metastatic tumor cell decide?. Cell Cycle, 5:812–817.

12.

Adam

, Pura

, Goffredo

, Dinan

, Reed

, Scheri

, Hyslop

, Roman

, Sosa

. 2015. Presence and number of lymph node metastases are associated with compromised survival for patients younger than age 45 years with papillary thyroid cancer. J Clin Oncol, 33:2370–2375.

13.

Mazzaferri

, Jhiang

. 1994. Long-term impact of initial surgical and medical therapy on papillary and follicular thyroid cancer. Am J Med, 97:418–428.

14.

Haugen

. 2017. 2015 American Thyroid Association management guidelines for adult patients with thyroid nodules and differentiated thyroid cancer: what is new and what has changed?. Cancer, 123:372–381.

15.

Russ

, Bonnema

, Erdogan

, Durante

, Ngu

, Leenhardt

. 2017. European Thyroid Association guidelines for ultrasound malignancy risk stratification of thyroid nodules in adults: The EU-TIRADS. Eur Thyroid J, 6:225–237.

16.

Shin

, Baek

, Chung

, Ha

, Kim

, Lee

, Lim

, Moon

, Na

, Park

, Choi

, Hahn

, Jeon

, Jung

, Kim

, Kwak

, Lee

, Park

, Sung

, Korean Society of Thyroid

, Korean Society of

. 2016. Ultrasonography diagnosis and imaging-based management of thyroid nodules: revised Korean Society of Thyroid Radiology consensus statement and recommendations. Korean J Radiol, 17:370–395.

17.

Roth

, Lu

, Liu

, Yao

, Seff

, Cherry

, Kim

, Summers

. 2016. Improving computer-aided detection using convolutional neural networks and random view aggregation. IEEE Trans Med Imaging, 35:1170–1181.

18.

Szegedy

, Liu

, Jia

, Sermanet

, Reed

, Anguelov

, Erhan

, Vanhoucke

, Rabinovich

. 2015. Going deeper with convolutions. Presentation at IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015.

19.

Gulshan

, Peng

, Coram

, Stumpe

, Wu

, Narayanaswamy

, Venugopalan

, Widner

, Madams

, Cuadros

, Kim

, Raman

, Nelson

, Mega

, Webster

. 2016. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA, 316:2402–2410.

20.

Pathak

, Krahenbuhl

, Darrell

. 2015. Constrained convolutional neural networks for weakly supervised segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1796–1804.

21.

Song

, Girshick

, Jegelka

, Mairal

, Harchaoui

, Darrell

. 2014. On learning to localize objects with minimal supervision. arXiv preprint arXiv:14031024.

22.

Selvaraju

, Cogswell

, Das

, Vedantam

, Parikh

, Batra

. 2016. Grad-cam: Visual explanations from deep networks via gradient-based localization. https://arxiv org/abs/161002391 v3 7.

23.

Zhou

, Khosla

, Lapedriza

, Oliva

, Torralba

. 2016. Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2921–2929.

24.

Krzywinski

, Altman

. 2013. Points of significance: importance of being uncertain. Nat Methods, 10:809–810.

25.

Ghahramani

. 2015. Probabilistic machine learning and artificial intelligence. Nature, 521:452–459.

26.

Lin

, Chen

, Yan

. 2013. Network in network. arXiv preprint arXiv:13124400.

27.

Matsubayashi

, Kawai

, Matsumoto

, Mukuta

, Morita

, Hirai

, Matsuzuka

, Kakudoh

, Kuma

, Tamai

. 1995. The correlation between papillary thyroid carcinoma and lymphocytic infiltration in the thyroid gland. J Clin Endocrinol Metab, 80:3421–3424.

28.

Sohn

, Kwak

, Kim

, Moon

, Kim

. 2010. Diagnostic approach for evaluation of lymph node metastasis from thyroid cancer using ultrasound and fine-needle aspiration biopsy. AJR Am J Roentgenol, 194:38–43.

29.

Bruneton

, Balu-Maestro

, Marcy

, Melia

, Mourou

. 1994. Very high frequency (13 MHz) ultrasonographic examination of the normal neck: detection of normal lymph nodes and thyroid nodules. J Ultrasound Med, 13:87–90.

30.

Kim

E-K

, Park

, Chung

, Oh

, Kim

, Lee

, Yoo

. 2002. New sonographic criteria for recommending fine-needle aspiration biopsy of nonpalpable solid nodules of the thyroid. Am J Roentgenol, 178:687–691.

31.

Suh

, Baek

, Choi

, Lee

. 2017. Performance of CT in the preoperative diagnosis of cervical lymph node metastasis in patients with papillary thyroid cancer: a systematic review and meta-analysis. AJNR Am J Neuroradiol, 38:154–161.

32.

Doi

. 2007. Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Comput Med Imaging Graph, 31:198–211.

33.

Kim

, Lee

, Nam

, Kim

, Moon

, Yoon

, Han

, Kwak

. 2017. Ultrasound texture analysis: Association with lymph node metastasis of papillary thyroid microcarcinoma. PLoS One, 12:e0176103.

34.

Lee

, Yoon

, Seo

, Kim

, Baek

, Lim

, Cho

, Yun

. 2018. Intraobserver and interobserver variability in ultrasound measurements of thyroid nodules. J Ultrasound Med, 37:173–178.

35.

, Baek

, Na

. 2017. Risk stratification of thyroid nodules on ultrasonography: current status and perspectives. Thyroid, 27:1463–1468.

36.

Augasta

, Kathirvalavakumar

. 2012. Reverse engineering the neural networks for rule extraction in classification problems. Neural Process Lett, 35:131–150.