Abstract
Background:
An initial clinical assessment is described of a new, commercially available, computer-aided diagnosis (CAD) system using artificial intelligence (AI) for thyroid ultrasound, and its performance is evaluated in the diagnosis of malignant thyroid nodules and categorization of nodule characteristics.
Methods:
Patients with thyroid nodules with decisive diagnosis, whether benign or malignant, were consecutively enrolled from November 2015 to February 2016. An experienced radiologist reviewed the ultrasound image characteristics of the thyroid nodules, while another radiologist assessed the same thyroid nodules using the CAD system, providing ultrasound characteristics and a diagnosis of whether nodules were benign or malignant. The diagnostic performance and agreement of US characteristics between the experienced radiologist and the CAD system were compared.
Results:
In total, 102 thyroid nodules from 89 patients were included; 59 (57.8%) were benign and 43 (42.2%) were malignant. The CAD system showed a similar sensitivity as the experienced radiologist (90.7% vs. 88.4%, p > 0.99), but a lower specificity and a lower area under the receiver operating characteristic (AUROC) curve (specificity: 74.6% vs. 94.9%, p = 0.002; AUROC: 0.83 vs. 0.92, p = 0.021). Classifications of the ultrasound characteristics (composition, orientation, echogenicity, and spongiform) between radiologist and CAD system were in substantial agreement (κ = 0.659, 0.740, 0.733, and 0.658, respectively), while the margin showed a fair agreement (κ = 0.239).
Conclusion:
The sensitivity of the CAD system using AI for malignant thyroid nodules was as good as that of the experienced radiologist, while specificity and accuracy were lower than those of the experienced radiologist. The CAD system showed an acceptable agreement with the experienced radiologist for characterization of thyroid nodules.
Introduction
I
Computer-aided diagnosis (CAD) is a technology being newly developed for the characterization of thyroid lesions on US. The preliminary results of investigative studies show the intrinsic value of thyroid US CAD systems (9 –11), with the systems demonstrating excellent diagnostic accuracy for thyroid malignancies. However, these studies were independently developed by research teams and used noncommercial CAD systems. In these investigative studies, the patients with nodules were not enrolled within a clinical setting, and the studies did not assess the input of the thyroid US CAD system in the diagnostic performance of a radiologist. Additionally, these studies did not compare the diagnostic performance between the thyroid US CAD systems and radiologists.
Following the commercialization of the first thyroid US CAD system using artificial intelligence (AI; S-Detect for Thyroid, Samsung Medison Co., Seoul, South Korea), the clinical role of the system was prospectively evaluated by comparing its performance in the diagnosis of malignant thyroid nodules and classification of US characteristics with that of an experienced radiologist, using pathologic and radiologic results for reference.
Materials and Methods
This prospective study was supported by a grant and provision of equipment from Samsung Medison Co. in Seoul, South Korea. The authors had full control of the data and the information submitted for publication. The prospective study protocol was reviewed and approved by the Institutional Review Board of the authors' hospital. Written informed consent to undergo the US protocol was obtained from all patients before each examination. The methods were conducted in accordance with the Standards for Reporting Diagnostic Accuracy (STARD) statement (12).
Patients
Study patients were recruited from a tertiary referral hospital in Seoul, Korea, between November 2015 and February 2016. Potentially eligible patients were those requiring US for follow-up or preoperative evaluation of thyroid nodules. Patients older than 18 years of age who had a thyroid nodule with a decisive diagnosis were eligible for this study. Patients with a small thyroid nodule <5 mm or with thyroid nodules without decisive diagnosis were not eligible. Decisive diagnosis consisted of a malignant or benign diagnosis. A malignant diagnosis was made when malignancy was confirmed on the surgical specimen or by core-needle biopsy (CNB) histology or FNA cytology. A diagnosis of a benign nodule was made when any one of the following criteria was met: (i) confirmation using a surgical specimen; (ii) benign CNB histology or FNA cytology findings; or (iii) US findings of very low suspicion (spongiform or partially cystic nodules without any sonographic feature in the low, intermediate, or high suspicion patterns) or benign (pure cystic nodules) (1).
US image acquisition and analysis
US examinations were performed using an RS80A US system (Samsung Medison Co., Seoul, South Korea) equipped with a linear high-frequency probe (frequency range 3–12 MHz). The real-time CAD system software using AI (S-Detect for Thyroid; Samsung Medison Co.) was integrated into the US system. This real-time CAD software provided two points to indicate the top-left and bottom-right of a region of interest (ROI) box enclosing a thyroid nodule on the US system. On the basis of the given box, the software calculated the contour of the mass to distinguish it from normal thyroid tissue (segmentation). US findings of the segmented mass, including size (maximum diameter), composition (solid, partially cystic, or cystic), shape (oval-to-round or irregular), orientation (parallel or non-parallel), margins (well-defined, ill-defined, or spiculated), echogenicity (hyperechoic/isoechoic or hypoechoic/marked hypoechoic), and spongiform are quantified into computerized values, and presented as features to describe the thyroid nodule. Consequently, the software automatically displays the features of the mass in real time, and presents a diagnosis as to whether the nodule is possibly benign or malignant (Fig. 1).

Representative case of malignant (
Two separate US image analysis sessions were performed. The first session was performed by one radiologist using the CAD system, while the second session was performed without the CAD system by a different radiologist with 20 years of experience in performing US thyroid examinations. Neither of the reviewers had any information regarding a patient's clinical history, previous imaging results, or previous biopsy results. In the first US image analysis session, one radiologist drew the ROI box enclosing a thyroid nodule on transverse US images, and evaluated the quality of the nodule segmentation, US findings, and nodule diagnosis provided by the CAD system. Nodule segmentation was performed once per nodule. Nodule segmentation was assessed visually and classified into one of three categories: (i) excellent—the segmented part completely matched the nodule; (ii) satisfactory—although not perfect, the segmented volume was still representative of the nodule, and the maximum contour mismatch between the overlay and nodule was visually estimated to not exceed 30%; and (iii) poor—part of the nodule was segmented, but the segmented contour was not representative of the nodule (estimated mismatch >30%) (13,14). The first two categories (excellent and satisfactory categories) were regarded as successful nodule segmentation. In the second image analysis session, another experienced radiologist examined thyroid nodules on transverse US images, and evaluated the following features: composition (solid, partially cystic, or cystic); shape (oval-to-round or irregular); orientation (parallel or non-parallel); margins (well-defined, ill-defined, or spiculated); echogenicity (hyperechoic/isoechoic or hypoechoic/marked hypoechoic); spongiform; presence of echogenic dots suggestive of microcalcifications; and presence of macrocalcifications. The US criteria for malignant nodules included non-parallel orientation, spiculated margin, marked hypoechogenicity, and the presence of microcalcifications (6,15).
Outcome measures
The primary outcomes included the performance of the CAD system for diagnosis of malignant thyroid nodules compared with that of an experienced radiologist. The secondary outcome measures included the diagnostic performance of the CAD system for malignant thyroid nodules with a maximum transverse diameter >1 cm, inter-observer agreement of US findings between the CAD system and experienced radiologist, and the quality of the CAD system nodule segmentation.
Data and statistical analysis
The data are presented as means and standard deviations for continuous variables, and as the number of patients and nodules for categorical variables. For the CAD system and experienced radiologist, the sensitivity, specificity, positive predictive value, negative predictive value, and accuracy of thyroid malignancy diagnosis (all nodules and malignant thyroid nodules >1 cm across the maximum transverse diameter) were calculated. McNemar's test was used to compare the diagnostic sensitivity and specificity of the CAD system and the experienced radiologist. Additionally, the areas under the receiver operating characteristic curve (AUROC) for the CAD system and experienced radiologist were compared using the method of DeLong et al. (16). Cohen's kappa coefficient and proportional agreement were used to analyze the inter-observer agreement on each of the US thyroid nodule findings. A Cohen's kappa value of 0.00–0.20 implied slight agreement, 0.21–0.40 implied fair agreement, 0.41–0.60 implied moderate agreement, 0.61–0.80 implied substantial agreement, and 0.81–0.99 implied almost perfect agreement (17). Fisher's exact test was used to compare successful nodule segmentation according to the final diagnosis (benign vs. malignant). All statistical analyses were performed using MedCalc for Windows v15.0 (MedCalc Software, Ostend, Belgium), and a p-value of <0.05 was considered statistically significant.
Results
Of 172 participants who underwent US for a thyroid nodule and who were assessed for eligibility, 83 (48.3%) patients were identified as being ineligible for this study. The reasons for exclusion were a small thyroid nodule (<5 mm) in 14 patients, and thyroid nodules without a decisive diagnosis in 69 patients. A total of 89 patients (M age = 45.3 years; range 25–76 years) with 102 thyroid nodules (benign: n = 59, 57.8%; malignant: n = 43, 42.2%) were included in this study (Table 1). For benign nodules, the diagnosis was based on US findings (spongiform: n = 4; pure cystic nodules: n = 5) for nine nodules, 48 nodules were based on benign CNB histology or FNA cytology findings, and two nodules were based on confirmation of a surgical specimen. For malignant nodules (n = 43), the diagnosis was confirmed from a surgical specimen for 17 nodules, while it was based on malignant CNB histology or FNA cytology findings for the other 26 nodules. All malignant nodules were papillary thyroid carcinomas.
Note: Data are means ± standard deviations, unless otherwise specified.
CNB, core-needle biopsy; FNA, fine-needle aspiration; US, ultrasound.
Primary outcomes
Performance of the CAD system and radiologist in the diagnosis of thyroid malignancy
Table 2 lists the performance measures of the CAD system and experienced radiologist in the diagnosis of thyroid malignancy. The experienced radiologist demonstrated a diagnostic specificity for thyroid malignancy that was significantly higher than that of the CAD system (94.9% vs. 74.6%, respectively; p = 0.002), but there was no statistical difference in diagnostic sensitivity between the two modalities (sensitivity: 88.4% vs. 90.7%, respectively; p > 0.99). The AUROC curve for diagnosis of thyroid malignancy was significantly higher for the radiologist than it was for the CAD system (0.92 vs. 0.83, respectively; p = 0.021).
Numbers in square brackets are the confidence intervals.
McNemar's test was used to compare the diagnostic sensitivity and specificity of the CAD system and radiologist.
Areas under the ROC curve between the CAD system and radiologist were compared using the method of DeLong et al. (16).
ROC, receiver operating characteristic.
Secondary outcomes
Diagnostic performance of the CAD system and radiologist for thyroid malignancies >1 cm
Table 2 shows the diagnostic performance of the CAD system and experienced radiologist in the diagnosis of thyroid malignancies >1 cm across their maximum transverse diameter. The radiologist's diagnostic specificity for thyroid malignancy was significantly higher than that of the CAD system (97.4% vs. 71.8%, respectively; p = 0.006), but there was no statistical difference in diagnostic sensitivity between the radiologist and the CAD system (sensitivity: 92.9% vs. 100%, respectively; p > 0.99). The AUROC curve for the radiologist's diagnoses of thyroid malignancy was higher than that for the CAD system, although the difference was not statistically significant (0.95 vs. 0.86, respectively; p = 0.084).
Inter-observer variability of US characteristics between the CAD system and radiologist
A summary of the inter-observer variability in US characteristics between the CAD system and radiologist is shown in Table 3. With the exception of the margin (κ = 0.239), substantial agreement was seen for all characteristics (κ = 0.61–0.80), and with the further exception of the kappa value for shape not being available, as all nodules had an ovoid-to-round shape. Except for composition and margin, proportional agreement was >80%.
Numbers in parentheses are standard errors.
Quality of CAD system nodule segmentation
Excellent, satisfactory, and poor segmentations were observed in 20.6% (n = 21), 66.7% (n = 68), and 12.7% (n = 13) of nodules, respectively. For benign thyroid nodules, excellent, satisfactory, and poor segmentations were observed in 15.3% (n = 9), 66.1% (n = 39), and 18.6% (n = 11) of nodules, respectively. For malignant thyroid nodules, excellent, satisfactory, and poor segmentations were observed in 27.9% (n = 12), 67.4% (n = 29), and 4.7% (n = 2) of nodules, respectively. Successful nodule segmentation was significantly more frequent with malignant thyroid nodules than it was with benign thyroid nodules (p = 0.04). Representative cases are detailed in Figure 1.
Discussion
In this prospective study, an initial clinical assessment was performed of the diagnostic performance of a commercial CAD system using AI for evaluation of malignant thyroid nodules on US. The study demonstrates that the sensitivity of the CAD system in detecting thyroid malignancies is similar to that of an experienced radiologist. However, the CAD system showed lower specificity and accuracy than the experienced radiologist. Inter-observer variability of US characteristics between the CAD system and radiologist showed relatively good agreement (substantial agreement, except for margin).
Compared to studies of CAD systems for lung and breast cancer, only a few studies on CAD systems have been performed to diagnose thyroid malignancy using US (9 –11). These exploratory studies showed excellent accuracy (98.3–100%). However, they did not reflect a true clinical setting, as they were independently developed by research teams, and study enrollment was not based on clinical practice. The performance of these CAD systems does not necessarily anticipate its influence on radiologists when used in clinical practice (18). Additionally, these previously documented thyroid US CAD systems were not practical for real-time use, and required complex computer analysis and substantial processing time (9 –11). The thyroid US CAD system used in this study was installed within the US system, and allowed the use of CAD in a real-time clinical setting. Therefore, real-time decision making on the necessity for FNA is possible with the present system. This system would be easier to use in routine practice because of its simplicity and the reduced analysis time. A few studies have assessed the diagnostic performance of a commercialized CAD system for radiologists characterizing breast lesions, and they demonstrated that a breast US CAD system appears to be a useful tool for improving the diagnosis of malignant lesions for less experienced radiologists, but has little added value for experienced radiologists (19). From the results of the present prospective clinical study, it is assumed that the thyroid US CAD system does not provide added diagnostic value for an experienced radiologist, as the CAD system demonstrated lower specificity and accuracy than the experienced radiologist. However, the thyroid US CAD system did provide considerable sensitivity (90.7%) and negative predictive values (91.7%), which were similar to those of the experienced radiologist. Considering the time required for interpretation of US thyroid images, an experienced radiologist could possibly save time if the computer output was used as a second opinion, with the radiologist making the final decision in the positive prediction cases identified by the CAD system. Additionally, the high sensitivity and negative predictive values at the level of an experienced radiologist would be useful in practice for ruling out disease. Thus, malignant thyroid nodules could be excluded at the level of an experienced radiologist if the result was considered to be benign by the thyroid US CAD system. Therefore, this system could help to reduce the rate of unnecessary biopsies requested by inexperienced radiologists. However, the relatively low specificity (74.6%) should be taken into account when using the thyroid US CAD system. Future studies on the system, under the use of observers with various levels of experience, are necessary to determine the clinical validity.
Analysis of inter-observer variability in the nodule characteristics of composition, orientation, echogenicity, and spongiform aspects showed substantial agreement between the CAD system and the experienced radiologist, while fair agreement was found for the margin definitions. The previous literature shows variability between radiologists in the classification of the US characteristics of thyroid nodules. The studies demonstrated fair agreement for margin, as in the present study, less agreement for echogenicity, and similar or less agreement than was found in the present study for the other US characteristics (6 –8). Therefore, it can be assumed that the inter-observer variability between the thyroid US CAD system and the experienced radiologist might be similar or higher than that between radiologists. Further studies are needed to validate the inter-observer variability between the thyroid US CAD system and other radiologists.
Successful nodule segmentations were observed in 87.3% (89/102) of nodules in the present study. Poor nodule segmentation occurred more frequently with benign nodules (n = 11; 18.6%) than with malignant nodules (n = 2; 4.7%), and the difference was statistically significant (p = 0.04). Among those nodules with poor segmentation, all of the malignant nodules were diagnosed as such on the thyroid US CAD system, and 54.6% of the benign nodules (6/11) were also diagnosed as malignant. Therefore, it is speculated that poor segmentation on the thyroid US CAD system may increase the false-positive rate, without affecting the false-negative rate. There is currently a requirement for observer modification of poor segmentation to reduce the false-positive rate and increase the specificity of the CAD system; future technical improvements to the segmentation would be of substantial benefit.
This study has several limitations. First, most of the enrolled thyroid malignancies (97.7%; 42/43) were classical papillary thyroid carcinomas. US findings of a follicular variant of thyroid carcinoma or follicular carcinoma, as well as other malignancies such as medullary carcinoma and lymphoma, are somewhat different from those of classical papillary carcinoma (20 –24). Therefore, future studies including the various types of thyroid malignancies are needed. Second, the value of the CAD system was not evaluated for thyroid nodules with indeterminate cytologic results (atypia of undetermined significance/follicular lesion of undetermined significance) because thyroid nodules without decisive diagnosis, including indeterminate cytologic results, were excluded. Thus, future studies are needed to assess the role of the CAD system for indeterminate cytologic results. Third, the present thyroid US CAD system could not evaluate the calcification of thyroid nodules. Further technical developments are needed to improve the performance of the CAD system in this respect. Fourth, the added value of the thyroid US CAD system to the radiologist was not evaluated; only the diagnostic performance was compared between them. As computer outputs can be utilized by radiologists in daily practice but cannot replace them, future study on the added value of the thyroid US CAD system to observers of various levels of experience is necessary to determine clinical validity. This study was performed in a single center in South Korean, which means that there could be some selection bias. Additionally, the mean size of the nodules was small (1.2 cm), which means results may not be transferrable to a setting where nodules of larger size are encountered. Hence, large-scale multicenter studies with nodules of larger size are needed in the future to validate and generalize the findings.
In conclusion, the sensitivity and negative predictive value of the US CAD system using AI for malignant thyroid nodules were as good as those of an experienced radiologist, while specificity and accuracy were lower than those of the experienced radiologist. The CAD system showed acceptable agreement with the experienced radiologist for characterization of thyroid nodules. Thus, the system is useful for ruling out thyroid malignancy on US, may supply an easy method for determination of the requirement for a FNA biopsy, and may reduce the necessity for such biopsies. The exact clinical role of the thyroid US CAD system needs to be validated by observers with various levels of experience and various patient groups using multicenter studies.
Footnotes
Acknowledgments
This work was supported by a grant and provision of equipment from Samsung Medison Co.
Author Disclosure Statement
This work was supported by a grant and provision of equipment from Samsung Medison Co.
