Abstract
Background:
Thyroid ultrasound (US) is used as the first diagnostic tool to assess the management of the disease. In spite of its importance, US is a very subjective method and highly dependent on the skill of the performer. There have been few reports evaluating thyroid US performance and even fewer reports of observer variability in US assessment. Therefore, we evaluated inter- and intraobserver variations in US assessment of thyroid nodules and diagnosis among four radiologists and estimated its diagnostic accuracy.
Methods:
A total of 204 thyroid nodules in 144 patients were reviewed. There were 89 benign and 115 malignant cases. Four radiologists with more than 5 years of experience independently reviewed US images twice at 6-week intervals. Echogenicity, composition, margin, shape, calcification, vascularity, and final assessment were evaluated. Inter- and intraobserver variations were determined with Cohen's kappa statistics, and accuracy was calculated.
Results:
For interobserver variations, echogenicity showed slight agreement (κ = 0.34); composition, margin, calcification, and final assessment had fair agreement (κ = 0.59, 0.42, 0.58, and 0.54, respectively); shape and vascularity showed substantial agreement (κ = 0.61 and 0.64, respectively). For intraobserver variability, almost all showed substantial agreement (κ > 0.61). Overall sensitivity, specificity, positive predictive value, negative predictive value, and accuracy for the four radiologists were 88.2%, 78.7%, 76.2%, 89.6%, and 82.8%, respectively.
Conclusions:
Experienced radiologists showed more than a moderate degree of agreement in US assessment of thyroid nodules, and their final assessments were highly accurate.
Introduction
Materials and Methods
The institutional review board approved this study.
Study population
From January 2001 to October 2007, more than 20,000 patients underwent US-guided FNAB at Severance Hospital of Yonsei University Health System, Seoul, Korea. Initially, patients who underwent thyroidectomy and thyroid FNAB before surgery were included. All US images were scanned by radiologists with experience of 1 to 10 years. To reduce the influence of the machine factor, we included only patients examined by the iU 22 or HDI 5000 (Philips Medical Systems, Bothell, WA) machine with a linear array transducer of 5–12 MHz and their thyroid US had to have both transverse and longitudinal scans. In the case of benign nodules, patients had to be followed up for more than 4 years (47–52 months; mean 49.52 months) since they had to be confirmed as benign lesions on the initial FNAB. Additionally, they must not have shown any morphological changes or they had to be proven to be benign on repeat FNAB. Finally, an investigator (S.H.C.) randomly selected 204 nodules in 144 patients, and she was excluded from reviewing the images.
The 144 patients included 125 women and 19 men. There were 89 cases of malignancy (89/204, 43.6%) and 115 benign nodules (115/204, 56.4%). All malignant and 81 benign nodules were confirmed by operation. The remaining 34 benign nodules were stable during the follow-up period. Ninety-three nodules were <10 mm, 65 were 10–19 mm, 29 were 20–29 mm, and 17 were ≥30 mm in maximum nodule diameter.
Review of US images
An investigator (S.H.C.) selected 2–4 representative gray-scale and color Doppler images from each US examination on Picture Archiving and Communication System (PACS, GE Medical System, Milwaukee, WI). She converted the images into jpg files and arranged them in PowerPoint XP (Microsoft, Redmond, WA) slides in random order. A total of 204 slides were made.
Four radiologists (E.K.K., J.Y.K., M.J.K., and E.J.S.) with 12, 8, 6, and 9 years of experience, respectively, in breast and thyroid imaging reviewed the slides individually on an liquid crystal display (LCD) monitor. No clinical information about the nodules was given. To evaluate interobserver variation, the radiologists selected and recorded the descriptors and final assessments for each nodule (Table 1).
Final assessment was divided into four categories: benign, probably benign, low suspicious malignancy, and suspicious malignancy. All cases were categorized and subcategorized for the descriptors. For vascularity, 123 cases were assessed. To evaluate intraobserver variation, all four radiologists reviewed the slides again after 6 weeks using the same method. No explanation of the descriptors was given and all reviewers evaluated the nodules by their own criteria.
Statistical analysis
Kappa statistics were calculated using SAS (MAGREE SAS Macro program) to determine the proportion of inter- and intraobserver agreement beyond that expected by chance (10). The method for estimating an overall kappa value in cases of multiple observers and categories is based on the work of Landis and Koch (11). In addition to calculating the kappa value for each descriptor, we subcategorized margins into “well circumscribed” and “not circumscribed,” which included “irregular or spiculated” and “microlobulated,” and then calculated the kappa value. In color Doppler images, 123 cases were reviewed and evaluated. In four final assessment categories, we calculated the kappa value for each, subcategorized them into a benign (which includes “benign” and “probable benign”) and malignancy (consisting of “low suspicious malignancy” and “suspicious malignancy”) group and reassessed the kappa value between the two groups. A value of κ = 1.0 corresponds to complete agreement; 0, no agreement; and less than 0, disagreement. Landis and Koch suggested that a kappa value ≤0.20 indicates slight agreement; 0.21–0.40, fair agreement; 0.41–0.60, moderate agreement; 0.61–0.80, substantial agreement; and 0.81–1.00, almost perfect agreement (11).
Finally, we calculated diagnostic indices for each reviewer including sensitivity, specificity, positive predictive value (PPV) for low suspicious malignancy and suspicious malignancy groups, negative predictive value (NPV) for benign and probable benign groups, and accuracy based on data from their first round of review.
Results
Inter- and intraobserver variation
A summary of interobserver variability for each descriptor and final assessment category is shown in Table 2. We obtained a relatively high degree of interobserver agreement for US description of thyroid nodules. Except for echogenicity (Fig. 1), fair to substantial agreement was seen (κ = 0.41–0.80, Fig. 2). Overall agreement for shape was high (κ = 0.61 and 0.58) and for echogenicity was low (κ = 0.34 and 0.45) (Fig. 1). For composition of nodules, moderate agreement was obtained (κ = 0.59 and 0.58) overall, but agreement for “solid” was substantial (κ = 0.71 and 0.68). In assessing margins, overall agreement was fair (κ = 0.42 and 0.43) when we used three descriptors: well circumscribed, microlobulated, and irregular or spiculated (κ = 0.53 and 0.61, 0.35 and 0.33, 0.23 and 0.17, respectively). However, when we categorized margins into well and not circumscribed, overall agreement was fair and moderate (κ = 0.53 and 0.61).

A 46-year-old woman with a papillary carcinoma. Ultrasound (

A 52-year-old woman with an adenomatous hyperplasia. Ultrasound (
Analysis between well circumscribed and not circumscribed (microlobulated and irregular or spiculated) margins.
Analysis between benign (benign and probable benign) and malignancy (low suspicious malignancy and suspicious malignancy).
SE, standard error.
In assessing vascularity, overall agreement was fair (κ = 0.46 and 0.44) but in calcification, which is known to be related to malignant nodules, overall agreement was slightly higher (κ = 0.58 and 0.57).
In assessing final category, overall agreement was moderate (κ = 0.54 and 0.57). The individual kappa values were as follows: benign (κ = 0.25 and 0.44), probable benign (κ = 0.66 and 0.78), low suspicious malignancy (κ = 0.46 and 0.45), and suspicious malignancy (κ = 0.49 and 0.44). However, overall agreement increased substantially close to almost perfect (κ = 0.72 and 0.79) when we classified the final category into just benign and malignant lesions.
Intraobserver variability among the four radiologists for each descriptor and final assessment is summarized in Table 3. Substantial to almost perfect agreement was achieved in most of the descriptors and final assessment. In assessing margins, kappa values increased when margins were classified into two groups (well circumscribed and not circumscribed) compared with three categories (well circumscribed, microlobulated, and irregular or speculated). Especially in the final assessment, almost perfect agreement was shown among three radiologists when they used just the benign and malignant subgroup.
Analysis between well circumscribed and not circumscribed (microlobulated and irregular or spiculated) margins.
Two-group analysis between benign (benign and probable benign) and malignancy (low suspicious malignancy and suspicious malignancy).
Accuracy of each radiologist
Individual data are summarized in Table 4. We calculated sensitivity, specificity, PPV for low suspicious malignancy and suspicious malignancy groups, NPV for benign and probable benign groups, and accuracy. More than 80% accuracy and more than 85% NPV were achieved among all radiologists. The malignancy rate for the first and overall final assessment of each radiologist is calculated in Table 5.
PPV, positive predictive value; NPV, negative predictive value.
Discussion
Thyroid nodules are relatively common medical problems with a prevalence of 19–67% (1). US is a widely accepted diagnostic tool for evaluating thyroid nodules because it is an easily assessable, inexpensive, and accurate modality. Recently, the diagnostic accuracy of US for thyroid nodules was markedly improved with the advent of high-resolution US equipment. There have been many reports of predicting malignant thyroid nodules. Microcalcifications, marked hypoechogenicity, irregular margins, taller-than-wide shape, some macrocalcifications, and vascular pattern have been demonstrated to suggest malignant thyroid nodules (3,4,6,12,13). These characteristics are seen individually or simultaneously, and physicians performing US examinations assess the nodule based on their experience and knowledge. However, US examination is very subjective and dependent on the performer, and there are discrepancies between examiners in accuracy. For these reasons, there have been many studies of observer variability or accuracy in breast disease and other fields (14 –16), and interobserver agreement was good to substantial. However, there is only one report of interobserver agreement of benign thyroid nodules although there have been several reports of the volumetry of thyroid nodules and phantom on US (7 –9,17).
The result of US assessment for benign thyroid nodules was good to very good. In case of volumetry, the reported interobserver variation of thyroid nodule US measurement was approximately 50%, in contrast to a phantom study or three-dimensional sonography. In this study, we evaluated inter- and intraobserver variability in US assessment of thyroid nodules by radiologists specialized in thyroid imaging and estimated their diagnostic accuracy and performance. The descriptors are listed in Table 1.
For the final assessment category, we used four classifications: benign, probable benign, low suspicious malignancy, and suspicious malignancy. These classifications have been used in breast imaging reports and data system® by the American College of Radiology (18). According to the breast imaging reports and data system final assessment category, there are six classifications from category 1 to 6. Excluding known malignant lesions (category 6), the malignancy rate increases from category 2 to 5. We modified this assessment category for thyroid US. As far as we can tell there have been reports of the malignancy rates for US assessment categories, but our study showed that US assessment by experienced radiologists helped to predict malignancy. In our study, the malignancy rates for each category were as follows: 88.1–100% for suspicious, 65.9–78.7% for low suspicious malignant, 6.6–14.3% for probable benign, and 0–5.9% for benign. Further study of the US assessment category system, validation, and goals will be needed in the future.
In our study, mostly moderate to substantial agreement was obtained in inter- and intraobserver variability to assess thyroid nodules (Tables 2 and 3). The four radiologists who reviewed the US images work at the same institution, and they have been taught a uniform approach to the sonographic description of thyroid nodules. We believe that this may have had a positive effect on the results of our study.
For echogenicity, fair agreement was obtained (κ = 0.34). This was the lowest level of agreement of the US attributes that were assessed. This may be due to the fact that in assessing echogenicity, the background parenchyma is used as the basis for comparison. There can be some difficulty if there is a diffuse thyroid disease such as thyroiditis or calcifications within the nodule (Fig. 1). In addition, the heterogeneity of thyroid nodules makes it difficult to define a single grade of echogenicity.
In assessing the margins, overall agreement was fair (κ = 0.42 and 0.43) when categorized into three groups: well circumscribed, microlobulated, and irregular or spiculated. However, overall agreement increased (κ = 0.53 and 0.61) when the margins were classified into well circumscribed and not circumscribed. Clinically microlobulated and irregular are regarded as a suspicious US finding, so they do not affect the final assessment category. Therefore, they can be classified as not circumscribed.
In our study, the most interesting result was the kappa value of agreement for the descriptors that are suggestive of malignancy; marked hypoechogenicity, solid, and microcalcifications were higher than average (Table 2). These findings suggest that experienced radiologists are fully aware of the important criteria in discriminating malignant from benign thyroid nodules. Moreover, when we compared observer agreement for benign and malignancy, it showed the highest agreement (κ = 0.71) and their diagnostic accuracy rates were higher than 80% (Table 4). It probably means that experienced and specialized radiologists have effectively made their own diagnostic criteria based on their experience. Additionally, intraobserver variability showed higher agreement than interobserver variations, indicating that the radiologists are consistent with their own criteria. Overall PPV was 76.2%, which was higher than the 75% previously reported by Nóbrega et al. (19), and this suggests that the radiologists who participated in our study performed well.
There are some limitations of our study. First, our findings are only the result of a comparison between experienced radiologists with a faculty career spanning 6–12 years. In this regard, comparison to and among inexperienced radiologists will be needed in addition to comparison between grade of experience and agreement or accuracy. Second, this study was done using still images, not real-time images. These results are not the same if the scoring is defined directly on the patient in real time when the operator is performing the scan.
In conclusion, four radiologists with more than 6 years of experience who were fully aware of the imaging features of thyroid cancer showed relatively good agreement for inter- and intraobserver variations in US assessment of thyroid nodules. Additionally, when compared with histologic reports, their diagnostic performance was highly accurate. This conclusion emphasizes the importance of US operators' expertise for an appropriate final diagnosis.
Footnotes
Disclosure Statement
The authors declare that no competing financial interests exist.
