Abstract
Background:
Single-center trials demonstrated moderate-substantial level of interobserver agreement in the evaluation of ultrasound (US) features of thyroid nodules. Multicenter studies on US agreement, however, are scanty, and data on intraobserver agreement are poor. Aim of the study was to assess inter- and intraobserver agreement between different thyroid centers and different specialists.
Methods:
A blinded analysis of 100 electronically recorded thyroid nodule US images was conducted in three large-volume thyroid centers by seven radiologists and endocrinologists. The evaluation was repeated after randomization 4 months later. The following US characteristics were evaluated: composition, echogenicity, margins, intranodular echogenic spots, vascularity, and shape. Thyroid nodules were also classified according to AACE/ACE/AME, EU-TIRADS, ATA, and ACR-TIRADS US classifications. Intra- and interobserver agreement was calculated using cross-tabulation expressed as mean Cohen's Kappa.
Results:
Interobserver agreement for US features: K-coefficient was 0.53 for composition, 0.47 for echogenicity, 0.46 for intranodular vascularity, and 0.33 for margins of the nodules. For echogenic foci, the K-coefficient was 0.47 for microcalcifications, 0.38 for macrocalcifications, 0.11 for the subcategory comet-tail artifacts, and 0.42 for shape. Operators resulted uncertain on hyperechoic foci definition in 16% of cases and described them as “hyperechoic foci of uncertain significance.” Interobserver Cohen-K for US classification systems was 0.44 for AACE, 0.42 for ACR-TIRADS, 0.39 EU-TIRADS, and 0.34 for ATA. Intraobserver agreement: the K-coefficient for nodule US features was 0.62 for intranodular vascularity, 0.58 for composition, 0.60 for echogenicity, 0.54 for macrocalcifications, 0.55 for microcalcifications, 0.47 for comet tails, 0.39 for margins, and 0.35 for shape. Intraobserver Cohen-K for US classification systems was 0.54 for AACE, 0.49 for ACR-TIRADS, 0.38 for ATA, and 0.33 for EU-TIRADS.
Conclusions:
Intraobserver reproducibility for thyroid nodule US reporting and US classification systems appears fairly adequate, while the interobserver agreement between different centers is lower than that assessed in single-center trials. Reporting and rating ability of thyroid US examiners still appear not consistent. An unified lexicon of thyroid US features, a simplified method of classification, and a dedicated training in the description of thyroid US findings may increase the observers' agreement and the predictive value of US classification systems in real world practice.
Introduction
Thyroid nodules are a common clinical finding, being detected in 19–67% of the general population (1,2). As most thyroid lesions are benign and may be safely managed with a surveillance program, the main goal of their initial assessment is the identification of the minority of nodules that could harbor a clinically significant cancer (3).
Thyroid ultrasound (US) examination is widely used as the first diagnostic tool for the management of thyroid nodular disease because it may demonstrate the presence of a few well-established findings suspicious for malignancy (4,5). In the majority of cases, however, thyroid sonography reveals less clear-cut US features with a low level of clinical predictivity. For these reasons, major Endocrine and Radiological Societies produced US classification systems that were reported to provide a fairly good prediction of malignancy and, potentially, a more accurate selection of the lesions to be submitted to fine needle aspiration (FNA) biopsy (6 –10).
Single-center trials demonstrated a good level, from moderate to substantial, of interobserver agreement in the evaluation of the US features of thyroid nodules (3,5,11 –15). Yet, thyroid US is a rather subjective imaging method and is dependent on the specific expertise of the operators and the quality of their US equipment (5,11,16). Therefore, wider multicenter studies are needed to establish the actual clinical usefulness of US classification systems and their applicability to real world practice. Unfortunately, studies on the level of agreement between different thyroid centers are scanty, and data on intraobserver agreement for thyroid nodule features are poor (5,11 –16).
Aim of this study was to assess the inter- and intraobserver agreement between different thyroid centers and different specialists, endocrinologists, and radiologists with specific thyroid expertise, in the evaluation of thyroid nodule US features and the definition of the US classification scores.
Materials and Methods
One hundred thyroid nodules, either solitary or in multinodular goiter, referred for surgery to the Regina Apostolorum Thyroid Center, had a preoperatory assessment by means of two similar state-of-art US machines (Esaote Twin, Genoa, Italy) equipped with a 5–15 MHz linear transducer. Sample size was determined according to the guidelines proposed by Cantor (17), thus estimating a minimum sample size of 50 cases. US images were acquired during preoperatory neck evaluation by two operators with specific thyroid expertise who did not participate in the following trial. The scanning protocol included both transverse and longitudinal real-time multiplane imaging of each thyroid lesion. Static gray scale, as well as color-doppler, images were saved to PACS and their labeling was removed before the trial. A blinded analysis of the electronically-recorded US images was separately conducted in three high-volume centers for thyroid diseases (Regina Apostolorum, Rome; Catholic University, Rome; and Santa Maria Nuova, Reggio Emilia, Italy) by seven thyroid imaging experts (two radiologists and five endocrinologists) with an at least 15-year experience.
The mean size nodule was 2.7 cm (range 0.6–8.0 cm). There were 30 (30%) malignancies with a mean size of 2.2 cm (range 0.6–5.3 cm), of which 24 (80%) were papillary carcinomas, 5 (16.3%) were follicular carcinomas, and 1 (3.3%) was a medullary cancer. The benign nodules had a mean size of 2.9 cm (range 0.9–8.0 cm). The structure of the nodules, according to the EU-TIRADS Lexicon (8), was as follows: solid: 64, predominantly solid: 22, predominantly cystic: 9, spongiform: 3, and cystic: 2.
Before starting the trial, all the participants received the EU-TIRADS lexicon for the definition of thyroid nodule US findings and the schemes of the four US classification systems under evaluation (8). After 1 month their understanding of the reporting modalities of the trial was assessed by an external monitor (A.P.).
The experts were requested to fill-in for each set of images a multiple-answer electronic questionnaire based on the EU-TIRADS lexicon and, subsequently, to perform a stratification of the risk of malignancy of the lesions on the basis of the 5-class ACR-TIRADS, ATA, and EU-TIRADS systems and the 3-class AACE/ACE/AME US classification (7 –10). While the ACR-TIRADS classification is based on points, the other US systems are based on patterns.
The following US characteristics, defined in accordance with the EU-TIRADS terminology (8), were evaluated: composition (solid, predominantly solid, predominantly cystic, and cystic); echogenicity (hyperechoic, isoechoic, mildly and deeply hypoechoic); margins (well-defined, ill-defined, microlobulated, and spiculated); vascularity (no vascular signals, perinodular or slight intranodular flow, and marked intranodular flow); echogenic foci (microscopic, macroscopic, continuous or interrupted egg shell calcifications, and comet-tail artifacts); and shape (round/oval, taller than wide) (8). In case of uncertainty, the experts were also required to define the hyperechoic foci as “spots of uncertain significance” (7).
After 4 months, the experts performed a second blinded evaluation of the same images after their randomization in a different order. No consultation was allowed, and the blindness of the trial was supervised. The study flow chart is summarized in Figure 1. This retrospective observational study received the Institutional Board review and approval and followed the tenants of the Declaration of Helsinki.

Flowchart of the study.
Statistical analysis
Statistical analysis was performed using the Statistical Package for Social Science (IBM-SPSS), release 21.0. Continuous variables are expressed as mean ± standard deviation, while categorical variables are displayed as frequencies.
The interobserver agreement was calculated using cross-tabulation expressed in Cohen's Kappa. Kappa values were evaluated, according to the standard proposed by Landis and Koch, as follows: 0–0.20 poor, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.0 almost perfect agreement (18).
For each US features and classification system, the distribution of the different grades of concordance was also calculated. Nodules with concordance were defined as the frequency of nodules displaying an agreement of at least six out of seven observers. Mean Cohen's K and mean concordance were calculated from each repeated intrarater observation (test–retest reliability). The intraobserver agreement was estimated as the percentage of answers that were coincident during the two examinations performed by the experts.
Results
The level of interobserver agreement for the different US features of thyroid nodules is reported in Table 1.
Interobserver Agreement for the Ultrasound Features and Ultrasound Classification Systems of Thyroid Nodules
Mean concordance: defined as a consistent definition by seven out of seven observers (100%) and by six out of seven observers (86%).
US, ultrasound.
The mean K coefficient was 0.53 (moderate level of agreement) for composition, 0.47 (moderate) for echogenicity, and 0.46 (moderate) for intranodular vascularity. For the echogenic foci, the K-coefficient was 0.47 (moderate) for microscopic calcifications, 0.38 (fair) for macrocalcification, and 0.11 (poor) for comet-tail artifacts, respectively. The operators resulted uncertain on the conclusive definition of the hyperechoic foci in 16% of cases and preferred to describe them as “hyperechoic foci of uncertain significance.”
Finally, the K-coefficient was 0.33 (fair) for the description of the margin of the nodules and was 0.42 (moderate) for the shape (oval vs. round).
The interobserver mean Cohen-K for US classification systems was 0.44 for AACE (moderate level of agreement), 0.42 (moderate) for ACR-TIRADS, 0.39 (fair) for EU-TIRADS, and 0.34 (fair) for ATA.
The results of the intraobserver analysis are reported in Table 2 as the mean K-coefficient and the mean of the concordance degree.
Intraobserver Agreement for the Ultrasound Features and Ultrasound Classification Systems of Thyroid Nodules
The K-coefficient for nodule US feature was 0.62 (substantial level of agreement) for intranodular vascularity, 0.58 (moderate-substantial) for composition, 0.60 (moderate-substantial) for echogenicity, 0.54 (moderate) for macrocalcifications, 0.55 (moderate) for microcalcifications, 0.47 (moderate) for comet tails, 0.39 (fair) for margins, 0.35 (fair) for taller than wide shape, and 0.35 (fair) for the shape (oval vs. round). The mean Cohen-K coefficient for US classification systems was 0.54 (moderate level of agreement) for AACE, 0.49 (moderate) for ACR-TIRADS, 0.38 (fair) for ATA, and 0.33 (fair) for EU-TIRADS.
Discussion
Several thyroid nodule US classification systems from different scientific societies are currently available for the evaluation of the risk of malignancy and the indication to cytological assessment (7 –10). A few recent studies demonstrated that the main US classification systems have an elevated predictive value of malignancy in high-risk US categories and that they are effective for ruling-out the indication to FNA in low US risk nodules (3). The 3- and the 5-category classifications released by the ETA, ATA, AACE/ACE/AME, American College of Radiology, and SKSTR (7 –10,19) showed a rather similar diagnostic accuracy and a substantial interobserver agreement (Table 3) in single center trials (3,5,11,14). However, a few problems and some potentially limiting aspects should be considered. The available data are mostly due to single-center trials, generally performed by examiners with similar training, while the evidence from controlled multicenter studies is at present limited (5,14,15). Thyroid US examination, moreover, is an operator-dependent diagnostic procedure and is influenced by the specific expertise and the quality of the US equipment, relevant for the accurate definition of some characteristics. Finally, the intraobserver agreement was not, until now, systematically assessed and was evaluated either in radiological or endocrinological services. This issue is relevant because the concordance of the same operator in repeated examinations could be less than optimal when reporting a few subtle differences in US findings, such as the presence of ill-defined, speculated, or lobulated margins or the actual source of intranodular echoic spots.
Interobserver Agreement for the Different Ultrasound Classification Systems in Multi- and Single-Center Studies
Interobserver agreement is expressed with Cohen's K.
The present study addressed the inter- and intraobserver agreement in the definition of the main thyroid nodule US findings using four major US classification systems.
Interobserver agreement
The level of agreement among operators from different thyroid centers in the description of US features of thyroid nodules was only partially satisfactory as it ranged from fair to moderate, only partly confirming the results from former single-center studies. The lowest level of consistency was found for the characteristics of margins, with a Cohen's K = 0.33, similar to the previously reported data, ranging from 0.13 to 0.61 (13,20). The agreement on the presence of microcalcifications and the composition of the nodules resulted rather lower than the values found in former single center trials that ranged from 0.51–0.54 and 0.62–0.81, respectively (11 –13,21). The low level of agreement for microcalcifications, interrupted rim calcifications, and shape are in accordance with a recent trial on the predictivity of the ATA classification (14).
The interobserver agreement for the four thyroid nodule US classification systems was not completely satisfactory as well. The AACE/ACE/AME and ACR had a moderate level of agreement (Cohen's K = 0.44 and 0.42, respectively), while the ATA and EU-TIRADS demonstrated only a fair interobserver concordance (Cohen's K = 0.34 and 0.39, respectively). These results appear less acceptable for clinical practice than those found in former single-center trials, as reported in Table 3.
Not surprisingly, these results are in accordance with the studies concerning the discrepancy in the US assessment of breast masses with the BI-RADS US lexicon (19,20,22,23). In analogy with thyroid US data, the interobserver agreement in breast lesions ranged from 0.32 to 0.37 for margins, from 0.36 to 0.41 for the echo pattern, and from 0.48 to 0.51 for the presence of microcalcifications. The choice of an indeterminate report for intranodular echoic spots could represent a warning about the possible presence of a potentially malignant finding and could be considered part of an intermediate risk category. The performance rate was lower for less trained operators, but was improved by a dedicated training, similarly to what was reported for thyroid nodule classifications (11,16).
Intraobserver agreement
Intraobserver reproducibility was higher than interobserver agreement but was not satisfactory for a reliable use in clinical practice. The agreement was, as a mean, moderate for nodule composition, echogenicity, and microcalcifications (Cohen's K = 0.58, 0.60, and 0.55, respectively) and was fair only for the margin's definition (Cohen's K = 0.39). Similarly, the intraobserver agreement for the US classification systems was moderate for AACE/ACE/AME and ACR-TIRADS (Cohen's K = 0.54 and 0.49, respectively) and fair only for the ATA and EU-TIRADS classifications (Cohen's K = 0.38 and 0.33, respectively).
As a whole, the results of the present study demonstrate that the interobserver agreement between thyroid US experts operating in different centers ranges from fair to moderate for both the definition of the single US features and the rating of the nodules according to the US classification systems.
Variability in evaluating thyroid nodule US features was highest for margins and echogenic foci, except for macrocalcifications, confirming the data reported in a recent study, that demonstrated a Cohen-K value ranging from 0.25 to 0.39 (4). The classifications with a lower and less articulated number of classes showed a better inter- and intraobserver reliability than the more complex ones. As the description of specific features may markedly modify the rating of the risk of malignancy, the definition of these findings should be carefully characterized and the option for an “indeterminate report” should be considered for high-risk findings (as for the presence/absence of microcalcifications) in case of operator uncertainty.
As the intraobserver was better than interobserver reproducibility, even if barely adequate for clinical practice, the operators demonstrated that their personal criteria of reporting and classification for thyroid nodule US features were consistent, although with incomplete agreement with the examiners of the other centers.
Limits of the study
Even if the study was designed for preventing major bias, a few limitations are present. Observers were blinded to the conclusive pathological findings, but they were aware that all the nodules under examination were submitted to surgery and had pathologic confirmation. In their everyday practice, the observers mostly used the AACE/ACE/AME and the EU-TIRADS systems. This could have influenced, despite the preliminary training, the results of the agreement because of the uneven initial familiarity with the different classification methodologies. The use of video clips might have improved the study making it closer to real conditions of clinical practice. Therefore, a randomized study comparing the agreement between static images and clips should be useful for future considerations. Finally, the study provides information about results obtained by expert thyroid US operators, but the agreement among less experienced sonographers could be different and, possibly, less satisfactory (24).
In conclusion, the present study highlights that, even among experts, the interobserver agreement among multiple centers is low and that more work is needed. In the community and in centers without specific thyroid expertise, the situation is probably even less satisfactory.
To improve the inter- and intraobserver agreement, an universal lexicon and classification system should be released by the major scientific societies of the field and a consensus conference should be held in the different professional organizations. Moreover, even if we did not address specifically this issue, on the basis of the present and previous studies assessing the predictivity of classification systems, a 4-tier system might provide an appropriate balance between the complexity of the classification and the accurateness of risk scoring. Finally, on the basis of initial results (25), the use of artificial intelligence appears to be promising tool for the improvement of the diagnostic accuracy and consistency of thyroid US reporting.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
Funding Information
No funding was received for this article.
