Abstract
Introduction:
The British Thyroid Association Ultrasound-classification is a risk stratification model which grades thyroid nodules in U2–5 based on their sonographic appearance. Existence of variability between the ultrasound operators when U-scoring is reported in the literature with some evidence found in the author’s department. The aim of this study was to investigate whether there is significant disagreement in the department and identify potential reasons for variability.
Methods:
Eight operators, radiologists and sonographers, were recruited to grade 33 TNs and answer a tick box questionnaire using the British Thyroid Association lexicon. The inter-operator variability for the U-categories, indication for fine-needle aspiration biopsy and ultrasound features was assessed using Fleiss’ kappa and Gwet-AC1. The operators’ accuracy was measured against the most experienced operator in the department using Cohen’s kappa and percentage agreement.
Results:
Fair agreement (Fleiss’ K = 0.21) was obtained between the participants when U-scoring (U2–5). Fair-to-moderate agreement was noted between sonographers (K = 0.40). Significant variability was demonstrated between radiologists (p > 0.05). Indication for fine-needle aspiration biopsy reached fair to almost substantial agreement (radiologists’ AC1 = 0.34, sonographers’ AC1 = 0.58, overall AC1 = 0.41). No significant variability measured for echogenicity (K = 0.29), composition (K = 0.33), shape (K = 0.58), margin (K = 0.45), halo (K = 0.34) and vascularity (K = 0.44). Accuracy reached fair agreement (mean Cohen’s K = 0.29) and moderate agreement (mean AC1 = 0.53) for the U-categories and fine-needle aspiration biopsy, respectively. Radiologists demonstrated lower accuracy.
Conclusion:
No significant inter-rater variability in U-scoring or recommending fine-needle aspiration biopsy was demonstrated between all the operators in the department. Radiologists showed significant variability in U-scoring and lower accuracy. Reliability and accuracy could be improved by addressing those problematic categories and features identified with this study.
Keywords
Background
Thyroid nodules (TNs) are often discovered by the patient palpating a lump in the neck or from imaging tests such as computed tomography (CT scan) and magnetic resonance imaging (MRI).1,2 It is reported that up to 68% of the asymptomatic population have a TN, but only 7%–15% of these would be classified as malignant. 3 The high incidence of TNs has led to the need for a clear strategy to differentiate between those that require surgical removal and those that can be managed conservatively.3,4
Kim et al. 4 conducted a prospective study on 155 not palpable nodules in order to identify new sonographic criteria for the need of a fine-needle aspiration biopsy (FNAB). These findings included the presence of micro-calcification within the nodule, hypo-echogenicity, an irregular or lobulated margin, and a shape that is taller than wider. 4 However, because none of these criteria could be used on its own to predict malignancy, various ultrasound (US) stratification models were developed.5–7 These models combine the presence of the sonographic features in order to determine the TN’s risk category.
According to the British Thyroid Association (BTA) 8 model, for example, TNs classified as U2 are likely to be benign, whereas TNs classified as U4–U5 have a high suspicion of malignancy. The category U3 is assigned to TNs that are indeterminate, often due to overlapping features. While likely benign nodules do not require further evaluation, indeterminate or suspicious TNs are usually sent for FNAB. 8
Compared with other US stratification models, the BTA accompanied their document with a graphic (Figure 1) that was designed to simplify the scoring process and reduce its variability among the US operators. 8 Significant variability in TN classification, in fact, could lead to contradictory management advice from US operators, with some recommending an FNAB, for example, and others follow-ups or patient’s discharge.

BTA U-Classification 2014, reproduced from the literature. 8
The evaluation of inter-operator reliability is undoubtedly an important aspect of any research that introduces a new measurement method. 9 Bunting et al. 10 argued that the evaluation of the inter-operator reliability should also be performed for well-established methods as a part of quality control.
In an original research conducted by Weller et al., 11 the author investigated the inter-operator reliability of the BTA model by performing a retrospective assessment on 73 patients with five operators (four radiologists and one senior sonographer) and found substantial agreement between the participants in U-scoring (Fleiss’ K = 0.73).
Couzins et al., 12 on the contrary, studied the inter- and intra-operator reliability on 20 TNs between 14 US operators with different experience. Although no statistically significant variability existed when U-scoring, the study demonstrated only slight to fair agreement between all the participants and within the experienced group only (K = 0.19–0.34).
Nichols et al. 13 emphasised that the assessment of the inter-operator reliability is particularly important with new members of staff and when there is a change in the structure of the service provided. Following the National Institute for Health and Care Excellence (NICE) recommendation 14 to fast-track patients presenting with a neck and thyroid lump, the author’s hospital has recently initiated a service in which patients are referred directly from primary care to radiology for their initial evaluation.
An FNAB is usually performed on the same day as the US examination by radiologists or a consultant sonographer, and the results are sent to the referrer within 2 weeks. However, to make this service work, there must be good communication between the teams involved, evaluation of the proficiency of US trainees and audits among the requirements. A feedback system should also be put in place. 15
In this regard, there were few instances where sonographers disagreed with one another when discussing TN cases. The disagreement centred not only on the nodule’s classification but also on specific US features. It was less clear, however, whether this disagreement existed among radiologists who were regularly performing FNABs. There was also little information on the source and relevance of this disagreement and how it affected our service.
The evaluation of inter-rater reliability is not commonly discussed or carried out in clinical practice, possibly due to busy clinical services, lack of time, insufficient knowledge to perform statistical analysis, or underestimation of its importance. This assessment can aid in the identification of areas for improvement, direct staff training where it is needed and reassure staff.9,10
The aim of this study was therefore to assess the inter-rater reliability when scoring TNs according to the BTA model to evaluate whether there was significant disagreement in the department. The objectives of this study were to assess the overall inter-operator reliability for the U-categories (U2–5), for the indication of an FNAB and for each US feature (echogenicity, composition, etc.). Furthermore, because a high inter-rater reliability would only indicate ‘consistency’ between raters and it does not demonstrate whether the raters use the model correctly, the departmental accuracy measured against the most experienced US operator in the department (gold standard) was also calculated.
Materials and method
Study population and image interpretation
In accordance with the revised edition of Governance Arrangements for Research Ethics Committees (GAfREC 2021), 16 this project did not require the National Research Ethics Committee (NREC) approval.
The study took place in the radiology department of a general district hospital. Four consultant radiologists and four experienced sonographers, who perform most of the thyroid scans in the department, were recruited with written informed consent. All the participants had at least 5 years of experience in practice in interpreting thyroid US. FNAB was routinely performed by all the recruited radiologists. Two of the sonographers had also at least 12–18 months experience in training to perform FNAB.
The TNs cohort, comprised of 33 nodules from 33 patients (age > 18), was retrospectively selected from the Trust’s picture archiving and communication system (PACS) by the author and the most experienced US operator with the BTA model in the department (gold standard). All these nodules were larger than 10 mm in diameter (mean size of 25 mm), this is because the FNAB is not performed or recommended for TNs measuring less than this size in our department.
The TNs’ images were chosen from US scans performed in our department, between January 2019 and January 2022, and consisted of a transverse image with and without colour Doppler for each nodule. The US scans were performed with a Canon Aplio I700 and I800 using a 10- and a 12-MHz linear array transducers. The images were saved as JPEG and randomly arranged on PowerPoint (Microsoft Corporation, Redmond, WA, USA) to generate eight PowerPoint files, one for each operator. All participants were given a tick box questionnaire devised by the author strictly using the BTA lexicon and a printed copy of the BTA graphic (Figure 1). The participants were asked to evaluate the images independently, under the same controlled reporting conditions that they would implement for routine reporting, and to report on the category (U2–5) and sonographic appearance of each nodule.
Statistical analysis
Data from the questionnaires was organised in tables using Microsoft Excel, version 2012. Statistical analyses were undertaken using SPSS (IBM, Chicago, IL, USA) and R Software (version 2016, R Core Team). The inter-operator reliability was calculated using the K-statistic to adjust for agreement due to chance, and in particular, the Fleiss’s 17 elaboration for more than two rater and variables.
The confidence interval (CI), percentage agreement (PA) and Gwet-AC1 agreement were also calculated. Gwet-AC1 resolves for the paradox problem of the kappa-statistic and is indicated when there is prevalence in the population.18,19 A p-value of <0.05 was considered statistically significant.
For the interpretation of the reliability coefficients, the author utilised that proposed by Landis and Koch. 20 K-value between 0 and 0.20 corresponded to slight agreement, between 0.20 and 0.40 fair agreement, 0.40 and 0.60 moderate agreement, 0.60 and 0.80 substantial agreement and above 0.80 to almost perfect agreement. 20 In order to evaluate the departmental accuracy when U-scoring, the author used the PA and the Cohen’s kappa between the scores of each assessor and those of the expert gold standard.
Results
Of the 33 TNs included in this study, 11 (33%) were classified as U2, 6 (18%) as U3, 8 (24%) as U4 and 8 (24%) as U5 by the expert gold standard. The questionnaires were completed and returned to the author by all the invited participants.
Inter-operator reliability
U-classification
Fair agreement with a Fleiss’ K-value of 0.21 was obtained between the eight participants when U-scoring (Table 1). Moderate agreement was noted between sonographers (K = 0.40) and significant variability was demonstrated between radiologists as agreement due to chance could not be excluded for this group (p = 0.10). U5 was the most agreed upon feature (K = 0.56), and U3 and U4 were the least agreed on (K = 0.12 and 0.19, respectively).
Overall inter-operator reliability for the U2–5 categories and indication for FNAB.
FNAB: fine-needle aspiration biopsy.
Recommendation for FNAB
Moderate agreement was reached between all the participants when deciding whether the TNs needed an FNAB (Gwet-AC1 = 0.41). Moderate to substantial agreement was noted when considering the sonographers’ group only (AC1 = 0.58) and fair-to-moderate agreement was demonstrated for the group of Radiologists (AC1 = 0.34).
Agreement on US features
Overall fair agreement (K = 0.29) was obtained for echogenicity with the most agreement for the ‘Isoechoic’ feature (K = 0.45).
When considering the composition, raters assigned two or more features in multiple occasions, these cases were regarded as a fourth category (‘mixed’) to account for this in the calculation of the K-value (Table 2). There was fair overall agreement (K = 0.33) with ‘solid’ being the most agreed upon feature (K = 0.44). There was significant variability for the ‘micro-cystic/spongiform’ feature that agreement due to chance could not be excluded (p > 0.05).
Inter-operator reliability of the BTA U-classification lexicon.
BTA: British Thyroid Association; PA: percentage agreement; NA: not available.
The inter-rater reliability data of ‘equivocal echogenic foci’, ‘peripheral egg calcification’, ‘disrupted peripheral calcification’ ‘micro-calcification’ and ‘globular calcification’ was skewed by prevalence of TNs with none of these features, hence omission of the calculation of the Fleiss’ K for these (Table 2). It appears, however, that there was almost perfect agreement between all the participants in determining when the TN did not have any of these features (mean PA = 89% and AC1 = 0.86).
Shape and margin reached the highest agreement among all the US features with a K-value of 0.58 and 0.45, respectively. Fair (K = 0.34) and moderate (K = 0.43) agreement was reached for ‘halo’ and ‘vascularity’, respectively (Table 3).
Inter-operator reliability of the BTA U-classification lexicon.
BTA: British Thyroid Association; PA: percentage agreement.
Operators’ accuracy
U-classification accuracy
The participants’ accuracy was measured against the expert gold standard. The overall agreement for the categories U2–5 was mean Cohen’s kappa of 0.29 (Table 4). Sonographers demonstrated the highest agreement with the gold standard (mean PA = 54.25% and Cohen’s K = 0.36). Radiologists’ mean PA and Cohen’s K-values were 44% and 0.22; however, for two radiologists, agreement due to chance could not be excluded (p > 0.05) suggesting significant variability compared to the gold standard.
Operators’ accuracy against the gold standard.
FNAB: fine-needle aspiration biopsy.
Accuracy on recommendation for FNAB
The overall accuracy for FNAB recommendation was moderate (mean of AC1 = 0.53). The accuracy for the sonographers’ group was mean of PA = 77.27% and AC1 = 0.60, while for the radiologists’ group, PA = 69.7% and AC1 = 0.46. Agreement due to chance could not be excluded for one radiologist (p = 0.07).
Discussion
Although the level of agreement for the U-categories was only fair (K = 0.21) in the department, this appears in line with the study conducted by Couzins et al. 12 where the agreement for the experienced group was of K = 0.24–0.34. This, however, is significantly lower from the one reported by Weller et al. 11 (K = 0.73). This difference could be due to the fact that in Weller et al., 11 some of the participants were head and neck specialists and had therefore much more experience with thyroid scanning. Nevertheless, there was better agreement in deciding when a TN needed an FNAB (AC1 = 0.41) and this was somewhat reassuring as the main purpose for the BTA U-classification model is to triage TNs for FNAB. It was, however, interesting to note that when the sonographers’ group was considered on its own, the agreement in U-scoring and recommending an FNAB was higher, moderate (K = 0.40) and moderate to substantial (AC1 = 0.58), respectively. Surprisingly, the variability within the group of radiologists when U-scoring was significant that agreement by chance could not be excluded (p = 0.10). This may be of concern considering radiologists perform most of these assessments in the department.
The lower agreement within the radiologists’ group could be due to the fact that while all the sonographers had their qualification and training from the same institution and their preceptorship under the guidance of the same consultant sonographer, radiologists, in contrast, has different backgrounds of training, with some using other stratification models before joining our department. Radiologists’ variability in recommending FNAB, however, while not significant, showed potential for improvement when compared to the sonographers’ group (AC1 = 0.34).
The agreement on echogenicity (K = 0.29) is relatively low when compared with other studies available in the literature. Kim et al., 21 Koltin et al. 22 and Lim-Dunham et al. 23 reported a Fleiss’ K-value of 0.50, 0.46 and 0.54, respectively. It is worth noting, however, that these studies were conducted utilising a different lexicon to that of the BTA guideline which for the echogenicity category includes six individual features (isoechoic, ?hypoechoic, hypoechoic, very hypoechoic, mildly hyperechoic and markedly hyperechoic) compared to the four individual features (‘anechoic’, ‘isoechoic or hyperechoic’, ‘hypoechoic’ and ‘very hypoechoic’) of the TIRADS classification model on which most of these studies are based. The lower agreement could be due to the higher number of features that the US operators would need to agree on. To the author’s best knowledge, no previous study investigated the inter-rater reliability of BTA lexicon.
Composition had also fair agreement (K = 0.33); when considering the individual features, ‘cystic change’ and ‘solid’ had the highest agreement (K = 0.44). However, although defining a TN as cystic or solid would not significantly change patient’s management (Figure 1), it seems that the term ‘cystic change’ is being used by the operators to describe all nodules that are prevalently cystic or with only a small cystic component (26% of the cases were described with ‘cystic change’). It is not clear what this term means for the BTA guideline, such as whether the term denotes only TNs with more than a 50% cystic component or less. This is the case of the TN in Figure 2(c), where four operators described it as with ‘cystic change’, two as ‘spongiform’ and two as ‘solid’. The guideline makes it clear, however, that the structure of the cystic component must be evaluated for irregularity which would raise the suspicion of the nodule. 6

(a) Four operators described this TN with ‘peripheral eggshell calcification’ and four with ‘disrupted eggshell calcification’, (b) four operators described this as with ‘equivocal foci’, two with micro-calcification and two with ‘globular calcification’ and (c) two operators described this a spongiform, four as cystic and two as solid.
The K-value could not be obtained for ‘equivocal echogenic foci’, ‘micro-calcification’, ‘globular calcification’, ‘peripheral eggshell calcification’ and ‘disrupted peripheral calcification’ due to the skewed prevalence of TN with none these features; however, it is interesting to note that in TNs like those in Figure 2(b), where a relatively small hyperechoic focus was present, four on eight US operators described this as having an equivocal focus, two with globular calcification and two with micro-calcification. Like for the ‘cystic change’ versus ‘solid’, differentiation between ‘equivocal echogenic foci’, ‘micro-calcification’ and ‘globular calcification’ would not significantly change patient management according to the BTA model as TNs with these features will require FNAB (Figure 1). However, agreement on these features, and thus differentiation, can be improved by reaching an agreement on the size that distinguishes ‘micro-calcification’ and ‘globular calcification’, for example. Bias could be the reason for disagreement involving the term ‘equivocal foci’ as colloids and foci of calcifications with their shadowing artefact are better distinguishable with real-time US.
When analysing the data for ‘peripheral egg calcification’ and ‘disrupted egg calcification’, operators use these terms interchangeably. These terms, however, assume different meaning according to BTA guideline. While peripheral egg calcification can be either continuous or discontinuous, disrupted egg calcification indicates rim calcifications with small extrusive soft tissue component (Figure 1). For Figure 2(a), for example, while four of the operators described it as with ‘peripheral egg calcification’, four described it with ‘disrupted egg calcification’. Worth noting that TNs with disrupted egg calcification would require FNAB as are suspicious according to the BTA model (Figure 1). While this TN was classified as U2 by the gold standard, only 50% of the raters agreed, and among those who reported this TN with ‘disrupted peripheral calcification’, only 50% would send it for biopsy. This example showed that operators not only disagreed on the meaning of the terms but also on the use of the BTA model in categorising TNs with this feature.
Similarly to the study conducted by Hoang et al., 24 shape had relatively high agreement when compared with all the other features (K = 0.58). A shape that is taller than wider is known to have high sensitivity for malignancy and presence of this feature warrant further investigations. 8 Although on 264 cases, 61 were reported as taller than wider, 11 (18%) of these were classified as U2 by sonographers in 5 cases and by radiologists in 6 cases which indicates the operators preferred to give importance to other features for unclear reasons.
Departmental accuracy
When assessing the departmental accuracy against the expert gold standard, the study demonstrated only fair agreement (mean Cohen’s K = 0.29 with two radiologists demonstrating significant variability) for the U2–5 categories. In particular, it appears that the operators over assigned U3 and under assigned U4–U5 (Table 5). Better accuracy was noted by all the operators in recommending an FNAB (AC1 = 0.53 with one radiologist demonstrating significant variability).
n.cases and n.TN scored per category by at least one operator.
Radiologists downgraded more TNs (13) compared to sonographers (6) when an FNAB was needed, while sonographers sent more TN for FNAB (10 compared to 9 from radiologists) when it was not needed (Table 5). Difficult to explain the reasons for the radiologists downgrading more TNs to U2, although this could have been due to previous experiences with other stratification models such as TIRADS that do not consider some features such as vascularity as significant to change patient’s management.4,5 Nevertheless, this highlighted the necessity of more training on the BTA model.
Need of training
When deciding to conduct this study, the author was aware from both the literature review and anecdotal evidence that some degree of inter-rater variability when U-scoring would be expected in the department. The literature, however, showed that training and practice can help reduce operators’ variability.20,21
While the study did not demonstrate significant inter-operator variability when U-scoring and recommending an FNAB between all the operators, the data showed margin for improvement, particularly for the group of radiologists which showed significant inter-rater variability in assigning U2–5 and poorer agreement with the gold standard.
This study illustrated the need for the department to conduct meetings and establish consensus on the meaning of certain terms that are part of the BTA lexicon. In specific, there is need to conduct training on the U3 and U4 features and review the meaning of the terms ‘?hypoechoic’, ‘cystic change’, ‘disrupted peripheral calcification’, ‘cystic/spongiform’, ‘intranodular vascularity’ and the difference between ‘micro-calcification’ and ‘globular calcification’.
In this regard, some studies have shown that regular consensus discussions and training can improve variability between the operators in the department.21,25,26 Seifert et al., 25 for example, demonstrated that consensus reading sessions significantly improved the inter-rater reliability between four US specialists. The author concluded by recommending these sessions for ‘synchronisation’ purposes to improve the reliability within the team. 25
Kim et al. assessed the accuracy and inter-operator reliability of seven residents on thyroid US pre- and post-‘man-to-man’ training. The study showed that while the four residents that received man-to-man training demonstrated almost perfect agreement with the faculty radiologist, the others showed lower agreement. 21 Other studies suggested that by reviewing sets of US images of TNs with radio pathological correlation for all the U3–5 cases would familiarise the US operators with the BTA classification improving their performance. 26
Limitations
This study, however, was not free from limitations. First, it was a relatively small retrospective analysis, and the distribution of some of the US features was unequal. Only few TNs were ‘markedly hyperechoic’, had ‘peripheral calcification’ or ‘globular calcification’, for example. This could certainly have been improved by selecting more TNs in order to represent all the features with low prevalence.
Second, it could be argued that providing operators with only a couple of static images of the TN, eliminates the real-time key aspect of US which would limit the assessment and the detection of certain US features such as ‘equivocal echogenic foci’ and ‘colloids’ which are usually better detected on real-time US. However, because most of the US features can still be assessed through static images and because reviewing TNs during follow-ups examination happens by comparing static images, conducting the research project in this way was thought to be valid and still useful.
Conclusion
The aim of this study was to investigate whether there was significant disagreement in the department when U-scoring.
The study did not demonstrate significant inter-observer variability in U-scoring or recommending FNAB between all the operators in the department; however, it did demonstrate scope for improvement, particularly for the radiologists’ group which showed significant variability in U-scoring and poorer accuracy when compared with the expert gold standard.
Reliability and accuracy could be improved by addressing the identified problematic categories and features. This study established a benchmark which will allow further audits to be performed against, measuring training effectiveness. The author will present the outcome of this study to the department and help them in organising discussions and training involving the aspect highlighted in this study.
Footnotes
Acknowledgements
N.R. thanks Mrs AG and Mr SJS and the whole US department at YDH for their support and for participating in this study.
Contributors
NR.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
Ethics approval
In accordance with the revised edition of Governance Arrangements for Research Ethics Committees (GAfREC 2021), this project did not require the National Research Ethics Committee (NREC) approval. This study was part of the final dissertation module towards the MSc degree at UWE (2022/2023). The article has been adapted according to BMUS requirements. Ethical approval was sought and gained from UWE and the hospital’s Research Department at YDH (07/07/2022–Local Audit 524). Approval to access patient data was gained from the hospital’s Caldicott Guardian (23/06/2022).
Informed consent
Written informed consent was not sought for this study as it was a ‘retrospective study on anonymised images’.
Guarantor
NR.
