Abstract
Background:
Neck ultrasound (NUS) is currently seen as a main component of follow-up of differentiated thyroid cancer (DTC) and is usually performed regardless of non-stimulated thyroglobulin (Tg) levels. The aim of this study was to determine whether there is a clinical benefit from such a routine NUS in DTC patients.
Methods:
A retrospective database study was conducted of 3176 cervical ultrasound exams performed in 773 patients between June 15, 1996, and July 1, 2012. The accuracy of ultrasound results was assessed based on the results of further diagnostic and/or therapeutic procedures within six months of a particular ultrasound.
Results:
A total of 2199 NUS exams were classified as true negative, 216 as true positive, 692 as false positive in 339 (43.9%) individual patients, 170 of whom were low risk, and 69 as false negative. Thus, overall sensitivity, specificity, positive predictive value, negative predictive value, and accuracy [confidence interval] were 75.8% [70.1–81.5%], 76.1% [74.3–77.8%], 23.8% [18.1–29.5%], 97.0 [96.2–97.7%], and 76.0% [74.3–77.7%], respectively. No significant differences were found between low- and high-risk patients. There were no significant differences between patients with an undetectable and a low detectable (<1 μg/L) Tg level. However, these two groups both showed significantly lower positive predictive value and higher negative predictive value than patients with a Tg ≥1 μg/L. From January 2007 onwards, true-positive and false-negative neck ultrasounds were no longer observed in patients with Tg <1 μg/L.
Conclusion:
After total thyroidectomy and 131I ablation, neck ultrasound should be reserved only for anti-Tg antibody negative patients with a Tg level of ≥1 μg/L.
Introduction
F
For assessment of non-stimulated Tg levels, 1 μg/L is an often-recommended threshold for considering the test result pathologic. Historically, this threshold goes back to the time when the functional sensitivity of Tg assays was at this level. In the 1990s, it was shown that the addition of cervical ultrasound improved diagnostic sensitivity (2). However, since then, the functional sensitivity of Tg assays has improved considerably (3,4). Despite this, whether cervical ultrasound still has an additional value when employing Tg assays with a functional sensitivity well below 1 μg/L has not been studied extensively.
NUS is a highly operator-dependent technique (5). In DTC patients with a low or intermediate risk of recurrence in accordance with the American Thyroid Association (ATA) guidelines, it was recently shown that false-positive NUS occurred at a very high frequency (6,7).
An alternative follow-up strategy practiced in some countries where ultrasound is not as readily available in everyday practice consists in only performing NUS in high-risk patients, patients with elevated Tg levels, or a clinically or otherwise motivated suspicion of residual or recurrent thyroid cancer (8). Using such a strategy, NUS is omitted in the majority of patients.
The aim of the present study was to determine whether there is a clinical benefit from routine cervical ultrasound in DTC patients, regardless of non-thyrotropin (TSH)-stimulated Tg levels, as measured with sensitive Tg assays with a functional sensitivity <1 μg/L, after total thyroidectomy and 131I ablation.
Materials and Methods
Database
The Department of Nuclear Medicine of the University of Würzburg, a tertiary referral center for DTC, established its Thyroid Cancer Database in 1980. This database allows for larger prospective longitudinal scientific population studies in DTC patients (9 –16). Data are recorded by trained medical documentation specialists for each patient visit. Collected data include basic pathology data, the results of diagnostic and therapeutic procedures, as well as findings of laboratory, imaging, and further diagnostic procedures at follow-up visits.
The database, as part of a larger local system of oncologic databases, is maintained, updated, and analyzed with approval of and continuous monitoring by the local medical ethical committee. At their first visit to the hospital, patients or their parents/guardians are asked to give written consent for the recording and anonymized analysis of their data.
Patients and investigations
In this retrospective database study, records were selected of all outpatient non-TSH-stimulated follow-up procedures after 131I ablation without evidence of antithyroglobulin antibody interference performed since the introduction of a Tg assay with a functional sensitivity of <1 μg/L in the hospital on June 15, 1996. Data collection ended at the moment of data extraction on July 1, 2012.
Treatment
All patients underwent total thyroidectomy. After surgery, all patients included in the present study received 131I ablation. After surgery and 131I ablation, TSH-suppressive levothyroxine treatment was initiated and upheld throughout follow-up. Lifelong follow-up took place at half-yearly intervals for the first five years of unremarkable follow-up and yearly intervals thereafter. Follow-up consisted at a minimum of clinical examination, Tg measurement, and ultrasound of the neck. In addition, 6–12 months after initial radioiodine therapy, patients underwent diagnostic 131I whole-body scans and Tg measurement after withdrawal of levothyroxine or, in later years, after exogenous stimulation with recombinant human TSH. Until 2008, this was repeated at least once more within the first two years after diagnosis. Additional imaging procedures such as 18F-2-fluoro-2-deoxyglucose positron emission tomography (FDG-PET) scans, computed tomography (CT) scans, or magnetic resonance imaging (MRI) scans were performed if necessary. If indicated based on the results of follow-up examinations, further 131I therapies with 7000 MBq (189 mCi) or, in a few selected patients, with a dosimetrically determined (17) activity were given. 131I therapy was abandoned if the cancer became 131I refractory, as defined by a failing biochemical and/or structural response or the diagnosis of 131I-negative metastases. In such patients, further management was performed on an individualized basis.
Laboratory analysis
For Tg, measurement assays were purchased from Henning (later called B.R.A.H.M.S.; Thermo Fisher Scientific B.R.A.H.M.S. GmbH, Henningsdorf, Germany). Functional sensitivity of the assays was 0.3 μg/L for the assay Dynotest Tg (18), which was used from June 15, 1996, to July 8, 2001, and 0.2 μg/L for the assay Dynotest Tg Plus (19), which was used from July 9, 2001, onwards.
The presence of antibody interference in the earlier part of the study was performed using recovery measurement. In later years, this was amended by the direct measurement of antithyroglobulin antibodies using the VARELISA method (Thermo Fisher Scientific Varelisa Systems, Freiburg, Germany).
Ultrasound
At each follow-up visit, patients would receive at least a NUS in addition to Tg measurement. During the study period, two different devices were primarily used. From the start of the study to September 2008, a Siemens Sonoline Elegra (Siemens Medical Systems, Erlangen, Germany) was used. From September 2008 onwards, a GE Logiq e9 ultrasound (GE Healthcare, Munich, Germany) was used. Both these machines were equipped with a linear 12 MHz ultrasound probe for NUS. In addition, from the start of the study until October 2004, a Siemens Sonoline Prima (Siemens Medical Systems) equipped with a linear 7.5 MHz NUS probe was also used. From October 2004 onwards, this was replaced by a Siemens Sonoline G50 (Siemens Medical Systems), also equipped with a linear 7.5 MHz probe. It was not recorded in the database which ultrasound machine was used for a particular examination.
NUS was performed by scanning the neck, including the central/paratracheal and lateral anterior regions of the neck between the mastoid bone and the clavicles, on both sides with a transducer frequency of at least 7.5 MHz. Examinations were performed by all physicians of the Department of Nuclear Medicine of the University Hospital Würzburg, which at any time included at least five residents and three or four board-registered nuclear medicine physicians. Resident physicians were always supervised by an experienced nuclear medicine physician.
A pathologic ultrasound finding was dealt with on an individual basis, also weighing up other factors such as initial disease stage, histology, and response to therapy. Briefly, in case of highly suspicious lymph nodes (i.e., lymph nodes with one or more characteristics of a diameter >1 cm, those with a non-oval shape, with a missing fat-hilus sign, with a more echo-intense signal, or an increased perfusion as determined by color Doppler sonography) or a high degree of suspicion of local recurrence, a fine-needle aspiration biopsy was performed. In most other cases, the strategy consisted of watchful waiting with repetition of the cervical ultrasound after three to six months. Alternatively, a diagnostic 131I whole-body scan was performed earlier than originally scheduled. More rarely, FDG-PET/CT or, in earlier days, Tc-99m-MIBI scans would be used in case of suspected 131I negative disease.
Analysis
Initial risk group definition was performed based on the 2006 European Consensus (20), in which the very-low-risk and low-risk groups were combined into a single “low-risk” group. In brief, patients with a maximal tumor diameter of 4 cm without lymph node or distant metastases and without any extrathyroidal extension were classified as low risk, whereas all other patients were classified as high risk.
Within the database, the results of ultrasound examinations are recorded as either “pathologic” or “normal.” Unfortunately, no further information was stored for pathologic scans, and so that it is not possible to assess which specific lesions were involved in each case. Generally speaking, atypical lymph nodes, that is, those not configured ovally but rather more spherically, with a denser or irregular structure on ultrasound, or those with a short-axis diameter >1 cm, were all classified as pathologic. Vascularized nodules in the thyroid bed were also coded as pathologic.
The accuracy of each NUS recording was assessed by a single board-certified nuclear medicine physician (F.A.V.), who has extensive experience in the follow-up of DTC, based on the results of further diagnostic and/or therapeutic procedures within six months of a particular ultrasound, as recorded in the database. As the database only had dichotomous records of the NUS results without further information such as size, vascularization of the lesions, and so on, only the summary assessment of “normal” or “pathologic” was assessed for accuracy. All assessments were entirely based on the database record, and no re-evaluation of ultrasound images or other investigations (e.g., 131I whole-body scans) was performed, since the goal of the present study was to assess the data as employed in clinical practice rather than based on blinded expert review. Moreover, no other clinical information was available to the reviewer.
The period of six months was chosen to allow for further confirmatory procedures. As the focus of the present study was to investigate the diagnostic accuracy at the moment of recording rather than the predictive power of NUS, no information from further follow-up studies was assessed because it was assumed that after more than six months, new lesions, which were not detectable at the time of the ultrasound procedure, could have grown.
A scan was deemed true positive if further supporting evidence in the form of histology, cytology, radioiodine scanning, or other ancillary studies was found within six months. A scan was deemed false positive if no such supporting evidence was found and/or the lesion in question had disappeared on a follow-up ultrasound performed within six months. A scan was deemed true negative if, after a negative ultrasound, no other structural evidence of disease was found (Tg levels were not considered structural evidence of disease). A scan was deemed false negative if, within six months of a negative ultrasound, evidence of disease in the form of histology, cytology, radioiodine scanning, or other ancillary exams was found.
Based on this classification, values were calculated for sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), as well as accuracy for the whole population and various subgroups, including confidence intervals (CI) for each of these values. Differences between subgroups with regard to these values were considered significant if the CI showed no overlap.
Results
Records of 3176 complete outpatient visits consisting of at least the measurement of TSH and Tg as well as cervical ultrasound in 773 individual patients were retrieved. Details on the included patients' examinations can be found in Table 1.
Data are given as number of patients or as median (range).
Of the 3176 ultrasound examinations, 2199 were classified as true negative and 216 as true positive. Furthermore, 692 NUS exams were classified as false positive in 339 (43.9%) individual patients, 170 of whom were initially classified as low risk and 69 NUS exams as false negative.
Further expanded diagnostic procedures in the present patient population included 94 fine-needle biopsies, 77 radiological procedures, including CT and MRI, as well as 70 PET/CT scans. After a NUS was classified as pathologic, 733 examinations in 359 individual patients were followed by active ultrasound surveillance and Tg measurement only.
Overall this results in values for sensitivity, specificity, PPV, and NPV, as well as accuracy [CI] of cervical ultrasound in the post-131I ablation DTC follow-up of 75.8% [70.1–81.5%], 76.1% [74.3–77.8%], 23.8% [18.1–29.5%], 97.0% [96.2–97.7%], and 76.0% [74.3–77.7%], respectively. Remarkably, the number of false-positive ultrasound examinations outnumbers true-positive ones by a ratio of 3.2:1.
Ultrasound results in relation to initial risk
Table 2 summarizes the NUS results in low- and high-risk patients. The differences in terms of sensitivity, specificity, PPV, NPV, and accuracy between the low- and high-risk groups are only marginal and not significant. Furthermore, the rate of false-positive and false-negative findings is comparable in each of the risk groups.
Data shown are % of the patient group, as well as the values for sensitivity, specificity, PPV, and NPV, with confidence intervals for these values given in square brackets. False-negative values are underlined; false-positive ones underlined and italic.
DTC, differentiated thyroid carcinoma; PPV, positive predictive value; NPV, negative predictive value.
Ultrasound results in relation to Tg levels
Table 3 displays the NUS results in patients according to Tg levels. Sensitivity, specificity, PPV, NPV, and accuracy for patients with a Tg below the functional sensitivity of the assay and those with a Tg between the functional sensitivity and 1 μg/L are comparable. Furthermore, these groups have similar rates of true-positive, false-negative, and false-positive findings, with the amount of false-positive findings outnumbering the fraction of true-positive ones by a factor of roughly 13:1 and 8.2:1, respectively. This contrasts with the findings in patients with a Tg ≥1 μg/L, where the PPV is significantly better and the NPV is significantly worse than in both groups with a Tg <1 μg/L. Furthermore, the relative number of false-positive findings is lower and the relative number of true-positive ones is higher, resulting in a ratio of roughly 1:1. Changing the threshold to higher Tg levels (e.g., 5 μg/L or 10 μg/L) does not significantly improve these findings (data not shown).
Data shown are values for sensitivity, specificity, PPV, and NPV, with confidence intervals for these values between square brackets. False-negative values are underlined; false-positive ones are underlined and italic.
FS, functional sensitivity; Tg, thyroglobulin.
Ultrasound results over time
In order to analyze time period influences caused by technical improvements, the study collective was divided according to time period, using the introduction of an improved Tg assay on July 9, 2001, as the endpoint of period 1 and the introduction of a new generation ultrasound machine on September 1, 2008, as the endpoint for period 2. Period 3 ended at the end of data collection. Although there were no significant differences between periods 1 and 2, there were a number of significant differences between these two periods and period 3. In contrast to these earlier periods, no true-positive or false-negative findings were seen in patients with low (<1 μg/L) or undetectable Tg levels in period 3. Especially in patients with undetectable Tg levels (period 1: 75.5% [73.1–77.9%]; period 2: 73.8% [70.3–77.3%]; period 3: 85.3% [79.8–90.1%]), this results in a significantly improved specificity, whereas sensitivity for patients with low or undetectable Tg levels cannot be calculated for period 3 due to a lack of positive results. Furthermore, in period 3, there was a reduction in the relative number of false-positive findings in patients with low or undetectable Tg levels and in patients with Tg levels ≥1 μg/L (period 1: 27.0%; period 2: 25.0%; period 3: 14.6%), although this did not result in significant changes in the PPV, as the relative number of true-positive findings was also reduced.
Discussion
The present results clearly show that patients with a low or undetectable Tg level are much more likely to be classified as false positive than as false negative by NUS, potentially causing a large number of additional follow-up visits, as well as more invasive investigations as a result of such findings. Furthermore, in more recent years, there has been a clear tendency toward a loss of clinical usefulness of NUS in patients with low or undetectable Tg levels, as true-positive results in such patients have become exceedingly rare. In fact, in the present population, no such result was recorded after January 2007.
The results are in full agreement with those recently reported by two studies by Yang et al. (6,7). They found that in the course of follow-up of 90 ATA intermediate risk of recurrence DTC patients, 51 (57%) patients had at least one false-positive NUS examination. Although a direct comparison between the patient populations is difficult, as the ATA risk of recurrence system cannot be applied directly to the patient database, the results in both low- and high-risk patients are very similar to the ones described by Yang et al. In their second study, the same authors recently reported on ATA low risk of recurrence patients. In a population of 171 such patients followed with serial NUS, 67% had at least one false-positive ultrasound result, whereas only 1.2% of patients had true-positive results. Again, the present results are very similar in nature, although the false-positive rate in this study is somewhat lower, possibly because of the inclusion of patients who did not undergo 131I ablation in the study by Yang et al. (7). Furthermore, the comparison of these two studies from the United States also shows similar findings as the present results between low- and high-risk patients, and no major difference in the performance of NUS seems to occur between different risk groups.
Furthermore, the present results for the earlier time periods are in full agreement with those published earlier by Pacini et al. (2). In 2003, they described obtaining true-positive NUS results in 1.6% of 250 patients with a non-TSH-stimulated Tg <1 μg/L after total thyroidectomy and 131I ablation.
However, since 2003, the results have shifted. The reduction in the number of false-negative and true-positive results in patients with low or undetectable Tg levels to non-existent levels cannot be readily explained based on the present data. An effect associated with the change in Tg assays can effectively be excluded, as in both period 1 and period 2, similar rates of false-positive findings were obtained. Although the improvement of ultrasound technology in the form of a new machine used in period 3 may explain the lack of false-negative as well as the reduction in the number of false-positive results, it cannot account for the lack of true-positive results in patients with a low or undetectable Tg. The latter is more likely the result of a multifactorial process involving a number of different areas in the care of DTC patients. Among these is the well-documented epidemiological shift toward a lower stage at diagnosis that occurred during the last two decades. This will reduce the risk of pathologic findings occurring after surgery and 131I ablation. None of the many different effects that may play a role can, however, be quantified with regard to the present findings.
The present study has several weaknesses, which mostly are inherent to its retrospective design. First of all, the present results are likely influenced by the relatively large number of physicians who have carried out the NUS examinations in the time frame of the present study. As ultrasound is a highly observer-dependent modality (5), this might potentially introduce a bias when compared to the situation where only a few or even one expert longitudinally repeats such examinations. Nonetheless, considering that the false-positive rate is somewhat lower than that reported by Yang et al. (6,7), this does not appear to have excessively influenced the result, or at least no more than in other recognized centers of excellence. Furthermore, the larger number of physicians involved in the NUS exams for the present study may in fact make the results of the present study more transferable to clinical practice, where patients are often evaluated by sonographers and physicians with variable levels of expertise.
In addition, techniques in DTC care have evolved over time, potentially affecting the assessment of the accuracy of NUS, such as improvements in MRI techniques or the introduction of SPECT/CT (21,22) and PET/CT (23). These technological improvements can both prove and disprove results in a noninvasive manner, and the effect on NUS accuracy may ultimately be neutral. These technological improvements may also contribute to the changes in results over time, as described above.
Since the database was only filled with a one-word summary of the examination, namely “pathologic” or “normal,” the retrospective nature of the study and the highly observer-dependent methodology of NUS make it impossible to expand further on the precise nature and localization of the pathological findings. This introduces a certain amount of bias, as it is conceivable that patients will undergo further diagnostic procedures due to a lesion that was falsely identified as positive by NUS, which may reveal further DTC lesions in different localizations that were missed using NUS.
Of course, the present results only apply to DTC patients similar to the ones included in this study. In patients who are positive for Tg antibody (24), patients who did not undergo total thyroidectomy, or never received 131I ablation after total thyroidectomy, NUS remains an indispensable tool for monitoring of signs of residual or recurrent disease in the neck, as Tg levels are either not reliably or less sensitive in such patients. Furthermore, the present results only relate to non-TSH-stimulated Tg levels, and recommendations cannot be made on the additional value of cervical ultrasound in this setting. However, given that NUS does not provide additional information in patients with undetectable or low non-stimulated Tg levels, it is very likely that NUS will not add further information to stimulated Tg values that are low or undetectable.
The present results support that follow-up monitoring can be reduced to measuring non-stimulated Tg levels in the majority of DTC patients who underwent 131I ablation and who have very low or undetectable Tg levels. In particular, no additional information is to be expected from NUS, as the likelihood of achieving a false-positive result requiring further diagnostic evaluation is an order of magnitude higher than the likelihood for a true-positive result. However, in patients who have a Tg level ≥1 μg/L, NUS is a useful diagnostic tool to noninvasively identify DTC lesions in the neck with an acceptable false-positive rate. Therefore, it seems reasonable to base follow-up primarily on the measurement of serum Tg levels using a sensitive assay, as is already recommended in some countries (8). NUS should primarily be performed once a Tg level ≥1 μg/L is encountered. Thus, no relevant clinical findings will be missed, but the number of false-positive findings is likely to be greatly reduced.
Conclusion
After total thyroidectomy and 131I ablation, NUS should in general be reserved for patients with a Tg level of ≥1 μg/L in anti-Tg antibody negative patients, regardless of initial risk. Patients with a Tg <1 μg/L have an extremely low rate of true-positive NUS, but they have a considerable rate of false-positive findings. In contrast, patients with a Tg level ≥1 μg/L may benefit from NUS evaluations, given a much better balance between true- and false-positive findings.
Footnotes
Author Disclosure Statement
F.A.V. was a consultant to Bayer Healthcare and Sanofi-Genzyme and has received speaker honoraria from Diasorin and SanofiGenzyme. M.L. was a consultant for AstraZeneca, Bayer Healthcare, SanofiGenzyme, and Sobi and has received speaker honoraria and research support from SanofiGenzyme, Henning, and Merck. L.G. has received compensation as a member of the scientific advisory board of Roche Diagnostics and Sanofi-Genzyme and speaker for Roche Diagnostics, Sanofi-Genzyme, BRAHMS Thermo Fisher, and Siemens Healthcare. The other authors declare that they have no competing financial interests pertaining to this study.
