Challenges in Developing Recommendations Based on Low-Quality Evidence in Thyroid Guidelines

Abstract

Clinical practice guidelines (CPGs) provide guidance informed by evidence and clinical experience, with the ultimate intention of improving health outcomes. In many areas of medicine, including thyroidology, the best available evidence to guide clinical decision-making is often of low quality (i.e., based on observational research that is subject to important limitations and in the absence of high-quality randomized trials). This relates, in large part, to the relative paucity of funding supporting clinical trials in our field. Furthermore, certain practice patterns have been established as convention for decades without formal trials meeting modern standards (e.g., the use of beta blockers in thyrotoxicosis); these are unlikely to be studied in future trials, if their benefit is widely accepted and financial resources for trials are limited.

CPG panels are thus frequently faced with the challenge of formulating clinically meaningful recommendations when the relevant clinical evidence is of low quality. This occurs in situations where the best available evidence consists of studies with important design, execution, or reporting limitations, which can be further complicated by the presence of conflicting study results.

Clinical decision-making in the context of low-quality evidence is challenging, and often subject to controversy. Clinically meaningful CPG recommendations require guideline panelist judgment with a transparent explanation of evidence uncertainties and reasonable alternative options. In general, CPG panels must weigh the clinically important risks and benefits of interventions, as reported in primary research or systematic reviews/meta-analyses, and consider the strengths and limitations of the body of research, in developing a recommendation for implementation in clinical practice. The experience and background of CPG panelists, as well as other factors, may also influence a recommendation, and in particular the strength of the recommendation.

Formal frameworks, such as the “Grading of Recommendations, Assessment, Development, and Evaluation” (GRADE) system (1), the American College of Physicians (ACPs) system (2), or others, are used by CPG panels to rate the quality of evidence, reflecting certainty in estimates of benefits, harms, or other outcomes and the strength of recommendations (which has implications for the implementation of the recommendation in clinical practice). The utilization and reporting of an explicit process to rate evidence and recommendations are in keeping with current standards for trustworthy guidelines, as originally outlined by the Institute of Medicine (3).

The popular GRADE system, which has been studied the most extensively (4), was conceived and refined by the GRADE Working Group over the last two decades, with the goal of “developing an optimal system of rating of quality of evidence and determining strength of recommendations for clinical practice guidelines” (1). GRADE developers have endorsed the use of this system in CPGs, systematic reviews, and health technology assessments (1).

The ACP approach, which was originally developed for use in ACP guidelines and shares some core concepts with GRADE, has fewer categories of quality of evidence and fewer dimensions for consideration in formulating the strength of a recommendation than the currently used version of GRADE (2). The original ACP system enables rating of recommendations as “Insufficient Evidence to Determine Net Benefits or Risks” and does not include a category of best practice statements, which is in contrast to GRADE (2). Yet currently the ACPs' guideline development committee has fully adopted the GRADE system in its guidelines (with several minor modifications) (5).

The American Thyroid Association (ATA) has used the original ACP system, GRADE, and other systems in various guideline iterations (6). As clinicians with prior experience in developing thyroid CPGs, our goal is to reflect on some of the challenges we faced in formulating meaningful CPG recommendations when the best available pertinent evidence is low quality.

CPGs produced by professional organizations have been subject to some criticism in their methodologic approaches (7). Some examples of organizations providing clinical guidance documents, which have been criticized on their methods include the following: the American College of Cardiology (8), the Endocrine Society (9), the World Health Organization (WHO) (10,11), UpToDate (12), the American College of Chest Physicians (13), the American College of Gastroenterology (14), the American Thoracic Society (15,16), and most recently the ATA (17). A common criticism from methodologists is that “guideline panels frequently insist on formulating unjustifiably strong recommendations” (18), which is inferred to be in the context of low-quality evidence.

In the field of thyroidology, Bautista-Orduno et al. recently published a critique (17) of five recent ATA CPGs (19 –23), where the authors reported their judgments on whether CPG recommendations were concordant with “GRADE guidance.” Bautista-Orduno et al. suggested that many of the strong recommendations based on low-quality evidence should have been rated as “weak” recommendations or classified as ungraded “best practice” statements (17). These authors suggested that more than half of the strong recommendations based on low-quality evidence (89/151, 59%) should have been labeled as “best practice” statements.

Best practice statements are a component of the GRADE, but not of the ACP system, and they generally refer to statements based on indirect evidence. As per an official GRADE working group publication, “best practice” statements should be “seldom” used, but they may be justified if “after consideration of all relevant outcomes and potential downstream consequences,” their implementation is expected to “result in large net positive consequences” and there is “a well-documented and clear rationale connecting the indirect evidence” and collecting and summarizing the evidence are judged to be “a poor use of a guideline panel's limited time and energy” (24). It is also important to recognize that the GRADE working group has indicated that they are “uniformly concerned about the inappropriate use of good practice statements,” and “some members” were “so concerned they feel GRADE is unwise to provide guidance for such statements” (24). Moreover, Bautista-Orduno et al. (17) reviewed one ATA guideline that used the GRADE system (23) and four ATA guidelines using the original ACP system (19 –22), and so, the criticism for concordance with GRADE guidance is not directly applicable to the four guidelines that did not use the GRADE framework. Furthermore, some of the judgments of the authors encouraging the liberal use of best practice statements may also not be concordant with official GRADE guidance, which advises to “seldom” use this category. The distinction in use of the original ACP system and GRADE may be a source of confusion, as authors from the same group have recently incorrectly reported that the five ATA guidelines discussed above, utilized GRADE (25). It may be informative to reflect in greater detail on the experiences of CPG panels in applying the GRADE framework, including the recent experience in the ATA hyperthyroidism CPG Chair (D.S.R.) (23).

The GRADE system for rating of quality of evidence and strength of recommendations is complex and may be challenging for guideline panels to reliably operationalize. Sinclair et al. from the WHO have indicated that some guideline panel members reported “uncertainties about how to apply the GRADE approach when the evidence was of very low quality or when the recommendation seemed obviously common sense” (26). Furthermore, Sinclair et al. suggested that the WHO departmental staff expressed uncertainties regarding the “technicalities and bureaucracy” of GRADE, whereas the WHO Guidelines Review Committee members had uncertainties regarding a “transparent, systematic, and explicit process” (26). Also reflecting on the WHO experience, Barbui et al. reported that “problems with reliability in GRADE are particularly problematic when several raters are involved in the development of recommendations, which is often the case for development of guidelines covering a broad range of conditions” (27).

There are also challenges in using the GRADE approach in conveying a nuanced recommendation that may be highly dependent on contextual factors. Gärtner et al. recently suggested that a limitation in the current GRADE format is that it dichotomizes recommendations as either “strong” or “weak” and either “for” or “against” an intervention, which inherently creates difficulty in the explicit statement and justification of multiple medically reasonable options (28). Moreover, even after training from expert GRADE methodologists, the level of inter-rater agreement among guideline panelists in categorizing “strong” or “weak recommendations” is at best fair (κ coefficient 0.39) (29). Thus, there are challenges known in reproducibly applying the GRADE approach, particularly evident in formulating and rating recommendations based on low-quality evidence, when multiple stakeholders with differing perspectives are involved.

GRADE methodologists have explored some of the challenges faced by guideline authors in using their framework. In a qualitative interview study, including the WHO CPG panelists, GRADE methodologists suggested that panelists make recommendations “inconsistent with GRADE guidance” due to “limitations in their understanding of GRADE” as well as “skepticism” about the value of GRADE with implications for implementation of weak/conditional recommendations, political considerations, and high certainty in benefits (10). However, a counterargument could be made that some GRADE methodologists may also have limitations in their understanding of clinical context. The challenges in implementing GRADE reported by other organizations may also be generalizable to some extent in thyroidology.

The experience of a guideline chair (D.S.R.) from the 2016 ATA Guidelines for Diagnosis and Management of Hyperthyroidism and other causes of Thyrotoxicosis (23) illustrates some of the challenges using GRADE in the context of best available evidence being of low quality. Bautista-Orduno et al. (17) reported that many strong recommendations were judged by the authors to have been misclassified as strong rather than weak according to the GRADE framework. While one can attribute some of these misclassifications to the subjective interpretation of the use of “paradigmatic situations” allowed by GRADE (for strong recommendations based on low-quality evidence), one can also question whether this represents a limitation of the GRADE system to provide a readily operationalized framework. According to Bautista-Orduno et al. (17), the GRADE system would not distinguish between two recommendations that were supported only by weak-quality evidence, one of which was enthusiastically endorsed by all members of the guidelines task force and widely used in clinical practice (e.g., the example used in Table 2 of their article: “Patients with symptomatic thyrotoxicosis due to painless thyroiditis should be treated with beta-adrenergic-blocking drugs to control symptoms”), and another that was not the routine practice of all task force members and was offered as a clinical consideration (e.g., “In patients who are at increased risk for complications due to worsening of hyperthyroidism, resuming MMI [methimazole] 3–7 days after RAI [radioactive iodine] administration should be considered”) (17). In these examples, GRADE guidance could be considered inconsistent with the clinical judgment/experience of CPG panelists. Furthermore, in this context of strong beliefs based on clinical experience and established standards of care, the panelists expressed some skepticism regarding the implications for clinical implementation of a “weak” recommendation.

Some skepticism about the clinical implementation of a weak recommendation may be warranted, as reflected in our (B.R.H., A.M.S.) recent experience developing the most recently published iteration of the ATA guidelines on management of thyroid nodules and differentiated thyroid cancer (DTC) in adults (22). In applying the ACP system in these guidelines, we assigned a weak recommendation based on low-quality evidence for a recommendation on the use of adjuvant radioactive iodine treatment (RAI, RAIT) in intermediate-risk thyroid cancer (Recommendation 51d—“RAI adjuvant therapy should be considered after total thyroidectomy in ATA intermediate-risk level DTC patients”) (22). Although extensive explanation was provided in the text and a related table, a subsequent European consensus article critiquing this CPG indicated, “We would favor retaining the longstanding, widely applied practice of postoperative RAIT in low-risk and intermediate risk DTC patients until these studies (referring to ongoing randomized trials) reach conclusions suggesting that practice should be changed” (30). The controversy ultimately culminated in a dedicated meeting of representatives of multiple international thyroid and nuclear medicine societies in Martinique, and one of the major conclusions of that group was that “major gaps in knowledge and evidence regarding optimal use of I-131 therapy should be addressed with properly designed prospective studies” (31). Thus, there was little doubt among relevant stakeholders that the best available evidence was of relatively low quality, but there was substantial disagreement regarding the strength of any recommendation (with wording implications) regarding use of the intervention (particularly given the long-standing established, standard of care in some specialties and regions). Although the need for higher quality research is universally acknowledged, the issue remains controversial. This example illustrates how CPG panels may also be criticized for not making a strong recommendation for an established approach in the presence of low-quality evidence, and may justify some skepticism among CPG panelists about the clinical utility of this approach.

It is important to acknowledge some important published methodologic critiques of the GRADE system. Some methodologists have suggested that the explanation for challenges in the appropriate and consistent implementation of the GRADE framework may stem from limitations inherent within the framework (32 –34). Mercuri and Gafni have suggested the following problems with GRADE: (a) the absence of a theoretical and/or empirical basis for the framework components, (b) lack of clarity in the presented criteria for determining the quality of evidence and strength of recommendations, and (c) and a lack of clarity in defining how to operationalize and integrate key criteria/components (32 –34). Among GRADE methodologists, some concerns have been expressed with respect to the appropriateness of the use of “best practice” statements within evidence-based CPGs (24). Furthermore, the constant evolution of GRADE, which may be confusing to users (particularly when a guideline is in the process of being developed), has complicated the operationalization of this framework.

In conclusion, despite the advances that have been made in CPG methodology in recent years, significant challenges remain in formulating clinically useful recommendations when the best available evidence is of low quality. Our ultimate goal should be to generate clinically useful guidelines that improve the health outcomes of patients. In keeping with ATA policies (6), we believe that is important for CPG panels to include all relevant stakeholders, including clinicians from relevant professions, methodologists, and patient stakeholders and that representation should be with attention to gender equity and diversity.

In our experience, it is not uncommon for guideline panelists from varied backgrounds to disagree on important issues, particularly where the evidence foundation is of low quality. Contextual factors are also important to consider, and clinically relevant nuanced recommendations may be challenging to formulate using existing methodologic frameworks. We believe that the categorization of best practice statements (in GRADE) versus making a strong recommendation (based on low-quality evidence) are open to interpretation and may not necessarily impact clinical care, as the same course of action is recommended; in both cases, and the low quality of evidence is clear. We believe it is more important for the CPG panelists to carefully deliberate upon the evidence (with associated uncertainties) and important clinical contextual factors in a transparently reported manner, rather than spending excessive time and energy on the technicalities of such labeling. Furthermore, more research is needed to determine whether the use of any particular evidence/recommendation frameworks in CPG development may be associated with improved health outcomes. Finally, it is important to acknowledge that CPG recommendations are informed by both evidence and experience, and that experiences of various stakeholders (e.g., clinicians from various disciplines or regions of practice, methodologists, and patients) may vary substantially, which in turn may contribute to different viewpoints on recommendations.

Footnotes

Author Disclosure Statement

A.M.S., E.K.A., A.C.B., B.R.H., E.P., D.S.R., R.C.S., and J.J. were chairs or members of the writing groups for the referenced ATA CPGs. B.R.H., P.A.K., and D.S.R. have participated as members, and A.M.S. and J.J. as cochairs, of the ATA Guidelines and Statements Task Force or subsequent Committee. P.A.K. was Editor-in-Chief of Thyroid, the journal in which the referenced CPGs were published. R.C. is the methodologist working with ATA CPGs that are currently in development.

Funding Information

No funding was received.

References

Guyatt

, Oxman

, Schünemann

, Tugwell

, Knottnerus

. 2011. GRADE guidelines: a new series of articles in the Journal of Clinical Epidemiology. J Clin Epidemiol, 64:380–382.

Qaseem

, Snow

, Owens

, Shekelle

, Clinical Guidelines Committee of the American College of

Physicians

. 2010. The development of clinical practice guidelines and guidance statements of the American College of Physicians: summary of methods. Ann Intern Med, 153:194–199.

Committee on Standards for Developing Trustworthy Clinical Practice Guidelines, Board on Health Care Services, Institute of Medicine 2011 Clinical Practice Guidelines We Can Trust. National Academies Press, Washington, DC. Available at http://www.nap.edu (accessed April 12, 2020 ).

Ansari

, Tsertsvadze

, Moher

. 2009. Grading quality of evidence and strength of recommendations: a perspective. PLoS Med, 6:e1000151.

Qaseem

, Kansagara

, Lin

, Mustafa

, Wilt

, Clinical Guidelines Committee of the American College of

Physicians

. 2019. The Development of Clinical Guidelines and Guidance Statements by the Clinical Guidelines Committee of the American College of Physicians: update of methods. Ann Intern Med, 170:863–870.

Sawka

, Carty

, Haugen

, Hennessey

, Kopp

, Pearce

, Sosa

, Tufano

, Jonklaas

. 2018. American Thyroid Association guidelines and statements: past, present, and future. Thyroid, 28:692–706.

Guyatt

, Agoritsas

, Lytvyn

, Siemieniuk

, Vandvik

. 2019. BMJ rapid recommendations: creating tools to support a revolution in clinical practice guideline adoption. Canad J Gen Intern Med, 14:6–12.

Tricoci

, Allen

, Kramer

, Califf

, Smith

. 2009. Scientific evidence underlying the ACC/AHA clinical practice guidelines. JAMA, 301:831–841.

Brito

, Domecq

, Murad

, Guyatt

, Montori

. 2013. The Endocrine Society guidelines: when the confidence cart goes before the evidence horse. J Clin Endocrinol Metab, 98:3246–3252.

10.

Alexander

, Gionfriddo

, Li

, Bero

, Stoltzfus

, Neumann

, Brito

, Djulbegovic

, Montori

, Norris

, Schünemann

, Thabane

, Guyatt

. 2016. A number of factors explain why WHO guideline developers make strong recommendations inconsistent with GRADE guidance. J Clin Epidemiol, 70:111–122.

11.

Alexander

, Bero

, Montori

, Brito

, Stoltzfus

, Djulbegovic

, Neumann

, Rave

, Guyatt

. 2014. World Health Organization recommendations are often strong based on low confidence in effect estimates. J Clin Epidemiol, 67:629–634.

12.

Agoritsas

, Merglen

, Heen

, Kristiansen

, Neumann

, Brito

, Brignardello-Petersen

, Alexander

, Rind

, Vandvik

, Guyatt

. 2017. UpToDate adherence to GRADE criteria for strong recommendations: an analytical survey. BMJ Open, 7:e018593.

13.

Agoritsas

, Neumann

, Mendoza

, Guyatt

. 2017. Guideline conflict of interest management and methodology heavily impacts on the strength of recommendations: comparison between two iterations of the American College of Chest Physicians Antithrombotic Guidelines. J Clin Epidemiol, 81:141–143.

14.

Meyer

, Bowers

, Wayant

, Checketts

, Scott

, Musuvathy

, Vassar

. 2018. Scientific evidence underlying the American College of Gastroenterology's clinical practice guidelines. PLoS One, 13:e0204720.

15.

Schumacher

, Nguyen

, Makam

. 2019. Evidence-based medicine and the American Thoracic Society guidelines-reply. JAMA Intern Med, 179:1004–1005.

16.

Schumacher

, Nguyen

, Deshpande

, Makam

. 2019. Evidence-based medicine and the American Thoracic Society Clinical Practice Guidelines. JAMA Intern Med, 179:584–586.

17.

Bautista-Orduno

, Dorsey-Trevino

, Gonzalez-Gonzalez

, Castillo-Gonzalez

, Garcia-Leal

, Raygoza-Cortez

, Gionfriddo

, Rodriguez-Gutierrez

. 2020. American Thyroid Association Guidelines are Inconsistent with GRADE—A meta-epidemiological study. J Clin Epidemiol, 123:180–188.e2.

18.

Rabi

, Kunneman

, Montori

. 2020. When guidelines recommend shared decision-making. JAMA, 323:1345–1346.

19.

Smallridge

, Ain

, Asa

, Bible

, Brierley

, Burman

, Kebebew

, Lee

, Nikiforov

, Rosenthal

, Shah

, Shaha

, Tuttle

. 2012. American Thyroid Association guidelines for management of patients with anaplastic thyroid cancer. Thyroid, 22:1104–1139.

20.

Jonklaas

, Bianco

, Bauer

, Burman

, Cappola

, Celi

, Cooper

, Kim

, Peeters

, Rosenthal

, Sawka

. 2014. Guidelines for the treatment of hypothyroidism: prepared by the American Thyroid Association Task Force on thyroid hormone replacement. Thyroid, 24:1670–1751.

21.

Alexander

, Pearce

, Brent

, Brown

, Chen

, Dosiou

, Grobman

, Laurberg

, Lazarus

, Mandel

, Peeters

, Sullivan

. 2017. Guidelines of the American Thyroid Association for the diagnosis and management of thyroid disease during pregnancy and the postpartum. Thyroid, 27:315–389.

22.

Haugen

, Alexander

, Bible

, Doherty

, Mandel

, Nikiforov

, Pacini

, Randolph

, Sawka

, Schlumberger

, Schuff

, Sherman

, Sosa

, Steward

, Tuttle

, Wartofsky

. 2016. 2015 American Thyroid Association Management guidelines for adult patients with thyroid nodules and differentiated thyroid cancer: the American Thyroid Association guidelines task force on thyroid nodules and differentiated thyroid cancer. Thyroid, 26:1–133.

23.

Ross

, Burch

, Cooper

, Greenlee

, Laurberg

, Maia

, Rivkees

, Samuels

, Sosa

, Stan

, Walter

. 2016. American Thyroid Association guidelines for diagnosis and management of hyperthyroidism and other causes of thyrotoxicosis. Thyroid, 26:1343–1421.

24.

Guyatt

, Alonso-Coello

, Schünemann

, Djulbegovic

, Nothacker

, Lange

, Murad

, Akl

. 2016. Guideline panels should seldom make good practice statements: guidance from the GRADE Working Group. J Clin Epidemiol, 80:3–7.

25.

Castillo-Gonzalez

, Dorsey-Trevino

, Gonzalez-Gonzalez

, Garcia-Leal

, Bautista-Orduño

, Raygoza

, Gionfriddo

, Ospina

NMS

, Rodriguez-Gutierrez

. 2020. A deeper analysis in thyroid research: a meta-epidemiological study of the American Thyroid Association clinical guidelines. PLoS One, 15:e0234297.

26.

Sinclair

, Isba

, Kredo

, Zani

, Smith

, Garner

. 2013. World Health Organization guideline development: an evaluation. PLoS One, 8:e63715.

27.

Barbui

, Dua

, van Ommeren

, Yasamy

, Fleischmann

, Clark

, Thornicroft

, Hill

, Saxena

. 2010. Challenges in developing evidence-based recommendations using the GRADE approach: the case of mental, neurological, and substance use disorders. PLoS Med, 7:e1000322.

28.

Gärtner

, Portielje

, Langendam

, Hairwassers

, Agoritsas

, Gijsen

, Liefers

, Pieterse

, Stiggelbout

. 2019. Role of patient preferences in clinical practice guidelines: a multiple methods study using guidelines from oncology as a case. BMJ Open, 9:e032483.

29.

Kumar

, Miladinovic

, Guyatt

, Schünemann

, Djulbegovic

. 2016. GRADE guidelines system is reproducible when instructions are clearly operationalized even among the guidelines panel members with limited experience with GRADE. J Clin Epidemiol, 75:115–118.

30.

Luster

, Aktolun

, Amendoeira

, Barczyński

, Bible

, Duntas

, Elisei

, Handkiewicz-Junak

, Hoffmann

, Jarząb

, Leenhardt

, Musholt

, Newbold

, Nixon

, Smit

, Sobrinho-Simões

, Sosa

, Tuttle

, Verburg

, Wartofsky

, Führer

. 2019. European Perspective on 2015 American Thyroid Association Management Guidelines for Adult Patients with Thyroid Nodules and Differentiated Thyroid Cancer: Proceedings of an Interactive International Symposium. Thyroid, 29:7–26.

31.

Tuttle

, Ahuja

, Avram

, Bernet

, Bourguet

, Daniels

, Dillehay

, Draganescu

, Flux

, Führer

, Giovanella

, Greenspan

, Luster

, Muylle

, Smit

JWA

, Van Nostrand

, Verburg

, Hegedüs

. 2019. Controversies, consensus, and collaboration in the use of ¹³¹I therapy in differentiated thyroid cancer: a joint statement from the American Thyroid Association, the European Association of Nuclear Medicine, the Society of Nuclear Medicine and Molecular Imaging, and the European Thyroid Association. Thyroid, 29:461–470.

32.

Mercuri

, Gafni

. 2018. The evolution of GRADE (part 3): a framework built on science or faith?. J Eval Clin Pract, 24:1223–1231.

33.

Mercuri

, Gafni

. 2018. The evolution of GRADE (part 2): still searching for a theoretical and/or empirical basis for the GRADE framework. J Eval Clin Pract, 24:1211–1222.

34.

Mercuri

, Gafni

. 2018. The evolution of GRADE (part 1): is there a theoretical and/or empirical basis for the GRADE framework?. J Eval Clin Pract, 24:1203–1210.