Reimagining the Humble but Mighty Pen: Quality Measurement and Naturalistic Decision Making

Abstract

Much of the health system’s avoidable spending may be driven by doctors’ decision making. Past studies demonstrated potentially consequential and costly inconsistencies between the actual decisions that clinicians make in daily practice and optimal evidence-based decisions. This commentary examines the “best practices regimen” through the lens of the quality measurement movement. Although quality measures have proliferated via public reporting and pay-for-performance programs, evidence for their impact on quality of care is scant; the cost of care has continued to rise; and the environment for clinical decisions may not have improved. Naturalistic decision making offers a compelling alternative conceptual frame for quality measurement. An alternative quality measurement system could build on insights from naturalistic decision making to optimize doctors’ and patients’ joint decisions, improve patients’ health outcomes, and perhaps slow the growth of health care spending in the future.

Keywords

health care delivery decision making naturalistic decision making

The late Dr. John Eisenberg was fond of saying that the most expensive instrument in medicine is a doctor’s pen. The pen has since been replaced by keyboards and other devices, but more than 40 years later, doctors’ decisions still influence not only patient outcomes but also much of health care spending in advanced economies (Eisenberg, 1986). More crucially, doctors’ decisions shape the effectiveness of that spending in producing better population health. In 2017, the United States spent more than $3.1 trillion on health care services, but between one fifth and one third of that health care spending may be avoidable (Berwick & Hackbarth, 2012). Much of that avoidable spending is initiated by the doctor’s pen.

In the 1980s, landmark studies by researchers at the RAND Corporation and University of California, Los Angeles (UCLA), demonstrated a gap between clinical care as delivered and optimal clinical practice (Chassin et al., 1986). For several prevalent tests and procedures, the RAND-UCLA appropriateness method used a rigorous modified Delphi approach involving carefully selected and balanced panels of physicians rating hundreds of very specific clinical indication scenarios. These panels combined the best available scientific evidence, expert guidance, and clinical practice experience to differentiate appropriate and inappropriate care for specific indications and also identified situations of uncertainty (Ayanian, Landrum, Normand, Guadagnoli, & McNeil, 1998; Leape, Park, Kahan, & Brook, 1992; Phelps, 1993). Two central findings from RAND studies were replicated globally (Runciman et al., 2012). First, there are surprisingly large areas of uncertainty in clinical practice—under many scenarios, the balance of risks and benefits of treatments, tests, and procedures are difficult to judge because of incomplete evidence and disagreement among panelists. More important, there are costly inconsistencies between the actual decisions of clinicians in daily practice and the optimal decisions. Furthermore, these inconsistencies occur even within areas of practice characterized by strong evidence and high levels of consensus within review panels.

In the wake of these and other studies, researchers pursued many approaches to improving clinical decisions. To assist time-strapped clinicians, evidence-based clinical guidelines were developed to replace textbooks and better keep pace with a rapidly evolving scientific evidence base. Measuring adherence to guideline-based recommendations came to define the effectiveness of clinician decisions. Falzer’s paper (2018) describes this “best practices regimen” that, conceptually at least, equates adherence to guidelines with adherence to practice standards defined by experts. But generally, guideline developers, clinicians, and even courts adjudicating malpractice cases have rejected the notion that guidelines are clinical practice standards. Instead, guidelines have been considered advisory to professional judgement. Another group of researchers, noting the imprecise nature of many guideline statements, developed a different tool that may more properly define the “best practices regimen”—quality measures.

The quality measurement movement of the 1990s adapted aspects of the RAND-UCLA appropriateness method to translate scientific evidence, guideline recommendations, and expert opinion into rigorous quality measures. Quality measures—specifically, clinical process measures—were intended not as guidance but as tools for explicitly evaluating whether clinical services were delivered (or not) to groups of patients with well-specified clinical characteristics. The National Committee for Quality Assurance was an early pioneer of quality measures to compare health insurance plans. The organization was a response to new managed care capitation payment approaches that many feared would lead to denials of effective care. Early measures focused on the use of preventive services, such as cervical cancer screening, breast cancer screening, influenza vaccination, and beta-blocker prescription after heart attack. For these measures, populations who could benefit were carefully specified, and those with exceptions were excluded (e.g., hysterectomy, prior mastectomy, vaccine allergy, asthma).

Initially, quality measure developers avoided several pitfalls of the best practices regimen that Falzer (2018) describes. For example, they avoided rote translation of guidelines into quality measures, limiting the quality measures to situations of high consensus among clinicians to which the indicated services should be delivered. In addition, inclusion boundaries might be narrower than those allowed by guidelines to avoid misclassifying patients as eligible for a test or treatment (e.g., excluding older age groups from asthma measures to avoid including people with emphysema). As implemented, quality measures focused less on achieved performance levels and more on comparisons of populations, drawing on the observation that variation in practice quality often exceeded (a) differences that one might plausibly expect based on characteristics of the included populations or (b) plausible differences in clinical decision making in the aggregate (Schneider, Zaslavsky, & Epstein, 2002; Trivedi, Zaslavsky, Schneider, & Ayanian, 2006). Improvements in preventive services, chronic care, and health outcomes appeared to follow the introduction of quality measures for conditions such as diabetes, heart attack, heart failure, and pneumonia, although causality was difficult to establish (Lindenauer et al., 2007; Trivedi, Zaslavsky, Schneider, & Ayanian, 2005).

Not long after, policy makers saw quality measures as a basis for public reporting and pay-for-performance programs. Hundreds of quality measures were developed, and in the rush, some quality measure developers began to cut corners. Limitations of scientific evidence were ignored. Quality measure development panels included a wider variety of stakeholders, but some newcomers had limited experience in quality measure construction. Rigorous panel methodologies were abandoned because of cost. The financial interests of industry stakeholders began to influence measure development. Government grant support from the scientifically oriented Agency for Healthcare Research and Quality was replaced by contracts from the Centers for Medicare and Medicaid Services that demanded measures for mandated reporting and pay-for-performance programs. Limitations of available data sources coupled with aggressive timelines produced intuitively appealing but flawed measures and unintended consequences followed. For example, a measure of timely initiation of antibiotics for patients with pneumonia led many emergency departments to start antibiotics for anyone with upper respiratory symptoms that might turn out to be pneumonia. As measurement programs have implemented measures based on incomplete data—ignoring the complexity and contextual nuances of clinical decision making based on the patient’s history, physical exam, and diagnostic results—skepticism about quality measurement programs has grown even among those who believe that quality measures can be useful (Blumenthal & McGinnis, 2015; Lindenauer et al., 2014).

Is the quality of care better now than in the 1980s? Evidence suggests that quality has improved for some conditions, such as diabetes, cardiovascular events, and stroke. Did quality measures and the “best practices regimen” contribute to those gains? Perhaps, but there are reasons to be skeptical. Evidence of improvement related to quality measurement and pay-for-performance programs has been scant (Mendelson et al., 2017). The cost of care, a key concern since at least the 1980s, has only grown. It is even plausible that quality measurement and pay-for-performance programs have backfired. Clinician decisions that once fell short of best available evidence may now also be distorted by measurement-driven payment incentives. Clinicians striving for measure adherence may order clinically inappropriate care to raise scores on performance measures (e.g., ordering unnecessary antibiotics, screening colonoscopies, or other tests). Evidence suggests that they may game measurement systems through deliberate exclusion of patients from denominators or miscoding of patient risk factors (Harris et al., 2016).

If measures are incomplete, can naturalistic decision making offer a compelling alternative conceptual frame for quality measurement (Falzer, 2018)? I believe it can. A focus on the actual decision tasks that master clinicians perform highlights two shortcomings of current measurement approaches. First, patients’ preferences and their social contexts are not well represented in quality measurement systems even though preferences are integral to clinical decisions. Current quality measures do not adequately reflect professionals’ tailoring of diagnostic or treatment recommendations to a patient’s unique circumstances. Second, for each patient, conditions, diseases, and context evolve over time. As a simple example, high blood pressure and depression may be easily managed for a time, only to suddenly deteriorate because of divorce, job loss, or a change of insurance coverage. Astute clinicians may adjust accordingly, but their less skilled colleagues may not. Quality measurement programs of today do not evaluate this responsiveness to changes over time.

The current quality measurement regimen is precarious in practice and in concept. In a recent paper, colleagues and I began to reimagine it (McGlynn, Schneider, & Kerr, 2014). The insights of naturalistic decision making may be highly relevant to our reimagined quality measurement approach, which accounts not only for best evidence and expert opinion but also patients’ goals and priorities, as well as care plans developed jointly by patients and clinicians. Under the new measurement approach patients and clinicians jointly evaluate and plan care, documenting this decision-making process with yet-to-be-designed electronic care-planning tools. The technical, data-related, and logistical demands of new care-planning tools are not trivial, but new digital capabilities are emerging quickly and such a tool could replace traditional medical records, which do a poor job of capturing the naturalistic decision process. User-friendly devices and interfaces; the networks to gather, exchange, and process needed data; and machine learning to augment and assist decisions will enable better and more relevant decisions at the point of care. Developing a digital care-planning tool is not a trivial investment, but if a doctor’s pen is the most expensive instrument in medicine, then investing in a new instrument that could optimize doctors’ and patients’ joint decisions and improve patients’ health outcomes would be wise, especially if it can slow the growth of health care spending in the future.

Footnotes

ORCID ID

Eric C. Schneider

Eric C. Schneider is senior vice president for policy and research at The Commonwealth Fund, a private philanthropy that conducts independent research on health policy. Prior to joining the Fund in 2015, Dr. Schneider, a practicing general internist, held the RAND distinguished chair in health care quality at RAND Corporation and faculty positions at Harvard Medical School and the Harvard T.H. Chan School of Public Health.

References

Ayanian

J. Z.

Landrum

M. B.

Normand

S. L. T.

Guadagnoli

McNeil

B. J.

(1998). Rating the appropriateness of coronary angiography—do practicing physicians agree with an expert panel and with each other? New England Journal of Medicine, 338(26), 1896–1904.

Berwick

D. M.

Hackbarth

A. D.

(2012). Eliminating waste in US health care. JAMA, 307(14), 1513–1516. doi:10.1001/jama.2012.362

Blumenthal

McGinnis

J. M.

(2015). Measuring vital signs: An IOM report on core metrics for health and health care progress. JAMA, 313(19), 1901–1902. doi:10.1001/jama.2015.4862

Chassin

M. R.

Brook

R. H.

Park

R. E.

Keesey

Fink

Kosecoff

Solomon

D. H.

(1986). Variations in the use of medical and surgical services by the Medicare population. New England Journal of Medicine, 314(5), 285–290.

Eisenberg

(1986). Doctor’s decisions and the cost of medical care. Ann Arbor, MI: Health Administration Press Perspectives.

Falzer

P. R.

(2018). Naturalistic decision making and the practice of health care. Journal of Cognitive Engineering and Decision Making, 12(3), 178–193.

Harris

A. H.

Chen

Rubinsky

A. D.

Hoggatt

K. J.

Neuman

Vanneman

M. E.

(2016). Are improvements in measured performance driven by better treatment or “denominator management”? Journal of General Internal Medicine, 31(Suppl. 1), 21–27. doi:10.1007/s11606-015-3558-1

Leape

Park

R. E.

Kahan

J. P.

Brook

R. H.

(1992). Group judgments of appropriateness: Effect of panel composition. Quality Assurance in Health Care, 4(2), 151–159.

Lindenauer

P. K.

Lagu

Ross

J. S.

Pekow

P. S.

Shatz

Hannon

. . .Benjamin

E. M.

(2014). Attitudes of hospital leaders toward publicly reported measures of health care quality. JAMA Internal Medicine, 174(12), 1904–1911. doi:10.1001/jamainternmed.2014.5161

10.

Lindenauer

P. K.

Remus

Roman

Rothberg

M. B.

Benjamin

E. M.

Bratzler

D. W.

(2007). Public reporting and pay for performance in hospital quality improvement. New England Journal of Medicine, 356(5), 486–496.

11.

McGlynn

E. A.

Schneider

E. C.

Kerr

E. A.

(2014). Reimagining quality measurement. New England Journal of Medicine, 371(23), 2150–2153.

12.

Mendelson

Kondo

Damberg

Low

Motuapuaka

Freeman

Kansagara

(2017). The effects of pay-for-performance programs on health, health care use, and processes of care: A systematic review. Annals of Internal Medicine, 166(5), 341–353. doi:10.7326/M16-1881

13.

Phelps

C. E.

(1993). The methodologic foundations of studies of the appropriateness of medical care. New England Journal of Medicine, 329(17), 1241–1245.

14.

Runciman

W. B.

Hunt

T. D.

Hannaford

N. A.

Hibbert

P. D.

Westbrook

J. I.

Coiera

E. W.

Braithwaite

(2012). CareTrack: Assessing the appropriateness of health care delivery in Australia. Medical Journal of Australia, 197(10), 549.

15.

Schneider

E. C.

Zaslavsky

A. M.

Epstein

A. M.

(2002). Racial disparities in the quality of care for enrollees in medicare managed care. JAMA, 287(10), 1288–1294. doi:joc11037

16.

Trivedi

A. N.

Zaslavsky

A. M.

Schneider

E. C.

Ayanian

J. Z.

(2005). Trends in the quality of care and racial disparities in Medicare managed care. New England Journal of Medicine, 353(7), 692–700. doi:10.1056/NEJMsa051207

17.

Trivedi

A. N.

Zaslavsky

A. M.

Schneider

E. C.

Ayanian

J. Z.

(2006). Relationship between quality of care and racial disparities in Medicare health plans. JAMA, 296(16), 1998–2004. doi:10.1001/jama.296.16.1998