Abstract

In a recent article, Elliott and colleagues (2020) evaluated the reliability of individual differences in task-based functional MRI (fMRI) activity and found reliability to be poor. They concluded that “commonly used task-fMRI measures generally do not have the test-retest reliability necessary for biomarker discovery or brain–behavior mapping” (p. 801). This is an important and timely effort, and we applaud it for spotlighting the need to evaluate the measurement properties of fMRI. Large samples combined with pattern-recognition techniques have made translational applications finally seem within reach. As the field gets serious about using the brain to predict behavior and health outcomes, reliability will become increasingly important.
However, along with their findings and constructive criticism comes the potential for overgeneralization. Though Elliott et al. focused on arguably the most limited fMRI measure for biomarker development—the average response within individual brain regions—the article has garnered media attention that mischaracterizes its conclusions. One headline reads, “every brain activity study you’ve ever read is wrong” (Cohen, 2020). The causes of anti-fMRI sentiments are not our concern here, but it is important to specify the boundary conditions of Elliott et al.’s critique. As they suggest and we show below, fMRI can exhibit high test-retest reliability when multivariate measures are used. These measures, however, were not evaluated by Elliot et al., despite being commonly used for biomarker discovery (Woo, Chang, et al., 2017). Thus, their conclusions do not apply to all “common task-fMRI measures” but to a particular subset that does not represent the state of the art. Moreover, there are multiple use cases for fMRI biomarkers (FDA-NIH Biomarker Working Group, 2016)—many of which do not require high test-retest reliability (cf. Elliott et al.; Fig. 1a).

Use cases for functional MRI (fMRI) biomarkers and an example of test-retest reliability. Among the seven major categories of biomarkers defined by the U.S. Food and Drug Administration (a), some (i.e., predictive, risk, and prognostic biomarkers) are designed to measure variation between individuals (e.g., to measure traitlike variables such as risk for depression, trait anxiety, or vulnerability to drug overdose). These biomarkers depend on measuring stable interindividual differences and thus require long-term test-retest reliability, which is typically estimated by calculating the intraclass correlation coefficient (ICC) for continuous variables or Cohen’s κ for binary variables. Other biomarkers (i.e., safety, pharmacodynamic, monitoring, and response biomarkers) rely on the ability to measure variation within an individual across time, mental or physiological states, or treatment doses. Detecting within-person states relies less on stable individual differences than stable mappings between measure and state (in fMRI, between the brain and mental states and outcomes) with large and consistent effect sizes, referred to as task reliability (Hedge et al., 2018). This depends on low within-person measurement error (e.g., MSE) and can be measured with ICC or κ. For biomarkers related to dynamic states, other characteristics that increase test-retest reliability, including between-person heterogeneity and long-term stability across time, can be irrelevant or even undesirable. Reliability (b) is shown for a multivariate signature of risk for cardiovascular disease (figure adapted from Gianaros et al., 2020). The brain images depict significant pattern weights fitted on brain responses to affective images that positively (warm colors) and negatively (cool colors) contribute to the prediction of a marker of preclinical atherosclerosis. The scatterplot (with best-fitting regression line) depicts data used to estimate split-half reliability (N = 338).
Test-retest reliability estimates summarized by Elliott et al. reflect several limitations of the studies in their sample. These studies had (a) small sample sizes; (b) little data per participant (as little as 5 min); (c) single-task rather than composite-task measures, which can be more reliable (Gianaros et al., 2017; Kragel, Kano, et al., 2018); and (d) variable test-retest intervals, up to 140 days in the Human Connectome Project (HCP) data; in addition, they were limited to activity in individual brain regions.
Multivariate measures optimized using machine learning can have high test-retest reliability (Woo & Wager, 2016). Elliott et al. acknowledged this possibility, but did not provide quantitative examples. Examining some benchmarks from recent studies reveals that the situation is not nearly so dire for task-fMRI as Elliott et al. concluded. For example, Gianaros et al. (2020) identified patterns predictive of risk for cardiovascular disease using an emotional picture-viewing task and the HCP Emotion task. The same-day test-retest reliability of these measures was good to excellent (Spearman-Brown rs = .82 and .73, Ns = 338 and 427, respectively; Fig. 1b). In contrast, test-retest reliabilities of individual regions (e.g., amygdala) were much lower (rs = .11–.27). In a second example, we assessed the same-day test-retest reliability of the neurologic pain signature—a neuromarker for evoked pain—in eight fMRI studies (N = 228; data from Geuter et al., 2020; Jepma et al., 2018). Reliability was good to excellent in all studies (Fig. S1a in the Supplemental Material available online). Other multivariate measures show strong evidence for test-retest reliability across longer time intervals (Zuo & Xing, 2014). For example, Drysdale et al. identified four distinct fMRI biotypes for depression (total N > 1,000). 1 Test-retest analyses across 4 weeks showed 90% agreement of biotype classifications (Fig. S1b in the Supplemental Material).
Functional MRI has promise for measuring individual differences, yet it may be best suited to develop biomarkers that detect dynamic states. Functional MRI patterns can reveal specific brain states—for example, whether a person is viewing a face (Haxby et al., 2001), replaying a memory (Momennejad et al., 2018), paying attention (Rosenberg et al., 2016), engaging in self-regulation (Cosme et al., 2020), or experiencing pain (Wager et al., 2013). This can be done reliably across long timescales. For example, Kamitani and Tong (2005) developed multivariate patterns that predict the orientation of a line viewed by participants. These patterns predicted line orientation 31 to 40 days after the initial prediction with similar performance (0.7 to 1 degree of error). As a second example, we reanalyzed HCP data presented by Elliott et al. and found that a multivariate model designed to identify the engagement of face (vs. shape) processing had excellent reliability across 4 weeks (Fig. S1c in the Supplemental Material). These examples and many others (Zuo & Xing, 2014, Finn et al., 2015) show that fMRI can yield measures that reliably detect the emergence of brain states across time.
A common belief is that all biomarkers measure traits and thus require high test-retest reliability, but we argue that this is a misconception. The U.S. Food and Drug Administration identifies seven major categories of biomarkers (Fig. 1a). Some, such as prognostic biomarkers for future disease and predictive biomarkers for treatment response, rely on individual differences in stable traits and require long-term test-retest reliability. Others measure states and require low within-person measurement error but not necessarily high test-retest reliability (Bland & Altman, 1996). For example, diagnostic biomarkers for disease states—such as COVID-19—need not be stable across weeks to months as the disease state changes. The same is true for safety, monitoring, and pharmacodynamic biomarkers that track changes in pathophysiological states. Measures of fMRI show promise in detecting and monitoring disorder-related brain processes (Duff et al., 2020; Rosenberg et al., 2016; Woo, Schmidt, et al., 2017). Although a full discussion of use cases is beyond the scope of this Commentary (but see Davis et al., 2020), fMRI measures can be sufficiently sensitive, specific, and reliable for many uses.
Ultimately, reliability is not a fixed property of an assay, let alone a whole measurement technology. It depends on the tasks, samples, and measures extracted from them (Streiner, 2003). Other criticisms of fMRI (Eklund et al., 2016; Vul et al., 2009) have been used to issue blanket condemnations. We caution against such overgeneralization and propose that the findings of Elliott et al. be considered a lower bound on the reliability of fMRI. The upper bound is high and remains to be fully explored. We agree with the summary recommendations for future fMRI research made by Elliott et al. and are optimistic that new methods designed to optimize reliability (Dubois & Adolphs, 2016) while keeping construct validity in mind (Kragel, Koban, et al., 2018) will continue to fuel fMRI research on biomarker development.
Supplemental Material
sj-pptx-1-pss-10.1177_0956797621989730 – Supplemental material for Functional MRI Can Be Highly Reliable, but It Depends on What You Measure: A Commentary on Elliott et al. (2020)
Supplemental material, sj-pptx-1-pss-10.1177_0956797621989730 for Functional MRI Can Be Highly Reliable, but It Depends on What You Measure: A Commentary on Elliott et al. (2020) by Philip A. Kragel, Xiaochun Han, Thomas E. Kraynak, Peter J. Gianaros and Tor D. Wager in Psychological Science
Footnotes
Acknowledgements
We thank Bogdan Petre for assistance with data analysis.
Transparency
Action Editor: John Jonides
Editor: Patricia J. Bauer
Author Contributions
P. A. Kragel and T. D. Wager conceived the project and data-analysis plan. P. A. Kragel, X. Han, and T. E. Kraynak performed the analyses. P. A. Kragel and T. D. Wager wrote the manuscript. T. D. Wager and P. J. Gianaros designed, implemented, and oversaw data collection and generation of the research protocol. All the authors contributed to the revision of the manuscript and approved the final manuscript for submission.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
