Abstract
The past 40 years have generated numerous insights regarding errors in human reasoning. Arguably, clinical practice is the domain of applied psychology in which acknowledging and mitigating these errors is most crucial. We address one such set of errors here, namely, the tendency of some psychologists and other mental health professionals to assume that they can rely on informal clinical observations to infer whether treatments are effective. We delineate four broad, underlying cognitive impediments to accurately evaluating improvement in psychotherapy—naive realism, confirmation bias, illusory causation, and the illusion of control. We then describe 26 causes of spurious therapeutic effectiveness (CSTEs), organized into a taxonomy of three overarching categories: (a) the perception of client change in its actual absence, (b) misinterpretations of actual client change stemming from extratherapeutic factors, and (c) misinterpretations of actual client change stemming from nonspecific treatment factors. These inferential errors can lead clinicians, clients, and researchers to misperceive useless or even harmful psychotherapies as effective. We (a) examine how methodological safeguards help to control for different CSTEs, (b) delineate fruitful directions for research on CSTEs, and (c) consider the implications of CSTEs for everyday clinical practice. An enhanced appreciation of the inferential problems posed by CSTEs may narrow the science–practice gap and foster a heightened appreciation of the need for the methodological safeguards afforded by evidence-based practice.
Keywords
A clinically depressed client obtains psychotherapy; 2 months later, she is free of serious symptoms. Was her improvement due to the treatment?
The correct answer is “We don’t know.” On the one hand, ample data demonstrate that scientifically supported psychotherapies can alleviate many mental health difficulties (Barlow, 2004), so the client’s improvement may well stem at least partly from the intervention. On the other hand, as most mental health professionals know, we cannot draw valid conclusions regarding a treatment’s effectiveness in the absence of methodological safeguards against errors in inference, such as well-validated outcome measures, randomized control groups, and blinded observations (Gambrill, 2012). Yet even seasoned clinicians and researchers can easily fall prey to the error of concluding that a treatment worked when the evidence for this inference is insufficient. They can commit this mistake when evaluating the effectiveness of treatment for a given client, the effectiveness of a specific school or modality of psychotherapy, or both.
This error in reasoning can be found in published research as well. In numerous articles, authors have interpreted client improvement following an intervention—even in the absence of differences from a no-treatment control group—as evidence for treatment efficacy (e.g., Leins et al., 2007). For example, in a study of psychological treatment for outpatients with severe depression, a research team randomized participants to receive either cognitive-behavioral or interpersonal therapy and found broadly equivalent improvement in both groups. Despite the absence of a no-treatment or placebo-control condition, the authors concluded that “both therapies are equally effective for depression” (Luty et al., 2007, p. 496; see also p. 500). More recently, in a randomized controlled study comparing psychoanalytic therapy with cognitive-behavioral therapy for bulimia nervosa—which also contained no control condition—the authors concluded that “Both treatments had substantial effects on global eating disorder psychopathology and general psychopathology” (Poulsen et al., 2014, p. 114).
In this article, we explain why the error of inferring that a treatment is effective on the basis of inadequate evidence is widespread, understandable, and problematic for clinical inference. We contend that a number of mental health professionals are insufficiently cognizant of the manifold reasons why ineffective or even harmful treatments can appear effective to the unaided eye. Because of this inadequate recognition, some clinicians and researchers may dismiss or minimize the need for evidence-based practice (Sackett, Rosenberg, Gray, Haynes, & Richardson, 1996; Straus et al., 2010).
Evidence-Based Practice and Causes of Spurious Therapeutic Effectiveness
Evidence-based practice is a threefold framework for clinical practice that is often conceptualized as a three-legged stool. These legs comprise (a) research findings regarding the efficacy and effectiveness of psychotherapies, (b) clinical expertise, and (c) client values and preferences (Norcross, Beutler, & Levant, 2007; Spring, 2007). Evidence-based practice is not synonymous with empirically supported therapies (ESTs), which are merely one set of operationalizations of the research leg of the evidence-based practice stool (Westen, Novotny, & Thompson-Brenner, 2005). ESTs are interventions that have been demonstrated to work better than no treatment (or an alternative treatment) for specific disorders in independently replicated (a) controlled between-subject designs or (b) controlled single-subject designs, namely, those in which participants serve as their own controls (Barlow, Hayes, & Nelson, 1984; Chambless & Hollon, 1998). Although the scientific status of ESTs is controversial (for diverse viewpoints, see Beutler, 2004; Castelnuovo, 2010; Chambless & Ollendick, 2001; Herbert, 2003; and Westen, Novotny, & Thompson-Brenner, 2004), acceptance of the need for the research prong of evidence-based practice does not hinge on agreement with the criteria for or specific lists of ESTs, such as those proposed by Division 12 (Society of Clinical Psychology) of the American Psychological Association (see http://www.div12.org/empirically-supported-treatments/).
The research leg, which is the component of evidence-based practice most pertinent to our arguments, incorporates control groups, within-subject designs, blinding, randomization, and other methodological bulwarks against inferential mistakes. In ways that have often not been adequately appreciated or articulated, these research safeguards are frequently nonintuitive. When viewed in this light, the much decried science–practice gap (Baker, McFall, & Shoham, 2008; Lilienfeld, Lynn, & Lohr, 2003; Tavris, 2003) and the resistance to evidence-based practice that often accompanies it (Lilienfeld, Ritschel, Lynn, Cautin, & Latzman, 2013) are not entirely surprising.
Although there are multiple sources of the science–practice gap (for discussions, see Lilienfeld et al., 2013; Shafran et al., 2009; Ritschel, 2005; and Stewart, Chambless, & Baron, 2011), we focus on one key contributor here: the myriad reasons why individuals can be led to conclude that psychotherapy is effective even when it is not. We term these sources of inferential error causes of spurious therapeutic effectiveness (CSTEs). Because of an insufficient recognition of CSTEs, psychologists may assume that they can rely on informal clinical observations of client change during and after treatment to gauge whether interventions are effective.
We do not contend that informal clinical observations of client improvement are never accurate; they frequently are. Nor do we argue that such observations are useless or should be disregarded, as they are at times helpful signposts of change in treatment. As noted earlier, substantial evidence attests to the efficacy and effectiveness of a broad swath of psychotherapies for many mental health conditions, including mood, anxiety, sleep, sexual, and eating disorders, as well as some personality disorders, such as borderline personality disorder (Roth & Fonagy, 2005; Wampold, 2001; Weisz, Weiss, Han, Granger, & Morton, 1995). Hence, clinicians’ inferences of client improvement during and after psychological treatment are surely correct in many instances. Moreover, the social cognition literature demonstrates that numerous forms of intuitive thinking, such as heuristic processing, are often adaptive in real-world settings (Gigerenzer & Gaissmaier, 2011).
At the same time, the histories of medicine and psychology demonstrate that subjective inferences of change in treatment, subjectively compelling as they may be, are often mistaken (Garb, 1998; Grove & Meehl, 1996). Our overarching message is that because of CSTEs, unsystematic clinical observations of client change are rarely trustworthy guides by themselves for inferring treatment effectiveness.
Goals of the Article
Numerous articles have canvassed the magnitude and sources of the science–practice gap (e.g., Baker et al., 2008; McHugh & Barlow, 2010; Stewart et al., 2011). We do not intend to retread that well-traveled ground here. Instead, in light of relatively recent developments concerning (a) the implications of heuristics and biases for clinical practice (e.g., Crumlish & Kelly, 2009; Kahneman, 2011; Stanovich & West, 2008), (b) iatrogenic (i.e., psychologically harmful) effects in psychotherapy (e.g., Bootzin & Bailey, 2005; Dimidjian & Hollon, 2010; Lilienfeld, 2007), and (c) challenges to the dissemination of evidence-based practice (Lilienfeld et al., 2013; Stewart et al., 2011), we address the more specific and largely neglected question of what kinds of inferential errors in psychotherapy render evidence-based practice imperative.
The movement toward evidence-based practice has been contentious in many quarters, in part because some authors have taken issue with the premise that evidence derived from randomized controlled trials, controlled single-subject experiments, and other systematic research designs should be accorded higher priority than clinical experience when selecting treatments. Indeed, some scholars have proposed that “practice-based evidence,” namely, therapeutic practice informed by thoughtful clinical observations, should be accorded roughly equal weight to traditional evidence-based practice (Barkham, Hardy, & Mellor-Clark, 2010; Green & Latchford, 2012; Stricker, 2003). For example, Chwalisz (2003) lobbied for expanding the definition of evidence to encompass clinical observations and clinical consensus (see also Hoshmand & Polkinghorne, 1992). Similarly, while acknowledging that “practical knowledge” (viz., knowledge acquired from clinical observations of what does and does not work in treatment) is fallible, Bohart (2005) maintained that it is “evidence-based” (p. 46) and should be valued as a legitimate source of inferences for therapeutic effectiveness.
We view these assertions with decided ambivalence. On the one hand, clinical observations can sometimes (a) usefully guide therapists’ choices of interventions during treatment, (b) serve as springboards for the development of new models of treatment, and (c) inform the feasibility and transportability of scientifically based interventions to real-world settings. On the other hand, for reasons that we will explicate, the proposition that practice-based observations should be accorded comparable weight to the results of controlled clinical trials in treatment selection underestimates the inferential dangers stemming from CSTEs. With this background in mind, our goals are threefold. First, we aim to demonstrate that inadequate appreciation of the inferential threats posed by CSTEs is partly a by-product of natural cognitive processes that render it difficult for clinicians, clients, and researchers to accurately perceive and evaluate therapeutic change. In addition, we delineate four broad obstacles to scientific thinking—naive realism, confirmation bias, illusory causation, and the illusion of control—that underpin many or most CSTEs. The distinction between these overarching cognitive impediments and specific CSTEs themselves may not be entirely clear-cut. Nevertheless, we posit that these domain-general impediments lay the cognitive groundwork for more specific errors in inferring the existence or meaning of changes in treatment.
Second, we present a taxonomy of 26 CSTEs, divided into three categories, that can contribute to the appearance of therapeutic effectiveness in its objective absence. These three classes of CSTEs comprise influences that generate (a) the perception of client change in its actual absence, (b) misinterpretations of actual client change stemming from extratherapeutic factors, and (c) misinterpretations of actual client change stemming from nonspecific treatment factors. Some CSTEs operate at the level of individual clients, others at the level of groups of clients, and still others at both levels. Several writers in the medical literature have provided partial lists of artifacts that can make ineffective medical treatments seem effective (e.g., Beyerstein, 1997; Hall, 2011; Hartman, 2009; Kienle & Kiene, 1997), but no comparable list exists for psychotherapies; nor have previous authors provided a taxonomy of these artifacts.
Third, we outline how specific research methods help to control for, although not necessarily eliminate, CSTEs as sources of erroneous conclusions in treatment. Although these research methods are by no means new, to our knowledge their role in helping to rule out differing CSTEs has not been explicitly articulated. We also discuss how certain CSTEs continue to pose unresolved challenges to psychotherapy researchers and point to fruitful areas for further research on CSTEs and methods for attenuating their influence. In this respect, our analysis has heuristic value in that it points to gaps to be filled in extant psychotherapy methodology to minimize CSTEs. Just as important, we discuss how a better appreciation of CSTEs can inform everyday clinical practice. By promoting thoughtful consideration of explanations for client change above and beyond improvement due to therapy itself, CSTEs can assist clinicians with becoming better clinical scientists. Finally, we demonstrate that our discussion of CSTEs bears important implications for models of clinical training.
Overarching Cognitive Impediments to Evaluating Therapeutic Change
We submit that the principal reason why some mental health professionals may not appreciate sufficiently the problems posed by CSTEs is that scientific thinking does not come naturally to the human species (McCauley, 2011; Wolpert, 1992). Such thinking must be learned and practiced assiduously, because it often requires us to question our commonsense intuitions, such as our propensity to perceive meaningful causal relations in their absence (Bloom & Weisberg, 2007; Cromer, 1993). Most errors in judgment arising from CSTEs probably reflect rapid and intuitively plausible perceptions and interpretations that do not sufficiently consider alternative explanations of client change.
One telltale sign of the counterintuitive nature of scientific thinking is the history of the concept of the control group. Contrary to what many psychologists might assume, this concept is a relatively recent development in scientific history (Bull, 1959; Dehue, 2000; Gehan & Lemak, 1994). Examples of controlled trials of medication surfaced only as recently as the 18th century, with the first arguably conducted by James Lind, who in 1747 famously divided sailors with scurvy onboard a British ship into several groups, finding that only those who received citrus juice improved (Manzi, 2012). Yet even Lind’s discovery was apparently resisted, as the British Navy waited a full half-century before stocking lemon juice on its vessels (Bull, 1959). It was not until Coover and Angell (1907) advocated for the role of untreated groups in evaluating training programs in educational psychology that a formal exposition of the control group concept in social science appeared in print. The notion of the randomized control group is even more recent, emerging in the published literature in the 1920s (Dehue, 2005). Moreover, it was not until the 1950s that prominent authors (e.g., Eysenck, 1952; Meehl, 1955) began to call for randomized controlled trials of psychotherapy (Cautin, 2008).
As noted earlier, we contend that four broad cognitive impediments underlie many or most CSTEs, which we view as specific instantiations of these impediments in the context of psychotherapy. We next lay out these impediments, followed by a brief discussion of their implications for therapists’ perceptions of effectiveness.
Naive realism
Naive realism (Ross & Ward, 1996; Segall, Campbell, & Herskovits, 1966) is a concept imported into psychology from philosophy. Also termed commonsense realism or direct realism, naive realism is the ubiquitous assumption that the world is precisely as we see it. A plethora of phrases in everyday life attest to the power of naive realism in our thinking: “Seeing is believing,” “I saw it with my own eyes,” “I’ll believe it when I see it,” and “What you see is what you get.” In a related vein, Kahneman (2011) referred to a core principle of intuition as “WYSIATI”: What You See Is All There Is, an assumption that dovetails with naive realism. This heuristic (mental shortcut) leads us to focus on what is most obvious in our environments while ignoring subtler background information.
Naive realism is erroneous because the world is not exactly as we perceive it, a point illustrated vividly by visual illusions (Chabris & Simons, 2010) and enshrined in the time-honored psychological distinction between sensation and perception (Coren, 2003). What we perceive is constrained by external reality, but it is also influenced by our expectations, biases, and interpretations (“apperceptions”; Morgan & Murray, 1935). To a substantial extent, “believing is seeing” as much as the converse (Gilovich, 1991).
Naive realism bears important implications for the evaluation of psychotherapy outcome. It can lead clinicians, researchers, and others to assume that they can rely on their intuitive judgments—“I saw the change with my own eyes”—to infer that an intervention is effective (Ghaemi, 2009; Lilienfeld, Lohr, & Olatunji, 2008). As a consequence, these individuals may (a) misperceive change when it does not occur, (b) misinterpret it when it does, or (c) both.
One example of the overreliance on naive realism comes from Arnold Shapiro, producer of the 1978 Academy Award–winning documentary Scared Straight!, who responded to scientific criticisms of Scared Straight interventions. These interventions attempt to frighten adolescents at high risk for crime out of criminal careers by bringing them to prisons and introducing them to inmates. Shapiro defended Scared Straight programs by insisting that “I’m seeing it [the change following Scared Straight programs] with my own eyes, I’m there for every one of those shoots” (Harrison, 2011, p. 2). However, data from controlled studies suggest that Scared Straight is not merely ineffective but probably harmful, in that it produces a heightened risk for antisocial behavior (Petrosino, Turpin-Petrosino, & Buehler, 2005). In another example, Healy (2002) wrote in an article, subtitled “Evidence-Based Psychiatry,” that “When treatments work, the condition being treated vanishes, and we don’t need randomized controlled trials to see this happening” (p. 1). Yet the condition being treated may disappear for a plethora of reasons other than the intervention. Contra Healy’s implication, randomized controlled trials and other rigorous designs are indeed needed to exclude rival hypotheses for observed change.
Naive realism also reminds us of an easily forgotten principle: Change following therapy is not equivalent to change because of therapy, a logical error known as the post hoc, ergo propter hoc (after this, therefore because of this) fallacy (Finocchiaro, 1981). Conversely, this error can also lead individuals to equate deterioration following a treatment with deterioration because of the treatment (Lilienfeld, 2007), thereby overestimating the iatrogenic effects of certain interventions. The post hoc, ergo propter hoc fallacy underscores the point that pre–post studies of interventions are problematic (T. D. Wilson, 2011). Fortunately, as we will discover, there are multiple ways to compensate for the limitations of pre–post designs. Investigations using such designs are especially suspect when the “pre” data derive from retrospective assessments. For example, the much ballyhooed Consumer Reports study (Seligman, 1995) of 4,100 magazine subscribers who had participated in psychotherapy revealed that most felt that they had been helped by it. Yet, as numerous critics (e.g., Jacobson & Christensen, 1996; Mintz, Drake, & Crits-Christoph, 1996) pointed out, these data are difficult to interpret, because the study neglected to control for many potential confounds that may have led to improvement even without therapy.
The history of medicine offers a powerful cautionary tale regarding the hazards of naive realism (Bigby, 1998). Most medical scholars agree that the history of physical treatments administered prior to about 1890 is essentially tantamount to the history of the placebo effect. Along with ineffective medications, such interventions as bloodletting, blistering, purging, and leeching were routinely prescribed and presumed to be beneficial based on little more than informal clinical observations (Grove & Meehl, 1996; see Belofsky, 2013, for a review of bizarre but widely accepted medical practices through the ages). Even today, medicine has its share of ineffective interventions. A recent meta-analysis estimated that 40% of widely used medical procedures (e.g., intensive glucose lowering in Type 2 diabetes, induction of hypothermia for intracranial aneurysms) are useless or harmful (Prasad et al., 2012).
The history of psychiatry is similarly replete with a litany of useless or harmful interventions, many of which were endorsed by experts of the era yet that strike us as inhumane today. Such “treatments” as spinning chairs, tranquilizing chairs, and cold water were ubiquitous in early American psychiatry. As another example, insulin coma therapy, introduced by Manfred Sakel in 1933, was used widely to treat schizophrenia throughout the 1930s and 1940s. This procedure involved administering increasingly high doses of insulin to induce a hypoglycemic state, followed by a coma and sometimes convulsions. Early clinical reports described encouraging results. Its high morbidity and mortality rates notwithstanding, insulin therapy spread rapidly throughout Europe, the United States, Japan, and Australia (James, 1992). This wave was unceremoniously interrupted by an article in the Lancet by Bourne (1953), who concluded there was no evidence that insulin coma therapy was effective. As Jones (2000) noted, many psychiatrists published rebuttals to Bourne’s article: “Their tone was typified by remarks such as ‘it is clinical experience that counts here, despite all figures to the contrary’” (p. 148). By the late 1950s, insulin coma therapy had been all but abandoned (Shapiro & Shapiro, 1997).
Prefrontal lobotomy, which earned its principal developer of the procedure in humans, Portuguese neurosurgeon Egas Moniz, the Nobel Prize in Medicine or Physiology in 1949, offers another telling example. One practitioner of this technique insisted that “I am a sensitive observer, and my conclusion is that a vast majority of my patients get better as opposed to worse after my treatment” (see Dawes, 1994, p. 48), a view echoed by many of his contemporaries (Diefenbach, Diefenbach, Baumeister, & West, 1999). Later research, however, revealed lobotomy to be essentially worthless and to be associated with many disastrous psychological and neurological side effects (Valenstein, 1986).
Confirmation bias
A second cognitive impediment to appreciating the need for controls in psychotherapy research is confirmation bias. Confirmation bias is the deeply ingrained and commonly exercised tendency to seek out evidence consistent with one’s hypotheses and to deny, dismiss, or distort evidence that is not (Lilienfeld, Ammirati, & Landfield, 2009; Nickerson, 1998). Although confirmation bias is a cognitive phenomenon, it can be fueled by desires to find supportive evidence for our beliefs, a propensity termed “motivated reasoning” (Kunda, 1990). Because clinicians want their clients to improve, they can be driven to perceive change in its absence.
Confirmation bias can foster a propensity toward illusory correlation (not to be confused with illusory causation; see next section), which is the perception of a statistical association in its absence (Chapman & Chapman, 1967; Hamilton & Gifford, 1976). Specifically, confirmation bias can predispose clinicians to attend to the “hits” and forget the “misses” (Garb, Lilienfeld, & Fowler, 2008; Gilovich, 1991) and thereby overestimate the extent to which their interventions are associated with subsequent improvement. Imagine a therapist who engages from time to time in confrontational tactics with a client. Even though these tactics are ineffective for his client’s presenting problem, the therapist may attend to and recall the sessions in which the client was doing better and neglect and forget the sessions in which the client was not doing better or doing worse. As a consequence, the therapist may conclude that his use of confrontation was consistently followed by client improvement, even though it was not. In contrast, if the therapist were to monitor his clients’ symptoms systematically, this erroneous inference would presumably be less likely.
Illusory causation
Scottish philosopher David Hume (1748) maintained that humans are prone to perceiving causal relations in their absence. Two centuries later, Michotte (1945) argued that our propensity to perceive causal relations between events, even those that are causally unrelated, comes to us as naturally as does our propensity to perceive color. Research on illusory causation, or the propensity to perceive a spurious causal relation between two associated variables, bears out these contentions.
Laboratory evidence for illusory causation dates at least to the work of Koffka (1935), who showed observers two points of light in a dark room. When the points moved apart, perceivers tended to attribute causality to the dot on which they happened to be focusing, even if it was stationary. Koffka’s findings suggest that we are more likely to attach causal significance to the object of our attention while ignoring competing evidence. Later research demonstrated that illusory causation extends to social interactions. When observers are positioned physically so as to attend primarily to one partner in a two-person conversation, they regard him or her as more interpersonally influential than the other partner (Taylor & Fiske, 1975; see also McArthur & Solomon, 1978).
There are two potential, nonmutually exclusive explanations for illusory causation (McArthur, 1980). The first is perceptual: Individuals tend to attribute causality to whatever stimulus is most vivid and prominent in their visual fields and to accord less causal import to what lies in the visual background (Lassiter, Geers, Munhall, Ploutz-Snyder, & Breitenbecher, 2002). The second is cognitive: Individuals recall more information about stimuli that are prominent in their visual foregrounds than in their visual backgrounds (Taylor & Fiske, 1978). With the aid of an availability heuristic, by which we gauge the probability of an event by using its accessibility in memory (Tversky & Kahneman, 1974), we come to view the former stimuli as more influential.
Because of illusory causation, therapists, researchers, clients, and external observers may leap to the conclusion that a treatment exerted a causal effect on the client when it did not (Sloman, 2005). The client’s improvement within therapy sessions is plainly visible to the clinician, whereas rival explanations for this improvement (e.g., events occurring to the client outside of sessions, placebo effects, changes in cognitive biases over the course of treatment) rarely are. As a consequence, these explanations may be assigned less weight.
Illusion of control
A related error is the illusion of control, or the propensity to overestimate our ability to influence events (Langer, 1975). For example, when money is at stake, most people prefer to select a lottery ticket or roll a die themselves rather than leave these actions to others, even though the outcomes in all scenarios do not exceed chance. This illusion may predispose therapists to believe that they possess more causal power over client outcomes than they do. The illusion of control is especially likely when the individual in question (a) is personally involved in the behaviors, (b) is familiar with the situation at hand, (c) is aware of the desired outcome, and (d) has a history of previous success at the task (Thompson, 1999). Most or all of these criteria presumably apply to the modal psychotherapist. Indeed, when interventions are consistently followed by improvement, treatment providers may conclude that they are the active causal agents when they are not (Matute, Yarritu, & Vadillo, 2011).
Implications of cognitive impediments for clinicians’ self-perceptions and predictions
These four broad cognitive impediments may help to explain why some therapists overestimate their positive client outcomes. In this respect, they appear to be no different from professionals in many other fields, including college professors (Cross, 1977), physicians (Hodges, Regehr, & Martin, 2001), and political pundits (Tetlock, 2005), all of whom tend to hold an overly charitable view of their effectiveness (Dunning, Heath, & Suls, 2004). In a sample of 129 therapists in private practice (26.4% psychologists), the average clinician rated him- or herself at the 80th percentile of all therapists in terms of effectiveness and skills; 25% of respondents placed themselves at the 90th percentile. None rated themselves as below average. Moreover, the typical therapist in the sample estimated the rate of client deterioration in his or her caseload to be 3.7% (Walfish, McAlister, O’Donnell, & Lambert, 2012). In fact, numerous studies have indicated that about 10% of clients become worse following psychotherapy (Boisvert & Faust, 2002; Lilienfeld, 2007).
Other evidence dovetails with these results. In a sample of 49 psychotherapists in college counseling centers, clinicians markedly overestimated their rates of positive client outcomes (91%) relative to their actual positive outcomes (40%), as ascertained by a standardized symptom measure. Furthermore, although therapists predicted that only 3 out of a total of 550 clients (0.5%) in their collective caseloads would deteriorate, outcome data revealed that 40 (7.3%) did so (Hannan et al., 2005). Taken together, these findings suggest that many or most psychotherapists perceive improvements in clients in their absence and fail to perceive deterioration in their presence.
Summary
In summary, four overarching cognitive biases—naive realism, confirmation bias, illusory causation, and illusion of control—probably contribute to the difficulty of accurately evaluating change in psychotherapy, as well as to an insufficient appreciation of the inferential difficulties posed by CSTEs, which we view as more specific instantiations of these four broad biases within the domain of psychotherapy. These broad biases may also contribute to clinician overconfidence, inadvertent neglect of adverse client outcomes, and an undue reliance on unguided clinical experience (see also Groopman, 2007).
Causes of Spurious Therapeutic Effectiveness: A List and Taxonomy
As noted earlier, we refer to the manifold ways in which people can be misled into believing that a treatment is working when it is not as causes of spurious therapeutic effectiveness (CSTEs). We next briefly describe 26 CSTEs that can deceive individuals into concluding that ineffective or even harmful psychotherapies are effective. We regard this list of CSTEs as provisional and subject to improvement pending further research. Hence, for heuristic purposes, we adopt a “splitting” rather than a “lumping” approach (see Mayr, 1981, for a discussion of the splitting–lumping dichotomy in classification) toward CSTEs, electing to subdivide them into distinct categories when there is research support for doing so. The advantage of a splitting approach is that certain CSTEs can later be combined into broader categories if evidence demonstrates that they are merely variants of the same inferential error.
In distinguishing among CSTEs, we part company with authors who have placed most or all CSTEs under the overarching rubric of placebo effects (e.g., Offit, 2010; Shapiro & Shapiro, 1997). For example, Novella (2008, 2010) defined placebo effects as “including everything other than a physiological response to a biologically active treatment” (p. 33) and operationalized it as “the treatment effect measured in the placebo arm of a clinical trial” (p. 33). Under placebo effects, Novella included such artifacts as regression to the mean, observer biases, demand characteristics, and expectancy effects. There are two shortcomings with this expansive conceptualization. First, it conflates the response following a placebo (the placebo response) with the response to a placebo (the placebo effect) and thereby runs afoul of the post hoc, ergo propter hoc error (Ernst & Resch, 1995; Kirsch, 2013). Many of the symptomatic changes that occur in a study’s placebo arm can arise from variables other than the placebo itself. Second, this conceptualization is overly inclusive, because it does not distinguish among a myriad of sources of erroneous therapeutic effectiveness.
We stake no claim to our list’s comprehensiveness, but it provides a helpful starting point for conceptualizing the numerous challenges that confront clinicians, researchers, and clients when gauging psychotherapeutic effectiveness. Although all of the CSTEs we describe have been the subject of research on perceptions of change following interventions or experimental manipulations, several of these CSTEs (e.g., response-shift bias) have not, to our knowledge, been investigated with respect to psychotherapy per se. Nevertheless, there is no a priori reason why these latter CSTEs cannot produce the illusion of change following psychological treatment as well.
Overview of the taxonomy of CSTEs
We divide our proposed CSTEs into three overarching categories (see Table 1). The distinctions between these categories are conceptual, not empirical. First, some CSTEs, which we term Category 1 CSTEs, can lead individuals, including clinicians, researchers, and other observers, to misperceive change in its actual absence. In these cases, clients are not changing, although individuals erroneously perceive them to be changing. The problem of Category 1 CSTEs is underscored by a recent quotation from eminent psychiatrist Robert Spitzer, who 9 years earlier (Spitzer, 2003) had endorsed the effectiveness of “conversion therapies” for homosexuality on the basis of self-reported improvement from clients. In a widely publicized retraction of his conclusions, Spitzer (2012) acknowledged that there was no way to determine whether these perceptions of change were accurate. As Spitzer told a reporter (Carey, 2012, p. B1), “I knew this was a problem, a big problem, and one I couldn’t answer. How do you know someone has really changed?”
Causes of Spurious Therapeutic Effectiveness and Research Safeguards Against Them
Note: CSTEs in each category have one safeguard in common and then, usually, additional specific safeguards.
Category 1 CSTEs are highly heterogeneous, as some (e.g., CSTE Numbers 1 through 4; see following section) probably exert their initial effects primarily on clients’ perception of change, whereas others (e.g., CSTE Numbers 7 through 9 and 11) probably exert their initial effects primarily on clinicians’ perceptions of change. Still others (e.g., CSTE Numbers 10 and 13) probably exert their initial effects on both clients’ and clinicians’ perceptions. Nevertheless, because psychotherapy is a process of bidirectional influence between client and clinician (Marmar, 1990), most or all Category 1 CSTEs can eventually come to deceive both treatment recipient and treatment provider. Hence, these distinctions are unlikely to be clear-cut, and they will require empirical corroboration and potential revision.
Category 2 CSTEs can lead individuals, in most cases both clinicians and clients, to misattribute actual client change stemming from extratherapeutic factors to the active treatment per se. These factors include life events that occur outside of treatment and changes in the client’s psychological condition that are causally independent of treatment. In the case of Category 2 CSTEs, clients are improving, but their improvement bears no relation to either the specific or nonspecific effects of the treatment. Instead, the intervention is incidental to client change.
Category 3 CSTEs can lead individuals, again usually both clinicians and clients, to misattribute actual client change stemming from nonspecific effects of the treatment (e.g., provision of hope) to the specific effects of this treatment (see Wampold, 2001). In the case of Category 3 CSTEs, clients are improving, as they are in Category 2 CSTEs. In Category 3 CSTEs, however, this change is a consequence of common factors shared with most or all effective psychological treatments; little or none of the improvement is attributable to the specific treatment. Category 3 CSTEs are readily overlooked because they are highly correlated with the provision of the treatment. As a consequence of these CSTEs, clinicians and researchers may conclude that their hypothesized mechanisms of therapeutic effectiveness are corroborated when they are not, as these mechanisms (e.g., placebo effects) are shared by most if not all effective treatments.
Whether one regards Category 3 CSTEs as artifacts or as active agents of therapeutic change hinges largely on one’s hypotheses regarding the mechanisms of improvement. If one believes that a given psychotherapy works because of specific processes that are not shared with other treatments, Category 3 CSTEs are best regarded as artifacts that can predispose to spurious inferences regarding the causes of change. In contrast, if one believes that a given psychotherapy works because of common factors that are shared with most or all effective interventions (e.g., Frank & Frank, 1961; Wampold, 2001), then the sources of change comprising Category 3 CSTEs are best regarded as valid causes of improvement in their own right. Indeed, the long-standing interest in psychotherapy integration largely reflects a desire to identify cross-cutting mechanisms that operate across many treatments (Goldfried, 2010). Hence, we caution readers against regarding Category 3 CSTEs as extraneous influences to be automatically minimized or eliminated in research, as from the standpoint of scholars who argue for the primacy of common factors in psychotherapy, such influences play a pivotal role in treatment effectiveness (e.g., Messer & Wampold, 2002).
A fourth category of inferential errors that we do not explicitly address comprises erroneous inferences regarding the mechanisms of change in a given psychotherapy. As a consequence of this class of errors, researchers and therapists may conclude that a treatment is operating via specific mechanism X when it is actually operating via specific mechanism Y (see Kazdin, 2007). In such cases, the clinical improvements are due to specific mechanisms of the treatment but not to the specific mechanisms posited by the treatment’s proponents. For example, scholars continue to debate whether cognitive-behavioral therapy works by modifying cognitions, as posited by most of its proponents (Hofmann, 2008), or by alternative mechanisms, such as increases in reinforcing activities or extinction of maladaptive thoughts and emotions (Jacobson & Christensen, 1996; Longmore & Worrell, 2007). Because the inferential errors in this fourth class involve an (a) erroneous inference regarding the specific cause(s) of treatment effectiveness rather than (b) an erroneous inference of treatment effectiveness, we do not categorize them as CSTEs. In this respect, these errors differ from Category 3 CSTEs, which involve the error of attributing specific effectiveness to a treatment that does not contain specific active ingredients.
Category 1 CSTEs: Erroneous perceptions of client change in its actual absence
Illusory placebo effects. Illusory placebo effects arise when expectations of improvement lead clients to believe that an attribute or condition improves in the absence of genuine changes on specified outcome measures (Wechsler et al., 2011). Illusory placebo effects differ from placebo effects in that the former do not involve genuine change (hence, individuals harbor the illusion that they have improved when they have not), whereas the latter do.
In a clever study (Greenwald, Spangenberg, Pratkanis, & Eskenazi, 1991), experimenters switched audiotapes containing subliminal messages so that people who thought they listened to audiotapes designed to enhance memory actually listened to audiotapes designed to enhance self-esteem, and vice versa. Participants came away believing that their memory or self-esteem, as the case may be, had improved in response to the tape they believed they had heard rather than in response to the tape they had actually heard. In fact, on objective tests of memory and self-esteem, all of the tapes were ineffective. The illusory placebo effect demonstrates that expectations and implicit theories can lead people to perceive, or at least report, imaginary changes in their behaviors, thoughts, and feelings (see also Nisbett & Wilson, 1977).
2. Palliative benefits. Psychotherapy sometimes makes clients feel better about their difficulties but exerts little or no effect on these difficulties (Beyerstein, 1997). Echoing this point, Albert Ellis (2003) underscored the importance of distinguishing “feeling better” from “getting better” in psychotherapy. For example, an antisocial client may enter therapy distressed about his repeated marital infidelity and leave therapy less distressed but with an unaltered risk for future infidelity. As Alpert (2012) observed, “Therapy sessions can work like spa appointments: They can be relaxing but don’t necessarily help solve problems” (p. SR5).
One could justifiably contend that palliative changes can themselves be therapeutic in some instances, especially if distress regarding one’s behaviors is a treatment target. Yet especially for clients whose behaviors routinely engender interpersonal distress for other individuals, such as those with narcissistic or antisocial personality disorders (American Psychiatric Association, 2013), the problem behaviors themselves are often the foci of the intervention. In these cases, alleviating client distress may actually be countertherapeutic. For example, some authors have argued that psychological treatment often makes psychopaths worse (Rice, Harris, & Cormier, 1992), although the research support for this contention is admittedly equivocal (D’Silva, Duggan, & McCarthy, 2004).
3. Confusing insight with improvement. Some clients may achieve greater insight into their difficulties over the course of therapy. Although such insight may not be linked to improvements in objective treatment outcomes, clients may believe that they have achieved progress merely because they can now conceptualize and verbalize their problems in greater richness and detail. In this example, insight is unrelated to improvement and thus constitutes a CSTE. If, however, the acquisition of insight per se were a therapeutic goal, then acquiring insight (even in the absence of change in signs and symptoms) would not constitute a CSTE.
There are two separable issues here, both of which bear on the veracity of insight as opposed to its clinical utility. First, the insights obtained in psychotherapy may sometimes be illusory, reflecting subjectively compelling but erroneous causal stories (Taleb, 2007). To the extent that humans are “meaning-making” beings (Bruner, 1990), insight may at times prove helpful in constructing a framework within which to better comprehend themselves and others. Indeed, some specious insights acquired in treatment may improve clients’ mood or behavior, at least in the short term, by affording them a sense of understanding and control over their problems (see Jopling, 2001, for a discussion of “placebo insights” in treatment), but others may be therapeutically inert or harmful (Jopling, 2008).
Second, even if the insights accrued in therapy are veridical, they may not guarantee or even facilitate improvement. For example, a client with a specific phobia of dogs may come to recognize that his fears originated with a frightening dog attack and that he is now negatively reinforcing these fears by avoiding dogs. Yet if he is unwilling to confront his fears during therapy by engaging in systematic in vivo exposure to dogs, his symptoms are unlikely to abate (Wachtel, 1987). Nor is insight always necessary for improvement (Bloom, 1994). In one study of psychoanalytic treatment, half of 42 patients were rated as better adjusted at the conclusion of therapy although few were judged to exhibit increased insight into their “core conflicts” (Bachrach, Galatzer-Levy, Skolnikoff, & Waldron, 1991).
4. Retrospective rewriting of pretreatment functioning. In some cases, clients may persuade themselves that they have improved by misremembering their initial level of functioning as worse than it was (Ross, 1989). Such biased memories may stem from clients’ implicit expectations of change during therapy. In one study, researchers randomly assigned university students to either a study skills course designed to improve their grades or to a no-intervention control condition and measured their study skills and grades before and after the intervention. The study skills class was apparently useless, as it failed to improve students’ grades. Yet students in the experimental condition perceived the intervention as effective, because they misremembered their initial study skills as worse than they were (Conway & Ross, 1984). Similarly, evidence suggests that at least some of the change commonly attributed to “posttraumatic growth”—psychological improvement following trauma—may actually be due to derogation of individuals’ pretrauma selves (Frazier et al., 2009; McFarland & Alvaro, 2000). Retrospective rewriting of pretreatment functioning may sometimes also occur during psychotherapy, especially when clients harbor strong expectations of improvement.
Such retrospective rewriting may transpire even when individuals are asked to evaluate their long-standing personality traits. In an elegant series of studies, A. E. Wilson and Ross (2001) found that individuals frequently described their current selves more favorably than their past selves, largely because they derogated their past selves. This tendency was especially pronounced when participants cared about the traits being judged. These results dovetail with longitudinal data demonstrating that the correlations between actual and perceived change in personality traits are only modest (Robins, Noftle, Trzesniewski, & Roberts, 2005). A study of 290 undergraduates tracked across 4 years of college found that participants retrospectively overestimated the extent to which they had become more extraverted over time, perhaps consistent with the cultural narrative that students become more outgoing and socially adept in college (Robins, Fraley, Roberts, & Trzesniewski, 2001). Such findings suggest that retrospective self-evaluations of change in long-term therapy may be suspect, especially if they reflect implicit beliefs regarding the direction and nature of change (Ross, 1989).
5. Response-shift bias. A related phenomenon, response-shift bias, occurs when an intervention leads individuals to change “their evaluation standard with regard to the dimension measured” (G. S. Howard, 1980, p. 93; see also Bray, Maxwell, & Howard, 1984; G. S. Howard & Dailey, 1979). This shift, which is of particular concern for researchers or clinicians using self-report measures, can occur when an intervention leads clients to reconceptualize their initial levels of a specific psychological trait. In contrast to retrospective rewriting of pretreatment functioning, which reflects a memorial change, this CSTE reflects an alteration in one’s “implicit scale” for measuring a trait (McLeod, 2001). Response-shift bias can cause individuals to either underestimate or overestimate the effects of a psychological intervention, depending on the direction of the shift.
For example, an excessively self-critical spouse may enter couples therapy concerned that she is to blame for problems in her marriage; on self-report and interview measures, she initially rates herself as narcissistic and anger-prone. During treatment, she may come to realize that her verbally abusive and overbearing husband is primarily responsible for their marital conflicts and that her levels of self-centeredness and resentment are no higher than the average person would experience in a similarly trying situation. Even though her levels of these two problematic traits have not changed over the course of treatment, her trait scores on standardized measures may decline from pretest to posttest, leading the therapist (and often the client herself) to conclude erroneously that the treatment has lowered her self-centeredness and hostility. In a sense, the treatment has exerted an impact—on the client’s conceptualization of her traits but not on these traits themselves. 1
6. Reduction in cognitive biases. Successful treatment for depression and similar conditions may attenuate certain cognitive biases, such as those tied to self-criticism and perception of one’s level of impairment (Whisman, Miller, Norman, & Keitner, 1991). Although a reduction in such distortions is often a legitimate treatment target per se, it may engender the spurious appearance of improvement on other measures. For example, depression is often marked by overreporting of the features of associated psychopathology. As a consequence, an intervention that diminishes the intensity of the cognitive biases often associated with depression (e.g., magnification of one’s weaknesses) may lead to decreases in the reported severity of co-occurring problems (e.g., social adjustment), even when these problems have remained objectively unchanged (Morgado, Smith, LeCrubier, & Widlocher, 1991).
7. Demand characteristics. Demand characteristics occur when clients or research participants adjust their behavior, including self-reported behavior, in accord with what they believe to be the therapists’ or investigators’ hypotheses (Orne, 1962). The treatment rationale provided by clinicians can convey potent demand characteristics to patients regarding treatment and thereby shape their attributions, expectations, emotions, and actions (Addis & Carpenter, 2000; McReynolds & Tori, 1972). In one study, participants informed that thoughts precede affect in response to images (i.e., a cognitive therapy rationale) were more likely to report thoughts first compared with participants informed that affect precedes thoughts. Differences between the two rationales were especially apparent in response to highly arousing images (Kanter, Kohlenberg, & Loftus, 2004) and were maintained at a 1-week follow-up (Busch, Kanter, Sedivy, & Leonard, 2007).
Moreover, clients are often motivated to tell their therapists what they believe their therapists want to hear; they may also be motivated to persuade themselves that they have improved. Hathaway (1948) referred to the “hello–goodbye” effect as clients’ propensity to present themselves as worse than they actually were at the outset of treatment and better than they actually are at the conclusion of treatment. As a consequence of this phenomenon, therapists and other observers may conclude that client improvement occurred in its absence.
Similarly, hypnosis researchers have identified a “holdback effect” when participants are tested sequentially in nonhypnosis and hypnosis conditions. One of the implicit demands of hypnosis is to behave as a “good” hypnotic subject should, or at least as this role is understood by the participant (Orne, 1962). The holdback effect can arise when participants are not hypnotized during an initial baseline trial but know they will be hypnotized in the following trial. In such cases, they may deliberately “hold back” from fully responding when they are not hypnotized to demonstrate gains on the later hypnosis trial, thereby presenting themselves as good hypnotic subjects (Braffman & Kirsch, 1999; Zamansky, Scharf, & Brightbill, 1964).
8. The therapist’s office error. What we term the therapist’s office error is the mistake of confusing clients’ in-session behavioral presentation with out-of-session improvement. Clients’ behavior within the cloistered confines of the therapist’s office may not reflect their behavior or functioning outside of treatment (Holmes, 1971; Magaret, 1950). This error may sometimes lead clinicians to underestimate treatment effectiveness, as when adequately functioning clients use psychotherapy sessions as opportunities to express their pent-up negative emotions (see Nichols & Efran, 1985).
In other cases, however, the therapist’s office error may contribute to overestimates of treatment effectiveness. For example, clients with social anxiety disorder (social phobia) involving apprehension of interpersonal rejection who are initially anxious in treatment may grow more comfortable with the therapist over time, leaving the therapist (and perhaps clients themselves) with the misleading impression that they are experiencing improvement in social anxiety symptoms. Yet these clients may merely be exhibiting stimulus discrimination, learning to respond less anxiously to the psychotherapist or others who provide them with unconditional acceptance but not to the very people they find most interpersonally threatening. Indeed, studies of behavior therapy for anxiety disorders sometimes point to a stimulus generalization gradient from the therapist’s office to the outside world, reflecting marked improvements in the former setting followed by decrements upon treatment termination (Gruber, 1971; see Lynch, Chapman, Rosenthal, Kuo, & Linehan, 2006, for a discussion of real-world generalization strategies in dialectical behavior therapy). These findings underscore the need to ensure that the client’s anxiety-provoking behaviors are addressed in real-world settings during treatment.
The therapist’s office error may pose a particular challenge for psychoanalytic therapies, which rely heavily on the therapist–client transference as the engine of change. In many respects, one can conceptualize transferences as reflecting interpersonal expectancies (Westen, 1998). Accordingly, if clients do not generalize their positive transference reactions toward the therapist to others, their long-term improvements may be limited (Holmes, 1971).
9. Retest artifacts. The retest artifact (Loranger, Lenzenweger, Gartner, & Susman, 1991) is the tendency of scores on psychopathology indices to decline spuriously upon their second administration. This artifact may be especially likely with measures characterized by a skip-out structure, such as many structured and semistructured interviews. Clients may realize that if they say “no” to many questions, they will have a much briefer and less emotionally distressing experience than if they say “yes” to them, generating the false appearance of improvement. In other cases, clients may deny more symptoms during the second assessment if they learn that the questions concern sensitive behaviors, like drug use or antisocial activities. Indeed, evidence suggests that this artifact may be especially pronounced for measures of socially undesirable characteristics (Jorm, Duncan-Jones, & Scott, 1989). Although the test–retest artifact has not received the research attention it merits, data suggest that it may be more of a threat to the validity of short-term than long-term assessments of personality disorder features (Lenzenweger, 1999; Samuel et al., 2011).
10. Unknowable outcomes in the control condition. A largely unappreciated reason for erroneous inferences of therapeutic effectiveness is the absence of information regarding the “hypothetical counterfactual” (Dawes, 1994): our inability to know what would have occurred had we not intervened. Because clinicians in routine practice settings are necessarily unaware of how their clients would have fared in a control condition, they cannot gauge the extent to which the improvement they observed might have occurred in the absence of treatment or in the presence of an alternative treatment. Clients are certainly subject to the same epistemic limitation.
An illustrative example derives from research on critical incident stress debriefing (CISD), which is widely used to decrease the risk of posttraumatic stress symptoms among trauma-exposed victims. Controlled research demonstrates that CISD is ineffective and perhaps iatrogenic (Litz, Gray, Bryant, & Adler, 2002; McNally, Bryant, & Ehlers, 2003). Yet many people who have undergone CISD are convinced that it was effective (Carlier, Voerman, & Gersons, 2000). A study by Mayou, Ehlers, and Hobbs (2000) offers intriguing insights into this paradox. These investigators evaluated the 3-year outcome of 61 patients who had experienced traffic accidents; some had been randomly assigned to receive CISD and others to receive no intervention. Among other measures, participants completed the Impact of Events Scale (IES; M. J. Horowitz, Wilner, & Alvarez, 1979), an index measure of posttraumatic stress symptoms. As is evident from Figure 1, high-scoring IES participants who received CISD improved between the pretreatment baseline and the 3-year follow-up. Yet remarkably, high-scoring IES participants who received no intervention at all improved even more. These findings suggest that CISD can impede natural healing processes (McNally et al., 2003). They also help us to understand why so many people are persuaded that CISD is efficacious even though it is not. Specifically, trauma-exposed individuals who receive CISD do improve, but not because of the treatment. To the contrary, they probably would have improved even more had they received no treatment at all.

The effects of critical incident stress debriefing on posttraumatic stress symptoms among traffic accident victims. Note the striking difference in trajectories between high scorers who did and did not receive the intervention. Both groups improved, but the group that received the intervention would have improved more had they received no intervention at all. From Mayou et al. (2000). Reprinted with permission.
11. Selective attrition. This CSTE differs from others we have described in that it operates not at the level of individual clients but at the level of all clients in a clinician’s caseload. Selective attrition refers to the fact that clients who drop out of therapy are not a random subsample of all clients. Research demonstrates that clients who are not improving are especially likely to leave psychotherapy (Garfield, 1994; Tehrani, Krussel, Borg, & Munk-Jørgensen, 1996; see also Swift & Greenberg, 2012). As a result, therapists may conclude erroneously that their treatments are effective merely because their remaining clients are those that have improved. One problem that has long bedeviled the evaluation of Alcoholics Anonymous and similar 12-step interventions for substance disorders is the high level of client dropout from this intervention, often approaching 40% following 1 year (Kelly & Moos, 2003). The clients who remain in these treatments after several years are generally faring better than when they began, but they are unrepresentative of the clients who initially enrolled. The clients who dropped out may not have been helped or may have even been harmed by the intervention.
12. Compliance bias. A cognate problem of selection bias can arise even among clients who remain in treatment. Compliance bias occurs when differences among clients in their adherence to treatment recommendations are confounded with variables that predict outcome, such as motivation to improve or conscientiousness (Grodstein, Clarkson, & Manson, 2003; Petitti, 1994). One well-known case of such bias comes from the 1970s Coronary Drug Project, which examined the effects of clofibrate, a cholesterol-lowering medication, on heart disease (Coronary Drug Project Research Group, 1975). When the investigators detected no significant effect of clofibrate versus placebo on cardiovascular outcomes, they conducted internal analyses of regular versus irregular clofibrate users. When they did, they found that only 15% of regular clofibrate users (those who had taken 80% or more of their pills) had died of heart disease compared with 25% of irregular users, seeming to suggest a positive effect of the medication. Yet when the researchers compared regular versus irregular users of the placebo, the results were virtually identical (Dawes, 2001; Taubes, 2007). Presumably, a third variable, such as health consciousness, accounted for both (a) more diligent adherence to physicians’ recommendations and (b) better cardiovascular outcomes.
Research on cognitive-behavior therapy reveals that clients who comply with extrasession homework assignments display better treatment outcomes than those who do not (Mausbach, Moore, Roesch, Cardenas, & Patterson, 2010). Similarly, evidence suggests that clients who practice meditation regularly in studies of compassion-based meditation training exhibit better outcomes than clients who do not (Pace et al., 2009). Because of compliance bias, unwary psychotherapists may notice that some of their clients comply with their prescribed interventions more than do others, find that the former clients display superior treatment outcomes, and conclude that these interventions were effective. Yet individual differences in client treatment adherence may merely be a proxy for another variable, such as treatment motivation or emotional resilience, which in turn is linked to enhanced psychological health. 2 Moderator analyses, which examine whether interventions are especially beneficial for certain clients (Kazdin, 2007; Kraemer, Wilson, Fairburn, & Agras, 2002), may be helpful in this regard, as levels of compliance can be treated as continuous moderators of outcome.
13. Selective attention to client outcomes. Confirmation bias (Nickerson, 1998), illusory correlation (Chapman & Chapman, 1967), and allied cognitive errors may lead clinicians to attend selectively to certain outcome variables while ignoring or minimizing others. Specifically, psychotherapists may unwittingly “cherry-pick” the outcome variables on which clients are improving. For example, because of diagnostic overshadowing (Garb, 1998), therapists may focus unduly on apparent improvement in dramatic client signs and symptoms, such as psychotic features or aggressive behaviors, while neglecting lack of improvement or deterioration in less overt signs and symptoms, such as depressed mood, anxious ruminations, or anger (Zimmerman & Mattia, 1999). As a consequence, they may conclude that clients have improved when they have exhibited no change across multiple clinically important domains. Clients may fall victim to the same error: They may engage in “selective symptom monitoring” (Pennebaker & Skelton, 1981), focusing on symptoms that they expect to change while neglecting or underattending to others.
14. Selective memory for client outcomes. The past several decades of psychological research leave scant doubt that memory is fallible (Loftus, 1993; Lynn & Nash, 1994) and that most of us preferentially recall information consistent with our hunches and desires (Walker, Skowronski, & Thompson, 2003). As a consequence, clinicians may be more likely to recall positive than negative client signs and symptoms during and after treatment, potentially resulting in overestimates of treatment effectiveness. 3
15. Selective interpretation of client outcomes. Confirmation bias and similar cognitive errors may predispose to selective interpretation of the clients’ difficulties during and after treatment. The more ambiguous the outcome variables rated by clinicians, the larger the potential for biases in their ratings (Markin & Kivlighan, 2007; Westen & Weinberger, 2005). Hence, clinicians who are motivated to perceive improvement in their clients may interpret ambiguous symptoms (e.g., increased anger toward a spouse in marital therapy, heightened emotional processing of painful childhood memories) as evidence of treatment success.
Category 2 CSTEs: Misinterpretations of actual client change stemming from extratherapeutic factors
16. Spontaneous remission. Spontaneous remission refers to the tendency for disorders to resolve on their own (Beyerstein, 1997). Early reports by Eysenck (1952) of spontaneous remission rates of 70% or more among psychiatric patients were almost surely overestimates (Rachman, 1973). Nevertheless, later data point to nontrivial rates of spontaneous remission in outpatient samples (Chadwell & Howell, 1979; Lambert, 1976). For example, Posternak and Zimmerman (2000) reported a spontaneous remission rate of 52% among patients with major depressive disorder. The rates of spontaneous remission among children and adolescents with psychopathology, including behavioral problems, approach or exceed 40% (Harrington, Whittaker, Shoebridge, & Campbell, 1998; Jacobson & Christensen, 1996; McCullough, 2000).
The longer people remain in therapy, the greater the opportunity for extratherapeutic factors, including natural healing processes, coping, social support, and positive experiences in everyday life, to contribute to observed or perceived improvement (Jacobson & Christensen, 1996). Moreover, when frequent spontaneous remissions happen to coincide with the administration of specific interventions, client and clinician alike may fall prey to illusory causation, coming to believe that the interventions are producing the spontaneous remissions (Blanco, Barberia, & Matute, 2014).
Spontaneous remission may be partly accounted for by what Alexander and French (1946) termed the “corrective emotional experience,” a positive affective occurrence that ameliorates the detrimental impact of early negative life events (Bridges, 2006). Although Alexander and French emphasized the role of corrective emotional experiences in psychotherapy, such events (e.g., finding a loving partner in the aftermath of an abusive relationship) surely occur in everyday life as well. In the words of psychoanalyst Karen Horney (1945), “life itself still remains a very effective psychotherapist” (p. 240).
17. History. A related extratherapeutic factor that can contribute to the erroneous inference of a therapeutic effect is what Campbell and Stanley (1963) termed history: widely shared events transpiring outside of the treatment setting. A client who is experiencing severe life stressors due to a poor economy or recent natural disaster may improve when the impact of these events on his (and his friends’ and loved ones’) financial and personal life has dissipated. The clinician may erroneously attribute improvement during therapy to the treatment itself rather than to the salubrious changes in the client’s everyday life.
18. Cyclical nature of some disorders. Another extratherapeutic factor that can be linked to short-term improvement is the cyclical nature of many disorders (Beyerstein, 1997). In contrast to spontaneous remission, which refers to substantial amelioration in or disappearance of a condition per se, this CSTE refers to a transient shift into the benign phase of a condition characterized by a recurrent course. Like many medical conditions, such as multiple sclerosis, arthritis, and gastrointestinal problems, many psychological disorders have their “ups and downs.” In disorders that are cyclical, people often improve, periodically or over the long term, without intervention. For example, in cyclothymic and bipolar disorders, which are characterized by affective, interpersonal, and behavioral lability, an ineffective treatment implemented over a lengthy period will have ample opportunities to coincide with upticks that likely would have occurred regardless of treatment. Accordingly, clinicians may infer that therapy is responsible for improvement when positive changes are instead induced by fluctuations in the disorders’ natural course. One likely reason for the popularity of unvalidated and fringe treatments for autism spectrum disorders, such as secretin (a polypeptide hormone synthesized from the intestines of pigs) and sensory-motor integration therapy, is the fact that the corollary symptoms of this condition (e.g., aggression, self-injurious behavior, social interaction difficulties) often wax and wane over brief time periods (Romanczyk, Arnstein, Soorya, & Gillis, 2003), leading observers to mistake short-term behavioral changes for beneficial treatment effects.
19. Self-limiting nature of disorder episodes. Like the acute exacerbations of many physical disorders, the episodes of some psychological disorders tend to be self-limiting. A treatment may appear to exert a beneficial effect on a disorder episode that has run its natural course (Beyerstein, 1997). For example, the median duration of a depressive episode is approximately 13 weeks (Solomon et al., 2010), and some untreated episodes remit or improve substantially without any intervention (Kirsch & Sapirstein, 1998). In other cases, certain disorders themselves may be short-lived. For example, short-term drug-induced psychiatric conditions, such as amphetamine intoxication or alcohol withdrawal delirium (American Psychiatric Association, 2013), wane in intensity once the active physiological effects of the substance (or the withdrawal effects of the substance) have subsided.
20. Regression to the mean. It is a statistical fact of life that extreme scores tend to become less extreme upon retesting, a phenomenon known as regression toward the mean (Kruger, Savitsky, & Gilovich, 1999). By mathematical necessity, regression to the mean will occur whenever the correlation between pretest and posttest scores is less than unity (Salsburg, 2001); such regression will be especially pronounced when measures are of low reliability. If a patient presents to therapy as severely depressed, chances are reasonably high that he or she will be less depressed (or at least report lower levels of depression on standardized outcome measures) in a few weeks, even in the absence of treatment.
Regression to the mean is an especially thorny problem in evaluating the effectiveness of psychotherapy in real-world settings, because most patients enter treatment when their symptoms are most extreme and hence when regression effects are maximized (Gilovich, 1991). Similarly, antisocial children and adolescents may not be referred to treatment until their behaviors become unbearable to teachers or parents (Costello & Janiszewski, 1990). Some authors have conjectured that most of the variance commonly attributed to placebo effects in controlled trials of medication is actually due to regression effects (McDonald, Mazzuca, & McCabe, 1983). Moreover, regression effects may sometimes be misinterpreted as spontaneous remission (Campbell & Kenny, 1999).
Rendering this CSTE especially problematic are findings that humans are prone to nonregressive predictions (Dawes, 1986; Nisbett & Ross, 1980). That is, we do not sufficiently compensate for regression to the mean when predicting behavior from Time A to Time B. Perhaps because of the representativeness heuristic, a mental shortcut characterized by the assumption that “like goes with like” (Tversky & Kahneman, 1974), we expect behavior at Time A and B to be similar. Hence, whenever we detect a difference between Time A and Time B behavior, we tend to attribute this change to spurious extraneous factors, such as the effects of treatment, rather than to statistical regression (an error known as the regression fallacy; Kahneman, 1965). As Campbell and Kenny (1999) commented, “it seems likely that regression toward the mean leads people to believe in the efficacy of the scientifically unjustified regimens. . . . Many a quack has made a good living from regression toward the mean” (p. 48).
21. Maturation. A source of erroneous inferences of therapeutic efficacy, especially among children and adolescents, is maturation: improvement owing to naturally occurring psychological growth (Cook & Campbell, 1979). For example, children and young adolescents with high levels of what appear to be certain pre-psychopathic features, such as poor impulse control, low frustration tolerance, and defiance, may improve on their own because levels of these characteristics often diminish with the passage of time, especially when they are early appearing (Edens, Skeem, Cruise, & Cauffman, 2001). Such maturation can mislead clinicians into concluding that their treatment was responsible for declines in the levels of these and other externalizing problems. Psychological growth may be a source of mistaken therapeutic conclusions even among adult clients. For example, some patients with borderline personality disorder may improve over long stretches of time without treatment (Shea et al., 2009).
22. Multiple treatment interference. When clients seek out a treatment, they often obtain other interventions simultaneously (Kendall, Butcher, & Holmbeck, 1999), a confound known as multiple treatment interference or co-intervention bias. Some of these adjunctive interventions may be formal treatments, such as antidepressants or marital therapy. Others may be informal “treatments,” such as exercise, which has generally been found in controlled studies to be effective for alleviating depression (Fremont & Craighead, 1987; Penedo & Dahn, 2005), or confiding in trusted friends or religious figures. Multiple treatment interference renders it difficult or impossible to attribute client change conclusively to the active ingredients of the intervention of choice.
23. Initial misdiagnosis. Even the best trained diagnosticians are fallible (Beyerstein, 1997; Garb, 1998; Groopman, 2007). For example, relatively normal individuals undergoing temporary life stressors are at times mistakenly diagnosed as psychopathological; when they are later examined, they have improved but not necessarily because of the treatment. The same may hold for clients with acute medical disorders that are misdiagnosed as psychiatric conditions. For example, acute intermittent porphyria has been called “a great imitator” (Morrison, 1997, p. 155) and is occasionally mistaken for bipolar disorder and other cyclical emotional conditions. If this medical condition resolves on its own, which it sometimes does (Loftus & Arnold, 1991), an unwary clinician may mistakenly conclude that a treatment targeted for a manic episode was beneficial.
Category 3 CSTEs: Misinterpretations of actual client change stemming from nonspecific treatment factors
24. Placebo effects. The omnipresent placebo effect has been defined in multiple ways, but it is traditionally regarded as improvement resulting from the mere expectation of improvement (Beecher, 1955; S. Horowitz, 2012; Steer & Ritschel, 2010). By instilling hope and the conviction that one can rise above life’s challenges, virtually any credible treatment can be at least somewhat helpful for combating demoralization (Frank & Frank, 1961), which is a central component of many psychological disorders (Tellegen et al., 2003). Admittedly, importing the placebo concept into the domain of psychotherapy is fraught with complexities given that at least some of the efficacy of psychological treatment probably derives from expectancies of improvement (Kirsch, 2005; Lambert, 2005). Nevertheless, because such expectancies presumably cut across most or all effective psychotherapies, they can lead clinicians and researchers to conclude that the specific ingredients of a treatment are efficacious when they are inert.
In the case of medication, some research suggests that up to 80% of the effects of antidepressants on clinical depression, especially when it mild or moderate, may be attributable to placebo effects (Kirsch, 2005; Kirsch & Sapirstein, 1998; but see Coyne, 2012; Klein, 1998, for different views). Placebos generally exert their most potent effects on subjective reports, such as depression, pain, and nausea, rather than on largely objective indices, such as assays of cancer, heart disease, or other organic illnesses (Hróbjartsson & Gotzsche, 2001).
Placebo effects appear to play an important role in the efficacy of psychotherapy, too. Estimates of placebo effects in psychotherapy, typically obtained by comparing treatment outcomes from attention-placebo control groups with those of wait-list control groups, are on the order of d = 0.40, or about half of the typical effect size yielded by active therapies (Grissom, 1996; Lambert, 2005; Lambert & Ogles, 2004). Moreover, meta-analyses indicate that the estimated efficacy of psychotherapy is considerably smaller when it is compared with an attention-placebo control group than with a wait-list control group (Baskin, Tierney, Minami, & Wampold, 2003; Bowers & Clum, 1988), suggesting that some of the potency of psychological treatment derives from the nonspecific effects of expectancies.
Still, ascertaining the precise magnitude of placebo effects in psychotherapy is difficult and arguably impossible given the absence of a perfect psychological treatment analogue to a pill placebo (Kirsch, 2005). Moreover, even control conditions designed to be active placebos may not control fully for the effects of expectancies, as these placebos are often less plausible than the active interventions against which they are pitted (Boot, Simons, Stothart, & Stutts, 2013).
Placebo effects should not be confused with other nonspecific effects of treatment (Kienle & Kiene, 1997; cf. Novella, 2010), such as those of empathy and support (Nathan, Stuart, & Dolan, 2003). The causal role of these nonspecific factors is controversial. On the one hand, the therapeutic alliance is modestly and positively associated (average r = .22) with therapeutic improvement (Baldwin, Wampold, & Imel, 2007; Orlinsky, Rønnestad, & Willutzki, 2004). This finding has led some scholars to contend that the therapeutic alliance is a causal agent in psychotherapeutic change. On the other hand, relatively few therapy outcome studies account for the temporal relation between the alliance and improvement, precluding relatively clear-cut inferences of causality (Kazdin, 2007). Several investigations that have incorporated assessments of therapeutic alliance and symptom change at multiple therapeutic time points suggest that a positive alliance typically follows symptom change, not vice versa (DeRubeis, Brotman, & Gibbons, 2005; DeRubeis & Feeley, 1990); but other studies have arrived at different conclusions (Horvath, Del Re, Flückiger, & Symonds, 2011; Norcross & Lambert, 2006). In light of this mixed evidence, we do not class these nonspecific factors as CSTEs.
25. Novelty effects. Clients may improve, especially at the outset of treatment, because they are excited by the prospect of receiving a new intervention (Fraenkel & Wallen, 1993; Marino & Lilienfeld, 2007). Novelty effects probably overlap with placebo effects in some cases, but the former typically operate largely or exclusively during the initiation of treatment.
Psychotherapy outcome data suggest that about 15% of patients improve between the initial phone call from the clinician and the first session (K. I. Howard, Kopta, Krause, & Orlinsky, 1986). At least some of this improvement probably stems from the anticipation of receipt of a novel treatment. Moreover, for many conditions, including major depression and eating disorders, perhaps 60% to 80% of clinical improvement occurs by the fourth session (Ilardi & Craighead, 1999; G. T. Wilson, 1999; but see Tang & DuRubeis, 1999). Much of the early change in psychotherapy may similarly reflect clients’ reactions to a new intervention that offers the promise of change, although some of it may also stem from placebo or regression effects. Novelty effects may account in part for meta-analytic findings that the effect sizes for the efficacy of some psychotropic medications, including second-generation antipsychotics and antidepressants, have been highest shortly following their introduction, only to dissipate with time (Lehrer, 2010; Leucht, Arbter, Engel, Kissling, & Davis, 2009), although other factors (e.g., enrollment of progressively milder patients in medication studies, publication bias) may also be at play.
26. Effort justification. Because clients often devote substantial time, energy, effort, and money to treatment, they may feel a need to justify this investment. They may do so by persuading themselves that the therapy was beneficial, a phenomenon termed effort justification (Cooper, 1980; Cooper & Axsom, 1982). In one study, college students with snake phobic symptoms improved equally when receiving exposure therapy and when performing strenuous physical exercises (e.g., running quickly in place). The latter “treatment” required considerable effort and presumably led to a need to rationalize this effort (Axsom & Cooper, 1985). Effort justification may be a particularly challenging interpretative problem for long-term insight-oriented therapies, especially those lasting decades, because of the enormous financial, time, and emotional investment involved.
Summary
These 26 CSTEs are a helpful springboard for examining why certain inert or harmful treatments (or treatment ingredients) may appear to be effective. Our list is only a starting point, however, because CSTEs almost surely comprise only one set of sources for incorrect inferences regarding treatment effectiveness. Other sources include the fact that clinicians are often extremely busy and are therefore forced to make rapid decisions in complex and information-rich environments. Moreover, as noted earlier, another source of erroneous inferences comprises incorrect hypotheses regarding the specific mechanisms of a treatment. We encourage additional research on other potential CSTEs, as well as on shared processes that may underpin superficially different CSTEs.
Like many rival hypotheses in psychology (Huck & Sandler, 1979), CSTEs are readily overlooked because they are nonintuitive. In addition, they are less perceptually obvious than the easily observed impact of client change and therefore are likely to recede into the causal background (Lilienfeld et al., 2008). As a consequence, some clinicians may assume erroneously that they can dispense with the research leg of evidence-based practice and replace it with informal clinical observation.
Research Methods as Safeguards Against Causes of Spurious Therapeutic Effectiveness
A key point that is not emphasized sufficiently in education in clinical psychology and allied disciplines is that systematic research designs, including both between-subject and single-subject designs, are needed to minimize CSTEs as rival hypotheses for client improvement (Lilienfeld et al., 2008; T. D. Wilson, 2011). In many respects, the existence of CSTEs offers the most potent raison d’être for evidence-based practice, although to our knowledge this crucial point has never been made explicitly (but see Lilienfeld et al., 2013; and Stewart et al., 2011, for discussions of sources of resistance toward evidence-based practice).
Specifically, without randomized controlled trials, well-controlled quasi-experimental studies, systematic single-subject designs, and other research methods as partial safeguards against CSTEs, there is no way to ascertain whether client change was due to the intervention as opposed to a wealth of extraneous factors. Randomized controlled trials are not strict “gold standards,” as they do not remove all potential sources of error (Wachtel, 2010). Nevertheless, analyses of the medical literature suggest that treatment designs based on random assignment tend to yield more replicable results than do those based on quasi-experimental or naturalistic designs (Ioannidis, 2005), probably at least in part because the former help to eliminate more CSTEs as rival explanations for improvement.
As a consequence of their superior control over CSTEs, randomized controlled trials and rigorous single-subject designs justifiably occupy the highest rungs of evidentiary certainty in the evidence-based practice hierarchy (Ghaemi, 2009). Nevertheless, designs lower in this hierarchy, such as quasi-experimental and naturalistic methods, can also play valuable roles in research inference, as they help to protect investigators against certain CSTEs (Wachtel, 2010). Moreover, such designs are often indispensable in the early phases of treatment development, as they allow researchers to collect preliminary data that can shape the development of novel interventions. In turn, these interventions, if feasibly implemented and empirically promising, can later be tested in more rigorously controlled trials.
In the next section, we sketch out how widely used methodological procedures in psychotherapy outcome research help to eliminate or minimize CSTEs. Our exposition is instructive rather than exhaustive; we focus only on the most crucial research safeguards and most crucial CSTEs (again see Table 1 for these and additional methodological safeguards against CSTEs).
Protecting against Category 1 CSTEs
Well-validated outcome indicators
Well-validated and largely objective outcome measures help to rule out all Category 1 CSTEs, because these CSTEs can engender the false appearance of change in its absence. For example, well-validated indicators of depression or anxiety help to exclude—although not eliminate—illusory placebo effects and palliative effects in controlled trials of major depression and anxiety disorders. To be effective safeguards against Category 1 CSTEs, well-validated outcome indicators should be sensitive not only to client symptoms but also to client impairment. Such indicators are also useful as protections against Category 1 CSTEs in controlled single-subject designs. In contrast, demand characteristics can be especially difficult to rule out as sources of erroneous clinical inference. Nevertheless, outcome measures that are low in reactivity (Weiss & Weisz, 1990), such as extrasession behavioral data or unobtrusive behavioral observations, are at least partial antidotes against this CSTE. Collateral reports from outside informants (e.g., friends, significant others), which can supply “social validation” (Kazdin, 1977), can be useful in ruling out the confusion of insight with improvement, retrospective rewriting of pretreatment functioning, response shift bias, the therapist’s office error, and similar CSTEs. Specifically, these reports can assist clinicians and investigators with excluding the hypothesis that client-perceived change in symptoms is (a) limited to behaviors within therapy sessions, (b) illusory, or (c) both.
Pretreatment measures
Collecting measures of client psychological status at pretreatment is especially helpful for ruling out one specific Category 1 CSTE, namely, retrospective rewriting of pretreatment functioning. Specifically, such measures can assist in excluding the hypothesis that clients are merely misremembering their initial adjustment as worse than it actually was, thereby leading to spurious inferences of improvement. If these measures do not rely exclusively on self-report ratings, they can also help to eliminate response-shift biases as explanations for apparent improvement.
Blinding of observers
Blinded observations in controlled clinical trials control partially for several additional Category 1 CSTEs, especially those stemming from confirmation bias and illusory correlation (i.e., selective attention, memory, and interpretation of client outcomes). When external evaluators are fully blinded, they cannot subtly and selectively perceive, recall, or interpret ambiguous symptom changes as a function of treatment assignment. For example, blinded observers in a randomized controlled trial of cognitive-behavior therapy versus a wait-list control for generalized anxiety disorder are less likely to differentially elicit or cherry-pick indicators of improvement (e.g., reports of less frequent worrying) in the treatment condition.
Nevertheless, these Category 1 CSTEs may be difficult to eliminate entirely. Because therapy outcome studies cannot be strictly double-blinded (i.e., clients and clinicians know who is receiving treatment), confirmation bias can still affect ratings of improvement by clients and clinicians. Moreover, even the blinding of external observers in psychotherapy trials is rarely infallible, as these evaluators can often surmise treatment assignment at above-chance levels (Carroll, Rounsaville, & Nich, 1994). Assessing potential violations of blinding by asking evaluators to guess treatment conditions and using this variable as a covariate in analyses can be a helpful safeguard against selective perception, memory, and interpretation of client change. Nevertheless, such covariate analyses may underestimate treatment differences (especially when based on guesses made at the conclusion of treatment), because above-chance guessing could stem from evaluators’ accurate observations of differential improvement across conditions (Carroll et al., 1994; Rickels, Lipman, Fisher, Park, & Uhlenhuth, 1970).
Intent-to-treat analyses
Intent-to-treat (ITT) analyses (Hollis & Campbell, 1999) help to rule out one key Category 1 CSTE, namely, selective attrition. By examining outcomes of all participants enrolled in clinical trials, including dropouts, ITT analyses minimize erroneous inferences of improvement stemming from the fact that clients who leave treatment prematurely are often unrepresentative of those who initially enrolled (Tehrani et al., 1996). In contrast to clients who remain in treatment, those who drop out of treatment tend to be lower functioning and more psychologically disturbed (Swift & Greenberg, 2012), although in a minority of cases they comprise clients who have improved and no longer perceive themselves as requiring treatment (Baekeland & Lundwall, 1975; Tehrani et al., 1996). As a consequence of the selection biases introduced by client dropout, ITT analyses help to avoid misestimating—and typically overestimating—treatment effects.
Protecting against Category 2 CSTEs
Randomization to treatment conditions
Randomization to treatment conditions helps to address the inferential errors generated by Category 2 CSTEs, which produce changes stemming from extraneous factors outside of treatment. To be clear, well-executed randomized controlled trials do not eliminate Category 2 CSTEs, which still arise in these investigations and can deceive observers in the absence of randomized controlled groups. Nevertheless, the randomization process helps to exclude Category 2 CSTEs as rival explanations for therapeutic effectiveness, because these CSTEs are equally likely in sizeable experimental and control groups. Given the law of large numbers, these CSTEs should no longer account for between-group differences in randomized controlled trials provided that clinical trials are adequately powered (Hsu, 1989). For example, in a randomized controlled trial, spontaneous remission, history, regression to the mean, maturation, and multiple treatment interference occur frequently among individuals assigned randomly to both treatment and no-treatment (or alternative treatment) conditions. Nevertheless, proper randomization ensures that these CSTEs tend to be equalized across the active treatment and comparison arms.
Repeated measurements
In both between-subject and controlled single-subject experiments, repeated measurements across the course of treatment can help to rule out history and other extratherapeutic influences as sources of improvement in therapy (Laurenceau, Hayes, & Feldman, 2007). If one observes changes in treatment at multiple time points rather than at only one time point following an extratherapeutic event (e.g., initiation of a romantic relationship), the likelihood that such events—rather than the therapeutic intervention—are contributing to improvement is minimized (such observations are also useful for ruling out novelty effects, a Category 3 CSTE). In the context of single-subject designs, multiple baseline designs—especially those in which the intervention is applied to different behaviors in a temporal sequence—can help to rule out history and other extratherapeutic factors as rival explanations for change during treatment (Engel & Schutt, 2012; Nock, Michel, & Photos, 2007). If one consistently observes change in different behaviors at different time points, the likelihood that extratherapeutic factors account for the improvement is minimized. Finally, long-term follow-up measurements can be helpful in excluding CSTEs arising from the cyclical and self-limiting nature of certain disorders, as such assessments can ensure that improvements in signs and symptoms are not transient.
Minimizing and estimating measurement error
The use of pre- and posttreatment indicators with high reliability will minimize regression to the mean, as this statistical phenomenon is most probable when measures contain substantial amounts of nonsystematic (random) measurement error. Particularly in quasi-experimental treatment studies, investigators should be circumspect in their use of extreme-groups designs (in which participants are selected on the basis of very high pretreatment scores), as such designs are especially likely to yield high levels of regression effects. Researchers can also estimate, and control statistically for, regression effects in treatment outcome studies (see Barnett, van der Pols, & Dobson, 2005, for a discussion).
Protecting against Category 3 CSTEs
Common factor control groups
Systematic controls for common therapeutic factors help to control for Category 3 CSTEs, which involve the misattribution of client change to specific therapeutic ingredients when common factors (e.g., expectancies for improvement) are actually operative. For example, as observed earlier, attention-placebo control groups (Paul, 1966) can minimize expectancies for change as an explanation for group differences in outcome. Nevertheless, attention-placebo control groups are unlikely to eliminate entirely the threats posed by Category 3 CSTEs, because even well-constructed common factor control conditions are rarely as plausible as active treatment conditions (Baskin et al., 2003; Boot et al., 2013; O’Leary & Borkovec, 1978). Given the difficulty and perhaps impossibility of equating common factor control conditions with active treatment conditions on expectancies for change, hope, and treatment credibility, researchers and clinicians should ideally measure these factors at different points in treatment (in the case of novelty effects, at the outset of treatment). These variables can be treated as covariates in analyses, again bearing in mind that such statistical controls can underestimate treatment effects if expectancies and treatment credibility in part contribute to treatment efficacy.
Inclusion of measures of proposed mediators
The demonstration that a proposed mediator of treatment outcome accounts statistically for client improvement supports, although does not prove, the contention that this mediator is the underlying mechanism of change (see Kazdin & Nock, 2003, for conditions in which mediation offers especially compelling evidence for change mechanisms). In this regard, mediational tests can be helpful for excluding Category 3 CSTEs. Specifically, converging findings that a given psychotherapy appears to operate via a hypothesized mediator that is largely specific to that intervention (e.g., changes in maladaptive cognitions, cognitive defusion, increase in social reinforcement) minimizes the likelihood that this intervention is operating exclusively via common mechanisms shared by most or all treatments, such as placebo effects.
Summary
Methodological techniques in psychotherapy outcome research help to control for CSTEs, and certain methods are especially suited for excluding different CSTEs. The need to minimize CSTEs using between-subject and single-subject research designs offers the most compelling rationale for evidence-based practice. Our discussion also points to important gaps in methodology for attenuating the influence of CSTEs as well as fruitful directions for future research. As is evident from our analysis, Category 3 CSTEs are especially difficult to eliminate as erroneous sources of improvement, because equating active treatment and attention-placebo control conditions on expectancies and treatment credibility is often difficult or impossible. Hence, one key direction for future psychotherapy research will be the development of placebo conditions that are closely matched to treatment conditions on credibility (see Boot et al., 2013). Moreover, because psychotherapy trials cannot be conducted in a genuinely double-blind fashion, certain Category 1 CSTEs, especially selective attention, memory, and interpretation of client outcomes, are difficult to eradicate, particularly for client and clinician reports of improvement. The development of largely objective measures that are less susceptible to these and other observer biases is therefore an important direction for future psychotherapy outcome research. In the case of all CSTE categories, researchers are well advised to heed the methodological maxim that if one cannot remove a source of error, one should attempt to measure it. For example, by systematically assessing expectancies during treatment, investigators can strive to rule out rival hypotheses concerning client improvement and thereby draw more valid inferences regarding treatment effects.
Conclusions and Future Directions
The oft-lamented gap between science and practice in clinical psychology is in large measure a clash of epistemologies (McHugh, 1994). In particular, this schism reflects deep-seated differences of opinion regarding the place of controlled research versus intuition in clinical decision making (Lilienfeld et al., 2013; Tavris, 2003). Our central thesis is that the science–practice gap and the accompanying reluctance of some psychologists—clinicians and researchers alike—to adopt evidence-based practices rarely reflect a willful disregard of evidence per se. Instead, this reluctance stems largely from an erroneous belief that the evidence supplied by informal clinical observations of client change tends to be as trustworthy as the evidence supplied by the methodological safeguards comprising the research prong of evidence-based practice (Spring, 2007). When viewed through this lens, the science–practice gap is not fundamentally a disagreement about whether evidence is important in ascertaining therapeutic effectiveness: It is a difference of opinion about which kinds of evidence should be accorded priority in clinical decision making.
Implications for the role of intuition in clinical decision making
Clinical intuitions and informal observations play invaluable roles in psychotherapy, especially in hypothesis generation (Chambless, in press). For example, the spark that ignited Aaron Beck’s seminal theorizing regarding cognitive-behavioral therapy originated from his observations of a client who seemed anxious during sessions. After Beck, who was trained psychoanalytically, suggested to her that her anxiety reflected discomfort with unconscious sexual impulses, she replied politely that she felt nervous because she was concerned she was boring him. This experience inspired Beck to explore his clients’ unstated thoughts and assumptions, culminating in the development of what he initially termed cognitive therapy (Smith, 2009). Moreover, clinical impressions of change during treatment are sometimes accurate and should be regarded as fallible but potentially informative signposts to be corroborated by more systematic evidence. At the same time, our analysis is a reminder that clinical observations are often poorly suited to detecting and evaluating the sources of improvement in treatment. The evidence we have reviewed demonstrates that (a) throughout history, ineffective and harmful mental health treatments have routinely been perceived as effective; (b) psychotherapists frequently overestimate substantially the rates of positive outcomes in their clients (Hannan et al., 2005); and (c) many sources can contribute to the erroneous impression of therapeutic effectiveness in its absence.
One potential response to our arguments is that CSTEs are less of an impediment for highly experienced psychotherapists, who can gradually learn to distinguish accurate from inaccurate inferences of treatment effectiveness. Nevertheless, research across multiple domains reveals that the conditions for the acquisition of intuitive expertise are highly constrained. Intuitive expertise tends to emerge only in “high-validity environments”—those in which feedback is relatively objective, consistent, and immediate (Dawes, 1994; Kahneman & Klein, 2009; Tracey, Wampold, Lichtenberg, & Goodyear, 2014). None of these conditions apply to typical psychotherapy, a “low-validity environment” in which feedback to clinicians is often ambiguous (e.g., detecting whether a client is less anxious than in the previous session can be challenging, and detecting whether such change is due to the intervention itself is even more so), inconsistent (e.g., a client may appear improved in one session but not in the succeeding session), and delayed (e.g., clinicians may need to wait weeks or months before discovering whether a client improved following an intervention). Hence, the literature on expertise effects provides scant reason to expect the accuracy of intuitions concerning therapeutic effectiveness to improve with experience. Research on the relation between the amount of therapeutic experience and accuracy of clinical judgments offers few (Spengler et al., 2009) or virtually no (Garb, 1998, 2005) additional grounds for optimism.
Implications for prioritizing sources of evidence
These points bear noteworthy implications for the weighting of the three legs of evidence-based practice: research evidence, clinical expertise, and client preferences and values (Spring, 2007). Although some authors contend or imply that these three prongs should be accorded approximately equal weight in clinical decision making (e.g., American Psychological Association Task Force on Evidence-Based Practice, 2006), our analysis suggests that this ecumenical approach may be misguided. As we have seen, controlled research on treatment efficacy is better suited than unguided clinical judgment to ruling out manifold rival explanations for improvement, a finding that accords with the superior replicability of medical findings derived from randomized controlled trials compared with less rigorously controlled trials (Ioannidis, 2005). Hence, when well-replicated treatment outcome data conflict with clinical impressions of improvement, we should generally default to the former (Baker et al., 2008).
Implications for everyday clinical practice
Our arguments point to useful suggestions for everyday clinical practice as well (see Table 1). Because of Category 1 CSTEs, clinicians can be led to conclude that client change has occurred when it has not. One underutilized corrective to this problem is the periodic administration of outcome measures, such as the Outcome Questionnaire-45 (Lambert, Lunnen, Umphress, Hansen, & Burlingame, 1994), throughout treatment. These measures can alert clinicians to instances in which they may be erroneously perceiving improvement in its absence or overlooking deterioration. Some Category 1 CSTEs, especially those stemming from confirmation bias on the part of both clinician and client, can be minimized by collecting systematic data from outside informants. In some cases, as in the treatment of anxiety disorders, these CSTEs can also be minimized by collecting psychophysiological data (e.g., autonomic responsivity to anxiety-provoking events) over the course of treatment.
Although many Category 2 and 3 CSTEs are difficult to rule out in everyday clinical practice, clinicians can often enhance the accuracy of their inferences of treatment effectiveness by inquiring about these CSTEs systematically. For example, by monitoring clients’ use of informal adjunctive “interventions,” such as exercise, herbal remedies, and confiding in valued friends, clinicians can become more cognizant of multiple treatment interference as a potential explanation for changes in their clients’ clinical status.
Our analysis also reminds clinicians to be attuned to the possibility that some client characteristics may serve as moderators of certain CSTEs, thereby affecting their likelihood. For example, as noted earlier, people with depression may be especially prone to symptom overreporting and hence to the false appearance of improvement on adjunctive symptoms following treatment (Morgado et al., 1991). Similarly, because individuals with high levels of negative emotionality, especially trait anxiety, are prone to attend selectively to psychological symptoms (Suls & Howren, 2012), clinicians should be alert to the possibility that declines in negative emotionality over the course of treatment could predispose to spurious inferences of declines in other psychological symptoms. In addition, although efforts to identify a “placebo-prone personality” have met with mixed success, some evidence raises the possibility that optimists are more likely than pessimists to respond to positive expectancies (Geers, Helfer, Koskab, Weiland, & Landry, 2005) and hence may be especially prone to engendering certain CSTEs, especially placebo and novelty effects.
Implications for clinical psychology education and training
Our analysis implies that CSTEs and the research safeguards against them that we have delineated should be emphasized in the education and training of all would-be psychologists and other mental health professionals, as well as in the continuing education of current mental health professionals. Although we are unaware of survey data on how often CSTEs are discussed in graduate courses in the mental health professions, there is reason to believe that such coverage is often minimal. Most standard psychotherapy handbooks (e.g., Corsini & Wedding, 2010; Koocher, Norcross, & Hill, 2005; Meyer & Deitsch, 1996) accord scant attention to the overarching problem of CSTEs or to specific CSTEs themselves, such as placebo effects, spontaneous remission, and regression to the mean. Moreover, to our knowledge, no continuing education course approved by the American Psychological Association has ever focused on CSTEs.
In our view, the inferential problems posed by CSTEs, and, equally important, the ways in which research safeguards compensate for them, should be mandatory components of training for all mental health professionals. Exposing students to the long and sordid history of failed but widely espoused treatments in psychology and psychiatry may be especially helpful as a didactic device. In addition, a thoughtful consideration of CSTEs should be integrated routinely into clinical supervision and case presentations. For example, when reviewing client improvement over the course of treatment, supervisors should encourage trainees to carefully consider (a) rival explanations for such improvement other than, or in addition to, the intervention itself, (b) cognitive biases that may lead to false inferences in this regard, and (c) safeguards against drawing erroneous inferences concerning the existence and sources of improvement. Nevertheless, because it is not known whether instruction in CSTEs enhances therapeutic outcomes, we call for research on this question. The absence of such evidence notwithstanding, education regarding CSTEs may diminish resistance to evidence-based practice among future and current clinicians, as such knowledge provides a persuasive rationale for reliance on research designs to gauge therapeutic efficacy and effectiveness (Lilienfeld et al., 2013).
Our arguments also bear implications for training models in clinical psychology. For example, the local clinical scientist model (Stricker & Trierweiler, 1995), adopted by many or most scholar-professional (Psy.D.) programs, encourages clinicians to operate as scientists within the miniature laboratory of the clinical setting, carefully observing client behaviors in response to interventions, generating hypotheses about them, and testing these hypotheses with additional interventions. In principle, these are laudable goals. Nevertheless, CSTEs raise largely unappreciated challenges for the implementation of the local clinical scientist model, because they render it difficult to draw reasonably clear-cut conclusions regarding treatment effectiveness for individual clients. Hence, although local (“idiographic”) clinical science certainly has its merits, it cannot substitute for nomothetic clinical science derived from randomized controlled trials, single-subject designs, and other systematic research methods.
At the same time, idiographic clinical science is hardly a dead end, so clinicians need not despair. Although clinicians operating in the context of individual clients cannot exclude many alternative explanations for client improvement, especially Category 3 CSTEs, they can nonetheless evaluate client change through the prism of CSTEs. In this way, they can become more alert to alternative explanations for change. For example, as noted earlier, clinicians can monitor client change systematically across sessions, thereby permitting them to minimize illusory placebo effects; they can solicit information from informants regarding clients’ out-of-session behaviors to minimize the therapist’s office error; they can attend diligently to all relevant client outcomes to avoid inadvertent cherry-picking of signs and symptoms; they can be alert to the fact that client improvements over time may reflect regression to the mean, history, and other artifacts; they can attend to potential client characteristics that may moderate the likelihood of CSTEs; and so on. In this respect, they can adopt a scientific mind-set while bearing in mind that various sources of inferential error cannot be completely eliminated. Hence, the local clinical scientist model, although not an adequate substitute for scientist-practitioner or clinical science models of training (see Baker et al., 2008), is a helpful reminder that clinicians should continually operate as “detectives” who strive to identify potential rival sources of improvement and who (a) minimize these sources when they can and (b) bear them in mind as inferential constraints when they cannot.
Limitations of our analysis
Our analysis is limited in at least three respects. First, our review leaves unresolved the question of how often each of the 26 CSTEs we have identified contributes to erroneous inferences in actual clinical practice. As in many domains of psychology, one must distinguish “can” from “does” in discussions of causality (McCall, 1977). The fact that a CSTE can lead to incorrect inferences of therapeutic effectiveness does not tell us how often it does so. Research examining therapists’ knowledge of and understanding of CSTEs, both in the abstract and in real-world practice, would be a useful starting point in addressing this question.
Second, we have focused only on inferential errors that apply to everyday clinical practice. We have not examined the many methodological decisions that can generate spurious inferences of treatment effectiveness in research studies of all kinds. For example, the file-drawer effect (Rosenthal, 1979), which is the bias against submitting negative results for publication, and outcome reporting bias (Chan & Altman, 2005), which is the propensity to cherry-pick data on dependent measures that yield positive results, can lead to overestimates of treatment efficacy. Recent data also raise the possibility of a disconcertingly high prevalence of “p-hacking,” that is, analyzing data—or peeking repeatedly at already collected data—until alpha levels fall just below .05 (Masicampo & Lalande, 2012). More directly relevant to clinical practice, some researchers also contend that the use of wait-list control groups contributes to overestimates of psychotherapy efficacy, because clients in these groups may deteriorate as they await treatment, experience “resentful demoralization” (Cook & Campbell, 1979) as a consequence of not receiving treatment afforded to other individuals, or both. Nevertheless, the evidence for this assertion is mixed (e.g., S. A. Elliott & Brown, 2002). Still other authors argue that “treatment as usual” conditions, which often serve as control groups in psychotherapy outcome designs, are best conceived of as “intent to fail” conditions, defined (perhaps tendentiously) as “pseudotreatments designed specifically as control groups to prove the superiority of the investigator’s preferred treatment and that have no theoretical rationale or are delivered by graduate students who know they are administering treatment that is not supposed to work” (Westen & Bradley, 2005, p. 267). If these critics are correct, some standard psychotherapy outcome designs may overestimate the efficacy of beneficial treatments or generate the mistaken conclusion that inefficacious treatments are efficacious.
Third, some readers might contend that our core arguments are rendered effectively moot by the Dodo Bird verdict, named after the Dodo Bird in Lewis Carroll’s Adventures of Alice in Wonderland, who proclaimed following a race that “Everybody has won, and all must have prizes.” This verdict posits that all psychotherapies are (a) effective and (b) equivalent in their effectiveness, both overall and for all disorders (Luborsky, Singer, & Luborsky, 1975; Shedler, 2010; Wampold et al., 1997). If the Dodo Bird verdict is correct, the reasoning continues, CSTEs are of little or no concern because all treatments work, and work equally well (see Stewart et al., 2011).
Nevertheless, the Dodo Bird verdict has historically referred to a rough equivalence in the effectiveness of different schools of therapy (e.g., psychodynamic, cognitive-behavioral) rather than to a precise equivalence in the efficacy of all specific treatments (e.g., Smith, Glass, & Miller, 1980). Moreover, the assertion that all therapies are of equal efficacy, either overall (a main effects hypothesis) or for all conditions (an interactional hypothesis), is difficult to sustain (Lilienfeld, 2014; but see Wampold, 2001, for a more sanguine perspective on the Dodo Bird verdict). For example, well-replicated data indicate that exposure-based therapies are more efficacious than other treatments for at least some anxiety-related disorders (e.g., obsessive-compulsive disorder) and that behavioral therapies are more efficacious than nonbehavioral therapies for child and adolescent behavioral problems (Chambless & Ollendick, 2001; Hunsley & Di Giulio, 2002). A meta-analysis by Tolin (2010) similarly revealed that behavioral and cognitive-behavior therapies are more efficacious than other therapies for anxiety and mood disorders. Further calling into question the Dodo Bird verdict are findings that at least some interventions, such as CISD, are at best ineffective and perhaps harmful (Lilienfeld, 2007; McNally et al., 2003). Even Bruce Wampold, a prominent proponent of the Dodo Bird verdict, acknowledges that this conclusion applies only to “bona-fide” therapies, namely, those based on sound psychological principles, delivered by well-trained psychotherapists, and laid out explicitly in manuals or other publications (Wampold & DeFife, 2010).
Furthermore, our discussion of CSTEs is relevant not only to “schools” of psychotherapy but also to specific therapeutic techniques, many of which transcend diverse treatment modalities. That is, many CSTEs can predispose to false inferences regarding the effectiveness of specific techniques delivered within a therapy session, such as an interpretation of a client’s statement, a piece of advice imparted to a client, or a role-play exercise between client and clinician. Hence, even setting aside the contested Dodo Bird verdict, the inferential problems posed by CSTEs in the psychotherapy context remain.
Closing thoughts
The challenges posed by CSTEs are not grounds for pessimism, let alone nihilism, among clinical scientists in practice or research settings, as the inferential mistakes associated with them are to some extent surmountable. Nevertheless, CSTEs underscore the pressing need to inculcate humility in clinicians, researchers, and students (McFall, 1991). We are all prone to neglecting CSTEs, not because of a lack of intelligence but because of inherent limitations in human information processing (Kahneman, 2011). As a consequence, all mental health professionals and consumers should be skeptical of confident proclamations of treatment breakthroughs in the absence of rigorous outcome data (Dawes, 1994; Lilienfeld et al., 2003). CSTEs are potent reminders that although our intuitions are at times accurate, they can be misleading. When evaluating treatment effectiveness, our intuitions may fail to account for numerous rival hypotheses for change that are difficult or impossible to detect without the aid of finely honed research safeguards. As a consequence, CSTEs highlight the inherent limits of our knowledge as applied to the individual client and should impel us to be mindful of our propensities toward overconfidence. Science, which is a systematic approach to reducing uncertainty in our inferences (McFall & Treat, 1999; O’Donohue & Lilienfeld, 2007), is ultimately our best prescription against being deceived by inadequate evidence.
Footnotes
Acknowledgements
The authors thank Sean Carey and Ben Johnson for their valuable assistance with compiling references.
Declaration of Conflicting Interests
The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.
