Abstract
Overlap-based measures are increasingly applied in the synthesis of single-subject research. This article considers two criticisms of overlap-based metrics, specifically that they do not measure magnitude of effect and do not adequately correspond with visual analysis. It is argued that these criticisms are based on fundamental misconceptions regarding the nature of effect sizes and their appropriate interpretation in single-subject research. Suggestions for considerations in evaluating single-subject research studies are offered, including the need to separately consider experimental control and magnitude of effect.
The meta-analysis of traditional group studies has broad acceptance, and increasingly sophisticated statistical methods have been developed (Borenstein, Hedges, Higgins, & Rothstein, 2009). This approach has formed the methodological cornerstone of the movement toward evidence-based medicine (see Higgins & Green, 2009) and has been widely used in special education (Kavale & Forness, 1999) and education more generally (Hattie, 2009). Nevertheless, much research in special education relies on single-subject research designs. Although methodology for the quantitative synthesis of such research dates back more than 20 years and some procedures, such as percentage of overlapping data (PND; Scruggs, Mastropieri, & Casto, 1987b), are in reasonably wide use, there remains considerable controversy over their application (e.g., Haardorfer, 2010; Kratochwill et al., 2010; Wolery, Busick, Reichow, & Barton, 2010). Quantitative summaries of single-subject research offers several possible advantages over traditional narrative reviews, including the ability to identify effects that may be too subtle to detect in a single study, estimation of the robustness of an intervention, reducing subjectivity in summarizing research, and examining moderator variables (Critchfield, Newland, & Kollins, 2000).
An effect size is “a value that reflects the magnitude of the treatment effect or (more generally) the strength of a relationship between two variables” (Borenstein et al., 2009, p. 3). Central to the process of meta-analysis is the calculation of a standardized effect, which can be used to provide a measurement-independent metric to estimate a summary effect, evaluate consistency across studies, and allow examination of moderator variables that may affect outcomes (e.g., participant age or intervention intensity). Several approaches to the calculation of effect size in single-subject studies have been proposed.
Perhaps the most obvious approach is to use a corollary of the standardized mean difference metrics that are often used in analyzing continuous variables in traditional group designs (see Busk & Serlin, 1992). Within group comparison designs, this family of metrics typically standardizes the difference between the means of two groups by a measure of dispersion, often the pooled standard deviation across both groups. There are, however, a number of significant problems with this approach when applied to single-subject research. First, in traditional group designs, differences are standardized by the between-individual variation, whereas in single-subject research, differences are standardized by the within-individual variation. It is reasonable to expect that within-individual variation would typically be more constrained than variation across individuals, resulting in relatively inflated effect sizes (Leong & Carter, 2008). That is, metrics comparing groups at one point in time will not have the same scale as those describing change in one individual over time (Beretvas & Chung, 2008). In addition, where change is standardized by baseline data variation, problems arise with zero baselines (division by zero) and baselines with very limited variability. Importantly, these measures are based on assumptions such as independence and normality of distribution, which are often violated in single-subject data series (Parker, Vannest, & Brown, 2009). Finally, means often fail to adequately characterize data sets in single-subject research (Parker, Vannest, & Davis, 2011).
Another approach to meta-analysis of single-subject research has been to use regression-based procedures (e.g., Allison & Gorman, 1993; Center, Skiba, & Casey, 1985) but there are problems with such methodologies. One particular issue is that single-subject baseline data series are short on average and regression estimates that are based on limited numbers of data points are highly suspect (Allison & Gorman, 1993). The fundamental obstacle to these approaches, however, is their reliance on assumptions, most specifically that of independence of data (Wolery et al., 2010) and regression-based approaches have not been extensively used in single-subject meta-analysis.
One approach that potentially overcomes the difficulties associated with assumptions underlying parametric tests involves overlap-based measures. The progenitor of these approaches is PND, first described by Scruggs et al. (1987b), but a number of other metrics have been proposed such as percentage exceeding the median (PEM; Ma, 2009), improvement rate difference (IRD; Parker et al., 2009), and percentage of all nonoverlapping data (PAND; Parker, Hagan-Burke, & Vannest, 2006). In essence, these measures dichotomize data and examine the degree of overlap between baseline and intervention phases. For example, the PND measure counts the percentage of data points in the treatment phase that exceed the most extreme baseline point in the expected direction of change. PND is the most widely used metric (Beretvas & Chung, 2008) although applications of more recently developed measures such as PEM (Ma, 2009; Preston & Carter, 2009), IRD (Ganz et al., 2011; Reynhout & Carter, 2011), and PAND (Schneider, Goldstein, & Parker, 2008) are appearing in the literature. Overlap measures have the additional advantages that they are relatively simple to calculate (Parker, Vannest, & Davis, 2011) and are reasonably conceptually transparent.
Despite these advantages, overlap measures have been subject to criticism on a number of grounds (Haardorfer, 2010; Salzberg, Strain, & Baer, 1987; White, 1987) with Wolery et al. (2010) recently arguing that these measures should be abandoned. Two particular criticisms are that these types of metrics cannot be considered effect sizes as they fail to measure magnitude (Haardorfer, 2010; Wolery et al., 2010) and that they do not correspond sufficiently with visual judgments to offer a useful approach to synthesizing single-subject research (Wolery et al., 2010). These criticisms will now be examined and it will be argued that they are based on fundamental misconceptions regarding the nature of effect sizes and their appropriate interpretation in single-subject research. Some general suggestions for an approach to the evaluation of single-subject research will then be offered.
Measurement of Magnitude
Wolery et al. (2010) argued that overlap-based measures “are not an estimate of the magnitude of the effects between conditions” (p. 24). This claim is echoed by Haardorfer (2010) in reference to the PND statistic in noting that “it is not an effect size measure; it does not measure the magnitude of an effect” (p. 127). Furthermore, Haardorfer pointed out that Scruggs, Mastropieri, and Casto (1987a) explicitly stated that they viewed PND as a measure of convincingness of effect, rather than magnitude. While Scruggs et al. may not have considered PND an effect size metric, many other certainly have regarded it as such (e.g., Beretvas & Chung, 2008; Parker, Vannest, & Davis, 2011).
The basis of the argument forwarded by Wolery et al. (2010) and Haardorfer (2010) appears to be based on a circumscribed view of the terms magnitude and effect size. As previously noted, effect sizes are by definition values that reflect the magnitude of a relationship between two variables. There are a wide variety of different measures of effect size. In traditional group meta-analysis, a variety of effect sizes may be used, including unstandardized differences and standardized mean differences (e.g., Cohen’s d, Hedges’ g) for continuous data. In the latter case, magnitude is reflected in the standardized difference between group means. Correlational metrics may also be used (Lipsy & Wilson, 2001) reflecting the magnitude of association between paired data. Furthermore, when dichotomous data are examined, measures such as the odds ratio, risk difference, and intervention rate difference may be used (Borenstein et al., 2009). In the example of the odds ratio, the magnitude of difference between groups is reflected in the ratio of the odds of an event, such as passing a test, occurring in one group to the odds of that event occurring in a second group. Not only are all these measures considered effect sizes that reflect magnitude of difference but formulas are also available to allow conversion between metrics when studies use different types of endpoints (e.g., dichotomous and continuous; Borenstein et al., 2009).
Overlap-based effect size measures may be considered similar to dichotomous metrics (e.g., odds ratio) in the sense that data are treated as binary (overlapping or nonoverlapping) with the criterion for overlap defined by the specific measure (e.g., most extreme baseline value, median value). In fact, IRD is a direct extension of the IRD metric used in group research (Parker & Hagan-Burke, 2007). It would certainly be legitimate to argue that overlap-based measures may be insensitive to underlying variability under some circumstances, such as when there is no overlap between baseline and intervention phases or where baseline data reaches a floor or ceiling when using the PND metric. Nevertheless, it appears to be incorrect to suggest that magnitude is not measured and consequently that these metrics do not constitute effect size measures.
Correspondence With Visual Judgment
The efficacy of single-subject research has traditionally been judged by visual inspection, despite long-standing concern regarding lack of reliability between judges, even experts (Fisch, 1998). Beretvas and Chung (2008) have argued that effect size summaries in single-subject research “provide an alternative to visual inspection” (p. 129). A persistent criticism of nonoverlap effect size metrics is that they do not necessarily correspond with visual judgments regarding treatment effects for some data sets. For example, White (1987) raised the issue of the failure of PND to address obvious baseline trends in his commentary on the initial description of the technique. Recently, in considering overlap-based effect size measures, Wolery et al. (2010) argued that “a major criterion for judging the utility and rigor of these methods is to determine the extent to which they agree with judgments of visual analysts” (p. 19). Wolery et al. compared correspondence of several overlap measures with visual judgments about whether change was present in AB sequences. Using the criteria for PND, which may not be appropriate to other measures, Wolery et al. reported total errors between 13.2% and 22.3% when compared with data series identified as having treatment effects or no treatment effects by visual judgment. Based on this analysis, they held that overlap-based metrics are inadequate as they fail to detect all the relevant characteristics of time series data, specifically trend and variability. The question that arises is whether this is an appropriate or reasonable standard for the judgment of overlap measures? To address this question, it is helpful to consider the distinction between experimental control and effect size in traditional group research.
In traditional group research, the ability of a study to adequately demonstrate experimental control is substantially a product of the a priori features of research design, such as randomization and blinding. That is, a well-designed study offers a good probability of concluding that any observed effects are likely to be a result of the intervention rather than extraneous uncontrolled variables. Nevertheless, some threats to internal validity may arise after the commencement of research. For example, differential attrition in groups may compromise the researchers’ ability to draw causal inference about intervention effects (Campbell & Stanley, 1963). In addition, despite randomization, pretest differences between groups may compromise interpretation, particularly when participants are highly heterogeneous (Gersten et al., 2005) or when group size is small. Once data collection is complete, researchers typically use inferential statistical tests to determine whether results as extreme as those observed are probable if the null hypothesis is true. It should be stressed that at this point, we know nothing about the magnitude of any effect, just whether our design will allow strong causal inference about observed effects (or lack there of) and whether any observed effect is sufficiently large to be unlikely due to chance on the assumption that the null hypothesis is true. In contrast, effect sizes do provide us with an index of the magnitude of a treatment effect. The term effect size does not imply causation and effect sizes are independent of demonstrations of experimental control (Allison & Gorman, 1993). That is, a large effect size may be estimated even when convincing demonstration of experimental control can be excluded on the basis of fatal design flaws. Conversely, demonstration of experimental control may be unequivocal, but the calculated effect size may be very small.
The same basic issues may be seen to apply with single-subject research. That is, demonstration of experimental control is conceptually discrete from that of the magnitude of treatment effects, although in practice, the issues become somewhat more blurred. Careful attention to experimental design can reduce threats to internal validity and significantly enhance the ability to demonstrate experimental control in single-subject research. Nevertheless, even with due diligence to planning the design of a study, there remains the possibility that characteristics of data, such as baseline trends or instability, may still compromise ability to draw causal inferences. Ideally, baseline data should not exhibit trend or substantial variability (Kazdin, 1982) and researchers should attempt to establish a steady behavior state before intervention (Kennedy, 2005). In applied research, however, it may be undesirable to delay intervention for extended periods (Kennedy, 2005). Furthermore, in some cases, the issue becomes moot as intervention effects may be so powerful that they overwhelm baseline trends and variability. Thus, researchers ultimately need to make a subjective judgment about the degree of acceptable baseline trend and variability, given the anticipated magnitude of treatment effects (Kazdin, 1982).
Wolery et al. (2010) asserted that lack of correspondence between overlap measures and visual judgment undermines the validity of these metrics. The key issue in assessing this claim is what the judges were asked to evaluate—experimental control, magnitude of treatment effects, or both? Wolery et al. specifically asked, “Did a change exist in the data from Condition 1 to Condition 2?” This clearly addressed the question of experimental control rather than magnitude of effect and is characteristic of the type of judgments examined in many studies addressing visual analysis of single-subject research (e.g., Carter, 2009; Kahng et al., 2010; Matyas & Greenwood, 1990; Ottenbacher, 1986). Obviously, it would be expected that on average, larger magnitude effects would tend to lead to more confident judgments regarding control, so the issues are related but are not inextricably linked. Large effect sizes are possible without experimental control, and conversely, small effect sizes are possible where control is unambiguous.
The criticism of Wolery et al. (2010) that overlap-based measures do not sufficiently correspond to visual judgments conflates two separate but correlated issues, experimental control, and magnitude of effects. In fact, the findings of Wolery et al. may be seen as somewhat predictable. That is, a substantial degree of correspondence between visual judgment of experimental control and measured magnitude of effect is expected, but that correspondence is less than perfect. Nevertheless, depreciating the value of overlap metrics on the basis of their comparison with visual judgment focusing primarily on demonstrated experimental control (in contrast to magnitude) is fundamentally inappropriate. More appropriate comparisons between visual judgment of magnitude of effect and overlap measures (Ma, 2006; Parker & Hagan-Burke, 2007) have yielded degrees of correlation that are arguably of less concern than the levels of correspondence between visual judges with regard to experimental control (e.g., DeProspero & Cohen, 1979; Ottenbacher, 1986). It should, however, be noted that more recent research has suggested that judgments regarding control may be more consistent than previously thought (Carter, 2009; Kahng et al., 2010).
Considerations in Evaluating Efficacy in Single-Subject Research
In evaluating single-subject studies, two sets of judgments need to be made. First a judgment needs to be made about experimental control. This includes an assessment of whether the research design used can allow the possibility of adequate demonstrations of control and then whether such control has actually been observed. Horner et al. (2005) suggested that at least three clear demonstrations of experimental control at different time points should be the minimum standard in single-subject research studies. Assuming the basic design features are such that reasonable inferences regarding causation are possible, appraisal needs to be made regarding whether presented data do demonstrate experimental control. This analysis will need to consider immediacy of treatment effects, change in level, change in trend, and variability in data. Furthermore, complex judgments of multiple phases in a study may need to be made to make an overall assessment regarding experimental control (Wolery et al., 2010), and visual inspection remains the accepted method to make such judgments (Kratochwill et al., 2010). Second, a judgment needs to be made about the magnitude of an effect and overlap-based effect size metrics appear to offer a viable option. These measures, however, need to be interpreted in the context of experimental control. Where such control cannot be demonstrated or is not demonstrated, measures of magnitude of effect are not meaningful.
One approach might be to only proceed to data aggregation when experimental control is clearly demonstrated, but such a strategy is inherently problematic as it effectively means that only the “winners” are counted. In effect, more nuanced use and interpretation of effect sizes may be appropriate with consideration of whether experimental control has been demonstrated in the first instance and the magnitude of the effect in the second. For example, one approach to data aggregation might involve partialing or removal of studies with experimental designs that prohibit adequate demonstration of experimental control, consistent with the “best evidence” approach advocated by Slavin (1987). The calculated effect size for the remaining studies needs to be interpreted in relation to the proportion of data series in which clear experimental control has been demonstrated. Impressive effect sizes in the absence of high levels of demonstrated experimental control should be depreciated. Given effect sizes and experimental control need to be considered in quantitative synthesis of single-subject research, a possible direction for future research might be to examine how these two factors can be incorporated into a single metric.
Harvey et al. (2009) noted that choice of effect size metrics is an extensively debated issue and that “each time a new metric is proposed, its advocate typically emphasises its virtue by pointing out flaws in other metrics” (p. 71). Taking a more positive viewpoint, it could be argued there has been rapid development of metrics as well as increasing debate regarding their statistical properties and relative merits. This has included the development of overlap metrics that may address baseline trend (Parker, Vannest, & Davis, 2011; Parker, Vannest, Davis, & Sauber, 2011) as well as extension of existing measures to examine reversals in addition to the conventional AB phase comparisons (Parker et al., 2009). Issues still remain to be adequately addressed, such as possible inflation of some effect sizes by the extended collection of treatment data (Haardorfer, 2010; Wolery et al., 2010) and the synthesis of alternating treatment designs, which do not primarily rely on baseline comparisons. It is currently uncertain as to which metric or combination of metrics will ultimately provide the most appropriate outcome measure in single-subject research. Until such time as the picture becomes clearer, researchers may be well advised to consider the use of multiple measures to triangulate findings (Beretvas & Chung, 2008). Convergence of findings regarding the magnitude of effects will increase confidence in conclusions drawn by reviewers (Reynhout & Carter, 2011).
Aggregation of data presents inherent risks, in particular the risk of loss of information on idiosyncratic response to intervention (Critchfield et al., 2000; Salzberg et al., 1987). Nevertheless, generalities are important in science, no single finding is authoritative, and narrative and quantitative reviews can make converging contributions to understanding (Critchfield et al., 2000).
Conclusion
Two criticisms of overlap-based effect size metrics are addressed in this article and it is argued that both arise from misconceptions regarding the nature of effect sizes and what they tell us. Overlap-based measures do reflect the magnitude of changes between baseline and treatment, but they do so by dichotomizing data. Furthermore, comparison of visual judgment regarding experimental control and effect sizes inappropriately conflates the issues of control and magnitude, which are to some extent independent. Overlap-based measures can describe the magnitude of effects but cannot tell us whether effects are functionally related to interventions. This requires assessment of the basic features of experimental design as well as the analysis of data to evaluate experimental control, typically involving visual inspection.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
