Abstract
Twenty-five years ago, we proposed the use of the PND (percentage of nonoverlapping data) statistic for quantitative synthesis (or meta-analysis) of single-subject research. This procedure was controversial from its beginning, with criticism divided between those maintaining that no quantitative method should be used to summarize single-subject research and those suggesting that other methods may be preferable. Since that time, more than 40 research summaries have been published using the PND statistic or its variants, and a smaller number have been published using other methods. We argue that the PND method has proven to be very useful over time for this purpose, though other methods have also contributed. These procedures to date have led to more systematic, objective summaries of single-subject research. We discuss implications of present knowledge for future applications of research synthesis.
In 1983, shortly after receiving our doctoral degrees, we accepted research positions at Utah State University’s Exceptional Child Center (now Center for Persons with Disabilities). Margo accepted a postdoctoral position on the federally funded Early Intervention Research Institute (EIRI), and Tom accepted a position with other federal research grants, with a smaller involvement in EIRI projects. EIRI was being directed by Dr. Karl White and Dr. Glen Casto. Karl White had recently received his PhD from the University of Colorado, studying under Dr. Gene Glass, who had been working on methods for synthesizing quantitative research, a technique he referred to as “meta-analysis” (Glass, 1976). In that article, Glass described his own efforts to summarize the effects of psychotherapy, and also described in some detail Karl’s meta-analysis of the relationship between academic achievement and socioeconomic status (see White, 1982).
One of the major tasks of EIRI was to summarize all research literature on the effects of early intervention, which was accomplished using meta-analysis techniques (Casto & Mastropieri, 1986; White, 1985). For each study, standardized “effect sizes” (ES) were calculated, typically, by subtracting the control from the experimental posttreatment mean, and dividing the difference by the control group standard deviation. These effect sizes were then combined and averaged across various levels of study characteristics (e.g., age at onset of treatment, parent involvement, degree of structure), as described by Glass, McGaw, and Smith (1981). These reports were not without their critics, including one response written by Strain and Smith (1986). These two authors criticized the Casto and Mastropieri meta-analysis on a number of grounds. One significant criticism, which could not be denied, was that it had deliberately excluded single-subject research design studies. It is difficult to draw a complete picture of the effects of early intervention, Strain and Smith argued, with this substantial source of evidence ignored. Soon afterward, the EIRI Advisory Panel made that same argument.
There was, of course, no justification for ignoring all single-subject studies. The simple fact was that single-subject research was not thought to contain the type of data that could be converted to means and standard deviations comparable to those of group design studies, so they were eliminated. However, in response to feedback, we were asked by the EIRI Directors to create a procedure for integrating all single-subject literature on early intervention with students with disabilities.
Reviewing Literature Systematically
In implementing meta-analysis procedures, we relied heavily on the suggestions made by Jackson (1980) regarding optimal reviews of the literature. Jackson argued that reviews of literature should be as systematic, reliable, and replicable as any original research report, and consistent criteria should be applied in all cases. These standards should include the following:
Define and delimit the topic, specifying carefully what aspects of the topic will be covered, and what will not be covered.
Cite previous reviews, and describe how the present review will provide additional information.
Cite procedures for obtaining research reports, so others could replicate these procedures, and obtain the same reports.
Cite common dependent and independent variables across studies, so that these can be easily and systematically compared.
Describe the covariation between study outcomes and study characteristics (i.e., moderator variables), so readers will know, for example, a certain treatment is most effective at certain grade levels, or after a certain amount of training.
Support the conclusions of the review with empirical data, so it will be clear how the review contributed to overall conclusions.
Describe how outcomes were assessed. Reviewers should state criteria for assessing study outcomes, rather than, for example, simply restating the authors’ conclusions (see also American Psychological Association [APA], 2010, pp. 251–252).
In reviewing a body of literature, most of Jackson’s (1980) criteria can be met without a great deal of difficulty. However, at least two of these criteria (5 and 7) require the identification of a study outcome that can fairly be applied across studies, and this problem soon became the focus of our efforts.
An Appropriate Outcome Metric
We initially did not know what we would use for our outcome metric, but we did know what standards it would need to meet. First, we wanted an outcome that could be calculated relatively easily, and would result in a high degree of reliability among scorers. An outcome metric without high interscorer reliability would be doomed from the beginning. Second, we needed the outcome metric to be appropriate to the largest possible number of single-subject research reports. There would be little use for a metric that could not be applied fairly in the great majority of cases. Third, we wanted the outcome metric to be meaningful and easily understandable to the consumers of the research report. Because behavioral researchers and practitioners seemed most likely to read such reports, we hoped to be able to present the outcomes in a way meaningful to them. We did not want the outcome metric to have only esoteric meaning, understandable only to a few. Finally, we wanted the metric to be valid, in that it would faithfully represent the outcomes of the individual research reports. To achieve all these standards, we considered a number of different possibilities, on the basis of advice from experts and our own reasoning.
Standardized Effect Size
One of the first considerations was simply to calculate a standardized mean-difference effect size, as described by Glass (1976), where in this case the mean score across baseline observations is subtracted from the mean score across treatment observations, the difference then divided by the standard deviation of the baseline observations. One possible advantage is that the outcomes of single-subject synthesis reports would be of the same general type as effect size data obtained from group experimental research. We did identify, however, a number of drawbacks with the application of this method. The most significant problem was the fact that many if not most single-subject studies use only a relatively small number of observations per phase, often with five or fewer data points in baseline phases (see Huitema, 1985). Such small numbers can result in unreliable or disproportional effect sizes. A second concern is that many of the phases may contain data with autoregressive components (i.e., scores are not independent of each other), which could also complicate effect size estimates. In our own trials, for example, we routinely obtained effect sizes larger than 3 standard deviations, sometimes as high as 7 standard deviations. These values are not easily understandable. For example, an effect size of 2 standard deviations suggests about 98% of the treatment data exceed the mean of the baseline phase. When there are only 6 or 7 treatment observations, this is not particularly meaningful, and even less meaningful in the case of a 7 standard deviation treatment effect. So although we believed standardized mean difference effect sizes could reasonably be calculated, we concluded that the validity and meaningfulness of the calculation were problematic.
Regression Based Effect Size
Another possibility was to calculate the slopes of each phase and determine the extent to which changes in slope characterize study outcomes. Unfortunately, many similar problems emerged. Baseline or treatment phases with only a small number of observations can reveal highly inconsistent slope projections and result in seemingly arbitrary outcome values. Furthermore, many data displays did not present obvious or consistent slope effects, limiting our ability to apply this metric in the largest number of cases. Another problem with calculating trends or slopes is that behavioral charts often do not space data equally across time intervals, ignoring gaps from, for example, weekends or absences. In such cases, regression calculations will not reflect the data accurately.
Rating Scale
Another possibility, suggested to us by several experts, was to create a rating scale where trained observers would consider a number of evaluative criteria (e.g., level change, slope change, variability), and then provide an effectiveness rating on a scale, based on a particular scoring rubric. We created these rubrics and used them for evaluating single-subject charts. Unfortunately, we were unable to create a scale that resulted in sufficient interrater reliability, a finding long reported by others (e.g., Jones, Weinrott, & Vaught, 1978). We did achieve some reliability with a 3-point scale (effective, partially effective, and ineffective; see Mastropieri & Scruggs, 1985–1986), but we found that a 3-point scale did not provide sufficient discrimination for our purposes.
Overlapping Data
After a number of previous efforts, we identified a measure of between-phase nonoverlapping data, because it was the one measure that met all of the standards we had established for an appropriate outcome metric. The proportion of treatment phase data that exceed baseline observations is a very strong indicator of treatment effects. Tawney and Gast (1984) stated, “Generally, the lower the percentage of overlap, the greater the impact the intervention has on the target behavior” (p. 164). Kazdin (1978) suggested: If performance during an intervention phase does not overlap with performance during the baseline phase when these data points are plotted over time, the effects usually are regarded as reliable. The replication of nonoverlapping distributions during different treatment phases strongly argues for the effects of treatment. (p. 637)
For a measure of nonoverlapping data, we calculated the proportion of data observed in treatment phases that did not overlap data observed in the baseline phases. For example, if 9 of 10 treatment observations exceeded the highest (or lowest, depending on the intended treatment effect) baseline value, this would be calculated as 90% nonoverlapping data. In the case of ABAB or reversal designs, we calculated the combined data overlap across the two AB phases. For example, if 7 of 9 observations from the first B phase, and 5 of 7 observations from the second B phase, exceeded prior baseline levels, we would calculate (7 of 9) + (5 of 7) = 12/16 = .75, or a PND of 75.0% (see Scruggs, Mastropieri, & Casto, 1987a). We chose to use “nonoverlapping” rather than “overlapping” because it allowed us to state outcomes positively, with higher values representing more effective treatments. We chose the extreme baseline value, rather than, for example, the mean or median for comparison, because we felt there was more meaning in an absolute measure of overlap and because use of a mean or median comparison would result in more instances of 100% nonoverlapping data, limiting discrimination among outcomes.
When it came to choosing a name for our outcome metric, we settled on “PND” (for “percentage of nonoverlapping data”), because it is more directly meaningful than, for example, a Greek letter, and emphasizes its simplicity of calculation.
In a number of trial applications, we found that the percentage of nonoverlapping data could very easily be calculated (even when the precise observation values are not certain), with a high degree of reliability. We also found that it correlated strongly with observer ratings, when the reliability of the ratings was also high (Mastropieri & Scruggs, 1985-1986). We determined that the PND, though not a direct measure of other considerations such as slope and variability, was in many instances sensitive to slope and variability (Scruggs et al., 1987a; Scruggs & Mastropieri, 1998). That is, more variability in baseline usually led to a lower PND score; inappropriate baseline slopes were accounted for by conventions (e.g., charts with baseline trends in the therapeutic direction of change were not included); and appropriate trends in treatment phases usually led to higher PND scores. Finally, we were very pleased with the meaningfulness of the metric, especially as compared with statistical alternatives. For example, if we were to conclude that a particular treatment resulted in an average of 92% of treatment data that exceeded baseline observations, that value should be easily interpretable for behavioral researchers as a generally effective treatment. On the other hand, treatments that failed to raise performance over baseline levels in more than half the cases would be less likely to be regarded as effective.
Applications
Having identified an outcome metric that seemed to satisfy most of our criteria, we began to apply this metric to our original task, the quantitative synthesis of single-subject research in early intervention. Using standard search procedures, we identified 68 reports of research in which single-subject methodology was applied to evaluate interventions on children with disabilities, 66 months of age or younger, at the beginning of the intervention. Rather than summarize all these studies simultaneously, we determined for our purposes it would be better if we were to sort the studies by intervention area and then evaluate smaller, more homogenous data sets. We found that the 68 studies fit relatively easily into four categories: social withdrawal, conduct disorders, early language interventions, and general developmental functioning. At this beginning point in the methodology, we were very concerned that the overall results would make sense in light of the conclusions of the original authors of these reports and previous reviews, and yet provide a synthesis across studies that went beyond previous reviews because of the quantitative metrics used. We took this to be a measure of the validity of the method, and we would have been concerned had our conclusions not paralleled, at least in many respects, the conclusions of others.
We were also concerned our outcome metric would make sense with respect to the data presented in each of the original reports. To ensure this, we coded a PND score for each data display, then determined whether it appeared to be a fair representation of apparent treatment effects. We found we were easily able to do this, with the application of a few coding conventions to be used under specific circumstances (see Scruggs et al., 1987a, pp. 28-31), for example, when
a decreasing slope in the second baseline phase of a reversal or ABAB design, due to extinction processes during the second baseline, resulted in nearly all subsequent treatment data overlapping, yet the treatment appeared effective. In such cases, we used the first baseline as a better measure of comparison for the second treatment phase.
a baseline trend was observed in the direction of intended treatment effects. Such intervention effects are extremely difficult to interpret using any method (see, e.g., Kazdin, 1978), so we did not calculate PND scores in these cases.
a small number of outlying observations in baseline compromised treatment outcomes. That is, a single zero data point in baseline on an intervention intended to decrease behavior can result in a 0% PND score. We scored PND when there were several extreme observations and did not calculate a score when there were three or fewer and/or less than 33% ceiling or floor data points.
In fact, we encountered these problems in only a small proportion of cases. Furthermore, no research synthesis method we know of is able to include all relevant studies, and many studies may be excluded in meta-analysis of group research because of, for example, insufficient numbers of observations, floor or ceiling effects that limit variability, or insufficient data presented for calculating outcomes. We agreed no PND should be calculated and applied that did not fairly represent the observed treatment outcome. These conventions have been modified by others in some cases (sometimes to deal with special considerations in a specific data set; see Schlosser, Lee, & Wendt, 2008), or possibly not followed in others (Wolery, Busick, Reichow, & Barton, 2010). Using alternative conventions could possibly lead to inconsistent findings, and not following specific conventions for calculating outcomes can lead to invalid results.
Our first synthesis effort was conducted in the area of social withdrawal (Mastropieri & Scruggs, 1985-1986). We were pleased with the outcomes of this investigation, especially as reliability of coding was given at .94. Furthermore, the PND scores correlated .68 to .74 with our more subjective 3-point rating scale. Finally, our conclusions appeared sensible given the conclusions of the original research reports; for example, (a) reinforcement was effective in increasing interactions between classroom peers and withdrawn students, (b) unprompted modeling was less effective, (c) withdrawn students were far more likely to initiate peer interactions when directly reinforced, and (d) generalization was less effective and less frequently attempted, although near transfer effects were higher than far transfer effects.
Next, we summarized single-subject research in the area of conduct disorders (Scruggs, Mastropieri, Cook, & Escobar, 1986). The findings from this synthesis also seemed valid, in that tangible reinforcement was associated with the overall highest outcomes (PND = 100%), followed by punishment/time out (75.5%), and with differential attention or praise associated with lower overall outcomes (13.5%). These findings were robust across different levels of other variables (e.g., older vs. younger children, different types of settings, different handicapping conditions, different types of intervenors). The overall generalization effect was weak (44.5%).
Third, we summarized research reports on language interventions (Scruggs, Mastropieri, Forness, & Kavale, 1988). This synthesis supported the effectiveness of reinforcement and direct instruction. Maintenance effects were high, but generalization effects again were lower (62.5%), although higher with specific generalization training techniques. Because the total number of studies was not great (N = 20), we were able to provide much narrative information about individual studies that supported the quantitative results.
Our final early intervention synthesis fell under the general heading of developmental functioning, and included such topics as rumination, physical behavior, and feeding behavior (Scruggs, Mastropieri, & McEwen, 1988). Again, general principles of behavioral treatments (e.g., reinforcement) were validated; the lowest effects were for no reinforcement. Across studies, effects for specific generalization training were again higher than for “train and hope” methods (Stokes & Baer, 1977).
Overall, we felt we had accomplished the task required of us by the Directors of the Early Intervention Research Institute. We had been able to consolidate a large and important corpus of research, and to draw general conclusions that were faithful to the original research reports, and compatible with previous reviews of this literature. In addition, the use of a valid outcome metric allowed us to compare study outcomes objectively with study characteristics, such as setting, intervenor, type of treatment, generalization procedures; or age, gender, or handicapping condition of participants. Although conclusions from these reports were not directly comparable with the findings of group investigations reported by Casto and Mastropieri (1986), they nonetheless complemented the latter’s findings, and added substantively to the cumulative body of early intervention research in special education.
Later Synthesis Efforts Using the PND
Since our initial synthesis efforts, there have been a substantial number of efforts to summarize single-subject research on various topics using a variety of procedures, but most commonly using the PND as an outcome metric. Many topics have been examined, including self-injurious behavior (Schlosser & Goetze, 1992), social skills (Mathur, Kavale, Quinn, Forness, & Rutherford, 1998), self-determination (Algozzine, Browder, Karvonen, Test, & Wood, 2001), reading instruction for individuals with significant cognitive disabilities (Browder, Wakeman, Spooner, Ahlgrim-Delzell, & Algozzine, 2006), and writing research (Rogers & Graham, 2008). Schlosser et al. (2008) identified 45 research synthesis reports employing the PND statistic published between 1985 and 2008, and concluded, “the PND is undisputedly the most widely field-tested outcome metric for [single-subject experimental designs] to date. The data from this review also provide solid evidence that the PND can be produced reliably across more than one coder” (p. 174). Schlosser et al., however, pointed out variability in reporting of single-subject synthesis efforts, and argued for more consistently applied methods, not dissimilar to arguments applied to group experimental meta-analysis, which also have been reported inconsistently (Mostert, 2001).
Maggin, O’Keeffe, and Johnson (2011) identified 68 research reports in which single-subject research was synthesized quantitatively. Of these 68 reports, 48 (70.6%) used the PND metric and/or one or more of its variations, whereas the other 20 used one of several alternatives, including standardized mean difference, regression methods, multilevel modeling, or interrupted time-series analysis. These authors also argued for more consistency in methodology for summarizing single-subject research, making several suggestions and concluding, “a systematic approach for aggregating research findings needs to be developed” (p. 122).
Critiques of the PND
In spite of its apparent advantages, the PND method of synthesizing single-subject research reports has been carefully evaluated and commented upon, as is appropriate with a new methodology. One early critique, voiced by Salzberg, Strain, and Baer (1987), argued against any method of data aggregation for reviews of single-subject research, expressing preference for qualitative reviews of the literature. As an example, they offered a narrative review of six studies they regarded as preferable. We responded (Scruggs, Mastropieri, & Casto, 1987c) that descriptive information also can be included in quantitative reviews. However, in their review, Salzberg et al. (1987) neglected (a) to identify common independent and dependent variables, (b) to describe the covariation of study outcomes with study characteristics, or (c) to identify their methods for identifying study outcomes. They also failed to explain how these criteria would be applied in a qualitative review.
PND and Treatment Magnitude
Some (e.g., Wolery et al., 2010) have reasoned that the PND does not provide an estimate of the magnitude of experimental effects and is therefore inappropriate. Allison and Gorman (1993) argued, “the very heart of meta-analysis is the quantification of the magnitude of effects. The PND is therefore inappropriate for meta-analytic use” (pp. 623-624). And, in fact, we have previously downplayed the PND as an exact measure of treatment magnitude (e.g., Scruggs, Mastropieri, & Casto, 1987b). However, standard effect sizes calculated directly from single-subject data are often unreliable for reasons stated previously, and therefore cannot be considered reliable measures of treatment magnitude.
Our previous discussions, however, did not mean to suggest that PND has no relevance to treatment magnitude. In fact, relevant statistics texts have characterized effect sizes in terms of degree of overlap of distributions. For example, Hinton (2004) stated, “a large effect size indicates only a small overlap between distributions, whereas a small effect size indicates a large overlap between the distributions” (p. 101). Cohen (1988) argued, “it is possible to define measures of overlap (U) associated with d which are intuitively compelling and meaningful” (p. 21). He provided a table of equivalence between effect sizes and proportion of overlapping data. However, his calculation involved a measure of the distributions in both treatment and control conditions that overlap each other. A procedure more relevant to the PND was recently provided by Grice and Barrett (2011), who calculated proportion of treatment condition observations that exceeded control condition observations, using a standard normal z-table. By these calculations, an effect size of 1.0 would be equivalent to 38.3% nonoverlapping data; an effect size of 2.0 would be equivalent to 68.3% nonoverlapping data. Such procedures indicate the close relation between data overlap and effect size. However, some caution should be exercised in making translations between effect size and PND, for two reasons: first, the effect size → nonoverlapping data calculations assume normality and equivalent variance in the two distributions, which may or may not be true of single case data (nor, of course, is it always true of group data). Second, derived effect sizes from PND scores would be interpreted very differently than effect sizes from group data. For example, a “medium” effect size (Cohen, 1988) of .5 would be equivalent to 19.7% nonoverlapping data, which would be considered far from effective in a single-subject chart. We are not arguing here that equivalent effect sizes should be routinely reported from PND data; we are suggesting, however, that there is a clear and tangible relation between PND and effect magnitude.
Alternative Procedures
Some proposed alternatives to the PND metric involved the calculation of a standardized mean-difference effect size (e.g., DuPaul & Eckert, 1997; Gorman-Smith & Matson, 1985). Busk and Serlin (1992), for example, suggested calculating effect sizes using of one of three separate procedures, depending on assumptions made by the reviewer. In one case, sums of squares are calculated for within-subject and treatment effects, which are then used to yield a mean square residual. The square root of this mean square residual becomes the denominator of the effect size, with the baseline-treatment mean difference used as the numerator.
Unfortunately, many of these procedures used only two phases per study for their analysis, losing much potentially valuable data. Furthermore, the relatively small number of observations in single-subject studies is problematic. For example, Huitema (1985) analyzed 10 years of data displays from the Journal of Applied Behavior Analysis and found the median number of baseline data points to be 5, with the mode being 3 to 4. With such small number of observations, calculation of reliable standard deviations is extremely problematic, resulting in effect sizes as high as 13.73 standard deviation units (Gorman-Smith & Matson, 1985), a value extremely difficult to interpret or compare reasonably with other effect size values. Another problem in such calculations is that baseline floor effects (i.e., many zero observations) limit variability and necessarily inflate effect size calculations. For example, Beeson and Robey (2006), using the method suggested by Busk and Serlin (1992), calculated effect sizes on a multiple baseline design, which yielded a weighted mean effect size of d = 9.59. Examination of the chart presented in the article reveals obvious floor effects (22 of the 29 baseline observations across the three participants are zero), that have clearly limited variability and artificially inflated the effect size.
Other models such as the “regression-based” approach of Allison, Faith, and Franklin (1995, p. 283) are based on a concern that autocorrelation in baseline observations (or slopes) are not considered, and employ a regression equation of baseline data to predict treatment values. In a 10-step calculation procedure, the observed values are subtracted from the predicted values, and these residuals are used to compute “detrended” data, from which an F ratio is calculated, which is then converted to a standard effect size. One problem with this method is that reliable slopes are very difficult to calculate with only four, five, or fewer observations, a very common occurrence in single-subject research (Scruggs & Mastropieri, 1994b). Allison and Gorman (1993) acknowledged, “any regression estimate based on so few data points [i.e., 3–4] must be highly suspect” (p. 629; their own hypothetical examples contained baselines with 20 observations each). Allison et al. (1995) reported that when their predicted scores revealed percentage values greater than 100% or frequencies less than zero, they replaced the predicted values with the upper or lower limit, that is, 100% or 0. Nevertheless, these procedures are likely also to yield highly questionable outcomes, such as the 11.04 effect size reported by Allison et al. (p. 289). Swanson and Sachse-Lee (2000), who considered only the last three data points per phase, used a correction procedure that reduced somewhat the effect size estimates, but they still found many of the obtained effect sizes to be greater than 3.00, and removed these (20 studies) from their analysis (see also Reid, Trout, & Schwartz, 2005). Skiba, Casey, and Center (1985-1986) also limited effect sizes to 3.00 standard deviations and calculated effect sizes for slope, level, and combined effects. Some of their conclusions (e.g., feedback was more effective in special education settings, and reinforcement more effective in general education settings) were described by the authors as “unexpected” (p. 475); some conclusions “seemed seriously at odds with empirical data and theory” (p. 476).
It is true that substantial variability exists at present regarding methods for synthesizing single-subject research, including coding procedures as well as choice of outcome metric. Suggestions by Schlosser et al. (2008) and Maggin et al. (2011) may be helpful in resolving some of these issues. Schlosser et al. (2008) suggested that reviewers using the PND method (a) clearly present coding conventions (e.g., for inappropriate baseline trends), and how often they were applied; (b) calculate separate effects for treatment, maintenance, and generalization outcomes; (c) describe how outcomes were coded for different experimental designs; (d) describe how data were aggregated across studies; and (e) describe how often valid PND scores could not be calculated, to aid future refinements in the method. Maggin et al. (2011) suggested that all quantitative reviewers of single-subject research (a) report their method of data extraction and phase comparison; (b) employ multiple indices, including visual analysis; (c) increase the use of regression techniques, including multilevel modeling, to further evaluate their effectiveness; (d) provide information on internal validity of studies reviewed, and (e) provide detailed information on study characteristics, including participant information, to address external validity. In both cases, the suggestions of these researchers go beyond the scope of this article to describe in detail; the reader is referred to the articles themselves for further information. Schlosser, Wendt, and Sigafoos (2007) have provided a list of considerations for appraisal of single-subject synthesis reports.
It also should be considered, however, that meta-analysis of group experimental research has also had its critics, from the very beginning (Eysenck, 1978; Strain & Smith, 1986), and variability in conducting and reporting these meta-analyses has been frequently commented on (e.g., Moher, Tetzlaff, Tricco, Sampson, & Altman, 2007; Mostert, 2001).
Comparison of Overlap Models to Visual Analysis
Wolery et al. (2010) recently compared visual inspection procedures with four variations of the PND model: (a) the PND as described here; (b) the PDO2 method, based on a comparison of each baseline data point with all treatment data points (Parker & Vannest, 2007, cited in Wolery et al., 2010); (c) the PEM, which calculates the percentage of treatment data points that exceed the median baseline value (Ma, 2006); and (d) the PEM-T, which calculates the proportion of treatment data that overlaps a baseline trend projected through the treatment phase (Wolery, Busick, Reichow, & Barton, 2008, cited in Wolery et al., 2010). Each of these methods was used to calculate a measure of between-phase data overlap from 120 adjacent data phases selected from the Journal of Applied Behavior Analysis. Each value was compared with a qualitative evaluation of the perceived effect, where judges concluded a change did, or did not, occur across phases. The authors rated all graphs, without specific instructions to define “change,” and chose “yes” or “no.” These judgments were then compared with the four measures of data overlap, using general guidelines described by Scruggs and Mastropieri (1998): 90% or greater = very effective; 70% to 89% = effective; less than 70% = questionable or ineffective. The authors concluded that “error percentages” (defined as disagreement with visual analysis) were unacceptably high for all data overlap methods. In addition, they identified examples in which measures of data overlap may not yield an appropriate score. Wolery et al. (2010) concluded that these measures of overlap should be abandoned in the future, presumably in favor of a better (but unnamed) metric, that would simultaneously consider consistency of effects; all data presented, including treatment to baseline effects; magnitude of effects; and changes in level, trend, and variability. Furthermore, this hypothetical metric should be in agreement with visual inspection, should not violate assumptions of serial dependency, and should allow for analysis of moderator variables.
Wolery et al. (2010) raised some interesting points regarding use of data overlap methods, including PND, for research synthesis. However, there are a number of problems with this investigation, which ultimately limit the strength of its conclusions:
Wolery et al. (2010) based all comparisons on their model of visual inspection as a kind of absolute criterion, even though many investigations over the years have demonstrated problems with reliability of visual inspection. (See Danov & Symons, 2008; DeProspero & Cohen, 1979; Gottman & Glass, 1978; Jones et al., 1978; Kazdin, 1978; and Ximenes, Manolov, Solanas, & Quera, 2009; Kahng et al., 2010, reported higher reliabilities, but used hypothetical data.) In fact, in the data phases examined by Wolery et al. (2010), the raters agreed barely 3 times in 4, with nearly one fourth of the data excluded for this reason, even though only a 2-point scale (yes/no) was used. In addition, they included any adjacent phases (not only baseline treatment), used only two phases per comparison, and excluded charts that contained fewer than three data points. In contrast, PND values have been calculated with interscorer reliabilities as high as .96 (e.g., Scruggs & Mastropieri, 1994a).
Wolery et al. (2010) may have miscalculated PND. In our original article (Scruggs et al., 1987a), we specified circumstances in which the PND would be modified or not calculated (e.g., inappropriate baseline trends, outliers in baseline data). Because Wolery et al. made no mention of applying these conventions (p. 20), they may have calculated PND inappropriately in at least some cases.
These authors used our general guidelines for evaluating outcomes as an absolute criterion. We had previously offered (e.g., Scruggs & Mastropieri, 1998) some considerations for evaluating PND scores (e.g., >90% = very effective). However, we only intended this as a very general guideline (e.g., “scores of 50 to 70 have been considered questionable,” p. 224) rather than an absolute cutoff to be applied indiscriminately. By this standard, a PND score of 69% would be considered an “error” if the judges had rated a change had occurred. This is even more problematic in metrics such as the PEM, which calculates overlap from the median baseline data point, and could not be expected to deliver outcomes with the same meaning as the PND. Finally, as we reported previously, when a correlation is calculated between PND and qualitative rating based on visual inspection, much stronger relationships have been observed, as they were in the Mastropieri and Scruggs (1985–1986) investigation.
After their conclusion that the “error percentage” was unacceptably high, Wolery et al. listed several circumstances in which the PND score may lead to inaccurate conclusions, including changes in trend across condition, trending baseline that continues across phases, and outlier effects in baseline that obscure treatment effects. What they did not mention is that we had already identified these same issues in our original 1987a paper (providing very similar charts as examples), and provided conventions for dealing with these appropriately. If Wolery et al. disregarded these conventions in calculating PND (see no. 2, described previously), it is likely they miscalculated at least some PND values. It is also true that we encountered these issues relatively rarely, certainly far less often than the nearly 25% of comparisons these authors excluded because the judges disagreed on whether a behavior change had occurred.
Wolery et al. also suggested that PND is influenced by the number of data points per phase, where for example a PND score drops to 67% when one of three total data points does not exceed baseline values. Our response to this concern is simply that more data points in a particular phase results in a more convincing effect (e.g., 9 of 10 nonoverlapping data points) and leads to an appropriately higher PND score. Regarding their suggestion that the researcher “can simply continue to collect data” (p. 24) to increase the size of the effect—this would only be true if the treatment were truly more effective than indicated by the first few data points, and in such cases, more data would appear more convincing. Furthermore, in our previous research synthesis reports, no relation between PND and number of treatment data points was observed. Scruggs et al. (1988) reported a correlation of r = .054, ns, between PND and number of treatment data points, whereas Scruggs and Mastropieri (1994a) reported a correlation of r = –.07, ns. (see also Scruggs & Mastropieri, 1998).
Wolery et al. proposed that a new metric be developed, which simultaneously considers at least seven characteristics, as described previously. Although it does seem unlikely such a metric will emerge, this may become more possible if future single-subject research studies include sufficient data. Finally, their arguments against overlap methods fail to acknowledge similar limitations of traditional meta-analysis of group experimental research: several different types of effect size exist (e.g., Glass’s Δ, Hedges’s g, Cohen’s d), conventions are similarly suggested for calculating these effect sizes under different circumstances (e.g., Glass et al., 1981; Hedges & Olkin, 1985), and accurate or appropriate effect sizes are not codeable in many instances (e.g., insufficient number of observations, floor or ceiling effects, insufficient data presented). Having conducted both single-subject and group experimental (e.g., Casto & Mastropieri, 1986; Mastropieri & Scruggs, 1989; Scruggs, Mastropieri, Berkeley, & Graetz, 2010) research syntheses, we have found that coding decisions and outcome calculations are difficult and exacting processes, regardless of methodology.
Validity of Results
Finally, all critics of the PND method to our knowledge have yet to make the argument we personally would have found most convincing: that the appropriate application of this method has led to inaccurate or incorrect conclusions. For example, all of our own synthesis efforts have accurately identified a number of effective treatments such as direct instruction and reinforcement; and have generally agreed with the original authors, as well as previous reviews (see Scruggs & Mastropieri, 2001). Overall, we concluded generalization effects were studied in the minority of cases, and typically resulted in modest outcomes, as is commonly reported. In our research synthesis of generalization effects (Scruggs & Mastropieri, 1994a), we concluded that the overall outcome for generalization was modest (PND = 62.2%) compared with maintenance (76.7%) and treatment (90.2%) effects. We also concluded that edible reinforcement was associated with the overall highest treatment effects, followed in turn by tangible or token reinforcement, social reinforcement, and no reinforcement. The lowest overall generalization effects (45.1%) were associated with “train and hope” methods (Stokes & Baer, 1977), whereas other methods (e.g., indiscriminable contingencies, peer mediation, multiple exemplars) were associated with much higher scores. We also reported that near transfer effects were associated with higher PND scores than far transfer effects. Such results, although informative, are far from surprising and underline principles of applied behavior analysis that are generally accepted (see also, e.g., Algozzine et al., 2001; Rogers & Graham, 2008). If the PND method of research synthesis is incorrect or inappropriate, why does it lead to such sensible conclusions?
The Future
Future efforts in quantitative synthesis of single-subject research, we believe, will be directed toward more detailed and more systematic reporting of study characteristics and coding procedures, and further explorations of alternative methods of calculating study outcomes. To this extent, it seems likely that further exploration of regression-based approaches, including multilevel modeling, will be undertaken in the future (e.g., Maggin, Briesch, & Chafouleas, 2010). Although we have expressed some skepticism of the utility of these methods (based more on their appropriateness for the single-subject data that presently exist in the literature than on the methods themselves), further exploration is certainly warranted. As these are explored, we hope they will be undertaken with actual research data, rather than with theoretical analysis or with data simulations. As suggested by Schlosser et al. (2008), When metrics are discussed in terms of their theoretical strengths and weaknesses alone, divorced from issues of implementation and application, we jeopardize the capability of a particular metric to realize these strengths or perhaps minimize weaknesses, whatever they may be. (p. 184)
In this way, synthesis procedures can be evaluated for the extent to which they result in conclusions that are reliable and meaningful, and faithfully represent the research being reviewed. If these considerations are followed, we believe the future of research in this area will be very beneficial. We also hope that as additional information is presented, general agreement will be reached in the future on the preferred methods of synthesizing single-subject research.
Summary and Conclusions
Our original purpose for the use of the PND metric was met by 1988. However, since that time there have been many efforts to summarize single-subject research, mostly using the PND, but also employing a variety of alternative methods. In addition, there has been much discussion of methods for synthesizing single-subject research. Recent reviews, appropriately, have called for more systematic application of research integration procedures, and further investigation of alternative methods. Although it would be desirable to identify an outcome metric that would be directly comparable to the standardized mean-difference effect size of group experimental data; this in our view will be unlikely until single-subject research generally includes many more observations and is presented more consistently across studies.
Frequently missing from criticism of methods for quantitative research synthesis, interestingly, is criticism of traditional, subjective methods for reviewing research. In existing traditional reviews of research, it is common to find the topic was not defined and delimited, previous reviews were not reviewed, search procedures for research reports were not specified, common dependent and independent variables were not reported, moderator variables were not described, outcome measures were not specified, or conclusions of the review were not supported with conclusions of the original research. It should be remembered that meta-analysis in general came about as an attempt to provide more consistent, systematic, and objective reviews of the literature than had previously been conducted. In our view, this purpose has largely been accomplished. That limitations can be identified in any quantitative research integration procedure should not be taken as evidence that subjective, qualitative reviews themselves are without flaws.
Twenty-five years since our original efforts, quantitative synthesis of single-subject research using the PND method has continued to deliver coherent, valid summaries of relevant research, in a wide variety of subject areas. Although we never intended that it necessarily be the only method for summarizing such research, we still believe that, used appropriately, it remains the most versatile and meaningful of the various methods proposed and has led to the most sensible conclusions to date. Nevertheless, alternative procedures do exist, and no doubt will continue to be proposed. Given the present variability in methodology, we would like to reemphasize a point we made earlier: however the research synthesis is accomplished, authors should be certain to link the conclusions of the review to the conclusions of the original research, and to previous, similar reviews. Where discrepancies exist, authors should identify specifically the reasons for this. In this way, the validity of the review procedures can be established and value of quantitative research synthesis confirmed.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
