Assessing Consistency in Single-Case A-B-A-B Phase Designs

Abstract

Previous research has introduced several effect size measures (ESMs) to quantify data aspects of single-case experimental designs (SCEDs): level, trend, variability, overlap, and immediacy. In the current article, we extend the existing literature by introducing two methods for quantifying consistency in single-case A-B-A-B phase designs. The first method assesses the consistency of data patterns across phases implementing the same condition, called CONsistency of DAta Patterns (CONDAP). The second measure assesses the consistency of the five other data aspects when changing from baseline to experimental phase, called CONsistency of the EFFects (CONEFF). We illustrate the calculation of both measures for four A-B-A-B phase designs from published literature and demonstrate how CONDAP and CONEFF can supplement visual analysis of SCED data. Finally, we discuss directions for future research.

Keywords

single-case experimental designs effect sizes consistency statistical analysis visual analysis

Introduction

Single-case experimental designs (SCEDs) involve the repeated measurement of a single case that is being exposed to different levels of at least one manipulated variable (Kennedy, 2005; Kratochwill et al., 2010; Onghena & Edgington, 2005). SCEDs have a long history in the behavioral sciences (see Barlow, Nock, & Hersen, 2009; Kazdin, 2011; Kratochwill & Levin, 2015) and recently have received a strong impetus by the publication of guidelines and standards for more general implementation and reporting in various scientific disciplines (Kratochwill et al., 2010, 2013; Shamseer et al., 2016; Tate, Perdices, Rosenkoetter, McDonald, et al., 2016; Tate, Perdices, Rosenkoetter, Shadish, et al., 2016; Tate et al., 2013; Vohra et al., 2016). The application of SCEDs has risen steadily over the years (Michiels, Heyvaert, Meulders, & Onghena, 2017; Shadish & Sullivan, 2011; Smith, 2012), making SCEDs now a popular research methodology in the educational sciences (e.g., Gast, 2010; Horner et al., 2005; Kennedy, 2005; Tankersley, Harjusola-Webb, & Landrum, 2008), clinical psychology (e.g., Barlow et al., 2009; Morgan & Morgan, 2001), sport and exercise psychology (e.g., Barker, McCarthy, Jones, & Moran, 2011), and the health sciences (e.g., Morgan & Morgan, 2009).

Visual Versus Statistical Analysis of SCEDs

As the number of applications of SCEDs has increased over the years, the repertoire of available analytical techniques has steadily grown as well, leading to intensified discussions regarding the appropriateness of these techniques (Manolov & Moeyaert, 2017). Generally, one can distinguish between statistical and visual analysis of SCED data and there has been an ongoing debate about the superiority of any of the two approaches (e.g., Heyvaert & Onghena, 2014; Kazdin, 2011; Kratochwill & Levin, 2015). Traditionally, visual analysis is the primary method for analyzing data obtained from SCEDs (Heyvaert, Wendt, Van den Noortgate, & Onghena, 2015; Kennedy, 2005; Lane & Gast, 2014; Smith, 2012). Conducting visual analysis of SCED data “refers to the viewing and inspection of all available data (i.e., for all sessions in each condition) plotted on a line graph (i.e., time series data), and making determinations about behavior changes based on the visible data characteristics.” (Ledford, Lane, & Severini, 2018, p. 4). The What Works Clearinghouse guidelines for SCEDs (Kratochwill et al., 2010) recommend inspecting six features of the data when performing visual analysis: level, trend, variability, immediacy of the effect, overlap, and CONsistency of DAta Patterns (CONDAP) across similar phases. These data aspects “are [visually] examined to determine the extent to which a meaningful change in the behavior occurred and the extent to which this change can be attributed to the independent variable [. . .]” (Kahng et al., 2010, p. 35). The advantages of visual analysis include the following: It allows for the observation of abrupt as well as subtle changes over time (Lane & Gast, 2014); it is transparent when appropriate guidelines are followed and these guidelines are referred to explicitly (e.g., Ledford et al., 2018); it is self-explanatory, easily manageable, and it is widely accepted (Barker et al., 2011; Barton et al., 2016; Kennedy, 2005); large intervention effects are easily detectible, clinically insignificant effects are disregarded, and Type I error rates can be reduced (Baer, 1977; Harrington, 2013). Furthermore, visual analysis is response-guided, which allows researchers to make needed changes (e.g., phase changes) while maintaining experimental control (Barton et al., 2016). However, the validity of visual analysis has also been seriously questioned over the years. Some of the major drawbacks of visual analysis include low interrater agreement (Heyvaert & Onghena, 2014; Park, Marascuilo, & Gaylord-Ross, 1990), a lack of clear decision rules (Bulté & Onghena, 2009; Perdices & Tate, 2009), serial dependency in the data misguiding visual judgment (Matyas & Greenwood, 1990), insensitivity of visual analysis along with high Type II error rates (Ottenbacher, 1990), and finally—directly contradicting the proponents of visual analysis—increased Type I error rates (Harrington, 2013; Heyvaert & Onghena, 2014). Harrington and Velicer (2015) therefore conclude that visual analysis “is prone to bias and should not be used as a stand-alone analytical method” (p. 181).

A meta-analysis by Ninci, Vannest, Willson, and Zhang (2015) showed that the use of precise operational definitions of the data aspects that are important for visual analysis, together with other methodological and procedural variables, can increase the interrater agreement among visual analysts of SCED data. Various operational definitions and quantifications have been developed over the years to assess the data aspects suggested in the What Works Clearinghouse guidelines (Kratochwill et al., 2010). These operational definitions and quantifications are referred to as effect size measures (ESMs). Table 1 offers an overview of ESMs for each data aspect. The overview is based on a review of the available literature on ESMs for SCEDs, but we do not claim comprehensiveness. As Table 1 indicates, several data aspects such as level, trend, and overlap have been studied more extensively than others. For example, it seems that there are many more proposals for quantifying overlap than there are proposals for quantifying immediacy.

Table 1.

Overview of ESMs for Each Data Aspect.

Data aspect	ESMs	Key references
Level	R_n statistic for multiple-baseline designs Standardized mean difference (SMD) Mean baseline reduction (MBR) Pooled standardized mean difference (pSMD) Busk and Serlin’s d-statistic Level and slope treatment effect (LSTE) f² using variance ratios Absolute mean difference (AMD) MASCD-g (requires more than one case) Slope and level change (SLC) The Hedges, Pustejovsky, and Shadish d-statistic for SCDs (requires at least three cases) Phase median difference (PMD)	Revusky (1967) Gingerich (1984) O’Brien and Repp (1990), Olive and Smith (2005) Cohen (1992), Beretvas and Chung (2008), Manolov and Moeyaert (2017) Busk and Serlin (1992), Beeson and Robey (2006) Allison and Gorman (1993) Kromrey and Foster-Johnson (1996) Edgington and Onghena (2007), Bulté and Onghena (2008) Shadish, Rindskopf, and Hedges (2008) Solanas, Manolov, and Onghena (2010) Shadish, Hedges, and Pustejovsky (2014) Wilbert (2014)
Trend	Theil–Sen regression Semiaverage method Split-middle technique Gorsuch’s trend analysis Piecewise regression LSTE Ordinary least squares (OLS) SLC Generalized least squares (GLS)	Theil (1950), Sen (1968) Parsonson and Baer (1978) e.g., Kazdin (1982), Lane and Gast (2014) Gorsuch (1983), Manolov and Solanas (2008) Center, Skiba, and Casey (1985) Allison and Gorman (1993) e.g., Kromrey and Foster-Johnson (1996), Huitema and McKean (2000) Solanas et al. (2010) Maggin et al. (2011)
Variability	Semi-interquartile range coefficient (semi-IQR) f² using variance ratios Range and/or standard deviation Stability envelops	e.g., Bogle and Harris (1994) Kromrey and Foster-Johnson (1996) Kratochwill et al. (2010) Lane and Gast (2014)
Overlap	Percentage of nonoverlapping data (PND) Percentage of zero data (PZD) Percentage of data exceeding the median (PEM) Percentage of all nonoverlapping data (PAND) Pearson’s phi Relative success rate (RSR) Nonoverlap of all pairs (NAP) Improvement rate difference (IRD) Percentage of data exceeding median trend (PEM-T) Tau-U	Scruggs, Mastropieri, and Casto (1987), Schlosser, Lee, and Wendt (2008) Scotti, Evans, Meyer, and Walker (1991), Harvey, Boer, Meyer, and Evans (2009) Ma (2006) Parker, Hagan-Burke, and Vannest (2007) Parker and Hagan-Burke (2007) Parker and Hagan-Burke (2007) Parker and Vannest (2009) Parker, Vannest, and Brown (2009) Wolery, Busick, Reichow, and Barton (2010) Parker, Vannest, Davis, and Sauber (2011)
Immediacy	Immediate treatment effect index (ITEI) Piecewise regression Bayesian unknown change-point models	Michiels, Heyvaert, Meulders, and Onghena (2017) Center et al. (1985) Natesan and Hedges (2017)

Note. ESMs = effect size measures; MASCD = meta-analysis of single-case design; SCD = single-case design; LSTE = Level and slope treatment effect; SLC = Slope and level change.

The advocates of statistical analysis of SCED data have argued that ESMs, confidence intervals, and statistical tests have the advantage of producing identical results independently of who performs the analysis (Heyvaert & Onghena, 2014; Park et al., 1990). Further advantages include that Type I and Type II error rates are better accounted for (Matyas & Greenwood, 1990) and that quantification of the effect(s) allows for easier comparison within and between studies (Brossart, Parker, Olson, & Mahadevan, 2006). Notwithstanding this controversy regarding what method is superior, there seems to be a growing awareness in the field that the two approaches are best used concurrently (e.g., Bulté & Onghena, 2012; Kromrey & Foster-Johnson, 1996; Manolov & Moeyaert, 2017; Michiels et al., 2017; Parker & Brossart, 2003; Perdices & Tate, 2009). If we want the visual and statistical analyses to complement and support each other, developing ESMs for each data aspect that is visually analyzed can be beneficial. As Ninci et al. (2015) conclude, “The levels of reliability in regard to visually analyzed ratings are often considered unacceptable. Including effect sizes provides a means of interpreting the reliability and generalizability of results; this can be useful for the acceptance of single-case research methods [. . .]” (p. 536).

The Lack of a Measure for Consistency

As Table 1 indicates, several ESMs have been offered for five out of the six data aspects suggested by the What Works Clearinghouse guidelines (Kratochwill et al., 2010). Using statistical and visual analysis in such a complementary way can greatly strengthen the conclusions drawn and increase the acceptance in the scientific community (Michiels et al., 2017). However, to the best of our knowledge, no quantification exists yet for expressing the degree of consistency in SCED data.

We believe it is worthwhile for the SCED community to clearly delineate their use of the concept “consistency” and then to consider quantifications of the degree of consistency in SCED data. This clarification and possible quantification is beneficial because “consistency,” as such, is a broad and ambiguous concept that is defined and applied in many ways across scientific disciplines. In psychometrics, for example, internal consistency refers to the extent to which the items of a test jointly measure the same construct (Cronbach, 1951; Henson, 2001); in statistics, consistency is the property of an estimator to converge toward the true population value if the number of measurements increases (e.g., Newey & McFadden, 1994); in logic, a theory is consistent if none of its statements are contradictory (Audi, 1999); and in epidemiology, consistency is one of the Bradford Hill criteria for inferring a causal relationship (Hill, 1965).

However, with respect to the analysis of SCED data, “consistency” has a very specific meaning:

“Consistency of data in similar phases” involves looking at data from all phases within the same condition (e.g., all “baseline” phases; all “peer-tutoring” phases) and examining the extent to which there is consistency in the data patterns from phases with the same conditions. The greater the consistency, the more likely the data represent a causal relation. (Kratochwill et al., 2010, p. 18)

As this definition highlights, consistency plays a key role in establishing a causal link between the manipulation of the independent variable(s) and the dependent variable in SCEDs (Baer, 1977). The What Works Clearinghouse definition likely goes back to the guidelines for visual analyses by Horner et al. (2005). Horner et al. explain that visual analysts have to judge the “consistency of data patterns across multiple presentations of intervention and nonintervention conditions” (p. 171). In spite of not labeling it as consistency, one of the earliest definitions of this data aspect was given by Parsonson and Baer (1978): “While assessment of data within phases and between adjacent phases forms a major part of the visual analytic process, judgment of the congruity of data across experimentally similar phases is also important” (p. 128). These definitions circumscribe consistency as a data aspect that has to be assessed between phases implementing the same manipulation of the independent variable.

Other researchers describe consistency in the light of replicating a potential effect. As Kazdin (1982) briefly notes, the establishment of an effect through visual analysis of SCEDs depends among others on “the consistency of the effect across phases or baselines, depending on the particular design” (p. 237). Following this line of thought Barker et al. (2011) explain that “a treatment effect is inferred when replication is consistent” (p. 158). More recently, Ledford et al. (2018) have argued that consistency involves both data patterns between similar conditions and between different conditions:

Consistency refers to the extent to which data patterns are the same within like conditions (e.g., in both baseline conditions in an A–B–A–B design; in baseline conditions for all participants in a multiple baseline across participants design) and the extent to which changes (in level, trend, or variability) are the same for each potential demonstration of effect. In SCD research, the critical factor in determining a functional relation is the consistency of behaviour change between conditions; consistent but small changes in level between conditions are superior to inconsistent changes of larger magnitude. (pp. 6-7, emphasis in original)

Based on this definition, the current article proposes two major approaches to quantify consistency in SCEDs. First, we propose to quantify consistency as the extent to which data patterns are the same within similar conditions, using the Manhattan distance (MD) between data points. Next, we propose to quantify consistency of each potential demonstration of an effect, using a metameasure of the other five data aspects. Finally, we show how these consistency measures can support the visual analysis of SCED data. Recently, a first proposal for assessing the consistency of effects visually has been published by Manolov (2018). He proposes multilevel estimates of the variance across effects as visual aids for assessing consistency in the context of multiple-baseline designs. This approach is, however, not applicable to A-B-A-B designs as the variance would be calculated based on only two data points (one for each change from A phase to B phase). To demonstrate our methods to assess consistency in A-B-A-B phase designs, we use four examples from published articles. The underlying rationale for focusing on A-B-A-B phase designs is threefold. First, it has been argued that A-B-A-B phase designs are perhaps the most widely known form of SCEDs (Barlow et al., 2009). This is also reflected in the emphasis on A-B-A-B phase designs in the What Works Clearinghouse guidelines. Second, A-B-A-B phase designs are more rigorous than, for example, A-B and A-B-A phase designs by controlling the flaws present in these designs (Barlow et al., 2009). Finally, an A-B-A-B phase design has two similar conditions of each manipulation of the independent variable and offers three potential demonstrations of an effect. The A-B-A-B design is therefore the minimum design in which consistency within participants can be assessed as each phase and phase change from baseline to intervention occurs twice. Any developed measures for consistency in A-B-A-B phase designs can then be expanded to other forms of SCEDs.

Four Examples of A-B-A-B Phase Designs

The first data set (see Figure 1) was retrieved from Yuen (1993). He used an A-B-A-B phase design to investigate the efficacy of the purposeful use of an additional template in a woodworking task for a woman with cortical blindness. The dependent variable was productivity, measured as the number of usable brackets outlined by the woman divided by the maximum number of usable brackets that could be outlined with a minimum of zero and a maximum of 100% with a total of 29 measurement occasions. The second data set (see Figure 2) was retrieved from a methodological article by Heyvaert and Onghena (2014). This data set consists of 27 measurement occasions for a male participant on the 15-item Impact of Event Scale (IES; Horowitz, Wilner, & Avarez, 1979) with a possible range of 0 to 75 to evaluate a new treatment program to decrease posttraumatic stress symptoms. The third data set (see Figure 3) was retrieved from the original What Works Clearinghouse guidelines who constructed a hypothetical data set with a range from 0 to 100 to illustrate guidelines for visually analyzing SCED data. This data set contains 41 measurement occasions. The fourth data set (see Figure 4) was retrieved from Mackay, McLaughlin, Weber, and Derby (2001). The authors used an A-B-A-B phase design to study the effectiveness of a precision request procedure to decrease the noncompliance of a child with disabilities. The dependent variable was measured with count data as the number of noncompliant behaviors per day for 20 days. For the Yuen, Kratochwill et al., and Mackay et al. data sets, raw data were unavailable. They were recovered from the published graphs using “GetData Graph Digitizer” Version 2.26 (Fedorov, 2013). Raw data and descriptive statistics for all four data sets are available in the appendix.

Figure 1.

A-B-A-B phase design retrieved from Yuen (1993).

Figure 2.

A-B-A-B phase design retrieved from Heyvaert and Onghena (2014).

Figure 3.

A-B-A-B phase design retrieved from Kratochwill et al. (2010).

Figure 4.

A-B-A-B phase design retrieved from Mackay et al. (2001).

Consistency Between Similar Phases: Operationalizations Based on MD

As both the definitions by Ledford et al. (2018) and Kratochwill et al. (2010) about consistency across similar phases were proposed in the context of visual analysis, they leave considerable space for interpretation on how to quantify this data aspect. In other scientific domains, the MD has a long-standing tradition in assessing similarity between data patterns (Cha, 2007). In analogy-based estimation for example, the MD is a widely used similarity measure that takes into account the distance between pairs of projects (Chiu & Huang, 2006). A difference between the MD and competing similarity measures is that it computes the sum of the absolute differences rather than their squares, as it is, for example, the case with the Euclidean distance (Kokare, Chatterji, & Biswas, 2003). An advantage of this is that it reduces the excessive influence of outlying data points as two time series can be similar even if one has an outlying data point. A conceptual advantage of MD is that it calculates the distance between two points if the only paths you can take are parallel to the axes (Sherwood, Perelman, Hamerly, & Calder, 2002), as shown by the dashed lines parallel to the y-axis in Figure 5. MD can also be applied to assess consistency in single-case phase designs: If the data patterns within similar conditions are more or less consistent, the MD between data points occurring at paired moments in time—for example, the MD between the first measurement in A1 and the first measurement in A2, and so on—should be low or high. The vertical dashed lines in Figure 5 show the MD between seven paired measurement occasions of the B1 and B2 phases in the Yuen data set.

Figure 5.

Example of Manhattan distance for seven paired measurement occasions from the B1 and B2 phases of the Yuen data set.

We compare scores at paired moments in time to evaluate if the two data patterns evolve consistently over time. To obtain the MD between two phases, we simply sum up the absolute differences (i.e., the vertical dashed lines in Figure 5). The MD then equals:

MD = \sum_{i = 1}^{n} | x_{i} - y_{i} | .

(1)

As we calculate the MD separately for the A and B phases, $x_{i}$ represents values from either the A1 or B1 phase and $y_{i}$ values from the corresponding A2 or B2 phase. The index i equals the value of the numbered order of the paired observations. For example, for i = 1 in Figure 5, $| B 1_{1} - B 2_{1} |$ would equal │83-95│= 12. The MD for the two phases shown in Figure 1 can accordingly be calculated as: $| 83 - 95 | + | 78 - 88 | + | 87 - 97 | + | 78 - 97 | + | 86 - 92 | + | 93 - 100 | + | 90 - 97 | = 71$ An obvious limitation of this approach is that the MD increases with the number of measurement occasions. Dividing by the number of paired measurement occasions, n, adjusts the MD for this problem. The mean Manhattan distance (MMD) then equals:

MMD = \frac{1}{n} \sum_{i = 1}^{n} | x_{i} - y_{i} | .

(2)

The MMD for the example in Figure 5 is then 71/7 = 10.14.

Adjusting for Unequal Phase Lengths

Both the MD and MMD require identical phase lengths. This means that for each measurement occasion in the A1 or B1 phase there has to be a corresponding measurement occasion in the A2 or B2 phase. We propose two approaches for cases in which the phases differ in length. The first approach is to omit data points in the longer phase to reduce it to the same length as the shorter phase. In the example depicted in Figure 5, we omitted the last two data points of the B1 phase to reduce it to the same length as the B2 phase and then calculated the MMD for the remaining seven paired measurement occasions. However, if the two phases differ greatly in length, this approach to calculating MMD runs the risk of omitting a lot of data points.

An alternative is to calculate an MD distance measure between every possible sequence of observations in the longer phase—that is equal to the length of the shorter phase—and the shorter phase. The advantage of this approach is that it uses all data which is generally considered as a desirable feature of SCED ESMs (Maggin et al., 2011). For example, the B1 phase in the Yuen data set contains nine data points and the B2 phase seven data points. A first pairing of sequences would be Observations 1 to 7 in each phase, another would be Observations 1 to 7 in B2 and Observations 2 to 8 in B1, and finally Observations 1 to 7 in B2 and Observations 3 to 9 in B1. The number of possible pairings of sequences k then equals:

k = (n_{L} - n_{S}) + 1,

(3)

$n_{L}$ represents the number of measurement occasions in the longer phase and $n_{s}$ represents the number of measurement occasions in the shorter phase. As mentioned previously, the B1 phase in the Yuen data set contains nine data points and the B2 phase seven data points. The number of possible pairings is then equal to (9 – 7) + 1 = 3. Figure 6 depicts the three pairings of sequences of equal length using the same two phases as in Figure 5 but with the last two data points of the B1 phase added.

Figure 6.

All three parings of possible sequences of equal length for the B1 and B2 phases of the Yuen data set.

By consecutively shifting the shorter phase by one measurement occasion to the right, we obtain all three possible sequences of equal length. Next, we can calculate the MMD for each possible pairing of sequences. The obtained MMDs can then be averaged across the number of compared sequences by summing the MMDs and dividing by k. The overall MMD (OMMD) across all possible comparisons then equals:

OMMD = \frac{1}{k n_{s}} \sum_{j = 1}^{k} \sum_{i = 1}^{n} | x_{i j} - y_{i j} | .

(4)

The additional index j denotes the paired sequences. For the sequences depicted in Figure 6, j can, for example, take the values 1, 2, or 3. In Figure 6, the OMMD is equal to (10.14 + 9.43 + 6.71) / 3 = 8.76; the sum of the three MMDs divided by the number of paired sequences that are compared. If two phases have an identical length—as it is, for example, the case with B1 and B2 in the Kratochwill et al. data set and all phases in the Mackay et al. (2001) data set—the MMD and OMMD yield identical results. Both measures are independent of the number of observations and comparisons as we divide by n and k, respectively.

Adjusting for the Unit of the Measurement Scale

Both measures, however, still depend on the unit of the measurement scale. The Heyvaert and Onghena data set used, for example, a measurement scale ranging from 0 to 75, while both the Yuen and Kratochwill et al. data sets used measurement scales ranging from 0 to 100 and the Mackay et al. (2001) data set used count data. Both the MMD and OMMD still have to be interpreted in the light of the variability of the data patterns. To remedy this, we propose dividing the OMMD by the pooled standard deviation of the two phases (either A1 and A2 or B1 and B2) as suggested in Van den Noortgate and Onghena (2008). The scale invariant CONDAP for all phases within the same condition then equals:

CONDAP = \frac{OMMD}{\sqrt{\frac{(n_{s} - 1) \times S D_{s}^{2} + (n_{l} - 1) \times S D_{l}^{2}}{n_{s} + n_{l} - 2}}} .

(5)

The MMD, OMMD, and CONDAP for each data set can be found in Table 2.

Table 2.

MMD, OMMD, and CONDAP to Assess Consistency Between Similar Phases.

Data	Measure	A1A2	B1B2
Yuen (1993)	MMD	6.25	10.14
Yuen (1993)	OMMD CONDAP	7.67 1.46	8.76 1.66
Heyvaert and Onghena (2014)	MMD	4.83	3.20
Heyvaert and Onghena (2014)	OMMD CONDAP	4.00 1.18	4.07 1.41
Kratochwill et al. (2010)	MMD	9.90	7.10
Kratochwill et al. (2010)	OMMD CONDAP	9.65 1.14	7.10 0.66
Mackay, McLaughlin, Weber, and Derby (2001)	MMD OMMD CONDAP	3.80 3.80 1.11	2.40 2.40 1.43

Note. MMD = mean Manhattan distance; OMMD = overall MMD; CONDAP = CONsistency of DAta Patterns.

Interpretation of CONDAP

Table 2 shows how OMMD becomes scale invariant when converted to CONDAP. CONDAP expresses the distance between two phases in units of standard deviations. If two data patterns from similar conditions are perfectly consistent (i.e., identical), the absolute value of any MD-based measure will be zero. The reference number against which the MD-based consistency measures should be judged is thus zero, meaning that the closer the CONDAP is to zero, the more consistent the two data patterns are. As a distance measure, CONDAP is therefore essentially a measure of inconsistency. For example, initially we found an OMMD of 4.07 for the B1/B2 comparison of the Heyvaert and Onghena data set and an OMMD of 7.1 for the B1/B2 comparison of the Kratochwill et al. data set. As 4.07 is closer to zero than 7.1, we might conclude that the B1/B2 data patterns are more consistent in the Heyvaert and Onghena data set when only looking at the absolute value. However, by converting the OMMD to CONDAP, we see that the standardized consistency is higher in the Kratochwill et al. data set (CONDAP = .43) when compared with the Heyvaert and Onghena data set (CONDAP = .87). Similarly, the baseline OMMD in the Mackay et al. (2001) data set was initially higher than the intervention phase OMMD. Subsequently, standardizing OMMD to CONDAP shows that the baselines in the Mackay et al. data sets are actually more consistent than the intervention phases. These examples are in line with visual analysis as will be shown later. CONDAP is insensitive to the variability in the data patterns. The Heyvaert and Onghena data set utilized a measurement scale ranging from 0 to 75, whereas the Kratochwill et al. data set used a measurement scale ranging from 0 to 100 and the Mackay et al. data set used count data. Furthermore, CONDAP is sensitive to differences in central tendency between two phases. For example, the B1 and B2 phases in the Yuen data set seem to be similar at first sight. However, due to the noticeable difference in central tendency, the CONDAP for this comparison is the highest found in all example data sets; as such, a difference in level is a sign of inconsistency. Based on a systematic review of applied A-B-A-B phase designs published over the past 50 years (Tanious, De, Michiels, Van den Noortgate, & Onghena, 2018), we offer the following guidelines for interpreting the amount of consistency: very high, 0 ≤ CONDAP ≤ 0.5; high, 0.5 < CONDAP ≤ 1; medium, 1 < CONDAP < 1.5; low, 1.5 < CONDAP ≤ 2; very low, CONDAP > 2.

Consistency Between Adjacent Phases: Operationalizations Based on ESMs

Next to consistency between similar conditions, we can further assess the consistency for each potential demonstration of an effect. As the definition by Ledford et al. (2018) highlights, consistency of potential demonstrations of an effect involves an assessment of several data aspects for the amount of behavior change between conditions. Therefore, we call the comparison of ESMs for the five data aspects CONsistency of the EFFects (CONEFF). In fact, Barton, Lloyd, Spriggs, and Gast (2018) argue that this assessment of consistency is the primary factor when drawing conclusions about the existence of a functional relation. In the following, we want to outline how to summarize the consistency between potential demonstrations of an effect for each data aspect separately.

Each time when a phase change occurs from baseline to intervention or vice versa, it is possible to calculate the amount of change in these five data aspects: level, trend, variability, immediacy, and overlap. If the demonstration of an effect is consistent, so should be the changes in these five data aspects between A1 and B1 on one hand and A2 and B2 on the other. As Barton et al. (2018) point out, “Consistency also applies to behavior change across conditions. For example, the immediacy and magnitude of behavior change should be consistent each time similar condition changes occur” (p. 194). In an A-B-A-B design, two similar condition changes occur when changing from baseline to intervention. This is not to say that these are the only moments at which an effect can be demonstrated. We agree with Barlow et al. (2009) that an effect can also be demonstrated when changing from intervention back to baseline. However, a phase change from baseline to intervention is conceptually different from a change from intervention back to baseline. In addition, the only phase change that occurs twice in an A-B-A-B design is from baseline to intervention. Therefore, it is possible to assess the CONEFF for each of the five data aspects between these two demonstrations of an effect. As Barton et al. (2018) put it, “The purpose of SCD research is to determine if behavior change occurs when the intervention is introduced, and whether the behavior change can be replicated” (p. 190). Two steps are involved in calculating the CONEFF measures for A-B-A-B phase designs. In a first step, one needs to calculate an ESM for each of the five data aspects separately for each pair of adjacent AB phases. Based on a literature review (see Table 1), we chose the following ESMs for each data aspect: pooled standardized mean difference (pSMD; level), difference in ordinary least squares (OLS; trend), variance ratios (SD; variability), nonoverlap of all pairs (NAP; overlap), and immediate treatment effect index (ITEI, immediacy). An excellent overview of the main characteristics of several of these techniques is given by Manolov and Moeyaert (2017). We have chosen these five operationalizations for several reasons. First, pSMD, OLS, and SD are well-established measures in statistical theory. These measures are familiar even to researchers with little experience in conducting and analyzing data from SCEDs. Second, NAP offers several advantages over competing nonoverlap measures. It can be easily calculated by hand for shorter time series. Furthermore, NAP can be directly calculated from intermediate output of the nonparametric Mann–Whitney U and correlates well with the familiar R² (Parker & Vannest, 2009). Third, ITEI follows directly the recommendations given in the What Works Clearinghouse guidelines. As the calculation of CONEFF is generic, researchers might, however, choose other ESMs if they prefer to do so.

PSMD, ITEI, and NAP take into account data from adjacent phases by default. To calculate changes in trend, we employed the OLS method described in Kromrey and Foster-Johnson (1996), which calculates separate trend lines for the A and B phase and obtains an effect size from the associated R² measures. For changes in variability, we calculated the variance ratios between each A phase and the following B phase. For instructions on how to calculate each of these five effect sizes, the interested reader is referred to the key references listed in Table 1. Table 3 gives an overview of the results of this first step in the columns labeled “A1B1” and “A2B2.”

Table 3.

Consistency as a Measure of Replicability.

Data set	Data aspect	A1B1	A2B2	Absolute difference
Yuen (1993)	Level	4.16	7.55	3.39
	Trend	1.94	0.06	1.88
	Variability	1.56	1.04	0.42
	Overlap	1.00	1.00	0.00
	Immediacy	25.33	28.00	2.67
Heyvaert and Onghena (2014)	Level	2.01	3.96	1.95
	Trend	0.08	0.01	0.07
	Variability	2.81	1.17	1.64
	Overlap	0.95	1.00	0.05
	Immediacy	5.67	10.67	5.00
Kratochwill et al. (2010)	Level	4.20	4.60	0.40
	Trend	0.47	0.09	0.38
	Variability	3.12	1.25	1.17
	Overlap	1.00	1.00	0.00
Mackay, McLaughlin, Weber, and Derby (2001)	Immediacy	36.00	39.00	3.00
	Level	2.15	2.99	0.84
	Trend	0.28	0.00	0.28
	Variability	4.14	4.38	0.24
	Overlap	0.98	1.00	0.02
	Immediacy	5.33	7.33	2.00

Note. Level: Pooled standardized mean difference (pSMD); Trend: Ordinary least squares (OLS); Variability: Standard deviation (SD); Overlap: Nonoverlap of all pairs (NAP); Immediacy: Immediate treatment effect index (ITEI).

In a second step, the absolute difference between “A1B1” and “A2B2” can be calculated for each data aspect. The closer this number is to zero, the more consistent the replication of an effect is for that data aspect.

Linking Statistical and Visual Analysis

Visual analysis of SCED data remains to be frequently used and is sometimes the only analytical technique used in determining a functional relationship between the experimental manipulation(s) and the dependent variable(s) (Manolov & Moeyaert, 2017). Kennedy (2005) argues, for example, that a graphical display of the data is “the most revealing way of analyzing the data and provides the most information to the viewer” (p. 192). However, in light of the known shortcomings of visual analysis, we strongly recommend using visual and statistical analysis of SCEDs concurrently. This will provide even more detailed and contextualized information to the reader and fellow researchers. As Kratochwill and Levin (2014) point out, such a dual analytical approach should be widely applied: “For the future, we envision more widespread application of quantitative analyses, as critical adjuncts to visual analysis, in both primary single-case intervention research studies and literature reviews in the behavioral, educational, and health sciences” (p. 231). Using the four example data sets, we first want to show how CONDAP can be used to supplement the visual analytical process. Subsequently, we will show how CONEFF can be used in a similar way.

CONDAP

At first, a visual analyst might examine the extent to which there is consistency in the data patterns from all baselines. The A1 phase of the Yuen data set (see Figure 1) shows a clear negative trend. It appears that the A2 phase initially shows a negative trend as well, which does, however, not continue after the first three data points without a clear trend. Such trends, as well as the variability in the data patterns, can make it difficult to see what is happening (Morley, 2018). In addition, the A2 phase has five data points more than the A1 phase, making it even more difficult to assess the degree of consistency between the two phases by merely relying on visual analysis. The CONDAP for this comparison equals 1.46 indicating medium consistency. The A1 phase of the Heyvaert and Onghena data set consists of six data points and the data pattern is W-shaped for the first five data points. The first six measurements of the A2 phase show a similar W-shaped pattern, but the last three measurements show a pattern that is not reflected in the A1 phase. The CONDAP for this comparison equals 1.18, indicating medium consistency as well, but higher than for the Yuen data set. In contrast to the previous two data sets, in the Kratochwill et al. data set the A1 (11 data points) and A2 (10 data points) phases differ only by one observation in length. The A1 phase initially shows a negative trend, which is not reflected in the A2 phase. However, the remaining measurements of both data patterns resemble each other with a positive trend. Furthermore, both data patterns seem to have similar levels. The CONDAP for this comparison equals 1.14 indicating medium consistency similar to the Heyvaert and Onghena data set. The Mackay et al. (2001) data set has an equal number of data points in A1 and A2 (five data points each). The shape of the two baseline data patterns is nearly identical with an initial decrease, subsequent increase, and a final decrease in the therapeutic direction in the last two data points. But there is a noticeable difference in level between the two phases. The CONDAP of 1.11 indicates medium consistency similar to the previous two data sets. If we rank order the CONDAPs for the A1/A2 comparisons of the example data sets, we obtain the following order (from most consistent to least consistent): Mackay et al., Kratochwill et al., Heyvaert and Onghena, and Yuen.

Subsequently, a visual analyst might examine the extent to which there is consistency in the data patterns of all intervention phases. The B1 phase of the Yuen data set contains nine data points and initially shows a clear W-shaped form. This W-shaped pattern is repeated in a somewhat distorted form in the beginning of the B2 phase, which only contains seven data points. However, the last two data points of the B1 phase seem to show an increasing trend, whereas the last two data points of the B2 phase show a decreasing trend. Furthermore, there is a noticeable difference in level between the two data patterns. The CONDAP for this comparison equals 1.66 indicating low consistency. The B1 phase of the Heyvaert and Onghena data set contains five data points and the B2 phase contains seven data points. At first sight, the two data patterns do not seem to resemble each other a lot. However, a closer visual inspection reveals, for example, that the last three data points of each phase show nearly identical patterns. Furthermore, both data patterns show decreasing trends at first with subsequent increases. However, there also seems to be a difference in level. The CONDAP for this comparison equals 1.41 indicating medium consistency. The B1 and B2 phases of the Kratochwill et al. data set have identical lengths (10 data points). Both data patterns show a decreasing trend over time. The B1 phase shows somewhat more variability, but overall the two data patterns resemble each other strongly. The CONDAP for this comparison equals .66 indicating high consistency. The intervention phases of the Mackay et al. (2001) data set also have an identical number of data points (five each). Both patterns develop roughly similar over time, with an initial decrease followed by an increase in the contra-therapeutic direction. In the B1 phase, there is, however, at first a slight increase and no decrease for the last measurement, contrary to the B2 phase. In addition, there is a visually apparent difference in level. The CONDAP of 1.43 therefore indicates medium consistency. If we rank order the CONDAPs for the B1/B2 comparisons of the example data sets, we obtain the following order (from most consistent to least consistent): Kratochwill et al., Heyvaert and Onghena, Mackay et al., and Yuen. The ordering of CONDAP might prove especially useful in cases where the consistency between data patterns is not readily distinguishable by mere visual inspection of the graphed data.

CONEFF

As Morley (2018) points out, some data sets produce obvious effects: “For example in an A-B-A-B design when the phases are stable, with little variability or trend, when the intervention produces a large and immediate effect in the response and where withdrawing and reintroducing the intervention produces similar effects” (p. 88). In these instances, a consistent replication of the effect can be easily detected through visual inspection of the graphed data. However, in instances where one or more of the data aspects are not stable, the CONEFF measures can be a valuable supplement in support of visual analysis.

For example, the Yuen data set shows a perfect replication of nonoverlap for both potential demonstrations of the effect (absolute difference = 0). In addition, each introduction of the intervention results in a large change of level, with the pSMD above four and seven, respectively. The second introduction of the intervention leads, however, to a larger change in level (absolute difference = 3.39). Similarly, the intervention results in an immediate treatment effect with each introduction (ITEI > 25 in both cases). The absolute difference for immediacy equals 2.67 between both replications. The A1 phase shows a clear downward trend that is reversed in the B1 phase (OLS = 1.94). Between the A2 and B2 phases, there is only a minimal change in trend (OLS = .06). The variance ratio between A1 and B1 (1.56) is higher than for the second demonstration of the variance ratio between A2 and B2 (1.04). A variance ratio so close to 1 indicates only a minimal change in variability between the two phases.

Similar to the Yuen data set, the Heyvaert and Onghena data set shows a nearly perfect replication for nonoverlap. The absolute difference between the NAPs for both comparisons is only .05. Furthermore, visual inspection of the graph reveals that neither of the replications result in a noticeable change in trend. The absolute difference in the OLS measure is only .07. Furthermore, each intervention phase results in a decrease of level compared with the previous baseline. It can be seen that this decrease in level is, however, stronger for the second replication. The absolute difference of 1.95 between both pSMDs strengthens this visually apparent conclusion. Similarly, the immediacy of the treatment effect is visually apparent at both phase changes from baseline to intervention. The second introduction, however, results in a larger immediate treatment effect. The absolute difference between the ITEIs is equal to 5. The variance ratio between the first baseline and intervention phase equals 2.81. This change in variability might be hard to detect merely by visual analysis due to the one outlying data point and the fact that the B1 phase only contains five data points. The second introduction of the intervention leads to a much smaller change in variability with the variance ratio only being 1.17. The overall absolute difference in variance ratios is equal to 1.64.

It is visually apparent that both introductions of the intervention in the Kratochwill et al. data set lead to a complete nonoverlap. Therefore, the absolute difference between both NAPs equals 0. Visual analysis of the graphed data furthermore reveals that the intervention results in a decrease of level in the target behavior both times. The absolute difference between the two pSMDs is only .40, indicating that the changes in level are similar for both replications. Both introductions of the intervention lead to an immediate decrease in target behavior. The absolute difference between both ITEIs is only 3, indicating that the visually apparent immediate decrease in target behavior is similar for both replications. In addition, a negative trend in both intervention phases can be seen with each preceding baseline not showing a clear trend. The absolute difference in both OLS measures is only .38. The change in variability between both replications is visually less apparent. As the CONEFF, however, reveals, the first introduction of the intervention and the preceding baseline have a variance ratio of 3.12, whereas the second introduction of the intervention and the second baseline have a variance ratio of only 1.25. The absolute difference in variability between both replications thus equals 1.87.

Finally, the Mackay et al. (2001) data set also shows a near perfect replication of nonoverlap with NAP = .98 and 1 respectively. However, both baselines show a trend in the therapeutic direction before introduction of the intervention. This raises the question if the decrease in target behavior during the intervention phase is just a continuation of baseline trend. As the OLS measures indicate, there is only a small change in trend between A1 and B1 (.28) and no change in trend between A2 and B2 (.00), which might indeed indicate that the trends from the baseline phases just continue into the intervention phases. The consistent and visually apparent changes in level of 2.15 and 2.99, respectively, and 5.33 and 7.33, respectively, for immediacy with each introduction of the intervention should thus be interpreted with caution. Furthermore, each introduction of the intervention leads to quite large changes in variability with both variance ratios above four.

Discussion

Consistency is the only data aspect suggested in the What Works Clearinghouse guidelines for visual analysis of SCEDs that has not yet been formally operationalized. The present article addressed this gap in the existing literature by introducing two new measures to assess the degree of consistency in single-case A-B-A-B phase designs, CONDAP and CONEFF. CONDAP was introduced as a quantification of the degree of consistency between data patterns of phases implementing the same condition. CONEFF was introduced as a measure to assess the consistency between both potential replications of an effect in A-B-A-B phase design by systematically assessing each of the other five data aspects for each phase change from baseline to intervention. Using four example data sets from published literature, we first introduced a step-by-step guide on how to calculate CONDAP. Starting with a situation in which both phases have the same number of data points, we introduced the MMD as a means of obtaining the average MD between all paired observations of two data points. We then introduced the OMMD for situations in which the two phases differ in lengths. Finally, we converted the OMMD to CONDAP to make the measure scale invariant. Subsequently, we introduced CONEFF as a means of quantifying the changes in each data aspect with each introduction of the intervention. It was demonstrated how CONEFF can be calculated with previously validated ESMs for each data aspect to obtain a complete picture of the data set. As the calculation of CONEFF is generic, researchers might also choose other ESMs presented in Table 1 for each data aspect to calculate CONEFF according to the a priori hypotheses.

Both measures were presented in light of the growing consensus in the field that visual and statistical analysis of SCED data are best used concurrently, an issue which Kratochwill and Brody (1978) already addressed 40 years ago in this journal. The present study was a first attempt to systematically incorporate all six data aspects of the What Works Clearinghouse guidelines in a comprehensive analysis encompassing visual and statistical assessment of the data. As Kahng et al. (2010) pointed out, quantifications are of paramount importance as visual analysis does not quantify the magnitude of potential effects:

In general, when raters evaluate whether intrasubject data have met criteria for demonstrating experimental control for research or clinical purposes, it is more likely that visual inspection produces a dichotomous decision (i.e., experimental control either is or is not demonstrated, rather than the degree to which experimental control has been demonstrated). (p. 43)

CONEFF is an important measure to get a complete picture of each aspect of the data at hand rather than just focusing on one data aspect—unless this has explicitly been hypothesized in advance. Reporting the CONEFF of potential demonstrations of an effect has at least two advantages. First, it increases the transparency of the results and thereby facilitates reproducing the results in independent studies (Wicherts et al., 2016). Second, it prevents that researchers can simply pick the results that show the desired outcome, “for instance, the researcher could report only a subset of many analyses that showed the researcher’s most desirable results” (Wicherts et al., 2016, p. 9). Similarly, CONDAP assesses the similarity between data patterns overall, rather than just focusing on favorable data aspects.

Similar to many of the popular nonoverlap techniques, CONDAP is distribution free and requires minimal data assumptions. MD-based measures are furthermore intuitive, straightforward, and easy to implement (Ding, Trajcevski, Scheuermann, Wang, & Keogh, 2008). It has been shown that CONDAP does not only conform with the logic and conclusions of visual analysis, but also exceeds the means of visual analysis by systematically quantifying the degree of consistency and offering a tool to compare consistency between data sets. The CONDAP values found in the example data sets ranged from 0.66 to 1.66. The lowest possible CONDAP is zero, which indicates perfect consistency. It should also be noted that CONDAP can only be calculated if there is variability in at least one of the data patterns, that is, at least one of the standard deviations is not equal to zero. If both of the standard deviations are equal to zero, we recommend using MMD if the two phases have the same number of data points and OMMD in case of unequal phase lengths. Finally, the consistency between A1 and A2—and by extension the obtained CONDAP—might be affected by an incomplete return to baseline levels. As no intervention has taken place before the first baseline phase, we might expect the consistency between baselines to be lower as the consistency between intervention phases because each intervention phase follows after a preceding baseline phase.

Limitations and Future Research

As a demonstration and small-scale field test of CONDAP and CONEFF, this article has several limitations. First of all, the sample size of only four published data sets does not allow for broad generalization of the two measures. However, focusing on these four data sets in-depth allowed us to demonstrate how statistical and visual analysis can maximally benefit from one another. Second, both CONDAP and CONEFF are substantially new measures for a data aspect that has not previously been quantified within the single-case community. As such, the performance of these measures cannot yet be compared with other operationalizations aimed at quantifying the degree of consistency in SCED data. Such comparisons can contribute toward further validation of the interpretational guidelines for CONDAP and help in establishing interpretational guidelines for CONEFF. Similarly, a cross validation of CONDAP and CONEFF with assessments by visual analysts can further strengthen the validity of both measures. However, we anticipate that this article will stimulate further research in this area. Another limitation concerns the design of the studies included in the demonstration. The demonstration of CONDAP and CONEFF was restricted to A-B-A-B phase designs. Future research could focus on extending the proposed measures to other SCED applications in which consistency is desirable including phase designs with more than four phases and multiple-baseline designs. As multiple-baseline designs follow the logic of A-B comparisons across behaviors or participants, CONDAP and CONEFF can be used without major modifications. For example, in a multiple baseline across participants design, we can compare the consistency of all baseline and experimental phases between subjects. In applications with more than four phases (e.g., A-B-A-B-A-B), it is possible to compare the consistency across all baseline phases and all experimental phases.

Both CONDAP and CONEFF offer several further potential avenues for future research besides the ones already mentioned. As CONDAP and CONEFF are the first measures to quantify the degree of consistency in single-case A-B-A-B phase designs, one potential avenue for future research is to reanalyze published studies with these new measures. As CONDAP is scale invariant, it allows for comparing the consistency of results between studies. Second, CONDAP and CONEFF, as such, might be incorporated as a test statistic in multiple randomization tests, which at the same time allows for obtaining nonparametric confidence intervals for the degree of consistency. These procedures are, for example, described in Edgington and Onghena (2007), Heyvaert and Onghena (2014), and Michiels et al. (2017). The null hypothesis in this scenario would be that there is no consistency in the data patterns of similar phases in case of CONDAP or that there is no consistency in the demonstrations of an effect in case of CONEFF. To test this hypothesis, the p value can be obtained by locating the observed test statistic in the randomization distribution given all permissible randomizations. If the proportion of test statistics showing equal or higher consistency than the observed one is smaller than or equal to 5%, the null hypothesis that there is no consistency can be rejected. In addition, a better understanding is needed of how CONDAP is affected by missing scores.

Contrary to CONDAP, CONEFF utilizes previously validated ESMs. Therefore, future research should address other challenges than validating CONEFF as a measure. It has to be noted, however, that the different ESMs differ in their sensitivity. For example, a perfect replication of nonoverlap is easier to achieve than a perfect replication of level as most overlap measures suffer from a ceiling effect. Furthermore, we have yet to develop a scale invariant ESM to assess immediacy as immediacy is currently the only ESM used in CONEFF which cannot be compared across studies. One important consideration in the field should be the incorporation of CONEFF in standard reporting tools. Reporting the CONEFF of the data set can greatly increase the acceptability and credibility of SCED research findings within and outside the single-case research community. Examples of R-codes for calculating the ESMs needed to assess the CONEFF and a generic R-function to calculate CONDAP are available in the digital attachment. In addition, many of the analyses presented in this article—including the construction of graphs—can also be executed by practitioners with little to no programming knowledge using the single-case data analysis shiny app available at https://tamalkd.shinyapps.io/scda (De, Michiels, Vlaeyen, & Onghena, 2017). Similar to CONDAP, a challenge for future research into metaconsistency in SCED data concerns the application of this measure beyond A-B-A-B phase designs.

Conclusion

This article introduced two measures to assess consistency in SCED data: CONDAP and CONEFF. CONDAP can be used to assess the consistency between data patterns implementing the same manipulation of the independent variable(s). It is an MD-based measure that calculates the overall average MD between all possible sequences of equal length of the two phases. CONEFF can be used to assess the consistency between potential replications of an effect. An assessment of the CONEFF requires the calculation of separate ESMs for level, trend, variability, overlap, and immediacy for each shift from baseline to intervention. Both measures have been shown to be valuable supplements to the visual analytical process. We hope to see CONDAP and CONEFF in the future as part of a holistic approach to analyzing SCED data encompassing statistical and visual analysis of each data aspect.

Footnotes

Appendix

Table A8.

Descriptive Statistics for Data From Mackay, McLaughlin, Weber, and Derby (2001).

Phase	M	Median	SD	Range
A1	12.60	11.00	4.22	10.00
B1	5.60	6.00	2.07	5.00
A2	9.20	9.00	2.39	6.00
B2	3.60	4.00	1.14	3.00

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

René Tanious

Tamal Kumar De

Author Biographies

René Tanious is a doctoral candidate at the Faculty of Psychology and Educational Sciences at KU Leuven, Belgium. His current research interests include single-case experimental designs, development of effect size measures, and combining statistical and visual analysis for single-case experimental designs.

Tamal Kumar De is a doctoral candidate at the Faculty of Psychology and Educational Sciences, KU Leuven, Belgium. His research interests include single-case experiments, randomization tests and missing data.

Bart Michiels is a postdoc at the Faculty of Psychology and Educational Sciences, KU Leuven, Belgium. His research interests include, single-case experiments, randomization tests, meta-analysis and development of statistical software in R.

Wim Van den Noortgate is a professor of Statistics at the Faculty of Psychology and Educational Sciences of the KU Leuven. His major research interests include meta-analysis, multilevel analysis and learning analytics.

Patrick Onghena is a professor of Research Methodology and Statistics at the Faculty of Psychology and Educational Sciences, KU Leuven, Belgium. His major research interests include single-case experimental designs, distribution-free statistical inference, the methodology of systematic reviews, mixed methods research, and research on the teaching of statistics.

References

Allison

D. B.

Gorman

B. S.

(1993). Calculating effect sizes for meta-analysis: The case of the single case. Behaviour Research and Therapy, 31, 621-631. doi:10.1016/0005-7967(93)90115-B

Audi

(1999). The Cambridge dictionary of philosophy (2nd ed.). Cambridge, UK: Cambridge University Press.

Baer

D. M.

(1977). Perhaps it would be better not to know everything. Journal of Applied Behavior Analysis, 10, 167-172. doi:10.1901/jaba.1977.10-167

Barker

McCarthy

Jones

Moran

(2011). Single-case research methods in sport and exercise psychology. New York, NY: Routledge.

Barlow

D. H.

Nock

M. K.

Hersen

(2009). Single case experimental designs: Strategies for studying behavior change (3rd ed.). Boston, MA: Pearson.

Barton

E. E.

Ledford

J. R.

Lane

J. D.

Decker

Germansky

E. S.

. . . Kaiser

(2016). The iterative use of single case research designs to advance the science of EI/ECSE. Topics in Early Childhood Special Education, 36, 4-14.

Barton

E. E.

Lloyd

B. P.

Spriggs

A. D.

Gast

D. L.

(2018). Visual analysis of graphic data. In Ledford

J. R.

Gast

D. L.

(Eds.), Single case research methodology: Applications in special education and behavioral sciences (3rd ed., pp. 179-214). New York, NY: Routledge.

Beeson

P. M.

Robey

R. R.

(2006). Evaluating single-subject treatment research: Lessons learned from the aphasia literature. Neuropsychology Review, 16, 161-169. doi:10.1007/s11065-006-9013-7

Beretvas

S. N.

Chung

(2008). A review of meta-analyses of single-subject experimental designs: Methodological issues and practice. Evidence-Based Communication Assessment and Intervention, 2, 129-141. doi:10.1080/17489530802446302

10.

Bogle

S. M.

Harris

C. M.

(1994). Measuring prescribing: The shortcomings of the item. British Medical Journal, 308, 637-640. doi:10.1136/bmj.308.6929.637

11.

Brossart

D. F.

Parker

R. I.

Olson

E. A.

Mahadevan

(2006). The relationship between visual analysis and five statistical analyses in a simple AB single-case research design. Behavior Modification, 30, 531-563. doi:10.1177/0145445503261167

12.

Bulté

Onghena

(2008). An R package for single-case randomization tests. Behavior Research Methods, 40, 467-478. doi:10.3758/BRM.40.2.467

13.

Bulté

Onghena

(2009). Randomization tests for multiple-baseline designs: An extension of the SCRT-R package. Behavior Research Methods, 41, 477-485. doi:10.3758/BRM.41.2.477

14.

Bulté

Onghena

(2012). When the truth hits you between the eyes: A software tool for the visual analysis of single-case experimental data. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 8, 104-114. doi:10.1027/1614-2241/a000042

15.

Busk

P. L.

Serlin

(1992). Meta-analysis for single case research. In Kratochwill

T. R.

Levin

J. R.

(Eds.), Single-case research design and analysis: New directions for psychology and education (pp. 187-212). Hillsdale, NJ: Lawrence Erlbaum.

16.

Center

B. A.

Skiba

R. J.

Casey

(1985). A methodology for the quantitative synthesis of intra-subject design research. Journal of Special Education, 19, 387-400. doi:10.1177/002246698501900404

17.

Cha

S.-H.

(2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1, 300-307.

18.

Chiu

N.-H.

Huang

S.-J.

(2006). The adjusted analogy-based software effort estimation based on similarity distances. The Journal of Systems and Software, 80, 628-640. doi:10.1016/j.jss.2006.06.006

19.

Cohen

(1992). A power primer. Psychological Bulletin, 112, 155-159. doi:10.1037/0033-2909.112.1.155

20.

Cronbach

L. J.

(1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334. doi:10.1007/BF02310555

21.

T. K.

Michiels

Vlaeyen

J. W.

Onghena

(2017). Shiny SCDA [Computer software]. Retrieved from https://ppw.kuleuven.be/mesrg/software-and-apps/shiny-scda

22.

Ding

Trajcevski

Scheuermann

Wang

Keogh

(2008). Querying and mining of time series data: Experimental comparison of representations and distance measures. Proceedings of the VLDB Endowment, 1, 1542-1552. doi:10.14778/1454159.1454226

23.

Edgington

E. S.

Onghena

(2007). Randomization tests. Boca Raton, FL: Chapman & Hall/CRC.

24.

Fedorov

(2013). GetData graph digitizer [Computer software]. Retrieved from http://getdata-graph-digitizer.com/

25.

Gast

D. L.

(2010). Single subject research methodology in behavioral sciences. New York, NY: Routledge.

26.

Gingerich

W. J.

(1984). Methodological observations on applied behavioral science. The Journal of Applied Behavioral Science, 20, 71-79. doi:10.1177/002188638402000113

27.

Gorsuch

R. L.

(1983). Three methods for analyzing limited time-series (N of 1) data. Behavioral Assessment, 5, 141-154.

28.

Harrington

Velicer

W. F.

(2015). Comparing visual and statistical analysis in single-case studies using published studies. Multivariate Behavioral Research, 50, 162-183. doi:10.1080/00273171.2014.973989

29.

Harrington

M. A.

(2013). Comparing visual and statistical analysis in single-subject studies (Open Access Dissertations). Retrieved from http://digitalcommons.uri.edu/oa_diss

30.

Harvey

S. T.

Boer

Meyer

L. H.

Evans

I. M.

(2009). Updating a meta-analysis of intervention research with challenging behaviour: Treatment validity and standards of practice. Journal of Intellectual & Developmental Disability, 34, 67-80. doi:10.1080/13668250802690922

31.

Henson

R. K.

(2001). Understanding internal reliability estimates: A conceptual primer on coefficient alpha. Measurement and Evaluation in Counseling and Development, 34, 177-189.

32.

Heyvaert

Onghena

(2014). Analysis of single-case data: Randomization tests for measures of effect size. Neuropsychological Rehabilitation, 24, 507-527. doi:10.1080/09602011.2013.818564

33.

Heyvaert

Wendt

Van den Noortgate

Onghena

(2015). Randomization and data-analysis items in quality standards for single-case experimental studies. The Journal of Special Education, 49, 146-156. doi:10.1177/0022466914525239

34.

Hill

A. B.

(1965). The environment and disease: Association or causation? Journal of the Royal Society of Medicine, 58, 295-300. doi:10.1177/0141076814562718

35.

Horner

R. H.

Carr

E. G.

Halle

McGee

Odom

Wolery

(2005). The use of single-subject research to identify evidence-based practice in special education. Exceptional Children, 71, 165-179. doi:10.1177/001440290507100203

36.

Horowitz

Wilner

Avarez

(1979). Impact of event scale: A measure of subjective stress. Psychosomatic Medicine, 41, 209-218. doi:10.1097/00006842-197905000-00004

37.

Huitema

B. E.

McKean

J. W.

(2000). Design specification issues in time-series intervention models. Educational and Psychological Measurement, 60, 38-58. doi:10.1177/00131640021970358

38.

Kahng

S. W.

Chung

K.-Y.

Gutshall

Pitts

S. C.

Kao

Girolami

(2010). Consistent visual analysis of intrasubject data. Journal of Applied Behavior Analysis, 43, 35-45. doi:10.1901/jaba.2010.43-35

39.

Kazdin

A. E.

(1982). Single-case research designs: Methods for clinical and applied settings. New York, NY: Oxford University Press.

40.

Kazdin

A. E.

(2011). Single-case research designs: Methods for clinical and applied settings (2nd ed.). New York, NY: Oxford University Press.

41.

Kennedy

(2005). Single-case designs for educational research. Boston, MA: Pearson.

42.

Kokare

Chatterji

B. N.

Biswas

P. K.

(2003, March). Comparison of similarity metrics for texture image retrieval. TENCON 2003, Conference on Convergent Technologies for Asia-Pacific Region, Bangalore, India. doi:10.1109/TENCON.2003.1273228

43.

Kratochwill

T. R.

Brody

G. H.

(1978). Single subject designs: A perspective on the controversy over employing statistical inference and implications for research and training in behavior modification. Behavior Modification, 2, 291-307. doi:10.1177/014544557823001

44.

Kratochwill

T. R.

Hitchcock

Horner

R. H.

Levin

J. R.

Odom

S. L.

Rindskopf

D. M.

Shadish

W. R.

(2010). Single-case design technical documentation. Retrieved from https://ies.ed.gov/ncee/wwc/Docs/ReferenceResources/wwc_scd.pdf

45.

Kratochwill

T. R.

Hitchcock

Horner

R. H.

Levin

J. R.

Odom

S. L.

Rindskopf

D. M.

Shadish

W. R.

(2013). Single-case intervention research design standards. Remedial and Special Education, 34, 26-38. doi:10.1177/0741932512452794

46.

Kratochwill

T. R.

Levin

J. R.

(2014). Meta-and statistical analysis of single-case intervention research data: Quantitative gifts and a wish list. Journal of School Psychology, 52, 231-235. doi:10.1016/j.jsp.2014.01.003

47.

Kratochwill

T. R.

Levin

J. R.

(2015). Single-case research design and analysis: New directions for psychology and education. New York, NY: Routledge.

48.

Kromrey

J. D.

Foster-Johnson

(1996). Determining the efficacy of intervention: The use of effect sizes for data analysis in single-subject research. The Journal of Experimental Education, 65, 73-93. doi:10.1080/00220973.1996.9943464

49.

Lane

J. D.

Gast

D. L.

(2014). Visual analysis in single case experimental design studies: Brief review and guidelines. Neuropsychological Rehabilitation, 24, 445-463. doi:10.1080/09602011.2013.815636

50.

Ledford

J. R.

Lane

J. D.

Severini

K. E.

(2018). Systematic use of visual analysis for assessing outcomes in single case design studies. Brain Impairment, 19, 4-17. doi:10.1017/BrImp.2017.16

51.

H.-H.

(2006). Quantitative synthesis of single-subject researches: Percentage of data points exceeding the median. Behavior Modification, 30, 598-617. doi:10.1177/0145445504272974

52.

Mackay

McLaughlin

T. F.

Weber

Derby

K. M.

(2001). The use of precision requests to decrease noncompliance in the home and neighborhood: A case study. Child & Family Behavior Therapy, 23, 41-50.

53.

Maggin

D. M.

Swaminathan

Rogers

H. J.

O’Keeffe

B. V.

Sugai

Horner

R. H.

(2011). A generalized least squares regression approach for computing effect sizes in single-case research: Application examples. Journal of School Psychology, 49, 301-321. doi:10.1016/j.jsp.2011.03.004

54.

Manolov

(2018). Linear trend in single-case visual and quantitative analyses. Behavior Modification, 42, 684-706. doi:10.1177/0145445517726301

55.

Manolov

Moeyaert

(2017). Recommendations for choosing single-case data analytical techniques. Behavior Therapy, 48, 97-114. doi:10.1016/j.beth.2016.04.008

56.

Manolov

Solanas

(2008). Comparing N = 1 effect size indices in presence of autocorrelation. Behavior Modification, 32, 860-875. doi:10.1177/0145445508318866

57.

Matyas

T. A.

Greenwood

K. M.

(1990). Visual analysis of single-case time series: Effects of variability, serial dependence, and magnitude of intervention effects. Journal of Applied Behavior Analysis, 23, 341-351. doi:10.1901/jaba.1990.23-341

58.

Michiels

Heyvaert

Meulders

Onghena

(2017). Confidence intervals for single-case effect size measures based on randomization test inversion. Behavior Research Methods, 49, 363-381. doi:10.3758/s13428-016-0714-4

59.

Morgan

D. L.

Morgan

R. K.

(2001). Single-participant research design: Bringing science to managed care. American Psychologist, 56, 119-127. doi:10.1037//0003-066X.56.2.119

60.

Morgan

D. L.

Morgan

R. K.

(2009). Single-case research methods for the behavioral and health sciences. Thousand Oaks, CA: SAGE.

61.

Morley

(2018). Single-case methods in clinical psychology: A practical guide. New York, NY: Routledge.

62.

Natesan

Hedges

L. V.

(2017). Bayesian unknown change-point models to investigate immediacy in single case designs. Psychological Methods, 22, 743-759. doi:10.1037/met0000134

63.

Newey

W. K.

McFadden

D. L.

(1994). Large sample estimation and hypotheses testing. In Engle

McFadden

(Eds.), Handbook of econometrics (Vol. 4, pp. 2113-2245). Amsterdam, The Netherlands: North-Holland.

64.

Ninci

Vannest

K. J.

Willson

Zhang

(2015). Interrater agreement between visual analysts of single-case data: A meta-analysis. Behavior Modification, 39, 510-541. doi:10.1177/0145445515581327

65.

O’Brien

Repp

A. C.

(1990). Reinforcement-based reductive procedures: A review of 20 years of their use with persons with severe or profound retardation. Journal of the Association for Persons With Severe Handicaps, 15, 148-159. doi:10.1177/154079699001500307

66.

Olive

M. L.

Smith

B. W.

(2005). Effect size calculations and single subject designs. Educational Psychology, 25, 313-324. doi:10.1080/0144341042000301238

67.

Onghena

Edgington

E. S.

(2005). Customization of pain treatments: Single-case design and analysis. The Clinical Journal of Pain, 21, 56-68. doi:10.1097/00002508-200501000-00007

68.

Ottenbacher

K. J.

(1990). When is a picture worth a thousand p values? A comparison of visual and quantitative methods to analyze single subject data. The Journal of Special Education, 23, 436-449. doi:10.1177/002246699002300407

69.

Park

H.-S.

Marascuilo

Gaylord-Ross

(1990). Visual inspection and statistical analysis in single-case designs. The Journal of Experimental Education, 58, 311-320. doi:10.1080/00220973.1990.10806545

70.

Parker

R. I.

Brossart

D. F.

(2003). Evaluating single-case research data: A comparison of seven statistical methods. Behavior Therapy, 34, 189-211. doi:10.1016/S0005-7894(03)80013-8

71.

Parker

R. I.

Hagan-Burke

(2007). Single case research results as clinical outcomes. Journal of School Psychology, 45, 637-653. doi:10.1016/j.jsp.2007.07.004

72.

Parker

R. I.

Hagan-Burke

Vannest

(2007). Percentage of all non-overlapping data (PAND): An alternative to PND. The Journal of Special Education, 40, 194-204. doi:10.1177/00224669070400040101

73.

Parker

R. I.

Vannest

K. J.

(2009). An improved effect size for single-case research: Nonoverlap of all pairs. Behavior Therapy, 40, 357-367.

74.

Parker

R. I.

Vannest

K. J.

Brown

(2009). The improvement rate difference for single-case research. Exceptional Children, 75, 135-150. doi:10.1177/001440290907500201

75.

Parker

R. I.

Vannest

K. J.

Davis

J. L.

Sauber

S. B.

(2011). Combining nonoverlap and trend for single-case research: Tau-U. Behavior Therapy, 42, 284-299. doi:10.1016/j.beth.2010.08.006

76.

Parsonson

Baer

(1978). The analysis and presentation of graphic data. In Kratchowill

(Ed.), Single subject research (pp. 101-166). New York, NY: Academic Press.

77.

Perdices

Tate

R. L.

(2009). Single-subject designs as a tool for evidence-based clinical practice: Are they unrecognised and undervalued? Neuropsychological Rehabilitation, 19, 904-927. doi:10.1080/09602010903040691

78.

Revusky

S. H.

(1967). Some statistical treatments compatible with individual organism methodology. Journal of the Experimental Analysis of Behavior, 10, 319-330. doi:10.1901/jeab.1967.10-319

79.

Schlosser

R. W.

Lee

D. L.

Wendt

(2008). Application of the percentage of non-overlapping data (PND) in systematic reviews and meta-analyses: A systematic review of reporting characteristics. Evidence-Based Communication Assessment and Intervention, 2, 163-187. doi:10.1080/17489530802505412

80.

Scotti

J. R.

Evans

I. M.

Meyer

L. H.

Walker

(1991). A meta-analysis of intervention research with problem behavior: Treatment validity and standards of practice. American Journal of Mental Retardation, 96, 233-256.

81.

Scruggs

T. E.

Mastropieri

M. A.

Casto

(1987). The quantitative synthesis of single-subject research: Methodology and validation. Remedial and Special Education, 8, 24-33. doi:10.1177/074193258700800206

82.

Sen

P. K.

(1968). Estimates of the regression coefficient based on Kendall’s Tau. Journal of the American Statistical Association, 63, 1379-1389. doi:10.1080/01621459.1968.10480934

83.

Shadish

W. R.

Hedges

L. V.

Pustejovsky

J. E.

(2014). Analysis and meta-analysis of single-case designs with a standardized mean difference statistic: A primer and applications. Journal of School Psychology, 52, 123-147. doi:10.1016/j.jsp.2013.11.005

84.

Shadish

W. R.

Rindskopf

D. M.

Hedges

L. V.

(2008). The state of the science in the meta-analysis of single-case experimental designs. Evidence-Based Communication Assessment and Intervention, 2, 188-196. doi:10.1080/17489530802581603

85.

Shadish

W. R.

Sullivan

K. J.

(2011). Characteristics of single-case designs used to assess intervention effects in 2008. Behavior Research, 43, 971-980. doi:10.3758/s13428-011-0111-y

86.

Shamseer

Sampson

Bukutu

Schmid

C. H.

Nikles

Tate

, . . . CENT Group. (2016). CONSORT extension for reporting N-of-1 trials (CENT) 2015: Explanation and elaboration. Journal of Clinical Epidemiology, 76, 18-46. doi:10.1016/j.jclinepi.2015.05.018

87.

Sherwood

Perelman

Hamerly

Calder

(2002). Automatically characterizing large scale program behavior. ACM SIGARCH Computer Architecture News, 30, 45-57. doi:10.1145/635506.605403

88.

Smith

J. D.

(2012). Single-case experimental designs: A systematic review of published research and current standards. Psychological Methods, 17, 510-550. doi:10.1037/a0029312

89.

Solanas

Manolov

Onghena

(2010). Estimating slope and level change in N = 1 designs. Behavior Modification, 34, 195-218. doi:10.1177/0145445510363306

90.

Tanious

T. K.

Michiels

Van den Noortgate

Onghena

(2018, August 16). Consistency in single-case A-B-A-B phase designs: A systematic review. doi:10.31234/osf.io/62t7w

91.

Tankersley

Harjusola-Webb

Landrum

T. J.

(2008). Using single-subject research to establish the evidence base of special education. Intervention in School and Clinic, 44, 83-90. doi:10.1177/1053451208321600

92.

Tate

R. L.

Perdices

Rosenkoetter

McDonald

Togher

Shadish

W. R.

Vohra

(2016). The Single-Case Reporting guideline In BEhavioural Interventions (SCRIBE) 2016: Explanation and elaboration. Archives of Scientific Psychology, 4, 1-9. doi:10.1037/arc0000026

93.

Tate

R. L.

Perdices

Rosenkoetter

Shadish

W. R.

Vohra

Barlow

D. H.

Wilson

(2016). The Single-Case Reporting guideline In BEhavioural interventions (SCRIBE) 2016 statement. Aphasiology, 30, 862-876. doi:10.1080/02687038.2016.1178022

94.

Tate

R. L.

Perdices

Rosenkoetter

Wakim

Godbee

Togher

McDonald

(2013). Revision of a method quality rating scale for single-case experimental designs and N-of-1 trials: The 15-item Risk of Bias in N-of-1 Trials (RoBiNT) Scale. Neuropsychological Rehabilitation, 23, 619-638. doi:10.1080/09602011.2013.824383

95.

Theil

(1950). A rank-invariant method of linear and polynomial regression analysis. Proceedings of the Koninklijke Nederlandse Akademie Wetenschappen, Series A Mathematical Sciences, 53, 386-392.

96.

Van den Noortgate

Onghena

. (2008). A multilevel meta-analysis of single-subject experimental design studies. Evidence-Based Communication Assessment and Intervention, 2, 142-151. doi:10.1080/17489530802505362

97.

Vohra

Shamseer

Sampson

Bukutu

Schmid

C. H.

Tate

, & group t, C. (2016). CONSORT extension for reporting N-of-1 trials (CENT) 2015 statement. Journal of Clinical Epidemiology, 76, 9-17. doi:10.1016/j.jclinepi.2015.05.004

98.

Wicherts

J. M.

Veldkamp

C. L.

Augusteijn

H. E.

Bakker

van Aert

R. C.

van Assen

M. A.

(2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology, 7, Article 1832. doi:10.3389/fpsyg.2016.01832

99.

Wilbert

(2014). Using the SCDA package (0.8) for analysing single and multiple case AB designs. Retrieved from http://www.uni-potsdam.de/fileadmin01/projects/inklusion/PDFs/SCDA_Documentation.pdf

100.

Wolery

Busick

Reichow

Barton

E. E.

(2010). Comparison of overlap methods for quantitatively synthesizing single-subject data. The Journal of Special Education, 44, 18-28. doi:10.1177/0022466908328009

101.

Yuen

H. K.

(1993). Improved productivity through purposeful use of additional template for a woman with cortical blindness. The American Journal of Occupational Therapy, 47, 105-110. doi:10.5014/ajot.47.2.105