A Visual Aid and Objective Rule Encompassing the Data Features of Visual Analysis

Abstract

Visual analysis of single-case research is commonly described as a gold standard, but it is often unreliable. Thus, an objective tool for applying visual analysis is necessary, as an alternative to the Conservative Dual Criterion, which presents some drawbacks. The proposed free web-based tool enables assessing change in trend and level between two adjacent phases, while taking data variability into account. The application of the tool results in (a) a dichotomous decision regarding the presence or absence of an immediate effect, a progressive or delayed effect, or an overall effect and (b) a quantification of overlap. The proposal is evaluated by applying it to both real and simulated data, obtaining favorable results. The visual aid and the objective rules are expected to make visual analysis more consistent, but they are not intended as a substitute for the analysts’ judgment, as a formal test of statistical significance, or as a tool for assessing social validity.

Keywords

single-case design visual analysis trend variability visual aid

Visual analysis is strongly advocated and commonly used in single-case research (Lane & Gast, 2014; Ledford, Lane, & Severini, 2018; Maggin, Cook, & Cook, 2018; Parker, Cryer, & Byrns, 2006). Visual analysis allows assessing several data features to identify the presence of a functional relation between intervention and target behavior. According to the What Works Clearinghouse (WWC) standards (Kratochwill et al., 2010, see also Appendix A in Institute of Education Sciences, 2018), the data features on which visual analysis focuses are level, trend, overlap, variability, and immediacy for an A-B comparison. Moreover, for demonstrating a functional relation, it is crucial to assess the consistency of the results across several A-B comparisons within a study (Ledford, Barton, Severini, & Zimmerman, 2019). Several authors have described how visual analysis proceeds in relation to these data features. For instance, Ledford et al. (2018) emphasize the difference between using visual analysis for a formative purpose, focusing on the within-phase data pattern for deciding how to proceed, and using visual analysis for a summative purpose, comparing adjacent phases. As per Ledford et al. (2019), an important part of the visual analysis is the comparison of the obtained with the expected data patterns. Such a comparison is related to the expectation of an immediate or delayed effect and also to the potential confounding introduced by an improving baseline trend (Maggin et al., 2018). Furthermore, it has been highlighted that the magnitude of effect has to be quantified, once a functional relation has been established by means of visual analysis (Ledford et al., 2019).

Several authors (e.g., Harrington & Velicer, 2015; Vannest & Ninci, 2015; Vannest, Peltier, & Haas, 2018) and even journal editors (Karazsia, 2018) have advocated for the need to use visual and quantitative analysis jointly. Actually, adding trend lines, computing means and medians, and overlap indices are already part of the visual analysis as presented by Lane and Gast (2014). Moreover, a meta-analysis on the performance of visual analysts has identified the use of structured criteria as a factor in improved reliability (Ninci, Vannest, Willson, & Zhang, 2015). Thus, in relation to the importance of structured criteria, in the following sections of the current text, we first present the justification for the need of a new visual aid tool for making visual analysis more objective. Second, we describe the tool that we propose and provide a justification for its elements. We also clearly state what the tool is capable of achieving and what it is not suitable for. Third, we evaluate the tool by applying it to real data and to simulated data. Finally, we provide more general recommendations regarding the assessment of intervention effectiveness.

The Need for a New Tool

A visual aid tool is necessary on the basis of three pieces of evidence. First, the graphical representation of the data in any given study may not meet the requirements regarding the Y axis (Dart & Radley, 2017) and the ratio of the X-to-Y axes according to the number of data points depicted (Radley, Dart, & Wright, 2018), which can affect the result of the visual analysis. Second, there is evidence that visual analysis terms and procedures are used inconsistently (Barton, Meadan, & Fettig, 2019). Third, the agreement between visual analysts has been shown to be suboptimal (see the meta-analysis by Ninci et al., 2015, and later works: Diller, Barry, & Gelino, 2016; Wolfe, Seaman, & Drasgow, 2016). The evidence of the lack of consistency can be expected, given that for many of the data features assessed visually, there are no formal objective rules. For instance, Maggin, Briesch, and Chafouleas (2013) developed a protocol for applying the WWC Standards for visual analysis, but this protocol still includes multiple requirements of subjective judgments (e.g., Is the level discriminably different between the first and last three data points in adjacent phases? Is there an overall change in trend between baseline and treatment phases?). More recently, Wolfe, Barton, and Meadan (2019) proposed systematic protocols that ask for similar subjective evaluations (e.g., Project the trend of the first baseline phase into the first intervention phase. Is the level, trend, or variability in the intervention phase different than what you would predict based on the baseline data?). One issue is that the relevant data features can be defined operatively in several different ways: There are at least five ways to estimate trend suggested for single-case data (Manolov, 2018b) and at least nine different quantifications of overlap (Parker, Vannest, & Davis, 2011). Another issue is that the data patterns from different conditions can always be expected to be somewhat different or, stated otherwise, they cannot be expected to be exactly identical. This poses the question about how much different the data patterns should be to call them “different.” A third issue, in relation to the assessment of level, trend, variability, and overlap, is that there are no clear indications about how to proceed when the assessment of the different data features does not converge. On one hand, there are no formal indications regarding which of these data aspects to prioritize. On the other hand, it is that it is not clear whether a difference in only one of these data features (level, trend, or variability) is sufficient as a demonstration of a basic effect (to be replicated across at least three attempts), regardless of the evaluation of the other two data features.

In summary, to avoid making the visual judgments dependent on the (potentially inadequate) graphical representation, to clearly identify and report the tools used for assessing each data aspect, and to improve the agreement between visual analysts, a visual aid is necessary. In what follows, we justify why a new tool is necessary, taking into account the already existing tools.

For within-phase analysis, a relevant tool encompassing level, trend, and variability are stability envelopes (Lane & Gast, 2014; Ledford et al., 2018). According to this tool, stability in level, as opposed to variability, is defined as at least 80% of the data being within an envelope constructed on the basis of the median ±25% of the median. For assessing trend stability, the envelope is constructed likewise around the split-middle trend line. There are several issues with such a proposal. First, the specific values proposed are completely arbitrary. This is well illustrated by the fact that a previous text (Gast & Spriggs, 2010) referred to the envelope being constructed using ±20% of the median, whereas Lobo, Moeyaert, Cunha, and Babik (2017) suggest constructing the envelope using ±15% of the median. Second, no measure of the actual variability in the data is used for defining the envelope. Third, the split-middle trend line has been shown to be outperformed by a tri-split trend line (Parker, Vannest, & Davis, 2014).

For between-phases comparison, a well-known tool, highlighted by Kratochwill, Levin, Horner, and Swoboda (2014), is the Conservative Dual Criteria (CDC; Fisher, Kelley, & Lomas, 2003). In the CDC, the level is represented by the mean and trend is modeled via the split-middle method. Afterward, both the mean and the split-middle trend line are projected from the baseline phase into the intervention phase for comparison. The projections are adjusted conservatively in the direction of the expected effect, by shifting them a quarter of a standard deviation. On a positive side, the CDC has been shown to improve visual analysts’ decisions (Wolfe & Slocum, 2015; Young & Daly, 2016), although it has been stated that “[t]he validity of the CDC when data patterns produce disagreement among experts is unknown” (Wolfe, Seaman, Drasgow, & Sherlock, 2018, p. 349). In that sense, the CDC is not flawless. A first set of issues of the CDC are the sensitivity of the mean and the standard deviation to outlying values, especially if the data series are short, as well as the previously mentioned limitation of the split-middle trend line. Second, the variability of the data is not considered or represented visually beyond the conservative adjustment of the projected lines. Third, a quantification of the magnitude of difference between the projections and the actual data is not obtained. Fourth, no distinction is made between immediate and progressive/delayed effects. Fifth, it has been stated that more research is warranted on the application of CDC to data with baseline trends (Wolfe et al., 2018). Sixth, the CDC is only applicable to designs in which phases are compared (e.g., multiple-baseline design [MBD], A-B-A-B), but not to designs with rapidly changing conditions (i.e., alternating treatments designs [ATDs]). In the following section, we describe a new proposal and justify why we consider it a suitable alternative to the CDC.

The Visual Aid Implying an Objective Rule (VAIOR)

In this section, we provide an in-depth presentation of a new proposal for a visual aid. First, we comment on the usefulness of the proposal, while also clearly pointing at its limits. Second, we describe the steps it includes, when applied to MBD and reversal designs. Third, we justify the different elements of the proposal. Fourth, we describe how the proposal can be applied to ATDs. Finally, we provide general remarks on the use of the proposal. In a later section, there is an application to real and to simulated data.

Rationale for VAIOR: What It Is and What It Is Not

We refer to the proposal as VAIOR, an abbreviation of “visual aid implying an objective rule,” because the aim is to provide objective operative definitions for several terms such as level, trend, variability, and overlap, as well as an objective rule for determining whether a basic effect is present in an A-B comparison. VAIOR is designed to be useful both in situations in which an immediate effect is expected and when a delayed/progressive effect is expected. Thus, VAIOR aims to include information about all the data features that are commonly assessed when performing visual analysis: level, trend, variability, overlap, and immediacy. Overall, VAIOR is created as a tool for summative analysis. Nevertheless, a formative visual analysis (as described in Barton et al., 2019) is also possible, because the baseline projections that VAIOR entails can be extended for each successively obtained intervention phase measurement. This can serve for detecting early on any clearly insufficient or undesired effects of the intervention.

It is also important to state clearly what VAIOR is not (aimed or expected to achieve). First, an application of VAIOR to a single A-B comparison cannot be understood as an evidence for a functional relation, as several replications of the basic effect are necessary. Second, VAIOR is not a statistical test of a null hypothesis about no intervention effect. In that sense, VAIOR does not model or estimate the potential serial dependence¹ in the data, because there is no p value obtained on the basis of an assumption of independent data. (In contrast, the binomial test used after the CDC, Fisher et al., 2003, does assume that the measurements are independent.) Third, VAIOR does not include or represent a mastery criterion (e.g., 80% or more completion of a predefined criterion for several successive measurements, Kipfmiller et al., 2019). Relatedly, VAIOR does not quantify the number of sessions required until a mastery criterion is reached. Thus, VAIOR does not provide this potentially useful information for evaluating clinical significance and it cannot be used to distinguish between participants who are fast at achieving a criterion but fail to maintain performance, from participants who are slower but better at maintaining what is learned (Ledford et al., 2016). Fourth, VAIOR is not a measure of social validity (Horner et al., 2005), as it does not inform about whether the difference between conditions is practically significant, whether the behavior after the intervention is the normative range, or whether the intervention is acceptable and feasible in real settings. In summary, VAIOR is not meant as a substitute for either the visual analysis of all relevant information or the assessment of social validity. Finally, it has been said that the CDC serves a similar purpose as VAIOR and presents all the limitations mentioned in the current paragraph (and more).

Description of VAIOR for Multiple-Baseline and Reversal Designs

The application of VAIOR to each A-B comparison in the context of a strong design such as an MBD or a reversal (A-B-A-B) design entails nine steps. These steps are first presented and afterward justified in more detail.

Step 1: Estimate baseline trend using the Theil–Sen method (Sen, 1968).

Step 2: Estimate the amount of variability around the baseline trend, using the median of absolute deviations (MAD). In VAIOR, MAD is defined as the median of the absolute deviations from the predicted baseline values according to the trend line fitted using the Theil–Sen method.

Step 3: Construct a variability band² based on baseline trend plus/minus the MAD value.

Step 4: Project the baseline trend and variability band into the treatment phase.

Step 5: Choose the focus of the analysis (overall effect, immediate effect or cumulative growth) before data collection and on the basis of prior literature and/or the characteristics of the intervention and the behavior and to explicitly state this choice in the report, before presenting the results. If an immediate effect is expected, the first three data points in the intervention phase are used in the analysis (as in Kratochwill et al., 2010). If a cumulative growth, progressive effect (i.e., change in slope), or delayed effect is expected, the focus of the analysis can be the last three intervention phase data points, similar to the mean baseline reduction quantification (Olive & Smith, 2005) and to the 3-point decision rule (Riley-Tillman & Burns, 2009; see also Levin, Ferron, & Gafurov, 2017, for a proposal on delayed effects). In the absence of expectations for immediate or progressive/delayed effect, the comparison includes all intervention phase data points.

Step 6: Perform the main comparison between the A and the B phase. We here propose comparing the actually obtained intervention phase measurements to the variability band (rather than to the trend line), focusing on the upper or lower limit, according to the therapeutic aim.

Step 7: Summarize the results from the A-B comparison as a percentage. In this step, we quantify the percentage of data points in the intervention phase that are improved above or below the variability band projection; this would be an ordinal quantification consistent with nonoverlap indices.

Step 8: Summarize the results from the A-B comparison in terms of a dichotomous decision regarding whether a positive effect is present or not. The rules for such a decision would be as follows. If all of the first three intervention phase data points (when an immediate effect is expected) or all of the last three intervention data points (when a progressive or delayed effect is expected) improve the variability band projection, this would be counted as a “positive result,” in dichotomous terms. Where treatment outcome is more uncertain and all intervention data points are used, a slightly different definition of a positive result occurs on the basis of logic and preliminary simulation results (presented later). A positive treatment effect or result is identified only when the proportion of intervention phase data points improving the variability band projection is either 100% (as for immediate and progressive/delayed effects) or is at least twice the proportion of baseline data points that are outside the baseline variability band, whichever of the two quantities is smaller. The underlying idea is that if the trend line tightly fits or better reflects the baseline data (i.e., few and small deviations from this line are observed), fewer intervention data points outside of the projected variability band are needed to demonstrate an effect. That is, the reference for comparison would be clearer. On the contrary, if the fit of the trend line to the baseline data is not that good (i.e., many baseline data points out of the variability band), then stronger evidence is necessary.

Step 9: Assess the consistency of the results across replications of the A-B comparison. An experimental effect cannot be claimed on the basis of a single A-B comparison, in absence of replication (Kennedy, 2005) and the WWC Standards stipulate a minimum of three attempts to demonstrate an effect for MBDs and reversal designs. In terms of within-study replication, Maggin et al. (2013) suggest a minimal ratio of 3:1 of positive effects to no effects or negative effects. We consider that this recommendation is reasonable, albeit apparently not based on any empirical evidence, and it is easily applicable to VAIOR when its dichotomous rule is used.

In terms of how much evidence is sufficient, Lanovaz and Rapp (2016) developed a proposal for an objective rule based on the binomial distribution, although it refers to replications across studies. If we borrow their idea about using the binomial distribution, it is possible to define a rule for within-study replications. Specifically, the probability of success only due to chance could be set to be equal to the false positives rate for VAIOR (results presented later in the text). For instance, for phases with five measurements, independent data, and for an immediate effect, the false positives rate is 0.22. If there are five within-study replications, the probability of more than two positive results only due to chance is 0.0744, whereas the probability more than three positive results only due to chance is 0.0097. Thus, having four or five positive results would be enough evidence. If there are four within-study replications, the probability of more than two positive results only due to chance is 0.0356 suggesting sufficient evidence. Similar rules can be derived, for other phase lengths and assuming autocorrelated data. Nevertheless, it has to be underlined that the question answered by using the binomial distribution is whether the number of positive results can be expected to happen only by chance or there is something more. This is not necessarily the same question as whether there is consistent evidence regarding the presence of an intervention effect.

Interpretation of the results

To reduce the complexity of the proposal, a freely available, easy to use website was developed: https://manolov.shinyapps.io/TrendMAD. It marks with different colors the data points that improve the variability band and the ones that do not, to aid the visual inspection. Moreover, there is a quantification of the percentage of data points improving the variability band and a dichotomous indication of whether a “positive result” was obtained for an immediate effect, for a delayed/progressive effect, and for an overall effect, according to the criteria established.

Justification of the Elements of VAIOR for Multiple-Baseline and Reversal Designs

Overall, VAIOR entails a comparison between projected and actual intervention phase data, which is well aligned with the WWC Standards (Kratochwill et al., 2010; see also Ledford et al., 2019). A similar projection is included in the CDC. Thus, VAIOR is well aligned with the logic of prediction on the basis of the baseline data and predictable change in the target behavior (Riley-Tillman & Burns, 2009). VAIOR builds on that idea, including and representing error (variability) visually through the variability band and provides an objective rule.

Data features

Regarding trend, the use of the Theil–Sen method is justified for single-case data (Tarlow, 2017; Tarlow & Brossart, 2018; Vannest, Parker, Davis, Soares, & Smith, 2012) and is potentially more widely known as part of robust statistics (Wilcox, 2012). Theil–Sen is also resistant to outliers (unlike ordinary least squares estimation) and to the ceiling and floor effects, as cautioned against by Wolery, Busick, Reichow, and Barton (2010).

Regarding the fact that a trend line is represented rather than a mean line, it has to be noted that a trend line is a more comprehensive estimate of data than a mean/median line representing level. Trend can include the mean/median line as case of trend with a zero slope but the reverse is not true. Projecting a trend line avoids the need for checking several detrending options (Carlin & Costello, 2018) before obtaining the main quantification.

Regarding variability, VAIOR takes it into account in order not to assume that the trend line is a perfect representation of the data (Tarlow & Brossart, 2018). Moreover, including trend lines only may not be sufficient to ensure appropriate performance of novice visual analysts, with data variability and the presence of extreme values being relevant factors (Nelson, Van Norman, & Christ, 2017). VAIOR does represent data variability apart from trend, whereas fitting a straight line using the Theil–Sen method ensures that the influence of outliers is reduced. This is also well aligned with Wolery et al.’s (2010) comment that an analytical procedure needs to consider all data features. Furthermore, the variability band represented in VAIOR is based on a measure of variability of the data obtained, in contrast to other arbitrary rules such as the stability envelope (Lane & Gast, 2014; Lobo et al., 2017).

Regarding overlap, it is quantified after trend is taken into account, as recommended (Kennedy, 2005; Parker, Vannest, Davis, & Sauber, 2011; Wolery et al., 2010).

Focus of the analysis

We consider that the quantifications should reflect the type of effect expected before gathering the data (as has been recommended for randomization tests, Heyvaert & Onghena, 2014). Moreover, the quantification of the immediacy of the effect avoids dealing with far-away projections of the baseline trend considering the limitations of such projections (Manolov, Solanas, & Sierra, 2018; Parker et al., 2011), as well as a likely ceiling. In addition, when a comparison between actual and predicted data is performed only for the first three data in intervention phase, it becomes less relevant whether the straight line is the best possible representation of the baseline trend.

Outcomes of the application of VAIOR to an A-B comparison

We consider that a dichotomous indication regarding the presence/absence of a basic effect is warranted, because the application of visual analysis, in general, may be expected to lead to such a dichotomous decision. Regarding the percentage outcome, it would represent the number of intervention phase measurements that do not overlap with the expected range of intervention phase data in case the baseline trend continued. Thus, it would also be a useful quantification, especially if a quantitative integration of results across studies is desired.

Consistency

Regarding the overall assessment of the presence of an intervention effect, a distinction needs to be made between MBDs and reversal designs. In an MBD, all the comparisons follow the A-B sequence and the application of VAIOR is straightforwardly repeated for each tier. Thus, VAIOR is applicable to designs dealing with nonreversible behaviors. In an A-B-A-B design, one of the three comparisons follows the B-A sequence. This has led some authors to suggest omitting the B₁-A₂ comparison (Parker & Vannest, 2012; Scruggs & Mastropieri, 1998). Although this comparison is conceptually different, omitting it would lead to having only two attempts to demonstrate the basic effect. As an alternative, VAIOR can be applied to the B₁-A₂ comparison by fitting the trend line to the B₁ data and projecting it, together with the variability band, into the A₂ phase. If a researcher is not willing to consider the B₁-A₂ comparison, three attempts to demonstrate an effect would only be available in an A-B-A-B-A-B design. In any case, VAIOR is applicable to reversible behaviors.

Description and Justification of VAIOR for ATDs

The application of VAIOR to an ATD requires six steps and they are described in this section. Only justifications specific to the application to ATDs are included.

Referring to the analysis of ATDs in general, it has been stated that the “[d]ifferentiation in data paths is the primary means by which data are evaluated” (Ledford et al., 2018, p. 11), considering that the data paths are formed by the lines connecting the measurements belonging to the same condition. This differentiation can be considered as a difference in level or trend between the conditions compared (Wolery, Gast, & Ledford, 2014; trend is especially stressed by Janosky, Leininger, Hoerger, & Libkuman, 2009). Alternatively, the differentiation can be conceptualized as nonoverlap between the data paths (Barlow, Nock, & Hersen, 2009), understood as the lines corresponding to different conditions not crossing (Tate & Perdices, 2019). The visual structured criterion proposed by Lanovaz, Cardinal, and Francis (2019) and the procedure called ALIV (Manolov & Onghena, 2018) both refer to this way of analyzing ATD data. Another option is to perform comparisons between adjacent data points (Ledford et al., 2019). The specific application of the percentage of nonoverlapping data described by Wolery et al. (2014) and the procedure called ADISO (Manolov & Onghena, 2018) reflect this second kind of ATD data analysis. The steps for applying VAIOR, described next, are related to the comparison of data paths and not to the comparison of adjacent data points.

Step 1: Estimate Theil–Sen trend for the Condition A data. In this case, the independent variable in the Theil–Sen method (i.e., the measurement occasions) is not coded 1, 2, 3, . . ., n_A, but represents the actual measurement session numbers for Condition A. For instance, if there are 10 measurement occasions, five per condition, and Condition A takes place in sessions 1, 4, 6, 8, and 9, these would be the values of the independent variable for estimating the trend line.

Step 2: Compute MAD on the basis of the difference between actual Condition A data points and predicted ones, represented by the trend line.

Step 3: Construct variability bands on the basis of trend plus/minus the MAD value. These variability bands, as is also the case for the trend line, refer only to the measurement occasions in the limits between the first and the last Condition A measurement occasion (i.e., between Sessions 1 and 9 in the current example). Thus, for some of the Condition B data points (e.g., Measurement Occasion 10), there will be no comparison possible. The idea not to extrapolate is also present in the visual structured criterion (Lanovaz et al., 2019) and in ALIV (Manolov & Onghena, 2018).

Step 4: Perform the main comparison between the A and the B phase. The Condition B data points are compared with the relevant limit of the variability band.

Step 5: Summarize the results from the A-B comparison as a percentage. Specifically, the percentage of Condition B data points that improve the variability band around the Condition A data path is computed. This percentage refers to the total number of Condition B data points actually compared (i.e., Measurement Occasions 2, 3, 5, and 7 in our example).

Step 6: Summarize the results from the A-B comparison in terms of a dichotomous decision regarding whether a positive effect is present or not. The rule or requirement here is a 100% improvement or at least twice the proportion of baseline values out of the variability band, whichever is smaller. ATDs with at least five sessions per condition (Wolery et al., 2014) and five repetitions of the alternating sequence (Kratochwill et al., 2010) are considered sufficient for demonstrating a functional relation. Therefore, one application of VAIOR to such a data sequence would be sufficient for demonstrating a functional relation, without the need to refer to additional rules for deciding whether there is consistency in the effect of the intervention.

Overall, the strength of the application of VAIOR to ATD data is that it identifies data patterns with clear distinction between the data paths even in presence of some minimal overlap (e.g., a single Condition B data point is within the variability band). Moreover, the variability band identifies situations in which the consistent differences are minimal or practically equivalent (e.g., when all or almost all Condition B data points are within the variability band, despite being superior to the Condition A data points).

Interpretation of the results

The website https://manolov.shinyapps.io/TrendMAD marks with different colors the data points that improve the variability band and the ones that do not, to aid the visual inspection. Moreover, there is a quantification of the percentage of data points improving the variability band and a dichotomous indication of whether the criterion for a “positive result” was met.

A Caution Regarding the Use of VAIOR

A relevant question is whether VAIOR can be considered applicable to any kind of data pattern. Given that the comparison performed is strongly related to the baseline trend, it is crucial to evaluate how well the trend line fits the baseline data in either MBDs, A-B-A-B designs, or ATDs. Quantitative indicators include the mean absolute scaled error (Hyndman & Koehler, 2006) being less than 1 (Manolov, 2018b), the coefficient of variation being less than 10% (Mendenhall & Sincich, 2012), and the trend stability envelope including at least 80% of the data (Lane & Gast, 2014). The general idea is that in case the baseline data are too variable to fit (and in MBDs and reversal designs, to project) any meaningful trend line, then this VAIOR procedure is likely to be less useful. Actually, data with great variability may reflect uncontrolled sources of error in the environment or measure. In such cases, data analysis may not be justified at all.

Overall, there are three aspects requiring human judgment and professional experience: (a) the assessment of whether the fit is reasonable, using numerical and visual criteria; (b) the assessment of whether the subsequent projection (with the corresponding variability band) is reasonable or it leads to impossible values; and (c) the focus of the analysis: on an immediate, overall, or progressive/delayed effect. This is well aligned with the evidence-based practice movement where evidence-based treatments are evaluated in context (Vannest & Davis, 2013) and should not be completely eliminated from the application of the visual aid that we propose.

Evaluation of the Proposal

In this section, we first present the application of VAIOR to real data, using different single case experimental designs (SCEDs). Afterward, a more thorough evaluation is presented using generated data with known underlying parameters.

Illustration With Real Data

Illustrating an A-B comparison, Figure 1 represents the data from Allen, Vatland, Bowen, and Burke (2015): The target behavior is ordering food in this tier of the MBD. The output is obtained via the website https://manolov.shinyapps.io/TrendMAD. The right panel includes VAIOR added to the data: The shaded area represents the variability band constructed on the basis of the Theil–Sen baseline trend plus/minus the MAD value. In the website, the colors of the intervention phase data points reflect improvement over the projected variability band (green), or over the projected baseline trend (yellow), or no improvement over either of the projections (red).

Figure 1.

Application of the proposal to an A-B-comparison; data on ordering food as target behavior gathered by Allen et al. (2015).

Figure 1 shows that, in terms of an immediate effect, there is 100% improvement over the upper limit of the variability band. In contrast, for a delayed or progressive effect, none of the last three intervention phase values would represent an improvement. Regarding the overall effect, VAIOR compares the percentage of baseline data points outside the baseline variability band (44%: four of the nine values are outside the baseline variability band) and, specifically, the criterion requires 88% of the intervention data points (i.e., at least 21 out of 23 values) to improve the upper limit of the variability band. Only 21.74% (5/23) of the intervention measurements show such an improvement and thus the criterion is not met.

Illustrating an ATD, Figure 2 represents the data from Taipalus, Hixson, Kanouse, Wyse and Fursa (2017) for Students 1 and 2. For Student 1 (upper panel of Figure 2, obtained via https://manolov.shinyapps.io/TrendMAD), the proposal helps ignore the one potentially outlying value. This visual indicates that one of the conditions (the “ball” condition in Taipalus et al., 2017) is clearly superior to the chair condition with 100% of the values of the ball condition outside the variability band constructed on the values of the chair condition. For Student 2 (lower panel of Figure 2), the proposal suggests that only 25% of the intervention phase data points represent an improvement in the “ball” condition, whereas 75% represent a deterioration.

Figure 2.

Application of the proposal to a comparison phase of an alternating treatments design; data gathered by Taipalus et al. (2017) for Student 1.

Simulation Study

Data generation model: A-B comparisons

For data generation, we used the model described by Huitema and McKean (2000): $y_{t} = β_{0} + β_{1} T + β_{2} D_{t} + β_{3} D_{t} (T - (n_{A} + 1)) + ε_{t}$ . This model includes a term for general trend (T), taking values from 1 to the total number of measurements contained in the comparison (n_A measurements in baseline phase plus n_B measurements in the intervention phase), with its magnitude being defined by the parameter β₁ (set to 0 and 0.2). The model also allows specifying an immediate and sustained change in level, via the dummy variable D_t (n_A times 0 followed by n_B times 1), with its magnitude being defined by the simulation parameter β₂ (set to 0, 1, 2, and 3). The effect sizes were chosen to cover the range of small effects (0 to 1), moderate effects (1 to 2.5), and large effects (greater than 2.5), as suggested by Harrington and Velicer (2015), and they were also used in previous simulations (Ferron, Moeyaert, Van den Noortgate, & Beretvas, 2014; Levin et al., 2017). Another possibility is to model change in slope, via the interaction term between general trend and the dummy variable, $D_{t} (T - (n_{A} + 1))$ , and the corresponding magnitude, β₃ (set to 0, 0.2, and 0.4; 0.2 was used by Moeyaert, Rindskopf, Onghena, & Van den Noortgate, 2017 and Moeyaert, Ugille, Ferron, Beretvas, & Van den Noortgate, 2013). For the error term, ε_t, we specified a first-order autoregressive model, ε_t = u_t + ρ₁ u_t _{– 1}, in which the degree of autocorrelation is specified in ρ₁ (set to 0, 0.2, and 0.4, as in Ferron et al., 2014, and Swan & Pustejovsky, 2018, also well aligned with the average autocorrelations reported by Shadish & Sullivan, 2011: 0.191 for phase change with reversal, 0.320 for MBD, and −0.010 for ATD), and the random disturbance u_t follows a standard normal distribution. The phase lengths included were 5, 7, and 10, similar to previous simulations (e.g., Ferron et al., 2014; Pustejovsky, 2019; Swan & Pustejovsky, 2018), also reflecting the finding (Shadish & Sullivan, 2011) that almost half of the initial baselines included less than five measurements. Using the R software, we ran 10,000 simulations, for each condition defined by the combination of simulation parameters.

The fact that A-B comparisons were simulated does not entail that the simulation is relevant only for A-B designs, given that the A-B comparisons are the core of more complex and more methodologically appropriate single-case designs (Pustejovsky, 2019). The overall assessment of intervention effectiveness entails pooling the evidence from the separate A-B comparisons.

Data generation model: ATD with restricted randomization

The same models and parameter values for β₁, β₂, β₃, and ρ₁ were used, as for A-B comparisons. However, the way in which the dummy variable D_t is defined is different, given that we introduced the common restriction (Onghena & Edgington, 1994; Wolery et al., 2014) of having no more than two consecutive administrations of the same condition. Thus, also included were the sequences that can be obtained in ATDs with randomized blocks (see Onghena & Edgington, 2005). The number of measurements per condition (n_A = n_B) was set to 6 and 8, values also used in previous simulations on ATD (Lanovaz et al., 2019; Manolov, 2018a).

Data analysis

For the 10,000 simulated data sets for A-B comparisons, we performed comparisons to the projected upper variability band line, considering that the treatment effect simulated was an increase. Specifically, we tallied the number of iterations for which the proposal suggested (a) an immediate effect: 100% improvement of the first three intervention phase data points; (b) a progressive effect: 100% improvement of the last three intervention phase data points; and (c) an overall effect: 100% improvement or at least twice the proportion of baseline values out of the variability band, whichever is smaller. For the conditions in which there was no intervention effect simulated, an effect detected by the proposal would be a “false positive.” For the conditions in which there was an intervention effect simulated, an effect detected by the proposal would be a “true positive.” We are not using here the terms Type I error or statistical power, given that the proposal does not perform a test of statistical significance.

For A-B comparisons, the performance of the proposal was compared with three other similar options. First, the CDC (Fisher et al., 2003) was applied. The CDC entails obtaining the mean line and the split-middle trend line for the baseline data and projecting them into the subsequent intervention phase. Both lines are shifted upward or downward (according to the effect expected) 0.25 times the standard deviation of the baseline data. Afterward, the number of intervention phase data points that improve both lines is tallied. Finally, the binomial distribution is used to obtain the probability of obtaining as many improved intervention data points. A probability equal to or smaller than .05 is considered a positive result.

Second, the trend stability envelope (Lane & Gast, 2014) was constructed around the split-middle trend line for the baseline data. This envelope is based on the trend line plus/minus 25% of the baseline median. Afterward, the envelope is projected into the intervention phase. If the proportion of intervention phase data points included in the projected variability band is less than 80%, then we can conclude that the intervention phase data do not follow the baseline trend. Note that Lane and Gast (2014) suggest using the trend stability envelope for within-phase evaluation of variability, whereas we here adapted it for a between-phases assessment. Moreover, we required more than 20% of the intervention phase values being not just outside the envelope, but specifically above the upper line, as an increase of the target behavior was simulated. Third, we defined a variability band around the split-middle trend line on the basis of the variability in the baseline data, as quantified as the interquartile range of the differences between the actual values and the predicted ones from trend line (Manolov, Sierra, Solanas, & Botella, 2014). According to this option, if there is at least one intervention phase value above the variability band defined by the trend line plus/minus 1.5 times the interquartile range (as in the boxplot rule for detecting outliers), a “positive” result is obtained.

For the ATDs, we focused on the assessment of an overall effect according to the proposal: 100% improvement or at least twice the proportion of baseline values out of the variability band, whichever is smaller. It is less clear how to obtain a split-middle trend line for an ATD and, therefore, a comparison between the proposal and the previously mentioned alternatives was not compared for this kind of design.

Results and discussion of the simulation study

VAIOR, in its three different versions, shows lower false positive rates in identifying effects for the A-B-phase comparisons than the median-based and interquartile range–based trend stability envelopes (see Table 1). Nevertheless, in general, all procedures (including VAIOR) produce false positives rates that would be deemed completely unacceptable in case we were interpreting the results as Type I error rates (in which case a rate similar to .05 would be expected). The exception is the CDC, which yields acceptable false positive rates only when there is no general trend. It should be noted that only the CDC could be considered a statistical test, because it uses a probability model. The results for the VAIOR are especially good for longer series (i.e., n_A and n_B = 10) for which it even outperforms the CDC.

Table 1.

Relative Frequency of False Positives for A-B-Comparisons.

Procedure	Phase lengths n_A = 5, n_B = 5. Autocorrelation (ρ₁) and general trend (β₁)
Procedure	ρ₁ = 0; β₁ = 0	ρ₁ = .2; β₁ = 0	ρ₁ = .4; β₁ = 0	ρ₁ = 0; β₁ = 0.2	ρ₁ = .2; β₁ = 0.2	ρ₁ = .4; β₁ = 0.2
CDC	.020	.033	.051	.120	.128	.148
Md envelope	.611	.608	.597	.571	.577	.571
IQR envelope	.456	.488	.507	.469	.487	.498
Overall	.305	.330	.341	.306	.321	.341
Immediate	.222	.235	.258	.215	.239	.256
Delayed	.292	.300	.339	.283	.306	.327
Procedure	Phase lengths n_A = 7, n_B = 7. Autocorrelation (ρ₁) and general trend (β₁)
Procedure	ρ₁ = 0; β₁ = 0	ρ₁ = .2; β₁ = 0	ρ₁ = .4; β₁ = 0	ρ₁ = 0; β₁ = 0.2	ρ₁ = .2; β₁ = 0.2	ρ₁ = .4; β₁ = 0.2
CDC	.005	.012	.026	.111	.128	.140
Md envelope	.666	.650	.637	.622	.619	.602
IQR envelope	.521	.546	.532	.512	.522	.532
Overall	.204	.232	.258	.217	.242	.265
Immediate	.164	.193	.214	.163	.182	.207
Delayed	.259	.292	.314	.255	.289	.315
Procedure	Phase lengths n_A = 10, n_B = 10. Autocorrelation (ρ₁) and general trend (β₁)
Procedure	ρ₁ = 0; β₁ = 0	ρ₁ = .2; β₁ = 0	ρ₁ = .4; β₁ = 0	ρ₁ = 0; β₁ = 0.2	ρ₁ = .2; β₁ = 0.2	ρ₁ = .4; β₁ = 0.2
CDC	.005	.012	.029	.198	.213	.232
Md envelope	.666	.649	.629	.593	.583	.587
IQR envelope	.525	.524	.529	.507	.514	.530
Overall	.055	.076	.104	.059	.079	.104
Immediate	.106	.135	.173	.114	.136	.166
Delayed	.216	.250	.296	.217	.250	.298

Note. Overall, immediate, and delayed effects refer to the three versions of the proposal made in the current text. CDC = conservative dual criterion; Md = median; IQR = interquartile range.

Autocorrelation inflates the false positives rates for the CDC and VAIOR, but their performance is still better than the one of the other two procedures. General trend inflates the false positives rates for the CDC. Therefore, VAIOR seems to be a reasonable alternative to the CDC if conservative judgment is desired.

The rates of true positives for VAIOR are lower than for the median-based and interquartile range–based trend stability envelopes and higher than for the CDC in A-B-phase comparisons (see Table 2). The proposal performs best when looking for an immediate effect (when such an effect is present, that is, when β₂ ≠ 0) and when looking for a delayed effect (when there is a change in slope, that is, when β₃ ≠ 0). Even when looking for an overall effect, the proposal detects the change in level more frequently than the CDC. The CDC was not designed to deal with progressive effects (i.e., change in slope) and it does not detect them frequently enough. Therefore, VAIOR is better than the CDC according to this criterion. Even though the true positives rates for the proposal were not as high as for the trend stability envelopes, these latter procedures were excessively liberal and should not be recommended.

Table 2.

Relative Frequency of True Positives for A-B-Comparisons for Detecting Immediate and Sustained Change in Level (β₂ = 1 or β₂ = 2 or β₂ = 3) or a Change in Slope (β₃ = 0).

Procedure	Phase lengths n_A = 5, n_B = 5. No autocorrelation (ρ₁ = 0) or trend (β₁ = 0)
Procedure	β₂ = 1; β₃ = 0	β₂ = 2; β₃ = 0	β₂ = 3; β₃ = 0	β₂ = 0; β₃ = 0.2	β₂ = 0; β₃ = 0.4
CDC	.188	.543	.782	.059	.104
Md envelope	.799	.920	.976	.690	.759
IQR envelope	.656	.829	.927	.544	.635
Overall	.504	.709	.846	.366	.426
Immediate	.430	.680	.845	.244	.268
Delayed	.455	.634	.776	.374	.466
Procedure	Phase lengths n_A = 7, n_B = 7. No autocorrelation (ρ₁ = 0) or trend (β₁ = 0)
Procedure	β₂ = 1; β₃ = 0	β₂ = 2; β₃ = 0	β₂ = 3; β₃ = 0	β₂ = 0; β₃ = 0.2	β₂ = 0; β₃ = 0.4
CDC	.123	.475	.755	.036	.093
Md envelope	.844	.945	.987	.771	.869
IQR envelope	.722	.867	.949	.646	.773
Overall	.425	.670	.843	.306	.403
Immediate	.396	.680	.881	.194	.221
Delayed	.430	.630	.787	.424	.620
Procedure	Phase lengths n_A = 10, n_B = 10. No autocorrelation (ρ₁ = 0) or trend (β₁ = 0)
Procedure	β₂ = 1; β₃ = 0	β₂ = 2; β₃ = 0	β₂ = 3; β₃ = 0	β₂ = 0; β₃ = 0.2	β₂ = 0; β₃ = 0.4
CDC	.204	.633	.845	.115	.306
Md envelope	.856	.960	.992	.826	.941
IQR envelope	.736	.888	.960	.717	.880
Overall	.222	.478	.742	.122	.180
Immediate	.368	.695	.907	.142	.176
Delayed	.418	.638	.809	.541	.835

Note. The results correspond to data with no autocorrelation (ρ₁ = 0) or trend (β₁ = 0). Overall, immediate, and delayed effects refer to the three versions of the proposal made in the current text. CDC = conservative dual criterion; Md = median; IQR = interquartile range.

Regarding ATDs with restricted randomization, the false positive rates (as reported in Table 3) are sufficiently low to ensure that the use of the proposal does not lead to liberal conclusions. VAIOR performs better at detecting effects when the magnitude of difference is large (i.e., when β₂ ≥ 2, a difference of two standard deviations between the conditions). This is also true in cases where the difference between the conditions increases with time (especially for β₃ = 0.4).

Table 3.

Relative Frequency of False Positives (β₂ = 0 and β₃ = 0) and True Positives (β₂ ≠ 0 or β₃ ≠ 0) for Detecting an Overall Effect in Alternating Treatments Designs.

Data points per condition	Autocorrelation (ρ₁) and general trend (β₁). No effect—false positives
Data points per condition	ρ₁ = 0; β₁ = 0	ρ₁ = .2; β₁ = 0	ρ₁ = .4; β₁ = 0	ρ₁ = 0; β₁ = 0.2	ρ₁ = .2; β₁ = 0.2	ρ₁ = .4; β₁ = 0.2
n_A = 6, n_B = 6	.043	.047	.042	.046	.044	.042
n_A = 8, n_B = 8	.008	.005	.005	.008	.007	.005
Data points per condition	Autocorrelation (ρ₁) and general trend (β₁). β₂ = 1—true positives for change in level
Data points per condition	ρ₁ = 0; β₁ = 0	ρ₁ = .2; β₁ = 0	ρ₁ = .4; β₁ = 0	ρ₁ = 0; β₁ = 0.2	ρ₁ = .2; β₁ = 0.2	ρ₁ = .4; β₁ = 0.2
n_A = 6, n_B = 6	.280	.295	.299	.287	.294	.298
n_A = 8, n_B = 8	.133	.130	.130	.133	.132	.123
Data points per condition	Autocorrelation (ρ₁) and general trend (β₁). β₂ = 2—true positives for change in level
Data points per condition	ρ₁ = 0; β₁ = 0	ρ₁ = .2; β₁ = 0	ρ₁ = .4; β₁ = 0	ρ₁ = 0; β₁ = 0.2	ρ₁ = .2; β₁ = 0.2	ρ₁ = .4; β₁ = 0.2
n_A = 6, n_B = 6	.695	.721	.732	.699	.733	.734
n_A = 8, n_B = 8	.563	.575	.563	.561	.582	.569
Data points per condition	Autocorrelation (ρ₁) and general trend (β₁). β₂ = 3—true positives for change in level
Data points per condition	ρ₁ = 0; β₁ = 0	ρ₁ = .2; β₁ = 0	ρ₁ = .4; β₁ = 0	ρ₁ = 0; β₁ = 0.2	ρ₁ = .2; β₁ = 0.2	ρ₁ = .4; β₁ = 0.2
n_A = 6, n_B = 6	.939	.948	.954	.940	.949	.953
n_A = 8, n_B = 8	.908	.915	.913	.905	.917	.913
Data points per condition	Autocorrelation (ρ₁) and general trend (β₁). β₃ = 0.2—true positives for change in slope
Data points per condition	ρ₁ = 0; β₁ = 0	ρ₁ = .2; β₁ = 0	ρ₁ = .4; β₁ = 0	ρ₁ = 0; β₁ = 0.2	ρ₁ = .2; β₁ = 0.2	ρ₁ = .4; β₁ = 0.2
n_A = 6, n_B = 6	.365	.357	.375	.359	.368	.375
n_A = 8, n_B = 8	.274	.288	.276	.289	.285	.276
Data points per condition	Autocorrelation (ρ₁) and general trend (β₁). β₃ = 0.4—true positives for change in slope
Data points per condition	ρ₁ = 0; β₁ = 0	ρ₁ = .2; β₁ = 0	ρ₁ = .4; β₁ = 0	ρ₁ = 0; β₁ = 0.2	ρ₁ = .2; β₁ = 0.2	ρ₁ =.4; β₁ = 0.2
n_A = 6, n_B = 6	.694	.703	.709	.693	.708	.709
n_A = 8, n_B = 8	.638	.643	.652	.631	.645	.656

General Discussion

VAIOR is intended to be a user-friendly objective rule leading to the same results when used by any visual analyst. In that sense, it builds on the logic of stability envelopes and the CDC, but encompassing more data features. In terms of applicability, VAIOR can be used not only with MBDs and reversal designs but also with ATDs, unlike the CDC and unlike recent analytical proposals (e.g., Carlin & Costello, 2018; Pustejovsky, 2018; Tarlow & Brossart, 2018).

Regarding performance, it is noteworthy that the CDC only performs better (lower false positives than VAIOR) when there is zero trend, or when data series are short (i.e., n_A = n_B = 5). In contrast, VAIOR produces lower false positives rates for longer series and higher true positives rates (especially for progressive effects). Furthermore, the high true positive rates for larger effects are well aligned with the use of visual analysis for identifying successful interventions with large effects (Lanovaz & Rapp, 2016).

Recommendations Regarding the Evaluation of Intervention Effects

When assessing intervention effectiveness and the quality of the evidence obtained, the characteristics of the design have to be considered. Several methodological quality scales can be consulted for desirable design and study features (Zimmerman et al., 2018). Despite the focus of the current text on visual analysis, social validity has to be assessed as well (Horner et al., 2005), an aspect that is yet to receive sufficient attention (Snodgrass, Chung, Meadan, & Halle, 2018).

Apart from visual analysis and any potential dichotomous decision regarding the presence or absence of an effect, it is important to quantify the magnitude of difference between conditions, because a convergence between several procedures increases the confidence in the conclusions regarding intervention effectiveness (Lobo et al., 2017; Vannest et al., 2018). Thus, we recommend to always compute an effect size index, apart from using visual aids, to avoid publication bias (Shadish, Zelinsky, Vevea, & Kratochwill, 2016). Although there is no consensus currently on the effect size to use in the SCED context (Busse, McGill, & Kennedy, 2015; Kratochwill et al., 2010; Tate et al., 2013), there are several options available, such as nonoverlap indices (Parker et al., 2011; Vannest & Ninci, 2015), response ratios (Pustejovsky, 2018), between-case standardized mean differences (Odom, Barton, Reichow, Swaminathan, & Pustejovsky, 2018), and multilevel models (Ferron, Bell, Hess, Rendina-Gobioff, & Hibbard, 2009). (See Gage & Lewis, 2013, and Manolov & Moeyaert, 2017, for summary reviews.)

Finally, it is important to fully report (Tate et al., 2016) all the characteristics of the participant, setting, and intervention, the characteristics of the design (e.g., whether a multiple-baseline is concurrent or not), the visual inspection rules followed, the quantifications of magnitude of effect, and the assessment of social validity.

Limitations and Future Research

Regarding the potential limitations of VAIOR, it is not intended as a test for statistical significance, clinical significance, or as a substitute for visual analysis or for the assessment of social validity. It could be argued that the fact that VAIOR entails several steps may reduce the ease to follow all calculations. Nevertheless, the CDC (Fisher et al., 2003) and current visual protocols also include multiple steps (Wolfe et al., 2019). Moreover, the website developed (see https://manolov.shinyapps.io/TrendMAD) incorporates all the calculations, which could be summarized in three steps: fitting the baseline trend line, projecting it into the intervention phase, and constructing around it a variability band. Furthermore, marking the improving data points in green, marking the deteriorating data points in red, and marking in yellow the points exceeding the trend line but not the variability band makes the visual inspection easier.

In addition, the use of the outcomes of VAIOR for a quantitative integration of results across studies (i.e., a meta-analysis) present some challenges. On one hand, if the rule for a dichotomous decision regarding the presence or absence of a positive effect is used, the only option for quantitative integration is vote-counting (Bushman & Wang, 2009). On the other hand, if the percentage nonoverlap outcome is used, it can be combined as done for other nonoverlap measures previously (e.g., Schlosser, Lee, & Wendt, 2008). However, Wolery et al. (2010) argue that overlap indices present important limitations such as not considering level, trend, and variability, and not measuring the magnitude of effect, especially in relation to ceiling effects once 100% nonoverlap is achieved. In relation to VAIOR, the percentage overlap is obtained once level, trend, and variability are considered, but the ceiling effect is still possible. To circumvent the drawbacks of nonoverlap measures, a quantification is possible, in the context of VAIOR, using the average difference between the projected relevant limit of the variability band and the actual intervention phase data, as in the Mean phase difference (Manolov & Solanas, 2013). Nevertheless, it has to be taken into account that VAIOR was developed mainly as a visual aid and not as an effect size index, similar to how the CDC is used.

In terms of the limitations of the simulation study, the evidence obtained using generated data is constrained to the conditions actually tested: a continuous (Normal) model for the random term u_t, immediate and sustained change in level, linear general trend, and change in slope without modifying the linearity of the data pattern. Furthermore, the conditions simulated could have been broader in terms of series lengths and values of the autocorrelation parameter, but we wanted to keep the study feasible and similar to previous simulations.

A relevant future research would be to perform a field test with real data for obtaining the typical percentage values for VAIOR, as was done for the Nonoverlap of all pairs (Parker & Vannest, 2009) and for Tau-U (Parker et al., 2011). Moreover, a future study could compare the usability of VAIOR versus the CDC when training visual analysts.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

Author Biographies

Rumen Manolov, PhD, is a researcher and lecturer at the Faculty of Psychology at the University of Barcelona. His investigation is focused on single-case designs, specifically, testing and proposing data analytical techniques, as well as developing and promoting R code and user-friendly Shiny applications for single-case data analysis.

Kimberly Vannest, PhD, is a professor in special education and research faculty fellow at Texas A&M University. Her research interests involve the prevention and treatment of emotional and behavioral disorders in children and youth, the identification of evidence-based practices, and single-case experimental design methodology.

References

Allen

K. D.

Vatland

Bowen

S. L.

Burke

R. V.

(2015). Parent-produced video self-modeling to improve independence in an adolescent with intellectual developmental disorder and an autism spectrum disorder: A controlled case study. Behavior Modification, 39, 542-556. doi:10.1177/0145445515583247

Barlow

D. H.

Nock

M. K.

Hersen

(2009). Single case experimental designs: Strategies for studying behavior change (3rd ed.). Boston, MA: Pearson.

Barton

E. E.

Meadan

Fettig

(2019). Comparison of visual analysis, non-overlap methods, and effect sizes in the evaluation of parent implemented functional assessment based interventions. Research in Developmental Disabilities, 85, 31-41. doi:10.1016/j.ridd.2018.01.007

Bushman

B. J.

Wang

M. C.

(2009). Vote-counting procedures in meta-analysis. In Cooper

Hedges

L. V.

Valentine

J. C.

(Eds.), The handbook of research synthesis and meta-analysis (2nd ed., pp. 207-220). New York, NY: Russell Sage Foundation.

Busse

R. T.

McGill

R. J.

Kennedy

K. S.

(2015). Methods for assessing single-case school-based intervention outcomes. Contemporary School Psychology, 19, 136-144. doi:10.1007/s40688-014-0025-7

Carlin

M. T.

Costello

M. S.

(2018). Development of a distance-based effect size metric for single-case research: Ratio of distances. Behavior Therapy, 49, 981-994. doi:10.1016/j.beth.2018.02.005

Dart

E. H.

Radley

K. C.

(2017). The impact of ordinate scaling on the visual analysis of single-case data. Journal of School Psychology, 63, 105-118. doi:10.1016/j.jsp.2017.03.008

Diller

J. W.

Barry

R. J.

Gelino

B. W.

(2016). Visual analysis of data in a multielement design. Journal of Applied Behavior Analysis, 49, 980-985. doi:10.1002/jaba.325

Ferron

J. M.

Bell

B. A.

Hess

M. R.

Rendina-Gobioff

Hibbard

S. T.

(2009). Making treatment effect inferences from multiple-baseline data: The utility of multilevel modeling approaches. Behavior Research Methods, 41, 372-384. doi:10.3758/BRM.41.2.372

10.

Ferron

J. M.

Moeyaert

Van den Noortgate

Beretvas

S. N.

(2014). Estimating causal effects from multiple-baseline studies: Implications for design and analysis. Psychological Methods, 19, 493-510. doi:10.1037/a0037038

11.

Fisher

W. W.

Kelley

M. E.

Lomas

J. E.

(2003). Visual aids and structured criteria for improving visual inspection and interpretation of single-case designs. Journal of Applied Behavior Analysis, 36, 387-406. doi:10.1901/jaba.2003.36-387

12.

Fox

(2016). Applied regression analysis and generalized linear models (3rd ed.). London, England: SAGE.

13.

Gage

N. A.

Lewis

T. J.

(2013). Analysis of effect for single-case design research. Journal of Applied Sport Psychology, 25, 46-60. doi:10.1080/10413200.2012.660673

14.

Gast

D. L.

Spriggs

A. D.

(2010). Visual analysis of graphic data. In Gast

D. L.

(Ed.), Single subject research methodology in behavioral sciences (pp. 199-233). London, England: Routledge.

15.

Harrington

Velicer

W. F.

(2015). Comparing visual and statistical analysis in single-case studies using published studies. Multivariate Behavioral Research, 50, 162-183. doi:10.1080/00273171.2014.973989

16.

Heyvaert

Onghena

(2014). Randomization tests for single-case experiments: State of the art, state of the science, and state of the application. Journal of Contextual Behavioral Science, 3, 51-64. doi:10.1016/j.jcbs.2013.10.002

17.

Horner

R. H.

Carr

E. G.

Halle

McGee

Odom

Wolery

(2005). The use of single-subject research to identify evidence-based practice in special education. Exceptional Children, 71, 165-179. doi:10.1177/001440290507100203

18.

Huitema

B. E.

McKean

J. W.

(2000). Design specification issues in time-series intervention models. Educational and Psychological Measurement, 60, 38-58. doi:10.1177/00131640021970358

19.

Hyndman

R. J.

Koehler

A. B.

(2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22, 679-688. doi:10.1016/j.ijforecast.2006.03.001

20.

Institute of Education Sciences. (2018). What works clearinghouse standards handbook version 4.0. Retrieved from https://ies.ed.gov/ncee/wwc/Docs/referenceresources/wwc_standards_handbook_v4.pdf

21.

Janosky

J. E.

Leininger

S. L.

Hoerger

M. P.

Libkuman

T. M.

(2009). Single subject designs in biomedicine. Dordrecht, The Netherlands: Springer.

22.

Karazsia

B. T.

(2018). Editorial: New instructions for single-subject research in the Journal of Pediatric Psychology. Journal of Pediatric Psychology, 43, 585-587. doi:10.1093/jpepsy/jsy039

23.

Kennedy

C. H.

(2005). Single-case designs for educational research. Boston, MA: Pearson.

24.

Kipfmiller

K. J.

Brodhead

M. T.

Wolfe

LaLonde

Sipila

E. S.

Bak

M. S.

Fisher

M. H.

(2019). Training front-line employees to conduct visual analysis using a clinical decision-making model. Journal of Behavioral Education. Advance online publication. doi:10.1007/s10864-018-09318-1

25.

Kratochwill

T. R.

Hitchcock

Horner

R. H.

Levin

J. R.

Odom

S. L.

Rindskopf

D. M.

Shadish

W. R.

(2010). Single-case designs technical documentation. Retrieved from https://ies.ed.gov/ncee/wwc/Docs/ReferenceResources/wwc_scd.pdf

26.

Kratochwill

T. R.

Levin

J. R.

Horner

R. H.

Swoboda

C. M.

(2014). Visual analysis of single-case intervention research: Conceptual and methodological issues. In Kratochwill

T. R.

Levin

J. R.

(Eds.), Single-case intervention research: Methodological and statistical advances (pp. 91-125). Washington, DC: American Psychological Association.

27.

Lane

J. D.

Gast

D. L.

(2014). Visual analysis in single case experimental design studies: Brief review and guidelines. Neuropsychological Rehabilitation, 24, 445-463. doi:10.1080/09602011.2013.815636

28.

Lanovaz

Cardinal

Francis

(2019). Using a visual structured criterion for the analysis of alternating-treatment designs. Behavior Modification, 43, 115-131. doi:10.1177/0145445517739278

29.

Lanovaz

M. J.

Rapp

J. T.

(2016). Using single-case experiments to support evidence-based decisions: How much is enough? Behavior Modification, 40, 377-395. doi:10.1177/0145445515613584

30.

Ledford

J. R.

Barton

E. E.

Hardy

J. K.

Elam

Seabolt

Shanks

. . . Kaiser

(2016). What equivocal data from single case comparison studies reveal about evidence-based practices in early childhood special education. Journal of Early Intervention, 38, 79-91. doi:10.1177/1053815116648000

31.

Ledford

J. R.

Barton

E. E.

Severini

K. E.

Zimmerman

K. N.

(2019). A primer on single-case research designs: Contemporary use and analysis. American Journal on Intellectual and Developmental Disabilities, 124, 35-56. doi:10.1352/1944-7558-124.1.35

32.

Ledford

J. R.

Lane

J. D.

Severini

K. E.

(2018). Systematic use of visual analysis for assessing outcomes in single case design studies. Brain Impairment, 19, 4-17. doi:10.1017/BrImp.2017.16

33.

Levin

J. R.

Ferron

J. M.

Gafurov

B. S.

(2017). Additional comparisons of randomization-test procedures for single-case multiple-baseline designs: Alternative effect types. Journal of School Psychology, 63, 13-34. doi:10.1016/j.jsp.2017.02.003

34.

Lobo

M. A.

Moeyaert

Cunha

A. B.

Babik

(2017). Single-case design, analysis, and quality assessment for intervention research. Journal of Neurologic Physical Therapy, 41, 187-197. doi:10.1097/NPT.0000000000000187

35.

Maggin

D. M.

Briesch

A. M.

Chafouleas

S. M.

(2013). An application of the What Works Clearinghouse standards for evaluating single-subject research: Synthesis of the self-management literature base. Remedial and Special Education, 34, 44-58. doi:10.1177/0741932511435176

36.

Maggin

D. M.

Cook

B. G.

Cook

(2018). Using single-case research designs to examine the effects of interventions in special education. Learning Disabilities Research & Practice, 33, 182-191. doi:10.1111/ldrp.12184

37.

Manolov

(2018a). A simulation study on two analytical techniques for alternating treatments designs. Behavior Modification. Advance online publication. doi:10.1177/0145445518777875

38.

Manolov

(2018b). Linear trend in single-case visual and quantitative analyses. Behavior Modification, 42, 684-706. doi:10.1177/0145445517726301

39.

Manolov

Moeyaert

(2017). How can single-case data be analyzed? Software resources, tutorial, and reflections on analysis. Behavior Modification, 41, 179-228. doi:10.1177/0145445516664307

40.

Manolov

Onghena

(2018). Analyzing data from single-case alternating treatments designs. Psychological Methods, 23, 480-504. doi:10.1037/met0000133

41.

Manolov

Sierra

Solanas

Botella

(2014). Assessing functional relations in single-case designs: Quantitative proposals in the context of the evidence-based movement. Behavior Modification, 38, 878-913. doi:10.1177/0145445514545679

42.

Manolov

Solanas

(2013). A comparison of mean phase difference and generalized least squares for analyzing single-case data. Journal of School Psychology, 51, 201-215. doi:10.1016/j.jsp.2012.12.005

43.

Manolov

Solanas

Sierra

(2018). Extrapolating baseline trend in single-case data: Problems and tentative solutions. Behavior Research Methods. Advance online publication. doi:10.3758/s13428-018-1165-x

44.

Mendenhall

Sincich

(2012). A second course in statistics: Regression analysis (7th ed.). Boston, MA: Prentice Hall.

45.

Moeyaert

Rindskopf

Onghena

Van den Noortgate

(2017). Multilevel modeling of single-case data: A comparison of maximum likelihood and Bayesian estimation. Psychological Methods, 22, 760-778. doi:10.1037/met0000136

46.

Moeyaert

Ugille

Ferron

Beretvas

Van den Noortgate

(2013). Modeling external events in the three-level analysis of multiple-baseline across-participants designs: A simulation study. Behavior Research Methods, 45, 547-559. doi:10.3758/s13428-012-0274-1

47.

Nelson

P. M.

Van Norman

E. R.

Christ

T. J.

(2017). Visual analysis among novices: Training and trend lines as graphic aids. Contemporary School Psychology, 21, 93-102. doi:10.1007/s40688-016-0107-9

48.

Ninci

Vannest

K. J.

Willson

Zhang

(2015). Interrater agreement between visual analysts of single-case data: A meta-analysis. Behavior Modification, 39, 510-541. doi:10.1177/0145445515581327

49.

Odom

S. L.

Barton

E. E.

Reichow

Swaminathan

Pustejovsky

J. E.

(2018). Between-case standardized effect size analysis of single case designs: Examination of the two methods. Research in Developmental Disabilities, 79, 88-96. doi:10.1016/j.ridd.2018.05.009

50.

Olive

M. L.

Smith

B. W.

(2005). Effect size calculations and single subject designs. Educational Psychology, 25, 313-324. doi:10.1080/0144341042000301238

51.

Onghena

Edgington

E. S.

(1994). Randomization tests for restricted alternating treatments designs. Behaviour Research and Therapy, 32, 783-786. doi:10.1016/0005-7967(94)90036-1

52.

Onghena

Edgington

E. S.

(2005). Customization of pain treatments: Single-case design and analysis. The Clinical Journal of Pain, 21, 56-68.

53.

Parker

R. I.

Cryer

Byrns

(2006). Controlling baseline trend in single-case research. School Psychology Quarterly, 21, 418-443. doi:10.1037/h0084131

54.

Parker

R. I.

Vannest

K. J.

(2009). An improved effect size for single-case research: Nonoverlap of all pairs. Behavior Therapy, 40, 357-367. doi:10.1016/j.beth.2008.10.006

55.

Parker

R. I.

Vannest

K. J.

(2012). Bottom-up analysis of single-case research designs. Journal of Behavioral Education, 21, 254-265. doi:10.1007/s10864-012-9153-1

56.

Parker

R. I.

Vannest

K. J.

Davis

J. L.

(2011). Effect size in single-case research: A review of nine nonoverlap techniques. Behavior Modification, 35, 303-322. doi:10.1177/0145445511399147

57.

Parker

R. I.

Vannest

K. J.

Davis

J. L.

(2014). A simple method to control positive baseline trend within data nonoverlap. The Journal of Special Education, 48, 79-91. doi:10.1177/0022466912456430

58.

Parker

R. I.

Vannest

K. J.

Davis

J. L.

Sauber

S. B.

(2011). Combining nonoverlap and trend for single-case research: Tau-U. Behavior Therapy, 42, 284-299. doi:10.1016/j.beth.2010.08.006

59.

Pustejovsky

J. E.

(2018). Using response ratios for meta-analyzing single-case designs with behavioral outcomes. Journal of School Psychology, 68, 99-112. doi:10.1016/j.jsp.2018.02.003

60.

Pustejovsky

J. E.

(2019). Procedural sensitivities of effect sizes for single-case designs with directly observed behavioral outcome measures. Psychological Methods, 24, 217-235. doi:10.1037/met0000179

61.

Radley

K. C.

Dart

E. H.

Wright

S. J.

(2018). The effect of data points per x- to y-axis ratio on visual analysts evaluation of single-case graphs. School Psychology Quarterly, 33, 314-322. doi:10.1037/spq0000243

62.

Riley-Tillman

T. C.

Burns

M. K.

(2009). Evaluating educational interventions: Single-case design for measuring response to intervention. London, England: The Guilford Press.

63.

Schlosser

R. W.

Lee

D. L.

Wendt

(2008). Application of the Percentage of Non-Overlapping Data (PND) in systematic reviews and meta-analyses: A systematic review of reporting characteristics. Evidence-Based Communication Assessment and Intervention, 2, 163-187. doi:10.1080/17489530802505412

64.

Scruggs

T. E.

Mastropieri

M. A.

(1998). Summarizing single-subject research: Issues and applications. Behavior Modification, 22, 221-242. doi:10.1177/01454455980223001

65.

Sen

P. K.

(1968). Estimates of the regression coefficient based on Kendall’s tau. American Statistical Association Journal, 63, 1379-1389.

66.

Shadish

W. R.

Kyse

E. N.

Rindskopf

D. M.

(2013). Analyzing data from single-case designs using multilevel models: New applications and some agenda items for future research. Psychological Methods, 18, 385-405. doi:10.1037/a0032964

67.

Shadish

W. R.

Rindskopf

D. M.

Hedges

L. V.

Sullivan

K. J.

(2013). Bayesian estimates of autocorrelations in single-case designs. Behavior Research Methods, 45, 813-821. doi:10.3758/s13428-012-0282-1

68.

Shadish

W. R.

Sullivan

K. J.

(2011). Characteristics of single-case designs used to assess intervention effects in 2008. Behavior Research Methods, 43, 971-980. doi:10.3758/s13428-011-0111-y

69.

Shadish

W. R.

Zelinsky

N. A. M.

Vevea

J. L.

Kratochwill

T. R.

(2016). A survey of publication practices of single-case design researchers when treatments have small or large effects. Journal of Applied Behavior Analysis, 49, 656-673. doi:10.1002/jaba.308

70.

Snodgrass

M. R.

Chung

M. Y.

Meadan

Halle

J. W.

(2018). Social validity in single-case research: A systematic literature review of prevalence and application. Research in Developmental Disabilities, 74, 160-173. doi:10.1016/j.ridd.2018.01.007

71.

Solomon

B. G

. (2014). Violations of assumptions in school-based single-case data: Implications for the selection and interpretation of effect sizes. Behavior Modification, 38(4), 477-496. doi:10.1177/0145445513510931

72.

Swan

D. M.

Pustejovsky

J. E.

(2018). A gradual effects model for single-case designs. Multivariate Behavioral Research, 53, 574-593. doi:10.1080/00273171.2018.1466681

73.

Taipalus

A. C.

Hixson

M. D.

Kanouse

S. K.

Wyse

R. D.

Fursa

(2017). Effects of therapy balls on children diagnosed with attention deficit hyperactivity disorder. Behavioral Interventions, 32, 418-426. doi:10.1002/bin.1488

74.

Tarlow

(2017). An improved rank correlation effect size statistic for single-case designs: Baseline corrected Tau. Behavior Modification, 41, 427-467. doi:10.1177/0145445516676750

75.

Tarlow

K. R.

Brossart

D. F.

(2018). A comprehensive method of single-case data analysis: Interrupted Time-Series Simulation (ITSSIM). School Psychology Quarterly, 33, 590-603. doi:10.1037/spq0000273

76.

Tate

R. L.

Perdices

(2019). Single-case experimental designs for clinical research and neurorehabilitation settings: Planning, conduct, analysis, and reporting. London, England: Routledge.

77.

Tate

R. L.

Perdices

Rosenkoetter

McDonald

Togher

Vohra

(2016). The Single-Case Reporting Guideline In BEhavioural Interventions (SCRIBE) 2016: Explanation and elaboration. Archives of Scientific Psychology, 4, 10-31. doi:10.1037/arc0000027

78.

Tate

R. L.

Perdices

Rosenkoetter

Wakim

Godbee

Togher

McDonald

(2013). Revision of a method quality rating scale for single-case experimental designs and n-of-1 trials: The 15-item Risk of Bias in N-of-1 Trials (RoBiNT) Scale. Neuropsychological Rehabilitation, 23, 619-638. doi:10.1080/09602011.2013.824383

79.

Vannest

K. J.

Davis

(2013). Synthesizing single-case research to identify evidence based treatments. In Cook

B. G.

Tankersley

Landrum

T. J.

(Eds.), Evidence-based practices (Advances in learning and behavioral disabilities, Vol. 26) (pp. 93-119). Bingley, UK: Emerald.

80.

Vannest

K. J.

Ninci

(2015). Evaluating intervention effects in single-case research designs. Journal of Counseling & Development, 93, 403-411. doi:10.1002/jcad.12038

81.

Vannest

K. J.

Parker

R. I.

Davis

J. L.

Soares

D. A.

Smith

S. L.

(2012). The Theil–Sen slope for high-stakes decisions from progress monitoring. Behavioral Disorders, 37, 271-280. doi:10.1177/019874291203700406

82.

Vannest

K. J.

Peltier

Haas

(2018). Results reporting in single case experiments and single case meta-analysis. Research in Developmental Disabilities, 79, 10-18. doi:10.1016/j.ridd.2018.04.029

83.

Wilcox

R. R.

(2012). Introduction to robust estimation and hypothesis testing (3rd ed.). London, England: Academic Press.

84.

Wolery

Busick

Reichow

Barton

E. E.

(2010). Comparison of overlap methods for quantitatively synthesizing single-subject data. The Journal of Special Education, 44, 18-29. doi:10.1177/0022466908328009

85.

Wolery

Gast

D. L.

Ledford

J. R.

(2014). Comparative intervention designs. In Gast

D. L.

Ledford

J. R.

(Eds.), Single subject research methodology: Applications in special education and behavioral sciences (pp. 297-345). London, England: Routledge.

86.

Wolfe

Barton

E. E.

Meadan

(2019). Systematic protocols for the visual analysis of single-case research data. Behavior Analysis in Practice. Advance online publication. doi:10.1007/s40617-019-00336-7

87.

Wolfe

Seaman

M. A.

Drasgow

(2016). Interrater agreement on the visual analysis of individual tiers and functional relations in multiple baseline designs. Behavior Modification, 40, 852-873. doi:10.1177/0145445516644699

88.

Wolfe

Seaman

M. A.

Drasgow

Sherlock

(2018). An evaluation of the agreement between the conservative dual-criterion method and expert visual analysis. Journal of Applied Behavior Analysis, 51, 345-351. doi:10.1002/jaba.453

89.

Wolfe

Slocum

T. A.

(2015). A comparison of two approaches to training visual analysis of AB graphs. Journal of Applied Behavior Analysis, 48, 472-477. doi:10.1002/jaba.212

90.

Young

N. D.

Daly

E. J.

III . (2016). An evaluation of prompting and reinforcement for training visual analysis skills. Journal of Behavioral Education, 25, 95-119. doi:10.1007/s10864-015-9234-z

91.

Zimmerman

K. N.

Ledford

J. R.

Severini

K. E.

Pustejovsky

J. E.

Barton

E. E.

Lloyd

B. P.

(2018). Single-case synthesis tools I: Comparing tools to evaluate SCD quality and rigor. Research in Developmental Disabilities, 79, 19-32. doi:10.1016/j.ridd.2018.02.003