Abstract
There has been growing interest in using statistical methods to analyze data and estimate effect size indices from studies that use single-case designs (SCDs), as a complement to traditional visual inspection methods. The validity of a statistical method rests on whether its assumptions are plausible representations of the process by which the data were collected, yet there is evidence that some assumptions—particularly regarding normality of error distributions—may be inappropriate for single-case data. To develop more appropriate modeling assumptions and statistical methods, researchers must attend to the features of real SCD data. In this study, we examine several features of SCDs with behavioral outcome measures in order to inform development of statistical methods. Drawing on a corpus of over 300 studies, including approximately 1,800 cases, from seven systematic reviews that cover a range of interventions and outcome constructs, we report the distribution of study designs, distribution of outcome measurement procedures, and features of baseline outcome data distributions for the most common types of measurements used in single-case research. We discuss implications for the development of more realistic assumptions regarding outcome distributions in SCD studies, as well as the design of Monte Carlo simulation studies evaluating the performance of statistical analysis techniques for SCD data.
Introduction
Single-case designs (SCDs) serve as a cornerstone of research on behavior modification and also play an important role in certain areas of special education (Gast & Ledford, 2018), clinical and school psychology (Kazdin, 2011), and communication sciences (Byiers, Reichle, & Symons, 2012), among other fields. To draw conclusions in studies that use SCDs, researchers have traditionally relied upon visual inspection of graphed data (Horner & Swoboda, 2014; Smith, 2012). However, there has also long been interest in using statistical methods to analyze data from SCDs, and a large and growing array of statistical approaches are now available (for a recent review, see Manolov & Moeyaert, 2017).
Development of statistical methods for SCD data has been bolstered as researchers have increasingly sought to use systematic reviews and research syntheses of SCDs to inform evidence-based practice (e.g., Horner et al., 2005; Kratochwill & Stoiber, 2002; Wong et al., 2015). Researchers have developed methods for quantifying the magnitude of treatment effects, in the form of effect size indices and meta-analytic models, in order to synthesize findings from collections of SCD studies (Pustejovsky & Ferron, 2017). Production of systematic reviews and meta-analyses of SCD studies has also increased markedly over the past two decades (Jamshidi et al., 2018; Maggin, O’Keeffe, & Johnson, 2011). Thus, there is a need to scrutinize the statistical analysis methods used in such reviews, as well as methods used to analyze data from individual SCDs.
The validity of a statistical method hinges on whether its assumptions are plausible representations—or at least reasonable approximations—of the process by which the data were collected. Many well-known statistical methods for SCD data, such as the effect size methods proposed by Center, Skiba, and Casey (1985) and others (e.g., Gorman & Allison, 1996) and more recent approaches based on hierarchical linear models (e.g., Pustejovsky, Hedges, & Shadish, 2014; Van den Noortgate & Onghena, 2003, 2008), make the assumption that model errors are normally distributed and have constant variance. However, the most common types of outcome measurements used in SCD studies are frequency counts or percentages (Shadish & Sullivan, 2011), which may have characteristics that are not well modeled by normal distributions. Methodologists have therefore highlighted the need for statistical models that are more appropriate for the characteristics of outcome measures commonly used in SCDs (Shadish, 2014).
Recent work has begun to investigate other distributional models for SCD data. Most such work has focused on the use of binomial distributions for outcomes in the form of percentages or proportions (e.g., Moeyaert, Ferron, Beretvas, & Van den Noortgate, 2014; Rindskopf, 2014) or Poisson distributions for outcomes in the form of frequency counts (e.g., Declercq et al., 2018; Shadish, Kyse, & Rindskopf, 2013). Models based upon these distributions are well known and readily available in statistical software, yet these models also make strong assumptions of their own—assumptions that have not been thoroughly investigated in SCD data. A smaller amount of work has examined generalized linear and non-linear models with less strict distributional assumptions (Shadish et al., 2013; Shadish, Zuur, & Sullivan 2014; Swan & Pustejovsky, 2019), but these recent developments have yet to see many applications.
One of the most common approaches to measuring outcomes in SCDs is through systematic direct observation (SDO) of behavior (Ayres & Ledford, 2014; Kahng, Ingvarsson, Quigg, Seckinger, & Teichman, 2011), which entails observing a behavior in time and recording a quantitative summary of its features, using a system such as frequency counting, duration recording, momentary time sampling, or partial interval recording. SDO is considered a hallmark of single-case research (Horner et al., 2005) and a robust literature exists on the properties of different SDO procedures (Lane & Ledford, 2014; Pustejovsky & Runyon, 2014).
Despite the importance of SDO, only a few studies have considered the implications of SDO procedures for statistical modeling of SCDs. A recent exception is Yoder, Ledford, Harbison, and Tapp (2018), who compared two different methods of estimating behavioral frequency from partial interval data. In closely related work, Pustejovsky and Swan (2015) developed effect size estimation methods specifically tailored for partial interval data. Finally, Pustejovsky (2019) used a model called the alternating renewal process (ARP) to simulate behavioral data and study the effects of different SDO procedures on effect size magnitude. The ARP model is distinctive because its formulation emulates the physical process of observing a behavior and recording SDO data. As a result, the model is very flexible and can be used to generate frequency counting data, duration recording, or interval recording data. However, such flexibility comes at the cost of substantial additional complexity. Further investigation is therefore warranted to determine whether the added complexity is necessary and appropriate for modeling real SCD data.
Monte Carlo Simulations
Monte Carlo simulation is one of the primary tools used by methodologists to assess the performance of new statistical methods. Monte Carlo simulation involves simulating artificial data based on a model with known characteristics, applying one or more statistical procedures, and comparing the resulting estimates with the true features of the model (Morris, White, & Crowther, 2019). Repeating this process many times allows researchers to investigate or compare the performance characteristics (such as bias, accuracy, or Type I error) of different procedures.
Monte Carlo simulations have been used extensively in developing and evaluating methods for analysis of SCD data (e.g., Ferron, Bell, Hess, Rendina-Gobioff, & Hibbard, 2009; Moeyaert, Ugille, Ferron, Beretvas, & Van den Noortgate, 2013; Pustejovsky et al., 2014). Simulations are particularly advantageous for studying the robustness of statistical methods because artificial data can be generated based on a model that does not conform to the assumptions of the procedures being studied (e.g., Moeyaert, Ugille, Ferron, Beretvas, & Van den Noortgate, 2016; Petit-Bois, Baek, Van den Noortgate, Beretvas, & Ferron, 2016). Simulations have even been used to examine the performance of analytic methods that are not based on specific distributional assumptions, such as the non-overlap effect sizes for SCDs (e.g., Pustejovsky, 2019; Tarlow, 2017). However, Monte Carlo simulations are inherently limited by the models and conditions used to generate artificial data. In order for a simulation to yield findings that are relevant in practice, it must be based on a data-generating model that captures essential features of data that arise in real studies.
The Present Study
As the variety of statistical models for SCDs expands, and as researchers assess and compare methods using Monte Carlo simulations, it is critical that these efforts be informed by the features of real SCD data. In the present study, we address this need by surveying the characteristics of empirical SCD data, focusing in particular on the study designs, procedures used to collect outcome data, and distributional features of the resulting data.
We limit the scope of the survey in three important ways. First, we survey SCD studies included in extant systematic reviews of the single-case literature. This approach helps to ensure that included studies investigated interventions and outcomes for which it is plausible that researchers would be interested in applying statistical models (including effect sizes and meta-analysis, in particular). In contrast, previous reviews of the characteristics of SCDs have surveyed all available SCD studies published in certain journals (Hammond & Gast, 2010; Smith, 2012) or in a single year (Shadish & Sullivan, 2011) and therefore may have included idiosyncratic studies that would never be subjected to statistical analysis. Second, we focus on SCDs that measured behavioral outcomes using SDO, excluding SCDs studies focused solely on academic performance or other types of outcomes, where measurement considerations may be quite different. Third, we limit the analysis to SCD data from initial baseline phases only, so as to avoid the complexity of modeling changes in outcomes after interventions are introduced. Including data from intervention phases would necessitate making further assumptions about how to best quantify changes in outcomes between baseline and intervention phases—important questions, but ones that warrant separate investigation. Under these limitations, the present study is guided by three research questions:
What are the basic features of SCDs included in extant systematic reviews of the single-case literature?
In SCDs included in extant systematic reviews, which procedures are used to measure behavioral outcomes?
For the most common types of measurements, what are the distributional features of baseline data in such SCDs?
The remainder of the paper proceeds as follows. In the next section, we briefly review several distributional models that have been proposed for SCD data and illustrate the key distinguishing features of the models. The following section details the sample of included studies, coding procedures, and data analysis methods. We then present our results, followed by a discussion of limitations and implications.
Distributional Models
Some of the most well-known statistical methods for SCD data are based on regression models with errors that are assumed to be normally distributed and homoscedastic (Center et al., 1985; Gorman & Allison, 1996). Methodologists have also developed hierarchical linear models for analyzing individual SCDs (Van den Noortgate & Onghena, 2003), estimating effect size indices (Pustejovsky et al., 2014), or synthesizing data across multiple SCDs (Van den Noortgate & Onghena, 2008). These models involve assumptions similar to single-level regression models, including normality and homoscedasticity of the lowest-level errors.
Because most SCD outcome data come in the form of frequency counts or proportions, Shadish (2014) argued for the importance of considering other, non-normal distributional models for SCD data, such as the Poisson distribution for frequency counting data and the binomial distribution for proportions. The Poisson distribution has long been used as a model for frequency counts of animal or human behavior observed over time (e.g., Altmann, 1974). For example, Shadish et al. (2013) used the Poisson distribution to model data from DiCarlo and Reid (2004), who measured the number of initiations of independent pretend-play behaviors of young children with disabilities. The binomial distribution is useful as a model for a variable representing the number of occurrences out of a fixed, known number of trials, where each trial is independent of the others. For example, a binomial distribution might be used to model the number of times a participant looks at an adult instructor in response to a bid for joint attention, as in Taylor and Hoch (2008). Researchers have begun to develop models for SCD data based on these distributions, such as the generalized linear mixed models described by Shadish et al. (2013), Moeyaert et al. (2014), and Rindskopf (2014). Recently, Declercq et al. (2018) used Monte Carlo simulations to investigate the robustness properties of hierarchical linear models with normal error distributions when the true data-generating process involved Poisson distributions.
In models such as these, there are at least two important considerations for choosing distributional assumptions. One consideration is the mean level of the outcome, which influences the extent to which Poisson or binomial distributions are distinct from a normal distribution. In particular, a Poisson distribution with a large mean is very closely approximated by a normal distribution (Johnson, Kemp, & Kotz, 2005). Similarly, a binomial distribution with the mean near the middle of its range (i.e., mean near 50%) or with a large number of trials will also closely resemble a normal distribution (Johnson et al., 2005). Outside of these conditions, however, the Poisson and binomial distributions become more distinct from the normal distribution. Consequently, models based on normal error assumptions might be less robust when the outcomes are counts with low mean or percentage data based on a small number of trials.
A second important consideration with these distributions is how the variability of the distribution is related to its mean. Figure 1 depicts each of these three distributions, each with mean values of 2, 5, or 9. Assuming homoscedasticity, the mean and variance of the normal distribution are unrelated, so that changes in the mean do not affect the degree of variability. In contrast, the Poisson distribution has the property that its variance is equal to its mean. Consequently, a Poisson distribution with a larger mean will necessarily also have higher variability. The middle panel of Figure 1 illustrates this relationship, where it can be seen that the distribution with a mean of 9 has a wider spread than those with smaller mean values. The variability of a binomial distribution is also related to its mean, but in a different way than with the Poisson. Letting m denote the mean of a distribution, a binomial distribution with size

Distributions with means of 2, 5, and 9. The binomial distribution has size 10.
In practice, it is quite common for empirical data to exhibit some mean–variance relationship, but not necessarily the exact relationships entailed by Poisson or binomial distributions. Generalized linear models based on the quasi-likelihood framework (Fox, 2008; McCullagh & Nelder, 1989) provide one way to model count or proportion data under weaker distributional assumptions. Quasi-likelihood models are specified in terms of a mean–variance relationship that involves an additional parameter
where m = E(Y). If
Fox (2008) recommended using these quasi-likelihood models routinely due to the prevalence of over- and underdispersion in empirical data. However, the extent and range of dispersion in SCD data remains to be explored.
One alternative to the conventional Poisson or binomial distributions is the ARP model (Pustejovsky, 2015; Rogosa & Ghandour, 1991). The ARP model is formulated to describe a behavior that occurs episodically over time, such as bouts of stereotypy of self-injurious behavior during a therapy session (e.g., Meany-Daboul, Roscoe, Bourret, & Ahearn, 2007). Under the model, it is assumed that the lengths of time in between bouts of behavior and the lengths of behavioral episodes follow some probability distributions (such as exponential or gamma distributions). The stream of behavior is observed for a specified length of time, and an outcome is calculated by applying an SDO procedure such as frequency counting, momentary time sampling, or partial interval recording. Rogosa and Ghandour (1991) studied the ARP as a model for classroom behaviors such as teacher praise or time that a student is engaged or on-task.
A primary advantage of the ARP is its flexibility. Because it is a model for a stream of behavior, it can be used to simulate the effects of different SDO procedures, such as comparing momentary time sampling to partial interval recording or comparing data based on longer or shorter periods of observation (Meany-Daboul et al., 2007; Pustejovsky & Runyon, 2014). The ARP can also be used to simulate data that approximately conform to the assumptions of the quasi-poisson or quasi-binomial models, with varying degrees of dispersion. To illustrate, Figure 2 depicts three sets of frequency counting data simulated based on an ARP. Each distribution has a mean of 5, but the distributions have different spreads due to varying the degree of dispersion. The distribution with dispersion of 1 is identical to the Poisson distribution with mean 5, as depicted in the middle panel of Figure 1.

Frequency count data simulated from ARP with mean of 5 and varying levels of dispersion. Dispersion of 1 corresponds to a poisson distribution.
One of the main challenges in using the ARP is its complexity. Although its formulation as a model for a stream of observed behavior emulates the physical process of SDO, this also makes it difficult to study analytically. The distributions of summary measurements (such as frequency counts or proportions of intervals) do not generally have simple mathematical forms. The model can nonetheless be used to simulate artificial data based on assumptions about behavior streams and measurement procedures (Pustejovsky & Runyon, 2014). However, if Monte Carlo simulations based on the ARP are to be relevant for practice, the assumptions required of the simulation model should be informed by the features of actual SCD studies, including details of how outcomes are measured and the distributional properties of those outcomes. One of the goals of the present study is to assemble and summarize such details, so that future methodological studies can make more realistic, empirically informed assumptions.
Methods
Data Sources
We drew on data from seven completed systematic reviews of SCD studies with behavioral outcomes. Table 1 describes characteristics of these reviews, including the populations, interventions, and outcomes of included primary studies, as well as whether primary studies in each review were screened based on the What Works Clearinghouse (WWC) standards for SCDs (Kratochwill et al., 2013). Shogren, Faggella-Luby, Bae, and Wehmeyer (2004) synthesized the literature on choice-making interventions for students with autism. Gage, Lewis, and Stichter’s study (2012) is a systematic review examining the effectiveness of functional behavioral assessment-(FBA) based interventions for students with or at risk for an emotional or behavioral disorder. Maggin, Pustejovsky, and Johnson (2017) examined school-based group contingency interventions intended to reduce problem behaviors. Barton, Sweeney, and Gossett (2016) reported a systematic review of interventions intended to increase the frequency or enhance the quality of prosocial play behaviors. Dart, Collins, Klingbeil, and McKinley (2014) included studies on peer management interventions targeting non-academic behaviors in school settings. Verschuur, Didden, Lang, Sigafoos, and Huskens (2014) synthesized studies on pivotal response treatment for children with autism spectrum disorders. Ledford, King, Harbin, and Zimmerman (2016) reviewed studies on social skills interventions for children with autism. In the three most recent systematic reviews (Barton et al., 2016; Ledford et al., 2016; Maggin et al., 2017), primary studies were screened and included in the review only if they met the WWC standards with or without reservations. In the remaining reviews, inclusion criteria were defined in terms of population, intervention, and outcome characteristics, but studies were not excluded based on methodological quality criteria.
Characteristics of Included Systematic Reviews.
Coding
The first authors of six of the seven included studies shared the underlying data from their reviews with us. In the case of Verschuur et al. (2014), we had access to a database from an incomplete systematic review that contained many of the same studies included in Verschuur et al. (2014). We then identified and coded the studies from the Verschuur et al. (2014) review that were not in the existing database. We compiled the data from all seven systematic reviews into a relational database and coded additional characteristics of all included studies.
Data from SCDs have a multilevel, hierarchical structure. Each primary study in the included reviews involved one or more participants. For some types of SCDs, such as multiple baselines across participants, there is a one-to-one correspondence between participants and data series (i.e., the tiers of the design). However, it is also possible that multiple data series are measured on a given individual, such as in a multiple baseline across settings or in a treatment reversal design in which multiple outcomes are recorded for each session. In such cases, series are nested within participants. Finally, each data series consists of data points collected across multiple sessions, so that sessions are nested within data series. We built a relational database that followed this hierarchical structure and coded characteristics at each of four levels: the study level, the participant level, the series level, and the session (data point) level.
Study characteristics
We coded the type of design implemented in each study, following the typology described by Gast and Ledford (2018). Basic design types included AB (i.e., pre-experimental), multiple baseline across participants, multiple baseline across settings or behaviors, multiple probe across participants, multiple probe across settings or behaviors, treatment reversal, alternating treatment, adapted alternating treatment, or changing criterion. For designs that incorporated features of multiple design types, we classified the study according to the predominant basic type while also noting the adaptations or variations. For example, we classified a multiple baseline across participants where one tier of the design included treatment reversals as a multiple baseline across participants.
Participant characteristics
At the participant level, we extracted the pseudonym used to describe each case in the primary study. Additionally, some designs involved outcomes assessed on whole groups (such as an entire classroom of students), while other designs involved outcomes assessed at the individual level. We therefore classified the participants as individuals or aggregates.
Series characteristics
The majority of the coding pertained to series-level characteristics. For each data series, we coded several aspects of the outcome measurement procedures, including the class of measurement procedure, details of the measurement procedure, length of observation session, and outcome metric. Measurement procedures fell into two broad categories: those for measuring free operant behavior and those for measuring restricted operant behavior. SDO procedures for free operant behavior included continuous recording (i.e., durational recording), momentary time sampling (MTS), event counting, partial interval recording (PIR), and whole interval recording (WIR). Procedures for measuring restricted operant behavior included assessing successes on a fixed or variable number of stimuli/tasks, response latencies, and task checklists. A few studies used outcomes measured using self-reported or clinician-reported rating scale measures, typically assessing psychological constructs. For interval recording methods, we coded the length of the observation intervals and the amount of recording time in between observation intervals (e.g., 15 s PIR, with 5 s space in between intervals).
For data series in which outcomes were measured through SDO of free operant behavior, the length of the observation session may influence the variability of the measurements, with longer sessions expected to produce more stable measures. We therefore coded the reported length of observation sessions for data series measured using interval recording, MTS, event counting, or continuous recording. If sessions varied in length, we used the average length or midpoint of the range of session lengths as reported by study authors. Finally, we coded the metric in which the outcome was reported, so that we could standardize the format of outcomes for purposes of analysis. Possible metrics included percentages (on a scale of 0–100%), proportions (on a scale of 0–1), number of successes out of a fixed total, event counts with no natural upper limit, standardized rates of event frequency, or miscellaneous other metrics.
Session characteristics
In order to examine the distributional characteristics of baseline outcomes, we needed data on session-level outcome measurements, session sequence, and phases. For six of the seven reviews, the original authors provided session-level outcome data and phase information, which we cleaned, verified against graphs of the data in the source articles, and read into our database. For the Verschuur et al. (2014) review, we followed this process for a portion of the studies for which we had access to outcome data. For the remaining studies, we extracted outcome data from the primary source articles using the Web Plot Digitizer tool (Rohatgi, 2015). We coded each individual data point extracted from a plot with a session number and a label for the phase of the study. The data provided from Dart et al. (2014) included only initial baseline phases.
Data Analysis
Our main approach to data analysis involved calculating descriptive summary statistics, including percentages, means, medians, and interquartile ranges (IQRs) for characteristics of the included studies, participants, and data series.
To analyze the features of the distribution of baseline outcomes, we restricted the sample in three ways. First, we limited the sample to data from initial baseline phases only. Second, we also limited the sample to series in which the initial phase of the study was a baseline (non-intervention). This resulted in the exclusion of 53 data series from 13 studies. Finally, we restricted our analysis to the two most common types of measurements used in single-case research: event frequency counts and proportion-based measures of free operant behavior (i.e., PIR, WIR, MTS, and continuous recording).
For event frequency counts, we first converted all of the data series to a common metric. Specifically, for outcomes reported as standardized rates per minute, we multiplied the outcome data points by the session length to obtain frequency counts. We then calculated the sample mean and sample variance of each baseline data series, examined the empirical distribution of baseline means, and created scatterplots of the sample mean and sample variance in order to investigate their relationship.
For measures of free operant behavior reported as percentages or proportions, we first converted all of the series to the common metric of proportions. We then calculated the sample mean and sample variance of each baseline data series. Because different series were measured using different numbers of intervals and the variance of a proportion depends on the number of intervals, sample variances had to be rescaled. We put them on a common metric by multiplying each sample variance by the corresponding number of intervals used to measure the outcome for that data series. For scaling purposes, we treated continuous recording data as equivalent to 5 s MTS, so that the number of intervals was equal to 12 times the length of the observation session in minutes. If outcomes exactly followed a binomial distribution, then the expected scaled variance would be equal to
We carried out all calculations and created figures and tables using R (Version 3.6.0; R Core Team, 2019). R packages used in creating the figures included ggplot2 (Wickham, 2016), ggridges (Wilke, 2018), cowplot (Wilke, 2019), and colorspace (Zeileis, Hornik, & Murrell, 2009). The complete raw data and R code for replicating our analysis are available at https://osf.io/n9jud/.
Results
Study Design Characteristics
Table 2 reports the distribution of study design types both overall and by review. Across reviews, nearly half (48%) of the 303 primary studies used multiple baselines across participants. The next most common design types were treatment reversals (34%) and multiple baselines across behaviors (10%), with the remaining types used in less than 10% of studies. It is notable that the frequency of different design types varied by systematic review. In the pivotal response training and social skills reviews, across-participant multiple baseline and multiple probe designs were predominant, and the object play review consisted entirely of such designs. In contrast, treatment reversal designs comprised a large majority of primary studies in the choice-making, functional behavior assessment, and group contingency reviews.
Distribution of Study Designs by Review.
Note: MBP = multiple baseline across participants; TR = treatment reversal; MBB = multiple baseline across behaviors or settings; MPP = multiple probe across participants. Other designs include AB, alternating treatment, adapted alternating treatment, and multiple probes across behaviors or settings. FBA = functional behavior assessment; PRT = pivotal response training.
Table 3 reports further basic features of the included studies. Across reviews, the majority of studies included between two and four unique participants (mean = 3.1, median = 3). The distribution of the number of participants was generally consistent across the reviews, although there were a few studies that included a larger number of participants. Across reviews, 8% of studies included six or more participants, including a doctoral dissertation by Thorne (2005) with 12 participants and a study by Laski, Charlop, and Schreibman (1988) with nine parent–child dyads.
Study Characteristics by Review.
Note: IQR = interquartile range; FBA = functional behavior assessment; PRT = pivotal response training.
On average across reviews, one or two data series were measured per participant, again with little variation across most reviews. The exception was the pivotal response training review, which included an average of 3.10 data series per participant. This review also included the largest number of participants per study (mean = 4.4, median = 3).
Across reviews, initial baseline phases included an average of data points per series (median = 7, IQR = 5–12). As depicted in Figure 3, the distribution of baseline phase lengths was strongly right-skewed, with the longest baseline phase consisting of sessions. Of 1,765 data series that began with a baseline phase, 112 series (from 24 unique studies) had 30 or more observations in the initial phase.

Distribution of initial baseline phase lengths across seven systematic reviews of SCDs.
Measurement Procedures
Table 4 reports the distribution of procedures used to measure outcomes, both overall and in each of the seven reviews. Across reviews, the most common measurement procedure was PIR (36% of data series), followed by event counting (23%) and successes out of a fixed number of trials (20%). Notably, PIR was the most common procedure in five out of the seven reviews and was used in more than half of all data series in the functional behavior assessment and pivotal response training reviews. Among the 650 data series measured using PIR, the majority used 10 s interval lengths (61%), followed by 30 s intervals (18%). Only 6% of data series used PIR with interval lengths of less than 10 s. WIR and MTS interval lengths were similar to those used with PIR. Specifically, 10 s intervals were used in 84% of the 69 data series measured using WIR and 47% of the 86 data series measured using MTS.
Distribution of Measurement Procedures by Review.
Among procedures for measuring restricted operant behavior, successes out of a fixed or variable number of trials were common overall, but their use was not uniformly frequent across all reviews. Rather, these procedures were most common in the reviews of social skills training, where they comprised 45% of all data series, and pivotal response training, where they comprised 24% of all data series. In the other reviews, these procedures were rarely used.
When using SDO to measure free operant behavior, session length may have an influence on the variability (or stability) of the measurements. Figure 4 depicts the distribution of session lengths among the 1,233 data series that used SDO to measure free operant behavior and where session length was reported. Across all reviews, sessions lasting longer than 30 min were uncommon, occurring in less than 5% of data series. Overall, observation sessions had a median length of 10 min (IQR = 10–15). In four out of the seven reviews, 10 min was also the modal session length. In the review of choice-making interventions, the median session length was somewhat longer at 15 min (IQR = 14–15), and still longer, at 20 min (IQR = 15–30), in the group contingency review.

Distribution of session lengths (in minutes) in data series measuring free operant behaviors. Curves represent smoothed kernel density estimates. Dots indicate the quartiles of the distribution for each review. Session lengths over 30 min are not depicted.
Baseline Outcomes: Event Frequency Counts
Turning to the properties of baseline outcome measurements, we first examined the properties of event frequency counting data. In doing so, we distinguish between positive-valence outcomes (where increasing the behavior is desirable) and negative-valence outcomes (where decreasing the behavior is desirable), due to substantial differences in their mean frequency.
Figure 5 depicts the distribution of mean frequencies and scatterplots of the mean–variance relationship for positive-valence (left-hand plots) and negative-valence (right-hand plots) outcomes. Several trends are apparent. First, positive-valence outcomes have quite low average baseline frequency, with a median of 0.83 events per session (IQR = 0–3.4). Of the 311 data series with positive-valence event count outcomes, fully 27% had mean baseline levels of 0.01 or less—effectively zero. Most data series with zero baselines came from the review of social skills interventions.

Scatterplots of variance versus sample mean baseline frequency for event count outcomes, with marginal distributions of mean baseline frequency. Left-hand plots (purple) depict positive-valence outcomes. Right-hand plots (green) depict negative-valence outcomes. Each point represents one data series, with size corresponding to baseline phase length. Dashed lines represent unit slopes, where variance is equal to mean. Blue curves depict local linear regressions of variance as a function of mean. Dots along the horizontal axis indicate the quartiles of the distribution.
Second, compared to positive-valence outcomes, negative-valence outcomes have higher, and more variable, average baseline frequency. The median baseline frequency of these series was 11 events per session (IQR = 4.9–24), and very few series had baseline levels near zero.
Third, both positive- and negative-valence frequency counts exhibited positive mean–variance relationships. Series with larger sample means tended to also have larger sample variances. This pattern suggests that distributions in which the mean and variance are connected (such as the Poisson or ARP) may be useful in modeling or simulating single-case count data.
Fourth, for positive-valence outcomes, sample variances of the baseline series tended to exceed the sample means, indicating overdispersion of the distributions. For negative-valence outcomes, dispersion was around 1 for baseline frequencies of less than 15, but tended to exceed 1 for higher baseline frequencies. If outcomes followed a Poisson distribution, for which the mean and variance are equal, we would expect that sample variances would be similar to sample means. However, across both types of outcomes, the patterns suggest that it may be useful to use distributions that have varying degrees of dispersion—and particularly, in which overdispersion is possible—for modeling or simulating single-case data.
Baseline Outcomes: Proportions
In examining the distributions of outcomes measured as proportions, we focused on those that were measures of free operant behavior, collected using PIR, MTS, continuous recording, or WIR. Figure 6 depicts the distribution of baseline mean outcomes for data series measured using each of these procedures, with the distribution of positive-valence outcomes on the left and negative-valence outcomes on the right. For both positive- and negative-valence outcomes, the distribution of mean baseline levels spanned the full range from 0 to 1. Beyond that, however, the two types of outcomes had distinctive distributions. Positive-valence outcomes have fairly low mean levels, with median levels ranging from 0.05 for WIR and 0.06 for PIR to 0.17 for continuous recording. Surprisingly, the distribution of mean baseline outcomes measured using MTS was near the center of the range, with a median of 0.53 (IQR = 0.38–0.63). Across the 591 data series with positive-valence outcomes measured using any of the four procedures, 25% had mean baselines very close to zero (0.01 or less). In contrast, the bulk of the data series with negative-valence outcomes fell close to the center of the range. For instance, with the most commonly used procedure of PIR, the median baseline level was 0.36 (IQR = 0.20–0.59). Of the data series with negative-valence outcomes, none had a baseline level lower than 0.01.

Distributions of mean baseline proportions for partial interval recording (PIR), momentary time sampling (MTS), continuous recording (CR), and whole interval recording (WIR) data series. Left-hand plots depict positive-valence outcomes. Right-hand plots depict negative-valence outcomes. Dots along the horizontal axis indicate the quartiles of the distribution.
Figure 7 includes scatterplots of the scaled sample variance versus the sample mean proportions for series measured using each of the four procedures. Positive- and negative-valence outcomes are depicted in the same plot because they showed very similar patterns of results and because there were only a few series with negative-valence outcomes measured by WIR or CR. Just as with the frequency counting data, there are apparent functional relationships between the mean and variance of each type of outcome: larger scaled variances tend to occur at intermediate mean levels while smaller scaled variances occur at mean levels closer to the extremes of 0 and 1. If outcomes followed a binomial distribution, we would expect that sample variances would fall around

Scatterplots of sample variance versus sample mean baseline proportions for partial interval recording, momentary time sampling, continuous recording, and whole interval recording data series. Each point represents one data series, with size corresponding to baseline phase length. Curves depict local linear regressions of variance as a function of mean.
Discussion
This paper has examined several aspects of the study designs, operational procedures, and outcome data in SCDs, focusing on designs that are included in existing systematic reviews for behavioral interventions. Notably, the corpus of studies included in this survey is drawn from several fields, ranging from school psychology (Dart et al., 2014) to behavioral disorders (Gage et al., 2012) to autism and developmental disorders (Ledford et al., 2016; Verschuur et al., 2014). Although the seven included systematic reviews are only a convenience sample, we believe that the range of topics and settings of the included studies provides some reason to expect that our findings hold more broadly—generalizing to the literature base of SCD studies with behavioral outcomes that would be considered eligible for inclusion in systematic reviews. We would not, however, generalize to the single-case literature base on academic outcomes (e.g., Burke, Boon, Hatton, & Bowman-Perrott, 2015), teacher process outcomes, or other outcomes where SDO of behavior is less commonly used.
Findings regarding the types and basic features of SCDs are largely consistent with past reviews of the single-case literature. Past systematic reviews of SCDs have also found that multiple baselines are the single most common type of SCD (Hammond & Gast, 2010; Shadish & Sullivan, 2011; Smith, 2012) and that baseline phases include an average of approximately 11 observations (Smith, 2012). These previous reviews of SCD characteristics all examined samples defined by inclusion in specific journals (Hammond & Gast, 2010; Smith, 2012) or publication in a single year (Shadish & Sullivan, 2011). Using a sample defined by inclusion in a systematic review, we have highlighted that the prevalence of different design types varies across systematic reviews, with treatment reversal designs playing a more prominent role in some areas (e.g., Maggin et al., 2017) than in others (e.g., Verschuur et al., 2014).
Consistent with past reviews on use of SDO measurement systems (Adamson & Wachsmuth, 2014), we found that use of PIR is quite common—even used in two-thirds of data series in the review of FBA (Gage et al., 2012). This pattern is troubling because it is well known that PIR does not provide a clearly interpretable measure of behavioral frequency or duration (Lane & Ledford, 2014). There are theoretical illustrations that use of PIR can create misleading patterns of results, even making a beneficial intervention appear to be harmful or vice versa (Pustejovsky & Swan, 2015), as well as some empirical evidence that use of PIR leads to exaggerated evidence of intervention efficacy (Radley, O’Handley, & Labrot, 2015). Some recent work has begun to develop statistical methods tailored specifically to PIR data (Pustejovsky & Swan, 2015; Yoder et al., 2018), but further research efforts in this direction are needed.
Previous work has noted that outcomes in SCD studies are commonly in the form of counts, percentages, or proportions (Shadish & Sullivan, 2011) and has suggested use of Poisson and binomial distributions in models for such data (Shadish, 2014). In light of the range of average baseline levels—as well as the apparent mean–variance relationships—observed in the present review, further development of models for non-normal outcomes does seem warranted. However, Poisson and binomial distributions may be insufficiently flexible to capture the features of real SCD data. We found that baseline frequency count data and proportion data exhibited a wide range of dispersions, often exceeding the levels of variance that would be expected under Poisson or binomial models.
One alternative to such distributional models is the ARP, a highlight flexible model that emulates the physical process of SDO and that can be used to simulate data with over- or underdispersion. Use of the ARP does, though, require specifying the length of observation sessions and the type of procedure used to record data. Such assumptions can be informed by the empirical distribution of these procedural choices, as we have reported. Other alternative models that could be investigated further include the beta-binomial, double binomial, and double Poisson distributions (Efron, 1986; Johnson et al., 2005), all of which can accommodate varying degrees of dispersion. Future simulation studies exploring such models can use the distribution of mean levels and dispersions reported here to inform the range of parameters examined.
Findings from this review do have several important limitations. As we have noted, our analysis is based on a convenience sample of systematic reviews of single-case studies with behavioral outcomes, and does not necessarily generalize to studies in other research areas or with other types of outcomes. Furthermore, our analytic approach was limited to descriptive statistics, without quantifying the extent of uncertainty in the findings. This issue is most salient for the results regarding outcome distributions, where part of the observed variation in the distribution of outcome levels and dispersions is due purely to sampling variation. In ongoing work, we are exploring generative statistical models, such as generalized additive models for location and scale (Rigby & Stasinopoulos, 2005) that would make it possible to parse out study-level, participant-level, and session-level (sampling) variation.
Another limitation of our analytic approach is that we have focused only on the two simplest features—the mean and variance—of the baseline outcome distribution. Other aspects of baseline data patterns—including time trends, autocorrelation, and interrater agreement—are also important to consider when developing models for SCDs (Shadish, 2014). Past surveys of the degree of serial dependence in single-case data (e.g., Shadish & Sullivan, 2011) have found moderate levels of positive autocorrelation, but the approach in these analyses is based on models involving normal error distributions. Given the prevalence of non-normal count and proportion outcomes, combined with the fairly short data series of most single-case data, estimating and modeling autocorrelation is challenging and further methodological research is needed. Further, assessing interrater agreement and attaining high levels are considered important markers of rigorous single-case research (Horner et al., 2005; Kratochwill et al., 2013), but little work has explored how to account for interrater agreement in statistical models for single-case data.
In examining the outcome measurement procedures most commonly used in the included SCDs, one particularly notable pattern was that the systematic reviews varied in the types of measurement procedures most commonly used. To the extent that the choice of effect size measures and statistical models for SCDs depends on the properties of the outcome measurements, different methods might well be needed for reviews in different areas. Rather than aiming to identify one single effect size measure or statistical model that is ideally suited for analysis or meta-analysis of SCDs, a better goal may be to develop methods that are tailored to the properties of the data under analysis. In doing so, researchers should carefully attend to the properties of the study designs and measurement procedures that were used to collect the data.
Footnotes
Acknowledgements
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Grant R305D160002 from the Institute of Educational Sciences, U.S. Department of Education. The opinions expressed are those of the authors and do not represent the views of the Institute or the U.S. Department of Education.
