Abstract
Single-case-design researchers rarely used statistics in the past, but that is changing. In this article, I review the rapidly developing state of statistical analyses for single-case designs, including effect sizes, multilevel models, and Bayesian analyses. No analysis meets all the desiderata for an optimal single-case-design analysis, but this may be remedied in the near future. Single-case-design researchers will have incentives to use these analyses as they become more user-friendly and beneficial.
Keywords
The single-case design (SCD) is one of the oldest and most respected design traditions for testing the effects of a treatment (Sidman, 1960; Skinner, 1938), especially in some areas of psychology and education (Shadish & Sullivan, 2011; Smith, 2011). A case is exposed to treatment, or not, sequentially in a time-series design, and a treatment effect is inferred if the outcome changes as predicted when treatment is present, but not otherwise. Figure 1 shows results from one example of an SCD (Lambert, Cartledge, Heward, & Lo, 2006). In this study, the treatment was aimed at reducing pupils’ disruptive behaviors in two classrooms, and visual analysis of Figure 1 suggests that it worked (which SCD researchers call a functional relationship). Figure 1 shows a reversal (ABAB) design, but other variants of SCDs include the multiple-baseline design, the alternating-treatment design, the changing-criterion design, and designs that combine these four types (Shadish & Sullivan, 2011). In psychology and education, SCDs are used in such diverse fields as applied behavior analysis, autism, behavior modification and therapy, child and family therapy, developmental disorders, exceptional children, intellectual disabilities, language disorders, learning disabilities, neuropsychological rehabilitation, school psychology, special education, speech therapy, and sport and exercise psychology (Shadish & Sullivan, 2011). SCDs are also used in medicine, in which they are called N-of-1 designs, to study asthma, arthritis, fibromyalgia, medication choices, pain treatment, surgery, traumatic brain injury, and attention deficit hyperactivity disorder (Gabler, Duan, Vohra, & Kravitz, 2011).

Number of intervals of disruptive behavior recorded during single-student responding (SSR) and response-card (RC) conditions. Each row of graphs represents data for one student observed over at least 26 sessions. The SSR condition was a no-treatment condition, RC was the treatment condition, and both conditions were run twice to form an ABAB design. Successful treatment is indicated by a lower number of observations during RC than SSR. Reprinted from “Effects of Response Cards on Disruptive Behavior and Academic Responding During Math Lessons by Fourth-Grade Urban Students,” by M. C. Lambert, G. Cartledge, W. L. Heward, and Y. Lo, 2006, Journal of Positive Behavior Interventions, 8, pp. 94–95. Copyright 2006 by Sage. Reprinted with permission.
Researchers, practitioners, and policymakers recognize the ability of SCDs to contribute to evidence-based practice reviews. The What Works Clearinghouse and the American Psychological Association’s Division 16 Task Force on Evidence-Based Interventions in School Psychology have recommended acceptance of SCDs for evidence-based practice reviews (Kratochwill et al., 2010; Kratochwill & Stoiber, 2002). The U.S. Department of Education’s National Center for Special Education Research allows SCDs to be used instead of randomized experiments for some efficacy studies. Oxford University’s Centre for Evidence-Based Medicine, which works closely with the British Journal of Medicine, accepts N-of-1 trials as evidence about effects of medical interventions and ranks them at the same level as randomized experiments (Howick et al., 2011). Medical researchers are extending the Consolidated Standards of Reporting Trials (CONSORT) standards for reporting randomized experiments to reporting standards for N-of-1 trials (Shamseer et al., 2012).
However, an obstacle to full acceptance of SCDs among those outside the SCD community is the general failure to use statistics to analyze SCDs. Dating back to Skinner (1938), most SCD researchers have followed a visual-analysis tradition in which inferential statistics are eschewed (Kratochwill, Levin, Horner, & Swoboda, in press)—despite the use of statistics having been proposed by members of the SCD community for almost as long. Some objections to using statistics are principled, especially those based on skepticism that statistics can capture the nuances an SCD researcher considers when judging whether a functional relationship exists (Parker & Vannest, 2012). More pragmatic reasons are habit, lack of advanced statistical curricula in SCD training programs, absence of consensus about which type of analysis is best, and a dearth of incentives to use statistics.
Yet this avoidance is changing because of (a) pressure from evidence-based-practice communities to speak the statistical language in common use; (b) expectations from granting agencies and journal editors for quantitative work in addition to visual analysis; (c) generational changes within the SCD communities; (d) greater collaboration between SCD researchers and statisticians that corrects misperceptions and builds common agreements about the role of statistics; (e) the development of a cadre of statisticians who are interested in the intellectual challenges of SCD analysis, who have ready access to virtually all published SCD data because they have the ability to digitize published graphs, and who are therefore already developing and conducting analyses and meta-analyses of SCD data (Shadish, in press-b); and (f) increasing recommendations to use statistics by various entities providing standards for SCDs (Smith, 2011). Consequently, the past 10 years have seen great progress toward formally developed statistics capable of addressing the distinctive needs of SCD research (Shadish, in press-a). In this article, I will review and evaluate the most recent and promising developments.
A Review of Current Developments
As much as past SCD statistics need critical review, I cannot provide that in this article. Instead, my focus is on current advances and the reasons for them.
Standardized effect sizes
Formally developed standardized effect sizes are of current interest because they tend to be the coin of the realm in evidence-based practice reviews, especially when meta-analysis is used to decide what works, or when results from SCDs are compared with results from between-groups studies, such as randomized experiments. The phrase “formally developed” means the effect sizes are derived analytically using statistical theory. A standardized effect size remains on the same measurement scale no matter what outcome measure is used. Examples are d statistics, odds ratios, and correlation coefficients. Because evidence-based practice reviews nearly always combine results over very different outcome measures, standardized effect sizes are needed.
Past proposals for a standardized effect size in SCD research are intriguing and intuitive. Overlap statistics are a good example (Parker, Vannest, & Davis, 2011). To judge whether a treatment worked, SCD researchers visually examine the overlap of treatment and baseline observations for each case (Kratochwill et al., in press). As shown in Figure 1, for example, the treatment used by Lambert et al. (2006) reduced students’ disruptive behavior relative to baseline levels, resulting in little overlap of treatment and baseline observations. However, past SCD effect sizes lacked formal development from clear assumptions in statistical theory. As a result, their confidence intervals and significance tests are of unclear validity or are nonexistent, their power to detect effects is unknown, and we know little about how they perform in the face of variation in the number of observations per phase (phase length), in how observations change systematically over time even in the absence of treatment (trend), in how correlated observations within each case might be (autocorrelation), and in the outcome-measurement metrics (e.g., count, percentage, normally distributed data).
In comparison, consider the recent d statistic that Hedges, Pustejovsky, and Shadish (2012, 2013) developed. Its standard error is formally derived from clear distributional assumptions, it takes the autocorrelation into account, it has power analyses, and simulations have shown how it performs over a variety of circumstances common in SCD research. SPSS macros with graphical user interfaces (GUIs) are available to estimate this d and its power, and demonstrations show how to use the d meta-analytic statistics that cannot be applied with past SCD effect sizes (Shadish, Hedges, et al., 2013; Shadish, Hedges, & Pustejovsky, in press). It is explicitly derived from normal-distribution theory and explicitly assumes no trend, which are indeed limitations—but at least we know the kind of data to which it does apply! This work can be (and is being) extended to non-normally distributed data with trend. In contrast, past effect sizes are applied to all kinds of outcome distributions, to data with or without trend and with or without autocorrelation, with little thought to whether such applications are valid. The latter is, of course, remediable. Given how intuitive some past effect sizes are for SCD research, we hope that more formal statistical development of them occurs.
Analyses based on regression models
A second approach to the analysis of SCDs is regression or its extensions, such as multilevel models. Regression is formally developed statistically for many purposes relevant to SCD research, it can account for trend and autocorrelation, and it can assume non-normal outcome distributions that are appropriate for the count and rate measures that predominate in SCD research—although SCD researchers have rarely taken advantage of all these possibilities. Problems with past regression analyses are that they have been applied only within each case even though they could analyze all cases simultaneously, they do not easily provide statistical tests for whether cases differ significantly from each other, they usually assume trend is linear, and they do not typically produce a standardized effect-size estimate.
Multilevel models can easily address the first two of these problems (Moeyaert, Ferron, Beretvas, & Van den Noortgate, in press; Shadish, Kyse, & Rindskopf, 2013). For the data shown in Figure 1, past approaches would produce nine separate regression equations, one for each case, and stop there. Multilevel models would produce the same nine regression equations, but then extract and analyze the regression coefficients from each of these nine models (e.g., the nine regression coefficients for the treatment effect). So, they can test both the significance of the average of these coefficients (e.g., the average treatment effect over nine cases) and whether the nine regression coefficients differ significantly from each other (e.g., whether the treatment effect varies significantly over cases, although the power of that test can be low given the small number of cases often used in SCDs; see Shadish, Kyse, & Rindskopf, 2013).
Unfortunately, SCD researchers who use regression and multilevel models (both are parametric models) nearly always assume that trend does not exist or is linear (always increasing or decreasing in a straight line) as opposed to nonlinear (as seen when, e.g., a delayed effect occurs but then affects the outcome rapidly and, finally, reaches a ceiling beyond which no further improvement occurs). If trend is not linear, regression coefficients and standard errors will be wrong if linearity is assumed. This is a major problem, because parametric methods require the researcher to correctly identify the form of the trend, but the researcher rarely knows that form. Recent semiparametric regression models (e.g., generalized additive models) help solve this problem. Semiparametric regression combines the usual parametric predictors (e.g., treatment) with nonparametric smoothing techniques (e.g., for trend) that allow the data to suggest the form of the trend. Examples suggest that trend in SCDS is often nonlinear, and that these models can be practical and useful for the analysis of SCDs (Shadish, Zuur, & Sullivan, 2014). Figure 2 is a graph taken from Shadish et al. (2014) illustrating results from an application of generalized additive models to the Lambert et al. (2006) data shown in Figure 1. The analysis allowed trend for each case to have a different linear or nonlinear functional form. For example, Case 6 has no significant trend, Case 2 has significant linear trend, and other cases have significant nonlinear trends that are very wiggly (e.g., Case 9).

A multipanel scatter plot of a generalized additive model semiparametric regression analysis in which trend varies as either linear or nonlinear for each case. Black dots are original raw data, solid lines are smoothed predicted values, and dashed lines are 95% confidence intervals around those predicted values. Data were drawn from Lambert, Cartledge, Heward, and Lo (2006); each of the nine plots represents a case shown in Figure 1 (1 = A1, 2 = A2, 3 = A3, 4 = A4, 5 = B1, 6 = B2, 7 = B3, 8 = B4, 9 = B5). Reprinted from “Using Generalized Additive (Mixed) Models to Analyze Single Case Designs,” by W. R. Shadish, A. F. Zuur, and K. J. Sullivan, 2014, Journal of School Psychology, 52, p. 160. Copyright 2014 by the American Psychological Association. Reprinted with permission.
Bayesian statistics
The most recent development is the application of Bayesian statistics to SCD data (Rindskopf, in press; Swaminathan, Rogers, & Horner, in press). Bayesian methods can produce more valid results in the presence of small sample sizes that are characteristic of SCDs (Shadish & Sullivan, 2011). Bayesian methods have other advantages when compared with the usual (frequentist) analyses. For example, they take into account prior evidence and beliefs about parameters and can adjust those priors given data from a new SCD study. They provide more natural interpretations of results, although this is not obvious without careful comparison of Bayesian and frequentist interpretations. Rindskopf (in press) provides a readable introduction to Bayesian statistics as they relate to SCD data. Programs for Bayesian statistics have a steep learning curve, but the effort is worthwhile, and more user-friendly interfaces will appear (e.g., Woodward, 2011).
Swaminathan et al. (in press) have also proposed a d statistic that seems to produce estimates consistent with the more formally developed Hedges et al. (2012, 2013) d statistic. The same methods could be used to obtain a d statistic from multilevel and generalized additive models.
Criteria for the Ideal SCD Data Analysis
In prior work (Shadish, in press-b), I suggested that the ideal statistical analysis for SCD data would:
Accurately model trend.
Accurately model error structures with either autocorrelation or random effects.
Produce a standardized effect size suitable for meta-analysis.
Correctly model outcome variable distributions (e.g., normal, binomial, Poisson).
Be accompanied by appropriate power analyses.
Be accessible to SCD researchers through macros, GUIs, and examples with syntax.
The importance of the last criterion cannot be overstated. Many clinical scientists will understandably use simple statistical programs even if they are not state of the art (e.g., Bloom, Fischer, & Orme, 2009; Borckardt et al., 2007).
Table 1 shows how well the main analyses discussed above deal with each of the six criteria. No method meets all of the criteria. The cells in the table necessarily oversimplify, of course. For instance, some multilevel modeling programs have a point-and-click GUI, but most do not; power analyses exist for both regression and multilevel modeling, but they have not been specifically implemented for SCDs; and proposals for standardized effect sizes exist for some regression analyses, but not all.
A Sample of Current Approaches to the Analysis of Single-Case Designs and Their Capabilities
Note: A plus sign (+) means that the method clearly meets a given criterion. A question mark (?) means some implementations of the method may meet the criterion. The absence of either of these two, indicated by a dash, means the method generally does not meet the criterion.
Conclusions and Future Directions
A serious conceptual concern for both quantitative (Molenaar, 2004) and visual analysts (Perone, 1999; Sidman, 1960) is a tension between nomothetic and idiographic assumptions in statistics. Nomothetic analyses focus on generalizations over individuals; idiographic approaches look at the uniqueness of individuals (Cone, 1986). Some analyses discussed in this article are more nomothetic, assuming a common process over cases (e.g., Hedges et al.’s d; much multilevel modeling), but for some idiographic researchers, this assumption is at best an empirical question and at worst never useful. Yet some analyses (e.g., Hoeppner, Goodwin, Velicer, Mooney, & Hatsukami, 2008; Shadish et al., 2014) may meet both idiographic and nomothetic needs, and the trade-offs between those needs require more attention (Shadish, in press-b).
As this concern illustrates, in a field where many flowers currently bloom, who can predict the one on which we will finally settle our gaze? What I want, I think, is a user-friendly program dedicated to SCD research that has multilevel, semiparametric, and Bayesian capabilities; that produces a standardized effect size appropriate to the outcome distribution; and that produces publication-quality graphics (best of all, make it free!). That would meet or exceed all of my desiderata for an ideal analysis. Swaminathan et al. (in press) are working on a program with many such characteristics. Others will surely compete in that arena. More work needs to be completed before such a program can be realized fully, particularly the formal statistical development of effect sizes for non-normally distributed SCD data. Such work is in progress.
Others may want something else, and a special tension may exist between the needs of SCD clinicians and practitioners and those of SCD researchers. Furthermore, over the next few years, new methods may appear, or older methods may be resurrected and improved. For example, overlap statistics have great intuitive appeal, and researchers are working to overcome their limitations (Parker et al., 2011). Collaboration with statisticians might allow for their more formal statistical development. Similarly, traditional time-series analyses could be used to analyze SCDs (Gorsuch, 1983), but seemingly promising starts (e.g., Velicer, 1994) have languished for lack of further attention. Yet even with such uncertainties, the next 5 years will likely see an emerging consensus on an ideal analytic method for SCDs. We are already close to that goal.
Footnotes
Declaration of Conflicting Interests
The author declared no conflicts of interest with respect to the authorship or the publication of this article.
Funding
This research was supported in part by Grants R305D100046 and R305D100033 from the U.S. Department of Education’s Institute for Educational Sciences and by Grant 143118 from the University of California Office of the President to the University of California Educational Evaluation Consortium. The opinions expressed are those of the author and do not represent the views of the University of California, the Institute for Educational Sciences, or the U.S. Department of Education.
