Abstract
Measuring treatment effects when an individual’s pretreatment performance is improving poses a challenge for single-case experimental designs. It may be difficult to determine whether improvement is due to the treatment or due to the preexisting baseline trend. Tau-U is a popular single-case effect size statistic that purports to control for baseline trend. However, despite its strengths, Tau-U has substantial limitations: Its values are inflated and not bound between −1 and +1, it cannot be visually graphed, and its relatively weak method of trend control leads to unacceptable levels of Type I error wherein ineffective treatments appear effective. An improved effect size statistic based on rank correlation and robust regression, Baseline Corrected Tau, is proposed and field-tested with both published and simulated single-case time series. A web-based calculator for Baseline Corrected Tau is also introduced for use by single-case investigators.
It is when the patients recover that we are faced with the post hoc propter hoc dilemma. Is the result due to the remedy or to the vis medicatrix naturae [healing power of nature]?
The recovering patient poses a considerable challenge to the applied researcher. One is left wondering, is the treatment intervention responsible for recovery or was the recovery due to the improvement already underway? Randomized controlled trials (RCTs) address this inferential problem by comparing treatment and nontreatment (control) samples. However, many applied researchers and practitioners lack the resources necessary to conduct controlled clinical experiments, but they are nonetheleass eager to demonstrate the efficacy of their treatments (Morgan & Morgan, 2001). Single-case time-series experiments are an increasingly popular method of scientific inquiry (Smith, 2012), and, like RCTs, they can demonstrate the causal effects of treatments (APA Presidential Task Force on Evidence-Based Practice, 2006; Barlow & Hersen, 1984; D. T. Campbell & Stanley, 1963; Sidman, 1960). Unlike RCTs, well-designed single-case studies may be implemented rapidly, often for a fraction of the cost of a “large-N” controlled experiment. Addressing the inferential problem posed by the improving subject in a time-series experiment requires special tactics of design and analysis.
Investigators may design a single-case experiment in such a way as to address baseline trend, the deterministic pattern of responding in the baseline phase (A phase) of a time-series experiment. By withdrawing and reintroducing treatment (ABAB design) or by systematically altering the type or “dose” of treatment (functional manipulation), the investigator may be able to demonstrate a treatment effect above and beyond the level of response expected from baseline trend (Kazdin, 1982; Sidman, 1960). Unfortunately, practical and ethical concerns often limit an investigator’s capacity to manipulate or withdraw treatment in this way (Barlow & Hersen, 1984).
When investigators cannot exert experimental control over baseline trend, they may instead rely on statistical control. Considerable effort has recently been directed to address baseline trend in the statistical analysis of single-case data (e.g., Parker, Cryer, & Byrns, 2006). The most common strategy of baseline trend control involves estimating a baseline trend parameter (i.e., the rate of change underway in the pretreatment individual) followed by the correction or removal of that trend from both baseline and treatment phases. Advocates of this approach suggest that this strategy allows one to compare pretreatment and posttreatment performance without the confounding influence of baseline trend. Proposed baseline trend correction methods have addressed linear (e.g., Allison & Gorman, 1993), monotonic (e.g., Parker, Vannest, Davis, & Sauber, 2011), and stochastic (Manolov & Solanas, 2009) trends. No method has demonstrated clear superiority over other methods.
This article suggests an improvement to one popular method of baseline correction and effect size measurement, Tau-U (Parker, Vannest, Davis, & Sauber, 2011), which was adapted from Kendall’s (1938) rank order correlation statistic. Although it has unique strengths, Tau-U has some limitations that have yet to be explored critically. This article will identify some of Tau-U’s weaknesses and offer an improved method of baseline trend correction and effect size measurement, Baseline Corrected Tau.
Rank Correlation Methods
Ordinary least squares (OLS) models of varying levels of sophistication have been recommended for the analysis of interrupted time-series single-case data (e.g., Allison & Gorman, 1993; Center, Skiba, & Casey, 1985-1986; Faith, Allison, & Gorman, 1996; Huitema & McKean, 2000. However, evidence suggests that data often violate the assumptions of those methods (Solomon, 2014). In brief time-series experiments, rank order methods are desirable because they make fewer assumptions about the underlying distributions of data than OLS methods, and they are relatively robust to outliers and serially dependent data (i.e., autocorrelation). They may also be conceptually translatable to investigators who are better acquainted with parametric methods: For example, one might describe a Kruskal–Wallis test as the nonparametric equivalent of an ANOVA.
Kendall (1962) noted that his nonparametric rank correlation coefficient, Tau, could be used to test for statistically significant differences between two samples by testing their homogeneity, and that, in this special case, Tau was essentially a Wilcoxon’s test (Wilcoxon, 1945), also called a Mann–Whitney U test (Mann & Whitney, 1947). When used to compare two groups in this way, Tau essentially tests the null hypothesis that the rank order scores of two samples represent the same population and are homogeneous—It is assumed that rank scores of the samples would be heterogeneous if they were sampled from different populations. Tau can be used similarly to a t test, where a rank correlation is calculated between the scores of two samples and a dummy code variable. In a simple AB single-case design, Tau can similarly be used to correlate the time-series observations with a dummy code phase variable (i.e., A phase = 0, B phase = 1). To illustrate, consider a hypothetical participant in a single-case study of psychotherapy for depression. Suppose the participant obtained Beck Depression Inventory (BDI; Beck, Ward, Mendelson, Mock, & Erbaugh, 1961) scores of {9, 6, 11, 5} in the baseline phase and {4, 7, 8, 7, 3, 7, 1} in the treatment phase. The scores {9, 6, 11, 5 / 4, 7, 8, 7, 3, 7, 1} may be rank correlated with the dummy code variable {0, 0, 0, 0 / 1, 1, 1, 1, 1, 1, 1}, with the result of Tau = −0.314, p = .294, suggesting that, although the treatment phase was associated with lower depression scores, the effect is not sufficient to reject a conventional null hypothesis (p > .05).
Kendall (1962) illustrated this dummy code procedure with Tau (see Chapters 3 and 13); however, Parker, Vannest, and Davis (2011) stated that “Because KRC [Kendall’s rank correlation] is not designed for dummy-coded variables, the Tau value output will not be correct” when applied to single-case data (p. 313). They proposed a change in the calculation of the denominator in Kendall’s Tau equation and termed this variant “Taunonoverlap” or “Tau-UA vs. B” (Parker, Vannest, Davis, & Sauber, 2011). This effect size is also quite similar to the Nonoverlap of All Pairs (NAP; Parker & Vannest, 2009), as noted by Parker, Vannest, and Davis (2011): “Taunovlap = (Pos – Neg) / Pairs, whereas NAP = (Pos + .5 × Ties) / Pairs” (p. 313). The limitations of this change in the Tau formula are discussed in the next section.
Unfortunately, the method introduced by Kendall (1962), and later altered by Parker, Vannest, and Davis (2011), does not directly account for preexisting baseline trend, as do OLS methods (e.g., Allison & Gorman, 1993; Faith et al., 1996; Huitema & McKean, 2000). The results of a Tau analysis may be grossly distorted if a participant’s performance is improving or deteriorating independently of treatment, because rank scores are likely to reflect the preexisting baseline trend and not the effect of treatment. Parker, Vannest, and Davis suggested a further alteration to Kendall’s method, termed Tau-U, to control for baseline trend: Instead of correlating scores with a [0 / 1] dummy code phase variable, it was suggested that a specially coded phase variable be used, where the baseline phase was reverse-ordered by time and the intervention phase is the first time value of the phase repeated. In this way, the correlation retains the influence of within-baseline trend, but reverses its direction, and contrasts the baseline phase with the intervention phase via a nonoverlapping dummy code variable. To illustrate, returning to the hypothetical data above, the Tau (or “Taunonoverlap”) dummy-coded phase variable {0, 0, 0, 0 / 1, 1, 1, 1, 1, 1, 1} becomes {4, 3, 2, 1 / 5, 5, 5, 5, 5, 5, 5} in a Tau-U analysis. The rationale for this method was further expanded by Parker, Vannest, Davis, and Sauber (2011). To summarize their approach, the trend of the baseline phase observations is retained in Tau-U, but its direction is reversed; the phases are also contrasted to one another as with a dummy-coded phase variable, because all intervention phase variable values exceed all baseline phase variable values. As with “Taunonoverlap,” Parker, Vannest, and Davis stated the special Tau-U dummy-coding procedure yields incorrect Tau output, and an alteration is substituted in the denominator of Kendall’s formula.
Tau-U is an increasingly popular method for understanding single-case data due to its considerable strengths: It has few distributional assumptions and good statistical power, and is conceptually similar to the correlational methods familiar to most investigators. The article that developed the rationale for Tau-U (Parker, Vannest, Davis, & Sauber, 2011) has been cited by more than 200 published articles. The method’s popularity may also be due its accessibility—The authors offered a free web-based Tau-U calculator for single-case researchers (Vannest, Parker, & Gonen, 2011; http://www.singlecaseresearch.org/calculators/Tau-U), which has been cited by more than 50 publications. Yet, despite its popularity and accessibility, Tau-U’s limitations have received relatively little consideration. These limitations will be described below, and an improved method of single-case measurement based on these critiques is offered.
Limitations of Tau-U
Inconsistent Terminology
Tau-U was developed as not one, but a family of related rank correlation coefficients for single-case research (Parker, Vannest, Davis, & Sauber, 2011). All Tau-U methods use Kendall’s rank correlation statistic to measure (a) within-phase trend, (b) between-phase independence, or (c) both. The Tau-U variants may be differentiated with subscripts, for example, Tau-UTrend A, Tau-UA vs. B, Tau-UA vs. B−Trend A, and so on. Although these labels are useful, a review of Tau-U literature indicates they are rarely used in practice. Even Tau-U’s originators use the term “Tau-U” inconsistently, in one instance describing “Tau-U, which combines nonoverlap between phases with trend from within the intervention phase” (i.e., Tau-UA vs. B+Trend B; Parker, Vannest, Davis, & Sauber, 2011, p. 284), in another instance describing it as “nonoverlap after controlling for Phase A trend” (i.e., Tau-UA vs. B−Trend A; Parker, Vannest, & Davis, 2011, p. 314), and in yet another instance describing Tau-U as “a method for measuring data non-overlap between two phases” (i.e., Tau-UA vs. B; Vannest et al., 2011).
The lack of consistent terminology is a limitation of Tau-U, but one that is easily addressed. To clearly report effect sizes, authors may wish to familiarize themselves with the Tau-U variants and clarify which statistics are reported. This is of particular importance when Tau-U single-case results are aggregated via meta-analysis within or across studies.
As stated above, this article uses “Tau-U” to refer to the coefficient calculated as the A versus B phase contrast controlling for trend in Phase A (i.e., Tau-UA vs. B−Trend A). This coefficient was selected for study as it explicitly addresses the problem of baseline trend in single-case data, which is one of the purported strengths of Tau-U as an effect size measure (Parker, Vannest, & Davis, 2011). This variant was also selected because it can be calculated via the popular web-based Tau-U calculator, which does not currently permit the calculation of the other Tau-U effect size that controls Phase A trend (i.e., Tau-UA vs. B + Trend B−Trend A). Because of its accessibility to applied researchers via the online calculator, it is assumed this Tau-U coefficient is most widely used as a method of baseline correction and effect size measurement.
Tau-U Estimates Are Inflated and not Bound Between −1 and +1
Perhaps the most troubling limitation of Tau-U is that it is not arithmetically bound between the conventional limits of −1 and +1, as are most other correlation coefficients. This is due to the formula alteration introduced by its authors, who stated that Tau outputs would be incorrect if dummy-coded phase variables were used; although, as noted above, Kendall (1962) demonstrated that dummy code variables could be used with Tau. When the alteration in formula is used, Taunonoverlap and Tau-U values are inflated, and Tau-U values may no longer fall within the conventional bounds of correlation coefficients.
To understand this limitation, a brief overview of Tau’s arithmetic is helpful. The rank order coefficient is calculated by making pairwise comparisons of two ranking variables, X and Y (in a time-series experiment, these variables could represent time and the outcome/dependent variable). Essentially, Tau is a ratio of S / D, where the score S is a value indicating the number of agreeing/disagreeing pairs between X and Y; the denominator D is the maximum possible value of S given the lengths of X and Y. If there are no tied ranks and if X and Y agree perfectly (X = Y), then S will equal the total number of pairwise comparisons, so that Tau = S / D = 1.00. If X and Y disagree perfectly (X = −Y), then Tau = S / D = −1.00. If there is no relationship between X and Y (e.g., X and Y are independent random variables), then the score S will equal zero, and Tau = S / D = 0.00.
Kendall’s score S is easily calculated by hand for small data sets, and D is easily calculated when there are no tied rank scores. However, in the presence of ties, the calculation of D becomes more complex. Fortunately, all popular statistical software packages perform these calculations automatically. Without addressing the arithmetic, 1 S and D can assume a maximum value of (n)(n − 1) / 2 when there are no ties, n being the number of rank scores in X and Y. Tied ranks reduce the total number of pairwise comparisons, lowering the maximum values of S and D.
If one ranking variable, say, X, is dichotomous (e.g., a dummy-coded phase variable) with nA + nB = n rank scores, S cannot exceed the product nA × nB. Thus, Parker, Vannest, and Davis’s (2011) Taunonoverlap coefficient is still bound between −1 and +1, even with the formulaic adjustment suggested by the authors, substituting (nA × nB) for D. This substitution essentially discards any within-phase variation, resulting only in between-phase comparisons. This may be a useful strategy in some experimental designs; however, it is argued that including within-phase variation offers important insights about the reliability and consistency of observed data. To illustrate, consider the interrupted time series {3, 3, 3, 3, 3 / 8, 8, 8, 8, 8} and {2, 1, 4, 5, 3 / 7, 6, 9, 8, 10}, which are both correlated with the phase variable {0, 0, 0, 0, 0 / 1, 1, 1, 1, 1}. In both time series, the phases have means of 3 and 8, and the phases are entirely nonoverlapping. Taunonoverlap, which ignores the variability of the second time series, produces the effect size 1.00 for both data sets. By comparison, Tau gives the results 1.00 and 0.745, respectively. By substituting (nA × nB) for D, Taunonoverlap effect size values will tend to be inflated (though still falling within conventional bounds), with the implication that Taunonoverlap discriminates poorly between data patterns that are highly variable versus highly consistent. This “ceiling effect” was documented by Parker, Vannest, Davis, and Sauber (2011), although it was not explicitly linked to the alteration in Kendall’s formula.
Despite its insensitivity, Taunonoverlap is still bounded between the conventional limits of −1 and +1; however, the same is not true for Tau-U with baseline correction. As noted above, the maximum value of S is nA × nB when a dummy-coded phase variable is used. However, Parker, Vannest, and Davis (2011) stated that Tau-U uses a specially coded phase variable, where the baseline phase is reverse-ordered by time and the treatment phase is the first time value of the phase repeated. When this special phase variable is used, S can exceed the value of (nA × nB), and as a result, Tau-U may exceed the conventional bounds of −1 and +1. To illustrate, consider a hypothetical situation where an unwanted behavior, such as the frequency of a patient’s self-injury behavior, is increasing steadily prior to treatment; then, immediately upon initiating treatment, the frequency of self-injury quickly drops to zero. This might give us the hypothetical observations {1, 2, 3, 4, 5 / 2, 1, 0, 0, 0}. The investigator would reasonably want to control for the increasing baseline trend. Using the Tau-U baseline control method, the observed scores would be contrasted with the specially coded phase variable {5, 4, 3, 2, 1 / 6, 6, 6, 6, 6}. Normally, this correlation would result in Tau = −0.829, reasonably indicating a large negative effect of treatment on self-injury. However, Parker, Vannest, and Davis suggested this output is incorrect. Instead, the S value, −31, is divided by the product of the phase lengths. The alteration yields the effect size Tau-U = −31 / 5 × 5 = −31 / 25 = −1.24. This unexpected result may be verified with the online Tau-U calculator (Vannest et al., 2011; http://www.singlecaseresearch.org/calculators/Tau-U).
The fact that Tau-U is not bound between −1 and +1 raises questions about how it should be interpreted and compared with other effect size measures. Tau-U is neither a true correlation coefficient nor a percentage of nonoverlapping data controlling for trend (per Parker, Vannest, Davis, & Sauber, 2011). It is therefore suggested that Tau-U retain Kendall’s original formula without the added step of reducing the denominator, though retaining the original Tau formula would in many cases yield less impressive effect sizes.
It is noted that in their lengthier exposition of the Tau-U family of effect size indices, Parker, Vannest, Davis, and Sauber (2011) recommended the formula adjustment only for Tau-UA vs. B (i.e., “Tau-Unonoverlap” in Parker, Vannest, & Davis, 2011). As stated above, this adjustment will not yield out-of-bounds results in this case but will reduce the sensitivity of the statistic. For other Tau-U indices, Parker, Vannest, Davis, and Sauber offer other methods of hand calculating the denominator, D, that will not yield out-of-bounds results. Unfortunately, this adds to the confusion discussed above regarding inconsistent terminology. This article will demonstrate additional statistical limitations of Tau-U besides lack of conventional bounds; however, for investigators who wish to use some variant of Tau-U in light of those limitations, it is recommended that they use the formulas presented in Parker, Vannest, Davis, and Sauber rather than the methods in Parker, Vannest, and Davis or in the online calculator (Vannest et al., 2011), which both have the out-of-bounds problem.
Tau-U Data Cannot be Graphically Visualized
As noted by Parker, Vannest, and Davis (2011), the method of baseline trend correction used in Tau-U cannot be meaningfully represented on a visual graph. This is a considerable limitation of the Tau-U method, given the strong historical and practical ties between single-case experimental research and visual analysis. As many as four out of every five published single-case studies rely on visual analysis instead of statistical analysis (Brossart, Parker, Olson, & Mahadevan, 2006). When investigators have no visual representation of how their time-series data are altered via statistical test, the analysis becomes a kind of “black box” that offers little opportunity for insight and interpretation. Black-box methods also make it harder for investigators to know when the method might be inappropriate for their data.
Tau-U Baseline Trend Correction Is Affected by Experimental Phase Length
A point emphasized by its authors is that the Tau-U method of baseline trend correction is conservative compared with correction methods that assume a linear trend (e.g., Allison & Gorman, 1993; Center et al., 1985-1986; White & Haring, 1980). Parker, Vannest, and Davis (2011) stated this is because the degree of correction is determined by the number of observations in the baseline phase—However, the degree of correction is in fact determined by the ratio of baseline phase length to experimental phase length.
To illustrate, consider an AB phase design with perfect positive monotonic trend, with the ranks {1, 2, 3, 4, 5 / 6, 7, 8, 9, 10}. It is noted that the level of monotonic trend present in the baseline phase is statistically significant, Tau = 1.00, p = .03. For this series, Tau-U = 0.60, p = .18. Yet, if the investigator made additional experimental phase observations to record the series of ranks {1, 2, 3, 4, 5 / 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}, the effect size would increase to Tau-U = 0.80, p = .02. Although the baseline is identical in both examples, the Tau-U effect size has changed in both magnitude and statistical significance. Parker, Vannest, and Davis (2011) noted that the baseline trend control is conservative because “its effect is limited by Phase A length” (p. 317). This is not strictly true—The effect of baseline trend correction in Tau-U is influenced by both the length of the baseline phase and of the experimental phase.
This feature of Tau-U has troubling implications. Suppose, for example, an investigator found his or her experimental treatment effect size was reduced by baseline trend. Rather than reporting this disappointing result, the investigator could conceivably collect additional treatment phase observations until Tau-U increased to a desirable level of effect and statistical significance; yet, they could nonetheless claim their Tau-U analysis “controlled for baseline trend.” Although it is true that “ignoring positive baseline trend risks erroneous conclusions about the cause of change” (Parker, Vannest, Davis, & Sauber, 2011, p. 286), the risks are equally great of a method that purports to correct for baseline trend but does so only slightly. This risk is compounded by the limitation described above that Tau-U’s method of baseline correction cannot be visually graphed and is a “black box”—There are few opportunities for the investigator to verify their “baseline corrected” results through visual inspection or other analysis.
Theil–Sen Estimator
A useful approach to single-case data analysis is the development of complementary methods of baseline trend correction and effect size measurement that would be statistically sound, visualizable, and capable of being implemented together or separately depending on the presence of trend. The Theil–Sen estimator, coupled with Kendall’s Tau, offers one solution. Theil (1950) developed a robust, nonparametric regression coefficient later expanded by Sen (1968), termed the Theil–Sen estimator. In a Theil–Sen regression of two variables—for example, the observed scores and time values of a time series—the slopes are calculated for all pairs of data points; the Theil–Sen slope estimate, b, is the median of these slopes. A y-axis intercept may also be calculated as the median value of (y − bx) for each data point (Siegel, 1982). As Sen and others (Vannest, Parker, Davis, Soares, & Smith, 2012; Wilcox, 1998) pointed out, the estimator performs well in the small non-normal samples, which tend to violate assumptions of least squares regression.
The Theil–Sen estimator has, in this instance, a convenient relationship to Kendall’s Tau (it is sometimes referred to as the Kendall–Theil robust line; for example, Granato, 2006). The Tau rank correlation of x (time) and the y (outcome variable) residuals from a Theil–Sen regression will always approach or equal zero. Thus, in a time series, monotonic trend may be removed by calculating the residuals from a Theil–Sen regression. Put another way, the residuals of a Theil–Sen regression contain the variance not explained by a linear trend, and those residuals will themselves have a monotonic trend, or Tau value, of approximately zero. Theil–Sen regression has been recommended for the removal of monotonic trend from time-series data (Yue, Pilon, Phinney, & Cavadias, 2002), and it has been recommended for single-case data analysis specifically (Vannest, Davis, & Parker, 2013; Vannest et al., 2012).
Theil–Sen regression may also be used to correct for baseline trend in interrupted time-series designs. The method outlined here uses an approach similar to both parametric (e.g., Allison & Gorman, 1993; Huitema, 2000) and nonparametric (e.g., extended celeration line [ECL], White & Haring, 1980; mean phase difference [MPD], Manolov & Solanas, 2013) trend control measures: A baseline trend line is projected into the experimental phase, and then an effect size is calculated from the residuals of both phases. The Theil–Sen method is preferable to OLS regression because it is robust to outliers, does not assume normally distributed data, and yields less extreme effect size estimates (Brossart, Parker, & Castillo, 2011).
Baseline Corrected Tau: A Proposed Method
Rather than integrating baseline correction and effect size measurement into the same analysis, this article proposes a method that uses two separate but complementary statistical procedures. If monotonic baseline trend is present, a Theil–Sen regression is used to remove it from baseline and experimental phases. Tau is then used to estimate the homogeneity of phases, that is, the effect size. A decision tree summarizing this analytic process is presented in Figure 1, and the steps are described below.

Decision tree for conducting and interpreting a Baseline Corrected Tau analysis.
Step 1: Is There Statistically Significant Monotonic Baseline Trend?
Before baseline trend is corrected, it should be determined whether baseline trend exists or not. When data are “corrected” when no baseline trend exists, important variance essential to the analysis will nearly always be discarded, leading the investigator to erroneous conclusions. The statistical significance of monotonic baseline trend may be evaluated with Tau by calculating the rank order correlation between the baseline observations and a time variable (essentially calculating Tau for the A phase data only). If the null hypothesis of no baseline trend is rejected, the investigator should proceed to remove baseline trend from the series (Step 2); if baseline trend is not statistically significant, the investigator should skip to the effect size measurement (Step 3).
Step 2: Remove Monotonic Baseline Trend
To remove statistically significant baseline trend, a Theil–Sen regression of a time (x) variable onto the baseline observations (y) is performed. The residuals of the entire series (both phases) are calculated from the baseline phase Theil–Sen regression line. This procedure will result in a new baseline monotonic trend that equals or approaches zero.
Step 3: Tau Effect Size Analysis
An effect size may be calculated with Kendall’s (1962) method. A dummy code phase variable is rank correlated with either the original observed data or with the baseline corrected residuals from Step 2 to yield a Tau value.
Step 4: Interpret Results
The final Tau effect size will be bounded between −1 and +1. The effect size indicates the strength and direction of the effect of treatment (i.e., phase change) on observed scores, controlling for any statistically significant baseline trend. Tau values greater than zero indicate a positive association between treatment and outcome variable, accounting for any baseline trend. For Tau values less than zero, the opposite is true. A p value may be interpreted with the null hypothesis of no relationship between treatment and outcome after correcting for baseline trend (if necessary).
Applying Baseline Corrected Tau to Published Single-Case Data
To demonstrate these steps, data were adapted from a participant in the Watkins et al. (2007) study of cognitive-behavioral therapy (CBT) for depression. An interrupted time series is illustrated in Figure 2, with the participant’s recorded BDI scores across a baseline and weekly therapy treatment.

Uncorrected data of a weekly treatment for depression.
First, a Tau analysis is conducted to determine whether there is a statistically significant monotonic baseline trend. A rank correlation of the baseline BDI scores with their respective time values suggests a trend; Tau = −0.751, p = .031. Second, a Theil–Sen regression is conducted, and residuals are calculated for both phases. The Theil–Sen line is displayed in Figure 2. The residuals are replotted in Figure 3 for illustrative purposes—This shows the participant’s performance with the trend estimated by the Theil–Sen regression removed. The monotonic trend for the corrected baseline is now Tau = 0.000.

Baseline corrected data of Figure 2 data.
From the residuals, a Baseline Corrected Tau analysis may be performed. The residual scores are rank correlated with a dummy code phase variable to yield an effect size; Baseline Corrected Tau = .690, p < .001. This would suggest a positive relationship between phase and depression score after controlling for baseline trend, that is, the BDI scores in the treatment phase were higher than expected after removing baseline trend from the series. This result should be considered thoughtfully—It may not be the case that the treatment made the participant worse, even though that is what the statistical results seem to suggest upon first inspection. Whenever the baseline correction (Theil–Sen) line crosses the floor or ceiling of the measurement scale, all the scores that follow will tend to increase the magnitude of the effect size, even if the participant reaches a total recovery. 2 Thus, one should not necessarily interpret the above result as evidence that the treatment made the participant worse beyond the changes predicted by baseline trend. However, the results do show that the participant was most likely in the process of recovering prior to initiating treatment (recall the baseline trend, Tau = −0.751, p = .031), and the treatment itself did not demonstrate an effect beyond the change accounted for by baseline trend.
The Baseline Corrected Tau result of no treatment effect may also be compared with other effect size indices that control for baseline trend. In addition to Baseline Corrected Tau, the scores from Figure 2 were analyzed with Tau-U and three other effect size indices that are described in greater detail in the “Method” section of this article: ECL (White & Haring, 1980), a least squares regression model proposed by Allison and Gorman (1993), and MPD (Manolov & Solanas, 2013). All four additional methods suggest the presence of a decreasing baseline trend. ECL, the Allison and Gorman method, and MPD show positive-direction effect sizes, similar to Baseline Corrected Tau: ECL = 1.000 (p < .001); Allison–Gorman R2 = 88.2%, level change = 0.896 (p = .762), and slope change = 2.40 (p < 0.001); and MPD = 16.01 BDI points (MPD is an unstandardized measure). However, Tau-U gave the opposite result: Tau-U = −0.659, p = .023. Contrary with the other three methods, this Tau-U result would suggest the treatment was negatively associated with depression score, that is, it had a statistically significant desirable effect.
This case illustrates the counterintuitive results Tau-U may produce, which occur largely because of the limitations discussed above. Baseline Corrected Tau aims to improve on these limitations, and the case presented above suggests it may under some circumstances agree with other established baseline control methods better than Tau-U. To further explore Baseline Corrected Tau’s performance, it was compared with the other three effect size indices discussed by field testing it with 65 published single-case time series and with a Monte Carlo simulation study.
Autocorrelation
Unlike randomly sampled group data, time-series data are not assumed to be independently distributed. It is intuitively clear that an individual’s performance on Monday would correlate to some degree with performance on Tuesday, and therefore, those data points should not be subjected to statistical analyses which assume otherwise. What is less clear is how to account for this serial dependence in interrupted time-series effect size measurement. The issue of autocorrelation has been a particularly contentious one in the single-case statistical methodology literature (e.g., see the special issue of Behavioral Assessment; Baer, 1988; Busk & Marascuilo, 1988; Huitema, 1985, 1988; Sharpley & Alavosius, 1988; Suen, 1987; Suen & Ary, 1987; Wampold, 1988).
Time-series analysis is a framework of statistical methods used to model processes or systems as they change over time. Box–Jenkins models, a type of time-series analysis (Box & Jenkins, 1970), have been recommended for single-case studies because of their utility in other social sciences (e.g., Glass, Willson, & Gottman, 1975). These include autoregressive (AR), moving average (MA), autoregressive-moving average (ARMA), and autoregressive integrated moving average (ARIMA) models. Although Box–Jenkins models would solve the challenges posed by autocorrelation in single-case data, single-case experiments nearly always fall short of the number of data points (approximately 50) needed to conduct these analyses. Single-case studies also frequently use count data, which are not expected to meet the normality assumptions of Box–Jenkins models.
In lieu of formal time-series analysis, single-case researchers have pursued one of two strategies to address autocorrelation. Some have advocated “cleansing” autocorrelation from data using an ARIMA model (e.g., Parker et al., 2006), and others have focused on identifying effect size statistics that are robust to autocorrelation (e.g., Manolov & Solanas, 2008). Parker, Vannest, Davis, and Sauber (2011) considered both strategies, and found that Tau-U was fairly robust to autocorrelation, but could be combined with data cleansing. In the following section, it will be demonstrated that Baseline Corrected Tau is similarly robust to autocorrelation. “Cleansing” brief interrupted time series (also called “prewhitening” or “backcasting”) is not recommended because Baseline Corrected Tau is sufficiently robust and because ARIMA cleansing may overfit to brief time series and remove more variance than is accounted for by serial dependence. 3
Meta-Analysis With Baseline Corrected Tau
An obstacle to wider adoption of single-case experimental designs is the lack of statistically sound methods for meta-analysis (Shadish, Rindskopf, & Hedges, 2008; Smith, 2012). Ideally, a meta-analytic method for single-case data sets would weight effects by their inverse variances and thus allow for calculation of a conventional meta-analytic mean effect size. Kendall (1962) discussed Tau’s variance under various distributional assumptions, but showed that it cannot exceed
Baseline Corrected Tau: Web-Based Calculator
A web-based calculator (Tarlow, 2016) was developed to make Baseline Corrected Tau as accessible as possible to single-case investigators. The calculator is available at http://www.ktarlow.com/stats/tau. It is hoped that by making the method available to investigators, it can be more easily reviewed and implemented.
To use the Baseline Corrected Tau calculator, the user begins by inputting his or her interrupted time-series (AB) data. The calculator then essentially guides the user through the decision tree outlined in Figure 1. First, the user tests for baseline trend. Baseline trend results are displayed as a Tau value with corresponding p value. Then, depending on whether a conventional null hypothesis of no baseline trend is rejected (p < .05), the calculator will make a recommendation to the user whether to estimate an effect size using an uncorrected Tau analysis, or with Baseline Corrected Tau. The user then chooses which effect size to calculate and the corresponding result is displayed. The calculator also provides additional output if Baseline Corrected Tau is selected, including the Theil–Sen intercept and slope coefficients with the corrected Phase A and Phase B data points (i.e., the robust regression residuals) used in the Tau analysis. The Baseline Corrected Tau calculator also yields standard errors for all Tau effect size results. To combine effects via meta-analysis, the investigator would square the standard errors for the Tau variance, and then weight each effect by the inverse variance.
Plan for Study
Comparison of Tau-U and Baseline Corrected Tau
Based on the theoretical issues discussed above, the Baseline Corrected Tau effect size statistic is expected to outperform Tau-U in several ways when evaluated with published and simulated single-case data sets. First, Tau-U is hypothesized to yield out-of-bounds results when applied to published data sets, with effect sizes less than −1 or greater than +1. When applied to the same data sets that yield out-of-bounds Tau-U results, Baseline Corrected Tau is also hypothesized to give more interpretable “in-bounds” effect size estimates. Second, Baseline Corrected Tau is expected to control for baseline trend more effectively than Tau-U due to Tau-U’s relatively weak method of trend correction. Third, Baseline Corrected Tau is hypothesized to be robust to autocorrelation due to its similarity to Tau-U, which is known to perform well under moderate levels of serial dependency. Fourth, because of their similar approaches to modeling and rank correlation, Baseline Corrected Tau and Tau-U effect sizes are expected to correlate highly, suggesting that Baseline Corrected Tau is a reasonable alternative for Tau-U that will provide investigators with results that are on average statistically and distributionally similar, but more interpretable on a case-by-case basis. To test these hypotheses, published single-case data sets with unknown parameters and simulated data sets with known parameters will be evaluated with both statistics.
Comparison With Other Effect Size Indices
To further contextualize the Baseline Corrected Tau and Tau-U results, three additional single-case statistics were selected for comparison via the published and simulated data. There are numerous single-case effect size indices of varying complexity, flexibility, and practical accessibility (J. M. Campbell, 2004; Gorman & Allison, 1996; Kratochwill & Levin, 1992; Parker et al., 2005; Shadish, 2014). Few methods account for baseline trend, particularly among the nonparametric effect size statistics that lack the distributional assumptions which limit regression-based methods (Parker, Vannest, & Davis, 2011). In addition to Baseline Corrected Tau and Tau-U, the three statistics identified, which include some type of baseline trend control, were ECL (White & Haring, 1980), the Allison–Gorman R2 regression model (Allison & Gorman, 1993; Faith et al., 1996), and MPD (Manolov & Solanas, 2013). These measures were included in all analyses to determine which effect size indices (if any) were superior in terms of their distributions (e.g., lack of ceiling or floor effects), ability to control baseline trend, robustness to autocorrelation, and performance with very brief time series.
Method
Statistical Measures of Effect Size Identified for Study
There are dozens of effect size statistics available to single-case investigators, and no one statistic has demonstrated clear superiority over other methods. A single-case investigator’s selection of effect size measure typically involves weighing the accessibility and assumptions of each method against the research question, experimental design, and data structure (Smith, 2012). Only a handful of methods account for baseline trend, with the majority of single-case statistics assuming a stable pattern of responding prior to treatment—an assumption that may not be tenable in some areas of study (Solomon, 2014). Baseline Corrected Tau and Tau-U both model baseline trend—via nonparametric robust regression and reverse data coding, respectively. Three additional effect size statistics were identified for comparison with Baseline Correct Tau and Tau-U, with each modeling trend in a different way.
ECL
White and Haring (1980) proposed a “split middle” technique for analyzing interrupted single-case data. First, the baseline phase data points are divided into two halves by time, which gives the points (Xi, Yi) in the first half of the baseline and (Xj, Yj) in the second half of the baseline. A trend line is then fit to the medians of each half, that is, the points
Allison–Gorman R2
Allison and Gorman (1993) proposed an OLS regression–based procedure for modeling the effect size of interrupted time-series (AB) single-case designs; their model was adapted from the Center et al. (1985-1986) study. First, a regression line is fitted to the baseline phase data and projected into the experimental phase; residuals from both phases are then calculated. This is essentially Steps 1 and 2 of calculating Baseline Corrected Tau described above; however, in Allison and Gorman’s technique, least squares regression rather than a robust nonparametric estimator (i.e., Theil–Sen) is used. The baseline corrected residuals are then used in a multiple regression that models level change and slope change effects. R2 is reported as the effect size; it describes the percentage of variance accounted for by slope change and level change treatment effects after controlling for baseline trend. The adjusted R2 value was used in all analyses to account for small sample bias (Faith et al., 1996). Adjusted R2 values less than zero were changed to zero (n = 9). In some of the following analyses, a negative sign was assigned to R2 when the overall effect of slope and level change was in the negative direction (e.g., decrease in level and trend of performance). The direction of effect was relevant to some analyses, such as correlations between the four effect size statistics, because the other three methods provide coefficients with meaningful signs, unlike Allison–Gorman R2, which yields only positive numbers (as a percentage of explained variance). A positive or negative sign was assigned to R2 based on the direction of the d effect size value calculated with Allison and Gorman’s method (otherwise, the d value was disregarded in the following analyses as it is interchangeable with R2).
MPD
Similar to the methods above, Manolov and Solanas’s (2013) MPD estimates an effect size by comparing observed B phase scores with scores predicted by baseline trend. MPD estimates baseline trend by taking the average of the first-order differences of the A phase scores. This procedure results in nA−1 difference scores, calculated as nt+1 − nt. Similar first-order differencing methods have demonstrated usefulness in other single-case data applications (Manolov & Solanas, 2009; Solanas, Manolov, & Onghena, 2010). After estimating the baseline trend, b, via differencing, a new B phase data series,
Published Data Series Identified for Study
The five effect size statistics were evaluated with real and simulated single-case data sets. First, a convenience sample of 65 published single-case data sets was selected. Experiments identified for inclusion were recent studies of anxiety and depression treatments. Anxiety and depression were identified as useful outcome variables because their studies often have baselines exhibiting variability and/or trend, whereas studies that focus on behavioral acquisition or extinction are more likely to have flat baselines. Outcome measures used in the sampled data sets included assessment instruments such as the BDI (Beck et al., 1961), Positive and Negative Affect Schedule (PANAS; Watson, Clark, & Tellegen, 1988), and the Hospital Anxiety and Depression Scale (HADS; Zigmond & Snaith, 1983). Raw data were extracted from published graphs using GetData Graph Digitizer (2013). All extracted data sets were visually compared with the original published graph to confirm that data were extracted correctly.
Comparison of Effect Size Indices With Published Data Sets
Baseline Corrected Tau, Tau-U, ECL, R2, and MPD were calculated for the 65 published single-case data sets. In studies with more than two phases (e.g., ABAB), effect sizes were calculated only for the first AB phase contrast. In studies with multiple outcome variables, only the depression or anxiety outcome variable was used. In studies that incorporated both depression and anxiety measures (n = 18), the depression measure was used and the anxiety data were discarded; this was done so participants with both depression and anxiety outcome data were not overrepresented in the analyses. The depression measure retained in these studies was the BDI, one of the most widely used and well-understood measures of psychological outcomes in clinical settings (Beck, Steer, & Carbin, 1988; Dozois, Dobson, & Ahnberg, 1998); in all 18 cases with both depression and anxiety measures, the BDI was the more psychometrically validated of the two assessments used.
To determine the relative agreement of the five effect size statistics, their values on the 65 data sets were initially compared with correlation analyses. To identify possible floor and ceiling effects, the distributions of each effect size were evaluated with boxplots. Similar to the analysis in Parker, Vannest, Davis, and Sauber (2011), probability scatterplots were created for each of the five effect size distributions—In each figure, the 65 effect size values were plotted against their percentile rank. Ceiling effects are often visually apparent with these probability distribution plots when many data sets yield the same maximum possible value.
Autocorrelation: Monte Carlo Simulations
Monte Carlo simulations were conducted to determine how the five baseline control effect sizes (Baseline Corrected Tau, Tau-U, ECL, R2, MPD) performed in the presence of autocorrelation, which is known to distort some estimates of effect size, particularly of parametric methods that make strict assumptions about the independence of data (Manolov & Solanas, 2008).
The general approach outlined by Manolov and Solanas (2008) was implemented. First, random time series were generated with known degrees of variance and autocorrelation. Those random time series were then blended with intercept- and slope-change coefficients to simulate the data patterns of an AB single-case design. Each AB time series produced with this simulation model therefore incorporated predetermined phase length, variance, autocorrelation, slope, and level parameters. The simulated time series were then evaluated with the five effect size statistics. The final result of this procedure was a distribution of effect size values for each combination of simulation parameters and effect size statistic.
Simulations were conducted in R (R Core Team, 2014) with the following steps:
Select NA and NB (lengths of phases), so that N = NA + NB.
Select parameters β0 (baseline level), β1 (baseline trend), β2 (level change), and β3 (slope change) for four-parameter interrupted time-series model (Huitema & McKean, 2000):
where Tt is the time at time t, Dt is a dummy code variable for phase, and [Tt − (nA + 1)]Dt is a slope change term.
Select
Generate a series of N + 50 numbers using the equation,
where at is an independent, normally distributed process with unit variance, giving a series ϵ
t
with a lag-1 autocorrelation equal to
Eliminate the first 50 numbers in the series ϵ t .
Obtain series Yt from four-parameter model and autocorrelated error ϵ t .
Calculate effect size indices (Tau-U, Baseline Corrected Tau, ECL, Allison–Gorman R2, MPD) from simulated series Yt.
Conduct 5,000 simulations from Steps 4 through 7.
For each of the five effect sizes, calculate the mean of the 5,000 values.
For Step 1 above, simulations were conducted with three sets of phase lengths: both phases very brief (nA = nB = 5), longer B phase (nA = 5, nB = 10), and both phases longer (nA = nB = 10). These simulated series were tested under four different models using the equation in Step 2. Those models were no effect with stable baseline (β0 = .0, β1 = .0, β2 = .0, β3 = .0), no effect with baseline trend (β0 = .0, β1 = .3, β2 = .0, β3 = .0), level change with stable baseline (β0 = .0, β1 = .0, β2 = 1.0, β3 = .0), and level change and slope change with baseline trend (β0 = .0, β1 = .3, β2 = 1.0, β3 = .3). For Step 3, the autoregressive coefficient,
Monte Carlo simulation results were verified by comparing results of the Allison–Gorman R2 effect size coefficient with similar results published in Manolov and Solanas (2008). A literature review did not reveal any comparable simulation studies for Tau-U, ECL, or MPD.
Baseline Trend: Monte Carlo Simulations
A second set of simulations was conducted using the same methods outlined above. However, the second Monte Carlo study was designed to determine how the five effect size statistics performed under differing degrees of baseline trend when no treatment effect is present. As in the autocorrelation study, three sets of phase lengths were simulated. The baseline slope coefficient (β1) was set at values ranging from 0 (no baseline trend) to 1 (moderate baseline trend), in increments of 0.1. The level change (β2) and slope change (β3) parameters were set to zero. In this set of simulations, autocorrelation was assumed to be zero,
Power Table for Baseline Trend Detection With Tau
Following the analyses above, a Tau power table for detecting monotonic trend in the baseline phase was created to supplement the present study and to serve as a tool for investigators who wish to use Baseline Corrected Tau to analyze their single-case data. Creation of the power table followed the simulation methods described above. Each cell of the power table represents a combination of baseline phase length, nA, and linear baseline trend, β1. For each cell, 5,000 time series were simulated from nA and β1, and Tau was calculated for each time series. The percentage of simulated time series yielding statistically significant Tau results was recorded in the power table. Thus, the power table indicates the likelihood of detecting a statistically significant baseline trend given the length of the baseline phase and the degree of trend. Autocorrelation was assumed to be zero,
Results
Comparison of Effect Size Indices With Published Data Sets
The convenience sample of 65 published single-case AB data sets was analyzed with the five selected effect size statistics. The effect sizes model baseline trend in different ways: monotonic (Tau-U), robust regression (Baseline Corrected Tau), “split middle” (ECL), OLS regression (Allison–Gorman R2), and first-order differencing (MPD). The absolute values of all effect sizes were used in figures to enhance the interpretability of results. The relative magnitudes of effects produced by each method were of primary interest rather than their direction, positive or negative. The directions of effects were determined arbitrarily, not by the efficacy of treatment but by the particular outcome variable in each study (e.g., a treatment effect could be measured as increasing wellness or decreasing depression).
The boxplot in Figure 4 illustrates the range and distribution of effect sizes produced by each of the five methods when applied to the same published 65 time series. Some data patterns are immediately clear from the boxplot. For example, the upper quartile of Tau-U effect sizes (n = 11) exceed the conventional limit of 1.00, whereas Baseline Corrected Tau produced smaller effect sizes overall. ECL effect sizes were negatively skewed, whereas MPD effects were positively skewed.

Boxplot of four effect size statistics for 65 published AB phase contrasts.
The probability distribution of 65 data sets is presented for each effect size in Figure 5. Ideally, probability plots will have a roughly diagonal distribution with no gaps, floor, or ceiling, indicating the statistic’s strong discriminability (Parker, Vannest, Davis, & Sauber, 2011). ECL demonstrated a severe ceiling effect, with 44 time series (68%) earning an effect of ECL = 1.00. A relatively minor floor effect is evident in the probability plot for Allison–Gorman R2, with nine time series (14%) earning an effect of R2 = .00. Baseline Corrected Tau, Tau-U, and MPD demonstrated good discriminability with approximately diagonal probability distributions, although, as noted above, Tau-U yielded a high number of out-of-bounds values.

Probability distributions of five effect size statistics for 65 published AB phase contrasts.
Table 1 presents a Spearman correlation matrix of the five effect size statistics for the 65 analyzed time series. Baseline Corrected Tau and Tau-U results had the strongest association (ρ = .80, p < .001), and MPD and R2 results were also highly associated (ρ = .79, p < .001). Baseline Corrected Tau and Allison–Gorman R2 results had the weakest association (ρ = .50, p < .001). Although the five statistics take different approaches to baseline trend correction and effect size measurement, Table 1 correlations suggest these methods tend to agree at least moderately well on average.
Spearman Correlation Matrix of Effect Size Statistics for 65 Published AB Phase Contrasts.
Note. All correlations statistically significant, p < .001. ECL = extended celeration line; MPD = mean phase difference.
Tau-U and Baseline Corrected Tau are compared in Figure 6. Although they were highly correlated, the range of effect size values produced by each statistic is quite different. Tau-U gave a larger result than Baseline Corrected Tau on 53 (82%) time series. The value of Baseline Corrected Tau was larger on 11 (17%) time series. The two statistics produced equal results on one (1%) time series.

Tau-U and Baseline Corrected Tau effect sizes for 65 AB single-case data sets.
Monte Carlo Simulations: Baseline Trend and Autocorrelation
Figure 7 presents the results of Monte Carlo simulations designed to evaluate the performance of each effect size under no treatment effect with differing levels of baseline trend. Simulations were conducted across a range of baseline trend values and with three different sets of phase lengths. Ideally, under the “no effect” simulation model, the statistics should yield an effect size of zero across all combinations of phase length and trend. The approximately flat plots produced by ECL, R2, and MPD suggest those statistics controlled baseline trend well; however, R2 results appeared to vary as a function of phase length and had nonzero values for all conditions tested. Baseline Corrected Tau controlled trend with a sufficiently long baseline phase, although it failed to control trends with very brief time series. Tau-U failed to control trend in every condition tested, even with relatively small degrees of trend and relatively long baselines.

Effect of simulated baseline trend and phase length on effect size (no treatment effect).
Figure 8 presents the results of the Monte Carlo simulations testing the effects of autocorrelation on the five effect size statistics. Statistics were applied to time series generated from four simulation models: no effect, no effect with baseline trend, level change, and level change plus trend change with baseline trend. The nonparametric methods, Baseline Corrected Tau, Tau-U, ECL, and MPD, were robust to autocorrelation under a variety of level change and slope change effects. They were fairly stable even when autocorrelation was greater than the degree typically seen in published single-case studies (Shadish & Sullivan, 2011; Solomon, 2014). When extreme amounts of autocorrelation were present, the nonparametric statistics generally demonstrated a shallow inverted-U distribution, where large negative or positive levels of autocorrelation resulted in slight reductions in effect size. This result should ameliorate concern that autocorrelation will lead to inflated effect size estimates—In fact, extreme autocorrelation appears to attenuate the estimated effect size in the nonparametric and nonoverlap statistics. Conversely, the OLS R2 measure was highly sensitive to autocorrelation. As Manolov and Solanas (2008) demonstrated, simulation results showed a positive, nearly linear relationship between the autoregressive coefficient and R2, with higher autocorrelation resulting in higher effect size.

Effect of lag-1 autocorrelation on effect size.
Power Table for Baseline Trend Detection With Tau
When using Baseline Corrected Tau, the decision of whether or not to correct for trend depends on the detection of a statistically significant trend in the baseline phase (Figure 1). Table 2 illustrates the statistical power of Tau as a function of baseline phase length and amount of trend. The likelihood of detecting a statistically significant trend for correction increases with both the length of the baseline phase and the degree of trend. When nA < 7, Tau failed to detect even high degrees of trend. Conversely, when nA ≥ 10, Tau detected a range of trend values.
Power Table for Detecting Baseline Trend: Baseline Corrected Tau.
Note. Statistical power values were estimated from Monte Carlo simulations. All simulation models assumed error residuals with no autocorrelation, unit variance, and a normal distribution. Each combination of nA and β1 was simulated 5,000 times to estimate power.
Power values below this point are greater than .995.
Discussion
The purpose of the present study was to investigate the performance of Tau-U with baseline trend correction under a range of real and simulated single-case data sets. A theoretical review of Tau-U indicates several limitations that have not been previously discussed in depth. Tau-U was expected to yield “out-of-bounds” results (not limited between −1 and +1) because of an alteration to Kendall’s Tau equation when controlling for baseline trend; however, it was not previously clear how often and to what degree out-of-bounds results limit the interpretability of Tau-U. The method of monotonic trend correction used by Tau-U was also described as conservative by its originators, but it was not clear whether the weak method of trend correction sufficiently corrects when trend is in fact present in brief time-series data.
A new method, Baseline Corrected Tau, was introduced as an improved rank correlation statistic that could be used instead of Tau-U. The present study compared Baseline Corrected Tau and Tau-U with real and simulated single-case data to explore the questions of result interpretability (e.g., out-of-bounds results), trend correction, robustness to autocorrelation, and performance with time series of varying lengths. Three other single-case effect size statistics were identified that controlled for baseline trend with parametric, nonparametric, and stochastic methods. ECL, Allison–Gorman R2, and MPD were analyzed with Tau-U and Baseline Corrected Tau, and their strengths and limitations were identified.
Tau-U Yields “Out-of-Bounds” Values
As hypothesized, Figures 5 and 6 illustrate that 11 of the 65 analyzed data sets (17%) produced an effect of Tau-U > 1.00. These results raise doubts about the interpretability of Tau-U as a correlational effect size or as a percentage of nonoverlap controlling for baseline trend. The high number of out-of-bounds results suggests this phenomenon is not merely an arithmetic quirk of Tau-U but a practical problem for single-case investigators.
Tau-U Fails to Correct for Baseline Trend
Tau-U failed to correct for baseline trend across a wide range of simulated conditions, yielding effect sizes as large as Tau-U = 0.80 when the true effect was zero. Figure 7 shows that no combination of baseline trend and phase length produced results with an acceptable level of baseline trend correction.
Tau-U was presented as a desirable effect size measure because it uses a “conservative” method of baseline trend correction (Parker, Vannest, & Davis, 2011; Parker, Vannest, Davis, & Sauber, 2011). The Tau-U method of monotonic trend control tends to make smaller adjustments to data before estimating an effect size than do the other four effect size statistics, which utilize linear models of baseline trend. Conservative baseline trend correction is a desirable feature when the investigator is most concerned about overcorrecting baseline trend and incorrectly describing a marginally effective treatment as ineffective (Type II error); overcorrection in these cases would result in a higher rate of false negatives and lower statistical power.
However, the problem of baseline trend in single-case research has almost unanimously been described as a threat of false positives, not false negatives. Vannest, Davis, and Parker (2013) described the problem of baseline trend in this way:
When you determine that a baseline has a positive trend or a trend in the direction of the desired behavior change (that is, decreasing aggression) the analysis will ideally use a method which corrects for trend. This way, the effect you determine is more accurate because it is adjusted for the performances documented prior to the intervention onset. (p. 47, emphasis added)
And Parker et al. (2006) stated succinctly,
A positively trended baseline undermines the validity of the comparison between baseline and intervention phases, encouraging overly optimistic conclusions. (p. 421)
Thus, the greatest risk posed by baseline trend is incorrectly identifying a treatment effect where none exists (Type I error).
The relative risks of Type I versus Type II errors, both statistical and philosophical, are thoroughly reviewed elsewhere (e.g., Cohen, 1969). However, the consensus in single-case research literature would suggest that, in regard to baseline trend, Type I error poses a greater risk to the investigator. Quite simply, a “conservative” control method would, in the presence of a positive baseline trend, yield liberal (i.e., large) effect size estimates—a problematic result when one is most concerned with false positives. Tau-U’s relatively weak method of baseline trend control combined with the formulaic adjustment which inflates effect size results (discussed above) will therefore tend to produce “overly optimistic conclusions.”
Nonparametric Effect Size Statistics Are Robust to Autocorrelation
Monte Carlo analyses demonstrated that the nonparametric effect size statistics (Baseline Corrected Tau, Tau-U, ECL, and MPD) were not substantially affected by serially dependent data (see Figure 8). The OLS regression model (R2) was sensitive to autocorrelated data. For this reason, OLS regression methods are not recommended for single-case analysis unless serial dependence is adequately modeled (e.g., Huitema & McKean, 1998). A large number of observations (n ≥ 50) would permit either the accurate “cleansing” of autocorrelation (e.g., Parker et al., 2006) or time-series analysis, though those methods are not without unique challenges (e.g., Sivo & Willson, 2000; Velicer & Harrop, 1983).
Ceiling and Floor Effects
The probability plots in Figure 5 provide a visual illustration of the effect size indices’ relative strengths and limitations on a sample of published single-case data sets. The distribution of ECL effect sizes demonstrates a pronounced ceiling effect, with 44 out of 65 AB phase contrasts yielding an effect size of ECL = 1.00. This poor sensitivity is a known limitation of nonoverlap methods such as ECL (Allison & Gorman, 1993; Vannest, Davis, & Parker, 2013; Wolery, Busick, Reichow, & Barton, 2010), and for this reason, ECL is not recommended for use in single-case research. The floor effect demonstrated by Allison–Gorman R2 was relatively minor. No floor or ceiling effects were apparent for the analyzed data in the Baseline Corrected Tau, Tau-U, and MPD plots.
Baseline Corrected Tau: An Alternative to Tau-U With Limitations
For the sample of 65 AB phase contrasts, the effect sizes estimated by Baseline Corrected Tau were highly correlated with those of Tau-U (ρ = .80, p < .001); however, Baseline Corrected Tau offered improvements on many of Tau-U’s limitations. Baseline Corrected Tau values did not exceed the conventional bounds of a correlation or percentage-based effect size statistic. By using a less conservative baseline control method, it more reliably controlled for baseline trend. Baseline Corrected Tau may also be preferable to single-case investigators because its method of baseline correction is easily graphed, whereas Tau-U’s monotonic trend control method cannot be visualized. Baseline Corrected Tau also shares many of Tau-U’s strengths: As a nonparametric rank correlation statistic, it has few of the distributional assumptions of OLS methods and is robust to autocorrelation.
Baseline Corrected Tau did demonstrate one notable limitation during the simulation study. Figure 7 illustrates how Baseline Corrected Tau did a poor job of controlling baseline trend when the length of the baseline phase was very brief (n = 5). Recall that, using the decision tree in Figure 1, the investigator will only correct for baseline trend when there is sufficient evidence of trend, that is, when the monotonic trend of the baseline phase observations is statistically significant. In one sense, this is seen as a strength of Baseline Corrected Tau. Methods such as ECL and MPD will always assume there is a baseline trend and will fit a correction line (via “split-middle” or regression) regardless of how confident one is that a trend actually exists. Those methods may be susceptible to producing inaccurate or distorted results when applied to brief time series because trend is detected (and corrected) when it should not be—This tendency to overcorrect in brief time series is precisely the problem that Tau-U was intended to solve with its conservative approach to correction. However, as the results of the simulation study demonstrate, when there is an insufficient number of observations in the baseline phase to detect statistically significant trend, Baseline Corrected Tau will overestimate the size of treatment effect when trend is present (and when trend is in the direction of expected treatment outcome).
The limitation of Baseline Corrected Tau is therefore a limitation of its statistical power. As Figure 7 illustrates, when there is a sufficiently large trend, a sufficiently large number of baseline phase observations, or a combination of both, the method quite effectively corrects for baseline trend. Unfortunately, the investigator will often not know the true degree of baseline trend, nor will he or she always have control over the number of baseline observations recorded. To address this limitation, a power table for detecting baseline trend with Baseline Corrected Tau is presented in Table 2. This power table is limited by its own assumptions of variance and autocorrelation; however, it offers a useful heuristic reference for single-case investigators. Investigators may use this table to assess the power of their design under given assumptions and, when power is unacceptably low, they may decide to increase the length of their baseline phase or conduct a Tau analysis without baseline correction and report their low power in their results. By combining power analysis with the complementary trend control and effect size measurement procedures, Baseline Corrected Tau offers the single-case investigator a flexible option for data analysis despite this limitation.
As with all single-case effect size statistics based on data rank or overlap, Baseline Corrected Tau may do a poor job of distinguishing between large experimental effects with little or no between-phase overlap. For example, the Tau values for two single-case data sets [1, 1, 1, 1, 1 / 2, 2, 2, 2, 2] and [1, 1, 1, 1, 1 / 5, 5, 5, 5, 5] are the same, because the cross-phase ranks of both data sets are identical. When investigators must distinguish between large nonoverlapping effects, it may be useful to select a measure that accounts for these differences, such as a regression-based method or a standardized mean difference (such as MPD).
MPD Is a Practical Alternative
MPD performed well in a variety of simulation conditions. It was robust to autocorrelation and controlled baseline trend in a “no effect” simulation model. MPD’s unstandardized metric may also be useful to practitioners who wish to report improvement in the metric of the original outcome measure, rather than as a correlation or percentage of overlap.
Although MPD is well-suited for practitioners, it may be less useful for researchers. MPD effect sizes are not easily combined across studies (a strength of Baseline Corrected Tau, Tau-U, and R2). Its standardized form also yielded very large values in the present study, raising questions about interpretability. For example, eight out of 65 analyzed published data sets (12%) had a standardized MPD effect size greater than 4. Similar to other standardized mean difference single-case statistics, MPD results do not appear to conform to any established set of interpretive benchmarks (Parker et al., 2005). Nonetheless, MPD was a promising method when compared with other single-case statistics that modeled baseline trend.
Limitations of Study
The present study was limited by (a) the single-case statistics selected for study, (b) the sample of published data sets analyzed, and (c) the scope of the simulation models. First, only five effect size statistics were included out of the considerable array of methods available to single-case investigators. Only methods that modeled baseline trend were included, and as a result, many new and promising statistics were excluded (e.g., Shadish, 2014). The analyses and simulations included in this study could be readily extended to other effect size statistics, particularly if de-trending methods (such as Theil–Sen regression) are coupled with newer, more sophisticated measures.
The present study was also limited to a convenience sample of relatively homogeneous anxiety- and depression-treatment research. This field was selected because it may benefit from improved baseline control methods more than other areas of research, such as behavior therapy, where stable baselines are more common. However, limiting the sample to depression and anxiety treatment studies necessarily limited the sample of analyzed data sets to n = 65. It is possible that with a larger sample of published data sets, unidentified strengths and limitations of the five statistical measures under review would be more apparent.
The simulation models used in the present study were also limited in several ways. Only linear trend patterns were modeled, and future investigations would benefit from studying nonlinear trend as well (particularly as most trend correction methods use some form of linear model). Only lag-1 autoregressive autocorrelations were modeled; effect size statistics might perform differently under higher order, moving average, or integrated autocorrelation data patterns. The simulation studies also did not systematically manipulate error variance in the way that baseline trend and autocorrelation were manipulated. Investigators would greatly benefit from an investigation of how well different statistical measures perform under different amounts of data variability (regardless of trend, level, autocorrelation, etc.).
Conclusion
Tau-U has several strengths as a single-case effect size statistic: It assesses both within-phase trend and between-phase differences, it is robust to autocorrelation, and it makes few distributional assumptions and is therefore applicable to a wide range of single-case experimental designs. Tau-U is increasingly popular and has been used in dozens of published single-case studies. However, despite its popularity, Tau-U’s limitations have gone relatively unexplored. This article demonstrated several limitations of Tau-U that may be highly relevant for single-case investigators. The Tau-U method of baseline correction yields effects that are difficult to interpret because they are not bounded between −1 and +1. In the convenience sample of 65 published data sets, 17% had “out of bounds” Tau-U effect sizes. Tau-U’s method of baseline trend control is also a “black box” method that cannot be meaningfully visualized. Most importantly, Tau-U’s weak method of monotonic baseline correction failed to adequately control baseline trend across a wide range of simulated conditions, leading to increased Type I error. The conservative method of baseline control in fact leads to liberal estimates of effect size and encourages overly optimistic conclusions about the efficacy of experimental treatments. Because of these limitations, caution is recommended to single-case investigators who use Tau-U to better understand their experimental effects through statistical analysis. Investigators who use Tau-U should understand the statistic and its limitations thoroughly in order to correctly interpret results; they should also forgo the alteration to Kendall’s Tau formula, which leads to out-of-bounds results.
Baseline Corrected Tau was proposed as an improved rank correlation effect size for single-case experimental designs. This new statistic shares many of Tau-U’s strengths and improves upon its limitations. Although Baseline Corrected Tau results are highly correlated with Tau-U’s, the statistic is bounded within conventional limits and it can be visualized in an easily interpretable way. It also outperformed two other popular single-case effect size measures when field-tested on published and simulated time series. The MPD statistic (Manolov & Solanas, 2013) was also promising, especially for practitioners who wish to understand their treatment effects in a readily interpretable way. Investigators should consider MPD when selecting an appropriate effect size measure. Baseline Corrected Tau’s utility may be maximized with two resources provided with this article: a power analysis table for interpreting results (Table 2) and a web-based calculator (http://www.ktarlow.com/stats/tau; Tarlow, 2016), reducing the need for specialized statistical software or syntax. Baseline Corrected Tau is recommended to single-case investigators as a flexible and superior alternative to Tau-U when autocorrelation or baseline trend may be present in their time-series data.
Footnotes
Acknowledgements
The author would like to thank Vanessa Laird for her consultation on an early version of this article and for her assistance with data preparation and analysis. The author also wishes to acknowledge the reviewer whose feedback greatly improved the quality of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
