Abstract
Researchers and practitioners frequently use curriculum-based measures of reading (CBM-R) within single-case design (SCD) frameworks to evaluate the effects of reading interventions with individual students. Effect sizes (ESs) developed specifically for SCDs are often used as a supplement to visual analysis to gauge treatment effects. The degree to which measurement error associated with academic measures like CBM-R influences said ESs has not been fully explored. We used simulation methodology to evaluate how common magnitudes of error influenced the consistency and accuracy of outcomes from two nonparametric SCD ESs, percentage of data exceeding baseline trend and TauU. After accounting for other data characteristics, measurement error accounted for a statistically and practically significant amount of variance in the consistency and accuracy of outcomes from both ESs. This article suggests that the psychometric properties of academic measures are important to consider when interpreting ESs from SCDs.
A cornerstone of delivering effective individualized interventions within multi-tiered systems of support is the capability of formatively assessing and modifying instructional programs in response to student progress (Fuchs, Fuchs, & Stecker, 2010). Special educators and school psychologists could use single-case designs (SCD) to measure students’ response to instruction, particularly for students receiving intensive interventions within multi-tiered systems of support (Barnett, Daly, Jones, & Lentz, 2004). Briefly, to use a SCD, performance is measured repeatedly in the absence and then the presence of an intervention. Changes in level (average performance), trend (increase or decrease in observations across time), and variability (spread of observations) between baseline and intervention phases are evaluated, most often visually, to determine whether the implementation of the intervention was functionally related to the change in performance (Horner, Swaminathan, Sugai, & Smolkowski, 2012; Kratochwill et al., 2010). Due in part to the well-established pattern of inconsistency associated with visual analysis when certain data characteristics are present (Lieberman, Yoder, Reichow, & Wolery, 2010; Ninci, Vannest, Wilson, & Zhang, 2015), calls from policy makers to identify evidence-based interventions (Every Student Succeeds Act, 2015; No Child Left Behind Act, 2002), and an increased desire to objectively compare intervention effects across SCD studies (Allison & Gorman, 1993; Kratochwill & Levin, 2014; Shadish, 2014), researchers have developed statistical methods, specifically SCD-based effect sizes (ESs), to quantify treatment outcomes.
The purpose of this study was to explore the extent to which measurement error associated with curriculum-based measurement of oral reading (CBM-R), a common reading assessment used within SCDs, influences the consistency and accuracy of outcomes from two SCD ESs. The impact of measurement error on ESs used in between-group studies (Hedges, 1981) and meta-analyses (Schmidt & Hunter, 2015) is well-known. Because the maximum test-retest or alternate form reliability is r = 1.0, measurement error decreases the observed treatment effects. Researchers have yet to examine the influence of the reliability and precision of observed academic scores when quantifying intervention outcomes in SCD research. Common SCD ESs are based upon the percentage of between-phase overlap rather than mean differences, and measurement error could result in an over- or underestimation in the amount of overlap. Investigating the extent to which the consistency and accuracy of SCD ESs are impacted by the reliability of common measures used within these designs is important to (a) assist educational professionals in making correct interpretations about student progress and (b) provide empirical guidelines for researchers who seek to quantify treatment effects within SCD research. Before describing the current project in more detail, we briefly review two common ESs used with academic measures and unique characteristics of CBM-R that may influence the consistency and accuracy of outcomes from SCD ESs.
ES Types
Shadish (2014) categorized SCD ESs into three categories: (a) standardized mean differences (including nonparametric overlap statistics), (b) regression-based estimators (including multilevel modeling), and (c) Bayesian approaches. Opinions regarding the most appropriate type of ES (or whether they are needed at all; Wolery, 2013) varies considerably among researchers. It is likely that standardized ESs, particularly overlap indices, are the most accessible (i.e., easy to calculate) for practitioners and applied researchers. Yet, early overlap metrics have severe shortcomings, including the tendency to be heavily influenced by extreme values and trend in baseline phases (Parker, Vannest, & Davis, 2011). As a result, we evaluated the performance of two common overlap ESs that account for baseline trend and are less influenced by one aberrant data point: Percentage of Data Exceeding Baseline Trend (PEBT) and TauU. Both approaches are recommended for practitioners to identify evidence-based interventions (e.g., Vannest, Davis, & Parker, 2013).
PEBT
The percentage of data exceeding a median trend (PEM-T; Wolery, Busick, Reichow, & Barton, 2010) incorporates two SCD ESs: percentage exceeding the median (Ma, 2006) and the extended celeration line (White & Haring, 1980). If baseline trend is not present, PEM-T is equivalent to the percentage exceeding the median (Ma, 2006). To use PEM-T, baseline trend is estimated and a line of best fit is extended into the intervention phase. Next, the proportion of observations in the intervention phase above the projected trend line is calculated. The resulting proportion is subtracted from .50 to account for chance. Rescaling the obtained value:
Wolery et al. (2010) and White and Haring (1980) used the split-middle method to determine the baseline trend; but any type of trend line is sufficient (Parker, Vannest, & Davis, 2011). We used ordinary least squares (OLS) trend lines to summarize baseline trend (which is computationally complex but easily carried out using standard computer spreadsheet software) for two major reasons. First, predictions based on OLS trends are generally more accurate than those based on the split-middle method even for short data series (Good & Shinn, 1990). Second, the vast majority of published recommendations suggest the use of OLS methods for estimating trend in CBM-R data (Ardoin, Christ, Morena, Cormier, & Klingbeil, 2013). We refer to the resulting ES as the percentage of data exceeding baseline trend (PEBT) to note this key difference from PEM-T. Use of OLS within our calculation of PEBT assumes that baseline trend is linear.
Tau-U
TauU is a family of ES indices that integrate both trend and nonoverlap, with values ranging from −1 to 1 across all forms (Parker, Vannest, Davis, & Sauber, 2011). The full index represents the percentage of nonoverlapping data minus the percentage of overlapping data plus the improvement in Phase B, controlling for Phase A. The full TauU index may be the most relevant for academic outcomes because practitioners are often interested in quantifying intervention effects on the trend and level of student performance. Practitioners and researchers can upload their data to a website to calculate TauU quickly and free of charge (www.singlecaseresearch.org).
TauU is based on the Mann Whitney U test and Kendall’s tau. It differs from PEBT in that it allows for the control of baseline trends monotonically rather than using ordinary least squares regression. Another major benefit of TauU is its statistical power which often exceeds 115% when data do not conform to parametric assumptions (Parker, Vannest, Davis, and Sauber, 2011). The full TauU index was calculated using the steps presented in Parker, Vannest, Davis, and Sauber (2011). First, using columns for scores and phases (0 = baseline and 1 = intervention), order the baseline observations from last to first while maintaining the actual score values from the intervention phase. Second, conduct a Kendall’s rank correlation analysis on the resulting data to obtain the test statistic (S). Third, calculate the number of pairs for the design [n(n − 1)] / 2. Fourth, calculate TauU (S / # pairs).
Academic Interventions and SCDs
Traditional psychometric concepts of reliability and validity are important considerations for SCDs (Cone, 1977; Gast, 2014). Our review of how these concepts extend to SCDs relies heavily upon the work of Kazdin (2011) and Gast (2014). For SCDs, reliability can refer to reliability of effect, reliability of measurement, or procedural reliability. In this article, we sought to investigate how reliability of measurement, or the consistency of observed scores, influenced ES outcomes. Subsumed within consistency of measurement, we also investigated accuracy. Accuracy of measurement is the degree to which the observed behavior or scores approximate the “true” behavior under natural conditions (Gast, 2014). That is, accuracy asks whether the data that were collected reflect the student’s actual level of behavior. In this study, we investigated how reliability of measurement influenced the degree to which ESs approximated a true effect, or an ES without error.
From a generalizability theory perspective, error (or unreliability) in measurement arise from different sources (Cronbach, Gleser, Nanda, & Rajaratnam, 1972). Documenting interobserver agreement, a standard component for most SCD studies (Smith, 2012), provides evidence whether inconsistencies between observers is a source of measurement error. High interobserver agreement suggests that an academic skill measure was scored correctly. Interobserver agreement does not provide any information regarding error associated with the instruments used to measure academic skills. Although the development of SCD ESs has been rapid, the relationship between the reliability or precision of individual observations and the consistency and accuracy of measured academic outcomes has not been explored.
Excessive variability poses a threat to data-evaluation validity (Kazdin, 2011). Historically, within phase variability was attributed to things like differences in the environment from one measurement occasion to the next (e.g., observing a student in math class versus recess), differences in the internal state of the student (e.g., before or after taking medication), or differences in the intensity or fidelity of intervention implementation. Depending on the behavior one is measuring, the presence of high within-phase variability suggests that the researcher has not achieved sufficient experimental control, and interpretations of the relationship between the intervention and the dependent measure are made with limited internal validity (Gast, 2014; Kazdin, 2011). For academic interventions, an added source of within-phase variability, and thus a threat to data-evaluation validity, may be the reliability and precision of scores used to measure student performance. More empirical research is needed to determine how the technical adequacy of academic measures used within SCDs, including CBM-R, influences ES estimates.
CBM-R
CBM-R was an outgrowth of activities carried out at the University of Minnesota by Stan Deno and colleagues to help special educators measure, evaluate, and refine their instructional practices (Deno, 1985, 2003). To use CBM-R, educators administer passages of connected text, calculate the number of words read correct per minute (WRCM), plot the resulting value on a time series graph, and depending on the upward or downward pattern of WRCM scores across time, make an instructional change (i.e., implement a phase change; Deno, 1986, 1990). Currently, CBM-R is also used as a screening tool to identify students who may need additional support (Kilgus, Methe, Maggin, & Tomasula, 2014), a means to monitor regular education student response to instruction (Fuchs, Fuchs, & Speece, 2002), and as a method to evaluate systemic curriculum and instructional reforms (Cummings, Stoolmiller, Baker, Fien, & Kame’enui, 2015).
Improvement in WRCM is thought to reflect improvement in reading fluency, or oral reading rate, which is a critical skill for reading comprehension (i.e., the goal of literacy instruction; National Reading Panel, 2000). Robust empirical evidence suggests WRCM, as measured by CBM-R, is strongly related to and highly predictive of broader measures of reading comprehension (Fuchs, Fuchs, Hosp, & Jenkins, 2001; Reschly, Busch, Betts, Deno, & Long, 2009). The desirable psychometric properties of CBM-R, coupled with the simplicity of scoring passages, the minimal amount of time required to collect data, and the capability to repeatedly measure performance, is why the tool is often cited as a uniquely appropriate assessment to measure intervention effects across relatively brief periods of time.
Measurement Issues Associated With CBM-R
A key assumption when interpreting CBM-R time series data is that changes in performance across time are a result of instructional efficacy (Ardoin, Roof, Klubnick, & Carfolite, 2008). Within an SCD framework, the critical assumption is that between-phase changes are related to the manipulation of the independent variable (e.g., implementation of an academic intervention). Although CBM-R was designed for repeated measurement, evidence of reliability for groups of students (e.g., Deno, Fuchs, Marston, & Shin, 2001) does not provide evidence that changes in CBM-R scores accurately capture growth for an individual student (Ardoin et al., 2013). Accurate estimation of changes in the dependent variable of interest for an individual is fundamental to SCD research.
Differences in WRCM scores across time are not solely attributable to meaningful changes in oral reading rate. The manner in which probes are constructed (Christ & Ardoin, 2009; Hintze & Christ, 2004), differences in difficulty among passage sets (Betts, Pickart, & Heistad, 2009; Cummings, Park, & Bauer Schaper, 2013; Francis et al., 2009), the type of passage (e.g., expository vs. narrative; O’Keeffe, Bundock, Kladis, Yan, & Nelson, 2017), the manner in which instructions are delivered (Christ, White, Ardoin, & Eckert, 2013; Colon & Kranzler, 2006), the setting in which data are collected (Derr & Shapiro, 1989), and errors committed by data collectors (Cummings, Biancarosa, Schaper, & Reed, 2014) all influence the observed WRCM for a student at any given point in time. Further, it is likely that idiosyncratic differences in the testing environment and disposition of the student may contribute to minor fluctuations in performance. Among the previously mentioned factors, characteristics of probe sets consistently accounts for the majority of unwanted variability in oral reading rate (Ardoin & Christ, 2009; Poncy, Skinner, & Axtell, 2005).
The standard error of measurement (SEM) for CBM-R approximates 5 to 15 WRCM (Christ & Silberglitt, 2007). Thus, if a student earned a WRCM score of 50 on a given day, assuming that data were collected with high-quality passages in a highly standardized matter (i.e., SEM = 5), using a 95% confidence interval (CI), that student’s true WRCM may be as low as 40 or as high as 60 WRCM. The presence of within-phase variability in SCDs may provide meaningful information to better understand, and potentially refine, the intervention in question. With CBM-R, such fluctuations obscure meaningful interpretations of student improvement and are generally viewed as unwanted variability, or error.
When an educator interprets a WRCM score, they are not only interested in how quickly a student read that particular passage. Rather, they infer from that observation the students’ oral reading rate across all types of texts, and even more broadly, how that intervention is improving broad reading competence (Christ, Van Norman, & Nelson, 2016). Thus, in addition to being sensitive to small changes in performance across time, CBM-R must also measure said changes with a high level of reliability and precision. However, the computation of SCD ESs, particularly those reviewed earlier in the article, do not account for the imprecision of individual WRCM observations. As a result, it is unclear whether the consistency and accuracy of outcomes from either ES differ as a function of measurement error.
Purpose
Researchers have developed novel metrics and strategies to adjust group-based ESs to account for measurement error in the meta-analysis literature (Hunter & Schmidt, 2004). The impact, as well as strategies to account for, measurement error has received substantially less attention in the SCD ES literature. The purpose of the current project was to explore the extent to which measurement error associated with individual WRCM scores influenced the consistency and accuracy of outcomes from two SCD ESs. To accomplish this, we used simulation methodology to generate true and observed WRCM scores across a variety of conditions. Conditions for simulations were based upon descriptive and multilevel analyses of 88 AB graphs identified as part of an extensive literature review for another project. To assess the consistency of ES outcomes, we calculated the variability of observed ES results for each unique combination of conditions. To assess the accuracy of ES outcomes, we calculated root mean square error (RMSE) values between ES based upon observed and true scores. The following research questions framed this study:
Method
Overview
Data for the simulation were generated and analyzed in several steps. To ensure we simulated conditions that were meaningful to researchers and practitioners, we conducted a comprehensive review of the research literature to identify multiple-baseline design studies that used the WRCM metric as an outcome variable. After data extraction, graphs were analyzed using multilevel analysis. In addition, descriptive statistics were computed to measure typical levels of autocorrelation and data collection durations. From that information, true scores were generated across 729 unique conditions. Next, observed scores were generated in batches of 9,000 for each of the 729 unique conditions by adding a random error term with a predefined autocorrelation structure to each true score. ESs were computed from true and observed scores. RMSE values between ESs based upon true and observed scores were calculated to assess the accuracy of outcomes. SDs of observed outcomes for each unique condition were also calculated to measure the consistency of results. Finally, quantile logistic regression models were estimated to assess the degree to which measurement error influenced the consistency and accuracy of ES outcomes while statistically controlling for other data characteristics.
Literature Review
We extended the search procedures from Ross and Begeny (2014) to find single-case intervention research that used WRCM as the dependent variable. Ross and Begeny (2014) identified 23 multiple-baseline across participants studies (66 AB comparisons) published between 1995 and 2010. We searched PsychINFO and ERIC databases using the following search terms: oral reading; prosody; reading fluency; fluency with modeling; paired reading; assisted reading; fluency with peer tutoring; fluency with peer assisted; fluency with peer assisted learning strategies; fluency with previewing; fluency with paired reading; fluency with passage preview; fluency with Read Naturally; fluency with Reader’s Theatre; reading automaticity; fluency with fast start; fluency with phase drill; fluency with the fluency assessment system; fluency with everybody reads for studies published between 2011 and 2016. This search, plus our review of the references of each returned article, resulted in 76 studies that were screened for inclusion.
The inclusion criteria were that the study used (a) WRCM as an outcome measure and (b) a multiple-baseline design. The first and third author independently reviewed and coded 10 studies. There was one disagreement regarding the WRCM metric. The study in question calculated WRCM by dividing the number of minutes a student took to read a book by the number of words within that book. The authors discussed the disagreement and decided that the study should be excluded because the WRCM metric needed to be based upon an administration of an oral reading fluency passage. No disagreements were observed regarding the multiple-baseline design criterion. After meeting, the first and third author independently reviewed another 10 studies with 100% agreement on both inclusion criteria. The third author screened the remainder of the studies, with the entire process resulting in the overall exclusion of 57 of the 76 studies.
The graphs from each of the 19 remaining studies were evaluated using guidelines set forth by (Kratochwill et al., 2010) and Parker et al. (2005). We elected to use the criteria from Parker et al. (2005) for the total number of observations to limit the impact of imprecise slope estimates on the PEBT ES as CBM-R progress monitoring research suggests that slope estimates based upon less than six observations contain extremely high levels of measurement error (Christ, 2006; Christ, Zopluoglu, Monaghen, & Van Norman, 2013). Studies were excluded if they contained fewer than five baseline data points (n = 2), fewer than five intervention phase data points (n = 2), or fewer than 14 data points across both pages (n = 9). Thus, the updated search identified an additional six studies published between 2011 and 2016 (14 AB comparisons; see the appendix for an overview of each additional study). We included AB graphs that showed no or negative treatment effects from the identified studies, unlike Ross and Begeny (2014), because including those comparisons may provide a more accurate picture of applied practice and would generate more meaningful parameters for simulations. This resulted in an additional eight AB comparisons from the 23 studies in the Ross and Begeny (2014) article. Taken together, we included a total of 29 studies and the multilevel analysis contained 88 total AB comparisons.
Data extraction
Each graph was saved as a separate jpeg file and data were extracted using the program WebPlotDigitizer Version 3.10 (Rohatgi, 2016). The program enabled us to extract numeric values for each data point on each graph. We then created a spreadsheet that contained raw data, session number, and intervention phase for each graph. Values were rounded to the nearest whole number.
Parameters for Data Generation
We sought to evaluate the influence of SEM (5, 10, or 15 WRCM) on ES outcomes while controlling for other data characteristics that may also influence results. Among the identified characteristics were baseline level, the difference in level between baseline and treatment phases (treatment effect level), baseline slope, the difference in slope between baseline and treatment phases (treatment effect slope), the number of observations collected during the baseline phase, the number of observations collected during the intervention phase, and the magnitude of autocorrelation present in the data series. With the exception of SEM, we identified typical values for each of these data characteristics through descriptive and multilevel analysis.
Descriptive analysis
The average number of baseline observations collected among the 88 AB comparisons was 11 (SD = 4). A data collection schedule of one observation per day (assuming a 5-day school week) was the most common data collection schedule. Thus, 11 observations corresponded to roughly 2.2 weeks (SD = 0.80 weeks). The average number of intervention observations was 14 (SD = 7), which corresponded to 2.8 weeks (SD = 1.40 weeks). We estimated common levels of autocorrelation by applying the Cochrane-Ocrutt procedure to each graph. The average first order autoregressive autocorrelation was equal to .19 (SD = .32). To identify levels for later data generation, the mean value for each characteristic as well as 1 SD above and below that value (Table 1) were calculated.
Parameters to Generate True and Observed Words Read Correct per Minute Scores.
Note. Level and standard error of measurement (SEM) values correspond to words read correct per minute (WRCM) scores. Slope values were increased in WRCM per week. For parameters derived via multilevel and descriptive analyses, Level 2 represents the fixed effect or mean value, respectively. Level 1 is −1 SD from that fixed effect or mean and Level 3 is +1 SD from that fixed effect or mean. SEM values were selected based upon common levels used in previous curriculum-based measurement simulation studies (e.g., Christ, Zopluoglu, Monaghen, & Van Norman, 2013) and values observed from analyses of large extant data sets (e.g., Christ & Silberglitt, 2007).
Multilevel analysis
We estimated a two-level multilevel model where WRCM scores were nested within participants following steps outlined by Moeyaert, Feron, Beretvas, and Van den Noortgate (2014) using the nlme package (Pinheiro, Bates, DebRoy, Sarkar, & R Core Team, 2016) with the computer program R (R Core Team, 2016). More specifically, we estimated fixed and random effects for baseline level, treatment effect on level, baseline slope, and treatment effect on slope (Table 2). The use of multilevel analysis enabled us to identify typical levels of student performance while controlling for error and in turn enabled us to define ranges of true scores for our study. When estimating the model, we assumed a first-order autoregressive residual variance structure. We were unable to model different levels of residual variance and autocorrelation between phases due to convergence issues. After estimating the model, we selected conditions for data generation by adding and subtracting 1 SD from each fixed effect (Table 1).
Results of Multilevel Analysis From AB Graphs (n = 88) to Identify Conditions for Simulations.
Note. A first-order autoregressive residual variance structure was assumed while fitting the multilevel model. Modeling heterogeneous residual variance and autocorrelation between phases caused convergence issues during preliminary model building; thus, homogenous residual variance and autocorrelation were assumed in baseline and intervention phases.
Author specified
Based upon the multilevel analysis, the residual variance of WRCM scores was approximately 100 WRCM (SD = 10; Table 2), which coincides with values typically observed in the research literature, and is consistent with the qualitative descriptor of a “good” or “typical” data set in previous CBM-R simulation studies (Christ, 2006; Christ, Zopluoglu, et al., 2013). To better understand the influence of SEM on ES outcomes, we also selected residual values that represented ideal or “very good” quality data sets (SEM =5), as well as suboptimal or “poor” quality data sets (SEM = 15; Table 1).
Data Generation
True scores were generated for 729 unique conditions (3 [Baseline level] × 3 [Treatment effect level] × 3 [Baseline slope] × 3 [Treatment effect slope] × 3 [Baseline sessions] × 3 [Intervention sessions]). Observed scores were created in two steps (example syntax is available from the first author upon request). First, a random error term with an M = 0, SD = 5, 10, or 15 WRCM, and first-order autoregressive structure with
Outcome Measures
PEBT and TauU ESs were calculated for each case with true and observed scores using procedures described in the introduction. To assess the consistency of ES results, we calculated the SD of ES outcomes calculated with observed WRCM scores for each unique combination of data conditions. Because the true ES for any given combination of conditions was fixed, variability in outcomes would suggest inconsistency in ES results. To evaluate the accuracy of each ES across data collection conditions, we calculated RMSE values between observed and corresponding true cases using the formula below:
where yi was the ES based upon true scores for a given set of conditions and
Analyses
To determine the extent to which SEM influenced the consistency and accuracy of ES outcomes, we conducted descriptive and inferential analyses. First, we calculated mean, SD, skew, and kurtosis estimates for the SD of observed ESs as well as RMSE for each level of SEM. Next, we used logistic quantile regression to estimate the impact of each independent variable on the consistency and accuracy of outcomes for each ES.
Logistic quantile regression is an extension of quantile regression that allows for appropriate modeling of bounded continuous outcomes (Bottai, Cai, & McKeown, 2010), such as those calculated from the ESs evaluated in this study. Continuous outcomes bounded within a fixed interval often resemble probabilities (Bottai et al., 2010). Therefore, outcomes can be converted to a logit and analyses can take place using procedures (e.g., logistic regression) that require less assumptions than those associated with multiple regression. Instead of calculating odds using probabilities as with binary outcomes as is typical with logistic regression, the outcome (y) can be converted to a logit using the formula:
The outcome (SD or RMSE values), now in logits, is analyzed using typical quantile regression procedures. By using quantile regression, separate equations can be estimated for any conditional quantiles observed in the outcome of interest. Therefore, predictions that are outside the bounds of possible values do not occur. The resulting coefficients are interpreted in the same manner as traditional logistic regression. In addition, by estimating separate regression equations for each quantile, the degree to which the relation between the predictors and outcome differs across levels of the outcome (e.g., values near the median versus in the tails) can be investigated. For all inferential analyses, the number of baseline and intervention sessions was centered to minimum values.
Results
Reliability of Data Extraction
The first and third authors extracted WRCM scores from each graph independently. The range of differences in rounded WRCM scores was −2 to 2. More than 97% of rounded scores were exactly the same between authors.
Verification of ES Calculation
To verify that code for ES calculation was accurate, the first author calculated TauU using an online resource (www.singlecaseresearch.org) for a random sample of 20 cases with true scores and 20 cases with observed scores. Estimates between outcomes from the code written for this study were identical to four decimal places (the maximum number reported on the website) with those from the website. The first author also calculated PEBT manually using Microsoft excel for the same 40 cases. Results based upon manually counting observations in excel were identical to results from the code written for the present study.
Descriptive Results
Initial descriptive statistics suggest that across all conditions, the consistency of outcomes from each ES decreased as SEM increased (Table 3). For instance, the average SD of observed ES outcomes for PEBT was .06 (SD = .10) when SEM was equal to 5 WRCM and .16 (SD = .13) when SEM was equal to 15 WRCM. The average SD of observed ES outcomes for TauU was .12 (SD = .08) when SEM was equal to 5 and .18 (SD = .10) when SEM was equal to 15.
Descriptive Statistics for Consistency and Accuracy of Effect Size Outcomes.
Note. SEM = standard error of measurement; consistency = SD of effect sizes based upon observed words read correct per minute scores; accuracy = root mean square error values between effect size outcomes based upon true and observed words read correct per minute scores.
The accuracy of ES outcomes decreased as SEM increased (Table 3). The average RMSE value when SEM was equal to 5 WRCM for PEBT was .07 (SD = .13) compared to .21 (SD = .18) when SEM was equal to 15. The same pattern was observed for TauU. When SEM was equal to 5 WRCM, the average RMSE was equal to .32 (SD = .22) and when SEM was equal to 15 WRCM, RMSE was equal to .47 (SD = .28).
Inferential Results
Descriptive statistics are helpful to gain a sense of the general pattern of results. However, it is unclear whether said differences are statistically significant or if said differences would persist after accounting for different data characteristics (e.g., baseline slope). Given the sheer number of combinations of independent variables in the present study, inferential analyses are a more efficient method to determine the degree to which SEM influenced the consistency and accuracy of outcomes while statistically controlling for other data characteristics.
Consistency
Table 4 presents the results for consistency for each ES. Given the sheer number of models estimated, we elected to use a conservative p value, (p < .001) to gauge statistical significance. Considering PEBT, the number of baseline sessions and treatment effect on trend were consistently the largest negative predictors of consistency. Although the influence of baseline sessions remained relatively stable across quantile levels (e.g.,
Results of Quantile Logit Regression for SD of Observed Effect Sizes (Consistency).
Note. ° Coefficient not statistically significant at the p < .001 level. All other coefficients statistically significant at the p < .001 level. BL = baseline; T = treatment; Auto = autocorrelation; SEM = standard error of measurement.
Consistency results for TauU are also presented in Table 4. Unlike PEBT, baseline sessions had an inconsistent relationship with SDs across quantiles. At the .50 quantile, the coefficient was not statistically significant at the p < .001 level, and at .05 and .95 quantiles
Accuracy
Results for accuracy analyses are presented in Table 5. Controlling for other predictors, baseline sessions and treatment effect on trend was the strongest negative predictor of RMSE values across quantiles for PEBT. The influence of baseline sessions was relatively stable across quantiles, (median
Results of Quantile Logit Regression for Root Mean Square Error Values (Accuracy).
Note. ° Coefficient not statistically significant at the p < .001 level. All other coefficients statistically significant at the p < .001 level. BL = baseline, T = treatment, Auto = autocorrelation, SEM = standard error of measurement.
Similar to PEBT, baseline level was not a statistically or practically significant predictor of RMSE values across quantiles for TauU. In contrast to PEBT, treatment effect on slope was not consistently negatively related to RMSE values. At the .05 and .25 quantiles, coefficients were positive and not statistically significant. However, coefficients were negative and statistically significant at the .50, .75, and .95 quantiles (
Discussion
The purpose of this study was to evaluate the degree to which measurement error influenced the consistency and accuracy of outcomes from two common overlap-based SCD ESs. Despite the proliferation of studies demonstrating the use of SCD ESs, few, if any, have evaluated the degree to which outcomes from those ESs are influenced by the measurement error of skills being assessed, including academic skills. The results of this study suggest that even after controlling for various data characteristics, measurement error accounts for a statistically significant amount of variance in the consistency and accuracy of outcomes. We briefly review the results of this study in the next section. We also discuss limitations of the present study and ideas for future research.
Prior to discussing the results, we would like to reiterate that we selected ESs that practitioners and applied researchers use frequently or have the capability to use without sophisticated software. Nevertheless, the influence of autocorrelation should not be ignored in the present results. Various ESs have been developed to account for autocorrelation, but few if any can be feasibly implemented without a high level of familiarity with statistical software or access to unpublished technical reports. Researchers should continue to explore ways to control autocorrelation while computing SCD ESs, but do so with the end goal of developing an easy to use interface for SCD users (Shadish, 2014).
Consistency
In this study, consistency quantifies the magnitude of random errors associated with ES outcomes for a given set of conditions. SDs greater than 0 are indicative of unreliability. As one may expect, conditions with larger magnitudes of SEM tended to have less consistent ES outcomes (larger SDs). Inferential analyses indicated that a statistically significant proportion of the variability in ES outcomes could be explained by SEM magnitude, even after controlling for things like the number of observations collected during baseline and intervention phases.
Reliability, like consistency, is important to ascertain so that we can be confident that if we were to collect data in highly similar conditions using highly similar instruments, we would arrive at the same conclusion regarding the effects of an intervention. One can interpret SDs as a proxy for confidence intervals around observed ESs. An SD of .10 in this study suggests that for any given set of data collection conditions, one could reasonably assume (using a 68% CI) an ES outcome would differ by ±.10. If for instance, we calculated TauU, and the resulting ES was .70, and we were to calculate the ES again collecting data with highly similar procedures, we may obtain an ES as high as .60 or as low as .80 using a 68% CI. Positive coefficients in Table 4 suggest that increases in that factor (e.g., increasing SEM) decreases the reliability of outcomes. Conversely, negative coefficients suggest increasing that factor (e.g., baseline observations) increases the reliability of outcomes. For PEBT, reliability can be improved by collecting more baseline observations and controlling measurement error. Collecting more observations during the intervention phase and controlling measurement error increased the reliability of outcomes for TauU.
The difference in the importance of baseline observations between PEBT and TauU is likely related to differences in the manner in which each ES accounts for baseline trend. Computation of PEBT involves calculating and projecting a monotonic linear trend line based upon baseline data into the intervention phase. The standard error of the slope of OLS trend lines from CBM-R data is largely influenced by the duration of data collection and the number of observations collected (Christ, 2006; Christ, Zopluoglu, et al., 2013). For PEBT, projected trend lines based upon fewer observations are likely to contribute to the inconsistency of observations that exceed that line. Conversely, computation of TauU places no assumptions on the functional form of baseline trend and does not require the computation of a potentially unstable trend line.
The finding that larger treatment effects on trend resulted in improvement in the reliability of outcomes for both ESs suggests that interventions that produce more rapid improvement will yield more consistent outcomes. Given that researchers often advocate for the use of SCD ESs to detect treatment effects that are statistically but not clinically significant (Moeyaert et al., 2014) is important to note that such outcomes are likely less reliable. Effects that do not appear clinically significant, perhaps based upon visual analysis, should be interpreted with an additional level of caution.
Accuracy
As with results for consistency, as measurement error increased, the accuracy of outcomes decreased (i.e., RMSE values increased). RMSE represented the average difference between observed ESs and true ESs for a given set of unique conditions. A useful context to interpret RMSE values is the signal to noise ratio (Cronbach & Glesser, 1964). RMSE represents the typical amount of noise, or error, obscuring a true score, or signal. A ratio of 2:1 suggests that twice as much signal is present relative to error. Recommendations regarding the necessary amount of signal to noise varies across fields. Staying with the 2:1 example, such a criterion necessitates that a score needs to be twice as large as a corresponding RMSE value to be certain that behavior change is not purely an artifact of measurement error. To illustrate, consider an RMSE of .20. This would suggest that an ES would need to be at least .40 before we could detect it with a high degree of accuracy. For the present study, higher RMSE values mean that larger ESs are necessary to detect meaningful change.
Inferential results suggest that after statistically controlling for other data characteristics, modeling SEM led to a statistically significant increase in explaining the variability of the accuracy of ES outcomes. As with consistency, inferential results suggest that certain data collection practices promote accurate computation of treatment effects. Namely, for PEBT, collecting more baseline observations and controlling measurement error increased the accuracy of ES outcomes. For TauU, increasing the number of intervention observations and controlling measurement error also increased the accuracy of outcomes. Similar to the findings from the consistency of outcomes, increases in the treatment effect on trend improved the accuracy of outcomes. Thus, it seems that caution is again warranted when claims regarding the statistical significance of interventions are made based upon ES outcomes without clear clinical significance. Such findings support the use of statistical analyses as a supplement to visual analysis (e.g., Horner et al., 2012). In fact, What Works Clearinghouse standards (Kratochwill et al., 2010) indicate that the quantification of ESs should only occur when visual analysis indicates that a functional relation between the IV and DV exists.
Controlling Measurement Error
A consistent recommendation to improve the consistency and accuracy of ES outcomes thus far has been to control measurement error. Common sources of unwanted variability in CBM-R performance were reviewed at the outset of the article. From that information, one can conclude that when using WRCM as an outcome, researchers and practitioners would be advised to use passage sets constructed by vendors versus randomly sampling grade level text from books, collect and score data using consistent procedures, and collect data with minimal levels of distraction. Although such recommendations are somewhat commonplace when considering precision of growth estimates from CBM-R data (e.g., Christ, White, et al., 2013), the present investigation suggests that measurement error also influences interpretations of CBM-R data collected within more traditional SCDs.
Limitations
Several potential limitations are worth noting in the present investigation. As with any simulation study, the relevance of findings to applied practice depend on the degree to which parameters reflect typical conditions. To that end, we attempted to base our simulations on conditions observed in practice by conducting an extensive literature review and series of descriptive and inferential analyses. Nevertheless, we only included published studies in said analyses. We did however use inclusion criteria (Parker et al., 2005) that was less stringent than current standards (Kratochwill et al., 2010) to capture more studies that may reflect data collected in practice. Similarly, there were a number of factors that may have influenced results, unique to each study, that we had no control over, nor were we able to account for in our simulation. Factors such as the type of probe used to measure WRCM, the intensity and frequency of the intervention, settings in which data were collected, who collected the data, and other idiosyncratic differences between students and examiners may account for differences in WRCM performance between studies and consequently influenced ES outcomes. Last, the criterion used in this study to assess the accuracy of outcomes (i.e., a “true” ES) confers with it a high degree of internal validity at the expense of external validity. In practice, a true ES is not known. Studies that investigate the influence of measurement error on the accuracy of ES outcomes based upon other socially relevant criteria seems certainly worthwhile.
Future Directions
The results of this study speak to the performance of SCD ESs used with oral reading rate data. Researchers and practitioners use SCDs to evaluate interventions targeting a host of other academic skills. Future studies should explore the consistency and accuracy of SCD ESs as a function of measurement error when assessing improvement in other academic domains. Similarly, the methodology employed in this study could be used to evaluate the influence of measurement error on other types of SCD ESs. For instance, it is unclear whether measurement error influences regression-based or standardized mean difference ESs to the same degree as the overlap-based metrics used in this study. Such ESs may not be as influenced by measurement error. Advanced approaches such as generalized least squares (Maggin et al., 2011) appear especially promising and seem worthy of investigation. In addition, we evaluated the performance of ESs to measure outcomes from individual participants. SCD ESs, particularly metrics based upon multilevel modeling (e.g., Moeyaert et al., 2014), are used to compare outcomes between subjects and studies. The degree to which measurement error influences those types of outcomes is not known.
Conclusion
Academic journals and grant agencies have expressed a keen interest in the development and refinement of SCD ESs. The results of this article suggest careful consideration is warranted when using such ESs to measure academic outcomes. One could argue that the ESs and issues explored in this article are only of interest to researchers. However, practitioners are likely to compare ESs prior to selecting an intervention. If practitioners select an intervention assuming they will observe a similar ES reported in a study or meta-analysis and instead experience markedly different results, they may be less likely to turn to the research literature for intervention recommendations in the future, which only serves to increase the research to practice gap (Carnine, 1995). If measurement error is a factor that influences the consistency and accuracy of ES outcomes, then researchers need to take steps to account for it when reporting intervention effects from SCDs. In addition, studies that build upon the results reported here to develop guidelines to promote reliable and accurate interpretations for practitioners are needed. We would like to be clear that we are not admonishing the use of SCDs in academic intervention research or as means to identify effective interventions in schools. Rather, we hope that this study, as well as other empirical work, will serve as a catalyst to develop more empirically validated practices to measure academic outcomes within SCD frameworks.
Footnotes
Appendix
Participant Demographic Information, Characteristics of the Dependent Variable, Intervention Format, and Analysis Method for Each of the Newly Identified Studies in Klingbeil, Van Norman, McLendon, Ross, and Begeny (2017). Information for the Other 23 Studies Can Be Found in Ross and Begeny (2014).
| Article | Grade or age of participants | Behavioral and/or academic difficulties in addition to difficulty with grade-level reading | Reading level of passages used for dependent variable | Intervention format | Analysis method |
|---|---|---|---|---|---|
| Albers and Hoffman (2012) | Third-grade students | Participating students were English Language Learners who read below grade level | Third-grade passages | One-to-one | Visual inspection; Pre- and postintervention accuracy percentages; percentage of nonoverlapping data; mean scores for both phases; ordinary least squares (OLS) linear regression for each phase |
| Hua et al. (2012) | Students were enrolled in a postsecondary education program | Students had reading difficulties and were diagnosed with autism and an intellectual disability | Instructional-level passages (3rd and 6th grades) | One-to-one | Visual inspection; Pre- and postintervention accuracy percentages; percentage of nonoverlapping data; mean scores for both phases; ordinary least squares (OLS) linear regression for each phase |
| Lo, Cooke, and Starling (2011) | Second-grade students | Students mastered basic decoding skills and read below benchmark level | First- and second-grade passages | One-to-one | Visual inspection; trend analysis slope values for both phases; mean scores for both phases |
| Musti-Rao, Lo, and Plati (2015) | First-grade students | Students were identified as being at-risk or having some risk for reading failure | Sight words from Fry’s word list | iPad application | Visual inspection; conservative dual-criterion method for determining intervention effect; mean scores for both phases; percentage of nonoverlapping data (PND) |
| Vasquez III and Slocum (2012) | Fourth-grade students | Students scored below the 20th percentile on a reading achievement test and needed supplemental instruction | DIBELS ORF passages | Reading intervention had multiple components. At least one component included one-to-one online tutoring | Visual inspection; mean scores for both phases; slope values for both phases; percentage of nonoverlapping data (PND) |
| Walcott, Marett, and Hessel (2014) | First- and second-grade students | Students exhibited attention and reading problems | Grade-level passages | Computer-aided instruction | Visual inspection; mean scores for both phases; slope values for both phases; percentage of nonoverlapping data (PND); effect size using the nonoverlap of all pairs (NAP) index |
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
