Abstract
We conducted this systematic review to map the literature and classify the evidence-based status of teacher-directed strategies to increase students’ opportunities to respond (OTR) during whole-group instruction across the K-12 continuum. Specifically, we conducted this review to determine whether OTR could be classified as an evidence-based practice according to Council for Exceptional Children’s Standards for Evidence-Based Practices in Special Education. We examined the extent to which 21 included studies addressed quality indicators and evidence-based practice standards using a modified, weighted criterion for methodologically sound studies. Three studies met all eight quality indicators and 11 studies met or exceeded 80% of quality indicators following a weighted criterion to define methodologically sound studies. Results indicated teacher-directed OTR strategy of response cards in K-12 school settings to be a potentially evidence-based practice. Educational implications, limitations, and future directions are discussed.
Keywords
Research has demonstrated academic engagement to be a critical predictor of students’ school achievement (Brophy & Good, 1986). In large-group situations, teachers’ implementation of instructional strategies is an important determinant of engagement for students who engage in challenging behavior (Downer, Rimm-Kaufman, & Pianta, 2007). Students who are not academically engaged may become passive learners, give up easily on tasks, and become anxious, withdrawn, or angry about school—leading to unsuccessful school experiences (Montague & Bergeron, 1997). As such, it is important for teachers to use high-leverage practices to promote active student engagement to facilitate success (McLeskey et al., 2017). Collectively, these practices promote safe, positive learning environments that foster academic engagement and decrease disruption. One practice shown to be effective for students who persistently engage in behavior challenges (e.g., students with emotional or behavior disorders [EBD]) is increasing students’ opportunities to respond (OTR; Adamson & Lewis, 2017).
Increasing Students’ Opportunities to Respond
Use of OTR includes procedures for (a) presenting materials, (b) asking students questions at a high rate, (c) promoting rapid student response through various modalities (e.g., verbal, gestural, textual), and (d) providing immediate feedback. OTR can be teacher-mediated (e.g., choral responding), technology-mediated (e.g., gaming), or peer-mediated (e.g., peer-tutoring). Ideally, teachers present students with multiple and varied OTR during a lesson at a brisk pace, but not so rapid that students are unable to participate (Sutherland & Wehby, 2001). By making simple shifts during instructional activities, teachers can promote and support the engagement of multiple students. In addition to being associated with higher rates of on-task behavior and lower rates of disruption for students with EBD (Sutherland & Wehby, 2001), OTR strategy can promote fluency and automaticity in basic skills of any content area and be used to formatively assess students’ proficiency with material (Lane, Menzies, Ennis, & Oakes, 2015).
Teacher-delivered OTR strategy comprises three main elements: (a) identifying the content or skills to be targeted, (b) preparing an extensive set of questions or prompts that offer students practice with the material, and (c) leading the session with a high rate of questioning, rapid student responding, and immediate teacher feedback (Lane et al., 2015). A variety of student response formats can be utilized, including verbal (e.g., choral responding), physical (e.g., thumbs up or down, response cards), and electronic (e.g., clickers).
In 1987, the Council for Exceptional Children (CEC) recommended OTR to occur (a) four to six times per min for new material, with students responding with 80% accuracy, and (b) eight to 12 times per min for review material, with students responding with 90% accuracy. Stichter and colleagues (2009) suggested an optimal rate of 3.5 OTR per min. This rate is supported by results suggesting slight differences in students’ on-task behavior at three and five OTR per min (Sainato, Strain, & Lyon, 1987). Naturally occurring rates of OTR tend to fall below recommended levels, with a reported average of 2.61 per min (SD = 0.66; Stichter et al., 2009).
Establishing an Evidence Base
Evidence-based practices (EBP) can refer to a process or an instructional technique (Cook, Cook, & Collins, 2016). For instance, the process of EBP considers instructional decision-making based on the best available evidence, professional judgment, and preferences and needs of students (Spencer, Detrich, & Slocum, 2012), whereas EBPs are strategies, practices, or programs (a) supported by a body of high-quality, peer-reviewed, experimental research and (b) that have undergone a systematic evidence-based review and classified as evidence based (Cook et al., 2016). Given mandates, such as Every Student Succeeds Act (2015), charging schools to provide high-quality instruction for all students, critical instructional techniques are appraised both for methodological quality and for magnitude of effect to identify EBPs.
Quality Appraisals
In 2005, Horner et al. and Gersten et al. introduced standards for identifying EBPs in special education using single-case research design (SCRD) and group-comparison designs, respectively. Lane, Kalberg, and Shepcaro (2009) field-tested SCRD standards (Horner et al., 2005). Following this initial application, Lane et al. suggested initial standards to classify EBPs may be too conservative of a standard in determining “what works.” Lane et al. raised concerns that overly rigorous criteria may lead to the unintended consequence of having too few EBPs for use. As a result, they recommended using an 80% criterion for identifying studies as methodologically rigorous (and, therefore, eligible to be considered when classifying the evidence base of instructional techniques), rather than a 100% criterion across all quality indicators (QIs). In 2014, CEC proposed new Standards for Evidence-Based Practices in Special Education (hereafter referred to as Standards for EBPs), which also required studies meet 100% of QIs to be considered methodologically sound and included in EBP reviews. Recently, reviews have begun to apply Lane et al.’s weighted criterion to the Standards for EBPs (Common, Lane, Pustejovsky, Johnson, & Johl, 2017; Ennis, Royer, Lane, & Griffith, 2017).
Quantifying the Evidence Base
The emergence of EBP as a priority in education, both as a process and in the identification of instructional techniques, has placed increased emphasis on not only the methodological rigor (e.g., QIs) but also the magnitude of the effect across rigorous (e.g., methodologically sound; CEC, 2014) studies. SCRD has historically emphasized visual analysis to assess and report the effects of treatments, and many scholars have been skeptical of whether syntheses employing statistical analyses can capture nuances of SCRD (Shadish, Hedges, Horner, & Odom, 2015). The extent to which SCRD effect sizes (e.g., between-case standardized mean difference [BC-SMD]) and other quantitative indices (e.g., percentage of nonoverlapping data [PND]) adequately estimate the direction and magnitude of functional relations remains a subject of continued interest and debate (Ledford & Gast, 2018).
Effect sizes
Even in non-meta-analytic reviews (e.g., EBP reviews), effect sizes are useful in comparing results across rigorous studies that could otherwise not easily be compared (CEC, 2014; Shadish et al., 2015). Effect sizes can be calculated within and across studies and be standardized or nonstandardized. Standardized effect sizes put study results on a scale with the same meaning across studies (e.g., standardized mean difference, risk ratios, odd ratios; Shadish et al., 2015) and are particularly important when examining a body of evidence comprising a range of methodologies (e.g., SCRD, group-comparison designs).
Ideally, effect sizes are metrics that can be validly compared across studies using various designs (Pustejovsky, 2018). Hedges, Pustejovsky, and Shadish (2012, 2013) introduced BC-SMD, an effect size for SCRD directly comparable to standardized mean difference effect sizes used in group-comparison designs. BC-SMD is based on a hierarchical model for the within-case and between-case variation in the dependent variable (DV) captured in SCRD employing withdrawal/reversal (ABk) design, multiple-probe design, and multiple-baseline design (MBD) with three or more cases (Shadish et al., 2015). Although comparable, metrics from SCRD are relatively new and tend to be larger than those from group designs and should be interpreted with caution (Barton, Pustejovsky, Maggin, & Reichow, 2017).
More recently, Pustejovsky (2018) introduced the log response ratio (LRR), another effect size for SCRD, that is not constrained by number of cases and less constrained by design. LRR effect size is a within-case effect size (ESwc) particularly well suited for single-case demonstration designs, with behavioral outcomes measured through systematic direct observation (Pustejovsky, 2015, 2018). Although there may not be a consensus on whether and which metrics should be used to discern magnitude effect in SCRD, there is growing consensus that, when used, scholars must demonstrate how their selection of effect sizes/quantitative indices should be made in the context of the set of studies to be synthesized (Maggin, Lane, & Pustejovsky, 2017).
Opportunities to Respond: Lessons Learned
Reviews examining specific OTR strategies include examination of response cards (Horn, 2010; Randolph, 2007; Schnorr, Freeman-Green, & Test, 2015) and choral responding (Haydon, Marsicano, & Scott, 2013). Randolph (2007) meta-analyzed studies examining response cards and found statistically significant effect sizes for achievement (d = 1.08), as well as substantial increases in student participation (47.70%) and decreases in off-task behavior (34.34%). Horn (2010) extended this review of response card for students with disabilities and offered initial evidence for considering response cards as an EBP using Horner et al.’s (2005) guidelines. Although Horn provided descriptive information and concluded guidelines for EBP were met, a methodological quality appraisal of included studies was not reported.
Haydon et al. (2013) conducted a review of the literature comparing choral and individual responding. Findings suggested choral responding resulted in higher levels of active student responding and on-task, appropriate behavior, as well as decreases in students’ disruptive and inappropriate behaviors. More recently, Schnorr et al. (2015) offered the first methodological appraisal of an OTR strategy and examined response cards in elementary settings. Results indicated sufficient support for response cards as an EBP with a moderate level of evidence for increasing OTR for elementary students. Yet, like previous reviews, their review focused on a specific OTR strategy and not the full range of methods possible for student responding during whole-group OTR (e.g., choral responding, clickers).
MacSuga-Gage and Simonsen (2015) examined varying modalities of teacher-delivered OTR, with results indicating choral responding resulted in positive academic and behavioral outcomes across students when compared with individual responding. They found no studies conclusively examined differential effects of OTR rates nor identified the optimal rate of teacher-delivered OTR. All studies exploring the impact of increased rates of OTR demonstrated positive outcomes for students with and without disabilities, including increased correct responses, student participation, and on-task behavior and decreased off-task and disruptive behavior. Yet, MacSuga-Gage and Simonsen’s study did not evaluate the methodological rigor necessary for classifying the evidence base of teacher-delivered OTR.
Purpose
We conducted the current EBP review to examine the effectiveness of teacher-delivered OTR strategies during whole-group instruction across the K-12 continuum. Specifically, we (a) mapped descriptive characteristics of included studies, (b) appraised the methodological rigor of included studies, (c) determined the evidence-based classification of OTR strategy, and (d) described the magnitude effects of OTR across methodologically sound studies.
Method
Article Selection Procedures
Article procurement was conducted independently by two or more authors at each step and included electronic, hand, and ancestral searches of the literature, initially conducted in Spring 2016 and again in Winter 2017, with searches concluding in December 2017. The electronic search included four databases: ERIC, ProQuest Research Libraries, PsycArticles, and PsycINFO. The following search string was used to identify potential records: all(“Choral Respon*”) OR all(“signal* system*”) OR all(“individual white board”) OR all(“student response system*”) OR all(“clicker*”) OR all(“communication cups*”) OR all(“response card*”) OR all(“Opport* to respond*”) OR all(“active student respond*”), NOT all(“higher education” OR “medical students” or “college students” or “adult education” or “distance learning” or “community college” or “college” OR “undergraduate”).
Ancestral searches occurred for all included articles, as well as for other literature reviews examining OTR (Haydon et al., 2013; Horn, 2010; MacSuga-Gage & Simonsen, 2015; Randolph, 2007; Schnorr et al., 2015). Hand searches were conducted for journals with two or more included studies (Behavioral Disorders, Education and Treatment of Children, Journal of Applied Behavior Analysis, Journal of Positive Behavior Interventions, and Preventing School Failure) from 1979 to 2017 (including online first), beginning the search from the first published study (McKenzie & Henry, 1979). Primary and secondary coders independently read titles and abstracts of each article to determine whether the full article should be read to further evaluate its eligibility. When a disagreement occurred between coders, the article was read in full and a consensus model was used until agreement was achieved. See Figure 1 for the identification and inclusion process.

Article procurement flow diagram.
Inclusion Criteria
We used a binary coding scheme of met/not met to determine whether studies met the inclusion criteria. First, all studies had to be conducted using group comparison or SCRD (CEC, 2014). Second, studies needed to include a teacher-delivered method of increasing students’ OTR (e.g., choral responding, signals such as thumbs up/down, communication or signaling cups, response cards, student response system, clickers) as the independent variable (IV; MacSuga-Gage & Simonsen, 2015). As such, interventions targeting peer-mediated strategies (e.g., classroom-wide peer-tutoring, Greenwood, Delquadri, & Hall, 1989; numbered heads together, Maheady, Mallette, Harper, & Sacca, 1991) were not included in this review. Third, the study’s intervention needed to be teacher directed during whole-group instruction toward K-12 children and youth. Interventions could take place in general or special education classrooms. Fourth, studies included at least one student-level academic or behavior outcome DV. Finally, studies not written in English or included in peer-reviewed journals were excluded.
Coding Procedures
Descriptive coding
To provide descriptive context, we mapped the literature by coding description of practice, context and settings, participants, intervention agent, implementation fidelity, internal validity, outcome measures/DV, and data analysis. Inter-rater agreement (IRA) was 94.89%.
Quality indicator coding
To appraise the methodological quality of included studies, two authors independently coded every article using the eight categories of QIs in the Standards for EBP (full descriptions to follow). Across these QIs, coding components included either the 22 items for SCRD studies or the 24 items for group-comparison studies (CEC, 2014). We used a coding protocol developed by Lane, Common, Royer, and Muller (2014). The first and second authors were trained to reliability at 85% or higher across three or more consecutive articles not included in this review. Average IRA across four training articles was 90.90% (SD = 6.43).
Given the methodological quality of a study exists on a continuum—ranging from no methodological rigor to a strong methodological rigor—we followed recommendations by Lane et al. (2009) to report the degree to which each QI was met by using a weighted coding scheme. Rather than using an absolute coding scheme (QI met/QI not met), we allowed each component constituting an indicator that was present to contribute partially. We used a binary scale coding scheme for each component (met [1], not met [0], or not applicable [NA]) within an indicator. For each QI, the number of components met within each indicator (range: 1-6) was summed and divided by the total number of components scored. Components coded as not applicable were dropped from denominator. Weighted scores ranged from 0 to 1 (rather than 0 or 1). Disagreements were resolved through a consensus process. IRA across studies was 92.01% (SD = 0.09) and 93.38% (SD = 0.08) across components.
Methodological Quality Indicators
1.0 Context and setting
This indicator included one component. To meet 1.1 context/setting description, investigators needed to describe critical features of the context or setting relevant to the review (CEC, 2014). This component was considered met if at least one setting/context feature (e.g., region, type of school/classroom) was described (Lane et al., 2014).
2.0 Participants
This indicator included two components. To meet 2.1 participant description, investigators needed to describe participant demographics relevant to the review (CEC, 2014). This component was met if at least one demographic element (e.g., age, gender) was reported (Lane et al., 2014). To meet 2.2 participant disability/at-risk status, investigators needed to describe participants’ disability or risk status and method of determination (CEC, 2014; Lane et al., 2014). We did not require risk status to be reported when the whole class was the unit of analysis (Ennis et al., 2017). We considered the following as insufficient: (a) global definitions, such as behavioral disabilities, and (b) vague descriptions that were not described with replicable precision, such as teacher nomination (Lane et al., 2014). This component was considered nonapplicable for studies not including participants with disability/at-risk status.
3.0 Intervention agent
This indicator included two components. To meet 3.1 role description, investigators needed to describe intervention agent’s role (e.g., researcher, teacher; CEC, 2014). To meet 3.2 training description, investigators needed to report information on how intervention agent(s) received training and how investigators checked for understanding (e.g., trained to criterion, role-play). Furthermore, if the intervention agent was both a teacher and an author, author affiliation and/or authors’ notes were used to reasonably determine the extent to which the author was competent in OTR strategy (e.g., designed intervention as part of guided study, theses, or dissertation process).
4.0 Description of practice
This indicator included two components: To meet 4.1 intervention procedure description, investigators needed to provide details with replicable precision (CEC, 2014). For 4.2 materials description, investigators needed to include a description of materials needed to implement intervention or offer accessible references providing this information (CEC, 2014). The second component was considered nonapplicable to studies not requiring materials (Cook et al., 2015).
5.0 Implementation fidelity
This indicator included three components. To meet 5.1 implementation fidelity, investigators needed to assess and report implementation fidelity using direct, reliable measures of adherence. To meet 5.2 dosage or exposure assessed/reported, investigators needed to assess and report implementation fidelity related to dosage or exposure to treatment conditions (CEC, 2014). This was considered met by reporting length of time of intervention or how long the intervention was in place (e.g., available from time-series line graph). Finally, to meet 5.3 assessed across relevant elements and/or throughout study, investigators needed to (a) assess and report implementation fidelity regularly and throughout the intervention (e.g., beginning, middle, and end), and (b) specify when, where, and for whom fidelity was assessed and report fidelity (Cook et al., 2015). This was considered present if any mention of assessing implementation fidelity occurred across different time points of the intervention. Studies did not have to report a measure of fidelity for each condition if an aggregated measure across conditions was reported. If neither adherence (5.1) nor dosage (5.2) was assessed, 5.3 was not applicable (CEC, 2014).
6.0 Internal validity
This indicator included six components, three shared by SCRD and group-comparison designs (6.1, 6.2, and 6.3), with three additional components specific to SCRD (6.5. 6.6, and 6.7) and three specific to group-comparison designs (6.4, 6.8, and 6.9). To meet 6.1 IV systematically manipulated, investigators were required to control and systematically manipulate the IV (CEC, 2014) and measure treatment fidelity of intervention (Lane et al., 2014). To meet 6.2 baseline description, investigators needed to describe baseline or control/comparison group conditions. To meet 6.3 no or limited access to IV during baseline, investigators needed to explicitly state or measure that nonintervention conditions did not have exposure to intervention (Lane et al., 2014). To meet 6.4 group assignment, investigators needed to describe assignment to group, which must have involved unit of analysis (e.g., participants, schools) being assigned randomly, nonrandomly and matched, or nonrandomly with meaningful differences identified and statistically controlled. To meet 6.5 three demonstrations of experimental effect, investigators must have employed a design that allowed for the possibility of three demonstrations or replications of an experimental effect at three different time points (CEC, 2014). To meet 6.6 baseline: minimum three data points and established pattern, investigators needed to include at least three baseline data points unless justified by the study author (CEC, 2014). This component was not applicable to SCRD not requiring baseline (e.g., alternating treatment designs [ATDs]) although if baseline was included this component was assessed. To meet 6.7 controls for threats to internal validity, investigators must have employed an accepted SCRD (Ledford & Gast, 2018) with procedural integrity (Lane et al., 2014). To meet 6.8 overall attrition, overall attrition needed to be low across groups (e.g., <30% in a 1-year study; CEC, 2014). Finally, to meet 6.9 group attrition, differential attrition between groups needed to be low (e.g., ≤10%) or controlled for (CEC, 2014).
7.0 Outcome measures/dependent variables
This indicator included six components, of which the first five (7.1-7.5) applied to both SCRD and group-comparison designs, and one additional component specific to group-comparison design. To meet 7.1 socially important, investigators needed to discuss (e.g., introduction or discussion) the social significance of the goals, social appropriateness of the procedures, and/or social importance of the effects and/or explicitly measured and reported social validity (Lane et al., 2014). To meet 7.2 description of DV measures, investigators needed to define and describe each DV and use a valid measurement system (CEC, 2014). To meet 7.3 reports effects on the intervention on all measures, investigators needed to report the effects of the intervention across all outcome measures (CEC, 2014). To meet 7.4 measured repeatedly (minimum three data points per phase), investigators needed to measure outcomes with appropriate frequency and timing (e.g., minimum of three data points per phase [e.g., ABk, MBD, changing criterion design]; at least four repetitions of alternating sequence [e.g., ATD]; Ledford & Gast, 2018). For 7.5 adequate interobserver agreement (IOA), investigators needed to provide evidence of adequate IOA by meeting minimal standards (i.e., IOA ≥80%, κ ≥60%; CEC, 2014) across participants and DVs. This component was considered met for aggregated data if the study stated IOA occurred across participants or conditions, and if averages met specified levels and any reported range did not fall below 60% IOA (Lane et al., 2014). Finally, for group-comparison designs only, 7.6 validity was considered met if investigators reported either (a) adequate validity coefficients or (b) outcomes adequately represented content measured (i.e., content validity; CEC, 2014).
8.0 Data analysis
This indicator included two components specific to group-comparison design (8.1, 8.3) and one component specific to SCRD (8.2). To meet QI 8.1. data analytic techniques, group designs studies needed to employ (a) statistical analysis procedures generally recognized as appropriate for comparing change in the performance of two or more groups, or (b) atypical procedures were used, but justified and explained. To meet 8.2 graph clearly represents outcome data, SCRDs need to clearly represent outcome data for all student outcome measures by providing graphs that allowed for the possibility of visual analysis (e.g., examine level, trend, and stability within and across conditions). Finally, to meet 8.3 effect sizes, studies need to report effect sizes or provide data from which appropriate effect sizes can be calculated.
Evaluation Procedures for Determining Evidence- Based Practices
To answer questions related to classifying the evidence base for OTR strategies, we restricted included studies by including only demonstration (e.g., ABk, MBD) and comparison designs evaluating and comparing OTR strategy as an IV with other IVs different from OTR strategy. As such, studies comparing multiple variations of OTR strategy were quality appraised but excluded in evaluating and classifying the evidence base.
Classifying methodologically sound studies
CEC (2014) defined methodologically sound studies as meeting all of the QIs across components. We utilized a modified criterion (Lane et al., 2009) and defined methodologically sound as studies meeting 80% or more of all eight QIs. A weighted criterion to define methodologically sound articles is based on the logic that rigor exists as a continuum (e.g., no rigor, some rigor, high rigor) rather than as a dichotomy (present, absent). We acknowledge this does not strictly adhere to Standards for EBPs’ recommendation for classifying effects of studies.
Classifying study effect
To classify effects of group-comparison studies deemed methodologically sound, we followed the recommendations of Standards for EBPs: negative effect if ES ≤ −0.25, mixed/neutral effect if −0.24 > ES > 0.24, and positive effect if ES ≥ 0.25. We calculated Hedges’s g for studies that did not report effect sizes but reported information from which effect sizes could be calculated (e.g., M, SD, n; t, df). We dropped studies with inconsistent reporting of information necessary to calculate effect sizes (e.g., different t values).
To classify effects of SCRDs deemed methodologically sound and with three or more cases, we employed visual analysis to discern positive, neutral or mixed, or negative effects, based on (a) number and proportion of participants for whom a functional relation was established and (b) direction of functional relation (CEC, 2014). The presence of a functional relation was evaluated independently by two authors examining graphed data within and across phases for changes in level (e.g., low, moderate, or high), trend (e.g., increasing, decreasing, or flat), and stability (stable, variable; Ledford & Gast, 2018). Studies were determined to have positive effects if (a) 75% of cases demonstrated a functional relation between the IV and therapeutic changes in the DV, (b) there was no evidence of counter-therapeutic effects, and (c) remaining cases were neutral or mixed (i.e., no negative effects). Studies were determined as having negative effects if 75% of its cases demonstrated a functional relation between IV and unfavorable changes in DV (e.g., counter-therapeutic effects). Studies were determined as having mixed or neutral effects if it neither qualified as having positive or negative effects. IRA between visual analysis coders was 100%.
Classifying the evidence base
According to the Standards for EBP, for a strategy, practice, or program to be considered evidence based, it must be supported by (a) two methodologically sound group-comparison studies with random assignment to groups and unit of analysis aligned with unit of assignment, positive effects, and at 60 or more participants across studies; four methodologically sound group-comparison studies with random assignment but unit of analysis not aligned with unit of assignment or non-random assignment to groups; positive effects and at 120 or more participants across studies; or five methodologically sound SCRDs with positive effects and at 20 or more participants across studies; or (b) meet at least 50% of criteria for two or more of the study designs (CEC, 2014, p. 8). In addition, no methodologically sound studies can have negative effects, and the ratio of positive to neutral/mixed effects must be 3:1 or greater. See CEC’s (2014) Standards for EBPs for classification requirements for potentially evidence-based, mixed evidence, insufficient evidence, and negative effects.
Determining Magnitude of Effect
To complement visual analysis of methodologically sound SCRD studies with three or more cases, we additionally calculated ESBC and ESWC. We digitized published graphs of methodologically sound studies using data extraction software WebPlotDigitizer (Version 3.12; Rohatgi, 2015). Digital data were extracted independently by one of two authors, cleaned and formatted for statistical software, and made reliable against original graphs by a second author prior to analyses. Data without clearly marked legend keys or titles explaining the data (e.g., DV, unit of analysis) were dropped from these analyses. Effect sizes were screened using Grubbs’s test for outliers in R (Komsta, 2011). Omnibus effect sizes were not calculated across articles and are reported by article (BC-SMD) or case (LRR).
Between-case effect sizes
We selected BC-SMD (Hedges et al., 2012, 2013) for SCRDs. BC-SMD technical requirements include SCRDs that (a) use MBD, multiple probe, or ABk designs; (b) contain three or more cases; and (c) assume no trend (Shadish et al., 2015). We calculated BC-SMD using the online BC-SMD calculator developed by Pustejovsky (2016). We used the restricted maximum likelihood estimation method and specified (a) fixed effect and random effect for the baseline phase (i.e., permitted intercept [level] across all baseline phases to be different from zero and vary across cases, respectively) and (b) fixed effect for the treatment phase level (i.e., permitted the intercept [level] across all intervention phases to vary from the baseline phase level). Furthermore, we specified random effect for the treatment phase (i.e., permitted treatment effect to vary across cases). Finally, following recommendations from Valentine, Tanner-Smith, Pustejovsky, and Lau (2016), we assumed treatment effects to be constant across cases by omitting random effects for treatment phase level. Not enough information was reported in group-comparison studies to calculate Hedges’s g. We employed Shadish, Zelinsky, Vevea, and Kratochwill’s (2016) descriptive quartiles, which divided 74 previously published BC-SMD estimates into four groups for interpretation: 0 to 0.36 = nominal effect, 0.37 to 0.97 = small effect, 0.98 to 1.86 = medium effect, and ≥1.87 = large effect.
Within-case effect sizes
We selected LRR as our ESWC single-case parametric over regression-based metrics, which account for trend, because regression-based approaches have additional technical constraints related to (a) insufficient number of data points in the initial condition to predict accurately, and (b) too much (i.e., instability) or not enough (i.e., zero baselines) variance in baseline to accurately predict performance in adjacent conditions. LRR conceptualizes the proportionate change for an individual case across two adjacent conditions (e.g., A-B). It was important to select an ESWC flexible enough to be (a) calculated across a range of studies that do not meet the technical requirements of BC-SMD (Common et al., 2017), and (b) used to draw conclusions about each case separately, a hallmark of SCRD (Ledford & Gast, 2018; Shadish et al., 2015). ESWC (comparisons are made within individual participants) differs conceptually from ESBC (comparisons are made between average performance across participants). As such, it is impossible to compare ESWC with ESBC (Shadish et al., 2015). This limits the extent to which ESWC can be utilized in quantitative reviews examining both group comparison and SCRDs.
The technical requirements of LRR assume the pattern of behavior within each phase lacks time trends (e.g., stable from session to session). When applied to DVs on a scale of 0% to 100%, LRR requires all outcomes to be defined in the same direction of therapeutic change. All but one DV in this review consisted of student-level outcomes with a therapeutic direction being upward; thus, one DV (percentage off-task) was recoded to percentage on-task (i.e., 100 – % off-task = % on-task). LRRs were calculated using the online single-case effect size calculator (Pustejovsky, 2017). We followed recommendations set forth by Pustejovsky (2017) for ABk design studies with multiple A-B comparisons and estimated LRR for each pair of adjacent phases and combined those estimates to average a single summary effect size for each case (Pustejovsky, 2018). Furthermore, we excluded cases for LRR calculations under the following conditions: (a) either phase in an A-B contrast has fewer than three data points, (b) there was zero responding within a baseline phase, or (c) there was near-zero responding in a phase followed by a ceiling effect in the next phase. For ATD, LRR was adopted and treated as an ABk design (e.g., A-B, A-C, A-D, B-C, B-D; Zelinsky & Shadish, 2018). For interpretation, LRR effect sizes employ directionality, with negative values of LRR corresponding to decreases, values of zero corresponding to no changes, and positive values corresponding to increases (Pustejovsky, 2015). To aid in the interpretation of LRR, percentage change was calculated from the LRR parametric using the following formula: 100 × [exp (LRR) – 1].
Results
Descriptive Characteristics of Included Studies
Twenty-one studies were included and published across nine unique journals from 1979 to 2017 (see Figure 1). One study employed a group-comparison design and 20 studies employed an SCRD (ABk = 13, ATD = 6, within-subject cross-over design = 1). Studies included participants from K to 11th grade. Ten studies took place in an elementary school, five in middle school, and four in high school. Two studies did not specify the school level, but specified classroom grade: third grade (McKenzie & Henry, 1979) and fifth grade (Munro & Stephenson, 2009). See Table 1 for additional information pertaining to context and setting. Although all studies were implemented during whole-class instruction, not all students were selected as participants for data recording. Across studies, 166 students participated in SCRD studies and 52 students were assigned to two theoretically comparable treatment groups in a group-comparison study. The predominant description of practices were response cards (k = 13; 61.90%), followed by verbal or nonverbal choral responding (k = 5; 23.80%), mixed-mode responding (k = 3; 14.29%), and student response systems/clickers (k = 2; 9.52%).
Descriptive Results of Included Articles.
Note. A-B-A or A-B-A- B or A-B-A-B-C or A-B-C-B-C = withdrawal or reversal design; ADHD = attentiondeficit/hyperactivity disorder; AET = academic engaged time; ATD = alternating treatment design; CEI = critical events index (Walker & Severson, 1992); CFI = combined frequency index (Walker & Severson, 1992); DD = developmental delay; DO = direct observation; DV = dependent variable; EBD = emotional or behavioral disorder; ELL = English language learner; ES = elementary school; ESL = English as a second language; FRL = free and/or reduced lunch; GE = general education; HS = high school; ID = intellectual disability; LD = learning disability; MS = middle school; NA = nonapplicable; NHT = numbered heads together; OHI = other health impaired; OTR = opportunities to respond; RC = response cards; SCC = self-contained classroom; SED = serious emotional disturbance; SL = speech or language impairment; SSBD = Systematic Screening for Behavioral Disorders (Walker & Severson, 1992); — = not reported.
Methodological Quality Indicators
Results of the methodological quality appraisal (CEC, 2014) are provided in Figure 2. Three studies (Adamson & Lewis, 2017; Haydon, Musti-Rao, & Alter, 2017; Messenger et al., 2017) met all eight QIs (mode = 4; range: 3-8). Eleven studies (see Figure 2) met or exceeded 80% or more of the QIs, which we defined as being methodologically sound (M = 6.73; SD = 0.93; range: 4.97-8.00).

Scatter box plot of quality indicators (Council for Exceptional Children, 2014) of included studies.
Evaluation of the Practice
Four methodologically sound SCRD studies included three or more cases and examined the effectiveness of OTR strategy (n = 17; Adamson & Lewis, 2017; Clarke, Haydon, Bauer, & Epperly, 2016; Munro & Stephenson, 2009; Wood, Mabry, Kretlow, Lo, & Galloway, 2009). Two methodologically sound studies were excluded from classifying the evidence base of the OTRs because they employed an ATD comparing more than one variation of an OTR strategy (Haydon et al., 2017; Messenger et al., 2017). Munro and Stephenson (2009) demonstrated positive effects of response cards on student-initiated responses. Wood et al. (2009) demonstrated positive effects of response cards on students’ on-task behavior and participation. Clarke et al. (2016) demonstrated positive effects of response cards on student responding. Adamson and Lewis (2017) demonstrated positive effects of response cards when contrasted with class-wide peer-tutoring and guided notes on students’ academic engaged time. Thus, teacher-delivered OTR strategy—specifically response cards—during whole-group instruction meets criteria for being a potentially EBP when a weighted criterion was used to define methodological rigor. See Table 2 for a summary of visual analysis of methodologically sound studies with three or more cases.
Visual Analysis and Between-Case Standardized Mean Difference of Methodologically Sound Single-Case Design Studies With Three or More Cases Meeting Technical Requirements
Note. BC-SMD = between-case standardized mean difference effect size; DV = dependent variable; Est. = estimate; SE = standard error; NA = nonapplicable; CI = confidence interval.
Between-case effect sizes
We calculated four BC-SMD estimates for three methodologically sound studies meeting the technological requirements (Clarke et al., 2016; Munro & Stephenson, 2009; Wood et al., 2009; see Table 2). A Grubbs’s test for outliers revealed that an effect size of 33.38 from Clarke et al. (2016) was an outlier (G = 1.38,p = .16). Upon visual inspection of normal probability plot, an effect size of 14.76 from Wood et al. (2009) was also identified as an outlier. We confirmed effect sizes were not an error and we dropped them from our analysis because the large effect sizes were artifacts of near-zero responding to ceiling effect (Zelinsky & Shadish, 2018). Munro and Stephenson (2009) examined the effects of response cards, suggesting large effects, BC-SMD = 2.60, SE = 1.45; 95% confidence interval (CI): [1.10, 7.78], on student-initiated response opportunities. Wood et al. (2009) examined the effects of response cards, which also demonstrated large effects (BC-SMD = 3.27, SE = 0.38; 95% CI: [2.55, 4.05]) on student’s on-task behavior (originally coded off-task and reversed).
Within-case effect sizes
We considered 16 LRR estimates from four methodologically sound studies meeting the technological requirements to calculate LRR. An additional nine cases were dropped for not meeting the technical requirements due to near-zero responding/ceiling effect, and one case was dropped for having fewer than three data points within an A-B contrast (see Table 3). A Grubbs’s test for outliers revealed that an effect size of 1.75 from Munro and Stephenson (2009) was an outlier (G = 1.65, p = .71). Upon visual inspection of normal probability plot, an effect size of 1.72 from Adamson and Lewis (2017) was also identified as an outlier. We confirmed effect sizes were not an error and dropped them from our analysis because the large effect sizes were artifacts of near-zero responding to ceiling effect. Munro and Stephenson (2009) examined effects of response cards, which demonstrated between 252.54% (LRR = 1.26, SE = 0.06, 95% CI: [1.14, 1.38]) and 274.34% (LRR = 1.32, SE = 0.08, 95% CI: [1.17, 1.47]) increase in student-initiated responding between hand-raising and response cards for two cases. Wood et al. (2009) examined effects of response cards, which demonstrated between 120.34% (LRR = 0.79, SE = 0.28, 95% CI: [0.25, 1.32]) and 235.35% (LRR= 1.21, SE = 0.31, 95% CI: 0.60, 1.82) increase in on-task behavior between hand-raising and response cards across four cases. Adamson and Lewis (2017) compared effects of response cards against class-wide peer-tutoring and guided notes. Response cards demonstrated between 56.83% (LRR = 0.45, SE = 0.27, 95% CI: [−0.09, 0.98]) and 89.65% (LRR = 0.64, SE = 0.43, 95% CI: [−0.20, 1.48]) increase when compared against guided notes across three cases. Response cards demonstrated between 15.03% (LRR = 0.14, SE = 0.40, 95% CI: [−0.65, 0.93]) and 27.12% (LRR =0.24, SE = 0.33, 95% CI: [−0.41, 0.90]) increase when compared against class-wide peer-tutoring across three cases.
Within-Case Effect Sizes of Methodologically Sound Studies With Three or More Cases Meeting Technical Requirements of LRR.
Note. LRR = log response ratio effect size; Est. = estimate; SE = standard error; CI = confidence interval; ABk = withdrawal/reversal design; ATC = active student responding; A1 = baseline; RC = response card; CWPT = class-wide peer-tutoring; GN = guided notes; Sk = Student 1, Student 2, or Student 3.
Discussion
Increasing students’ OTR is a high-leverage practice teachers can use to facilitate school success by increasing student engagement and decreasing challenging behavior (Lane et al., 2015; McLeskey et al., 2017). Across included studies, teachers implemented OTR as a part of general classroom management (e.g., Armendariz & Umbreit, 1999), as well as to offer additional support to students at-risk for EBD (e.g., Haydon et al., 2010; Messenger et al., 2017). Using Standards for EBP, we employed a modified definition for methodologically sound studies (i.e., 80% or more weighted criterion; Lane et al., 2009) to identify articles that were sufficiently methodologically rigorous (Common et al., 2017). Across 21 studies, 11 studies met our criterion for being methodologically sound, with three studies meeting 100% of CEC’s QIs across components. Findings of this review indicated the majority (52.38%) of studies examining OTR strategies (e.g., choral responding, clickers, response cards) were methodologically rigorous.
Four methodologically sound studies with three or more cases (n = 17) examined the effectiveness of OTR strategy—specifically response cards—and demonstrated positive effects on student-initiated responses (Munro & Stephenson, 2009), on-task behavior (Wood et al., 2009), active student responding (Clarke et al., 2016), and academic engaged time (Adamson & Lewis, 2017). Thus, teacher-delivered OTR strategies, and specifically response cards, meet Standards for EBPs requirements as a potentially EBP following a modified definition of rigor. We note teacher-delivered OTR strategies would have been classified as having insufficient evidence had we followed CEC’s (2014) definition of rigor.
We also calculated ESBC and ESwc for methodologically sound SCRD studies with three or more cases. Although three of the four studies met the methodological qualitative appraisals and technical standards for BC-SMD, two estimates were dropped from analysis for being outliers. Similarly, three of the four studies met the methodological qualitative appraisals and technical standards for LRR, with two estimates also being dropped from analysis for being outliers. These results are similar to other studies showing patterns of inflated effect sizes in SCRD (Barton et al., 2017; Zelinsky & Shadish, 2018). Examining the magnitude effect of OTR in SCRDs presented unique challenges because the field has not yet come to consensus on which ESBC and ESwc should be employed. Specifically, few SCRD effect sizes are designed for use with ATD, and floor to ceiling effects failed to meet the technical requirements of many cases screened for LRR and produced inflated ESBC employing BC-SMD.
Overall, results from this EBP review are similar to Schnorr et al.’s (2015) review in which the authors found sufficient support for response cards as an EBP. We extended their work and found a range of research examining OTR strategies (e.g., response cards, choral responding) to be methodologically rigorous and effective. We also examined the magnitude effect of teacher-delivered OTR across methodologically sound studies with three or more cases, of which those meeting the technical requirements of BC-SMD and LRR were large and in a therapeutic direction. These findings are consistent with our visual analysis.
Limitations and Future Directions
We encourage consideration of the following limitations and recommendations for future research when interpreting these findings. First, OTR strategies include a broad range of teacher-, peer-, and technologically mediated practices. We evaluated the evidence base of teacher-driven strategies to increase students’ OTR. Future reviews are needed to examine the methodological quality of peer-mediated and technologically mediated OTR strategies. In addition, more research is needed to examine the empirical support for choral responding, clickers, and varied modes of responding. OTR strategy is particularly well suited for promoting fluency and automaticity in content knowledge, which are associated with increased gains in engagement, academic achievement, and desired student behaviors. Future research is needed to explore such student outcome effects across varying configurations (e.g., effects of student and classroom characteristics as well as various OTR strategies).
Second, in this review, we included all ATD to map the literature, including quality appraisal of the methodological rigor. Unlike demonstration designs (e.g., ABK and MBD), which specifically examine the efficacy of the IV, ATDs are comparison designs, which address questions related to which IV is more effective (Ledford & Gast, 2018). As such, two ATD that compared different—albeit similar—IVs considered to be an OTR strategy were dropped from our evaluation of the evidence base (Haydon et al., 2017; Messenger et al., 2017). Furthermore, BC-SMD and LRR were initially designed for demonstration designs, although Zelinsky and Shadish (2018) posited comparison designs might be adopted to allow A-B contrasts. Effect sizes for SCRD employing comparison designs should be interpreted with caution, as more research is needed to examine their theoretical and technical constraints.
Third, CEC’s (2014) Standards for EBPs classifies SCRDs as having positive, neutral or mixed, or negative effects based in part by “the number and proportion of participants in a study for whom a functional relationship between IV and the DV was established” (p. 7). In this review, a number of methodologically sound studies reported aggregate class-wide data (e.g., Cavanaugh, Heward, & Donelson, 1996; George, 2010) and were excluded from visual analysis and further consideration in supporting the evidence base. Future research is necessary to explore the extent to which SCRDs reporting aggregate data should be included for visual inspection and considered when classifying the evidence base of instructional practices.
Fourth, whereas BC-SMD is theoretically on the same scale as Cohen’s d and Hedges’s g, early work has shown effect sizes for SCRD (i.e., ESBC, ESwc) to be much larger than d or g in group studies (Barton et al., 2017). This is consistent with theoretical expectations. For example, visual analysis of SCRD allows for the detection of large effects better than small or moderate effects. Thus, there is a strong publication preference for studies with larger effects (Barton et al., 2017; Shadish et al., 2016). More research is needed to examine the extent to which BC-SMD, d, and g are truly on the same scale, and, if so, whether they should be interpreted similarly or differently within systematic reviews that examine the magnitude effect across group comparison and SCRD research articles.
Finally, in this systematic review we included only studies published in peer-reviewed journals. As such, generalization may be limited due to the omission of theses, dissertations, and other studies that may have included null outcomes and/or included methodological decisions that may have prevented their publication. Also, although all studies reported effects across outcome measures, not all student outcome measures were graphed in original studies (e.g., reporting outcomes in tabular form). This led to the exclusion of some outcome data for further visual and statistical analysis. Furthermore, several graphs were dropped from analysis due to underreporting and inconsistent reporting of information necessary to interpret the data. Future researchers should ensure all necessary information is reported in studies to facilitate synthesis in systematic reviews.
Summary
Our goal was to classify the evidence base of teacher-delivered strategies to increase students’ OTR across the K-12 continuum during whole-class instruction. We applied Standards for EBPs (2014) utilizing a modified definition to identify methodologically sound studies. Eleven studies met or exceeded our modified criterion of 80% or more of the QIs. Five of these studies included three or more cases (n = 21) and demonstrated positive effects. Effect sizes demonstrated large magnitude effects in the therapeutic direction. We, therefore, classify teacher-directed OTR strategy in K-12 school settings to be a potentially EBP.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
