Abstract
Active supervision is a proactive, low-intensity strategy to minimize challenging behaviors and increase desired behaviors. To examine the evidence base of this strategy, we applied the Council for Exceptional Children’s (CEC) Standards for Evidence-Based Practices in Special Education to the body of research exploring the impact of active supervision with Pre-K–12 students in traditional school settings. In this systematic literature review, we identified seven peer-reviewed, single-case design, treatment-outcome studies meeting inclusion criteria. All studies met a ≥80% weighted criterion of CEC’s quality indicators. These seven studies included 15 cases aggregated at the school, classroom, or grade level, collectively involving 1,686 participants. Three studies included three or more cases and demonstrated positive effects across primary dependent variables (with one study showing neutral effects on a secondary dependent variable). Based on available evidence and using CEC criteria, we determined active supervision to be a potentially evidence-based practice. We discuss implications, limitations, and future directions.
Keywords
A shared goal among educators is to deliver high-quality instruction while providing a safe and positive learning environment for all students. In recent years, schools and researchers have shifted toward prioritizing proactive efforts to support students who engage in challenging behavior to increase student engagement and promote prosocial behaviors. Educators meet this charge by (a) providing students with engaging pedagogy encompassing high-leverage practices (HLPs) and developmentally appropriate instruction, (b) building and fostering teacher–student and peer relationships, and (c) implementing effective proactive and reactive behavior management strategies as part of three-tiered models of prevention (Horner & Sugai, 2015; McLeskey et al., 2017). For example, consider the recent emphasis placed on infusing HLPs into pre-service training and professional development. These are essential practices designed to empower educators to support diverse student learners across multiple educational contexts (McLeskey et al., 2017). HLPs target students’ social and emotional well-being, as well as student behaviors, to promote success both within and beyond the classroom. Implementation of HLPs includes adopting an instructional approach for academics, behavioral, and social-emotional competencies through a continuum of evidence-based practices (EBPs) delivered in a caring, respectful, and culturally relevant manner (McLeskey et al., 2017).
Other ways educators have shifted toward proactive efforts is through the implementation of three-tiered models (e.g., Positive Behavior Interventions and Supports [PBIS; Horner & Sugai, 2015]; comprehensive, integrated, three-tiered model of prevention [Ci3T; Lane, Oakes, & Menzies, 2014]). Tiered models include progressively more intensive assistance: preventing challenges from occurring (Tier 1; for all students), reversing challenges (Tier 2; for some students), and reducing challenges (Tier 3; for few students; Lane & Walker, 2015). For students demonstrating needs beyond Tier 1, educators provide additional supports at Tier 2 (e.g., check in/check out; Hawken, MacLeod, & Rawlings, 2007; targeted reading and writing interventions) and/or Tier 3 (e.g., functional assessment-based interventions [FABI; Common, Lane, Pustejovsky, Johnson, & Johl, 2017]). Through tiered models, teachers can employ universal supports efficiently (e.g., low-intensity behavior management strategies such as active supervision) to reduce problem behavior and facilitate greater access to the curriculum.
Simonsen and colleagues (2015) summarized effective proactive and reactive behavior support strategies to empower teachers. Examples of proactive, preventative strategies include low-intensity strategies, such as active supervision, proximity, increased opportunities to respond, prompts, and precorrections (Simonsen et al., 2015). Instructional feedback and behavior-specific praise are similarly low in intensity to implement and promote future probability of desired behaviors. Low-intensity strategies are particularly well suited for schools implementing three-tiered models with a shared commitment of installing HLPs across the tiers to increase student engagement and decrease disruptive behavior (Lane, Menzies, Ennis, & Oakes, 2015).
Recent reviews examining effective, efficient, low-intensity strategies—such as precorrection (Ennis, Royer, Lane, & Griffith, 2017) and instructional choice (Royer, Lane, Cantwell, & Messenger, 2017)—investigated the extent to which these practices can be considered EBPs following the Council for Exceptional Children’s (CEC; 2014) Standards for Evidence-Based Practices in Special Education (hereafter referred to as Standards for EBP) using a modified ≥80% weighted criterion to essentially allow for “partial credit” when discerning methodologically sound studies (Lane, Kalberg, & Shepcaro, 2009). In a systematic literature review of precorrection, Ennis and colleagues (2017) identified 10 studies with results suggesting precorrection is an EBP when using the weighted coding criterion. In a systematic review of instructional choice, Royer et al. (2017) identified 26 studies with results suggesting insufficient evidence available to classify instructional choice as an EBP using Standards for EBP. These reviews offer important findings for educators as they seek effective and efficient strategies to incorporate into regular school practices to meet students’ multiple needs. Like precorrection and instructional choice, active supervision is another antecedent-based strategy focusing on changing the environment to prevent undesirable behaviors from occurring (Cooper, Heron, & Heward, 2007). To our knowledge, the current evidence on active supervision has yet to be examined.
Supervision and monitoring of students are essential strategies for student safety and promoting on-task behaviors. Teachers monitor student behaviors by watching all students to look for desired behaviors and assess the rhythm and flow of activities (Doyle, 1989). Early work on teacher effectiveness (Brophy, 2010; Kounin, 1970) highlighted the importance of a teacher’s monitoring students’ engagement and activities in the classroom. Since then, empirical studies have shown how active supervision can reduce behavior problems during transitions to and from classrooms (Haydon, DeGreg, Maheady, & Hunter, 2012), tardiness to class (Johnson-Gros, Lyons, & Griffin, 2008), and how supervision more broadly (e.g., observation and proximity) supports positive shifts in student performance in academic and behavioral domains (Oswald, Safran, & Johanson, 2005; Sariscsany, Darst, & van der Mars, 1995; Schuldheisz & van der Mars, 2001).
Prior to the development of active supervision, researchers examined the impact of various teacher supervision practices on student behavior. For example, Sariscsany et al. (1995) studied the effects of three teacher behaviors on students’ on-task behavior during gym: (a) skill-based feedback delivered near the student, (b) skill-based feedback delivered from a far distance from the student, and (c) no skill-based feedback with the teacher at a distance from the student. Using an alternating treatment design, authors found skill-based feedback interventions increased two of three students’ on-task behavior. Extending the work of Sariscsany et al. (1995), Schuldheisz and van der Mars (2001) studied the impact of teacher behavior (e.g., movement, skill-based feedback) on individual students’ physical activity in a physical education class. Using an A-B-A-B withdrawal design across four gym classes, results demonstrated a positive impact on physical activity level following the introduction of teacher verbal prompting and movement.
Active Supervision
Active supervision is a low-intensity antecedent-based strategy employing teacher behaviors, such as scanning, moving, and interacting with students, to reduce problem behaviors and increase desired behaviors (Colvin, Sugai, Good, & Lee, 1997). Simonsen, Fairbanks, Briesch, Myers, and Sugai (2008) conducted a systematic review to identify evidence-based classroom management practices using a criterion of three or more supporting empirical articles. Authors identified and described 20 practices, including active supervision. Simonsen and colleagues (2008) described active supervision as a practice to support teaching, monitoring, and reinforcing behavioral expectations. However, the scope of this review did not include an assessment of methodological quality of studies or the degree of impact on student outcomes.
In the initial study, Colvin et al. (1997) examined the effect of active supervision and precorrection on the behavior of elementary-aged students transitioning between school settings (e.g., from cafeteria to classroom). Authors found a decrease in problem behavior after introducing an active supervision/precorrection intervention. Subsequent articles analyzed effects of active supervision when combined with other strategies (e.g., with precorrection; Haydon & Kroeger, 2016; Lewis, Colvin, & Sugai, 2000) or in isolation (Johnson-Gros et al., 2008). To date, studies have focused on schoolwide (Colvin et al., 1997), classwide (De Pry & Sugai, 2002), and grade-level (Franzen & Kamps, 2008) efforts. These studies demonstrate the versatility of active supervision as a stand-alone strategy or packaged as a component in an intervention along with other low-intensity strategies (e.g., precorrection). Also, these studies suggest active supervision is a viable primary prevention practice for supporting all students, including those demonstrating challenging behavior, across a range of settings (e.g., recess, classroom). Specifically, active supervision is an effective strategy that is easy to learn and implement in multiple contexts.
Although many studies have investigated effects of active supervision on student behavior, to date, there have been no systematic reviews of its literature including both a methodological quality appraisal and its effect on student outcomes. This review extends the work of Simonsen et al. (2008) by using the Standards for EBP—similar to reviews of precorrection and instructional choice—to provide researchers and practitioners with valuable information regarding the status of active supervision as an EBP and where additional research is needed.
Purpose
We used CEC’s (2014) Standards for EBP to determine the extent to which studies included in this systematic review align with quality indicators (QIs) and current EBP standards. We used a modified ≥80% weighted criterion (Lane et al., 2009) to identify methodologically sound studies to determine the quality and rigor of the active supervision literature. We examined evidence for the practice and magnitude of the impact of active supervision on Pre-K–12 student outcomes using visual analysis and between- and within-case effect sizes.
Method
Article Procurement
We conducted a multistep process to identify articles, beginning with electronic searches followed by ancestral and hand searches (inclusion criteria to follow). Two authors independently conducted electronic searches with the following Boolean search terms: (a) active supervision, (b) active supervision AND praise, and (c) active supervision AND (precorrection OR pre-correction). Authors searched the following databases: Academic Search Complete, Education Abstracts, ERIC, JSTOR, PsycARTICLES, PsycINFO, and Psychology and Behavioral Science Collection. We included terms relating to precorrection, as this strategy is often embedded in active supervision treatment interventions (Colvin et al., 1997), to increase the likelihood of finding all relevant articles for this systematic review. Searches were limited by full text, peer reviewed, and English language parameters. We identified 116 articles across searches with 100% agreement between two authors. We entered resulting titles and abstracts in a Microsoft Excel file to organize the screening, eligibility, and inclusion processes (Wolery, Lane, & Common, 2018). In the event of disagreement and consensus not being reached, it was determined a priori to include a third author (senior scientist) to inform the final decision.
Two authors screened titles and abstracts for possible inclusion. To screen titles and abstracts, authors coded each article (n = 116) using 1 (met inclusion) or 0 (did not meet inclusion). This process yielded seven possible articles, including five single-case design studies and two pre–post design studies. Next, authors read each article in full to assess eligibility. Five of the seven initial articles were included in the present review.
Following the same screening, eligibility, and inclusion processes, two authors began an ancestral search by independently reviewing each article’s references to identify potential articles to be included in this review. Authors reviewed 173 references, leading to identification of four potentially eligible articles that were read in full. Two additional articles met criteria.
Finally, hand searches were conducted if journals identified in the aforementioned procedures featured two or more included articles. One journal met this criterion: Education and Treatment of Children. In the hand search, we reviewed issues published from the year of the first included article (1997) to the year in which the hand search was conducted (2017). Authors reviewed 687 articles, of which one article was read in full for eligibility but did not meet inclusion criteria.
Across our systematic search, we found most articles during the electronic search (n = 5; 71.42%); hand and ancestral searching contributed two articles (28.57%). Seven studies were identified for this review. Inter-rater agreement (IRA) was conducted across steps (see Figure 1).

Preferred reporting items for systematic reviews and meta-analyses flow diagram.
Inclusion Criteria
To meet our inclusion criteria, studies needed to be empirical and examine the efficacy of active supervision using an experimental group comparison or single case research design (CEC, 2014). Following Colvin et al. (1997), we defined active supervision as educators engaging in the specific and overt behaviors of scanning, moving, and interacting to prevent problem behavior and promote rule-following behavior. This included (a) established expectations, (b) frequent scanning of context, (c) positive interactions (verbal and nonverbal precorrections and prompts), (d) reinforcement of desired behavior, and when necessary (e) correction to support student success (De Pry & Sugai, 2002; Haydon & Scott, 2008). We included studies of active supervision in isolation or with one or two other strategies that emphasized components within active supervision (e.g., precorrection, reinforcement). Second, the dependent variable(s) (DV) must have included one or more student-level academic or behavioral outcomes (e.g., engagement/time on-task; inappropriate behaviors). Third, the intervention needed to occur in a traditional Pre-K–12 setting (e.g., general education, resource, self-contained classroom). Studies were excluded if they employed a design not capable of demonstrating experimental control (e.g., AB-designs; Oswald et al., 2005), were not written in English, or were not published in a peer-reviewed journal.
Coding Procedures
Training
The first author was taught how to code for QIs using the Standards for EBP by two senior authors, one of whom developed the initial coding rubric and training procedures. Both senior authors have applied this quality appraisal coding protocol in published reviews (e.g., Common et al., 2019). Training consisted of independently coding three training articles (not included in this review) with ≥85% interrater agreement (IRA). IRA was calculated by dividing the number of agreements by the total number of agreements plus disagreements and multiplying by 100. Across training and subsequent QI coding for this review, we used a QI coding protocol for single-case and group-comparison design methodology (Lane, Common, Royer, & Muller, 2014). During training, the first and second author’s mean IRA was 89.39% (SD = 2.38; range = 86.36–90.48).
Descriptive coding
During QI coding, one author coded articles for descriptive characteristics based on each of the CEC (2014) QI: (a) context and setting, (b) student and intervention agent demographics, (c) descriptions of the practice, (d) implementation fidelity, (e) internal validity, (f) outcomes measures, and (g) data analysis. These characteristics illustrate the context of the present evidence base and suggest future research directions for examining active supervision. A second coder verified the descriptive coding by comparing the first coder’s results to the content in the articles; IRA (calculated by dividing the sum of the initial agreements across studies by the sum of the initial agreements and disagreements and multiplying by 100) was 100%.
Quality indicator coding
After training to criterion, two authors independently appraised each article’s methodological rigor using Standards for EBP. The eight QIs and scoring procedures are described in the following sections. Twenty-two components for single-case design studies were coded, as no group comparison studies were included. Each component was coded dichotomously (0 = not met, 1 = met). The two authors then compared their coding, discussed disagreements, and came to consensus. It was determined a priori a third author would review coding and render a decision if a consensus could not be reached. Point-by-point agreement for each component was calculated for an average IRA by indicator. IRA was calculated by dividing the initial agreements across studies by the sum of the initial agreements and disagreements and multiplying by 100. Across studies, IRA was 88.90% (SD = 9.02; range = 77.27–100). Average IRA by QI was 88.42% (SD = 10.91; range = 71.43–100). Initial IRA prior to consensus or third author consultation is reported below by QI. Final agreement was 100% across indicators.
QI 1.0. Context and setting
To meet this indicator, a study needed to describe salient characteristics of the intervention context. This could refer to location, region, school setting (e.g., public elementary school), or setting within the school (e.g., third-grade inclusive classroom). If the study described information on at least one context/setting variable, this indicator was considered met. IRA was 100%.
QI 2.0. Participants
QI 2.1 refers to the description of participants. QI 2.2 refers to describing the risk status of participants and how this determination was made. To meet the first component, the study needed to describe at least one demographic variable of participants (e.g., age, gender, ethnicity). To meet the second component, the study needed to describe the process for determining participant’s disability/risk status (e.g., standardized assessment, interdisciplinary team, screening tool). We did not require risk status to be reported in aggregate school, classroom, or grade level, because disability/risk eligibility is not typically a requirement for whole-school/classwide participation. It was desirable (although not required) to describe the participant selection/inclusion criteria for how class(es) or school(s) were selected. Average IRA for each component was 85.71%.
QI 3.0. Intervention agent
QI 3.1 refers to describing intervention agent(s) within the study. QI 3.2 refers to describing the specific training or qualifications necessary to implement the intervention. To meet QI 3.1, a study must have described at least one demographic variable of the intervention agent (e.g., role, years of experience, age, race). To meet QI 3.2, a study must have described the training procedures required for the intervention agent to deliver the intervention and how the intervention agent achieved them (e.g., role play, check for understanding). Average IRA was 71.43% (range = 57.14–85.71).
QI 4.0 Description of practice
QI 4.1 refers to providing a description of practice, and QI 4.2 refers to describing any intervention materials necessary for implementing the intervention. QI 4.1 was met if intervention procedures and intervention agent actions were described with replicable detail. Studies may also have cited outside sources reporting this information. Similarly, QI 4.2 was met if a study described intervention materials with replicable detail. If the intervention materials described were specific to particular components of a package intervention (e.g., precorrection) unrelated to implementing active supervision and no materials were needed for active supervision, this component was considered not applicable. Average IRA was 75.00% (range = 50.00–100).
QI 5.0 Implementation fidelity
QI 5.1 was met if a study used a direct measure of fidelity and was quantifiably reported, such as a self-report. QI 5.2 was met if a study reported fidelity in dosage or exposure to the intervention. This could have been achieved by reporting the length of time the intervention was implemented (e.g., 25 min daily at recess) in conjunction with how long the intervention was in place (e.g., as depicted in a graph). QI 5.3 was met if fidelity data were taken regularly throughout the intervention (e.g., percentage of sessions across all phases). We considered this met if fidelity data were provided in an aggregate form with authors indicating data were collected with sufficient frequency throughout (e.g., beginning, middle, end). Average IRA was 95.24% (range = 85.71–100).
QI 6.0 Internal validity
QI 6.1 was met if studies reported experimental control over the independent variable (IV). We determined this component could only be met if implementation fidelity (QI 5.1) was also met, as implementation fidelity is necessary to determine whether experimental control was truly achieved (Common et al., 2019). QI 6.2 was met if a study described baseline conditions in detail (e.g., who did what to whom, and under what conditions; Lane, Wolery, Reichow, & Rogers, 2006). QI 6.3 was met if participants in baseline, withdrawal, or control conditions had little or no access to the intervention. This component was considered met if the study reported how active supervision was not present during nonintervention phases (e.g., baseline, withdrawal) and/or reported implementation fidelity during nonintervention phases to demonstrate active supervision components were low. QI 6.5 was met if a study’s design allowed for three demonstrations of experimental effect. QI 6.6 was met if studies had at least three data points in baseline conditions. Exceptions could be made if a reasonable justification was provided (e.g., behavior was harmful and placed child at risk), but it was insufficient to state withdrawal had fewer than three data points for reasons of convenience (e.g., teacher preference of not having problem behaviors). Further, a baseline condition was not required (although recommended) in alternating treatment designs (Ledford & Gast, 2018). QI 6.7 was met if the study controlled for common threats to internal validity using a valid single case research design (Ledford & Gast, 2018). The study must also report implementation fidelity (QI 5.1), as the lack of implementation fidelity is a threat to internal validity. Average IRA was 85.71% (range = 71.43–100).
QI 7.0 Outcome measures/dependent variables
QI 7.1 was met if a study demonstrated the social validity of the outcomes either through social validity survey data or through justifications presented in the introduction or discussion. QI 7.2 was met if a study described DVs with replicable detail. QI 7.3 was met if all outcomes measured were reported. QI 7.4 was met if the frequency and timing of outcomes were reported appropriately (e.g., at least three data points in each phase). QI 7.5 was satisfied by reporting acceptable reliability of indices (e.g., IOA ≥ 80% or kappa ≥ 60%). Average IRA was 94.29% (range = 85.71–100).
QI 8.0 Data analysis
QI 8.2 was met if a study included an appropriate single-case graph (e.g., line graph) clearly representing outcome data for all student-level DVs. A study needed to provide a graph to allow for visual analysis of level, trend, and stability within and across conditions. Average IRA was 100%.
Evaluation Procedures for Determining Evidence- Based Practice Status
We applied the Standards for EBP to determine the extent to which active supervision in traditional school settings across the Pre-K–12 continuum qualified as an EBP using a modified criterion for operationalizing methodologically sound studies. In applying the weighted criterion, each component within an indicator was equally weighted (Lane et al., 2009). For example, QI 2.0 had two components, with each weighing 0.50; whereas QI 5.0 had three components, each weighing 0.33. This allowed for the possibility of partial credit by allowing components met to contribute toward the QI score for that component, rather than an absolute coding that gives a QI a score of zero if any component within an indicator was not met. Following QI appraisal, studies meeting the modified ≥80% weighted criterion were considered methodologically sound rather than using the original CEC’s (2014) more conservative absolute QI coding criterion. We selected the weighted criterion a priori to allow for the inclusion of high-quality studies as methodologically sound despite not meeting all components across all QIs (Ennis et al., 2017; Royer, Lane, Dunlap, & Ennis, 2018). We considered these high-quality studies sufficiently rigorous to be included in the evidence-based decision-making process (Lane et al., 2009).
Following Standards for EBP guidelines for classifying effects across studies using single case research designs, we evaluated methodologically sound studies with three or more cases demonstrating positive, neutral or mixed, or negative effects. Studies were classified as having positive effects if 75% of cases demonstrated a functional relation between the IV and therapeutic changes in the DV, there was no evidence of counter-therapeutic effects, and remaining cases showed neutral or mixed effects (i.e., no negative effects). Studies were considered as having negative effects if 75% of its cases demonstrated a functional relation between IV and unfavorable changes in DV (i.e., counter-therapeutic effects). Studies were considered as having mixed or neutral effects if the study did not qualify as having positive or negative effects. The presence of a functional relation was evaluated by examining data within and across phases for changes in level, trend, and stability (Ledford & Gast, 2018). Two authors independently performed visual analysis. IRA for classification of study effects was calculated by dividing the sum of the initial agreements across cases (within studies) by the sum of the initial agreements and disagreements and multiplying by 100. IRA was 100%.
We applied the Standards for EBP to the evidence base for active supervision to determine the extent to which the practice can be considered an EBP. To be considered an EBP, the practice must have (a) five methodologically sound single case studies demonstrating positive effects across 20 or more participants, (b) a ratio of three studies with positive effects to one study with neutral or mixed effects, and (c) no studies with negative effects. A potentially evidence-based practice must have (a) between two and four single case studies demonstrating methodological rigor with positive effects, (b) a ratio of two studies with positive effects to one study with neutral or mixed effects, and (c) no studies with negative effects (CEC, 2014).
Determining Magnitude Effect
Data extraction
To calculate within- and between-case effect sizes, we digitized published graphs for all methodologically sound studies using data extraction software (WebPlotDigitizer; Version 3.12; Rohatgi, 2017). This enabled each data point to be extracted digitally, exported, and saved into a spreadsheet for analysis. Shadish et al. (2009) demonstrated data extraction software to be highly reliable with extracted data being nearly identical to original data. Digital data were extracted by one author and cleaned for effect size calculations by converting the file to .xlsx file format, creating headers (case identification, condition, x-value, y-value, and therapeutic direction), rounding ordinates to the hundredth decimal place, and confirming the extracted values fell within a possible value range. Next, a second author compared extracted data to original articles with 100% reliability of extracted graphs.
Within-case effect sizes
Within-case effect sizes, such as log response ratio (LRR; Pustejovsky, 2018), are metrics conceptualizing proportionate change for an individual case across two adjacent conditions (e.g., A-B). The LRR effect size parameter (Pustejovsky, 2018) is defined as,
We calculated LRR for all methodologically sound studies using the online single case effect size calculator (Pustejovsky, 2017), and percentage change in Microsoft Excel. The percentage change formula (Pustejovsky, 2018) is defined as,
LRR assumes the pattern of behavior within each phase is stable from session to session (i.e., lacks time trends). When applied to DVs on a scale of 0% to 100%, LRR requires all outcomes to be defined in the same direction of therapeutic change to be consistent across cases and studies. All DVs consisted of student-level outcomes with therapeutic direction being downward, thus no student outcome data required recoding. For reversal and withdrawal design studies with multiple A-B comparisons, we followed recommendations from Pustejovsky (2014) and estimated LRR for each pair of adjacent phases, then combined those estimates by averaging into a single summary effect size for each case (Pustejovsky, 2018). We did not analyze A-B contrasts with fewer than three data points or zero responding within a phase. That is, if A1-B1 and B2 had three or more data points but A2 only had two data points, A2-B2 contrast would be dropped from further analysis, and LRR would be reported for A1-B1 only. LRR effect sizes employ directionality, with negative values of LRR corresponding to decreases, positive values corresponding to increases, and values of zero corresponding to no changes (Pustejovsky, 2014). Percentage change is an intuitive way to interpret magnitude of effect from baseline to treatment level.
Between-case effect sizes
Between-case standardized mean difference (BC-SMD) effect sizes, unlike within-case effect sizes, are comparable with standardized mean differences from between-group experimental designs (e.g., Cohen’s d). This allows BC-SMDs to be comparable across a wide range of studies employing between-case effect size parametrices. To date, BC-SMDs (Hedges, Pustejovsky, & Shadish, 2012, 2013) are calculated for single-case studies using multiple baseline or reversal/withdrawal designs containing three or more cases. We calculated BC-SMDs for all methodologically sound studies with three or more cases using either a multiple baseline or reversal/withdrawal design containing three or more cases for entry in the online BC-SMD calculator (Pustejovsky, 2016). We used the restricted maximum likelihood (REML) estimation method and specified both a fixed effect and random effect for the baseline phase to permit the intercept (level) across all baseline phases to be different from zero and vary across cases, respectively. We further specified a fixed effect for the treatment-phase level to permit the intercept (level) across all intervention phases to vary from the baseline-phase level. We did not specify random effect for the treatment phase, which permits treatment effect to vary across cases. Following recommendations from Valentine, Tanner-Smith, Pustejovsky, and Lau (2016), we assumed treatment effects were constant across cases by omitting the random effects for treatment phase level.
Shadish, Zelinsky, Vevea, and Kratochwill (2016) examined two previous published samples of BC-SMD and used quartiles to indicate small (g = 0.373; 25th percentile), medium (g = 0.978; 50th percentile), and large (g = 1.874; 75th percentile) effects. We descriptively interpreted BC-SMD effects as follows: small (0.37 to 0.98), medium (0.98 to 1.87), large (≥ 1.87). BC-SMD are desirable for being (a) directly comparable with those effect sizes calculated for group comparison studies, (b) interpreted on the same magnitude scale, and (c) for the first time allowing magnitude effect to be contrasted across groups (Shadish, Hedges, Horner, & Odom, 2015). This allows effect sizes calculated between cases within a study to be comparable with other between-case effect sizes (e.g., BC-SMDs for single-case and Cohen’s d for group comparison studies).
Results
Mapping the Literature
Descriptive characteristics
Seven studies were included and published across six unique journals from 1997 to 2016. Two studies examined active supervision without precorrection. Johnson-Gros et al. (2008) examined active supervision in isolation, and Franzen and Kamps (2008) examined active supervision as part of a multicomponent recess intervention. Five studies examined active supervision with precorrection. Two studies examined active supervision with only precorrection (Colvin et al., 1997; Lewis et al., 2000), two studies examined active supervision and precorrection with explicit timing (Haydon et al., 2012; Haydon & Kroeger, 2016), and one study examined active supervision and precorrection with daily data review (De Pry & Sugai, 2002). Four studies took place in an elementary school setting (Colvin et al., 1997; De Pry & Sugai, 2002; Franzen & Kamps, 2008; Lewis et al., 2000), and three studies took place in secondary settings (Haydon et al., 2012; Haydon & Kroeger, 2016; Johnson-Gros et al., 2008). Three studies investigated schoolwide interventions (Colvin et al., 1997; Johnson-Gros et al., 2008; Lewis et al., 2000), three studies investigated classwide interventions (De Pry & Sugai, 2002; Haydon et al., 2012; Haydon & Kroeger, 2016), and one study investigated a grade-level intervention (Franzen & Kamps, 2008). Three studies took place within schoolwide PBIS frameworks (De Pry & Sugai, 2002; Franzen & Kamps, 2008; Haydon & Kroeger, 2016). Five studies used teachers as the intervention agents for active supervision (De Pry & Sugai, 2002; Franzen & Kamps, 2008; Haydon et al., 2012; Haydon & Kroeger, 2016; Johnson-Gros et al., 2008); one study used paraprofessionals, classroom teachers, and the building principal (Colvin et al., 1997); and one study used playground monitors (Lewis et al., 2000). No studies investigated effects of active supervision on individual students. Across studies, aggregate data were provided across 15 cases, which collectively involved 1,686 participants. Five studies reported the region in the United States in which the study occurred (Colvin et al., 1997; Franzen & Kamps, 2008; Haydon et al., 2012; Haydon & Kroeger, 2016; Johnson-Gros et al., 2008). All seven studies described the school setting (e.g., rural, suburban, urban). See Table 1 for descriptive characteristics.
Descriptive Characteristics of Reviewed Studies.
Note. This table is an extension from the original work mapping the literature on precorrrection (Ennis et al., 2017) with an emphasis on focusing on active supervision. CEC = Council for Exceptional Children; DV = dependent variable; K = Kindergarten; FRL = free and reduced-price lunch; IOA = interobserver agreement; ODR = office discipline referral; STEM = science, technology, engineering, math; SW-PBIS = schoolwide positive behavior interventions and supports; SV = social validity.
Methodological quality indicators
Results of the quality appraisal for active supervision employing the Standards for EBP are reported by indicator. See Figure 2 for QIs by component for included studies.

Scatter box plot of quality indicators (CEC, 2014) of included single-case design articles.
QI 1.0 Context and setting
All seven studies met 1.0 in full. This indicator included one component: 1.1 Context/Setting description.
QI 2.0 Participants
All seven studies met 2.0 in full. This indicator included two components: 2.1 Participant description and 2.2 Participant disability/at-risk status.
QI 3.0 Intervention agent
Five studies (71.43%) met 3.0 in full. This indicator included two components: 3.1 Role description and 3.2 Training description. All seven studies met 3.1 Role description. Five studies (71.43%) met 3.2 Training description.
QI 4.0 Description of a practice
All seven studies met 4.0 in full. This indicator included two components: 4.1 Intervention procedure description and 4.2 Materials description. All seven studies met 4.1 Intervention procedure description. Five studies (71.43%) met 4.2 Materials description. These studies included and described intervention materials specific to active supervision or its implementation (e.g., daily graphed data of teacher’s implementation of core components). The final two studies (28.57%) did not include intervention or implementation materials specific to active supervision, as such this component was considered not-applicable.
QI 5.0 Implementation fidelity
All seven studies met 5.0 in full. This indicator included three components: 5.1 Implementation fidelity assessed/reported, 5.2 Dosage or exposure assessed/reported, and 5.3 Assessed across relevant elements/throughout study.
QI 6.0 Internal validity
Three studies (42.86%) met 6.0 in full. All seven studies met 6.1 IV systematically manipulated. and 6.2 Baseline description Five studies (71.43%) met 6.3 No or limited access to IV during baseline. Six studies (85.71%) met 6.5 Three demonstrations of experimental effect. Six studies (85.71%) met 6.6 Baseline: minimum three data points and established pattern. Four studies (57.14%) met 6.7 Controls for threats to internal validity.
QI 7.0 Outcome measures/dependent variables
Five studies (71.43%) met 7.0 in full. This indicator included five components. All seven studies met 7.1 Socially important, 7.2 Description of DV measures, 7.3 Reports effects on the intervention on all measures, and 7.5 Adequate interobserver agreement. Five studies (71.43%) met 7.4 Measured repeatedly.
QI 8.0 Data analysis
All seven studies met 8.0 in full. This indicator included one component: 8.2 Graph clearly represents outcome. All studies included a line graph allowing for the possibility of visual analysis of student-level DVs.
Evaluating the Evidence Base
One study (Franzen & Kamps, 2008) met all eight QIs, with remaining studies meeting between six to eight indicators (mode = 7). All seven studies met or exceeded ≥80% of the applicable QIs following a weighted coding scheme to define studies as methodologically sound (M = 7.56, SD = 0.38; range = 6.83–8.00). We identified three studies including three or more cases (Colvin et al., 1997; Franzen & Kamps, 2008; Lewis et al., 2000). According to Standards for EBP, studies with three or more cases can be assessed to determine positive, mixed/neutral, or negative effects across cases through visual analysis. Colvin et al. (1997) demonstrated positive effects across settings targeting students’ problem behavior using a multiple baseline design. Lewis et al. (2000) demonstrated positive effects across recesses targeting students’ problem behavior during unstructured activities and neutral effects across recesses during structured activities using a multiple baseline design. Franzen and Kamps (2008) demonstrated positive effects across grades at recess targeting students’ inappropriate behavior using a multiple baseline design. No methodologically sound studies demonstrated a negative effect. This quality appraisal of the methodological quality of active supervision met criteria for being a potentially evidence-based practice—but only when a weighted criterion was used (Lane et al., 2009)—across three methodologically sound single-case studies.
Determining Magnitude of Effect
Within-case effect sizes
Table 2 displays LRR effect-size estimates for each case and each outcome across methodologically sound studies. For interpretation, LRR effect sizes employ directionality, with negative values of LRR corresponding to decreases, positive values corresponding to increases, and values of zero corresponding to no changes. All seven studies demonstrated negative values (range = −0.09 to −1.80) across DVs/settings except for one study. Lewis et al. (2000) demonstrated slight positive values when examining active supervision during structured recess activities (range = 0.07–0.39; Lewis et al., 2000) and negative values during unstructured recess activities (range= −1.08 to −1.52). Overall, these findings suggest magnitude effects in the therapeutic direction (decreases in undesirable behaviors) with percentage change for outcomes ranging from −8.61% to −83.47%.
Quality Appraisal and Study Effect.
Note. Due to fewer than three data points within a phase, the following A-B contrasts were dropped from aggregate WC-LRR calculations: De Pry and Sugai (2002) and Haydon and Kroeger (2016). Abs = absolute coding; BC-SMD = between-case standardized mean difference (Hedges, Pustejovsky, & Shadish, 2012, 2013); CI = confidence interval; Est. = estimate; GE = general education classroom; ODR = office discipline referral; QI = Quality indicator; WC-LRR = within-case log response ratio (Pustejovsky, 2014); Wghtd = weighted coding.
Between-case effect sizes
Table 2 displays BC-SMD effect size estimates for outcomes across methodologically sound studies using a withdrawal/reversal design or multiple baseline design with three or more cases. Colvin et al. (1997) examined the effects of active supervision on students’ problem behavior entering school, entering the cafeteria, and exiting school. The BC-SMD estimate across settings was −2.18 (SE = 0.49; 95% CI [−3.19, −1.34]), suggesting a large effect. Lewis et al. (2000) examined the effects of active supervision on students’ problem behavior during structured and unstructured activities across three recess periods. The BC-SMD estimate across recesses during unstructured activities was −2.04 (SE = 0.32; 95% CI [−2.69, −1.43]), suggesting a large effect. Conversely, the BC-SMD estimate across recesses during structured activities was 0.06 (SE = 0.26; 95% CI [−0.44, 0.56]), suggesting a small effect. Franzen and Kamps (2008) examined effects of active supervision on students’ problem behavior during recess across three grade levels. The BC-SMD estimate across grade levels during recess was −2.13 (SE = 0.28; 95% CI [−2.69, −1.61]), suggesting a large effect. These effect sizes yielded similar patterns of effect as determined using visual analysis.
Discussion
Using Standards for EBP, this systematic review appraised the quality of the literature on active supervision to determine to what extent the practice can be classified as an EBP and the magnitude of its effectiveness on student outcomes. To our knowledge, this is the first systematic literature review of active supervision, though other effective, efficient, low-intensity strategies have recently been appraised (e.g., Royer et al., 2017). The current review identified seven single-case design studies that (a) were experimental, (b) reported at least one student-level outcome, and (c) took place with student(s) in a K–12 traditional school setting. We identified all studies in peer-reviewed journals. Using a modified definition for methodologically sound studies (e.g., weighted criterion), results from this systematic review indicate active supervision is a potentially evidence-based practice with magnitude effects using both within case (i.e., LRR) and between-case (i.e., BC-SMD) indicating moderate to large effects in reducing student problem behavior. Active supervision may be an EBP when applied in tandem with broader interventions, such as precorrection and reinforcement.
Findings align with previous reviews of classroom management practices classifying active supervision as an evidence-based strategy (e.g., Simonsen et al., 2015). Our results are similar to other recent systematic reviews on low-intensity behavior management strategies (e.g., Common et al., 2018). In each review, the researchers employed the weighted criterion to discern methodologically sound studies drawing on concerns first raised by Lane et al. (2009) that too rigorous of criteria may lead to the unintended consequence of excluding strong studies and ultimately having fewer effective strategies, practices, and programs to be classified as evidence based. This reflects a growing trend in systematic reviews employing this criterion to emphasize how methodological rigor exists on a continuum ranging from studies having little or no methodological rigor to studies demonstrating superior methodological rigor (Common et al., 2019).
Implications for Practice and Research
For practitioners, there are several lessons to be learned. Findings provide evidence to support active supervision when used in isolation (e.g., Johnson-Gros et al., 2008) or as component of an intervention package (e.g., active supervision, precorrection, and explicit timing; Haydon & Kroeger, 2016). Practitioners can employ this effective, efficient, low-intensity strategy with greater confidence to reduce problem behavior in a variety of school and classwide contexts. Given all studies met QI 5.0 Implementation fidelity, active supervision is a strategy educators can learn and implement to positively affect student behaviors. This level of prevention is well aligned with typical HLPs at Tier 1, such as developing and reinforcing schoolwide behavior expectations (Lane, Oakes, & Menzies et al., 2014), which can be employed to support students at the school and classroom level.
Furthermore, practitioners can learn to implement active supervision with a minimal investment of time and resources. Studies in this review described low-intensity training processes typically involving brief presentations for faculty and staff or conversations with a teacher (Colvin et al., 1997; De Pry & Sugai, 2002). For example, Haydon and Kroeger (2016) described a training process where teachers were provided with a definition of active supervision. Researchers then role-played examples and nonexamples of active supervision. At the request of the teachers, the researchers provided a script.
For practitioners implementing active supervision schoolwide, this review provides several considerations. First, active supervision can be successfully taught during a building-wide meeting (Colvin et al., 1997; Johnson-Gros et al., 2008). Second, staff can use active supervision by learning to implement its component parts (e.g., scanning, moving, interacting). Third, practitioners must know whether the practice is in place and what stakeholders believe about the intervention; this review demonstrates the feasibility of collecting treatment integrity and social validity data to address these issues (Haydon & Kroeger, 2016; Johnson-Gros et al., 2008).
Limitations and Future Directions
There are several limitations worthy of consideration when interpreting the findings of this review. First, patterns in the included studies may limit the external validity of these studies. For example, most studies included in this review did not investigate active supervision in isolation. It was frequently paired with another low-intensity behavior management strategy (Colvin et al., 1997; De Pry & Sugai, 2002; Franzen & Kamps, 2008; Haydon et al., 2012; Haydon & Kroeger, 2016; Lewis et al., 2000). This makes it difficult to interpret the role of active supervision in producing the results in each study. Indeed, only one study examined the efficacy of active supervision in isolation (Johnson-Gros et al., 2008). Future research may study active supervision as a sole treatment (e.g., withdrawal/reversal or multiple baseline design) or with other treatments using a comparative design to evaluate the unique impact of active supervision in conjunction with other treatment components (Ledford & Gast, 2018). Similarly, many components of active supervision are typical practices of educators (e.g., scanning) that may be difficult to fully control for, leading to possible cofounding variables (e.g., contamination) during nonintervention conditions (e.g., baseline, withdrawal). As such, it may be difficult to completely withdraw a strategy like active supervision during a baseline phase because its components incidentally overlap with regular school practices. In our QI coding process, this made QI 6.3 Limited or no access to intervention difficult to interpret for this review. In addition to monitoring treatment integrity across conditions, future component analyses may be helpful in determining which facets of active supervision (e.g., proximity, establishing expectations, interaction) are most effective in preventing challenging behavior. This would provide a stronger evidence base to bolster the internal and external validity of active supervision.
Second, there was a large range for the average IRA by QI (M = 88.42%, SD = 10.91; range = 71.43–100). Although the overall average IRA by article was high (M = 88.90%, SD = 9.02; range = 77.27–100), the range of IRA by QI was negatively affected by QI 3.0 Intervention agent (71.43%) and QI 4.0 Description of practice (75.00%). This may in part be due to the denominator ranging from five to seven due to the small number of included studies. For example, if there was a single disagreement, that component would be scored an 87.50%; if there were two disagreements, that component would be scored 71.43%. Initial coding discrepancies occurred most frequently for QI 3.2 Training description and QI 4.2 Materials description. These disagreements, in addition to disagreements regarding QI 2.0 Participants and 6.3 No or limited access to IV during baseline, contributed to a low IRA on De Pry and Sugai (2002; IRA = 77.27%). Although this range indicates a limitation, 100% agreement was achieved through the consensus model and confirmation with a third author. Throughout the coding process, we clarified our interpretations of each QI specific to active supervision, as the training articles used to train to criterion were not related to active supervision. The training process may have been improved by using similar articles reflecting package interventions with a focus on assessing the methodological quality of a specific component of the package. Future reviewers are encouraged to clarify how to operationalize each component as it relates to the practice under investigation.
Third, this review may demonstrate a bias toward peer-reviewed, high-quality, published studies. Other studies examining active supervision may have found null or negative effects; these studies may not have been published and would therefore not be found through our systematic search. Academic journals must strive to publish studies with null or negative effects, yet with high methodological rigor, so researchers can fully understand the impact teaching practices have on students.
Fourth, during the coding process, we considered QI 2.2 (participant disability/at risk) to be met if the study was conducted as a schoolwide, classwide, or grade-level intervention. We considered the identification of the aggregate case as the unit of analysis targeted for the intervention as the process for identifying students’ risk status (need of intervention), thereby obviating the necessity for identifying and reporting students with disabilities through other methods. This interpretation has the potential to artificially inflate the total QI scores of these studies. Future research is needed to determine (a) the methodological rigor of describing participant disability/at-risk status, but not the inclusion criteria for how schools or classes were selected for intervention, and (b) how individual characteristics—such as participants’ disability/at risk status—moderate the effectiveness of active supervision at the individual level.
Fifth, during the evaluation of study effect, we employed both visual analysis (was there an effect?) and effect sizes (what was the magnitude of that effect?). For visual analysis, we followed guidelines for conducting visual analysis within- and between-conditions as recommended by Ledford and Gast (2018) to dichotomously determine the presence or absence of study effect. To complement visual analysis, we used both within-case (LRR) and between-case (BC-SMD) effect sizes. Although all seven studies met technical requirements of LRR, only three studies met technical requirements for BC-SMD (Colvin et al., 1997; Franzen & Kamps, 2008; Lewis et al., 2000). As such, fewer than half the included studies’ magnitude effects can be compared with group-design research at this time. Future research may consider classifying functional relations as small, medium, or large to allow further comparison with other metrics—such as effect sizes—to complement visual analysis (Moeyaert, Zimmerman, & Ledford, 2018). Future research on active supervision meeting the technical requirements of BC-SMD is also needed.
Sixth, all studies included in this review examined effects of active supervision using aggregated school, classroom, or grade levels. To date, no studies have examined how active supervision can support individual students. We encourage future research to explore the efficacy of active supervision at the level of the individual as part of Tier 1 practices or used in a more intensive active-supervision intervention (e.g., Tier 2) for students who may require more intentional supports. This line of inquiry is important for teachers searching for ways to effectively support students who engage in challenging behavior.
Finally, although the evidence for active supervision was found to be methodologically high in rigor, there was a small number of studies identified in this review (n = 7). More research is needed to examine active supervision’s efficacy in isolation and as part of multi-component interventions. Additional studies are required to establish active supervision as an EBP. For example, more research is needed in other instructional contexts. Many studies in this review took place in classroom, playground, or transition contexts. Other settings (e.g., cafeteria or common gathering areas) warrant future investigation. Similarly, no study examined active supervision in preschool contexts. At the time of this review, the only barrier to determining active supervision as an EBP was the number of qualifying studies demonstrating positive effects.
Summary
Active supervision is an efficient, low-intensity strategy that shows promise for reducing problem behavior and promote appropriate behavior. This review applied the Standards for EBP to the literature on active supervision obtained through a systematic literature search. This search yielded seven studies, of which all studies were identified as being methodologically sound through employing a modified weighted criterion. Three studies included three or more cases and demonstrated positive effects for the primary DVs. Based on these analyses, we determined active supervision to be a potentially evidence-based practice.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This systematic review of the literature review was supported in part by an Office of Special Education Programs Preparation of Leadership Personnel (CFDA 94.325D, PR/Award number H325D160080) to the Department of Special Education at the University of Kansas.
