Abstract
Implementation of research-based, Tier 1 behavior management strategies can be monitored to provide data-driven feedback and in support of integrity. The Measure of Active Supervision and Interaction (MASI) was developed to measure four behavior management practices (i.e., Praise, Correction, References to Behavior Expectations, Active Supervision) using systematic direct observation. This study was designed to address research questions related to reliability and validity by applying the MASI to evaluate staff behavior in seven out-of-school time programs. Findings indicate that two raters can complete the MASI with high agreement. Ratings are attributable largely to desirable sources of variance, and content validators positively rate the measure. Results are nonsignificantly correlated with established implementation measures for Positive Behavior Interventions and Supports.
Keywords
To encourage prosocial student behavior, education professionals (e.g., teachers, paraprofessionals, out-of-school time [OST] staff) employ research-based, Tier 1 behavior management practices (Epstein, Atkins, Cullinan, Kutash, & Weaver, 2008; Newcomer, Colvin, & Lewis, 2009; Simonsen, Fairbank, Briesch, Myers, & Sugai, 2008). Successful implementation of these Tier 1 practices requires adults to establish three to five behavior expectations that are defined across settings and activities and explicitly taught to students (Simonsen et al., 2008), and provide high levels of behavior-specific praise and low rates of correction (e.g., Sutherland, Wehby, & Copeland, 2000). High levels of behavior-specific praise and low levels of corrections are associated with improved on-task behavior (Partin, Robertson, Maggin, Oliver, & Wehby, 2009; Sutherland et al., 2000), while the use of consistent expectations as a component of Positive Behavior Interventions and Supports (PBIS) has been found to increase compliance and reduce rates of problem behavior in classrooms, hallways, and schools generally (Bradshaw, Koth, Thornton, & Leaf, 2009; Leedy, Bates, & Safran, 2004). To encourage high rates of praise and references to behavior expectations and simultaneously minimize the use of correction, educators engage in active supervision by moving around the setting, observing, and interacting with students (Colvin, Sugai, Good, & Lee, 1997). Active supervision is associated with lower rates of problem behavior (Colvin et al., 1997; Lewis, Colvin, & Sugai, 2000). In spite of ample evidence supporting their effectiveness, these practices are applied inconsistently in schools and OST settings, thereby limiting their potential to affect student outcomes (Reddy, Fabiano, Dudek, & Hsu, 2013b; Ruberto, 2015).
Evaluating the Implementation of Tier 1 Behavior Management Practices
To support the delivery of effective practices, it is necessary to understand the extent and circumstances under which they are implemented, which can be done through treatment integrity assessment. As this is an emerging research area, there are few well-established implementation measures, even for behavior management strategies (Collier-Meek, Fallon, & Gould, accepted; Gresham, 2014; Sanetti & Kratochwill, 2009). In the sections that follow, we summarize the multistep process for developing treatment integrity tools and review two types of behavior management measures (i.e., self-report tools and observational measures) before describing gaps in the literature.
Tools Developed Per Treatment Integrity Assessment Guidelines
Current recommendations for treatment integrity assessment can be described in five steps: operationalize intervention components, consider varied dimensions (e.g., adherence, quality), select an assessment method (e.g., observation, permanent product review), determine a rating format (e.g., dichotomous, Likert-type scale), and sum ratings for a total (Collier-Meek, Fallon, Sanetti, & Maggin, 2013; Gresham, Dart, & Collins, 2017). Researchers apply these procedures to assess implementation of behavior management practices; indeed, interobserver and intraobserver agreement data provide initial evidence for the reliability of the resulting data (Collier-Meek, Sanetti, & Boyle, 2016; Gresham et al., 2017).
Beyond reliability, student outcome data provide initial evidence of the convergent validity of treatment integrity measures when higher levels of behavior management and student prosocial behavior correlated as theoretically expected (e.g., significant positive correlations; Collier-Meek et al., 2016). In spite of this evidence in support of validity, Sanetti and Collier-Meek (2014), evaluated multiple treatment integrity measures and noted method bias and contextual variations; estimates varied depending on the observation method, current classroom activity, and other factors (e.g., student behavior, observation timing). Thus, the ability of the current treatment integrity guidelines to steer the development of measures that produce data that sufficiently and soundly assess key aspects of behavior management remains unclear.
Established Self-Report Tools
Several self-report tools have been developed to facilitate implementers’ assessment of Tier 1 behavior management. PBIS practitioners frequently use the Classroom Management Self-Assessment Revised (Simonsen, Fairbank, Briesch, & Sugai, 2006), a teacher self-report tool that supports reflection on and improvements in classroom management implementation. This tool includes space to tally positive and negative student contacts to calculate a ratio of positive-to-negative interactions, as well as 10 items pertaining to classroom management practices (e.g., classroom structure, expectations, active engagement) which are rated as present or absent. With the Classroom Ecology Checklist, the implementer rates the extent to which specific behavior management practices aligned with six domains are present (i.e., no, somewhat, yes; Reinke, Herman, & Sprick, 2011). This checklist was designed to be used in conjunction with other data sources (e.g., implementer interview, observation) in the context of the Classroom Check Up, an established method of consultative support (Reinke, Lewis-Palmer, & Merrell, 2008). Although these measures are valuable for self-reflection as one component of treatment integrity assessment, findings consistently indicate that implementers overestimate their level of implementation, suggesting that self-appraisal may not be appropriate method to estimate treatment integrity (Wickstrom, Jones, LaFleur, & Witt, 1998).
Established Observation Measures
Observational measures of teacher classroom practices with well-established psychometric properties include the Classroom Assessment Scoring System (CLASS; Pianta, La Paro, & Hamre, 2008) and the Classroom Strategies Scale–Observer Form (CSS-OF; Reddy, Fabiano, Dudek, & Hsu, 2013a). The CLASS posits a multilevel latent structure and measures the quality of classroom processes, including various features of student–teacher interactions. Items are rated on a 7-point scale and produce scores along three dimensions (emotional supports, classroom organization, and instructional supports) after an extended period of observation (Pianta & Hamre, 2009). The CSS-OF is a measure of instructional and behavior management practices that includes frequency counts, Likert-type scaling, and checklist items that are rated following two 30-min observations (Reddy et al., 2013a). The CLASS and CSS-OF are both general and comprehensive measures of teacher implementation, and classroom characteristics are used periodically to evaluate classroom practices (including Tier 1 behavior management) and facilitate data-driven professional development.
Gaps in Tier 1 Behavior Management Implementation Assessment Literature
As is the case with all assessment decisions, selecting an instrument to examine Tier 1 behavior management practices requires consideration of context, purpose, and feasibility. Both the CLASS (Pianta et al., 2008) and CSS-OF (Reddy et al., 2013a) are excellent candidates, as they were systematically developed, demonstrate sound psychometric properties, and provide reliable estimates of practice. To facilitate classroom teachers’ self-assessment and planning for improvement, existing self-report measures (e.g., Classroom Management Self-Assessment Revised; Simonsen et al., 2006) could be appropriate and feasible tools, despite concerns about the accuracy of self-report (e.g., Wickstrom et al., 1998). A range of tools developed under existing treatment integrity guidance might help estimate intervention implementation; however, the means and form of assessment depends on assessment and context factors (e.g., purpose, target behaviors, observation opportunities and duration, activities underway; Sanetti & Collier-Meek, 2014). There remains a need for a flexible, sensitive, and formative way to evaluate delivery of discrete behavior management practices that can be utilized across settings, within and beyond the classroom (hallways, playgrounds, OST). Whereas the above-described, established observational measures produce sound estimates of overall classroom management and teacher–student relations, emerging research employs systematic direct observation (SDO) to evaluate more discrete implementer behaviors. In this case, specific practices (e.g., praise statements) serve as target behaviors for instruction, observation, and measurement within a data-driven paradigm (Simonsen, MacSuga, Fallon, & Sugai, 2013). Research on the psychometric properties of SDO approaches is limited and additional work is needed to evaluate the reliability, validity, feasibility, and utility of data collected via SDO to assess key behavior management practices.
SDO of Tier 1 Behavior Management Implementation
SDO is a well-established, flexible measurement method wherein behavior is observed during a specified time period and systematic data collection procedures are applied to evaluate the specific dimensions of target behaviors (Cooper, Heron, & Heward, 2007; Suen & Ary, 1989). As it is not always possible or feasible to conduct continuous observation, studies often use time sampling or interval recording, during which the occurrence or nonoccurrence of a target behavior during a specified interval is coded according to specific decision rules (e.g., coded if present during entire interval; Cooper et al., 2007). Researchers have demonstrated that momentary time sampling, partial interval recording, and whole interval recording produce varying levels of accuracy depending on the presence of mixed intervals, where behavior both occurs and does not occur during a single interval (e.g., academically engaged and unengaged during a 30-s interval; Suen & Ary, 1989). Recent work focuses on sources of variance that emerge across these three methods (see Johnson, Chafouleas, & Briesch, 2017) and the reliability of behavior estimates when averaged over multiple time periods (see Ferguson, Briesch, Volpe, & Daniels, 2012).
SDO has often been applied to evaluate student behaviors (Cooper et al., 2007) such as academic engagement (Johnson et al., 2017) and disruptive and off-task behaviors (Shapiro, 2011). As the application and measurement of behavior management strategies such as praise, references to behavior expectations, and correction can be defined and directly observed, they can be evaluated using SDO methodology (Cooper et al., 2007). To evaluate the effectiveness of varied implementation supports within a single case methodology, investigators have monitored teachers’ rates of specific praise (e.g., Simonsen et al., 2013), frequency of student interactions (Colvin et al., 1997), and ratios of praise to corrective statements (Ruberto, 2015). In these investigations, initial evidence of reliability was established through acceptable levels of interobserver agreement (Ruberto, 2015; Simonsen et al., 2013); however, additional research is needed to evaluate the soundness of SDO for implementer behaviors.
Purpose of Study
This study addresses a gap in the literature by detailing the development and initial investigation of the Measure of Active Supervision and Interaction (MASI). Inasmuch as the MASI applies SDO (i.e., momentary time sampling, frequency count) to assess the implementation of four discrete behavior management practices (i.e., Praise, Corrections, References to Behavior Expectations, and Active Supervision), it emerges from and reflects elements of existing and related measurement traditions; however, it focuses on the enactment of four specific behaviors that comprise key components of Tier 1 behavior management in nonclassroom settings. We attempted to appraise the reliability and validity of ratings from the MASI and to evaluate the utility of this measure among OST staff. Specifically, we sought to address three research questions:
Method
Participants and Context
Participants were involved in the study in two distinct phases: content validation and observations. Initial measure development and content validation preceded the reliability and validity appraisal, which was conducted across several OST settings. We used two program-wide measures of implementation as part of the overall appraisal of implementation and expected the collective performance of individual OST staff to be related to, yet distinct from, the summary scores on program-wide implementation (divergent validity).
Content validation
Prior to initiating the study, we recruited five researchers and one practitioner to participate in content validation of the MASI. Content validators were recruited based on their expertise and/or experience in OST programs and PBIS. Four held doctoral degrees (66.7% in special education and school psychology) and two held master’s degrees (33.3% in social work and human development and family studies). Two (33.3%) were involved in a larger OST and PBIS project (Farrell & Collier-Meek, 2014) but were not involved in measure development. All content validators were female.
Observations
This study was conducted within the implementation of a PBIS intervention (Positive Behavior Support in Out-of-School Time, Positive BOOST [Behavior in Out-of-School Time]; Farrell & Collier-Meek, 2014) across seven distinct programs in a northeastern state. In vivo observations were conducted of OST professionals (N = 147). No demographic information on the OST professionals is available. All OST programs were funded by 21st Century Learning Community Grants; thus, these programs involved academic enrichment, recreation activities or social-emotional learning, and family literacy activities for students attending high-poverty school districts. Programs were operated by public schools (n = 4), local nonprofit organizations (n = 2), and a charter school (n = 1). The average OST program operated for 12.8 hr per week (range = 10–16) for a total of 34.9 weeks during the school year (range = 30–40). Additional information about the OST sites and town characteristics are presented in Table 1.
OST Program Sites, Observation Frequency and Percentage, and Town Characteristics.
Note. OST = out-of-school time.
Data retrieved from https://nces.ed.gov/ccd/districtsearch/index.asp. bData retrieved from http://datacenter.kidscount.org/
Within the study, three raters completed each MASI; two were female graduate students and one was a female undergraduate student. All were enrolled in education, psychology, and human development programs at a university in the Northeast and were specifically trained to participate in the implementation and activities research. The two graduate students had prior training and experience with SDO, whereas the undergraduate student had no relevant training or experience prior to beginning the study. Raters were actively involved in Positive BOOST, a project to incorporate PBIS into OSTs (see Farrell & Collier-Meek, 2014). Two raters were present at 27.21% of observations (n = 40).
Measures
MASI
As suggested above, the MASI was developed as a measure of four distinct components of implementation of Tier 1 behavior management practices among individual OST providers. A single administration of the MASI occurs during a 60-min observation interval divided into three 20-min observation periods. Each observation period concerns an individual OST professional and thus results in data corresponding to individual staff, enabling overall estimates of staff behavior during the observation interval. Prior to the onset of the interval, raters record general information including program name, setting, activity, number of students present, and rater name. The rater then randomly selects three of the OST professionals present to be observed. The order of observation is also random. If three or fewer professionals are present for the observation, then only random ordering of observations takes place. Data pertaining to each professional are collected within four sections that incorporate the evaluation of five distinct behaviors (see Table 2). The MASI summarizes observations in these four areas and does not contain a summary or overall score, as the component behaviors are discrete and not theorized to contribute to a larger construct per se.
Measure of Active Supervision Behavior Definitions, Examples, Nonexamples, and Assessment Method.
Note. OSTP = out-of-school time professional.
We define Move, Scan, and Interact (MSI) as an OST professional moving throughout the space, scanning student behavior, or interacting with students, and evaluate it using momentary time sampling at 15-s intervals within a 10-min observation period. Praise, Correction, and Reference to Behavior Expectations (called Behavior Expectations from this point forward) are evaluated using a frequency count for a 10-min interval. We define Praise as the OST professional providing praise or otherwise acknowledging the student for desired behaviors and Correction as reprimanding or redirecting student(s) when undesired behavior is exhibited. Behavior Expectations are defined as an OST professional referencing program behavior expectations when engaging with student(s). After two 10-min observations periods occur, the observer makes two types of summative ratings. First, Praise, Correction, Behavior Expectations, and Nuisance Behaviors are evaluated through a checklist ratings of adherence to specific behavioral components. That is, the rater checks (or does not check) items on a brief list of narrative descriptors. These ratings are intended to provide additional illustrative detail to complement and provide context for the quantitative data (frequency counts) provided when using the MASI to provide feedback. The definitions for Praise, Correction, and Behavior Expectations remain the same, while Nuisance Behavior is defined as undesired behaviors exhibited by students that indicate a mild disruption, have limited impact, and are not dangerous or escalating. Finally, the rater may record any relevant narrative notes about the observation. After all three OST professionals have been observed and all data have been recorded, the rater summarizes the overall findings of the observation by behavior and across the professionals. The final version of the MASI is included in the appendix.
BOQ-OST
The BOQ (Kincaid, Childs, & George, 2010) is a measure intended to assist PBIS teams to identify areas of strength and areas of implementation in need of improvement. With permission, we adapted the BOQ, aligning it with the OST context. The original school-based coaches version of the BOQ includes 10 sections with 53 specific items to be rated on a rubric with scales from 0 to 1 and 0 to 4 based on operationally defined criteria (Kincaid et al., 2010). Internal consistency, reliability, and validity analyses indicated that the BOQ can produce data with adequate psychometric properties (see Cohen, Kincaid, & Childs, 2007). The systematic adaptation of the BOQ-OST included the development of an additional section, “Set the Stage”; three items that reflect the teaching of expectations in the setting (e.g., posting of expectations, explicit instruction, reinforcing routines); and the adjustment of item wording across the measure to reflect the OST context. For instance, references to “students” were changed to “participants,” and “classroom systems” was changed to “setting specific systems” to correspond to the OST context.
SET-OST
Similarly, the School-Wide Evaluation Tool (SET; Sugai, Lewis-Palmer, Todd, & Horner, 2001) was adapted with permission for this study to be aligned with the OST context. The original SET includes seven sections that include 30 items. To complete the SET, the evaluator reviews permanent products (e.g., handbooks, behavioral data) and completes systematic observations and interviews guided by the SET procedures. Based on the information gathered, items are rated and a total score is obtained. Internal consistency, reliability, and validity analyses have indicated that the SET can produce data with adequate psychometric properties (see Horner et al., 2004). The systematic adaptation of the System-Wide Evaluation Tool in Out-of-School Time (SET-OST) for this study included slight modifications to the data collection procedures and the adjustment of the item wording to reflect the OST context. For example, the suggested permanent products for review were expanded to include additional materials appropriate to OSTs (e.g., program handbook). Wording revisions included changing references from “students” to “participants,” and “referrals to the office” was changed to “referrals to the OST coordinator.”
Procedures
We developed the MASI by generating items consistent with the Tier 1 behavior management literature, reflecting the core components of PBIS, and aligned with the existing evidence on measurement. Initial content validity appraisal further informed and assisted the refinement of the MASI format and items, including operational definitions of the four key behaviors. Once the MASI was finalized, raters were trained to use the measure and then completed observations in OST programs. The content validation, measure training, and observations are described below.
Content validation
To provide evidence of content validity, five researchers and one practitioner familiar with OST programs and PBIS reviewed the initial version of the MASI. To do so, reviewers responded to six questions on a 7-point Likert-type scale ranging from strongly disagree (1) to strongly agree (7) using SurveyMonkey, an online survey website. Questions addressed (a) the clarity of items, (b) the clarity of directions, (c) the feasibility of the directions and procedures, (d) the alignment of the measure with PBIS, (e) the appropriateness of the measure for the elements of implementation assessed, and (f) the appropriateness of the measure for OST programs (McCoach, Gable, & Madura, 2013). In addition, raters provided specific comments in an open response format in addition to general feedback. These ratings are described in the “Results” section. Based on the responses and feedback of the reviewers, several changes were made to the MASI. First, the title of the measure was revised to be specific to active supervision and interaction, rather than PBIS as a whole. This change was prompted by a rater who indicated that the original name was too broad, as the measure only addressed some aspects of PBIS. Second, directions were revised to clarify aspects of the measure that the reviewers’ indicated were confusing. This change was prompted by a relatively lower rating regarding the directions. Finally, minor copy edits noted by the reviewers were addressed.
Rater training
All three observers who participated in this study underwent training to criterion in the MASI prior to engaging in data collection. Training in the MASI included a didactic introduction to SDO generally and the MASI in particular. Orientation to SDO included instruction on types of behavior and varied decision rules for direct observation. Then, raters were introduced to the MASI (a) sections and format, (b) behavior definitions including examples and nonexamples, (c) procedures (i.e., randomly picking staff and then completing the measure), and (d) ratings (i.e., momentary time sampling, frequency count, checklist ratings). Raters then completed multiple ratings of a video clip of students and an OST professional from an actual OST program. Percentage agreement was calculated for momentary time sampling and checklist ratings by behavior, while total agreement was calculated by behavior for frequency count data. Once 90% agreement was achieved for all rating types and behaviors, raters were deemed ready to utilize the measure in programs.
Observations
MASI observations occurred within the context of a larger study designed to support OST program implementation of Positive BOOST (Farrell & Collier-Meek, 2014), an approach to staff development that includes PBIS curricula and materials adapted for the OST context. Within the present study, different intensities of implementation support were delivered to OST leadership as part of an effort to evaluate the level of training needed for successful adoption. Most observations occurred following initial implementation of Positive BOOST (65.98% in total sample; 72.50% in paired sample with two raters). Throughout study phases, raters conducted observations at consistent days and times, though at varied frequencies across programs (see Table 1). In the total sample, most observations occurred in a classroom (42.85%) or cafeteria (23.12%) with an average of 15.37 participants (SD = 13.79) present. Participants engaged in activities such as homework (32.65%), athletics/games (27.21%), or academic content (12.24%). In the paired sample, most observations occurred in a classroom (37.50%) or cafeteria (35.00%) with an average of 13.42 students (SD = 10.23) engaged in activities such as homework (15.00%), athletics/games (45.00%), or academic content (15.00%).
As described earlier, before each observation, raters randomly selected three OST professionals to observe using a random number generator. The raters then completed the form background information, reviewed the behavior definitions as needed, and prepared their stopwatches. For each of the three OST professionals, the raters independently recorded the (a) prevalence of MSI behavior using momentary time sampling for 10 min; (b) frequency of Praise, Correction, and Behavior Expectations behavior for 10 min; (c) checklist ratings of Praise, Correction, Behavior Expectations, and Nuisance Behaviors; and (d) any relevant narrative notes. Following the completion of the MASI, the raters summarized their observations. The SET-OST and BOQ-OST were completed at three time points for each program by the same raters.
Analyses
To evaluate the utility of the MASI, reliability and validity analyses were conducted.
Reliability
As described by Hintze (2005), the reliability of data derived from direct observation instruments can be characterized and assessed in multiple ways. Two of the most relevant for research situations are (a) interobserver agreement and (b) intraobserver reliability. We conducted both on the paired sample (n = 40 observations with two raters).
Interobserver agreement
We calculated three interobserver agreement indices. Percent agreement (i.e., frequency of intervals with agreement divided by the total number of intervals) was calculated to evaluate the agreement across raters and observation sessions for frequency count and momentary time sampling data. For frequency count data, the exact agreement for the entire 10-min interval was required for that ratings to be considered in agreement. Then, the total observations with agreement was delivered by the total number of observations. Coefficient Kappa (Cohen, 1960) was calculated to evaluate the agreement across raters for the momentary time sampling data. Two-way intraclass correlations for a single rater, also referred to as ICC(C,1) (Shrout & Fleiss, 1979), were calculated for frequency count data.
Intraobserver reliability
Although percent agreement, Kappa, and ICC values provide distinct estimates for the degree of consistency between raters, these calculations do not account for the unique structure of the data collected for this study; to wit, two to three randomly selected professionals were observed per observation period. Individual professional characteristics, circumstances during the overarching observation session, and the interaction between these variables may each influence the resulting ratings. For example, an OST program is observed on Monday, Tuesday, and Wednesday by two simultaneous raters, and three distinct, randomly selected professionals are observed on each day. The Monday observation session may have been a particularly difficult day due to a large number of substitute professionals being present within the program. In addition, the rater may be feeling particularly sympathetic and provide ratings that are distinct from those of the other rater. Finally, there may be interactions between these potential sources of variance; raters may rate specific types of professionals in different ways, creating a rater by professional interaction that influences resulting ratings. Thus, the day of the observation (session, or s) the behavior of each OST professional being rated (professional, or p), the rater (r), and the interaction between these variables may be expected to influence the rating that was given to each OST professional.
To address the complex nature of these data alongside the multiple potential sources of variance in ratings and their interactions, we use variance partitioning analyses to determine the extent to which variance in ratings was influenced by desirable (e.g., actual variations in the behavior of the object of measurement) versus undesirable (e.g., rater variations) sources (Briesch, Swaminathan, Welsh, & Chafouleas, 2014). Specifically, a nested two-facet model, (p:s) × r, was utilized to determine the amount of variance in a given dependent variable attributable to (a) OST professionals, who were nested within observation session, and (b) rater, treated as randomly selected, and their interactions. It is critical to note that the nested structure of the data precludes an ability to disentangle the effect of professional from the interaction between professional and session. The measurement model was applied to ratings of MSI, Praise, and correction behaviors as scored using the MASI-OST. Given the extremely limited variance observed in the Behavior Expectations ratings, these data were not subjected to variance partitioning analyses. We transformed all ratings into prevalence/rate for analyses; MSI was expressed as the percentage of intervals scored as an occurrence of the target behavior, while the frequency counts of Praise and Correction behaviors were each divided by the observation duration (i.e., 10 min).
Validity
To evaluate the extent to which the MASI data were reflective of the concepts purportedly being assessed (Shadish, Cook, & Campbell, 2002), three types of validity were assessed. Content validity were evaluated through analysis of the content validation data. Convergent validity and discriminant validity were evaluated through comparisons of MASI, SET-OST, and BOQ-OST data. The SET-OST and BOQ-OST were selected for this comparison, because of the fact that program-wide implementation is reliant on the treatment integrity of individual staff. Therefore, we expected a relationship between these data sources. These types of validity and the associated analyses are described further below.
Content validity
Content validity involves whether the items are representative of the broader concept that it purports to measure (Hintze, 2005). The items from the content validation were used to provide an initial assessment of content validity. Specifically, content validators rated whether the measure was (a) well aligned with PBIS, (b) appropriate for elements of implementation assessed, and (c) appropriate for OST programs on a 7-point Likert-type scale from very much disagree to very much agree. Furthermore, content validators rated (a) whether the items were clear or understandable, (b) whether directions were clear and understandable, and (c) whether directions were feasible and appropriate to for the measure on a 7-point Likert-type scale from very much disagree to very much agree.
Convergent validity
Convergent validity is an aspect of construct validity that examines whether items correlate as expected with particular variables (Hintze, 2005). Convergent and discriminant validity (described next) were evaluated using the Spearman correlation coefficient (i.e., Spearman’s rho) due to the ordinal nature of SET and BOQ ratings, as well as the nonnormal distributions of mean MASI rating; correlation coefficients were applied to mean ratings from the MASI as paired with individual ratings from the SET-OST and BOQ-OST. These analyses were conducted using MASI ratings derived from the total sample (N = 147 observations). Means were calculated to provide a single summative rating against which to compare data from each administration of the SET-OST and BOQ-OST. Mean ratings were calculated for each of the four core MASI behaviors for the relevant date ranges aligning to each SET-OST and BOQ-OST administration, resulting in 24 rows of observations aligned with the three SET-OST and BOQ-OST administrations across eight programs (3 × 8 = 24).
We hypothesized that MSI would be modestly correlated with SET-OST sections (a) system for rewarding expectations and (b) system for responding to violations, as well as BOQ-OST sections (a) effective procedures for dealing with discipline, (b) expectations and rules developed, (c) set the stage, (d) reward/recognition program established, and (e) setting-specific overall. We hypothesized that Praise would be modestly correlated with SET-OST sections (a) expectations defined, (b) expectations taught, and (c) system for rewarding expectations, as well as BOQ-OST sections (a) effective procedures for dealing with discipline, (b) expectations and rules developed, (c) set the stage, (d) reward/recognition program established, (e) lesson plans for teaching expectations/rules and routines, and (f) setting-specific overall. We hypothesized that Correction would be negatively correlated with the SET-OST section, a system for responding to violations, as well as BOQ-OST section, effective procedures for dealing with discipline. We hypothesized that Behavior Expectations would be modestly correlated with SET-OST sections (a) expectations defined, (b) expectations taught, and (c) system for rewarding expectations, as well as BOQ-OST sections (a) expectations and rules developed, (b) reward/recognition program established, (c) lesson plans for teaching expectations/rules and routines, and (d) setting-specific systems. For all these comparisons, modest correlations were expected to capture the relationship between individual staff treatment integrity and program-wide implementation.
Discriminant validity
Discriminant validity is an aspect of construct validity that involves whether an item is not correlated with variables that it should not be correlated with (Hintze, 2005). Overall, we hypothesized that MSI, Praise, Correction, and Behavior Expectations would not be correlated with SET-OST sections related to (a) monitoring and decision-making, (b) management, and (c) broad support, as well as BOQ-OST sections (a) PBIS leadership, (b) staff commitment, (c) data entry and analysis plan, and (d) evaluation.
Results
Following overall descriptive statistics, the reliability and validity of the data collected using the MASI, as evaluated based on content validation, variance partitioning analyses, and comparisons with other measures are described below.
Descriptive Statistics
Descriptive statistics across both samples are presented in Table 3. For the total sample (N = 147), the mean percentage of intervals of MSI was 89.64% (SD = 14.49). In the total sample, OST professionals, on average, used Praise (M = 3.28) and Corrections (M = 3.46) at about the same level, but infrequently referred to Behavior Expectations (M = 0.28). For the paired sample (n = 40), the mean percentage of intervals of MSI was 90.70% (SD = 14.61). OST professionals, on average, praised 3.91 times and provided 3.25 corrections during the observations. In the paired sample, OST professionals, on average, praised (M = 4.15) slightly more often than they provided corrections (M = 3.98), but relatively infrequently referred to behavior expectations (M = 0.95).
Descriptive Statistics and Reliability Across Measure of Active Supervision and Interaction Variables.
Note. ICC = intraclass correlations.
Reliability
Interobserver agreement
In evaluating the paired sample for interobserver agreement, overall percentage agreement findings indicate that the MASI was independently completed by two raters with high rates of agreement (see Table 3). MSI was completed with 82.5% agreement, frequency behaviors ranged from 100.0% (Behavior Expectations) to 82.5% (Correction) agreement, and behavior characteristics were rated above 90% agreement, with seven of the behavior characteristics at 100.0% agreements. A Kappa coefficient of .755 was observed for MSI, indicating moderate levels of agreement beyond those expected from chance. The two-way, single-person ICC values for consistency between raters, calculated for ratings of 40 professionals over 14 observation sessions, suggested that the frequency ratings for Praise (ICC = .994, 95% CI = [.989, .997]), Correction (ICC = .983, 95% CI = [.969, .991]), and Behavior Expectations (ICC = 1.000) were conducted with a high degree of consistency.
Intraobserver reliability
In evaluating the paired sample for intraobserver reliability, results of variance partitioning analyses suggested that for the MSI, Praise, and Correction variables, sources of rating variance were generally attributed to the behavior of the OST professional being observed (which, due to its nesting within session, cannot be disentangled from the effect of session on the professional, see Table 4). Variance in ratings of MSI were completely attributable to the professional: session facet (100%), while 91% of variance in Praise ratings was attributable to the professional: session facet. Rating variance in Correction behavior was chiefly attributable to two sources: professional: session (54.2%) and session (44.6%), with a small amount of variance attributable to the residual term (1.2%). No variance in ratings for any of the three analyzed behaviors was attributable to the rater facet, which is consistent with the high agreement indices observed.
Variance Component Estimates and Percentages of Variance for (p:s) × r Model.
Validity
Content validity
The content validation items provided an assessment of content validity. Content validators indicated that they generally agreed items were clear and understandable (M = 6.33, SD = 0.81); slightly agreed that directions were clear and understandable (M = 5.00, SD = 1.26), and agreed directions were feasible and appropriate for the measure (M = 6.00, SD = 0.63). Furthermore, content validators rated indicated that they agreed the measure was well aligned with PBIS (M = 6.00, SD = 1.54), Agreed with the appropriateness of the elements of implementation assessed (M = 6.33, SD = 1.02), and agreed to very much agreed that the measure was appropriate for an OST program (M = 6.5, SD = 0.83).
Convergent validity
Spearman’s rank-order correlation coefficients between the MASI and BOQ-OST and SET-OST are presented in Table 5. After correcting for familywise error with the Holm method, no correlations were statistically significant. The lowest corrected p value observed was .160 (“Praise” with “BOQ: Reward/recognition program established,” rho = .595). We hypothesized that MSI would be modestly correlated with SET-OST sections (a) system for rewarding expectations and (b) system for responding to violations, as well as BOQ-OST sections (a) effective procedures for dealing with discipline, (b) expectations and rules developed, (c) set the stage, (d) reward/recognition program established, and (e) setting-specific overall. Correlation analyses indicate the MSI was nonsignificantly correlated with SET-OST sections (a) expectations taught (.540), (b) system for rewarding expectations (.457), as well as the SET overall score (.441). MSI ratings were also not significantly correlated with BOQ-OST sections (a) set the stage (.441) and (b) reward/recognition program established (.494).
Spearman Correlations Between the Measure of Active Supervision and Interaction and the System-Wide Evaluation Tool–OST and BOQ–OST.
Note. OST = Out-of-School Time; BOQ = Benchmark of Quality; PBS = Positive Behavior Support.
Measures were adapted for the OST program context with permission from the original authors. bExpected to be modestly correlated to provide evidence of convergent validity. cExpected to not be correlated to provide evidence of discriminant validity.
We hypothesized that Praise would be modestly correlated with SET-OST sections (a) expectations defined, (b) expectations taught, and (c) system for rewarding expectations, as well as BOQ-OST sections (a) effective procedures for dealing with discipline, (b) expectations and rules developed, (c) set the stage, (d) reward/recognition program established, (e) lesson plans for teaching expectations/rules and routines, and (f) setting-specific overall. Correlation analyses indicate the Reinforcement was not significantly correlated with SET-OST sections (a) expectations taught (.477) and (b) system for rewarding expectations (.566). Praise ratings were also not significantly correlated with BOQ-OST sections set the stage (.441), lesson plans for teacher expectations/rules, and routines (.468), setting-specific systems (.530), and evaluation (.443) as well as the BOQ overall (.485).
We hypothesized that Correction would be negatively correlated with the SET-OST section, a system for responding to violations, as well as BOQ-OST section, effective procedures for dealing with discipline. Correlation analyses indicate the Correction was not significantly correlated with SET-OST sections responding to violations (–.416) and with BOQ-OST set the stage (.563).
We hypothesized that Behavior Expectations would be modestly correlated with SET-OST sections (a) expectations defined, (b) expectations taught, and (c) system for rewarding expectations, as well as BOQ-OST sections (a) expectations and rules developed, (b) reward/recognition program established, (c) lesson plans for teaching expectations/rules and routines, and (d) setting-specific systems. Correlation analyses indicate the Behavior Expectations was not significantly correlated with SET-OST section monitoring and decision-making (.402) and the BOQ-OST section data entry and analysis (.494).
Discriminant validity
Correlations between the MASI and BOQ-OST and SET-OST are presented in Table 5. Overall, we hypothesized that MSI, Praise, Correction, and Behavior Expectations would not be correlated to SET-OST sections related to (a) monitoring and decision-making, (b) management, and (c) broad support, as well as BOQ-OST sections (a) PBIS leadership, (b) staff commitment, (c) data entry and analysis plan established, and (d) evaluation. Correlation analyses indicate that SET-OST sections management and broad support, as well as BOQ-OST section staff commitment, did not demonstrate significant correlations with MSI, Praise, Correction, and Behavior Expectations (see Table 5). Behavior Expectations was not significantly correlated with monitoring and decision-making and data entry and analysis plan established.
Discussion
The use of research-based Tier 1 behavior management practices such as high rates of praise use of behavior expectations, and low levels of correction, is associated with positive outcomes for students (Bradshaw et al., 2009; Newcomer et al., 2009). Unfortunately, education professionals such as OST professionals and teachers rarely deliver these strategies consistently and require ongoing implementation support (Reddy et al., 2013b; Ruberto, 2015). To do so, ongoing assessment of Tier 1 behavior management implementation is needed and some emerging tools (Gresham et al., 2017) and strongly supported, classroom-focused measures are available (Pianta & Hamre, 2009). Initial research has utilized SDO methodology to feasibly and flexibly evaluate implementer behavior (e.g., Simonsen et al., 2013). To this end, we developed the MASI to measure OST professionals’ Praise and Correction Statements, References to Behavior Expectations, and Active Supervision, and conducted observations by multiple raters in seven OST programs. Findings suggest that the MASI can be completed by two raters with high agreement; ratings are attributable to desirable sources of variance for most behaviors, content validators positively rate the measure constructs and clarity, and results were not significantly correlated with components of the SET-OST and BOQ-OST.
Interobserver reliability analyses suggest that the MASI data reported here were completed with high levels of agreement. Intraobserver analyses, conducted using variance partitioning analyses, suggested that the majority of variance in ratings for MSI and Praise was attributable to the professional and/or the interaction between the professional and the session wherein they were observed. For Corrections behaviors, a just under half of the variance in ratings was attributable to the session independent of the professional and the professional/session interaction. That is, aspects of the session during which the observation took place (e.g., Monday afternoon, math day, new room) were almost as influential on rating variance as the professional within the session. In other words, the ratings of Correction may be influenced by the overall session just as much as the person who is expressing the behavior and the person’s interaction with the session.
The extremely limited variance in the Behavior Expectations behavior precluded its use in variance partitioning analyses, and suggests that this variable requires additional attention in order for Behavior Expectation frequency count data to be used in this measure. Further development should focus on examination of the Behavior Expectation behavior definition, and whether Behavior Expectation is better characterized as a state behavior (e.g., better measured using time sampling) than as an event behavior (e.g., using frequency counts). Furthermore, the low frequency and limited variability in the Behavior Expectation data may have affected the interobserver agreement of the behavior characteristics ratings related to this construct. Ratings of Expectations Posted and Expectations Reinforced had only modest levels of agreement per Kappa. It is possible that revisions to the Behavior Expectation definition and measurement could have a commensurate impact on adjusting the agreement of these ratings.
Evidence for the validity of the MASI was collected through initial content validation and correlations between the MASI and measures of PBIS implementation adapted for OST settings. Content validators agreed that the items were clear, appropriate for the setting, and aligned with specific behavior management practices and PBIS. After correcting for multiple comparisons, correlation coefficients between the MASI and the SET-OST and BOQ-OST did not indicate a significant relationship between the rankings of results from MSI, Praise, Correction, and Behavior Expectations and specific factors of the PBIS implementation measures. In general, these correlations were in the expected directions providing evidence of convergent and divergent validity. However, some modest and unexpected correlations (e.g., MSI with SET-OST Expectations Taught, Praise with SET-OST Evaluation) may suggest a more general relationship between the specific practices on the MASI (e.g., Praise, Active Supervision) and aspects of PBIS implementation than initially expected. However, given that none of the correlations were determined to be significant, these results are extremely tentative and potentially no different from zero. Thus, evidence for the convergent and discriminant validity of data derived from the MASI when compared with results from the SET-OST and BOQ-OST is still absent at this time. Future research may evaluate the correspondence between MASI scores and other measures of behavior management. It would be expected that the correlations between the MASI and staff-level measures would be higher than the correlations reported here between the MASI and SET-OST and BOQ-OST data.
Limitations
This initial assessment of the MASI has limitations. The three raters in this study were enrolled in a research-oriented university. Although the rater training is documented here and may be replicated by others, findings may not be generalizable to different types of raters (e.g., OST leaders) and OST contexts. Future research should document the reliability of data collected via the MASI by other raters. In addition, the OST professionals evaluated here were involved in a larger project evaluating training and support of Positive BOOST implementation. No OST professional demographic data were collected. As a result of this investigation conducted within a large study, the participant data may not be representative of wider OST populations and, furthermore, the varied phases of the Positive BOOST project may have influenced OST professional behavior. Future studies may utilize the MASI to evaluate OST staff behavior in programs unassociated with Positive BOOST or other settings that students participate in, such as school.
Although the limited number of observations utilized in this study precluded more fine-grained analyses, future research should consider the role of program implementation upon the validity and reliability of data derived from the MASI, as well as consider the use of a design that would permit the examination of individual professional-level variance disentangled from session. Also, the quantitative data produced by the MASI were the focus of this study, and the suitability and information provided by the checklist ratings were not evaluated. Future research should evaluate the extent to which the checklist ratings and narrative recording are reliable, valid, and informative.
Last, the MASI data were compared with ratings of the SET and BOQ that were adapted for the OST setting. Although prior analyses have indicated that the original measures can produce data with adequate psychometrical properties in school settings (Cohen et al., 2007; Horner et al., 2004), the OST adaptions used in this study have not been assessed in this way. Furthermore, the a priori hypotheses for convergent and divergent validity analyses between the MASI and the SET and BOQ were identified by the authors alone. Future research could include an expert panel not otherwise involved in the research to provide their impressions of the expected relationships and evaluate the relationship between the MASI and other measures that include items with behavior management practices (e.g., CLASS, CSS-OF, Reddy et al., 2013a; Pianta & Hamre, 2009).
Implications for Research
Despite the importance of Tier 1 behavior management strategies to prevent and address problem behavior (Kern & Clemens, 2007; Simonsen et al., 2008), there is relatively limited research on related implementation measures, outside of the CLASS (Pianta & Hamre, 2009) and CSS-OF (Reddy et al., 2013a), comprehensive measures of classroom instructional and behavioral practices. SDO is typically applied to evaluate student behavior (Suen & Ary, 1989), but may also have utility in the assessment of teacher behavior (e.g., Colvin et al., 1997; Simonsen et al., 2013). As applied here, some evidence suggested that the MASI was an appropriate measure, particularly related to the assessment of Praise and Correction Statements as well as Active Supervision. Additional research is needed to refine the Behavior Expectation definition and measurement, which might provide insight about why this behavior was rated with comparatively less agreement. Further research may also evaluate the use of the MASI in settings outside of OST programs, such as classrooms, because the constructs that the measure assesses are likely relevant to settings outside of OST (Newcomer et al., 2009). Research could also assess how sensitive to change MASI data are and, in doing so, evaluate the utility of this measure for providing feedback to implementers about their behavior. Furthermore, the findings on the MASI suggest that SDO may be applied to other adult implementation behaviors, such as prompting and providing choices, although additional research is needed. Overall, SDO may be a promising methodology for future treatment integrity assessment research.
Implications for Practice
These findings suggest some evidence to support the use of the MASI to evaluate OST staff implementation of behavior management practices, with the exception of references to Behavior Expectations. Implementation of research-based behavior management practices is critical, yet doing so consistently is challenging (e.g., Reddy et al., 2013b); accordingly, it is important to monitor regularly. To address this need for monitoring, the measure might provide one option for assessing key Tier 1 strategies and providing targeted performance feedback or support. That is, the MASI could be used on a regular basis to facilitate data-driven performance feedback for staff and ensure consistent implementation of these Tier 1 strategies. Whereas it seems likely that individual staff performance and program-level implementation are related, the exact character of those relations is not clear.
Footnotes
Appendix
| MASI-OST/Observation of OST Professional (OSTP) OST #l/Code: | ||||||||||||||||||
| OST program: | Setting: | Observer 1 | ||||||||||||||||
| Number of students present: | Activity: | Observer 2 | (NA) | |||||||||||||||
| Plan to randomly select three OSTPs to observe and record observations separately. First, select OSTPs using random number generator. Complete above background information. Review behaviors and definitions. Then, (a) complete momentary time sampling of MSI in 15 s intervals, and (b) take a frequency count of reinforcement, correction, and behavior expectations for 10 continuous minutes. Immediately following the administration, review behavior characteristics (in italics) and record if they were present during the 10 min. Write any clarifying narrative notes. Summarize observations on page 5. | ||||||||||||||||||
| SYSTEMATIC DIRECT OBSERVATIONS Start Time: | ||||||||||||||||||
| Move, Scan, Interact (MSI): OSTP actively moving throughout the space, scanning student behavior, or interacting with student(s). | ||||||||||||||||||
| Interval | 0:15 | 0:30 | 0:45 | 1:00 | 1:15 | 1:30 | 1:45 | 2:00 | 2:15 | 2:30 | 2:45 | 3:00 | 3:15 | 3:30 | 3:45 | 4:00 | 4:15 | 4:30 4:45 5:00 |
| MSI | ||||||||||||||||||
| Interval | 5:15 | 5:30 | 5:45 | 6:00 | 6:15 | 6:30 | 6:45 | 7:00 | 7:15 | 7:30 | 7:45 | 8:00 | 8:15 | 8:30 | 8:45 | 9:00 | 9:15 | 9:30 9:45 10:00 |
| MSI | ||||||||||||||||||
| FREQUENCY OBSERVATIONS Start Time: | ||||||||||||||||||
| Reinforcement (Reinforce/Be positive): OSTP praises or acknowledges student(s) for desired behaviors. | Correction: OSTP reprimands, corrects student(s) when undesired behavior is exhibited. | Behavior Expectations (BE): OSTP references behavior expectations when engaging with student(s). | ||||||||||||||||
| Frequency 10 min | Frequency 10 min | Frequency 10 min | ||||||||||||||||
| ❑ | Specific: identifies skill/behavior student exhibited | ❑ | Specific: identifies skill/behavior student exhibited | ❑ | BE posted in area of activity (if indoors) | |||||||||||||
| ❑ | Immediate: provided asap following desired behavior | ❑ | Immediate: provided asap following undesired behavior | ❑ | BE adherence reinforced: students praised for adherence | |||||||||||||
| ❑ | Appropriate: to student, setting, behavior exhibited | ❑ | Redirection: Accompanied by redirection | Nuisance Behaviors: Undesired behaviors, mild disruption, not dangerous, not escalating, limited impact. Note: no frequency data collected for nuisance behavior. | ||||||||||||||
| ❑ | Delivered across many students in program | ❑ | Brief duration: correction is less than 30 s | |||||||||||||||
| ❑ | Refers to behavior expectations and/or routines | ❑ | Praise follows shift to desired behavior | |||||||||||||||
| ❑ | Refers to behavior expectations and/or routines | ❑ | NA | Ignored: no attention given to nondesired behaviors | ||||||||||||||
| Narrative notes: | ❑ | NA | Praise is delivered to students engaged in appropriate behaviors | |||||||||||||||
| ❑ | NA | Praise immediately follows shift from non-desired to desired behavior | ||||||||||||||||
| ❑ | NA | Different responses to nuisance and problem behavior | ||||||||||||||||
Note. MASI = Measure of Active Supervision and Interaction; OST = out-of-school time.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors gratefully acknowledge the support of the Connecticut State Department of Education (CSDE), in particular, Shelby Pons, LMSW in the Office of Health/Nutrition, Family Services and Adult Education; Betsy Leborious, Kaitlyn O’Leary, Kimberly Brewer, and Gerald Barrett at the Capital Region Education Council (CREC); and the UConn Center for Applied Research in Human Development. The opinions expressed are those of the authors and do not represent views of the CSDE or CREC.
