Abstract
Treatment integrity is an important concern in treatment centers but is often overlooked. Performance feedback is a well-established approach to improving treatment integrity, but is underused and undervalued. One way to increase its value to treatment centers may be to expose unrealized benefits on the observer who collects the performance feedback data. This “observer effect” could increase the value of performance feedback and promote more consistent evaluation of treatment integrity. The purpose of this investigation was to evaluate the observer effect on treatment integrity. Five supervisors who worked in a day treatment center were asked to collect performance feedback data on staff members’ integrity in following a standard treatment protocol that supervisors were also expected to follow. Results showed an immediate and marked improvement in treatment integrity in three supervisors who collected but never received performance feedback. For two supervisors, this effect was reversed and replicated. Implications are discussed.
Treatment integrity is an important concern in treatment settings. Described as the degree to which a treatment is implemented as intended (Gresham, 1989; Yeaton & Sechrest, 1981), good things tend to happen when an effective treatment is implemented with a high degree of integrity (Fixsen, Naoom, Blasé, Freidman, & Wallace, 2005). When treatments are implemented with high integrity, consultants and program supervisors who design interventions are able to make valid conclusions regarding functional relations between the intervention and the targeted response (i.e., internal validity), allowing them to determine necessary modifications to promote client outcomes. High integrity can also provide accurate information regarding the extent to which the intervention can generalize to other settings, individuals, or contexts (i.e., external validity), which is an essential component for maintaining client progress. Conversely, poor treatment integrity can hinder data-based decision making and jeopardize client programming and behavior plans. Furthermore, poor treatment integrity can cause problematic behaviors in students and affect academic performance, social skill development, and adaptive functioning (Gresham, 2005).
A number of factors can contribute to poor treatment integrity, including the complexity of the intervention, the time and materials required to implement the intervention, the number of people involved, and the motivation of teachers (Gresham, 1989, 1998). Additional factors include perceived effectiveness (i.e., the extent to which a person believes a treatment has an impact on a targeted response) and social validity (Lane, Bocian, MacMillan, & Gresham, 2001). Furthermore, research has shown experienced educators to be as vulnerable to poor integrity as novice staff (e.g., Codding, Livanis, Pace, & Vaca, 2008; Hagermoser Sanetti, Luiselli, & Handler, 2007) because of less frequent supervision and irregular feedback, which can contribute to the likelihood of treatment “drift,” or the gradual alteration of a protocol (Peterson, Homer, & Wonderlich, 1982). The myriad of factors related to poor treatment integrity highlights the importance of consistently monitoring those who are responsible for implementing interventions.
Performance feedback is one strategy that has been demonstrated to be an effective way to increase levels of treatment integrity across staff with different levels of experience. By summarizing and conveying information about an individual’s performance relative to treatment protocols or rules (Brinko, 1993), performance feedback can be used to identify any nonprescribed alterations that have been made by comparing the observed behavior with the desired behavior. Performance feedback has been shown to effectively improve treatment integrity for behavior plans (Noell et al., 2005) and academic interventions (Noell et al., 2000), providing a way to correct and improve staff skills and treatment integrity across a range of protocols. Moreover, it has been used to improve treatment integrity in a variety of settings, including general education classrooms (Reinke, Lewis-Palmer, & Martin, 2007; Reinke, Lewis-Palmer, & Merrell, 2008; Witt, Noell, LaFleur, & Mortenson, 1997), inclusion and special education classrooms (Codding et al., 2008; Codding, Feinburg, Dunn, & Pace, 2005; Hagermoser Sanetti et al., 2007), and residential institutions (e.g., Fleming & Sulzer-Azaroff, 1992; Greene, Willis, Levy, & Bailey, 1978; Panyan, Boozer, & Morris, 1970; Quilitch, 1975).
Despite repeated demonstrations of performance feedback as a well-researched solution, administrators often do not support regular assessment of treatment integrity (Cochrane & Laux, 2008). Time and effort have been identified as common barriers to assessing treatment integrity, along with cost and labor constraints (Cochrane & Laux, 2008; Perepletchikova, Hilt, Chereji, & Kazdin, 2009). Indeed, investigators have illustrated how time-consuming performance feedback can be, spending up to 40 min on individual observations and 10 to 12 min delivering verbal feedback (e.g., Codding et al., 2005; Codding et al., 2008; Hagermoser Sanetti et al., 2007).
In an effort to reduce the time and effort associated with collecting data for performance feedback, some investigators have looked to reduce the length of the observations. For example, Reinke and colleagues (2007, 2008) demonstrated the effectiveness of 10- to 20-min assessments as alternatives to lengthier observations. They showed that visual performance feedback, which does not require any verbal exchange and simply displays performance on a graph, was as effective as verbal feedback and less time-consuming.
Interestingly, most of the performance feedback research has focused on the effects on the individual who is observed and receives feedback (e.g., Codding et al., 2005; Codding et al., 2008; Hagermoser Sanetti et al., 2007; Quilitch, 1975; Witt et al., 1997). Little attention has been directed toward examining the reactive effects on the observer who is responsible for collecting the performance feedback data. However, there is reason to believe that the actual process of observing and collecting performance feedback data may produce an observer effect, that is, a type of reactivity that leads to the improvement of the treatment integrity of the person collecting that data. Nelson and Hayes (1981) proposed that the entire process of recording behavior is responsible for causing this type of reactivity in the observer. Their claim was supported by research from self-monitoring literature that demonstrated reactivity even when the targeted behavior and self-recording did not occur (e.g., Lipinski, Black, Nelson, & Ciminero, 1975; Maletzky, 1974) as well as studies that occasioned reactivity with the presence of self-recording devices, even when self-recording no longer occurred (e.g., Broden, Hall, & Mitts, 1971; Maletzky, 1974).
Direct evaluation of this observer effect has been conducted in studies of occupational safety. Alvero and colleagues have demonstrated marked improvements in the safety behavior of observers after the observers were involved in conducting observations of the safety performance of others (Alvero & Austin, 2004; Alvero, Rost, & Austin, 2008). However, these studies were conducted in analogue settings and used confederates as the observees rather than actual employees. Thus, questions remain about the extent to which observer effects might occur in actual applied settings with actual employees.
To explore the possibility of an observer effect in applied settings, Burke, Howard, Peterson, Peterson, and Allen (2012) looked at the impact of observing conducted by supervisors in a day treatment program. The primary purpose of the project was to evaluate whether visual performance feedback improved staff members’ treatment integrity in using behavior-specific praise (BSP). However, a secondary purpose was to evaluate whether the supervisors who conducted the performance observations changed their own treatment integrity as a result of being an observer. In this study, the supervisors, who were responsible for delivering treatment to children as well as for supervising less-experienced staff (i.e., assistants and aides), were asked to collect performance feedback data on staff use of BSP with children in the program. After submitting the data to an administrator, the administrator converted the data into visual performance feedback (i.e., graphed data) and presented the data to the staff member. Results showed that the treatment integrity of the staff did improve after receiving feedback. In addition, the results showed that the treatment integrity of the supervisors who had collected the data also improved even though they never received any feedback on their performance. This evaluation of the observer effect must be interpreted with caution in large part because there were only two supervisors and there was not a clear demonstration of experimental control. As a result, further investigation is warranted. Finding an observer effect for a supervisor who collects performance feedback data could provide treatment settings with a practical and time-efficient approach to improving treatment integrity in two employees, simultaneously.
The purpose of the current investigation was to extend the research on the observer effect by evaluating its potential impact in an applied, day treatment setting for children. Specifically, the aim of this investigation was to determine whether a supervisor’s treatment integrity improves as a result of the supervisor collecting data on a staff member’s treatment integrity.
Method
Setting
This study was conducted in a midsized midwestern city at a day treatment program that serves children aged 18 months to 8 years referred for problematic behavior (see Burke, Kuhn, Peterson, Peterson, & Badura Brack, 2010, for more information about this setting). At the day treatment program, there is a strong focus on building social competencies through a combination of verbal reinforcement (Maag, 2001), modeling (Bandura & McDonald, 1963), problem solving and social skill instruction (Gresham, Sugai, & Horner, 2001), and a contingency-based token economy (Axelrod, 1971; Christophersen, Arnold, Hill, & Quilitch, 1972; Wolf, Giles, & Hall, 1968). Delivery of praise is used in conjunction with a contingency-based token economy to reinforce appropriate behaviors, whereas a timeout hierarchy is used to respond to inappropriate behavior (e.g., noncompliance, verbal aggression, physical aggression). Additional skill instruction includes teaching adaptive replacement skills that address skill deficits specified by the parent in each child’s Individualized Treatment Program.
Research and program evaluation are integral components of the day treatment program. Observations and evaluations occur frequently to assess the need for changes or modifications, and all employees are expected to participate in research projects and collect data as a condition of employment. Thus, it was neither unusual to ask employees to collect data, nor was it unusual for additional people to be present in classrooms as observers. Participants in the study had experience collecting frequency data and were typically instructed to collect daily data on a variety of behaviors of children in the program. Participants also had knowledge of the importance of treatment integrity, which was first addressed in the initial staff training and then addressed with feedback from the administrator in the natural setting as needed. Although reactivity in children occasionally occurred (e.g., smiling, waving, inquiries), these attention-seeking behaviors were typically ignored by the researchers involved in this study.
At the time of this study, observations were conducted in classrooms that were separated by a single wall or shelves in an open-bay environment. Five different classrooms were used in this study and included children ranging from 2 years to 6 years of age. Two classrooms with children older than 6 years of age were excluded from this study due to the differences in classroom structure (e.g., individual desk work, group therapy sessions).
The hierarchy of employees involved in this investigation included one administrator, five supervisors, and seven staff. The role of the administrator was to oversee supervisor and staff performance and provide training, support, and feedback. The administrator monitored employee–child interactions throughout the day through unscheduled observations across all classrooms. Each classroom had one assigned supervisor and a number of rotating staff members. The description for supervisors, who were the participants in this study, can be found in the “Participants” section. Staff members were responsible for assisting the supervisor by interacting with children and implementing behavior management protocols in the same manner as the supervisor.
Participants
Five individuals employed as supervisors at the day treatment program served as participants. The supervisors were responsible for implementing an academic curriculum, conducting daily activities, and providing direct behavioral treatment to six to eight children in a group setting with the support of staff members. In addition to curricula planning and implementation, the supervisors were responsible for monitoring staff treatment integrity and for occasionally observing and collecting performance feedback data on staff. These participants were female, Caucasian, and had an average age of 25 years (range = 22-31). Participants had either some college experience or a college degree and worked at the program for an average of 28 months (range = 10-55). All participants were blind to the purpose of the study.
Primary Dependent Variable
BSP
Delivery of BSP was chosen as the primary dependent variable due to its specific importance at the program and its general role in shaping appropriate behavior (Bernhardt & Forehand, 1975; Eyberg & Robinson, 1982; Filcheck, McNeil, & Herschell, 2001; Forehand & King, 1977). BSP was operationally defined as a verbal description of a child’s positive behavior indicating approval or affirmation. As such, specific indication of approval (e.g., “I like how you are waiting patiently!”) and descriptions of behavior (e.g., “You are waiting patiently!”) were included as BSP. In the literature, these differing types of attention are considered BSP and prosocial attends, respectively, and are described to play different roles in shaping (i.e., reinforcing successive approximations of a target behavior) child behavior. However, both types of attention are considered BSP at the day treatment program and were included in the operational definition of BSP.
Frequency of BSP was defined by number of behaviors specified as well as number of children clearly addressed. For example, “Sally, thank you for sitting quietly and keeping your hands in your lap,” was coded as two BSP while “I like how Mary, Julie, and Seth are waiting patiently,” was coded as three BSP. Alternatively, BSP that was presented to a group in a general format (e.g., “Everyone is doing such a nice job playing quietly.”) was coded as a single BSP due to ambiguity that could occur when individual names were not explicitly stated. Participants and their staff were responsible for delivering BSP to children. Frequency of BSP was collected and converted to rate per minute.
Secondary Dependent Variables
General Praise
Frequency of general praise was collected and converted to rate per minute to determine the extent to which data collection influenced specific and general praise. General praise was operationally defined as any verbal or nonverbal indication of approval of child behavior that did not explicitly describe a behavior. Examples of general praise included positive physical contact (e.g., high-fives, fist bumps, pats on the back), affirmations (e.g., “That’s right!”), nonspecific positive acknowledgment (e.g., “Good job.”), and delivery of tokens that were not accompanied by BSP. Like BSP, frequency of general praise was determined by number of names specified or number of positive physical contact or tokens delivered. For example, “Good job, everyone” was coded as a single general praise statement while “Good job, Sally and Susie” was coded as two general praise statements. Participants and staff were responsible for delivering general praise.
Timeout
Frequency of timeout was collected to calculate praise-to-correction ratios. A timeout hierarchy was used at the program to respond to noncompliance, rule breaking, and aggression. Levels were communicated verbally and/or with gestures (e.g., “Floor timeout”; point to the floor) and were presented in accordance with the level of violation. For example, noncompliance with a rule was responded to with a 10-s timeout, whereas a more serious violation (e.g., aggression) was responded to with a lengthier timeout. Timeout was operationally defined as any level of timeout presented by an adult and was recorded as such. Participants and staff were responsible for implementing timeout.
Praise-to-Correction Ratios
Praise-to-correction ratios were calculated to provide a relative context for delivery of BSP. Praise-to-correction ratios were calculated by summing the frequency of BSP and general praise statements delivered in each observation and dividing the sum by the frequency of any level of timeouts delivered during that same time period.
Treatment Acceptability
After this study was concluded, participants and the administrator were asked to anonymously complete a modified Treatment Evaluation Inventory–Short Form (TEI-SF) to measure procedural acceptability (Kelley, Heffer, Gresham, & Elliott, 1989). Kelley et al. (1989) found that the TEI-SF is an internally consistent and valid measure, and researchers have modified the TEI-SF to evaluate individualized interventions (e.g., Borrego, Ibanez, Spendlove, & Pemberton, 2007; Teng, Woods, & Twohig, 2006; Woods & Twohig, 2002). This nine-item survey was modified by changing the statements to reflect the targeted protocol in this investigation while maintaining the original sentence structure. The TEI-SF included items such as “I liked the processes used in this procedure” and “I would be willing to use this procedure to increase staff use of other treatment strategies.” Each item was rated on a 5-point Likert-type scale, ranging from strongly disagree = 1 to strongly agree = 5, with a reverse score for Item 6 (i.e., strongly disagree = 5, strongly agree = 1). TEI-SF scores range from 9 to 45, with higher scorings indicating higher treatment acceptability. Researchers have suggested that scores above 27 indicate treatment acceptability (Borrego et al., 2007; Teng et al., 2006; Woods & Twohig, 2002). After the TEI-SFs were completed and collected, the ratings for each item were added together, and the sum was reported as the total score (see Miltenberger, Wagaman, & Arndorfer, 1996; Teng et al., 2006; Woods & Twohig, 2002).
Experimental Design
This investigation used a multiple baseline with reversal design (Kazdin, 1982). In accordance with this design, baseline phases across all observers began concurrently. When a stable rate of BSP was observed in baseline for one participant, that participant began observing and collecting data on a staff person while all other participants remained in baseline. When a favorable change in BSP rates was demonstrated in the first participant and a stable baseline was observed in a second participant, data collection was introduced to the second participant while the other participants remained in the baseline condition, and so forth. Stability was determined by clinical judgment, knowledge of research design, and visual interpretation of trend. Data collection was then withdrawn to examine reversibility and reintroduced to observe replication effects. This study received approval from a university Institutional Review Board (IRB). In accordance with IRB regulations, this study did not require consent because of the preexisting use of performance feedback data collection in the setting. However, all participants were debriefed at the conclusion of the study and provided verbal consent for their data to be used in this study without personal identifiers.
Procedure
Baseline
Baseline data were collected on all participants to assess a current estimate of treatment integrity in the delivery of BSP. Across all phases, data were collected by researchers as unobtrusively as possible (e.g., sitting in the corner of the room). Typical activities occurring during observations included classroom-based activities such as circle time, snack time, story time, arts and crafts, and individual-center play. Activities excluded for observation were bathroom time and recess. Typically, one to two staff members were present in the classroom during observations.
A minimum of three 10-min observations, conducted by the primary or secondary research data collector, occurred across separate days to acquire a sample of typical delivery of BSP. Ten minutes were chosen for an observation time interval based on previous performance feedback studies completed by Reinke and colleagues (2007, 2008). It was understood that in an applied treatment setting like the day treatment program, events could interrupt observations (e.g., scheduled breaks, end of shift, or excessive problematic behavior of a child). If an observation was interrupted, the participant waited up to 10 min for the observation to resume; however, if the observation did not resume within 10 min or if the sample was less than 8 min (i.e., less than 80% of the total possible observation), the observation was considered invalid. This guideline was used across all phases.
Intervention—Data Collection
Data collection included three primary components that were intended to enhance the salience of the BSP protocol and thereby improve participant treatment integrity. The first component was a preintervention meeting with the administrator and the participant to discuss the importance of delivering BSP. Because data collection was implemented twice and withdrawn once with each participant to examine reversibility of effects, each participant had two preintervention meetings. The second component of data collection was a data sheet to be used for collecting data on staff performance that listed an operational definition of the BSP protocol, examples of BSP, and an explicit definition of treatment integrity. The third component was actual data collection by the participant on a staff member’s treatment integrity to the BSP protocol.
Data collection—Component 1
Introduction of data collection began with the preintervention meeting. The preintervention meeting was designed to address the importance of adhering to the BSP protocol. Each participant attended one preintervention meeting alone with the administrator in a private, quiet room (note this preintervention meeting only occurred on the first day of intervention). The administrator used a script designed by the investigators to prompt her to emphasize specific details of the BSP protocol. The components in the script were similar to initial employee training that occurred at the start of employment and emphasized for employees the importance of delivering BSP at a high rate. The script comprised seven components that were derived from the empirical literature on praise and included the operational definition of BSP and examples, rationales describing the importance of BSP, the importance of collecting data on staff use of BSP, and the expectation that staff should be delivering at least four BSP statements per minute. This standard was derived from the praise research (see description of primary dependent variable). Additional components included stating the purpose of the meeting (i.e., asking participants to collect data), reviewing the data sheet to be used, and providing an opportunity for participants to demonstrate their understanding by providing five examples of BSP.
The primary investigator (PI) listened to all meetings from an adjacent room, where she was able to use a live audio monitoring device and record inclusion or omission of the components listed on the script. The presence of the PI in the adjacent room was known to the administrator and unknown to the participant.
Data collection—Component 2
The second component of data collection was a data sheet that listed examples and the operational definition of BSP, as well as the staff goal to deliver BSP at least 4 times each minute. The data sheet was visibly present in all preintervention meetings while the administrator emphasized important aspects of the protocol. In addition to examples of BSP and the expectation that staff deliver at least four BSP statements per minute, the data sheet included seven areas for participants to complete each day of data collection. These areas prompted participants to tally the frequency of behaviors observed, total the frequency of behaviors observed, record the date, record the time of day, record the staff who was observed, record the number of children present in the room, and record the activity the staff member was engaged in while the observation was occurring.
Data collection—Component 3
The final component of data collection involved participants collecting treatment integrity data. Participants received a blank copy of the data sheet from the administrator and the instruction to collect data on the frequency of BSP of a specified staff member. Participants were told to focus on recording the targeted staff member’s frequency of BSP for the duration of the observation and were encouraged to be discreet while collecting data to reduce staff reactivity. Data collection was exclusively focused on BSP and did not include other types of attention (e.g., general praise, timeout). Data were collected on the data sheet in 1-min intervals across 5 min, and participants were instructed to collect and return the data within 30 min of receiving the data sheet. Typical activities occurring during observation times matched baseline activities. At the completion of the 5-min observation, participants submitted the data sheet to the administrator by placing it face down on her desk or in a folder on her desk. The administrator transferred these data to a visual performance feedback graph and provided visual performance feedback to the observed staff member within 10 min. Then, the primary or secondary investigator collected data on the participant for 10 min. This routine continued for the extent of the data collection phase.
Withdrawal of Intervention
In this phase, the participant was not provided a data collection form and was not instructed to collect performance feedback data on staff. Data collection on the participant continued for 10 min each day.
Return to Intervention
Data collection was reintroduced to determine replication effects. This phase began with a second preintervention meeting with the administrator, who used an abbreviated script with three components to emphasize the most important details of the protocol. This script included the importance of BSP, the goal of delivering at least four BSP statements per minute, and the request for participants to collect data on staff with the data sheet provided. This script was modified from the first to avoid redundancy (to maintain the salience of the components) while continuing to emphasize the most important aspects of the protocol (e.g., treatment integrity goal).
Following this meeting, participants received instruction to resume data collection on staff. Participants were provided a data collection sheet each day they were asked to collect data and followed the same guidelines described in the intervention section. The length of this phase was shorter than other phases due to limited availability to continue collecting data at the research site.
Debriefing and Consent
All participants were individually debriefed by the primary and secondary data collectors at the conclusion of the study in a quiet, private room at the facility. Participants were informed of the research hypothesis, shown their individual data without personal identifiers, and provided the opportunity to ask questions about the investigation and comment on data collection. To protect participants from coercion (i.e., the use of threat or punishment to motivate individuals to engage in a specific behavior; Sidman, 2001), they were informed that no one outside the research team was privy to personal identifiers, including the administrator and owners of the program, and ensured that withdrawing their data from this study would have no impact on their job. All participants provided verbal consent for their data to be used in this study.
Reliability
The PI was responsible for coding dependent measures. A secondary investigator was trained to 90% reliability by the PI with in vivo observations and served as a second observer who independently collected reliability data for 29% of all observations. In the absence of the PI, the secondary data collector independently coded dependent measures. Interobserver agreement (IOA) was calculated by determining the number of agreements between observers on the occurrence and nonoccurrence of target behaviors, dividing the number of agreements plus disagreements, and multiplying by 100. IOA of BSP across sessions was 92% (range = 79-100). IOA was also calculated for all combined dependent measures (i.e., BSP, general praise, and timeout) and was 88% (range = 70-100).
Independent Variable Adherence (Treatment Integrity)
Traditionally, the term treatment integrity, when used in research studies, has referred to the extent to which the independent variable was implemented as intended. However, in this study, the term treatment integrity was already assigned to the dependent variable, that is, the degree to which the BSP protocol was implemented as intended by employees. Because the term treatment integrity was already being used to describe the primary dependent variable, a different term, independent variable adherence, was used to describe the extent to which the independent variable was implemented as intended.
Independent variable adherence was collected on preintervention meetings by listening to the meeting with an audio monitoring device that was installed in a room adjacent to the meeting room. The administrator was informed of the PI’s presence in the adjacent room; however, participants were not. Independent variable adherence was calculated by dividing the number of components that were stated by the administrator by the number of components in the script, and multiplying by 100. Independent variable adherence on preintervention meetings was 100% for all participants. IOA was collected with a secondary observer on two occasions (40%) and was 100%.
Independent variable adherence was also collected on participant completion of the data collection data sheet. In this study, the data sheet included seven components that participants were responsible to complete each day of data collection. Independent variable adherence was calculated on each participant’s completion of the data sheet by dividing the number of completed components by the total number of components and multiplying by 100. Overall independent variable adherence was 92% but varied across participants between 62.9% (Cara) and 100% (Betty, Dina, and Effie).
Results
Baseline
Figure 1 displays rate of BSP per minute for each participant during the investigation. During baseline, BSP rates for all participants were low and consistently below the criterion of delivering at least four BSP statements per minute. Baseline extended for various lengths across the five participants (range = 3-17 days), and all participants displayed decreasing levels of BSP prior to intervention.

Daily rate of behavior-specific praise by participants.
Intervention—Data Collection
When data collection was implemented, there was a marked and sustained increase in level of BSP for three participants. Change in BSP was initially demonstrated by Anna in Day 4, while rates for Betty and Effie, who were still in baseline, remained unchanged. Effects were replicated (i.e., BSP rates increased) when treatment was introduced with Betty on Day 6, and with Effie on Day 16.
Two other participants did not show improvements in the level of BSP when treatment was introduced. Cara did not demonstrate any noticeable change in BSP level during intervention or for the remainder of the investigation. Dina showed no change in BSP level during intervention; however, there was an apparent increasing trend in BSP rates during intervention compared with baseline.
Withdrawal of Intervention
When data collection was terminated, a decrease in the rate of BSP was observed with two participants. Betty and Effie showed lower levels of BSP when the intervention was removed. In contrast, Anna continued to exhibit fairly stable levels of BSP, which continued to remain higher than baseline levels. Dina’s rate of BSP showed a change in trend but little change in the overall level of BSP.
Return to Intervention
When data collection was reintroduced with Betty and Effie, their responses to the intervention were very similar to their initial responses to the intervention, with marked changes in the level and trend of BSP. Anna’s BSP delivery rate also increased in level and changed direction, showing a favorable increasing trend. The same effect was not observed with Dina, who continued to deliver BSP at levels similar to her withdrawal phase.
General Praise
Frequency of general praise was collected to determine differential effects of the intervention on BSP versus general praise. In baseline, the mean rate of general praise was about one praise statement per minute across all participants. This rate remained at about one praise statement per minute when data collection was introduced and either remained stable or decreased when the intervention was withdrawn and reintroduced. In general, results indicate that the levels of general praise were relatively low and stable across all phases for all participants (Table 1).
Mean Rate of General Praise for Each Participant by Phase.
Note: DC = data collection intervention. Numbers in bold represent the mean rate of general praise statements delivered each day across phases.
Praise-to-Correction Ratios
Median praise-to-correction ratios are presented by phase in Table 2. During baseline, Anna, Betty, Cara, and Effie showed low to moderate praise-to-correction ratios (range = 1.8:1-3.9:1). Cara’s praise-to-correction ratio remained low across all phases of the investigation. In contrast, Anna, Betty, and Effie showed large increases in praise-to-correction ratios (range = 8.7:1-9.8:1) when the intervention was introduced. Dina’s praise-to-correction ratio was high in baseline (median = 6.6:1) but also increased when data collection was introduced (median = 10:1). When data collection was terminated, Anna, Betty, and Dina showed decreases in praise-to-correction ratios, while Effie’s praise-to-correction continued to increase. Reintroduction of data collection intervention resulted in increased ratios for Anna, Betty, and Effie.
Median Praise-to-Correction Ratio for Each Participant by Phase.
Note: DC = data collection intervention. Numbers represent praise-to-correction medians for each phase (n:1) and were calculated by dividing the sum of behavior-specific praise statements and general praise statements by the number of timeouts delivered.
Treatment Acceptability
Three participants and the administrator completed and returned the modified TEI-SF. Researchers have suggested that scores above 27 indicate treatment acceptability (Borrego et al., 2007; Teng et al., 2006; Woods & Twohig, 2002). Results showed that three participants found the treatment to be acceptable with scores of 31, 34, and 37 (maximum score = 45). Participant responses were anonymous. The administrator completed the TEI-SF with a score of 38, also indicating a positive view of the procedure. These scores compare favorably with other TEI-SF scores for other popular treatments, such as positive reinforcement and response cost (Jones, Eyberg, Adams, & Boggs, 1998).
Discussion
The results of this investigation demonstrated that data collection can be an effective way to improve treatment integrity in day treatment supervisors without intervening with them directly. Baseline data confirmed that treatment drift can occur with more experienced staff (Codding et al., 2008; Hagermoser Sanetti et al., 2007), as all participants showed low levels of BSP in baseline, regardless of the length of the baseline. When data collection was introduced, three participants showed marked, stable, and replicable changes. In addition, levels of responding were reversed and recovered with two participants, whose BSP rates decreased when the intervention was removed and increased when the intervention was reintroduced. There were also corresponding improvements in praise-to-correction ratios. Finally, data collection was found to be an acceptable approach to improve treatment integrity.
These results are important because they extend the findings of Burke et al. (2012) and confirm that the observer effect can enhance the treatment integrity of supervisors collecting data for performance feedback to staff. Note that this effect was not dependent on the participants delivering the performance feedback to others or receiving feedback themselves. The participants only collected the data on others, allowing a program administrator to actually deliver the performance feedback. This is an important finding because it suggests that it was the activities associated with observing that were responsible for the change. Thus, these data provide support for the notion that performance feedback can improve the treatment integrity of those who simply collect the performance data as well as those who receive the performance feedback.
Although the participants increased BSP levels when data collection was occurring, they did not reach the target rate of four BSP statements per minute. It is unclear how to interpret these data as we do not know how “much” treatment integrity with BSP is necessary to retain optimal child behavior outcomes. It would be beneficial to evaluate the minimum BSP dosage needed for favorable outcomes in day treatment programs for children.
It is interesting to note that the influence on BSP did not affect rates of general praise. One might have expected that the observer effect would also produce changes in delivery of general praise because of topographical similarities (e.g., “Good job sitting quietly!” vs. “Good job!”) and because general and specific praise occupy a similar response class (i.e., both are used to reinforce appropriate behavior). However, levels of general praise remained low and stable across all participants throughout the investigation. This is a notable finding because it suggests that the observer effect may be limited to the target of the observations.
The concomitant changes in praise-to-correction ratios are interesting although challenging to interpret. The literature is replete with praise-to-correction ratio recommendations, ranging from 3:1 (Shores, Gunter, & Jack, 1993) to 5:1 (Hart & Risley, 1995), with an outlier of 10:1 (Nafpaktitis, Mayer, & Butterworth, 1985). Elementary school teachers who established a 4:1 praise-to-correction ratio in their classrooms had better student outcomes than their colleagues with lower ratios (Burke, Oats, Ringle, O’Neill Fichtner, & DelGaudio, 2011). Other researchers (Sawka, McCurdy, & Mannella, 2002; Sutherland, Wehby, & Copeland, 2000) and national organizations such as Boys Town (see Connolly, Down, Criste, Nelson, & Tobias, 1997) and the National Association for School Psychology (see Sawka-Miller & Miller, 2007) have endorsed a 4:1 ratio. Surprisingly, these recommendations lack well-established empirical support. The absence of an evidence-based praise-to-correction ratio limits clarity in determining the most effective ratio; however, there is a general sense that relative rate of reinforcement is essential, especially in a day treatment setting, where a high praise-to-correction ratio may be more critical to improving the behavior of children who are referred for aggressive and disruptive behaviors.
Two participants showed little change in behavior when data collection was introduced. Although data collection was designed to enhance the salience of the BSP protocol, it is possible that other controlling stimuli in the classroom were more salient for Cara and Dina. For example, although the majority of children at the day treatment program engaged in noncompliant and inappropriate behavior, a number of the children assigned to Cara’s classroom were particularly aggressive. As a result, the need for Cara to immediately intervene to address these high rate behaviors may have been more salient than following the rule to deliver BSP at a high rate. Likewise, midway through the study, Dina indicated that she was planning to leave her job, perhaps making the importance of treatment integrity less salient for her.
While stimulus salience and reactivity are hypothesized to have played important roles in the observer effect, observational learning may have also been involved. Researchers have demonstrated the effects of learning by observing others, ranging from simple imitation (Baer, Peterson, & Sherman, 1967) to complex imitation (Bandura, 1965). Because participants were instructed to observe staff, it is possible that the participant’s improvements were, at least in part, a function of imitating the BSP performance of the staff they observed. Given that some staff had better BSP rates than their supervisors, even in baseline, this seems a credible alternative explanation. However, even in this conceptualization, one cannot rule out that it was the actual recording of the data, not just the observation that mediated the observer effect.
It is difficult to determine which aspect of data collection was responsible for influencing participant responding. Data collection included a preintervention meeting emphasizing the BSP target, a data sheet with reminders about the target rate of BSP, and finally, the actual process of observing and coding BSP by a staff person. Thus, the observer effect may have been the result of any one of these components alone or the result of some combination of these components. In addition, the effective components may have been different for different observers. A component analysis would be necessary to determine which element(s) of data collection are necessary to produce an observer effect on treatment integrity.
There are numerous benefits to using data collection to improve treatment integrity in an applied setting. In this investigation, all necessary components of the intervention (i.e., preintervention meetings, assignments to observe specified staff, and data collection on staff) were completed by employees working in the day treatment program. This feature was important because it illustrated the feasibility of data collection, independent of researcher involvement. Furthermore, performance feedback data were collected by supervisors, which showed that individuals other than researchers or administrators were capable of observing and taking data on staff. This reduced the amount of effort required by the administrator to directly evaluate treatment integrity and might enhance the view of data collection as a viable way for other administrators to improve treatment integrity.
A final advantage of data collection is the minimal time requirement. Lack of time has been noted as one of the main problems with measuring and improving treatment integrity (Cochrane & Laux, 2008). In this study, observers collected performance feedback data on staff for 5 min each day. Graphing and showing staff individual visual performance feedback data took about 30s to complete. The initial preintervention meetings held between the administrator and observers were 4 to 10 min long (M = 6.3 min) and the second preintervention meetings were 1 to 2 min long (M = 1.3 min). Had the intervention continued, the format of the second preintervention meeting would have served as the model for all subsequent preintervention meetings with observers. In sum, the entire observation and feedback procedure required about 6 min each day, with an additional 1 to 6 min for preintervention meetings. Simply based on time, this may be a more time-friendly tool to assess and improve treatment integrity in treatment settings, which could increase the reinforcing value of data collection.
Data collection was rated as a highly acceptable method to improve treatment integrity. These positive results suggest that there is potential for this type of intervention to be incorporated into standard practice and meet the needs of clinicians; however, it is possible that the three observers who completed the TEI-SF were the same observers who responded favorably to data collection (i.e., Anna, Betty, and Effie). It is also possible that only positive results were submitted due to the small number of observers and potential discomfort in providing poor ratings. Replications of this study in other settings would be valuable to further assess treatment acceptability and determine the extent to which these results affect the use of performance feedback to regularly evaluate treatment integrity.
Conclusions regarding the generality of these data must be made with caution. First, this intervention was implemented in a unique treatment environment that specialized in increasing appropriate behavior with children who showed high rates of noncompliance, aggression, and disruptive behavior. Second, participants had experience collecting performance feedback data on staff. In other treatment environments (e.g., special education classroom, residential setting), supervisors may not have the experience to adequately collect data on other employees. Thus, the extent to which the observer effect would be experienced in other settings is unknown. Third, the observer effect was not demonstrated with all five participants. It is possible that the observer effect would only be experienced with certain types of employees with certain types of learning histories.
There are a number of limitations in this study. First, the data do not reflect consecutive days. Although considerable effort was made to observe participants every day, it was impossible to conduct continuous observation sessions in this setting due to weekends, sick days, and vacation days. Second, participants were scheduled for shifts that varied in start time each day, and it is not clear how this may have affected response to intervention. In addition, intervention phases were relatively short for most observers and because no follow-up data were collected, the long-term effects of data collection are unclear. Finally, the researchers were aware of the research hypothesis, which could have influenced data collection despite consistently high IOA. Future research should replicate this study to determine internal and external validity.
The replication of the observer effect in this study offers only limited generality of the effect. Thus, addition replications with day treatment supervisors would be beneficial, in addition to continuing to explore the generality of the observer effect with different protocols (e.g., punctuality) and in a variety of settings (e.g., special education classrooms, general education classrooms, residential settings). It would be helpful to determine whether repeated exposure to data collection affects the salience of the targeted protocol and to identify how frequently a “booster” meeting is needed to maintain acceptable levels of treatment integrity. It would also be valuable to assess how many protocols can be simultaneously targeted with data collection while maintaining an observer effect to explore the potency when multiple protocols are targeted. Finally, future research should replicate this study with different groups of individuals (e.g., coworkers, parents) to determine the generality of the intervention and extent to which the observer effect can improve and maintain high levels of treatment integrity.
In sum, this study contributes to the need for practical and time-efficient approaches to improving treatment integrity and illustrates the potential for examining interventions in applied settings. Vollmer (2011) recently discussed the expansion of behavior analysis to applied settings and noted that while “we need not sacrifice the logic of our methods . . . we must adapt (translate) our methods in order to have a say in resolving the most socially relevant problems of our time” (pp. 33-34). It is the responsibility of researchers to modify empirically supported treatment to fit the needs of clinicians and increase the reinforcing value of using evidence-based practices. In this way, clinicians will be better supported by researchers and have the tools to consistently and reliably evaluate and improve integrity in treatment settings.
Footnotes
Acknowledgements
The authors wish to thank Janie Peterson, Roger Peterson, and Jessica Wachtler for their continuous support of this project.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by Project #8188 from the Maternal and Child Bureau (Title V, Social Security Act), Health Resources and Services Administration, Department of Health and Human Services and in part by grant 90DD0533 from the Administration on Developmental Disabilities (ADD), Administration for Children and Families, Department of Health and Human Services.
