Abstract
We describe challenges in the 6-year longitudinal cluster randomized controlled trial (CRCT) of Positive Action (PA), a social–emotional and character development (SECD) program, conducted in 14 low-income, urban Chicago Public Schools. Challenges pertained to logistics of study planning (school recruitment, retention of schools during the trial, consent rates, assessment of student outcomes, and confidentiality), study design (randomization of a small number of schools), fidelity (implementation of PA and control condition activities), and evaluation (restricted range of outcomes, measurement invariance, statistical power, student mobility, and moderators of program effects). Strategies used to address the challenges within each of these areas are discussed. Incorporation of lessons learned from this study may help to improve future evaluations of longitudinal CRCTs, especially those that involve evaluation of school-based interventions for minority populations and urban areas.
Keywords
For youth to be successful in school and in life, they need to acquire social skills, build character, improve mental and physical health, and avoid problem behaviors (Coalition for Evidence-Based Policy, 2002; M. T. Greenberg et al., 2003). Programs designed with these goals—including those with a focus on social–emotional learning (Durlak, Weissberg, Dymnicki, Taylor, & Schellinger, 2011; Weissberg & O’Brien, 2004) and positive youth development (Catalano, Berglund, Ryan, Lonczak, & Hawkins, 2004; Elias et al., 2015; Lerner, Phelps, Forman, & Bowers, 2009; Snyder & Flay, 2012)—can not only foster social and emotional skill development but also help prevent multiple problematic health-related behaviors, including bullying (Afshar & Kenny, 2008) and substance use (M. T. Greenberg et al., 2003; Zins & Elias, 2006).
Despite these encouraging findings, it is widely recognized that school-based prevention efforts present many challenges (e.g., M. T. Greenberg et al., 2003). We discuss challenges faced during a school-based prevention cluster randomized controlled trial (CRCT) relating to logistics of the study, designing the study, program fidelity, and program evaluation, and share our strategies for addressing each of challenge. Discussions of limitations associated with social intervention research are growing in popularity in the program evaluation literature (D. Greenberg & Barnow, 2014; Jaycox et al., 2006; Ong-Dean, Hofstetter, & Strick, 2010). For example, D. Greenberg and Barnow (2014) discussed eight common flaws of randomized controlled trials (RCT) in evaluation of social programs; some of these flaws are also discussed in the present article. Additionally, Jaycox et al. (2006) discussed challenges related to recruitment, implementation, and dissemination of results. Many of the challenges and issues discussed in previous articles focus on the planning and implementation stages of RCTs. This article adds to the existing literature by considering a more comprehensive set of challenges faced in all stages of an RCT, as well as discussing some challenges not addressed in prior papers (e.g., measurement invariance in a longitudinal trial). Our goal is to provide an informative case example rather than a broad overview and synthesis of the literature. In alignment with this goal, we provide a description of issues or challenges faced, our strategy for addressing them, and sources for more in-depth discussion. Furthermore, it has been emphasized that contextual influences are critical to consider in social intervention research and that many concerns may be specific to particular settings (Trickett & Beehler, 2013). This article adds to understanding in this area by considering challenges that pertain to trials conducted in one particular type of setting—urban, low-income schools. The importance of considering challenges associated with this context is underscored by findings suggesting more limited effectiveness for programs when implemented and evaluated in low-income, urban school settings (Farahmand, Grant, Polo, Duffy, & DuBois, 2011).
The Positive Action (PA) Program
PA is a comprehensive, school-wide, social–emotional and character development (SECD) program grounded in theories of self-concept (Purkey, 1970; Purkey & Novak, 1970), particularly Self-Esteem Enhancement Theory (SET; DuBois, Flay, & Fagen, 2009), and is also consistent with integrative and social–ecological theories of health behaviors such as the social learning theory (Bandura, 1977), problem behavior theory (Jessor & Jessor, 1977), and theory of triadic influence (TTI; Flay & Petraitis, 1994; Flay, Snyder, & Petraitis, 2009). In line with SET, PA includes a classroom-based curriculum that brings into conscious awareness the motivation for self-esteem, while also teaching the skills needed for adaptive means of feeling good about oneself (e.g., self-control). In line with the TTI, a range of ecological supports (e.g., school-wide climate development, family classes) provide social reinforcement and validation for positive behaviors in both school and nonschool settings.
The PA program includes PreK–12 curricula, of which the K–8 portion was tested in the present trial. The scoped and sequenced classroom curricula consist of over 140 age-appropriate lessons per grade taught for 15–20 min, 4 days/week, for Grades K–6, and 70 lessons, taught 2–3 days/week, for Grades 7 and 8. The core curricula for all grades consist of the following six units: self-concept, PAs for body (physical health) and mind (learning), social and emotional PAs focusing on getting along with others, managing, being honest with, and continually improving oneself. In addition to the student core curricula, the program includes teacher, counselor, and family training as well as school-wide climate development activities. A community component is also included in the PA program; however, funding for the present study did not allow for its implementation. All teachers and staff implementing PA in the current trial were provided 2–3 hr of on-site training at the beginning of each year.
Several evaluations have indicated significant favorable effects of PA on several outcomes. Most notably, two CRCTs, including the present trial, have found favorable effects for students in schools receiving PA on academic achievement (Bavarian et al., 2013; Snyder et al., 2010), positive youth development indicators (Lewis, Vuchinich, et al., 2016), health behaviors (Bavarian, Lewis, Acock, et al., 2016), emotional health (Lewis, DuBois, et al., 2013), self-esteem (Silverthorn et al., 2016), social environment (Bavarian, Lewis, Silverthorn, DuBois, & Flay, 2016), fewer school disciplinary incidents (Lewis, Schure, et al., 2013; Snyder et al., 2010), and lower involvement in problematic health behaviors, including substance use, violence (Lewis, Schure, et al., 2013; Li et al., 2011), and sexual activity (Beets et al., 2009).
Brief Overview of Trial
Setting, Design, School Recruitment, and Sample
The Chicago trial of PA sought to extend the knowledge base on the long-term effectiveness of SECD programs (Durlak et al., 2011) for urban, minority students, as well as to examine the effectiveness of such programs during the transition to adolescence. This matched-pair CRCT (Murray, 1998) of PA was implemented in high-poverty, K–8 schools in Chicago, with a largely minority (87%) student population. Students in seven matched pairs of schools were followed beginning in Grade 3 (Fall 2004 and Spring 2005), and at six additional times (waves) over 6 years: beginning and end of Grade 4 (Fall 2005 and Spring 2006), end of Grade 5 (Spring 2007), beginning and end of Grade 7 (Fall 2008 and Spring 2009), and end of Grade 8 (Spring 2010). The uneven distribution of waves between Grades 5 and 8 was due to a gap in funding. Sampling and recruitment of schools took place during spring 2004 and are described in detail elsewhere (Flay, 2012; Ji, DuBois, Flay, & Brechling, 2008). Initial funding called for only five pairs (10 schools); additional funding was secured for two more pairs (14 schools total).
Measures
Student, teacher, and parent (or primary caregiver) reports were used to assess multiple outcomes throughout the duration of the CRCT. Measures were selected by a team of researchers from the sites participating in the Social and Character Development Research Consortium trial as well as members of the Consortium. Researchers and members attempted to identify measures of high reliability and validity in previous research with diverse samples of elementary school students (Social and Character Development Research Consortium, 2010). Collectively, student report measures assessed aspects of SECD (e.g., prosocial interactions, self-control, honesty), self-esteem, emotional health (e.g., anxiety), problem behaviors (e.g., substance use, violence), and school performance. Parents and teachers responded to similar measures, which also were used to obtain information about students’ contexts outside of school (i.e., home and neighborhood; parent report), students’ classroom behavior, academic motivation and performance, and parental school involvement (teacher report). Information on study measures can be found in the ClinicalTrials.gov posting by Flay (2012). The trial also was designed to test for intervention effects on school-level outcomes assessed via archival records of school-level absenteeism, disciplinary referrals and suspensions, and reading and math standardized test scores. These data, reported at the end of each academic year, were obtained from the Chicago Public Schools website once they became public. For the trial, we utilized these data from all study years as well as several academic years prior to the trial in order to establish a reliable baseline.
Human Subjects Approvals
This trial was approved by institutional review boards at the University of Illinois at Chicago and Oregon State University, the Research Review Board at Chicago Public Schools, and the Public/Private Ventures Institutional Review Board for Mathematica Policy Research (MPR; the latter because the trial was part of a group of trials for which MPR collected some of the data).
Challenges and Solutions
Table 1 provides a summary of the challenges discussed in this article and strategies used to address them in the present trial. The challenges encountered in the trial are presented in four groupings: logistics, design, fidelity, and evaluation. Strategies and suggestions for addressing each challenge are incorporated into the discussion of the challenge.
Summary of Challenges and Strategies.
Note. PA = Positive Action.
Logistics of the CRCT
Recruitment of schools and assignment to the control condition
Recruiting schools to participate in an intervention trial can present a wide range of challenges. Initially, researchers may encounter difficulties that stem from having lack of established ties with schools and thus being perceived as “outsiders” (Ji et al., 2008). To help overcome such potential barriers, our strategy was to have a yearlong planning phase during which we engaged in activities that allowed us to develop rapport with school administrators (e.g., visiting schools, holding meetings). Research staff spent approximately 20 hr per school during this phase. Recruitment and matching of schools are detailed elsewhere (Ji et al., 2008). Of the 68 schools that came to an informational meeting, 18 agreed to be a part of the study on the understanding that they would be randomly assigned to either receive PA or be in the wait-listed control condition. Funding allowed for the inclusion of 14 schools; the remaining 4 were kept as replacement schools in the event of school attrition. Additionally, we needed to be prepared to handle the reluctance of principals to agree to participate in the trial in the event that their schools were not assigned to condition of their choice. For this trial, the control condition was the least desired, so as a strategy we offered an annual stipend to control schools to support their participation in the trial, in recognition that data collection was a substantial and potentially disruptive effort. The research team had initiated the stipend/incentive structure, not the principals of schools negotiating participation. Control schools received US$1,000 per year for each year of the trial. We also included a provision to provide the PA program (including training) to all control schools at no charge at the end of the trial (Ji et al., 2008). PA schools received a stipend of US$5,000 per year. Treatment schools had a higher stipend to support ongoing technical support for implementation and a series of onetime incentives to fund school-level activities supporting implementation. In addition to receiving program materials, treatment schools also received 2–3 hr of annual training and incentives or tokens of appreciation for completion of assessments on students and implementation reports.
Retention of schools
The 18 schools that agreed to be in the study were randomized and no schools were lost because of randomization to the control condition (Ji et al., 2008). In the event of school attrition, four schools were retained throughout the trial. However, we did not lose any schools during the trial, thus giving us a school level N = 14. This accomplishment seems likely to be attributable, in part, to the above-described strategies and solutions (e.g., rapport, stipends). We also had a research coordinator (generally a graduate student) to support schools in the study. All schools signed a memorandum of understanding (MOU); school personnel that were new to the schools during the trial would also sign the MOU. Finally, we also emphasized maintaining positive working relationships with schools, particularly administrators and key support staff, and recommend the use of this strategy. This included periodic check-ins, problem-solving flexibly to accommodate the needs and preferences of individual schools with respect to activities such as structuring of program implementation and scheduling of data collection sessions and providing thank-you cards and small tokens of appreciation (e.g., food treats for teachers and other staff) at key junctures throughout the trial. As with the planning phase, approximately 20 hours per school was spent on these activities.
Consent and survey completion rates
Gaining consent from a high proportion of parents from urban areas or of ethnic minority is often a challenge, potentially because of a lack of parent involvement (National Center for Educational Statistics, 2004), risk level of the sample (Rojas, Sherrit, Harris, & Knight, 2008), or cultural differences between the researchers and participants (Rodríguez, Rodríguez, & Davis, 2006). In the present trial, we obtained parental consent at the start of the study when students were in Grade 3 (assent from students was also required at each wave, but did not prove to be a significant challenge for this trial, as assent rates averaged more than 95%). At baseline, parents of 79% of students provided consent. Parental consent for students joining the study at later waves was obtained at those times; these consent rates ranged from 65% to 78% for Waves 2–5. It was also necessary to reconsent all students at Wave 6 based on receipt of additional funding which allowed the original trial to be extended. Consent rates were lower at this latter stage of the study (≈58–64% for Waves 6–8), which is consistent with previous research indicating a drop in rate of consent at higher grade levels (Ji, Pokorny, & Jason, 2004; Thompson, 1984). Parental consent that was obtained lasted through the length of the study. That is, parents did not have to provide consent at every wave. One key limiting factor in achieving a high rate of consent in school-based intervention research is simply getting forms returned from parents. In order to address this challenge, our strategy was to provide incentives. Parents were offered US$10 for returning consent forms in early waves, regardless of their consent decisions. While consent rates were higher likely as a result of this incentive, we made a point of communicating that the incentives were not conditional on a “yes” decision, but rather simply returning a completed form. We also offered pizza parties for classrooms (and gift certificates for teachers) with 90% or higher returned consent forms (Ji et al., 2006). Other strategies include establishing a relationship with school personnel, having student assistants to help contact parents, and having clear consent forms (Blom-Hoffman et al., 2009; Fletcher & Hunter, 2003).
To help ensure high rates of survey completion among consented students and their parents and teachers, we offered financial incentives for completion of their respective surveys at each wave (US$20 for parents, which was increased to US$40 at the final wave; US$50 for teachers; and a US$5 coupon to a local restaurant for students, which was increased to US$10 at later waves). The increased incentives for parents increased survey completion rates (e.g., 50% at Wave 5 for a US$20 incentive to 73% at Wave 8 for a US$40 incentive).
Assessment of student outcomes
A majority of the student-level outcomes were assessed via self-report, potentially leading to a method bias (Podsakoff, MacKenzie, Lee, & Podsakoff, 2003). Self-reports in particular are susceptible to social desirability biases; students might exaggerate their participation in high-risk behaviors in order to feel as if they fit in with their peers. Alternatively, they might underreport such behaviors knowing society’s negative views on behaviors such as substance use and bullying. Of additional concern is the possibility that students in PA schools might report more desirable behaviors because these are emphasized as a key goal of the program; such a tendency could artificially inflate estimates of program effects. Other research suggests that such biases are of minimal concern (Bachman, Johnston, & O’Malley, 1996; Elliott, 1994; Krohn, Thornberry, Gibson, & Baldwin, 2010; Spoth et al., 2007). As one strategy for addressing this possibility, care was taken during data collection throughout the trial to provide instructions that encouraged students to respond honestly and that normalized reporting of undesirable behaviors. Specifically, we had the following instructions at the beginning of the survey: “You give the answer that is MOST TRUE FOR YOU. Please remember that there are no right or wrong answers—we just want your honest opinions.” As a further strategy to guard against possible self-report bias, we supplemented with teacher and parent reports. These reports had strong internal consistency; however, the correlations with student reports on similar outcomes were low to moderate. There are a few possible reasons for this. Teachers do not know their students well at the beginning of the school year, so their ratings at that time are subject to lack of both reliability and validity for this reason, thus compromising the utility of any change scores based on them. Also, students have different teachers each year and, in addition, for school in the current trial, teacher turnover was also high. As for parents, given their relatively low levels of formal education, the potential also exists for their reports to have less than optimal levels of reliability and validity. This is particularly so given that strategies most likely to be useful in countering such a possibility (e.g., personal interviews) were unable to be incorporated into our trial due to resource constraints. Finally, teachers and parents are rating behaviors based on what they see and where they see it (e.g., school and home, respectively), which are different environments and also differ from the student’s experience. So while these reports may be reliable for short-term changes, they are not as reliable for looking at behavior change as in this trial. We recommend supplementing with the previously noted use of school-level archival records data as additional measures of intervention impact.
A related issue concerns the confidentiality of student survey responses. To the extent that students do not feel safe in the privacy of their responses, for example, they may fail to share sensitive information (e.g., involvement in substance use or other risk behaviors). To address this concern, research staff, rather than teachers, administered surveys, and assurance of confidentially of responses were part of the assent process. Additionally, we used a system of coding ID numbers on surveys rather than having names. To further ensure confidentially for students, teachers were asked to stay seated at their desks and not walk around the classroom during survey administration. This also reduced teacher burden in that they were not responsible for data collection. Future researchers may also want to consider electronic data collection such as utilizing school computer labs or iPads as these may be useful in addressing confidentiality concerns of students (e.g., needing to hand in a paper survey that includes their responses to sensitive questions), while also easing the burden of survey collection.
Design of the CRCT
Randomizing a small number of schools/sites
The small number of clusters (i.e., schools) in this study presents concerns both for statistical conclusion validity (i.e., low statistical power; Shadish, Cook, & Campbell, 2002) and the potential that randomizing a small number of units would leave differences between treatment and control schools at baseline (Murray, Varnell, & Blitstein, 2004). Other threats included school-level attrition and failed randomization. To aid in randomization, we used a matched-pair design, a form of blocking in that the blocks are matched pairs. Exact matching can be difficult; but the use of a distance-matrix method for matching (Schochet & Novak, 2003) led to very tight matching in this case. Lack of equivalence on covariates at baseline could decrease power and precision, thereby creating a threat to statistical conclusion validity. Matching and establishing baseline equivalency reduces variability within pairs. Similar to the randomized block design, this approach reduces variability within treatment conditions and potential confounding, producing a better estimate of treatment effects (Imai, King, & Nall, 2009; Ivers et al., 2012; Rhodes, 2014).
Matching was implemented using multiple school-level archival variables including ethnicity, attendance rate, truancy rate, number of students per grade, information about school crime rates, percentage of parents reported to demonstrate school involvement, percentage of teachers employed by the school who met minimal teaching standards and percentage of students who met or exceeded criteria for passing the Illinois State Achievement Test (ISAT), received free lunch, and enrolled in or left school during the academic year (Ji et al., 2008). The use of a range of measures in the matching process helped to ensure baseline equivalency between treatment and control groups on school- and student-level outcomes. Equivalency tests at the school level revealed no statistically significant differences between the treatment and control group schools on any of the matching variables at baseline or at any of the three other times tested (Ji et al., 2008; Lewis, Bavarian, et al., 2012).
Furthermore, although random assignment from matched pairs of clusters (e.g., schools) cannot guarantee equivalence of the nested subjects (e.g., students; Giraudeau & Rivaud, 2009), of the total 74 student-, parent-, and teacher-reported scales tested for baseline equivalency, only 8 showed significant differences, 4 favoring control students and 4 favoring PA students. The low number of statistically significant differences and their varying directions suggest that the matching and randomization were successful and that threats to internal validity were minimized.
Statistical power
Due to funding agency decisions and funding constraints, only 14 schools could be included in the trial (increased from an initial 10 schools required by the funder). Current literature suggests that a sample size of seven school pairs would usually not be adequate for multilevel modeling (Hox, 2010) or for detecting small or moderate effects (Bingenheimer & Raudenbush, 2004). Our strategies of a matched-pairs design, analytic procedures appropriate to the distributions of outcomes, and assessment of outcomes over repeated occasions all served to help improve the level of statistical power (Raudenbush, Martinez, & Spybrook, 2007; Shadish et al., 2002) in addition to securing more funding to increase the number of schools from 10 to 14. Table 2 displays the minimum detectable effect size (MDES; Bloom, 1995) for the difference between intervention and control schools in multilevel models under several conditions. These estimates were calculated using the Optimal Design (Plus Version 3.0) software (Spybrook et al., 2011) and the “Cluster Randomized Trials with person-level outcomes” and “Repeated measures” options within the program. As shown in Table 2, MDES values are smaller for the larger N, smaller intraclass correlations (ICCs), and one- versus two-tailed tests. Increasing the number of clusters (i.e., schools) from 10 to 14 improved the two-tailed MDES values to levels similar to those for one-tailed values with N = 10. Improving power to be adequate for two-tailed tests of significance was particularly desirable given concerns about using one-tailed tests in assessing intervention effects (Ringwalt, Paschall, Gorman, Derzon, & Kinlaw, 2011). A number of methodological factors (e.g., number of students per group, repeated measures correlations) are important considerations in planning CRCTs (for a detailed discussion, see Murray et al., 2004).
Minimal Detectable Effect Sizes in a Cluster Randomized Controlled Trial.
Note. ICC = intraclass correlation. All estimates are derived using the Optimal Design software (Spybrook et al., 2011) assuming .80 as a desired level of statistical power and a Type I error rate of .05. Cluster sized of 10 was the original cluster size based on funding, additional funding allowed for four more clusters for a final cluster size of 14.
Fidelity
Implementation of PA
Getting teachers to deliver a prevention program curriculum regularly and with fidelity also can present an array of formidable challenges (e.g., organizational capacitiy, staff characteristics; Durlak & DuPre, 2008; Malloy et al., 2014). Further, many factors play a role in program implementation, such as integration into school activities, training, and supervision of the program (Payne & Eckert, 2009). Given the increasing demands on schools and teachers to focus on traditional academic subject instruction in particular, we anticipated that the support of school administrators and social support among colleagues would likely both be critical to ensuring implementation of the PA program.
With this concern in mind, our strategy was to ask principals to obtain the buy in of teachers before agreeing to participate in the study. Providing more specific expectations for principals’ efforts in this regard and asking for documentation of some requisite level of teacher buy in would have been desirable, and is recommended for future trials. We provided an incentive (e.g., a free lunch) to teachers to regularly complete brief implementation reports (completed at the end of each unit of the PA program, about every 6 weeks). A member of the research team worked with the program developer to provide ongoing technical support for implementation (e.g., visits to schools approximately every 2 weeks). Additionally, representatives from all schools were brought together annually to share experiences with each other (e.g., professional learning communities; Mullen & Schunk, 2010) and the program developer. Finally, as noted earlier, the program developer provided yearly staff trainings for each school. Training for the program was a challenge as well. We took advantage of in-service days and also paid for substitutes using grant funds. The amount of training was based on what the schools would accommodate and therefore likely compromised fidelity of implementation to some unknown degree. Resources for implementation support limited our capacity to ensure that 100% of teachers and other staff participated in trainings or, if absent or later arriving to the school, received make-up training. However, by structuring the training as part of required in-service/preparation days, attendance was very high. Furthermore, when a teacher or staff person joined a school midyear, the implementation coordinator made every effort to meet individually with them to orient them to the program and cover content similar to that covered in the beginning-of-year trainings. The periodic trainings conducted during each school year provided further opportunities for training.
Both teacher reports and student reports were used for the process evaluation. Teachers were asked about the number of lessons they taught and how much they made adaptations to the curriculum, if at all. Students rated their engagement with the program (“I like PA”; “I plan to use PAs when I grow up”). We attempted to obtain weekly reports of implementation activities (e.g., PA lessons taught) from teachers in treatment schools, but this was found to be too burdensome, so we relied on the six unit reports (e.g., self-concept, PAs for body, etc.). In addition to teacher reports, we recommend using observations or taping sessions to assess implementation quality. We did not use these methods in the present trial, as this was logistically too difficult to utilize effectively within the resource constraints of the study and the contexts of the participating schools.
Consistent with the issues and dynamics we had anticipated, an implementation study found that teachers who had delivered PA reported that although they saw SECD programs as beneficial, they found it difficult to prioritize implementing the PA program given pressures emanating from regional administrators and the school system’s central office to devote classroom instructional time to academics. As we had expected, teachers also reported that support from colleagues was critical to implementation (Fagen et al., 2015).
More generally, indices pertaining to implementation (e.g., teacher description of amount and quality of PA activities in the classroom, perceived effectiveness of the activities, student reports of exposure to and attitudes toward the program) tended to show variability across schools, especially in early years, with improvements over time (Bickman et al., 2009). By the end of Year 6, one school was implementing at only a moderate level of fidelity, three at a moderate to high level, and three at high levels (Jarpe-Ratner et al., 2013). Although neither the levels nor the consistency of implementation achieved are ideal, it seems likely that shortcomings in this aspect of the trial would have been notably more pronounced in the absence of the above-described strategies.
Control school activities
The transition from laboratory to field settings is not a smooth one (Hulleman & Cordray, 2009), and there are a variety of sources of infidelity that can impact the theoretically expected differences between treatment and control conditions. One such source of infidelity often ignored is what happens in control schools (e.g., Sloboda et al., 2008). A requirement of study participation was that schools had not previously utilized PA or a similar SEL/SECD intervention, so that the PA program effects would not be confounded with other programs. However, neither this provision nor offering the PA program to control schools upon completion of the study period served to prevent some control schools from using some SECD-oriented activities similar to those of the PA program during the study period; such a prohibition would, in fact, raise serious ethical considerations and likely would have inhibited school recruitment efforts.
Teachers completed surveys about whether they used SECD-like activities in their classroom, including specifics on the target domain (e.g., peace promotion, character education), program name (if they used a program), strategies at the classroom and school level for promoting SECD, and attitudes toward promoting SECD. These surveys were completed annually; the completion rate was over 85% at all waves (Social and Character Development Research Consortium, 2010). Some control schools did indeed report using SECD-oriented activities. We do not know the vigor of the activities (level of implementation, school support, etc.), other than that control schools reported a high quantity of SECD-oriented activities. However, treatment teachers reported engaging in more SECD-like activities and professional development than control teachers, as well as reported higher enthusiasm for these activities than did control teachers (Social and Character Development Research Consortium, 2010). It is possible, then, that the estimated effects of the PA program, as indexed by differences on measures between conditions (i.e., effect sizes), that have been reported are understated because of the SECD-related activities occurring in control schools.
We recommend assessing the level of treatment-like activities in control schools and including this stipulation in an MOU with schools. By doing so, we were able to have some idea of the extent of this bias. In principal, such assessments could then be utilized in testing models of intervention impact, such as complier average causal effects models (Stuart, Perry, Le, & Ialongo, 2008), although the present trial lacked a sufficient number of schools to do so.
Evaluation
Restricted range of outcomes
Most questionnaires were from existing, validated measures. Some had to be modified, informed by pilot testing and a general concern with item content being appropriate for students in a range of grades (from third to eighth). Therefore, some censoring or “floor” and “ceiling” effects (restricted range) were observed for some of these measures (e.g., respect for parents). In addition, other outcomes were relatively rare (e.g., extreme forms of violence). As a result, for some measures the standard assumptions for a basic linear regression type analysis did not hold. Using such an approach can lead to misleading estimates of program effects (Long & Freese, 2006). Our strategy and solution for this issue has been to utilize generalized linear mixed models appropriate to the distribution of each outcome. Researchers should be prepared to analyze data using methods appropriate to the distribution beyond a normal distribution, such as binary or categorical (Cohen, Cohen, West, & Aiken, 2013; Long & Freese, 2006; Rabe-Hesketh & Skrondal, 2012), Poisson (Olsen & Schafer, 2001; Rabe-Hesketh & Skrondal, 2012), and censored (Joreskog, 2002; Meng & Schenker, 1999; Skrondal & Rabe-Hesketh, 2004) models.
Measurement invariance
Measurement invariance establishes that the construct (measure) meaning is similar for groups (e.g., boys and girls, conditions) or across time (e.g., age; Geiser, 2012; Geldhof & Stawski, 2016; Pentz & Chou, 1994). Many measures, however, are designed to be stable over time. Measurement invariance may not be a reasonable expectation in a longitudinal study when a developmental change is expected as well as change due to intervention effects for the treatment group (Geldhof et al., 2014). In this trial, our analyses to date have revealed several measures to not exhibit strong invariance over times of measurement (Lewis, Vuchinich, et al., 2016). We have no a priori reason, however, to expect that such noninvariance was differentially applicable to the intervention condition, such as might occur if PA affected youths’ interpretation of items, a consideration which supports the meaningfulness of our treatment effect estimates (Geldhof et al., 2014). Additionally, given the comprehensive nature of PA, measures were used that provided relatively global assessments of constructs (e.g., overall tendencies toward negative affect rather than assessments of particular types of feelings). Such measures seem less subject to problematic variation over time than those that are more specific and thus more prone to developmental change. If measurement invariance does not hold, groups cannot be directly compared, as the items have different meaning between the groups; these qualitative distinctions in the items should be explored and discussed (e.g., the intervention may have changed the treatment group’s understanding of a construct). An in-depth discussion of measurement invariance is beyond the scope of this article; however, we recommend that researchers test measures for invariance when it was reasonable to expect invariance. Further, researchers are recommended to test for changes due to development versus program effects (Geldhof & Stawski, 2016; Lawrence & Blair, 2003; Pentz & Chou, 1994).
Student mobility
Although all 14 schools were retained throughout the trial, the student population attending these schools was highly mobile. Within low-income, urban schools, mobility is a common occurrence (Tobler & Komro, 2011). Student mobility poses challenges for efficacy trials of longitudinal school-based programs. One such challenge includes difficulty inferring whether the observed effects are due to the intervention or a result of differential attrition. A strategy proposed to overcome these disadvantages is to conduct a cluster-focused intent-to-treat analysis (Brown et al., 2008; Vuchinich, Flay, Aber, & Bickman, 2012). This approach acknowledges the focus on schools and follows all schools randomized to condition to trial endpoint, regardless of how well the intervention is implemented (or not) in treatment schools. It also requires collecting and analyzing data from all students who are in the appropriate grade cohort in the schools at the time of each assessment. In accordance with these considerations, we assessed students who entered schools after the beginning of the trial (joiners), but did not follow individual students who stopped attending the study schools (leavers). This is similar to the repeated cross-sectional design used in community-level research (Murray, 1998), except that we surveyed the population of students present at each time rather than taking a sample of them.
From the standpoint of students, across time they could be considered a “dynamic” (i.e., changing) grade cohort. Table 3 shows the number of students present at each wave, as well as the number of students who remained in later waves of the trial; there were no differences by condition. At the student level, the total sample size across all eight waves was 1,170 with approximately half that number being present at each wave. Only 21% (131) of the original 624 Grade 3 cohort remained at Grade 8, illustrating the high mobility of the low-income, urban students in our sample. Additionally, as noted above, parental consent rates declined during the middle-school grades; furthermore, simultaneously, Chicago school enrollment was decreasing during the time of the study. For all of these reasons, the student sample size at Wave 8 (363) was smaller than at Wave 1 (624). The average number of waves of data provided per student was 3.1.
Mobility of Students by Study Condition.
Note. The increase in mobility rates after Wave 5 may be partially explained by the time difference between Wave 5 and Wave 6 representing one school year plus two summer breaks of mobility (end of Grade 5 to beginning of Grade 7) and the transition from elementary to middle school grades. Joiners in the Fall of 2005 and 2008 were considered as joiners at the end of the school year (Spring 2006 and 2009, respectively). This table shows the analysis sample. There may be small variations between this table and tables in papers (and reports) from Institute of Education Sciences and the Social and Character Development Research Consortium because some students may have only been present for the “multisite” or “site-specific” days of data collection. Additionally, some students may not have been present for any data collection and therefore only have teacher-reported data.
Potential moderators of program effectiveness
Programs like PA may have different effects for boys and girls, for students at different levels of risk, or for students as well as entire schools that differ in other ways. Moderation analyses are necessary to test for such possibilities, and these are now becoming more common in the literature (e.g., Bierman, Nix, Greenberg, Blair, & Domitrovich, 2008). Indeed, several programs have found gender differences in social–emotional outcomes and health-related behaviors (e.g., Bierman et al., 2010; Flay, Graumlich, Segawa, Burns, & Holliday, 2004; Taylor, Liang, Tracy, Williams, & Seigle, 2002). Other potential moderators include implementation (e.g., Durlak et al., 2011; Wilson & Lipsey, 2007), behavioral risk (e.g., Ellickson, McCaffrey, Ghosh-Dastidar, & Longshore, 2003; Wilson & Lipsey, 2007), and environmental factors (e.g., Bierman et al., 2010; Hughes, Cavell, Meehan, Zhang, & Collie, 2005). In addition, race or ethnicity moderation may be of interest. It is important to consider potential moderators prior to data collection to ensure that these moderators are assessed in the measurement tools.
To date, in the Chicago trial of PA, gender differences have been found on only a limited and inconsistent basis (Bavarian et al., 2013; Lewis, Schure, et al., 2013). Further, using student race or ethnicity as a moderator was not possible because of extreme confounding with aggregate racial/ethnic composition at the school level. Median odds ratio (MOR) for Black versus non-Black schools ranged from 6.39 to 20.30 (ICCs = .53 to .75), showing that student race/ethnicity largely covaried with school (MOR is preferable to ICCs with binary outcomes; Merlo et al., 2006). Indeed, three pairs of schools were >99% Black, two pairs were 75% Hispanic, and two pairs were mixed (50% Hispanic, 31% Black, 9% White, and 9% Asian). Ideally, we would test for moderation by factors such as race/ethnic student composition, baseline levels of student academic achievement and attendance, and so forth at the school level. However, with only 14 total schools in the trial, such analyses are not feasible due to their extremely low statistical power. To date, no moderation analyses by implementation or risk factors have been assessed in the present trial.
Of particular interest in highly mobile populations, program effects might have differed based on whether students stayed in the same school for the duration of the program, left the school during the study, or joined the school during the study. Because the trial involved random assignment at the cluster (school) level, data were collected not only from students present at the start of the trial but also those who joined the school after the trial began. We thus have been able to explore whether program impacts differed depending on student mobility pattern, as indicated by whether data were present or missing for a student at each wave. A challenge, however, is that there are 256 possible patterns of student mobility (present or absent at each wave to the eighth power = 256).
An approach to analyzing missing data (mobility) patterns is latent class analysis (LCA; Beunckens, Molenberghs, Verbeke, & Mallinckrodt, 2008; Lin, McCulloch, & Rosenheck, 2004; Roy, 2003), as LCA allows for the identification of classes of students with similar patterns of missingness (Marsh, Lüdtke, Trautwein, & Morin, 2009). Drawing on this work, we conducted an LCA to identify subgroups of individuals (Flay, 2012; Lewis, Bavarian, et al., 2017). The results of this analysis revealed five distinct patterns of student mobility during the trial: (1) stayers (average study duration of 5.72 years, 13%), (2) temporary participants (present for Grade 4 and/or 5 only; average study duration of 1.30 years; 16%), (3) late joiners (average study duration of 1.38 years; 25%), (4) early leavers (average study duration of .94 years; 22%), and (5) late leavers (average study duration of 3.23 years; 24%). We have used these patterns as a grouping variable to test for mobility as a moderator of program effects—that is, examining whether program effects varied by mobility pattern.
To date, these analyses have not revealed significant differences in estimated program effects by mobility pattern (e.g., Lewis, Vuchinich, et al., 2016). Ideally, researchers also should collect and utilize data on auxiliary variables as these aid in the imputation process (Collins, Schafer, & Kam, 2001; Enders, 2008; Graham, 2003) that may help to predict or explain patterns of missingness or mobility in school-based CRCTs. For this trial, information was collected on several family variables that could help to explain mobility, including prior moving history, job stability, employment status, and the likelihood of moving during the next year. In future studies, we plan to incorporate these variables into our analyses. Additionally, presence or absence at a particular wave is based on whether the student had any data on any outcome at the wave. The student may have been absent on the days of data collection; that is, they may have technically been “present” that school year or wave, but just not for data collection. A more accurate approach would be to use attendance data; however, only aggregated (not individual student) data was not available for our use. Researchers and evaluators in schools may want to examine the possibility of gaining access to student-level records such as attendance for various analyses.
Discussion
Although a number of challenges were encountered in our school-based cluster randomized, we were able to address each of them (at least partially) using practical strategies that may be useful in other school-based CRCTs. It is worth noting that with these strategies incorporated, the trial was able to meet many of the standards put forth by the Society of Prevention Research for standards of evidence for prevention programs and policies (Flay et al., 2005; Gottfredson et al., 2015) such as efficacy Standard 2 regarding measurement of the outcomes and follow-up, and efficacy Standard 5 regarding the reporting of all outcomes, regardless of direction or significance. Further, this program is included in the What Works Clearinghouse, Blueprints for Healthy Youth Development, and CrimeSolutions.gov as a model program. In line with these strengths, multiple papers reporting the intervention impact from this trial have been accepted for publication in a wide range of peer-reviewed journals (Bavarian, Lewis, Acock, et al., 2016; Bavarian et al., 2013; Lewis, Bavarian, et al., 2012; Lewis, DuBois, et al., 2013; Lewis, Schure, et al., 2013; Lewis, Vuchinich, et al., 2016). Our experience suggests that moving to an effectiveness trial in low-income schools for a multifaceted, and thus relatively complex, intervention such as PA would be likely to present serious and intensified challenges to achieving an adequate level of fidelity. To help offset this concern, it could be desirable to build into the intervention itself some of the external supports that were incorporated into the current trial (e.g., implementation coordinator to work with the schools that is affiliated with the program, not the school).
In this trial, we were quite successful in addressing a number of challenges that were unique to, or at least present in more pronounced ways, within urban, low-income schools. School-based CRCTs of prevention and promotion programs such as PA can be expected to play a critical role as we move to effectiveness and dissemination trials using more population- or setting-based approaches (Flay, 1986; Glasgow, Lichtenstein, & Marcus, 2003). Longitudinal CRCTs face many challenges, and the goal of this article was to elucidate several of these challenges and corresponding strategies for addressing them. The strategies described were not fully successful in all instances. Illustratively, the levels of implementation fidelity achieved were clearly not ideal. Likewise, not all potential strategies discussed could be utilized in this trial due to inherent limitations such as the relatively small number of schools involved. Lastly, to reiterate, this article does not provide a broad synthesis of the literature of CRCTs and the challenges faced in such trials. Such a paper would be informative, however, and clearly would benefit from greater availability of papers such as the present one that provide in-depth consideration of issues faced in specific trials. In this way, it will be possible to build a robust knowledge base to inform not only the science but also the “art” of conducting rigorous and informative CRCTs in different settings.
Footnotes
Authors’ Note
Kendra M. Lewis, Niloofar Bavarian, Marc Schure, and Margaret Malloy were affiliated with Oregon State University during initial preparation of this article. Joseph Day was with University of Illinois at Chicago. The Social and Character Development (SACD) research program includes multiprogram evaluation data collected by MPR and complementary research study data collected by each grantee. The findings reported here are based only on the Chicago portion of the multiprogram data and the complementary research data collected by the University of Illinois at Chicago and Oregon State University (Brian Flay, Principal Investigator) under the SACD program. The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Institute of Education Sciences, CDC, MPR, or every Consortium member, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. Correspondence concerning this article should be addressed to Kendra Lewis or Brian Flay.
Acknowledgments
We would like to thank Robert Duncan for assistance with analyses. We are extremely grateful to the participating Chicago Public Schools (CPS), their principals, teachers, students, and parents. We thank the CPS Research Review Board and Office of Specialized Services, especially Drs. Renee Grant-Mitchell and Inez Drummond, for their invaluable support of this research.
Declaration of Conflicting Interests
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research described herein was conducted using the program, the training, and technical support of Positive Action, Inc. in which Brian Flay’s spouse holds a significant financial interest. Issues regarding conflict of interest were reported to the relevant institutions and appropriately managed following the institutional guidelines.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was funded by grants from the Institute of Education Sciences (IES), US Department of Education: R305L030072, R305L030004, and R305A080253 to the University of Illinois, Chicago (2003-05) and Oregon State University (2005-12). The initial phase (R305L030072), a component of the Social and Character Development (SACD) Research Consortium, was a collaboration among IES, the Centers for Disease Control and Prevention’s (CDC) Division of Violence Prevention, Mathematica Policy Research Inc. (MPR), and awardees of SACD cooperative agreements (Children’s Institute, New York University, Oregon State University, University at Buffalo-SUNY, University of Maryland, University of North Carolina-Chapel Hill, and Vanderbilt University).
