Abstract
The logic, theory, and practice of large-scale evaluation were once limited to large federal initiatives. However, with the advent of regularly collected performance measures and the often multisite implementation of quality improvement efforts, there is an opportunity to adopt large-scale evaluation methods in local and regional evaluation efforts. While ineffective programs show little variation in their ineffectiveness, effective programs generally show a range of effects. A central task of large-scale evaluation is to describe and ascribe why the same program, implemented in multiple settings, produces different effects. By its attention to variation attributable to setting, activities, outputs, and by participants and documenting the conditions in which programs achieve greater and lesser success, large-scale evaluation supports the needs of decision-makers when choosing to implement an evidence-based program. In addition to knowing a program is effective, decision-makers want to know whether it is appropriate for their situation and what facilitates or impedes effective implementation and bears on the program’s ultimate effectiveness. This article presents the different methods and approaches appropriate for effectively and efficiently constructing and executing a large-scale evaluation that will provide decision-makers the evidence they need for evidence-informed adoption of effective programs.
Keywords
The Role of Large-Scale Evaluation in Evaluation Practice
Program evaluation has undergone considerable change in the past 50 years. Emerging from an amalgam of concerns stemming from the 1960s Great Society investments in education, income maintenance, housing, health, and criminal justice (Shadish, Cook, & Leviton, 1991), evaluation has become a diverse and heterogeneous field with a surfeit of models and methods supporting this endeavor (Stufflebeam & Coryn, 2014). According to Sechrest and Figueredo (1993), these various approaches arise out of the social ecology in which evaluation operates and that different evaluation models and methods gain or lose dominance in response to that ecology. After reviewing the landscape of evaluation, these authors advocate a change in emphasis “from a focus on ruling out rival causes or plausible rival hypotheses to…[the] production of evidence that makes our favored cause a plausible explanation for the data” (p. 665).
A large-scale evaluation, in which the implementation of a program, strategy, or practice is assessed across multiple sites or settings, provides an opportunity to produce this plausible evidence by estimating the consistency of effectiveness, and the factors influencing effectiveness estimates across the multiple settings. The discussion that follows introduces the reasoning supporting these assertions, describes the defining features of a large-scale evaluation, and provides an overview of how commonly used methods of evaluation are applied when conducting a large-scale evaluation. The discussion is necessarily broad and conceptual; citations provide much of the additional detail necessary to conduct a credible large-scale evaluation. Additionally, the national evaluation of the Safe Schools/Healthy Students (SS/HS) initiative is cited throughout as an example of many of the concepts introduced below. This evaluation focused on how the 57 grantees of the 2005, 2006, and 2007 SS/HS cohorts implemented the initiative and used standardized change scores to document their achievement of seven outcome measures.
The SS/HS initiative was a collaboration between the U.S. Departments of Education, Health and Human Services, and Justice which, since 1999, served to over 13 million youth and provided more than US$2 billion in funding and other resources to 386 communities in all 50 states across the nation. The initiative funded local education agencies (LEAs) to create partnerships with local law enforcement, juvenile justice agencies, social service and mental health agencies, and other community organizations to create an integrated network of activities, programs, services, and policies to enhance coordinated service delivery. This coordinated approach addressed the concern that it was inefficient for agencies and community organizations to focus only on their own issues and outcomes and that their effectiveness in addressing youth problem behaviors was being compromised. Consistent with Jessor and Jessor’s (1977) theory of problem behavior, SS/HS was hypothesized to improve school climate and access to behavioral health services, while reducing student alcohol, tobacco, and marijuana use, and rates of student perceived and experienced school violence (Modzeleski et al., 2012; Rollison et al., 2012).
The conceptual approach taken in this article asserts that a principal goal of large-scale evaluation is to create generalizable knowledge to inform scalability. The strategies used to achieve this emphasis assess setting conditions, factors that influence implementation, and their moderating effect on program effectiveness across multiple settings. Using the natural or planned variation in implementation, large-scale evaluations provide an opportunity to objectively test what works, for whom, and under what conditions. While a well-conducted, large-scale evaluation describes the program and its variation across settings and tests the overall effectiveness of a program, the value and strength of large-scale evaluation lies in its ability to discern the plausible reasons why the same program—implemented in multiple sites—returns different effectiveness results. Using the tools of systematic inquiry, and holding equally the ideals of respect for people and the responsibilities for social and public welfare, a large-scale evaluation provides evidence on the merit, worth, and value of social and behavioral interventions intended to improve the human condition (American Evaluation Association, 2004).
For a considerable period of time, the vast majority of program evaluations have been concerned with documenting the implementation and impact of a single program or intervention conducted in a single setting. However, increasingly, we observe opportunities to evaluate interventions that are administered in multiple settings. For example, an intervention trial to improve the nutritional value and quality of food that diabetics eat can take place in several medical clinics or physicians’ offices scattered throughout a defined catchment area. Likewise, a smoking or drug prevention trial can take place in several school districts scattered throughout a region or state, with some schools assigned to treatment and others to control conditions. Colloquially, these evaluations have been dubbed large-scale evaluations and have historically been the domain of large federally funded endeavors that reflect the public health agenda (e.g., Healthy People 2000 and 2020). Consistent with the implementation of these multisite programs, mandated collection of quality improvement (performance) data in many domains and the advent and increased availability of big data (Dinov, 2016) will provide opportunities to adapt large-scale evaluation methods and questions to many local and regional evaluation efforts.
Characteristics of Large-Scale Evaluations
What distinguishes a large-scale evaluation from normal “run-of-the-mill” program evaluation is that the analysis focus shifts from describing the intervention and its impact on individual recipients to accounting for intervention effects on recipients given the activities, contexts, and characteristics of the settings implementing the intervention. These “contextual factors” play an active role in determining procedurally how the intervention is implemented, how well it is implemented, how closely trainers adhere to the program content (fidelity), and whether adaptations are made (cultural and otherwise) to improve program fit. Context may also influence the case mix of participants exposed to the intervention, the resources (fiscal and otherwise) available to the intervention, and may strongly influence participant recruitment and retention. In this respect, a large-scale program evaluation provides an opportunity to examine how variations in implementation, program delivery, and program participation all have the potential to affect program outcomes.
To wit, a rigorous program evaluation generally includes a rich description of the program, program participants, and detailed information on the instructional modalities tied to the intervention. We also often encounter some type of table of specifications or logic model explicating how the intervention is hypothesized to work. Subsequent analyses are structured to assess program efficacy on designated “treatment” recipients relative to comparators (controls). Additional analyses may demonstrate how intervention recipient characteristics, competencies, and lifetime and/or intervention experiences interact with outcome success. In contrast, a large-scale evaluation may involve the same type of analyses, however, because it collects data on the multiple settings in which the intervention is implemented, it has also the capacity to systematically examine other features that impinge on a program’s effectiveness. This effort can include identifying how implementation features, activities, strategies, and conduct varies across implementation settings, further identify reasons for this variation, and assess whether and how these variations influence intervention effectiveness (e.g., Glasgow, Lichtenstein, & Marcus, 2003; see also Glasgow et al., 2012). Put simply, the goal of large-scale evaluation is not only to estimate the overall effectiveness of an intervention but to identify features associated with both implementation and intervention success, which may more accurately portend generalizability of the intervention (Donaldson, Graham, & Hansen, 1994).
To illustrate the emphasis of large-scale evaluations, consider the situation where a school adopts an evidence-based drug prevention program. The program may be delivered by multiple teachers or by the same teacher to multiple classes. It may be delivered to different grades, at different times of day, or in different rooms. Different teachers may emphasize different content or have differing beliefs regarding the value and intent of the intervention. Not only will teachers vary in their support for implementation, organizational characteristics can also influence program receptivity; something experienced with the diffusion of any innovative intervention (Rogers, 2003). Research shows that individual differences in teachers’ background and training as well as their level of “buy in” (Hunter, Elias, & Norris, 2001) can influence implementation quality (e.g., Beets et al., 2008) and this is especially true for drug prevention programs (e.g., Fagan & Mihalic, 2003; Mihalic, Fagan, & Argamaso, 2008). Each of these unique intersections of host setting, delivery agent, and recipient may be considered a partial replication of the intervention since the intervention experience will likely differ by classroom composition, student and teacher energy levels, and even the content delivered (e.g., Malloy et al., 2015).
All of these setting features create a unique intervention experience for students, and each intersection may produce a different estimate of program efficacy. A large-scale approach to this evaluation would include not only specifying what student features influenced effectiveness but would also identify and describe the conditions under which the program was delivered with the greatest fidelity (implementation quality), what challenged or facilitated effective implementation, and what steps were taken to ameliorate these conditions. Likewise, the same evaluation also needs to consider what features, activities, and characteristics of implementation were associated with greater demonstrated program effectiveness. These issues have now become front and center in the discussion regarding implementation science (e.g., Pas, Waasdorp, & Bradshaw, 2015; Payne, 2009) and are a major focus on bridging the science to practice gap. In the long term, results of these and related studies will facilitate the scaling and dissemination of evidence-based programs by highlighting the conditions in which programs have the greatest potential for success as they tackle the problems of scaling up (e.g., Botvin & Griffin, 2010).
Furthermore, there is now tremendous attention paid to the “replication crisis” that exists in psychology, medicine, and science more generally (Begley & Ellis, 2012; Begley & Ioannidis, 2015; Ioannidis, 2005; Open Science Collaboration et al., 2015; Prinz, Schlange, & Asadullah, 2011). It would thus seem prudent to actively seek opportunities to incorporate standards for replication into the primary goals of establishing what works, for whom, and under what conditions (e.g., Valentine et al., 2011). Large-scale evaluation approaches provide an opportunity to examine the stability of findings, identify implementation features and organizational characteristics associated with greater and lesser levels of effectiveness, as well as barriers that may diminish and facilitators that may increase program effectiveness and influence the ability of innovative interventions to scale up. 1
Methods for Large-Scale Evaluations
When evidence-based programs are implemented with fidelity, they almost always return stronger results than poorly implemented programs (Mihalic, 2004). However, “innovations are seldom implemented as planned” (Berman & McLaughlin, 1976, p. 347) and are typically adapted to meet local needs (Rogers, 2003). Large-scale evaluations can provide an opportunity to assess (a) what conditions necessitate adaptation, (b) what adaptations might be expected in delivering the intervention, (c) whether the adaptations improve or compromise the delivery of core intervention components, and (d) whether and which adaptations affect outcomes. Documenting adaptations, the extent to which they align with local conditions, and their influence on the fidelity of intervention delivery is a central concern in any large-scale evaluation. The factors affecting successful implementation vary across a range of considerations and several theories, models, and frameworks have emerged to orient and guide implementation scientists (Nilsen, 2015). Depending on the goals of the evaluation and the needs of decision-makers, one or more of these approaches will sensitize the evaluator to the range of implementation influences and which are likely central or peripheral to the evaluation’s goals.
Theory-Driven Large-Scale Evaluation
While the focus on large-scale evaluation may differ somewhat from traditional program evaluation, large-scale evaluation activities generally adhere to the principles of theory-driven evaluations (Chen, 1990). Beginning with a conceptual framework and logic model (McLaughlin & Jordan, 1999), the large-scale evaluator identifies the various inputs, activities, and outputs, which are expected to influence the near-term, intermediate, and long-term outcomes of the intervention (Hawe, Shiell, & Riley, 2009; see also Durlak, 2013). A well-specified logic or conceptual model is both theory-driven and evidence-based and describes fully the range of predictor, intermediate, and outcome constructs for which measures must be identified. Thus, the conceptual or logic model should specify not only the constructs or variables associated with intervention effectiveness (at the recipient unit level) but also those associated with the effective implementation of activities and the execution of outputs (at the host setting and delivery level).
In general, implementation models capture similar information, although the content varies somewhat by field, application, and available data. Inputs, activities, and outputs specify the constructs which comprise factors that influence implementation effectiveness. In prevention science, for the most part, inputs include the internal (organizational and operational features), external (laws, policies, and local conditions), and program recipient characteristics over which implementers and the evaluator have little or no control (e.g., individual difference variables like self-efficacy). The inputs for the 57 LEA grantees of the 2012 SS/HS national evaluation included measures of existing collaborative mechanisms, community characteristics, pre-grant system resources, and LEA’s grants operating environment. Activities tend to be categorical and reflect intervention components—what actions are planned that are intended to produce (mediate) the intervention result. Activities in the SS/HS national evaluation included both structural (e.g., decision-making structure) and functional (e.g., communication) measures of grants operations and the adoption of evidence-based practices. Outputs are often counts of activities, and describe the amount, frequency, quality, and/or intensity of activities, and can often be thought of as moderators of activity effectiveness. SS/HS outputs included measures of comprehensive programs and activities, enhanced services, and the levels of coordination and service integration achieved by each grantee.
Very few social or behavioral interventions directly influence long-term outcomes—the intended goal(s) of the intervention. Most social and behavioral interventions attempt to change the risk and/or protective factors associated with the long-term outcome (Coie et al., 1993; see also Derzon, 2000). These interventions seek to interrupt the hypothesized causal chain by addressing near-term outcomes (such as recipients’ knowledge, attitudes, or beliefs), which are expected to alter some intermediate outcome (e.g., risky behavior) reducing, in turn, the display of an undesired long-term behavior (i.e., drug use). 2 Near-term program outcomes in the SS/HS evaluation included improvements in school climate and progress toward sustainability, while the long-term outcomes involved the magnitude of change over time in seven grant goals (i.e., reducing past 30-day student alcohol, tobacco, and marijuana use; reducing the percentage of students perceiving or experiencing violence; and increasing the percentage of students receiving school and/or community mental health services). LEAs provided annual cross-sectional means and variance estimates for the three substance use measures and frequencies for the remaining measures. Schools used local instruments and protocols to collect outcome data and those data were standardized by the national evaluation team using meta-analytic procedures (Derzon et al., 2012).
In general, the causal mediation sequence is rarely modeled and data on near-term and intermediate outcomes are rarely available when using institutional data sets. However, when the budget allows the collection of data on these near- and intermediate outcomes, there is much to be learned from documenting the correspondence, or lack thereof, between these outcomes. Near-term outcomes describe the effectiveness of the intervention, while modeling the causal mediation sequence tests the theory connecting those near-term with long-term outcomes that are the intervention goal (Suchman, 1967). Mediation analysis (MacKinnon, Fairchild, & Fritz, 2007) and decision analytic modeling (Keeney, 1982; Petrou & Gray, 2011) provide methods for documenting and assessing transitions between near-term, intermediate, and long-term outcomes.
Measures
When implementing the same evidence-based program across settings, these features can often be operationalized by form (i.e., the specific components or features of the intervention). When the intervention being evaluated is less structured or “complex” (Craig et al., 2017), as many large initiatives are, programmatic features may need to be operationalized according to their functions. These are the “steps in the change process that the elements are purporting to facilitate” (Hawe, Shiell, & Riley, 2004, p. 1562). For example, the SS/HS national evaluation collected both structural (what did they do) and functional (the extent to which implemented activities achieved the goal of the planned activity) measures of partnership functioning. Intervention components may be collected for descriptive purposes, but often functional coding of the role these elements play in interventions may simplify analysis across disparate and complex interventions.
Measures can be derived from a variety of sources ranging from key informant interviews to secondary archival data. One of the advantages of large-scale evaluation is that, with the exception of outcome data associated with participant performance, most measurement is at the organizational level and is readily available or relatively inexpensive to obtain. On the other hand, program staff are generally busy, so minimally intrusive measures should always be considered. For example, the SS/HS national evaluation reviewed grant applications for data on existing collaborative relations and system resources (such as existing investments to meet student needs and pre-grant prevention programs) and employed census data to assess community sociodemographics. The key to good measurement in evaluation is to have measures that are valid, reliable, and sensitive to differences across all organizations participating in the evaluation. A well-designed survey can supply much of the necessary data, but many large-scale evaluations also call for and warrant the use of mixed methods in which both qualitative and quantitative data are collected (Tashakkori & Teddlie, 2010).
Serial mixed-method approaches, in which qualitative and quantitative data are collected sequentially (Johnson & Turner, 2003; Plano Clark & Ivankova, 2015), can be particularly informative in large-scale evaluations. For example, earlier collected qualitative data can identify emergent themes and stakeholder theories of effectiveness (Chen & Turner, 2012). Items capturing these insights can be added to the logic model and later assessed for breadth, magnitude, and their relationship with intervention impact through either a well-designed survey or structured qualitative data collection. Alternatively, quantitative data can inform qualitative data collection. To assess partnership functioning, the SS/HS national evaluation team deployed a survey to collect partner’s perceptions of functioning within their SS/HS partnerships, used the results of that survey to rank order grantees, and conducted focused interviews with the 10 highest and 10 lowest performing sites to identify six distinguishing themes (Merrill et al., 2012).
While many large-scale evaluations define their outcome measures and directly collect those outcome data, some will rely on locally collected outcome (performance) data. If part of a performance monitoring system, these local data may be useful in establishing preintervention trends but may take a variety of forms across implementation settings. In the absence of standardized outcome data, meta-analysis provides multiple methods for standardizing locally collected data to a common metric (e.g., Borenstein, 2009; Fleiss, 1994). Occasionally, the large-scale evaluator will have some input regarding how outcome data are collected and reported. If so, the least processed data may offer the evaluator the greatest flexibility and may represent the least burden on implementing organizations. If individual-level data are collected and rolled up, then the same standards for cleaning and quality assurance can be applied. When only summary (group-level) estimates are available, also requesting simple summary statistics (such as means and standard deviations for interval data, frequencies for binary outcomes) for the intervention and comparison group, if relevant, can often help to understand findings. Findings incorporating more evidence (e.g., difference-in-differences metrics, regression-derived partial or semipartial coefficients) may obscure relations within the evidence that may be hard to detect across multiple sites. It should be noted that local capacity to collect, manage, organize, and report data may be extremely limited and should be assessed early in the evaluation design.
Moreover, if locally collected performance data are used in the evaluation, it is critical to assess the quality of these data and incorporate that assessment in any analyses of these data. In general, data quality is better if the requested data are meaningful to providers and collected by a disinterested third party. Common local “adjustments” may include alterations in how measures are defined or captured, or adjusting their findings to only include those treated in what is supposed to be an intention-to-treat evaluation. Other adjustments may include using different procedures to identify comparison samples, documenting performance in different measurement periods, using different criteria to define success or failure, and other adjustments to meet local demands and local evaluator preferences. Reasons for these decisions may be legitimate (e.g., an inability to reprogram an existing information technology system) or less benign (e.g., altering inclusion or eligibility criteria to improve success rates), and the large-scale evaluator rarely has the ability to enforce changes or reverse these local decisions. They can, however, document the local choices and statistically control or adjust for them in subsequent analysis.
Outcome Data Considerations
The data produced by a large-scale evaluation can occur in multiple levels (if participant- and setting-level data are available) or take shape as summary statistics (when only group performance data are available). In general, participant-level data offer the greatest opportunity to assess what works for whom and under what conditions, but individual-level data introduce considerable complexity both for the analysis and for evaluation management. For example, because students’ educational information is federally protected, 3 data-use agreements are often necessary, confidentiality must be assured, and if the evaluation is federally sponsored, clearance from the Office of Management and Budget will be required. These requirements may create delays for obtaining baseline (pretreatment) data but may allow closer tailoring of the data available to directly address the goals of the intervention (e.g., obtaining academic grades and student characteristics for the evaluation of a school-based delinquency prevention program).
There are generally fewer restrictions on performance and other summary data since they rarely contain personally identifiable information. In addition, many settings are already routinely collecting and aggregating both individual (e.g., grades) and site (e.g., school absenteeism and truancy rates) performance data. If so, it may be possible to establish preintervention trends, which can be used to strengthen causal inference. However, if locally provided summary data are used, they should be closely examined as measure specification (how a measure is defined, implemented, calculated, and reported) may differ across implementing organizations. Data should be reviewed early and often to confirm with the data provider that abrupt changes in magnitude or direction or notable changes in effect size variance are not attributable to improper measurement, faulty programming, or changes in participant eligibility. Data supplied using different metrics can often be harmonized (see Hansen et al.’s chapter, this issue), however, such adjustments should always be tested in the analysis to ensure they do not introduce some type of unheralded bias.
Research Designs
There are a plethora of research designs available for outcome evaluations (Reichardt, 2011), and this does not change dramatically merely because multiple sites are being used. In general, however, experimental designs with randomization of individuals to treatment and comparison conditions provides the strongest evidence for causal claims, while one-group pre–post observational studies offer the weakest evidence (Flay et al., 2005). Nonetheless, a consistent replicated finding obtained from multiple pre–post designs may be a good indicator of likely, and likely generalizable, effectiveness. Regardless of the design, unless the large-scale evaluator is required to estimate the significance of each setting’s result, having more implementation settings with fewer subjects per setting will provide greater utility than having few settings each with large samples (Brown & Liao, 1999; Cohn & Becker, 2003).
Experimental Designs
Experimental and quasi-experimental designs are defined by the randomization of individuals or groups to condition, respectively. Randomization seeks to achieve initial probabilistic equivalence (covariate balance) and estimates program effects by contrasting the outcomes achieved by those exposed to the intervention with a control group that has not been exposed to the intervention (Holland, 1986; Rubin, 2005). In the absence of attrition, randomized designs provide the strongest evidence of a causal relationship but can be difficult to implement in many applied settings (Derzon, 2014). Nonetheless, when randomization is possible and attrition avoided, it is considered the “gold standard” and most powerful design for making causal inferences (see, e.g., Hansen et al. and also Mason & Fleming, this issue). It should be recognized, however, that regardless of design, implementation should be assessed across sites and its impact on findings tested.
Unique, perhaps, to large-scale evaluation is the opportunity to randomize not only at the participant level but also at the implementation setting level using a factorial design (Grannemann & Brown, 2017; Somers, Collins, & Maier, 2014). In a factorial design, experimental units (implementation settings) are randomly assigned to a unique intersection of two or more factors (e.g., interventions, intervention components, and levels of components) allowing the systematic testing of the effects of factors and interactions on outcome(s). First introduced as “complex experiments” (Fisher, 1926), factorial designs offer an efficient method to systematically assess the effect of multiple independent variables and their interactions on outcomes. This scenario of mixed or combined intervention effects is a ubiquitous condition in much applied research. Because the design explicitly assigns units to comparator conditions (e.g., Caldwell et al., 2012; Strecher et al., 2008), factorial designs obviate many of the challenges of identifying an appropriate comparison group for quasi-experimental large-scale evaluations (Collins, Dziak, Kugler, & Trail, 2014).
Nonexperimental Designs
Frequently, the large-scale evaluator will be faced with a predefined coterie of implementation sites for whom comparators must be found. This design creates challenges for the large-scale evaluator by introducing the potential for both selection bias (the sites selected to deliver the intervention are systematically different from sites not selected to deliver the intervention) and bias in how participants in intervention and nonintervention sites perform on the designated outcomes. Several procedures have come into vogue in recent years to mitigate these potential sources of bias including propensity score matching or weighting, doubly robust estimation (Bang & Robins, 2005), and differences-in-differences designs (Ashenfelter, 1978; Ashenfelter & Card, 1985). Imbens and Wooldridge (2009) provide a comprehensive review of these methods.
Interrupted time series are often used to assess the impact of environmental or policy-level interventions and may be implemented in multiple settings (e.g., Biglan, Ary, & Wagenaar, 2000). In the absence of alternative explanations, differences in trends prior to and following the introduction of a policy change are attributed to the intervention. If multiple settings return similar time series results, our confidence in attributing the finding to the policy is reinforced. Similarly, in cases where a comparison group cannot be identified or convened, the inherent capacity of large-scale evaluations to document the replicability of pre–post change scores across multiple implementation settings reduces the probability that a third “unnamed” factor caused an observed effect. A result which replicates across multiple settings may allow a researcher to make a more robust causal claim even from these relatively weaker research designs.
For example, using treatment-only pre–post change scores, the SS/HS national evaluation documented that within-LEA change in one outcome was typically not associated with measured change in other outcomes (refuting the program theory), but that LEAs consistently and significantly increased access to school- and community-based mental health services while reducing the numbers of students experiencing violence. Across LEAs, SS/HS did not produce significant improvements in any of the other outcomes (Derzon et al., 2012).
Although homogeneous results (indicating that results across multiple implementation settings vary less than would be expected from sampling error) would be a satisfying finding, it is not the typical result observed in large-scale evaluations of effective social and behavioral interventions. As hypothesized in the logic model, results across settings might be expected to vary due to differences in internal and external context, sample characteristics, as well as variations in intervention delivery, staff experience, and a myriad of other factors that can moderate program effectiveness. To explicate reasons for this variation, we turn now to a brief overview of analysis considerations frequently encountered in large-scale evaluations.
Analysis Considerations
Effectiveness estimates from multiple sites will typically vary in magnitude and the distribution should be tested to see if they show greater variation than would be expected due to sampling error. If so, then the impact of setting, sample, and implementation options and characteristics on these heterogeneous results can be tested. However, between-setting heterogeneity raises a conundrum for the large-scale evaluator. Evaluators are warned against capitalizing on chance relations in their data and are admonished to specify their theory in advance and test only that theory detailed in their models. Given the cost, time, and effort that goes into a large-scale evaluation, and the amount of data typically captured by the evaluation, limiting analysis to one test of one theory seems both wasteful and an unnecessary burden on data providers. In all but the largest of evaluations, the number of variables identified in the logic model as potential effect moderators will invariably exceed the degrees of freedom available for modeling. For these reasons, proper data mining approaches should be considered, and the results of these analyses presented with the appropriate caveats and/or statistical adjustments to avoid committing both Type I and II errors.
Multilevel Modeling
If both participant- and setting-level data are available, multilevel models should be used to statistically address the nesting of participants within implementation settings (Hox, Moerbeek, & van de Schoot, 2010; Raudenbush & Bryk, 2002). Multilevel models simultaneously analyze data at each level of measurement, taking into account the various dependencies introduced at each level. While the methods for modeling multilevel data are quite advanced, constructing appropriate models requires careful consideration, given the typically large number of variables and interactions between variables in the model. In the words of Hox and Maas (2005), “Ideally, a multilevel theory should specify which variables belong to which level and which direct effects and cross-level interaction effects can be expected” (p. 786). This is often easier said than done as large-scale evaluators typically identify and measure a multitude of outcome moderators.
Meta-Analysis
If only group-level (summary) data are available, analysis is greatly simplified. If summary data are provided, subject characteristics, if they are thought to moderate the outcome, can be represented by proportions (e.g., proportion male). Alternatively, it may be possible to arrange for findings broken out by key subject moderators (e.g., deriving separate estimates for participants with high or low sensation seeking). Meta-analysis provides a generalizable approach to analyzing group-level data (Cooper, Hedges, & Valentine, 2009). In addition to estimating an overall effect of the intervention across settings (the grand mean), meta-analysis estimates the homogeneity of the distribution of setting results. If there is greater variability than expected from sampling error (heterogeneity), results can be modeled using either hierarchical meta-regression or statistically compared using stratified analysis. Essentially all the variation exceeding sampling error for past 30-day tobacco use and perceived risk of violence in the SS/HS national evaluation was explained using meta-regression (Derzon et al., 2012). There is an emerging consensus that random-effects models are almost always appropriate, but for some limited applications, fixed-effect models may be applicable (Borenstein, Hedges, Higgins, & Rothstein, 2010). Finally, in the presence of dependent outcome measures (e.g., multiple outcomes, repeated measures designs), robust variance analysis should be considered (Hedges, Tipton, & Johnson, 2010).
Bayesian and Alternative Approaches
Two additional analysis approaches, Bayesian analysis and a set theoretic approach known as Qualitative Comparative Analysis (QCA), merit mention. Bayesian models incorporate prior probability knowledge to estimate the posterior data distribution (Gelman et al., 2013; Kruschke, 2015). This allows the analyst to make probability statements about particular end points (e.g., what is the probability the intervention reduced tobacco use by 5% over the evaluation period?), which are quite attractive to and understood by policymakers. QCA uses set theory to identify the necessary and/or sufficient conditions in which settings succeed (or fail) and can identify the collinear conditions and features of interest to policy makers in scaling and disseminating effective practices (Rihoux & Ragin, 2009).
Qualitative Data
It would be presumptuous to propose a single method for the qualitative analysis of observational and narrative data collected from interviews, focus groups, key informants, and stakeholders. Analysis frames can be tightly linked to the approach adopted (Creswell & Poth, 2017), and most evaluators have a favored approach. Nonetheless, large-scale evaluations offer excellent opportunities for case-study approaches (Eisenhardt, 1989) while grounded theory (Glaser & Strauss, 2017), with its emphasis on inductively creating categories and constructs from systematic data collection, may be particularly useful in a sequential exploratory mixed-method evaluation. For example, qualitative methods can highlight stakeholder-identified barriers and facilitators for both implementation and program effectiveness not included in the original logic model, and these hypothesized moderators can then be systematically tested in quantitative models. Many mixed method evaluations conduct parallel analyses, identifying barriers and facilitators of implementation using qualitative methods and estimating intervention impact using quantitative methods. If, however, the qualitative data are systematically collected across all implementing organizations, the large-scale evaluator can convert this narrative data to numeric values for use in formal statistical analyses (Newman, Onwuegbuzie, & Hitchcock, 2015).
Final Considerations
Large-scale evaluations are typically associated with major initiatives in which decision-makers are increasingly asking for frequent feedback of not just implementation progress, but on initiative effectiveness. This use of performance monitoring data for promoting continuous quality improvement is being referred to in health care as rapid-cycle evaluation (Shrank, 2013) and represents a significant opportunity to which large-scale evaluators can contribute. Establishing a web-accessible performance dashboard where users can compare their settings’ performance and activities on various metrics against peer averages may facilitate the use of evidence for service improvement. While the approach typically calls for using the long-term outcomes common for performance monitoring, a theory-driven large-scale evaluation may be better served by initially focusing on and providing feedback for near-term outcomes. As initiatives mature and become part of the natural fabric of implementation, then monitoring the extent to which the intervention achieves its intended intermediate and long-term outcomes becomes increasingly appropriate.
With the exception of the experimental factorial design presented above, none of the designs discussed will produce causal evidence of the impact of setting, subject, and implementation features or components on evaluation outcomes. In the absence of random assignment to the features being modeled, the large-scale evaluator may only claim that a feature or the level of a feature is associated with greater and lesser success. Nonetheless, performance measurement and the intense focus on outcome measurement in recent years have sensitized practitioners to the need to document that adopted programs produce their intended outcomes. Program developers want evidence that the programs they develop are effective and will generalize to other implementation settings. However, 20 years of meta-analytic inquiry has demonstrated that seemingly similar programs and interventions often produce disparate effectiveness estimates. By attributing performance differences to the theoretical, substantive, subject, and methodological reasons effectiveness estimates vary, large-scale evaluation provides a unique opportunity to cast light on when, for whom, and under what conditions an intervention is effective. Perhaps even more importantly, large-scale evaluation can provide empirical support for the claim that a program proven effective in one setting will likely be effective in others.
Implications for Evaluation Practice
The dominance of Western critical rational thinking dictates that reason, evidence, and the scientific method provide the basis for informed decision-making and rational action. And while the production of evidence grows at a remarkable pace (Bornmann & Mutz, 2015), there is little evidence that the quality or quantity of that output has appreciably contributed to actionable knowledge (Bornmann, 2013). This state of affairs may be particularly acute in social and behavioral outcomes evaluation where the enthusiasm and support of interventions is often divorced from the measured effectiveness of those interventions. Three good examples of this trend can be found in the independent evaluations of DARE (Ennett, Tobler, Ringwalt, & Flewelling, 1994), Scared Straight (Petrosino, Turpin-Petrosino, & Buehler, 2003), and the US$1.2 billion 4 Department of Education funded afterschool program 21st Century Community Learning Centers (Dynarski, 2015). Each of these programs remains popular despite considerable credible evidence that they fail to significantly improve their intended outcomes. Moreover, adoption of interventions with demonstrable effectiveness is often slow or nonexistent (Derzon, 2015), and despite the critical importance of implementation for determining effectiveness (e.g., Durlak & DuPre, 2008) guidance for how and for whom evidence-based programs should be implemented is often lacking (Ahmad, Boutron, Dechartres, Durieux, & Ravaud, 2010; Glasziou, Meats, Heneghan, & Shepperd, 2008).
Through its explicit attention to the vicissitudes and consequences of context and implementation, large-scale evaluation addresses several of the needs of decision-makers in recognizing under what conditions interventions are most likely to be effective and can highlight the factors, features, and program characteristics necessary for successful implementation and those which are associated with greater program success. Just knowing that a program is effective is insufficient evidence for effective dissemination and successful adoption of evidence-based programs (Derzon, 2015). Large-scale evaluation provides the means and methods for systematically constructing the information that evidence-based program adopters seek when deciding what program will work for them, in their situation, and that can specifically meet their needs.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
