Building an Assessment Argument to Design and Use Next Generation Science Assessments in Efficacy Studies of Curriculum Interventions

Abstract

Evaluators must employ research designs that generate compelling evidence related to the worth or value of programs, of which assessment data often play a critical role. This article focuses on assessment design in the context of evaluation. It describes the process of using the Framework for K-12 Science Education and Next Generation Science Standards (NGSS) to design assessments to evaluate the efficacy of a curricular intervention. The new science standards present a significant challenge to assessment designers and evaluators because these standards emphasize the integration of disciplinary core ideas, practices, and crosscutting concepts. This article presents the structure of a validity argument for such uses with an evidence-centered design perspective and unpacks the design decisions in developing and implementing these assessments in an efficacy study of a project-based science curriculum. Implications for designing NGSS-aligned assessments for program evaluation purposes are discussed.

Keywords

program evaluation assessment design science education evidence-centered design

The Framework for K-12 Science Education (Quinn, Schweingruber, & Keller, 2012) presents an ambitious vision for improving teaching and learning in science. The Next Generation Science Standards (NGSS; NGSS Lead States, 2013) embody a key part of that vision by specifying challenging standards that all students are expected to meet. Realizing the vision of the Framework for K-12 Science Education will require more than state adoption of the standards. It will require changes across the educational system to support implementation, including changes to curriculum, instruction, teacher professional development, and assessment.

With respect to assessment, the changes will depend principally on the purposes for which assessments are used. Program evaluation is one purpose for assessment. In science education, for example, programs may include curricula, professional development, and focused interventions that target specific populations of students, either in or out of school. Since the Sputnik era, policy makers have provided funding to develop and test these kinds of programs in science education, on the premise that such programs provide necessary support for teachers to meet new and ambitious goals for student learning (Atkin & Black, 2003). Evaluations of these programs require assessments that are aligned to standards, in order to test this core premise.

While many of the challenges of developing assessments of NGSS are similar to those for other disciplines, some are distinctive. For example, the tasks must do more than elicit students’ knowledge of facts. They must elicit evidence of knowledge in use—that is, how students use knowledge of disciplinary core ideas, skill in science and engineering practice, and application of crosscutting concepts of science (e.g., patterns, systems, and cause and effect relationships) to explain scientific phenomena (Pellegrino, Wilson, Koenig, & Beatty, 2013). These knowledge-in-use learning goals are articulated as performance expectations in the NGSS. Each performance expectation incorporates all three dimensions of knowledge in use by asking students to apply disciplinary knowledge and make connections to a crosscutting concept as they engage in a science or engineering practice. Since no single performance expectation fully encompasses a core idea, NGSS assessment designers also must make choices about which “bundles” of performance expectations to assess (Pellegrino et al., 2013). They must also consider whether and how to integrate disciplinary core ideas, science and engineering practices, and crosscutting concepts within rubrics and score reports. Existing approaches to assessment have not measured the “three-dimensional” learning expressed in performance expectations, so new kinds of assessments are needed (Pellegrino et al., 2013).

Evaluation uses of assessment information present designers with additional challenges. Assessments must be instructionally sensitive and also permit fair comparisons among different treatments of the same standards. Evaluators must employ research designs that generate compelling evidence related to the worth or value of programs, of which assessment data are one source among many (National Research Council, 2002). Evidence about implementation and implementation contexts are critical components of program evaluation (Century, Rudnick, & Freeman, 2010). If programs are not implemented with integrity to principles of the program’s design, inferences about the potential efficacy of the program may not be valid.

In this article, we present an approach to designing assessments for the purpose of evaluating curriculum interventions intended to support implementation of NGSS. We outline the structure of a validity argument within an evidence-centered design (ECD; Mislevy & Haertel, 2006) framework. ECD is a principled approach to assessment design that provides a framework for developing evidence of construct validity. We illustrate our approach to assessment design with assessments used to evaluate a middle school science curriculum in a cluster randomized trial in a large urban district. We describe the iterative process of analysis and refinement to develop assessments in Earth science and physical science that incorporate the practice of developing and using models (NGSS Science and Engineering Practice 2).

Evaluation Context

The context for our work is an efficacy study of a commercially available middle school project-based science curriculum. Project-Based Inquiry Science (PBIS) is a comprehensive 3-year curriculum sold and distributed through It’s About Time Publishing. The curriculum is comprised of 8–10 week units in life, physical, and Earth science, spanning Grades 6 through 8. Most units were developed in the context of learning sciences research projects, notably the Center for Learning Technologies in Urban Schools program (Singer, Marx, Krajcik, & Clay-Chambers, 2000) and the Learning by Design project (Kolodner et al., 2003). The full curriculum became available in 2009.

The PBIS units that are the focus of our evaluation are in the areas of physical science (energy) and Earth science (processes that shape Earth’s surface). PBIS presents challenges in which students investigate phenomena and apply concepts to answer a driving question or to achieve a design challenge. The driving question or challenge typically targets a core idea in science, and the activities within each unit provide students with multiple occasions for investigating as scientists would—through observations, asking questions, designing and carrying out experiments, building and using models, constructing explanations, and so forth. In this way, PBIS emphasizes a knowledge-in-use perspective (Duschl, Schweingruber, & Shouse, 2007)—teaching a few core ideas and integrating science and engineering practices.

At the time that we began our preparations for the evaluation study, available assessments did not integrate core ideas and science and engineering practices in the manner addressed in PBIS and intended by the Framework for K-12 Science Education and NGSS. We were faced, then, with developing our own measures. We aimed to systematically attend to the blending of content, practices, and crosscutting concepts to design NGSS-aligned assessment tasks that PBIS students would have an opportunity to learn. The ECD framework served as a central strategy to articulate an assessment argument that persuasively unpacks performance expectations into a coherent association of learning goals, descriptions of tasks to elicit those goals, and expressions of student performances to serve as evidence for proficiency (Mislevy & Haertel, 2006).

Conceptual Framework

Assessment is a form of reasoning from evidence in which observations of students’ actions and artifacts are used to support inferences about what they know and can do (Pellegrino, Chudowsky, & Glaser, 2001). Validation of assessments entails developing a coherent, compelling argument that supports these inferences (Kane, 1992). In modern psychometric theory, construct validity—the degree to which a test measures what it claims to measure—serves as an overarching frame for evaluating the strength of assessment arguments (Messick, 1995). Developing validity evidence for next generation science assessments presents challenges, including specifying claims about what integrated knowledge of big ideas, practices, and crosscutting concepts looks like and identifying the evidence from student responses to tasks that can support claims about their learning (Pellegrino, 2013).

Developing a fair assessment that is aligned to the NGSS is also challenging because few students today have had an adequate opportunity to learn according to the vision of the Framework for K-12 Science Education and NGSS. To evaluate the merit of programs, inferences from measures must allow for fair comparisons of a target program to compelling alternative programs (Ruiz-Primo, Shavelson, Hamilton, & Klein, 2002). In addition, data on implementation and professional development are needed, because when programs are not implemented well, inferences about the potential efficacy of a program may not be valid (Love, 2004). These data are also useful for developing and testing hypotheses about the conditions under which programs can work (Means & Penuel, 2005).

Figure 1 shows a basic structure for organizing validity evidence for an assessment when the intended use is evaluating programs. This structure is represented as a set of interlocking arguments—the assessment argument, shown in the bottom half of the figure, and the use argument, in the upper half. At the center, connecting the two arguments, is a claim about what students know and can do.

Figure 1.

General structure of an assessment argument for evaluation use.

For any assessment argument of student learning, assessors need data on student actions, such as their response to tasks, in order to judge the strength of the claim (Pellegrino et al., 2001). Whether the data support the claim depends upon a number of warrants that must be backed by additional evidence: Were the tasks adequate to elicit what students know and can do? Can student responses be reliably categorized or scored? Can students with knowledge in the domain understand task directions and perform well? Do they do better than students with less understanding? Data concerning the assessment situation, too, are needed, such as the length of time students had to complete the assessment, and the stress they might be under to perform well. An important source of evidence related to claims about student proficiency is what Messick (1989) termed the external component of validity. This could include evidence that individuals’ test scores are more strongly correlated with scores on other measures of the same construct than with scores on different constructs. It might also include evidence that measures are sensitive to the effects of instruction. Importantly, the comparisons made should be grounded in a theory of a construct; otherwise, correlations of scores on the measure being developed with other measures are difficult to interpret. Evidence of the external component of validity is likely to qualify claims about individuals, because “no single test is a pure exemplar of the construct but contains variants due to other constructs and method contaminants” (Messick, 1989, p. 48). Even if data from student performance, warrants, and backing for warrants are compelling, assessors still must construct an argument for the use of assessment data for particular purposes (Messick, 1994; Shepard, 1993). In evaluation, a typical claim for program efficacy and effectiveness is framed as an answer to a question about the efficacy or effectiveness of a program, as shown in the upper half of Figure 1. Data (evidence) of the treatment effect that would support that claim comes not from one student but from large samples of students. Whether data support this claim about program effectiveness depends in part on warrants, such as the strength of the argument about the outcome measure to support claims about individual learning. It also depends on the strength of the research design to support causal inferences (Shadish, Cook, & Campbell, 2002), the proximity of the content and tasks of the assessment to content and tasks presented to students the program (Ruiz-Primo et al., 2002), the nature and persuasiveness of the counterfactual to the potential users of evaluation results (Morgan & Winship, 2007), the quality of implementation or the achieved relative strength of the intervention (Cordray & Pion, 2006), and evidence regarding threats to internal and external validity (Shadish et al., 2002). Finally, a compelling argument considers possible qualifiers or counterarguments that would lead to different claims about effectiveness. For example, perhaps students performed better on assessments, but they did so because they spent more time on task content than the comparison group.

This broader structure of interlocking arguments underlies the design of the assessments. Although the validity evidence we present in this article focuses principally on supports for claims about students (the assessment argument), our design choices reflected the intended use of the assessment data from the start. We considered, for example, not only what claims about students we wanted to investigate but also how to make the assessment a test for judging the efficacy of the curriculum. The assessments had to be sensitive to instruction differences but not “overaligned” to the content treatment students encountered, which could reduce the credibility of our judgments. The process of developing the assessment tasks we present in this article underscores the complexity and challenges of developing assessments of the performance expectations of the NGSS, whether these assessments are for use in evaluation or other purposes.

From Performance Expectations to an Assessment Argument: Unpacking the Complexities

The NGSS performance expectations provide a starting point for assessment design, yet moving from these to an actual assessment that will be valid for the purposes of evaluating a program requires building a coherent assessment argument that elaborates the claims, evidence, and reasoning to inferences about student learning that are desired. The following performance expectation serves as an example:

MS-PS-1-4. Develop a model that predicts and describes changes in particle motion, temperature, and state of a pure substance when thermal energy is added or removed. [Clarification Statement: Emphasis is on qualitative molecular-level models of solids, liquids, and gases to show that adding or removing thermal energy increases or decreases kinetic energy of the particles until a change of state occurs. Examples of models could include drawing and diagrams. Examples of particles could include molecules or inert atoms. Examples of pure substances could include water, carbon dioxide, and helium.]

This performance expectation provides a lot of detail, such as how the science practice of modeling can be integrated with the core idea, structure and function of matter. The crosscutting concept of cause and effect is central, as students are required to use their models to think about how thermal energy affects particle motion. The clarification statement provides details about the kinds of models that can be expected and appropriate substances to use in items.

The NGSS offer good guidance about the science domain and broad expectations; however, there are a number of design decisions remaining that must be formed into a logical argument to promote consistency in the design of the tasks within an assessment, and particularly so for assessments to inform program evaluation. For example, when we think about the claims, we want to be able to make about what students know and can do, is there something qualitatively different about students’ understanding and use of models when they are using them to predict versus describe a phenomenon? With respect to evidence in student work, what are the qualities that differentiate a strong model from a weaker one, and what kinds of changes in particle motion need to be described—in drawing, in writing? What is the level of detail with which atoms or molecules need to be represented in students’ models? Several questions also may be raised with respect to how student work will be scored. For instance, will there be multiple rubrics or a single rubric to capture the dimensions of core idea, modeling, and crosscutting concept for items associated with this performance expectation? In short, to use a performance expectation for assessment design requires further specification. Table 1 describes these decisions in relation to the warrants needed in the assessment argument (Figure 1).

Table 1.

Questions to Guide Assessment Design Decisions.

Warrants	Design Decisions
Relevance of claims and aspects of proficiency	What opportunities are there in the curriculum intervention and in the business-as-usual curriculum to learn these core ideas, practices, and crosscutting concepts? What do we know about how children learn and develop? What state science standards are teachers accountable to? Given responses to the questions mentioned previously, which performance expectations (PEs) should be targeted in the assessment? What are performance levels associated with the targeted core ideas, practices, and/or crosscutting concepts?
Adequacy of tasks	What are the design principles that guide the elicitation of intended claims/PEs?^a Do observations of student performances provide evidence of the claims? If not, how should tasks be revised? Do experts agree that tasks are well aligned to claims?
Scoring	Are rubrics and scoring guides capturing the range of evidence related to the claims? Are raters able to consistently and reliability score tasks?
Item and test performance	What statistical approaches will be used to make inferences about claims from scores? Is the assessment a reliable measure of the claim(s)?

^aIn this project, we focused on core ideas, science practices, and crosscutting concepts in the Framework for K-12 Science Education (Quinn et al., 2012), because the Next Generation Science Standards (NGSS) were not available at the time we needed to develop our assessments.

An ECD Approach to Inform Decisions

ECD is one approach that facilitates addressing these design decisions. ECD requires up-front specification of student and measurement models to promote coherence in the design of tasks and rubrics and the interpretation of student performances. ECD involves an interdisciplinary codesign team including experts in assessment design, science education, science content, and psychometrics. ECD is also an iterative process. Design specifications are refined, as new information is discovered from the performance of items and tasks.

Relevance of Claims and Aspects of Proficiency

One of the central research questions in the PBIS efficacy study is: To what extent do PBIS students perform better than non-PBIS students on measures of learning? We needed to develop measures that had potential to be instructionally sensitive in both conditions; thus, we wanted to align our claims to standards—in this case, the core ideas, practices, and crosscutting concepts in the Framework for K-12 Science Education. To ensure that we included items that were fair to both conditions, we had to consider curriculum learning goals, state standards, and research on how students learn. In ECD, this phase of design is called domain analysis.

Curriculum learning goals

The process of identifying curriculum learning goals involved reviewing the curriculum units. In some cases, learning goals were explicit statements about content knowledge (e.g., energy exists in different forms and can be changed from one form to another). We also reviewed activities to determine whether students had opportunities to engage in particular science practices. Based on the curriculum analysis, the team selected several physical science core ideas, an Earth science core idea, and the scientific practice of modeling as areas of focus for the assessments. Selecting core ideas that are taught in the curriculum reflects a basic principle of fairness: Students cannot be expected to know what they have not had opportunity to learn. The learning goals of the curriculum also figure in the evaluation use argument, and in a way that requires thinking about alignment (or overalignment) to the treatment curriculum, as well as to the theoretical coherence of different treatments.

State standards

A challenge to our own study was that the NGSS had not been released yet or adopted by any states. While the district had great interest in the Framework for K-12 Science Education (Quinn et al., 2012) and forthcoming (at the time) NGSS, they and their science teachers were accountable to the state science standards. Thus, attention to the current state standards was an important consideration in our assessment design.

Research on how students learn

Emerging learning progressions research (e.g., Schwarz et al., 2009) offered insights into tasks teachers can present to students that can support students’ engagement in modeling. This literature stresses the benefits of having students construct and manipulate their own models, as opposed to working with preprepared models. It operationalizes the practice of modeling to include the following elements: (a) constructing models; (b) using models to make predictions or explain processes or phenomena; (c) comparing, critiquing, and evaluating models; and (d) revising models to better account for evidence. Moreover, this literature focuses on the progression in terms of students’ abilities to engage with models in these various ways. Specific research on the practice within our targeted disciplines was also reviewed (e.g., Rivet & Kasten’s 2012 construct-based assessment in Earth science).

Defining claims and performance levels

Our work to define claims and performance levels focused on the practice of developing and using models, as we intended for this practice to be central in both the physical and Earth science assessments. On the basis of the research literature and Framework for K-12 Science Education, we defined three claims: (a) ability to construct a model and use the model to explain or make predictions about a phenomenon, (b) ability to evaluate the quality of the model for explaining a phenomenon, and (c) ability to use a given model to make a prediction about a phenomenon.

Two successive field trials conducted in the first 2 years of this study (2011–2013) informed the development of a construct map to describe performance levels for the practice of modeling (Table 2). The construct map promotes coherence in the way levels of proficiency related to modeling can be described across all assessment tasks in both content domains. Given that the target of our assessments is sixth-grade students, the levels on the construct map span skills that, at the lowest levels, would be the focus of upper elementary school, and at the highest levels are expected in middle school (NGSS Lead States, 2013, appendix F).

Table 2.

Developing and Using Models Construct Map.

Levels	Level Descriptors
4	The student recognizes models as a representation that can explain why a phenomenon is observed or that can be used to make predictions about the phenomenon Model captures all mechanistic features of the observable and unobservable phenomena
3	The student recognizes models as a representation that can explain why a phenomenon is observed or that can be used to make predictions about the phenomenon Model captures some mechanistic features of the observable and unobservable phenomena
2	The student recognizes models as a representation that can explain why a phenomenon is observed or that can be used to make predictions about the phenomenon Model attends primarily to macroscopic, observable, or surface features with emerging understanding of mechanistic features
1	The student conceives of a model as an analogy or an explicit representation of phenomena that is visible Student-constructed model or student evaluation of a given model attends only to relationships among macroscopic, observable, or surface features to explain a phenomenon
0	The student does not demonstrate any understanding of scientific models Student-constructed model or student evaluation of a given model includes no appropriate relationships based on core ideas (mechanistic, surface, or otherwise)

Adequacy of Tasks

Designing tasks to elicit the intended claims

The design pattern (Table 3) describes the argument underlying the design of our assessments—how we intended to elicit students’ ability to engage in the practice of modeling using physical science and Earth science core ideas. The attributes in design patterns specify features of kinds of observations that can provide evidence about acquisition of a claim, and the characteristic and variable features of task situations.

Table 3.

Elements of a Design Pattern for Developing and Using Models.

Attribute	Description
Claims The primary claims about what students know and can do	C1. Ability to construct a model and use the model to explain or make predictions about a phenomenon C2. Ability to evaluate the quality of the model for explaining a phenomenon C3. Ability to use a given model to make a prediction about a phenomenon
Additional knowledge, skills, and abilities (AKSAs) Other knowledge, skills and abilities that may be required to demonstrate the claims	AKSA 1. Knowledge that a model explains or predicts AKSA 2. Declarative knowledge related to core ideas AKSA 3. Ability to construct a response in drawing or writing
Potential Observations (PO) Student performances that constitute evidence of claims	PO1. Appropriate application of scientific concepts to construct a model (using drawings and words) that explains why the phenomenon occurs. (Physical science example: Student accurately constructs a representation of liquid water molecules and explains why water as a liquid can flow and change its shape to fit a container) PO2. Accurate description of similarities and differences between a model and a phenomenon. (Earth science example: Student identifies accurate similarities and differences between a cracked egg model and a scientist’s model of Earth’s surface/interior/geologic processes) PO3. Use of a model to make a reasonable prediction about a phenomenon. (Earth science example: Given a representation that shows the Hawaiian islands, the current location of the hot spot, and the direction of plate movement, the student correctly predicts the location of the next volcano and appropriately justifies his or her prediction using the model)
Characteristic task features (CF) Aspects of tasks that are necessary in some form to elicit desired evidence	CF1. All phenomena for which a model is developed must be observable or fit available evidence CF2. Models provided in stimulus materials must illustrate a process or why a phenomenon exists CF3. All items must elicit core ideas as defined in Framework for K-12 Science Education (Quinn, Schweingruber, & Keller, 2012)
Variable task features (VF) Aspects of tasks that can be varied in order to shift difficulty or focus	VF1. Drawing required: None, add to existing picture, construct model from scratch VF2. Format of “real-world” phenomenon presented: image, data, text, and combination VF3. Disciplinary core idea targeted in model: physical science, Earth science VF4. Function of the model: To explain a mechanism underlying a phenomenon, to describe/predict a phenomenon, to generate data VF5. Scale of mechanistic relationships in model: Observable-macro, unobservable-micro, unobservable-macro

The Framework for K-12 Science Education does not provide specific guidance about, nor does it clearly differentiate between, latent knowledge and skills, student performances, and task features with respect to modeling. The design pattern schema was essential in this regard. Because the structure of a design pattern implicitly contains the structure of an argument in general, completing the design pattern simultaneously renders explicit the relationships in an assessment argument for developing and using models as a practice. By investing in defining an assessment argument around a science practice, we were well positioned to apply the approach to assessment of modeling in both content domains. Importantly, while the claims in the design pattern highlight developing and using models, it is evident in the description of the potential observations and characteristic task features that core ideas and modeling must be blended in tasks. Over successive refinements to the design pattern, we incorporated the third dimension from the Framework for K-12 Science Education: crosscutting concepts. Variable Feature 5 now highlights how models may address micro- or macro-level relationships.

Using observations of student performance to inform task design

The design pattern provided a common approach for developers to implement modeling tasks and rubrics. In designing tasks, the team had to keep several constraints in mind. We had two sequential class periods (approximately 90 min total) for each assessment. Assessments needed to be delivered via paper and pencil, because it was not feasible to design and deliver technology-based assessments in the efficacy study. Using lab equipment also was not feasible because the project was not in a position to purchase and ship materials. Thus, we had the challenge of needing to design assessments that elicited modeling with more limited resources. As a consequence, however, the resulting assessments are ones that can be feasibly implemented in evaluation projects with modest resources.

The task in Figure 2 illustrates several lessons learned with respect to designing tasks to measure the three dimensions of core ideas, science practices, and crosscutting concepts. The task intends to target Claim 1, “Ability to construct a model and use the model to explain a phenomenon.” In early piloted versions of this task, students were asked only Parts C and D; however, we found that prompting students to explain did not elicit an understanding of the mechanism (model) about why the oldest rock was further away from the plate boundary. We found that for middle school students, this level of scaffolding was important so that students understood the kind of evidence we expected in their responses. Thus, many of the tasks on our Earth science and physical science assessments have multiple parts. We also revised the item based on expert review, a process described in greater detail subsequently. The expert review revealed a key discrepancy between theories of mantle convection presented in curricula today and modern geoscientists’ ideas about the role of the heating from Earth’s core in driving convection currents.

Figure 2.

Earth science task.

Expert review

As another source of validity evidence, we asked experts who served on committees to develop the Framework for K-12 Science Education and/or NGSS to review items and rubrics. For each item and its rubric, reviewers responded to the following questions: (a) Does the item assess a concept targeted by the core idea? (b) Does the item assess the practice of modeling? and (c) For ratings of “yes” or “partial” to the previous questions, how well does the item address the integration of content and practice? Raters were not required to resolve discrepancies. They each brought expertise unique and critical to the review of the items, and thus, it was important for our purposes to capture variation. As shown in Figures 3 and 4, perhaps not surprisingly so, constructed response (CR) items were judged to better elicit modeling. In most cases, raters indicated that the core idea and modeling were well integrated. Several multiple-choice items were not well aligned and were candidates for significant revision.

Figure 3.

Alignment ratings of physical science multiple choice (MC) and constructed response (CR) items.

Figure 4.

Alignment ratings of earth science multiple choice (MC) and constructed response (CR) items.

Scoring and Evaluation of Rubrics

During scoring that took place for the piloting and field testing phases, we considered whether rubrics captured the range of evidence about the claims. We also examined the extent to which raters were able to consistently and reliably score tasks. These activities informed refinements to items and rubrics. An experienced psychometrician on our team provided feedback on rubrics, designed scoring sessions, and developed analyses of item performance. We recruited current and former middle school science teachers to serve as scorers. During early scoring sessions, they reviewed samples of student work with the rubrics as designed and added detail to clarify or refine rubrics. After scoring all papers, they provided recommendations on how items and/or scoring rubrics may be revised to better elicit the desired evidence.

Internal consistency

In the first field test in Spring 2011, we attempted to distinguish between content and science practice in each CR task to better understand the relationship between these constructs. Two rubrics were developed for each item: one that foregrounded content knowledge and the other that foregrounded developing and using models. When scorers subsequently attempted to distinguish the use of modeling from the conceptual knowledge required to understand the underlying problem in student responses, they were able to identify discontinuities between the item prompt and the performance expectations identified in the modeling rubrics. This fine-grained focus on modeling was useful in helping the designers think more specifically about how students utilize modeling to solve science problems, and how that might be articulated in student responses. This scoring approach helped to identify elements of content and practice targeted by each item. It also helped ensure that both were present in each task, but it required bundling of the two CR scores for each item to fulfill the assumption of item independence in the item response theory analyses.

One-parameter unidimensional partial credit models were used to examine item and test qualities for each domain (physical science and Earth science). A comparison of the test information functions (TIF) provided by 6 CR items from the Energy assessment scored for content, modeling practice, or combined in a bundled model indicated that the bundled model provided significantly more information than either the content or practices models, TIF(θ_C) ≤ 7, TIF(θ_P) ≤ 3 TIF(θ_B) ≤ 15; Kennedy (2012a),¹ as did a comparison of models for the 8 CR items from the Earth science assessment, TIF(θ_C) ≤ 6, TIF(θ_P) ≤ 4 TIF(θ_B) ≤ 12; Kennedy (2012b). A TIF value of 10 is analogous to a reliability coefficient of .90 at a particular proficiency level (Samejima, 1994). Going forward, blended content-modeling rubrics for each item (see Table 4) were used to distinguish among levels of sophistication with respect to both content knowledge and modeling. For complete credit in the most complex items, students must construct models that include scientifically accurate content knowledge and describe how their model helps to explain a phenomenon.

Table 4.

Blended Core Idea—Practice Rubric for Earth Science Task.

Score Point	Descriptors for Parts A, B, C, and D
+1	A: Arrows are next to (or in) the magma on both sides angled up toward the crust, and then away from the magma and (maybe) down toward the bottom of the picture (the convection cycle). Arrows must be drawn in the mantle. All arrows drawn only up or down are not acceptable
+1	B: Student explains that convection in the mantle drags/moves/pulls the two plates apart. (Student must talk about convection causing the plates to move, not just that the magma is “coming” or “pushing” up)
+1	C: Xs are on both sides of the drawing, on the outermost edge of the crust. X’s can be placed anywhere along outermost edge of crust, including multiple X’s lining the outermost edge of crust
+1	D: Older rock is dragged (moves) away from where the magma pushes up (plates move apart from the boundary). Or magma is coming up between the plates and filling the gap with new rock

Interrater reliability

A factor related to item quality is the degree to which different raters interpret student responses similarly. Scoring sessions typically involved four to five raters, and approximately 10% of all items were scored by the same raters. On both the physical and Earth science assessments, intraclass correlation coefficient (ICC) analyses were conducted. The ICC is used as a measure of the reliability of the scores obtained from different raters for the same student response. A two-way mixed effects model with absolute agreement was used to produce average-measures ICC coefficients. A one-way model would have been used if raters were randomly chosen to score individuals. In the present study, a fixed set of raters was used, so the two-way model can assess systemic deviations among the raters. A mixed effects model is used because while the raters are fixed and do not vary randomly, the students (respondents) are considered randomly drawn from the population of science students. Absolute agreement is used because exact scores are important rather than correlated scores, and average measures are used because scores from multiple raters are averaged together for each student. No student in the study is scored by only one rater. Interrater reliability was at least .80 on all items except for 2 Earth science items (Kennedy, 2012a, 2012b).

Item and Test Performance

Analyses of item and test performance included evaluation of item fit and difficulty parameters and test reliability. The fit of items on each assessment to a unidimensional “content + modeling” scale in each content area were examined to identify potential construct-irrelevant variance. Each model was calibrated using bundled scores for the CR items only, then a second model was estimated that added the multiple choice items with anchored CR item parameters. Unweighted mean squares (outfit), which tended to deviate more than the weighted mean squares (infit), for the Earth science items ranged from 0.83 to 1.26 on one test form, and from 0.86 to 1.24 for the second form. For physical science, which combined items from both forms using common item equating, outfit ranged from 0.85 to 1.12. These mean square values indicate good fit for the items on the different tests (see Wright, Linacre, Gustafson, & Martin-Lof, 1994). Wright and colleagues (1994) indicate that mean square values of less than 0.6 or greater than 1.4 indicate a practically significant amount of misfit for partial credit items, while values less than 0.7 or greater than 1.3 indicate practically significant misfit for multiple-choice items. Examination of difficulty parameters (item thresholds) can reveal gaps in representation of the construct at particular proficiencies. Our analyses revealed that more items assessed higher-than-average levels than lower levels on the Energy “content + modeling” construct and that there were relatively large gaps on the proficiency scale among the lower level items (Kennedy, 2012a). For Earth science, our analysis focused on improving measurement of the modeling practices aspect of the construct. Our findings indicated a need for more items to assess proficiencies in the middle range on the construct (Kennedy, 2012b).

The 2012 revised version of the physical and Earth science assessments exhibited substantial improvements over the initial pilot version, primarily with regard to test reliability and test information. Most of the items worked very well to inform a unidimensional measure combining conceptual science knowledge with scientific practice and together produced a test reliability coefficient of at least .80 for both assessments. The analyses also indicated that while scoring at the item-part level produces consistent scores among multiple raters, bundled models provided better fit of the data to a unidimensional proficiency scale. A primary concern with the revised set of items was the large number of difficult items. Three of the 13 physical science CR items, for instance, had no responses in the highest scoring category (N = 164).

Prior to use in the main study, item prompts in both assessments, particularly for items found to be very difficult, were reevaluated to better align to the practice of modeling (as described earlier), and revised for clarity so students know what is expected to earn full credit. In revising scoring guides, we worked to improve coherence between the rubrics and the modeling concept map, including mappings, such as the example in Table 5.

Table 5.

Mapping of Blended Rubric and Developing and Using Models Construct Map.

Total Score, see Table 4	Level on Modeling Construct Map, see Table 2	Rationale
4	4	Student is able to construct an accurate model using drawing and writing that shows convection currents as the mechanism to explain the phenomenon of plate movements at a divergent zone. The explanation also includes an understanding about why older rock can be found further away from the plate boundary
3	3	Student’s model is mostly complete based on drawings and writings. Some aspects of the mechanism of convection currents are present. Details may be missing regarding aspects of convection currents or reasoning about where older rock can be found
2	2	Student’s model in drawing and writing is partial, with minimal evidence of the understanding of the mechanism of convection
1	1	Student’s model in writing and drawing attends primarily to aspects of the phenomenon of plate movement and or the location of older rocks. Evidence for the understanding of the mechanism of convection is not present
0	0	No evidence of modeling

Assembling the Use Argument: Conceptualizing and Analyzing Treatment Strength

In this section, we highlight design decisions with respect to aspects of the evaluation use argument that are particularly relevant to the assessment argument and to the challenges unique to evaluation of curriculum materials, namely, the analysis of achieved relative treatment strength. Treatment strength refers to the theoretical coherence of the treatment and how much of a treatment participants are expected to experience (Cordray & Pion, 2006; Sechrest & Yeaton, 1979). This is important because some treatments are likely to be so weak as to have little chance for success (Sechrest, West, Phillips, Redner, & Yeaton, 1979). Relative treatment strength refers to the difference between the strength of a proposed treatment and that of a comparison treatment (e.g., typical practice); analysis includes components that are unique and essential to the treatment, as well as essential nonunique components (Cordray & Pion, 2006).

Program implementers change programs when they implement them in real settings. Sometimes, a treatment may not produce a significant effect because the integrity of the program has been compromised by those changes (Sechrest et al., 1979). Hence, it is critical for evaluation researchers to develop evidence about the achieved relative treatment strength or the strength as realized in implementation. In an evaluation use argument, achieved relative treatment strength is an important qualifier to inferences about the efficacy of programs.

In conceptualizing relative treatment strength for the PBIS evaluation study, our team faced a number of dilemmas. Early in the study, we did not know where we would be conducting our study. It was thus impossible to know what curriculum materials teachers in the comparison condition would be using. Without knowing what curriculum materials were in play, we could not compare the theoretical coherence of PBIS with comparison materials. Even after we selected a district for the study, we discovered through survey research that most all teachers regularly supplemented the district-adopted textbook with other materials. We did not have access to these materials to analyze their coherence, either.

Another dilemma particular to NGSS is that the developers of PBIS did not create curriculum materials aligned to either the Framework for K-12 Science Education or NGSS. For instance, while there are several opportunities for students to engage in modeling, other science practices are less prominent. In addition, the disciplinary core ideas focal in units chosen for the study (due to alignment to the state standards where we conducted the study) did not align fully to the disciplinary core ideas of the Framework for K-12 Science Education. Our situation is not unique, and many evaluation researchers will face this dilemma in the future.

In response to these dilemmas, we constructed a curriculum theory of action focused on what we hypothesized would differentiate teaching and learning with PBIS from teaching and learning with textbooks with few opportunities for direct investigation of phenomena. We first gathered input from curriculum designers about what they considered the key “active ingredients” of the curriculum. We also analyzed the district-adopted textbook to better understand how it presented content, including the opportunities provided for students to engage in science practices. Finally, we used the Framework for K-12 Science Education to identify the kinds of opportunities hypothesized to link to student learning.

Our constructs for implementation measures focused on both unique and nonunique program elements hypothesized to be essential for improving teaching and learning. Data sources included weekly teacher logs, an annual teacher survey, observations, teacher assignments, and associated student work products. These sources will provide evidence of the achieved relative treatment strength for PBIS. If those data indicate that achieved relative treatment strength is low, then any conclusions regarding the efficacy of the program must be qualified. The teacher assignments and associated student work data collections are essential for helping us to understand just how different implementations of PBIS and comparison materials are.

Importantly, our measures for analyzing achieved relative treatment strength are linked to the student learning assessments described in this article. Specifically, protocols focus on the same disciplinary core ideas and practices as the assessments. They assess opportunities students have to engage in science practice of modeling, in the context of participation in activities focused on disciplinary core ideas in physical science and Earth science that are included on the assessment. The linkages will allow us to analyze evidence related to the instructional sensitivity of the assessments. We plan to examine the association between opportunity to learn and student outcomes among comparison groups, where teachers are using a variety of instructional materials, including the district-adopted textbook. Within the broader experiment, we can also test whether and how much teachers’ engagement of students in practices to teach core ideas mediates any treatment effects. This analysis has the potential, too, to contribute to the field’s understanding of how engagement in practices relates to student learning.

Discussion

Our assessment design process entailed many different considerations with respect to content and practices, context for the study, and the program we were evaluating. The validity argument considered claims, data, warrants, and qualifiers for an assessment argument and relates these to a use argument to promote valid claims about program efficacy or effectiveness. In the context of the NGSS, these interlocking arguments became more complex as not only core ideas, but also science practices and crosscutting concepts needed to be considered.

Our assessment design decisions were guided by ECD, which provided structure for defining and refining claims of performance expectations to develop tasks that blend core ideas, practices, and crosscutting concepts. ECD approaches were very helpful in building consensus among the design team about what it means to assess modeling. Both the design pattern and construct map promoted coherence in the design of modeling tasks. We iterated on task designs, developing and using evidence from scoring sessions and item modeling to refine construct maps, tasks, and rubrics. Cycles of early field testing were critical in providing evidence about the extent to which items were capturing the critical range of ideas related to the claims.

There are two important limitations of the study. One of the limitations of the assessments is that we focused on only particular components of the practice of developing and using models. Constrained by administration conditions, we were not able to easily elicit other components, such as model testing and model revision. Assessing dimensions of this science practice may be better supported in formative classroom administrations, where students have more time to develop and revise their models, and with technology-supported assessments, where conducting investigations and making revisions on the basis of data are possible. Nonetheless, this study demonstrates that, even under limited administration conditions, we were able to elicit aspects of this complex science practice. A second limitation is that the pilot assessments were administered with teachers implementing a single set of curriculum materials. The assessments need to be further tested using additional sets of curricular interventions.

At this time, the evaluation study of the PBIS curricular intervention is ongoing, and we are still developing evidence related to the use argument. The study is examining weekly online classroom logs as evidence of teachers’ implementation of two curricular units, to understand teachers’ enactment and the frequency with which they engaged students in science practices. Analyses of classroom video will provide evidence of how teachers’ orchestration of class discussions, specifically talk moves² during whole class discussion, shapes opportunities for students to engage in modeling.

Implications for Evaluation

Program evaluators need assessments that yield evidence to support inferences about student progress toward the NGSS performance expectations. An adequate evaluation of a program purporting to be “NGSS-aligned” requires assessments that include tasks that measure the integration of core ideas, practices, and crosscutting concepts. Most widely available assessments today, however, measure either content knowledge or science practices in isolation (Pellegrino et al., 2013). Thus, simply taking science measures “off the shelf” is not an adequate strategy for developing evidence for claims about program effectiveness.

Our aim in this article was to begin to articulate a model for designing NGSS assessments for program evaluation. As we begin to see future assessment development guided by the NGSS, we need to not only ask questions about student performance data and evidence to support adequacy of tasks, scoring, and test performance. It is also imperative to anticipate and design for the intended purposes for the assessment from the start.

While many of the assessment design decisions were specific to NGSS and the PBIS curriculum, this article provides an approach for linking assessment and use arguments that may be applied in other evaluations of curricular interventions and to support the design of new curricular interventions. Tools like design patterns are critical in building consensus among the design team about how to design tasks to elicit evidence of claims. Although the PBIS efficacy study did not examine claims of program effectiveness for specific populations of students, such as English learners or students with disabilities, design patterns also can facilitate explication of skills and supports needed for particular populations of students. For example, in documenting the additional knowledge, skills, and abilities for reading that may be required in science assessments, developers can make principled decisions about whether to employ multiple representations for English learners or scaffolds for students with disabilities. With respect to design, ECD may be useful for developing assessments that guide the creation or adaptation of coherent sequences of instructional experiences that can help all students meet challenging performance expectations, as is recommended in various “backward design” techniques.

These kinds of ECD-based schemas also may serve to build coherence in assessment systems. For example, a tool like the modeling design pattern that we developed is agnostic to purpose, and thus may be adapted to inform the design of modeling assessments for formative purposes. Thus, the up-front investment of generating focal points for collaboration and shared agreement about how to assess learning goals can have long-term benefits in establishing coherence within an assessment system.

Footnotes

Author’s Note

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This material is based upon work supported by the National Science Foundation under Grant Number DRL-1020407.

Notes

References

Atkin

J. M.

Black

(2003). Inside science education reform: A history of curricular and policy change. New York, NY: Teachers College Press.

Century

Rudnick

Freeman

(2010). A framework for measuring fidelity of implementation: A foundation for shared language and accumulation of knowledge. American Journal of Evaluation, 31, 199–218.

Cordray

D. S.

Pion

G. M.

(2006). Treatment strength and integrity: Models and methods. In Bootzin

R. R.

McKnight

P. E.

(Eds.), Strengthening research methodology: Psychological measurement and evaluation (pp. 103–124). Washington, DC: American Psychological Association.

Duschl

R. A.

Schweingruber

H. A.

Shouse

A. W.

(Eds.). (2007). Taking science to school: Learning and teaching science in grades K-8. Washington, DC: National Academies Press.

Furtak

E. M.

Thompson

Braaten

Windschitl

(2012). Learning progressions to support ambitious teaching practices. In Alonzo

A. C.

Gotwals

A. W.

(Eds.), Learning progressions in science: Current challenges and future directions (pp. 405–434). Rotterdam, the Netherlands: Sense Publishers.

Kane

M. T.

(1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.

Kennedy

C. A.

(2012a). PBIS energy instrument psychometric analyses. Unpublished internal document, SRI, Menlo Park, CA.

Kennedy

C. A.

(2012b). New PBIS earth science analysis. Unpublished internal document, SRI, Menlo Park, CA.

Kolodner

J. L.

Camp

P. J.

Crismond

Fasse

B. B.

Gray

J. T.

Holbrook

… Ryan

(2003). Problem-based learning meets case-based reasoning in the middle-school science classroom: Putting learning-by-design into practice. Journal of the Learning Sciences, 12, 495–547.

10.

Love

(2004). Implementation evaluation. In Wholey

J. S.

Hatry

H. P.

Newcomer

K. E.

(Eds.), Handbook of practical program evaluation (2nd ed., pp. 63–97). San Francisco, CA: Jossey-Bass.

11.

Means

Penuel

W. R.

(2005). Research to support scaling up technology-based educational innovations. In Dede

Honan

J. P.

Peters

L. C.

(Eds.), Scaling up success: Lessons from technology-based educational improvement (pp. 176–197). San Francisco, CA: Jossey-Bass.

12.

Messick

(1989). Validity. In Linn

R. L.

(Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: Macmillan.

13.

Messick

(1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23, 13–23.

14.

Messick

(1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749.

15.

Michaels

O’Connor

(2011). Talk science primer. Cambridge, MA: TERC.

16.

Mislevy

R. J.

Haertel

G. D.

(2006). Implications of evidence-centered design for educational testing. Educational Measurement: Issues and Practice, 25, 6–20.

17.

Morgan

S. L.

Winship

(2007). Counterfactuals and causal inference. London, England: Cambridge University Press.

18.

National Research Council. (2002). Scientific research in education. Washington, DC: National Academy Press.

19.

NGSS Lead States. (2013). Next generation science standards: For states, by states. Washington, DC: National Academies Press.

20.

Pellegrino

J. W.

(2013). Proficiency in science: Assessment challenges and opportunities. Science, 340, 320–323.

21.

Pellegrino

J. W.

Chudowsky

Glaser

(2001). Knowing what students know: The science and design of educational assessment. Washington, DC: National Academies Press.

22.

Pellegrino

J. W.

Wilson

Koenig

J. A.

Beatty

A. S.

(2013). Developing assessments for the Next Generation Science Standards. Washington, DC: National Academies Press.

23.

Quinn

Schweingruber

Keller

(Eds.). (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. Washington, DC: National Academies Press.

24.

Rivet

A. E.

Kastens

K. A.

(2012). Developing a construct-based assessment to examine students’ analogical reasoning around physical models in earth science. Journal of Research in Science Teaching, 49, 713–743. doi:10.1002/tea.21029

25.

Ruiz-Primo

M. A.

Shavelson

R. J.

Hamilton

L. S.

Klein

(2002). On the evaluation of systemic science education reform: Searching for instructional sensitivity. Journal of Research in Science Teaching, 39, 369–393.

26.

Samejima

(1994). Estimation of reliability coefficients using the test information function and its modifications. Applied Psychological Measurement, 18, 229–244.

27.

Schwarz

C. V.

Reiser

B. J.

Davis

E. A.

Kenyon

Acher

Fortus

… Krajcik

(2009). Developing a learning progression for scientific modeling: Making scientific modeling accessible and meaningful for learners. Journal of Research in Science Teaching, 46, 632–654.

28.

Sechrest

L. B.

West

S. G.

Phillips

M. A.

Redner

Yeaton

W. H.

(1979). Some neglected problems in evaluation research: Strength and integrity of treatments. In Sechrest

L. B.

West

S. G.

Phillips

M. A.

Redner

Yeaton

(Eds.), Evaluation studies review annual (Vol. 4, pp. 15–35). Beverly Hills, CA: Sage.

29.

Sechrest

L. B.

Yeaton

W. H.

(1979). Strength and integrity of treatments in evaluation studies. Washington, DC: National Criminal Justice Reference Service.

30.

Shadish

W. R.

Cook

T. D.

Campbell

D. T.

(2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton-Mifflin.

31.

Shepard

L. A.

(1993). Evaluating test validity. In Darling-Hammond

(Ed.), Review of Research in Education (Vol. 19). Washington, DC: American Educational Research Association.

32.

Singer

Marx

R. W.

Krajcik

Clay-Chambers

(2000). Constructing extended inquiry projects: Curriculum materials for science education reform. Educational Psychologist, 35, 165–178.

33.

Wright

B. D.

Linacre

J. M.

Gustafson

J. E.

Martin-Lof

(1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370.