Constructing Aligned Assessments Using Automated Test Construction

Abstract

We describe an innovative automated test construction algorithm for building aligned achievement tests. By incorporating the algorithm into the test construction process, along with other test construction procedures for building reliable and unbiased assessments, the result is much more valid tests than result from current test construction practices. The test construction process is efficient and versatile in its application. The target domain could be state content standards, a particular curriculum, or even an intervention. The algorithm, which draws on Surveys of Enacted Curriculum (SEC) procedures, can be employed in the context of English language arts and reading, mathematics, or science. We demonstrate the algorithm’s use and benefits for constructing tests, guiding item writing, and creating subtests from an existing test such that the subtests.

Keywords

achievement tests algorithm alignment assessments test construction

Despite the requirements of No Child Left Behind Act (NCLB) and theories underlying standards-based educational reforms, the tests used to measure student mastery of state content standards under NCLB have been weakly aligned to those standards (Polikoff, Porter, & Smithson, 2011). Aside from the obvious inefficiencies posed by this disconnect, misalignment carries with it a number of practical problems. For one thing, poorly aligned tests send confusing messages to teachers about the content they should be teaching. Misalignment also undermines the clear communication of content in the standards. And it fails to provide teachers with robust information about the extent to which their instruction has helped students learn core content.

The need for increased alignment has taken on new urgency as educators transition to the Common Core State Standards (CCSS). The U.S. Department of Education is funding two major consortia to develop assessments that are aligned to the CCSS by the 2014–15 academic year.

In what follows, we present an algorithm for automated test construction that yields more tightly aligned assessments. The use of the algorithm makes the validity argument of alignment or content validity a more transparent and replicable part of item writing and item selection in test construction. The result is much more highly aligned and so more valid tests than result from current test construction procedures. One of the primary benefits of our algorithm is that the target domain is generalizable; the domain could be content standards, a specific set of curriculum materials, or an intervention designed to improve participants’ knowledge. Similarly, our approach allows for numerous types of items, including multiple choice, short answer, and extended response. Our algorithm is efficient, versatile, and results in more tightly aligned assessments than those created by existing test construction techniques. When use of the alignment algorithm is added to test construction procedures that ensure reliability and lack of bias, the result is much better information on student achievement levels.

We first situate our work in the commonly accepted standards for testing. Next, we describe the study that motivated the work. Finally, we introduce our algorithm and illustrate its uses. We conclude with implications for policy and practice.

A Brief History of Call for Alignment

The Standards and Alignment

According to Linn (2006), the Standards for Educational and Psychological Testing “are widely recognized as the most authoritative statement of professional consensus regarding the development and evaluation of educational and psychological tests” (p. 27). Although the Standards do not contain the term alignment (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999), they do refer to the concept of alignment in various ways.

In the section “[Validity] evidence based on test content,” the Standards state that “plausible rival hypotheses can often be generated by considering whether a test measures less or more than its proposed construct. Such concerns are referred to as construct underrepresentation and construct irrelevant variance” (p. 10), and both contribute to lack of alignment between the test and its target (Liu & Fulmer, 2008; Martone & Sireci, 2009).

Typically, when constructing student achievement tests, the first step is to specify a framework for the test that describes the scope of the domain to be measured. The next step is to establish test specifications. The third step, according to the Standards, is to assemble items into a test: “relevant consideration in item selection may include the content quality and scope, the weighting of items and subdomains, and the appropriateness of the items selected for the intended population of test takers” (AERA, APA, NCME, 1999, p. 39). Standard 3.11 states, “Test developers should document the extent to which the content demand of a test represents the defined domain and test specifications” (p. 45).

And finally, Standard 13.3 calls for the concept of alignment—as well as clear evidence of it—in the following passage, again without using the exact term:

When a test is used as an indicator of achievement in an instructional domain or with respect to specified curriculum standards, evidence of the extent to which the test samples the range of knowledge and elicits the processes reflected in the target domain should be provided. Both tested and target domains should be described in sufficient detail so their relationship can be evaluated. The analyses should make explicit those aspects of the target domain the test represents as well as those aspects that it fails to represent. (AERA, APA, NCME, 1999, p. 145)

Alignment in NCLB and CCSS

At least since the enactment of NCLB, alignment has been the preferred term to describe the strength of the relationship between what is actually tested and what test makers want to test. The law requires states to align assessments with academic content standards.

The next generation of standards is the CCSS (Common Core State Standards Initiative, 2010a, 2010b) in English language arts and mathematics. As we mentioned in the introduction to this article, the U.S. Department of Education is funding two multistate testing consortia to develop assessments aligned to the CCSS and ready for use by the 2014–15 academic year.

Under NCLB, the U.S. Department of Education requires states to provide evidence that their assessments are aligned to their content standards. However, NCLB does not define alignment precisely. In response, researchers have developed several procedures for measuring alignment (Ananda, 2003; Bhola, Impara, & Buckendahl, 2003; Olson, 2003; Porter, 2002; Rothman, 2003).

Measuring Alignment

Webb (2002) developed the most widely used method to investigate test-standards alignment. Webb’s procedure relies on expert judgment of the relationship between the assessment and the target content standards in four areas; judgments of alignment are based on surpassing minimum standards in each area. Webb’s procedure is applicable only to investigating the degree of alignment of a particular assessment to a particular set of content standards; it does not produce a single alignment index.

The second most commonly used method to investigate alignment was developed by Porter and colleagues (Gamoran, Porter, Smithson, & White, 1997; Porter, Floden, Freeman, Schmidt, & Schwille, 1988). This procedure produces an index of alignment ranging from 0 to 1. It can be used to investigate the degree of content alignment between any two statements of content, including content standards, assessments, curriculum materials, and teacher reports of the enacted curriculum (Porter, 2006). Because it is generalizable, and because it yields a single index of alignment, we adopted Porter’s alignment procedure as the basis of our test construction algorithm.

The Problem

The problem that led us to create the test construction algorithm was the need to develop student achievement tests and teacher knowledge tests for a yoked pair of large randomized field trials to test education interventions. One intervention involved the use of cognitive science principles (analogical and diagrammatic reasoning, and misconceptions) to modify middle school science curricula (Bransford & Schwartz, 199l; Chi, 2005; Hegarty, Kriz, & Cate, 2003). The other intervention used professional development to improve teacher content knowledge. The work involved testing the effects of the two interventions against each other and against a business-as-usual control.

We built two 90-school randomized field trials, one using Holt curriculum materials and the other using Full Option Science System (FOSS) curriculum materials. For each trial, the interventions targeted three curriculum units.

Two research questions guided our work: First, does the content knowledge intervention improve teacher content knowledge as compared to the control group? Second, which of the two interventions results in the greatest improvements to student achievement? To answer the first question, we needed to develop six tests of teacher content knowledge, one for each unit in the professional development intervention. To answer the second question, we needed to build six tightly aligned student achievement tests.

Our Approach

Our approach was to collect two populations of high-quality (reliable and unbiased) previously used science items, one for teacher content knowledge and one for student achievement, and to draw items from the populations to construct the 12 tests. Items for the student assessments came from National Assessment of Educational Progress science assessments, publicly released state assessments, and the My Pals Are Here Science Tests (Hwa, 2009). Items for the teacher assessments came from the Educational Testing Service’s Praxis exams for middle school science and other Praxis items, as well as the Diagnostic Science Assessments for Middle School Teachers (Saderholm, Ronau, Brown, & Collins, 2010), the Geosciences Concept Inventory (Linbarkin & Anderson, 2005), and the Misconceptions-Oriented Standards-Based Assessment Resources for Teachers (Sadler, Coyle, Cook-Smith, & Miller, 2007).

Describing the Content Using the Surveys of Enacted Curriculum

The Porter (2002) alignment measure begins by using a standard language to describe the content of the target domain and the available test items. The procedures are known as the Surveys of Enacted Curriculum (SEC) and are available through the Wisconsin Center for Educational Research. Content languages are available for English language arts and reading, mathematics, and science. Each content language has two dimensions: topics and categories of cognitive demand. Topics are fine-grained (e.g., transport of cellular material), and each fine-grained topic is nested within a coarse-grained topic (in this case, components of living systems). In all, the SEC language for science encompasses 211 fine-grained topics; when crossed with the five categories of cognitive demand, the SEC makes 1,055 content distinctions to describe K–12 science content. We describe the topics and cognitive demand categories in the online appendix. While the SEC frameworks have five levels of cognitive demand and approximately 200 topics, the algorithm will work on any framework for describing content that results in formal replicable distinctions among different content in a content domain. Content can be defined using distinctions at any “grain size” using one, two, or more dimensions of content types. We favor descriptions of content that result in distinctions in what is and is not taught that predict variance in measured value added to student achievement.

Each of the content languages has evolved over time. The goal was to create languages inclusive enough to capture virtually all of the science, math, and English language arts and reading taught in K–12 U.S. schools. The distinctions made among topics and categories of cognitive demand reflect distinctions teachers say they make in deciding what to teach and what not to teach. Over time, teachers, curriculum experts, and researchers have tested the languages, and changes have been incorporated (Porter, 2002).

Analyzing the Content

Porter’s content analysis techniques are described in detail elsewhere (Porter, Polikoff, Zeidner, & Smithson, 2008); here we provide a brief overview, as the techniques apply to our pair of randomized trials.

For the units and the achievement items collected, we asked content experts to independently analyze the material. For each of the smallest chunks of material being analyzed (e.g., an item), each content analyst determined which of the cells (defined by the intersection of a topic and cognitive demand) in the language were implied by the content. Analysts were restricted to selecting no more than three cells per item. If an analyst placed an item in multiple cells, the score points associated with that item were divided equally between those cells. The materials being content analyzed were weighted equally unless there was an indication of differential weighting (e.g., a 2-point test item gets twice as much weight as a 1-point item); for pieces of material in the six units, there was no other source of information from which to create differential weighting, so each was weighted 1.0.

After our content experts analyzed all the material for a unit, we converted frequencies in cells to proportions by dividing by the total number of pieces of instructional content analyzed. We then formed a matrix of proportions from the data of each content analyst, and averaged these across content analysts. The end result for each unit is a 211-by-5 matrix of proportions, which, across all cells, sums to 1.0.

Measuring Alignment

To investigate the alignment of a test to a unit, we compared the matrix for the unit to the matrix of proportions for the test. The alignment index is defined by the formula:

Alignment = 1 - \frac{\sum | X i - Y i |}{2},

where x_i = the assessment proportion in cell i and Y _i = the target unit proportion in cell i (Porter, 2002). The index ranges from 0 (no alignment) to 1.0 (perfect alignment). The SEC and its alignment index have been used to investigate the alignment of state assessments to state standards (Polikoff et al., 2011), the alignment of state standards to one another (Porter, Polikoff, & Smithson, 2009) and to the CCSS (Porter, McMaken, Hwang, & Yang, 2011a, 2011b), and the alignment of instructional content as delivered by teachers to both standards and assessments (Polikoff, 2012).

Testing Data Quality

There are several indications of the SEC procedure’s validity and reliability. Porter (2002) and Porter et al. (2008) examined reliability among raters and found that, on average, the generalizability coefficient when four raters were used was .80 or higher when examining standards or tests.

Tests of an earlier version of the SEC content grid (using a sample of 62 teachers in 12 districts) showed high agreement between teacher reports of their instruction, classroom observations, and daily logs on the days observed (correlations of .70 and .80). Agreement between daily logs aggregated to a full school year and end-of-semester teacher surveys was also strong (most correlations between .60 and .80) (Porter et al., 1993). Further, the tool has been used to predict student achievement gains (Gamoran et al., 1997) and to estimate the effects of a professional development intervention (Porter, Smithson, Blank, & Zeidner, 2007). Using the SEC to quantify the alignment of teacher-reported instruction to a student achievement test using teacher/class as the unit of analysis, alignment uniquely explained more than 25% of the between-class variance in achievement gains (Gamoran et al., 1997). Using alignment as a dependent variable in a randomized field trial of a math and science professional development intervention, the alignment index was correlated .6 to .7 across 2 years, and the intervention group showed a .22 standard deviation increase in alignment resulting from the treatment (Porter et al., 2007).

The most telling pieces of this array of reliability and validity evidence for the SEC are (a) the strength of the relationship between the alignment index and student achievement gains (Gamoran et al., 1997) and (b) the statistically significant effect size of .22 on the alignment index in the randomized field trial of the professional development intervention (Porter et al., 2007). How could the index predict achievement gains or detect an effect in an experiment if (a) teacher self-report of the content of their instruction wasn’t valid, (b) content analyses weren’t valid, and (c) the index didn’t capture alignment between the two?

The Test Construction Algorithm: An Overview

We describe the test construction algorithm in detail in the Appendix. In brief, the procedure comprises seven steps:

Content analyze the target domain.

Decide the test length in terms of number of score points.

Content analyze each item in the population of items.

Calculate the alignment of each item to the target.

Select the most aligned item (if a tie, randomly select from among tied items¹) and adjust the target to account for content of the item.

Recalculate the alignment for each item.

Repeat Steps 5 and 6 until enough items have been selected to generate a test of the desired length.

For a desired allocation of test formats across item types (e.g., 10 multiple choice, 10 open response), the algorithm could be constrained such that, after it has selected 10 items of one type, that type could no longer be selected. In the trials that motivated us to build the algorithm, we limited student achievement tests to 18 multiple choice items and teacher content knowledge tests to 20 multiple choice items. Table 1 shows the alignment of each of the 12 tests to their targets. For the student achievement tests, alignment ranged from a low of 0.42 to a high of 0.70, with a median of 0.54. For the teacher content knowledge tests, the alignment ranged from a low of 0.32 to a high of 0.54, with a median of 0.41.

Table 1

End of Unit Student and Teacher Achievement Test Alignments With Target Domains

	Holt			FOSS
	Cell	ITM	IRE	EH	WW	DOL
Student Achievement
Alignment	0.53	0.58	0.70	0.42	0.55	0.46
Agree max	0.74	0.85	0.80	0.80	0.79	0.76
Disagree max	0.89	0.95	0.93	0.93	0.92	0.91
Agree adjusted	0.70	0.68	0.88	0.53	0.70	0.61
Disagree adjusted	0.58	0.61	0.75	0.45	0.60	0.51
Teacher Knowledge
Alignment	0.32	0.36	0.54	0.38	0.43	0.44
Agree max	0.62	0.71	0.89	0.62	0.85	0.81
Disagree max	0.91	0.93	0.98	0.91	0.95	0.95
Agree adjusted	0.52	0.51	0.61	0.61	0.51	0.54
Disagree adjusted	0.35	0.39	0.55	0.42	0.45	0.46

Note. Alignment is the unadjusted alignment index of the assessment. Agree max is the maximum possible alignment assuming complete agreement among content analysts. Disagree max is the maximum possible alignment assuming complete disagreement among content analysts. Agree adjusted is the raw alignment index adjusted for the maximum alignment given complete agreement. Disagree adjusted is the raw alignment index adjusted for the maximum alignment given complete disagreement. ITM = Introduction to Matter; IRE = Inside the Restless Earth; EH = Earth History; WW = Weather and Water; DOL = Diversity of Life.

How Large Are These Alignment Indices?

The size of the alignment index is not only a function of how well the items in the population of items fit the target domain (i.e., a unit), but also of (a) the breadth of the target domain as reflected in the number of cells covered and (b) the number of items to be included in the test (Fulmer, 2011; Polikoff & Fulmer, in press). The broader the target and the fewer the items, the lower the maximum possible alignment. We include proof of these claims in the Appendix.

Given these facts, we sought to adjust the obtained alignment indices to calculate maximum possible alignment using simulations. Each simulation assumed four raters and three cells per item per rater, and that an infinite item pool was available that covered all the cells in the target in every possible combination. One simulation assumed complete rater agreement and the other complete rater disagreement. The two assumptions bound the maximum possible alignment given a target domain (e.g., unit) and a test length. The assumption of complete rater agreement is a lower bound, and the assumption of complete rater disagreement is an upper bound. Generalizability studies suggest that the maximum possible alignment typically lies closer to the lower bound (Porter et al., 2008). The simulation methods, which we describe in more detail in the Appendix, result in maximum alignment indices that can be used to adjust our raw alignment indices. For instance, if we obtained a raw alignment value of .5, but the maximum possible alignment was .8, then the adjusted alignment index is .5/.8 or .625.

Table 1 shows maximum possible alignments for each of our boundary assumptions. To illustrate, we interpreted these values for the Holt Cells and FOSS Earth History units. Given complete rater agreement, the maximum alignment of an 18-item test was .74 for the Cells curriculum and .80 for the Earth History curriculum. The maximum alignments of the 20-item teacher tests for Cells and Earth History were each .62. The alignment index adjusted for the maximum alignment (i.e., the raw value divided by the maximum value) for the Cells student achievement test was .70; for the Earth History assessment, .53. Adjusted alignment indices for the teacher knowledge tests were .52 for Cells and .61 for Earth History.

Under the assumptions of complete rater disagreement, the maximum alignment of an 18-item test was .89 for the Cells curriculum and .93 for the Earth History curriculum. The maximum alignment of the 20-item teacher tests to professional development for both the Cells and Earth History was .91. Again, given that the maximum for each of the targets is less than 1.0, the actual alignments are larger than the raw values imply. The alignment index adjusted for the maximum was .58 for the Cells student achievement test and .45 for the Earth History assessment. For the teacher tests, the adjusted alignments were .35 for Cells and .42 for Earth History.

Another way to judge the size of a test’s alignment to its target domain is normatively. Polikoff et al. (2011) investigated the degree of alignment between student achievement tests and state content standards in the NCLB era, when states were required to have student achievement tests aligned to their content standards and provide evidence of their alignment. That study found that for the nine states for which science alignment data were available, the average alignment was 0.26, with a range from 0.16 to 0.37. Clearly, the methods used to align those state science tests with their corresponding content standards are significantly less effective than our automated test construction algorithm. Of course, the six target units represent narrower targets than state grade-specific science standards. The end-of-unit tests were limited to 18 multiple-choice items, and so they were much shorter than state science tests.

Other Uses of the Test Construction Algorithm

In addition to using our algorithm to create highly aligned assessments, we used it to build subtests and to measure teacher knowledge.

Building subtests

In the same pair of randomized field trials we describe earlier, a research question asked about the combined effects of the three units on a state-administered science test. We hypothesized that the cognitive science and teacher content knowledge interventions might be sufficiently robust to show effects on the state science test; however, because the state science test was written against state science standards and the interventions included only three science units, such effects would certainly be small.

We obtained item-level student performance data on the state science test for each student participating in one or more of the Holt target units. While one analysis used student total score on the state science test as a dependent variable in our impact analysis, we also used the algorithm to construct subtests from the state test, one aligned to each of the three target units. For the subtests, however, we did not know the length in advance. As before, we content analyzed the state test items using the SEC framework. We ran the algorithm for tests of varying lengths, starting with one item and increasing to the full number of items on the state test. We examined how the subtest alignment to the target unit changed as a function of test length, and we chose the shortest length where adding more items did not further increase alignment of the subtest to the target unit.

The state science test had 64 items. Using the algorithm to build a subtest aligned to the Cells unit, alignment increased as subtest length increased from one item up to seven items and decreased thereafter. The alignment of the seven-item subtest for Cells was 0.25. For the Introduction to Matter unit (ITM), alignment increased to a subtest of eight items with alignment of 0.26. For the Inside the Restless Earth unit (IRE), alignment increased to a subtest of 10 items with alignment of 0.24. We allowed subtests to share common items; no items were common across all three subtests, but cells and IRE had two items in common, and IRE and ITM had five items in common.

Measuring teacher knowledge

A third use of the alignment algorithm is illustrated by how we constructed the teacher content knowledge tests for the units. This set of tasks involved two variants in the algorithm. The first variant was that there was no set of curriculum materials to content-analyze to define the target domain; rather, science content experts created and delivered professional development to improve teachers’ content knowledge for each unit. Instead of content-analyzing materials, we asked each of the four content professional development experts to complete the SEC survey, indicating what topics and levels of cognitive demand they included in their professional development. We used the results of these four surveys to define the target domain.

The second variant was motivated by the fact that the population of items was not strong enough to produce a reasonably aligned teacher content knowledge test. The content of the item pool can be represented graphically using something analogous to a topographical map, where topics represent the North/South dimension and cognitive demand represents the East/West dimension (Porter et al., 2007). The SEC data are nominal scale, but the graphs are nevertheless accurate at the intersection of a particular topic and a particular cognitive demand. Figure 1 shows two topographical maps, one for content in the Cells teacher content knowledge target domain and one for the content of the constructed teacher content knowledge assessment. Comparing the two graphs indicates where there is content in the domain that is not represented by the items, leading to less than perfect alignment. For example, several topics in the target domain contain the “communicate” cognitive demand, but no items on the constructed test contain that demand. Figure 1 also illustrates that because items may cover multiple cells, the algorithm can select items that are aligned to the target while also assessing content not in the target. This occurred with communicate organ systems, which is covered on the test but not in the target domain.

Figure 1.

Topographical maps comparing the content of the target domain and the constructed teacher content knowledge test for the cells unit

To address the weak item pool available for the teacher test, we asked the four experts who designed and delivered the professional development to write items to fill identified content gaps. The new items were content-analyzed using SEC procedures to see whether the writers produced items in the correct cells. This proved to be a difficult task. Developers wrote 14 items for six targeted cells. Twelve of the 14 items were not placed in the target cell by any of the content analysts. The problem was matching cognitive demand levels, not topics. Our experience in guiding item writing provides yet another example of how important it is to build formal replicable consideration of alignment into the test construction process. Alignment does not happen easily.

Automated Test Construction

Our algorithm is computer programmed for automated test construction. Schmeiser and Welch (2006) attribute the first automated test construction to Theunissen (1985). According to Schmeiser and Welch (2006), automated test construction is “not intended to replace the role of test developers in the test-assembly process” (p. 335). They note that automated test assembly processes are limited by “(a) quality and size of the item bank, (b) reasonableness of the constraints and targets, and (c) degree to which the content and technical codes can be defined.” Although the first two limitations apply to our algorithm, the SEC procedures address the challenge of defining content and technical codes.

Most automated test construction work focuses on building tests of maximum reliability and/or multiple parallel forms (Armstrong, Jones, & Wang, 1994; Luecht, 1998). Our algorithm could be embedded in these techniques in a way that would combine alignment with other goals, such as reliability.

Discussion: Limitations and Possibilities

An algorithm is described for building much more content-valid-aligned student achievement tests than result from current test construction procedures. The algorithm is used in conjunction with test construction procedures that ensure reliability and lack of bias. The result is much better information on student achievement than is currently available. We illustrated the algorithm’s use for constructing student achievement tests and teacher content knowledge tests, for guiding item writing, and for creating aligned subtests from an existing test. This last use has considerable application in studies of education interventions, where one source of information on student achievement comes from state or district testing programs. For all of these applications, the degree of test alignment against a target domain is a function of breadth of content coverage in the target domain, quality of content coverage in the item pool, and test length.

Limitations of the Algorithm

One limitation is that the algorithm is based on the SEC conception of content defined at the intersection of topics and cognitive demand. The SEC content language has evolved over time and has been used successfully for a variety of purposes; nevertheless, the SEC topics may be insufficiently fine-grained, or perhaps too fine-grained, and the distinctions among cognitive demands may be insufficient to capture the distinctions needed. In particular, the SEC content conception does not capture learning progressions as described by Corcoran, Mosher, and Rogat (2009), or content structure as described by Webb (2006). Whether these are important content distinctions for purposes of content alignment remains to be seen.

A second limitation is that the utility of the algorithm depends on the quality of the target domain description and the quality of the content coverage of the items in the available pool; however, this limitation applies to any effort to build aligned assessments.

The alignment algorithm is also limited by its exclusive focus on content alignment. An achievement test needs to be more than content valid. The test must also be reliable, unbiased, and without floor and ceiling effects for the target testing population. This implies items that are clear and a test that is within the bounds of reasonable burden (AERA, APA, NCME, 1999). To the extent all items in the pool are of high quality based on field tests, the resulting aligned test built using the algorithm should not only have strong validity due to alignment but strong reliability as well.

Finally, the algorithm does not necessarily result in a maximally aligned test given the target domain, test length, and item pool. After an item is selected, the inclusion of the item in the test is not revisited as subsequent items are selected. The set of items that work maximally well together for the alignment criterion could be different and yield an even more aligned test (Guyon & Elisseeff, 2003; Temlyakov, 2011). Still, the tests resulting from application of the algorithm are more aligned to the target domain than almost all student achievement tests reported in the literature.

Possibilities of the Algorithm

Our algorithm could make several important contributions to test construction and to educational research. For one, with appropriate training, these procedures might guide item writing for achievement tests to make alignment a forethought of test construction. Currently, professional item writers construct achievement tests to meet test specifications. Only after the test is developed and, often, already in use does anyone use an alignment procedure to see whether the test is aligned to the target content standards. At that point, it’s usually too late; the test already exists. A better method would be to use the criteria of alignment to guide item writing.

Another contribution may be in constructing parallel forms of a test. Each test is a sample of items from a population of items, and parallel forms are to be representative samples from the same population of items. Where two parallel forms are desired, enough items for two forms might be selected from the population of items by the algorithm, and items might be randomly assigned to forms. Where forms are written over time, they need to be equated. Ideally, an infinite number of parallel forms would in aggregate be perfectly aligned to the target. The alignment algorithm could guide this process.

A third contribution is that the algorithm is easily programmed, making test construction automatic. The procedure is versatile, applicable in a variety of contexts for various target domains and various item populations and formats.

With the alignment algorithm for automated test construction, future work can focus on improving the quality of the definition of target domains. For example, CCSS writers can focus on making their content standards clearer, and the testing consortia can focus on making their tests more aligned to the CCSS. Similarly, with the algorithm, work can proceed on writing better pools of items from which to construct tests. With better-defined target domains and higher quality item pools, we can do a much better job constructing tests that are closely aligned to content. This should yield more content-valid and instructionally sensitive tests (Polikoff, 2010). Finally, future work may allow for a more complete automated test construction algorithm—one that formally incorporates the goals of reliability and fairness with content alignment.

Footnotes

Appendix

Surveys of Enacted Curriculum Science Survey: The Surveys of Enacted Curriculum (SEC) are procedures for describing content in English language arts & reading, mathematics, and science that is available through the Wisconsin Center for Educational Research. Each content language has two dimensions: topics and categories of cognitive demand. For science there are:

When the 211 fine-grained topics are crossed with the five categories of cognitive demand, 1,055 content distinctions are made in the language to describe K–12 science content.

Details of the test construction algorithm: We use an example to explain how we calculate the cell proportions for each item based on test length. Imagine that you need to construct a test with 20 items and that you are using four content raters. Each item is a multiple choice item worth 1 point, and each item therefore comprises 5% of the total test content. Rater 1 content-analyzes the first item to be in Cell A; Rater 2 puts the item in both Cell B and Cell C; Rater 3 puts the item in both Cell A and Cell B; and Rater 4 puts the item in Cell C (of course, this is just one example). Cell proportions of the item are then calculated as follows: for Cell A, there are 1.5 frequencies (1 frequency from rater one and .5 frequency from Rater 3); for Cell B, there is 1 frequency (.5 frequency from Raters 2 and 3); and for Cell C, 1.5 frequencies (5. frequency from Rater 2 and 1 frequency from Rater 4). Next, these cell frequencies are divided by the number of raters (i.e., four): 37.5% of the item’s content is in each of Cells A and C, and 25% is in Cell B. Finally, these proportions are multiplied by 5%, representing the total contribution of this item to the final test. Thus, for Cells A and C, the value is 0.375 * 0.05 = 0.0188, and for Cell B the value is 0.25 * 0.05 = 0.0125. More generally, for each rater, the score points assigned to an item are evenly divided across the cells that the rater identifies the item as reflecting. These cell frequencies are then summed across raters and divided by the product of the number of raters times the number of items on the test.

Maximum possible alignment: The broader the target and the fewer the items, the lower the maximum possible alignment. For a highly constrained situation, a simple proof is possible. Let N equal the number of “objectives” in the target domain and n the number of items on the test. Assume that (a) N is always greater than n (i.e., there are more objectives in the target than items on the test) and (b) each objective in the target and each item on the test is content-analyzed into only one unique cell (i.e., each curriculum objective and item falls into a different cell and only that one cell). Given these assumptions, the upper limit on the alignment index is n/N. For example, if there are 100 objectives in the target and 10 items on the test, the maximum possible alignment is 0.1. This can be seen because each test item, comprising 10% of total test content, can only overlap with at most one objective (and each objective comprises 1% of total test content). Thus, maximum alignment equals .10 or 10 items times .01. The maximum possible alignment is almost always less than 1.0, in part because the target is a domain of all content a student is to learn, and the items on a test represent only a sample of all possible items.

In simulations that we used to calculate maximum alignment, we relaxed the highly constraining assumptions in the preceding proof. Each of our two simulations assumed that an infinite item pool was available that covered all the cells in the target in every possible combination, given four content analysts who each placed each item in three cells. One simulation assumed complete rater agreement and the other complete rater disagreement. The two assumptions bound the maximum possible alignment given a target domain and a test length; the assumption of complete rater agreement is a lower bound, while complete rater disagreement is an upper bound.

Item weights per cell were calculated to be 1 over the number of items on the test times the maximum number of cells an item could be placed into. For our six student achievement tests, there were 18 items, four raters, and three cells each rater could assign to an item. Assuming complete rater disagreement, the item weight per cell is 1/216 [18 items * 4 raters * 3 cells per rater]. For complete rater agreement, the minimum item weight per cell is 1/54 [18 items * 3 cells per rater]. Assuming complete agreement removes the dependency of the alignment on the number of raters.

Our simulations took the content matrix for a target domain and sorted the cells by the size of their proportions. The cell with the largest proportion was “selected” for the test, and the content matrix describing the target was re-centered by subtracting the minimum item weight (1/216 for no agreement, 1/54 for complete agreement) from the chosen target cell. If the largest cell proportion was smaller than the minimum item weight (e.g., a cell proportion of 0.01 in the case of complete agreement), the cell proportion was reduced to 0 and the procedure continued. The target domain cell proportions were re-sorted and the cell with the new largest proportion was selected. This process was repeated until 18 simulated items (20 for teacher tests) had been selected. The alignment index for the constructed tests is algebraically equivalent to the sum of the weights subtracted from the target (i.e., total coverage of the target provided by the items).

Notes

Authors

ANDREW PORTER is George & Diane Weiss professor and dean of the Graduate School of Education at the University of Pennsylvania, 3700 Walnut Street, Philadelphia, PA 19104; andyp@gse.upenn.edu . His research focuses on psychometrics, applied statistics, and the effects of education policy.

MORGAN S. POLIKOFF, PhD, is an assistant professor at the University of Southern California Rossier School of Education, 3470 Trousdale Parkway, Los Angeles, CA 90089; polikoff@usc.edu . His research focuses on standards, assessment, and accountability policy.

KATHERINE M. BARGHAUS, PhD, is a research associate at the University of Pennsylvania, Graduate School of Education, 3700 Walnut Street, Philadelphia, PA 19104; barghaus@upenn.edu . Her research interests included psychometrics, applied quantitative methods, and early childhood education.

RUI YANG is a PhD candidate at the University of Pennsylvania, 3700 Walnut Street, Philadelphia, PA, 19104, 215-898-7019. His research focuses on measurement and impact evaluation.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Ananda

(2003). Rethinking issues of alignment under No Child Left Behind. San Francisco: WestEd.

Armstrong

R. D.

Jones

D. H.

Wang

(1994). Automated parallel test construction using classical test theory. Journal of Educational Statistics, 19(1), 73–90.

Bhola

D. S.

Impara

J. C.

Buckendahl

C. W.

(2003). Aligning tests with states’ content standards: Methods and issues. Educational Measurement: Issues and Practice, 22(3), 21–29.

Bransford

J. D.

Schwartz

D. L.

(1999). Rethinking transfer: A simple proposal with multiple implications. In Iran-Nejad

Pearson

P. D.

(Eds.), Review of research in education (pp. 61–101). Washington, DC: American Educational Research Association.

Chi

M. T. H.

(2005). Commonsense conceptions of emergent process: Why some misconceptions are robust. Journal of the Learning Sciences, 14, 161–199.

Common Core State Standards Initiative. (2010a). Common Core State Standards for English language arts & literacy in historical/social studies, science, and technical subjects. Retrieved from http://www.corestandards.org/assets/CCSSI_ELA%20Standards.pdf

Common Core State Standards Initiative. (2010b). Common Core State Standards for mathematics. Retrieved from http://www.corestandards.org/assets/CCSSI_Math%20Standards.pdf

Corcoran

Mosher

F. A.

Rogat

(May 2009). Learning progressions in science: An evidence-based approach to reform. Philadelphia: Consortium for Policy Research in Education (CPRE).

10.

Fulmer

G. W.

(2011). Estimating critical values for strength of alignment among curriculum, assessments, and instruction. Journal of Educational and Behavioral Statistics, 36(3), 381–402.

11.

Gamoran

Porter

A. C.

Smithson

White

P. A.

(1997). Upgrading high school mathematics instruction: Improving learning opportunities for low-achieving, low-income youth. Educational Evaluation and Policy Analysis, 19(4), 325–338.

12.

Guyon

Elisseeff

(2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.

13.

Hegarty

Kriz

Cate

(2003). The roles of mental animations and external animations in understanding mechanical systems. Cognition & Instruction, 21, 325–360.

14.

Hwa

K. S.

(2009). My Pals Are Here! Science Tests Primary 5 & 6. Singapore: Marshall Cavendish Education.

15.

Linbarkin

J. C.

Anderson

S. W.

(2005). Assessment of learning in entry-level geoscience courses: Results from the geosciences concept inventory. Journal of Geoscience Education, 53, 394–401.

16.

Linn

R. L.

(2006). The standards for educational and psychological testing: Guidance in test development. In Downing

S. M.

Haladyna

T. M.

(Eds.), Handbook of test development (pp. 27–38). Mahwah, NJ: Lawrence Erlbaum Associates.

17.

Liu

Fulmer

(2008). Alignment between the science curriculum and assessment in selected NY state regents exams. Journal of Science Education and Technology, 17(4), 373–383.

18.

Luecht

R. M.

(1998). Computer-assisted test assembly using optimization heuristics. Applied Psychological Measurement, 22(3), 224–236.

19.

Martone

Sireci

S. G.

(2009). Evaluating alignment between curriculum, assessment, and instruction. Review of Educational Research, 79(4), 1332–1361.

20.

Olson

(2003, Spring). Standards and tests: Keeping them aligned. Research Points: Essential Information for Education Policy, 1(1), 1–4.

21.

Polikoff

M. S.

(2010, Winter). Instructional sensitivity as a psychometric property of assessments. Educational Measurement: Issues and Practice, 29(4), 3–14.

22.

Polikoff

M. S.

(2012). The association of state policy attributes with teachers’ instructional alignment. Educational Evaluation and Policy Analysis, 34(3), 278–294.

23.

Polikoff

M. S.

Fulmer

G. W.

(in press). Refining methods for estimating critical values for an alignment index. Journal for Research on Educational Effectiveness.

24.

Polikoff

M. S.

Porter

A. C.

Smithson

(2011). How well aligned are state assessments of student achievement with state content standards? American Educational Research Journal, 48(4), 965–995.

25.

Porter

A. C.

(2002). Measuring the content of instruction: Uses in research and practice. Educational Researcher, 31(7), 3–14.

26.

Porter

A. C.

(2006). Curriculum assessment. In Green

J. L.

Camilli

Elmore

P. B.

(Eds.), Handbook of complementary methods in education research (3rd ed., pp. 141–160). Washington, DC: American Educational Research Association.

27.

Porter

Floden

Freeman

Schmidt

Schwille

(1988). Content determinants in elementary school mathematics. In Grouws

D. A.

Cooney

T. J.

(Eds.), Perspectives on research on effective mathematical teaching (pp. 96–113). Hillsdale, NJ: Lawrence Erlbaum Associates. (Also Research Series 179, East Lansing, MI: Michigan State University, Institute for Research on Teaching.)

28.

Porter

A. C.

Kirst

M. W.

Osthoff

E. J.

Smithson

J. L.

Schneider

S. A.

(1993). Reform up close: An analysis of high school mathematics and science classrooms. Final report to the National Science Foundation on Grant No. SAP-8953446 to the Consortium for Policy Research in Education. Madison: University of Wisconsin-Madison, Consortium for Policy Research in Education.

29.

Porter

A. C.

McMaken

Hwang

Yang

(2011a). Common Core standards: The new U.S. intended curriculum. Educational Researcher, 40(3), 103–116.

30.

Porter

A. C.

McMaken

Hwang

Yang

(2011b). Assessing the Common Core standards: Opportunities for improving measures of instruction. Educational Researcher, 40(4), 186–188.

31.

Porter

A. C.

Polikoff

M. S.

Smithson

(2009). Is there a de facto national intended curriculum? Evidence from state content standards. Educational Evaluation and Policy Analysis, 31(3), 238–268.

32.

Porter

A. C.

Polikoff

M. S.

Zeidner

Smithson

(2008). The quality of content analyses of state student achievement tests and content standards. Educational Measurement: Issues and Practice, 27(4), 2–14.

33.

Porter

A. C.

Smithson

Blank

Zeidner

(2007). Alignment as a teacher variable. Applied Measurement in Education, 20(1), 27–51.

34.

Rothman

(2003). Imperfect matches: The alignment of standards and tests. Unpublished paper commissioned by the National Research Council, Committee on Test Design for K-12 Science Achievement, Washington, DC.

35.

Saderholm

Ronau

Brown

T. E.

Collins

(2010). Validation of the diagnostic teacher assessment of mathematics and science (DTAMS) instrument. School Science and Mathematics, 110(4), 180–192.

36.

Sadler

Coyle

Cook-Smith

Miller

(2007). Misconceptions-oriented standards-based assessment resources for teachers (MOSART). Cambridge, MA: Harvard College.

37.

Schmeiser

C. B.

Welch

C. J.

(2006). Test development. Education Measurement, 4, 307–353.

38.

Temlyakov

(2011). Greedy approximation. Cambridge, UK: Cambridge University Press.

39.

Theunissen

T. J. J. M.

(1985). Binary programming and test design. Psychometrika, 50, 411–420.

40.

Webb

N. L.

(2002). Alignment study in language arts, mathematics, science, and social studies of state standards and assessments for four states. Washington, DC: Council of Chief State School Officers.

41.

Webb

N. L.

(2006). Identifying content for student achievement tests. In Downing

S. M.

Haladyna

T. M.

(Eds.), Handbook of test development (pp. 155–180). Mahwah, NJ: Lawrence Erlbaum.