Abstract
Standardized tests are regularly used as education system monitoring tools to compare the average performance of students living in different states or belonging to different subgroups (e.g., defined by race/ethnicity, sex, or parental income) and to track their progress over time. This article describes some uses and design features of tests in system monitoring contexts. We provide the example of the National Assessment of Educational Progress (NAEP), the only large-scale system monitoring test in the United States. The availability of NAEP data, in turn, has facilitated the construction of the Stanford Education Data Archive (SEDA), a publicly available database that can be used to describe patterns of achievement for nearly all school districts in the United States. Here, we discuss progress in and challenges to the use of standardized tests as system monitoring tools.
The U.S. public education system has the ambitious goal of providing a high-quality education to approximately 50 million students enrolled in K–12 schools every year (McFarland et al. 2017). Maintaining this system is a large public expense: spending on K–12 education has remained at roughly 4.0 percent of gross domestic product (GDP) (or, as of this writing, $707 billion) since 1965 (Snyder et al. 2018). Monitoring the education system requires tracking a wide array of information about the inputs and outcomes of the system. Standardized tests are regularly used as education system monitoring tools to compare the average performance of students living in different states or belonging to different subgroups (e.g., defined by race/ethnicity, sex, or parental income) and to track their progress over time. Although there are many valued goals and outcomes of the education system that cannot be measured with standardized tests, test scores can (and do) serve as useful indicators of whether the education system is achieving the goal of having all students learn relevant academic content and skills.
In this article, we discuss progress in and challenges to the use of standardized tests as system monitoring tools. We use the example of the National Assessment of Educational Progress (NAEP), the only large-scale system monitoring test in the United States, and describe how the availability of NAEP data has facilitated the construction of the Stanford Education Data Archive (SEDA), a unique, publicly available database that can be used to describe patterns of achievement among nearly all public school districts in the United States.
We first explain how test results can be used for system monitoring. These uses are different from classroom-based assessments (providing educators with information about their own students’ performance; see Shepard, this volume) and testing for accountability (which ties test score performance to explicit consequences—see Hanushek, this volume; Loeb and Byun, this volume). After briefly discussing how the NAEP has evolved, we highlight design features that make the NAEP well-suited for system monitoring. Finally, we describe a new development in system monitoring, in which data from multiple testing programs are combined to yield more detailed information about the education system, and provide the example of SEDA.
Standardized Tests for System Monitoring
The education system has many important goals, which include supporting all students in their learning of varied subject matter and academic skills (e.g., mathematics, reading, and science) and developing in them an array of important social and behavioral capacities (e.g., civic engagement, effort, and ambition). When used with other indicators of educational inputs and outcomes, carefully designed standardized tests can be effective tools to measure the knowledge and skills that students have gained. Other commonly used indicators of educational opportunity are per-pupil expenditures, teacher qualifications, or high school graduation rates; test scores complement these indicators by providing more direct information about what students know and can do. On the other hand, unlike indicators that are based on directly observable quantities, test scores often represent unobservable constructs; therefore, scores are indirect indicators, or approximations, of the underlying quantities of interest. Designing a test for system monitoring, then, requires decisions about how to select and define the constructs, and how to design a test or tests to measure them. Ideally, these decisions will align with the overall goals of the education system by emphasizing the content and skills deemed most important.
As with any measurement of educational performance, test scores become valuable indicators not when they are generated but based on how they are used and interpreted to provide insight into the education system. Professional assessment standards require that test uses and interpretations be explicitly stated and evaluated using theoretical and empirical evidence in a process known as validation (American Educational Research Association [AERA], American Psychological Association [APA], and National Council on Measurement in Education [NCME] 2014; Kane 2013). Validation begins by specifying an interpretation and use argument—or IUA (Kane 2013)—that delineates the intended test uses and interpretations, as well as key assumptions or claims underlying them. A complete IUA will also incorporate a “theory of action” that articulates both how these uses and interpretations are expected to lead to desired outcomes and what unintended consequences of test use may need to be considered (Cronbach 1988; Messick 1995; Haertel 2013; Kane 2013; Shepard 1993; Bennett 2010; Linn 1989). After specifying an IUA, the next step of validation is to draw on theory and evidence to evaluate whether the interpretations and uses specified in the IUA are warranted. For example, experts may gather evidence to show that the content matches specified standards and difficulty level, that all students are able to demonstrate their knowledge equally (i.e., a lack of bias), or that decisions based on the test scores lead to more desirable outcomes than could otherwise be achieved. Validity thus refers to the degree to which this theory and evidence support proposed interpretations and uses of test results articulated in the IUA (AERA, APA, and NCME 2014), and is a property of score interpretations and uses, rather than an inherent property of the test or testing procedure.
There are many possible uses of test score data for system monitoring. We consider three, drawing on the framework developed by Haertel (2013). First, when reported publicly, test score data can be used to influence public opinion or shape the dialogue around the public education system. Second, test scores can be used to better understand how the education system is functioning or to direct resources, by identifying regions, organizations, or groups that are performing above or below expectations. Third, scores can be used as an outcome measure, that is, the dependent variable, in program evaluations. They can inform comparisons among different educational strategies or policies and help to identify practices that increase educational achievement and equity (for the general principles of education modeled as a production function, see Hanushek 2007; or Murnane et al. 2000).
These uses require making distinct (but often overlapping) inferences and interpretations. At a minimum, system monitoring test data should support accurate descriptive interpretations about aggregate academic achievement with enough precision to describe subgroup and geographic variation. Some uses may also require judgments about whether performance meets prespecified academic standards (for discussion of achievement levels in NAEP, see, e.g., National Academies of Sciences, Engineering, and Medicine 2017). Shaping public perception, for example, may involve comparing scores to a benchmark that indicates acceptable and unacceptable performance. Whether to allocate resources to regions or subgroups may involve decisions based on perceived need or performance relative to expectations. The third use—policy or program evaluation—often relies on being able to make credible causal inferences about observed variation in test scores. Many intended uses will require additional knowledge or data about the context in which the test scores were generated (e.g., Haertel 1989).
The intended uses of system monitoring tests need to be specified and documented in an IUA. A thorough IUA will also attend to potential unintended uses and unintended effects (especially negative effects) of using the scores. When system monitoring test score data are reported publicly, for example, they may be perceived as high-stakes or consequential, potentially leading to distortions of the scores, such that the scores less accurately measure the quantities of interest or of the education system (Campbell 1979). One example of a distortion is “score inflation” (when increases in test scores are not indicative of meaningful increases in learning) due to inappropriate test preparation or outright cheating (Jacob 2007; Koretz 1988, 2017). Another is the unintended narrowing of the curriculum that places undue emphasis on tested content or skills at the expense of other valued outcomes. Finally, test scores may also be used or interpreted in unanticipated ways. Although the IUA may not be able to specify these unanticipated uses and interpretations in advance, care should be taken to track them over time and to address them as they arise.
A single test is unlikely to be valid for all intended uses. However, as we discuss in the next section, the NAEP is a good example of an informative tool for a range of system monitoring uses, and includes design elements that reduce the likelihood of distortions.
The National Assessment of Educational Progress
The NAEP is intended to serve as an “independent monitor of student academic achievement in the United States at the elementary and secondary levels” for the benefit of many audiences: the U.S. public, elected representatives, policy-makers, educators, and business leaders (Fabrizio, Friedman, and Garrison 2013, 3–4). The NAEP is the longest-running continuous system monitoring test in the United States and provides achievement data at different ages for different demographic subgroups (i.e., by race/ethnicity, sex, socioeconomic status, and for students with disabilities and students who are English language learners) living in large urban districts, states, and the nation. A brief history of the NAEP illustrates the significant challenges a national system monitoring test is likely to face, while highlighting the progress that has been made in support of its use for system monitoring.
The NAEP began, in part, as a response to the report on Equality of Educational Opportunity (the “Coleman Report”; Coleman et al. 1966) and trends on the Scholastic Aptitude Test (SAT), both of which highlighted troubling features of the U.S. educational system: large disparities in test score performance on one hand, and declining SAT scores on the other (see also Vinovskis, this volume). However, each of these data sources had important limitations that made such interpretations problematic. For example, unwillingness to participate in testing led to a low response rate for the Coleman Report (only 65 percent of the intended participants were tested) such that subgroup comparisons were imprecise. For the SAT, the changing and nonrepresentative sample of test takers meant that changes in average scores could be driven by changes in the test-taking population and not by meaningful changes in performance. These troubling patterns, coupled with unreliable data, created the demand for better system monitoring tests (Beaton et al. 2011).
To meet this demand, the first NAEP was administered in 1969, providing nationally representative assessment data for content areas of citizenship, science, and writing for 17-year-old students still in school. In 1970 and 1972, the NAEP began testing nationally representative samples of students aged 9, 13, and 17 in math and reading, respectively. For each assessment, results were reported separately by region and demographic subgroups to monitor whether all students were being provided equal educational opportunities (Beaton et al. 2011).
The National Assessment Governing Board (NAGB) was formed in 1989 and was charged with setting many of the NAEP’s policies. These policies include defining the content and format of the assessments (the “assessment frameworks”), setting achievement levels (what level of achievement qualifies as “proficient”), and reporting of the results (see also Braun and Singer, this volume). 1 By law, board members must be bipartisan, and the board must include multiple stakeholders, such as educational measurement experts, educators, and community members. These membership rules help to ensure that the NAEP remains independent and reflective of diverse goals. To oversee the NAEP, the NAGB works alongside the National Center for Education Statistics (NCES) and the NAEP contractor selected to administer the NAEP (as of this writing, and since the mid-1980s, the Educational Testing Service). This organizational structure has had important political and practical consequences. The inclusion of the NAGB allows important policies to be set by an independent group. Further, by placing direct oversight of the NAEP within the NCES (a department within the federal government), the NAEP has been provided with a stable and consistent source of funding. Finally, by contracting a testing company to carry out the actual assessment, the technical design and administration of the NAEP have been, and continue to be, at the forefront of large-scale assessment methodology.
During the 1980s, there was growing support for reporting state-level NAEP results, along with debate about the virtues of this expansion to the program. State-level testing would allow for more detailed descriptions but raised concerns that the test could undermine local governance, including standard setting, curricular development, and accountability (Stedman 2009; Vinovskis 2001, this volume; OTA 1992). Nevertheless, federal commitment to state-level data collection persisted, in part because of the publication and importance of A Nation at Risk, which presented a dire account of the U.S. educational system (National Commission on Excellence in Education 1983; Vinovskis 2001).
With the formation of the NAGB and the push for state-level NAEP results, the 1990s brought two further developments in the NAEP. First, the NAGB developed new assessment frameworks to reflect contemporary curricular standards and educational goals and decided to assess representative samples of students by grade level instead of age group. These changes resulted in a new test score scale: scores based on the new assessments could not be compared to scores based on the original assessments. The new assessments are now referred to as the “main” NAEP and have continued to the current day. The original math and reading tests for national samples of students at different age levels (now referred to as the long-term trend NAEP or NAEP-LTT) have also continued and have remained relatively unchanged since their start more than 50 years ago.
Second, under the guidance of the NAGB, the first state-level NAEP results were reported based on the 1990 (math) and 1992 (reading) administrations of the main NAEP. While participation and reporting of state-level results, often referred to as the “state” NAEP, was initially voluntary, 2 universal state participation took effect in the 2002–2003 school year, when the No Child Left Behind (NCLB) legislation tied state participation in grades four and eight math and reading to Title 1 funding (Riddle 2008). In 2002, the NAEP also launched the Trial Urban District Assessment (TUDA), which reports NAEP scores for a small number of large urban districts—six in 2002 and twenty-seven by 2017.
Some of the most important responsibilities of the NAGB include determining the content that will be assessed, who will be assessed, how often to assess, and at what level of granularity the data should be reported. In particular, the NAGB is careful about what and how content-specific achievement is evaluated, as these content areas reflect beliefs about the importance of those content areas relative to others. The lack of common content or curricular standards across states makes the selection of tested content an inherently political process and the tested content of the NAEP a source of debate (Mazzeo, Lazer, and Zieky 2006). While the original NAEP initially assessed students in areas of citizenship, science, and writing, it quickly changed focus to math and reading. Similarly, the main-, state-, and TUDA-NAEP have focused on math and reading but continue to add new content areas, including arts, civics, economics, geography, science, technology and engineering literacy, and U.S. history and writing (many of which are only reported at the national level).
Using NAEP for System Monitoring
We describe three design elements of the NAEP intended to support its use for system monitoring. We then illustrate how NAEP data can be used to provide high-level summaries and comparisons of student performance across states and over time.
Stability and comparability
One might ask, Why continue the NAEP given the rise in other tests, such as state accountability tests? Comparability and stability are two key reasons. Although nearly every student completes math and reading accountability tests in grades three through eight each year, scores on these tests are not generally comparable across states or across long periods of time. Each state selects its own accountability test, and many states regularly change the selected test over time. While some states have recently administered the same assessments as part of the Common Core assessment consortia, differences in administration procedures exist across states and participation is declining. The NAEP, therefore, provides the only standard benchmark to monitor U.S. student achievement both over time and across states (see also Hanushek, this volume, for reference to problems with state variation in standards and reporting, especially under the NCLB).
As state and national content standards and curricula have evolved, the NAEP has been faced with the challenge of maintaining content relevance by updating the assessment frameworks while also maintaining comparability by not changing them too much. As already described, these competing aims came to a head in the early 1990s when new assessments were started under the “main” NAEP umbrella, in addition to the math and reading NAEP-LTT assessments that continue today. Designers of the assessments have been careful to preserve the comparability of scale scores across years and states (Nellhaus, Behuniak, and Stancavage 2009). Changes to the NAEP in terms of test format and student inclusion since 1990 have been cross-validated to ensure preservation of test score trends using so-called trend or bridge studies, with occasional breaks in the trends (Beaton and Zwick 1990; Haertel 2016; Hedges and Vevea 1997; Hedges and Bandeira de Mello 2013).
Group-score estimation
One of the NAEP’s goals is to provide high-level information about the performance of the education system. The focus on aggregate instead of individual performance is, in part, to ensure compliance and reduce behaviors (e.g., cheating) that would undermine the validity of the scores. Assessments specifically designed to yield measurements of aggregate (but not individual) achievement are sometimes referred to as “group-score assessments” (Mazzeo, Lazer, and Zieky 2006). As with other group-score assessments, the NAEP relies on a process known as matrix sampling (Mazzeo, Lazer, and Zieky 2006), a technique for providing each student participating in the assessment with only a sample of test items.
This design has important benefits for system monitoring. First, because each student completes only a small number of items, students can be presented with more complex constructed-response items—questions for which students supply written responses to prompts (Mazzeo, Lazer, and Zieky 2006). Second, matrix sampling produces reliable estimates for large groups such as states, but not for individual students or smaller groups such as schools (Mislevy et al. 1992). While this limits the inferences that can be made from the data, it also means that no direct decisions can be made based on the test scores at these levels, hence deterring some potential distortions and unintended uses described above. In fact, under Title III of the NAEP Act, reporting of results for individual students or teachers is prohibited, as is using the NAEP data to reward or sanction individuals, teachers, schools, or districts.
Many steps are taken to reduce potential bias of NAEP test scores due to factors such as test content, statistical artifacts, and sampling. For example, in a process known as differential item functioning, analysts perform content and statistical reviews to ensure that test items are not biased against subgroups of test takers (Mislevy et al. 1992; Zwick and Ercikan 1989). Detailed background surveys are administered to students so that these variables can be incorporated into the analytic procedures and ensure that subgroup means are accurately estimated despite the complex sampling designs (Mazzeo, Lazer, and Zieky 2006). Because the NAEP relies on complex sampling, accurate comparisons across groups or states require that the sampling plans be faithfully implemented. As such, accommodations and inclusion rates across states are carefully monitored to ensure samples accurately represent the target populations.
Dissemination and interpretation
NAEP results are disseminated in many ways and to many groups: the general public, researchers, policy-makers, and education agencies. Reports range from general “report cards” to detailed data files for researchers. Designers of the NAEP have struggled to create summaries of student performance that are accurate and easily interpreted (Beaton and Allen 1992; Jaeger 1998). Originally, results were presented item-by-item, reporting the percentage of students answering each item correctly by student subgroup. With the development of item response theory (IRT) techniques, it became possible to summarize performance across many items on a common score scale. Because numeric scores produced by IRT do not have a natural interpretation (in contrast, for example, to the high school graduation rate), criterion-referenced score interpretations have been developed to describe what students scoring at different levels know and can do in the content area.
To facilitate accurate interpretation of NAEP results, the Elementary and Secondary Education Act (ESEA) tasked the NAGB with “identifying appropriate achievement goals” on the NAEP (National Academies of Sciences, Engineering, and Medicine 2017). The NAGB developed a set of achievement levels for each grade level, defining Basic, Proficient, and Advanced performance. A psychometric process known as “standard-setting” was then used to identify cut scores on the NAEP scales indicating the score needed to attain each level. Achievement level results have been reported since 1992, and their reporting is mandated for all states in math and reading under NCLB. These achievement levels are controversial for a number of reasons: the cut scores are ultimately arbitrary, undue attention to proficiency (or any) thresholds is undesirable, and the use of terms like “proficient” conveys that “100 percent proficiency is the only rhetorically acceptable goal” (Haertel et al. 2012, 34). However, the most recent commissioned evaluation from the National Academies of Sciences found that the standard setting process conformed to professional standards at the time and recommended the current achievement levels remain in place (National Academies of Sciences, Engineering, and Medicine 2017; for additional discussion, see, e.g., Bourque 2009; Haertel et al. 2012). While these achievement levels officially remain in a “trial” mode due to questions about their accuracy, reliability, and validity for some intended uses, they have become fixtures of the NAEP reporting and are a central part of the use and interpretation of NAEP results.
Uses of NAEP data
Data produced by the NAEP have been used to describe and evaluate academic performance and to study the effectiveness of large-scale education policies. Figure 1 illustrates how data from the NAEP can be used to describe changes in student performance, by demographic and age subgroups, over time. The figure presents trends in national average achievement in math and reading, by subgroup, over the past 25 years using the main-NAEP assessment data. In both math and reading, these data show that average achievement has increased for all subgroups during this period (the year the tests were introduced varies by subject and grade). The data also show that the average levels of achievement in any given year differ for students of different racial/ethnic backgrounds. Taking grade four math as an example, black and Hispanic students in 1990 performed substantially lower, on average, than their white peers. These data also indicate that although these “achievement gaps” have declined over time, they remain a persistent feature of the U.S. educational system.

National Average NAEP Math and Reading Scores, by Grade and Subgroup, 1990–2015
The horizontal lines in each plot represent the score needed to reach the basic, proficient, and advanced achievement levels of the NAEP. Including these levels demonstrates how NAEP scores can be used to make judgments about whether U.S. students are meeting prespecified standards determined by the NAGB in collaboration with content and test design experts. Figure 1 shows that average subgroup scores rarely reach the proficient level. This high bar for proficiency has gained public attention and is often criticized for being unrealistic (see also Tucker 2018). We note, however, that these average scores do not show the substantial variation within groups and years: in any given year, there are many individual students scoring at or above the proficient level. These performance benchmarks may lead to judgments about whether the education system is failing or succeeding.
Finally, researchers have been encouraged to use NAEP data for descriptive and causal research about large-scale policies. An example is the use of state-level NAEP data to evaluate whether the introduction of test-based accountability policies led to increased student achievement in U.S. schools (e.g., Burstein 1984; Carnoy and Loeb 2002; Dee and Jacob 2011). More recently, researchers have used individual-level NAEP research data files to evaluate the effects of school finance reforms on academic achievement (Lafortune, Rothstein, and Schanzenbach 2018).
New Developments in System Monitoring
Though the NAEP has made strides to provide more granular data—transitioning from collecting only national data to collecting test score data for states and large urban districts—there is likely to be meaningful variation in student achievement at finer geographic and time scales that the NAEP cannot measure. With the proliferation of new test score data, from state accountability tests to computerized interim and benchmark tests, there are new possibilities for combining scores from multiple data sources to study this variation. Many of these assessments, however, are not designed for system monitoring, and their use for this purpose poses challenges. Here we describe one example of how researchers have combined data from the NAEP with state accountability test score data to provide higher-resolution student achievement data in a publicly available platform: SEDA (Reardon et al. 2018).
SEDA has its roots in prior efforts to use NAEP for comparisons across different tests through a technique known as test linking (Kolen and Brennan 2010; Feuer et al. 1999). Because states use distinct tests and set unique proficiency standards on these tests, the NAEP has served as a common metric for comparisons of the stringency of each state’s proficiency standards. A series of reports by the NCES has mapped each state’s fourth- and eighth-grade proficiency standards onto the NAEP math and reading scales (e.g., Bandeira de Mello, Rahman, and Park 2018; McLaughlin, Gallagher, and Stancavage 2004; Braun and Qian 2007). NAEP data have also been linked to international assessments such as the Trends in International Mathematics and Science Study (TIMSS) and the Programme for International Student Assessment (PISA) (National Center for Education Statistics 2013), both to predict how foreign students would perform on the NAEP and to predict how students in U.S. jurisdictions participating in the NAEP would perform on international assessments (e.g., Phillips 2007; Hambleton, Sireci, and Smith 2009; Lim and Sireci 2017).
SEDA uses the NAEP to link state test score data in a different way and for a different purpose. Using district-level scores from each state’s accountability test contained in the U.S. Department of Education’s EDFacts database (EDFacts 2015), the mean and standard deviation of achievement in each district is mapped onto the NAEP scale to facilitate comparisons of average district-level achievement across states. SEDA contains achievement data for third through eighth graders in math and reading on a common scale, by student race and sex, for nearly every public U.S. school district during the 2008–2009 through 2014–2015 academic years. SEDA also contains estimates of the average within-cohort change in test scores from third through eighth grade for each district. SEDA thus provides information about the educational opportunity that districts provide students in the form of average achievement at each grade level as well as the average “growth” in test scores across grades. While this measure of change is not a direct measure of student growth due to the cross-sectional and aggregate nature of the data, we refer to these as growth estimates. Additional details about the construction of SEDA can be found in the technical documentation (Fahle et al. 2018) and elsewhere (e.g., Reardon, Kalogrides, and Ho 2018; Reardon et al. 2017).
These data allow researchers to study variation in achievement and educational opportunity in ways not possible with either the NAEP or state accountability data alone. As an example, Figure 2 shows the average fourth-grade math achievement from 2009 to 2015 for U.S. public school districts, linked to the main NAEP score scale. The figure highlights considerable within-state and between-district variability in average achievement that is not apparent from the state-level NAEP data. Combining the SEDA data with information about local context allows researchers to pursue descriptive and causal analyses about patterns of educational opportunity. For example, through linking SEDA data to U.S. Department of Housing and Urban Development data on lead contamination, Sorensen et al. (2018) estimate the causal effect of lead exposure on student achievement. In another example, the SEDA data are used to describe variation in early childhood learning opportunities (as measured by average district achievement in third grade) relative to average growth in test scores from third to eighth grade (Reardon 2018). This analysis shows that although average achievement in third grade is highly correlated with family socioeconomic conditions in a school district, average growth is much less highly correlated with either average third grade achievement or socioeconomic conditions.

Map of District-Level Average Achievement, Stanford Education Data Archive
Validity concerns
Repurposing instructional or accountability tests for system monitoring entails using tests in ways they were not originally intended or designed to be used. These new uses require validation to ensure that they are warranted. That is, evidence needs to be gathered to support the use of scores on instructional or accountability tests for system monitoring purposes. Linking state tests using the NAEP, for example, requires assumptions about comparability that must be evaluated for the purpose of system monitoring. Multiple studies evaluating the methodology and results in SEDA have already been reported (e.g., Reardon, Kalogrides, and Ho 2018; Reardon et al. 2017), and SEDA documentation has been careful to articulate some of the intended and unintended interpretations of these data. However, research articulating and evaluating the assumptions underlying the uses of these data should continue.
Unintended or unanticipated uses and effects should also be studied. In the case of SEDA, publicly reporting district-level test scores using NAEP-based linkages might have adverse consequences, either by increasing the apparent stakes on the NAEP or by changing other behaviors (e.g., public reporting of school- and district-test score data may increase residential segregation [Lareau and Goyette 2014]). Finally, the public nature of these data could raise privacy concerns or allow for potential misinterpretation or misuse of the data. For example, the SEDA achievement estimates have been interpreted as measures of “school quality” (Bui and Dougherty 2017) despite caveats about the measures in SEDA reflecting a combination of both school and neighborhood influences.
Carrying out validity research for complex testing programs and systems such as the NAEP or SEDA is challenging. For example, although the NAEP follows many standards for best practices, and numerous studies have evaluated specific aspects of the NAEP, an overall validity framework with specific IUAs has not been consistently articulated and maintained (Buckendahl et al. 2009). This may be due, in part, to the multifaceted nature of the NAEP, as well as its broad and evolving intended uses and interpretations. Another challenge is determining the responsible party for carrying out the (often costly) validity research. This challenge highlights the benefit of having funding through a federal system such as the NCES, which has contracted with organizations to conduct validity research on behalf of the NAEP. In the case of SEDA, responsibility may fall on those constructing the database, rather than those designing the underlying tests. However, such efforts would still need to be financially supported, for example by grant-making agencies. Despite these challenges, validity research should be a priority for system monitoring testing programs and data systems.
Conclusion
For the purpose of monitoring how student achievement has changed over time and varies among states or subgroups, there are currently no good substitutes for the NAEP. Its quality, longevity, stability, and comparability make it a unique resource. By providing the data to the public, including data files to researchers, the NAEP enables analysis of changes over time, among states, and among subgroups, as well as evaluation of certain large-scale policies. The substantial work devoted to designing and evaluating the NAEP testing procedures and the oversight provided by the NAGB have also made the NAEP a trusted and respected source of information about student achievement.
Second, combining test score data from multiple extant sources is a promising avenue for expanding the use of tests for system monitoring. States and school districts currently administer many tests in addition to the NAEP. Technology makes it relatively simple, from a purely practical standpoint, to combine and report these data. SEDA is one example. It illustrates how combining the NAEP data with state accountability testing data, coupled with transparent information about the data construction process and limitations, can provide more detailed descriptions of student achievement on a national scale. Because it may be easier to redirect resources or implement change at a local level, these finer-grained data available from sources such as SEDA could allow for more targeted and effective educational management practices. At the same time, care should be taken to provide guidance about the intended uses and interpretations of these test scores, along with validity evidence.
Finally, deciding what and who should be tested, and how often, remains an important challenge for those designing and using system monitoring tests. Many valued educational outcomes cannot be measured by standardized tests. Standardized tests should thus be used as one indicator among many in a comprehensive system and should be chosen to complement the other indicators. Even among constructs that can be measured by standardized tests or surveys, however, it is neither practical nor desirable to attempt to measure all of them. Trade-offs will need to be made that balance the continuation of tests in core content areas for measuring progress, with the recognition that nontraditionally tested skills, such as so-called social-emotional skills, are important early predictors of long-term outcomes (Deming 2017; Heckman and Rubinstein 2001; Jackson 2018) and therefore may warrant attention. These decisions should be made deliberately, and with adequate documentation of their intended and unintended effects, to inform further progress in the design and use of tests for system monitoring.
Footnotes
Note:
The authors thank Jack Buckley, Ed Haertel, Andrew Ho, and Sean Reardon for helpful comments; as well as two of the volume editors, Amy Berman and Michael J. Feuer; and participants of the NAEd/AAPSS ANNALS Assessment Workshop. All errors are our own. Authors are listed in alphabetical order.
Notes
Erin M. Fahle is an assistant professor in the Department of Administrative and Instructional Leadership at the St. John’s University School of Education. Her research focuses on describing and explaining variation in educational opportunity across the United States and has appeared in Educational Researcher and AERA Open.
Benjamin R. Shear is an assistant professor in the Research and Evaluation Methodology program in the School of Education at the University of Colorado Boulder. His research focuses on the uses of educational tests and applied statistical issues in educational measurement and psychometrics, particularly those relevant to validity and validation.
Kenneth A. Shores is an assistant professor in Human Development and Family Studies at Pennsylvania State University. He has published in such journals as American Journal of Sociology and Education Finance and Policy on topics of racial/ethnic test score inequality and the effects of school finance reform.
