Abstract
Early presidents of the American Educational Research Association were leaders in the testing movement. Their intentions were to improve education by means of testing, which included both IQ and achievement tests. Early measurement experts acknowledged in scholarly articles that IQ tests could not measure inherited ability of groups with vastly different opportunities to learn, and yet ability testing was promoted as a beneficial means for matching instruction to individual differences until the insights of the civil rights era in the 1960s. Standard achievement measures were developed importantly to allow valid comparisons across school systems and over time, but the representations of learning that were adequate 100 years ago came to have distorting effects on teaching and learning. Today’s young psychometricians have opportunities to create new assessments in partnership with curriculum experts, but they should remain alert to the ways that well-intentioned assessment systems have been corrupted in the past.
Testing as a means to improve education is a strong theme in the history of American education research. In fact, in the earliest years, the development of tests and use of their findings might even have been thought to be the entirety of education research. In recounting this history—and the contributions of AERA presidents to it—I wish to address myself particularly to young scholars. It is important for psychometricians and education researchers, more generally, to know where we come from, to understand the good intentions and harms from the past, and to recognize the distinct legacies from both the IQ testers and the achievement measurers.
In framing the task in this way, I risk what historians refer to as the sin of “presentism.” When we interpret history from the vantage of present-day values and concepts, we fail to understand the past in its own historical context. Lynn Hunt (2002), President of the American Historical Association, warned that presentism fosters a shortsighted sense of moral superiority; she argued instead for maintaining a more difficult “fruitful tension between present concerns and respect for the past” (p. 2). To cover a hundred years of testing history in short space, I organized the relevant set of presidential speeches into three major time periods. To acknowledge the presentism tension, I first provide within each period the perspectives from that time. This is followed by brief commentary, given what we know now. Ideas from the separate time periods are necessarily overlapping, but they sort roughly as follows: I. The Early Years: Social Efficiency and Administrative Progressivism (1916–1929); II. Middle Years: Expansion of Testing and Establishment of the Educational Measurement Profession (1930–1969); and III. Recent History: High-Stakes Testing and Accountability (1970–2015).
The Early Years: Social Efficiency and Administrative Progressivism (1916–1929)
The beginnings of testing in the early 20th century had two distinct threads. One focused on the measurement of IQ, following from Binet’s 1905 development of an “intelligence” test to identify differences between children with normal mental functioning and those in need of special assistance. The other aimed at developing objective measures of achievement, traceable to Rice’s 1897 comparative study of common spelling words. The Stone Arithmetic Test, published in 1908, was the first of many “standard” tests developed by E. L. Thorndike and his students to correct the “scandalous” unreliability (and noncomparability) of teachers’ examinations revealed by a number of early education research studies. Both IQ and achievement tests served well the purposes of the larger social efficiency movement and of educational reformers, whom David Tyack (1974) in The One Best System called the “administrative progressives.” Both types of measures were essential to the development of education as a science.
In industry, Taylorism and the scientific management movement emphasized eliminating waste and increasing efficiency. According to Tyack (1974), administrative progressives, who were distinct from John Dewey’s democratic and transformative progressivism, sought to bring these same benefits of science and efficiency into schools to establish a new, best system for education. To better address the needs of the urban poor, corrupt systems would be replaced with a centralized, corporate model of school organization presided over by elite and altruistic school boards together with professional superintendents. Administrative progressives criticized the traditional rote, undifferentiated, recitation curriculum, inhumane punitive conditions, and poorly trained teachers. In its place, John Franklin Bobbitt’s social efficiency would eliminate waste by matching curricula to the needs of an industrial society. Enlightened reforms—child labor laws and compulsory education—plus waves of immigration put enormous pressures on schools, which were addressed by a new highly differentiated curricula and the use of tests to classify students.
Frank W. Ballou was the first president (1915–1916) of the National Association of Directors of Educational Research (NADER), which we take to be the parent organization of today’s American Educational Research Association. Each of the first six presidents was a major contributor to the testing movement. I have selected papers by Ballou (1916) and Haggerty (1921) for attention here, because they illuminate the thinking of the time about achievement and intelligence testing, respectively.
Ballou’s (1916) title, “Improving Instruction through Education Measurement,” sets the tone, in fact, for the next 100 years. Measurement was needed not merely to study education, but as a means to intervene to improve education. A familiar pattern of steps was recommended:
The quality of educational results must be measured by the best available standard tests.
Results must be analyzed and suggestions for improvement made where results are unsatisfactory.
After reasonable time has elapsed, “similar standard tests must be repeated to determine what effect, if any, the suggestions have had on instruction” (p. 355).
Ballou was the director of the Department of Educational Investigation and Measurement in Boston, and a good portion of his paper was a status report on the progress being made in Boston to implement these steps in each of several subject areas. In spelling, for example, the department had found that teachers had identified as difficult “words which elementary school pupils should not be expected to spell” (p. 356), words such as “diaphragm, dilatory, equilibrium, fictitious, impenetrability,” and so forth.
As a means of economizing the time of teacher and pupil, and improving the pupil’s ability to spell, the department is preparing a course of study for each grade, which shall consist primarily of words used voluntarily by normal pupils in their written work. (p. 356)
In arithmetic, the Courtis Standard Tests, authored by NADER’s third president, had been administered in Boston beginning in 1912. Annual administration of this test made it possible for Ballou (1916) to describe the progress being made in implementing an entire testing program. “Objective standards of achievement in the four fundamental operations in arithmetic” had been established “on the basis of the median achievement” (p. 359) calculated for all the pupils tested. Formal reports were prepared for teachers, principals, and the superintendent, allowing each to make comparisons to the standards. However, “detailed comparison of the work of one teacher with that of another, which is likely to arouse controversy or hard feeling, has been studiously avoided” (p. 360). Self-comparison to standards was regarded as one means to improve instruction, but, in addition, systematic practice materials were introduced to provide more targeted drill for the students most in need. Using the Courtis results from the spring of 1915, Ballou was able to document greater speed and accuracy of performance for the group of schools that had been using the test and practice materials longest. Emerging education research methods can be seen in Ballou’s assurance that each group of schools had been selected “to represent the varying conditions found in the city … so that one group of schools possesses no coherent superiority over another group” (p. 363).
The formal title of M. E. Haggerty’s address in 1921 was “Recent Developments in Measuring Human Capacities,” but he began by saying that a more suitable title might be “the Inadequacy of Intelligence” (p. 241). Acknowledging the burgeoning of group IQ testing following World War I, he remarked as follows: The avidity with which the educational public has seized upon group intelligence examinations is both encouraging and alarming. It confirms our faith that such tests meet a real need in school work, but it also raises a doubt as to the existence of a wholesome critical attitude of mind toward the proper selecting of tests and the proper use of test results. (p. 241)
Haggerty recognized that the intention of intelligence tests was “to make possible important remedial measures in teaching and school organization” (p. 242), and he documented dramatic acceleration in the sale of such tests. He applauded in particular the contribution of two other early NADER presidents, B. R. Buckingham and Walter S. Monroe, who had introduced the idea of using intelligence measures as something like a value-added control for research studies, called an “achievement quotient” (p. 244), which he considered to be essential when trying to evaluate which part of achievement outcomes could rightfully be credited to school programs.
Haggerty (1921) also recognized the limitations of intelligence tests, not because they did not measure native intelligence, but rather because “success in school work and success in life are not determined by intelligence alone” (p. 245). He reported on a comparative study conducted by two of his students in which qualities such as industry, loyalty, honesty, sympathy, tactfulness, and cheerfulness were as important in determining success as intellective factors such as efficiency, attentiveness, prudence, and adaptability. He also enumerated cases of pupils scoring well on intelligence tests but doing poorly in school, and vice versa. Further evidence of the inadequacy of intelligence tests was found in larger scale studies where the distribution of scores for unsuccessful populations, such as 500 women in a state reformatory, “was found to be practically coextensive” (p. 249) with that of the general population. In summarizing the state of the field, Haggerty’s stance was not to abandon intelligence testing, but like Terman and Buckingham, he thought it necessary to also pursue the development of scales to measure non-intelligence factors predictive of success.
Commentary
It is important to understand that standardized achievement testing and IQ testing grew up together in the U.S.; and as I have suggested, this expansive testing movement was central to the development of education research in the early years. In his history of education research from 1918 to 1927, Walter Monroe (1928) chronicled the tremendous burgeoning in the number of doctors’ theses in education, in the establishment of research bureaus in city school systems and in teacher-training institutions, in the publication of books on statistical methods and educational measurement, and in the attention to measurement studies in still-new educational psychology and educational research journals. As part of his research, Monroe sent surveys to all of the leading test publishers and, as a result, estimated that “not less than thirty million copies of standardized tests and scales are now being used annually” (p. 114). He thought that the number might be closer to 40 million if complete data were available. “About 25 per cent of these are intelligence tests and 75 per cent tests of achievement” (p. 114).
Reading presidential papers and various historical accounts written in this period, one can sense the excitement and faith in the progress being made possible by new technologies. As obvious as it may seem, careful attention to learning goals and efforts to define the objects of measurement were important innovations (Thorndike, 1922), as was the effort to create standardized instruments that would allow for sufficient comparability to study the effectiveness of educational interventions across jurisdictions and over time. This period saw the development of the “objective or new-type” test (Ruch, 1929) and, along with it, a “sampling theory of examinations” (p. 56) that laid out the idea of a content domain from which short, objective item types could sample more broadly and essay questions more intensively. The fundamental ideas of test reliability and validity were also developed in this period.
While recognizing the importance of sound achievement testing principles that derive from this time, I nonetheless have two major concerns regarding achievement testing in the early years. First, the tests that were built reflected very narrow conceptions of learning (though clearly more appropriate to that time than now), and second, as a field, achievement measurers were alarmingly sure of their scientific accuracy. In his history, Monroe (1928) identified “faith in objective methods” (p. 46) as a salient feature of education research alongside the many technical developments. “In a number of reports of educational research, there is evidence that the author believed that if his data were objective, the conclusions were indisputable” (p. 47). Monroe cited examples where authors “know definitely” because of the objectivity of their methods and where they labeled their findings as “strictly scientific,” “which appears to mean that the conclusions were independent of the opinions or prejudices of the investigators” (p. 48). Monroe assured us later that the “worship” of objective methods had reached its peak in 1922 or 1923, and he hoped, I think falsely, that by the time of his writing there was “a growing recognition of the limitations of objective methods” (p. 48).
Of course, even in the early days, there were critics of standardized and objective tests, and it is instructive to see how these worries were considered but refuted by holders of the dominant view. The fourth NADER president, B. R. Buckingham (1920), argued that fact questions were sufficient to measure historical ability even if historical ability is constituted by information, thought, and character development. Reflecting his conception of knowledge, he extended the term “facts” to include not only “dates, persons, and places but also as to causation, relationship, political and social movements, economic developments” (p. 168), all of which he thought the adept pupil would know by heart. He also argued—based on his own and recently published correlations between fact and thought questions—that one could use a mathematical equation to predict scores on thought questions from fact question scores without ever having to administer thought questions. Another leader in the field, Ben D. Wood (1923), made a similar argument: There is not as much opposition between “information” and “reasoning” as some teachers would have us believe. … [F]acts are not only a legitimate and undoubted aspect of thinking, … they can be acquired, retained, and reproduced only by thinking, only by organizing material in a logical and systematic manner. (p. 162) Every experimental study thus far made and reported has shown a very high relationship between measurement of information in a field and the intelligence or ability to think in the material of that field. (p. 163)
This false confidence in achievement tests as sufficient measures of learning or as adequate proxies to be used in judging educational programs is a popularized belief from those early days that still haunts us today. These examples also illustrate a pattern among measurement experts that was to acknowledge complaints about the adequacy of their measures but not to change course in their promulgation of tests and test batteries as effective tools in the service of educational improvement. This was true for both achievement measures and IQ tests, used initially as controls when evaluating the effectiveness of educational programs, but then increasingly as placement tests.
The history of intelligence testing and its connection with the American eugenics movement in the early 20th century has been told many times (Chapman, 1988; Cronbach, 1975). IQ testing did not create racism in America, but the theory of innate differences in merit among social groups was embraced because of prevailing beliefs. Moreover the claim of scientific objectivity fed a system of beliefs—about nativist rather than opportunity explanations for attainments—that still today affects public discourse and the beliefs of educators. That’s why contemporary antiracism scholarship necessarily focuses so on understanding privilege and why effective educational interventions find it necessary to address the problem of “deficit thinking” (Valencia, 1997).
In addition to reifying racist beliefs, IQ testing created further harm by assigning low-scoring students to low-track and vocational classrooms. Terman (1919, 1922) reported IQ differences observed among various occupational groups, college students, businessmen, semiskilled, and unskilled laborers and argued in turn that different curricula and methods of instruction should be provided to children identified as “gifted,” “bright,” “average,” “slow,” and “special.” Cronbach (1975) summarized the reasoning and effects of tracking by IQ as follows: When the tests determined who would enter the college preparatory program and before that determined who would go into the “fast” section of an early grade, the tests began to determine fates. The testers’ sorting process was to shield the child destined to be a worker from the rigors of an academic curriculum. Such a plan would reduce distaste for schooling, prevent failure, and retain him in school longer. Testers said that the IQ was constant; hence to make decisions early was merciful and just. (p. 11)
In his more detailed history, Tyack (1974) documented the particularly devastating effects of these beliefs and school policies on black Americans, whom he called “victims without ‘crimes’” (p. 217). Even when some acknowledgement was made that social conditions might account for lack of success in both school and employment, “low mentality” was also pointed to as a cause, and regardless, administrative progressives saw it as the responsibility of the system to prepare Black children for their place in the existing social order.
Cronbach’s (1975) analysis of this early period is particularly useful in helping us understand why beliefs about native ability and a highly differentiated school system could persist virtually unchallenged until the 1970s. My simplified summary is as follows. Early enthusiasts of IQ testing joined eugenicists and taught the public extreme views regarding the objectivity of the tests, hereditability of IQ, and racial differences. In contrast, Cronbach documents that, even at the time, scientists were more careful about their claims when talking amongst themselves. He cites Freeman (1923), who published “a referendum of psychologists,” in which Yerkes and Terman among other leaders admitted that it was illogical to draw conclusions about the inherited ability of groups with marked differences in training or experience. A similar point can be made about Philip A. Boyer, considered in the next section when he is AERA president. Published in 1920, his PhD thesis provides an extensive example of scientific management applied to schools. Boyer goes on at length comparing differences between blacks and whites on illiteracy rates, unstable marriages, unsanitary housing conditions, low wages, working mothers, and so forth. At one point he admits that these findings could be attributed to slavery and “the obstacles placed in (the black man’s) way after freedom” (p. 35), but in the end Boyer subscribes to the idea of “transient” racial superiority (p. 36) and the need to minimize academic subjects for certain capacity groups so as to ensure the “development of proper attitudes towards hygiene, vocation and home-making” (p. 114).
The point is that even when researchers and measurement specialist qualified their nativist claims, they never suggested that their caveats should lead to a change in the curriculum-differentiation policies they had launched. This is an important insight when considering the contributions of AERA presidents throughout the middle years, which I take up next. They tend to describe progress in the development of the measurement field, with a few asides to the existence of controversy, of which they themselves are not a part.
Middle Years: Expansion of Testing and Establishment of the Educational Measurement Profession (1930–1969)
Given a shared rationale, this middle-years set of presidential papers might have been grouped with the early-years presidents. However, by 1933 when AERA president John L. Stenquist reported on “Recent Developments in the Use of Tests,” the prevalence of standardized testing programs in schools had become so extensive that the field was no longer characterized by the newness of the testing movement. Instead, the central role of testing in schools was taken for granted and attention was turned to developing a broader array of tests and contexts of application and to formalizing methods of test construction and scoring. Historian Daniel Resnick (1982) characterized the period after 1935 as a time of growth, adaptation, and “burgeoning of multiple assessment testing” (p. 189). A key example was the extension of ability testing to college selection with the development of the Scholastic Aptitude Test in 1926. The Iowa Test of Basic Skills developed by E. F. Lindquist during the 1930s was an expanded test battery made available to all of the schools in Iowa on a voluntary basis and eventually to school districts in other states.
This middle-years period was also a time when measurement experts established themselves as a group of specialists with interests focused on technical issues. A small group calling themselves the National Association of Teachers of Educational Measurement met for the first time in 1938 and founded the organization that would subsequently be named the National Council on Measurement in Education (NCME). In 1936, to address the need for more adequate training, the American Council on Education’s standing Committee on Measurement and Guidance sponsored an elementary handbook on The Construction and Use of Achievement Examinations (Hawkes, Lindquist, & Mann, 1936); and the more imposing Educational Measurement handbook edited by E. F. Lindquist was published in 1951. The first test standards were published by the APA for psychological tests in 1954 and for achievement tests in 1955 by AERA and NCME.
Rather than delve into any one president’s contribution more deeply, I attempt here to give a sense of this period of expansion and technical development by enumerating each of the several foci selected by different groups of presidents. Not surprisingly, the first three presidents in this period, Stenquist (1931–1932), Rankin (1933–1934), and Boyer (1935–1936), most directly reflect the thinking of the earlier period, adopting in particular its vocabulary of “individual differences” and “homogeneous grouping.” Rankin’s (1931) review of research on ability grouping, for example, includes a generally favorable survey of 500 superintendents from which he identified arguments both for and against grouping by ability. The superintendents offered primarily affective arguments against homogeneous grouping rather than lack of access to interesting academic content. Rankin summarized research supporting the use of IQ tests for grouping. He recommended that teacher judgments and nonintellectual traits be used in addition to test scores when making placements and that students be grouped based on aptitude or achievement in specific subject areas rather than on the basis of general ability. Boyer’s (1939) article on “Educational Adjustments to Individual Differences” acknowledged that grouping by ability was complex and did not always result in improved achievement. He still spoke of arranging students by brightness, but unlike his thesis from the 1920s, there was no explicit mention of race.
Ralph W. Tyler, a 20th-century giant in the fields of curriculum, evaluation, and testing (Madaus & Stufflebeam, 1989), was never president of AERA, which exposes one of the limitations of relying on presidential papers to understand the history of testing and assessment during this period. One of Tyler’s many contributions had been to specify learning objectives and to evaluate whether existing tests assessed those objectives adequately. Tyler (1936) explicitly called out the distinction between test questions focused on recall versus “higher mental processes.” In 1940, president T. R. McConnell (1941–1942) wrote a very technical paper, focused on item discrimination indices and intercorrelations between subtests; unlike Tyler, he was unable to find evidence of the differences between fact questions versus application and inference questions.
Examples of the expansion of testing during the middle part of the century are seen in papers by AERA presidents Alvin C. Eurich (1945–1946) and Arthur E. Traxler (1950–1951). Eurich’s (1944) review article described the development of psychological tests to classify officers and enlisted men for the U.S. Navy during World War II. Traxler (1952) reviewed tests for selection into graduate schools, most particularly the Graduate Record Examinations and the Miller Analogies Test. Walter W. Cook (1958) wrote a paper for NCME about “What Teachers Should Know About Measurement.” Cook’s list included knowledge of percentile ranks and item-discrimination calculations as well as consideration of the mental processes involved in answering test questions. Julian C. Stanley’s presidential address was entitled “Reliability Revisited,” but there are no known copies. Stanley’s contributions in measurement were most famously associated with his use of the SAT to identify precocious youth.
Chester Harris wrote an overview for the 1962 Review of Educational Research issue on “Educational and Psychological Testing.” His is perhaps the one slightly critical paper among all the papers written in this time period. In summarizing the progress that had been made since Binet, Simon, and Terman, he noted that “we have a considerable amount of machinery” but nevertheless “seem to be studying many of the same problems in the same way” (p. 103). Harris, an expert in factor analysis, acknowledged that it was a good thing that the field had learned not to “announce the existence of an aptitude, mental ability, or personality trait on the basis of naming a factor derived from some conveniently available set of test responses” (p. 104). Harris opined, however, that “tests of general mental ability and tests of special aptitudes appear to have settled into rather comfortable ruts” (p. 105). He lamented the “utilitarian emphasis” (p. 106) of the field whereby ability and achievement tests were pursued for their predictive utility without any attention as to why they predicted or what effective interventions might disrupt those predictions.
Commentary
The difficulties I encountered in trying to draw a boundary between the middle and present-day periods in testing and assessment and in trying to place the addresses by Robert L. Ebel (1973) and Robert L. Thorndike (1975) are a fruitful source of commentary. Although Ebel and Thorndike acknowledged issues and changing conceptions from the present era, their tone and optimism are from the past. Ebel noted that intelligence tests have done “much harm educationally and socially” (p. 9). But he was content with a caution that we should not draw inferences about biological differences between cultural groups. Thorndike—the son of Edward Thorndike—gave an extended account covering the past 70 years of IQ testing, focusing mostly on the meaning of the standard score scale over time. His treatment is largely technical, although at the end he mentions that “subculture differences may inhibit performance,” so “any rigid specification of level of test performance as the basis for decision or action … seems unwise and perhaps pernicious” (p. 7). Both Ebel and Thorndike seem to see their role as one of issuing appropriate cautions. The psychologists and educational measurement experts at the start of the century had rolled up their sleeves and engaged with the administrative progressives in developing tests as a means to change the educational system for the good as they saw it. With the establishment midcentury of a measurement profession, these specialists no longer had to engage much with how their tests were being used in schools. None argued that ability measures should not be used.
The set of presidential papers in this middle period does not afford much attention to achievement testing. In my presidential address (Shepard, 2000) considered later, I made the case that one problem with the item formats of standardized achievement tests is that they carried forward conceptions of learning from the earliest part of the century. Ralph Tyler and others worried about whether tests could adequately assess higher-order applications of subject matter knowledge and principles, but theirs was not the dominant view. Besides, the behaviorists who held sway in the 1950s, 1960s, and 1970s were satisfied if low-level skills were given the most attention, because they believed that the basics had to be mastered before going on to higher-level thinking. In an earlier article (Shepard, 1991), I reviewed the thinking behind programmed instruction and mastery learning to explain why it was acceptable from a behaviorist perspective to drill on test items and then retest on nearly identical items.
A final concern regards Walter Cook’s (1958) advice for teachers. His idea that classroom teachers needed to learn descriptive statistics, reliability coefficients, and how to interpret standardized test batteries is representative of how “Tests and Measurements” courses were taught to preservice teachers for most of the century. Learning objectives were to be specified in terms of expected changes in behavior, and to measure these objectives, teachers should know the advantages and disadvantages of various types of objective and essay questions for constructing classroom tests, mostly for purposes of grading. This canon was never officially repudiated by measurement specialists. It was not until much later, beginning in the 1980s, that what teachers should know about assessment began to be challenged by subject matter specialists, who sought to develop more authentic and substantively oriented assessment strategies grounded in instructional activities (Shepard, 2006). During the middle years there was no talk of formative assessment, although it was expected—without further elaboration—that teachers would know how to individualize instruction on the basis of test score results.
Recent History: High-Stakes Testing and Accountability (1970–2015)
The final period of testing and assessment history, from 1970 to the present day, is characterized by a dramatic decline in ability testing and a concurrent steep rise in use of standardized achievement tests to hold students and schools accountable. As Cronbach (1975) noted, “There is a tide in the affairs of issues” (p. 11), meaning that some ideas gain traction, because of the Zeitgeist of a period, that would be ignored or ridiculed at a different time. The long-standing complaints against the validity of IQ tests—in the face of unequal opportunities to learn—and the harm of sorting students into dead-end school placements came to the fore during the civil rights era. The Civil Rights Act was passed in 1964, and under its authority, IQ testing was challenged both for employment and placement in special education. In 1972, in Larry P. v. Riles, the court agreed with claims of racial bias in California’s use of IQ tests, based on evidence that smaller proportions of Black students were identified as mentally retarded in states where other criteria were used such as achievement tests and teacher evaluations. In his decision, Judge Peckham also agreed that Black children erroneously classified as EMR suffered irreparable harm from the social stigma of the EMR label and from a self-fulfilling, limiting curriculum focused on social skills and grooming instead of academic skills.
As they had in the early 1900s, measurement experts once again became active in helping educators and policy leaders think about how to use tests. When Cronbach was AERA president in 1964–1965, he was not writing about testing, but he did so significantly in 1975 in his examination of Arthur Jensen and the longer-term history of IQ testing cited above. He was also a member of the National Research Council (NRC) Committee on Ability Testing, which included leading psychometricians and statisticians, Lyle Jones, Melvin Novick, Mary Tenopyr, and John Tukey, along with relevant experts in each of the social sciences, including learning researcher Lauren Resnick. The NRC Ability Testing report (Wigdor & Garner, 1982) was cautious, attempting to balance the need for comparative data—which was possibly less biased than other indicators—against evidence of past harms from test misuse. They gave an example of a man, whose father was a mathematician, who scored 700 on the GRE mathematics test, compared to a woman of working-class background who scored 650. The committee argued that the woman could be the more exceptional candidate. They used this example to illustrate why the test was useful in identifying this woman as a very capable candidate in a pool of many other applicants but also why strict top-down selection based on test scores alone was unwarranted. The committee recommended against the use of tests as the sole criterion for selection decisions and against the use of rigid cutoff scores for special education placements. A similarly important NRC report, also published in 1982, addressed assessment issues associated with the overrepresentation of minorities in classes for the mentally retarded (Heller, Holtzman, & Messick, 1982).
The civil rights movement that greatly diminished ability testing also generated renewed interest in achievement testing—because resources for equity would require accountability. The equity agenda of the 1960s launched numerous federally sponsored social programs focused on preschool education, job training, health care, and housing. The first Elementary and Secondary Education Act (ESEA) of 1965, aimed at providing greater educational opportunities for low-income students, also created the field of educational evaluation (Worthen & Sanders, 1973) and a quid pro quo bargain that federal dollars would come with a demand for outcome measurement. Several big ideas follow from the ethos and political discourse of this period (Shepard, 2008). First was the focus on outcomes, which in educational measurement was taken up in the form of criterion-referenced testing. Second, conceptions of accountability again brought business management models into education under names such as management by objectives (MBO) and the program evaluation and review technique (PERT) (Wise, 1978). Third was the idea that evaluation data would be used as a means to leverage change. Robert Kennedy expected that parents could use the Title I evaluation mandate as a “whip” or a “spur” (McLaughlin, 1975) to gain a better education for their children.
None of the AERA presidents who addressed measurement issues in this contemporary period considered ability testing. Instead, we each located our work in the context of achievement testing and the predominating theme of accountability. Regarding this context, a few other major testing milestones from this epoch should be noted. In 1969, Ralph Tyler led the effort to establish the National Assessment of Education Progress (NAEP). Tyler distinguished objective-referenced tests from norm-referenced tests designed for sorting and argued that NAEP was needed to provide census-like data to monitor achievement trends over time. Beginning with the SAT test score decline from the mid-1960s onward, tests became both the messengers of crisis and the means to institute reform, through successive waves of test-based accountability: minimum-competence test in the 1970s, back-to-basics tests in the 1980s, standards-based reforms in the 1990s, and the No Child Left Behind (NCLB) Act in the 2000s. Each new wave brought with it a ratcheting up of both standards and stakes attached to test results (Shepard, 2008).
For his presidential address in 1978, W. James Popham invited former president Robert Ebel to a debate in which Popham presented the argument for criterion-referenced measurements and Ebel made the case for norm-referencing. Both agreed that familiar norm-reference standardize tests covered broader and more general subject matter domains than did criterion-referenced tests. Popham saw this as a weakness, however, which he attributed to the profit motive of test companies and their need to make their tests marketable to a broad array of school districts. This breadth and the need to sell tests led to vague descriptions of test content, which meant there could be mismatches between what was tested and what was taught. In addition, norm-referenced tests were likely to be insensitive to instruction because psychometric criteria caused the elimination of very easy items. Ebel countered that standards for learning, at a particular grade for example, are implicitly normative. He also offered several criticisms which were specific to the ways that criterion-referenced testing was being implemented, namely that such tests focused primarily on minimums, they discretize separate learning targets, and they dichotomize learning as mastery or nothing. Popham admitted that criterion-reference tests were a recent phenomenon, and hence there did not yet exist many outstanding examples. Neither Ebel nor Popham considered whether the multiple-choice formats used to build both types of measures were adequate for representing valued learning goals.
A decade later, president Nancy S. Cole (1990) was very much concerned with the substance of what tests measured and what that implied about “Conceptions of Educational Achievement.” Although Cole’s address was more about educational goals than about testing, she provided a useful history of behavioral learning theories that had come to dominate psychology and education. Behaviorism’s focus on mastery of discrete skills and well-specified behaviors, manifest in the criterion-referenced testing movement, was consonant with a political environment that decried students’ lack of basic skills, lack of core knowledge, or ignorance about facts of the Civil War. During the 1990s, a quite different conception of achievement with roots in philosophy and recent progress in cognitive psychology would focus instead on higher-order thinking skills and advanced knowledge.
Ten years later, I took up this same history (Shepard, 2000). I tried to explain the mismatch between existing standardized achievement tests and contemporary visions of teaching and learning. Put simply, most multiple-choice-only standardized tests are an anachronism. They carried forward representations of knowledge and theories of learning from the past. For example, teaching the test would not be a problem if you believed that “tests are isomorphic with learning” or that “test-teach-test” (Shepard, 2000, pp. 5–6) is an ideal learning routine. These were very much the assumptions of mid-20th-century behaviorists and programmed-instruction experts (Shepard, 1991). In contrast, beginning in the 1980s, cognitive scientists and subject matter experts in each of the disciplines had contributed to fundamental reconceptualizations of what it meant to become expert and how such intellectual abilities are socially and culturally developed. I cited the many efforts under way to develop more extended and authentic assessment tasks to better represent challenging learning goals. With an accompanying image of Darth Vader, I summarized the familiar research literature on the negative effects of high-stakes testing, especially curriculum distortion and test score inflation, meaning that scores were going up without there being real or sustained learning. To protect classroom assessment from the negative effects of high-stakes accountability testing, I argued for a shift in classroom practices that would enable the use of formative assessment in ongoing learning interactions. What I described was development of a learning culture, not a new testing program.
Robert L. Linn’s address in 2003, “Accountability: Responsibility and Reasonable Expectations,” was in the earliest years of NCLB. He identified the features of an accountability system necessary to ensure improved education and contrasted these characteristics with NCLB requirements. First, responsibility must be shared for accountability systems to be effective. This means that in addition to holding teachers and administrators responsible, policy makers have a responsibility “to provide the means – both instructional resources and professional development – for teachers and students to meet the expectations of the accountability system” (p. 3). Linn also repeated the evidence of curriculum distortion that comes with high-stakes accountability and argued that “what counts” needs to be more broadly defined. Most famously, Linn analyzed NCLB’s Adequate Yearly Progress (AYP) requirements and explained why requiring schools to bring 100% of students to demanding proficiency levels in 12 years was unrealistic. He foretold that “targets that were plucked out of the air and dropped into the legislation” (p. 12) would “do more to demoralize educators than to inspire them” (p. 10). Linn argued instead that policymakers could still set ambitious targets but they should be informed by an “existence proof” (p. 4). For example, it would be ambitious, indeed, but realistic to select as a goal the growth rate achieved by the top 10% of Title I schools.
Fittingly as the last presidential address on testing, Eva L. Baker entitled her paper in 2007, “The End(s) of Testing.” She told the same story of gaps and improvement and classifications of schools that can’t be interpreted if the validity of tests is largely unexamined. Baker went further and considered the many mitigations that had already been tried piecemeal to repair broken accountability systems. These included: multiple measures, measures of opportunity to learn, “having tests worth teaching to,” formative assessment, less-is-more fewer standards, and technology-based assessments. Instead of patching the present system, Baker proposed that we achieve a more balanced accountability system simply by reducing the amount of testing in the elementary years—where there was little hope of changing an embedded, skill-focused system—and then turning our intention to opening up more personalized opportunities for secondary students to engage with and demonstrate accomplishments through a system of Qualifications. Similar to Al Shanker’s (1988) merit badge idea and contemporary digital badge systems (Alliance for Excellent Education, 2013), Baker’s Qualifications were envisioned to be more than warmed-over performance assessments. They would be “validated accomplishments, obtained inside or outside of school,” “with integrated goals, tasks, learning experiences, criteria, and tests” (p. 313). Baker concluded that “unless we find something tangible, beyond a test score, that engages and fulfills students and teachers, education will continue to shrink and shrivel” (p. 315).
Concluding Commentary
Gathering good data is a critical part of any social science endeavor; and in education, well-constructed measures of achievement can further both individual student learning and program improvement. NAEP is an example of a large-scale testing program with broader and more open-ended representations of subject-matter learning than typical state and commercial achievement tests. It is also low stakes because accountability punishments are not tied to NAEP results. Because of NAEP, we were able to document the gains of Blacks in the South after desegregation and gains in basic skills in the 1980s at the expense of higher-order thinking and problem-solving abilities. NAEP results have even been a resource in revealing spurious state test score gains as in the case of the Texas Miracle (Klein, Hamilton, McCaffrey, & Stecher, 2000). NAEP has also provided trustworthy comparative data in countless studies examining natural experiments among state policies, in much the way that Rice and his contemporaries had sought comparable data to evaluate school programs.
I can also offer an example of how assessment tools can be instrumental in a research and development cycle aimed at transforming instructional practices. Early on in the field of physics education research, Eric Mazur (1997) discovered that his students, who could solve difficult quantitative problems, could not answer simpler qualitative questions about the same physical concept. This caused him to change both the nature of his examinations and his methods of instruction. Today, discipline-based education researchers (DBER) in physics, chemistry, engineering, biology, geoscience, and astronomy are beginning to take up findings from cognitive science to design more interactive-learning environments and to integrate both formative and summative assessment practices that are more conducive to student learning (Kober, 2015). Instead of memory-focused quizzes and algorithmic exams, formative assessment practices are valued because they surface student thinking and enable just-in-time, responsive teaching. Importantly, this entire body of work from research on undergraduate science learning is not part of an installed accountability regime. Potentially the DBER community could have greater success in enacting instructional and assessment reforms—grounded in research on learning—than K-12 learning-informed reforms under way since the 1990s but continually hobbled by existing belief systems and testing mandates.
Formally designed assessments and quantitative summaries of learning outcomes can be a force for good. Psychometric models and assessment design principles from the field of educational measurement joined together with the knowledge-base of cognitive psychology (Pellegrino, Chudowsky, & Glaser, 2001) are, in fact, a resource for the development of assessments in DBER described above. Examined honestly, however, the history of educational measurement cannot be a proud one—because too often narrow, technical contributions have been pursued without checking to verify whether the claimed benefits to education have, indeed, been accomplished.
Measurement in K-12 schools has been the cause of two great harms: the sorting of students who then received diminished opportunities and the cheapening of academic learning because of the constraints of standardized test formats. One hundred years ago, psychologists and education researchers together created pernicious systems of tracking; and for 50 years, these test-based practices survived with hardly any expert protest. Thank goodness that beginning in the 1970s educational measurement experts finally stood up and explained the limitations of “objective” tests and the harm of dead-end placements. Nonetheless, tracking continues to this day and is exacerbated by accountability structures. Measurement experts were slower to recognize the negative effects following from the constraints of standardized test formats and eventually did so only because of the evidence offered by subject matter colleagues.
The formative assessment practices I described in 2000 had deep roots in research on learning. Specific strategies from cognitive research such as attending to prior knowledge, substantive feedback focused on ways to improve, and instructional extensions to teach for transfer, can be taken up within a more encompassing view of learning based on activity theory and Lave and Wenger’s (1991) communities of practice. But this was not to be. Whatever the excitement of researchers and educators might have been about the possibilities of formative assessment in the year 2000, it was extinguished by the passage of NCLB in 2001. The pressures of unreasonable AYP targets described by Linn (2003) led to massive investments by school districts in interim assessments (Bulkley, Oláh, & Blanc, 2010), sometimes falsely advertised as serving the interactive formative practices supported by research. Because of the costs and sheer numbers of items involved to administer interim measures every 6 weeks, the quality of items and the representation of learning thus encapsulated in interim tests are often worse than on typical state assessments.
At the same time, NCLB also undercut the progress that had been made during the 1990s in improving the substantive quality of state tests. Led most prominently by the National Council of Teachers of Mathematics (1989), subject matter experts in mathematics, science, literacy, and social studies developed curriculum standards in the 1990s focused on deep learning—thinking, reasoning, problem solving, and conceptual understanding—instead of rote learning and algorithmic skills. These more ambitious learning goals required new, more open-ended forms of assessment, such as portfolios, performance assessments, and constructed response items, which a number states had begun to implement. Unfortunately, NCLB mandates greatly increased the amount and cost of testing because of the number of grades required, and as a result some states that had moved in the direction of extended response questions, reversed that decision (Olson, 2005). The well-known Maryland School Performance Assessment Program, for example, had to be replaced because its matrix sampling approach did not produce individual student scores (Baltimore Sun, 2003).
For education researchers the only way out of this morass is to try once again to convince policymakers of the corrupting effects of high-stakes accountability. Negative side-effects from standardized testing are widely recognized, but the difference between proponents and opponents of test-based accountability is that the magnitude of the problem is perceived very differently. Perhaps we are coming to a place finally where the negative consequences can be seen to outweigh any hoped-for good. If we could roll-back the pervasive pressures of accountability systems, measurement experts could play a role in supporting efforts to design new curricula and teacher professional development. But real hopes for transformative and equitable instructional innovations will have a greater chance for success if they are led by subject matter experts and learning scientists working with testing experts, not by testing experts working alone. Given the failure of high-stakes tests to drive meaningful change over several decades, perhaps policymakers can be persuaded to accept a sparser, less punitive framework of tests to track system progress and then to redirect resources to develop curricular and assessment resources that would not be reported beyond the classroom.
Once aware of our history, young scholars who want to be psychometricians might ask themselves what they would need to know—beyond statistical modeling—to be able to make contributions to the field like Cronbach, Messick, and Linn. Such principled and insightful contributions require a deep understanding of both constructs and contexts—of learning, if that is what is intended to be assessed, and of the educational context in which an instrument is to be used. There are exciting opportunities in domain-based education research that would allow measurement experts to work in immersive environments with learning scientists and subject matter experts to devise new means for eliciting and representing learning progress in ways more consistent with this century’s conceptions of knowledge and disciplinary practices. Small-scale curriculum projects especially hold greater promise for preserving the integrity of intended improvements than large-scale projects imposed from the top on too short a time line (Shepard, 2015). Young scholars as well as senior experts should be aware of how previously well-intentioned assessment systems have been corrupted and should adopt an ongoing research and development stance, whereby assessment innovations are tested to see if hoped-for benefits are realized and, if not, are further revised and evaluated.
