Abstract
This article investigates the persistent and change elements of educational testing and assessment from 1920 to the present day. I show by examining the addresses and texts of American Educational Research Association presidents a continuing focus on schools, from early experiments and development up through applications in accountability systems. Continuing topics include sources of test content and uses of tests for equity, effectiveness, support of teaching, and comparisons of alternative methods through experiments or references to standards. Although early writers appeared very close to school practices, later discussions expanded implications for policy uses.
This is an essay about testing and assessment, documenting across almost 10 decades the perspectives of a few selected American Educational Research Association (AERA) presidents, their emphases, and influences on the future. A large fraction of AERA presidents (by my estimate, more than 25% of them) included assessment and testing topics in at least part of their scholarly portfolio. Today, a substantial number of AERA members continue to contribute to assessment and measurement literature. (See American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014, or review any program of the AERA annual meetings.) Over the years, this community has invested in techniques to design and control the types and uses of tests so as to guide positive impact on education. Concepts of validity, quality, fairness, and utility have been essential elements of their studies.
At the outset of research and development (R&D) in testing, there was experimentation to learn what testing was and how it could be developed and then applied in education. In a relatively brief time, the use of assessment gained broad acceptability in school systems. As social science flourished, tests were used as dependent variables to measure the effects of new instructional options and programs, enabling the expansion of education research. The article will consider these transitions as reflected by the presidential texts.
Terms
I use the terms assessment and testing interchangeably to refer to systematic observations of student performance inferred either from products of examinees (such as essays or marked answer sheets) or from processes they demonstrate or use to accomplish outcomes. Measurement includes the particular analytic and technical approaches used to create and vet tests.
Testing Today
Quantitative indicators are found everywhere in today’s societies. In no way limited to education, tests play major roles in modern life. In a variety of fields, such as clinical psychology, they diagnose or verify psychological states (American Psychiatric Association, 2013). In the workplace, tests screen applicants for a broad array of competences, from physical strength or agility to general skills and specific knowledge and procedures presumed needed for job success (www.siop.org). In the professions, they certify the accomplishments of graduates of professional schools, such as in law (www.americanbar.org) or medicine (www.ama-assn.org).
Test results in education are valued because they purport to open a window showing students’ types and levels of knowledge and skills and, by inference, the effectiveness of the educational process to which they were exposed. A perennial contention is whether the window is open wide enough or how it may distort our view.
Review Strategy
My selection of presidential texts drew from the early years of AERA as well as from addresses in more recent times and spans almost a full century. I chose the specific texts to trace the sources of current positions about test development and use, and to find any intellectual dead ends and twists in the development of continuity.
It will be obvious to any reader that I am not an educational historian. Nonetheless, I attempt to situate the texts in brief sketches of the times in which they were written, with extra details for more distant periods. The background I describe is intended to remind the reader a bit about our history as well as to highlight educational developments.
Although tests are now found virtually everywhere in education, I limit myself in these few pages to external and formal (rather than teacher-made) assessments employed in typical precollegiate educational settings. Intended for purposes of accountability, program evaluation, or measuring progress, the results of these tests are claimed by proponents to inform and improve learning practices and educational policies. Note that these are the exact type of assessments now attracting heat and fire from the vocal politicians, parents, educators, media, and scholars. The current situation is in part a consequence of various provisions of the Elementary and Secondary Education Act of 1965 (ESEA; 1974), whose most recent and softened iteration can be found in its 2015 authorization (Every Student Succeeds Act, 2015). Even this restricted set of tests, focused on mathematics and English language arts, invariably delivers far less than is promised in their names. Assessments are burdened by unsustainable claims for meaningful impact and improvement as well as by old and troubling assumptions. No wonder they engender disagreement, suspicion, and disappointment and have been periodic targets of dissent.
Over the years, external assessments moved in and out of supporting teachers’ judgments and now functionally substitute for that judgment in the halls of bureaucracy. From countless policy, media, and public reports, one could say that test results now embody the common definition of educational expectations at different grades for school subjects. To me, success on tests is greatly overvalued.
Transitions Over Time
I think the reader will discern the evolutionary path of tests in the minds of the referenced AERA presidents. They are generally an optimistic lot, perhaps choosing to turn the public moment of their presidency to analyze and advocate (and often overstate) the benefits of testing—but also to warn of its limits. In the course of their expositions, they appear to sometimes ignore the world around them—as scholars might—but expectations of the presidential addresses are always focused on scholarship. When considering their texts, I struggled to give an accurate and reasonably comprehensive examination of each address but at the same time find strands of work that persisted or changed. (In the course of the reading, I found that I also omitted contributions of major figures in the field, e.g., Lindquist, because their work was not strongly referenced in the presidential texts.)
Therefore, the reader should be alert to topics and approaches that recur and change throughout the pieces. These include, among others, the sources of test content, technical issues in test development, and issues of comparability and reporting. Presidents consider the uses of tests for system policy or evaluation and, of course, for accountability. In classrooms, tests are thought to motivate students, to sort them into different groups, and to improve learning, either by being embedded in instruction itself or by assisting teachers in their instruction tasks. It is clear to see the transition from the original technical development and use of an idea to its role in a broader political environment. Performance standards are an example of a technique that changed as tests moved into greater and greater public consciousness.
The Outset
Well before AERA began, tests were already a topic of interest, and lights and shadows of early work impact today’s perspectives. Horace Mann, the father of the common school, developed tests in order to determine how well students could perform various required tasks. Mann’s tests included explicit behavioral tasks extracted from school textbooks and classroom teaching activities (Phelps, 2007). They covered solving computational problems, spelling, and reading. Another early developer, Alfred Binet (1903), known for studies leading to intelligence testing, began his design by assembling a set of reasonable everyday tasks that could be used with young children. E. L. Thorndike (1914, 1915) examined the specifics of spelling and arithmetic among other subjects in the early years of the 20th century. Yerkes (1929) and Brigham (1923), among the authors of the influential Army Alpha examinations, used elements of literacy for their ability measures. They, along with Thorndike and others, also embraced the idea of preordained mental capacity, positions that they supported from World War I army testing results. These findings, when arrayed by subgroups of native-born Whites, immigrants, and Blacks, gave rise to misbegotten ethnic and racial inferences. In the early part of the 20th century, these positions were mainstream enough (Yerkes was president of the American Psychological Association and went on to work at the National Research Council). Woodrow Wilson also openly discussed biases in the context of immigration policy. One will find these ideas are embedded in the text of the first AERA president, B. R. Buckingham.
President B. R. Buckingham (1920)
Social Context
To begin the AERA presidential journey, let’s start in 1920, a year lost in history for almost every reader, seemingly as remote as Revolutionary times. As we know, the United States was in the throes of change resulting from the Industrial Revolution and the related unprecedented waves of immigration. Buckingham was born only 13 years after the Civil War. The relative closeness of the U.S. agrarian past to his presidency is revealed in the order of census tables for 1920 (U.S. Census Bureau, 1922), giving prominence to states’ weather (hottest and coldest days) and irrigation and drainage summaries.
The total U.S. population was over 100 million, or less than one third of today’s count. Census interviewers personally visited households and in their demographic reports summarized age, gender, and household membership. The population’s “racial” distribution was reported using percentages by state in a four-category system: White (native-born or foreign-born), non-White, and other. For contemporary comparison, the 2010 census reported data for more than 60 different self-identified groups (U.S. Census Bureau, 2011). Although the census is normally used for redistricting, it was not so applied because the dramatic population shifts from rural to urban America were creating rancorous political conflict and led, as a favor to Southern states that would lose congressional members, to the deferral of reapportionment.
Like today, although for somewhat different reasons, great social concern focused on immigration policy. The Army tests assessed over 1.7 million men to determine fitness for military duty, and they made clear their views of the unacceptability of immigrants from southern and eastern Europe, who came with lower literacy rates.
Thus, in the period around Buckingham’s presidency, anxiety was voiced about the decrease in immigration from northern Europe and the large numbers arriving from eastern and southern Europe, whose lower literacy rates were presumed to reflect less intelligence. (Much of Asian immigration had been all but eliminated by the Chinese Exclusion Act of 1882.) The surrounding logic was reported in Facts for American Education Week (National Education Association of the United States, 1922), then a publication of the Research Division of the National Education Association, the predecessor of AERA. Greater literacy, as a proxy for ability, would allow immigrants more rapidly to contribute to the economy. In the Immigration Act of 1917, provisions for literacy testing were included. Literacy tests for immigrants required them to read 30 common words in their native language.
Supporters of literacy tests also raised concerns about U.S.-born citizens who lived in “alien conclaves” (National Education Association of the United States, 1922) where they spoke their own language and held cultural values claimed to be antagonistic to American values. Suspicion of immigrants was widespread. Differential literacy rates between Blacks and Whites in the South were telling as well. Margo (1990) reports illiteracy rates of Southern Blacks at 26% and of Whites around 6%. Access to schooling for children ages 5 to 8 favored Whites only by about 10%, suggesting that the quality of teaching and poorer schools, as opposed to more odious hypotheses, might be an explanation for lower literacy rates of Black students.
Overview and Highlights of Text
The 1920 president, B. R. Buckingham, published the substance of his address with a coauthor, W. S. Monroe (who was AERA president in 1917). In their Journal of Educational Research article (Buckingham & Monroe, 1920), they advocated testing as a means to address inequality as it played out in urban and rural schools. Although they spent a good piece justifying their measures for the purpose of literacy assessment, their actual work also embodied elements found later in educational testing. For example, they used literacy and arithmetic as core content. They created a way to standardize scoring and reporting. They incorporated intelligence testing in their battery in part to find a basis to “equate” or at least compare the students in teachers’ classrooms. They also drew implications of their studies for educational improvement.
They supposed that differing teacher characteristics, education, and pay in urban and rural schools would affect levels of student literacy. They noted that the number of days in the school year also differentiated urban from rural schools. The authors’ interest was to determine the degree of literacy differences, and they illustrated the use of tests to understand the literacy problem and lead to its solution. Because teachers in rural settings had less education and frequently taught multiple grades and subjects in one-room schools, Buckingham and Monroe (1920) proposed standardized or common measures, designed to be easily administered, to clarify the extent of achievement differences.
Achievement and Intelligence Tests
Their study tested two school subjects, reading and arithmetic, administered in grade spans. They packaged an intelligence test with subject matter tests in order to assess mental capacity and to compare results to observed performance on subject exams.
Test Content and Administration
They report details of test development and administration. The intelligence test included subtests in analogies, arithmetical problems, sentence vocabulary, substitution, verbal ingenuity, arithmetical ingenuity, and synonym-antonym. Sections of this test were borrowed from work by Cameron (1921) and Pressey (1920) (Buckingham & Monroe, 1920, p. 524). Achievement tests were divided by grade span, Part 1 for Grades 3 through 5 and Part 2 for Grades 6 through 8. Part 1 used the Silent Reading Test developed by Monroe and provided scores on rate and comprehension. Eight subtests on arithmetic for Grades 3 through 5 are summarized as four combinations and four sample examples for each arithmetic operation. In Grades 6 through 8, Monroe revised his Silent Reading Test, and the arithmetic subtests included column addition, long multiplication, long division, subtraction, addition and subtraction of fractions, multiplication and division of fractions, and division of decimals (Buckingham & Monroe, 1920, p. 525). I provide this detail to allow the reader to compare contemporary test content with that used in the Buckingham and Monroe (1920) study. The authors also describe a procedure for bundling the three tests together for ease of administration and report that the instructions were carefully developed to be similar for all exams.
Innovations That Persisted Over Time
Buckingham and Monroe attended to issues still central to our agenda today. They created standard cut scores for achievement tests to support their comparability across settings and tests. Their argument for creating such standards was based on the unfairness of comparisons of within-class scores. They were particularly concerned that such comparisons would not be meaningful for smaller schools in rural settings where there might be only two or three students at each grade. As a consequence, they created a less useful artifact, the achievement quotient (AQ).
The AQ related each student’s observed subject matter achievement to the mental age derived from the intelligence test scores. This combined quotient was to serve as an index for individuals. They enumerated multiple uses for the AQ, first claiming that it “affords a basis of prophecy” (Buckingham & Monroe, 1920, p. 527), as test results were to be systematically used to sort and place students. For example, they averred that intelligence combined with achievement should be used to determine which children were allowed to progress to richer curricula or to a regular track and which were counseled to take less demanding academic courses or to complete different levels of schooling. In addition to classifying student differences for future educational access, Buckingham and Monroe (1920) asserted that their standardization of scores allowed grade-level progression or benchmarks to be developed, another familiar topic in the contemporary testing world.
A second use of this index beyond placement was to make judgments about teachers and schools.
If results are below standard, the teacher and the school are likely to be criticized. This may be wholly unjust. The general level of mentality among the children may be below normal by even a greater amount than the scores in subject matter tests are below standard. (Buckingham & Monroe, 1920, p. 532)
Yet, because Buckingham and Monroe believed in the value of intelligence tests, as distinct from achievement tests, they went on to argue that teachers should be faulted if their students’ achievement was below the potential described by intelligence. Here they set up the future concepts of under- or overachievement, an idea that long persisted. (In my school life, I had been labeled both.) Over- and underachievement were roundly criticized on technical and conceptual bases by William Angoff (1971), who also referred to earlier, detailed analyses by John Flanagan (1951).
Presaging the quantitative emphasis found in the future of tests, research, and use, Buckingham and Monroe (1920) opined that their greatest contribution lay in the analyses of the scores (standard setting) and in the predictive relationship implied by their AQ.
Practical Uses
The authors offered a set of options for broader, practical applications of their study. For example, they recommended the tests to serve accountability function (although not using that term) to determine the effectiveness of teachers, educational systems, and their superintendents.
Their discourse strongly implied that tests could make a difference in education. They suggested that achievement content for tests should come from well-known textbooks, following the practice of earlier writers. They championed putting exams into an easy-to-administer form, so that they could be used by less well-prepared teachers (especially in the rural areas, where importing proctors to administer exams would be impossible). They thought that information from test results could be used to improve instruction and to estimate the quality of teaching. They were very efficiency minded and saw their examination as serving multiple purposes.
In the efficiency domain, Buckingham (1920) also wrote about best ways to develop measures of teaching efficacy in the domain of history. In this piece, while recognizing the importance of understanding and using historical ideas, he argued against assessing testing the application of ideas in favor of testing historical knowledge. His proposal rested again on efficiency of knowledge exams, and he justified his position by referring to the strong correlation between knowledge and more complex application essays. Beyond efficiency, his suggestions indicate he saw history proficiency as a broad construct that conflated historical knowledge with its use and interpretation. His views may very well have been appropriate for the time, if, as still persists in some assessments today, content accuracy is the dominant contributor to essay rubrics. The theme of efficiency will continue through much of testing, where simpler, less expensive proxies are substituted for harder-to-assess outcomes.
Summary and Implications
Buckingham and Monroe engaged in empirical studies, and the study they reported has resonance for the future. First, they addressed inequality (between rural and urban students, where “rural” also included a large component of Black students). Second, they conducted careful development and documented their administrative procedures. They explored reporting issues using their AQ as an index that included mental capacity (see Brigham, 1923; Yerkes, 1923). They thought that the AQ would be useful for predicting performance and sorting students into groups intended to benefit from simpler or more advance curricula. The fallacy of the AQ is a stark limitation to their work and to successive studies using IQ as an indicator of capacity to learn.
On the more positive side, they advocated the use of test results for a variety of reasonable purposes, including empirically based contrasts of settings, support for instructional improvement, and benchmarks to determine student progress and to make for comparisons among groups using external standards of performance. They also considered test results to make judgments about teacher competency as well as about the effectiveness of educational systems or superintendents. All of these functions can be found in the works of later scholars and reflected in part in the tests of AERA presidents reviewed in this document, including the next, John Stenquist.
President John Stenquist (1933)
The year that AERA president John Stenquist wrote his article (1933), the world, our country, and education were suffering economically. The United States was deep into the Great Depression, Franklin Delano Roosevelt was just elected to his first term as president, and Adolf Hitler was appointed as chancellor of Germany. Economics played a big part in the run-up to World War II. In 1933, millions of Americans were on the road looking for work, the country was experiencing a 25% unemployment rate, and approximately 20,000 schools were closed for lack of tax revenues and other funds.
Thus economic factors were paramount and overwhelmed many educational options. Policy choices to meet funding shortfalls included closing schools, moving to larger class sizes, shortening the school years, and using social promotion to move students either to graduation or to drop out. School closures hit Black students and others in the South most severely, and population movement continued from the rural South to cities (U.S. Census Bureau, 1930). The invention and wider distribution of the automobile provided a method to reduce costs by supporting the consolidation of schools, resulting in larger units for more students yet employing fewer teachers. These facts combined to create a climate for increasing agitation by teachers to maintain their jobs and the quality of education for students.
Overview of Text
In 1933, Stenquist wrote on the topic of “Recent Developments in the Use of Tests,” published in the Review of Educational Research. At the outset, he noted that tests had moved from a novelty to a major feature of schooling. He commented, “Today the use of educational tests has become almost as commonplace as that of textbooks” (Stenquist, 1933, p. 48). Much of his article was given to summarizing the large accumulation of technical literature on testing, with the mission to show the contemporary place of testing in education. He catalogued a number of significant individual studies as well as cited compilations of research and bibliographies. He lists findings of influential researchers, including Kelley (1930), Pressey and Pressey (1922), Thurstone (1932), and Thorndike (1915), among others. He organized his topic by the uses of tests.
Specific Text Uses
Stenquist was also an advocate of efficiency, important in this period, as well as of the behavioral science supporting it. He outlined four uses of tests, for which he provided examples and commentary shown in Table 1.
Stenquist’s List of Testing Applications
Source. Stenquist (1933, p. 50).
For each use, he presented examples and embedded critiques, noting under the first use that intelligence tests were originally applied to classify students but that achievement tests were also in contemporary use for that purpose, a change from the Buckingham and Monroe days. As Buckingham and Monroe had predicted, tests were also used to evaluate the effectiveness of practices and polices. Group classification practices had led to a greater interest in individual differences and new classroom organizations.
Stenquist (1933) addressed instructional research in individualization, where tests were used as dependent measures. For example, he described empirical support for the Winnetka plan, an extension of John Dewey’s work by Carleton Washburne (1932). In this individualization plan, social and emotional factors and creativity were added to regular school subjects as desired outcomes, precursors of present-day interests. Features of the Winnetka plan depended on initial tests (pretests) and prescribed self-instruction and self-testing as efficient means to ameliorate student difficulties. Stenquist reported studies of students’ progress and subsequent achievement that favored this form of individualized instruction, anticipating midcentury efforts in programmed instruction. On the basis of empirical findings, he supported the development and further use of diagnostic tests for individualization of students rather than sorting students into homogeneous groups for instruction. For instance, on the matter of assignment to groups, Stenquist reported results that showed no advantage for homogeneous grouping and a great deal of overlap among groups, a matter of contention in the late 20th century, and similar findings (Oakes, 1985). Aspects of the Winnetka experiment are echoed in contemporary claims for the value of formative assessment and the design of diagnostic and adaptive technologies.
Uses in Administrative Polices
Stenquist (1933) organized his discussion of various “administrative policies” to include studies of class size and social promotion. These applications were his only allusions to the economic difficulties facing the country and the educational system, one that required greater numbers of students in each class. He also advocated teachers’ use of examination results to determine the promotion of students, a practice that would, at the same time, eliminate the evils of “courtesy promotions” (Stenquist, 1933, p. 53), a method to move students out of school.
From Measures to Guidance
As part of the discussion of the curriculum focus tests, Stenquist (1933) suggested that tests should function more than just measures of progress or outcomes. He proposed that the tests themselves should be used to decide the content and sequence of particular curricula, a major conceptual shift from earlier work that used textbooks and teachers as common foundations of test development. Although he listed objectives in his table, he noted (with greater honesty than sometimes found today), “The availability of standardized tests controlled to a large extent what objectives of education were objectively measured” (Stenquist, 1933, p. 56). He continued later that “criticism about the difference between what was taught and what was tested resulted in recommendations to prepare standardized tests covering more of the objectives and to utilize objective test(s) in the measurement of the entire list of acceptable objectives.” The reader will recognize that the source of test content continues to evolve in later years (Stenquist, 1933, p. 56). Stenquist noted positively that Ralph Tyler (1931) had suggested a more inclusive set of objectives in his then-active studies of progressive curricula. Tyler (1949) in his later work developed an integrated system for creating and “screening” goals and objectives, instructional opportunities and testing and evaluation. Tyler’s work was influential to the present times.
Practical Uses
At the time of Stenquist’s AERA presidency, there were also efforts to use tests in other practical contexts, not unlike those described by Buckingham and Monroe. For instance, Stenquist (1933) reported test use to supplement teachers’ judgments about students, with the comment that teachers’ grades became more reliable when objective rather than essay tests were employed for that purpose. Stenquist also considered grading on the curve, a practice he thought might work for very large groups but found a less suitable idea for smaller classrooms. Notwithstanding his views, grading on a curve was frequently used in the late 20th century by both precollegiate and postsecondary instructors irrespective of class size. Foreshadowing the pervasiveness of test results in today’s data-driven world, he reported that supervisors had begun to use tests to manage all parts of the education system, a clear extension of control by results.
Research Uses of Tests
On the research front, Stenquist (1933) pointed to experimental comparisons of interventions, for instance, different problem-solving approaches, that used as dependent measures tests of various topics and age ranges. Stenquist described how tests could shed light on learning and cited the utility of studies of student errors conducted by Pressey (1925), Morton (1925a, 1925b, 1925c), and Streitz (1924). These error analyses were presumably designed to help teachers pinpoint problems in order to overcome student learning difficulties. In particular, he promoted the benefit of scientific and more precise testing to assist teachers in determining which students possessed particular difficulties, connecting backward to Buckingham and Monroe and forward, once again, to item analyses, buggy intelligent tutors, formative assessment, and adaptive learning systems.
Summary and Implications
Stenquist’s (1933) writing showed that he strongly embraced efficiency, behavioral science, and control. He implied that a massive investment in testing was to be an important part of the solution to educational problems. Only in a small way did he incorporate (in his address) mention of progressive education, the other major development of the prior 20 years. He cited contemporary efforts by Washburne and Tyler. But the tenor of his extensive review and advocacy for empiricism contrasted with the then-contemporary work by Hutchins (1936), who cautioned against the overuse of social science when considering the important intellectual goals of education.
He illustrated the continuing investment in educational testing by AERA and psychological researchers. Stenquist’s compendium included a growing archive of empirical research on tests. Conducted in schools using tests as both independent and dependent variables, these studies worked in concert with other attempts to solve practical educational problems. Stenquist’s categories of uses can be found today in somewhat different form. He noted the value of testing in the design of curriculum, in the assessment of students’ status and progress, in the development of individualized learning programs, in the evaluation of systems, and as support for teaching and new instructional approaches.
Presidents W. James Popham (1978) and Robert L. Ebel (1973)
Rationale
Although much fascinating work in testing occurred in the period from 1933 forward—including new analytical techniques, extensions of research on individual differences, studies of test use in World War II, and Lindquist’s creation of test design, scoring, and analysis technologies—I skip decades ahead for practical, intellectual, and personal reasons. I chose 1978 in part because the debate format chosen by president W. James Popham handily gives us the viewpoints of two AERA presidents with illustrious careers in measurement (Robert L. Ebel, president in 1973). The debate was especially interesting not only for its adversarial nature but because the two presidents exemplified in their remarks near end points on the continuum of positions about proper test development and use. Moreover, their disparate positions reflected back to earlier views discussed by Buckingham and Monroe, and Stenquist. Last, I was there at the debate, and I wonder if this summary captures the strength of the disagreement and the charged atmosphere that day.
Social Context
In the period leading up to 1978, the Depression ended with the U.S. entry into World War II. Following the war’s conclusion, an unprecedented expansion of education occurred with the GI Bill. The Sputnik satellite launch by the Union of Soviet Socialist Republics (USSR) stimulated the United States to pass the National Defense Education Act (1958), an immense spur to education in curriculum, in studies and improvements of teaching, and in the expansion of R&D in instruction, product development, and later, computerized instruction. The Cold War between the USSR and the West led to less successful military actions in Korea and Vietnam, conflicts disproportionately fought by poorer, less educated young men. Assassinations shook the country with the losses of John F. Kennedy, Martin Luther King Jr., Robert Kennedy, and Malcolm X.
These events impacted education at all levels. The draft of young men into military service and the power of television’s war and civil rights coverage combined to produce a broad-based, antiauthority movement. Its core embodied antiwar, pro–civil rights, and women’s equality protests. These protests began with higher education students but spread to secondary school pupils as well. Although cliché now, the slogan “peace, love, and rock and roll” labeled one side of the political conflict. For the younger reader, here are two realities of almost 40 years ago. In 1978, (a) chaos reigned in the Middle East, and (b) technology awareness accelerated, begun 9 years earlier with the moon landing and pushed into more rapid change by new forms of computational support.
Testing Marches On
As protests in education pushed against the standard authority of institutions, many teachers embraced more cooperative, democratic classroom practices. Perhaps as reaction, educational testing became more firmly entrenched as a way to assert institutional control.
Most of K–12 education was moving toward even greater investment in tests. At the national level, education programs became the focus of the evaluation and accountability associated with the passage of the first ESEA, and new federal programs for the most part focused on improving performance of underserved students, for instance, Head Start and Follow Through. Evaluations compared effects of competing Head Start and Follow Through models over extended periods, that is, 1969 to 1977 (Hubbell, 1983; Stebbins, St. Pierre, Proper, Anderson, & Cerva, 1977). As a result, the evaluation enterprise expanded and with it, more interest was engendered in methods to conduct these studies. For many evaluations and policy investigations, results of tests were essential to their conclusions. To assess the status of schools and to evaluate the impact of new programs, states and local districts also regularly administered standardized tests in excess of federal requirements, a practice not unknown today. Comparisons of reading models with back-to-basics and competency-based programs were conducted at the national, state, and large-city levels. Taken together, evaluation and testing sustained extensive scholarship, propelled graduate training, and generated new employment options for researchers in both university and non-university settings.
Initiatives at many universities moved research-based instructional and assessment programs into schools and set the stage for the growth of computer-based instruction. These programs, at their outset, constituted variants of the test-teach-test approach discussed by Stenquist and, more recently, applied in programmed instruction in the 1950s and 1960s.
An emphasis on behavioral curriculum objectives and instruction was prevalent but drew strong criticism from those with affiliations with transformations of progressive education. These groups supported emerging alternative or open education. Behavioral approaches were augmented by psychologically oriented research in information-processing models of learning (Gagne, 1985). This work paralleled a reconsideration of cognitive models applied to learning and teaching, as earlier advocated by Bruner (1966), catalyzed especially by the research of early computer scientists. Perhaps ironically, as one major historical cradle of progressive education, the University of Chicago developed initiatives that contributed strongly to the testing movement. First, Ralph Tyler’s (1949) work in curriculum gave guidance on developing objectives, learning experiences, and assessments. Also a Chicago product, the Taxonomy of Educational Objectives (Bloom, Englehart, Furst, Hill, & Krathwohl, 1956) was developed. Its authors created a model that ordered illustrative test items along a continuum from recall and recognition to higher-order thinking.
Bloom’s Taxonomy exerted powerful influences on expectations for instruction and tests. Adoption reviews were conducted of standardized tests to determine the levels of taxonomy represented by different test items. Critics alleged that despite claims of measuring advanced skills for general constructs, far too many test items actually required only the recognition of facts to get the correct answer. Both instruction and assessment began to consider higher-order skills embedded in content rather than as general, stand-alone measures of critical thinking or creativity. Although such work continued through to the present time, only relatively recently have the ideas of cognitive complexity and deeper learning resurfaced in the testing world, outside of performance testing (Pellegrino & Hilton, 2012).
The Setup
The debate at the 1978 presidential session pitted the current president, W. James Popham, against past president Robert L. Ebel (who served in 1973). Their then-hot topic was the comparative merits and utilities of norm-referenced or criterion-referenced testing. From one perspective, their arguments captured the differences between more behavioral and goal-focused assessments and construct-oriented large-scale standardized tests, the latter much supported for their economy and familiarity to educational policymakers. Differing in views of test design, analysis, support for teaching, and ways to report results, the two scholars took clear, opposing positions (Ebel, 1978; Popham, 1978).
Norm-Referenced Tests
Robert L. Ebel, a major figure in standardized test development and analysis, contended that the value of standardized tests resided in their claim to measure broader abilities rather than disconnected pieces of knowledge. Norm-referenced tests, Ebel maintained, could be demonstrated to assess important constructs of school learning, such as mathematics ability or reading comprehension. Moreover, results could be clearly interpreted by comparing a student’s results to those attained by other students. These results could be framed as grade equivalence, comparing to average score attained by presumably comparable students at the same grade level. Other reporting metrics used standard scores, showing how an individual’s results differed from the mean by converting scores to stanines or percentiles. Score meaning, then, derived from the place a person’s score fell in the distribution of other examinees. Recall that this reporting tradition had roots in early measures designed to select, assign, or place students, where the best students or groups might be selected for limited special opportunities or admission to further education. It was familiar and somewhat intuitive.
Criterion-Referenced Tests
The criterion-referenced claims were very different. As Glaser (1963), another AERA president with expertise in learning and testing, had written earlier, the key elements of criterion-referenced tests were how content should be selected for the test and how test scores should be interpreted. To the first point, criterion-referenced tests attempt to delimit or parameterize the tested domain, a process illustrated later in important studies by Hively, Patterson, and Paige (1968). Explicit boundaries on a domain (using an analog of set theory) would identify content that should be taught and tested—contrasted with content that was out of bounds and so not to be tested. Although such boundary rules worked well in mathematics, they were somewhat strained when applied to the humanities (where enumeration of works or examples to be learned substituted for more elegant boundary statements). Both rules and examples of “fair test content” were argued to be the legitimate focus of teaching, educational programs, and tests. Norm-referenced tests were better known, as they were older and by implication more credible (Ebel, 1978, p. 3), but the reader may also remember that very early test development practices described clearly and operationally the task on the test (Binet, 1903; Thorndike, 1914, 1915), similar to criterion-referenced practices.
Task specificity was advocated in the design and development of criterion-referenced tests by Popham. In these tests, score meaning was derived from Glaser’s two features. First, sampling from within the well-described (bounded) domain was important for the design of instruction and relevant assessments. Second, the distribution of content was divided into different criterion levels intended to specify operational and ordinal differences of expertise for different bands of scores. Data would be reported as the percentage of students achieving particular achievement levels. Norm-referenced interpretations of findings grounded their results in comparisons to other people (or perhaps to other schools) to distinguish the better from the worse. Proponents argued that criterion-referenced tests could answer the question: how much of the content specified does the examinee know or know how to do? Yet, most criterion-referenced tests made no real attempt to sample the universe of items within their boundaries.
The Discussion
Ebel’s (1978) arguments for norm-referenced measures amplified some of his 1973 AERA address; in particular, his view was that any test is more useful and understandable when it measures broad outcomes rather than particular, discrete collections of items. This approach was supported by a wealth of traditional psychometric procedures built upon normal distributions.
Popham (1978) contended that the reporting of results in terms of student norming distributions too strongly influenced the details of test design and development, like the tail wagging the dog. He suggested that ideal items for norm-referenced interpretation would have about a 0.5 difficulty, or for a given student, a 50-50 chance of success of getting the right answer, in order to generate sufficient variation for norm-referenced interpretations. He argued that over a period of time, this desired difficulty level would compel test developers to discard items on which many students did well. As a result of this process, over time, those test items that reflected well-taught objectives would be deleted from the examination because they did not contribute to desired test variance. What would be left were items that fit a normal distribution and not those that actually reflected good instruction experienced by the students. As a result, the testing (or accountability) system would underrepresent the actual competency of students and, by inference, the quality of schools. It should be said that there was considerable disagreement about this premise in general, voiced in the debate venue.
Popham (1978) also noted that tests with very general specifications were harder to understand for teachers, as broad constructs could have various legitimate meanings and interpretations. To support the economics of norm-referenced tests, the general nature of broadly described constructs, for example, mathematics ability, was needed, said Popham. It was his belief that broad test topics could be marketed more widely because potential buyers find their local goals subsumed by these descriptions. Popham’s earlier work had noted the similarities of norm-referenced achievement tests and aptitude tests, which were more akin to the earlier sort-and-classify test uses.
Popham (1978) explained that for important purposes, in particular, evaluation and instructional improvement, criterion-referenced tests “properly fashioned” could provide greater clarity for teachers and, by extension, for learners. Of course, “properly fashioned” was a condition that applied to both approaches. It is fair to say that the domain parameterization (or enumeration) desired by Popham was not then a reality, perhaps as much of an aspiration as that of fully measuring general constructs with a limited number of survey items. Only recently has parameterization been widely possible through computational support.
The differences in approach were well formulated by the two debaters. Many in opposing camps during and following the debate countered the positions of each debater, and the dispute continues in slightly different form today. Popham’s position, then as now, is that tests should provide operational and meaningful targets appropriate for teaching.
Fought to a Draw
In summary, the Popham-Ebel argument hinged on different perspectives on the quality and utility of assessments. Ebel brought to bear the wealth of traditional psychometric approaches, then the stuff of graduate courses in measurement and of the guidelines used by commercial test developers. Popham sought to link operationally what a teacher could actually do legitimately to improve student scores. Succinctly, his position was that teachers needed very clear guidance on what to teach, just as Glaser argued the need for clear instructional guidelines. The debaters both addressed practicalities of the reporting systems of norm-referenced and criterion-referenced tests.
Epilogue
And what has happened since 1978? When confronted with a survey-type norm-referenced test, many districts and schools began to purchase materials (often sold by the same test vendor) for students to practice items just like those they would encounter on the test. The greater clarity achieved by criterion-referenced (and standards-based tests) turned out to be no protection; the use of practice test materials has also increased as greater consequences were attached to criterion test results. These practice materials may undermine both kinds of tests, that is, those of broad constructs and specific standards, particularly if practice focuses strongly on test formats. Furthermore, explicit periods of practice for tests give ammunition to those who believe that the schools’ curricula are limited to a few criteria of quality and that instruction spends too much time on test-related activities. Schools may also use interim assessments designed to predict outcomes and formative assessments intended to uncover and remedy student difficulties.
Accountability and inferences about school effectiveness strongly depend on test results. Cheating is a real concern and corporations now exist with the mission to detect and protect against cheating on external examinations. Popham’s view was that teachers should know what is expected of them, and thus, by inference, they should not rely on special practice windows or unethical practices. When accountability consequences depend on test results, it is argued that interpretation of results is invariably corrupted (Campbell, 1976; Koretz, 2008).
The Next 25 Years
In the interim, between 1978 and the address by president Robert L. Linn in 2003, a hybrid form of tests emerged, one that incorporated elements of both parts of the Ebel-Popham debate. Many tests were designated as criterion-referenced tests but included items developed in the norm-referenced tradition, that is, survey items of a broader construct. However, the results, as Popham earlier worried, were reported in terms of “criterion-referenced” achievement levels, that is, as percentage of examinees attaining each achievement level. As Popham predicted, assessments might seem to have criterion-referenced benefits, but their actual items did not invariably reflect important, teachable tasks.
On the policy front during the period from the Popham presidency to the last address by Robert Linn, accountability and school effectiveness indicators have relied more and more on formal assessments with unknown sensitivity to instruction, although there have been recent studies of scalable methods to link the two. Until there are clearer crosswalks between scalable indicators of instructional quality and test results, test results as measures of school effectiveness continue to raise concerns. Because of murky instructional test relationships, questions about procedures for how to set criteria or achievement levels have been matters of technical and policy discussion.
President Robert L. Linn (2003)
As the years approach the present, the need for history recedes and my personal connections grow stronger. I worked with Jim Popham as a student and colleague at the University of California, Los Angeles, in the 1960s and ’70s (together, we promoted bumper stickers saying “Help Stamp Out Non-Behavioral Objectives”). From the mid-1970s until his recent death, I was one of the many colleagues, friends, and admirers of Robert L. Linn, the 2003 AERA president. Known to the world as Bob, he wrote his AERA address after observing the period of growing accountability pressure in the prior three decades.
The year of Bob Linn’s presidency, 2003, was memorable for many reasons. The attack on 9/11 was still fresh in our minds. Bob and I had traveled in 2002 to Washington, D.C., because the No Child Left Behind (NCLB; 2001) reauthorization of ESEA was beginning to be implemented.
NCLB
This authorization continued a decade-long increase in the emphasis on tests as prime indicators in accountability plans. Based on the earlier report of the National Council on Education Standards and Testing (1992), where the development of a narrative about voluntary national standards and assessment began, the Improving America’s Schools Act of 1994 (IASA) enacted ESEA legislation that required all students to take assessments, not only those in federally supported compensatory programs. IASA also was written to add multiple measures and noncognitive indicators to accountability requirements, in part to reduce the sole dependence on test results. Although intended to include ideas such as opportunity to learn in classrooms, access to learning materials, and outcome measures more unique to individuals, in the final IASA rule making, additional indicators were limited to archival information, such as absences and dropouts. In many states with IASA, accountability began to have clear consequences, with labels as well as dispatches of help to schools that were “underperforming.” Policy activism occurred through the work of the National Educational Goals Panel (1999), which listed expectations for performance subsequently included in NCLB. The provisions of NCLB amplified requirements, including that there be adequate representation of all subgroups on testing day, and subpar performance in any category set a school up for various degrees of improvement efforts and, if unsuccessful, for a label as a failing school.
The Text
Bob Linn (2003) aptly took as his subject accountability in the new ESEA legislation. His presidential address targeted details of the accountability provisions of NCLB. Linn did not focus on test content, development models, or analyses, surely among his scholarly areas of achievement. Nor did he examine test content as it enhanced or inhibited teaching and learning. His concerns were larger. He began his talk by calling again for shared responsibility among those contributing to accountability systems, including researchers, along with those who mandated and those who are acted upon by accountability. He recognized that features and effects of accountability systems had communal sources and were not simply political in inspiration.
Linn (2003) moved to discuss some particulars of NCLB accountability provisions. He noted his clear concern for the narrowing of curriculum to match what was actually tested. He cited Lorrie Shepard (1990), a past AERA president, and Stetcher and Hamilton (2002), among others who had conducted research in this area. He again referred to the Campbell law related to the corruption of public indicators (Campbell, 1976).
Goals and Achievement Levels
By far, Linn’s (2003) longest thesis addressed the potentially damaging effects of using poorly rationalized, preordained benchmarks and progress indicators. NCLB called for states to provide improvement plans year by year to show how the frequencies for students would improve until all children would attain 100% proficiency by 2014. By providing their annual formula to achieve this goal, their plan for adequate yearly progress (AYP) was made explicit.
With regard to the goals to be attained, Linn (2003) raised the need for an “existence proof” (p. 4), explaining, “That is, we should not set a goal [both 100% attainment and increasing performance standards] for all schools that is so high that no school has yet achieved it.” He also summarized the likely effects of NCLB provisions that permitted each state to adopt its own content standards and assessments, citing that disparities in rigor among states would occur because the federal government “is reluctant to get into the business of specifying content coverage” (Linn, 2003, p. 5).
Linn (2003) first demonstrated that state proficiency levels had varied meanings based on the rigor of different states’ standards and assessments. For illustrative states, he compared the percentages of students reaching achievement levels by mapping selected state percentages against a common benchmark, each state’s attainment on the National Assessment of Educational Progress (NAEP). Linn had been the cochair of the National Academy of Education’s Study on State by State Use of National Assessment and the chair of the Design and Analysis Committee of the NAEP.
In his analysis of state assessments and NAEP, Linn (2003) made two important points. The first was that some states may demonstrate relatively good performance against their own standards but show no or smaller increases on NAEP. The NAEP achievement levels would be the basis upon which to compare disparate assessments and achievement levels in states, although he allowed that NAEP achievement levels are “clearly ambitious, perhaps too ambitious” (Linn, 2003, p. 6). Second, he provided data to show that given available rates of change in AYP for reading and math, extreme acceleration in growth would need to occur to meet any 100% proficiency goal. Accountability consequences were to ensue when schools fell behind their blueprint for AYP. Some states began with small percentages of growth at the outset of their AYP plan and predicted (and deferred) large annual increases for years closer to the 2014 goal. “Such rapid acceleration would be nothing short of miraculous,” Linn (2003, p. 6) commented. Even at the outset, it was clear that the 100% proficiency aim was a political and not an educational inspiration. As it turned out, waivers in the out years exempted states from adhering to their unrealistic AYP plans. Some states had successfully gamed a bad system.
Linn’s Contribution
A number of features of the Linn (2003) address are notable. First, he was able to put complex assessment issues into commonsense context and language, a job especially valued because his own scholarly reputation backed up his use of less complicated discourse. Through relevant data, he showed that unrealistic accountability policy came from politicians, policymakers, researchers, and other educators, together accepting a common improvement narrative without detailing the means by which real change could be accomplished.
Furthermore, for most of his professional life, as those who knew him would attest, Linn valued balanced opinion and plain talk. He avoided hyperbole or rash predictions, preferring to write analyses that relied on data to make his points. In this piece, directed to a group broader than his measurement colleagues, he again marshaled data to support his positions and revealed in his choice of language the strength of his beliefs. By using rare modifiers (for him)—such as remarkable, ambitious, and miraculous—in an obviously sardonic tone, he underscored the strength of his disagreement with the new accountability provisions. Note that Linn and others actively and futilely sought to influence the NCLB provisions while the law was under development by Congress.
Linn’s concerns remain relevant today. Comparability among interpretations of state achievement scores continues as a problem. In the light of the return of states to the accountability provisions in the newly authorized Every Student Succeeds Act, the problems of benchmarking and progress Bob Linn described will doubtless both persist and expand.
Conclusions
In this essay, I have written about testing as seen through the views of selected AERA presidents. My goal was to share perspectives that presaged subsequent work in educational assessment. I had hoped to uncover ideas that could make new contributions.
Testing is a decision-oriented enterprise (see Cronbach & Suppes, 1969). The earliest writers, Buckingham and Monroe, focused their attention on using measures for multiple purposes: to compare settings (and personnel) to assess students skills in school subjects for the purposes of determining who is best “fit” for various curricula or other educational offerings, to assist teachers in finding learning difficulties, and to compare (the efficacy of) superintendents and systems. They were attuned to the need to use data to make comparisons. The context surrounding the first writers was replete with notions of fixed mental capacity and, more odiously, group differences. Buckingham and Monroe perhaps more innocently argued to integrate subject matter and intelligence tests together, with their now-discounted AQ. In their view, intelligence tests were also supposed to provide expectations about what individuals could achieve and serve as well as an adjustment for classroom effectiveness, now sometimes attempted by student background indicators.
In the earliest days of assessment work, the focus for achievement was school learning, using content derived from textbooks. A little more than a decade later, Stenquist noted that test use had expanded and could (when summarized in objectives) be considered as the source of curriculum itself rather than the other way around. He clustered functions of tests to include administrative and classroom uses, applied research and evaluation in studying interventions, and embedded tests into instruction to serve diagnosis and confirmation of learning.
The debate between Popham and Ebel illustrated serious differences in opinion that grew from the alternative description of constructs, measures, and development and reporting of results. Ebel was happy with general constructs and surveys to assess understanding, whereas Popham argued that clarity was needed so that test descriptions could guide teachers’ classroom instruction. It is true today that both kinds of assessments show gains after initial administrations. What is not yet clear is what causes this growth.
Linn’s address marked reflections on accountability uses to which standards and reporting were put in the NCLB. Earlier discussion of equity and of improvement of learning and results (AYP) were in part redefined in the NCLB accountability system. Linn implied that politics as much as education drove NCLB requirements.
Yet despite the many differences among the presidents and the eras they represent, consistent attention to similar test purposes propelled R&D and policy across the century to assess students, to evaluate schools, to contribute to improvement, and to reduce inequality in the forms available to the writers at the time. It is also apparent that the presidents were personally invested in R&D in order to impact education positively, and none apparently saw assessment or testing as a sufficient intervention to bring about educational improvement, especially for those students in need of help. Although the research endeavor is frequently criticized as remote from reality, notice that each president kept schools and students central to his address. The purposes of assessment have not much changed over the years, although many methods of design, analysis, and reporting have been transformed.
There is no doubt that the U.S. education system remains frustrated in its attempts to raise the academic proficiencies of its students, and this review suggests that the investment in testing as a reform centerpiece is not near to doing the job alone. Perhaps the integration of instruction and annual outcome measurement as described in the report of the Gordon Commission (2013) and the impact of concrete innovations using technology-based learning systems, games, and simulations will result in a better future.
From the writing of the presidents, I note that all have addressed education in the largest sense. If we are to partially realize their goals and integrate assessment into stronger, productive roles, specialists in assessment, measurement, and evaluation will make greater effort to connect to other major constituencies working to improve education.
