Abstract
This essay briefly describes some of the early AERA presidents who were empiricists, several of them directors of research, and how their work connects with some of the issues of design, measurement, analysis, and interpretation today. Beginning with the first president of AERA, a number of presidents through the late 1940s are highlighted, as well as how their work spoke to the times they lived in and how its impact continues through this centennial year. The history stops with the 1950s, which ushered in a new era of significant innovations in research methodology and statistical techniques, strong criticisms of the quality of education research, and a growing recognition of the importance of education in promoting U.S. economic development and social reform. The intent of this essay is to discuss how the legacy of education research, unfamiliar to many of us, shaped how we approach issues of testing, measurement, and student ability—some for the best, others fueling debates and inertia for sustainable reform.
Keywords
Prologue
One of the major findings of the Alfred P. Sloan Study of Youth and Social Development in the 1990s was the historically high ambitions of high school students and the overwhelming number of them, regardless of their gender, race and ethnicity, and socioeconomic backgrounds, who expected to receive a bachelor’s degree (Csikszentmihalyi & Schneider, 2000; Schneider & Stevenson, 1999). In an attempt to understand what contextual conditions influenced this rise in expectations, my colleague, David Stevenson, and I pored over 1960s studies of the educational aspirations of adolescents, most memorably, the Adolescent Society: The Social Life of the Teenager and Its Impact on Education by James S. Coleman (1961). Our rereading of this book motivated us to examine other empirical studies of adolescents within specific communities from the 1920s through the 1960s. The ones that looked potentially valuable across the decades included Middletown: A Study in Contemporary American Culture, by Robert and Helen Lynd (1929); Children of the Great Depression: Social Change in Life Experience, by Glenn Elder (1974); Elmstown’s Youth: The Impact of Social Classes on Adolescents, by August B. Hollingshead (1949); and Growing Up in River City, by Robert Havighurst (Havighurst, Bowman, Liddle, Matthews, & Pierce, 1962). 1 Our primary interest was to locate the original data used in these studies and compare it to the extensive national longitudinal data we had amassed from several data sets to gain new insights into what changes in the family, high school, and community may have driven the educational aspirations of young people.
Networking with colleagues and friends, we attempted to learn if any of these earlier data files still existed. Our contacts yielded considerable information about unlocatable data and communities that were unexpectedly negative toward some of the researchers primarily because of perceived inaccurate and biased descriptions of their children. However, among this set of studies, the one by Havighurst seemed particularly promising. Havighurst had been an eminent education professor in the Department of Human Development at the University of Chicago and was regarded as one of the leading theorists in the study of human development. Colleagues were confident that the data existed somewhere in one of the university’s social science buildings. Determined, we scoured the bowels of the basement of the infamous “Judd Hall,” pulling out huge wire baskets of the papers of luminary education researchers such as former American Educational Research Association (AERA) presidents Benjamin Bloom (1965–1966), William S. Gray (1932–1933), and Phillip Jackson (1989–1990)—but were unable to find the Havighurst data. From Judd, I went on to other basements, porches, and unused old laboratories in psychology and human development, only to come up emptyhanded. After my last dismal scavenging attempts, sitting in my office looking quite forlorn, one of my students came in and asked why I looked so sad. I told him that despite what colleagues had said, it seemed that the Havighurst data had vanished. He looked at me quite quizzically and asked if I had tried the library. The
The boxes contained the actual data files of the young people of River City, their IQ scores, grades, and Havighurst’s constructed measures of social economic class. Spending a year in River City in the 1950s, he had collected a storehouse of field notes and observational data trying to understand social mobility and inequality in one American town. Reviewing census data, River City was surprisingly representative of the population of the United States at that time.
The Havighurst data files provided a deep historical picture of how researchers in the 1950s collected information, analyzed it, and used it to inform, describe, and understand a society in transition. As with the other authors in this presidential tribute, I was curious if I would find similar strategies in the work of earlier presidents and how they conceptualized problems, formulated measures, and used evidence to describe a societal trend, a successful program, or a change in policy. The work of Havighurst helped me understand and come to appreciate the perspective earlier researchers took in making sense of their worlds and their commitment to advancing research to improve education practice and policy.
Clearly, earlier AERA presidents did not have the sophisticated statistical techniques that we have today, but their interest in accountability of student learning, instructional practices, and program evaluation and policy are relevant to us. The following vignettes describe briefly some of the early presidents who were empiricists, several of them directors of research, and how their work connects with some of the issues of design, measurement, analysis, and interpretation today. Beginning with the first president of AERA, I highlight a number of presidents through the late 1940s and how their work spoke to the times they lived in and its impact through this centennial year. I stopped here as the 1950s ushered in a new era of significant innovations in research methodology and statistical techniques, strong criticisms of the quality of education research, and a growing recognition of the importance of education in promoting U.S. economic development and social reform (Ravitch, 1985; Vinovskis, 2009). 2 These major shifts in education research are described in several excellent histories, speeches, and the work of AERA presidents since that time. My intent is to show how the legacy of education research, unfamiliar to many of us, shaped how we approach issues of testing, measurement, and student ability—some for the best, others fueling debates and inertia for sustainable reform.
Defining the Focus of Education Research
The first president of AERA (1915–1916) was Frank Ballou, who at the time was the director of the Department of Educational Investigation and Measurement in Boston, Massachusetts. He defined the primary purpose of education research as being for the improvement of instruction.
3
As he writes: In this age of new undertakings and of new definitions of old ones, we cannot be reminded too often that every administrative agency, every special teacher or supervisor, all educational equipment, in fact, everything pertaining to the public school system is fundamentally for the purpose of making effective the instruction of children. (Ballou, 1916, p. 354)
What is perhaps most noteworthy is that the idea that student learning can be measured and used to evaluate the performance of teachers, a concept that has been a central focus of education research for a hundred years. Recent policy efforts to evaluate teacher performance on the basis of standardized tests are hardly a new phenomenon, and its advocates, as in prior times, are primarily those in policy positions. The rationale Ballou underscores for the value of measuring student learning in relation to teacher performance is that the information obtained from assessing the work of teachers is a public good that should be shared throughout the educational system from the superintendent to the principal to the teacher for purposes of identifying remediating “ineffective instruction.”
Measurement of teacher effectiveness at these earlier times was not unlike some of the arguments researchers often use in the service of forging better indicators of teacher accountability. However, it was how this initiative should be conducted that appears quite different from the process and outcome of teacher evaluation methods that have met with caution from AERA (2015) and the American Statistical Association (2014). Ballou recommends obtaining achievement information from the “best available standard tests” to set the benchmark on what students should be learning. It is against this bar that teachers’ performance should be assessed. After the results of the students’ tests are analyzed, he maintains, “suggestions must be made for improvement where results are unsatisfactory” and given to the teachers in informal conferences or in groups in printed reports but not distributed or widely disseminated in the form of official orders—in contrast to the public punitive sanctions backed by some in recent policy recommendations. When a reasonable time has elapsed, the students are to be tested again to determine what effect the suggestions had on performance (pre- and post-measures were promoted even then).
Given that tests to measure student performance were in an early stage of development, Ballou urged school personnel to begin by examining some topics, primarily arithmetic as it can be measured more easily from one time period to another. Some of the other subjects taught to students at the time included “copying” spelling, geography, and penmanship. One concern he had with spelling assessments was that there were “too many words, which are practically useless to the pupils.” Spelling, he advised, should center on words used voluntarily by “normal” pupils in their writing and a reasonable list of words determined by a committee of teachers be identified for this purpose. Results for geography showed that “little of the same test taught in the sixth grade remains in the minds of eighth grade, high school, or normal school” (Ballou, 1916, p. 357). The solution was to find the minimum essentials of common facts of geography—and teach only those facts. Ballou had serious apprehensions regarding teaching the same elements over and over again, opening the conversation about what we commonly refer to as “drill and kill.”
The recommendations made by Ballou were quite specific, and some of his general approaches to standardized achievement measurement continue in part today. Arithmetic, in Ballou’s school district, was measured by the Courtis Standard Tests, introduced in Boston in 1912, which was voluntarily administered in Grades 6, 7, and 8. The test used a rollout process where over time more students were added, and eventually all of Boston’s 70 elementary schools were tested for a total of 55,277 students. This is quite an impressive student sample for 1915.
Scores were reported as medians—with distinctions made between those items attempted and those successfully completed—and reported by speed and accuracy. The inclusion of speed was to assess the ability to copy quickly; while we do not measure copying, the idea of speed and timed tests are still part of many of our assessments. The value of speed has been reinforced in recent studies that consider speed as the ratio of time to effort—to carry out tasks of word recognition (see Perfetti, 2007).
Comparisons in speed and accuracy were made between schools by grade. Differences were calculated between the highest performing school within grades and other schools and grades. Measuring within-school and between-school differences—doesn’t that sound familiar to those who use hierarchical linear modeling (HLM; Raudenbush & Bryk, 2002)? Although not having today’s statistical techniques, Ballou’s aim was to account for within-school variation when examining differences across schools. His fundamental conclusion was that scientific measurement of educational results is both possible and practical.
Testing and Controversies
The use of educational testing to measure student learning and teacher and school accountability continued into the next decade, gaining wider acceptance along with more explicit rationale of its value. Several years after Ballou, Stuart Courtis, a supervisor of education research in Detroit Michigan, became AERA president (1917–1918). Courtis’s presidential address could not be found, but his position on testing was evident in an article he wrote in which he counters the opinions of an English teacher who criticizes standardized tests for English (Parker & Courtis, 1919). The teacher regards a new scale used to measure English competence in composition inadequate because of the nature of writing, which is distinct from other academic subjects, such as mathematics where precision can be achieved.
Courtis disputes the English teacher’s assertions and suggests that there is supervision and teaching; supervision is concerned with the efficiency of the general process, teaching with the development of the individual child. He avows that tests of competency, in this instance English, are useful for making comparisons across systems. Their ultimate value lays in the benefit to the teacher, not the student—so that the teacher can measure the ability of the students when she receives them and when she sends them on to the next teacher.
Supporting Ballou’s position, Courtis argues that a teacher needs to know the amount of change she has produced in her students over the semester. Courtis reasons that the teacher, in an ideal situation, should be responsible for her own performance, but there is always the problem of opinion, stemming from inadequate measurement and the personal bias of supervisors. Recognizing that even though test measures are crude, he maintains it is far better to set upon a path where performance is made on the basis of “exact knowledge obtained from careful experiments and scientific measurement” than personal bias: Measurement is no mechanical system for grinding the life out of teachers, but a tool by the proper use of which problems may be solved, inefficiency eliminated, and the teacher set free to work for and to achieve as never before, those higher beauties and graces of their subjects which are now intangible only because we have so little exact knowledge of them. (Parker & Courtis, 1919, p. 217)
The “Roaring Twenties” created its own roar in educational testing and measurement. M. E. Haggerty, AERA president (1920–1921), discusses in his presidential address one of the most widespread reforms of the time, the use of intelligence tests, examining both their benefits and potential misuse. 4 Haggerty (1921) credits the advances in ability testing to World War I during which intelligence tests were developed to determine the “mental capacity” of soldiers. Within three years after the first intelligence tests were introduced in 1920, over 4 million students in U.S. public schools had been tested (perhaps it is not so unexpected that some policymakers believe it is possible to scale within three years of development). Reporting on the accuracy of these tests, Haggerty describes a correlation between these intelligence tests and school achievement ranging from .60 to .80 and some at the .90 level. He heralds the development of these tests as meeting the criteria of “discriminative capacity, reliability, significance, and adequate standards of comparison” (p. 243). 5 Intelligence tests, he concludes, question the use of school achievement tests for measuring school effectiveness; for how can a school system be held accountable if the students do not have the ability to learn the material?
Despite his support and affirmation of advances made in intelligence tests, Haggerty (1921) argues that these tests are not satisfactory and that work needs to continue on “sources of error and determining the specific uses to which particular tests are best adapted” (p. 243). Measures of intelligence or ability, he contends, are not necessarily good indicators of how successful a student will be in school, playing only a small predictive part in the student’s later life success. Haggerty stresses the value of personal characteristics that weigh into success, presaging our own recent attention to which we have now turned our attention to what we now sometimes consider “character,” including what he terms nonintelligence traits such as industry, loyalty, honesty, tactfulness, sympathy, and cheerfulness. Character traits that detract from school success were identified as self-assertion, pride, conceit, jealousy, quarrelsomeness, suggestibility, and intolerance. The solution, according to Haggerty, was the construction of tests that combine both ability and achievement in order to credit the school for “capitalizing and paying dividends on whatever intelligence investment it may have” (p. 244).
The 1920s initiated the call for standardized tests to include intelligence and achievement measures, a process that continues through this century. However, over time, more and more educators and policymakers came to view achievement as the preferred choice for measuring student performance and teacher and school accountability. Intelligence was seen as unalterable (neuroscience confirming otherwise; see Immordino-Yang, 2016) but achievement could change with support, resources, and opportunities for learning. This focus on achievement and its relationship to what students are actually learning remains a key issue in education research, with the recent Gordon Commission (2013) providing recommendations on what and how to assess student performance.
Haggerty’s paper could be considered one of the first mixed method studies in education research, beginning with a quantitative analysis of correlations between achievement and ability tests and followed by a rich description of case studies that clarify how intelligence tests may misinterpret students’ abilities to succeed in school and life. Specifically, he examined high school students, including those placed in state reformatories and those in opportunity schools for the gifted and students with high intelligence receiving low marks and those with low intelligence scores receiving high marks. Emphasizing the importance and value of obtaining measures of students’ school attitudes, he also noted the difficulty in separating such characteristics from other achievement and ability factors likely to influence each other.
The talk also discussed problems of grades that use rating scales to measure the scholarship, intelligence, and industriousness of students. Haggerty cites his own research indicating that teachers use scales differentially in rating their students—showing little differences between ability and scholarship and that students are influenced by the group in which they are assessed. These early education researchers, at least among this group of presidents, were seriously thinking about how social context, like peer groups, can influence student learning. Provocative and forward thinking, Haggerty raised the idea that variation in success was not only due to ability but also to social and emotional skills and character, which are now gaining major traction in assessing students’ potential. In many ways, his ideas foreshadow advances being made in statistical methods (e.g., structural equation modeling [SEM] and HLM) that are allowing us to estimate the personal moderating and contextual mediating variables on student achievement.
Testing Models, Ability Grouping, and Experiments
The testing movement of the 1920s also led to an increase in the number of districts and schools using ability grouping in the 1930s. One of the key review pieces on why ability grouping may or may not be advantageous for teaching and learning was written by Paul T. Rankin, AERA president 1933–1934. Using data from the U.S. Office of Education, he described how ability grouping grew in popularity from elementary school through college. A survey of 500 superintendents that he cites shows that ability grouping was seen much more as an advantage than a disadvantage. It is in his work that we find the relationship between ability grouping and the individualization of instruction.
Rankin (1931) maintained that children’s school performance differs by such characteristics as age and sex, which creates considerable challenges in organizing students into ability groups. His ideas regarding the polemics of the advantages and disadvantages of ability grouping might have come directly from the future—from the review by Oakes (1985) and Gamoran (1992). Among the advantages of ability grouping are less differentiation of the curriculum, the need for individual attention is somewhat curtailed, students are more likely to compete with each other, and the work is set at a pace that allows for a real challenge. Disadvantages were viewed as hampering lower ability students’ sense of motivation to do well and seeing themselves as failures. Even in this early work, concerns are expressed that ability grouping is undemocratic and creates class distinctions. A student may have difficulty in one subject but not another, lower ability groups are more difficult for teachers to handle, and few teachers have the competencies to improve the performance of such students. Rankin also raised worries about the permanence of grouping and its sustaining long-term effects: Should struggling students receive more instructional time? And what type of enrichment should be given to those more academically able?
Rankin’s argument regarding individual differences include physical, mental, educational, and emotional but not contextual differences such as family resources—that focus would have to wait for two other decades to be seriously considered as a deterrent to educational opportunities (Coleman et al., 1966). Teacher judgments of student abilities are viewed as reliable, whereas intelligence tests are viewed as more suspect, relying on single measures, taken from group intelligence tests, lacking attention to maturity, and given to the misclassification of some students. The state of intelligence testing identified by Rankin could easily be applied to the achievement tests of today, including problems of item differentiation, validity and reliability of factors for combined scores, and need to consider performance in relation to teacher judgments.
The question of ability grouping was controversial enough that experiments were conducted by multiple school districts in which students were either homogenously grouped or assigned into mixed groups. The language for the experiments use the term segregated rather than treatment. One criticism by Rankin was it would be difficult to compare the two groups because they did not have the same teacher (a problem later solved by Rubin, 1974, and others). Pages and pages are devoted to different experiments across the country where researchers examined the differences between homogenous groups and mixed classes, concluding that there is a slight advantage to homogenous grouping and that teachers prefer it. It is easy to understand why this practice achieved such widespread adoption in schools across grades K–16 and also eventually sparked considerable debate on its effectiveness (Hallinan, 1994).
Not all AERA presidents were only concerned with the K–12 system. AERA president W.W. Charters (1930–1931), writing an Editorial Comment in the Journal of Higher Education (Charters, 1931), underscores the importance of college selection, suggesting students need to be properly advised when considering college so that those “undermatched” lacking in ability and achievement are not encouraged to attend places where they are unlikely to succeed. The college curriculum, much like the high school, is uneven in its instructional quality and subjective grading and is greatly in need of improvement and reform. This presents considerable challenges to students, especially those who may have ability and achievement scores below the average of other students in the college. However, just having low admission scores, while partially predictive, are not full proof, and current models of selection appear inadequate. He recommends that colleges provide to high school students and parents information on how students with different abilities and achievement scores fare with respect to graduation. The need for transparency and his critique of the inadequacy of the formulae for college admission might easily have been written in 2015.
Methods of Research and AERA’s Legacy
For those who are uncertain where education rests in the hierarchy of the disciplines, Carter Good, AERA president (1940–1941), provides clarity to this debate in the first sentence of his AERA talk by stating that education is an organized discipline and that the content of what has been learned is a function of development of research procedures for “problem solving.” He emphasizes that it is the complexity of education problems that has required the use of a variety of research procedures from other disciplines, such as philosophy, mathematics, statistics, sociology, psychology and experimental science, history, and economics. His paper centers on the types of data collection that researchers can use that could easily have come directly from the Common Guidelines for Education Research and Development constructed by IES and NSF.
With respect to quantitative analysis, which he describes as a normative survey approach, he includes not only assessments and questionnaires but also interviews, observations, extant data, and analyses of documentary materials. He states that a recent development is educational experimentation, the ideal goal of which is “to hold all factors constant except a single variable which is manipulated in order to measure its effect” (Good, 1940, p. 86)—to be conducted in laboratories or classrooms. Citing Fisher (1937), Good (1940) acknowledges problems of establishing control conditions and present limitations of measurement and statistical techniques while maintaining that these methods hold promise for studying instructional procedures. What he describes as a case study closely overlaps with the design parameters of most empirical studies today. In the 1940s, the science of education research was exploding, but the terminology was most likely unfamiliar to more experienced researchers whose understanding of causal factors, generalizability, and control conditions were limited. This fact did not go unnoticed by Good, who recognizes the importance of funding and the need for more of a “cooperative attack.”
By the middle of the 1940s, research methods had hit their full stride, and AERA president Wrightstone (1944–1945), with advisors such as Paul Rankin, George Stoddard, and Ralph Tyler, were engaged in evaluation research (some over a six-year period) using data from activity classrooms with
It is important to underscore the close relationship that existed between psychology and education research initiatives that profoundly influenced the type of problems and standard statistical methods used by scholars from the 1940s through the 1980s. In the 1940s, psychologists, mathematicians, and statisticians found themselves recruited for the World War II effort (Ellenberg, 2014). 6 State-of-the-art multifaceted testing regimes, regarded as more complex than earlier assessments (with new methods for constructing item validity and reliability), were implemented to sort soldiers into different classes of service. After the war, such testing regimes became an acceptable practice for matching age-related skills with particular ability levels, viewed especially by elite postsecondary institutions as reliably predictive of educational and economic success (Lemann, 2000). Tests of various types continued to grow in importance, not only as measures of probabilistic achievement but in areas of psychological well-being and social development.
While early education researchers and leaders perceived the value of testing for creating opportunities for students of different talents and abilities and the improvement of instructional practice, the 1940s and the war effort cast a somewhat different shadow on assessment—a shadow that continued through the 1950s and the cold war. Students were assessed into different instructional programs (i.e., tracks) presumably matched to their ability and their expected work and education trajectories. Instructional improvement became a secondary priority to advancing students’ schooling careers based on the ability level of the classes to which the teachers were assigned. This tension between matching ability (measured by ability or achievement tests often implicitly designed to measure ability) with educational opportunity rather than providing greater equality of educational opportunity regardless of ability (compromised by social and economic resources) continues to tug at our educational system. And teachers are often caught in the middle, evaluated on their skills at improving the performance of all students irrespective of explicit or implicit student achievement matching practices. Matching or mismatching (as some have suggested) versus creating educational opportunities has the potential to separate our work from the purpose of our education research legacy.
Epilogue
Understanding the motivation that inspired several of the early education scholars can help us to clarify what we value as researchers not only in terms of assessment tools and methodology but in the service of advancing student learning and instructional practice across the education spectrum. From AERA’s beginnings, education researchers have been willing to take advantage of new technologies such as intelligence tests, but they also greeted them with some skepticism, providing examples that children could learn if they had the support, resources, and learning opportunities. This perspective is worth reiterating in the conduct of current research efforts and in the training of future scholars.
The use of assessment, often formative in nature, was for individual teachers to know where their classes stood in comparison with the established standard for the grade and that if the class scored below the average, the solution was the implementation of instructional tutoring. Reading the papers of these early scholars, it is indeed regrettable that the purpose of teacher evaluation today in many states and districts involves sanctions and dismissal rather than early diagnostic prevention and remediation of teacher performance. Instead of the public and often misinterpreted analysis of teacher performance that characterizes many failing schools today, early researchers advocated a process of evaluation that was to be confidential, and comparisons among teachers were to be cautiously avoided. Most importantly, educational measurement was a constructive cyclical process for improving educational results by targeting student assessment and using it to improve teacher performance and student learning through better pedagogical practice.
Early education researchers struggled with many of the same problems we continue to work on today, whether enhancing learning for all students or improving the quality of practice. Committed to using the most up-to-date methods and statistical practices at the time, their results could be viewed as relatively simple when compared to the outputs we can now derive from merely a tap on a menu-driven software package. Yet there is something akin to an epiphany—an “aha” moment—in calculating a regression analysis by hand with matrix algebra or having to think very hard before committing one’s ideas to nonerasable onion skin paper as Havighurst did when writing his notes on River City. Our education research legacy helps us to value why studying certain problems are important for learning and instruction and how our research can be conducted to ensure high quality, reproducibility, and relevance. One can only wonder, especially with the major changes in technology, what three generations from now will think of our attempts at assessing learning and instruction for all students. Hopefully, it will be with respect and appreciation for our efforts despite how relatively unsophisticated our methods and priorities might seem to them.
