Abstract

A key step in preparing this volume was an authors’ workshop (convened in September 2018). For each draft article, invited discussants provided technical commentary, advice on editorial style, and suggestions for how to reinforce the theme of educational assessment as a useful form of empirical evidence. The ensuing discussions were, themselves, immensely useful—to the authors and editors—as we went about preparing final versions.
In pursuit of answering the fundamental question motivating this project, we invited two distinguished policy-makers to the workshop, and later engaged them in a lengthy interview to elicit their views on evidence generally and on testing and assessment specifically, in their respective “real worlds” of policy, practice, and research. Carl Cohn and Rebecca Maynard each have decades of experience at the local, state, and federal levels.
An edited version of our two-hour conversation follows (CC = Carl Cohn; RM = Rebecca Maynard).
Editors: We want to start with a general question about the uses of scientific evidence—by which we mean research-based, empirical, systematic knowledge—in policymaking. Not necessarily limited to educational assessment, how have you found research to be useful, and what have you done to promote its use?
As an urban superintendent, I was constantly trying to figure out how to motivate and inspire the adults who work with kids. Sometimes superintendents speak a lot about kids; I approached the job from the point of view that the only way we are going to get a better outcome for kids is by making sure that there is a trusting relationship with those who actually work with kids all day long. There was not a lot of evidence to support that during my 12+ years as an urban superintendent, but then along came Bryk and Schneider 1 and their study of how elementary schools actually got better in Chicago and that the building of relational trust was the key feature that improved performance.
At the time that this research came out, I was an urban leadership professor at Claremont preparing the next generation of school leaders who are very much focused on social justice, spending a lot of time talking about kids, and determining how you really change outcomes for kids. You change outcomes for kids by building this relational trust in a school system. I would work with the mentees that I was coaching and their boards of education on how to build this relational trust, which includes winning over teachers in an atmosphere that is often very contentious in terms of labor management. This research-based evidence on relational trust made my work of introducing other leaders and policy-makers to these theories much more compelling.
In my experience, the most useful research provides clear, contextualized answers to well-defined questions—not the researcher’s interpretations of how a bit of evidence should be used to change policy or practice. Let me give you an example. It is useful to know the differences in test score distributions or means across population groups, how reliably those means and distributions are measured, and even what factors may be predictive of or strongly associated with those differences. But as a policy-maker or a practitioner, it is typically not helpful to have the researchers say, “You know, all the rural districts are really doing horribly. Perhaps we should cut their funding or take over their schools?” That might be the researcher’s conclusion, but that is not a conclusion that is supported by evidence. It is a bridge too far to have the byline of a research article be the researcher’s conclusions and recommendations as opposed to the evidence-based facts, whether descriptive, predictive, or causal, and the logical conclusions of those facts.
The researcher-based products that are most useful are ones that are clear about what has been studied and what has been learned, and less prescriptive of what it means for policy or practice. The reason is that most of us on the research side are not deeply involved in the questions of practice or policy that are being decided, and we are not well schooled in the nuances of that work. Furthermore, even if we were involved in the policy world when writing our papers and doing our research, contexts change over time. It is generally a distraction to have a lot of emphasis in research papers on particular policies or practices, unless, in fact, that was the single, explicit purpose of the study. For example, consider the question about the effect of a negative income tax on labor force participation. That is a very concrete question, and it could be answered quite well with experimental evaluations, like those conducted in the 1960s and 1970s. But it should still not be in the researcher’s purview to assert her judgment that it would be a good thing to implement a negative income tax because, for example, the downward pressure on labor force participation (“drag”) is not very high. That is a policy decision that entails many considerations beyond the drag on labor force participation. It is a distraction in the research to conflate the scientific evidence and the policy prescription.
Editors: A variety of assessment data are collected as part of programs of research, and the results are reported in various forms, including research articles and agency reports. How have you looked at and understood those data and made use of them for policy?
In the early 1990s, as a new superintendent in Long Beach, California, one of the things we were concerned about was early literacy. We spent quite a bit of time looking at the available evidence and decided to go in the direction of some kids needing to spend more time acquiring literacy skills. We developed a mandatory summer school program: five weeks of required summer school for all those kids who were not reading at grade level at the end of third grade. We identified all students who were not deemed proficient on the state standardized test. We then used a benchmark book and had a teacher sit with each individual child to determine what level they were at and whether they qualified for the summer school. It was a combination of standardized testing and clinical assessment between a teacher and student.
The Board of Education and I were in sync that this was an important program. There was substantial evidence that early literacy is absolutely critical. Our hunches, though, were that this required summer school would end up with 75–80 percent males of color—and that is exactly what happened. We used the available evidence to implement an evidenced-based program. Part of it was saying to the community we are going to go the extra mile based on the evidence. We also framed it in a pluralist way as opposed to a redistributive way where obviously this was a multi-million-dollar program that required additional school district resources, but it was framed as any youngster who sits with a teacher and doesn’t meet this benchmark book threshold is going to go to this mandatory summer school. It was not a program for African American and Latino males. It was a program for anybody eligible. It is an example of how you can decide you want to shake up the system, do something new, and then use evidence to get there. The program was overwhelmingly embraced by the entire community and was launched as part of a number of improvement efforts in Long Beach, which ultimately caught the attention of the secretary of education, the attorney general, and President Clinton.
A good example for me was a large-scale experimental evaluation recommended to determine if participation in a supported employment program would substantially improve employment and earnings of mentally challenged young adults. The programs in the study received funding to do their very best to prepare the intellectually challenged individuals for unsubsidized jobs. Thus, not surprisingly, staff instinctively wanted to prioritize serving those who were most likely to become employed following training. Yet a strong test of the effectiveness of the program required that we be able to monitor and compare employment outcomes for comparable groups of mentally challenged young adults who went through the program and who did not. So, the challenge was to convince the program staff to refocus their targeting goals to encompass a representative set of youth who they would be willing to serve, should the program funding allow them to enroll more youth. We were successful in convincing the program staff that, absent evidence of the program’s effectiveness, there likely would not be continued funding for this type of program. Furthermore, we argued that if we did not expand the participant pool in the study sample to include young adults who program staff judged to be less likely to show good outcomes, it would be impossible to produce convincing evidence that the program would be effective if run at a larger scale (and, thus, by definition, served a group the program staff were less confident would succeed). It turned out that youth with IQ scores in the band that staff would have chosen to serve, a group then referred to as borderline for “mental retardation” (now referred to as intellectual disability), were not the group who benefitted from the program. Rather, it turned out that the group that benefitted were those somewhat more challenged—in the group just below those the program would have prioritized.
This illustrates what happens when using an assessment for a purpose for which it was not really designed—in this case the use of an assessment as a proxy work performance indicator. People had made judgments based on some practical wisdom. The IQ test was predictive of outcomes, but not of the program’s impact. Had we followed closely the judgments of the professionals whose focus was on participant outcomes, not program impacts and where the program’s “bang for the buck” might be greatest, we would never have tested the program with the very subset of mentally challenged young adults for whom it was most effective. Mind you, outcomes for the less challenged group of individuals were considerably better than those for the more challenged group; however, the program ended up helping only the more challenged individuals. This is an example of an off-label use of an assessment, taking an intellectual test and applying it for workforce aptitude or readiness. We were able to reach this conclusion because researchers were working with practitioners to test the effectiveness of the program for individuals across a spectrum of IQ levels using rigorous experimental methods.
Editors: Can you provide examples of how assessments might impede policy-makers?
Putting on my hat as a California State Board of Education member, we fought with the federal government over the issue of identification of low-performing schools based exclusively on standardized test results. We argued that identifying low-performing schools has been part of school reform for the past two decades, and it does not get you much. You need to come up with a way to identify school systems and work on improvement of those systems. These schools are not isolated. They are part of a system, and so the thrust of what we as a State Board of Education wanted to do was focus on the system. This requires multiple measures of accountability that go beyond mere standardized test results. We fought for several years with the U.S. Department of Education over this approach, with support of our governor. This is certainly an example of how assessments can impede policymaking. We were determined to make a huge statement about fixing the school systems where individual failing schools reside, so that improvements would be sustained over time by capturing what we know about continuous improvement in organizations. Isolating individual schools for individual treatment contributed very little to systemic improvement.
Following on Carl’s example, assessments sometimes divert attention from the issues of consequence in explaining particular outcomes. Persistently low-performing districts, for example, tend not to mimic in any substantial way the ones that are uniformly high performing. They are different in so many ways that efforts would be better spent trying to unpack what is similar, what is different, and who the outliers are. Are there demographics and contextual factors that tend to characterize low- and high-performing schools? We need to work to understand these contextual differences instead of using assessments to place schools in a pecking order.
Another example would be in admissions testing. For instance, there are programs where you need a minimum ACT/SAT or GRE to get in. Most places, however, now seem to realize that absolute thresholds are not that helpful. In my university we have these discussions every year, and we have people who point out examples of people with low GREs who do well, and vice versa.
Editors: Following on those examples, can you talk a bit about the more general problem of unintended effects of using test results for so-called high-stakes decisions?
I saw this problem but in a particular context different from the way teachers or school leaders may cope with the pressures of testing. I’m referring to how federal-level policy professionals might collect and report data required as part of resource distribution decisions. Allocations of federal “school turnaround” money, for example, came with expectations or requirements that recipients of funds would meet performance benchmarks. Any time you tie funding to performance, it tends to divert attention from getting the job done to reporting on the job you’re doing. Some of that is necessary, but making it “high stakes” invites people to worry more about the aspects of their job that relate to those performance markers and not necessarily to the aspects of the job that relate to getting the children to the desired levels of proficiency.
I had a three-year hiatus on the faculty at USC between being superintendent in Long Beach and becoming superintendent in San Diego. I was fascinated with the initial questions from teachers about what I thought of “value added.” 2 I thought I had established a pretty clear record in a 10-year run in Long Beach that firing your way to excellence wasn’t the approach, but it was interesting how this sort of new phenomenon had gotten into the equation. No matter what your track record was, they simply wanted to know what do you, Mr. New Superintendent, think of value-added. Is that the way you will hold teachers accountable? Like so many things, this had a way of distracting the organization from its fundamental work: if you believe (as I do) that the most important way school systems get better is by improving the professional development of teachers, this was a major distraction that absolutely took time away from building the trust and confidence for everybody to roll up their sleeves and go to work on high-quality professional development. This new alleged evidence had a way of distracting the entire organization.
Editors: Thinking of your respective roles in educational policy and leadership, is evidence of such distortions or other unintended consequences sufficient as an argument to stop using tests?
Not all testing. I think that’s a particular enactment of how you use testing. Most teachers and educators would argue that some form of testing is actually critical to getting the job done. You need testing for diagnostics and accountability, but it really matters how the test is constructed, how the scales are constructed, and how they’re used in practice.
It’s an excellent question. A group of us served as advisors to the Harvard Urban Superintendents Program, and so we would get together twice a year for meetings and discussions. The chair of the advisory group was the late Beverly Hall, then the Atlanta superintendent. After the Atlanta scandal broke, 3 I was absolutely shocked. It was a huge object lesson in how a major system can go wrong while totally perseverating on test scores. But I don’t believe even a case like that makes a sufficient argument to abolish testing altogether; rather, it teaches that we have to spend more time figuring out how to avoid unintended responses by educators and students.
Editors: On balance, does the use of testing in American education do more to describe or to perpetuate disparities?
I have thought quite a bit about this, and I would say that testing has probably been neutral or, possibly, slightly negative. But I think there is a way to do it right. We have conflated our purposes. For purposes of monitoring disparities, looking at trends over time, and understanding performance of the overall system, we do not need to test every child, in every grade three through eight and twice in high school, every year. That is totally overkill. There is another purpose of testing, which is to know where individual students are and to be able to do the kind of responsive programming that Carl’s been talking about having done in Long Beach. For this purpose, you need to know how each child is doing relative to benchmarks. Some students may be doing so well that they do not really need full testing every year. And some students may be doing so poorly that you should be conducting more targeted testing and testing more frequently than once a year so that you can keep them on track programmatically. I think we have done some of the right things—but in a suboptimal way—and I think that has also led to some misuse of data.
If we look at things like our obsession with teacher value-added, the evidence shows rather convincingly that those measures for individual teachers are not very reliable. Teachers have classes that have quite different profiles of students. Everybody who has been in a classroom understands that the ability of a teacher to teach children in the classroom is not just a function of how he or she runs the class: it also depends on the capacities, the personalities, the situation in the classroom, the size of the class, the distribution performance levels among students in the class. Do you have a nice mix of high and low performers, mostly low performers, or mostly high performers? How many disruptive children are in the class? All these things factor in. The evidence is clear that a single value-added measure for a teacher is not a good indicator of his or her ability because there are so many other things going into play.
Looking at No Child Left Behind (NCLB), we began disaggregating the student data by race in Long Beach a full decade before NCLB. Our research office was populated by people from the CRESST Center at UCLA. I took the view that all districts were doing the same in terms of disaggregating the data. Well, I found out, when I was talking with a former commissioner of education in Texas, that the superintendents in east Texas would never disaggregate the data by race. And I guess my reaction to that is: Is that a national problem or is that a Texas problem? And do you get to better long-term, sustained improvement in this country when you get people at the local level to do the right thing and to come to grips with the challenges of closing the achievement gap? My brother is a civil rights lawyer, and he and I have this debate all the time. But it seems to me with regard to long-term motivation to improve at the level closest to kids, that is what you want to get to rather than the eager fight to push back against the federal government. Isn’t the better part of valor convincing people at lower levels to absolutely do the right thing? And disaggregating that data by race is the right thing to do.
Editors: What do you believe should be the highest priorities for evaluating the current state of assessment, including its future development, uses, and policy implications?
When I was superintendent at Long Beach, Lynn Winters, from the CRESST Center at UCLA, joined our team as the assistant superintendent of research and evaluation. In this role, she went out to schools and said to classroom teachers that “we can help you to build assessments that are a better measure of what you are actually doing with kids than the SAT-9 does.” That was an amazing breakthrough that won over classroom teachers. It showed them that the central office was not out to “get them,” but instead wanted to work in partnership with them. Coming up with assessment designs that build on the capacity of hard-working teachers to me is the critical outcome that we want. We should think about refining the design and uses of assessment to be more like the medical field: looking for the right dose, the right time, for the right patient.
Carl addressed what is really useful at the local level to get the job done with the children. I believe there are two other purposes of assessments that call for something a bit different. As a nation we should have some overall standards, some expectations that should build on what educators tell us about what a child of a certain age and profile should be expected to know, perform, and do that could serve a national monitoring role. We need to hold ourselves accountable as a nation to answer the questions, “Are we investing enough in education? Are we getting the outcomes we need? How should we be distributing resources?” In the policy space, having a consistent set of data that allows us to look across the country at students of different ages and profiles to help with such questions would be important to guiding those policy decisions.
Second, we would benefit greatly from having systematic longitudinal data on student performance that could be linked with other forms of data for addressing research questions. Neither of these purposes requires assessing every child every year. The data can be sample-based and have a little more parsimony in the items that are included. For example, they would not need to be of such a fine-grain level that they could be simultaneously used to determine the instructional needs of individual students. That is a whole different level of granularity, and you would need local involvement and clinical assessments for those types of decisions. We could build on NAEP’s [National Assessment of Educational Progress] strong foundation as well as our national longitudinal data surveys to build an efficient, effective monitoring system.
Editors: Did what you heard at the authors’ workshop add to or modify your thinking about how policy-makers should or could appropriately use assessments?
I learned from all the papers. As one example, I was impressed with Jennifer Frey’s discussion about special education. I have worn a couple hats with regard to that, first as the federal court monitor for the special education consent decree in Los Angeles, and then these last three years as the head of a state agency, the California Collaborative, where the California dashboard identified 163 districts that needed assistance because of the performance of students with disabilities. I thought her discussion of getting to the type of precision that medicine has, the right dose for the right patient at the right time; that all education ought to be special, that was one of the things that really caught my attention.
I also feel that the work by William Penuel and Douglas Watkins on research-practice partnerships, describing how folks actually sit down with a large urban district and talk about how they can better support student learning and building on classroom experience, is for me the real future. I see research-practice partnerships as an important way forward in all of this work—how we can bring together the practitioners and the researchers to explore problems and challenges that need addressing.
I pulled out three themes from the workshop, two of which we have talked about here. Several papers talked directly or indirectly about the fact that policy-makers at different levels have different interests in assessments. Another important takeaway is that the most commonly used standardized assessments and the ones we have been investing heavily in are useful for large-scale monitoring and for hypothesis-generation research, but they tend not to be that useful for more fine-grained issues. A third theme, which we have talked a little about, is the unintended consequences of using assessments “off-script,” for purposes they weren’t really designed for.
Editors: Is there any final point you would like to make?
We really did not discuss the fact that we have been investing heavily in theory-based professional development interventions with teachers to improve teaching and practice; yet for the most part, these investments are not moving the markers of teacher performance that rely on student assessments. What this says to me is that it may be more challenging than we have acknowledged to improve student achievement through in-service teacher professional development or, possibly, these assessments are not good markers of what needs to change to achieve better long-run outcomes for students. It may be that still the best we have is that principals know their good teachers from their weak teachers; it may be that there are just too many dimensions to codify. Either way, we probably should rethink how we prepare and evaluate our teachers.
More of these conversations that actually build on the expertise of both practitioners and researchers are necessary for future breakthroughs in what I call a long, hard slog. There are no known shortcuts or magic bullets that get us there on a timeline that the media and the critics of public schools will find acceptable, but we must continue in this quest for what works with our most vulnerable and deserving students.
Footnotes
Notes
Carl A. Cohn is professor emeritus at Claremont Graduate University. For the past seven years, he has served as a California State board member and executive director of the California Collaborative for Educational Excellence. Previously, he was superintendent of schools in the Long Beach and San Diego school systems. Assessment-related work includes past service as the cochair of California’s Academic Accountability Performance Task Force, on the National Assessment Governing Board (NAGB) and as member and chair of the ACT Board of Directors, and current service on the board of the Center for Assessment.
Rebecca A. Maynard is a professor of education and social policy at the University of Pennsylvania. She is a leading expert in the design and conduct of randomized controlled trials in the areas of education, workforce development and career pathways, and social policy. She has conducted influential methodological research, including codeveloping PowerUP! to support efficient sample designs for causal inference studies, and she has been influential in advancing the development and application of research synthesis methods and cost and cost-effectiveness analyses. From 2010 through 2012, she served as commissioner of the National Center for Education Evaluation and Regional Assistance at the Institute of Education Sciences (IES), where she oversaw the Institute’s evaluation initiatives, the What Works Clearinghouse, the Regional Education Laboratories, and the National Library of Education.
