Abstract
Educational policies such as Race to the Top in the USA affirm a central role for testing systems in government-driven reform efforts. Such reform policies are often referred to as the global education reform movement (GERM). Changes observed with the GERM style of testing demand socially engaged validity theories that include consequential research. The article revisits the Standards and Kane’s interpretive argument (IA) and argues that the role envisioned for consequences remains impoverished. Guided by theory of action, the article presents a validity framework, which targets policy-driven assessments and incorporates a social role for consequences. The framework proposes a coherent system that makes explicit the interconnections among policy ambitions, testing functions, and the levels/sectors that are affected. The article calls for integrating consequences into technical quality documentation, demands a more realistic delineation of stakeholders and their roles, and compels engagement in policy research.
Keywords
Educational reform efforts are observed around the world, for example, Australia (Brindley, 2001), Canada (Klinger, DeLuca, & Miller, 2008), China (Jin, 2011), the UK (Broadfoot, 1996), Germany (Rupp & Vock, 2007), and the USA (Wixson, Dutro, & Athan, 2003; Chalhoub-Deville, 2009a; Deville & Chalhoub-Deville, 2011). Governmental reform efforts assert that education is central for economic advancement, international competitiveness, and social order (Haertel & Herman, 2005). These educational reform initiatives increasingly employ high-stakes accountability tests to improve the so-called quality of education. Reform-based testing involves claims, which reach well beyond score interpretation and use at the individual or group level. It also entails system (e.g., educational, economic, and social) change claims. These expansive claims present us with challenges not attended to in current validity theories. What is lacking are structures, which make explicit the interconnections among policy stipulations, testing capabilities, and those impacted – at the individual, group, and societal levels. Also needed are scholarly engagements that acknowledge the social dimension to validity.
Although validity remains grounded in score interpretation and use, its theoretical foundations as well as scope are being challenged. Messick’s (1989) hold on the profession 1 is waning and Kane (2006), who authored the most recent chapter in the influential Educational Measurement, is advancing ideas, which are spawning theoretical and practical changes. While Messick (1989) grounded validity in construct judgments, Kane (2006), in response to concerns about Messick’s unwieldy approach, proposed shifting the framework of validity to interpretive argument (IA) justification. (In 2013, Kane replaced the term IA with interpretation/use argument (IUA). This change is discussed later in the article.) In the second language (L2) field, this shift is clearly represented in the works of Bachman and Palmer (2010), Chapelle, Enright, and Jamieson (2008, 2010), and Chapelle (2012). The present article reflects on construct-based and IA/IUA-anchored validity formulations and discusses the contentions surrounding the inclusion of consequences in validity research. The article contends that validity conceptualizations and the role they afford to consequences are inadequate for addressing the needs of reform-based testing. The article also brings forth concerns about validity practices, which are disconnected from policy research.
The article includes four main sections. The first section describes the push in recent years to use testing as part of educational reform agendas. While the present article addresses primarily reform-driven assessments in the USA, issues discussed pertain to educational reform assessments/accountability tests generally. In light of the expanded role for assessment, the article calls for a reexamination of the scope of validity frameworks. The second section reviews construct- and argument-based validity issues and the continued struggle to integrate consequences into validation research. The discussion transitions in the third section to focus on emerging research under the heading of theory of action (TOA), which calls for a reform-informed validity framework. TOA delves into the interconnections of policy mandates, testing research, and societal consequences. The final section of the paper presents a socially grounded conceptualization of validity, which integrates consequential research. The framework offers a coherent structure of measurement and TOA arguments at the individual, aggregate, and educational/societal levels. The framework also addresses consequential research and responsibilities at various stages of reform policy formulation, assessment system development, and ongoing assessment administration.
Educational reform and testing
Reform-driven educational initiatives have become a global phenomenon, which are often referred to as the global education reform movement (GERM) (Sahlberg, 2011, http://pasisahlberg.com/global-educational-reform-movement-is-here/). In the USA GERM has taken shape in federal policies such as No Child Left Behind (NCLB) (2002) and more recently Race to the Top (RTTT) (authorized under the American Recovery and Reinvestment Act of 2009). On July 24, 2009, President Obama stated when he announced the RTTT initiative:
America will not succeed in the 21st century unless we do a far better job of educating our sons and daughters … And the race starts today … if you set and enforce rigorous and challenging standards and assessments; if you put outstanding teachers at the front of the classroom; if you turn around failing schools – your state can win a Race to the Top grant that will not only help students outcompete workers around the world, but let them fulfill their God-given potential. (U.S. Department of Education, 2009, www.whitehouse.gov/the-press-office/fact-sheet-race-top)
RTTT is grounded in consortia testing 2 intended to drive reform. The function of these assessments moves beyond traditional roles of placement, selection, end-of-grade, certification, and so forth, at the individual level. The RTTT testing systems are intended to drive teacher effectiveness and school performance. Targeting different levels of education quality is a commonly observed phenomenon in reform assessments. As Haertel & Herman (2005) write, educational reform tests perform two functions: “[o]ne function is sorting and selecting, comparing students to one another for purposes of placement or selection. The second is improving the quality of education. At times, these two categories overlap … these two broad functions recur again and again” (p. 5). Additionally, RTTT seeks systemic reforms to help ensure that students graduate ready for college and career, which, ultimately, are intended to drive economic growth (www.achieve.org/college-and-career-readiness). In short, reform-driven RTTT moves beyond the traditional interest in individual student scores. They include far-reaching goals, which mandate attention to aggregate and system performance. Investigations of the validity of such assessment programs cannot shy away from aggregate and social system considerations.
Another distinctive feature of reform-driven initiatives is their level of involvement in designing the assessment programs. RTTT mandates many of the design features to which the testing consortia have to attend. It requires the alignment of tests with Common Core State Standards (CCSS), a federally sponsored project (see Deville & Chalhoub-Deville, 2011). 3 RTTT also specifies design elements for formative and summative assessments, online administration and automated scoring, instructionally relevant tasks, as well as enhanced feedback and reporting. RTTT requires attention to professional development, the adoption of common protocols of accommodations for ELLs, as well as a unified approach to identifying ELLs across consortia states. An argument could be made that government specifications are understandable given the ambitious goals and the budget allocated, over $4 billion. Nevertheless, this involvement changes traditional research responsibilities allocated to test developers versus users.
RTTT consortia grant applications are evaluated by reviewers (educational professionals invited by the government to perform proposal reviews) based on the implementation of a Theory of Action (TOA) validation process, that is, how the consortium’s goals, processes, and deliverables (e.g., professional development, standards, assessments, instruction) are integrated into a coherent system that will enhance individual and institutional performance to achieve college- and career-readiness. TOA is effectively mandating the validation not only of traditional individual score interpretation and use but also the overall effectiveness of a system in achieving its intended educational-economic-social goals. This TOA requires that measurement and language testing scholars delve into mostly unfamiliar validity research. The latter part of the paper addresses TOA.
RTTT reform features, validity requirements, and intervention in assessment design are observed with GERM efforts in general (Sahlberg, 2011). GERMs tend to do the following:
seek to fix public education;
promote a uniform approach to education, for example, outcomes- and standards-based education;
emphasize core subjects such as mathematics and literacy, often at the expense of other content areas;
give preeminence to standardized achievement testing programs;
authorize governments to dictate design features of testing programs;
direct instructional practices to increase achievement scores;
gear attention to L2 learners toward a quick transition to monolingual educational systems;
model structures for improvement after corporate approaches;
target expansive reform goals; and
compel a socially grounded research approach to document impact intended – and unintended.
Although a broad critique of RTTT and similar GERM policies is beyond the scope of this article (readers are encouraged to consult, e.g., Byrnes (2005), Mathis (2010), Menken (2008), and Ravitch (2010) for a critique of US policies targeting educational reform), the following remarks are offered to illustrate the nature of some of the concerns. The focus on select subject areas contributes to the reallocation of resources and potentially the neglect of other areas. This imbalance is likely to impoverish an educational program and compromise intended goals of college and career readiness. Additionally, while performance may be adequately measured in terms of a specified assessment domain such as CCSS, the question remains with regard to the representation of such standards in the upper educational systems and the varied work domains. Also problematic are practices that emphasize English language proficiency, often at the expense of students’ first languages. These practices serve to undermine other language initiatives, which are focused on increasing foreign language proficiency.
RTTT and GERM assessments have goals with broad claims at the individual, group, and social system level. Our traditional validity frameworks are inadequate for guiding the evidential and consequential research needed by different stakeholders. Additionally, professionals in measurement and language testing have typically shied away from engaging in policy research. The present article seeks to reframe validity theory and practices to engage better with claims embedded in reform-based testing. Emerging thinking in this area seeks to expand the validity arguments, emphasizes socially centered validity explorations that comprise investigations of consequences, and encourages engagement in policy research. The article discusses validation issues that merit serious attention as the L2 testing profession pushes forward with reform-driven testing.
Validity contentions: An overview of recent history
Validity theory
Two primary sources of information, published approximately every 10–20 years, serve as representative indicators of where the validity discussion is at any given point in time. An argument could be made that forward thinking validity conceptualizations are presented in the Educational Measurement editions. Authors of these chapters (as they have appeared in the various editions: Cureton, 1951; Cronbach, 1971; Messick, 1989; and Kane, 2006) have been in the forefront of contemplating what educational and psychological scholarship, psychometric developments, policy considerations, social values, and assessment practices entail for more effective validity conceptualizations. A more consensus-based approach to validity is embodied in the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1954/1955, 1966, 1974, 1985, and 1999, 2014. The manuscript references the 1999 edition. The 2014 edition appeared after the manuscript has been accepted/in production). The Standards are published after extensive deliberations and a widely sought input process. The Standards are endorsed by a variety of organizations, universities, and associations, including the International Language Testing Association. Educational Measurement (1989) and the Standards (1999) are central to where thinking is in the professions at large, and the two sources, until recently, have grounded validity theory and practice in constructs and test scores. Kane (2006, 2012, and 2013), however, replaces construct validity with argument-based validation.
Messick (1989) and the Standards (1999) have called for a unitary, score-focused, and construct-grounded approach to validity. Messick portrays validity as an evaluative judgment of the extent to which theoretical arguments and empirical evidence support interpretations of scores as proposed by test purpose(s) and uphold defensible uses/actions undertaken. Additionally, in Messick’s progressive matrix, the construct is at the core of the interpretation and use functions, whether the basis for the justification is evidential or consequential. Kane (2012) writes that “The unified model of construct validity was conceptually elegant, but not very practical” (p. 7). He adds that the model is “conceptually rich and suggestive, but it is not easy to implement effectively, because it does not provide a place to start, guidance on how to proceed, or criteria for gauging progress and deciding when to stop” (p. 8). Such concerns about practicality have been raised by other researchers (Lissitz & Samuelsen, 2007; Sireci & Parker, 2006).
Kane (2006) builds on his research in operational settings (e.g., Kane, Crooks, & Cohen, 1999) to advance an argument-based approach to validity. He prefers the use of the term validation, as it denotes a more applied orientation to the process. IA (as a reminder, IA references Kane’s interpretive argument) specifies the network of inferences embedded in test score interpretation and use for a given assessment. The chain of IA inferences includes claims that pertain to areas such as scoring, generalization, extrapolation, and implication. In language testing, implication is labeled explanation (Chapelle et al., 2008) or theory-based interpretation (Chapelle, 2012). It is interesting that Chapelle et al. (2008) and Bachman and Palmer (2010) also emphasize utilization claims, which consider decisions and impact based on scores. (This addition, it can be argued, points to the relatively circumscribed role that consequences play in Kane’s IA. More on this issue later when discussing Kane’s 2013 publication.) This chain or network of IA inferences becomes the agenda pursued in validation. Validity includes arguments and evidence that support the plausibility, coherence, and completeness of the IA specified. Validation is an evaluation of “the coherence and completeness of this interpretation/use argument and of the plausibility of its inferences and assumptions” (Kane, 2013, p. 1). The approach is “uniform in that it is consistent across applications, but it is responsive to differences in the proposed interpretations and uses, as well as differences in populations and contexts” (Kane, 2012, p. 10). In other words, the claims embedded in a network of inferences, however, can vary to accommodate the goals of an assessment. (See Bachman and Palmer for excellent examples of how inferences are to be parsed into claims and arguments.)
Kane’s work has been well received in L2 testing. Chapelle et al. (2010) characterize Kane’s representation of validity as follows: “The argument-based approach was designed to retain the generality inherent in the unified model (Messick, 1989) while proposing a more straightforward approach to validation efforts” (p. 9). Kane’s IA is at the heart of major textbook publications such as Bachman and Palmer (2010) and research projects such as Chapelle et al. (2008, 2009). Kane delivered the Samuel Messick Memorial Lecture at LTRC in 2010 and a related paper has been published in Language Testing (2012).
Consequences in validity
Notions represented in the terms Impact, consequences, backwash, and washback are widely discussed in the literature, both in language testing and in measurement. In L2 testing, see for example, Bachman and Palmer (1996, 2010), Cheng (2008), Hamp-Lyons (1997), Kunnan (2004), McNamara and Roever (2006), Shohamy (2001), Wall (2005), and Wall and Alderson (1993). In measurement, see for example, Messick (1989), Mehrens (1997), Shepard (1997), Moss (1998), and Kane (2006, 2012, 2013). Chalhoub-Deville (2009b) argues that these are interrelated terms that can be used interchangeably. The variation in use may be a matter of tradition in a discipline. For example, impact tends to be employed in language testing. Consequences is a more prevalent label in the measurement literature. Backwash and washback seem to be favored in instruction and classroom management. The choice of term may also reflect a geographic preference. While washback seems to be popular in the USA, backwash is used more in Europe. Given the article’s focus on educational testing and the validity literature in the measurement field, the term consequences appears more frequently in the present article.
It is commonly acknowledged that test score interpretation and use inevitably have consequences. Broadly, consequences refer to the effects of testing outcomes – intended and unintended, positive or negative – on different stakeholders, including students, teachers, administrators, as well as larger societal systems (Cheng, 2008; Fulcher, 2014; Hubley & Zumbo, 2011; Kunnan, 2004; Shohamy, 2001). Discussions of consequences focus on a number of highly contentious issues, which include whether the evaluation of consequences should be integral to validation; what aspects of consequences are to be investigated; and who is responsible for engaging in evaluating consequences and at what point. These issues are considered next.
Mehrens (1997) wrote: “Whatever the eventual outcome of the debate about the term consequential validity, it is evident that currently there is no agreement about the wisdom of its usage” (p. 18). Almost ten years later, Brennan essentially reported no change in this outlook. “Perhaps the most contentious topic in validity is the role of consequences” (Brennan, 2006, p. 8). However, what is increasingly contested in the literature is not whether test consequences deserve attention, but whether it is feasible to include them in validity theory and research (Kane, 2013). The contention in measurement is mirrored in arguments observed in language testing. While researchers like Shohamy (2001) and Lynch (2001) argue for a critical examination of test functions and consequences as part of validation, Bachman (2005) deems such endeavors as impractical in validity research. Davies (2008) entrusts codes of ethics and of practice with social and political accountability, an approach viewed with skepticism by McNamara and Roever (2006). McNamara and Roever write: “we are uncertain whether the codification of ethical principle will have any measureable impact on the field” (p. 253). To these contentions, I add that separating consequences from validity suggests to test developers and other professionals that such research is not integral to the technical quality of the assessment program and can be pushed more easily aside or allocated to some stakeholder.
Kane (2006) and McNamara (2006) credit Messick (1989) with making explicit the role of consequences in validity theory. Messick integrates consequences into his progressive matrix and argues “that scores are always associated with value implications, which are a basis for score meaning and action (interpretation and use), and which connect construct validity, consequences, and policy decisions” (Chalhoub-Deville, 2009a, p. 120). Messick, nevertheless, offers a delimited role for consequences in validity. Specifically, Messick maintains that “If the adverse social consequences are empirically traceable to sources of test invalidity, then the validity of the test use is jeopardized. If the social consequences cannot be so traced … then the validity of the test use is not overturned” (p. 88). In another publication, he explains: “It is not that adverse social consequences of test use render the use invalid but, rather, that adverse social consequences should be attributable to any source of test invalidity, such as construct underrepresentation or construct-irrelevant variance” (Messick, 1995, p. 748). Messick seems to argue that adverse consequences are to be construed as part of validation if, and only if, they are related to construct misrepresentation.
Messick’s (1989, 1995) views are in line with those presented in the Standards (AERA et al., 1999):
evidence about consequences may be directly relevant to validity when it can be traced to a source of invalidity such as construct underrepresentation or construct-irrelevant components. Evidence about consequences that cannot be so traced – that in fact reflects valid differences in performance – is crucial in informing policy decision but falls outside the technical purview of validity. (p. 16)
This quotation asserts that consequences are a validity concern only to the extent they are directly related to issues of construct documentation. This circumscribed attention to construct-related consequences is illustrated in part II of the Standards, where engagement remains technical and avoids the educational–social impact of a testing program. (Part II of the Standards, Fairness in Testing, addresses fairness and bias as well as equitable considerations for special population groups, including individuals of diverse linguistic backgrounds.) Mostly, Fairness in Testing speaks to the technical quality of the assessment instrument as well as construct-related explorations of interpretations. The Standards relegate policy-related investigations to evaluation research.
The delimited orientation to consequences, which is articulated by Messick and formalized in the Standards, mirrors publications in language testing. McNamara and Roever (2006) point out that the research available in language testing seems to focus more on the technical aspects of validity, that is, “Using evidence in support of claims: test fairness” (p. 14). Studies on “The overt social context of testing” (p. 14) (i.e., social, educational, economic, and policy explorations) are undertaken less. McNamara and Roever reason that this orientation to validity is “heavily marked by its origins in the individualist and cognitively oriented field of psychology” (p. 9), a characterization echoed by Haertel (cited in Sireci, 2013). Haertel holds that measurement has its roots in psychology, which focuses on individual differences. Reform assessments demand that we alter this cognitive, individualistic orientation to validity. They call for a social orientation to validity research.
Validity theory is beginning to alter in ways that accommodate the realities of reform assessments. Kane (2006, 2012, 2013) has increasingly advocated for a more expanded role for consequences in validity research. Kane (2013) argues that consequential research, which addresses score interpretation, is necessary but not sufficient to document the quality of assessment results. Kane contends that bad decisions can be made even when score interpretation is sound. “The evaluation of test score uses requires an evaluation of the consequences of the proposed uses, and negative consequences can render a score use unacceptable” (Kane, 2013, p. 46). To formalize the need to pay attention to both the interpretation and use of scores in consequential research, Kane proposes an adjustment to the IA term. He revises the IA term “to give interpretations and uses equal billing” (Kane, 2013, p. 2). Kane replaces the term IA with IUA (as a reminder, IUA references Kane’s interpretation/use argument). (Kane’s 2012 Language Testing article had not yet employed that terminology.) Declaring consequences integral to validity theory and relabeling the argument as IUA is a significant development.
IUA’s promise of this broadened engagement in consequential research, however, is not met. An analysis of the content and word count of Kane’s (2013) article shows that the discussion remains focused primarily on issues of score interpretation. Rollins (2013) performs a rudimentary word use frequency analysis of Kane’s article. Rollins reports, as shown in Table 1, that the most frequently used root word in Kane’s article is interpret with a frequency count of 996. In comparison, Use appears 345 times (almost 35% less frequently than interpret) and consequence 190 times (less than 20% in comparison to interpret). A look at the lesser-used words is also revealing. Words such as policy and fairness barely make an appearance in the article. A reading of the article shows that Kane’s discussion of consequences remains retrospective and engaged in the polemic of the role of consequences in validity. Although this is obviously an important pursuit, it offers little information to advance the knowledge base needed in reform-based testing research.
Kane (2013): A sample from most to least frequently used word roots/concepts.
In 2008, McNamara characterized Kane’s – and Messick’s – contributions as follows: the “lack of an appropriate model of the broader social context in which to consider the social and political functions of tests is true of work in validity theory in general, even in progressive theories such as Messick and Kane” (p. 423). This characterization holds true for Kane’s latest publications as well. Kane does not address how research needs to be framed to accommodate issues beyond score interpretability, construct-related consequences and individual scores. It is only when we consider Kane’s collaborative work with regard to TOA (Bennett, Kane, and Bridgman, 2011) that we gain insights into how validity research can incorporate a social orientation to consequences. The next section explores TOA and the innovations offered.
In conclusion, while consequences are acknowledged as relevant to testing practices, disagreement prevails on whether they should be part of validity. The dominant perspective has been to restrict the scope of validity to semantic, construct-related research and explorations of fairness and bias. This view of validity has tended to exclude arguments and investigations that pertain to sociopolitical consequences. IUA ratifies investigations of consequences as part of validity research. This is significant given the historic debate and reform testing needs. Kane’s (2013) substantive contribution to guide systematic explorations of social consequences, however, remains inadequate. The few TOA publications that have begun to emerge offer a promising orientation to the role of consequences in validity. Next the article presents the tenets of TOA, which target the validity research needed with GERM assessments.
Theory of Action (TOA)
TOA is explicit about the need to integrate consequential research into validity explorations associated with GERM and RTTT testing. Bennett et al. (2011) write as follows: “in addition to evaluations of score meaning and score-based use, the intended impact of the assessment implementation and its evaluation must, at least in this [reform-driven K-12] context, play a major role in validating testing programs” (p. 5). TOA underscores research into consequences at various levels. TOA demands attention not only to individual scores but also to aggregate results, for example, teachers, administrators, and schools as well as the educational–social contexts of testing. Bennett (2010) maintains that TOA research “gives greater prominence to the effects of the assessment system on individuals and institutions as well as to the underlying mechanisms behind those effects” (p. 71). TOA calls for a social perspective of consequential research to include not only intended but also unintended outcomes. Data is to be collected “from key stakeholders (students, parents, teachers, and administrators) documenting how assessment results are used, noting both intended and unintended consequences of score use” (Bennett et al., 2011, p. 4). TOA-motivated changes for a validity conceptualization are articulated in the next section, in Table 2, “Conceptualizing consequences within validity in reform-driven testing.”
Conceptualizing consequences within validity in reform-driven testing.
TOA acknowledges that reform assessments are not some independent measure, introduced to account for individual performance. As observed with RTTT, assessments are part of the intervention for socioeconomic change. Accordingly, a TOA demands that traditional notions of technical quality documentation be expanded to include explorations of the extent to which a testing program helps achieve intended reform outcomes at various individual, aggregate, and social levels. As Sabatini, Bennett, and Deane (2011) remark, in a TOA “the assessment system as an intervention becomes a key part of what it means to demonstrate technical quality. Technical quality as such is not just instrument functioning; it is also the impact (negative and positive) of instrument use on students, teachers, classroom practice, school functioning, and the larger education system as a whole” (p. 14). Within this perspective, a social orientation to validity becomes critical.
Bennett et al. (2011) suggest a two-part interpretive argument in conceptualizing TOA validation: a measurement argument, which is a more apt label for Kane’s IA/IUA, and a TOA argument, which focuses on “the impact claims and the mechanisms through which that impact is expected to occur” (Bennett et al., 2011, p. 6). TOA targets the potentially casual links between proposed efforts and intended outcomes, including those that extend beyond construct-related interpretations and uses. TOA includes the following (Bennett et al., 2011):
The components of the assessment system and a logical and coherent rationale for each component, including backing for that rationale in research and theory
The interpretive claims that will be made from assessment results
The intended effects of the assessment system
The action mechanisms designed to cause the intended effects
Potential unintended effects and what will be done to mitigate them. (p. 6)
These emerging TOA delineations facilitate a more socially aware validity probing of GERM-style of testing.
TOA urges scholarly engagement that begins to address McNamara’s (2008) concerns about validity conceptualizations in language testing:
The main difficulties impeding progress in this area [discussions of consequences in validity theory] are a reluctance on the part of the language-testing profession to seriously engage with language testing as a social and political practice, and the lack of an adequate theory of the social context in which tests find their place even with the discussions of the social dimensions of language tests. (pp. 422–423)
Researchers such as McNamara call on researchers to consider the sociopolitical orientation of reform assessments. TOA promises to offer a foundation for such an orientation. TOA demands that we amend technical quality documentation to address the interconnections among policy goals, accountability testing, and those impacted. TOA urges that we expand validity research to accommodate consequential investigations at the aggregate and system level. TOA scholarship is in its infancy and a critical evaluation of its contribution is premature/not feasible. The extent to which TOA will develop to guide the type of systemic, sociopolitical scholarship necessitated by GERM assessments remains to be seen.
Consequences and validity revisited: A framework for GERM assessment systems
A reform-driven framework
Table 2 presents a reform-informed, socially grounded framework for conceptualizing consequences within validity. The framework involves a general structure that invites claims associated with the specific nature of an assessment system. The table can target score as well as educational–social context claims. Table 2 emphasizes that policy formulation, assessment design/development, and the ongoing administration of a testing program are part of a coherent validation program. The framework brings together in a more transparent and coherent fashion the IA/IUA approach (Kane, 2006, 2012, 2013) and TOA nascent contributions (Bennett, 2010; Bennett et al., 2011; Sabatini et al., 2011). Additionally, the framework makes clear the need to pursue multiple validity arguments, which are broadly represented as individual, aggregate, and socio-educational score interpretations and uses. The measurement argument deals with individual- and aggregate-level specifications of claims, such as group differences/DIF, rater bias, and accommodation effectiveness. Examples of a TOA argument associated with system-level investigations include the effect of assessment on curriculum (narrowing/expanding), the impact of flagging schools for remediation on school functioning, and the relationship between the inclusion of L2 students in assessments and college graduation rates.
Table 2, as suggested by Kane (2006), distinguishes between the assessment development and the operational (the testing program is live) stages of validation. At the design and development stage, developers are typically engaged in operationalizing the construct, elaborating their specifications, creating items/tasks, and conducting research to examine and refine assessment content, administration, and other related practices. Development also includes the delineation of claims associated with test purpose(s), intended score interpretations/uses. At this stage, efforts should simultaneously be expanded to articulate and invest in the research of plausible unintended interpretations/uses/decisions in the individual, aggregate, and larger educational–social contexts. Consequential research should pay attention to relationships among the various levels as well as system products and services (e.g., test preparation materials, teacher professional development, curricula guidelines, and communication plans). At the development stage, the validation evidence tends to be confirmationist in nature. Validation information obtained is used to revise and improve the quality of the assessment and related program products and services. Once the assessment is live, validation research is expanded to include critical investigative studies. At this operational stage, research is more of a fault-finding type of appraisal. Unintended and negative types of consequences observed are pursued as a part of validation.
Also integral to a framework that incorporates consequences into a validity framework is the communication of documentation. Test developers and researchers have typically developed arguments and framed them to address their testing colleagues. In the published literature, Chapelle (2012) raises awareness about the need to pay attention to the diverse “audience for validity arguments” (p. 26). The call with this socially grounded framework is for a consideration of the needs of other stakeholders (e.g., policy makers, teachers, and parents) associated with and impacted by the testing system. In conclusion, the proposed framework renders the ensuing technical quality research for a testing program even more complex. Nevertheless, given the ambitious scope of GERM-style of assessment, it is reasonable to demand a broad validity agenda. Table 2 offers a preliminary procedure for integrating consequences in validity. Further validity scholarship is needed to articulate consequential research programs and related analytic tools that address policy-driven accountability assessments.
Policy
Reform-driven testing mandates policy engagement. Publications in this area, however, are scant. Menken (2008) writes: “The acknowledgement in research of the intersection between testing and language policy is very recent; overall there is very little research on this critical topic” (p. 410). Additionally, the limited publications that engage in policy research tend to be either critical of policies/testing altogether (see discussion of realists and constructivists in Fulcher, 2014) or reactive whereby a policy initiative has unfolded and the related testing program is operational (Byrnes, 2005; Ravitch, 2010). Table 2 embeds the reactive policy approach in the last two sections: “Assessment Program & Arguments Under Development/Developed.” A reactive approach to investigations of consequences is important, but should not be the only mode of research engagement. It is not clear yet to what extent TOA envisions a priori engagement at the policy level. A TOA, however, does suggest research that includes “[p]otential unintended effects and what will be done to mitigate them (Bennett et al., 2011, p. 6). This point could be interpreted to mean that a research agenda should address unintended consequences and does not call upon researchers to engage in policy formation. As Table 2 depicts, anticipatory policy engaged research is integral to validation agendas associated with reform-driven testing. The “Policy & Arguments Under Development” section in Table 2 draws attention to policy-engaged anticipatory research.
Reform initiatives seek accountability testing ultimately to improve the performance of various societal sectors. These initiatives are monumental in terms of their impact on individuals, groups, and systems. It is critical that initiative deliverables and processes be subjected to rigorous vetting and piloting before they are enacted. Chalhoub-Deville (2009a, 2009b) discusses social impact assessment (SIA), which addresses a proactive approach to policy research. SIA is “the process of assessing or estimating, in advance, the social consequences that are likely to follow from specific policy actions or project development … SIA as a process and methodology has the potential to contribute greatly to the planning process” (Burdge & Vanclay, 1996, p. 59). SIA entails working with policy makers upfront to inform policy design. SIA demands that language testers engage in the scholarship of reform policy design to enhance the effectiveness of testing programs. Policy theories, research procedures, and communication systems are needed to investigate and address potential (intended/unintended, positive/negative) consequences at the policy preparation stages.
Policy research, similar to the development of testing specifications and the piloting of instruments, requires a program of investigations that commences at the design level. Table 2 integrates SIA into the framework in the section “Policy (Includes an Assessment Program) & Arguments Under Development.” The anticipatory policy part of the validity framework mirrors the two assessment sections. Simulation studies, ethnographic research, focus groups, among others could be undertaken to investigate the variety of claims delineated. A confirmationist approach to validation evidence is emphasized at this stage. Evidence obtained is used to revise and strengthen the effectiveness of the policy under development to drive reform. Policy research continues during the development of the assessment program as well as when assessments are operational. This ongoing research is represented with the arrow, which cuts across the assessment sections. The validation evidence obtained during this ongoing process tends to be critical in nature. Research may call for adjustments in the policy mandate given data obtained during the process of assessment development and piloting as well as the information that surfaces once the assessment is operational (see Chalhoub-Deville, 2009a for a discussion of such information with NCLB).
Arguments for engaging in policy research continue to appear in our published literature. Nevertheless, language testing and measurement professionals are reluctant to engage. The scholarship of policy planning and policy-driven assessments, therefore, continues to be scant. Engaging in research of reform assessment policies is likely to mean that “language testing will become a broader, more diverse field because it will be unlikely that any single researcher or group of researchers will be able to be sufficiently on top of the range of relevant intellectual fields” (McNamara & Roever, 2006, p. 254). The cost for potentially losing a unified field is to be considered in relation to being “more socially and intellectually responsive” (McNamara & Roever, 2006, p. 254) to the demands that GERM assessment imposes. Given the scope of reform initiatives, which are driving the language assessment enterprise, the L2 testing field cannot afford not to be engaged in order to remain relevant.
Role allocation
Test developers have assumed responsibility for the research of consequences as they directly relate to issues of the construct. Additionally, and as the Standards (AERA et al., 1999) state, “If a test is used in a way that has not been validated, it is incumbent on the user to justify the new use, collecting new evidence if necessary” (p. 18). Kane (2013) offers a more nuanced stance. He writes: “test users generally are in the best position to identify unintended consequences, but test publishers also have the responsibility for the consequences of uses they explicitly or implicitly advocate” (p. 58). This fixed, a priori allocation of roles for consequential research is not tenable with GERM testing (Chalhoub-Deville, 2009a, 2009b; Nichols & Williams, 2009). For example and as already described with RTTT, the federal government and various state agencies, typically viewed as users, increasingly direct the development, interpretation, and use of accountability assessment systems. Reform testing blurs the lines between test-taker and test-user groups. Issues such as role conflation necessitate a more flexible approach to the allocation of responsibilities for consequential research. Figure 1, an adaptation of a configuration by Nichols and Williams (2009), offers one such approach. It is appropriate to point out at this point that emerging TOA publications do not yet elaborate on the allocation of roles for different stakeholder groups. Although TOA calls upon test developers to engage in an expanded role for documenting consequences, it does not explicitly address the fluid roles test developers and users play in reform-driven assessments.

Role and responsibility allocation for consequences in validity.
Figure 1 brings together various elements to help systematize traditional and less defined roles to undertake research into consequences. The scope of the construct/domain (e.g., whether it is related to a proficiency construct, instructional curriculum, standards/frameworks, work domain) is one core feature for allocating responsibility for consequential research. The proximity-distal (intended/unintended interpretation and use of assessment results) continuum is also useful in defining differential responsibilities. The figure identifies a larger role for test developers in the upper-right section, where the construct/domain definition is intended to be broad in scope and score interpretation/use is in line with intended claims. On the other hand, the figure allocates more responsibility to test users in the lower-left section, where the scope of the construct is limited but score interpretation and use moves well beyond intended inferences. Outside these traditional areas/research roles, developers and users have to confer about how to address consequences. The figure integrates a zone of negotiated responsibility (ZNR) to accommodate unallocated responsibilities for consequential research. A ZNR denotes that circumstances exist and/or arise whereby test developers and users need to confer about and address consequential research. Circumstances, for example, role conflation with RTTT, necessitate deliberations. Under such circumstances, stakeholders have a shared responsibility to discuss capabilities to fund or undertake research for consequences.
The time element and related stages are also important in defining roles and allocating responsibility. Responsibilities alter with the passage of time. For example, over time and with a successful operational testing program, users tend to devise new – typically towards the distal end of the continuum – interpretations of/uses for results. Test developers cannot hold on to the stance that such interpretations/uses are not originally intended. What might once have been characterized as distal has become common practice, which needs to be accommodated in the research agenda. Developers are likely to accrue financial benefits from these new interpretations/uses, because they tend to involve an increase in test administrations. Unless developers seek to stop unauthorized practices, then an argument could be made that developers “implicitly advocate” (Kane, 2013, p. 58) these practices. Test developers have to assume research responsibility for this expanded interpretation/use of results and/or enter into negotiations to undertake necessary research. In Figure 1, the expanded ZNR as time passes denotes these increased demands for negotiating responsibilities for consequential research.
As argued above, at the policy development stage, language testing researchers need to undertake SIA investigations of the potential consequences of reform processes and deliverables. Figure 1 directs attention to anticipated roles and responsibilities when a policy is under development. Research at this stage considers roles for different developer–user groups to undertake investigations of likely consequences for a testing program and related materials/services. Feasibility studies are undertaken at the policy formulation phase to specify role specifications under different circumstances and speculate about areas that will need to be negotiated. As part of fleshing out roles and responsibilities, procedures must also be put in place to assist with inevitable future negotiations. The scholarship needed to engage in defining roles and allocating responsibility for consequential research at the policy development phase is practically nonexistent in the field. The article is a call to language testers to rectify this state of affairs.
The field’s depiction of role allocation for consequential research in validity is inadequate. The literature presents a delimited scope of responsibility and rather fixed roles. Publications in the area also do not accommodate the changing functions of assessments and the validity research responsibilities they require. Figure 1 depicts a more flexible approach to delineating roles and responsibilities for consequential research. Ideas presented are preliminary and in need of rigorous development. Scholarship is called for to systematize practices for defining stakeholder groups, delineating roles, apportioning responsibilities, and elaborating communication/negotiation systems. Research should delve into these issues and consider changing circumstances as they pertain to individual, aggregate and larger societal levels of consequences.
Concluding remarks
GERM systems comprise ambitious educational–social goals with broad claims. Construct judgments and IA justification, which have anchored validity research in recent history, are inadequate, in terms of their theoretical foundations and scope of research, for addressing GERM testing demands. Reform policies and related accountability assessments require a coherent system that makes explicit the interconnections among policy goals, testing functions, validity research, and the groups/systems impacted. The article presents a framework that embeds consequences in validity conceptualization and pushes for considering a social dimension to consequences. The framework includes interrelated arguments: measurement and TOA arguments to address consequences at the individual, aggregate, and educational–social levels. The article prompts engagement in policy research and urges researchers to undertake investigations of policy proposals to inform more reasoned reform policies and testing practices. The article also presents a figure to enable a more flexible allocation of roles and delineation of responsibilities for consequential research. Ideas presented, especially with regard to policy, are in need of rigorous development. In conclusion, the contentions the article presents merit critical consideration as language testing scholarship catches up with the spread of reform-driven assessment programs.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
