Abstract
This review presents two case studies that illustrate how multivariate statistical modeling applies to the specific goal of improving educational assessment. The first case study involves the development of a new large-scale English language proficiency assessment system (called the English Language Proficiency Assessment for the 21st Century; ELPA21). The second application concerns efforts to quantify student progress in learning using conditional growth models, a topic of current debate about assessment policy. A popular measure, Student Growth Percentile (SGP), is explored through the lens of multivariate statistical analysis. It is concluded that collaboration between researchers and practice stakeholders can improve assessments that benefit student learning.
Keywords
Tweet
Innovative, close collaboration between research and practice will improve educational assessment systems to support student learning.
Key Points
Research and development in statistics and psychometrics have provided increasingly sophisticated measurement models to better assess constructs in the social and behavioral sciences.
Computational algorithms have advanced to allow routine use of previously intractable, high-dimensional multivariate models.
The practice of educational assessment lags behind the research and development.
Close collaborations between research universities and state education agencies can improve assessment systems that benefit students and other stakeholders.
Introduction
Innovative models have developed over the last several decades in the research literature on multivariate statistics in the psychological, educational, and health-related sciences. For statisticians, they belong to the family of item response theory (IRT) models that are random effects nonlinear hierarchical models for discrete multivariate data. For readers who are interested, Cai, Choi, Hansen, and Harrell (2016) provide a recent technical overview of developments in IRT. For practitioners, policymakers, and other stakeholders, they represent greater precision, interpretability, and validity.
Development of ELPA21
Background
Roughly one out of 10 students in K-12 education in the United States belong to the English Learner (EL) population (Ruiz Soto, Hooker, & Batalova, 2015). As changing demographics would suggest, the EL population is rapidly growing. Providing all children equal access to high-quality education continues to occupy the attention of policy makers, experts, teachers, and the community. The 2015 reauthorization of the Elementary and Secondary Education Act of 1965, known as the Every Student Succeeds Act (ESSA), outlines current federal legal provisions on content and achievement standards, assessment systems, evaluation and accountability, and program improvement, among other topics.
ESSA and related federal guidance brought EL issues to the forefront of the attention of the field of educational measurement. State education agencies are required to have English language proficiency (ELP) standards that correspond to the rigorous academic content standards that the states must develop or adopt so that the students are prepared for college and careers in a rapidly shifting economic landscape. In addition, states are required to administer annual assessments for the EL students that align to their ELP standards and that assess (a) the listening, speaking, reading, and writing language domains; (b) overall proficiency; and (c) achievement in comprehension. Indicators on students’ progress toward attaining English proficiency must also be included in the states’ accountability systems (Lyons & Dadey, 2017). For the individual students, initial qualification for additional English language development (ELD) services is determined by a screening assessment that also covers the various language domains.
In response to federal requirements and tied to the adoption of new ELP standards, a multi-State collaborative (led by Oregon) worked together on the task of developing a reliable, valid, accessible, and fair ELP assessment system. With grant funding from the federal government, the English Language Proficiency Assessment for the 21st Century (ELPA21) was developed, field tested, and fully implemented in the 2015-2016 school year.
In the ELP standards adopted by ELPA21 states (Council of Chief State School Officers, 2014), proficiency is described as the capacity to use the English language to understand and communicate ideas, knowledge, and information. This integrated view is essential for students to benefit from classroom instruction, to engage in grade-appropriate curricular practices (e.g., a typical science classroom for a fourth grader), and to succeed in the future. Accordingly, assessment of ELP needs to be accomplished by collecting student responses that require language skills emerging within social and interactive processes (inherent in normal classroom instruction)—preferably in a technologically-enhanced environment where the test is delivered online.
The ELPA21 states recognized that language acquisition is often nonlinear and individual-specific. Defining proficiency in terms of standing on an overall composite or summary score may mask individual variation in the path to proficiency (i.e., variations of the profile of performance across domains) and hide important information for proper instructional planning.
Since 2012, the ELPA21 Consortium has solicited technical advice and thought partnership from a number of university-based researchers and research institutions, including the Understanding Language Initiative at Stanford, the National Center on Education Outcomes (NCEO) at University of Minnesota, and the National Center for Research on Evaluation, Standards & Student Testing (CRESST) at University of California, Los Angeles (UCLA). The collaboration between researchers and states led to several methodological breakthroughs in the underlying statistical foundation of the ELPA21 system that we will review. More extensive technical documentation on the development of ELPA21 is available elsewhere (ELPA21, 2016a, 2016b, 2017).
Background on Standard Practice in Educational Assessment
Traditional educational tests—such as the end-of-year summative assessments given by states in subjects such as math—are long and homogeneous in content. This, along with the legacy influence of classical test theory, results directly in a heavy emphasis on the concept of unidimensionality in the practice of educational measurement. That is, tests have typically aimed to assess one concept, such as overall math ability, rather than more fine-grained skills. For nearly a century, all educational tests are made in the same image, regarding their technical aspects.
To elaborate: An educational test elicits performance from test takers by presenting them a series of questions or stimuli (test items). Carefully chosen test items, and enough of them, can provide evidence regarding knowledge, proficiency, achievement, or other constructs that the test builders or users desire to measure. Because each test taker answers each question, their data can be arrayed in a matrix (like a spreadsheet grid). The matrix—with test takers (cases) as rows, crossed by item scores (variables) as columns— comprises the raw data. This is the multivariate data set that the educational statistician has to work with in order to study the properties of the test as well as to devise methods for appropriately scoring the responses.
As one can imagine, the item scores are going to be correlated because they are akin to repeated measurements from the same individual. To put the correlations among the item responses to productive use, educational statisticians have relied on an old idea that can be traced at least as far back to Spearman (1904): unidimensionality. Item responses are posited to be correlated precisely because there is one underlying common source of influence across all the items. In other words, all the observed covariations are fully accounted for with this underlying latent variable—the concept, such as knowledge, that the test aims to measure. This single source of influence is further presumed to vary over individuals’ continuous distribution. Indeed, its distribution is often presumed to follow the Gaussian (normal) curve. This is not a terrible assumption to make when one is developing a math content assessment for end-of-year use, or if one is interested in general aptitude measures.
Upon this idea of unidimensional tests, psychometricians have built elaborate tools for quantifying the technical quality of the items and tests. They aim to address such questions as whether the test is reliable, how it should be scored, can it predict variables or performance that it is meant to predict, can it correctly separate groups of individuals, and so on. Psychometricians also hope to understand how an item would relate to student proficiency—to be able to answer such questions as at which level of proficiency would a student be able to solve the test item most of the time, would an item differentially favor subgroups of test takers, and whether a test taker can correctly answer the item by guessing or other test-taking strategies unrelated to proficiency. There is large research literature on psychometrics. The edited volume Educational Measurement (Brennan, 2006), a routinely consulted reference book, is 779 pages long and weighs 5 pounds. Given the centrality of unidimensionality in psychometric research and practice, it is not surprising that unidimensional IRT models emerged as the currently predominant modeling framework underlying large-scale educational measurement in practice.
Language assessment, though, deviates from the unidimensional assumption in obvious ways. A lay person can immediately tell that listening, speaking, reading, and writing are distinct (though related) proficiencies. Thus, the developers of ELPA21 faced from the beginning a choice. Should they build four separate tests that are individually unidimensional, or should they build one test that is multidimensional? If multidimensional, how would they build it and score it?
Multidimensional IRT
Standard unidimensional IRT models specify a single underlying latent (conceptual) variable to represent the proficiency being assessed. How to employ this model in the practice of educational measurement is routinely taught in graduate-level training programs and used by assessment programs across the country. A multidimensional IRT (MIRT) model, on the other hand, is considerably more complex. By definition, test items in a multidimensional model may be influenced by one or more latent proficiencies. These latent proficiencies may also be correlated.
To visualize the complexity, a model that is closest in spirit to what the ELPA21 developers would like can be represented in the form of a graph (Figure 1). The nodes in the graph represent variables. The squares are observed variables (scores on actual items), and the circles are latent variables (underlying proficiencies). The single-headed arrows provide clues as to the direction of influence. This model posits that variations in underlying proficiencies cause differences in performance on the test items. The four major language domains are represented jointly in a single model. Importantly, the domains are modeled as correlated variables because, although separable, test takers who perform well on one tend to perform well on others, but not redundantly so. Furthermore, some items may depend on more than one proficiency (e.g., listen and speak, or read and write). Within this modeling framework, these interactive items (often technology enhanced) can simply co-exist with other items that may primarily only require a single proficiency.

An MIRT model for English language proficiency.
In contrast, the traditional unidimensional IRT model would require the ELPA21 development team to work instead with the series of models represented in Figure 2. These unidimensional models treat each of the language domains independently. The items are required to depend only on a single proficiency, which is not consistent with how the new ELP standards are structured, as per our earlier discussion. Interactive items that depend on two or more proficiencies would not be possible, despite the consensus among practitioners that in authentic classroom environments, the ELs are constantly engaged in the interactive modes of language use.

A traditional approach to scaling an English language proficiency assessment: Four unidimensional IRT models. Domains are treated as independent (uncorrelated), and items are influenced by only one domain.
Why then, would the field of educational measurement be so enamored with the unidimensional concept and modeling framework? One explanation is computational: It’s less complicated and more convenient. To the statistician, a model must be fitted to data in order to learn or estimate key parameters of the model, and to quantify the degree of uncertainty in the predictions from the model. Unfortunately, for IRT models, the nonlinearity (owing to the fact that item scores are discrete rather than continuous), the multivariate nature of the data, and the presence of latent variables together cast the “curse of dimensionality” on these models. Historically, numerical algorithms and computational tools (implemented in statistical software programs) were developed to handle only a few latent variables, and most easily, one. The computational challenge of MIRT was once so great that operational assessment programs in education that have tight deadlines could not afford the potential problem of having no result to report because either estimation would take more time than is tolerable or that algorithms involved in estimating the model parameters would simply fail to produce meaningful answers.
In the last two decades, significant new developments in flexible MIRT modeling frameworks, as well as new algorithms and software tools, have emerged in the research literature (e.g., Cai, 2010a, 2010b, 2010c; Cai, Yang, & Hansen, 2011; Edwards, 2005). These approaches now enable the specification of realistically complex MIRT models for operational use, with assurances that the resulting parameter estimates would be quickly available and of high quality. The ELPA21 development team took advantage of these modeling advances and chose the MIRT approach to assessment development and analysis.
Innovative Scoring and Classification Procedures
The operational ELPA21 model contains multiple latent dimensions representing proficiency in the language domains. The model depicted in Figure 1 would directly generate a set of four scores per student in the Listening, Speaking, Reading, and Writing domains, along with statements regarding the uncertainty of measurement in the scores. To the users of the scores, performance levels or, in other words, categorical statements about proficiency might be more meaningful and easier to interpret. Through a process known as Standard Setting (e.g., Hambleton & Pitoniak, 2006), the ELPA21 team adopted a set of four cut scores in each language domain to divide the proficiency estimates into five levels. To go from the domain-specific performance level determination to an overall determination of ELP, then, is a nontrivial task.
Traditionally, users of the ELP assessments tended to use a compensatory approach. In this approach, the domain scores are combined (sometimes with weights preassigned) into a composite score. A cut score is developed for the composite, and students whose composite score exceed the cutoff value are declared proficient. While simple, this approach is not without its own problems. First, the compensatory scoring for the composite implies that one could compensate for lower proficiency in one domain by a correspondingly higher proficiency level in another domain. Whether this should be allowed in the determination of overall proficiency is a question worth discussing, but the use of the composite score itself already removed the individual differences across domains that may be worth examining. Second, the composite score is less likely to lead to useful instructional planning. If a student does not meet the proficiency cut score, there is no information about the student’s relative performance on the domains left in the composite.
The ELPA21 system adopted a different approach. After the domain scores have been divided into the performance levels, the overall determination of proficiency is derived from the profile of domain levels directly. For instance, a student could have achieved a 4-2-4-3 combination (Listening, Speaking, Reading, Writing, in that order). This student is relatively more proficient in receptive language, but the proficiencies in the productive domains are only emerging. To be proficient, a student must achieve levels 4 or 5 across all domains (16 possible profiles). An emerging learner is similarly defined as having levels 1 or 2 across all domains (16 possible profiles). The remaining 593 profiles (625 – 32) are grouped together into an intermediate proficiency category termed progressing.
This approach retains useful information regarding performance that would be necessary to provide support to ELs. A relatively simple rule determines overall proficiency; teachers and students can relate to it. Also, the approach leaves open the possibility for ongoing evaluation of the appropriateness of cut scores for each domain without introducing unnecessary changes to the overall method.
Policy Implication
The primary lesson that one could draw from the experience of developing ELPA21 assessments is that effective collaboration and building of trust between researchers and state education agencies are generally conducive to successful implementation of innovative methodological approaches. Availability of resources to support the collaboration and effective state leadership would remove constraints that limit innovation.
Student Growth Percentiles (SGPs)
Background
Federal education legislation has led to various uses of student test scores in accountability systems that the states are required to implement. One topic that caught the attention of both states and researchers is the issue of measuring student growth in achievement. The SGP (e.g., Betebenner, 2009) has become a widely used method of quantifying growth. SGPs, once aggregated to the classroom level, are often used in teacher evaluations.
The SGP addresses the following statistical problem. Imagine that one could put together a list of students who all achieved the same score on the prior year’s state assessment (P) and that the students are assessed again at the end of the current school year. Each student’s current year performance (C) could be expressed in terms of percentile among the group with equal prior year score. Obviously for two students with the same P, the student who scored higher this year will have a higher percentile rank. This conditional percentile is therefore imbued with the growth interpretation.
While over 25 states have adopted the SGP as a growth measure, it is by no means clear whether the SGP’s technical properties are well understood by the users. In particular, it is not clear whether SGP is reliable when it is known that the observed test scores contain measurement error. Measurement error in test scores is analogous to slight variations in instrument readings over repeated measurement of physical quantities. Test scores can fluctuate, even if the true proficiency has not changed. With long and well-designed tests that average over many items, measurement error can be reduced but not eliminated. The reliability of SGP has a specific meaning in the present context. That is, whether SGP based on observed student assessment performance can accurately reflect the true conditional growth percentile of a particular student. An index of reliability quantifies the degree to which SGP may be corrupted by measurement error. If it is close to zero, the SGP estimate is totally unreliable. If it is close to 1.0, preferably well above .90, the SGP may be considered reliable.
Reliability of SGPs
Following Lockwood and Castellano (2015) as well as Monroe and Cai (2015), SGP can be recast into a statistic derived from an MIRT model. Figure 3 depicts this MIRT model. Student performance on test items for the current year and at least one prior year are linked in a longitudinal data set. Most current state assessments make the assumption of unidimensionality, as indicated by a single circle representing student achievement. While assuming unidimensionality is not ideal given preceding discussions, it is reflective of current practice in building mandatory State assessments. The two latent dimensions representing prior and current year achievement are correlated. The MIRT model provides an estimate of the conditional distribution of C given P directly. From the estimate of the conditional distribution, one could directly derive the SGP but more importantly an estimate of uncertainty associated with the SGP estimate as well. Monroe and Cai (2015) describe the procedure in detail.

An MIRT model for estimating conditional growth.
A benefit of MIRT modeling is that reliability indices are readily available. Monroe and Cai (2015) developed such an index that has consistent interpretations as other reliability indices used in educational measurement. The MIRT model also provided a conceptual framework for examining the conditions that would influence the reliability of SGPs.
Monroe and Cai (2015) found that under conditions typically found in State end-of-year assessments, the reliability of SGPs tend to be unacceptably low (in the .60 range). They also found that the practice of including more than one year’s worth of prior achievement data in SGP calculations does not improve its reliability and may in some cases make it worse.
Policy Implication
With the popularity of the SGP and the stakes attached to it by including SGP-related measures in accountability calculations, it would be prudent for states to examine SGP’s technical properties. Current research suggests that the SGP can be highly unreliable at the individual level. To base individual feedback on such an unreliable measure is not going to lead to effective planning and improvement. Alternatives to the SGP are warranted.
Conclusion
Most rational decision-making processes involve some analysis of data or evidence. While multivariate statistical procedures may seem esoteric, they very frequently occupy central roles in important education policy discussions, particularly assessment policy. Their continued improvement and freedom from misuse are not solely the responsibility of researchers. Experience from developing a large-scale assessment system (ELPA21) suggests that to overcome implementation barriers and to translate new research into practical improvements, collaboration between researchers and state education agencies is critical. And sometimes simple ideas are not necessarily the best, as the research on SGPs shows. In both cases, modern psychometric techniques address current problems in educational measurement. We also urge researchers in education, particularly educational assessment researchers, to consider policy implications of their work and to keep in mind that the assessment should inform and support instruction.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
