Abstract
K–12 English language proficiency tests that assess multiple content domains (e.g., listening, speaking, reading, writing) often have subsections based on these content domains; scores assigned to these subsections are commonly known as subscores. Testing programs face increasing customer demands for the reporting of subscores in addition to the total test scores in today’s accountability-oriented educational environment. Although reporting subscores can provide much-needed information for teachers, administrators, and students about proficiency in the test domains, one of the major drawbacks of subscore reporting includes their lower reliability as compared to the test as a whole. In addition, viewing language domains as if they were not interrelated, and reporting subscores without considering this relationship between domains, may be contradictory to the theory of language acquisition.
This study explored several methods of assigning subscores to the four domains of a state English language proficiency test, including classical test theory (CTT)-based number correct, unidimensional item response theory (UIRT), augmented item response theory (A-IRT), and multidimensional item response theory (MIRT), and compared the reliability and precision of these different methods across language domains and grade bands. The first two methods assessed proficiency in the domains separately, without considering the relationship between domains; the last two methods took into consideration relationships between domains. The reliability and precision of the CTT and UIRT methods were similar and lower than those of A-IRT and MIRT for most domains and grade bands; MIRT was found to be the most reliable method. Policy implications and limitations of this study, as well as directions for further research, are discussed.
English language learners (ELLs) form the fastest growing educational subgroup in the nation. The percentage of public school students in the United States who were ELLs was higher in the school year 2013–14 (9.3%, or an estimated 4.5 million students) than in 2003–04 (8.8%, or an estimated 4.2 million students) and 2012–13 (9.2%, or an estimated 4.4 million students) (US Department of Education, 2016). Statewide English language proficiency (ELP) testing programs are required by the No Child Left Behind Act (NCLB, 2001) to assess annually ELLs’ progress toward ELP and to report language assessment diagnostic information for examinees that allows parents, teachers, and administrators to understand and address the specific academic needs of ELLs. Title III requires states to establish ELP content standards and to use a single ELP test to assess students’ progress in and mastery of these standards in four domains: Reading, Writing, Speaking, and Listening. Results from the annual administration of ELP tests are used to report on students’ progress in and attainment of ELP (National Research Center [NRC], 2011). Boals et al. (2015) outline several roles of ELP assessments: clarifying important school-based language expectations; informing the educational community about how students are progressing in the English language development (ELD); and providing the information needed to ensure accountability to federal civil rights mandates so that ELL students receive the educational support services they need and are entitled to receive. Depending on how they are designed, implemented, and acted on, assessments can help ensure equitable educational opportunities for ELLs.
States may use either the assessments developed by the consortia (e.g., ACCESS developed by the WIDA consortium; ELDA developed by the LEP-SCASS consortium), commercially developed tests (such as LAS Links or SELP), or the tests they developed themselves. Overall, in the 2009–10 school year, states used approximately 19 different proficiency tests (NRC, 2011). However, these tests shared a number of similar features, such as the assessment of the four domains specified by the legislation (Listening, Speaking, Reading, and Writing); all the tests assessed academic language, were standard-based, and aligned with the language demands of the states’ core academic content standards. From the technical perspective, some tests consist of strictly multiple-choice (MC) items, and others consist of a combination of MC and constructed response (CR) items. Nearly all the tests report scores for each of the four domains, an overall composite score summarizing performance in all four domains, and a comprehension score that is a composite of performance on the Listening and Reading tests. The composite scores are not consistently based on either equally or unequally weighted subscale scores. This differential weighting reflects states’ priorities with regard to those aspects of ELP in the four domains that are acquired first and those domains that are critical to succeeding in school. States differ in the number of performance categories for the ELLs and the cut scores necessary to achieve each category.
States’ methods of identification and reclassification of ELLs vary, with some states only using the scores on an ELP assessment (including subscores), and others also including information from core content tests and input from school personnel and parents. Similarly, different criteria are used by the states for exiting Title III. Bailey and Carroll (2015) identify four general decision rules that states may use regarding exiting a student from a Title III program: conjunctive, in which a student needs to pass on all indicators (which in some states may include core classes as well as ELP assessment); compensatory, in which a student needs to pass only on some indicators; complementary, in which a student needs to pass on either one or another indicator; and mixed, which is a combination of the models mentioned above. In conjunctive models, the reliability of all subscores is critical, because the appropriateness of reclassification is hinged on the subscale with the lowest reliability, whereas in compensatory and complementary methods, the lower reliability of one subscore can be compensated for by the higher reliability of another. (It should be noted that we are using the term “domain” to denote an area of language proficiency such as Listening, Reading, Writing, or Speaking, the term “subscale” to denote the portion of a test that covers a specific domain, and the word “subscore” to refer to the score on a subscale.)
The NRC report stresses that the tests themselves are more similar than different between the states; the differences are in the ways they are used for identification and reclassification of ELLs. Although that is true, test differences (including, for example, different content coverage within language domains, the types of tasks and items used to test the domains, the resulting assessment reliability overall and in the subscales, and the methods of subscores and total score calculation) result in the specific interpretation and consequent use of those tests. For that reason, in order to obtain a full picture of the factors affecting the decision-making process regarding ELL identification and classification, it is necessary to evaluate these assessment characteristics.
From the technical perspective, two main scoring methods are used for ELP assessments: classical test theory (CTT) and item response theory (IRT). The main difference between the two is in the assumptions used to assign scores to the examinees. CTT is based on the idea that a person’s observed score on a test (which is often the sum of all the scores for items answered correctly) is the combination of a true score (error-free score) and an error score. IRT, on the other hand, is based on the premise that the probability of a correct response to an item is a function of person’s trait (such as ELD) and item parameters (difficulty, discrimination, and guessing) (Hambleton & Jones, 1993).
CTT and unidimensional IRT (UIRT) treat assessment subscales essentially as four separate tests based on the assumption that the trait being measured is unidimensional in nature. Figure 1 illustrates the similarities between CTT and UIRT in the lack of use of information about subscale relationships.

Test scoring with CTT and UIRT.
One issue with reporting separate subscores for an assessment using the above-mentioned methods is that these subscores often have significantly lower reliability than the full assessment owing to a small number of items on the subscales (Haberman, 2008; Haberman, Sinharay, & Puhan, 2009; Sinharay, Puhan, & Haberman, 2011). Therefore the reliability of such a subscore is far lower than the reliability of the test as a whole, and the subscore is influenced by random error more than by the actual student performance.
The treatment of language proficiency assessments as a combination of several unidimensional assessments poses another problem: the view of language proficiency as the additive function of unrelated domains lacks theoretical support. It has been firmly established that proficiency in one language domain interacts with the proficiency in all other domains (Alderson, 2007; Chapelle, 2011; Solano-Flores & Trumbull, 2003). Building an assessment on assumptions that are not theoretically supported diminishes the strength of the argument for the assessment’s validity. Additionally, from the psychometric standpoint, underestimating the number of assessment dimensions may lead to increased errors of measurement and the possibility of making incorrect inferences about a student’s ability (Walker & Beretvas, 2003).
Although there is theoretical support for the interaction between domains, the exact degree and direction of their relationship remain unclear (Chiappe & Siegel, 2006; Farnia & Geva, 2013). For example, it is widely believed that vocabulary and oral skills influence reading ability (Geva, 2006; Gottardo, 2002); at the same time, some studies have shown that it is the print exposure that explains significant variance in children’s growth in vocabulary, verbal fluency, and general knowledge (Chiappe & Siegel, 2006). Whereas the variation in students’ word recognition skills can generally explain most of the variation in reading achievement from kindergarten to third grade, oral language proficiency and higher-order comprehension processes predict greater shares of the variance in reading achievement from grade 4 onward (e.g., Catts, Hogan, & Adlof, 2005; Vellutino, Scanlon, Small, & Tansman, 1991). Similarly, Farnia and Geva (2013) emphasize that the nature of the predictors of reading comprehension changes over time. Several theories of interaction between writing and reading across time exist. A study by Davis and Bryant (2006), for example, found that at 7 years of age, the ability to read influenced the ability to write, but then the causality of this relationship was reversed at an older age, and became much less significant by the age of 10. The relationship between word recognition and spelling is often relatively more important early on, but more structural aspects of text knowledge become more significant with older and more proficient readers.
In order to address the issue of low subscore reliability, scoring procedures were developed that take into consideration relationships between subscales, with the resulting subscores being more reliable. One such procedure is augmented IRT (A-IRT), and another one is multidimensional IRT (MIRT). These methods also reflect more accurately the theoretical underpinnings of ELD, including non-uniform growth in the language domains and the changing relationships between domains along the ELD continuum.
The idea behind using A-IRT is to borrow information from some other source collateral to examinee responses (such as scores from other subscales) in order to reduce error (Wainer et al., 2001). The assumption behind the MIRT approach is that each subscore represents a distinct trait, and the final score reflects a mixture of related multidimensional traits. Item parameters are calculated from the factor loadings of each item on each subscale and the estimated covariance between subscores’ matrices, and then the person parameter is calculated based on these values (Reckase, 2009). Therefore, the MIRT approach to calculating subscores, like the augmented approaches, also takes advantage of shared information across subscores to improve the reliability of these estimates. However, instead of borrowing that information from the subscores of other persons in the group, it borrows the information from the person’s own scores on other subscales. Figure 2 illustrates the similarities between A-IRT and MIRT in the use of information about subscale relationships.

Test scoring under A-IRT and MIRT.
To summarize the comparison between the four subscore methods investigated here, the difference between CTT and UIRT is in the framework these theories use to assign a score to the examinee; CTT score is the sum of observed score and error, whereas IRT score is a likelihood function of the probability that a person of a certain ability will answer correctly an item of a certain difficulty. The similarity between CTT and UIRT lies in the fact that neither one takes into consideration the relationship between the subscales. The similarity between UIRT, A-IRT, and MIRT is in the fact that they use the same framework for assigning scores, namely, item response theory. The difference between UIRT and the other two IRT models (A-IRT and MIRT) is that UIRT does not take into consideration the relationships between subscales when estimating subscores and the total score, whereas A-IRT and MIRT do. The difference between A-IRT and MIRT is in the way they make use of the information about relationships between subscales.
This study compares the augmented and multidimensional methods of subscore reporting for four subscales (Listening, Speaking, Reading, and Writing) with the unidimensional methods in terms of subscore reliability and precision for K–12 ELLs. In addition, the study examines the changes in reliability and precision of subscores between subscales at the same grade levels, and within subscales across grade levels. The study is based on research work done by Longabach (2015); in addition, it improves on a feature of data analysis of the previous study, which was carried out by using a combination of several software packages. The present research was carried out using only the R package mirt (Chalmers, 2012; code available upon request), which not only simplified calculations and parameter estimations, but also made them more consistent from one method of subscore reporting to another.
A number of previous studies compared the reliability of different methods of subscore reporting (de la Torre, Song, & Hong, 2011; de la Torre & Patz, 2005; Wang, Chen, & Cheng, 2004; Yao & Boughton, 2007); however, none have been performed on a language proficiency assessment. Also, this study used real data, whereas many previous studies’ data were simulated (e.g., de la Torre, Song, & Hong 2011; de la Torre & Patz, 2005; Edwards & Vevea, 2006; Yao & Boughton, 2007). Although there are some advantages to using simulated data, one cannot examine the irregularities brought on by real data. The uniformity of correlations between subscales across all examinees, grades, and ability levels common in the simulated data is unlikely to occur in real data. Additionally, the data are usually generated based on the assumption of a model that the data should follow, which is not the case with real data (de la Torre & Song, 2009).
Methods
Participants
The participants for the study were the approximately 44,000 K–12 students in state public schools who took the ELP assessment in 2013 (Table 1). Of these students, 48% were female and 52% were male. The majority of students claimed Spanish as their home language (81%); the next most frequent home languages were Vietnamese and Arabic.
Number of assessed ELL K–12 students by grade level.
Instrument
The data for this study were collected from the state ELP assessment administered in February–May 2013. Generally, any student identified as an ELL based on a prior year’s administration of the state assessment or another commercially available assessment is required to take the assessment. In addition, a student new to the district and whose home language is not English needs to take the assessment. An ELL student may exit an English for Speakers of Other Languages (ESOL) program by achieving a “fluent” performance level on all four subscales (Reading, Writing, Listening, and Speaking) and the total composite score of the assessment for two consecutive years.
The subscales consisted of a combination of multiple-choice (MC) and constructed response (CR) items. The Speaking subscale items were administered individually to students. Students responded orally; the tasks included answering short questions; elaborating on a question; and describing what is happening on a picture or a picture sequence. The examiner scored the response to each question immediately after it was given. The Listening subscale items were presented orally; students responded by following directions or answering questions on paper. The tasks included following directions; identifying beginning, middle, and ending word sounds; distinguishing between a grammatically correct sentence and an incorrect one; and answering comprehension questions based on a story read to the students. For the Reading subscale, the students read the items from the booklets and answered the related questions on paper. The tasks included identifying rhyming words; completing cloze sentences; identifying synonyms/antonyms; selecting correct word definitions; distinguishing between fact and opinion; identifying analogies; and answering questions related to the story read by the students. The Writing subscale for K–1 grade levels consisted of MC Writing items only. The tasks included writing letters/numbers based on oral prompt; completing cloze sentences; correctly rewriting sentences with syntactic errors; identifying correctly spelled word; and writing word labels to describe a picture. For grades 2–12, the Writing section was split 50–50 between an MC Writing section and a Writing Rubric section. The MC section included identifying grammatically correct use of parts of speech, punctuation, and syntax; and identifying synonyms/antonyms. For the Writing Rubric section of the Writing subscale students were supposed to write several short essays based on either a picture or a written prompt, which were then scored by a human rater.
Four performance levels were adopted for the assessment: (1) beginning, (2) intermediate, (3) advanced, and (4) fluent. The same test items for all four subscales were administered within a grade band (K–1, 2–3, 4–5, 6–8, and 9–12), but different (age-appropriate) items were administered for each grade band. The cut scores, however, were set separately for seven levels: K, 1, 2, 3, 4–5, 6–8, and 9–12. Differential weights were assigned to the subscales based on the seven levels, with the highest weights assigned to Speaking and Listening for younger students, and to Reading, Writing, and Listening for older students. The number of items per subscale ranged from nine to 28, with the number of possible points per subscale ranging from 18 to 31.
Data analysis
The assessment was scored using CTT, UIRT, A-IRT, and MIRT; reliability was estimated for each scoring method. CTT scores were calculated by adding up the numbers of correct item answers. Owing to the fact that the test combines MC and CR items, the 2PL generalized partial credit model (GPCM, Muraki, 1992) was used to score the items under the UIRT framework. This model expresses the probability of selecting a particular response category over the previous one. A-IRT subscores were calculated by using a regression-like formula, where the weights (β coefficients) depended on the covariances between subscores and the reliability of subscores. For MIRT, the multidimensional version of the generalized two-parameter partial credit model (M-2PPC) described by Yao and Schwarz (2006) was used; it was designed to model the interaction of persons with items that were scored with more than two categories. Each subscale contributed some degree of unique variance to the overall proficiency trait. A confirmatory IRT model was fitted, in which the number of dimensions was specified, but the correlations between the dimensions were unconstrained and estimated by the model. In addition, a compensatory model was used in which a stronger ability compensated for a weaker ability. The choice of a compensatory model was based on the assumption that it describes how the students approach language assessment items better than a noncompensatory model. Compensatory models assume that examinees use one of several alternative strategies for answering an item correctly, rather than using, for example, one set of skills only for Speaking, and a different set of skills only for Listening. These models are most successfully used for estimating the parameters of items that involve combinations of attributes or skills (Embretson & Yang, 2013). Noncompensatory models, on the other hand, assume that examinees must master all the skills necessary to answer an item correctly. Consequently, these models are more commonly used to describe cognitive traits where it is necessary to execute successfully a series of steps in a specific order, such as in testing mathematical abilities. In addition, noncompensatory models present severe estimation challenges owing to their need to estimate a separate difficulty parameter for each item on each dimension (Wang & Nydick, 2015).
Reliability was conceptualized as Cronbach’s alpha for CTT scoring, and as a ratio of observed to estimated scores for IRT scoring methods. Precision was conceptualized as the standard error of measurement (SE), which was calculated as the product of the standard deviation of the observed score and the square root of the difference of one and reliability. Comparison between reliability and precision of CTT and IRT-derived scores can be less than straightforward, as these two models conceptualize reliability differently. To mitigate these differences, IRT-based subscore reliability and precision were derived conceptually and formulaically similarly to the CTT methods. IRT-derived SE was averaged across all score levels.
Results
Subscore variability
Subscore variability was generally reduced for all subscales and grade levels from CTT and UIRT to A-IRT and MIRT, with A-IRT and MIRT having very comparable and fairly low variability. A-IRT subscores of an individual were closer to the mean of the group subscores on each subscale, since A-IRT estimates them by using average scores of all the individuals in the group on that same subscale. On the other hand, MIRT subscores of an individual for a given subscale were more like that individual’s subscores for other subscales, since MIRT estimates them by using the individual’s scores on other subscales and the relationships between the subscales. Figures 3–6 illustrate the subscores of the same five randomly selected individuals in grade 1 for all subscales obtained by the four different methods (CTT, UIRT, A-IRT, and MIRT).

CTT subscore profiles.

UIRT subscore profiles.

A-IRT subscore profiles.

MIRT subscore profiles.
Subscore correlations
Correlations between subscores derived by different methods for each grade level and subscale indicate a high level of consistency between estimation methods, ranging from .71 to .99. Generally, the highest correlations were noted between A-IRT and MIRT. Correlations between subscales for CTT and UIRT methods were very similar across grade levels; correlations increased consistently when A-IRT and MIRT methods were used. This relationship was expected, as the latter two methods are based on using information from all subscores to derive the subscores for the subscale of interest. Within pairs of subscales, subscore correlations for all methods were almost always higher in grade level 9–12 than in grade K, indicating that subscales become more interrelated as the students get older and become more proficient; however, the increase in the magnitude of the correlation was not always consistent between grades and between methods. It appears, however, that the correlation between Reading and MC Writing was most consistently high across grades, with an average of .72 for CTT, .69 for UIRT, .95 for A-IRT, and .96 for MIRT (Table 2), indicating a strong positive relationship between these subscales.
Correlations between subscales within each method of subscore estimation for all grade levels.
Reliability
Differences in reliability between subscore methods
We found that MIRT was the most reliable method of subscore reporting, closely followed by A-IRT, for all subscales and grade levels. On average, A-IRT and MIRT had substantially higher reliability across all grades and all subscales (.89 and .95, respectively) compared to the reliability of UIRT and CTT (.79 for both). In addition, although the reliability of subscores varied substantially between subscales within a given grade for CTT and UIRT, it was more uniform (the reliability values for subscores were closer together) in A-IRT and MIRT.
Differences in reliability between subscales’ scores
If one were to rank the average reliability of scores for different subscales across grades, fairly consistently Writing Rubric had the highest reliability (ranging from .89 for CTT and UIRT to .96 for MIRT), followed by Speaking (ranging from .87 for CTT and UIRT to .97 for MIRT) for all methods except for A-IRT (in which case it was Reading), then followed by Reading (ranging from .82 for UIRT to .97 for MIRT), MC Writing (ranging from .70 for UIRT to .95 for MIRT), and Listening (ranging from .64 for CTT to .92 for MIRT). The least reliable subscale scores received the highest increase in reliability when A-IRT and MIRT were used. Listening was consistently the least reliable subscale across all grades and all methods, and consequently it received the most significant boost in reliability when scored with A-IRT and MIRT (Table 3). Other subscales, whose reliability was higher, received a smaller increase in reliability when scored with A-IRT and MIRT. For example, Writing Rubric was consistently the subscale with the highest reliability across all grades, and consequently received the smallest increase in reliability when scored with A-IRT and MIRT (Table 4). MC Writing (Table 5), Reading (Table 6), and Speaking (Table 7) subscales’ reliability received a moderate increase in the course of A-IRT and MIRT subscore assignment. Between grades, within-subscale reliability on average was increasing for all subscore methods and for all subscales from grade K to grade levels 9–12; however, that increase was neither uniform nor linear, decreasing in some grade levels and increasing in others.
Reliability and standard error (SE) for the Listening subscale.
Reliability and standard error (SE) for the Writing Rubric subscale.
Reliability and standard error (SE) for the Writing subscale.
Reliability and standard error (SE) for the Reading subscale.
Reliability and standard error (SE) for the Speaking subscale.
Both A-IRT and MIRT use information from other subscales’ scores to estimate the scores for the subscale of interest. In MIRT, the higher the covariance between a specific latent trait and the subscale score of interest, the more impact that latent trait has on the augmented subscale score of interest. In A-IRT, the degree of impact of other subscales’ scores on the scores of the subscale of interest is based on the correlation between subscores and their reliability. The subscores with the highest reliability and the highest correlation with the scores on the subscale of interest is going to have the biggest impact on the subscale of interest. If the subscale of interest is highly reliable, it is going to have the highest impact on itself. Otherwise, the next highest reliable subscale that has the highest correlation with the subscale of interest is going to have the highest impact. It is reasonable to expect that the subscale itself will have the highest impact on itself. However, if the subscale has very low reliability, the next most reliable subscale may have the highest impact.
Standard error of measurement (SE)
Just as reliability was increasing from CTT and UIRT to A-IRT and MIRT across subscales and grade levels, SE was decreasing. SE was higher for CTT and UIRT (1.54 and .45, respectively) than for A-IRT and MIRT (.39 and .17, respectively). Across grade levels, for CTT, average SE was highest for the Reading subscale (1.82), followed by the Listening subscale (1.79), Speaking (1.63), MC Writing (1.38), and lowest for the Writing Rubric (1.09). For IRT methods, the highest SE was on the Listening subscale (.43), followed by the MC Writing (.37), Reading (.32), Speaking (.29), and the lowest was on the Writing Rubric subscale (.27), indicating that the Listening and Reading subscales had the least precision in measurement, and the Writing Rubric was consistently the subscale measured with the most precision. Within subscales between grade levels for all subscore reporting methods, the SE was generally decreasing from grade K to grade levels 9–12 for most subscales, although, similarly to reliability, this decrease was not uniform. This finding indicates that generally, subscores were becoming more precise in higher grades, or, in other words, one could be more certain about assigning a specific score to students on a given subscale (Tables 3 –7).
Conclusion
The reliability of CTT was found to be similar to that of UIRT. Some previously done studies either confirm that the reliability is very similar between the two methods, or the reliability of UIRT may be somewhat higher than that of CTT (Haberman & Sinharay, 2010; Shin, 2007; Xu & Stone, 2012; Yao & Boughton, 2007). Although these two methods are different in their key assumptions, they are still similar in the sense that they treat each subscale as a separate test and do not use any additional information about student ability, internal or external to the test, and so the reliability of these two approaches tends to be similar. In addition, when the number of items in a subscale is large, as was the case with the present assessment of language proficiency, these two methods often tend to have similar reliability.
The reliability of A-IRT was consistently found to be higher than that of UIRT. This finding was supported by a number of previous studies (de la Torre & Song, 2009; Haberman & Sinharay, 2010; Wainer et al., 2001; Wang, Chen & Cheng, 2004; Yao & Boughton, 2007; Skorupski, 2008). MIRT was consistently found to be more reliable than the non-augmented methods of subscore estimation (CTT and UIRT). In assessments where correlations between subscales are negligible, the use of MIRT as opposed to non-augmented methods would not make much difference in reliability. However, in assessments where correlations between subscales are larger, MIRT is consistently found to have higher reliability.
The reliability of the Writing Rubric was consistently highest and the SE was lowest across subscore methods. On the other hand, the reliability of the Listening scale was consistently lowest and the SE was highest. Also, the reliability was increasing and the SE was decreasing, although not uniformly, within subscales from kindergarten to grade levels 9–12. Although the difference in the number of items can affect reliability, Listening was not the subscale with the lowest number of items. The number of items changed very slightly across grade levels, with an average increase of 1.6 items and two points per subscale from kindergarten to grade levels 9–12. Student population homogeneity remained fairly similar across grade levels, with standard deviations of CTT-calculated scores ranging from 2.27 to 3.33 for Listening, from 4.23 to 5.27 for Reading, from 3.78 to 6.63 for Speaking, from 2.21 to 3.35 for MC Writing, and from 3.21 to 3.48 for Writing Rubric.
A possible explanation of the fluctuation of reliability and SE between subscales is the difference in the appropriateness of the tasks used for language domain assessment. The effect of task is central to test performance (Bachman, 2002; In’nami and Koizumi, 2009); the tasks that are appropriate for the assessment of a specific domain are more likely to produce more reliable and more defensible scores. Although a number of studies criticize language assessment tasks for all domains on various grounds related to the inappropriate representation of the construct in question, some authors (e.g., Brindley & Slatyer, 2002; Brown, 2004; Khoii & Paydarnia, 2001; Shin, 2007) specifically draw attention to the fact that listening skills, while critical to ELD, are not very well understood and hard to assess. Difficulty with finding appropriate tasks to assess the construct of listening could have resulted in this subscale having the lowest reliability, whereas the Writing Rubric tasks were probably more appropriate.
The differences in reliability and SE within subscales across grade levels are probably a result of the interaction of task characteristics and test-taker attributes, such as age and age-related factors. A number of authors stress that young ELLs may interact with assessments in ways that are quite different to those of older students owing to a variety of developmental reasons (Butler, 2016; Hasselgreen, 2013; McKay, 2006; Rea-Dickins, 2000). For example, their memory is different from that of older students not only quantitatively, but qualitatively, which may alter the way in which they remember a passage read to them. Hands-on tasks facilitate their thinking to a greater degree than that of older students. They may have difficulty integrating information from different sources. All these factors significantly influence and regulate assessment formats, modalities, and procedures for assessment. Tasks that are regularly used for assessing adults may not work well for assessing young learners. However, even if a task is considered developmentally appropriate and works well in the classroom environment, young learners may not be able to exhibit consistently in an assessment environment what they are able to do in a non-assessment context (Butler, 2016). For this reason, it may be challenging to obtain reliable information about their domain proficiency.
Policy implications
Subscore reliability, and consequently the reliability of the total score on the assessment of ELP, is of key importance to the ability to make appropriate decisions regarding ELL identification, instruction, placement, and reclassification. Ultimately, assessing ELLs is intended to improve their education, and the realization of this goal is dependent on the degree to which assessments accurately measure their true proficiency, and the degree to which assessment results are designed for the appropriate measurement purpose (Sireci & Faulkner-Bond, 2015). Low score reliability and consequent misclassification of students may result in either not recognizing that ELLs need services, or in assigning to services those students who do not need them. Both can be harmful to the educational progress of ELLs. If an ELL is exited from Title III too early, they may have difficulty in making progress in core content classes. At the same time, a number of researchers voice concern over the growing number of students spending extended time in Title III (Bailey & Carroll, 2015). For these students, inability to exit Title III owing to the structure of exit rules and assessment characteristics, despite their readiness to do so from the ELD perspective, often results in low levels of school persistence, including dropping out of school, and reduced access to college education (Kim, 2011).
The state in question establishes fluency on all subscales and overall fluency as the criteria for exiting the ELL program – a decision rule referred to as conjunctive. For states using this type of decision rule, the reliability of individual subscales is critical, since the entire decision about exiting may hinge on the measurement precision of the least reliable evidence (Bailey & Carroll, 2015). As some of the subscales carry more weight than others in the calculation of the total score, the reliability of those subscales that carry the most weight becomes even more important. In addition, the fact that the state requires a student to score as “fluent” for two consecutive years to exit the English as a second language (ESL) program places an additional emphasis on the need for subscore reliability. Suppose that a student whose “true” classification is “fluent” receives a subscore that classifies him or her as “fluent” one year, and then as “other than fluent” the next year owing to low subscore reliability; this situation will delay the student’s exit from the program by at least two years.
It is important to keep in mind that the type of inferences we can make from subscores depends on the method the subscores were derived by. Although the psychometric goal of the assessment under the conjunctive system may be producing subscores with the highest possible reliability, the subscores that do have the highest reliability are those produced by methods involving using information from other subscales. Therefore, these methods may not be theoretically acceptable under the conjunctive system, which seeks to assess proficiency within each domain independently of other domains. In addition, A-IRT estimates subscores for individual students by “borrowing” information from the subscores of other students in the group, which makes basing decisions regarding individual proficiency based on these subscores less justifiable. On the other hand, using subscore reporting methods that involve borrowing information from other subscales within a conjunctive system of reclassification may serve as a trade-off between the stringency of conjunctive methods and higher reliability of compensatory methods.
MIRT estimates subscores for individual students by “borrowing” information from the student’s own scores on other subscales. Therefore, this method may be more appropriate to make reliable decisions regarding a given individual’s proficiency for the purpose of placement. Although MIRT seems to be the most appropriate method of subscore estimation owing to its high reliability, its drawback is that scores for individual student’s subscales are made less distinct from one another, thus obscuring the student’s strengths and weaknesses in individual subscales, which may be a factor in instructional planning. In addition, since both A-IRT and MIRT are dependent on the impact of other subscores on the subscore of interest, and the relationships between subscales change from grade to grade within the same domain, these subscores may be harder to interpret and compare between grades.
Overall, the selection of a subscore method involves a number of complex decisions involving the evaluation of the effects and consequences of selecting a specific method. Critical to the process is the consideration of trade-offs between the accuracy and precision versus the relative ease of subscore calculation and interpretation, and between subscore reliability and distinctiveness. No one single method of assessment and scoring is going to be suitable for all types of decision making in ELL assessment. However, it is important to evaluate the impact of scoring mechanisms to be able to make an informed decision regarding which procedure fits best with the ultimate goal of providing optimal educational opportunities to ELLs.
Limitations
One limitation of this study is the use of real data, which precluded us from knowing the true ability of the examinees. In addition, it precluded us from being able to examine separately the impact of such factors as student ability, age, number of students, number of subscales, correlation between subscales, and number of items in a subscale on the subscale reliability. Another limitation of using a real data set is that it is difficult to know how these results could be generalized to other tests. For example, tests with a different number of dimensions, different item types and formats, different correlational structures between subscales may behave quite differently from what we observed in this study.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
