Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment

Abstract

This study examined the relative effectiveness of the multidimensional bi-factor model and multidimensional testlet response theory (TRT) model in accommodating local dependence in testlet-based reading assessment with both dichotomously and polytomously scored items. The data used were 14,089 test-takers’ item-level responses to the testlet-based reading comprehension section of the Graduate School Entrance English Exam (GSEEE) in China administered in 2011. The results showed that although the bi-factor model was the best-fitting model, followed by the TRT model, and the unidimensional 2-parameter logistic/graded response (2PL/GR) model, the bi-factor model produced essentially the same results as the TRT model in terms of item parameter, person ability and standard error estimates. It was also found that the application of the unidimensional 2PL/GR model had a bigger impact on the item slope parameter estimates, person ability estimates, and standard errors of estimates than on the intercept parameter estimates. It is hoped that this study might help to guide test developers and users to choose the measurement model that best satisfies their needs based on available resources.

Keywords

Bi-factor model item response theory local dependence reading assessment testlet response theory model

Introduction

The majority of reading assessments, whether in a first language or second/foreign language, comprise sets of passages with a group of items pertaining to each passage. Such passages, usually called testlets¹ (Wainer & Kiely, 1987), are used for many reasons, among which the principal one is higher testing efficiency for test takers (Thissen, Steinberg, & Mooney, 1989). With several items embedded in a testlet, test takers need not waste a considerable amount of time and energy in processing a long passage just to answer a single item. Despite such great strength, this testing format poses a threat to item analysis because items within a testlet often violate the local independence assumption of item response theory (IRT) (Sireci, Thissen, & Wainer, 1991; Wang & Wilson, 2005a), which stipulates that the probability of responding to an item is statistically independent of the probability of responding to any other item in the same test conditional on the test-taker’s ability (Embretson & Reise, 2000). In other words, the violation of the local independence assumption indicates that after the impact of the measurement construct is partialed out from the item scores, non-negligible correlation still exists between the items (Yen, 1993).

Sources of local item dependence in reading assessments

In reading assessment, the most heatedly discussed potential sources of local item dependence are as follows: (a) common item types; (b) the same subskill measured by different items; and (c) a single passage followed by a set of items.

The item type effect has been discussed in the literature more often under the context of multidimensionality than local item dependence. However, as pointed out by Andrich, Humphry, and Marais (2012), multidimensionality constitutes one generic source of local item dependence; therefore, the use of different item types within a test (e.g., multiple-choice and constructed-response items) might not only introduce unmodeled variation that can be attributed to a secondary dimension of item type (Linacre, 1998), but might also lead to the occurrence of local dependence of items of the same item type. Previous empirical research (Kobayashi, 2002; Shohamy, 1984) has suggested that various item types measure different aspects of reading comprehension and somewhat different constructs. In addition, some item types inherently suffer from a higher probability of violating the local independence assumption. A typical example in language assessment is the item type of gap filling, in which test takers are required to select the correct answers to fill in the blanks of missing words from a list of options. A wrong response to one item might result in further wrong responses to subsequent items. The “item-chaining effect” (Wang, Cheng, & Wilson, 2005) of this item type poses a greater threat to the assumption of local independence assumption than many other item types.

The same subskill measured by different items constitutes another major source of local item dependence and multidimensionality in reading assessment. Although it is important to operationalize the reading construct through attempting taxonomies of reading subskills in reading syllabus design (Munby, 1978) and test specifications development (Alderson, 1990a; Lumley, 1993), it has triggered heated debate and speculation among language testers about the existence of reading subskills (Alderson, 1990a, 1990b, 1995; Alderson & Lumley, 1995; Lumley, 1993), the divisibility of reading subskills (Davis, 1968; Johnston, 1983; Song, 2008), as well as the possibility of identifying a single dominant subskill measured by an item (Alderson, 1990a, 1995; Alderson & Lukmani, 1989). Empirical studies investigating subskill-related local dependence and multidimensionality of reading assessment have also provided mixed or even conflicting results. For instance, using the DIMTEST procedure and NOHARM analysis, Schedl, Gordon, Carey, and Tang (1996) reported that different subskills of reading items did not bring multidimensionality to the construct of the reading comprehension section of TOEFL. In line with this, Lee (1998, 2004) employed IRT-based Q₃ statistics to an EFL (English as a foreign language) reading comprehension test in Korea and concluded that no significant level of local dependence related to subskills was detected. On the other hand, a study conducted by Jang and Roussos (2007) on the reading section of TOEFL revealed the multidimensionality effect related to reading subskills, by utilizing conﬁrmatory analyses coupled with exploratory cluster analyses and content analyses.

The passage effect has been the major focus for most research on local dependence and multidimensionality in reading assessment because local dependence is almost innate in passage-based reading assessment. When several items are based on a common passage, test takers with diverse background knowledge of the passage, differential skills specific to the passage, or various other motivational factors concerning the passage (DeMars, 2006) might have differential understanding of the passage and hence differential item response behavior. In this way, item responses within a testlet are not only contingent upon test-takers’ reading proficiency measured by the test, but also related to a second trait relevant to the passage. Thus test-takers’ responses to items within the same testlet are more highly correlated than their responses to items across different testlets. Findings of empirical research in this area have consistently shown that items sharing a common passage indeed exhibit local dependence and multidimensionality (e.g., DeMars, 2006; Jang & Roussos, 2007; Lee, 1998, 2004; Rijmen, 2010; Thissen et al., 1989; Zhang, 2010), regardless of the specific test analyzed and/or the specific data analysis technique used.

Approaches to cope with local item dependence

In order to address the problem of the violation of local independence assumption, a number of approaches have been suggested by numerous researchers in the field of educational measurement. One intriguing approach to circumvent this problem is to group together items within a testlet, treat it as a single polytomous item, and apply a unidimensional polytomous IRT model (Lee, 1998; Lee, Kolen, Frisbie, & Ankenmann, 2001; Thissen et al., 1989; Wainer, 1995). One potential caveat of this approach, however, is the loss of item-level information (Wainer, Bradlow, & Du, 2000; Yen, 1993), especially when a testlet is composed of a large number of dependent items (So, 2010; Zhang, 2010). By transforming items within a testlet to a polytomous item, it loses whatever information is contained in the test-takers’ precise pattern of item responses.

An alternative is to model testlet items with multidimensional IRT models, either the bi-factor model (Gibbons & Hedeker, 1992) or the testlet response theory model (Wainer, Bradlow, & Wang, 2007). The use of the bi-factor model has been preferred more by researchers in the area of factor analysis than those in IRT modeling. The bi-factor model originates from confirmatory factor analysis for continuous responses (Holzinger & Swineford, 1937) and has been extended to IRT analysis for both dichotomously scored items by Gibbons and Hedeker (1992) and for polytomously scored items by Gibbons et al. (2007). It incorporates both a primary dimension related to all items and secondary dimensions pertaining to subsets of items only; thus each item response in the bi-factor model depends on both the primary dimension and one of the secondary dimensions. Specifically, each item has a non-zero value for the discrimination parameter in the direction of the primary dimension and also at most a non-zero value for the discrimination parameter corresponding to one of the testlet dimensions. The other discrimination parameters related to the other testlets are fixed to zero.

A variety of testlet response theory (TRT) models have been proposed in the IRT literature to capture the local dependence of items in testlets from different perspectives, including the Bayesian random-effects testlet models (Bradlow, Wainer, & Wang, 1999; Wainer & Wang, 2000; Wainer et al., 2000; Wainer et al., 2007) from the perspective of “an interaction between a testlet and persons”, the Rasch testlet models (Wang & Wilson, 2005b) from the perspective of “multidimensionality”, and the three-level, one-parameter testlet model (Jiao, Wang, & Kamata, 2005) from the perspective of “contextual effects on items nested within a testlet” (Jiao, Wang, & He, 2013). Among them, the most-cited one in the literature is the Bayesian random-effects testlet model (Bradlow et al., 1999; Wainer & Wang, 2000; Wainer et al., 2000; Wainer et al., 2007), which has also been most widely used in large-scale language assessments (e.g., Eckes, 2013; Rijmen, 2010; Zhang, 2010). As opposed to the bi-factor model where separate discrimination parameters are assigned to the primary and secondary dimensions, the Bayesian random-effects testlet model usually imposes constraints on the relationship between the primary slope and the secondary slopes. Based on different theoretical rationales, three types of constraints might be placed. The first type of constraint, typical in a plethora of Bayesian random-effects testlet models (Bradlow et al., 1999; Wainer & Wang, 2000; Wainer et al., 2000; Wainer et al., 2007), stipulates that there is a proportional relationship between the primary slope and the secondary slopes, implying that items discriminating well on the primary dimension should also discriminate well on the secondary dimensions. This supposition, however, does not seem to comply with the reality as it indicates that, for instance, in the case of testlet-based reading comprehension tests, items that discriminate well on test-takers’ reading ability are also more heavily influenced by the secondary dimensions related to the passage. To address this problem, Li, Bolt, and Fu (2006) proposed an alternative TRT model in which an inverse relationship – the second type of constraint – was imposed on the primary slope and the secondary slopes, in accordance with the rationale that an item discriminating well on the primary dimension should have less discrimination power on the secondary dimensions. Apart from this, Li et al. (2006) also introduced a third type of TRT model that constrains all slope estimates to be the same within a testlet and compared its fit to that of the first two types of TRT models as well as the bi-factor model (Gibbons & Hedeker, 1992), finding that the bi-factor model yielded the best model–data fit.

In addition to constraints on the relationship between the primary slope and the secondary slopes, the Bayesian random-effects testlet models also vary in terms of the constraints placed on the variance estimates of the secondary dimensions. The Bradlow et al.’s (1999) TRT model, as the first attempt to model test-takers’ responses to testlet-based items, assumes that the variances of all testlets should be equal by constraining the proportionality constants to be the same across all testlets, while others (Wainer & Wang, 2000; Wainer et al., 2000; Wainer et al., 2007; Wang, Bradlow, & Wainer, 2002) allow the proportionality constants to vary across testlets, rendering the model less parsimonious but enabling researchers to know the relative contribution of each testlet factor.

The main difference between the bi-factor model and Bayesian random-effects testlet model resides in parameter constraints/model parsimony and model identification. Owing to the additional constraints placed, the Bayesian random-effects testlet model is generally more parsimonious than the bi-factor model. This is one of the major advantages of using the Bayesian random-effects testlet model over the bi-factor model because with fewer item parameters estimated in the Bayesian random-effects testlet model, estimation efficiency can be improved. Moreover, practitioners always strive to arrive at simpler models if they are available and the model fit does not deteriorate too much. One thing that should be noted is that, although generally no constraint is placed on the relationship between the primary slope and the secondary slopes in the bi-factor model, when a testlet only consists of two items, some constraint needs to be placed in the bi-factor model on the two secondary dimension slope parameters. A typical way is to constrain the absolute value of bi-factor slopes of the two items to be equal (Liu & Thissen, 2012). In terms of model identification, the additional constraints on the slope estimates in the Bayesian random-effects testlet model permit the estimations of the variances of the secondary dimensions, whereas in the bi-factor model the variances of the secondary dimensions are typically set to 1 for identification purposes. The mathematical definitions of the Bayesian random-effects testlet model and bi-factor model for both dichotomously and polytomously scored items are provided by Wainer et al. (2007) and Cai, Yang, and Hansen (2011) respectively.

Consequences of ignoring local item dependence

Despite these slight differences among the various multidimensional IRT models, they can all be applied to handle the violation of the local independence assumption in testlet-based assessment and to minimize or eliminate the problems that might be incurred when utilizing standard IRT models, including inaccurate item parameter and person ability estimates (Bradlow et al., 1999; Chen & Thissen, 1997), overstatement of test information and measurement precision (Keller, Swaminathan, & Sireci, 2003; Sireci et al., 1991; Thissen et al., 1989; Wainer, 1995; Yen, 1993), errors in equating/scaling (Lee et al., 2001), as well as item misfit (Marais & Andrich, 2008).

Applications in language tests

The limitation of the standard IRT models in handling the violation of the local independence assumption has also been discussed by a plethora of researchers in the field of language assessment (e.g., Choi, Kim, & Boo, 2003; Eckes, 2013; Lee, 1998, 2004; Lee-Ellis, 2009; Ockey, 2012; So, 2010; Schmitt, Schmitt, & Clapham, 2001; Zhang, 2010). Applications of the three different approaches to bypass the problem of local item dependence can be found in large-scale language tests. For example, Lee (1998) utilized a series of polytomous IRT models to circumvent the local dependence between items within a testlet in an EFL reading comprehension test in Korea. After recognizing the potential problem of information loss of the polytomous IRT approach, Wainer and Wang (2000) attempted to use the 3-parameter logistic Bayesian random-effects testlet model to characterize the local dependence among items within the TOEFL listening and reading comprehension testlets, finding that estimates of item difficulty parameters were not affected by testlet-associated local item dependence, while estimates of discrimination and guessing parameters were inaccurate if local item dependence was ignored. Similarly, Eckes (2013) applied the 2-parameter logistic Bayesian random-effects testlet model to analyze three testlets in the listening section of the Test of German as a Foreign Language (TestDaF). Eckes demonstrated that local item dependence did not affect estimates of item difficulty and discrimination parameters but led to overestimated test reliability and underestimated standard error of ability estimates. Studying the classification accuracy of language proficiency under different measurement models, Zhang (2010) showed evidence that compared to the standard IRT model, the three-parameter logistic Bayesian random-effects testlet model yielded substantially larger standard errors of ability estimates, suggesting the inflation of classification accuracy of the standard IRT model.

Further, Li, Li, and Wang (2010) applied the bi-factor model to accommodate local dependence in a testlet-based English reading test that contained both dichotomously and polytomously scored items, demonstrating that local item dependence exerted a small influence on the item parameter estimates but a comparatively larger influence on test information and reliability. They also cautioned that the added item parameters estimated in the bi-factor model might have compromised the fit of the model; therefore, they called for the use of a multidimensional IRT model with a simpler structure. On the basis of substantiating the presence of both a primary trait dimension and secondary passage dimension through a series of confirmatory factor analyses, So (2010) applied the bi-factor model to analyze the testlet-based reading paper of the Certificate in Advanced English (CAE), finding that though the presence of a secondary passage dimension did not signiﬁcantly affect estimation of item difficulty parameters, it had signiﬁcant effects on the estimation of the discrimination parameter, which consequently led to an underestimation of lower-ability test takers and overestimation of higher-ability test takers.

The majority of the studies summarized above have compared either the multidimensional TRT model with the unidimensional IRT model (Eckes, 2013; Wainer & Wang, 2000; Zhang, 2010) or the multidimensional bi-factor model with the unidimensional IRT model (So, 2010), but little research (DeMars, 2006; Rijmen, 2010) has been conducted to compare systematically the relative effectiveness of these two different multidimensional IRT models in capturing local dependence in language tests. Using both simulated data and real data of math and reading testlets, DeMars (2006) compared the bi-factor model, TRT model, testlets-as-polytomous-items model, and independent-items model, reporting that in both simulated data and real data the bi-factor model generally provided better model–data fit than the more specialized TRT model and the independent-items model. However, based on the results of root mean square error and bias in different manipulated circumstances, the more parsimonious TRT model was found to be favored over the bi-factor model. Yet, DeMars (2006) concluded with a promising outlook for the use of the bi-factor model among applied practitioners, because for one thing, the bi-factor model can be run easily and efficiently in commercial software; for another, the accuracy of the slope estimates and ability estimates was found to be maintained using the bi-factor model even when the data were generated from the more constrained TRT model. On the other hand, Rijmen’s (2010) study, which fitted the bi-factor model, TRT model and unidimensional IRT model to the data from an international English test, provided evidence that the proportionality constraints that were placed on the relationship between the primary slope and secondary slopes in the TRT model were too stringent and therefore advocated for the use of the bi-factor model in testlet-based assessment.

Research findings on the effectiveness of the multidimensional bi-factor model and the multidimensional TRT model in accommodating local item dependence in testlet-based language assessments are mixed. Moreover, the scope of these comparative studies on multidimensional IRT models has been mainly confined to the analysis of dichotomously scored items, except Li et al. (2010), probably owing to the fact that multidimensional IRT analysis of polytomous data is a more recent development, although in language assessment, the mixed-format with both dichotomously and polytomously scored items might represent a more realistic scenario.

The present study

The purpose of this study, therefore, is to investigate whether the claim that the bi-factor model serves as a practical alternative to the TRT model in testlet-based assessment is true in the mixed-format with both dichotomously and polytomously scored items in the area of language assessment. If so, what is the consequence of selecting a worse-fitting model to address the local dependence between items within a testlet or ignoring the local dependence by applying the unidimensional IRT model? In order to address these issues, this study aims to answer the following four research questions:

To what extent do the unidimensional IRT model, the multidimensional TRT model, and the multidimensional bi-factor model fit the testlet-based reading comprehension section of the GSEEE?

To what extent are item parameter estimates influenced by different IRT models?

To what extent are person ability estimates influenced by different IRT models?

To what extent are standard errors of item parameter and person ability estimates influenced by different IRT models?

Method

Data

The data analyzed in the present study were 14,089 test-takers’ item-level responses to the reading comprehension section of the Graduate School Entrance English Exam (GSEEE) in China administered in 2011. The 14,089 test takers were candidates applying to programs of various disciplines at one major university in the eastern coast of China in that year. The GSEEE is designed and administered by the National Education Examinations Authority (NEEA) under the Ministry of Education in China to provide information for educational or research institutions in selecting candidates for their Master’s programs (He, 2010). With an annual testing population of over 1 million, the GSEEE is one of the two highest-stakes tests in China, the other being the National College Entrance Exam. For most test takers, the results of the test will determine their career path and for some, change their life. Different cut-off scores are set for different majors and the cut-off scores of GSEEE in 2011 ranged from 45 to 60 for candidates applying to the university, resulting in an overall selection ratio of 41.8%. The GSEEE assesses test-takers’ English language proficiency in three areas: (1) use of English (a cloze test) accounting for 10% of the total score; (2) reading comprehension, 60%; and (3) writing, 30%. The reading comprehension section, which is the main focus of this study, consists of three parts:

Part I: multiple-choice items. Test takers are required to select the best answer to each of the 20 four-option multiple-choice items based on four passages, which are dichotomously scored.

Part II: paragraph-reorganizing items. Test takers are required to put five jumbled sentences or paragraphs in the correct order to form a coherent passage, which are scored on a 2-point scale.

Part III: translation items. Test takers are required to translate five underlined sentences in a passage into Chinese, which are scored on a 3-point scale.

Data analyses

Unidimensionality and local independence check

First of all, in order to check whether local dependence exists between items within a testlet, the authors examined the standardized local dependence (LD) χ² statistic, obtained from unidimensional IRT modeling via IRTPRO 2.1 (Cai, Thissen, & du Toit, 2011). The standardized LD χ² statistic is based on the local dependence statistic proposed by Chen and Thissen (1997) but extended to accommodate polytomous responses. It is computed by comparing the observed and expected frequencies in the two-way marginal tables for each item pair and then standardized to make values comparable among items with different number of response categories (Cai, Thissen, & du Toit, 2011). Because these are approximately standardized statistics, values exceeding 4 suggest clear local dependence between items, and values exceeding 10 suggest extreme local dependence between items (Cai, personal communication, July 22, 2013).

IRT modeling

Secondly, the authors applied three unidimensional and two multidimensional IRT models to the testlet-based reading comprehension section of the GSEEE using IRTPRO 2.1 (Cai, Thissen, & du Toit, 2011), a program released by Scientific Software International (SSI) to encompass the functions of BILOG-MG, MULTILOG, PARSCALE and TESTFACT with newly advanced features. One major advantage of IRTPRO 2.1 relevant to this study is that the bi-factor model can be estimated in IRTPRO 2.1 with both dichotomously and polytomously scored items, while the original software programs, such as NOHARM and TESTFACT, can be used for the bi-factor model only when the data are dichotomously scored (Edwards & Edelen, 2009).

The three unidimensional IRT models applied were a hybrid of standard 1-parameter logistic/graded response (1PL/GR) model, 2-parameter logistic/graded response (2PL/GR) model, and 3-parameter logistic/graded response (3PL/GR) model. After the best unidimensional IRT model for the data was decided, two multidimensional IRT models, namely, the TRT and bi-factor models, were estimated, both of which utilized the best unidimensional IRT model at the item level.

To identify the TRT model (Wainer et al., 2007), the locations of all the dimensions including the primary dimension and the secondary testlet dimensions and the scale of the primary dimension need to be fixed, but the scale of the specific dimensions can be freely estimated (Rijmen, 2010). Specifically, in this study, the mean and variance of the primary dimension was set to 0 and 1 respectively; the mean of the specific dimensions was set to 0, with testlet-specific variances freely estimated. The slope estimates on the secondary dimensions were constrained to be proportional to the slope estimate on the primary dimension. The proportionality constants were free to vary, allowing us to gauge the relative strength of the local dependence of different testlets. To identify the bi-factor model (Cai, Yang, & Hansen, 2011), the mean and variance were set to 0 and 1 respectively for all the traits including the primary dimension and the secondary specific dimensions. It is assumed in both the TRT model and the bi-factor model that the primary dimension and the secondary testlet dimensions are jointly normally distributed and mutually orthogonal.

For all the five models, the item parameters were estimated with the Bock–Aitkin marginal maximum likelihood estimation (Bock & Aitkin, 1981), which has been proven to be an effective estimation method for both unidimensional and two-dimensional IRT models (Cai, Thissen, & du Toit, 2011). In addition, during all calibration runs, a standard normal distribution N (0, 1) constraint was imposed for θ_g (a given ability on the general trait dimension) and θ_s (a given ability on the secondary passage dimension) for the bi-factor model, and for θ_g for the TRT, standard 1PL/GR, 2PL/GR, and 3PL/GR models. Thus, the parameter estimates obtained from different IRT models were on the same scale (Li et al., 2010). Besides, the computation of IRT scale scores was done using the expected a posteriori (EAP) method (Bock & Mislevy, 1982), which in general requires less computation and produces a smaller standard error of ability estimates (Wang & Vispoel, 1998).

Results

The unidimensionality and local independence assumption

The reading comprehension section analyzed in the present study comprised 6 five-item testlets, so the authors examined the standardized LD χ² statistic for 60 item pairs (6 × $C_{5}^{2}$ ) to see whether local dependence existed between items within a testlet. Note that the authors did not report the statistic for the other 375 item pairs ( $C_{30}^{2}$ –6 × $C_{5}^{2}$ ) because the main focus of this study was items within a testlet rather than items across testlets. Therefore, the potential local dependence between items across testlets owing to other factors such as subskills was beyond the range of the current investigation.

The results showed that, when the unidimensional IRT (i.e., standard 2PL/GR) model was applied, 49 out of the 60 item pairs yielded a value of over 4 for the standardized LD χ² statistic, indicating that test-takers’ responses to items in most of the item pairs were locally dependent and these item pairs may measure an un-modeled passage dimension. More specifically, the proportion of item pairs exhibiting local dependence reached 100%, 80%, 90%, 70%, 50% and 100% for the six testlets respectively, suggesting that all of the six testlets suffered from violation of local independence. It is also noteworthy that out of the 60 item pairs, 35 pairs produced a value of over 10 on the standardized LD χ² statistic, suggesting an extreme level of local dependence.

Overall model fit

Preliminary analysis

A preliminary analysis was conducted to decide the best unidimensional IRT model for the data. Table 1 summarizes the fit indices of the three unidimensional IRT models, including df (the degree of freedom), −2log-likelihood (minus twice the log-likelihood evaluated at the maximum likelihood estimates), AIC (Akaike information criterion; −2log-likelihood plus twice the number of parameters) (Akaike, 1974), BIC (Bayesian information criterion; −2log-likelihood plus the logarithm of the sample size times the number of parameters) (Schwarz, 1978), and RMSEA (root mean square error of association; a value of 0.05 or below suggests adequate fit) (Browne & Cudeck, 1993). According to these fit indices, the 3PL/GR model provides the best model–data fit among the three unidimensional IRT models as it offers the best solution in terms of model fit, as indicated by the lowest −2log-likelihood value, and the best balance between model misfit and model parsimony, as indicated by the lowest AIC and BIC values (Rijmen, 2010). However, a closer examination of the item parameters estimated from the 3PL/GR model indicated that out of the 30 items, 6 items had poorly estimated difficulty/discrimination/guessing parameters, with standard errors reaching over 1000. One possible reason might be that the test analyzed in the present study is targeted at proficient learners of English who have generally completed two years of EFL education at college or university, a lack of information at the lower end of the ability scale may have therefore led to unstable estimation of the guessing parameter in the 3PL model (Lord, 1980), which consequently led to poor estimation of other item parameters (Baker, 1987). The second best-fitting unidimensional IRT model – 2PL/GR model – was therefore chosen to be the best unidimensional IRT model for the data. In addition, it was decided to use a combination of standard 2PL model and GR model at the item level for multidimensional IRT analyses.

Table 1.

Summary of fit indices of three unidimensional IRT models.

IRT models	NP	df	−2log-likelihood	AIC	BIC	RMSEA	χ²-difference tests1PL/GR 2PL/GR 3PL/GR
1PL/GR	36	589	480602	480674	480946	0.06	\
2PL/GR	65	560	472720	472850	472850	0.05	χ² (29) = 7882* \
3PL/GR	90	535	467601	467781	468461	0.03	χ² (54) = 13001* χ² (25) = 5119* \

Notes: NP = number of parameters; * p < .05.

Main analysis

When comparing the selected unidimensional IRT model with the multidimensional IRT models, it is found that the bi-factor model is the best one among all the three models as it has the lowest −2log-likelihood, AIC and BIC values, as can be seen from Table 2. Moreover, the bi-factor model has the lowest value of 0.01 for RMSEA, suggesting that this model fits the data extremely well.

Table 2.

Summary of fit indices of unidimensional and multidimensional IRT models.

IRT models	NP	df	−2log-likelihood	AIC	BIC	RMSEA	χ²-difference tests2PL/GR TRT bi-factor
2PL/GR	65	560	472720	472850	472850	0.05	\
TRT	71	554	463006	463148	463685	0.02	χ² (6) = 9714* \
Bi-factor	95	530	462275	462465	463183	0.01	χ² (30) = 10445* χ² (24) = 731* \

Notes: NP = number of parameters; * p < .05.

Because the standard 2PL/GR model is nested in the TRT model, and the TRT model is nested in the bi-factor model, the signiﬁcance of the difference in −2log-likelihood can be tested with a series of χ²-difference tests (du Toit, 2003) to decide which one is the best-fitting model, with degrees of freedom equal to the difference in the number of parameters. The results of the χ²-difference tests showed that the bi-factor model fit the data significantly better than the TRT model (χ² (24) = 731, p < .05), which in turn fit the data significantly better than the standard 2PL/GR model (χ² (6) = 9714, p < .05) (see the last three columns of Table 2). The bi-factor model, therefore, was chosen as the final model for the present study. This suggested on one hand, that the secondary dimension was strong enough to cause local item dependence between items within a testlet; on the other hand, that the constraints imposed by the TRT model on the slope parameters were too stringent.

Nonetheless, in an operational context, it seems to be common to either apply the unidimensional IRT model ignoring the testlet structure or apply the more specialized TRT model to do item analysis, so it is worth the effort to examine the consequences of applying the worse-fitting model in testlet-based assessment. Next, item parameter estimates, person ability estimates and standard errors of these estimates obtained from the standard 2PL/GR and TRT models would be compared with those obtained from the best fitting model – the bi-factor model, respectively.

Item parameter estimates

Intercept.²

The item intercept parameter is negatively associated with the difficulty parameter of the item (Reckase, 2009). Generally, the higher the item intercept estimate is, the easier the item is. The left panel of Figure 1 depicts the relationship between the intercept parameter estimates obtained from the standard 2PL/GR model (UNI-c) and the bi-factor model (BIF-c). As can be seen from the figure, the item intercept parameters estimated from these two models are highly correlated (r = .98). This suggests that item intercept estimates are unaffected by local dependence between items within testlets and the current unidimensional IRT model is sufficient for this purpose. In a similar vein, the right panel of Figure 1 shows a similar trend, indicating that applying the TRT model does not result in any obvious inaccuracy in the intercept parameter estimates.

Figure 1.

Comparison of the intercept parameter estimates from three models.

Slope

The item slope parameter has the same interpretations as the discrimination parameter of the item. The higher the item slope parameter is, the more discriminating the item is. Figure 2 depicts the relationship between the slope/discrimination parameter estimates obtained from the three models. Unlike the intercept parameter estimates, the slope/discrimination parameter estimates (with respect to the general trait) are correlated at a medium level (r = .55) for the standard 2PL/GR and bi-factor models, and at a slightly higher level (r = .78) for the TRT and bi-factor models, suggesting that applying the worse-fitting models may result in a large inaccuracy in estimates of item slope/discrimination parameters. Note that there is an extreme outlier above the reference line in both panel (a) and panel (b), while below the reference line five outliers are evident in panel (a) but not in panel (b), which are found to be correspondent with the five items in the paragraph-reorganization task.

Figure 2.

Comparison of the slope parameter estimates from three models.

Table 3 provides a comparison of the slope/discrimination parameters on the general trait dimension and on the passage dimension estimated from the bi-factor model. The results show that for most items (i.e., 18/30), the slope/discrimination parameter related to the trait dimension is larger than that on the passage dimension, indicating that most items assess test-takers’ reading proficiency more than their background knowledge of the passage. In other words, most items in the reading comprehension section of the GSEEE measure the reading proficiency rather than the construct-irrelevant passage content, since the slope/discrimination parameter can be directly transformed to factor loading (Kamata & Bauer, 2008; Takane & de Leeuw, 1987). However, it is noteworthy that for all the items in the paragraph-reorganizing task, the slope/discrimination parameter concerning the trait dimension is smaller than that on the secondary dimension, suggesting that test-takers’ responses to the five items are more affected by the secondary dimension. In addition, content analyses reveal that all the four items (i.e., Items 3, 8, 13, and 19), with a much larger slope/discrimination parameter on the secondary dimension than that on the trait dimension (i.e. a large negative difference value) in the multiple-choice task, measure test-takers’ inferential skills. However, there are also other inference items (i.e., Items 4, 9, 18, and 20) that do not exhibit such trend. It is therefore difficult to identify the causes of the larger influence of the secondary passage dimension on these items or specify a common deficiency among these items by simply conducting content analyses from the perspective of reading subskills.

Table 3.

Item slope parameter estimates from the bi-factor model.

Task	Item	BIF-Trait-a	BIF-Passage-a	Difference
Multiple-choice	1	0.46	0.10	0.36
	2	0.72	0.20	0.52
	3	1.11	2.14	−1.03
	4	1.55	0.04	1.51
	5	1.26	0.26	1.00
	6	1.08	0.10	0.98
	7	0.15	0.29	−0.14
	8	0.05	2.65	−2.60
	9	0.70	0.07	0.63
	10	0.54	−0.19	0.73
	11	1.35	0.08	1.27
	12	0.71	0.18	0.53
	13	1.20	2.69	−1.49
	14	0.86	0.21	0.65
	15	0.52	0.12	0.40
	16	0.56	−0.04	0.60
	17	0.48	−0.24	0.72
	18	0.17	0.28	−0.11
	19	2.84	4.84	−2.00
	20	0.53	0.19	0.34
Paragraph-reorganizing	21	0.89	1.77	−0.88
	22	0.91	1.63	−0.72
	23	1.48	3.76	−2.28
	24	2.18	4.97	−2.79
	25	1.05	1.69	−0.64
Translation	26	1.17	0.88	0.29
	27	1.40	2.74	−1.34
	28	1.69	0.32	1.37
	29	0.68	0.26	0.42
	30	0.77	0.46	0.31

Notes: BIF-Trait-a = the slope parameter estimates concerning the trait obtained from the bi-factor model;

BIF-Passage-a = the slope parameter estimates concerning the passage obtained from the bi-factor model;

Difference = BIF-Trait-a – BIF-Passage-a.

Person ability estimates

Since the application of the worse-fitting model leads to inaccurate estimates of the slope parameter, a relevant question to be asked, then, is whether it makes any difference to person ability estimates when the test is scored under the worse-fitting model. As indicated in the fourth column of Table 4, the distribution of the scores from the unidimensional IRT (i.e., standard 2PL/GR) model is more dispersed than those from the other two models. The skewness and kurtosis statistics suggest that the three types of scores are all normally distributed. The average correlations among ability estimates and the average root mean square difference (RMSD) between these ability estimates are also calculated to better understand the degree of correspondence among scores estimated from different IRT models. The RMSD is computed as the square root of the average squared difference between the estimated ability from a hypothesized model and that from the best-fitting model, which is in general regarded as a good measure of accuracy. As can be seen from the last two columns of Table 4 and Figure 3, while the correlation between ability estimates from the TRT and bi-factor models is extremely high with a very small RMSD, the correlation between the 2PL/GR and bi-factor models is comparatively low, accompanied by a large RMSD. This indicates that though the person ability estimates seem to be little affected by using the TRT model, they might be influenced to a very large extent by using the 2PL/GR model that ignores local item dependence.

Table 4.

Person ability estimates from different IRT models.

	Min	Max	SD	Skewness	Kurtosis	Correlation(BIF scores)	RMSD(BIF scores)
UNI scores	−2.87	2.99	0.89	0.11	−0.67	0.772	0.590
TRT scores	−2.72	2.61	0.84	−0.10	−0.39	0.996	0.071
BIF scores	−2.65	2.93	0.85	−0.10	−0.37	1	1

Notes: The mean of the ability distribution was fixed at 0 for estimation purposes for the three models;

UNI scores = scores obtained from the unidimensional IRT (i.e., standard 2PL/GR) model;

TRT scores = scores concerning the general trait obtained from the multidimensional TRT model;

BIF scores = scores concerning the general trait obtained from the multidimensional bi-factor model.

Figure 3.

Comparison of the ability estimates from three models.

Standard errors of estimates

Figure 4 displays the scatter plots of the standard errors associated with the item intercept, item slope and person ability estimates from the three models. Overall, the standard errors of item intercept estimates tend to be lower in the 2PL/GR and TRT models than those in the bi-factor model. They could be spuriously low, giving the false impression that the instrument has a higher precision of the parameter estimates. This effect is less pronounced for the slope parameters as most of the standard errors associated with the slope parameters are on the reference line in the range from 0 to 0.2, except that there is an outlier (a standard error of 0.8 estimated from the bi-factor model), which is far away from the bulk of the data.

Figure 4.

Comparison of the standard errors from three models.

In terms of the standard errors associated with ability estimates, the standard 2PL/GR model displays much more variability than the bi-factor model, as can be seen from the much larger span of dots along the vertical line depicting the standard errors of ability estimates from the standard 2PL/GR model than the horizontal line depicting those from the bi-factor model. This indicates that the ability estimates obtained from the standard 2PL/GR model are less consistently precise among different test takers. However, there is not any consistent trend for the standard 2PL/GR model to yield spuriously high or low measurement error, as the dots are basically scattered evenly across the reference line. In addition, compared to the bi-factor model, the estimated standard errors of ability estimates tend to be slightly higher in the TRT model, as can be seen from the larger amount of dots below the reference line than those above the reference line in the last figure. They could be spuriously high, giving the false impression that the instrument has a low precision of ability estimates.

Discussion

Research question 1 : Using real data sets of the reading comprehension section of the GSEEE, the present study found that the bi-factor model was the best-fitting model for the testlet-based reading assessment with both dichotomously and polytomously scored items, followed in order by the TRT model and the standard 2PL/GR model. This corroborates the findings from previous research (DeMars, 2006; Jang & Roussos, 2007; Lee, 1998, 2004; Rijmen, 2010; Thissen et al., 1989; Zhang, 2010) that in testlet-based reading assessment, apart from test-takers’ reading proficiency, test-takers’ interaction with the passage also contributes to the variation in their performance. Therefore, it could be argued that the claim that reading proficiency is a unitary trait (Lunzer & Gardner, 1979; Rost, 1993) may not be substantiated in testlet-based reading assessment.

Moreover, from the perspective of IRT modeling, the findings are compatible with those from previous research (DeMars, 2006; Li et al., 2006; Rijmen, 2010) that the bi-factor model exhibited better model–data fit than the TRT model in testlet-based reading assessment. Given the consistent evidence of the bi-factor model as a better-fitting model as well as the ease of running the bi-factor model in commercial software like TESTFACT and IRTPRO, it is surprising that it has not gained great popularity in testlet-based assessment, especially in language assessment. One plausible explanation is that the TRT model is typically introduced to handle the problem of local dependence of items within a testlet where the testlet effect is in general considered a nuisance factor, while the bi-factor model tends to be preferred when the secondary dimensions embody substantive interpretations (Jeon, Rijmen, & Rabe-Hesketh, 2012). The findings of the present study as well as other relevant studies (DeMars, 2006; Li et al., 2006), however, indicate that since the bi-factor model is applicable to the case of testlets from a technical perspective, its use as a viable solution to local dependence in testlet-based reading assessment should draw due attention from researchers in the field.

Research question 2 : Comparing the item parameter estimates from the three different IRT models, the results indicated that for the intercept parameter, the choice of models almost made no difference, though there was a large inaccuracy in item slope estimates when the standard 2PL/GR model was used. These findings, in general, echo much of the literature in the area (DeMars, 2006; Li et al., 2010; So, 2010; Wainer & Wang, 2000) that consistently suggested that the item discrimination parameter was more sensitive to the IRT modeling technique than the item difficulty parameter. However, when it comes to a comparison between the slope parameter estimates from the TRT and bi-factor models, the findings of the present study seem to diverge from those of DeMars’s (2006) study, which reported almost no difference between slope parameter estimates from the TRT model and those from the bi-factor model. One possible explanation for the discrepancy in results will be discussed later as it is also related to the fourth research question.

Another issue that needs to be discussed is the smaller magnitude of the slope parameter on the general trait dimension than on the specific dimensions for all the five items in the paragraph-reorganizing task. This suggests that, contrary to the test developers’ expectation, the factor on the secondary dimension, rather than the reading proficiency as measured by the primary dimension, is the major factor affecting their responses to this task. Therefore, caution should be taken by test users when interpretations and decisions are to be made based on test-takers’ scores on the reading comprehension section of the GSEEE. The larger slope parameter estimates on the secondary dimension for the five items could also help to explain the reason why the slope parameter estimates on the general trait dimension for the five items are outliers in panel (a) but not in panel (b) in Figure 2. It is because both the TRT and bi-factor models have taken the secondary dimension into consideration in panel (b) while the 2PL/GR model failed to capture the secondary dimension, leading to an inflated item slope parameter estimates for items with a strong secondary dimension.

Research question 3 : The findings that the scores obtained from the standard 2PL/GR model have the most dispersed distribution is in line with previous empirical research (So, 2010), reconfirming the claim that ignoring local item dependence indeed overstates the test’s power in discriminating test takers on their reading proficiency. In addition, the low correlation and large RMSD between person ability scores from the standard 2PL/GR and bi-factor models suggest that applying a unidimensional IRT model to testlet-based reading assessment results in inaccurate estimates of test-takers’ reading proficiency. This poses a great problem, in particular, when test users make relative, score-based decisions about individual test takers.

The good news is that the person ability estimates from the TRT and bi-factor models were found to be extremely highly correlated, indicating that our interpretation of test-takers’ reading proficiency will be essentially the same, no matter which multidimensional model is used. It can therefore be claimed that the specialized TRT model, though not as well-fitting as the bi-factor model, might still serve as a viable approach to accommodate local item dependence in testlet-based reading assessment when the ultimate goal is to provide scores to individual test takers. However, the fact that the TRT model might give the false impression of a slightly lower precision of ability estimates of the instrument also warrants attention when interpretations are being made.

Another issue that warrants further discussion is the scoring method that is typically used in real practice. The reading comprehension section of the GSEEE as well as most large-scale standardized reading assessment in China, in general, is scored by simply summing up the correctly answered items or at most by using unidimensional IRT models. The results of this study show that such simple scoring methods might not be appropriate because the confounding effect of passages, as evidenced in the present study, cannot be properly teased out. It can therefore be argued that in testlet-based reading assessment, the more appropriate scoring method might be utilizing the bi-factor or TRT model based on which test-takers’ reading proficiency could be estimated with the effect of passages on their performance partialed out.

Research question 4 : As expected, the standard 2PL/GR model erroneously suggested a higher level of precision in item parameter estimates, converging with most previous research (DeMars, 2006; Li et al., 2010; Wainer & Wang, 2000) indicating that the unidimensional IRT model tended to overestimate test reliability and test information when local item dependence was present. Similarly, the TRT model was found to yield slightly lower standard errors of item parameter estimates than the bi-factor model, but the difference would be trivial if the influence of the outlier were teased out.

What is particularly noteworthy is that there exists one outlier among the standard errors associated with the slope parameters obtained from the bi-factor model (see Figure 4). Given the large sample size of the current study, it is surprising that there is still such a large standard error. A closer examination reveals that this outlier is probably the cause of the low correlation between the slope parameter estimates from the bi-factor and TRT models in Research Question 2. Specifically, the outlier among standard errors of slope parameter estimates in Figure 4 is correspondent with the outlier among slope parameter estimates in Figure 2, both of which are estimates related to Item 19 in the multiple-choice item type. If this outlier were removed from the data, the correlation of the slope parameter estimates from the bi-factor and TRT models would reach .95.

Lastly, the finding that the standard 2PL/GR model did not consistently yield spuriously high or low measurement errors is in conflict with Eckes (2013), who reported that compared to the TRT model, the 2PL model produced spuriously low measurement errors of the ability estimates of test takers in the testlet-based listening test. The divergent findings might be related to the different item types used in the two studies. For one thing, while the TestDaf listening paper analyzed in Eckes’s (2013) study consists of short-answer items and true/false items, the reading comprehension section of the GSEEE analyzed in the present study comprises multiple-choice items, paragraph-reorganizing items and translation items. In the true/false item type in Eckes’s (2013) study, test takers are likely to benefit greatly from their guessing behavior as the probability of getting an item correct without knowing anything is 50%; however, in the paragraph-reorganizing items in the present study, test takers might be penalized if they guess an item wrong as a result of the “item-chaining effect” (Wang et al., 2005). The various effects of guessing behavior on test-takers’ responses to different item types are not taken into consideration in the two studies, which may have led to the divergent findings concerning the measurement errors associated with the ability estimates. For another, as mentioned earlier, different item types might measure different constructs of reading proficiency (Kobayashi, 2002; Shohamy, 1984) and thus lead to local dependence among items of the same item type. The different degrees of the local item dependence caused by the item type effect in the two studies are not taken into account, which might have contributed to the contradictory findings. More evidence needs to be collected, however, since very little relevant research can be located in the literature concerning the local item dependence caused by item types. Therefore, more studies along this line are needed to explore the reasons for the unexpected difference.

Conclusions

The current study has investigated the effectiveness of two multidimensional IRT models in accommodating local item dependence in testlet-based reading assessment, as well as the impact of applying the worse-fitting model on item parameter estimates, person ability estimates and standard errors associated with these estimates. The findings show that the more complex bi-factor model serves as the best-fitting model to accommodate local dependence within testlets; however, it yields practically the same results as the TRT model in terms of item parameter estimates, person ability estimates and estimates of standard errors, so test developers and users should choose between them on the basis of their interpretability. In addition, it is found that the item slope parameter estimates, person ability estimates and standard errors of estimates are more susceptible to the choice of IRT model than the intercept parameter estimates. This study is of considerable importance both theoretically and practically. Theoretically, it provides additional evidence about the viability of utilizing the bi-factor model in testlet-based reading assessment where the unidimensionality and local independence assumptions are untenable. Practically, it is hoped that this study, by presenting the results and consequences of using different IRT models on item parameter estimates, person ability estimates, and standard errors of these estimates, may help to guide test developers and users to choose the model that not only minimizes estimation inaccuracies but also best satisfies their needs based on available resources.

Despite the contribution of this research, there are three limitations that warrant further investigation. First of all, the item type and passage effects might be confounded to some extent because for one thing, the reading comprehension section of the GSEEE analyzed in the present study includes only one passage for the paragraph-reorganizing task and one passage for the translation task, making it difficult to judge whether it is the item type or the passage content that causes variability in testlet effects; for another, the use of paragraph-reorganizing items might have to some extent guaranteed the large overall testlet effect found in the present study. Therefore, further research is warranted to either separate these effects by using tests with a better balance of item types across passages or concentrate on more typical passage-based reading items. Future research on the effect of subskill factor is also needed as the presence of different subskills measured by different items constitutes a potential source of local item dependence in language assessment. Secondly, the present study is limited in that it did not take advantage of the Bayesian framework, which is one major contribution of the TRT model. The reason why the Bayesian framework was not used is that the Bock-Aitkin marginal maximum likelihood estimation method³ was adopted for the estimation of the three unidimensional IRT models and the multidimensional bi-factor model. In order to ensure the meaningfulness of the interpretations of the differences of item parameter and person ability estimates across different models, the same estimation method should be used for the TRT model. However, follow-up studies that utilize the Bayesian context when comparing different multidimensional IRT models would be helpful to present a more comprehensive picture of the pros and cons of different models. Finally, in this study only the testlet-based items of the reading comprehension section of the GSEEE were analyzed. No endeavors have been made to analyze other testlet-based components of the GSEEE, such as the cloze test in Part I, together with the testlet-based reading comprehension section. The main reason is that the two multidimensional models used in the present study, both the TRT model and the bi-factor model, stipulate that there could be only one general trait factor, making it impossible to simultaneously analyze more components of the GSEEE. A generalization of the bi-factor model, the two-tier, full-information item factor analysis model (Cai, 2010), however, relaxes the sole general trait factor restriction to the extent that multiple general trait factors can not only exist but also be allowed to correlate with each other. Therefore, a promising avenue for future research might be the application of the more flexible two-tier model to testlet-based language assessment to provide a more accurate global statement of test-takers’ language proficiency.

Footnotes

Funding

The study reported here is sponsored by two National Social Science Foundation projects [10BYY092 and 11&ZD188]. The preparation of this manuscript is supported by a grant [2013Z77] from Zhejiang Provincial Federation of Social Sciences. An earlier version of this paper was presented at the 2012 Language Testing Research Colloquium, Educational Testing Service, Princeton, USA. We are grateful to the conference attendees who gave very helpful comments that helped to form the final version of this paper and we would also like to express our gratitude to the anonymous reviewers for their constructive and insightful comments and suggestions.

Notes

References

Akaike

(1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.

Alderson

J. C.

(1990a). Testing reading comprehension skills (part one). Reading in a Foreign Language, 6(2), 425–37.

Alderson

J. C.

(1990b). Testing reading comprehension skills (part two). Reading in a Foreign Language, 7(1), 463–503.

Alderson

J. C.

Lukmani

(1989). Cognition and levels of comprehension as embodied in test questions. Reading in a Foreign Language, 5(2), 253–270.

Alderson

J. C.

Lumley

(1995). Responses and replies. Language Testing, 12(1), 121–130.

Andrich

Humphry

S. M.

Marais

(2012). Quantifying local, response dependence between two polytomous items using the Rasch model. Applied Psychological Measurement, 36(4), 309–324.

Baker

F. B.

(1987). Methodology review: Item parameter estimation under the one-, two-, and three-parameter logistic models. Applied Psychological Measurement, 11(2), 111–141.

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: an application of the EM algorithm. Psychometrika, 46(4), 443–459.

Bock

R. D.

Mislevy

R. J.

(1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431–444.

10.

Bradlow

E. T.

Wainer

Wang

(1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168.

11.

Browne

M. W.

Cudeck

(1993). Alternative ways of assessing model fit. In Bollen

K. A.

Long

J. S.

(Eds.) Testing structural equation models (pp. 136–162). Newbury Park, CA: SAGE Publications.

12.

Cai

(2010). A two-tier full-information item factor analysis model with applications. Psychometrika, 75(4), 581–612.

13.

Cai

Thissen

du Toit

(2011). IRTPRO user’s guide. Lincolnwood, IL: Scientific Software International, Inc.

14.

Cai

Yang

J. S.

Hansen

(2011). Generalized full-information item bifactor analysis. Psychological Methods, 16(3), 221–248.

15.

Chen

Thissen

(1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289.

16.

Choi

I.-C.

Kim

K. S.

Boo

(2003). Comparability of a paper-based language test and a computer-based language test. Language Testing, 20(3), 295–320.

17.

Davis

F. B.

(1968). Research in comprehension in reading. Reading Research Quarterly, 3, 499–545.

18.

DeMars

C. E.

(2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43(2), 145–168.

19.

DeMars

C. E.

(2012). Confirming testlet effects. Applied Psychological Measurement, 36(2), 104–121.

20.

du Toit

. (2003). IRT from SSI: BILOG-MG, MULTILOG, PARSCALE, TESTFACT. Lincolnwood, IL: Scientific Software International, Inc.

21.

Eckes

(2013). Examining testlet effects in the TestDaF listening section: A testlet response theory modeling approach. Language Testing, OnlineFirst, published on July 11, 2013 as doi: 10.1177/0265532213492969

22.

Edwards

Edelen

(2009). Special topics in item response theory. In Millsap

Maydeu-Olivares

(Eds.), The SAGE handbook of quantitative methods in psychology (pp. 178–198). London: SAGE Publications.

23.

Embretson

S. E.

Reise

S. P.

(2000). Item response theory for psychologists. NJ: Lawrence Erlbaum.

24.

Gibbons

R. D.

Hedeker

D. R.

(1992). Full-information bi-factor analysis. Psychometrika, 57(3), 423–436.

25.

Gibbons

R. D.

Bock

Hedeker

Weiss

D. J.

Segawa

Bhaumik

D. K.

Kupfer

D. J

Frank

Grochocinski

V. J.

Stover

(2007). Full-information item bifactor analysis of graded response data. Applied Psychological Measurement, 31(1), 4–19.

26.

(2010). The graduate school entrance English examination. In Cheng

Curtis

(Eds.), English language assessment and the Chinese learner (pp. 145–157). New York and London: Routledge.

27.

Holzinger

K. J.

Swineford

(1937). The bi-factor method. Psychometrika, 2(1), 41–54.

28.

Jang

E. E.

Roussos

(2007). An investigation into the dimensionality of TOEFL using conditional covariance-based nonparametric approach. Journal of Educational Measurement, 44(1), 1–21.

29.

Jeon

Rijmen

Rabe-Hesketh

(2012). Modeling differential item functioning using a generalization of the multiple-group bifactor model. Journal of Educational and Behavioral Statistics. OnlineFirst, published on March 30, 2012 as doi:10.3102/1076998611432173

30.

Jiao

Wang

(2013). Estimation methods for one-parameter testlet models. Journal of Educational Measurement, 50(2), 186–203.

31.

Jiao

Wang

Kamata

(2005). Modeling local item dependence with the hierarchical generalized linear model. Journal of Applied Measurement, 6(3), 311–321.

32.

Johnston

P. H.

(1983). Reading comprehension assessment: a cognitive basis. Newark, DE: International Reading Association.

33.

Kamata

Bauer

D. J.

(2008). A note on the relation between factor analytic and item response theory models. Structural Equation Modeling, 15(1), 136–153.

34.

Keller

Swaminathan

Sireci

(2003) Evaluating scoring procedures for context-dependent item sets. Applied Measurement in Education, 16(3), 207–222.

35.

Kobayashi

(2002). Method effects on reading comprehension test performance: text organization and response format. Language Testing, 19(2), 193–220.

36.

Lee

Y.-W.

(1998). Examining the suitability of an IRT-based testlet approach to the construction and analysis of passage-based items in an EFL reading comprehension test in the Korean high school context. Doctoral dissertation, The Pennsylvania State University, University Park, PA.

37.

Lee

Y.-W.

(2004). Examining passage-related local item dependence (LID) and measurement construct using Q3 statistics in an EFL reading comprehension test. Language Testing, 21(1), 74–100.

38.

Lee

Kolen

M. J.

Frisbie

D. A.

Ankenmann

R. D.

(2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied Psychological Measurement, 25(4), 357–372.

39.

Lee-Ellis

(2009). The development and validation of a Korean C-Test using Rasch Analysis. Language Testing, 26(2), 245–274.

40.

Bolt

D. M.

(2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30(1), 3–21.

41.

Wang

(2010). Application of a general polytomous testlet model to the reading section of a large-scale English language assessment (ETS RR-10–21). Princeton, NJ: Educational Testing Service.

42.

Linacre

J. M.

(1998). Detecting multidimensionality: Which residual data-type works best?. Journal of Outcome Measurement, 2(3), 266–283.

43.

Liu

Thissen

(2012). Identifying local dependence with a score test statistic based on the bifactor logistic model. Applied Psychological Measurement, 36(8), 670–688.

44.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.

45.

Lumley

(1993). The notion of subskills in reading comprehension tests: An EAP example. Language Testing, 10(3), 211–234.

46.

Lunzer

Gardner

(Eds.). (1979). The effective use of reading. London: Heinemann Educational Books.

47.

Marais

I. D.

Andrich

(2008). Effects of varying magnitude and patterns of local dependence in the unidimensional Rasch model. Journal of Applied Measurement, 9(2), 105–124.

48.

Munby

J. L.

(1978). Communicative syllabus design. Cambridge, UK: Cambridge University Press.

49.

Ockey

G. J.

(2012). Item response theory. In Fulcher

Davidson

(Eds.), Routledge handbook of language testing in a nutshell (pp. 336–349). Florence, KY: Routledge, Taylor & Francis Group.

50.

Reckase

M. D.

(2009). Multidimensional item response theory. New York: Springer.

51.

Rijmen

(2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement, 47(3), 361–372.

52.

Rost

(1993). Assessing the different components of reading comprehension: Fact or fiction? Language Testing, 10(1), 79–92.

53.

Schedl

Gordon

Carey

Tang

K. L.

(1996). An analysis of the dimensionality of TOEFL reading comprehension items (TOEFL research report 53). Princeton, NJ: Educational Testing Service.

54.

Schmitt

Clapham

(2001). Developing and exploring the behaviour of two new versions of the Vocabulary Levels Test. Language Testing, 18(1), 55–88.

55.

Schwarz

G. E.

(1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464.

56.

Shohamy

(1984). Does the testing method make a difference? The case of reading comprehension. Language Testing, 1(2), 147–170.

57.

Sireci

S. G.

Thissen

Wainer

(1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28(3), 237–247.

58.

(2010). Dimensionality of responses to a reading comprehension assessment and its implications to scoring test takers on their reading proficiency. Doctoral dissertation, University of California, Los Angeles.

59.

Song

M.-Y.

(2008). Do divisible subskills exist in second language (L2) comprehension? A structural equation modeling approach. Language Testing, 25(4), 435–464.

60.

Takane

de Leeuw

(1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52(3), 393–408.

61.

Thissen

Steinberg

Mooney

J. A.

(1989). Trace lines for testlets: A use of multiple-categorical response models. Journal of Educational Measurement, 26(3), 247–260.

62.

Wainer

(1995). Precision and differential item functioning on a testlet-based test: The 1991 Law School Admissions Test as an example. Applied Measurement in Education, 8(2), 157–187.

63.

Wainer

Kiely

G. L.

(1987). Item clusters and computerized-adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), 185–201.

64.

Wainer

Wang

(2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203–220.

65.

Wainer

Bradlow

E. T.

(2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In van der Linden

W. J.

Glas

C. A. W.

(Eds.), Computerized adaptive testing: Theory and practice (pp. 245–269). Dordrecht/ Boston/ London: Kluwer Academic.

66.

Wainer

Bradlow

E. T.

Wang

(2007). Testlet response theory and its applications. New York: Cambridge University Press.

67.

Wang

Vispoel

W. P.

(1998). Properties of ability estimation methods in computerized adaptive testing. Journal of Educational Measurement, 35(2), 109–135.

68.

Wang

W.-C.

Cheng

Y.-Y.

Wilson

(2005). Local item dependence for items across tests connected by common stimuli. Educational and Psychological Measurement, 65(1), 5–27.

69.

Wang

W.-C.

Wilson

(2005a). Exploring local item dependence using a random-effects facet model. Applied Psychological Measurement, 29(4), 296–318.

70.

Wang

W.-C.

Wilson

(2005b). The Rasch testlet model. Applied Psychological Measurement, 29(2), 126–149.

71.

Wang

Bradlow

E. T.

Wainer

(2002). A general Bayesian models for testlets: Theory and applications. Applied Psychological Measurement, 26(1), 109–128.

72.

Yen

W. M.

(1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30(3), 187–213.

73.

Zhang

(2010). Assessing the accuracy and consistency of language proficiency classification under competing measurement models. Language Testing, 27(1), 119–140.

Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment

Abstract

Keywords

Introduction

Sources of local item dependence in reading assessments

Approaches to cope with local item dependence

Consequences of ignoring local item dependence

Applications in language tests

The present study

Method

Data

Data analyses

Unidimensionality and local independence check

IRT modeling

Results

The unidimensionality and local independence assumption

Overall model fit

Preliminary analysis

Main analysis

Item parameter estimates

Intercept. 2

Slope

Person ability estimates

Standard errors of estimates

Discussion

Conclusions

Footnotes

Funding

Notes

References

Intercept.²