Abstract
This study examined the relative effectiveness of the multidimensional bi-factor model and multidimensional testlet response theory (TRT) model in accommodating local dependence in testlet-based reading assessment with both dichotomously and polytomously scored items. The data used were 14,089 test-takers’ item-level responses to the testlet-based reading comprehension section of the Graduate School Entrance English Exam (GSEEE) in China administered in 2011. The results showed that although the bi-factor model was the best-fitting model, followed by the TRT model, and the unidimensional 2-parameter logistic/graded response (2PL/GR) model, the bi-factor model produced essentially the same results as the TRT model in terms of item parameter, person ability and standard error estimates. It was also found that the application of the unidimensional 2PL/GR model had a bigger impact on the item slope parameter estimates, person ability estimates, and standard errors of estimates than on the intercept parameter estimates. It is hoped that this study might help to guide test developers and users to choose the measurement model that best satisfies their needs based on available resources.
Keywords
Introduction
The majority of reading assessments, whether in a first language or second/foreign language, comprise sets of passages with a group of items pertaining to each passage. Such passages, usually called testlets 1 (Wainer & Kiely, 1987), are used for many reasons, among which the principal one is higher testing efficiency for test takers (Thissen, Steinberg, & Mooney, 1989). With several items embedded in a testlet, test takers need not waste a considerable amount of time and energy in processing a long passage just to answer a single item. Despite such great strength, this testing format poses a threat to item analysis because items within a testlet often violate the local independence assumption of item response theory (IRT) (Sireci, Thissen, & Wainer, 1991; Wang & Wilson, 2005a), which stipulates that the probability of responding to an item is statistically independent of the probability of responding to any other item in the same test conditional on the test-taker’s ability (Embretson & Reise, 2000). In other words, the violation of the local independence assumption indicates that after the impact of the measurement construct is partialed out from the item scores, non-negligible correlation still exists between the items (Yen, 1993).
Sources of local item dependence in reading assessments
In reading assessment, the most heatedly discussed potential sources of local item dependence are as follows: (a) common item types; (b) the same subskill measured by different items; and (c) a single passage followed by a set of items.
The item type effect has been discussed in the literature more often under the context of multidimensionality than local item dependence. However, as pointed out by Andrich, Humphry, and Marais (2012), multidimensionality constitutes one generic source of local item dependence; therefore, the use of different item types within a test (e.g., multiple-choice and constructed-response items) might not only introduce unmodeled variation that can be attributed to a secondary dimension of item type (Linacre, 1998), but might also lead to the occurrence of local dependence of items of the same item type. Previous empirical research (Kobayashi, 2002; Shohamy, 1984) has suggested that various item types measure different aspects of reading comprehension and somewhat different constructs. In addition, some item types inherently suffer from a higher probability of violating the local independence assumption. A typical example in language assessment is the item type of gap filling, in which test takers are required to select the correct answers to fill in the blanks of missing words from a list of options. A wrong response to one item might result in further wrong responses to subsequent items. The “item-chaining effect” (Wang, Cheng, & Wilson, 2005) of this item type poses a greater threat to the assumption of local independence assumption than many other item types.
The same subskill measured by different items constitutes another major source of local item dependence and multidimensionality in reading assessment. Although it is important to operationalize the reading construct through attempting taxonomies of reading subskills in reading syllabus design (Munby, 1978) and test specifications development (Alderson, 1990a; Lumley, 1993), it has triggered heated debate and speculation among language testers about the existence of reading subskills (Alderson, 1990a, 1990b, 1995; Alderson & Lumley, 1995; Lumley, 1993), the divisibility of reading subskills (Davis, 1968; Johnston, 1983; Song, 2008), as well as the possibility of identifying a single dominant subskill measured by an item (Alderson, 1990a, 1995; Alderson & Lukmani, 1989). Empirical studies investigating subskill-related local dependence and multidimensionality of reading assessment have also provided mixed or even conflicting results. For instance, using the DIMTEST procedure and NOHARM analysis, Schedl, Gordon, Carey, and Tang (1996) reported that different subskills of reading items did not bring multidimensionality to the construct of the reading comprehension section of TOEFL. In line with this, Lee (1998, 2004) employed IRT-based Q3 statistics to an EFL (English as a foreign language) reading comprehension test in Korea and concluded that no significant level of local dependence related to subskills was detected. On the other hand, a study conducted by Jang and Roussos (2007) on the reading section of TOEFL revealed the multidimensionality effect related to reading subskills, by utilizing confirmatory analyses coupled with exploratory cluster analyses and content analyses.
The passage effect has been the major focus for most research on local dependence and multidimensionality in reading assessment because local dependence is almost innate in passage-based reading assessment. When several items are based on a common passage, test takers with diverse background knowledge of the passage, differential skills specific to the passage, or various other motivational factors concerning the passage (DeMars, 2006) might have differential understanding of the passage and hence differential item response behavior. In this way, item responses within a testlet are not only contingent upon test-takers’ reading proficiency measured by the test, but also related to a second trait relevant to the passage. Thus test-takers’ responses to items within the same testlet are more highly correlated than their responses to items across different testlets. Findings of empirical research in this area have consistently shown that items sharing a common passage indeed exhibit local dependence and multidimensionality (e.g., DeMars, 2006; Jang & Roussos, 2007; Lee, 1998, 2004; Rijmen, 2010; Thissen et al., 1989; Zhang, 2010), regardless of the specific test analyzed and/or the specific data analysis technique used.
Approaches to cope with local item dependence
In order to address the problem of the violation of local independence assumption, a number of approaches have been suggested by numerous researchers in the field of educational measurement. One intriguing approach to circumvent this problem is to group together items within a testlet, treat it as a single polytomous item, and apply a unidimensional polytomous IRT model (Lee, 1998; Lee, Kolen, Frisbie, & Ankenmann, 2001; Thissen et al., 1989; Wainer, 1995). One potential caveat of this approach, however, is the loss of item-level information (Wainer, Bradlow, & Du, 2000; Yen, 1993), especially when a testlet is composed of a large number of dependent items (So, 2010; Zhang, 2010). By transforming items within a testlet to a polytomous item, it loses whatever information is contained in the test-takers’ precise pattern of item responses.
An alternative is to model testlet items with multidimensional IRT models, either the bi-factor model (Gibbons & Hedeker, 1992) or the testlet response theory model (Wainer, Bradlow, & Wang, 2007). The use of the bi-factor model has been preferred more by researchers in the area of factor analysis than those in IRT modeling. The bi-factor model originates from confirmatory factor analysis for continuous responses (Holzinger & Swineford, 1937) and has been extended to IRT analysis for both dichotomously scored items by Gibbons and Hedeker (1992) and for polytomously scored items by Gibbons et al. (2007). It incorporates both a primary dimension related to all items and secondary dimensions pertaining to subsets of items only; thus each item response in the bi-factor model depends on both the primary dimension and one of the secondary dimensions. Specifically, each item has a non-zero value for the discrimination parameter in the direction of the primary dimension and also at most a non-zero value for the discrimination parameter corresponding to one of the testlet dimensions. The other discrimination parameters related to the other testlets are fixed to zero.
A variety of testlet response theory (TRT) models have been proposed in the IRT literature to capture the local dependence of items in testlets from different perspectives, including the Bayesian random-effects testlet models (Bradlow, Wainer, & Wang, 1999; Wainer & Wang, 2000; Wainer et al., 2000; Wainer et al., 2007) from the perspective of “an interaction between a testlet and persons”, the Rasch testlet models (Wang & Wilson, 2005b) from the perspective of “multidimensionality”, and the three-level, one-parameter testlet model (Jiao, Wang, & Kamata, 2005) from the perspective of “contextual effects on items nested within a testlet” (Jiao, Wang, & He, 2013). Among them, the most-cited one in the literature is the Bayesian random-effects testlet model (Bradlow et al., 1999; Wainer & Wang, 2000; Wainer et al., 2000; Wainer et al., 2007), which has also been most widely used in large-scale language assessments (e.g., Eckes, 2013; Rijmen, 2010; Zhang, 2010). As opposed to the bi-factor model where separate discrimination parameters are assigned to the primary and secondary dimensions, the Bayesian random-effects testlet model usually imposes constraints on the relationship between the primary slope and the secondary slopes. Based on different theoretical rationales, three types of constraints might be placed. The first type of constraint, typical in a plethora of Bayesian random-effects testlet models (Bradlow et al., 1999; Wainer & Wang, 2000; Wainer et al., 2000; Wainer et al., 2007), stipulates that there is a proportional relationship between the primary slope and the secondary slopes, implying that items discriminating well on the primary dimension should also discriminate well on the secondary dimensions. This supposition, however, does not seem to comply with the reality as it indicates that, for instance, in the case of testlet-based reading comprehension tests, items that discriminate well on test-takers’ reading ability are also more heavily influenced by the secondary dimensions related to the passage. To address this problem, Li, Bolt, and Fu (2006) proposed an alternative TRT model in which an inverse relationship – the second type of constraint – was imposed on the primary slope and the secondary slopes, in accordance with the rationale that an item discriminating well on the primary dimension should have less discrimination power on the secondary dimensions. Apart from this, Li et al. (2006) also introduced a third type of TRT model that constrains all slope estimates to be the same within a testlet and compared its fit to that of the first two types of TRT models as well as the bi-factor model (Gibbons & Hedeker, 1992), finding that the bi-factor model yielded the best model–data fit.
In addition to constraints on the relationship between the primary slope and the secondary slopes, the Bayesian random-effects testlet models also vary in terms of the constraints placed on the variance estimates of the secondary dimensions. The Bradlow et al.’s (1999) TRT model, as the first attempt to model test-takers’ responses to testlet-based items, assumes that the variances of all testlets should be equal by constraining the proportionality constants to be the same across all testlets, while others (Wainer & Wang, 2000; Wainer et al., 2000; Wainer et al., 2007; Wang, Bradlow, & Wainer, 2002) allow the proportionality constants to vary across testlets, rendering the model less parsimonious but enabling researchers to know the relative contribution of each testlet factor.
The main difference between the bi-factor model and Bayesian random-effects testlet model resides in parameter constraints/model parsimony and model identification. Owing to the additional constraints placed, the Bayesian random-effects testlet model is generally more parsimonious than the bi-factor model. This is one of the major advantages of using the Bayesian random-effects testlet model over the bi-factor model because with fewer item parameters estimated in the Bayesian random-effects testlet model, estimation efficiency can be improved. Moreover, practitioners always strive to arrive at simpler models if they are available and the model fit does not deteriorate too much. One thing that should be noted is that, although generally no constraint is placed on the relationship between the primary slope and the secondary slopes in the bi-factor model, when a testlet only consists of two items, some constraint needs to be placed in the bi-factor model on the two secondary dimension slope parameters. A typical way is to constrain the absolute value of bi-factor slopes of the two items to be equal (Liu & Thissen, 2012). In terms of model identification, the additional constraints on the slope estimates in the Bayesian random-effects testlet model permit the estimations of the variances of the secondary dimensions, whereas in the bi-factor model the variances of the secondary dimensions are typically set to 1 for identification purposes. The mathematical definitions of the Bayesian random-effects testlet model and bi-factor model for both dichotomously and polytomously scored items are provided by Wainer et al. (2007) and Cai, Yang, and Hansen (2011) respectively.
Consequences of ignoring local item dependence
Despite these slight differences among the various multidimensional IRT models, they can all be applied to handle the violation of the local independence assumption in testlet-based assessment and to minimize or eliminate the problems that might be incurred when utilizing standard IRT models, including inaccurate item parameter and person ability estimates (Bradlow et al., 1999; Chen & Thissen, 1997), overstatement of test information and measurement precision (Keller, Swaminathan, & Sireci, 2003; Sireci et al., 1991; Thissen et al., 1989; Wainer, 1995; Yen, 1993), errors in equating/scaling (Lee et al., 2001), as well as item misfit (Marais & Andrich, 2008).
Applications in language tests
The limitation of the standard IRT models in handling the violation of the local independence assumption has also been discussed by a plethora of researchers in the field of language assessment (e.g., Choi, Kim, & Boo, 2003; Eckes, 2013; Lee, 1998, 2004; Lee-Ellis, 2009; Ockey, 2012; So, 2010; Schmitt, Schmitt, & Clapham, 2001; Zhang, 2010). Applications of the three different approaches to bypass the problem of local item dependence can be found in large-scale language tests. For example, Lee (1998) utilized a series of polytomous IRT models to circumvent the local dependence between items within a testlet in an EFL reading comprehension test in Korea. After recognizing the potential problem of information loss of the polytomous IRT approach, Wainer and Wang (2000) attempted to use the 3-parameter logistic Bayesian random-effects testlet model to characterize the local dependence among items within the TOEFL listening and reading comprehension testlets, finding that estimates of item difficulty parameters were not affected by testlet-associated local item dependence, while estimates of discrimination and guessing parameters were inaccurate if local item dependence was ignored. Similarly, Eckes (2013) applied the 2-parameter logistic Bayesian random-effects testlet model to analyze three testlets in the listening section of the Test of German as a Foreign Language (TestDaF). Eckes demonstrated that local item dependence did not affect estimates of item difficulty and discrimination parameters but led to overestimated test reliability and underestimated standard error of ability estimates. Studying the classification accuracy of language proficiency under different measurement models, Zhang (2010) showed evidence that compared to the standard IRT model, the three-parameter logistic Bayesian random-effects testlet model yielded substantially larger standard errors of ability estimates, suggesting the inflation of classification accuracy of the standard IRT model.
Further, Li, Li, and Wang (2010) applied the bi-factor model to accommodate local dependence in a testlet-based English reading test that contained both dichotomously and polytomously scored items, demonstrating that local item dependence exerted a small influence on the item parameter estimates but a comparatively larger influence on test information and reliability. They also cautioned that the added item parameters estimated in the bi-factor model might have compromised the fit of the model; therefore, they called for the use of a multidimensional IRT model with a simpler structure. On the basis of substantiating the presence of both a primary trait dimension and secondary passage dimension through a series of confirmatory factor analyses, So (2010) applied the bi-factor model to analyze the testlet-based reading paper of the Certificate in Advanced English (CAE), finding that though the presence of a secondary passage dimension did not significantly affect estimation of item difficulty parameters, it had significant effects on the estimation of the discrimination parameter, which consequently led to an underestimation of lower-ability test takers and overestimation of higher-ability test takers.
The majority of the studies summarized above have compared either the multidimensional TRT model with the unidimensional IRT model (Eckes, 2013; Wainer & Wang, 2000; Zhang, 2010) or the multidimensional bi-factor model with the unidimensional IRT model (So, 2010), but little research (DeMars, 2006; Rijmen, 2010) has been conducted to compare systematically the relative effectiveness of these two different multidimensional IRT models in capturing local dependence in language tests. Using both simulated data and real data of math and reading testlets, DeMars (2006) compared the bi-factor model, TRT model, testlets-as-polytomous-items model, and independent-items model, reporting that in both simulated data and real data the bi-factor model generally provided better model–data fit than the more specialized TRT model and the independent-items model. However, based on the results of root mean square error and bias in different manipulated circumstances, the more parsimonious TRT model was found to be favored over the bi-factor model. Yet, DeMars (2006) concluded with a promising outlook for the use of the bi-factor model among applied practitioners, because for one thing, the bi-factor model can be run easily and efficiently in commercial software; for another, the accuracy of the slope estimates and ability estimates was found to be maintained using the bi-factor model even when the data were generated from the more constrained TRT model. On the other hand, Rijmen’s (2010) study, which fitted the bi-factor model, TRT model and unidimensional IRT model to the data from an international English test, provided evidence that the proportionality constraints that were placed on the relationship between the primary slope and secondary slopes in the TRT model were too stringent and therefore advocated for the use of the bi-factor model in testlet-based assessment.
Research findings on the effectiveness of the multidimensional bi-factor model and the multidimensional TRT model in accommodating local item dependence in testlet-based language assessments are mixed. Moreover, the scope of these comparative studies on multidimensional IRT models has been mainly confined to the analysis of dichotomously scored items, except Li et al. (2010), probably owing to the fact that multidimensional IRT analysis of polytomous data is a more recent development, although in language assessment, the mixed-format with both dichotomously and polytomously scored items might represent a more realistic scenario.
The present study
The purpose of this study, therefore, is to investigate whether the claim that the bi-factor model serves as a practical alternative to the TRT model in testlet-based assessment is true in the mixed-format with both dichotomously and polytomously scored items in the area of language assessment. If so, what is the consequence of selecting a worse-fitting model to address the local dependence between items within a testlet or ignoring the local dependence by applying the unidimensional IRT model? In order to address these issues, this study aims to answer the following four research questions:
To what extent do the unidimensional IRT model, the multidimensional TRT model, and the multidimensional bi-factor model fit the testlet-based reading comprehension section of the GSEEE?
To what extent are item parameter estimates influenced by different IRT models?
To what extent are person ability estimates influenced by different IRT models?
To what extent are standard errors of item parameter and person ability estimates influenced by different IRT models?
Method
Data
The data analyzed in the present study were 14,089 test-takers’ item-level responses to the reading comprehension section of the Graduate School Entrance English Exam (GSEEE) in China administered in 2011. The 14,089 test takers were candidates applying to programs of various disciplines at one major university in the eastern coast of China in that year. The GSEEE is designed and administered by the National Education Examinations Authority (NEEA) under the Ministry of Education in China to provide information for educational or research institutions in selecting candidates for their Master’s programs (He, 2010). With an annual testing population of over 1 million, the GSEEE is one of the two highest-stakes tests in China, the other being the National College Entrance Exam. For most test takers, the results of the test will determine their career path and for some, change their life. Different cut-off scores are set for different majors and the cut-off scores of GSEEE in 2011 ranged from 45 to 60 for candidates applying to the university, resulting in an overall selection ratio of 41.8%. The GSEEE assesses test-takers’ English language proficiency in three areas: (1) use of English (a cloze test) accounting for 10% of the total score; (2) reading comprehension, 60%; and (3) writing, 30%. The reading comprehension section, which is the main focus of this study, consists of three parts:
Part I: multiple-choice items. Test takers are required to select the best answer to each of the 20 four-option multiple-choice items based on four passages, which are dichotomously scored.
Part II: paragraph-reorganizing items. Test takers are required to put five jumbled sentences or paragraphs in the correct order to form a coherent passage, which are scored on a 2-point scale.
Part III: translation items. Test takers are required to translate five underlined sentences in a passage into Chinese, which are scored on a 3-point scale.
Data analyses
Unidimensionality and local independence check
First of all, in order to check whether local dependence exists between items within a testlet, the authors examined the standardized local dependence (LD) χ2 statistic, obtained from unidimensional IRT modeling via IRTPRO 2.1 (Cai, Thissen, & du Toit, 2011). The standardized LD χ2 statistic is based on the local dependence statistic proposed by Chen and Thissen (1997) but extended to accommodate polytomous responses. It is computed by comparing the observed and expected frequencies in the two-way marginal tables for each item pair and then standardized to make values comparable among items with different number of response categories (Cai, Thissen, & du Toit, 2011). Because these are approximately standardized statistics, values exceeding 4 suggest clear local dependence between items, and values exceeding 10 suggest extreme local dependence between items (Cai, personal communication, July 22, 2013).
IRT modeling
Secondly, the authors applied three unidimensional and two multidimensional IRT models to the testlet-based reading comprehension section of the GSEEE using IRTPRO 2.1 (Cai, Thissen, & du Toit, 2011), a program released by Scientific Software International (SSI) to encompass the functions of BILOG-MG, MULTILOG, PARSCALE and TESTFACT with newly advanced features. One major advantage of IRTPRO 2.1 relevant to this study is that the bi-factor model can be estimated in IRTPRO 2.1 with both dichotomously and polytomously scored items, while the original software programs, such as NOHARM and TESTFACT, can be used for the bi-factor model only when the data are dichotomously scored (Edwards & Edelen, 2009).
The three unidimensional IRT models applied were a hybrid of standard 1-parameter logistic/graded response (1PL/GR) model, 2-parameter logistic/graded response (2PL/GR) model, and 3-parameter logistic/graded response (3PL/GR) model. After the best unidimensional IRT model for the data was decided, two multidimensional IRT models, namely, the TRT and bi-factor models, were estimated, both of which utilized the best unidimensional IRT model at the item level.
To identify the TRT model (Wainer et al., 2007), the locations of all the dimensions including the primary dimension and the secondary testlet dimensions and the scale of the primary dimension need to be fixed, but the scale of the specific dimensions can be freely estimated (Rijmen, 2010). Specifically, in this study, the mean and variance of the primary dimension was set to 0 and 1 respectively; the mean of the specific dimensions was set to 0, with testlet-specific variances freely estimated. The slope estimates on the secondary dimensions were constrained to be proportional to the slope estimate on the primary dimension. The proportionality constants were free to vary, allowing us to gauge the relative strength of the local dependence of different testlets. To identify the bi-factor model (Cai, Yang, & Hansen, 2011), the mean and variance were set to 0 and 1 respectively for all the traits including the primary dimension and the secondary specific dimensions. It is assumed in both the TRT model and the bi-factor model that the primary dimension and the secondary testlet dimensions are jointly normally distributed and mutually orthogonal.
For all the five models, the item parameters were estimated with the Bock–Aitkin marginal maximum likelihood estimation (Bock & Aitkin, 1981), which has been proven to be an effective estimation method for both unidimensional and two-dimensional IRT models (Cai, Thissen, & du Toit, 2011). In addition, during all calibration runs, a standard normal distribution N (0, 1) constraint was imposed for θg (a given ability on the general trait dimension) and θs (a given ability on the secondary passage dimension) for the bi-factor model, and for θg for the TRT, standard 1PL/GR, 2PL/GR, and 3PL/GR models. Thus, the parameter estimates obtained from different IRT models were on the same scale (Li et al., 2010). Besides, the computation of IRT scale scores was done using the expected a posteriori (EAP) method (Bock & Mislevy, 1982), which in general requires less computation and produces a smaller standard error of ability estimates (Wang & Vispoel, 1998).
Results
The unidimensionality and local independence assumption
The reading comprehension section analyzed in the present study comprised 6 five-item testlets, so the authors examined the standardized LD χ2 statistic for 60 item pairs (6 ×
The results showed that, when the unidimensional IRT (i.e., standard 2PL/GR) model was applied, 49 out of the 60 item pairs yielded a value of over 4 for the standardized LD χ2 statistic, indicating that test-takers’ responses to items in most of the item pairs were locally dependent and these item pairs may measure an un-modeled passage dimension. More specifically, the proportion of item pairs exhibiting local dependence reached 100%, 80%, 90%, 70%, 50% and 100% for the six testlets respectively, suggesting that all of the six testlets suffered from violation of local independence. It is also noteworthy that out of the 60 item pairs, 35 pairs produced a value of over 10 on the standardized LD χ2 statistic, suggesting an extreme level of local dependence.
Overall model fit
Preliminary analysis
A preliminary analysis was conducted to decide the best unidimensional IRT model for the data. Table 1 summarizes the fit indices of the three unidimensional IRT models, including df (the degree of freedom), −2log-likelihood (minus twice the log-likelihood evaluated at the maximum likelihood estimates), AIC (Akaike information criterion; −2log-likelihood plus twice the number of parameters) (Akaike, 1974), BIC (Bayesian information criterion; −2log-likelihood plus the logarithm of the sample size times the number of parameters) (Schwarz, 1978), and RMSEA (root mean square error of association; a value of 0.05 or below suggests adequate fit) (Browne & Cudeck, 1993). According to these fit indices, the 3PL/GR model provides the best model–data fit among the three unidimensional IRT models as it offers the best solution in terms of model fit, as indicated by the lowest −2log-likelihood value, and the best balance between model misfit and model parsimony, as indicated by the lowest AIC and BIC values (Rijmen, 2010). However, a closer examination of the item parameters estimated from the 3PL/GR model indicated that out of the 30 items, 6 items had poorly estimated difficulty/discrimination/guessing parameters, with standard errors reaching over 1000. One possible reason might be that the test analyzed in the present study is targeted at proficient learners of English who have generally completed two years of EFL education at college or university, a lack of information at the lower end of the ability scale may have therefore led to unstable estimation of the guessing parameter in the 3PL model (Lord, 1980), which consequently led to poor estimation of other item parameters (Baker, 1987). The second best-fitting unidimensional IRT model – 2PL/GR model – was therefore chosen to be the best unidimensional IRT model for the data. In addition, it was decided to use a combination of standard 2PL model and GR model at the item level for multidimensional IRT analyses.
Summary of fit indices of three unidimensional IRT models.
Notes: NP = number of parameters; * p < .05.
Main analysis
When comparing the selected unidimensional IRT model with the multidimensional IRT models, it is found that the bi-factor model is the best one among all the three models as it has the lowest −2log-likelihood, AIC and BIC values, as can be seen from Table 2. Moreover, the bi-factor model has the lowest value of 0.01 for RMSEA, suggesting that this model fits the data extremely well.
Summary of fit indices of unidimensional and multidimensional IRT models.
Notes: NP = number of parameters; * p < .05.
Because the standard 2PL/GR model is nested in the TRT model, and the TRT model is nested in the bi-factor model, the significance of the difference in −2log-likelihood can be tested with a series of χ2-difference tests (du Toit, 2003) to decide which one is the best-fitting model, with degrees of freedom equal to the difference in the number of parameters. The results of the χ2-difference tests showed that the bi-factor model fit the data significantly better than the TRT model (χ2 (24) = 731, p < .05), which in turn fit the data significantly better than the standard 2PL/GR model (χ2 (6) = 9714, p < .05) (see the last three columns of Table 2). The bi-factor model, therefore, was chosen as the final model for the present study. This suggested on one hand, that the secondary dimension was strong enough to cause local item dependence between items within a testlet; on the other hand, that the constraints imposed by the TRT model on the slope parameters were too stringent.
Nonetheless, in an operational context, it seems to be common to either apply the unidimensional IRT model ignoring the testlet structure or apply the more specialized TRT model to do item analysis, so it is worth the effort to examine the consequences of applying the worse-fitting model in testlet-based assessment. Next, item parameter estimates, person ability estimates and standard errors of these estimates obtained from the standard 2PL/GR and TRT models would be compared with those obtained from the best fitting model – the bi-factor model, respectively.
Item parameter estimates
Intercept. 2
The item intercept parameter is negatively associated with the difficulty parameter of the item (Reckase, 2009). Generally, the higher the item intercept estimate is, the easier the item is. The left panel of Figure 1 depicts the relationship between the intercept parameter estimates obtained from the standard 2PL/GR model (UNI-c) and the bi-factor model (BIF-c). As can be seen from the figure, the item intercept parameters estimated from these two models are highly correlated (r = .98). This suggests that item intercept estimates are unaffected by local dependence between items within testlets and the current unidimensional IRT model is sufficient for this purpose. In a similar vein, the right panel of Figure 1 shows a similar trend, indicating that applying the TRT model does not result in any obvious inaccuracy in the intercept parameter estimates.

Comparison of the intercept parameter estimates from three models.
Slope
The item slope parameter has the same interpretations as the discrimination parameter of the item. The higher the item slope parameter is, the more discriminating the item is. Figure 2 depicts the relationship between the slope/discrimination parameter estimates obtained from the three models. Unlike the intercept parameter estimates, the slope/discrimination parameter estimates (with respect to the general trait) are correlated at a medium level (r = .55) for the standard 2PL/GR and bi-factor models, and at a slightly higher level (r = .78) for the TRT and bi-factor models, suggesting that applying the worse-fitting models may result in a large inaccuracy in estimates of item slope/discrimination parameters. Note that there is an extreme outlier above the reference line in both panel (a) and panel (b), while below the reference line five outliers are evident in panel (a) but not in panel (b), which are found to be correspondent with the five items in the paragraph-reorganization task.

Comparison of the slope parameter estimates from three models.
Table 3 provides a comparison of the slope/discrimination parameters on the general trait dimension and on the passage dimension estimated from the bi-factor model. The results show that for most items (i.e., 18/30), the slope/discrimination parameter related to the trait dimension is larger than that on the passage dimension, indicating that most items assess test-takers’ reading proficiency more than their background knowledge of the passage. In other words, most items in the reading comprehension section of the GSEEE measure the reading proficiency rather than the construct-irrelevant passage content, since the slope/discrimination parameter can be directly transformed to factor loading (Kamata & Bauer, 2008; Takane & de Leeuw, 1987). However, it is noteworthy that for all the items in the paragraph-reorganizing task, the slope/discrimination parameter concerning the trait dimension is smaller than that on the secondary dimension, suggesting that test-takers’ responses to the five items are more affected by the secondary dimension. In addition, content analyses reveal that all the four items (i.e., Items 3, 8, 13, and 19), with a much larger slope/discrimination parameter on the secondary dimension than that on the trait dimension (i.e. a large negative difference value) in the multiple-choice task, measure test-takers’ inferential skills. However, there are also other inference items (i.e., Items 4, 9, 18, and 20) that do not exhibit such trend. It is therefore difficult to identify the causes of the larger influence of the secondary passage dimension on these items or specify a common deficiency among these items by simply conducting content analyses from the perspective of reading subskills.
Item slope parameter estimates from the bi-factor model.
Notes: BIF-Trait-a = the slope parameter estimates concerning the trait obtained from the bi-factor model;
BIF-Passage-a = the slope parameter estimates concerning the passage obtained from the bi-factor model;
Difference = BIF-Trait-a – BIF-Passage-a.
Person ability estimates
Since the application of the worse-fitting model leads to inaccurate estimates of the slope parameter, a relevant question to be asked, then, is whether it makes any difference to person ability estimates when the test is scored under the worse-fitting model. As indicated in the fourth column of Table 4, the distribution of the scores from the unidimensional IRT (i.e., standard 2PL/GR) model is more dispersed than those from the other two models. The skewness and kurtosis statistics suggest that the three types of scores are all normally distributed. The average correlations among ability estimates and the average root mean square difference (RMSD) between these ability estimates are also calculated to better understand the degree of correspondence among scores estimated from different IRT models. The RMSD is computed as the square root of the average squared difference between the estimated ability from a hypothesized model and that from the best-fitting model, which is in general regarded as a good measure of accuracy. As can be seen from the last two columns of Table 4 and Figure 3, while the correlation between ability estimates from the TRT and bi-factor models is extremely high with a very small RMSD, the correlation between the 2PL/GR and bi-factor models is comparatively low, accompanied by a large RMSD. This indicates that though the person ability estimates seem to be little affected by using the TRT model, they might be influenced to a very large extent by using the 2PL/GR model that ignores local item dependence.
Person ability estimates from different IRT models.
Notes: The mean of the ability distribution was fixed at 0 for estimation purposes for the three models;
UNI scores = scores obtained from the unidimensional IRT (i.e., standard 2PL/GR) model;
TRT scores = scores concerning the general trait obtained from the multidimensional TRT model;
BIF scores = scores concerning the general trait obtained from the multidimensional bi-factor model.

Comparison of the ability estimates from three models.
Standard errors of estimates
Figure 4 displays the scatter plots of the standard errors associated with the item intercept, item slope and person ability estimates from the three models. Overall, the standard errors of item intercept estimates tend to be lower in the 2PL/GR and TRT models than those in the bi-factor model. They could be spuriously low, giving the false impression that the instrument has a higher precision of the parameter estimates. This effect is less pronounced for the slope parameters as most of the standard errors associated with the slope parameters are on the reference line in the range from 0 to 0.2, except that there is an outlier (a standard error of 0.8 estimated from the bi-factor model), which is far away from the bulk of the data.

Comparison of the standard errors from three models.
In terms of the standard errors associated with ability estimates, the standard 2PL/GR model displays much more variability than the bi-factor model, as can be seen from the much larger span of dots along the vertical line depicting the standard errors of ability estimates from the standard 2PL/GR model than the horizontal line depicting those from the bi-factor model. This indicates that the ability estimates obtained from the standard 2PL/GR model are less consistently precise among different test takers. However, there is not any consistent trend for the standard 2PL/GR model to yield spuriously high or low measurement error, as the dots are basically scattered evenly across the reference line. In addition, compared to the bi-factor model, the estimated standard errors of ability estimates tend to be slightly higher in the TRT model, as can be seen from the larger amount of dots below the reference line than those above the reference line in the last figure. They could be spuriously high, giving the false impression that the instrument has a low precision of ability estimates.
Discussion
Moreover, from the perspective of IRT modeling, the findings are compatible with those from previous research (DeMars, 2006; Li et al., 2006; Rijmen, 2010) that the bi-factor model exhibited better model–data fit than the TRT model in testlet-based reading assessment. Given the consistent evidence of the bi-factor model as a better-fitting model as well as the ease of running the bi-factor model in commercial software like TESTFACT and IRTPRO, it is surprising that it has not gained great popularity in testlet-based assessment, especially in language assessment. One plausible explanation is that the TRT model is typically introduced to handle the problem of local dependence of items within a testlet where the testlet effect is in general considered a nuisance factor, while the bi-factor model tends to be preferred when the secondary dimensions embody substantive interpretations (Jeon, Rijmen, & Rabe-Hesketh, 2012). The findings of the present study as well as other relevant studies (DeMars, 2006; Li et al., 2006), however, indicate that since the bi-factor model is applicable to the case of testlets from a technical perspective, its use as a viable solution to local dependence in testlet-based reading assessment should draw due attention from researchers in the field.
Another issue that needs to be discussed is the smaller magnitude of the slope parameter on the general trait dimension than on the specific dimensions for all the five items in the paragraph-reorganizing task. This suggests that, contrary to the test developers’ expectation, the factor on the secondary dimension, rather than the reading proficiency as measured by the primary dimension, is the major factor affecting their responses to this task. Therefore, caution should be taken by test users when interpretations and decisions are to be made based on test-takers’ scores on the reading comprehension section of the GSEEE. The larger slope parameter estimates on the secondary dimension for the five items could also help to explain the reason why the slope parameter estimates on the general trait dimension for the five items are outliers in panel (a) but not in panel (b) in Figure 2. It is because both the TRT and bi-factor models have taken the secondary dimension into consideration in panel (b) while the 2PL/GR model failed to capture the secondary dimension, leading to an inflated item slope parameter estimates for items with a strong secondary dimension.
The good news is that the person ability estimates from the TRT and bi-factor models were found to be extremely highly correlated, indicating that our interpretation of test-takers’ reading proficiency will be essentially the same, no matter which multidimensional model is used. It can therefore be claimed that the specialized TRT model, though not as well-fitting as the bi-factor model, might still serve as a viable approach to accommodate local item dependence in testlet-based reading assessment when the ultimate goal is to provide scores to individual test takers. However, the fact that the TRT model might give the false impression of a slightly lower precision of ability estimates of the instrument also warrants attention when interpretations are being made.
Another issue that warrants further discussion is the scoring method that is typically used in real practice. The reading comprehension section of the GSEEE as well as most large-scale standardized reading assessment in China, in general, is scored by simply summing up the correctly answered items or at most by using unidimensional IRT models. The results of this study show that such simple scoring methods might not be appropriate because the confounding effect of passages, as evidenced in the present study, cannot be properly teased out. It can therefore be argued that in testlet-based reading assessment, the more appropriate scoring method might be utilizing the bi-factor or TRT model based on which test-takers’ reading proficiency could be estimated with the effect of passages on their performance partialed out.
What is particularly noteworthy is that there exists one outlier among the standard errors associated with the slope parameters obtained from the bi-factor model (see Figure 4). Given the large sample size of the current study, it is surprising that there is still such a large standard error. A closer examination reveals that this outlier is probably the cause of the low correlation between the slope parameter estimates from the bi-factor and TRT models in Research Question 2. Specifically, the outlier among standard errors of slope parameter estimates in Figure 4 is correspondent with the outlier among slope parameter estimates in Figure 2, both of which are estimates related to Item 19 in the multiple-choice item type. If this outlier were removed from the data, the correlation of the slope parameter estimates from the bi-factor and TRT models would reach .95.
Lastly, the finding that the standard 2PL/GR model did not consistently yield spuriously high or low measurement errors is in conflict with Eckes (2013), who reported that compared to the TRT model, the 2PL model produced spuriously low measurement errors of the ability estimates of test takers in the testlet-based listening test. The divergent findings might be related to the different item types used in the two studies. For one thing, while the TestDaf listening paper analyzed in Eckes’s (2013) study consists of short-answer items and true/false items, the reading comprehension section of the GSEEE analyzed in the present study comprises multiple-choice items, paragraph-reorganizing items and translation items. In the true/false item type in Eckes’s (2013) study, test takers are likely to benefit greatly from their guessing behavior as the probability of getting an item correct without knowing anything is 50%; however, in the paragraph-reorganizing items in the present study, test takers might be penalized if they guess an item wrong as a result of the “item-chaining effect” (Wang et al., 2005). The various effects of guessing behavior on test-takers’ responses to different item types are not taken into consideration in the two studies, which may have led to the divergent findings concerning the measurement errors associated with the ability estimates. For another, as mentioned earlier, different item types might measure different constructs of reading proficiency (Kobayashi, 2002; Shohamy, 1984) and thus lead to local dependence among items of the same item type. The different degrees of the local item dependence caused by the item type effect in the two studies are not taken into account, which might have contributed to the contradictory findings. More evidence needs to be collected, however, since very little relevant research can be located in the literature concerning the local item dependence caused by item types. Therefore, more studies along this line are needed to explore the reasons for the unexpected difference.
Conclusions
The current study has investigated the effectiveness of two multidimensional IRT models in accommodating local item dependence in testlet-based reading assessment, as well as the impact of applying the worse-fitting model on item parameter estimates, person ability estimates and standard errors associated with these estimates. The findings show that the more complex bi-factor model serves as the best-fitting model to accommodate local dependence within testlets; however, it yields practically the same results as the TRT model in terms of item parameter estimates, person ability estimates and estimates of standard errors, so test developers and users should choose between them on the basis of their interpretability. In addition, it is found that the item slope parameter estimates, person ability estimates and standard errors of estimates are more susceptible to the choice of IRT model than the intercept parameter estimates. This study is of considerable importance both theoretically and practically. Theoretically, it provides additional evidence about the viability of utilizing the bi-factor model in testlet-based reading assessment where the unidimensionality and local independence assumptions are untenable. Practically, it is hoped that this study, by presenting the results and consequences of using different IRT models on item parameter estimates, person ability estimates, and standard errors of these estimates, may help to guide test developers and users to choose the model that not only minimizes estimation inaccuracies but also best satisfies their needs based on available resources.
Despite the contribution of this research, there are three limitations that warrant further investigation. First of all, the item type and passage effects might be confounded to some extent because for one thing, the reading comprehension section of the GSEEE analyzed in the present study includes only one passage for the paragraph-reorganizing task and one passage for the translation task, making it difficult to judge whether it is the item type or the passage content that causes variability in testlet effects; for another, the use of paragraph-reorganizing items might have to some extent guaranteed the large overall testlet effect found in the present study. Therefore, further research is warranted to either separate these effects by using tests with a better balance of item types across passages or concentrate on more typical passage-based reading items. Future research on the effect of subskill factor is also needed as the presence of different subskills measured by different items constitutes a potential source of local item dependence in language assessment. Secondly, the present study is limited in that it did not take advantage of the Bayesian framework, which is one major contribution of the TRT model. The reason why the Bayesian framework was not used is that the Bock-Aitkin marginal maximum likelihood estimation method 3 was adopted for the estimation of the three unidimensional IRT models and the multidimensional bi-factor model. In order to ensure the meaningfulness of the interpretations of the differences of item parameter and person ability estimates across different models, the same estimation method should be used for the TRT model. However, follow-up studies that utilize the Bayesian context when comparing different multidimensional IRT models would be helpful to present a more comprehensive picture of the pros and cons of different models. Finally, in this study only the testlet-based items of the reading comprehension section of the GSEEE were analyzed. No endeavors have been made to analyze other testlet-based components of the GSEEE, such as the cloze test in Part I, together with the testlet-based reading comprehension section. The main reason is that the two multidimensional models used in the present study, both the TRT model and the bi-factor model, stipulate that there could be only one general trait factor, making it impossible to simultaneously analyze more components of the GSEEE. A generalization of the bi-factor model, the two-tier, full-information item factor analysis model (Cai, 2010), however, relaxes the sole general trait factor restriction to the extent that multiple general trait factors can not only exist but also be allowed to correlate with each other. Therefore, a promising avenue for future research might be the application of the more flexible two-tier model to testlet-based language assessment to provide a more accurate global statement of test-takers’ language proficiency.
Footnotes
Funding
The study reported here is sponsored by two National Social Science Foundation projects [10BYY092 and 11&ZD188]. The preparation of this manuscript is supported by a grant [2013Z77] from Zhejiang Provincial Federation of Social Sciences. An earlier version of this paper was presented at the 2012 Language Testing Research Colloquium, Educational Testing Service, Princeton, USA. We are grateful to the conference attendees who gave very helpful comments that helped to form the final version of this paper and we would also like to express our gratitude to the anonymous reviewers for their constructive and insightful comments and suggestions.
