Abstract
Testlets are subsets of test items that are based on the same stimulus and are administered together. Tests that contain testlets are in widespread use in language testing, but they also share a fundamental problem: Items within a testlet are locally dependent with possibly adverse consequences for test score interpretation and use. Building on testlet response theory (Wainer, Bradlow, & Wang, 2007), the listening section of the Test of German as a Foreign Language (TestDaF) was analyzed to determine whether, and to which extent, testlet effects were present. Three listening passages (i.e., three testlets) with 8, 10, and 7 items, respectively, were analyzed using a two-parameter logistic testlet response model. The data came from two live exams administered in April 2010 (N = 2859) and November 2010 (N = 2214). Results indicated moderate effects for one testlet, and small effects for the other two testlets. As compared to a standard IRT analysis, neglecting these testlet effects led to an overestimation of test reliability and an underestimation of the standard error of ability estimates. Item difficulty and item discrimination estimates remained largely unaffected. Implications for the analysis and evaluation of testlet-based tests are discussed.
Language tests often contain sets of items that are based on the same input or stimulus and are administered together. A typical example is a reading or listening passage that is followed by a small number of items designed to assess an examinee’s ability to comprehend central messages of that passage. Another example refers to gap-filling tests like C-tests, where the items (gaps) are embedded in the passage itself (Eckes & Grotjahn, 2006; Klein-Braley, 1997). Such sets of items have been variously called item bundles (Rosenbaum, 1988; Wilson & Adams, 1995), context-dependent item sets (Haladyna, 1992, 2004; Keller, Swaminathan, & Sireci, 2003), subtests (Andrich, 1985), superitems (Cureton, 1965), or testlets (Wainer & Kiely, 1987). Introduced within the context of computer-adaptive testing, the term testlet has gained wide acceptance in the field of language testing and beyond, and will be used for the rest of this article.
Testlets have a number of appealing features regarding test development and administration, but they also pose potentially severe psychometric problems that can have adverse consequences for test score interpretation and use (e.g., biased parameter estimation, faulty examinee classification, test-equating errors). In the present research, the focus was on an approach that has been specifically developed for the analysis of testlet-based tests—testlet response theory (TRT; Wainer, Bradlow, & Wang, 2007). The TRT measurement approach was applied to examinee responses in the listening section of the Test of German as a Foreign Language (Test Deutsch als Fremdsprache; TestDaF). This section consisted of 25 items distributed over three testlets. 1
The major research question asked whether, and to what extent, responses to listening items would be subject to testlet effects, that is, subject to the influence of unintended, secondary factors that are not part of the construct being measured by the test as a whole. An additional purpose of the study was to examine what happened to parameter estimation and overall test evaluation if testlet effects were not taken into account. To this end, the TRT modeling approach was compared to the use of two more traditional approaches. The first approach simply ignored the testlet structure, viewing all listening items as independent items; the second, more elaborate approach construed each testlet as a polytomous item. Throughout the analyses, listening data from two live examinations provided the input for estimating model parameters.
Testlets and the assumption of local item independence
According to Wainer, Bradlow, and Du (2000), one of the main reasons for the development of testlet-based tests has been “to reduce concerns about the atomistic nature of single independent small items” (p. 246). These concerns have been particularly raised against multiple-choice tests where lists of discrete items are presented in a largely decontextualized manner. Testlets are well-suited for providing more contextual information, or more complex input, designed to tap into various facets of higher-level knowledge, skills, or abilities. Thus, when testing communicative language ability, which is commonly conceived of as being composed of a number of interrelated lower-level skills, competencies, or functions (e.g., Bachman, 1990; Bachman & Palmer, 2010; Hulstijn, 2011), testlets may serve to yield a much better representation of the construct being measured, thereby increasing the validity of the inferences drawn from test results.
On a more technical note, testlets are advantageous in terms of efficiency of item writing and test administration. With more contextual information carefully embedded within a stimulus or input, it seems obvious to develop not just one item but a set of items referring to different aspects of that input. Moreover, as Wainer et al. (2000) pointed out, “a substantial part of an examinee’s time is spent processing the information contained in the stimulus material” (p. 246). As a consequence, when examinees can use one and the same amount of input information for responding to a set of related items, testlets save testing time. Finally, referring to the original motivation of introducing the testlet concept, testlets lead to a reduction of context effects in adaptive testing, where each item’s context typically differs from examinee to examinee; that is, forming fixed context–item units, testlets diminish the likelihood of differential context effects arising from factors such as item location, cross-information, or unbalanced content (Wainer et al., 2007).
Attractive as testlets may be, items nested within testlets violate a central assumption of standard item response theory (IRT) models: the assumption of local, or conditional, item independence. This assumption implies that a person’s response to an item does not affect the probability of the person’s response to another item. Put differently, the joint probability of two item responses is equal to the product of the individual probabilities of the two item responses, conditional on the latent trait being measured (Hambleton & Swaminathan, 1985; Henning, 1989; Yen & Fitzpatrick, 2006). For example, consider two test items i and j, where xi = 1 and xj = 1 denote a correct answer to item i and item j, respectively. Then, the probability of answering both items correctly, given the ability level θ, can be computed by multiplying the individual response probabilities:
Because items within a testlet are associated with the same stimulus, examinee responses to these items are likely to be locally dependent. That is, responses to items within a testlet tend to be interrelated even when the latent trait is taken into account, and Equation 1 no longer holds.
Marais and Andrich (2008) distinguished two generic ways in which local item dependence (LID) may come about. First, there may be secondary traits or dimensions that have an impact on responses in addition to the trait being measured. This source of LID, which Marais and Andrich called trait dependence, is particularly likely for items that are nested within testlets. Second, the response to a given item may affect the response to a subsequent item. This so-called response dependence can be present irrespective of the use of testlets. 2
Among the factors that may contribute to trait dependence are examinee differences in specific background knowledge or skills, differences in motivation or attention, ambiguous or misleading contextual information provided in the input, and many others (e.g., Yen, 1993; Yen & Fitzpatrick, 2006). For example, examinees who experience particular difficulty in understanding critical expressions of a listening passage because they lack stimulus-related background knowledge will tend to answer many, if not all, items relating to the listening stimulus incorrectly; conversely, examinees with much the same level of listening ability who are lucky to have the knowledge that the former do not have, will tend to get many items within the testlet correct.
Given that testlets provide a natural format for many language tests, it is generally not an option for test developers to eliminate or replace items that manifest strong LID, because each item within a testlet is typically designed to contribute a unique piece of construct-relevant evidence. Rather, a reasonable approach would aim to make sure that testlet-based LID does not have undue influence on the meaning and interpretation of test scores.
Psychometric approaches
The traditional approach
With very few exceptions, some of which will be discussed later, the field of language testing has been dominated by the traditional approach to addressing testlet effects. Under this approach, each item that belongs to a testlet is considered as a single, independent item and is scored and analyzed as such. In a typical analysis, standard IRT models are used that ignore the testlet structure of the test under study. Of course, this implies that the assumption of local item independence holds for the complete set of items. In some instances, a statistical indicator of the extent to which LID is present in a testlet-based test is computed to check this assumption. A frequently used statistic is Yen’s (1984) Q3 index. This index is defined as the correlation between the residuals of two items, where each residual is the difference between an examinee’s observed item response and the response expected based on the assumptions of the IRT model (e.g., Chen & Wang, 2007; Chen & Thissen, 1997; Kim, de Ayala, Ferdous, & Nering, 2007; Lee, 2004).
Standard psychometric models used to analyze testlet-based tests most often are unidimensional Rasch models or two- or three-parameter IRT models, which estimate not only person ability and item difficulty, but also item discrimination and an additional guessing parameter (e.g., de Ayala, 2009; Yen & Fitzpatrick, 2006). As a case in point, consider international comparative large-scale assessments like PISA (Programme for International Student Assessment) or PIRLS (Progress in International Reading Literacy Study). These assessments basically rely on testlet-based tests analyzed in a traditional way using Rasch or IRT models (e.g., Monseur, Baye, Lafontaine, & Quittre, 2011; Chang & Wang, 2010; for an overview, see Wendt, Bos, & Goy, 2011).
However, it has been repeatedly shown that ignoring the dependence structure caused by administering testlet-based tests can lead to the following: (a) overestimation of the precision of person ability estimates; (b) overestimation of test reliability and test information; (c) underestimation of the standard errors of parameter estimates; (d) biased item parameter estimates, in particular biased item difficulties and item discriminations; and (e) inappropriate test equating (e.g., Chen & Thissen, 1997; Chen & Wang, 2007; Sireci, Thissen, & Wainer, 1991; Thissen, Steinberg, & Mooney, 1989; Tuerlinckx & De Boeck, 2001; Wainer & Wang, 2000; Wang & Wilson, 2005b; Yen, 1984, 1993; Zhang, Shen, & Cannady, 2010).
Taken together, these results suggest that testlet effects, if not treated in an adequate way, can lead to severe errors in judging the psychometric quality of tests with possibly adverse consequences for test score interpretation and use. Specifically, LID that is due to testlets may entail inaccurate inferences about the ability level of examinees, which in turn raises the likelihood of misclassifying examinees (e.g., Sireci et al., 1991; Yen, 1993). A recent study by Zhang (2010) provided clear evidence on this issue. Investigating the classification accuracy based on scores from a large-scale EFL certification test that employed testlets in its listening, reading, and cloze sections, Zhang found that all testlets were associated with strong effects. More importantly, these testlet effects had a pronounced impact on the classification of examinee ability. Zhang concluded that “using the IRT model would give test users a wrong impression of how many classification errors may have been committed” (p. 136), and recommended instead using the TRT model for purposes of language ability classification.
Quite a different approach to the analysis of testlet-based tests starts with defining each testlet as a superitem or polytomous item. Yet, the polytomous-items approach still rests on standard IRT models, as discussed next.
Score- and item-based approaches
Following Wilson and Adams (1995), methods for analyzing testlets can be classified into score-based approaches and item-based approaches. Adopting a score-based approach implies treating each testlet as a single superitem, scoring it polytomously, and applying a polytomous IRT or Rasch model. That is, item scores are summed across items nested within a testlet such that identical total testlet scores are assigned to the same category of the resulting superitem. Note that the scores of this superitem can also be considered as categories of a rating scale, with as many categories as there are potentially different sum scores (Cook, Dodd, & Fitzpatrick, 1999; Sireci et al., 1991; Thissen et al., 1989). Thus, if a testlet contains five dichotomously scored items, the polytomous scoring of the testlet yields a superitem (or rating scale) with six categories ranging from 0 to 5. C-tests form a class of language tests where this modeling approach has been increasingly used (e.g., Eckes, 2007, 2011; Eckes & Grotjahn, 2006; Lee-Ellis, 2009).
A score-based modeling approach maintains local independence across testlets while eliminating item dependencies within testlets (Rosenbaum, 1988). However, one shortcoming of this approach is a loss of information by not taking into account the precise pattern of examinee responses to individual items within a testlet; that is, information about differences in response patterns that result in the same sum score is not preserved. 3 Moreover, when items within a testlet are highly interrelated, with a fairly low overall proportion of independent items within the entire test, polytomous Rasch or IRT models tend to produce biased parameter values and inflated reliability estimates (Wainer, 1995; Wainer & Wang, 2000; Wang & Wilson, 2005b, 2005c; Yen, 1993). Finally, score-based approaches are inappropriate if the test is administered using a traditional adaptive testing format.
Item-based approaches avoid these shortcomings. They build on examinee responses to individual items within testlets as the unit of analysis rather than on total testlet scores. Information on individual response patterns is thus preserved. Importantly, item-based approaches explicitly account for local dependency between items by adding specific model parameters. Most of these models can be considered extensions of standard Rasch or IRT models. However, item-based approaches differ strongly in the precise way in which they accomplish this extension. In recent years, increasingly sophisticated models have been developed, including the bi-factor model (Gibbons & Hedeker, 1992; Reise, 2012), the Rasch testlet model (Wang & Wilson, 2005c), the multilevel testlet model (Jiao, Kamata, Wang, & Jin, 2012; Jiao, Wang, & Kamata, 2005), marginal IRT models (Braeken, 2011; Braeken, Tuerlinckx & De Boeck, 2007), and the family of TRT models (Wainer et al., 2007). Formal relations between some of these models have also been investigated (DeMars, 2006, 2012; Li, Bolt, & Fu, 2006; Rijmen, 2010; Wang & Wilson, 2005c).
TRT modeling approaches to the analysis of testlet-based tests have become more widely used recently, including applications to large-scale language tests (e.g., Chang & Wang, 2010; DeMars, 2006; Wainer & Wang, 2000; Wainer et al., 2000; Zhang, 2010). The present study also draws on this kind of approach, employing the specific TRT model presented below.
Testlet response theory (Wainer et al., 2007)
To account for LID associated with items nested within a testlet, Bradlow, Wainer, and Wang (1999; Wainer et al., 2007) advanced an extension of the standard two-parameter (2-PL) IRT model (Birnbaum, 1968). The 2-PL TRT model includes a random-effects parameter representing the interaction of person n with testlet d(i), the testlet that contains item i. In this model, the probability of a correct response to an item i nested in testlet d(i) for a person n with ability θ n is given by:
where ai and bi are the item discrimination and difficulty parameters, respectively, and γnd(i) is the testlet effect parameter for person n on testlet d(i).
The key feature of the 2-PL TRT model is the introduction of a random-effects parameter γnd(i), which can be interpreted as a person-specific testlet effect. That is, this parameter is the same for all items nested within a testlet for a particular person n, but it may differ across testlets, with as many testlet effects γnd(i) as there are testlets within the test (Wainer et al., 2007; Wang & Wilson, 2005b).
When there is no testlet effect (i.e., γnd(i) = 0), the model reduces to the standard 2-PL IRT model where local item independence is assumed to hold. Testlet-based LID manifests itself through the testlet effect variance
The 2-PL TRT model has been extended to accommodate polytomous items as well as tests composed of a mixture of dichotomous or polytomous testlet items and independent items (Wainer et al., 2007; Wang, Bradlow, & Wainer, 2002). Moreover, Wainer et al. (2000) suggested a 3-PL TRT model, based on Birnbaum’s (1968) 3-PL IRT model, where a guessing parameter is added.
There have been a number of applications of the TRT modeling approach to large-scale language tests. For example, Wainer and Wang (2000) analyzed the reading and listening sections of the Test of English as a Foreign Language (TOEFL) using the 3-PL TRT model. They found that testlet-associated LID did not affect the estimation of difficulty parameters but resulted in biased estimation of guessing and discrimination parameters. Wainer and Wang also demonstrated that when local dependence was ignored, test information was overestimated by as much as 15% for some ability levels. Chang and Wang (2010) examined testlet effects in the PIRLS 2006 assessment using the 3-PL TRT model. Findings showed that LID had a negligible effect on item difficulty estimates. However, item discriminations and the precision of examinee proficiency measures were overestimated. The authors also reported difficulties in estimating guessing parameters. As noted before, Zhang (2010) studied the accuracy of EFL ability classification under different measurement models, including a standard 3-PL IRT model and its TRT counterpart. These models yielded highly similar ability estimates, but the standard errors of the TRT-based ability estimates were substantially larger than those provided by the standard IRT model, erroneously suggesting higher measurement precision in the case of IRT-based estimates.
Research questions
The present study aimed to analyze the testlet-based dependency structure of the TestDaF listening section. This section measured the examinee’s ability to understand, and respond adequately to, spoken texts relevant to academic life. There were three listening passages, containing 8, 10, and 7 items, respectively.
The listening passages, along with the associated items, defined the testlets to be analyzed. The central question studied here was whether, and to which extent, each of the three testlets violated the assumption of local item independence, and which influence this violation had on parameter estimation and test reliability. More precisely, the research questions were as follows:
To what extent do the listening passages show testlet effects? Specifically, what is the size of the variance of the testlet effects estimated for each passage? Since the size of the testlet effect variance is an indicator of the degree of LID, answering this question should inform a decision on the appropriateness of alternative modeling approaches, such as polytomous modeling or ignoring the LID issue altogether by treating each of the 25 items as an independent item.
How do measurement results obtained using the TRT model compare to the results obtained using standard IRT modeling approaches? Two alternatives were considered: (a) an independent-items model that ignored the testlet structure of the listening test; and (b) a polytomous-items model with item scores summed into a single score for each listening passage. Model comparisons specifically referred to estimates of person ability, item difficulty, and item discrimination, as well as to the magnitude of the corresponding standard errors and reliability estimates.
Method
Participants
Participants were from two independent samples of TestDaF examinees. The first sample comprised 2859 examinees taking the TestDaF in April 2010 (1855 females, 1004 males); the second sample comprised 2214 examinees taking the TestDaF in November 2010 (1429 females, 785 males). All examinees were foreign students applying for entry to an institution for higher education in Germany.
In the April 2010 exam, there were 262 test centres involved (130 centres in Germany, 132 centres in 61 foreign countries). Considering the country of origin, the largest number of participants in the April exam was from Russia, the People’s Republic of China, Ukraine, Bulgaria, and the Republic of Korea. In the November 2010 exam, there were 279 test centres involved (133 centres in Germany, 146 centres in 60 foreign countries). In terms of number of participants, the same set of countries of origin as in the April exam ranked highest in this exam (with Ukraine and Bulgaria changing places).
Instruments and procedure
The TestDaF measures the four language skills in separate sections (reading, listening, writing, and speaking). Examinee performance in each section is related to one of three levels of language proficiency, TDN 3, TDN 4, and TDN 5 (TestDaF-Niveaustufen, TestDaF levels). The TDNs cover the Council of Europe’s (2001) Lower Vantage Level (B2.1) to Higher Effective Operational Proficiency (C1.2); that is, the test measures German language proficiency at an intermediate to high level (for a detailed definition of each of these levels, see www.testdaf.de; see also Gesellschaft für Akademische Studienvorbereitung und Testentwicklung, 2012; Kecker, 2011; Kecker & Eckes, 2010). The TestDaF is officially recognized as a language entry exam for students from abroad. Examinees who have achieved at least TDN 4 in each section are eligible for admission to a German institution of higher education (Eckes et al., 2005).
As already mentioned, the listening section (duration: 40 mins.) measures the examinee’s ability to understand spoken texts related thematically and linguistically to the field of higher education. The three listening texts refer to the following communicative situations: (a) a dialogue typical of everyday life at university (Testlet 1; 8 items); (b) a radio interview with three or four speakers (Testlet 2; 10 items); and (c) a short lecture or an interview with an expert (Testlet 3; 7 items). Testlets are presented to examinees in the order of increasing levels of difficulty, as judged, for example, by the degree of abstraction or informational density, the complexity of sentence structures, and the number of words within each listening text (350 to 400 words for Text 1, 550 to 580 words for Text 2, 580 to 620 words for Text 3).
Across testlets, examinees are required to demonstrate comprehension of context and detail as well as implicit information. For this purpose, two types of items are used, that is, short-answer questions (Testlet 1, Testlet 3) and true/false questions (Testlet 2). Item responses are scored either correct or incorrect. Scoring of short-answer questions is provided by expert raters using a predefined list of minimally required keywords.
Different forms of the listening section were administered in the April and November exams. The total number of items in each section was 25; that is, there were no independent items. At the test’s website (www.testdaf.de), two full sample test versions, including the type of listening section considered here, are available for close inspection (the sample tests are called “Modellsatz 02” and “Modellsatz 03”, respectively).
Data analysis
In the present study, the data represented a mixture of so-called 2-PL dichotomous items, where it could be assumed that no guessing factor was present (i.e., short-answer questions belonging to Testlets 1 and 3), and 3-PL dichotomous items, where guessing may have occurred due to the true/false format of Testlet 2 items. When this mixture of item types was built into the specifications of the computer program SCORIGHT (Version 3.0; Wang, Bradlow, & Wainer, 2005) used to estimate TRT model parameters the estimation process failed to reach convergence even after 10,000 iterations. Therefore, it was decided to treat all listening items as 2-PL items and to use the 2-PL TRT model shown in Equation 2 instead of the 3-PL TRT model. 5
Three different 2-PL measurement models were fitted to the listening data: (a) the TRT model (Wainer et al., 2007); (b) an independent-items IRT model (Birnbaum, 1968); and (c) the graded response model (GRM; Samejima, 1969), representing the score-based, polytomous-items IRT approach.
For purposes of direct comparisons between measurement results, SCORIGHT was used to estimate parameters for all three models. Note that the current version of this program can only handle items with a number of response categories equal to or less than nine, which was exceeded by the 11 categories resulting for Testlet 2 (containing 10 true/false items). Moreover, for Testlet 1 the first two categories (i.e., sum scores equal to 0 and 1, respectively) were much less frequently observed than the others, as were the first four categories of Testlet 2. To make the data suitable for analysis, and to facilitate comparisons between testlets, the number of categories was reduced to eight by collapsing across categories 0 and 1 for Testlet 1 and across categories 0 to 3 for Testlet 2; Testlet 3 was left at the original set of eight categories.
SCORIGHT employs Bayesian estimation techniques to estimate model parameters. To facilitate parameter estimation, Bayesian methods involve modifying the likelihood function to incorporate any prior information that is known about model parameters (for an introduction to Bayesian data analysis, see Fox, 2010; Jackman, 2009; Kruschke, 2011). Thus, the distributions of all parameters are assumed to be normal with a mean of zero (for details of the priors, see Bradlow et al., 1999). For model identification purposes, the mean of the ability distribution is set to 0 with variance equal to 1. Inferences for unknown parameters under the Bayesian testlet model are obtained by drawing samples from the joint posterior distribution using Markov chain Monte Carlo (MCMC) techniques (Kim & Bolt, 2007; Wainer et al., 2007; Wang et al., 2005). Thus, the posterior mean of each examinee’s ability distribution can be used as a point estimate of θ, also called an expected a-posteriori (EAP) ability estimate; similarly, the posterior standard deviation can be used as an estimate of the standard (or model) error associated with each EAP ability estimate.
In the present analysis, five chains were run to assess convergence of the posterior distribution of each model parameter. For each chain, the number of iterations was set at 4000, with the first 3000 iterations as burn-in; that is, the draws after 3000 iterations were used for inference purposes. To reduce the autocorrelation effect, the gap between posterior draws was set at 10; that is, every 11th posterior draw was recorded. These specifications were the same for all three models. Examination of chains indicated that each of the analyses converged without problems as the potential scale reduction factors were close to 1.0 (Wang et al., 2005).
Results
The results of the analysis based on the TRT model are presented first, focusing on the variance of testlet effects as an indicator of the extent to which each of the three listening passages was subject to LID. Then, results for the person ability estimates are shown, comparing the TRT-based estimates with those of the other two IRT modeling approaches, the independent-items and the polytomous-items models, respectively; in addition, various statistics are given that indicate the precision with which ability parameters were estimated by each model. Finally, estimates of item discrimination and item difficulty are presented, comparing results for the TRT and independent-items modeling approaches.
Testlet effects
First, some more general comments on the size of testlet effects are in order. Remember that the variance of testlet effects indicates the degree of local dependence among items of a given testlet. If the testlet effect variance is zero, there is no local dependence. The more this variance exceeds zero, the higher is the degree of local dependence. Note also that, as discussed previously, due to the normalization of the ability distribution the variance of testlet effects is on the same scale as the variance of examinee ability estimates, which equals 1. Thus, a testlet effect variance of 0.50 is half the variance of the ability estimates (Wang et al., 2002).
Beyond these basic statistical considerations, commonly accepted criteria for judging the size of an estimated testlet variance are lacking. In this situation, a tentative interpretation may be achieved by referring to two kinds of research evidence: (a) simulation studies, showing that variances below 0.25 can generally be considered negligibly small (Glas, Wainer, & Bradlow, 2000; Wang et al., 2002; Wang & Wilson, 2005c; Zhang et al., 2010); and (b) empirical case studies of language and related tests, which found substantial testlet effects that ranged from 0.50 to 2.00 and higher (e.g., Wainer et al., 2000; Wang et al., 2002; Zhang, 2010). For example, Zhang (2010) reported strong effects in a study involving a Cloze test, where the estimated testlet effect variance was as high as 1.43.
For each of the two TestDaF exams considered here, Table 1 shows the estimated values of the testlet effect variance for the three listening passages.
Testlet statistics for the TestDaF listening section with three passages.
Note: SE = standard error.
Judged by any of the available criteria or guidelines, the testlet effects for Testlet 1 and Testlet 3 can be considered small. As can be seen, the testlet effect variance for Testlet 2 was somewhat larger, but still below 0.50. The associated standard errors show that each testlet effect variance was estimated with a high degree of precision. This pattern of findings held true for both exams.
The heightened testlet effect for Testlet 2 was also reflected in the size of the residual correlations between listening items. As indicated by the Q3 statistic (Yen, 1984), out of a total 300 residual correlations between the 25 items on the listening section, the four highest correlations with values ranging from .14 to .18 (April exam) and from .12 to .29 (November exam) involved items nested within Testlet 2 only. These values were larger than the expected value of Q3 assuming local item independence, which would be approximately −0.04 (Yen, 1993). Analysis of item content revealed that most of these locally dependent items referred to statements made by the same speaker addressing closely related aspects (presented in an interview situation comprising three or four speakers).
Person ability estimates
The precision with which each of the three models estimated the person ability parameters was studied first. Based on the EAP ability estimates, the root mean-square measurement error (RMSE) and the person separation reliability (R) were computed, according to the formulas suggested by Wright and Masters (1982). That is, the RMSE was computed by taking the square root of the average of the posterior error variances (the mean-square error, MSE); the person separation reliability R was obtained by subtracting the MSE from the observed variance of EAP ability estimates, and then dividing the result by the observed variance of EAP ability estimates. Table 2 presents the results along with some other summary statistics.
Summary statistics for person ability estimates in the TestDaF listening section using the testlet response model, the independent-items model, and the polytomous-items model.
Note: All models were 2-PL models estimated using a fully Bayesian MCMC approach. RMSE = root mean-square estimation error. R = person separation reliability.
For both exams, the analysis based on the polytomous-items model yielded the lowest estimate of overall measurement error and the highest estimate of person separation reliability. The corresponding estimates for the testlet response model suggested a considerably lower measurement precision. As indicated by the SD values, the estimates from the polytomous-items model showed a somewhat larger variation than the estimates from the other two models (regarding the zero means, remember that the mean of the ability distribution was fixed at that value for estimation purposes).
Next, correlations between each of the models’ estimates of person ability were computed. In addition, three other statistical indicators describing the correspondence between ability estimations were used: (a) the mean difference in ability estimates (MD); (b) the mean absolute difference (MAD); and (c) the root mean-square difference (RMSD). The RMSD was computed by taking the square root of the average of the squared differences between ability estimates. Table 3 gives the results.
Correlations and mean differences for person ability estimates in the TestDaF listening section using the testlet response model, the independent-items model, and the polytomous-items model.
Note: All models were 2-PL models estimated using a fully Bayesian MCMC approach. MD = mean difference. MAD = mean absolute difference. RMSD = root mean-square difference.
As evidenced by the correlations and by the three difference-based statistics, the correspondence between the ability estimates from the TRT and the independent-items models was almost perfect. Smaller, but still very high correspondence was observed for the other two model comparisons. Figures 1 and 2 (left panel) display the scatter plots for each of these comparisons. The plots illustrate that the polytomous-items model tended to underestimate the ability at lower levels of the ability scale, as compared to the estimates based on the TRT and the independent-items models.

Person ability estimates (left panel) and associated standard errors (right panel) under the testlet response model, the independent-items model, and the polytomous-items model (April exam).

Person ability estimates (left panel) and associated standard errors (right panel) under the testlet response model, the independent-items model, and the polytomous-items model (November exam).
Figures 1 and 2 (right panel) present the scatter plots for the standard errors associated with the ability estimates. The scatter plot in the top-right corner of Figures 1 and 2, respectively, demonstrates that the independent-items model underestimated the measurement error (or overestimated the precision). This effect was particularly pronounced in the lower SE range from 0.35 to 0.50. Considerably less consistent relations between standard errors resulted for the other two model comparisons.
Item parameter estimates
The summary statistics shown in Table 4 attest to a very high correspondence between the item parameters estimated by the TRT and the independent-items models. In particular, item difficulty and item discriminations parameters were estimated with an extremely high precision, irrespective of the model used.
Summary statistics for item parameter estimates in the TestDaF listening section using the testlet response model and the independent-items model.
Note: Both models were 2-PL models estimated using a fully Bayesian MCMC approach. RMSE = root mean-square estimation error. R = item separation reliability.
Table 5 presents the correlations and difference-based statistics for item difficulty and discrimination estimates, respectively. Overall, the correspondence between the TRT and the independent-items models was very high. Figure 3 displays the scatter plots.
Correlations and mean differences for item parameter estimates in the TestDaF listening section using the testlet response model and the independent-items model.
Note: Both models were 2-PL models estimated using a fully Bayesian MCMC approach. MD = mean difference. MAD = mean absolute difference. RMSD = root mean-square difference.

Item difficulty and item discrimination estimates under the testlet response model and the independent-items model in the April exam (upper panel) and November exam (lower panel).
If at all, there was a slight tendency for the independent-items model to overestimate the difficulty at the lower end of the difficulty scale; this effect was somewhat more pronounced for the November exam. The items involved were all nested within Testlet 2. Thus, for the November exam, the two largest absolute differences concerned Item 12 (0.40 logits easier, based on the TRT model) and Item 9 (0.30 logits easier). Regarding the item discrimination estimates, no evidence of a consistent estimation bias in one or the other direction was obtained.
Summary and discussion
Many, if not most, language tests use testlets in some form or other. However, an appropriate analysis of testlet effects is still an exception. Given the detrimental consequences these effects may have for test score meaning and interpretation, the widespread neglect of approaches to account for testlet effects can hardly be justified. The present study aimed to fill this gap. Testlet effects were investigated within the context of the Test of German as a Foreign Language (TestDaF). Building on testlet response theory (TRT; Wainer et al., 2007), the TestDaF listening section was closely examined to find out whether, and to which extent, each of the three testlets making up this section was subject to testlet-based LID. Basically, two research questions were addressed.
The first research question concerned the size of the testlet effects present in the listening section. Effects were estimated using a 2-PL TRT model as implemented in the computer program SCORIGHT (Wang et al., 2005). Results indicated that there were moderate effects for one testlet (Testlet 2), and small effects for the other two testlets (Testlet 1, Testlet 3). The increase in effect size for Testlet 2 was confirmed by an analysis of residual correlations between listening items: Heightened values of the Q3 statistic (Yen, 1984) were obtained for items belonging to Testlet 2 only.
Though there appeared to be no strong testlet effects, the second research question concerned the appropriateness of a traditional IRT modeling approach, an approach that ignored the dependency structure caused by the testlet format, treating all 25 items on the listening section as independent items. Specifically, it was asked how the TRT-based ability and item parameter estimates compared to the estimates that resulted from two competing modeling approaches, that is, from an independent-items model and from a polytomous-items model, respectively. The polytomous-items approach was one in which testlets were considered polytomous items and analyzed using a polytomous IRT model. Part of this question also addressed the precision with which these parameters were estimated, as well as the size of the estimated test reliability. Results showed that the correspondence between the person ability parameters under the TRT and the independent-items models was extremely high, approaching an almost perfect correlation. The other two model comparisons (i.e., TRT model vs. polytomous-items model, independent-items model vs. polytomous-items model) yielded slightly lower levels of correspondence.
Regarding overall test reliability, under the polytomous-items and the independent-items models higher values were observed than under the TRT model. When all listening items had functioned as independent items one should have expected similarly high reliabilities under the TRT and the independent-items models. Yet, this was not the case, indicating that the independent-items model overestimated the reliability of the listening section. Considering the standard errors that were associated with the person ability estimates, it was confirmed that the independent-items model suggested a higher level of measurement precision than was actually warranted. In terms of item difficulty and discrimination estimates, as well as their precision, the three competing models yielded highly similar results. Thus, the estimation of item parameters seemed to remain largely unaffected by testlet-based LID.
The present findings are in line with results from recent simulation studies that examined the relative performance of each of a number of psychometric approaches to model testlet-based tests (DeMars, 2006, 2012; Zhang et al., 2010). For example, studying the TRT model, the bi-factor model (of which the TRT model is a special case; e.g., Rijmen, 2010), the polytomous-items model, and the independent-items model, DeMars (2006) concluded that differences between the models concerning ability parameter estimates were negligibly small, “though the estimated reliability will be inflated for the independent-items model if items within testlets are not independent” (p. 165).
In the April exam, the TRT model yielded a reliability estimate of .71; in the November exam this estimate was .74. The reliability estimates obtained under the independent-items model were substantially higher, that is, .76 and .82, respectively. Using the Spearman-Brown prophecy formula, this overestimation can also be expressed in terms of the number of items that would have to be added to the test in order to achieve the higher reliability value in each comparison. Thus, when testlet-based LID was taken into account employing the TRT model, 7 items would have to be added to achieve a reliability of .76 from .71 (April exam), corresponding to an increase in test length by 28%; in the November exam, to achieve a reliability of .82 from .74, the number of items to be added would be as high as 15, corresponding to an increase in test length by 60%. Clearly, then, the impact of testlet effects on measurement precision brought about by adopting a traditional IRT modeling approach is a matter of practical concern, even when these effects are small to moderate as in the present study. Moreover, adding items may run the risk of introducing even higher redundancy into the test, aggravating the problem of “double counting” of information contained in testlet items (Braeken et al., 2007). A more reasonable option may therefore be to increase the number of testlets and to let each item within a testlet address a distinct portion of passage-related input (Bradlow et al., 1999; Glas et al., 2000; Zhang et al., 2010).
When the variance of testlet effects differs in size between the testlets under consideration, the question arises as to what may have caused the differences. In this analysis, Testlet 1 and Testlet 3 showed similarly small effect variances, whereas Testlet 2 proved to be somewhat more subject to LID—a pattern that held across both TestDaF exam dates. The three testlets differed in overall difficulty, which was designed to increase from Testlet 1 to Testlet 3, in the number of items per testlet, and in the item format (short-answer questions in Testlet 1 and Testlet 3, true/false items in Testlet 2), and many other characteristics. The difference in item format could have been particularly relevant here. Whereas short-answer questions require examinees to construct their own answer, which then needs to be scored as either correct or incorrect, true/false items simply ask examinees to decide whether a given statement is true or false (Alderson, Clapham, & Wall, 1995; Buck, 2001). Hence, as mentioned earlier, true/false items are subject to guessing—a factor that may have contributed to the somewhat heightened variance of the testlet effects for Testlet 2.
One principled way to address this question would be by means of regression techniques, in particular tree-based regression (TBR; Breiman, Friedman, Olshen, & Stone, 1984). Adopting a TBR approach requires first to identify testlet characteristics that can be hypothesized to function as predictors in the regression model, such as formal testlet characteristics (e.g., the number of items contained in each testlet), characteristics of the input (e.g., linguistic features, thematic category), characteristics of the items (e.g., abstractness, degree of relatedness to the input), or characteristics of the implied cognitive processes (e.g., inferential processing, categorization). In the regression model, the criterion variable would be the variance (or standard deviation) of the testlet effects. Applying the TBR approach to testlet data from an analytic reasoning test, Paap and Veldkamp (2012) were able to identify a number of input-related variables that were associated with the size of the testlet effects (e.g., percentage of “if” clauses, theme/topic).
Item writing and test development can benefit in various ways from using a TRT-based modeling approach. For example, if a testing program called for testlets having only small testlet effects, testlets that have been shown to be associated with strong effects could be replaced by testlets with lower effects. Estimates of testlet effect variance would thus serve as a kind of testlet or task selection criterion (Wainer et al., 2007; Zhang et al., 2010). Moreover, when previous analyses identified some factors contributing to the occurrence of testlet effects, such as item format, item content, or linguistic features of the input, these factors could be taken into account in the process of testlet construction.
Conclusion
The testlet format has gained great popularity in language testing. Yet, testlet-based tests are most often developed, analyzed, and evaluated based on traditional psychometric approaches that ignore the dependency between items caused by their association with the same input or stimulus. The striking imbalance between carefully devised practices of test development and use on one side, and the unreflecting application of possibly inappropriate techniques of item and test analysis on the other side clearly requires corrective measures. A growing body of research has attested to the adverse consequences of ignoring testlet-based LID for parameter estimation, for the precision of ability estimates and test reliability, as well as for the meaning and interpretation of test scores and the use that is made of scores for certification, classification, and other decision-making purposes. Recent advances in the development of suitable psychometric models (e.g., DeMars, 2012; Rijmen, 2010; Wainer et al., 2007), and in the development of easily accessible software to implement these models (e.g., Adams, Wu, & Wilson, 2012; Chalmers, 2012; Curtis, 2010; Kruschke, 2011; Wang et al., 2005) give reason to hope that significant progress in the analysis of testlet-based tests will be made in the near future. Finally, this will serve to ensure and possibly increase the validity of inferences drawn from test scores.
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
