A Two-Level Alternating Direction Model for Polytomous Items With Local Dependence

Abstract

The chiropractic clinical competency examination uses groups of items that are integrated by a common case vignette. The nature of the vignette items violates the assumption of local independence for items nested within a vignette. This study examines via simulation a new algorithmic approach for addressing the local independence violation problem using a two-level alternating directions testlet model. Parameter values for item difficulty, discrimination, test-taker ability, and test-taker secondary abilities associated with a particular testlet are generated and parameter recovery through Markov Chain Monte Carlo Bayesian methods and generalized maximum likelihood estimation methods are compared. To aid with the complex computational efforts, the novel so-called TensorFlow platform is used. Both estimation methods provided satisfactory parameter recovery, although the Bayesian methods were found to be somewhat superior in recovering item discrimination parameters. The practical significance of the results are discussed in relation to obtaining accurate estimates of item, test, ability parameters, and measurement reliability information.

Keywords

testlet response theory (TRT)violation of local independence Bayesian methods Markov Chain Monte Carlo (MCMC)generalized maximum likelihood estimation (GMLE)

Introduction

The National Board of Chiropractic Examiners (NBCE) is an independent third-party testing agency for the chiropractic profession. It was incorporated in 1963 and began administering its first examinations in 1965. Prior to the formation of the NBCE, each state chiropractic licensing board created and administered its own battery of licensure examinations. As a consequence, licensure testing for the chiropractic profession and standards for entry-level chiropractic licensure varied considerably from state to state. Presently successful completion of the NBCE examinations comprised of four separate parts is required for licensure across all of the United States, thereby providing the chiropractic profession with a single pathway to licensure (for additional details, see online Supplementary Appendix A).

One key component of the chiropractic clinical competency examination is the use of groups of items that are integrated by a common case vignette. The nature of the vignette items, however, violates the defining item response theory (IRT) assumption of conditional independence or local independence for items. This assumption basically states that the conditional probability of observing a particular response pattern given a latent trait value equals the product of the conditional probabilities of the items. The violation of this assumption is referred to as local dependence. Stated another way, local dependence implies that the only thing causing items to covary is the modeled latent trait(s). As items in a test should not be related to each other, having items nested within a vignette is a potential source of local dependence. Not accounting for these dependencies can lead to biased estimates of items, test, and ability parameters and even overestimation of measurement reliability (Sireci, Thissen, & Wainer, 1991; Wainer & Wang, 2000).

The purpose of this article is to introduce a new computational estimation algorithm based on the widely used two-level unidimensional testlet response theory (TRT) model (Wainer, Bradlow, & Wang, 2007). Although past research has suggested that the TRT model is ideally suited for obtaining parameters estimated in settings with local dependence, the estimation process can be computationally challenging due to the strong nonlinear coupling of examinee and item parameters. The newly proposed estimation approach seeks to address this issue and contribute to the literature by introducing a computationally efficient algorithm that can be used in practical measurement settings. To accomplish the purposes of this article, a number of simulations using data based on realistic data scenarios were conducted and scrutinized.

The remainder of the article is organized in the following way. The next section presents an overview of the literature on the estimation of tests containing testlet structure. This is followed by a description of the proposed computational approach. Next, complete details on the simulation design, the analyses, and the procedures used to examine the proposed computational approach are described. This is followed by a description of results obtained from analyses of the simulated data. Finally, a discussion of the implications of the findings is provided.

Overview of the Literature

Recent developments in educational and professional testing have provided support for the use of complex test items that often include group of items united by a common stimulus (Attali, 2011). Examples of such complex test items include ones containing a subset of so-called scaffolded or broken down tasks in one larger task that provide an opportunity to elicit more responses as well as situations when test-takers are provided with more information to complete an item (Bergner, Choi, & Castellano, 2019; Wolf et al., 2016). Wainer and Kiely (1987) proposed a name for such complex items, calling them “testlets.” Testlets are commonly used to boost testing efficiency in situations that examine an individual’s ability to understand some sort of stimulus (e.g., a reading passage, an information graph, a musical passage, or a table of numbers; Wainer et al., 2007). Other complicated situations include examinations where the local independence between the items of different testlets holds, but the assumption of within-testlet independence is violated due to the presence of within-testlet residual dependency in responses. Although attempts to develop a comprehensive, inclusive model for various types of dependent items date back some 30 years (Fray, 1989), the debate on how best to account for dependency in the test in a holistic final score continues to date. A main reason for this debate is that when items are united by a single prompt, they inevitably violate the assumption of local independence, which is a fundamental assumption for IRT models (De Boeck & Wilson, 2004).

To account for such potentially complex tests with dependencies in responses, De Boeck and Wilson (2004) suggested three different approaches to modeling the vector of Y responses. The first approach is called the conditional modeling approach, because the conditional probability of a certain response to an item is modeled conditionally on some or all other responses to that item. The second approach is called marginal modeling and uses the marginal distributions that are constructed for each component of Y based on the set of predictor variables with the correlation among the components of Y added on top of the marginal distributions to explain residual dependency. The third model is called the random-effects modeling approach and involves the use of a vector of random effects introduced to the model to account for the dependency among the components of Y. Although using these approaches can prove in general beneficial, for a test with a large number of items they can become prohibitively computationally demanding (Tuerlinckx & De Boeck, 2004).

A number of other approaches to account for tests with dependencies in responses have also been proposed in the extant literature. For example, Fischer (1989) proposed a generalization of the logistic linear model with relaxed assumptions in the context of violation of stochastic independence for tests when testlets are formed. In the model, called a hybrid model, it is possible to estimate changes even if the responses do not have a common latent trait. Verhelst and Glas (1995) also proposed two dynamic generalizations of models that relax the assumption of local stochastic independence. In the first approach, they proposed a special case of a log-linear model with added parameters based on the Rasch model, while for the second approach they applied a framework from mathematical learning theory (Sternberg, 1963).

Other models proposed to handle settings with possible local dependency include the rating score model (Andrich, 1978), the partial credit model (Masters, 1982), the generalized partial credit model (Muraki, 1992), and the graded response model (Samejima, 1969). Following this strand of research, Culpepper (2014) presented an overview of different sequential item response models applicable to tests constructed using items with violation of local independence, specifically those allowing multiple attempts of an item. He demonstrated how these models for repeated attempts could be applied within the Rasch modeling framework, introducing attempt-specific parameters as a strategy to account for the differences in probability of providing correct response during the repeated attempts. Although advantageous, this modeling framework was never extended beyond the Rasch model to other models such as the 2-parameter- and 3-paramater-logistic models, that are often used by operational testing programs to explain the responses collected from test takers. Another limitation with these approaches is the computational complexity needed to solve the problem can be quite demanding. This is the reason Li, Li, and Wang (2010) indicated that “some relatively simple methods to detect local dependency and measure the magnitude of the testlet effect” (p. 22) need to be urgently developed. Given that to date no broader computationally efficient approach for addressing the problem has been suggested, this article proposes a new computational algorithm that is based on the estimation of parameters via the two-level unidimensional TRT model (Wainer et al., 2007).

Method

Model Notation and Specification

To illustrate the approach, let us consider $Y_{ji}$ to represent the score category of examinee $j$ on item $i$ . Assuming next that a test item $i$ belongs to a testlet $d (i)$ with $m_{i} + 1$ scored categories starting from $0$ , the probability that examinee $j$ provides an answer in the category $k$ can be given by the following equation:

P_{ijk} = \frac{\exp (\sum_{v = 0}^{k} (a_{i 1} θ_{j} - t_{iv} + a_{i 2} γ_{d (i) j}))}{\sum_{c = 0}^{m_{i}} \exp (\sum_{v = 0}^{c} (a_{i 1} θ_{j} - t_{iv} + a_{i 2} γ_{d (i) j}))}

(1)

where $P_{ijk}$ is the probability of scoring in category $k$ of item $i$ by examinee $j$ , $t_{iv}$ is the difficulty parameter for score category $v$ , $θ_{j}$ is the ability of examinee $j$ , $γ_{d (i) j}$ represents an examinee’s secondary ability associated with testlet $d (i)$ for examinee $j$ (which essentially implies that $γ_{d (i)}$ accounts for local item dependence within testlet $d (i)$ ), and $a_{i 1}$ and $a_{i 2}$ indicate the item’s discriminating power with respect to $θ$ and $γ$ , respectively. For identification purposes, the following assumptions are made:

θ ~ N (0, 1); γ ~ N (0, 1) and Cov (θ, γ) = 0

(1)

An alternating direction approach is adopted to decouple the examinee ability–related parameter $θ_{j}$ and the item-related parameters $a_{i 1}$ , $a_{i 2}$ , $t_{iv}$ , and $γ_{d (i)}$ . The initial estimation value for $θ_{j}$ is obtained at the testlet level by grouping related items using the following generalized partial credit model (GPCM):

P_{i' jk'} = \frac{\exp \sum_{v = 0}^{k'} {a'}_{i 1} (θ_{j} - t_{i' v'})}{\sum_{c = 0}^{{m'}_{i}} \exp \sum_{v = 0}^{k} {a'}_{i 1} (θ_{j} - t_{i' v'})}

(2)

where the primed parameters are the testlet-level counterparts. The testlet item scores are obtained by summing up the scores in each constituting item. The testlet item categories are obtained in the same fashion.

More generally, during the alternating direction testlet TRT fitting, after obtaining the examinee ability parameter $θ_{j}$ , the estimation for item parameters is obtained via maximizing the standard likelihood function:

L (a_{i 1}, a_{i 2,} t_{iv}, γ_{d (i)}) = \sum_{k = 1}^{N} \log \frac{\exp \sum_{v = 0}^{k} a_{i 1} (θ_{j} - t_{iv} + a_{i 2} γ_{d (i)})}{\sum_{c = 0}^{m_{i}} \exp \sum_{v = 0}^{k} a_{i 1} (θ_{j} - t_{iv} + a_{i 2} γ_{d (i)})}

(3)

with $N$ being the total number of examinees considered.

With successful decoupling, the above equation becomes a relatively straightforward mathematical system that can be solved using standard numerical approaches. Similarly, after the parameters $a_{i 1}, a_{i 2,} t_{iv}, γ_{d (i)}$ have been updated, the $θ_{j}$ s are obtained by maximizing the following equation:

L (θ_{j}) = \sum_{k = 1}^{M} \log \frac{\exp \sum_{v = 0}^{k} a_{i 1} (θ_{j} - t_{iv} + a_{i 2} γ_{d (i)})}{\sum_{c = 0}^{m_{i}} \exp \sum_{v = 0}^{k} a_{i 1} (θ_{j} - t_{iv} + a_{i 2} γ_{d (i)})}

(4)

with $M$ being number of items.

Monte Carlo Data Simulation and Analytic Strategy

To systematically evaluate the performance of the proposed approach, simulated data using Monte Carlo techniques were analyzed under a variety of design conditions. The goal is to develop a computational approach using the likelihood function that will allow for an estimation of the parameters in the testlet model without loss of precision or reliability. Two simulated datasets were constructed and examined. The first contained synthetic data for 800 test-takers examined on 6 testlets with 5 items within each testlet, and the second contained synthetic data for 800 test-takers examined on 5 testlets and 20 items within each testlet. A complete itemized structure of the two generated data sets is presented in Tables 1 to 4.

Table 1.

The Structure of Data Set 1, Test–Taker-Related Parameters.

Test-taker	Gammas	Thetas
1	$γ_{11}, γ_{12}, \dots, γ_{16}$	$θ_{1}$
2	$γ_{21}, γ_{22}, \dots, γ_{26}$	$θ_{2}$
⋮	⋮	⋮
800	$γ_{8001}, γ_{8002}, \dots, γ_{8006}$	$θ_{800}$

Table 2.

The Structure of Dataset 1, Item-Related Parameters.

Item	Discrimination	Difficulty
1	$a_{11}, a_{12}$	$t_{11}, t_{12}, t_{13}$
2	$a_{21}, a_{22}$	$t_{21}, t_{22}, t_{23}$
⋮	⋮	⋮
30	$a_{301}, a_{302}$	$t_{301}, t_{302}, t_{303}$

Table 3.

The Structure of Data Set 2, Test–Taker-Related Parameters.

Test-taker	Gammas	Thetas
1	$γ_{11}, γ_{12}, \dots, γ_{16}$	$θ_{1}$
2	$γ_{21}, γ_{22}, \dots, γ_{26}$	$θ_{2}$
⋮	⋮	⋮
800	$γ_{8001}, γ_{8002}, \dots, γ_{8006}$	$θ_{800}$

Table 4.

The Structure of Dataset 2, Item-Related Parameters.

Item	Discrimination	Difficulty
1	$a_{11}, a_{12}$	$t_{11}, t_{12}, t_{13}$
2	$a_{21}, a_{22}$	$t_{21}, t_{22}, t_{23}$
⋮	⋮	⋮
100	$a_{1001}, a_{302}$	$t_{1001}, t_{1002}, t_{1003}$

On generating the synthetic data sets, the next step was to recover the parameters using Bayesian methods via Markov Chain Monte Carlo (MCMC) estimation and using the generalized maximum likelihood estimation (GMLE) method.

Bayesian Estimation

Considering the vector of $Y$ responses, Bayesian analysis relies on samples drawn from the following posterior probability distribution:

P (a, t, θ, γ | Y, \dots)

(5)

where “…” correspond to the hyper-parameters used in prior distributions. The prior distributions of $a, and t$ are often chosen from conjugate priors to yield a closed-form posterior distribution for algebraic convenience. The MCMC method is the most frequently used algorithm to generate samples according to a posterior distribution. This estimation algorithm is summarized below.

Given an old sample $(a_{old}, t_{old}, θ_{old}, γ_{old})$ —where the term “old” refers to the parameter estimate in the previous step:

$a_{new}$ is sampled from distribution $P (a | t_{old}, θ_{old}, γ_{old}, Y, \dots)$

$t_{new}$ is sampled from distribution $P (t | a_{new}, θ_{old}, γ_{old}, Y, \dots)$

$θ_{new}$ is sampled from distribution $P (θ | a_{new}, t_{new}, γ_{old}, Y, \dots)$

$γ_{new}$ is sampled from distribution $P (γ | a_{new}, t_{new}, θ_{new}, Y, \dots)$

The fully updated $(a_{new}, t_{new}, θ_{new}, γ_{new})$ is then a new sample following the posterior distribution.

Generalized Maximal Likelihood Estimation

Maximal likelihood approaches estimate model parameters that maximize the likelihood function, given examinees overall matrix of responses Y. This can be specified as follows:

Π_{i = 1}^{I} Π_{j = 1}^{J} P (Y_{ij} | a_{i 1}, a_{i 2}, t_{iv}, θ_{j}, γ_{d (i) j})

(6)

or equivalently as,

\begin{matrix} E (a, t, θ, γ | Y) = \sum_{i = 1}^{I} \sum_{j = 1}^{J} \log (P (Y_{ij} | a_{i 1}, a_{i 2}, t_{iv}, θ_{j}, γ_{d (i) j}) \\ = \sum_{i = 1}^{I} \sum_{j = 1}^{J} \log (\frac{\exp (\sum_{v = 0}^{k} (a_{i 1} θ_{j} - t_{iv} + a_{i 2} γ_{d (i) j}))}{\sum_{c = 0}^{m_{i}} \exp (\sum_{v = 0}^{c} (a_{i 1} θ_{j} - t_{iv} + a_{i 2} γ_{d (i) j}))}) \end{matrix}

(7)

Numerically, the iterative gradient descent method is usually adopted to find optimal model parameters:

(a_{new}, t_{new}, θ_{new}, γ_{new}) = (a_{old}, t_{old}, θ_{old}, γ_{old}) + r_{u} (\frac{\partial E}{\partial a}, \frac{\partial E}{\partial t}, \frac{\partial E}{\partial θ}, \frac{\partial E}{\partial γ})

(8)

where $r_{u}$ is the updating parameter.

Because of the complexity of $E$ , solving explicit formula calculating derivatives of $E$ with respect to $a, t, θ, γ$ is impractical and subject to numerical instability. To ease the computational complexity, the so-called TensorFlow platform (described further below) is used to aid in the complex calculation of the derivatives.

The following steps summarize the algorithm implemented:

1. Generate $θ_{old}, γ_{old}$ from $N (0, 1)$ .

2. Find $a_{old}, t_{old}$ maximizing

E (a, t | θ_{old}, γ_{old}) = \sum_{i = 1}^{I} \sum_{j = 1}^{J} \log (\frac{\exp (\sum_{v = 0}^{k} (a_{i 1} θ_{[old], j} - t_{iv} + a_{i 2} γ_{[old] d (i) j}))}{\sum_{c = 0}^{m_{i}} \exp (\sum_{v = 0}^{c} (a_{i 1} θ_{[old] j} - t_{iv} + a_{i 2} γ_{[old] d (i) j}))})

(8)

using the gradient descent method.

3. Obtain $a_{old}, t_{old}$ , finding $θ_{new}, γ_{new}$ maximizing

E (θ, γ | a_{old}, t_{old}) = \sum_{i = 1}^{I} \sum_{j = 1}^{J} \log (\frac{\exp (\sum_{v = 0}^{k} (a_{[old] i 1} θ_{j} - t_{[old] iv} + a_{[old] i 2} γ_{d (i) j}))}{\sum_{c = 0}^{m_{i}} \exp (\sum_{v = 0}^{c} (a_{[old] i 1} θ_{j} - t_{[old] iv} + a_{[old] i 2} γ_{d (i) j}))})

(8)

4. Set $θ_{old} = θ_{new}, γ_{old} = γ_{new}$

5. Repeat steps 2 and 3 until convergence is attained.

The TensorFlow Platform

In the fields of mathematics and physics, tensors are large-dimension geometrical objects that describe linear relations between vectors, scalars, and other tensors. Accordingly, tensors can be considered as merely generalizations of scalars and vectors: a scalar is a zero-rank vector, and a vector is a first-rank tensor. The need for higher rank tensors comes when more than one direction is required to describe physical or other properties. In statistics, tensors can be quite helpful when the examined data are multidimensional and require tedious calculations.

TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence organizational group for the purpose of conducting machine learning and deep neural network research. TensorFlow is basically an open-source software platform for dataflow programming. The platform provides an interface that can be used to express complex machine learning algorithms and enables the implementation and execution of such algorithms. The system is quite flexible and can be effectively used to express a wide variety of algorithms, particularly those required in machine training and neural network models (Abadi et al., 2015).

The system uses an increasingly powerful form of computational learning with very impressive accuracy. Any computation that can be expressed as a computational flow graph can in principle be computed on the TensorFlow platform. For the purpose of completing the analyses necessary for this study, TensorFlow was used to calculate the complex derivatives in the likelihood functions. All calculations were conducted on the Cloud, which also afforded any additional needed computational power. The Python source code needed to estimate the testlet models examined in this study using TensorFlow is included in the online Supplementary Appendix B.

Results

The overall parameter recovery solutions produced by the MCMC and the GMLE estimation methods were in general successful. In terms of estimated item parameters, overall item difficulties were recovered much better than item discriminations. In terms of test–taker-related parameters, the recovery was outstanding using both methodologies. Detailed results of the parameter recoveries are presented in Tables 5, 6, and 7 for Dataset 1 and in Tables 8, 9, and 10 for Dataset 2. Figures 1 to 4 display graphical visualizations of parameter recovery results for Dataset 1, while Figures 5 to 8 display similar graphical visualizations of parameter recovery for the second dataset.

Table 5.

Correlation Between Real and Recovered Differentiation Parameters, Dataset 1.