IRT Models for Learning With Item-Specific Learning Parameters

Abstract

We propose a new item response theory growth model with item-specific learning parameters, or ISLP, and two variations of this model. In the ISLP model, either items or blocks of items have their own learning parameters. This model may be used to improve the efficiency of learning in a formative assessment. We show ways that the ISLP model’s learning parameters can be estimated in simulation using Markov chain Monte Carlo (MCMC), demonstrate a way that the model could be used in the context of adaptive item selection to increase the rate of learning, and estimate the learning parameters in an empirical data analysis using the ISLP. In the simulation studies, the one-parameter logistic model was used as the measurement model to generate random response data with various test lengths and sample sizes. Ability growth was modeled with a few variations of the ISLP model, and it was verified that the parameters were accurately recovered. Secondly, we generated data using the linear logistic test model with known Q-matrix structure for the item difficulties. Using a two-step procedure gave very comparable results for the estimation of the learning parameters even when item difficulties were unknown. The potential benefit of using an adaptive selection method in conjunction with the ISLP model was shown by comparing total improvement in the examinees’ ability parameter to two other methods of item selection that do not utilize this growth model. If the ISLP holds, adaptive item selection consistently led to larger improvements over the other methods. A real data application of the ISLP was given to illustrate its use in a spatial reasoning study designed to promote learning. In this study, interventions were given after each block of ten items to increase ability. Learning parameters were estimated using MCMC.

Keywords

item response theory adaptive learning MCMC LLTM

Introduction

Differentiated Instruction

Vygotsky (1980) theorized a zone of proximal development (ZPD), defined as the range between what a child can do alone without guidance and what the child can do with guidance. In differentiated learning, if the teacher can push the child into their ZPD and coach with a task slightly more complex than the child can manage alone, then the child can master new skills and learn to become an independent thinker and problem solver through repetition (Joseph et al., 2013).

Differentiated instruction is common in the modern classroom and has been shown to be beneficial for both remedial and advanced students (Manning et al., 2010). Over the past decade, online learning sites, such as IXL, Khan Academy, and Dream Box, have become quite popular, perhaps due to the demand for extra help and practice and their ability to give immediate feedback. These online learning platforms provide differentiated instruction to students by means of computerized adaptive testing and interventions. Adaptive testing can improve the quality of measurement by presenting individuals with items targeted to their current ability level, which can be assessed dynamically.

Several methods of item selection have been developed for adaptive learning, such as item-pool partitioning (Kingsbury & Zara, 1989), the weighted-deviation method (Swanson & Stocking, 1993), the maximum priority index method (Cheng & Chang, 2009), testlet-based adaptive testing (Wainer et al., 2007), and constrained adaptive testing with shadow tests (van der Linden & Glas, 2009). In this work, we propose a model for the ability growth of individuals throughout the course of a formative assessment and show an example of how it could be used to select items adaptively.

Latent Variable Models for Learning

Item response theory (IRT) is a paradigm that relates a subject’s performance on an assessment to a measure of their overall ability level, $θ$ . In IRT, item response functions (IRFs) describe the relation between a latent variable and the probability of a correct response. The IRF may be modeled using parameters for item difficulty, item discrimination, and a lower asymptote (Ong, 2017) serving as the guessing parameter. The one-parameter logistic (1PL) model is the simplest logistic ogive model and is popularly used. In the 1PL, the discrimination parameter is fixed for all items; the probability of a correct response is determined by the item difficulty and ability level. A 1PL model with item discrimination fixed at 1 and unspecified ability distribution is also known as the Rasch (1960) model. The IRF for the Rasch Model is given by

P [X_{i j} = 1] = \frac{exp (θ_{i} - b_{j})}{1 + exp (θ_{i} - b_{j})},

where $P [X_{i j} = 1]$ is the probability of a correct response by subject i on item j, b_j is the item difficulty parameter for item j, and $θ_{i}$ is the ability level for subject i.

Of the many methods for parameter estimation in the Rasch model, two important approaches are widely used because the asymptotic properties of the estimates are known and desirable: conditional maximum likelihood (CML) and unconditional maximum likelihood (Fischer, 1981). CML (Andersen, 1972) is a parameter estimation method that is applicable to the one-parameter model. In CML estimation of item parameters, subjects’ ability levels are treated as arbitrarily given; item and ability parameters are estimated simultaneously (Bock & Lieberman, 1970). The subject’s ability is treated like a nuisance parameter, which is eliminated from the likelihood by conditioning on the number of items answered correctly, which is a sufficient statistic for the ability (Bock & Aitkin, 1981). In the Rasch model, an advantage of CML is that the distribution of $θ$ does not need to be modeled. This makes the Rasch model a semi-parametric model when used with CML. However, this method does not work for the two-parameter or three-parameter logistic or normal ogive models (Bock & Aitkin, 1981) because total score is no longer a sufficient statistic for $θ$ .

Alternatively, if the ability distribution is assumed to have a particular form, marginal maximum likelihood (MML) can be used to estimate item parameters. Here, the data are regarded as arising from a sample of subjects from a specified population. Item parameters can be estimated by integrating over the distribution of $θ$ (Bock & Lieberman, 1970) and maximizing the marginal likelihood. Bock and Lieberman (1970) have used this approach for estimation in the two-parameter normal ogive model by using Gauss–Hermite quadrature for numerical integration (Bock & Aitkin, 1981). The MML method may be applied to a variety of item response models, including the logistic, multiple nominal categories model (Bock, 1972).

Linear Logistic Test Model (LLTM)

Scheiblechner (1972) proposed a more structured version of the Rasch model called the LLTM, in which the item difficulties are subject to the linear restrictions:

b_{j} = \sum_{k = 1}^{m} q_{j k} η_{k} + c,

where $q_{j k}$ can be expressed as an entry of a binary matrix Q with elements that indicate whether operation k is required in item j. Parameter $η_{k}$ represents the additional difficulty resulting from requiring the kth operation. The applications of the LLTM attempt to explain item difficulty based on this Q matrix. Schieblechner (1972) was able to explain item difficulty for propositions of formal logic by a linear combination of only three operations: “negation,” “disjunction,” and “considering asymmetry.” Other early applications of the LLTM showed that although it is a useful tool, good model fit is only attained if items are constructed deliberately (Fischer & Formann, 1982). Some of the advantages of the LLTM are that when it is valid, it allows for the prediction of item difficulty for new items that have never been presented, as well as creation of items with a prespecified difficulty (Fischer & Formann, 1982). Like the Rasch model, the estimation of the LLTM can be accomplished with either CML or MML estimation.

IRT Growth Models

Item response models that allow for growth in the ability parameter can be applied when learning is the goal of the assessment. By continually measuring the latent ability and adapting the assessment and intervention, one can increase efficiency by avoiding overtesting and undertesting at a particular level. In addition to evaluation, items or bundles of items may be seen as educational interventions. In this work, we consider methods for modeling growth of the latent trait and also study the benefits of adaptive item selection using these models. There is vast literature on learning and growth models in IRT, though it is not focused on item-level or testlet-level learning parameters.

Longitudinal IRT (LIRT) is a branch of IRT that allows researchers to measure individual growth and change using longitudinal data (Ong, 2017). An early example of LIRT is Fischer’s extension of the LLTM to multiple time points in the linear logistic model with relaxed assumptions (LLRAs; Fischer & Formann, 1982). This model drops the assumption of unidimensionality, which may be difficult to attain in educational or clinical settings (Fischer & Formann, 1982). The LLRA supposes that the reactions of J individuals to n items have been observed at several time points $t_{1}, t_{2},..., t_{h}$ and models the probability of a correct response for person i to item j at time t as

P [X_{i j} = 1 | θ_{i j}^{*}, t_{x}] = \frac{exp [θ_{i j}^{*} + (t_{x} - t_{1}) δ_{i}]}{1 + exp [θ_{i j}^{*} + (t_{x} - t_{1}) δ_{i}]},

δ_{i} = \sum_{k} q_{i k} η_{k} + τ,

for $i = 1, ..., n,$ $j = 1, ..., J,$ $x = 1, ..., T$ , and $k = 1, ..., m$ , where $θ_{i j}^{*}$ is the latent ability of person i to give reaction “+” to item j at time $t_{1} .$ Parameter $q_{i k}$ is the dose of treatment k applied to person i, $η_{k}$ is the effect of treatment k, $τ$ is the sum of effects that are independent of the treatments, or “trend,” and $δ_{i}$ is the total amount of change in person i caused by treatments and trend.

The additive nature of $δ_{i}$ in the LLRA makes strong assumptions but may serve as a H ₀ to test more complex assumptions against such as the inclusion of interaction terms or nonlinear dose-response curves (Fischer & Formann, 1982). Although the LLRA permits the estimation of treatment parameters, it is not applicable to measuring individual differences in learning, since treatment effects are equal for all subjects who receive the same treatment (Embretson, 1991).

Andersen (1985) analyzed scenarios where the same set of items was used to measure a latent variable at two different points in time, with values, $θ_{1} {and θ}_{2}$ . Andersen studied the structure of the joint population density of $θ_{1}$ and $θ_{2}$ and was able to estimate the correlation $ρ_{{(θ}_{1} {,θ}_{2})}$ . The purpose of this work was to measure the degree to which the corresponding latent traits have changed over time (Andersen, 1985).

Embretson (1991) points out that although Andersen’s model is appropriate for understanding the impact of time or treatment on the ability distribution, it does not contain change parameters for individuals. To that effect, Embretson (1991) developed a multidimensional Rasch model for learning and change (MRMLC), which is appropriate for ability measurements, where items are not repeated to avoid bias from repeated testing.

Embretson’s model extends the Rasch model to include M abilities measured on K occasions. It assumes that on the first occasion, $k = 1$ , performance depends only on initial ability, and that on further occasions, k, performance depends on $k - 1$ additional abilities. The MRMLC can be given as follows:

P [x_{i j k} = 1 | θ_{i}, b_{j}] = \frac{exp (\sum_{m = 1}^{k} θ_{i m} - b_{j})}{1 + exp (\sum_{m = 1}^{k} θ_{i m} - b_{j})},

where $θ_{i}$ is the vector of abilities, such that $θ_{i 1}$ is the initial ability at $k = 1,$ $θ_{i k}$ is the change in ability at occasion, k, and b_j is the difficulty of item $j .$

Across conditions, the MRMLC is multidimensional, but within any condition, item response probabilities can be given by a unidimensional model (Embretson, 1991). For all items within condition k, and for all k, the probability of a correct response depends on the composite ability $θ_{i k}^{†}$ , which is the unweighted sum of initial ability and $k - 1$ additional abilities:

P [x_{i j k} = 1 | θ_{i k}^{†}, b_{j}] = \frac{exp (θ_{i k}^{†} - b_{j})}{1 + exp (θ_{i k}^{†} - b_{j})} .

In this article, we propose an IRT growth model for learning, in which items or testlets have their own learning parameters that may be estimated from data. We demonstrate an estimation procedure for its learning parameters. Then, we assess the finite sample properties in simulation under a variety of conditions. A separate simulation study is then conducted to examine the gains in learning efficiency that a learning model may afford while paired with an adaptive item selection algorithm. Finally, we fit our learning model to a real dataset that involves an intervention to help students learn spatial rotation skills.

IRT Models for Learning

The models we consider can be broken into two distinct parts, the measurement model and the latent growth model. The measurement model we use is the 1-parameter logistic (1PL) model, $log i t (P r [y_{i j} = 1 | θ_{j}]) = α (θ_{i} - b_{j})$ , where $α$ is a discrimination parameter. In cases where the item difficulties can be modeled by characteristics indexed in a Q-matrix, one appealing option is to constrain the item difficulties by the LLTM.

Proposed Models

The other component of an IRT growth model describes how learning or growth takes place. We propose a new model called the IRT growth model with item-specific learning parameters (ISLPs). The ISLP model can be used to explain the ability growth of an examinee throughout the course of an assessment. These assessments are broken into groups of items called blocks. For the sake of simplicity, we suppose that blocks have the same number of items and that items are presented in the same order within each block. Blocks can be presented in any order to the examinees. After an examinee completes each block of items, they are given an intervention that corresponds to that block. An intervention could be something as simple as providing feedback explaining the solutions to incorrect items from the previous block or something more complex such as an interactive applet the examinee can use to learn. Larger block sizes may correspond to an assessment that is more focused on measurement. Smaller block sizes indicate more frequent interventions and may be the focus of a formative assessment. For instance, a block size of one would correspond to providing an intervention after each item.

We assume that ability growth only takes place during the intervention phase and model the learning benefit provided by each intervention, $Δ_{k}$ , $k \in {1, ..., T}$ , where T represents the total number of blocks. We will also use the following notation: item $j \in {1, ..., J}$ , examinee $i \in {1, ..., n}$ , and time (measured in blocks), $t \in {1, ..., T}$ . J_t represents all the items contained in the block presented at time t. Since these blocks can be given in different orders, we will use parenthesis in the indices to denote individual-specific permuted orders of the parameters. For example, for any given individual, $Δ_{(t)}$ represents the effect of the intervention presented at time, t. The difficulty of the jth item an individual encounters on the exam is represented by $b_{(j)}$ .

For example, consider an assessment of length $J = 50$ , which is divided into five blocks of ten questions each, where block 1 = Items 1 through $10$ , block 2 = Items $11$ through $20$ , and so on. Suppose that an examinee is presented with item blocks in the following order: ( $3, 4, 5, 1, 2$ ). The first block the examinee encounters at $t = 1$ is block 3, comprised of items, $j = 21, 22, ..., 30$ , which can be denoted collectively as J ₁. The items in J ₁ (block 3) correspond to an intervention with effect, $Δ_{(1)} = Δ_{3}$ . The first item encountered is Item $21$ , with difficulty parameter $b_{(1)} = b_{21}$ . The proposed growth models are as follows:

ISLP Model: $θ_{i, t + 1} = θ_{i, t} + Δ_{(t)} \cdot e^{- (\underset{j \in J_{t}}{mean} (| θ_{i, t} - b_{j} |)} .$

The ISLP asserts that more learning takes place if the item difficulties are similar to the ability level of the subjects. If an item is much too difficult for an examinee, little can be gained from providing a teaching intervention afterwards. Similarly, if an item is so easy that it requires little thought, we can also expect little growth. This model scales the intervention effects by a dampening factor of $e^{- (\underset{j \in J_{t}}{mean} (| θ_{i, t} - b_{j} |)}$ , where J_t is the set of items given at time t. In this model, the greatest benefit from any particular question is realized when $θ_{i} = b_{j}$ . The ISLP can be extended to reflect a fatigue factor or to take into account the diminishing marginal returns of practice over time.

Extended ISLP Model: $θ_{i, t + 1} = θ_{i, t} + Δ_{(t)} \cdot e^{- (\underset{j \in J_{t}}{mean} (| θ_{i, t} - b_{j} |)} \cdot e^{- λ (t - 1)}$ .

This model includes an extra parameter, $λ$ , which multiplicatively affects $Δ_{t}$ through an additional dampening term $e^{- λ (t - 1)}$ . In this model, there are no diminishing returns on $Δ_{1}$ , but further interventions have reduced benefit. When $λ = 0$ , the extended model is equivalent to the original ISLP model.

On the other hand, the ISLP can be reduced into a model that has no dampening terms.

Reduced ISLP Model: $θ_{i, t + 1} = θ_{i, t} + Δ_{(t)}$ .

The reduced model is the simplest and assumes that the ability growth caused by the $t th$ intervention is a parameter $Δ_{(t)}$ , which is unrelated to item difficulty or a subject’s ability level.

Estimation of Learning Parameters

Markov chain Monte Carlo (MCMC) is an attractive choice for the estimation of learning parameters due to the wide variety of possible models that can be considered and the varying difficulty of numerical estimation problems that arise. In this section, we give an example where the 1PL model is used as the measurement model and the extended ISLP model is used as the learning model. However, the following procedure can be modified to fit any model.

Learning model parameters may be estimated by using the Metropolis-Hastings algorithm to simulate draws from their conditional posterior distributions. We first assume that the item difficulty parameters and discrimination parameter are known. This could correspond to a scenario where one has access to a calibrated item bank from a testing company and wishes to evaluate the efficacy of newly created learning tools pertaining to those items. The structure for estimating the extended ISLP model learning parameters, { $Δ_{1},..., Δ_{T}$ }, under the 1PL is as follows:

MCMC Parameters:

Ω = {Δ_{1},..., Δ_{T}, θ_{1, 1},..., θ_{i,1},..., θ_{n,1}, λ} .

Joint Posterior Distribution:

f (Ω | X) \propto L (X | Ω) f (Ω) = [\prod_{i = 1}^{n} \prod_{j = 1}^{J} f_{X_{i j}} (x_{i j} | Ω)] [\prod_{k = 1}^{T} f_{Δ_{k}} (Δ_{k})] [\prod_{i = 1}^{n} f_{θ} (θ_{i,1})] f_{λ} (λ),

X_{i j} \overset{indep .}{\sim} Bernoulli (p_{i j}), p_{i j} = \frac{e^{α (θ_{i, t} - b_{(j)})}}{1 + e^{α (θ_{i, t} - b_{(j)})}},

where $θ_{i, t}$ follows the extended ISLP model.

Parameter starting values may be initialized arbitrarily. For example, they may be randomly drawn from their respective prior distributions. Or, baseline data may be used to estimate initial values for ability parameters using standard Gaussian quantiles based on proportion correct to speed up convergence. Candidate parameters may be proposed using a Gaussian distribution with mean at the current value, and standard deviations tuned to yield the desired acceptance rates, adjusting for different models or prior distributions.

For every parameter, the Metropolis-Hastings acceptance ratio, a, can be calculated and compared to a randomly generated Uniform(0,1) random variable, U. If $U < a$ , the proposal value is accepted for the next iteration. Parameters are estimated by taking their posterior means after removing burn-in.

If the item difficulty parameters are unknown, they can either be estimated first with a method such as MML, or jointly with MCMC. If the difficulty parameters are estimated first, they can then be fixed for the MCMC estimation of learning model parameters.

Simulation Studies

The first study aims to see how well the learning model parameters for the ISLP can be recovered using MCMC when the item difficulties are known. The second study examines how well ISLP learning parameters can be recovered when item difficulties are unknown but are assumed to follow the LLTM. We expand this second study to look at the effect on learning parameter estimation when random effects (RE) are incorporated into the ISLP model to generate the data. We also look at parameter estimation under the reduced and extended ISLP models.

The third simulation study gives an example of an application of pre-estimated ISLP models in the context of an assessment utilizing adaptive item selection. Assuming that the ISLP is the true growth model, the growth of the ability level, $θ$ , is estimated in real time throughout the assessment, and the difference in ability growth using adaptive item selection is compared against two other methods of item selection.

Learning Parameter Estimation With Known Item Difficulties

Using the ISLP as the growth model, random response data were generated with different sample sizes ( $n = 500 and 2, 000$ ) and test lengths ( $J = 30 and 50$ ). Initial ability parameters were randomly generated: $θ_{i,1} \sim N (0, 1)$ . Item difficulty parameters were generated using the structure of the LLTM and the same Q-matrix as the data set from (Wang et al., 2018) and assumed to be known. When $J = 30$ items, we used the Q-matrix entries corresponding to the first 30 items. We set $η = (1, 2, 3, 4)$ and $c = - 3$ to generate the item difficulties. Responses to each item were generated using a random Bernoulli distribution with probability, $p_{i j}$ given by the current ability and item difficulty under the 1PL. The discrimination parameter, $α$ was set equal to 1 and also treated as known.

Questions were divided into five blocks of either ten questions (J = 50) or six questions (J = 30). Subjects were randomly assigned into five test groups. Each group was presented with the blocks in a different order. Each block corresponded to an intervention with learning parameter, $Δ_{k}$ . For all simulation conditions, the true values of $Δ_{k}$ were fixed at the following values: $Δ_{1} = 0.4$ , $Δ_{2} = 0.8$ , $Δ_{3} = 1.2$ , $Δ_{4} = 1.6$ , and $Δ_{5} = 2.0$ . These values were chosen to cover a wide range to see how well MCMC was able to distinguish the more helpful interventions from the less helpful ones. The MCMC algorithm described in the previous section was used to estimate ${Δ_{1},..., Δ_{5}, θ_{1, 1},..., θ_{i,1},..., θ_{n,1}}$ , omitting $λ$ from the estimation procedure. The following priors were used:

Δ_{1},..., Δ_{5} \overset{i . i . d}{\sim} Uniform (0, 3),

θ_{1, 1},..., θ_{n,1} \overset{i . i . d}{\sim} N (0, 1) .

The following proposal step sizes were tuned to give acceptance rates around 20% for $Δ_{1},..., Δ_{5}$ , and acceptance rates ranging between 20% and 40% for each $θ_{i,1}$ .

Δ_{1},..., Δ_{5} : σ_{Δ} = 0.7,

θ_{1, 1},..., θ_{n,1} : σ_{θ} = 1.5.

For each simulation condition, 25 chains of length 10,000 iterations were run. Starting values for item learning parameters were drawn randomly from their prior distributions, and starting values for $θ_{i,1}$ were estimated using standard Gaussian quantiles based on proportion correct in the first block of items to help the chains converge. For all parameters, chains appeared to converge within the first 1,000 iterations with Gelman–Rubin statistics all less than 1.05 by then. The first 2,000 iterations were removed as burn-in. The absolute deviation between the posterior means of each $Δ_{1},..., Δ_{5}$ versus their true values was averaged over all iterations. Increasing the test length from 30 items to 50 items had a far greater effect on reducing the mean absolute deviation (MAD) than increasing the sample size from 500 to 2,000. The best performance was achieved at the largest test condition ( $J = 50$ and $n = 2, 000$ ), with overall MAD = 0.101 over 25 repetitions. For reference, this corresponded to an average correlation of $ρ = 0.98$ between the true and predicted $Δ$ s over 25 repetitions. MAD results for estimating the ISLP learning parameters under all four conditions are shown in Table 1.

Table 1.

Mean Absolute Deviation in the Estimation of $Δ_{k}$ s for ISLP Model Using 1PL Measurement Model With Item Difficulties Known

Conditions		Mean Absolute Deviation
n = 500	J = 30	.225
n = 500	J = 50	.135
n = 2,000	J = 30	.220
n = 2,000	J = 50	.101

Note. Results are averaged across 25 repetitions. ISLP = item-specific learning parameter; 1PL = one-parameter logistic.

Estimation of Learning Parameters Under the LLTM

Next, we analyzed the performance of the model under the assumption that the LLTM holds but item difficulty parameters are unknown. To estimate the item difficulty parameters, we assumed that all items can be studied at baseline prior to learning interventions. The underlying distribution of each initial ability was assumed to be $N (0, 1)$ , which enabled the identification of the discrimination parameter. Using responses from the first block of items, MML estimation was then used to obtain estimates of all item difficulty parameters under the 1PL model, $b_{j}^{*}$ . When the question difficulties truly follow the LLTM, we can get better accuracy by applying the constraints of the LLTM to the item difficulty estimates. This can be done by projecting the vector of $b_{j}^{*}$ values onto the space of Q matrix:

{\hat{b}}_{J \times 1} = Q {(Q^{t} Q)}^{- 1} Q^{t} b_{J \times 1}^{*}, Q = [\begin{matrix} 1 & q_{11} & q_{12} & \dots & q_{1 m} \\ 1 & q_{21} & q_{22} & \dots & q_{2 m} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 1 & q_{J 1} & q_{J 2} & \dots & q_{J m} \end{matrix}] .

${\hat{b}}_{J \times 1}$ is the resulting vector of item difficulties under the LLTM, such that any two items with the same Q-matrix entries will be estimated to have the same difficulty. In our simulations, when $n = 2, 000 and J = 50$ , the average root mean square error for estimating the true values of item difficulty over all items and repetitions was 0.1356 using MML alone and 0.046 using this additional projection method.

Once the item difficulty parameters have been estimated in this way, they are fixed and used as the true values of b_j in the MCMC step. In this simulation, the same process is repeated as in the first simulation study. The original, reduced, and extended ISLP models were all used as the true model to generate response data, while varying test length (J) and the number of examinees (n). For each model, simulations were performed with $J = 30 and 50$ , and $n = 500 and 2, 000$ for a total of 12 simulated conditions. The true discrimination parameter, a, was set to 1, then estimated using MML along with the item difficulty parameters. For the simulations using the extended ISLP model, the true value of $λ$ was set equal to 2.

For each condition, item and learning parameters were fixed and 25 replications were performed. Proposal steps were the same as before. For each replication, a chain of length 10,000 was run with a burn-in segment of 2,000. The posterior means of $Δ_{1},..., Δ_{5}$ were compared to the true values using the MAD over 25 iterations, with results reported in Table 2. For the original ISLP model, the results were very close to those in the previous study. This is due to the fact that we were able to estimate the item difficulties with very high accuracy. When $n = 2, 000 and J = 50$ , the MAD over 25 iterations in the original ISLP model was 0.1199; for the reduced and extended ISLP models, the MADs were 0.0673 and 0.2433, respectively. Density plots of the posterior distributions of $Δ_{1}$ through $Δ_{5}$ for a single trial under the reduced model, with burn-in removed, are shown in Figure 1. Here, the vertical dotted lines represent the target (true) values of $Δ_{1}$ through $Δ_{5}$ , which were well-approximated by their posterior means in this simulation.

Table 2.

Mean Absolute Deviation in Estimation of $Δ_{k}$ s for Different Models Averaged Across 25 Repetitions

Conditions		ISLP Model
n	J	Reduced	Original	Extended
500	30	.1310	.2736	.3296
500	50	.0876	.2005	.2413
2,000	30	.0994	.2227	.2506
2,000	50	.0673	.1199	.2433

Note. ISLP = item-specific learning parameter.

Figure 1.

Posterior distributions of $Δ_{1}$ through $Δ_{5}$ (left to right) in the reduced item-specific learning parameter model.

Sensitivity Analysis

The proposed ISLP models all make the assumption that subjects with the same $θ$ will benefit equally when presented with the same items and interventions. However, it is reasonable to assume that individuals may have different rates of learning. In this sensitivity analysis, we investigated whether different learning trajectories would have an effect on the estimation of the values of the learning parameters. These different learning trajectories can be modeled by incorporating an RE for each individual with a multiplicative error, $ε_{i}$ , into the ISLP. For each individual analysis, all of $ε_{i}$ s were assumed to be independent and identically distributed.

ISLP-RE Model:

θ_{i, t + 1} = θ_{i, t} + (Δ_{(t)} \cdot ε_{i}) \cdot e^{- (\underset{j \in J_{t}}{mean} (| θ_{i, t} - b_{j} |)}, t = 1, 2, ..., T .

The sensitivity analysis was repeated for various distributions of $ε_{i}$ to estimate the values of $Δ_{1}$ through $Δ_{5}$ under the ISLP model, while the data were generated from the ISLP-RE model. $ε_{i}$ s can be modeled by any distribution and were used only to generate the data but were not estimated. $ε_{i}$ s were generated by sampling random numbers from a given beta distribution and then scaled by multiplying by $\frac{α + β}{α}$ to give a mean of 1. This allows us to compare the MAD results for the estimation of $Δ$ s.

These underlying beta distributions for the $ε_{i}$ s were selected to represent varying distributional shapes for the RE. For example, Beta(1,1) makes a uniform distribution and Beta(5,1) makes a skewed distribution of the RE (Figure 2).

Figure 2.

Beta distributions used to generate $ε_{i}$ s in sensitivity analysis.

Table 3 shows a comparison of the results from the ISLP model and ISLP-RE model. The results showed that estimates for $Δ_{k}$ were quite accurate, mostly unbiased, and not sensitive to the presence of individual RE in the learning trajectories. The simulation results suggest that the estimated values of $Δ_{k}$ would still remain proportional relative to each other and could still be used to identify the best interventions.

Table 3.

Sensitivity Analysis: Mean Absolute Deviation in the Estimation of $Δ_{k}$ s in ISLP-RE Model

Condition		Model
n	J	ISLP	$β$ (10,10)	$β$ (5,5)	$β$ (2,2)	$β$ (1,1)	$β$ (5,1)
500	30	.274	.347	.261	.265	.369	.313
500	50	.201	.197	.202	.200	.189	.249
2,000	30	.223	.233	.243	.199	.184	.220
2,000	50	.120	.122	.120	.140	.102	.119

Note. ISLP = item-specific learning parameter; RE = random effects.

Adaptive Item Selection

Next, we demonstrate how the ISLP models can be implemented in an assessment to choose the best items for an individual using adaptive item selection. The total increase in ability parameters for all individuals from the beginning to the end of an assessment is measured and compared against two other methods of item selection, which do not depend on the ISLP model: (1) random item selection and (2) selecting items with the highest $Δ$ without regard to item difficulty or ability level. For the true model, we used four different ISLP models: the original model and three variants of the extended ISLP with $λ$ = 0.1, 0.05, and 0.02. The reduced ISLP model was not used in this study because it would be unrealistic for ability levels to increase without some kind of diminishing returns due to the large number of interventions.

Item difficulty and learning parameters were assumed to be known. True initial ability parameters, $θ_{i,1}$ , were assumed to be unknown and were randomly drawn from a standard Gaussian distribution for a sample of size $n = 100$ individuals. For each variation of the ISLP model, the following procedure was repeated $25$ times, generating new values of $θ_{i,1}$ at each repetition to simulate $25$ batches of subjects. Results for all repetitions were aggregated for analysis.

A test of length $J = 50$ was administered from a test bank of $100$ items, with item difficulties also generated from a $N (0, 1)$ distribution. Each item could only be administered one time. The test bank used the exact same items for all repetitions. To reflect a formative assessment, each single item was considered as a block and had its own associated learning intervention.

For initial estimation of ability levels, a calibration set of 10 predetermined questions of varying difficulty levels was administered one question at a time. Based on the difficulty of these questions and performance on the responses, maximum likelihood estimation was used to obtain the estimates of the initial abilities, ${\hat{θ}}_{i,1}$ . In the adaptive selection method, we used the estimated values ${\hat{θ}}_{i, t}$ to make item selections starting with each subjects’ $11 th$ item. At the end of the assessment, the true final ability values, $θ_{i,51}$ for adaptive item selection are compared with the true final ability values of the other two methods.

To illustrate the process for updating $θ_{i, t}$ and ${\hat{θ}}_{i, t}$ , we will examine the first few steps of the procedure for subject i, with initial ability $θ_{i,1}$ . The result of subject i‘s response to their first item is randomly generated from a Bernoulli( $p_{i 1}$ ) distribution, where $p_{i 1}$ is the 1PL model probability that a subject with ability, $θ_{i,1}$ , successfully answers a question of difficulty, $b_{(1)}$ . To estimate $θ_{i,1}$ , we calculate its MLE, ${\hat{θ}}_{i,1}$ , using the pmf of the Bernoulli( $p_{i 1}$ ) distribution, where $p_{i 1}$ is specified by the 1PL. To avoid unreasonably small or large estimates, we restrict the parameter space of $θ$ to be between $- 4$ and 4. Next, the second item, with item difficulty, $b_{(2)}$ , is chosen. Based on the ISLP, the ability level of subject i going into their second item is $θ_{i,2} = θ_{i,1} + Δ_{(1)} e^{- (| θ_{i,1} - b_{(1)} |)}$ .

$θ_{i,2}$ is then used to generate another random response to Item 2 from the Bernoulli( $p_{i 2})$ distribution based on the 1PL. From here, the following cycle is repeated until the end of the assessment: estimate the current ability parameter, choose an item, update the true theta based on the item’s learning intervention, generate a random response, and repeat. To estimate ${\hat{θ}}_{i,2}$ , note that we can write the estimate of all remaining ${\hat{θ}}_{i, t}$ s recursively in terms of ${\hat{θ}}_{i,1}$ . For example,

{\hat{θ}}_{i,2} = {\hat{θ}}_{i,1} + Δ_{(1)} e^{- (| {\hat{θ}}_{i,1} - b_{(1)} |)},

{\hat{θ}}_{i,3} = {\hat{θ}}_{i,1} + Δ_{(1)} e^{- (| {\hat{θ}}_{i,1} - b_{(1)} |)} + Δ_{(2)} e^{- (| {\hat{θ}}_{i,2} - b_{(2)} |)} .

Then, the current ability estimates can be found in real time through recursive application of the ISLP model after computing the MLE, ${\hat{θ}}_{i,1}$ , with the data.

Since the same calibration set was used in all three methods of item selection, the estimates of ability levels at the end of the calibration set, ${\hat{θ}}_{i,11}$ , were the same for all three methods. In adaptive item selection, starting at the examinee’s $t = 11$ th item, the item with difficulty closest to ${\hat{θ}}_{i, t}$ was chosen, as it would yield the greatest learning based on the model. Once an item was selected, the true value of $θ_{i, t}$ was updated based on the ISLP model and the item’s corresponding learning parameter, $Δ_{j}$ . The true $θ_{i, t}$ values were treated as unknown and not used for item selection.

Starting from Item $11$ , the other two methods of item selection were also used to select the remaining items. In random item selection, items were randomly chosen from the test bank. For the maximum $Δ$ method, the item from the test bank with the largest $Δ$ was selected from the remaining items. For all of these methods, the true values of the $θ_{i, t}$ s were updated according to the ISLP model after each item was selected.

Adaptive item selection results

Figures 3 and 4 show that the vast majority of subjects using adaptive item selection ended the assessment with a higher final ability level versus the other two methods regardless of $λ$ . For example, in the original ISLP model ( $λ = 0$ ), roughly 94% of all subjects ended the assessment with a higher final $θ$ when using adaptive selection versus maximum $Δ$ selection. Compared to random item selection, adaptive testing performed even more favorably. Table 4 shows the proportion of subjects for which adaptive item selection outperformed the other two methods for all models. Table 5 shows the mean improvement in $θ$ over the course of the assessment for all methods and models. The results show that the smaller values of $λ$ yielded more of a benefit for the adaptive selection method. As $λ$ increased, the improvement in $θ$ shrank, leading to less differentiation between the methods of item selection.

Figure 3.

Initial ability ( $θ_{i, 1}$ ) versus final ability ( $θ_{i, 51}$ ) for n = 2,500 examinees: item-specific learning parameter model.

Figure 4.

Initial ability ( $θ_{i, 1}$ ) versus final ability ( $θ_{i, 51}$ ) for n = 2,500 examinees: extended item-specific learning parameter model, $λ = 0.05$ .

Table 4.

Proportion of Examinees With Higher Final $θ$ Using Adaptive Item Selection Versus Random and Maximum $Δ$ Item Selection

Conditions	Adaptive Versus Random	Adaptive Versus Maximum $Δ$
$λ = 0$	.9960	.9416
$λ = 0.02$	.9988	.9628
$λ = 0.05$	.9952	.9220
$λ = 0.1$	.9458	.7752

Note. Adaptive selection resulted in higher performance for a vast majority of subjects.

Table 5.

Total Mean Improvement $(θ_{i, 51} - θ_{i, 1})$ by Item Selection Method for Different ISLP Models

ISLP Model	Random	Maximum $Δ$	Adaptive
$λ = 0$ (Original)	3.0144	3.2870	3.4956
$λ = 0.02$	2.4272	2.7054	2.9547
$λ = 0.05$	1.7299	1.9696	2.1964
$λ = 0.1$	1.0324	1.1708	1.2558

Note. ISLP = item-specific learning parameter.

Real Data Analysis

We analyzed data from a computer-based assessment of spatial reasoning ability. 350 subjects selected from a University of Illinois–Department of Psychology paid subject pool were given a series of 50 items. Items were presented in blocks of 10 questions with a learning intervention after each block. Subjects were randomly assigned to one of the five test versions, which presented the blocks in different orders. This test was developed based on the Purdue Spatial Visualization Test (PSVT; Yoon, 2011). Thirty of the questions were taken from the PSVT, while 20 of the questions were created by Wang et al. (2018). The dataset was obtained from the hmcdm (hidden Markov cognitive diagnosis models for learning) package in R (Zhang et al., 2018).

The items in this assessment are comprised of operations requiring either 90° rotations, 180° rotations, or both. There are seven distinct combinations in the Q-matrix structure. Due to the Q-matrix underlying item construction, this initially suggested that the LLTM might be an appropriate choice for the measurement model. The plot in Figure 5 shows the proportion of correct answers for each item grouped by their Q-matrix entries. It can be seen that questions involving two operations were more difficult overall than items involving a single operation. However, based on the proportion correct, it can be seen that items with the same Q-matrix entries have a wide range of difficulties. We concluded that the LLTM was not an appropriate measurement model for this dataset because when it holds, items with the same Q-matrix entries should have been answered correctly in very similar proportions. The 1PL model without LLTM constraints was selected as the measurement model.

Figure 5.

Scatterplot and violin plot of questions (grouped by Q matrix entries) versus proportion correct.

A further exploratory analysis revealed that the only significant improvement in ability level was realized when an examinee completed their first block and intervention. Table 6 shows that on average, virtually no improvement occurs starting from the second intervention onward for all test versions. This phenomenon may be due to a lack of incentive for the subjects to perform well and declining motivation. If the decline of improvement had been less abrupt, the extended ISLP model could have been a good choice for the learning model. For our analysis, the ISLP was chosen as the learning model, and only data from the first 20 items (two blocks) answered by each examinee were used to estimate the learning parameters.

Table 6.

Performance of Examinees Did Not Increase Significantly After Time 2

Proportion of Correct Responses by Time
Test Version	Time 1	Time 2	Time 3	Time 4	Time 5
1	.776	.823	.826	.787	.809
2	.710	.782	.734	.755	.782
3	.753	.741	.739	.769	.810
4	.710	.761	.795	.784	.811
5	.705	.784	.761	.763	.714

Real data learning parameter estimation

Learning parameters $Δ_{1}$ through $Δ_{5}$ are estimated using the same MCMC algorithm described previously, using the ISLP as the learning model. Since the sample size is smaller ( $n = 350$ ), MML estimation may not be as reliable in estimating the item difficulties. The item difficulties were estimated directly with the learning parameters and initial abilities in MCMC. In this analysis, the specification for the joint posterior distribution is similar to the example provided in the simulation section, except with extra terms corresponding to the item difficulty parameters, b_j , and terms corresponding to $λ$ removed. The joint posterior distribution is

$f (Ω | X) \propto L (X | Ω) f (Ω) = [\prod_{i = 1}^{n} \prod_{j = 1}^{J} f_{X_{i j}} (x_{i j} | Ω)] [\prod_{k = 1}^{T} f_{Δ_{k}} (Δ_{k})] [\prod_{i = 1}^{n} f_{θ} (θ_{i,1})] [\prod_{j = 1}^{J} f_{b} (b_{j})],$

$X_{i j} \overset{indep .}{\sim} Bernoulli (p_{i j}), p_{i j} = \frac{e^{α (θ_{i, t} - b_{(j)})}}{1 + e^{α (θ_{i, t} - b_{(j)})}},$

where $θ_{i, t}$ follows the ISLP model.

Prior distributions:

$Δ_{1},..., Δ_{5} \overset{i . i . d}{\sim} Uniform (0, 3),$

$θ_{1, 1},..., θ_{350, 1} \overset{i . i . d}{\sim} N (0, 1),$

$b_{1},..., b_{50} \overset{i . i . d}{\sim} N (0, 4) .$

When specifying the prior distributions for each $Δ_{k}$ , we assumed that any given intervention would yield a positive benefit that does not exceed 3. Initial ability levels, $θ_{i,1}$ , were assumed to come from a standard Gaussian distribution. As shown in Table 6, the proportion of questions answered correctly for all test versions on the first block of items was significantly higher than 0.5. Based on this observation, we assigned a larger variance to the item difficulty priors to make them less informative rather than changing the mean of the priors.

To test convergence, five separate chains of length 10,000 were run. Starting values for each $Δ$ , $θ$ and b were randomly drawn from within their respective prior distributions. Candidate parameters were proposed using the following Gaussian distribution with mean at the current value, and standard deviations were tuned to give acceptance rates around 30%:

$Δ_{1},..., Δ_{5} : σ_{Δ} = 1.5. Mean acceptance rate = 29.4%,$

$b_{1},..., b_{50} : σ_{b} = 1.5. Mean acceptance rate = 28.2%,$

$θ_{1, 1},..., θ_{350, 1} : σ_{θ} = 1.7. Mean acceptance rate = 30.9%.$

The first 2,000 samples were discarded as burn-in. The posterior means of the remaining 8,000 samples used to estimate the learning parameters. According to the Gelman–Rubin statistic, all of the parameter estimates appeared to converge well before the first 2,000 samples. An example of a Gelman–Rubin diagnostic plot for $Δ_{1}$ is shown in Figure 6.

Figure 6.
Gelman–Rubin plot showing convergence of $Δ_{1}$ .

The average posterior mean of all item difficulties was $- 1.362$ , and the average posterior mean of all initial $θ$ values was $0.109$ . Averaging the posterior means for learning parameters across all chains gave the following estimates: $Δ_{1} = 0.889, Δ_{2} = 2.169, Δ_{3} = 1.700, Δ_{4} = 1.018, and Δ_{5} = 1.446$ . This indicates that block 2 (Items 11–20) contributed most to learning while block 1 (Items 1–10) contributed the least. Although some of these $Δ$ values appear large at first glance, the actual benefit they impart is greatly dampened by an exponential term in the ISLP.

Analyzing this dataset with a subset of the first 20 items is similar to using the extended ISLP with a large value of $λ$ , such that no effective improvement takes place after the second block of items. It should be noted that both the smaller sample size of 350 subjects and shorter effective test length of 20 items may compromise the reliability of our parameter estimates. The magnitude of estimation error can be roughly approximated by extrapolating the results of the simulation studies. Another potential source of error for parameter estimation is that the item bank was not precalibrated, and item difficulties must be jointly estimated. Despite these limitations, the results of this empirical study suggest that even if subjects are presented with the same questions, the order in which they are presented can have a substantial impact on growth.

Summary

The ISLP model was presented along with a couple variations as a model capable of assessing learning at the item level or block level. In this study it was used with the 1PL and LLTM but can be used in conjunction with any IRT model. The ISLP model can be applied to precalibrated items after learning interventions are developed. The item difficulty parameters may be treated as known if they were obtained in a setting where ability is expected to be constant prior to any interventions associated with the items. Learning parameters may also be estimated simultaneously with the item parameters in other designs. Simulation results showed that parameters can be accurately recovered under the study conditions. An additional simulation studied the benefit of utilizing learning parameters in an adaptive exam when the aim is to promote learning.

A real data application involving spatial reasoning was given to illustrate the ISLP. It was seen that while all blocks were capable of helping the examinees learn the spatial rotation tasks, only those that we administered at the very beginning helped, and learning appeared to end after the second block. Because the exam in this dataset contained 50 questions, it is possible that fatigue and lack of motivation were responsible for the observed plateau in examinees’ performance after the second time point.

Due to the need for shorter exams, adaptive item selection would be a good choice. We can see that when these models hold, selecting items adaptively gives us a large advantage, even over choosing items with the maximum $Δ$ .

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Andersen

E. B.

(1972). The numerical solution of a set of conditional estimation equations. Journal of the Royal Statistical Society: Series B (Methodological), 34(1), 42–54.

Andersen

E. B.

(1985). Estimating latent correlations between repeated testings. Psychometrika, 50(1), 3–16.

Bock

R. D.

(1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37(1), 29–51.

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459.

Bock

R. D.

Lieberman

(1970). Fitting a response model Forn dichotomously scored items. Psychometrika, 35(2), 179–197.

Cheng

Chang

H.-H.

(2009). The maximum priority index method for severely constrained item selection in computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 62(2), 369–383.

Embretson

(1991). A multidimensional item response model for learning processes. Psychometrika, 56, 495–515.

Fischer

G. H.

(1981). On the existence and uniqueness of maximum-likelihood estimates in the Rasch model. Psychometrika, 46(1), 59–77.

Fischer

G. H.

Formann

A. K.

(1982). Some applications of logistic latent trait models with linear constraints on the parameters. Applied Psychological Measurement, 6(4), 397–416.

10.

Joseph

Thomas

Simonette

Ramsook

(2013). The impact of differentiated instruction in a teacher education setting: Successes and challenges. International Journal of Higher Education, 2(3), 28–40.

11.

Kingsbury

G. G.

Zara

A. R.

(1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2(4), 359–375. https://doi.org/10.1207/s15324818ame0204_6

12.

Manning

Stanford

Reeves

(2010). Valuing the advanced learner: Differentiating up. The Clearing House: A Journal of Educational Strategies, Issues and Ideas, 83(4), 145–149. https://doi.org/10.1080/00098651003774851

13.

Ong

M. L.

(2017). A longitudinal item response theory-latent growth modeling for measuring change [Doctoral dissertation]. University of Georgia.

14.

Rasch

(1960). Studies in mathematical psychology: I. Probabilistic models for some intelligence and attainment tests. Nielsen & Lydiche.

15.

Scheiblechner

(1972). Das lernen und losen komplexer denkaufgaben. Zeitschrift fUr experimentelle und angewandte Psychologie, 19, 476–506.

16.

Swanson

Stocking

M. L.

(1993). A model and heuristic for solving very large item selection problems. Applied Psychological Measurement, 17(2), 151–166. https://doi.org/10.1177/014662169301700205

17.

van der Linden

W. J., & Glas, C. A.

(2009). Constrained adaptive testing with shadow tests. In Elements of adaptive testing (pp. 31–55). Springer.

18.

Vygotsky

L. S.

(1980). Mind in society: The development of higher psychological processes. Harvard University Press.

19.

Wainer

Bradlow

E. T.

Wang

(2007). Testlet response theory and its applications. Cambridge University Press.

20.

Wang

Zhang

Douglas

Culpepper

(2018). Using response times to assess learning progress: A joint model for responses and response times. Measurement: Interdisciplinary Research and Perspectives, 16, 45–58. https://doi.org/10.1080/15366367.2018.1435105

21.

Yoon

S. Y.

(2011). Psychometric properties of the revised Purdue spatial visualization tests: Visualization of rotations (the revised psvt: R). Purdue University.

22.

Zhang

Wang

Chen

(2018). Hmcdm: Hidden Markov cognitive diagnosis models for learning.