Abstract
The present paper introduces a random weights linear logistic test model for the measurement of individual differences in operation-specific practice effects within a single administration of a test. The proposed model is an extension of the linear logistic test model of learning developed by Spada (1977) in which the practice effects are considered random effects varying across examinees. A Bayesian framework was used for model estimation and evaluation. A simulation study was conducted to examine the behavior of the model in combination with the Bayesian procedures. The results demonstrated the good performance of the estimation and evaluation methods. Additionally, an empirical study was conducted to illustrate the applicability of the model to real data. The model was applied to a sample of responses from a logical ability test providing evidence of individual differences in operation-specific practice effects.
Keywords
Introduction
Research in item response theory has led to considerable advances in measuring individual differences in learning between different administrations of a test (e.g., Andersen, 1985; Bock, 1976; Embretson, 1991). However, the measurement of individual differences in learning within a test has been surprisingly understudied (Lozano & Revuelta, 2023). Within-test learning effects are likely to occur when the items share a set of solution principles that must be repeatedly applied throughout the test, so that respondents can improve their performance during the test as a result of practice. These effects may be facilitated by explicit feedback. For instance, after the respondent answers an item (or a set of items), the examiner (or the test in computerized form) may provide information about the correctness of the response(s) or about the cognitive operations that should have been applied to solve the item(s). However, note that feedback could also occur without the need for explicit information. For example, in a multiple-choice test, the fact that the respondent finds (or does not find) a response alternative that matches the solution he or she has arrived at provides information that such solution is probably correct (or certainly incorrect).
Most of the existing models aimed at measuring within-test learning effects treat learning as a fixed effect constant across examinees (e.g., Spada, 1977; Verguts & De Boeck, 2000; Verhelst & Glas, 1993). Of particular interest for the present paper is the model developed by Spada (1977; see also Fischer & Formann, 1982; Lozano & Revuelta, 2021a; Spada & McGaw, 1985) on the basis of the linear logistic test model (LLTM; Fischer, 1973; Scheiblechner, 1972). This model was conceived to account for within-test learning effects specifically associated with the cognitive operations required to solve the test items. The model assumes that the operation-specific learning effects are the same for all examinees, which may be a too restrictive assumption in some instances, particularly when the items involve relatively complex operations with which examinees do not have previous experience. In such circumstances, examinees may show quite different learning curves throughout the test, which would make the model produce incorrect estimates of person and item parameters. In the present paper, a generalization of this model is proposed in which the practice effects are treated as random effects varying across examinees. This provides the model with greater flexibility, allowing it to account for a wider variety of learning patterns as well as to yield a more precise analysis of person and item properties. The proposed model is therefore a multidimensional model able to measure individual differences in operation-specific learning within a single administration of a test.
A Random Weights Linear Logistic Test Model for Learning
This section describes the model specification and identification.
Model Specification
According to the proposed model, the logit of a correct response for person i (i = 1, 2, . . ., n) on item j (j = 1, 2, . . ., J) is given by: θ
i
is the ability of person i; w
jm
is the weight of item j on operation m (m = 1, 2, . . ., M); α
m
is the initial difficulty of operation m; w
km
is the weight of previous item k (k = 1, 2, . . ., j ‒ 1) on operation m; and δ
im
is the effect of previous practice on the difficulty of operation m for person i. Hypothetical Example of Matrices
The model is based on a J × M structure matrix
The ability parameter θ i represents the ability of person i to perform the operations involved in the test without practice. Since most psychometric tests are not meant to produce practice effects, θ i may be conceived as the ability the test is intended to measure. The difficulty parameter α m represents the initial difficulty of operation m; that is, the difficulty of operation m without practice. The practice parameter δ im represents the effect of practicing operation m for person i. A positive sign for δ im indicates a decrease in the difficulty of operation m for person i as a function of practice, which may be interpreted as learning. On the other hand, a negative sign for δ im indicates an increase in the difficulty of operation m for person i as a function of practice, which may be interpreted as fatigue or loss of interest and/or attention. A value of zero for δ im entails the absence of practice effect, and therefore that the difficulty of operation m for person i remains equal to α m throughout the test. Note that the operation difficulty parameters (α m ) are fixed effects constant across examinees, whereas the practice parameters (δ im ) are random effects allowed to vary over individuals. The model is therefore a random weighs linear logistic test model (RWLLTM; Rijmen & De Boeck, 2002). According to this, we will refer to the model as random weighs operation-specific learning model (RWOSLM).
Model Identification
For the RWOSLM to be identified, the matrices
Relation to Other Models
This section examines the relationship of the RWOSLM to other item response models.
Random Weights Linear Logistic Test Model
As mentioned in the previous section, the proposed model can be considered a RWLLTM (Rijmen & De Boeck, 2002). According to the RWLLTM, the logit of a correct response is given by:
Operation-Specific Learning Model
A special case of the RWOSLM takes place when δ
im
is constant over i (
Linear Logistic Test Model
When δ
im
is zero for all i and m, the RWOSLM equals the LLTM (Fischer, 1973; Scheiblechner, 1972):
As can be appreciated, both the OSLM and the LLTM are Rasch models (Rasch, 1960) with linear constraints on the item parameters, whereas the RWOSLM and the RWLLTM would be constrained multidimensional Rasch models.
Bayesian Framework
This section describes a Bayesian framework for model estimation and evaluation.
Model Estimation
Marcov chain Monte Carlo (MCMC) simulation was used to derive an empirical approximation to the posterior distribution of the parameters (Brooks et al., 2011). Specifically, the No-U-Turn (NUTS) algorithm (Hoffman & Gelman, 2014) was used to iteratively draw samples from the posterior distribution. The mean of the posterior draws was used to summarize the estimation, whereas the variance and the quantiles of the posterior distribution were used as a measure of the precision of the estimation. The un-normalized posterior distribution is given by the product of the prior distribution of the parameters and the likelihood of the data given the parameters:
Model Evaluation
Model evaluation was conducted using posterior predictive model checking (PPMC; Gelman et al., 1996). PPMC is based on test statistics that capture relevant features of the data. In PPMC, the realized value of the statistic based on the observed data, T(
In this study, PPMC was conducted based on the odds ratio statistic (OR; Sinharay, 2005). The OR is a measure of association between pairs of items that has proven useful for detecting learning effects within a test (Lozano & Revuelta, 2021b, 2023). Measures of inter-item associations at the item and test level are obtained by summing the OR values over the pairs of items. PPP values close to .5 indicate that the realized value of the statistic is in the middle of the posterior predictive distribution, evidencing adequate model-data fit; whereas extreme PPP values close to zero (one) indicate that the observed data exhibit more (less) local dependence than expected based on the model.
Model Comparison and Selection
Model comparison and selection was based on two information criteria: the widely applicable information criterion (WAIC; Watanabe, 2013) and the leave-one-out information criterion (LOOIC; Vehtari et al., 2017). WAIC and LOOIC estimate the out-of-sample predictive performance adjusting the log predictive density of the observed data by penalizing for model complexity. Such penalty compensates for the over-fitting exhibited by more complex models by virtue of their higher flexibility. WAIC and LOOIC are used to compare competing models in order to select the one that fits the data best. Lower values indicate better predictive accuracy. WAIC and LOOIC have shown good performance in detecting learning effects within a test (Lozano & Revuelta, 2021b, 2023).
Simulation Study
A simulation study was conducted to examine the performance of the model in combination with Bayesian estimation and evaluation methods. Specifically, the study examined the performance of the NUTS algorithm in parameter recovery and the performance of PPMC and information criteria in model evaluation and selection.
Method
Matrix
The simulation was conducted with R version 3.6.1 (R Development Core Team, 2019) and the RStan R package version 2.19.2 (Stan Development Team, 2019). One hundred data sets of dichotomous responses were simulated from each generating model. The models were estimated from each simulated data set using four Markov chains of 2,000 iterations each. The draws from the first half of the iterations were discarded as burn-in. The potential scale reduction statistic (
Based on previous research (Lozano & Revuelta, 2021b, 2023), a normal(0, 100) was specified as prior distribution for the fixed-effects parameters (α
m
, β
j
, and δ
m
), whereas a normal(0, 1) was specified for the parameter θ
i
. Additionally, a normal(0, 1) and a Cauchy(0, 5) were specified for the hyper-parameters
PPMC based on the OR statistic was used to assess the fit of the models to the data. The hypothesis that the model fits the data was rejected when the PPP value was less than .05 or greater than .95. The performance of the OR statistic was assessed by the average PPP value over the 100 simulated samples and the empirical proportion of rejections (EPR), that is, the proportion of simulated samples in which the fitted model is rejected. When the fitted model coincides with the model used to generate the data, the EPR is an estimate of the false-positive error rate of the test, whereas when the fitted model and the generating model do not coincide, the EPR is an estimate of the sensitivity of the test.
Additionally, WAIC and LOOIC were used for model comparison and selection. These measures were obtained using the loo R package version 2.3.1 (Vehtari et al., 2020). For each condition of the study, the performance of WAIC and LOOIC was assessed by their average value over the 100 simulated samples and the empirical proportion of selections (EPS), that is, the proportion of simulated samples in which the fitted model is selected.
The accuracy of the parameter estimates was assessed with the root mean square error (RMSE). Lower values of RMSE indicate higher estimation accuracy. The RMSE of the parameter δ
im
is defined as follows:
Results
Average Posterior Predictive p-value (
Average WAIC and LOOIC (Mean) and Empirical Proportion of Selections (EPS) for each Combination of Generating and Fitted Model.
Root Mean Square Error (RMSE) of Parameter Estimates for each Combination of Estimated Parameter, Generating Model, and Fitted Model.
Conclusions
The simulation study demonstrated the good performance of PPMC for model evaluation. Specifically, the OR statistic showed high specificity and sensitivity identifying the presence or absence of individual differences in learning affecting the data. Additionally, the information criteria demonstrated good performance in model comparison and selection. Nevertheless, a certain tendency to overestimate the presence of individual differences in practice effects suggests the need to complement the use of information criteria with PPMC. The results indicated that the MCMC algorithm provided accurate estimates for the model parameters.
Empirical Study
An empirical study was conducted to illustrate the applicability of the model to real data. The model was applied to a deductive reasoning test based on several logical operations that previous research has demonstrated to be prone to within-test practice effects (Lozano & Revuelta, 2021a, 2023).
Method
The data consist of responses from 501 examinees to 50 items of the DA5 logical ability test (SHL, 1995). Each item includes a sequence from two to four figures. Symbols are presented along with the figures representing the logical operations that must be applied to transform the figures. The test is based on 10 logical operations: (1) rotate the figure from top to bottom; (2) rotate the figure from left to right; (3) erase the previous figure; (4) erase the next figure; (5) interchange the figure with the previous one; (6) ignore the previous operator; (7) ignore the next operator; (8) reverse the order of the figures; (9) reorder the figures such that 1234 → 3412; and (10) reorder the figures such that 1234 → 2143. The respondent must choose between five response alternatives the one that represents the result of applying the logical operations to the figures. Figure 1 depicts a hypothetical example of DA5 item. The item involves operations 1, 6, 3, 5, and 10, in that order, and the correct response is D. Table 6 shows the hypothesized structure matrix Hypothetical example of DA5 item. Structure Matrix for the Empirical Study.
Results
Goodness-of-fit Estimates of the Fitted Models.
Expected a Posteriori Estimates (EAP), Posterior Standard Deviations (SD), and Posterior Probability Intervals (2.5%–97.5%) of the RWOSLM Parameters.

Difficulty of the DA5 logical operations as a function of practice for all examinees.
Expected a Posteriori Estimates (EAP), Posterior Standard Deviations (SD), and Posterior Probability Intervals (2.5%–97.5%) of the Practice Parameters of the RWOSLM for a Particular Examinee.

Difficulty of the cognitive operations as a function of practice for a particular examinee.
Conclusions
This study illustrates the applicability of the RWOSLM for the measurement of individual differences in operation-specific practice effects. The model was fitted to data from the DA5 logical ability test. The results suggest that the practice effects in the DA5 logical operations are a source of individual variability. In this regard, the RWOSLM overcame the OSLM in terms of fit, demonstrating higher flexibility to account for diverse patterns of practice effects. Interestingly, the operations that showed greater variability in practice effects were those that impose a higher load on working memory.
Discussion
The present paper introduces a random weights linear logistic test model for the measurement of individual differences in operation-specific practice effects within a test. A simulation study was conducted to examine the performance of the Bayesian estimation and evaluation procedures. PPMC along with the OR statistic demonstrated good performance in model evaluation, whereas the information criteria performed well in model comparison and selection. Additionally, the MCMC algorithm demonstrated accuracy in recovering the model parameters. The proposed model in combination with the Bayesian procedures thus proved useful in identifying individual differences in learning. Additionally, an empirical study was conducted to illustrate the applicability of the model. The RWOSLM proved its superiority over the OSLM to account for the variability of the practice effects associated with the operations involved in the DA5.
The requirement of relatively complex operations with which respondents do not have previous experience makes more likely the manifestation of individual differences in learning. In such circumstances, the RWOSLM may provide superior performance compared to one-dimensional operation-specific learning models. Despite the applicability of the model, there are certain considerations that must be taken into account. In the present paper, the structure matrix used with the RWOSLM represented the frequency with which the cognitive operations were required to solve the items. However, the model allows for other uses of the matrix, such us differentiating the order in which the operations must be applied to solve the items. In this regard, different columns of the structure matrix may reflect different orders of application of the same set of operations. Additionally, the present paper is focused on the measurement of non-contingent learning. However, other variants of the RWLLTM can be specified to measure individual differences in within-test contingent learning effects (e.g., Lozano & Revuelta, 2023). Future studies may examine the specification and identification conditions of these models along with their performance. Finally, as other operation-specific learning models, the proposed model is based on the assumption that item difficulty can be explained in terms of a well-defined set of operations. Such an assumption will constitute an important limitation when the hypothesized operations do not truly reflect the way in which examinees actually solve the items. In this regard, other item properties may constitute a source of item difficulty or even a source of practice effects throughout the test (e.g., Lozano & Revuelta, 2020). For a detailed discussion of this issue, see Lozano and Revuelta (2021b). Another important limitation of the proposed model is the assumption of linearity. In this regard, the difficulty of the cognitive operations is assumed to be a linear function of previous practice during the test. Despite its parsimony, the linearity of the model may constitute a too restrictive assumption when dealing with more complex patterns of practice effects, as for instance a learning curve with successively smaller reductions in operation difficulty (e.g., Spada, 1977; Spada & McGaw, 1985). Further research will be aimed to overcome these limitations.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Ministerio de Ciencia, Innovación y Universidades (PID2021-124885NB-I00). The computations were run with the support of the Scientific Computing Centre at Universidad Autónoma de Madrid (CCC-UAM).
