Abstract
Given the frequent presence of slipping and guessing in item responses, models for the inclusion of their effects are highly important. Unfortunately, the most common model for their inclusion, the four-parameter item response theory model, potentially has severe deficiencies related to its possible unidentifiability. With this issue in mind, the dyad four-parameter normal ogive (Dyad-4PNO) model was developed. This model allows for slipping and guessing effects by including binary augmented variables—each indicated by two items whose probabilities are determined by slipping and guessing parameters—which are subsequently related to a continuous latent trait through a two-parameter model. Furthermore, the Dyad-4PNO assumes uncertainty as to which items are paired on each augmented variable. In this way, the model is inherently exploratory. In the current article, the new model, called the Set-4PNO model, is an extension of the Dyad-4PNO in two ways. First, the new model allows for more than two items per augmented variable. Second, these item sets are assumed to be fixed, that is, the model is confirmatory. This article discusses this extension and introduces a Gibbs sampling algorithm to estimate the model. A Monte Carlo simulation study shows the efficacy of the algorithm at estimating the model parameters. A real data example shows that this extension may be viable in practice, with the data fitting a more general Set-4PNO model (i.e., more than two items per augmented variable) better than the Dyad-4PNO, 2PNO, 3PNO, and 4PNO models.
Keywords
1. Introduction
Despite having been first introduced four decades ago, the four-parameter model (4PM) has seen a relative resurgence in interest within item response theory. The 4PM is a generalization of the three-parameter model (3PM) that allows for both lower and upper asymptotes on the probability of a positive item response on a dichotomous item. Its item response function (IRF) is as follows:
where
The renewed interest has largely been fueled by emerging evidence that upper asymptotes may exist in a variety of contexts, such as low-stakes assessments (Culpepper, 2017), high-stakes assessments (Rulison & Loken, 2009), psychological and personality assessments (Waller & Reise, 2010), and even in the assessment and monitoring of food insecurity (Gregory, 2020). This research shows that the reasons for the existence of upper asymptotes can range from simply making a mistake during test-taking to careless responding to intentionally misresponses.
Further interest in the 4PM centers on the problems underlying its estimation. Estimation approaches that have been introduced in recent years include estimation using WinBUGS (Loken & Rulison, 2010), marginal maximum likelihood estimation (Feuerstahler & Waller, 2014), using a Gibbs sampling method to estimate the 4PNO (Culpepper, 2016), Bayesian model estimation (Waller & Feuerstahler, 2017), marginal maximum a posterior estimation using a mixture model formulation (Meng et al., 2020), using a Gibbs-slice sampler (Zhang et al., 2020), and using data augmentation in a Gibbs sampler to estimate the multidimensional 4PL (Fu et al., 2021).
Much of the interest in estimation of the 4PM stems from the serious concerns underlying its identification that are extended to it due to its generalization from the 3PM (Maris & Bechger, 2009; San Martín et al., 2015; Thissen, 2009). When a model has identification issues, several severe consequences arise. The most salient comes directly from the definition of an identified model: having an identified model means that optimal parameters are unique, so that when a model is unidentified, several parameter sets may be equally optimal. In this situation, how does one choose between them? Other problems include the consistency of parameter estimates (Gabrielsen, 1978) and the interpretability of the parameters (San Martín et al., 2009). These issues should cause any applied researcher relying on a potentially unidentified model to take pause. On the other hand, evidence suggests that not taking into account guessing effects (Aitkin & Aitkin, 2006) and slipping effects (Loken & Rulison, 2010) may lead to biased trait estimates in the extreme ends of the trait distribution.
Given the previous paragraph, it is clear that including these effects in such a way that the model is identified is a highly valuable prospect for the psychometric community. To that end, one solution proposed recently showed conditions under which a 4PM model can be identified (Kern & Culpepper, 2020). Using results from the cognitive diagnostic modeling literature (Gu & Xu, 2021), it was shown that by grouping items into pairs that measure a common attribute—which they termed as dyads—and restricting the slope and threshold parameters within that pair to be equal, that the model can be generically identified. This is as follows:
where d denotes the dth dyad, and
In their paper, they proposed a Markov chain Monte Carlo (MCMC) algorithm to explore the space of all possible dyads to determine a likely set of dyads and estimate person, dyad, and item parameters when the dyads are not known a priori. In their approach, they used a prior distribution of potential groupings that assumed that each potential dyad was equally likely. Hence, their approach yielded an exploratory version of the proposed restricted 4PM. In the current project, we will modify the procedure to instead use a prior distribution of potential groupings that assumes one specific grouping, yielding a confirmatory version of the proposed restricted 4PM. Furthermore, the new model considered here will extend the Dyad-4PNO model further by allowing for groups of items, which we term item sets, with two or more items, rather than just the two-item dyads previously considered. These extensions will make the approach more robust, allowing for testable hypotheses with respect to item sets and improving the estimation accuracy of model parameters. Importantly, the resulting model can be thought of as a special case of the higher-order DINA (HO-DINA) model (de la Torre & Douglas, 2004) with simple structure. By exploiting this simple structure, the developed algorithm is able to accommodate many more attributes than other existing HO-DINA estimation algorithms.
The rest of the current article is structured into five sections. In the first section, the Dyad-4PNO model is introduced. Following this brief introduction is a discussion about the extension of the Dyad-4PNO model, which we term the set four-parameter normal ogive (Set-4PNO) model, to cases where a test may consist of item sets with two or more items. The second section describes a Bayesian formulation of the new model, as well as an outlined Gibbs sampling algorithm for approximating the posterior distribution. Also given in this section is a brief discussion of how the current formulation connects with the exploratory Dyad-4PNO. The third section describes and reports the results from a Monte Carlo simulation study designed to evaluate parameter recovery of the proposed Gibbs sampler under a variety of conditions. In the fourth section, we give an example of fitting several competing models, including the Set-4PNO, to an experimental IQ data set. Finally, we provide a discussion on the contributions of this article, potential future directions of this work, and limitations of the current study.
2. A Restricted 4PM
2.1. The Dyad-4PNO Model
In this section, we describe the development of the Dyad-4PNO model. As originally described, this model can be thought of as a HO-DINA model (de la Torre & Douglas, 2004) with uncertainty in which items measure a binary latent attribute, and a restriction that attributes are measured by only two items, the minimum number of items necessary for generic identifiability to hold (Kern & Culpepper, 2020). These sets of paired items are called dyads. To denote a dyad, we let
Suppose that person i does not possess attribute d. Then, under the DINA model
In the current model, we assume that if items j and
where
Treating
where ad
is a slope parameter describing the strength of the relationship between
Here, we can define a so-called dyad response function (DRF). The DRF is the joint distribution of the item responses to items in dyad d. The DRF is
where
and
are the asymptotes as
The marginal IRF for a single item can be found by summing over the response categories of the other item in a dyad. Doing this shows that the resulting marginal item characteristic curve (ICC) is
Notice that this is similar to the ICC of the usual 4PNO model in Equation 1 with the difference that the item slope and intercept parameters are restricted to be equal to each other within each dyad pair d. The resulting model is generically identified.
Assuming local independence of the responses between dyads, the likelihood over all J item responses for the entire set of n examinees is as follows:
Note that in general, this likelihood is not equivalent to the one often given in the context of the 4PM. This is because item responses within an item dyad are not assumed independent; that is, the dyad makes an explicit dependence assumption between items within the dyad. However, this dyadic relationship is necessary for the identifiability conditions to minimally hold.
2.2. The Confirmatory Set-4PNO Model
Treating the dyad index vector
However, here we observe two things. First, while a minimum of two items forming a dyad loading onto a common attribute
Given these observations, the Dyad-4PNO model can be generalized. First, instead of only allowing for item sets of two items, we can instead allow for sets of size two or larger. To do this, the definition of an item set in the vector
where
where
and
with interpretation as given previously. With this, we can find the expected sum-score Sd for an item set d to be
where
Approaching the second observation that test structure may be known or theorized, we can simply treat the dyad index vector

Path diagram of the confirmatory set four-parameter normal ogive (Set-4PNO) model implied by the item set index vector
2.3. Singlet Items
An important point missing from the above discussion is that ungrouped, or singlet, items may be included in the test as well. However, to guarantee test model identifiability, these items must have their guessing and slipping parameters set to zero, that is,
where
3. Bayesian Estimation of the Confirmatory Set-4PNO Model
In this section, a Bayesian formulation for the confirmatory Set-4PNO model is presented. First, a presentation of the model and a description of Bayesian priors are given. Second, the full conditional distribution for implementing an MCMC Gibbs sampler is provided.
3.1. Bayesian Formulation
A directed acyclic graph representing the confirmatory Set-4PNO model is given in Figure 2. The following Bayesian formulation is proposed for the confirmatory Set-4PNO model described above:

Direct acyclic graph of the set four-parameter normal ogive (Set-4PNO) model. Note. Squares and circles distinguish observed random variables and random parameters, respectively, rectangles are plates that denote variables that share common indices, and solid and dashed lines distinguish between stochastic and deterministic relationships.
The choice of priors is both a reflection of uncertainty in the parameters and of a desired simplicity in sampling. The chosen priors lend themselves well to a Gibbs sampling approach.
Equation 6 is the Bernoulli IRF for the observed response
The rest of the above equations complete the Bayesian formulation of the Set-4PNO model by giving prior distributions for person, item, and item set parameters. Equation 9 shows that the latent trait
For the items parameters sj
and gj
, we give a joint beta prior distribution, as shown in Equation 10. If we have a simple monotonicity assumption for the model—that is, that
Finally, for the item set parameters ad
and bd
, a truncated bivariate normal distribution prior is given. As shown in Equation 11, the distribution has mean
3.2. Connection With Exploratory Dyad-4PNO
The difference between the current estimation approach and the one given by Kern and Culpepper (2020) is that they proposed using a uniform prior over the space of potential item dyads. That is, they proposed using
3.3. Posterior Approximation
Using the formulation given, we can construct a simple Gibbs sampling algorithm to approximate the full posterior distribution of the model parameters given the observed responses. The posterior distribution, in turn, can be used to determine point estimates, variances, and other summary statistics for each of the parameters in the model.
Gibbs sampling is an approach to finding the full posterior distribution of a set of parameters by iteratively sampling from each parameter’s conditional distribution, given the remaining parameters in the set and the observed data. Following is the series of sampling steps implied by the Bayesian formulation given above. Importantly, the choices for prior distributions generally allow for sampling from well-known distributions, which can greatly increase the efficiency of the MCMC algorithm. For each iteration of the Gibbs sampling algorithm, each step is taken in order. This is done sequentially a number of times until the Markov chain converges in distribution to the joint posterior distribution; this is the so-called burn-in sample. Once the chain has converged, each additional sample is said to be representative of the posterior distribution. This post-burn-in sample can then be used for describing the posterior distributions for each parameter.
1. Update for
where
2. Update for augmented data
3. Update for
4. Update for
where
Here, note that the added constraint that
and
respectively. Using this approach yields a sample value of
5. Update
where
In this instance, as in Step 4, the constraint that
4. Monte Carlo Simulation Studies
In this section, the proposed model is investigated using two Monte Carlo simulation studies. First, the efficacy of the above MCMC is examined. Specifically, this is an examination into how well the algorithm recovers model parameters when the structure of estimated model is properly specified as the one given in the data generating model. In the second simulation study, the ability of the deviance information criterion (DIC)—both the standard version and the marginal version—to choose the correct model is investigated.
4.1. Simulation 1
4.1.1. Method
In this simulation study, we seek to determine how well the algorithm recovers model parameters in a variety of contexts when the model structure is properly specified in estimation. Five factors were manipulated, largely following the first study in Kern and Culpepper (2020). First, guessing parameters were either generated from
Latent trait values
4.1.2. Parameter recovery
The efficacy of model parameter recovery was determined using root mean squared error (RMSE), bias, and correlation with the true parameter. Each of these measures is averaged over all the
where M is either the number of items, item sets, or examinees (dependent upon the given parameter
and
4.1.3. Results
As overall result patterns are as expected for sample size, to not clutter this space, only results for the sample size
Simulation results for the item set parameters a and b are given in Table 1. First, while not shown, it was found that with increasing sample size, item set parameters show increasingly better recovery. This is especially the case with the b-parameters, which show near perfect recovery for
Simulation 1 Recovery of Item Set Parameters a and b for
Results for the item parameters s and g are given in Table 2. First, we note that bias and RMSE are especially small for s and g, which is notable in that these parameters are notoriously difficult to estimate in IRT models. This result shows some of the usefulness of this model in accounting for guessing and slipping effects. Furthermore, as with the item set parameters, it was found that increasing sample size improves item parameter recovery. Other factors that affect item parameter recovery include the number of items per item set J and the levels of slipping and guessing. Specifically, as J increases or as slipping and guessing levels decrease, item parameters show better recovery. Interestingly, it seems that increasing the number of item sets D does not affect item parameter recovery very much.
Simulation 1 Recovery of Item Parameters g and s for
Lastly, results for the latent trait
Simulation 1 Recovery of Person Parameters
4.2. Simulation 2
4.2.1. Method
In this simulation study, the ability of the DIC to select the correct model in several contexts is examined. In this simulation, DIC and its marginal counterpart (i.e., marginal DIC) are studied. The DIC was chosen for this simulation study because of its ubiquity in Bayesian model selection. The marginal DIC was chosen because it has been argued that it is the more appropriate index in latent variable modeling contexts, that is, that an index selecting between latent variable models should have the latent variable integrated out (Merkle et al., 2019).
This study was conducted in two parts, with each part consisting of a different generating model. In the first part, the generating model was the Set-4PNO model, whereas in the second part, the generating model was the 2PNO model. This approach was used as an approximation to model complexity, because the Set-4PNO model has more parameters than the 2PNO model. Ideally, the criterion indices would do as they are designed to do and choose the generating model more often, regardless of model complexity. In both generating model situations, five competing models are fit: the Set-4PNO, exploratory Dyad-4PNO, 4PNO, 3PNO, and 2PNO models.
Using the Set-4PNO model as the generating model, four factors were manipulated. The first factor is the number of item sets
Using the 2PNO model as the generating model, two factors were manipulated. As in the previous paragraph, these were the number of item sets
In all cases, latent trait values were sampled as
4.2.2. Results
The main results in this section concern the proportion of times the correct model was chosen under the various conditions. These results are given in Tables 4 and 5. Some important trends can be found. First, when the 2PNO model was the generating model, both DIC and marginal DIC correctly selected the 2PNO model in every instance of the simulation. Thus, when no slipping or guessing effects are present, this seems to be easily detected via both indices.
Proportion of Models Chosen by Deviance Information Criterion by Condition
Note. 4PNO = four-parameter normal ogive; 3PNO = three-parameter normal ogive; 2PNO = two-parameter normal ogive.
Proportion of Models Chosen by Marginal Deviance Information Criterion by Condition
Note. 4PNO = four-parameter normal ogive; 3PNO = three-parameter normal ogive; 2PNO = two-parameter normal ogive.
The same results do not hold in general when the Set-4PNO model is the generating model. Consider the case using the DIC for model selection. In this case, we find two things. It seems that increasing the effect sizes for guessing and slipping, that is, increasing response uncertainty, increases the chances for the index to incorrectly choose the wrong model. This is especially true with decreasing numbers of items and smaller item set sizes. In particular, with high guessing and slipping, the exploratory Dyad-4PNO is incorrectly chosen much of the time. For lower values of D and G, the 4PNO model is incorrectly chosen often as well, though less often that the Dyad-4PNO. In the simulation, neither the 3PNO nor the 2PNO were incorrectly chosen.
Using marginal DIC for model selection given the Set-4PNO is the generating model looks to be a generally better approach than using the DIC. First, increasing guessing and slipping effects does not affect model selection uncertainty with the marginal DIC nearly as much as with the DIC. In all conditions, the marginal DIC never chose the 2PNO, 3PNO, or 4PNO models. In particular, when D is low, and guessing and slipping are high, the Dyad-4PNO was never incorrectly chosen a very small amount of the time. This is clearly much less often than with the standard DIC.
Taken together, these results show that the Set-4PNO model is distinguishable from other potential competing models. In particular, the marginal DIC is a good metric for making a model selection among the competing models. Furthermore, this shows that if a nondyadic structure for the items is present, then the set-4PNO model will generally fit better than the exploratory Dyad-4PNO model does, as one would expect.
5. Application
In this section, we provide a real data example to demonstrate the use of this model. The data consist of an experimental IQ data set found on Open Psychometrics website. The data include the responses of 400 examinees to 25 different items. For each item, examinees are shown a patterned
To demonstrate that allowing for item sets with more than two items can be beneficial, two confirmatory models were fit. For the first model, the dyads chosen by the exploratory Dyad-4PNO in Kern and Culpepper (2020) were fit to the data. The model was chosen as it is presumably the best possible set of two-item sets for these data. However, as shown in Table 6, while several dyad groupings are quite stable in terms of percentage of time items are matched in the exploratory algorithm, four dyads—namely, Dyads 3, 6, 8, and 10—exhibit much less stability. It can be hypothesized that the reason for this instability might be that some residual dependencies between dyads may be present. In this way, two item sets of size four, each consisting of the items from two dyads in the exploratory solution, can be constructed. Implicitly, choosing to merge a pair of dyads imposes an equality constraint on the item set parameters, so to choose which pairs of dyads to merge, we chose the pairs with the smallest differences between the dyad parameters. This leads to an eight-set Set-4PNO model with six item sets of two items and two item sets of four items. Additionally, the data were also fit to the more standard 2PNO, 3PNO, and 4PNO models. This is done as a check to determine whether the Set-4PNO is a better fit to the data than other standard IRT models. For all five models, the data were fit using Gibbs sampling.
Exploratory Dyad-4PNO Solution
Note. 4PNO = four-parameter normal ogive.
To implement the MCMC algorithms, it is necessary to set the hyperparameters for the prior distributions. For the person parameters
For all models, parameter convergence was assessed using the potential scale reduction factor
Item Set and Item Parameter Estimates for the Dyad-4PNO and the Set-4PNO Models
Note. 4PNO = four-parameter normal ogive.
Item Parameter Estimates for the 4PNO, 3PNO, and 2PNO Models
Note. 4PNO = four-parameter normal ogive; 3PNO = three-parameter normal ogive; 2PNO = two-parameter normal ogive.
As a first diagnostic, we investigated the latent trait estimates given across the alternative models. Specifically, a scatter plot matrix containing plots of the

Scatter plot matrix of latent trait estimates for each of the five competing models.
Models were also compared using a marginal DIC that was computed for each. DIC is the average deviance of the posterior distribution with a penalty for the effective number of parameters. As previously described, the marginal DIC is like the standard DIC, but the deviance is computed with
Model Fit for the Dyad-4PNO, Set-4PNO, 4PNO, 3PNO, and 2PNO Models
Note. 4PNO = four-parameter normal ogive; 3PNO = three-parameter normal ogive; 2PNO = two-parameter normal ogive; DIC = deviance information criterion.
To show that it was the particular proposed Set-4PNO model that fit the data well and not just any eight sets of items, two cases of competing models were compared with the proposed model. In the first case of models, an item set index vector
A competing model is said to fit better than the proposed Set-4PNO if it has a smaller marginal DIC. In 1,000 randomly chosen models, it was found that none fit better than the proposed model. Similarly, in 172 unique close models, none fit better than the proposed model, though in general, the difference in marginal DIC was smaller than for the randomly chosen models. However, the smallest difference among the close models to the proposed model was 9.98. These results are shown in Figure 4. This shows (1) that there is a great deal of heterogeneity in fit among the possible models with a similar structure, (2) that the chosen model is most likely better than one that we could have arrived upon by chance alone, and (3) that there are no extremely close models that are better. Note that this does not rule out the possibility that some model is better. More generally, this approach shows how one could assess the goodness of a confirmatory model against those of a similar structure, where the smaller the proportion of random models with a smaller DIC than the proposed model is better.

Histograms of marginal deviance information criteria (DICs) for models with similar item set structure to the proposed set four-parameter normal ogive (Set-4PNO) model. Plot A is the histogram for the randomly chosen item set. Plot B is the histogram for the close item sets. The black vertical line shows the marginal DIC of the proposed Set-4PNO model.
One may question the reason for increased a-parameters in the 4PMs. Experience shows that a-parameters tend to be most affected by the inclusion of guessing and slipping effects into the response model. Note that model fitting in the IRT context is equivalent to finding the set of ICCs under the constraints of a given model that best describe the patterns in the data. Then, when fitting two separate IRT models to the same data, the differences in the models’ implied ICCs would typically be close subject to model constraints. When minimizing the differences between the 2PNO model ICC to the 4PM ICC, the 2PNO slope must be smaller than the 4PM slope. In this way, if guessing and slipping effects are truly present, then four-parameter-type models’ slopes larger than a 2PNO model fit on the same data.
5.1. Alternative Prior Investigation
It is possible that parameter estimates and outcomes of model selection could be sensitive to the choice of a prior distribution. This has been cited before as a reason to be cautious in using the 4PM (Asparouhov & Muthén, 2020). To investigate this, the five models (Dyad-4PNO, Set-4PNO, 4PNO, 3PNO, and 2PNO) were also estimated using several less informative priors than in the previous section. The priors were constructed by crossing two factors: (1) a-parameter prior (
Two measures are used to investigate sensitivity. The first is to compute marginal DIC for each of the models being compared and see which model corresponds with the smallest marginal DIC in each case. Model selection is then not sensitive to the choice of prior to the extent that the same model is chosen in each case. The second measure of interest is the root mean squared difference between of the expected item set proportion correct Sd
given the prior in the previous section (i.e.,
where
First, as shown in Table 10, the choice of prior does not seem to change the decision of which of the five models to select. However, decreasing the prior information generally seems to decrease the value of the marginal DIC for each model. This might be expected as the search space for the parameter values would be widened with a larger prior variance. However, the decrease in each case is generally small, though the decrease for the 4PNO is a bit larger (i.e., a decrease of less than 4 for all models except 4PNO).
Marginal Deviance Information Criterion for the Dyad-4PNO, Set-4PNO, 4PNO, 3PNO, and 2PNO Models Under Several Alternative Priors
Note. Bold-face indicates the model chosen by the given prior. 4PNO = four-parameter normal ogive; 3PNO = three-parameter normal ogive; 2PNO = two-parameter normal ogive.
Next, Table 11 shows the RMSDPC of each item set for each of the alternative priors compared to the prior given in the previous section. In every case, this is nearly zero, indicating a close fit in the expected scores within each item set under the Set-4PNO model regardless of prior. This can further be seen in Figure 5, which shows the expected item set scores for each prior. These results, along with the results of the marginal DIC values, show that the choice of prior does not have much effect on the outcome of results, given that the prior is suitably wide.
Squared Deviation of the Expected Proportion Correct for Each Item Set for Alternative Priors From the Prior
Note. 4PNO = four-parameter normal ogive.

Plots of expected proportion correct for each of the eight item sets and each of the priors.
5.2. Investigating Missingness
One thing noted earlier is that there was an overall missingness rate of 3.25% in the sample. To handle missingness above, it was decided to code missing responses as incorrect responses. However, upon further investigation, a couple observations were made. First, there was a fair amount of heterogeneity in the level of missingness between items, ranging from 0.5% to 13.75% missing, with a median of 2.25% missing. Second, the level of missingness was correlated with the item slipping parameter estimates
To investigate this further, it was decided to consider the case when the missing data values were imputed. To do this, the mice package in R was used, using the predictive mean matching imputation method. After imputation, the data were again used to fit the five competing models. As shown in Table 12, the Set-4PNO model is again chosen as the best model. Further analysis of the Set-4PNO model solution shows that estimated slipping parameters decreased by 11.67% on average and estimated guessing parameters grew by 9.46% on average. This suggests that missingness did play a small factor in potential over- and underestimation of slipping and guessing parameters, respectively. This seems to be due to general missingness effects seen in IRT modeling, in this case induced by the data being collected in a voluntary low-stakes testing context. However, future investigations should look at this issue further.
Model Fit for the Dyad-4PNO, Set-4PNO, 4PNO, 3PNO, and 2PNO Models Fit With Imputed Data
Note. DIC = deviance information criterion; 4PNO = four-parameter normal ogive; 3PNO = three-parameter normal ogive; 2PNO = two-parameter parameter ogive.
6. Discussion
6.1. Contributions
In this article, we offer several contributions to the literature. First, we show that the algorithm presented is able to handle large numbers of items and item sets. Notably, each item set is equivalent to an attribute in the HO-DINA model. Initially, the intention was to compare the newly presented Gibbs sampling algorithm to existing HO-DINA estimation software. Unfortunately, it was quickly found out that the existing software was unable to handle a large number of attributes. While theoretically this number of attributes could be found in real assessments, practical constraints presented by the software can hinder investigation of them. The current algorithm, however, can easily estimate a model with 20 or more item sets or attributes. The only constraint on this is that test structure should be in simple structure. While a seemingly restrictive constraint, we consider this to be a positive in that only knowledge of similarity amongst sets of items is necessary for this to hold approximately.
Second, the algorithm presented is relatively fast. Specifically, for 20 item sets of five items (100 items overall) and 2,000 persons, estimation took approximately 30 minutes on a PC with an Intel i7-6700HQ processor with 16 GB of RAM. In comparison, an attempt to use the
Third, the current model extends the Dyad-4PNO by allowing for confirmatory settings. In a confirmatory setting, a researcher would be able to specify a priori an item structure as a set of items and fit the implied model to the data. This grouping could be obtained through prior exploration or expert knowledge. Furthermore, if several competing structures have been obtained a priori, then each of the implied models can be fit to the data, which can then be compared via fit indices such as the marginal DIC. This is particularly powerful when moving from an exploratory stage to a theory testing stage.
Fourth, the current model further extends the Dyad-4PNO by allowing item sets that are larger than two. The Dyad-4PNO model was developed with minimally sufficient conditions for identification in mind. While this is incredibly theoretically useful, it may be quite limited in many situations. Identification conditions, however, allow for the possibility that items can be grouped into sets of more than two items. For instance, while two items per item set (i.e., a dyad) is sufficient for generic identifiability to hold, three items per item set are sufficient for strict identifiability to hold (Kern & Culpepper, 2020). If there are any more than three items per item set, then strict identifiability will hold. Noting that an item set is equivalent to items matching on a binary attribute, it seems reasonable that item sets of more than two items may occur in practice. The Set-4PNO model can accommodate these situations, and as shown in the simulation study, evidence shows that increasing numbers of items in an item set can improve parameter estimation.
Fifth, it was shown that the model can be applied to real data settings. To do so, a subset of an experimental IQ data set was used. A comparison of five competing models (i.e., 2PNO, 3PNO, 4PNO, Dyad-4PNO, and Set-4PNO) showed that the models with a grouped item structure (Dyad-4PNO and Set-4PNO) fit better than the standard IRT models (2PNO, 3PNO, and 4PNO). This is unsurprising given previous findings using these data (Kern & Culpepper, 2020). However, it was also shown that allowing an item set structure with some sets of items with more than two items rather than simply forcing all sets to have two items, like in the Dyad-4PNO model, can increase model fit. Moreover, this model fit increase is not simply a byproduct of the kind of grouping allowed but rather of the specific sets of items chosen to group together. In other words, knowledge of the sets of items to be grouped together can go a long way in terms of maximizing model fit. Further investigations with these data showed that the conclusion that the Set-4PNO model fit best among the competing models was not due to choice of prior, nor was it due to a decision to rescore missing data as incorrect. One thing that was shown, however, was that the estimation of guessing and slipping parameters may be affected by missingness in the data, though the affect of missingness was ultimately small.
6.2. Future Directions and Limitations
While the Set-4PNO model is an important extension onto the Dyad-4PNO, several future directions can be identified. First, something not explored is the estimation of attribute mastery patterns as in a traditional CDM. This has not been pursued due to the focus on the continuous latent trait, but given the growing interest in the psychometric literature on formative assessments and attribute mastery, future study should investigate the potential of estimating attribute patterns. Importantly, it seems that given the capability of the model to efficiently accommodate large numbers of attributes, this may open up possibilities for more fine-grained assessments.
Second, the model and its accompanying algorithm should be expanded in several ways. One important inclusion would be to consider the possibility of a multidimensional structure of the continuous latent trait. With the model as currently posed with simple structure on the submodel for the attributes, this should be fairly straightforward, though the efficacy of latent trait and parameter estimation would have to be examined, and conditions for identification would need to be established. While noting that criticism of the simple structure requirement of this model, another consideration is determining the extent to which attribute simple structure can be relaxed while still maintaining estimation accuracy and algorithmic speed.
One possible way to expand as briefly discussed is to allow partial information on the item-attribute structure. This would result in a partially exploratory Set-4PNO model. Conceptually, this partially exploratory approach could be used as a way of implementing a Bayesian approach to comparing models. For example, suppose there are two competing models
Another possible way to expand is to allow for larger item sets in an exploratory approach. Currently, the original exploratory Dyad-4PNO allows for only simple two-item item sets. As shown, this is the minimum necessary for guaranteed identification of slipping and guessing parameters. However, as the current article shows in the example, it is possible to find better fitting solutions with item sets larger than two. The approach taken to finding this solution was to use previous findings from the exploratory Dyad-4PNO and an observation of less certainty regarding two of the dyads. While this approach, using data-based observation worked well, it seems possible that other groupings among the items could fit better yet, which a more general exploratory Set-4PNO approach may be able to discover.
Finally, as suggested by the example, it would be fruitful to further investigate the effects of missingness on the estimation of guessing and slipping parameters. It does seem that level of missingness may have a small overall effect on these parameter estimates. Further work could be done to modify the estimation routine to allow for missingness similarly to how it is currently handled in standard IRT model estimation. In this instance, special attention must be given to estimating the item set parameters.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
