Abstract
Conditional independence (CI) between response time and response accuracy is a fundamental assumption of many joint models for time and accuracy used in educational measurement. In this study, posterior predictive checks (PPCs) are proposed for testing this assumption. These PPCs are based on three discrepancy measures reflecting different observable consequences of different types of violations of CI. Simulation studies are performed to evaluate the specificity of the procedure, its robustness, and its sensitivity to detect different types of conditional dependence and to compare it to existing methods. The new procedure outperforms the existing methods in most of the simulation conditions. The use of the PPCs is illustrated using arithmetics test data.
Keywords
1. Introduction
When modeling response accuracy (RA; i.e., a response being correct or incorrect) and response time (RT) in educational and cognitive tests, conditional independence (CI) between RA and RT to the same item is often assumed, given the speed and the ability parameters (van der Linden, 2007, 2009). The relationship between the RAs and the RTs is assumed to be fully explained by the higher-level covariance between speed and ability, such that there is no residual dependence left. The CI assumption can be represented in the following way:
where xpi
is the coded binary response of person p to item i having a value of 1 if the response is correct and a value of 0 otherwise, tpi
is the RT of person p on item i; θ
p
and τ
p
are the ability and the speed of person p, respectively; and
In the hierarchical framework for modeling RT and RA (van der Linden, 2007), it is also assumed that the RA depends only on the ability of the person and the item parameters related to RA, denoted by
where
CI can be violated in different ways. For example, a residual correlation between the vector of RAs of all persons to item i and the vector of RTs of all persons answering item i may remain after taking the relationship between speed and ability into account. Such residual correlations can for example be modeled using the joint model for RTs and RAs of Ranger and Ortner (2012). These residual correlations may be more than just measurement artefacts, since for example the sign of the residual correlate may depend on the item difficulty (Bolsinova, de Boeck, & Tijmstra, 2015).
Violations of CI might not always show up in the form of residual correlation across all persons, since the residual relationship between RT and RA may differ across persons. For example, if there is a negative residual correlation between RT and RA for persons with low ability and a positive residual correlation for persons with high ability, then these correlations may cancel out at the population level. However, this type of violation of CI might show up as heterogeneity of variances (i.e., as differences between the variances of RTs given a correct and given an incorrect response) or as an interaction effect between RT and ability on RA.
An interaction effect between RT and ability may not only be a result of different residual correlations at different ability levels but may also arise due to time heterogeneity of response processes, that is, fast and slow responses being qualitatively different, as suggested by the results of Partchev and De Boeck (2012) and Goldhammer, Naumann, Stelter, Toth, and Rölke (2014). These examples are not meant to be exhaustive, but rather to illustrate that CI can be violated in many different ways, which may threaten the validity of the model. This means that it is important for tests of CI to be able to pick up on these different types of violations.
There are two procedures available in the literature for testing CI. One of them has been proposed from the perspective of the hierarchical model (van der Linden & Glas, 2010). Here, the RTs under CI are modeled using a lognormal distribution, with the mean parameter depending on the item time intensity ξ
i
and the person speed and the item variance parameter denoted by
This model is tested against a parametric alternative:
where the added parameter λ i captures the difference in the location of the distribution of RTs for correct and for incorrect responses. The hypothesis: H0: λ i = 0 is tested against Ha: λ i ≠ 0 using a Lagrange multiplier (LM) test. While this test is able to detect differences in the location of the two RT distributions, it is not designed for assessing other types of violations of CI. Moreover, the LM test requires the parametric shape of the distribution of RTs to be correctly specified.
A different approach has been proposed by Bolsinova and Maris (2016). Their test of CI requires an exponential family model, for example, the Rasch (1960) model, to hold for the RA, which makes it possible to test the following hypothesis:
meaning that for each item within each group of persons with the same value of the sufficient statistic S(
While the number of available tests for CI is limited, a wide range of methodologies have been developed for evaluating the assumption of local independence (i.e., CI between item scores) in the context of item response theory models that do not take RT into account. These methodologies propose measures that capture residual dependencies within item pairs that remain after conditioning on the latent variable(s) explaining the item scores. Some of these measures are based on the observed and expected frequencies in contingency tables, such as χ2, G2, and the standardized log odds ratio residual (W. Chen & Thissen, 1997). The Mantel–Haenszel statistic is also based on the observed contingency table but considers this table separately for each value of the restscore (Ip, 2001; Sinharay, Johnson, & Stern, 2006). Other measures are based on associations between item residuals in some form or other, such as the Q 3 (Yen, 1984), the model-based covariance (MBC; Reckase, 1997), and the standardized MBC (SMBC; Levy, Xu, Yel, & Svetina, 2015).
There are a number of relevant differences between the evaluation of local independence and that of CI between RA and RT. For example, while the assessment of local independence focuses on item pairs, the evaluation of CI between RA and RT is done for individual items. Furthermore, RT is a continuous variable, which prevents a direct application of many of the existing measures to the context of assessing CI between RA and RT, especially those based on contingency tables. However, some of the methods of detecting local dependence may provide valuable starting points for the development of new ways of evaluating CI between RA and RT. Research on detecting violations of local independence (Levy, Mislevy, & Sinharay, 2009; see also Sinharay et al., 2006) suggests that among others the Q 3 and SMBC were found to have relatively high power to detect violations of local independence (Levy et al., 2009, 2015). This finding provides motivation for considering a similar discrepancy measure in the context of evaluating CI between RA and RT, as will be proposed in the subsequent section.
However, because these methods for assessing local independence solely consider the item scores, they are not tailored toward detecting the different types of violations of CI that may be relevant and realistic in the context of jointly modeling RA and RT that were discussed before. For example, they are not designed to be able to detect differences in the variance of one variable (in our case, RT) for different values of the other variable (RA). Neither are they aimed at detecting differences in discrimination conditional on RT. Therefore, we will consider not only measures of residual correlation (similar to detecting local dependence) but also measures for other consequences of violations of CI.
In the present article, we present a new procedure for detecting violations of CI between RT and RA, which aims to overcome the limitations of the existing methods (LM and KS tests). Based on the general framework of posterior predictive checks (PPCs; Gelman, Meng, & Stern, 1996; Meng, 1994; Rubin, 1984), we developed a PPC procedure using three discrepancy measures targeted at different ways in which CI can be violated.
This article is organized as follows: In Section 2, the hierarchical model for RTs and RAs is elaborated and PPCs are introduced. In Section 3, three item-level discrepancy measures of conditional dependence are introduced and a test-level decision criterion for either rejecting or retaining the CI assumption is described. Section 4 presents simulation studies focused on the specificity, robustness, and sensitivity of the PPCs for CI. The performance of the new procedure is compared to that of the existing methods for testing CI. In Section 5, the use of PPCs is illustrated for an empirical example. The article is concluded with a discussion.
2. Model Specification, Estimation, and PPCs
In this article, we consider a version of the hierarchical model with a log-normal model for the RTs (van der Linden, 2006) as presented in Equation 3 and a two-parameter logistic (2PL) model for the RA (Birnbaum, 1968):
where α i and β i are the discrimination and the difficulty parameters of item i. We consider these particular models in this article, but in general the PPC method can be used for a different specification of the model for the RAs and for the RTs, provided that one can sample from the posterior distribution of the model parameters.
At the item level, the dependence between the item parameters is modeled by a multivariate normal distribution:
where
At the person level, a bivariate normal distribution is used to model the dependence between speed and ability:
where
The model can be estimated using a Gibbs Sampler algorithm (see Online Appendix A for the details). At each iteration g of the Gibbs Sampler after the burn-in a sample from the posterior distribution of the model parameters given the data is obtained. Using these values, a new replicated data set with RAs and RTs is generated under the model:
The discrepancy measures of interest are computed at each iteration g for both the observed data, denoted by D
(g), and the replicated data, denoted by
3. Discrepancy Measures
In the Introduction, we discussed three possible consequences of violations of CI: a residual correlation between RT and RA, heterogeneity of variances of RT between the correct and the incorrect responses, and an interaction effect of RT and ability on RA. Here we describe three discrepancy measures that address these consequences of violations of CI.
The first discrepancy measure considers the partial correlation between the observed RAs and the observed RTs to item i, given the persons’ ability and speed parameters.
This correlation can be computed using
The second discrepancy measure considers the difference between the log of the variance of the observed RTs of the correct responses and the log of the variance of the observed RTs of the incorrect responses:
This discrepancy measure is aimed at the kind of violation of CI in which there is not necessarily a residual correlation between the RAs and the RTs, but where the two distributions of RTs differ in their variances. For example, for some items, correct responses might generally tend to be less variable in terms of RT, because the underlying response processes may be more similar to each other than those leading to incorrect responses.
The final discrepancy measure considers the difference between the item-rest correlation of the item for fast responses and the item-rest correlation of the item for slow responses:
where the slow and the fast responses are defined as the observed responses with an RT longer or shorter than the sample median RT to the item (Ti ,med), respectively. This measure is aimed at a type of violation of the CI where slow and fast responses do not necessarily differ in the probability of a correct response but where they do differ in the strength of the relationship between the item and the measured ability. As has been found in empirical data and as predicted by measurement models (Bolsinova et al., 2015; Coyle, 2003; Maris & van der Maas, 2012), slow responses may sometimes be more or less informative about a person’s ability than the fast responses. The item-rest correlation is used since it is a simple classical test theory statistic that roughly captures the discriminatory power of the item.
The last two measures are test statistics, meaning that they do not depend on the values of the model parameters. For the observed data, they have to be computed only once. This is not the case for the first measure, since it conditions on the values of
At each iteration, three replicated discrepancy measures are computed per item in the same way as the observed measures in Equations 11, 12, and 13, but using the replicated data
In the case of these three discrepancy measures, both PPP values close to zero and those close to one indicate that the model does not adequately capture the aspects of the data summarized by these discrepancy measures, since both highly positive and highly negative residual correlations, differences between log variances, and differences between item-rest correlations are indicative of conditional dependence.
3.1 Test-Level Decision Criterion
Based on the observed distribution for each of the three PPP values of the items in a test, a researcher has to decide whether there are too many extreme values to retain the assumption of CI. This means that some criterion has to be chosen that determines whether a PPP value should be considered extreme. We suggest to symmetrically consider PPP values below
where p crit is a chosen value on the interval from 0 to 1 which is supposed to be acceptably low, for example, .05. If the distribution of the PPP values is not uniform but more concentrated around .5, then this makes the criterion more conservative, because the probability of an extreme value will be smaller than under uniformity. Because uniformity and independence of the PPP values are not guaranteed, the proposed criterion should not be taken to imply that the false positive rate is fixed at a specific known value p crit. Rather, the theoretical binomial distribution provides a mathematically convenient starting point from which to derive a criterion that may be useful but which performance needs to be assessed, as will be done in the simulation study. When using multiple PPCs, one might choose to use a lower value for p crit to prevent the inflation of the overall chance of a misclassification of the data set as having a violation of the CI due to multiple testing. Here we use .05/m, where m is the number of PPCs that are used to assess CI.
4. Simulation Studies
4.1. Specificity of the PPCs
4.1.1. Specificity when the lower level models are correctly specified
Methods
The data were generated under CI using the hierarchical model for the RT and RA, using a 2PL model for the RA (see Equation 6) and a lognormal model for the RTs (see Equation 3). The item and the person parameters were simulated in the following way:
For each combination of sample size (500, 1,000, or 2,000), test length (20 and 40), and correlation between ability and speed (0 and .5), 500 replicated data sets were simulated. The hierarchical model was fitted to each of the replicated data sets using the Gibbs Sampler (see Online Appendix A) with 3,000 iterations (including 1,000 of burn-in), and each second iteration after the burn-in was used for the PPCs, with the three discrepancy measures described in the previous section.
First, the performance of each of the discrepancy measures was evaluated separately as if only one measure was used for testing CI, where p crit = .05 was used. Following that criterion and the guidelines in Equation 14, a conclusion about CI being violated for the test was drawn if n extreme > 3 for n = 20 and if n extreme > 4 for n = 40. Second, the performance of the combination of the three discrepancy measures was evaluated, where p crit = .05/3 was used for each of the PPCs. CI was considered to be violated if for at least for one of the three measures n extreme > 3 for n = 20 or n extreme > 5 for n = 40.
The Type I error rates of the existing procedures (the LM test and the KS tests) were also evaluated in this simulation study. For the LM test, Bonferroni correction was used to control for the effect of multiple comparisons. For the KS tests, the equality of the RT distributions was tested after conditioning on the number of items correct and CI was rejected if either the minimum of the item-level p-values was smaller than .05/n or their maximum was smaller than .05 (Bolsinova & Maris, 2016).
Results
Figure 1 shows the histograms of the PPP values for the three discrepancy measures for all the simulation conditions combined. The three observed distributions closely resemble a uniform distribution. However, for p 1i , the extreme values are slightly underrepresented compared to a uniform distribution, meaning that the proportion of values of p 1i that are considered to be extreme is lower than π.

Distributions of the posterior predictive p-values (PPP values) for the three discrepancy measures under conditional independence for all 12 conditions combined. Each histogram is based on 180,000 PPP values.
Table 1 shows for each condition the proportion of data sets where a violation of CI was falsely detected, based on each of the PPCs individually and based on the three checks combined. For p 1i , the false discovery rate was generally lower than for the other two checks, which is in line with the observation made in the previous paragraph that the distribution of p 1i (Figure 1) shows some deviation from uniformity. For p 2i and p 3i , the false discovery rate was close to the proportion of rejections that would be expected for PPP values that are i.i.d. uniform (.016 for n = 20 and .048 for n = 40). For the combination of the three checks, the proportion of false positives is somewhat lower than expected for i.i.d. uniform PPP values (.048 for n = 20 and .042 for n = 40).
Proportion of Data Sets Generated Under CI Where a Violation of CI Was Detected Based on Each of the Three PPCs Individually and Combined, the LM Test and the KS Tests (500 Replications)
Note. KS = Kolmogorov–Smirnov; LM = Lagrange multiplier; PPCs = posterior predictive checks; CI = conditional independence.
Table 1 also includes the observed Type I error rates of the LM test and the KS tests. The Type I error rate of the LM test is in most of the conditions slightly lower than .05. The KS tests are even more conservative, since correction for multiple testing is performed both within an item and between the items (Bolsinova & Maris, 2016).
4.1.2. Specificity when the lower level models are misspecified
The first simulation results showed that the PPCs rarely classified data sets generated under CI as having violations of the assumption. Next, we evaluate the robustness of the PPCs to the misspecification of the RT model and the RA model. These lower level misspecifications do not affect the relationship between the RAs and the RTs and do not influence CI. However, it could be that the performance of the procedure is affected because the posterior predictive distribution is obtained using the wrong model. We investigated whether the specificity of the PPCs suffers from these misspecifications.
Methods
Two misspecifications of the lower-level models were considered. First, a possibility is that the model for the RTs is misspecified. Here, we consider a situation where the data generating model for the RTs includes an extra item parameter:
where ai is an extra item parameter similar to the item discrimination in the model for the RA (Fox & van der Linden, 2007; Klein Entink, Fox, & van der Linden, 2009), which reflects that items might differ with respect to the decrease in RT as speed increases. Second, the model for RA can be misspecified. Here, we consider the case where the data generating model is not a 2PL but a three-parameter logistic model (Birnbaum, 1968):
where ci
is the guessing parameter of item i. The extra item parameters were simulated as follows:
First, the robustness of the PPCs under a baseline condition (N = 1,000, n = 20, ρθτ = .05) was analyzed. Then, the effect of changing one of these design parameters was investigated, resulting in five simulation conditions for each type of misspecification (note that to investigate the effect of sample size both a smaller and a larger sample size were considered).
For each condition, 500 data sets were simulated under CI. The hierarchical model with lower-level models defined in Equations 3 and 6 was fitted to the replicated data sets and the PPCs were performed. Robustness of the PPCs was compared to that of the LM test (see Online Appendix B for details) and the KS tests.
Results
Table 2 shows that when the RT model was misspecified, the specificity of the PPCs did not appear to be effected: The proportions of the data sets in which a violation of CI was falsely detected were similar to those when the lower-level models were correctly specified (see Table 1). The Type I error rate of the KS test was not inflated. The Type I error rate of the LM test was strongly inflated (up to 1) in the conditions with correlated speed and ability. This means that unlike the other two tests, the LM test is very much dependent on the correct specification of the RT distribution.
Robustness of the Tests of CI Against the Misspecifications of the Lower-Level Models (500 Replications)
Note. The data were generated under CI, but during estimation either the RT or the RA model was misspecified. KS = Kolmogorov–Smirnov; LM = Lagrange multiplier; PPCs = posterior predictive checks; CI = conditional independence. Design factors that deviate from the baseline condition are indicated in boldface.
When the RA model was misspecified, the specificity of p 1i and p 2i was hardly affected. However, the specificity of the p 3i suffered from this misspecification when ρθτ = .5, in the sense that the false discovery rate was considerably larger than when the RA model was correctly specified (see Table 1). This happens because when θ and τ are correlated, the ability of persons with slow responses is on average lower than the ability of persons with fast responses, and therefore the item-rest correlation for the slow responses (corresponding to on average lower ability) decreases due to guessing, which makes the distribution of p 3i not uniform but skewed with a large number of values close to 0. The performance of the combination of the three PPCs was affected due to the problem with p 3i . If guessing is an important factor in a test, then it may be advisable either not to use p 3i or to use a model that accounts for the guessing behavior. The Type I error rates of the LM test and of the KS tests were not inflated when the RA model was misspecified.
4.2. Sensitivity of the PPCs
Method
To evaluate how well the PPCs detect violations of CI, we simulated data under different models with five different types of violations of CI. The exact specification of each violation can be found in Tables 3 and 4, and the choice of these conditions is motivated below.
Specification of the Models for RT and RA for the Different Types of Violations of CI
Note. CI = conditional independence; RA = response accuracy; RT = response time.
Specification of Medium and Small Violations of CI of Each Type
Note. CI = conditional independence; RT = response time.
Types 1 and 2 both specify that the distribution of tpi depends on whether the response xpi is correct or incorrect. The violation of Type 1 is the kind of violation for which the LM test for CI (van der Linden & Glas, 2010) was designed: The location parameters of the lognormal distribution of RTs differ for the correct responses (ξ i + λ i σ i ) compared to the incorrect responses (ξ i ). Type 2 captures the idea that the dependence between RT and RA is not necessarily constant across persons and could depend on the person’s ability, as is predicted by the Signed Residual Time model (Maris & van der Maas, 2012) and as has been addressed by the KS tests of Bolsinova and Maris (2016). Here, ηθ p is the standardized difference between the expectation of the log RTs for the correct responses and for the incorrect responses for person p. A negative value of η has been used, meaning that for persons with a high ability, correct responses are faster than incorrect responses, while for persons with a low ability, incorrect responses are faster than correct responses.
Type 3 focuses on the variances of log RTs. While CI predicts equal residual variances of the log RTs for correct and incorrect responses, this does not have to hold in practice. It could be the case that the RTs of low-ability test takers are more varied than those of high-ability test takers, for example, because the former may sometimes skip or guess on a question. Because high-ability test takers will have a higher proportion of correct responses, this will result in a lower residual variance for correct responses than for incorrect responses, meaning that CI (given θ and τ) is violated. We specified this condition by setting the variance of log RT to be person as well as item dependent, where the person component of the variance is negatively correlated with ability.
Types 4 and 5 capture violations of CI due to differences between fast and slow response processes, a possibility that has been discussed in the literature (Bolsinova et al., 2015; H. Chen & De Boeck, 2014; Goldhammer, Naumann, Stelter, Toth, & Rölke, 2014; Partchev & De Boeck, 2012). Different item response functions were specified depending on a response being relatively fast or slow, compared to what would be expected under the lognormal RT model. In Table 3, we use a dummy variable
In Type 4, the difficulties differ for slow responses (β i + δ i ) compared to fast responses (β i ). In Type 5, the discriminations are different: α i γ i and α i for slow and fast responses, respectively, capturing the idea that the amount of information that a response provides is different for slow and fast responses (Coyle, 2003; Maris & van der Maas, 2012).
First, for each of the types of violations, the assumption of CI was tested in a baseline condition: N = 1,000, n = 20, ρθτ = .5 and a medium violation of CI. Second, the effect of changing one of the following parameters compared to the baseline condition was evaluated: sample size (500 or 2,000 instead of 1,000), test length (40 items instead of 20), correlation between ability and speed (.0 instead of .5), and size of violation (small instead of medium), resulting in five extra conditions per type of violation. Finally, one extra condition was used: with N = 2,000 and a small violation, because we expected that a sample size of 1,000 might not be enough for detecting small violations. In each of the conditions, PPCs with the combination of {p 1i , p 2i , p 3i } (see subsection 4.1.1 for details), the LM test and the KS tests were performed in each of 100 simulated data sets.
Results
Figure 2 shows the distribution of the PPP values for the baseline condition for each of the types of violations of CI. Violations of Type 1 and 4 resulted in a large number of extreme p 1i s, meaning that the observed residual correlations between RAs and RTs are generally stronger than expected under CI. Violations of Type 2 resulted in a large number of small p 2i s, meaning that differences between the observed variances of the RTs of the correct responses and those of the incorrect responses are generally higher (i.e., more positive) than expected under CI. Additionally, violations of Type 2 resulted in an even larger number of large p 3i , meaning that the differences between the observed item-rest correlations for the slow responses and for the fast responses are generally lower (i.e., more negative) than expected under CI. Violations of Type 3 resulted in a large number of large p 2i s, meaning that differences between the observed variances of the RTs of the correct responses and those of the incorrect responses are generally lower (i.e., more negative) than expected under CI. Violations of Type 5 resulted in a large number of extreme p 1i s and also a rather large number of extreme p 3i s.

Distributions of the posterior predictive p-values (PPP values) for the three discrepancy measures for the five types of violations of conditional independence in the baseline condition (based on 2,000 PPP values).
Table 5 shows the proportion of data sets in which violations of CI were detected in each of the conditions by each of the three procedures (PPCs, LM test, and KS tests). Because in practice we recommend to use the combination of discrepancy measures in order to be able to detect a variety of ways in which CI might be violated, we present only the results based on the combination of the three measures in Table 5 (see subsection 3.1 for details). The results for the individual discrepancy measures can be found in Online Appendix C.
Proportion of Correctly Detected Violations of CI (100 Replications)
Note. CI = conditional independence; PPCs = posterior predictive checks using the combination {p 1i , p 2i , p 3i }, LM = Lagrange multiplier test (van der Linden & Glas, 2010), KS = Kolmogorov–Smirnov tests (Bolsinova & Maris, 2016). For the specifications of medium (m) and small (s) sizes of violation, see Table 4. Design factors that deviate from the baseline condition are indicated in boldface.
The PPCs and the LM test detected violations of Type 1 in all simulated data sets in all conditions. The KS tests did not have adequate power (>.8) in the condition with N = 1,000 and a small violation but did have adequate power in all other conditions. Only the PPCs had adequate power in all the conditions of Type 2. The LM test lacked power to detect violations of Type 2 when the sample size is small (N = 500) and when the size of violation is small. The KS tests had lower power than the LM test and the PPCs in all conditions of Type 2. For violations of Type 3, the PPCs had adequate power in every condition, except for the combination of small violation and N = 1,000. The KS tests were however unable to detect this type of violation adequately. The LM test had in most conditions lower power to detect violation of Type 3 than the PPCs but performed better than the PPCs when the violations were small and N = 1,000. However, using the LM test results in a misrepresentation of the kind of violation present in the data: The locations of the two distributions are actually the same while the variances are different. The PPCs outperformed the other two procedures in all conditions of Type 4 but had relatively low sensitivity when the violation was small (.48 and .74 when N = 1,000 and N = 2,000, respectively). The most difficult type of violation to detect was Type 5. Only with the PPCs, power above .8 was achieved in two conditions: when either the sample size was large or the number of items was large. In all other conditions, the PPCs performed similarly to the LM test, while both outperformed the KS tests.
5. Empirical Example
To illustrate the performance of the PPCs for CI, the procedure was applied to data of an arithmetics test that is a part of the exit examination in Dutch secondary education. The data of the students from a common educational background (preparatory higher vocational education) to one of the test versions (consisting of 52 items) were used. Only the items with proportions of correct responses between .2 and .8 were used, resulting in a final test length of 38 items. Data from one person were deleted because the total time on the test was 197 seconds while all other students spent more than 1,000 seconds on the test. The final sample size was 610.
A hierarchical model with a 2PL model for RA (see Equation 6) and a log-normal model for RT (see Equation 3) was fitted to the data using a Gibbs Sampler (see Online Appendix A). Two chains with 11,000 iterations each (including 1,000 iterations of burn-in) were used. Convergence was evaluated using
The PPCs resulted in 22 extreme p 1i s (either above .975 or below .025), 32 extreme p 2i s, and 8 extreme p 3i s. Based on these results, we conclude that CI is violated for the data set at hand, since for at least one of the discrepancy measures (and in this case for all) the number of extreme PPP values exceeded 5 (see Equation 14). Therefore, the hierarchical model does not seem to hold for these data. Among the items with an extreme p 1i , RA had a positive residual correlation with RT for 14 items and a negative residual correlation with RT for 8 items. Among the items with an extreme p 2i , the RT distribution of correct responses had a higher variance than the RT distribution of incorrect responses for 2 items and a lower variance than the RT distribution of incorrect responses for 30 items. Among the items with an extreme p 3i , observed item-rest correlation was higher for the fast responses for 6 items and was higher for the slow responses for 2 items. Figure 3 shows the histograms of the discrepancy measures of the observed data. Since the first discrepancy measure is computed at each iteration, its expected a posteriori (EAP) estimate is calculated for each item, denoted by EAP(D 1i ). The values of the observed discrepancy measures give some indication of the size of violations of CI. For example, for 6 items, the EAP of the residual correlations exceeded the benchmark of small effect size, and for 7 items, the variance of the log RTs of incorrect responses was at least 3 times as large as the variance of the log RTs of correct responses.

Histograms of the discrepancy measures between the observed data and the hierarchical model assuming conditional independence for the arithmetics test data.
6. Discussion
The PPCs presented in this article offer a powerful, robust, and flexible approach to testing CI. In most conditions of the simulation study, the PPCs detected violations of CI more often or at least as often as the existing tests of CI. Results strongly indicate that the proposed PPC method can be useful in detecting different types of violations of CI, but further research may be needed to determine the performance of the procedure in a wider range of realistic scenarios.
The three proposed discrepancy measures capture different ways in which CI may be violated. D 1i measures the residual correlation that remains between RA and RT after taking speed and ability into account.
Positive residual correlations, which were the most common in the empirical example, could be explained by persons varying their effective speed as a consequence of changing their position on the speed-accuracy trade-off (van der Linden, 2009). However, negative residual correlations cannot be explained in this way. A possible explanation for the negative correlations, which were also found in the empirical example, is that respondents finish working on an item if they have found a correct response but may continue working on the item if they have not yet found the correct answer. In that case, it could be that slow responses are less often correct, resulting in a negative residual correlation. Further research is needed to reveal the different possible causes of residual correlations for different types of items.
D 2i captures differences in the variance of RTs depending on whether a response is correct or incorrect. High values of p 2i , as were observed for many of the items in the empirical example, indicate that RT is more variable for incorrect responses and may reflect that the underlying response processes are more heterogeneous than those resulting in correct responses. This could also be of substantive interest, because it may be relevant to distinguish between different processes that lead to an incorrect response (e.g., guessing, anxiety, or misapplying a response strategy).
D 3i was designed to measure difference between fast and slow responses in terms of the item discrimination as has been suggested in the literature (Goldhammer et al., 2014; Partchev & De Boeck, 2012). A large number of extreme p 3i values might indicate that fast and slow responses have underlying processes that are qualitatively different and as a result certain type of responses based on one process (e.g., careful deliberation) might be more informative about ability than response based on other processes (e.g., guessing). It may be relevant to take this into consideration when modeling the data as has been suggested by Bolsinova, de Boeck, and Tijmstra (2015).
In our treatment of the PPC method, we proposed to use a combination of discrepancy measures, rather than base the decision to reject CI on a single discrepancy measure. The motivation for this choice is that in practice, it may be difficult to anticipate which types of violations are likely to occur in the data. Furthermore, it could very well be that for different items different types of violations are present, making it useful to look at the set of discrepancy measures that cover a range of potential consequences of these violations. Additionally, having information about these different types of discrepancies provides a more informative picture of the likely sources of the violation and may provide a user with suggestions of where it makes sense to extend the model to incorporate those sources. However, as mentioned before, care should be taken when inferring the source of conditional dependence, because the observed discrepancies may be due to a variety of different sources. Also there may be other types of violations that are not addressed by the discrepancy measures presented in this article. However, we offer a flexible framework that can be extended with new discrepancy measures if particular violations of CI are suspected.
As we have shown in the simulation study, the PPC procedure is rather robust to violations in the lower level models for RT and RA unlike the LM test, which is important because a model used for analysis is always a simplification of reality. Moreover, the procedure can be easily extended to deal with more complex lower level models, for example, including a time discrimination parameter or a guessing parameter or including a different RT distribution, as long as a Bayesian estimation algorithm for these models is available.
The method requires a choice of a higher-level model but is flexible with respect to which model is chosen. In this article, the hierarchical model for RT and RA (van der Linden, 2007) that is prevalent in educational measurement was used, but the general method can readily be adapted to be applicable for other models that assume CI, for example, the diffusion model (Ratcliff, 1978) or the independent race model (Tuerlinckx & De Boeck, 2005), provided methods are available for sampling from the posterior distribution of the model parameters and for simulating data under the model.
Whereas the PPP values may help users determine whether conditional dependence is present, the proposed discrepancy measures may provide insight into the severity of those violations. These measures can function as indicators of effect sizes of the different ways in which CI can be violated (partial correlation, ratio of variances, and difference between item-rest correlations) as illustrated in the empirical example. As such, they may be useful in determining the likely impact of observed violations of CI on model-based inferences. However, further research into the robustness of models assuming CI is needed for such an assessment to be realistic.
Using PPCs, CI is tested for a particular set of latent variables (in the case of the hierarchical model, these are θ and τ). If CI is violated for this particular set, it does not mean that CI cannot hold for a different set of latent variables. If an alternative CI model with a new set of latent variables is formulated, then the CI assumption can be again tested with PPCs using samples from the posterior distribution of the parameters of the new model. This also means that the PPCs can be used in the context of modeling conditional dependence to check whether an attempt to model conditional dependence has been successful.
The aim of this article has been to provide new powerful methods of detecting a variety of violations of CI. Evaluating the practical impact of particular types of violations of CI on inferences that are made based on the model that make this assumption is beyond the scope of this article. However, violations of CI may not only be relevant with regard to their consequences for the accuracy of the model inferences but may also reveal substantively relevant aspects of the response process that are not accounted for in the model that is used. Investigating the ways in which CI is violated may therefore be of substantive interest. The proposed PPC framework and possible extensions of it may prove useful in addressing this substantive questions in future research.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
