Detecting Noneffortful Responses Based on a Residual Method Using an Iterative Purification Process

Abstract

The prevalence and serious consequences of noneffortful responses from unmotivated examinees are well-known in educational measurement. In this study, we propose to apply an iterative purification process based on a response time residual method with fixed item parameter estimates to detect noneffortful responses. The proposed method is compared with the traditional residual method and noniterative method with fixed item parameters in two simulation studies in terms of noneffort detection accuracy and parameter recovery. The results show that when severity of noneffort is high, the proposed method leads to a much higher true positive rate with a small increase of false discovery rate. In addition, parameter estimation is significantly improved by the strategies of fixing item parameters and iteratively cleansing. These results suggest that the proposed method is a potential solution to reduce the impact of data contamination due to severe low test-taking effort and to obtain more accurate parameter estimates. An empirical study is also conducted to show the differences in the detection rate and parameter estimates among different approaches.

Keywords

noneffortful response response time residual method mixture model iterative purification

In educational and psychological measurement, the primary goal is to obtain valid scores for students, which means that the information based on a test can purely reflect the latent traits of those students. However, in practice, the prevalence of noneffortful responses from unmotivated participants has been repeatedly reported, which can be observed in either low-stakes or high-stakes testing situations (Bridgeman & Cline, 2004; Wise & Kong, 2005).

As traditional measurement models or methods (e.g., item response theory, IRT) do not take test-taking effort into consideration, data sets contaminated by noneffortful responses may lead to serious consequences (Weirich et al., 2016; Wise & DeMars, 2006, 2010). First, it is well-known that noneffortful responses will distort the estimation of item parameters in IRT models (Bolt et al., 2002; Wise & DeMars, 2006, 2010). Consequently, the ability estimates, especially those of the noneffortful test takers, will be biased (Bolt et al., 2002; Wise & DeMars, 2006, 2010). Second, if noneffortful test takers are included in an IRT calibration, test information and standard errors of measurement will be biased as well (Wise & DeMars, 2006). Finally, the measured constructs may be different from the theoretically tested constructs, and the convergent validity may be compromised (Weirich et al., 2016; Wise & DeMars, 2006). Therefore, it is necessary to identify noneffortful responses and to reduce their detrimental effects.

There is a robust literature on how to detect noneffortful responses on educational tests (Wise, 2015). Particularly, a growing number of studies show that response time (RT) in achievement tests can do a good job of identifying extremely fast responses, with accuracy probabilities no better than chance (Demars, 2007; Guo et al., 2016; Kong et al., 2007; Lee & Jia, 2014). These fast responses, which are much faster than the time required by a test taker to read, understand, and select a response (Wise, 2017), are defined as noneffortful responses. The logic of those RT-based methods is that although the RT spent by test takers to a giving item varies (either due to individual differences on a variety of factors, such as ability level, reading speed, cognitive speed, motivation, and fatigue, or due to some additional episodic factors such as the testing environment or distracting noises), the potential range of RT is far less ambiguous. The normal RT of an item can be determined as the RT required by a test taker’s speed and the characteristics of the item. If an RT is remarkably less than the normal RT, this response should be regarded as a noneffortful response and consequently does not provide information about the test taker’s latent trait (Wise, 2017). In addition, as modern technology has greatly popularized computer-based testing, RT of each response can be easily recorded and used for further analysis. It becomes easy for practitioners to apply those RT-based methods for detecting noneffortful responses.

The RT-based methods can be roughly divided into three categories. The first category is to determine the time threshold for classifying noneffortful and effortful responses. For example, some methods involve visual inspection of the RT distribution for each item and choosing the threshold at the end of the short time spikes (Wise, 2006). Although these methods are intuitive, easy to interpret, and evidence-based, they are also often subjective, time-consuming, and inconclusive (e.g., there is no bimodal distribution of RT, there is no intersection of the accumulative curve of accuracy, given item RT and the chance level, see Guo et al., 2016; Lee & Jia, 2014; Rios et al., 2017). The second category is the mixture modeling methods. For example, C. Wang and Xu (2015) have proposed a mixture hierarchical model to account for the differences of item responses and RT patterns between effortful and noneffortful responses. However, these mixture models usually have strong assumptions about the models or parameter distributions for different classes. They may not perform well if these assumptions are violated (Molenaar et al., 2018; Ranger & Kuhn, 2017). The third category is the RT residual-based methods, which usually have less assumptions than the mixture modeling methods and are easier to apply.

The traditional residual method for detecting noneffortful responses is based on van der Linden’s (2006) model for RT. The residual of each RT can be calculated after all the parameters in the model are estimated. These residuals should follow a standard normal distribution if the data fit the model well (Qian et al., 2016). In Qian et al.’s (2016) study, an RT is flagged as aberrant when its residual has a larger negative value than the threshold defined by the density of the standard normal distribution (e.g., −1.96). However, this method was only applied to a real data example in their study. It is neither investigated in a simulation study nor widely used for detecting noneffortful responses in practice. Another popular approach under the framework of the residual method is the Bayesian residual method. This method is first proposed by van der Linden and Guo (2008). In their method, whether an RT from respondent i to item j is flagged as aberrant is judged by comparing the real RT to the posterior predictive density computed by all the other responses and RTs except the one from respondent i to item j. C. Wang, Xu, and Shang (2018) used the Bayesian residual method for detecting aberrant behavior and item preknowledge. The results showed that the Bayesian residual method relied heavily on the severity of aberrance (i.e., the proportion of aberrance exhibiting on the problematic items). It had low power and sometimes high false detection rate when the proportion of aberrance was high. This may be caused by the well-known “masking effect” in outlier detection, which means that the aberrant responses may not seem as extreme as they should be because a large proportion of outliers bias the model parameter estimates (Yuan et al., 2004; Yuan & Zhong, 2008). Therefore, C. Wang, Xu, Shang, and Kuncel (2018) suggested that an iterative purification process might be employed to improve the performance of the residual-based methods.

Recently, Patton et al. (2019) have applied an iteratively cleansing procedure based on person-fit statistics to detect careless respondents. This method showed much higher power than the noniterative procedure when the percentage of careless respondents is higher than 30%. It can also control the false positive error rate close to the nominal level. However, this method is not free of limitations. For one thing, it can only detect noneffortful persons but not responses. For another, the iterative procedure suffers from convergence issues, especially when an item contains some categories that are sparsely endorsed (i.e., an item has almost all correct or incorrect answers).

In this article, we will propose an iterative cleansing method based on a residual method with fixed item parameters. The rest of this article is organized as follows. The Method section briefly reviews the existing residual method for detecting noneffortful responses and introduces the new method with an iterative purification process. The Simulation Studies section presents two simulation studies and their results to demonstrate better detection accuracy and parameter estimation using the proposed method compared with the other two methods (i.e., the original residual method and the noniterative method) when severity of noneffort is high. Furthermore, a real data example is presented to show the differences among different approaches in practical applications. The last section concludes with a general discussion.

Method

Standard Residual Method

van der Linden’s (2007) hierarchical model can be used to fit the data containing both response accuracy (RA) and RT. The model can be seen as below:

\{\begin{matrix} P (Y_{i j} = 1 | θ_{i}) = \frac{e x p (a_{j} (θ_{i} - b_{j}))}{1 + e x p (a_{j} (θ_{i} - b_{j}))} 2 P L m o d e l, \\ ln (t_{i j}) | τ_{i} \sim N (β_{j} - τ_{i}, α_{j}^{- 2}) R T m o d e l, \end{matrix}

where $P (Y_{i j} = 1 | θ_{i})$ is the probability of a correct response ( $Y_{i j}$ =1) for item j (j = 1,…, J) by person i (i = 1,…, I), $t_{i j}$ is the RT for item j by person i, a_j and b_j are the discrimination parameter and the difficulty parameter for item j, respectively, $α_{j}$ and $β_{j}$ are the discrimination power and the time-intensity parameter of item j with respect to RT, and N denotes the normal distribution. $ξ_{i} = (θ_{i}, τ_{i})'$ are the person parameters, where $θ_{i}$ and $τ_{i}$ are the ability parameter and the speed parameter for person i, respectively. They are assumed to be randomly drawn from a bivariate normal distribution, where the density function is

f (ξ_{i}; μ, Σ) = \frac{{| Σ^{- 1} |}^{1 / 2}}{2 π} e x p [- \frac{1}{2} {(ξ_{i} - μ)}^{′} Σ^{- 1} (ξ_{i} - μ)],

with the mean vector

μ = (μ_{θ}, μ_{τ})',

and the covariance matrix

Σ = [\begin{matrix} σ_{θ}^{2} & σ_{θτ} \\ σ_{τθ} & σ_{τ}^{2} \end{matrix}] .

For model identification, $μ$ is fixed at $(0, 0)'$ , and $σ_{θ}^{2}$ is fixed at 1 (see van der Linden, 2007). We conduct our study under this hierarchical modeling framework for two reasons. One is that, as van der Linden (2007) has stated, the information of RT may help improve item calibration of IRT model parameters. The other is that as the parameters in the RT model can be used for further analysis (e.g., RT residual analysis, see Qian et al., 2016; item selection in computerized adaptive testing, see Choe et al., 2018; Fan et al., 2012), the estimation of these parameters deserves equal attention as parameters in the IRT model.

The computation of residuals can be expressed as follows:

{\hat{e}}_{i j} = {\hat{α}}_{j} (l n (t_{i j}) - ({\hat{β}}_{j} - {\hat{τ}}_{i})), {\hat{e}}_{i j} \sim N (0, 1),

where the parameters in the equation have the same meaning as in Equation 1. These residuals have an approximate standard normal distribution because a larger negative value may indicate that the answer is given much quicker than expected. Thus, a response to an item is flagged as noneffortful when its residual is less than −1.645 (5% at the end of the left tail of a standard normal distribution). In our study, we refer to this method as the original standard residual (OSR) method.

Biased parameter estimates will lead to biased estimation of RT residual. We propose the following procedures to improve this method. First, accurate item parameter estimates should be obtained. Then, after removing noneffortful responses identified by OSR in each iteration, person parameter estimations are improved. These parameter estimates in turn can produce better residual estimates, which can be used to more accurately pinpoint the noneffortful responses. These procedures are explained in more detail as below.

Obtaining Accurate Item Parameter Estimates

Two different situations are considered here. For one, if item parameters are unknown, a mixture model method can be applied to improve item parameter estimation (Liu et al., 2020). First, based on both RA and RT, a mixture model is used to help differentiate noneffortful and effortful individuals (Liu et al., 2020). Once the effortful sample is identified, item parameter estimates can be obtained by using van der Linden’s (2007) hierarchical model to fit the data based on this sample. Liu et al.’s (2020) study showed that, compared to the item parameter estimates based on the whole sample, more accurate item parameter estimates can be obtained by the data consisted of effortful respondents. For details of this method, please refer to Liu et al. (2020). The Mplus code for the mixture model method is given in Appendix A in the online version of the journal. For another, if item parameters are known, they can be directly used in the following step.

Iteratively Purifying the Sample

When data sets are contaminated by noneffortful responses, the parameter estimates as well as RT residuals may be biased. On the contrary, when evaluated with parameter estimates that are closer to their true values, the RT residuals should be more capable to successfully identify outlier RTs. Hopefully, when aberrant responses are removed, the speed parameter (as well as the ability parameter) estimates will recover their true values better. The improved speed parameter estimates can in turn produce better RT residual results, which can be used to more accurately detect the noneffortful responses. Therefore, we propose the following iterative cleansing procedure (iterative conditional estimate standard residual method, ICSR):

When item parameters are unknown, use the pure sample X _e identified by the mixture model method to obtain item parameter estimates ${\hat{γ}}_{e}$ based on van der Linden’s (2007) hierarchical model. When item parameters are known, use true item parameters $γ_{e}$ directly in the following steps.

Fix item parameters as ${\hat{γ}}_{e}$ or $γ_{e}$ to obtain latent trait estimates ${\hat{θ}}^{(0)}$ (ability parameter) and ${\hat{τ}}^{(0)}$ (speed parameter) for data set $X^{(0)}$ ( $X^{(0)}$ is the original full data).

Using ${\hat{γ}}_{e}$ or $γ_{e}$ , ${\hat{τ}}^{(q)}$ ( $q = 0, 1, 2, . . ., Q$ is the sequence of iteration) and RT from $X^{(0)}$ to compute RT residuals ${\hat{e}}^{(q)}$ for every response in $X^{(0)}$ . Create a cleansed calibration sample $X^{(q + 1)}$ by removing noneffortful responses whose ${\hat{e}}^{(q)}$ are below −1.645.

Fixed item parameter estimates as ${\hat{γ}}_{e}$ and obtain person parameter estimates ${\hat{θ}}^{(q + 1)}$ and ${\hat{τ}}^{(q + 1)}$ based on the cleansed sample $X^{(q + 1)}$ for all respondents by fitting van der Linden’s (2007) hierarchical model. Substitute $q = q + 1$ .

Repeat Steps 3 and 4, until the proportion of responses that change classification (i.e., from effortful response to noneffortful response or vice versa) between two successive iterations does not exceed 0.001.

Upon convergence, the noneffortful responses identified in the last iteration are taken as the final detected noneffortful responses.

In each iteration, we compute a RT residual (based on Equation 5) for every response in the original full sample (given ${\hat{γ}}_{e}$ or $γ_{e}$ , ${\hat{τ}}^{(q)}$ , and RT), instead of just the responses identified as effortful in the most recent cleansed sample. In this way, responses that are removed during early iterations can still be added back in later iterations. This prevents the number of noneffortful responses from continuously increasing.

If the above procedure simply stops at q = 0, RT residuals are calculated based on the speed parameter estimates obtained by conditional estimating with the fixed item parameters. Then, responses with residuals smaller than −1.645 are removed. We refer to this procedure as the noniterative conditional estimate standard residual (CSR) method.

Once the noneffortful responses are detected, they can be treated as missing values (i.e., the item is not answered by the examinee) to obtain a cleansed sample. For all the methods (i.e., OSR, CSR, ICSR), the final parameter estimates are estimated based on the resulting cleansed sample by fitting van der Linden’s (2007) hierarchical model. These estimates are hopefully less biased than the original parameter estimates based on the full, contaminated sample.

Simulation Studies

In this section, two simulation studies are conducted to evaluate whether the proposed method can successfully detect noneffortful responses and recover model parameters. In Simulation Study 1, item parameters are supposed to be unknown. Therefore, a mixture model method (Liu et al., 2020) is used to obtain item parameter estimates, which are fixed to apply both the iterative and noniterative procedures of RT residual-based methods. In Simulation Study 2, item parameters are supposed to be known.

Design

In order to investigate whether it was necessary to detect noneffortful responses, parameters were estimated based on the original full data set as a baseline to compare different methods in terms of parameter recovery. Specifically, in Simulation Study 1, the precision of the item parameter estimates based on the cleansed sample identified by the mixture model method is evaluated as well.

In Simulation Study 1, there were five manipulated factors:

Noneffort prevalence ( $π$ ), which was the proportion of individuals with noneffortful responses; it had three levels: 0%, 20%, and 40%.

Noneffort severity ( $π_{i}^{n o n}$ ), which was the proportion of noneffortful responses for each noneffortful individual. It varied between two levels: low and high. When $π_{i}^{n o n}$ was low, we simulated $π_{i}^{n o n}$ from U (0, 0.25); when $π_{i}^{n o n}$ was high, we simulated $π_{i}^{n o n}$ from U (0.5, 0.75), where “U” denoted uniform distribution.

The difference between RTs of noneffortful and effortful responses ( $d_{R T}$ ). The difference between RTs from two groups, $d_{R T}$ , had two levels, small and large. Logarithmized RTs of noneffortful responses were generated from normal distribution N ( $μ$ , $0.5$ ²), where $μ = - 1$ when $d_{R T}$ was small, $μ = - 2$ when $d_{R T}$ was large.

Sample size (I) consisted of two levels: 1,000 and 2,000.

Test length (J) consisted of two levels: 30 and 50. Note that when $π = 0 %$ (i.e., all the responses were effortful), there was no need to detect noneffortful responses. In this way, it allowed to evaluate the cost of using the unnecessary detection methods. These simulation conditions were set up following C. Wang, Xu, Shang, and Kuncel’s (2018) study. Consequently, by combining $π$ and $π_{i}^{n o n}$ , we produced different levels of overall percentages of noneffortful responses, approximately 2.5%, 5.0%, 12.5%, and 25% of the total sample.

In Simulation Study 2, the manipulated factors are the first three factors in Simulation Study 1 (i.e., $π$ , $π_{i}^{n o n}$ , and $d_{R T}$ ). As the three methods show similar patterns across different sample size and test length in Simulation Study 1, we fixed I = 2,000 and J = 30 in Simulation Study 2 to keep the scope of the study manageable.

Data Generation

For effortful responses, RA and RT were simulated from van der Linden’s (2007) hierarchical model (see Equations 1 –4). Item parameters were generated with $a_{j} \sim U (1, 2.5)$ , $b_{j} \sim N (0, 1)$ , $α_{j} \sim U (1.5, 2.5)$ , and $β_{j} \sim U (- 0.2, 0.2)$ . These distributions were selected to ensure that the resulting RA and RT mimic the real data closely (van der Linden, 2007; C. Wang & Xu, 2015; C. Wang, Xu, & Shang, 2018). For person parameters, $(θ_{i}, τ_{i})$ were generated from a bivariate normal distribution with the mean vector of $μ = (0, 0)'$ and the covariance matrix of $Σ = [\begin{matrix} 1 & 0.25 \\ 0.25 & 0.25 \end{matrix}]$ . In doing so, the correlation between ability and speed was fixed at a moderate level (i.e., 0.5), indicating that high-ability examinees tended to answer items faster (C. Wang, Xu, & Shang, 2018).

As for generating the noneffortful responses, first, we selected noneffortful test takers based on $π$ . We followed C. Wang, Xu, and Shang (2018) in determining noneffortful simulees, assuming that slow examinees were more likely to guess. We drew 60% of noneffortful group from people whose true speed $τ_{i}$ values fell within the lowest 33% percentile, 30% of noneffortful group from people whose $τ_{i}$ fell within the middle 33% percentile, and 10% of noneffortful group from people whose $τ_{i}$ fell within the upper 33% percentile. Then, as the noneffortful responding could happen randomly on any item, for each noneffortful simulee, we randomly selected noneffortful responses according to $π_{i}^{n o n}$ . Finally, the probability of a correct response g_j was set at 0.25 for all noneffortful responses (C. Wang & Xu, 2015), reflecting the chance level of multiple-choice items with four response options.

Analysis

In Simulation Study 1, the mixture model method with the number of classes fixing at two for each latent variable (i.e., M22 in Liu et al.’s, 2020, study) was used to identify noneffortful individuals. van der Linden’s (2007) hierarchical model was calibrated using a Bayesian Gibbs sampling method during each iteration and to obtain the final parameter estimates. The data augmentation involves implementing a posterior sampler on the enlarged probability space for $f (ξ, a, b, α, β)$ . It can be obtained based on Bayesian theorem:

f (ξ, a, b, α, β | Y, T) \propto \prod_{j = 1}^{J} \prod_{i = 1}^{I} f (Y_{i j} | θ_{i}, a_{j}, b_{j}) f (t_{i j} {|τ}_{i}, α_{j}, β_{j}) \times f (ξ) f (a) f (b) f (α) f (β),

where $Y$ and T are the observed RA matrix and RT matrix, respectively. The conditional distributions of these parameters can be derived from Equation 6. Suppose that there is a way to augment $Y$ and T with latent data Z (unobserved), X = (Y, T, Z) is straightforward to analyze (i.e., the augmented data posterior density, $f (ξ, a, b, α, β | X)$ , is of known form). The iterative scheme includes three steps: (a) the initial values of the parameters were set for k = 0 (k is the sequence of iteration), denoted as $ξ^{(0)}, a^{(0)}, b^{(0)}, α^{(0)}, β^{(0)}$ . (b) For the kth iteration (k ≥ 1), generate a sample of $M$ ( $M > 0$ , which is suggested to be small initially and then increase with successive iterations) latent data patterns $Z_{1}^{(k - 1)}, . . ., Z_{M}^{(k - 1)}$ from the current approximation to the predictive density $Y, T$ and parameter estimates (i.e., $p (Z^{(k - 1)} | ξ^{(k - 1)}, a^{(k - 1)}, b^{(k - 1)}, α^{(k - 1)}, β^{(k - 1)}, Y, T)$ , where $ξ^{(k - 1)}, a^{(k - 1)}, b^{(k - 1)}, α^{(k - 1)}, β^{(k - 1)}$ are the values for the (k − 1)th iteration. (c) Update the posterior of $ξ, a, b, α, β$ , given $Y, T$ , to be the mixture of conditional densities of $ξ, a, b, α, β$ , given the augmented data patterns generated in (b), that is

p (ξ^{(k)}, a^{(k)}, b^{(k)}, α^{(k)}, β^{(k)}) = M^{- 1} \sum_{m = 1}^{M} p (ξ, a, b, α, β |, Z_{m}^{(k - 1)}, Y, T),

Then, set k = k + 1 and repeat steps (b) and (c) until convergence is achieved.

In accordance with C. Wang, Xu, and Shang's (2018) study, the priors we chose for item parameters were $a_{j} \sim log n o r m a l (0, 1)$ , $b_{j} \sim N (0, 1)$ , $α_{j} \sim log n o r m a l (0, 1)$ , and $β_{j} \sim N (0, 1)$ . The priors we chose for person parameters were the same as the distributions for generating these parameters. The initial values of the parameters were randomly sampled from the prior distribution of each parameter. The number of chains was fixed at 2. The convergence is evaluated by Gelman and Rubin’s (1992) statistics for each parameter separately. If the Gelman–Rubin potential scale reduction statistic is less than 1.05, convergence is considered achieved. A preliminary study was conducted to obtain a necessary number of iterations and burn-in to achieve convergence. Finally, the number of iterations of each chain was fixed at 5,000, with the first 2,500 as burn-in. The thinning rate was set as 5. The posterior mean is used as the point estimates of the unknown parameters. Overall, for each condition, the above procedure was repeated L = 30 times. The mixture model used to select effortful respondents was implemented in Mplus Version 7.11 (Muthén & Muthén, 2012), while detecting and calibrating procedures were implemented in R Version 3.5.3 (R Development Core Team, 2019) and JAGS Version 4.3 (Plummer, 2003).

Criteria of Evaluation

Accuracy of identification of noneffortful responses

The true positive rate (TPR) and false discovery rate (FDR) of the identified noneffortful responses were computed and summarized over 30 replications. TPR was defined as the proportion of noneffortful responses that was correctly detected in the true noneffortful responses. It could be regarded as an evaluation of the power of these detecting methods. FDR was defined as the percentage of incorrectly detected noneffortful responses in the total detected noneffortful responses.

Parameter recovery

As the whole sample with detected noneffortful responses being recoded as missing is used to fit van der Linden’s (2007) hierarchical model, there is no sample selection bias during calibration. The parameter recovery can be evaluated directly. Bias and root mean square error (RMSE) were computed for each parameter, according to Equations 8 and 9:

bias = \frac{1}{L} \sum_{l = 1}^{L} \frac{1}{H} \sum_{h = 1}^{H} (ο_{h} - {\hat{o}}_{h}^{(l)}),

RMSE = \sqrt{\frac{1}{L} \sum_{l = 1}^{L} \frac{1}{H} \sum_{h = 1}^{H} {(o_{h} - {\hat{o}}_{h}^{(l)})}^{2}},

where $ο_{h}$ denotes the true value of a parameter, ${\hat{o}}_{h}^{(l)}$ denotes the corresponding estimate of a parameter in the lth replication (for item parameters, h = j; for person parameters, h = i), H denotes the number of items (H = J) or persons (H = I), and L denotes the number of replications under each condition.

Then, the relative efficiency (RE) was computed as the ratio of the RMSE of parameter estimates achieved by one of the methods to that resulted from fitting a van der Linden’s (2007) hierarchical model to the whole data set (RE₁) or to that based on the pure sample identified by the mixture model method (RE₂). RE₂ was only computed for item parameters in IRT model. A value smaller than 1 implied an error reduction by using the detection method, while a value larger than or equal to 1 suggested using the more sophisticated method did not pay off or even obtain worse estimation.

Moreover, in Simulation Study 1, as the accuracy of the fixed item parameters was crucial for ICSR and CSR, we evaluated the accuracy of item parameter estimates calibrated based on the pure sample identified by the mixture model method (the estimates based on the original sample was set as the baseline). For the two-parameter logistic (2PL) model, the person parameters of the identified effortful group were scaled to a bivariate normal distribution during calibration, whereas for the generated data, the whole population followed a bivariate normal distribution. Therefore, a linking procedure was needed to adjust sample selection bias before direct comparisons could be made about item parameters. Accordingly, for the 2PL model, the estimates of item parameters were equated to the scale of true values based on a mean/sigma method (Kolen & Brennan, 2004) before bias and RMSE were computed. For the RT model, as data were generated based on the assumption that low-speed examinees were more likely to guess, the distribution of speed parameters of the effortful group might be different from standard normal distribution as well. We are not aware of any linking procedure to adjust the difference of scales of RT model parameter estimates. Therefore, the correlation of the estimated item parameters and their true values in the RT model is computed as an evaluation criterion. In addition, the consistency of residuals based on the estimated parameters in the RT model (estimated residuals) and those based on the true values (true residuals) is evaluated. It was assumed that, if the parameter estimates cannot recover the generated values well, the correlation between the estimated residuals and the true residuals should be low. This index (r_e ) could be calculated across items and replications by using a Fisher z-transformation, as seen in Equation 10.

r_{e} = z^{- 1} (\frac{1}{L} \sum_{l = 1}^{L} \frac{1}{J} \sum_{j = 1}^{J} z (c o r (e_{j}, {\hat{e}}_{j}^{(l)})),

where e _j is the true residual vector for item j, ${\hat{e}}_{j}$ is the estimated residual vector for item j in the lth replication, and cor means correlation. Note that the correlation was calculated for I* individuals, where I* was the number of effortful individuals identified by the mixture model method. Besides, $z (\cdot)$ was the Fisher’s z-transformation function used to transform the correlation coefficients (Fisher, 1915).

Person fit

The percentage of misfit respondents, which is computed based on person-fit index, is used to evaluate whether the proposed method improves the person fit. The specific steps are as follows: First, noneffortful responses detected by OSR, CSR, and ICSR are recoded as missing. Second, van der Linden’s (2007) hierarchical model is fitted to the three cleansed data sets and the original full data. Third, based on the parameter estimates in the 2PL model, the corrected standardized log-likelihood person-fit index $l_{z}^{*}$ (Snijders, 2001) is used to detect misfit respondents. Then, using $α = 0.05$ , respondents with $l_{z}^{*} < - 1.96$ or $l_{z}^{*} > 1.96$ ( $α = 0.05$ ) are flagged as having an aberrant response pattern (i.e., misfit respondent). Fourth, the percentage of misfit respondents under each method and the original data are summarized and compared.

Results

Simulation Study 1

According to Gelman and Rubin’s (1992) statistic, when van der Linden’s (2007) hierarchical model is calibrated, the Gibbs sampler successfully converged within 5,000 iterations (with the first 2,500 as burn-in). The results of convergence diagnostic statistics lead to the same conclusion for all our simulation studies and empirical example.

As the difference among all the methods is similar across different sample size and test length, we have focused on the results when I = 2,000 and J = 30 in this section. The results under different sample size and test length can be found in Online Appendix B.

As the purpose of applying the mixture model method is to improve item parameter estimation, we first examine the accuracy of item parameters that need to be fixed. As can be seen in Table 1, when all the items are assumed to be responded effortfully, although the RMSE under the mixture model method is higher than that based on the original data, the magnitude is small. For the RT model, r_e is equal for both methods. The results suggest that when there are no noneffortful responses, the mixture model method can provide item parameter estimates almost as accurate as those based on the full data. When data are contaminated by noneffortful responses, the item parameters in the IRT model for both methods are recovered precisely when $π_{i}^{n o n}$ is low. However, when $π_{i}^{n o n}$ is high, the sample identified by the mixture model method yields more accurate parameter recovery. The improvement in estimation precision is more profound when $π$ = 40%. It can also be found that the mixture model method shows more advantages in terms of Cor and r_e when $π_{i}^{n o n}$ is high. We can infer from the results that the mixture model method works well in reducing the error of the item parameter estimates when noneffort severity is high, which is consistent with Liu et al.’s (2020) study. On the contrary, fitting van der Linden’s (2007) hierarchical model directly to the original full data tends to underestimate the discrimination parameter and overestimate the difficult parameter, which is in consistent with previous studies (Patton et al., 2019; C. Wang, Xu, & Shang, 2018).

Table 1.

Parameter Recovery for the Fixed Item Parameters in Simulation Study 1

$π_{i}^{n o n}$	$d_{R T}$	Criteria	Parameter	$π$
				0%		20%		40%
				Pure Sample	Original	Pure Sample	Original	Pure Sample	Original
High	Large	Bias	a	−.019	−.002	.013	.204	.032	.341
		Bias	b	.000	.000	.000	−.126	.000	−.236
		RMSE	a	.135	.103	.131	.384	.147	.558
		RMSE	b	.049	.045	.052	.176	.061	.303
		Cor	α	.992	.994	.990	.916	.987	.776
		Cor	β	.993	.994	.992	.984	.990	.971
		r_e	α, β, τ	.984	.984	.961	.796	.961	.802
	Small	Bias	a	—	—	.009	.245	.043	.413
		Bias	b	—	—	.000	−.140	.000	−.267
		RMSE	a	—	—	.157	.420	.175	.611
		RMSE	b	—	—	.057	.194	.068	.339
		Cor	α	—	—	.990	.965	.985	.897
		Cor	β	—	—	.990	.989	.988	.983
		r_e	α, β, τ	—	—	.972	.860	.967	.835
Low	Large	Bias	a	—	—	.062	.090	.154	.176
		Bias	b	—	—	.000	−.025	.000	−.046
		RMSE	a	—	—	.161	.158	.246	.242
		RMSE	b	—	—	.058	.059	.076	.084
		Cor	α	—	—	.954	.959	.899	.915
		Cor	β	—	—	.990	.992	.984	.987
		r_e	α, β, τ	—	—	.966	.961	.955	.954
	Small	Bias	a	—	—	.076	.100	.151	.191
		Bias	b	—	—	.000	−.023	.000	−.049
		RMSE	a	—	—	.178	.167	.245	.255
		RMSE	b	—	—	.057	.056	.076	.088
		Cor	α	—	—	.981	.985	.964	.969
		Cor	β	—	—	.991	.993	.989	.991
		r_e	α, β, τ	—	—	.974	.973	.967	.967

Note. Pure Sample means the estimation based on the effortful sample identified by the mixture model method. Original means the estimation based on the original full data. Cor means the correlation between the estimates of the item parameters and their true values.

The number of iterations under ICSR is summarized in Table 2. The higher level of noneffort severity requires more iterations before convergence is achieved.

Table 2.

Number of Iterations Under ICSR in Simulation Study 1

$π_{i}^{n o n}$	$d_{R T}$	$π$
		0%		20%		40%
		M	SD	M	SD	M	SD
High	Large	4.00	0.00	6.67	.48	7.77	.43
	Small	—	—	6.47	.51	7.20	.41
Low	Large	—	—	3.17	.38	3.00	.00
Low	Small	—	—	4.00	.00	4.00	.00

Note. ICSR = iterative conditional estimate standard residual method.

Table 3 summarizes the average TPR and FDR for noneffortful responses for all the three methods. In the condition that all the responses are effortful, TPR cannot be calculated and FDR is always 1. We computed the false positive rate (FPR, similar to Type I error in this situation) instead, which was defined as the percentage of responses incorrectly flagged as noneffortful in all the responses. The FPR for OSR and CSR are 0.047 and 0.048, respectively, which are both close to the nominal level (0.05). The FPR for ICSR is slightly inflated (0.060). The proportions of “pure sample” are about 95% for these methods. For each condition, the proportion of detected noneffortful responses are presented as well. First of all, when $π_{i}^{n o n}$ is low, the proportions of noneffortful responses detected by all the methods are close to the true proportions. Second, when $π_{i}^{n o n}$ is high and $d_{R T}$ is large, only ICSR can detect noneffortful responses close to the true proportions. The proportions of detected noneffortful responses in OSR or CSR are much smaller than the true proportions. Third, when $π_{i}^{n o n}$ is high and $d_{R T}$ is small, the proportions of detected noneffortful responses by all the methods are smaller than the true proportions, with OSR and CSR showing even much smaller proportions than ICSR.

Note that as noneffortful responses may have serious consequences for parameters estimation, in our study, higher TPR is stronger related to better estimation than lower FDR. As clearly shown in this table, compared to OSR, ICSR (as well as CSR) can increase TPR markedly when $π_{i}^{n o n}$ is high and $d_{R T}$ is large. ICSR presents higher TPR and lower FDR than CSR. When $π_{i}^{n o n}$ is high and $d_{R T}$ is small, although ICSR shows visibly better TPR, the magnitude of TPR of all the methods is only around 0.5. When $π_{i}^{n o n}$ is low, ICSR (as well as CSR) can only bring a small improvement of TPR and increase FDR slightly. In accordance with C. Wang, Xu, and Shang's (2018) finding, the noneffort severity ( $π_{i}^{n o n}$ ) is more devastating than noneffort prevalence ( $π$ ) to the TPR of OSR. One possible explanation is that in these residual methods, each individual’s residual of RT on a certain item is compared referring to a standard normal distribution, while this residual (i.e., speed) will be more seriously distorted by high level of noneffort severity than noneffort size (see Equation 5), and thus adversely affects the TPR. In summary, the proposed method shows significant advantages in terms of detection accuracy, especially when the contamination caused by noneffortful responses is severe.

Table 3.

Summary of the TPR and FDR of Three Methods in Simulation Study 1

$π$	$π_{i}^{n o n}$	$d_{R T}$	Criteria	OSR		CSR		ICSR
$π$	$π_{i}^{n o n}$	$d_{R T}$	Criteria	M	SD	M	SD	M	SD
0%	—	—	FPR	.047	.001	.048	.001	.060	.001
20%	High	Large (.125)	TPR	.307	.009	.500	.011	.930	.008
		Large (.125)	FDR	.161	.008	.357	.008	.283	.008
			Proportion	.046	.001	.097	.002	.162	.002
		Small (.125)	TPR	.186	.005	.252	.007	.503	.014
		Small (.125)	FDR	.477	.009	.537	.007	.427	.006
			Proportion	.044	.001	.068	.001	.109	.002
	Low	Large (.025)	TPR	.908	.009	.912	.009	.967	.005
		Large (.025)	FDR	.468	.017	.489	.021	.532	.022
			Proportion	.042	.001	.044	.002	.052	.002
		Small (.025)	TPR	.587	.013	.587	.014	.693	.014
		Small (.025)	FDR	.692	.012	.691	.011	.707	.011
			Proportion	.048	.001	.047	.001	.059	.001
40%	High	Large (.250)	TPR	.167	.004	.494	.008	.931	.008
		Large (.250)	FDR	.029	.003	.172	.006	.136	.006
			Proportion	.043	.001	.149	.003	.269	.004
		Small (.250)	TPR	.131	.003	.243	.007	.488	.015
		Small (.250)	FDR	.225	.009	.306	.007	.225	.005
			Proportion	.042	.001	.088	.002	.157	.005
	Low	Large (.050)	TPR	.870	.008	.865	.008	.939	.005
		Large (.050)	FDR	.166	.010	.157	.011	.181	.012
			Proportion	.052	.001	.051	.001	.057	.001
		Small (.050)	TPR	.552	.011	.547	.011	.651	.011
			FDR	.455	.014	.449	.014	.468	.014
			Proportion	.051	.001	.050	.001	.061	.001

Note. The numbers in the parentheses are the averaged true proportions of noneffortful responses under each condition. TPR = true positive rate; FDR = false discovery rate; OSR = original standard residual; CSR = conditional estimate standard residual; ICSR = iterative conditional estimate standard residual method.

The intercoder reliability between these methods can be found in Table 4. As can be seen in this table, the detections by OSR and CSR, and CSR and ICSR are more consistent than those by OSR and ICSR. All the methods tend to show less consistency when $π_{i}^{n o n}$ is high, especially when $d_{R T}$ is large.

Table 4.

Summary of the Classification Consistency of Three Methods in Simulation Study 1

$π$	$π_{i}^{n o n}$	$d_{R T}$	OSR-CSR		OSR-ICSR		CSR-ICSR
$π$	$π_{i}^{n o n}$	$d_{R T}$	M	SD	M	SD	M	SD
0%	—	—	.998	.000	.987	.001	.988	.000
20%	High	Large	.949	.002	.884	.003	.935	.002
		Small	.976	.001	.935	.003	.958	.002
	Low	Large	.998	.001	.991	.002	.993	.001
		Small	.999	.000	.988	.001	.988	.001
40%	High	Large	.894	.003	.774	.004	.880	.003
		Small	.954	.002	.885	.005	.930	.003
	Low	Large	.999	.000	.994	.000	.994	.000
	Low	Small	.999	.000	.989	.001	.988	.000

Note. OSR = original standard residual; CSR = conditional estimate standard residual; ICSR = iterative conditional estimate standard residual method.

A natural question that follows is how the performance of the proposed method in detecting noneffortful responses translates to parameter recovery. When $π_{i}^{n o n}$ is high, the bias, RMSE, and RE are reported in Table 5. This table represents the conditions when these methods show more differences. When $π_{i}^{n o n}$ is high and $d_{R T}$ is large, a significant reduction of RMSE of estimation for ICSR, as well as CSR, can be noted, with ICSR showing more advantages. When $π$ increases, the improvement in estimation precision is more profound regarding parameters a, b, α, β, and τ. In contrast, when $π_{i}^{n o n}$ is high and $d_{R T}$ is small, the proposed method makes a smaller improvement in parameter estimates. It can be noted that when $π_{i}^{n o n}$ is high, OSR, sometimes CSR, tends to yield positive bias for a, α, β, and negative bias for b. This effect is more obvious when $π = 40 %$ than when $π = 20 %$ . The parameter recovery results when $π = 0 %$ and $π_{i}^{n o n}$ is low are shown in Online Appendix C. When $π_{i}^{n o n}$ is low, the three methods show more consistent parameter estimates. In addition, for a and b, when $π_{i}^{n o n}$ is high and $d_{R T}$ is large, ICSR provides estimates as accurate as those based on the pure sample. While when $π_{i}^{n o n}$ is low, ICSR shows significant improvement in estimation precision against that on the pure sample.

Table 5.

Parameter Recovery for All the Parameters When $π_{i}^{n o n}$ = High in Simulation Study 1

$d_{R T}$	Criteria	$π$	20%			40%
$d_{R T}$	Criteria	Parameter	OSR	CSR	ICSR	OSR	CSR	ICSR
Large	Bias	a	0.200	0.181	0.039	0.351	0.333	0.098
		b	−0.098	−0.077	−0.020	−0.215	−0.153	−0.032
		α	0.376	0.202	−0.213	0.719	0.473	−0.176
		β	0.182	0.104	−0.052	0.468	0.278	−0.029
		θ	−0.004	−0.004	−0.005	−0.004	−0.004	−0.004
		τ	−0.015	−0.015	−0.014	−0.016	−0.015	−0.014
	RMSE	a	0.326	0.282	0.129	0.531	0.454	0.174
		b	0.137	0.113	0.058	0.272	0.194	0.067
		α	0.399	0.217	0.220	0.747	0.488	0.186
		β	0.183	0.106	0.054	0.469	0.280	0.033
		θ	0.412	0.397	0.344	0.521	0.477	0.387
		τ	0.453	0.388	0.170	0.614	0.489	0.210
	RE₁	a	0.848	0.734	0.336	0.952	0.814	0.313
		b	0.781	0.642	0.332	0.899	0.640	0.223
		α	0.654	0.356	0.360	0.866	0.566	0.215
		β	0.674	0.390	0.199	0.841	0.502	0.059
		θ	0.947	0.913	0.791	0.965	0.883	0.716
		τ	0.803	0.689	0.302	0.900	0.718	0.308
	RE₂	a	2.489	2.153	0.985	3.612	3.088	1.184
		b	2.635	2.173	1.115	4.459	3.180	1.098
Small	Bias	a	0.244	0.242	0.204	0.418	0.417	0.375
		b	−0.124	−0.118	−0.093	−0.250	−0.232	−0.187
		α	0.081	0.007	−0.161	0.301	0.183	−0.058
		β	0.085	0.059	−0.003	0.244	0.188	0.079
		θ	−0.003	−0.003	−0.004	−0.002	−0.002	−0.003
		τ	−0.015	−0.014	−0.014	−0.015	−0.015	−0.014
	RMSE	a	0.388	0.377	0.305	0.593	0.574	0.496
		b	0.172	0.164	0.127	0.316	0.293	0.229
		α	0.116	0.057	0.168	0.333	0.206	0.079
		β	0.087	0.061	0.015	0.246	0.190	0.082
		θ	0.428	0.424	0.400	0.534	0.521	0.484
		τ	0.300	0.288	0.221	0.389	0.361	0.282
	RE₁	a	0.924	0.898	0.725	0.970	0.939	0.812
		b	0.886	0.845	0.657	0.931	0.864	0.676
		α	0.383	0.189	0.556	0.683	0.422	0.162
		β	0.588	0.412	0.098	0.795	0.614	0.264
		θ	0.975	0.966	0.912	0.977	0.954	0.886
		τ	0.857	0.822	0.629	0.894	0.830	0.647
	RE₂	a	2.471	2.401	1.943	3.389	3.280	2.834
	RE₂	b	3.018	2.877	2.228	4.647	4.309	3.368

Note. RE₁ is the ratio of the RMSE of parameter estimates achieved by one of the methods to that based on the original data, and RE₂ is the ratio of the RMSE of parameter estimates achieved by one of the methods to that based on the pure sample identified by the mixture model method. The results where ICSR shows more advantages are boldfaced. OSR = original standard residual; CSR = conditional estimate standard residual; ICSR = iterative conditional estimate standard residual method; RMSE = root mean square error.

To better compare the bias for item parameters under different methods, Figures 1 through 6 show the item/person parameter bias of each item/person when $π = 40 %$ . As the patterns of the bias in each replication are similar, parameter estimates are averaged across replications to better represent a robust estimation of each parameter. For the discrimination parameter, estimation based on the full data set yields larger positive bias for items with larger true discrimination parameters. This effect becomes quite extreme when $π_{i}^{n o n}$ is high. When $π_{i}^{n o n}$ is high and $d_{R T}$ is large, only ICSR can eliminate the positive bias markedly. When $π_{i}^{n o n}$ is low, OSR, CSR, and ICSR make similar improvement in terms of bias, with ICSR exhibiting a little more advantage. The bias of the difficulty parameter estimates based on the contaminated data shows an interesting pattern: a slight “inward” bias for moderate difficulty parameters and a larger negative “outward” bias for both easy parameters and more extreme difficulty parameters (with even larger bias for more difficult items). This pattern is more obvious when $π_{i}^{n o n}$ is high. The bias can be virtually eliminated by using ICSR in all the conditions except that when $π_{i}^{n o n}$ is high and $d_{R T}$ is small. For the time discrimination power parameter, we can see that items with larger true α values tend to present larger positive bias, which is more extreme when $π_{i}^{n o n}$ is high. When $π_{i}^{n o n}$ is low, all the methods can yield bias much less than that using the full sample. While when $π_{i}^{n o n}$ is high, only ICSR can reduce the positive bias noticeably. Especially, when $π_{i}^{n o n}$ is high and $d_{R T}$ is large, ICSR leads to a small negative bias, which is possibly associated with overflagging—more responses than necessary are screened from the calibration sample. For the time intensity parameter, a little larger positive bias can be observed for items with higher β values, especially when $π_{i}^{n o n}$ is high. In this condition, using ICSR largely eliminates the bias. While $π_{i}^{n o n}$ is low, all the methods yield estimates only slightly less biased than those based on the full data. This may be due to the considerably small magnitude of bias for the original data set. For ability parameter, low-ability examinees tend to be overestimated, while high-ability examinees tend to be underestimated. This effect is more obvious when $π_{i}^{n o n}$ is high. ICSR can eliminate the bias effectively, especially when $d_{R T}$ is large. For speed parameter, when $π_{i}^{n o n}$ is high, we can see that slow examinees tend to present larger negative bias. Less biased estimates can be obtained by using ICSR. When $π_{i}^{n o n}$ is low, all the methods, as well as the full sample, can obtain almost unbiased person parameter estimates for most of examinees.

Figure 1.

Bias of discrimination parameter estimates (when π = 40%).

Figure 2.

Bias of difficulty parameter estimates (when π = 40%).

Figure 3.

Bias of time discrimination power parameter estimates (when π = 40%).

Figure 4.

Bias of time-intensity parameter estimates (when π = 40%).

Figure 5.

Bias of ability parameter estimates (when π = 40%).

Figure 6.

Bias of speed parameter estimates (when π = 40%).

Finally, the percentage of misfit respondents under each method and the original full data are summarized in Table 6. As can be seen in this table, when $π_{i}^{n o n}$ is high and $d_{R T}$ is large, the percentage of misfit respondents under ICSR is around 4%, which is much smaller than those under CSR and OSR, as well as the original data. While in all the other conditions, the differences among different methods are not obvious. As expected, when noneffort severity is high and the difference between RTs is large, 2PL model fits the data cleansed by ICSR better than the data cleansed by CSR or OSR.

Table 6.

The Percentage of Misfit Persons for Each Method and the Original Data (%)

$π$	$π_{i}^{n o n}$	$d_{R T}$	Original	OSR	CSR	ICSR
0%			5.220	4.482	4.467	4.335
20%	High	Large	11.717	9.183	7.115	4.037
		Small	11.565	9.775	8.995	7.048
	Low	Large	6.167	4.612	4.570	4.370
		Small	6.255	4.862	4.862	4.555
40%	High	Large	14.103	12.813	8.662	3.910
		Small	14.007	12.577	10.932	8.957
	Low	Large	7.195	4.740	4.763	4.505
	Low	Small	7.235	5.465	5.498	5.218

Note. OSR = original standard residual; CSR = conditional estimate standard residual; ICSR = iterative conditional estimate standard residual method.

Simulation Study 2

Given that in Simulation Study 1, the results under the conditions of $π = 20 %$ and 40% show highly consistent patterns, $π$ is fixed at 0% and 40% in Simulation Study 2. In general, Table 7 indicates that the patterns of TPR and FDR in Simulation Study 2 are similar to those in Simulation Study 1, while the FDR of CSR and ICSR is higher than in Simulation Study 1, especially when $π_{i}^{n o n}$ is low and $d_{R T}$ is large (Simulation Study 2: FDR = 0.437 and 0.534 for CSR and ICSR; Simulation Study 1: FDR = 0.157 and 0.181 for CSR and ICSR). This may be explained by underestimation of speed parameter (see Table 8; bias for CSR and ICSR is −0.037 and −0.059, respectively), which translates to underestimation of RT residual. Consequently, CSR and ICSR may pinpoint more responses as noneffortful (i.e., larger FDR).

The classification consistency of these methods in Simulation Study 2 can be found in Online Appendix D. Similar to the results in Table 4, when $π_{i}^{n o n}$ is high (especially when $d_{R T}$ is large), the detections of CSR and ICSR show less consistency.

Table 7.

Summary of the TPR and FDR of Three Methods in Simulation Study 2

$π$	$π_{i}^{n o n}$	$d_{R T}$	Criteria	CSR		ICSR
$π$	$π_{i}^{n o n}$	$d_{R T}$	Criteria	M	SD	M	SD
0%	—	—	FPR	.047	.001	.059	.001
40%	High	Large	TPR	.511	.015	.942	.009
			FDR	.185	.005	.155	.002
		Small	TPR	.259	.013	.522	.029
			FDR	.314	.011	.235	.010
	Low	Large	TPR	.941	.006	.979	.003
			FDR	.437	.010	.534	.008
		Small	TPR	.621	.020	.728	.018
			FDR	.555	.010	.600	.008

Note. In Simulation Study 2, due to known item parameters, OSR and CSR are the same method. TPR = true positive rate; FPR = false positive rate; FDR = false discovery rat; CSR = conditional estimate standard residual; ICSR = iterative conditional estimate standard residual method.

Table 8 shows the results of person parameter recovery across different simulation conditions in Simulation Study 2. Similar to the results in Simulation Study 1, when $π_{i}^{n o n}$ is high and $d_{R T}$ is large, ICSR shows more advantage than CSR. However, compared with the results in Simulation Study 1, the bias and RMSE of person parameters are generally larger. The reason may be explained as follows. For example, when item parameters are unknown, given that noneffortful responses are treated as missing to obtain both item and person parameters, CSR overestimates the difficulty parameter (e.g., when $π_{i}^{n o n}$ is high and $d_{R T}$ is large, bias of b is 0.153 in Simulation Study 1) and provides barely no bias for the ability parameter (e.g., when $π_{i}^{n o n}$ is high and $d_{R T}$ is large, bias of $θ$ is 0.004 in Simulation Study 1). Meanwhile, when item parameters are known, as item parameters are fixed as the generating values (i.e., no bias for item parameters), all the discrepancies between the estimates and the true values can only be reflected in the person parameters. It can be seen from Table 8 that CSR underestimates the ability parameter (e.g., when $π_{i}^{n o n}$ is high and $d_{R T}$ is large, bias of $θ$ is −0.123 in Simulation Study 2), which may probably be due to the low accuracy caused by undetected noneffortful responses.

Table 8.

Parameter Recovery for All the Parameters in Simulation Study 2

$d_{R T}$	Criteria	$π$			40%
		$π_{i}^{n o n}$	0%		High		Low
		Parameter	CSR	ICSR	CSR	ICSR	CSR	ICSR
Large	Bias	θ	0.010	−0.013	.123	.036	.013	.013
		τ	0.048	−0.059	−.254	.025	.037	.059
	RMSE	θ	0.306	0.308	.526	.411	.320	.323
		τ	0.111	0.121	.551	.201	.114	.124
	RE₁	θ	1.016	1.021	.891	.697	.909	.918
		τ	1.191	1.302	.610	.223	.465	.507
Small	Bias	θ	—	—	.184	.159	.026	.026
		τ	—	—	−.181	−.071	.031	.053
	RMSE	θ	—	—	.589	.551	.330	.328
		τ	—	—	.398	.280	.116	.124
	RE₁	θ	—	—	.965	.903	.969	.963
	RE₁		—	—	.739	.520	.704	.758

Note. In Simulation Study 2, due to known item parameters, OSR and CSR are the same methods. CSR = conditional estimate standard residual; ICSR = iterative conditional estimate standard residual method; RMSE = root mean square error.

Empirical Example

In addition to the simulation studies, the OSR, CSR, and the proposed method were used for the analysis of a real data set from Program for International Student Assessment (PISA) 2015. A sample of students taking two mathematics clusters from booklet 45 was selected as an illustrative example. As the RT might not be comparable across different languages/countries, we selected a subsample from Spain for analysis, which contained 901 participants in total. The selected data set included RA and RT information regarding 23 mathematics items, where 22 of them had two score categories and one of them had three score categories. For simplicity, Categories 2 and 3 were combined for the only item with three response categories so that all the items had two categories. In that way, van der Linden’s (2007) hierarchical model could be fitted. The percentage of missing values for RA is 7.42% while that for RT is 3.67%. The percentage of responses missing in RA and RT simultaneously is 3.62% (one response had RA information but no RT information). More details about the test design and sampling procedure can be found in the PISA 2015 technical report (Organization for Economic Cooperation and Development, 2017).

The original sample was used to estimate item and person parameters using van der Linden’s (2007) hierarchical model. Then, item parameters were obtained based on the effortful sample identified by the mixture model method. Once we got these parameter estimates, CSR, ICSR, and OSR were applied. The prior distribution, the initial values of each parameter, the number of chains, the number of iterations, the number of burn-in, and the thinning rate were set the same as in simulation studies. The iterative cleansing procedure converged in five cycles. For OSR, CSR, and ICSR, the percentages of flagged responses are 5.39%, 6.73%, and 8.48%, respectively, while the difference between RTs of noneffortful and effortful responses is 2.24, 2.04, and 1.88 for OSR, CSR, and ICSR, respectively. Therefore, this is similar to that when $π$ is 40%, $π_{i}^{n o n}$ is low and $d_{R T}$ is large in Simulation Study 1. It can be inferred that in this situation, ICSR may show some advantages over the other two methods in terms of classification accuracy and parameter estimation, but the improvement may not be as large as when $π_{i}^{n o n}$ is high and $d_{R T}$ is large. Cleansing the data by CSR, especially ICSR, results in more responses being flagged. The intercoder reliability of different methods are rather high (i.e., above 0.95).

Item parameter estimates based on the full sample and the samples cleansed by OSR, CSR, and ICSR are presented in Online Appendix E. As can be seen in that table, the results of the empirical data analysis are consistent with the findings of the simulation studies. First, cleansing the sample by these methods, especially by CSR or ICSR, results in noticeable increases in the time discrimination power estimates (see Figure 3 as a similar pattern in the simulation study). Second, the estimates of time-intensity parameter based on the sample cleansed by any method are higher than those based on the original sample. ICSR results in the highest estimates, which is consistent with the pattern in Figure 4 as well. Third, these methods barely have effects on the estimation of the discrimination parameter or difficulty parameter. This may be attributed to the fact that the proportion of noneffortful responses in these data may be rather low.

By regarding the estimates from the original full data as the baseline, we computed the relative difference (RD) and mean absolute difference (MAD) of the estimates based on different methods. RD is computed as the averaged difference between the estimates based on one of the detection methods and those based on the original data. MAD is computed as the averaged absolute difference between the estimates based on one of the detection methods and those based on the original data. The results are shown in Table 9. The RD and MAD of discrimination parameter and difficulty parameter are small for all the methods, which means that the item parameter estimates in the IRT model barely change after data are cleansed. All the methods tend to produce positive RD of both time discrimination power parameter and time-intensity parameter, with ICSR showing the largest difference from the original data. Although there is almost no RD for person parameters, the magnitude of MAD shows that there are some nonignorable differences between the person parameter estimates based on the original data and those based on the data cleansed by ICSR.

Table 9.

The RD and MAD of Parameter Estimates Compared to Those Based on the Original Data

Statistics	Method	a	b	α	β	θ	τ
RD	OSR	−.003	−.013	.508	.113	−.001	−.001
	CSR	−.006	−.021	.579	.130	−.001	−.001
	ICSR	−.036	−.035	.675	.155	−.001	−.003
MAD	OSR	.047	.031	.508	.113	.048	.058
	CSR	.060	.040	.579	.130	.055	.062
	ICSR	.075	.057	.675	.155	.158	.146

Note. RD = relative difference; MAD = mean absolute difference; OSR = original standard residual; CSR = conditional estimate standard residual; ICSR = iterative conditional estimate standard residual method.

Finally, in the real data analysis, the percentages of misfit respondents based on the original data, OSR, CSR, and ICSR are 3.885%, 2.886%, 2.664%, and 2.442%, respectively. ICSR improves the fit of the 2PL model slightly.

Discussion and Conclusion

In order to detect noneffortful responses, we propose a method based on RT residual analysis: the ICSR. This method is implemented with fixed item parameters and iteratively updates the person parameter estimates before calculating RT residuals. The proposed method is compared with a noniterative method with fixed item parameters (CSR) and an OSR in two simulation studies. Three factors related to noneffortful responses are focused on in the simulated studies: noneffort prevalence, noneffort severity, and the difference between RTs of noneffortful and effortful responses. We find that the proposed method is much more effective in detecting noneffortful responses and reducing the bias of parameter estimates when noneffort severity is high and the difference between RTs of noneffortful and effortful responses is large.

The proposed method is a kind of RT residual-based methods. These methods may show some advantages in the following aspects (Qian et al., 2016; van der Linden & Guo, 2008; C. Wang, Xu, & Shang, 2018). First, they have theoretical reference distributions of RT residual (i.e., standard normal distribution). Second, they make no assumptions concerning the form of irregular behavior, which means that neither RT nor RA of two different responses needs to be fitted by a specific model. Third, the residual-based methods can be applied easily to large-scale tests as they do not rely on visual inspection of RT distribution for each item. Fourth, these methods can be implemented when RT of all the responses does not follow a bimodal distribution. Finally, they can be used for items with different types, as they do not need to define the accuracy of random level (as Guo et al., 2016, did in their study for multiple-choice items).

When data are contaminated by noneffortful responses severely, to improve the performance of OSR, we need to obtain more accurate parameter estimates to calculate the residuals. To achieve this goal, we have applied two strategies in our proposed method. For item parameter estimation, when they are unknown, we first select a pure effortful sample by the mixture model method and find that the estimation is improved based on this effortful sample. When they are known before detection (e.g., de la Torre & Deng, 2008), estimation errors of item parameters are not considered. These item parameter estimates could be fixed in the following iteration steps. For person parameter estimation, we observe the performance gain in parameter recovery after iteratively removing noneffortful responses that may lead to biased estimates.

Obtaining accurate item parameter estimates is a crucial strategy for the proposed method. To begin with the unknown item parameter estimates, we have fixed item parameter estimates that are obtained by using the full data to fit the hierarchical model and applied the purification process (the method is called ICSR with original item parameter estimates, ICSRO). ICSRO is applied to the generated data of Simulation Study 1. The results show that when $π$ is 40%, $π_{i}^{n o n}$ is high, and $d_{R T}$ is large, parameter estimates of ICSRO are more accurate than those of OSR, but not as accurate as those of CSR, or even ICSR (e.g., RMSE of $α$ for OSR, CSR, ICSR, and ICSRO are 0.531, 0.454, 0.174, and 0.475, respectively). Therefore, we have figured out that when item parameters are unknown, more accurate item parameter estimates should be fixed in ICSR. However, for ICSR, when item parameters are known, person parameter estimates are not more accurate than those when item parameters are unknown.This situation is complicated, as item parameters are not needed to be estimated, all the errors caused by noneffortful responses or estimation process are reflected in the person parameters (see the results of Simulation Study 2).

Afterward, item parameters should be fixed in the purification process. As we have introduced before, Patton et al.’s (2019) iterative method suffers from convergence issues. This may be due to the fact that in their study, all the parameters need to be reestimated during each iteration, which leads to a big change of classification based on the renewed parameters. We hope to solve the convergence problem by fixing the item parameter estimates (i.e., not updating item parameter estimates for each iteration). Consequently, even with a more stringent convergence criteria than Patton et al.’s (0.001 vs. 0.01), all the replications of the proposed ICSR method have converged successfully in our study.

The strategy of iterative purification has been applied in educational and psychological measurement for a long time. For example, scale purification procedures have been strongly advocated and widely implemented onto IRT-based differential item functioning (DIF) detection methods (Candell & Drasgow, 1988; Lord, 1980; Park & Lautenschlager, 1990; W. C. Wang et al., 2009) and non-IRT-based methods (French & Maller, 2007; Holland & Thayer, 1988). Many researchers have found that, when tests contain less than 20% DIF items, DIF detection methods with scale purification outperform those without scale purification in reducing inflated FPRs and increasing deflated TPRs (French & Maller, 2007; Hidalgo-Montesinos & Gómez-Benito, 2003; C. W. Wang & Su, 2004). As data contaminated by aberrant responses (e.g., noneffortful responses) are similar to tests contaminated by DIF items, it is natural to apply the iterative purification process in detecting aberrant responses. de la Torre and Deng (2008) have proposed a method for improving the performance of the standardized log-likelihood person-fit statistic (l_z ) by constructing the distribution for each l_z through resampling methods iteratively. Their study shows that the proposed method has Type I error rates close to the nominal level for most ability levels and reasonably good power. Recently, Patton et al. (2019) have proposed to iteratively detect careless respondents and cleanse the data by removing their responses, which shows high-power and nominal-level FPR. However, these methods are based on person-fit statistics, and therefore, the classifications are only on the examinee level. As our method is based on a residual analysis, the classification results are for each item by person encounter, which can provide more detailed information than the person-level detections. Moreover, Patton et al. (2019) take the most recent set of parameter estimates as the final values after convergence, and we reestimate all the parameters based on the cleansed sample after convergence. These estimates are hopefully more accurate than theirs.

As noneffortful responses are characterized as responses with lower RA and RT, when noneffort severity is high, the estimation based on the full sample or data cleansed by OSR will obtain biased item parameters, while ICSR, sometimes CSR in this article, can eliminate the bias to some extent. First, the estimation based on the original data and OSR tends to underestimate discrimination parameters, especially for items with higher true values of discrimination parameters. In these situations, including noneffortful responses to fit the model may reduce the information provided by the data, as the noneffortful responses exhibit much less or even misleading psychometric information (Wise, 2017). Second, fitting van der Linden’s (2007) hierarchical model to the original data or OSR will overestimate the difficulty parameters for both easy and hard items, which means that these items seem more difficult than they really are. It may be due to the fact that RA is much lower by noneffortful responding. Third, the estimation based on the original data and OSR produces large positive bias of time discrimination power parameters for items with larger true α values. As can be seen from Equation 1, items with lower time discrimination power may distinguish fast and slow respondents worse. As deleting noneffortful responses by the proposed method means deleting responses with extreme low RTs, which bring nuisances unrelated to the respondent’s true speed, these methods lead to an increase of α. Fourth, the time-intensity parameter estimates based on the original data or OSR are generally underestimated. As can be seen from Equation 1, items with lower time intensity require less time to complete. Therefore, removing noneffortful responses with lower RT to a certain item can markedly increase the RT needed by the item in general. Finally, the ability parameter estimates based on the original data or OSR are overestimated for low-ability examinees and underestimated for high-ability examinees. It may be caused by the fact that the accuracy of noneffortful responses is 0.25 regardless of examinees’ ability level. Therefore, low-ability examinees perform better, whereas high-ability examinees perform worse than they really are. The speed parameter estimates based on the original data or OSR are overestimated for slow examinees. This is related to the assumption in our simulation studies, where slow examinees are more likely to respond noneffortfully. These noneffortful responses of slow examinees present extremely small RT, which leads to relatively larger estimates of speed parameters.

There are many important applications of parameters in van der Linden’s (2007) hierarchical model, and they all can be adversely affected by miscalibrated parameters due to noneffortful responses. Our proposed method provides a way to efficiently detect noneffortful responses and improve parameter estimation. For example, as high discriminating items are quite desired in item selection, the proposed method can avoid erroneously neglecting these items in item selection because they can avoid underestimation of discrimination parameters. Moreover, Choe et al. (2018) have proposed a novel item selection criterion that maximizes Fisher information per unit of expected RT. One of their methods favors items with high information and low expected RT. As items with low β tend to have low expected RT, it is obvious that the proposed method can obtain less biased β estimates in some situations, which will lead to more accurate expected RT. In summary, we highly recommend practitioners to detect noneffortful responses before estimating item parameters.

Furthermore, our study has some limitations and can be extended in the following ways. First, in this study, data are generated assuming that slow examinees were more likely to guess. In practice, noneffortful responding may be irrelevant to examinees’ ability or speed (e.g., Sundre & Wise, 2003; Wise & DeMars, 2005; Wise & Kong, 2005). In accordance with that, future studies can draw noneffortful samples randomly from the whole sample. Second, the iterative purification process with fixed item parameters can be applied in detecting other forms of aberrant responses (e.g., cheating, items with preknowledge) based on other statistics (e.g., person-fit statistics, nonparametric statistics). Third, the current study adopts a mixture model method with a fixed number of latent classes to classify the noneffortful and effortful groups. Other methods can be developed to select a pure group to obtain more accurate item parameter estimates. Fourth, the iterative purification process is quite complex and time-consuming. The results in our simulation studies show that when noneffort severity is low, the iterative process does not bring much improvement for identification accuracy and parameter recovery. In real data analysis, practitioners should decide whether it is necessary to use the proposed iterative method. In order to have a deeper understanding of this method, future studies would compare the proposed method with other within-subject mixture approaches (e.g., Kuijpers et al., 2020; Molenaar et al., 2018) in various conditions. Fifth, as can be seen from the results, the proposed method may increase estimation error in the RT model, especially for time discrimination parameter. This may be due to the fact that deleting extremely fast responses changes the distribution of RT. Maybe some robust estimation methods (e.g., see Hong & Cheng, 2019) that down-weight flagged response patterns can provide an alternative to directly removing noneffortful responses (i.e., as we did in our current study). Finally, due to the “masking effect,” we caution readers against overgeneralizing the results of the current study to the situation where noneffortful severity is extremely high (e.g., almost all the responses of a respondent are noneffortful). A large-scale simulation study is needed in the future to explore the performance of the proposed method in such extreme situations.

Supplemental Material

Supplemental Material, sj-docx-1-jeb-10.3102_1076998621994366 - Detecting Noneffortful Responses Based on a Residual Method Using an Iterative Purification Process

Supplemental Material, sj-docx-1-jeb-10.3102_1076998621994366 for Detecting Noneffortful Responses Based on a Residual Method Using an Iterative Purification Process by Yue Liu and Hongyun Liu in Journal of Educational and Behavioral Statistics

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by the Grant from National Education Examinations Authority of P. R. China (GJK2017015).

ORCID iD

Hongyun Liu

References

Bolt

Cohen

Wollack

(2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39, 331–348.

Bridgeman

Cline

(2004). Effects of differentially time-consuming tests on computer-adaptive test scores. Journal of Educational Measurement, 41(2), 137–148.

Candell

G. L.

Drasgow

(1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12, 253–260.

Choe

E. M.

Kern

J. L.

Chang

H. H.

(2018). Optimizing the use of response times for item selection in computerized adaptive testing. Journal of Educational and Behavioral Statistics, 43(2), 135–158.

de la Torre

Deng

(2008). Improving person-fit assessment by correcting the ability estimate and its reference distribution. Journal of Educational Measurement, 45(2), 159–177.

DeMars

C. E.

(2007). Changes in rapid-guessing behavior over a series of assessments. Educational Assessment, 12, 23–45.

Fan

Wang

Chang

H. H.

Douglas

(2012). Utilizing response time distributions for item selection in CAT. Journal of Educational and Behavioral Statistics, 37(5), 655–670.

Fisher

R. A.

(1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10(4), 507–521.

French

B. F.

Maller

S. J.

(2007). Iterative purification and effect size use with logistic regression for differential item functioning detection. Educational and Psychological Measurement, 67, 373–393.

10.

Gelman

Rubin

D. B.

(1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472.

11.

Guo

Rios

J. A.

Haberman

Liu

O. L.

Wang

Paek

(2016). A new procedure for detection of students’ rapid guessing responses using response time. Applied Measurement in Education, 29, 173–183.

12.

Hidalgo-Montesinos

M. D.

Gómez-Benito

(2003). Test purification and the evaluation of differential item functioning with multinomial logistic regression. European Journal of Psychological Assessment, 19, 1–11.

13.

Holland

W. P.

Thayer

D. T.

(1988). Differential item performance and the Mantel–Haenszel procedure. In Wainer

Braun

H. I.

(Eds.), Test validity (pp. 129–145). Lawrence Erlbaum.

14.

Hong

M. R.

Cheng

(2019). Robust maximum marginal likelihood (RMML) estimation for item response theory models. Behavior Research Methods, 51(2), 573–588.

15.

Kolen

M. J.

Brennan

R. L.

(2004). Test equating, scaling, and linking: Methods and practices. Springer Verlag.

16.

Kong

X. J.

Wise

S. L.

Bhola

D. S.

(2007). Setting the response time threshold parameter to differentiate solution behavior from rapid-guessing behavior. Educational and Psychological Measurement, 67(4), 606–619.

17.

Kuijpers

R. E.

Visser

Molenaar

(2020). Testing the within-state distribution in mixture models for responses and response times. Journal of Educational and Behavioral Statistics. Advance online publication. https://doi.org/10.3102/1076998620957240

18.

Lee

Y. H.

Jia

(2014). Using response time to investigate students’ test-taking behaviors in a NAEP computer-based study. Large-Scale Assessments in Education, 2(8), 1–24.

19.

Liu

Cheng

Liu

(2020). Identifying effortful individuals with mixture modeling response accuracy and response time simultaneously to improve item parameter estimation. Educational and Psychological Measurement, 80(4), 775–807.

20.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum.

21.

Molenaar

Bolsinova

Vermunt

J. K.

(2018). A semi-parametric within-subject mixture approach to the analyses of responses and response times. British Journal of Mathematical and Statistical Psychology, 71(2), 205–228.

22.

Muthén

L. K.

Muthén

B. O.

(2012). Mplus user’s guide. Muthén & Muthén.

23.

Organization for Economic Cooperation and Development. (2017). PISA 2015 technical report. Paris, France: OECD Publishing.

24.

Park

D. G.

Lautenschlager

G. J.

(1990). Improving IRT item bias detection with iterative linking and ability scale purification. Applied Psychological Measurement, 14, 163–173.

25.

Patton

J. M.

Cheng

Hong

Diao

(2019). Detection and treatment of careless responses to improve item parameter estimation. Journal of Educational and Behavioral Statistics, 44, 309–341.

26.

Plummer

(2003, March). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd international workshop on distributed statistical computing (Vol. 124, No. 125.10). https://www.r-project.org/nosvn/conferences/DSC-2003/Drafts/Plummer.pdf

27.

Qian

Staniewska

Reckase

Woo

(2016). Using response time to detect item preknowledge in computer-based licensure examinations. Educational Measurement: Issues and Practice, 35(1), 38–47.

28.

R Development Core Team. (2019). R: A language and environment for statistical computing [Computer software Manual]. Vienna, Austria. http://www.Rproject.org (ISBN 3-900051-07-0).

29.

Ranger

Kuhn

J. T.

(2017). Detecting unmotivated individuals with a new model-selection approach for Rasch models. Psychological Test and Assessment Modeling, 59(3), 269.

30.

Rios

J. A.

Guo

Mao

Liu

O. L.

(2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not? International Journal of Testing, 17(1), 74–104.

31.

Snijders

T. A.

(2001). Asymptotic null distribution of person fit statistics with estimated person parameter. Psychometrika, 66, 331–342.

32.

Sundre

D. L.

Wise

S. L.

(2003, April). Motivation filtering: An exploration of the impact of low examinee motivation on the psychometric quality of tests. Annual Meeting of the National Council on Measurement in Education, Chicago, IL, United States.

33.

van der Linden

W. J.

(2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181–204.

34.

van der Linden

W. J.

(2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72, 287–308.

35.

van der Linden

W. J.

Guo

(2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73, 365–384.

36.

Wang

(2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68(3), 456–477.

37.

Wang

Shang

(2018). A two-stage approach to differentiating normal and aberrant behavior in computer based testing. Psychometrika, 83(1), 223–254.

38.

Wang

Shang

Kuncel

(2018). Detecting aberrant behavior and item preknowledge: A comparison of mixture modeling method and residual method. Journal of Educational and Behavioral Statistics, 43(4), 469–501.

39.

Wang

W. C.

Shih

C. L.

Yang

C. C.

(2009). The mimic method with scale purification for detecting differential item functioning. Educational and Psychological Measurement, 69(5), 713–731.

40.

Wang

W. C.

Y.-H.

(2004). Effects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the Mantel-Haenszel method. Applied Measurement in Education, 17, 113–144.

41.

Wise

S. L.

(2006). An investigation of the differential effort received by items on a low-stakes computer-based test. Applied Measurement in Education, 19(2), 95–114.

42.

Wise

S. L.

(2015). Effort analysis: Individual score validation of achievement test data. Applied Measurement in Education, 28(3), 237–252.

43.

Wise

S. L.

(2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52–61.

44.

Wise

S. L.

DeMars

C. E.

(2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10(1), 1–17.

45.

Wise

S. L.

DeMars

C. E.

(2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43(1), 19–38.

46.

Wise

S. L.

DeMars

C. E.

(2010). Examinee noneffort and the validity of program assessment results. Educational Assessment, 15(1), 27–41.

47.

Wise

S. L.

Kong

X. J.

(2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163–183.

48.

Yuan

K.-H.

Fung

W. K.

Reise

(2004). Three Mahalanobis-distances and their role in assessing unidimensionality. British Journal of Mathematical and Statistical Psychology, 57, 151–165.

49.

Yuan

K.-H.

Zhong

(2008). Outliers, leverage observations and influential cases in factor analysis: Minimizing their effect using robust procedures. Sociological Methodology, 38, 329–368.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB