Using Item Scores and Distractors to Detect Item Compromise and Preknowledge

Abstract

Any time examinees have had access to items and/or answers prior to taking a test, the fairness of the test and validity of test score interpretations are threatened. Therefore, there is a high demand for procedures to detect both compromised items (CI) and examinees with preknowledge (EWP). In this article, we develop a procedure that uses item scores and distractors to simultaneously detect CI and EWP. The false positive rate and true positive rate are evaluated for both items and examinees using detailed simulations. A real data example is also provided using data from an information technology certification exam.

Keywords

item compromise item preknowledge test security distractor analysis nested logit model

1. Introduction

Any time examinees have had access to items and/or answers prior to taking a test, the fairness of the test and validity of test score interpretations are threatened. Therefore, there is a high demand for procedures to detect both compromised items (CI) and examinees with preknowledge (EWP). Although many researchers have considered each of these problems individually (e.g., Sinharay, 2017; Wang & Liu, 2020), a far more common yet challenging problem is one where the CI and EWP must be detected simultaneously. This type of situation could occur, for example, if a testing program has no information regarding the CI or EWP prior to starting the analysis.

In recent years, several methods have been proposed for simultaneously detecting CI and EWP. Many of these methods rely on information from item scores (e.g., Belov, 2017; O’Leary & Smith, 2017), item response times (e.g., Boughton et al., 2017), or both item scores and response times (e.g., Chen et al., 2022). However, because item scores are dichotomous, they are limited in the information they are able to provide. Item response times are continuous and therefore provide a much richer source of information, but unfortunately, response times are not always available.

An additional source of information that is often overlooked is distractor selection. Item distractors are polytomous, freely available in all multiple-choice data, and can be used to detect preknowledge of an incorrect answer key. Until recently, many researchers have focused only on detecting preknowledge of a correct answer key and/or no answer key. However, several examples of preknowledge of an incorrect answer key appear in operational settings (e.g., Eckerly, 2021; Liu & Becker, 2022), suggesting a need for detection methods that are also sensitive to this type of preknowledge.

The purpose of this article is to develop a method that uses item scores and distractors to simultaneously detect CI and EWP. Unlike many existing methods, the proposed method is sensitive to preknowledge of no answer key, preknowledge of a correct answer key, and preknowledge of an incorrect answer key. The remainder of this article is organized as follows. In Section 2, a description is provided of the model that is used for the item scores and distractors. In Section 3, we introduce the framework for simultaneously detecting CI and EWP. In Section 4, detailed simulations are conducted to investigate the performance of the proposed approach. The false positive rate and true positive rate are evaluated for both items and examinees. In Section 5, a real data example is provided using data from an information technology certification exam. Finally, in Section 6, we conclude with a brief discussion along with potential directions for future research.

2. Modeling Item Scores and Distractors

To model the item scores and distractors, we apply the two-dimensional nested logit model (henceforth referred to as the NLM). Under the NLM, the probability of a correct response is modeled using a dichotomous model, while distractor selection is modeled using the nominal response model (Bock, 1972) conditional on an incorrect response. The NLM has been shown to provide a reasonable fit to several operational data sets (Bolt et al., 2012) and has successfully been applied in related contexts to detect additional types of aberrant behavior (Gorney & Wollack, 2023).

Consider a multiple-choice test comprised of n items, where item $i = 1, 2, \dots, n$ has one correct answer and $m_{i} - 1$ distractors for a total of m_i response categories. Let $θ$ and $η$ denote the latent traits that govern response correctness and distractor selection, respectively. Under the NLM (Bolt et al., 2012), the probability of answering item i correctly is modeled using a dichotomous model. For example, we can apply the two-parameter logistic model (2PLM) such that the probability of a correct response is given by

P_{i} (θ) = \frac{exp (α_{i} θ + β_{i})}{1 + exp (α_{i} θ + β_{i})},

where $α_{i}$ and $β_{i}$ are the discrimination and easiness parameters of item i, respectively. The probability of selecting distractor j is modeled as the product of the probability of an incorrect response and the probability of selecting distractor j conditional on an incorrect response:

P_{i j} (θ, η) = [1 - P_{i} (θ)] P_{i j} (η) .

The probability of selecting distractor j conditional on an incorrect response is given by

P_{i j} (η) = \frac{exp (λ_{i j} η + ζ_{i j})}{\sum_{k = 1}^{m_{i} - 1} exp (λ_{i k} η + ζ_{i k})},

where $λ_{i j}$ and $ζ_{i j}$ are the slope and intercept parameters of item i, distractor j, respectively. Note that Equation 3 has the same form as the nominal response model, though here the denominator is computed by summing across only the distractor categories, and not all of the response categories.

To estimate the NLM, we used the mirt package in R (Chalmers, 2012; R Core Team, 2022). The 2PLM (Equation 1) was fitted to the item scores, and the nominal response model (Equation 3) was fitted to the item distractors. When necessary, item parameters were estimated using marginal maximum likelihood estimation. Person parameters were estimated using maximum likelihood estimation, where estimates were bounded between $- 3$ and 3.

3. Method

In order to simultaneously detect CI and EWP, we propose a flexible framework that consists of two phases. Figure 1 provides a graphical representation of the framework. Within each phase, steps are repeated until some convergence criterion is met. The purpose of Phase 1 (light gray) is to identify an initial set of items that are suspected of being compromised. This phase may be skipped if practitioners already know or have a reasonable idea as to which items have been compromised. Phase 2 (dark gray) then identifies the final set of items that are suspected of being compromised, as well as the set of examinees who are suspected of having preknowledge. At any time, the procedure may be terminated if too many or too few items are flagged. This is because the sets of flagged and non-flagged items must be of reasonable sizes (e.g., at least 4 items) to compare examinee performance across item sets. If fewer than 4 items are flagged, then no solution is returned, meaning that no items or examinees are flagged. If more than $n - 4$ items are flagged, then the procedure terminates and the flagging results are reported as is.

Figure 1.

Framework to detect compromised items and examinees with preknowledge. Note. Same items = same set of items flagged within a phase.

The framework has a flexible “plug-and-play” approach, where different statistical methods can be plugged into each of the steps depending on the needs of the testing program. In what follows, we describe two similar, yet distinct, collections of methods that can be used with the suggested framework. Our newly proposed approach is presented next, followed by the closely related existing approach of O’Leary and Smith (2017). Both approaches involve the assumption that the item parameters are known. Future studies might try using different collections of methods and compare them to the approaches that are presented here.

3.1. Proposed Approach (ISLAND)

Our proposed approach uses the following statistics: Item-fit statistics, Signed Likelihood ratio test statistics, AND Differential item functioning (DIF) and differential distractor functioning statistics. Therefore, we refer to our approach as ISLAND from this point forward.

3.1.1. Step 1: Initial item flagging

To identify an initial set of items that are suspected of being compromised, we apply two item-fit statistics: a residual statistic for the dichotomous item scores and a $χ^{2}$ statistic for the polytomous item distractors. Both statistics depend on estimates of the person parameters. Unfortunately, such estimates may be biased if the data are heavily contaminated from preknowledge or from any other type of aberrance. Therefore, to minimize this bias, we re-estimate the person parameters at each iteration using the non-flagged items only. The new person parameter estimates are then used to update the item flagging results, and the cycle continues until the convergence criterion is met (see Figure 1).

Let ${\hat{θ}}_{S}$ and ${\hat{η}}_{S}$ denote the person parameter estimates that are computed using the non-flagged items only. For item i, the item-fit statistics are computed by summing across all examinees within the sample $ℰ$ :

Z_{i} = \frac{\sum_{ℰ} [X_{i} - P_{i} ({\hat{θ}}_{S})]}{\sqrt{\sum_{ℰ} P_{i} ({\hat{θ}}_{S}) [1 - P_{i} ({\hat{θ}}_{S})]}},

and

Q_{i} = \sum_{j = 1}^{m_{i} - 1} \frac{{[\sum_{ℰ} (1_{j} (D_{i}) - P_{i j} ({\hat{η}}_{S}))]}^{2}}{\sum_{ℰ} P_{i j} ({\hat{η}}_{S})},

where $P_{i} ({\hat{θ}}_{S})$ is obtained by inserting ${\hat{θ}}_{S}$ into Equation 1, $P_{i j} ({\hat{η}}_{S})$ is obtained by inserting ${\hat{η}}_{S}$ into Equation 3, and $1_{j} (D_{i}) = 1$ if distractor j was selected, or 0 if any other distractor was selected. Under the null hypothesis that the model fits the item scores and distractors, Z approximately follows a $N (0, 1)$ distribution (Hambleton et al., 1991), Q approximately follows a $χ_{m - 2}^{2}$ distribution (Pearson, 1900), and Z and Q are conditionally independent given response correctness. These distributions are approximate (rather than exact) because estimates of the person parameters are used, and not the true values.

In this article, we use a significance level of .05 for the initial item flagging. After splitting the significance level for a two-sided hypothesis test, items with $1 - Φ (Z) \leq .025$ are flagged as potentially compromised with a correct answer key and/or no answer key, and items with $Φ (Z) \leq .158$ and $1 - F (Q; m - 2) \leq .158$ are flagged as potentially compromised with an incorrect answer key, where $Φ (\cdot)$ is the cumulative distribution function (c.d.f.) of the $N (0, 1)$ distribution and $F (\cdot)$ is the c.d.f. of the $χ_{m - 2}^{2}$ distribution. These cutoffs ensure that the Type I error rate of the suggested item flagging procedure is equal to

\underset{correct/no answer key}{\underset{︸}{.025}} + \underset{incorrect answer key}{\underset{︸}{(.158 \times .158)}} = .025 + .025 = .05.

For convenience, these flagging rules are also provided in the left-hand column of Table 1.

Table 1.

Flagging Rules

Step	Proposed Approach (ISLAND)	O’Leary and Smith (2017) Approach
1	${\begin{array}{l} i \in C & if 1 - Φ (Z) \leq .025, \\ i \in ℐ & if Φ (Z) \leq .158 and 1 - F (Q; m - 2) \leq .158, \\ i \in S & otherwise . \end{array}$	${\begin{array}{l} i \in C & if 1 - Φ (Z) \leq .025, \\ i \in S & otherwise . \end{array}$
2a	${\begin{array}{l} e \in P & if C \neq \emptyset and ℐ \neq \emptyset and 1 - G (L) \leq α, \\ or if C \neq \emptyset and ℐ = \emptyset and 1 - Φ (L_{C}) \leq α, \\ or if C = \emptyset and ℐ \neq \emptyset and 1 - Φ (L_{ℐ}) \leq α, \\ e \in N & otherwise . \end{array}$	${\begin{array}{l} e \in P & if 1 - H (T; ν) \leq .05 and \hat{Δ} \geq 1.5, \\ e \in N & otherwise . \end{array}$
2b	${\begin{array}{l} i \in C & if p \leq .05 and \hat{Δ} \geq 1.5, \\ i \in ℐ & if p \leq .224 and \hat{Δ} \leq - 1.5 and p_{j} \leq \frac{.224}{m - 1} and {\hat{Δ}}_{j} \leq - 1.5, \\ i \in S & otherwise . \end{array}$	${\begin{array}{l} i \in C & if p \leq .05 and \hat{Δ} \geq 1.5, \\ i \in S & otherwise . \end{array}$

Note. $Φ (\cdot)$ = c.d.f. of the $N (0, 1)$ distribution; $F (\cdot)$ = c.d.f. of the $χ_{m - 2}^{2}$ distribution; $G (\cdot)$ = c.d.f. of the ${\bar{χ}}^{2}$ distribution; $H (\cdot)$ = c.d.f. of the $t_{ν}$ distribution.

3.1.2. Step 2a: Examinee flagging

To identify the set of examinees who are suspected of having preknowledge, we apply two signed likelihood ratio test statistics. The purpose of applying these statistics is to compare examinee performance across three subsets of items: compromised with a correct answer key and/or no answer key ( $C$ ), compromised with an incorrect answer key ( $ℐ$ ), and secure ( $S$ ). Let ${\hat{θ}}_{C}$ , ${\hat{θ}}_{ℐ}$ , and ${\hat{θ}}_{S}$ denote the ability estimates on the individual item sets, and let ${\hat{θ}}_{C S}$ and ${\hat{θ}}_{ℐ S}$ denote the ability estimates on the combined item sets ${C, S}$ and ${ℐ, S}$ , respectively. For each examinee, we test the hypothesis $H_{0} : θ_{C} = θ_{S} = θ_{ℐ}$ against $H_{1} : θ_{C} > θ_{S}$ or $θ_{S} > θ_{ℐ}$ using the statistics

L_{C} = sgn ({\hat{θ}}_{C} - {\hat{θ}}_{S}) \sqrt{2 [ℓ ({\hat{θ}}_{C}, {\hat{θ}}_{S}) - ℓ ({\hat{θ}}_{C S})]},

and

L_{ℐ} = sgn ({\hat{θ}}_{S} - {\hat{θ}}_{ℐ}) \sqrt{2 [ℓ ({\hat{θ}}_{ℐ}, {\hat{θ}}_{S}) - ℓ ({\hat{θ}}_{ℐ S})]},

where, for example, $ℓ ({\hat{θ}}_{C}, {\hat{θ}}_{S})$ denotes the log-likelihood of the item scores at the separate ability estimates ${\hat{θ}}_{C}$ and ${\hat{θ}}_{S}$ , and $ℓ {(\hat{θ}}_{C S})$ denotes the log-likelihood of the item scores at the combined ability estimate ${\hat{θ}}_{C S}$ . Under the null hypothesis of no preknowledge, $L_{C}$ and $L_{ℐ}$ have asymptotic $N (0, 1)$ distributions, where large positive values of the statistics indicate a greater likelihood of preknowledge (Sinharay, 2017).

If some items are suspected of being compromised with a correct answer key and/or no answer key ( $C \neq \emptyset$ ) but no items are suspected of being compromised with an incorrect answer key ( $ℐ = \emptyset$ ), only $L_{C}$ needs to be computed. Conversely, if $C = \emptyset$ and $ℐ \neq \emptyset$ , only $L_{ℐ}$ needs to be computed. Examinees with $1 - Φ (L_{C}) \leq α$ or $1 - Φ (L_{ℐ}) \leq α$ are flagged as EWP, where $Φ (\cdot)$ is the c.d.f. of the $N (0, 1)$ distribution and $α$ is the significance level.

If $C \neq \emptyset$ and $ℐ \neq \emptyset$ , both $L_{C}$ and $L_{ℐ}$ must be computed. They can then be combined to form a single test statistic:

L = L_{C +}^{2} + L_{ℐ +}^{2},

where $L_{C +} = max {L_{C},0}$ and $L_{ℐ +} = max {L_{ℐ},0}$ . Under the null hypothesis of no preknowledge, L has an asymptotic ${\bar{χ}}^{2}$ distribution (Sinharay & Johnson, 2020), which is a mixture of the following $χ^{2}$ distributions:

{\bar{χ}}^{2} = \frac{1}{4} χ_{2}^{2} + \frac{1}{2} χ_{1}^{2} + \frac{1}{4} χ_{0}^{2} .

Large positive values of L indicate a greater likelihood of preknowledge, and examinees with $1 - G (L) \leq α$ are flagged as EWP, where $G (\cdot)$ is the c.d.f. of the ${\bar{χ}}^{2}$ distribution given by Equation 9.

3.1.3. Step 2b: Item flagging

To identify the final set of items that are suspected of being compromised, differential item functioning (DIF) and differential distractor functioning (DDF) analyses are conducted using the Mantel–Haenszel procedure (Holland & Thayer, 1988; Mantel & Haenszel, 1959) and the Educational Testing Service (ETS) C rule (Paek & Holland, 2015)—both of which will be described in detail shortly. DIF/DDF statistics are used rather than the item-fit statistics of Step 1 because only the former are able to compare item performance across subgroups of examinees (EWP vs. non-EWP) and therefore test against the alternative hypothesis of interest. The Mantel–Haenszel procedure is selected because it can work with smaller sample sizes than alternative DIF/DDF procedures, such as the likelihood ratio test. Small sample sizes may be encountered if very few examinees are suspected of having preknowledge. Meanwhile, the ETS C rule is the most conservative of the three ETS flagging rules. Thus, items flagged under the ETS C rule are those for which we are most confident that differences exist across examinee subgroups.

The Mantel–Haenszel procedure is conducted after separating examinees into G strata, where strata are formed by conditioning on some matching variable. In this study, we set $G = 2$ and use the total score on the non-flagged items and the studied item as the matching variable. Because the set of non-flagged items is updated with each iteration, the matching variable is “purified” or “refined” and is therefore expected to produce more accurate results over time (e.g., Zwick et al., 2013). The cutoff that separates the two strata is recomputed at each iteration for each item using the 50th percentile of the focal group. A similar technique has been employed by previous researchers to ensure that there are enough examinees in each stratum to complete the analysis (Socha et al., 2015).

Let $P$ denote the set of EWP (i.e., the focal group), $N$ denote the set of non-EWP (i.e., the reference group), and $ℰ$ denote the set of all examinees. Furthermore, let $N_{g 1}$ denote the number of examinees in stratum g in $N$ who answered the studied item correctly, and let $N_{g 0}$ denote the number of examinees in stratum g in $N$ who answered the studied item incorrectly. The DIF analysis (Holland & Thayer, 1988; Mantel & Haenszel, 1959) begins by computing the conditional odds ratio across all strata:

\hat{α} = \frac{\sum_{g = 1}^{G} N_{g 1} P_{g 0} / ℰ_{g}}{\sum_{g = 1}^{G} N_{g 0} P_{g 1} / ℰ_{g}},

where $ℰ_{g} = N_{g 1} + N_{g 0} + P_{g 1} + P_{g 0}$ . The corresponding effect size is

\hat{Δ} = - 2.35 ln (\hat{α}),

where large positive values of $\hat{Δ}$ indicate that examinees in $P$ are more likely to answer the item correctly than examinees in $N$ , and large negative values of $\hat{Δ}$ indicate the reverse. To test $H_{0} : | Δ | \leq 1$ against $H_{1} : | Δ | > 1$ (Paek & Holland, 2015), we compute the following p-value:

p = Φ (\frac{- 1 - | \hat{Δ} |}{s}) + Φ (\frac{1 - | \hat{Δ} |}{s}),

where $Φ (\cdot)$ is the c.d.f. of the $N (0, 1)$ distribution. The standard error is given by

s = 2.35 \sqrt{\frac{\sum_{g = 1}^{G} ℰ_{g}^{- 2} (N_{g 1} P_{g 0} + \hat{α} N_{g 0} P_{g 1}) (N_{g 1} + P_{g 0} + \hat{α} N_{g 0} + \hat{α} P_{g 1})}{2 {(\sum_{g = 1}^{G} \frac{N_{g 1} P_{g 0}}{ℰ_{g}})}^{2}}} .

Under the ETS C rule, items with $p \leq .05$ and $\hat{Δ} \geq 1.5$ are flagged as compromised with a correct answer key and/or no answer key, and items with $p \leq .224$ and $\hat{Δ} \leq - 1.5$ are further investigated for DDF. If the items under investigation also display DDF, then they are flagged as compromised with an incorrect answer key.

For the DDF analysis, let $N_{g j}$ denote the number of examinees in stratum g in $N$ who selected distractor j, and let $N_{g \ j}$ denote the number of examinees in stratum g in $N$ who selected all other distractors. The DDF analysis (Terzi & Suh, 2015) begins by computing the conditional odds ratio across all strata:

{\hat{α}}_{j} = \frac{\sum_{g = 1}^{G} N_{g \ j} P_{g j} / ℰ_{g}}{\sum_{g = 1}^{G} N_{g j} P_{g \ j} / ℰ_{g}},

where $ℰ_{g} = N_{g \ j} + N_{g j} + P_{g \ j} + P_{g j}$ . The corresponding effect size is

{\hat{Δ}}_{j} = - 2.35 ln ({\hat{α}}_{j}),

where large positive values of ${\hat{Δ}}_{j}$ indicate that examinees in $N$ are more attracted to distractor j than examinees in $P$ , and large negative values of ${\hat{Δ}}_{j}$ indicate the reverse. To test $H_{0} : | Δ_{j} | \leq 1$ against $H_{1} : | Δ_{j} | > 1$ (Paek & Holland, 2015), we compute the following p-value:

p_{j} = Φ (\frac{- 1 - | {\hat{Δ}}_{j} |}{s_{j}}) + Φ (\frac{1 - | {\hat{Δ}}_{j} |}{s_{j}}),

where the standard error is given by

s_{j} = 2.35 \sqrt{\frac{\sum_{g = 1}^{G} ℰ_{g}^{- 2} (N_{g \ j} P_{g j} + {\hat{α}}_{j} N_{g j} P_{g \ j}) (N_{g \ j} + P_{g j} + {\hat{α}}_{j} N_{g j} + {\hat{α}}_{j} P_{g \ j})}{2 {(\sum_{g = 1}^{G} \frac{N_{g \ j} P_{g j}}{ℰ_{g}})}^{2}}} .

Using a strategy similar to the ETS C rule (i.e., where both statistical significance and effect size are considered), items with $p \leq .224$ and $\hat{Δ} \leq - 1.5$ (from the DIF analysis) that also have a distractor with $p_{j} \leq \frac{.224}{m - 1}$ and ${\hat{Δ}}_{j} \leq - 1.5$ (from the DDF analysis) are flagged as compromised with an incorrect answer key. Because the DDF analyses are run simultaneously for each of the $m - 1$ distractors, the Bonferroni correction is applied when testing for DDF (Terzi & Suh, 2015). Further note that $.224 \times .224 = .05$ is the desired Type I error rate.

3.2. O’Leary and Smith (2017) Approach

O’Leary and Smith (2017) developed a method to simultaneously detect CI and EWP that is similar to ISLAND. However, the following differences between the two methods make it difficult to obtain a fair comparison: (a) the approach of O’Leary and Smith is not iterative, (b) it requires having some idea of which items are secure and which items are compromised before starting the analysis, and (c) it analyzes only item scores (and not item distractors) and is therefore designed to only detect preknowledge of a correct answer key and/or no answer key. In what follows, we modify the first two features of O’Leary and Smith to develop an approach that is compatible with the framework shown in Figure 1. The modified approach is also more similar to ISLAND, allowing for a fairer comparison between the two methods.

3.2.1. Step 1: Initial item flagging

The original approach of O’Leary and Smith (2017) does not include a step for initial item flagging. Therefore, we use the Z statistic of Equation 4 for this purpose. After splitting the significance level of .05 for a two-sided hypothesis test, items with $1 - Φ (Z) \leq .025$ are flagged as potentially compromised with a correct answer key and/or no answer key, and all other items are presumed secure. This is in agreement with the original study of O’Leary and Smith, in which a two-sided test for item flagging was used, but only the results for items displaying patterns associated with preknowledge of a correct answer key and/or no answer key were reported. For convenience, this flagging rule is also provided in the right-hand column of Table 1.

3.2.2. Step 2a: Examinee flagging

To identify the set of examinees who are suspected of having preknowledge, O’Leary and Smith (2017) suggested using differential person functioning (DPF) as implemented in the computer program Winsteps (Linacre, 2022). The DPF test statistic is given by

T = \frac{{\hat{θ}}_{C} - {\hat{θ}}_{S}}{\sqrt{S E {({\hat{θ}}_{C})}^{2} + S E {({\hat{θ}}_{S})}^{2}}},

with degrees of freedom

ν = \frac{{(S E {({\hat{θ}}_{C})}^{2} + S E {({\hat{θ}}_{S})}^{2})}^{2}}{\frac{S E {({\hat{θ}}_{C})}^{4}}{n_{C} - 1} + \frac{S E {({\hat{θ}}_{S})}^{4}}{n_{S} - 1}},

where $n_{C}$ and $n_{S}$ are the numbers of items in $C$ and $S$ , respectively. Under the null hypothesis of no preknowledge, T approximately follows a $t_{ν}$ distribution (Linacre, 2022). The difference between the ability estimates ( ${\hat{θ}}_{C} - {\hat{θ}}_{S} = \hat{Δ}$ ) is referred to as the DPF contrast. Using a strategy similar to the ETS C rule, examinees with $1 - H (T; ν) \leq .05$ and $\hat{Δ} \geq 1.5$ are flagged as EWP, where $H (\cdot)$ is the c.d.f. of the $t_{ν}$ distribution.

3.2.3. Step 2b: Item flagging

To identify the final set of items that are suspected of being compromised, O’Leary and Smith (2017) suggested using DIF as implemented in Winsteps (Linacre, 2022). However, this method for assessing DIF is only valid if the Rasch model is used for calibration. Therefore, to generalize their approach and make it more comparable to ISLAND, we use the MH method that is given by Equations 10 –13. Under the ETS C rule, items with $p \leq .05$ and $\hat{Δ} \geq 1.5$ are flagged as compromised with a correct answer key and/or no answer key, and all other items are presumed secure.

4. Simulation Studies

We conducted three simulation studies to investigate the performance of the proposed approach (ISLAND). To mimic our real data example, we simulated 2,000 examinees taking a test, where each item had one correct answer and three distractors for a total of four response categories. In all simulations, the item parameters were assumed known.

The false positive rate and true positive rate were computed for both items and examinees. The false positive rate is the proportion of secure items (or non-EWP) that were incorrectly flagged as CI (or EWP), while the true positive rate is the proportion of CI (or EWP) that were correctly flagged as such. It is important to note that because the method is iterative, the false positive rate and true positive rate do not reflect Type I error rate and power in the statistical sense—that is, based on hypothesis testing. However, although the false positive rates for examinees and items cannot strictly be “controlled” at particular significance levels, we do expect them to be close to the significance levels that are used in Steps 2a and 2b, respectively. Therefore, false positive rates that are closer to the intended significance levels and true positive rates that are larger should indicate better performance.

4.1. Study 1: Comparing Methods

4.1.1. Design

The purpose of Study 1 was to compare the performance of ISLAND to the modified approach of O’Leary and Smith (2017). We simulated a 40-item test, where 5% of the examinees had preknowledge of 20% of the items, and the disclosed key was 100% accurate. The extent to which these factors (test length, % EWP, % CI, key accuracy) affect the results is investigated in Studies 2 and 3. For this study, we specifically considered preknowledge of a correct answer key, because the approach of O’Leary and Smith was not designed to detect preknowledge of an incorrect answer key.

Conditions were created by manipulating three factors. The first factor was the detection method that was used (ISLAND or the approach of O’Leary and Smith, 2017). The second factor was the proportion of the disclosed key that was known before starting the analysis (none, quarter, half, or whole). For example, given that 8 ( $40 \times 0.2$ ) items had been compromised in total, the “half” condition represents the case, where the practitioner knows the disclosed key for 4 ( $8 \times 0.5$ ) of the items before starting the analysis. This type of situation could occur, for example, if some items along with an answer key had been found on a braindump site. The third factor was the examinee significance level $α$ . When evaluating ISLAND, two $α$ levels were considered: .025 and .05. When evaluating the O’Leary and Smith approach, only $α = .05$ was considered, since this level most closely corresponds with the ETS C rule that was used for the examinee flagging. For each condition, 100 replications were conducted.

4.1.2. Data generation

Item responses were generated using the NLM. The person parameters for examinee e were sampled such that

[\begin{matrix} θ_{e} \\ η_{e} \end{matrix}] \sim N ([\begin{matrix} 0 \\ 0 \end{matrix}], [\begin{matrix} 1 & 0.8 \\ 0.8 & 1 \end{matrix}]),

and the item parameters were sampled such that $α_{i} \sim L o g n o r m a l {(0, 0.25}^{2})$ , $β_{i} \sim N (1, 1)$ , $λ_{i j} \sim N (0, 1)$ , and $ζ_{i j} \sim N (0, 1)$ . The test was simulated to be relatively easy, similar to the certification exam in the real data example.

After simulating the uncontaminated data, the contaminated data were obtained as follows. First, the CI and EWP were randomly selected so that the CI were similar in difficulty to the secure items, and the EWP were similar in ability to the non-EWP. Then, the responses of the EWP to the CI were changed to match the keyed response (i.e., the correct answer) with a probability of 0.9. Otherwise, the original response was retained.

4.1.3. Results

Flagging results for the items and examinees can be viewed in Tables 2 and 3, respectively. In both tables, rows correspond to the different detection methods and examinee significance levels, and columns correspond to the proportion of the disclosed key that was known before starting the analysis. Non-shaded cells indicate that Phase 1 was skipped, because the disclosed key was known for at least 4 items before starting the analysis (see Figure 1). Shaded cells indicate that Phase 1 was executed, because the disclosed key was known for fewer than 4 items before starting the analysis.

Table 2.

Item Flagging Rates for Study 1

Method	Examinee Sig. Level	None		Quarter		Half		Whole
Method	Examinee Sig. Level	FPR	TPR	FPR	TPR	FPR	TPR	FPR	TPR
ISLAND	$α = .025$	.035	.290	.038	.608	.019	.741	.001	1.000
ISLAND	$α = .05$	.017	.262	.030	.547	.000	.610	.001	1.000
O’Leary and Smith (2017)	$α = .05$	.025	.212	.043	.584	.041	.711	.007	1.000

Note. Shaded cells indicate that Phase 1 was executed. FPR = false positive rate; TPR = true positive rate.

Table 3.

Examinee Flagging Rates for Study 1

Method	Examinee Sig. Level	None		Quarter		Half		Whole
Method	Examinee Sig. Level	FPR	TPR	FPR	TPR	FPR	TPR	FPR	TPR
ISLAND	$α = .025$	.013	.145	.021	.268	.025	.382	.027	.498
ISLAND	$α = .05$	.028	.178	.047	.326	.053	.446	.057	.621
O’Leary and Smith (2017)	$α = .05$	.041	.176	.087	.423	.150	.649	.086	.732

Note. Shaded cells indicate that Phase 1 was executed. FPR = false positive rate; TPR = true positive rate.

For the item flagging results, Table 2 reveals that in all conditions, false positive rates are consistently below the item significance level of .05. This result is expected because both ISLAND and the O’Leary and Smith (2017) approach use the conservative ETS C rule for item flagging. When the same examinee significance level is used for both methods ( $α = .05$ ), the O’Leary and Smith approach generally produces larger true positive rates than ISLAND, presumably because O’Leary and Smith use one-sided tests to specifically detect preknowledge of a correct answer key, whereas ISLAND uses more general two-sided tests. However, when a smaller examinee significance level is used with ISLAND, true positive rates exceed those of O’Leary and Smith. We suspect this is because the smaller examinee significance level produced a sample of flagged examinees with a higher EWP-to-non-EWP ratio. The preknowledge signal appears stronger in such a sample, making it easier to distinguish CI from secure items.

For the examinee flagging results, Table 3 reveals that ISLAND typically produces false positive rates that are near or below the examinee significance level, whereas the O’Leary and Smith (2017) approach tends to produce false positive rates that are quite a bit larger. When the same examinee significance level is used for both methods ( $α = .05$ ), the O’Leary and Smith approach generally produces larger true positive rates than ISLAND. For both methods, true positive rates increase when more of the disclosed key is known before starting the analysis. Observe that true positive rates roughly double when just a quarter of the disclosed key (i.e., 2 items) is known before starting the analysis.

In summary, ISLAND appears to offer some advantages over the approach of O’Leary and Smith (2017). Although both methods consistently produce false positive rates for the items that are below the item significance level, only ISLAND is able to produce false positive rates for the examinees that are near or below the examinee significance level. ISLAND also maintains these reasonable false positive rates for different examinee significance levels. Test security analyses are often conducted using extremely small examinee significance levels, so this may be viewed as an important benefit. In general, smaller false positive rates and larger true positive rates are found when a larger proportion of the disclosed key is known before starting the analysis and a smaller examinee significance level is used. Based on these results, all subsequent analyses are conducted using ISLAND at the smaller examinee significance level of $α = .025$ .

4.2. Study 2: Preknowledge Characteristics

4.2.1. Design

The purpose of Study 2 was to examine the performance of ISLAND in the presence of different types of simulated preknowledge. We again simulated a 40-item test, and conditions were created by manipulating four factors:

Key accuracy: 50% or 100%

% CI: 10, 20, or 40

% EWP: 5 or 20

Proportion of disclosed key known: none, quarter, half, or whole

The four factors were fully crossed, resulting in a total of 48 ( $2 \times 3 \times 2 \times 4$ ) conditions. In addition, two null conditions were added in which no preknowledge was simulated (i.e., % CI and % EWP = 0). Under the first null condition, entirely null data were simulated such that no aberrance was present. Under the second null condition, a different type of aberrance (test speededness) was simulated for some examinees. For each condition, 100 replications were conducted, and an examinee significance level of $α = .025$ was used.

4.2.2. Data generation

The data generation procedure was identical to that of Study 1 except for two differences. First, under the null condition with test speededness, 5% of the examinees were simulated to be speeded on the last 20% of the items. Speededness was simulated as random guessing such that each response had an equal probability of being selected. The second difference occurred when the key was simulated to be 50% accurate. This level of accuracy was chosen based on our real data example. Under this condition, exactly half of the CI were disclosed with a correct answer key, while the other half were disclosed with an incorrect answer key. As in Study 1, when the disclosed key was correct, the responses of the EWP were changed to match the keyed response with a probability of 0.9. However, when the disclosed key was incorrect, the keyed response was randomly chosen among the distractors such that each had an equal probability of being selected, and the EWP responses were changed to match the keyed response with a probability equal to $0.5 + \frac{0.4}{1 + exp (β)}$ . Thus, EWP responding to extremely difficult items were expected to select the (incorrect) keyed response with a probability of 0.9, whereas EWP responding to extremely easy items were expected to select the (incorrect) keyed response with a probability of 0.5. This probability was conditioned on item easiness rather than examinee ability because previous research has shown that when the answer key is disclosed, items tend to represent the dominant source of variability (Gorney & Wollack, 2022b).

4.2.3. Results

Flagging results for the items and examinees can be viewed in Tables 4 and 5, respectively. For the items, it can be seen that across all conditions, false positive rates are near or below the item significance level of .05. Therefore, ISLAND successfully limits the number of secure items that are incorrectly flagged as compromised. For the examinees, false positive rates are also near or below the examinee significance level of .025. In some instances, false positive rates are much smaller than .025, presumably because examinees were flagged by a likelihood ratio test statistic that uses critical values from an asymptotic null distribution. Therefore, conservative results are expected when any of the item sets are small, including, for example, cases where very few items have been compromised (10% CI = 4 compromised items).

Table 4.

Item Flagging Rates for Study 2

			None		Quarter		Half		Whole
Condition	% CI	% EWP	FPR	TPR	FPR	TPR	FPR	TPR	FPR	TPR
Null (with no aberrance)	0	0	.049	—	—	—	—	—	—	—
Null (with test speededness)	0	0	.040	—	—	—	—	—	—	—
50% key accuracy	10	5	.034	.117	.035	.350	.043	.590	.032	1.000
		20	.026	.517	.035	.680	.039	.835	.035	1.000
	20	5	.021	.456	.028	.745	.027	.819	.002	1.000
		20	.022	.969	.022	.980	.014	.979	.000	1.000
	40	5	.002	.899	.016	.868	.000	.954	.000	1.000
		20	.002	.994	.012	.997	.000	.998	.000	1.000
100% key accuracy	10	5	.028	.115	.051	.432	.054	.675	.009	1.000
		20	.040	.938	.039	.963	.049	.973	.004	1.000
	20	5	.035	.290	.038	.608	.019	.741	.001	1.000
		20	.025	.998	.025	.998	.002	.981	.000	1.000
	40	5	.033	.220	.006	.550	.000	.799	.000	1.000
		20	.011	.998	.002	.931	.000	.997	.000	1.000

Note. Shaded cells indicate that Phase 1 was executed. FPR = false positive rate; TPR = true positive rate; CI = compromised items; EWP = examinees with preknowledge.

Table 5.

Examinee Flagging Rates for Study 2

			None		Quarter		Half		Whole
Condition	% CI	% EWP	FPR	TPR	FPR	TPR	FPR	TPR	FPR	TPR
Null (with no aberrance)	0	0	.005	—	—	—	—	—	—	—
Null (with test speededness)	0	0	.009	—	—	—	—	—	—	—
50% key accuracy	10	5	.005	.035	.008	.065	.009	.091	.017	.322
		20	.010	.161	.013	.182	.014	.214	.018	.317
	20	5	.013	.322	.018	.484	.020	.514	.020	.634
		20	.021	.589	.021	.592	.020	.595	.020	.631
	40	5	.023	.858	.023	.797	.023	.885	.023	.894
		20	.023	.886	.023	.877	.023	.890	.023	.890
100% key accuracy	10	5	.007	.027	.011	.065	.017	.105	.022	.268
		20	.024	.221	.024	.225	.024	.221	.021	.268
	20	5	.013	.145	.021	.268	.025	.382	.027	.498
		20	.028	.464	.028	.463	.027	.496	.027	.502
	40	5	.011	.148	.026	.405	.028	.592	.027	.665
		20	.028	.654	.026	.612	.027	.667	.027	.668

Note. Shaded cells indicate that Phase 1 was executed. FPR = false positive rate; TPR = true positive rate; CI = compromised items; EWP = examinees with preknowledge.

True positive rates for the items and examinees are generally larger when the key is 50% accurate than when it is 100% accurate. This result agrees with previous research on different types of preknowledge (e.g., Gorney & Wollack, 2023) and is not at all surprising when we consider that the test was simulated to be easy. Thus, it would be more unusual to see many incorrect responses than it would be to see many correct responses. True positive rates for the items and examinees also increase as % CI and % EWP increase. Previous research has shown similar results for preknowledge of a correct answer key (e.g., Sinharay, 2017; Wang & Liu, 2020), but it is interesting to see that this finding extends to situations, where the disclosed key is partially incorrect (50% accurate). Some limited simulations confirm that this pattern also holds when the disclosed key is entirely incorrect (0% accurate).

In summary, this study shows that ISLAND produces small and reasonable false positive rates in the presence of different types of simulated preknowledge. True positive rates are noticeably larger when key accuracy is lower, and % CI and % EWP are higher. Relative detection patterns are similar regardless of whether the disclosed key is 50% accurate or 100% accurate.

4.3. Study 3: Test and Examinee Characteristics

4.3.1. Design

The purpose of Study 3 was to examine the performance of ISLAND for different tests and populations. We again considered the case where 5% of the examinees had preknowledge of 20% of the items, and the disclosed key was 100% accurate. Conditions were created by manipulating four factors:

Test easiness: easy or medium

Test length: 40 or 80 items

EWP ability: same as non-EWP or lower than non-EWP

Proportion of disclosed key known: none, quarter, half, or whole

The four factors were fully crossed, resulting in a total of 32 ( $2 \times 2 \times 2 \times 4$ ) conditions. For each condition, 100 replications were conducted, and an examinee significance level of $α = .025$ was used.

4.3.2. Data generation

The data generation procedure was identical to that of Study 1 except for two differences. First, easy and medium tests were simulated such that $β_{i} \sim N (1, 1)$ and $β_{i} \sim N (0, 1)$ , respectively. Second, when EWP ability was simulated to be lower than that of the non-EWP, the EWP were selected as follows. All examinees were rank ordered based on $θ$ . EWP were then selected with probabilities proportional to their rank. Specifically, the examinee with the lowest $θ$ was most likely to be selected, and the examinee with the highest $θ$ was least likely to be selected. In this way, EWP were simulated to be of lower ability while leaving the ability distribution of the entire population intact.

4.3.3. Results

Flagging rates for the items and examinees can be viewed in Tables 6 and 7, respectively. Overall, false positive rates are very similar to those of Studies 1 and 2. Specifically, false positive rates for the items are below the item significance level of .05, and false positive rates for the examinees are near or below the examinee significance level of .025. True positive rates for both items and examinees are larger for medium tests and when EWP ability is lower than that of the non-EWP. These results are not surprising given that the presence of many correct responses would seem especially unusual under such conditions. True positive rates are also larger for the 80-item tests than they are for the 40-item tests, presumably because longer tests produce more accurate ability parameter estimates, which in turn improves detection of EWP.

Table 6.

Item Flagging Rates for Study 3

Test Easiness	Test Length	EWP Ability	None		Quarter		Half		Whole
Test Easiness	Test Length	EWP Ability	FPR	TPR	FPR	TPR	FPR	TPR	FPR	TPR
Easy	40	Same	.035	.290	.038	.608	.019	.741	.001	1.000
		Lower	.026	.520	.030	.776	.003	.885	.000	1.000
	80	Same	.010	.580	.014	.769	.001	.887	.000	1.000
		Lower	.003	.907	.001	.927	.000	.957	.000	1.000
Medium	40	Same	.022	.814	.027	.943	.001	.959	.000	1.000
		Lower	.011	.969	.011	.989	.000	.994	.000	1.000
	80	Same	.003	.975	.000	.966	.000	.985	.000	1.000
		Lower	.000	.992	.000	.989	.000	.988	.000	1.000

Note. Shaded cells indicate that Phase 1 was executed. FPR = false positive rate; TPR = true positive rate; EWP = examinees with preknowledge.

Table 7.

Examinee Flagging Rates for Study 3

Test Easiness	Test Length	EWP Ability	None		Quarter		Half		Whole
Test Easiness	Test Length	EWP Ability	FPR	TPR	FPR	TPR	FPR	TPR	FPR	TPR
Easy	40	Same	.013	.145	.021	.268	.025	.382	.027	.498
		Lower	.019	.361	.025	.506	.027	.644	.027	.692
	80	Same	.026	.510	.028	.632	.029	.720	.027	.736
		Lower	.029	.861	.028	.875	.029	.884	.028	.888
Medium	40	Same	.025	.595	.028	.670	.027	.731	.027	.743
		Lower	.027	.842	.028	.858	.027	.880	.027	.881
	80	Same	.027	.898	.027	.893	.027	.902	.027	.903
		Lower	.026	.981	.026	.982	.026	.981	.026	.982

Note. Shaded cells indicate that Phase 1 was executed. FPR = false positive rate; TPR = true positive rate; EWP = examinees with preknowledge.

5. Real Data Example

The data in this example originated from a single form of an information technology certification exam and were also studied by Eckerly (2021) and Sinharay (2021). The sample comprises 1,992 examinees who took a 60-item test; however, only 39 of these items were administered in the multiple-choice format and are used in this analysis. All 39 items were published on a braindump site with answer keys. Specifically, 15 of these items (38%) were disclosed with a correct answer key, while the remaining 24 items (62%) were disclosed with an incorrect answer key.

The first 600 examinees who took the test were used for item calibration, and the resulting item parameter estimates were treated as known item parameters for all subsequent analyses. This subsample was selected for item calibration due to the sudden change in average test scores that occurred shortly after this point (Figure 2a), and because the testing program had reason to believe that many of the response patterns after this point had been contaminated by preknowledge. As in the simulation studies, the 2PLM was used to model item scores since it was found to provide a significantly better fit than the Rasch model. The nominal response model was used to model distractor selection. Furthermore, to make use of the full data set, missing responses were scored as incorrect and were treated as an additional response category. This treatment has been used in previous research on detecting aberrant behavior (e.g., Drasgow et al., 1985).

Figure 2.

Moving average (size = 400) of (a) average test scores and (b) proportion of flagged examinees over time.

We ran ISLAND under the assumption that no information was known regarding the compromise status of the items. As in the simulation studies, we used an examinee significance level of .025. Results showed that the procedure terminated because so many items (36 of 39) were flagged by the DIF and DDF statistics. This is the result we would expect to see if there was, in fact, widespread compromise.

To investigate this claim more thoroughly, we reviewed which items and examinees had been flagged. We found that the procedure had flagged 86.7% the items that had been disclosed with a correct answer key, and 95.8% of the items that had been disclosed with an incorrect answer key. On the examinee side, the procedure flagged 2.5% of Examinees 1–600 and 27.7% of Examinees 601–1992 (Figure 2b). Given that the testing program had been fairly certain the first 600 examinees were clean and that we had used an examinee significance level of .025, these results strongly suggest that the responses of those who took the test earlier differed significantly from the responses of those who took the test later.

Finally, in order to study these differences more closely, we examined the results at both the examinee and item levels in the form of the proportion correct scores. Figure 3(a) reveals that nearly all flagged examinees performed better on the items that were known to be disclosed with a correct answer key than on the items that were known to be disclosed with an incorrect answer key. Similarly, Figure 3(b) shows that performance improved over time on the items that had been flagged as compromised with a correct answer key, whereas performance declined over time on the items that had been flagged as compromised with an incorrect answer key. Taken together, these findings suggest that the signal ISLAND had detected was indeed preknowledge of the published answer key.

Figure 3.

(a) Proportion correct scores on items disclosed with a correct answer key versus items disclosed with an incorrect answer key and (b) item p values for examinees 1–600 versus examinees 601–1992.

6. Discussion

Although many methods have been developed to simultaneously detect CI and EWP, the majority of these methods are limited in that they only utilize item scores and/or item response times. By doing so, such methods overlook an additional source of information that is freely available in all multiple-choice data: distractor selection. Among other benefits, item distractors offer the unique advantage of being able to detect preknowledge of an incorrect answer key. There is a surprising lack of research in this area, despite the fact that many examples of this type of preknowledge have been found in operational settings.

In this article, we developed a method that uses item scores and distractors to simultaneously detect CI and EWP. Unlike many existing methods, ISLAND is sensitive to preknowledge of no answer key, preknowledge of a correct answer key, and preknowledge of an incorrect answer key. Results showed that in contrast to an existing approach (O’Leary & Smith, 2017), ISLAND was able to produce small and reasonable false positive rates for both items and examinees across all conditions. Subsequent analyses revealed that true positive rates for the items and examinees were largest when key accuracy was lower, % CI and % EWP were higher, the test was longer and more difficult, EWP ability was lower than that of the non-EWP, and a greater proportion of the disclosed key was known before starting the analysis.

Following our simulations, we conducted an analysis of preknowledge behavior on a real data set with known compromise. We ran ISLAND under the assumption that no information was known regarding the compromise status of the items and found results that closely aligned with the information provided by the testing program. We believe that these results, combined with those of the simulation studies, suggest that ISLAND is an effective method for simultaneously detecting CI and EWP.

There are several limitations to this article, providing many opportunities for future research. First, due to the iterative nature of the method, false positive rates for the items and examinees could not strictly be “controlled” at particular significance levels. However, in our simulations, we found the false positive rates to be consistently near or below the significance levels that were used in Steps 2a and 2b, indicating that ISLAND effectively limited the number of non-EWP who were incorrectly flagged as EWP as well as the number of secure items that were incorrectly flagged as compromised. Second, we only compared ISLAND to the existing approach of O’Leary and Smith (2017). It would be useful to compare ISLAND to other score-based methods, such as the information theory and combinatorial optimization approach of Belov (2017), or response-based methods, such as the exact-matching near-matching approach of Haberman and Lee (2017). Third, it would be helpful to analyze additional real data sets, including those for which the disclosed key is 100% accurate. Fourth, in its current form, ISLAND is only designed to detect a single group of EWP who have preknowledge of the same set of CI. It would be useful to extend the approach to detect multiple groups of EWP who have preknowledge of different sets of CI. A simple way to do this would be to run ISLAND once, remove the examinees who are flagged, and then run ISLAND again to see whether any additional examinees and items are flagged. A more complex extension of the method could also be developed.

Finally, it is important to remember that ISLAND is but one collection of methods that can be plugged into the much larger framework of Figure 1. The framework is flexible in that it allows different statistical methods to be plugged into each of the steps; therefore, it would be interesting and useful for future studies to try using different collections of methods. For example, a valuable extension would be to include methods that incorporate response times. Previous research has found that EWP tend to respond more quickly to items for which they have preknowledge (e.g., Gorney & Wollack, 2022a), and presumably such information could be leveraged to further improve detection results (see, e.g., Sinharay & Johnson, 2020).

Footnotes

Authors’ Note

This work was completed while the first author was an Educational Testing Service Harold Gulliksen Psychometric Research Fellow.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research and/or authorship of this article.

ORCID iDs

Kylie Gorney

Sandip Sinharay

References

Belov

(2017). Identification of item preknowledge by the methods of information theory and combinatorial optimization. In Cizek

G. J.

Wollack

J. A.

(Eds.), Handbook of quantitative methods for detecting cheating on tests (pp. 164–176). Routledge.

Bock

R. D.

(1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37(1), 29–51. https://doi.org/10.1007/BF02291411

Bolt

D. M.

Wollack

J. A.

Suh

(2012). Application of a multidimensional nested logit model to multiple-choice test items. Psychometrika, 77(2), 339–357. https://doi.org/10.1007/s11336-012-9257-5

Boughton

K. A.

Smith

Ren

(2017). Using response time data to detect compromised items and/or people. In Cizek

G. J.

Wollack

J. A.

(Eds.), Handbook of quantitative methods for detecting cheating on tests (pp. 177–190). Routledge.

Chalmers

R. P.

(2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://www.doi.org/10.18637/jss.v048.i06

Chen

Moustaki

(2022). Detection of two-way outliers in multivariate data and application to cheating detection in educational tests. Annals of Applied Statistics, 16(3), 1718–1746. https://doi.org/10.1214/21-AOAS1564

Drasgow

Levine

M. V.

Williams

E. A.

(1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38(1), 67–86. https://doi.org/10.1111/j.2044-8317.1985.tb00817.x

Eckerly

(2021). Answer similarity analysis at the group level. Applied Psychological Measurement, 45(5), 299–314. https://doi.org/10.1177/01466216211013109

Gorney

Wollack

J. A.

(2022a). Generating models for item preknowledge. Journal of Educational Measurement, 59(1), 22–42. https://doi.org/10.1111/jedm.12309

10.

Gorney

Wollack

J. A.

(2022b). Two new models for item preknowledge. Applied Psychological Measurement, 46(6), 447–461. https://doi.org/10.1177/01466216221108130

11.

Gorney

Wollack

J. A.

(2023). Using item scores and distractors in person-fit assessment. Journal of Educational Measurement, 60(1), 3–27. https://doi.org/10.1111/jedm.12345

12.

Haberman

S. J.

Lee

Y.-H.

(2017). A statistical procedure for testing unusually frequent exactly matching responses and nearly matching responses (Research Report No. RR-17-23). Educational Testing Service. https://doi.org/10.1002/ets2.12150

13.

Hambleton

R. K.

Swaminathan

Rogers

H. J.

(1991). Fundamentals of item response theory. SAGE Publications.

14.

Holland

P. W.

Thayer

D. T.

(1988). Differential item performance and the Mantel–Haenszel procedure. In Wainer

Braun

H. I.

(Eds.), Test validity (pp. 129–145). Lawrence Erlbaum Associates.

15.

Linacre

J. M.

(2022). Winsteps (Version 5.2.4) [Computer software]. https://winsteps.com/

16.

Liu

Becker

(2022). The impact of cheating on score comparability via pool-based IRT pre-equating. Journal of Educational Measurement, 59(2), 208–230. https://doi.org/10.1111/jedm.12321

17.

Mantel

Haenszel

(1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4), 719–748. https://doi.org/10.1093/jnci/22.4.719

18.

O’Leary

L. S.

Smith

R. W.

(2017). Detecting candidate preknowledge and compromised content using differential person and item functioning. In Cizek

G. J.

Wollack

J. A.

(Eds.), Handbook of quantitative methods for detecting cheating on tests (pp. 151–163). Routledge.

19.

Paek

Holland

(2015). A note on statistical hypothesis testing based on log transformation of the Mantel–Haenszel common odds ratio for differential item functioning classification. Psychometrika, 80(2), 406–411. https://doi.org/10.1007/s11336-013-9394-5

20.

Pearson

(1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine: Series 5, 50(302), 157–175. https://doi.org/10.1080/14786440009463897

21.

R Core Team. (2022). R: A language and environment for statistical computing (Version 4.2.0) [Computer software]. https://www.R-project.org/

22.

Sinharay

(2017). Detection of item preknowledge using likelihood ratio test and score test. Journal of Educational and Behavioral Statistics, 42(1), 46–68. https://doi.org/10.3102/1076998616673872

23.

Sinharay

(2021). Latent-variable approaches utilizing both item scores and response times to detect test fraud. Open Education Studies, 3(1), 1–16. https://doi.org/10.1515/edu-2020-0137

24.

Sinharay

Johnson

M. S.

(2020). The use of item scores and response times to detect examinees who may have benefited from item preknowledge. British Journal of Mathematical and Statistical Psychology, 73(3), 397–419. https://doi.org/10.1111/bmsp.12187

25.

Socha

DeMars

C. E.

Zilberberg

Phan

(2015). Differential item functioning detection with the Mantel-Haenszel procedure: The effects of matching types and other factors. International Journal of Testing, 15(3), 193–215. https://doi.org/10.1080/15305058.2014.984066

26.

Terzi

Suh

(2015). An odds ratio approach for detecting DDF under the nested logit modeling framework. Journal of Educational Measurement, 52(4), 376–398. https://doi.org/10.1111/jedm.12091

27.

Wang

Liu

(2020). Detecting compromised items using information from secure items. Journal of Educational and Behavioral Statistics, 45(6), 667–689. https://doi.org/10.3102/1076998620912549

28.

Zwick

Isham

(2013). An investigation of the efficacy of criterion refinement procedures in Mantel-Haenszel DIF analysis (Research Report No. RR-13-16). Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2013.tb02323.x