Hybrid Threshold-Based Sequential Procedures for Detecting Compromised Items in a Computerized Adaptive Testing Licensure Exam

Abstract

Using classical test theory and item response theory, this study applied sequential procedures to a real operational item pool in a variable-length computerized adaptive testing (CAT) to detect items whose security may be compromised. Moreover, this study proposed a hybrid threshold approach to improve the detection power of the sequential procedure while controlling the Type I error rate. The hybrid threshold approach uses a local threshold for each item in an early stage of the CAT administration, and then it uses the global threshold in the decision-making stage. Applying various simulation factors, a series of simulation studies examined which factors contribute significantly to the power rate and lag time of the procedure. In addition to the simulation study, a case study investigated whether the procedures are applicable to the real item pool administered in CAT and can identify potentially compromised items in the pool. This research found that the increment of probability of a correct answer (p-increment) was the simulation factor most important to the sequential procedures’ ability to detect compromised items. This study also found that the local threshold approach improved power rates and shortened lag times when the p-increment was small. The findings of this study could help practitioners implement the sequential procedures using the hybrid threshold approach in real-time CAT administration.

Keywords

test security compromised item detection change-point problem computerized adaptive testing

Computerized adaptive testing (CAT) reuses test items over time. Therefore, test-takers can obtain previous access to test items that allows them to achieve illegal gains in their test scores. This is especially true in high-stakes licensure exams administered by using CAT. If the stem and answer choices of a compromised or leaked item have been shared with test-takers before they take an exam, they have a better chance to respond to the item correctly during the exam. Compromised items can threaten the validity of inferences from examination scores as well as the fairness of an exam. To maintain the validity and fairness of a high-stakes testing program, continuous monitoring of test items ensures the soundness of items’ psychometric properties.

Various methods detect compromised items in CAT. One method indicates a drift in the item difficulty parameter. Veerkamp and Glas (2000) used a statistical quality control method to detect such a drift. This method reestimates item difficulty parameters for items during the CAT administration and compares them with their initial difficulty parameters. If an item becomes significantly easier according to the cumulative sum (CUSUM) chart, a sequential analysis used for monitoring change-points (Grigg et al., 2003; Page, 1954), the item may be compromised. Kang and Chang (2016) have extended the procedure by using log-likelihood CUSUM statistics. These methods show a satisfactory detection rate, but they cannot be implemented during real-time CAT administration due to the repeated calibrations they require.

Another way to detect compromised items in CAT is to identify aberrant examinees. To detect these respondents, Belov (2014) used a 3D algorithm—merging information theory and combinatorial optimization. Qian et al. (2016) used a lognormal response time model (van der Linden, 2006) to detect aberrant examinees and compromised items in computer-based examinations and a case study based on CAT. Sinharay (2017) proposed using a likelihood ratio test and score test to detect aberrant examinees. O’Leary and Smith (2017) used differential person and item functioning to detect candidate preknowledge and compromised items. Wang et al.’s (2018) mixture hierarchical item response theory model used both response accuracy and response time information to detect aberrant behavior and compromised items. These methods show great potential to detect compromised items, but first they must identify the aberrant examinees. This requirement is very challenging because psychometric evidence may not be sufficient to prove the aberrant behavior of an examinee.

Zhang (2014) and Zhang and Li (2016) proposed sequential procedures based on classical test theory (CTT) and item response theory (IRT), respectively. The procedures use a series of statistical hypothesis tests to monitor whether the item response function (IRF) of individual items had changed significantly across test administration over time. These procedures could be applied to CAT in real time. Choe et al. (2018) incorporated response times into the IRT-based sequential procedure in addition to actual responses. Thus, the procedure can provide even greater statistical power for detecting compromised items as well as stronger substantive evidence that an item is compromised. Liu et al.’s (2019) model with a leakage parameter considers various scenarios in which items become compromised. The more generalized detection method that they developed achieves a high level of detection accuracy while maintaining Type I error rates at the nominal level.

The sequential procedures look promising based on their simulation studies (Zhang, 2014; Zhang & Li, 2016). Although the data simulations in the studies were based on real CAT data sets, the sequential procedures have been applied to few real CAT programs. In addition, the test forms in the studies were fixed-length (≤40 items) from a small item pool (about 400 items). Moreover, the studies used a global threshold, a single cutoff value for all the items in the pool with a high probability (.8 or .9) of being answered correctly. At these high probabilities, the global threshold could detect compromised items well (Zhang, 2014; Zhang & Li, 2016), but it was not applied to items at lower probabilities. In real CAT tests, this high probability may be uncommon, and the probability may increase gradually after items are compromised. Therefore, these procedures may not detect compromised items immediately, and they may have long lag times.

The current study applied the sequential procedures from Zhang (2014) and Zhang and Li (2016) to a real operational item pool in a variable-length CAT. The item pool had 1,472 items administered to 65,753 candidates taking a licensure exam within a 3-month period. The number of items administered to test-takers ranged between 75 and 265. This study used an approach that also works when the probability is lower of test-takers answering a compromised item correctly. To accomplish this goal, this study examined probabilities as they gradually increased to 1.0 by small increments (.1).

The current study proposed a hybrid threshold approach to improve the detection power rate of the sequential procedure while controlling the Type I error rate. The hybrid threshold approach proposes a local threshold for each item in the beginning of the CAT administration. The local threshold is useful for detecting potentially compromised items with probabilities that have increased but are still relatively low overall. The local threshold, a threshold for each item, is determined by using the (1 −α)th percentile of the CTT and IRT statistics from the corresponding item. The hybrid threshold approach uses the global threshold to indicate a large change in the item response function at the monitoring point after the item is flagged as compromised by the local threshold.

This study addressed the following research questions:

What factors affect the detection power rate and lag time of the sequential procedures based on CTT and IRT models?

Does the local threshold approach improve power rates and shorten lag times by using small increases in probability to detect compromised items?

How does the hybrid approach detect possible compromised items in a real test administration based on CAT?

The article is organized as follows. The next section provides the theoretical framework of the sequential procedure based on the CTT and IRT statistics. Then it outlines a hybrid threshold approach. The “Simulation Studies” section describes three simulation studies, including a Type I error study, a power study using the global threshold approach, and a power study using the local threshold approach. The “Case Study” section applies the hybrid approach to the real CAT data to identify any potentially compromised items. Last, after summarizing the simulation and case studies, the article discusses the implications and limitations of the proposed approach.

Theoretical Framework

This research employed a theoretical framework proposed by Zhang (2014). Figure 1 illustrates a series of examinees to whom an item i is administered in a CAT. Let n be an examinee sequence for item i: more specifically, the nth examinee to whom item i is administered. Because items might not be compromised immediately as they are administered in the sequential procedure, a starting point of monitoring ( $n_{0}$ )—for example, the 60th examinee—can be set. If the content of item $i$ was leaked after examinee $n$ took the test, then the item would have become easier for the rest of the examinees. This is a so-called change-point problem in sequential analysis (Zhang, 2014). Let the change-point of item $i$ be $n_{c}$ . If the procedure fails to detect this item as compromised until the end of the examinee sequence, this is called a Type II error. If the procedure detects a compromised item before the true change-point, so that $n$ < $n_{c}$ , this detection is called a false positive or a Type I error. If the procedure detects a compromised item after the true change-point, so that $n$ > $n_{c}$ , this detection is a true positive or power. However, there is a lag time between the true change-point and its identification ( $n_{c}$ < $n$ < $n_{d}$ ). The lag time in the sequential procedure refers to the number of examinees who received the compromised item between the change-point ( $n_{c})$ and the identification of the compromised item ( $n_{d})$ , such that the lag time equals $n_{d} - n_{c}$ . The ideal property of the sequential procedure would be having higher power rates and shorter lag times while controlling Type I error rates.

Figure 1.

The sequential procedure in an examinee sequence for an item.

The sequential procedure contains three elements that should be predefined by users. The first element, a moving sample size ( $m$ ), represents the most recent responses to items. It is used to calculate statistics for detecting compromised items. Zhang (2014) and Zhang and Li (2016) showed that, as $m$ increased, both the power rates and lag times increased with fixed-length tests. To obtain higher power rates but shorter lag times, the selection of an appropriate m is critical. The second element is a starting point for items ( $n_{0}$ ), where the sequential procedure starts monitoring items administered to examinees. When choosing the starting point, users assume that the items have not been leaked before the starting point. In addition, selection of the starting point should depend on a reliable estimate of item response function for the statistics in the sequential procedure. The last element is a cutoff point or threshold. The threshold is used to determine whether items are compromised under the Type I error control, such as .01 or .05. To find an appropriate threshold for items administered by a CAT system, a simulation study that mimics the CAT system for the real exam can be used. The rate of family-wise Type I errors is controlled at significance level ( $α$ ) by selecting an appropriate cutoff point, called a global threshold in this article. The next two subsections describe two statistical methods used in the sequential procedure to obtain statistical characteristics of each item: classical test theory (CTT) and item response theory (IRT).

Sequential Procedure Based on Classical Test Theory

Based on CTT, the item response function can be defined as a $p$ value, which is the proportion of the examinees who answered an item correctly. The sequential procedure considers a series of moving samples $m$ to calculate a Z statistic. For example, when the procedure examines the $n$ th test-taker, the Z statistic for that examinee is obtained by using the $p$ values for both the reference and the target samples. In the CTT approach, the reference sample has the first $n - m$ responses, and the target sample has the $m$ most recent responses, as shown in Figure 2. The estimated $p$ for the reference sample to the item $i$ can be defined

Figure 2.

Reference and target samples and item response functions in the sequential procedure.

{\hat{p}}_{n - m}^{i} = \frac{1}{n - m} \sum_{j = 1}^{n - m} U_{j}^{i},

(1)

where $U_{j}^{i}$ is the scored response $j$ to item $i$ , where 1 indicates that examinee $j$ answers item $i$ correctly, while 0 indicates an incorrect answer. The estimated $p$ for the target sample to item $i$ is

{\hat{p}}_{nm}^{i} = \frac{1}{m} \sum_{j = n - m + 1}^{n} U_{j}^{i} .

(2)

The ${\hat{p}}_{n - m}^{i}$ and ${\hat{p}}_{nm}^{i}$ are the proportions of the correct responses to the item for the first $n - m$ responses and the most recent $m$ responses, respectively. By using the p values from the two samples for item i, the Z statistic at $n$ can be computed:

{\hat{Z}}_{nm}^{i} = \frac{{\hat{p}}_{nm}^{i} - {\hat{p}}_{n - m}^{i}}{\sqrt{{\hat{p}}_{n - m}^{i} (1 - {\hat{p}}_{n - m}^{i})}} \sqrt{\frac{m (n - m)}{n}},

(3)

where the denominator of the Z statistic is the standard deviation of the reference sample. If the item has not been compromised at $n$ , then the difference between the two p values and the Z statistic should be small. However, if the item was compromised at any point between $n - m$ and $n$ , then the p values and the Z statistic will have become relatively large. Therefore, when the value of the Z statistic is large, the item may have been compromised. Because the Z statistic is defined by the item response functions from two samples, as shown in Equation (3), appropriate $m$ and $n_{0}$ should be selected based on each CAT administration system.

Sequential Procedure Based on Item Response Theory

The item response function obtained based on item response theory is the probability of answering correctly item i ( $p^{i}$ ), given examinee $j$ ’s ability ( $B_{j}$ ). Using the Rasch model, the item response function can be defined by

p^{i} (B_{j}) = \frac{\exp (B_{j} - D_{i})}{1 + \exp (B_{j} - D_{i})},

(4)

where $D_{i}$ is item $i$ ’s difficulty. The statistic based on the Rasch model is

{\hat{Y}}_{nm}^{i} = \frac{X_{nm}^{i} - {\hat{SP}}_{nm}^{i}}{\sqrt{\sum_{j = n - m + 1}^{n} p^{i} (B_{j}) (1 - p^{i} (B_{j}))}},

(5)

where $X_{nm}^{i} = \sum_{j = n - m + 1}^{n} U_{j}^{i}$ is the number of correct responses among the recent m responses to the item i, and ${\hat{SP}}_{nm}^{i} = \sum_{j = n - m + 1}^{n} p^{i} (B_{j})$ is the estimated correct responses of the target sample in item $i$ . Like the CTT method, the denominator of the Y statistic is the standard deviation of the response function. However, unlike the CTT method, the IRT method uses only the target sample to calculate the Y statistic, as shown in Equation (5) and Figure 2. When the value of the Y statistic is large, item $i$ may be considered compromised.

Hybrid Threshold Approach

As described in the previous section, a cutoff point or threshold for the sequential procedure must be selected to detect compromised items. If a Z statistic from the CTT method or a Y statistic from the IRT method in the sequence of responses to an item is larger than the predefined threshold, then the item is flagged as compromised. Finding an appropriate threshold for items administered by a CAT system can use a simulation study that mimics the CAT system for the real exam. All the items in the simulation study are initially assumed to be uncompromised. The significance level ( $α$ ) for the threshold can be selected based on a desired Type I error rate, such as .01 or .05.

Zhang (2014) and Zhang and Li (2016) used maximum statistics to select a global threshold, a single threshold for all the items in the pool. However, the global threshold could perform well only when the proportion of correct answers to total answers for the compromised items is greater than or equal to 0.8. In a real CAT, though, the proportion of correct responses may increase gradually after items are compromised. In this situation, the global threshold-based procedures cannot detect the compromised items immediately, so those procedures would have longer lag times.

Therefore, the current study proposed a hybrid threshold approach to improve the sequential procedure’s ability to detect when item compromise has begun. The hybrid threshold approach suggests a local threshold for each item in an early stage of the CAT administration, during which the item response function may not have dramatically changed. The local threshold is determined by using the 95th percentile ( $α$ = .05) or the 99th percentile ( $α$ = .01) of the statistics within each item. Therefore, most of the local threshold values can be smaller than the global threshold value. Using smaller values makes the local threshold approach more sensitive to item compromise than the global threshold approach can be. This increased sensitivity allows the sequential procedure to detect compromised items more effectively, even when the change in the item response functions is still small. The local threshold can expose potentially compromised items and monitor trends in item response functions. If an item’s trend shows a continuous increase in the item response functions, the item can be blocked immediately and examined for item compromise.

This increased detection power rate, however, may accompany an increased Type I error rate. To control the Type I error and indicate a large change in the item response function, the hybrid approach suggests using a global threshold in addition to a local threshold. This approach is especially useful for the decision-making stage. Not only does it help to control the Type I error rate, but it also allows the sequential procedure to flag an item for blocking in the CAT administration until evidence of item compromise is found. The next section describes a series of simulation studies that compared the two threshold methods in terms of power rates and lag times under different simulation conditions.

Simulation Studies

This research conducted three simulation studies—a Type I error study and two power studies—for the simulated data based on real variable-length CAT data. The real variable-length CAT program uses the Rasch model with a minimum length of 60 items and a maximum length of 250 items. The wide range of test length is to ensure the classification accuracy of the candidates whose ability is near the passing standard because it is a high-stake licensure exam. The average test length is around 125 items. The starting point is one logit below the passing standard. A Bayesian ability estimation is used until a candidate gets at least one item correct and one item wrong, at which point the theta estimation method changes to maximum likelihood estimation.

The Type I error study attempted to determine both a global threshold for all the items and a local threshold for each item in the pool. These thresholds were used to flag potentially compromised items in the power studies and the case study. If the Z or Y statistic of a monitoring point for an item was larger than the threshold, the item was flagged as compromised.

Two power studies were performed. Power Study 1 used the global threshold to investigate influential factors on power rates and lag times in the sequential procedures. Power Study 2 used the local thresholds to compare that method’s performance with the performance of the global threshold method, especially under small increments of change, such as .1 or .2, regarding the probability of test-takers answering correctly to items.

Methods

Both CTT- and IRT-based sequential procedures were employed for the simulation studies.

Data

Data were generated based on the real CAT data described in the “Case Study” section. Specifically, candidate abilities were simulated from a normal distribution with a mean of .30 and a standard deviation of .44. Also, candidate responses were generated by using the same 1,472 items and variable-length CAT algorithm as the items and algorithm used in the real exam. Then, simulation studies selected 190 items that had between 1,000 and 2,200 responses to examine the performance of the procedures under the different simulation conditions, described in the “Simulation Factors” section. Table 1 presents the number of responses and the descriptive statistics of the item difficulty levels for all the items in the item pool and for the 190 selected items. The power studies randomly selected 20 items among the 190 items as compromised items in each replication. Responses for the 20 items after the change-point were manipulated according to the increment of probability of correct answers (p-increment).

Table 1.

Descriptive Statistics for Difficulty Levels of Selected Items.

Item	Difficulty level	N	M	SD	Min.	Max.	Min. responses	Max. responses
Item pool	Total	1,472	−0.158	1.003	−2.200	2.186	0	18,548
	Very easy	332	−1.464	0.375	−2.200	−0.900	0	1,699
	Easy	333	−0.570	0.168	−0.899	−0.302	0	6,837
	Moderate	398	−0.029	0.173	−0.299	0.299	0	18,548
	Difficult	158	0.551	0.183	0.300	0.898	0	16,979
	Very difficult	251	1.467	0.379	0.904	2.186	165	6,824
Selected items	Total	190	0.385	0.597	−0.360	1.195	1,056	2,177
	Moderate	96	−0.168	0.254	−0.360	0.648	1,058	2,177
	Difficult	94	0.949	0.144	0.650	1.195	1,056	1,788

Note. The real variable-length computerized adaptive testing program uses the Rasch model where the item discrimination parameter is .588 (scaled with 1.7) and the item guessing parameter is 0.

Simulation Factors

The simulation factors for this study are summarized as follows:

• Type I error rate ( $α$ ): .05, .01

• monitor starting point ( $n_{0}$ ): 60th, 300th, 500th response

• moving sample size ( $m$ ): 50, 150, 250

• item difficulty level ( $b$ ): moderate (−0.36 $\leq b \leq 0.64)$ and difficult (0.65 $\leq b \leq 1.20)$

• response change-point ( $n_{c}$ ): 60th, 450th, 1,000th response

• increment of probability of correct answer (p-increment): .1, .2, . . ., until the target probability reaches 1.0

The Type I error study used the first three simulation factors: Type I error rates, monitor starting points, and moving sample sizes. The two Type I error rates, .05 and .01, were selected because they are commonly used nominal significance levels in most studies regardless of research area. The three monitor starting points were chosen based on the actual exam administration. For example, about 50% of items in the item pool were administered to about 60, 300, and 500 examinees in 1 day, 1 week, and 2 weeks, respectively. Considering the monitor starting points ( $n_{0} > m)$ and the statistical power, the three different moving sample sizes were selected as small (50), medium (150), and large (250). Zhang (2014) and Zhang and Li (2016) found that a larger moving sample size led to higher power rates and longer lag times for a fixed-length CAT. Using the three moving sample sizes, this study investigated whether that is still true for a variable-length CAT.

The power studies used all six simulation factors. The fourth factor was item difficulty levels (b): moderate and difficult. The difficulty levels for the 190 selected items ranged between −0.36 and 1.20. According to the criteria used in the real exam practice, each was classified as a moderate or a difficult item. A similar number of items was assigned to all levels.

The fifth factor was the response change-point or the item compromise point ( $n_{c}$ ). The 60th, 450th, and 1,000th responses served as change-points because they corresponded to the early, middle, and late stages of the actual item administration practice. Also, the selection of three change-points resulted from the selected monitoring starting points and moving sample sizes ( $n_{c} \geq n_{0} > m$ ). For example, using $n_{c}$ = 60th response, only $n_{0}$ of 60 and $m$ of 50 would satisfy the condition ( $n_{c} \geq n_{0} > m$ ), so only those values could serve as one of the combinations of factors for the simulation studies.

The last factor was the increment of probability of correct responses (p-increment) after items were compromised. Studies conducted by Zhang (2014) and Zhang and Li (2016) used a fixed probability of respondents answering correctly after items are compromised, such as .8 or .9. Instead, the current study increased the probability by .1 in order to examine the power rates and lag times under different p-increments. This study assumed that, when the CAT is administered with a large item pool, items could be compromised gradually, and compromised items would influence the item response pattern gradually. The p-increment started from .1 and increased until the target proportion equaled 1.0, which means that 100% of respondents answer the compromised item correctly. The p-increment was added to a reference probability of the correct answer. The reference probability was the proportion of correct to total responses before the compromise point. Using the increased probability, responses for the compromised items were generated. For the IRT-based sequential approach, the examinee’s interim ability was calculated by using maximum likelihood estimation as well as the Bayesian approach estimation for extreme response patterns, such as all incorrect answers or all correct answers. For all combined conditions, 20 compromised items were used, which represented about 10% of the 190 items selected for this study. The number of replications for each combination was 100, and the results were averaged over the 100 replications.

When summarizing the results from the power study, this research applied the analysis of variance (ANOVA) and stepwise regression to reduce the number of factors. First, it used a saturated variance model. The response variable was the power rate or lag time, and the covariates were the six factors and sequence procedure methods as well as their first-order interactions. A full model for power rate or lag time was not converged, and the first-order interaction and main factor model still explained well the variance for the power rate (>93%) and lag time (>83%). Next, the study simplified the model by dropping nonsignificant terms at the 5% significance level in terms of the Akaike information criterion (AIC). Finally, it removed simulation factors if their main effects and interaction effects explained less than 1% of the total variance of the power rate or lag time (R²). The “Results” section describes power rates and lag times by using the simulation factors that remained in the final variance model. For the ANOVA test and stepwise regression, it uses R programming (R Core Team, 2020).

Results

Global and Local Thresholds

Table 2 shows the global thresholds and ranges of the local thresholds from the CTT- and IRT-based methods for two Type I error rates ( $α$ ), .05 and .01, over three different monitor starting points ( $n_{0}$ ) and moving sample sizes ( $m$ ). When $n_{0}$ = 60 and $m$ = 50, the CTT method resulted in higher global thresholds than did the IRT method regardless of the $α$ levels. For example, when $n_{0}$ = 60 and $m$ = 50, the global thresholds in CTT and IRT at $α$ = .05 were 3.9 and 3.5, respectively. Under the same $n_{0}$ and $m$ , the global thresholds in CTT and IRT at $α$ = .01 were 5.9 and 4.3, respectively. As $m$ increased with $n_{0}$ = 300 or 500, the global thresholds in CTT became slightly lower, whereas the global thresholds in IRT became slightly higher.

Table 2.

Global and Local Thresholds for Type I Error Rates ( $α$ ).

$α$	$n_{0}$	Global threshold (range of local thresholds)
		CTT			IRT
		$m$ = 50	$m$ = 150	$m$ = 250	$m$ = 50	$m$ = 150	$m$ = 250
.05	60	3.9 (0.9, 2.8)	–	–	3.5 (−0.2, 3.1)	–	–
	300	3.5 (0.6, 2.9)	3.3 (−0.3, 3.4)	3.3 (−0.5, 3.6)	3.5 (−0.2, 3.1)	3.6 (−2.0, 3.8)	3.7 (−3.1, 4.2)
	500	3.5 (0.6, 3.1)	3.3 (−0.2, 3.5)	3.2 (−0.6, 3.9)	3.5 (−0.4, 3.1)	3.6 (−1.9, 3.8)	3.7 (−3.1, 4.2)
.01	60	5.9 (1.3, 3.8)	–	–	4.3 (0.2, 3.8)	–	–
	300	4.2 (0.9, 3.7)	4.2 (0.1, 4.2)	4.1 (−0.4, 4.2)	4.3 (0.1, 3.8)	4.3 (−1.7, 4.1)	4.4 (−2.8, 4.6)
	500	4.2 (0.9, 3.8)	4.2 (0.1, 4.2)	4.1 (−0.4, 4.2)	4.3 (−0.2, 4.0)	4.3 (−1.6, 4.1)	4.4 (−2.8, 4.6)

Note. – = irrelevant simulation conditions; CTT = classical test theory; IRT = item response theory; $n_{0}$ = monitor starting point; $m$ = moving sample size.

Overall, the IRT method yielded broader ranges for the local thresholds than did the CTT method regardless of the $α$ , $n_{0}$ , and $m$ . As $n_{0}$ or $m$ increased, the two methods tended to have broader local threshold ranges. While the global threshold value depended on the distribution of all the items’ maximum Z or Y statistics, the local threshold value was relatively small because the local threshold is selected from the distribution of all the Z or Y statistics within each item. Both power studies and the case study used the determined global and local thresholds.

Power Study 1: Using the Global Threshold

Power rates

ANOVA analysis and stepwise regression showed that all the simulation factors were important in terms of their significance level and total variance of the power rate explained by each simulation factor. Appendix A summarizes the statistically important interaction- and main-effect terms of the factors. The interaction effects between the p-increment and the other simulation factors were significant for the power rate. Power rates from the two sequential procedures at $α$ = .05 and $α$ = .01 appear in Tables 3 and 4, respectively. As expected, the power rates from the sequential methods increased as the p-increment increased over all simulation conditions. In other words, when items were more severely compromised, it was easier for the sequential methods to detect the compromised items correctly. When the p-increment was at least .4, where the probability of correctly answering items was greater than or equal to .9 for the moderately difficult items and .8 for the difficult items, the power rates at $α$ = .05 were close to 1.00 except $n_{c}$ = 60. In addition, the power rates at $α$ = .05 were higher than at $α$ = .01. When the p-increment was .1, for example, the power rates ranged from .04 to .87 at $α$ = .05 and from .00 to .68 at $α$ = .01. For brevity’s sake, only some of the graphs from the complete simulation study appear in this article. The rest of the article focuses on illustrating the results where the p-increment was less than or equal to .4 at $α$ = .05 with four combinations of three factors, including $n_{c}$ = {60, 1,000}, $n_{0}$ = {60, 500}, and $m$ = {50, 250}.

Table 3.

Power Rates Using Global Thresholds at $α$ = .05.

$n_{c}$	$n_{0}$	$m$	Method	Moderate items				Difficult items
				.1	.2	.3	.4	.1	.2	.3	.4
60	60	50	CTT	.08	.18	.36	.58	.11	.24	.48	.76
60	60	50	IRT	.66	.83	.91	.97	.31	.61	.84	.93
450	60	50	CTT	.07	.47	.98	1.00	.16	.70	.99	1.00
450	60	50	IRT	.68	.99	1.00	1.00	.14	.69	.99	1.00
450	300	50	CTT	.25	.79	1.00	1.00	.39	.90	1.00	1.00
450	300	50	IRT	.65	.99	1.00	1.00	.14	.72	.99	1.00
450	300	150	CTT	.51	.99	1.00	1.00	.62	.99	1.00	1.00
450	300	150	IRT	.84	.99	1.00	1.00	.17	.89	1.00	1.00
450	300	250	CTT	.54	1.00	1.00	1.00	.58	.99	1.00	1.00
450	300	250	IRT	.87	.99	1.00	1.00	.15	.87	.99	1.00
1,000	60	50	CTT	.12	.64	.99	1.00	.14	.72	.99	1.00
1,000	60	50	IRT	.52	.96	1.00	1.00	.05	.38	.91	1.00
1,000	300	50	CTT	.29	.86	1.00	1.00	.29	.88	1.00	1.00
1,000	300	50	IRT	.52	.96	1.00	1.00	.05	.38	.90	1.00
1,000	300	150	CTT	.50	.93	.96	1.00	.42	.92	.96	.98
1,000	300	150	IRT	.61	.95	.99	1.00	.07	.49	.89	.95
1,000	300	250	CTT	.51	.91	.95	.96	.39	.88	.95	.98
1,000	300	250	IRT	.65	.95	.97	1.00	.05	.47	.77	.87
1,000	500	50	CTT	.28	.86	1.00	1.00	.29	.89	.99	1.00
1,000	500	50	IRT	.51	.96	1.00	1.00	.04	.39	.91	.99
1,000	500	150	CTT	.50	.93	.96	1.00	.43	.91	.96	.98
1,000	500	150	IRT	.62	.95	.99	1.00	.06	.49	.89	.95
1,000	500	250	CTT	.56	.91	.95	.96	.41	.88	.96	.98
1,000	500	250	IRT	.66	.94	.97	1.00	.06	.47	.78	.87

Note. $n_{c}$ = response change-point; $n_{0}$ = monitor starting point; $m$ = moving sample size; CTT = classical test theory; IRT = item response theory.

Table 4.

Power Rates Using Global Thresholds at $α$ = .01.

$n_{c}$	$n_{0}$	$m$	Method	Moderate items				Difficult items
				.1	.2	.3	.4	.1	.2	.3	.4
60	60	50	CTT	.00	.00	.00	.00	.02	.02	.05	.14
60	60	50	IRT	.51	.72	.85	.93	.19	.38	.70	.89
450	60	50	CTT	.00	.00	.01	.30	.00	.01	.12	.67
450	60	50	IRT	.29	.89	1.00	1.00	.01	.26	.90	1.00
450	300	50	CTT	.03	.28	.86	1.00	.07	.52	.97	1.00
450	300	50	IRT	.28	.90	1.00	1.00	.02	.26	.91	1.00
450	300	150	CTT	.17	.80	1.00	1.00	.20	.87	1.00	1.00
450	300	150	IRT	.58	.98	1.00	1.00	.07	.60	.99	1.00
450	300	250	CTT	.30	.92	1.00	1.00	.28	.92	1.00	1.00
450	300	250	IRT	.68	.97	1.00	1.00	.09	.66	.98	1.00
1,000	60	50	CTT	.00	.00	.05	.58	.00	.01	.20	.84
1,000	60	50	IRT	.21	.74	.99	1.00	.01	.11	.62	.97
1,000	300	50	CTT	.05	.48	.97	1.00	.06	.51	.97	1.00
1,000	300	50	IRT	.21	.73	.99	1.00	.00	.11	.58	.97
1,000	300	150	CTT	.21	.83	.94	.96	.15	.80	.95	.97
1,000	300	150	IRT	.40	.92	.97	1.00	.02	.26	.83	.93
1,000	300	250	CTT	.33	.87	.93	.96	.17	.79	.91	.95
1,000	300	250	IRT	.45	.92	.96	.97	.02	.29	.75	.83
1,000	500	50	CTT	.05	.48	.97	1.00	.05	.53	.96	1.00
1,000	500	50	IRT	.20	.74	.99	1.00	.00	.12	.62	.96
1,000	500	150	CTT	.22	.83	.94	.96	.14	.80	.94	.97
1,000	500	150	IRT	.40	.92	.97	1.00	.02	.28	.83	.93
1,000	500	250	CTT	.31	.87	.93	.96	.18	.78	.91	.95
1,000	500	250	IRT	.47	.92	.96	.97	.03	.29	.75	.83

Note. $n_{c}$ = response change-point; $n_{0}$ = monitor starting point; $m$ = moving sample size; CTT = classical test theory; IRT = item response theory.

Figure 3a shows power rates over the p-increment for each of the sequential procedures. The first row of panels shows the results for the moderately difficult items while the second row of panels shows the results for the difficult items. For the moderately difficult items, the IRT-based method had higher power rates than the CTT-based method, but the differences between the two methods’ power rates decreased as the p-increment, $n_{c},$ $n_{0}$ , and $m$ increased. In contrast, for the difficult items, the CTT method had higher power rates than the IRT method over all the combinations of $n_{c},$ $n_{0}$ , and $m$ , except at $n_{c}$ = 60. For the difficult items, the differences between the two methods’ power rates increased when the p-increment increased from .1 to .2. This indicated that the CTT method improved the detection of compromised items faster than the IRT method when the p-increment = .2. These differences between the two methods’ power rates became relatively smaller when the p-increment = .3.

Figure 3.

Power and lag over p-increments for each of the sequential procedures at $α$ = .05.

Figure 4a illustrates power rates over the p-increment for each of the combinations of $n_{c},$ $n_{0}$ , and $m$ . The $n_{c},$ $n_{0}$ , and $m$ affected the methods differently. For example, as $n_{c}$ increased from 60 to 1,000, the power rates in CTT increased for all p-increments regardless of item difficulty levels. However, an increasing $n_{c}$ did not always lead to an increasing power rate in IRT. In addition, when $n_{0}$ increased from 60 to 500, the CTT method resulted in higher power rates, while the IRT method’s power rates remained constant. In both the CTT and IRT methods, an increasing $m$ did not always lead to an increasing power rate for all p-increments. When the p-increment = .1, the power rates increased as $m$ increased from 50 to 150 (see Table 3). The positive relationship between $m$ and power rates was already presented by Zhang (2014) and Zhang and Li (2016). However, as $m$ increased from 150 to 250, power rates remained relatively constant or even decreased for difficult items. Overall, the effect of the three simulation factors on power rates was relatively larger in the CTT method than in the IRT method regardless of item difficulty levels.

Figure 4.

Power and lag over p-increments for each combination of $n_{c}, n_{0}, and m$ at $α$ = .05.

Lag times

ANOVA analysis and stepwise regression showed that all the simulation factors were important in terms of their significance level and total variance of the lag time explained by each simulation factor. Appendix B summarizes the statistically important interaction- and main-effect terms of the factors. The interaction effects between the p-increment and the other simulation factors were significant for the lag time, just as they were for the power rates. Lag times from the two sequential procedures at $α$ = .05 and $α$ = .01 appear in Tables 5 and 6, respectively. The lag times from the sequential methods decreased as the p-increment increased over all simulation conditions. In other words, when items were more severely compromised, the sequential methods detected compromised items more quickly.

Table 5.

Lag Times Using Global Thresholds at $α$ = .05.

$n_{c}$	$n_{0}$	$m$	Method	Moderate items				Difficult items
				.1	.2	.3	.4	.1	.2	.3	.4
60	60	50	CTT	111	50	50	44	122	57	50	42
60	60	50	IRT	227	137	87	77	299	273	152	73
450	60	50	CTT	250	150	70	36	238	133	62	35
450	60	50	IRT	340	108	42	29	460	297	98	44
450	300	50	CTT	299	148	51	32	244	116	45	31
450	300	50	IRT	322	104	43	29	424	320	98	44
450	300	150	CTT	261	129	78	59	250	120	76	57
450	300	150	IRT	317	105	66	49	377	298	118	87
450	300	250	CTT	289	174	115	88	289	167	110	85
450	300	250	IRT	311	133	85	63	426	322	178	127
1,000	60	50	CTT	224	140	60	36	170	117	54	35
1,000	60	50	IRT	201	95	41	29	239	182	88	44
1,000	300	50	CTT	205	118	47	31	159	95	43	30
1,000	300	50	IRT	202	96	41	29	249	185	91	43
1,000	300	150	CTT	215	112	72	54	191	112	71	53
1,000	300	150	IRT	216	100	65	49	267	215	118	87
1,000	300	250	CTT	242	147	97	72	226	140	92	69
1,000	300	250	IRT	244	123	80	59	299	261	167	124
1,000	500	50	CTT	202	117	47	32	159	94	44	31
1,000	500	50	IRT	203	96	42	29	242	179	86	44
1,000	500	150	CTT	208	113	73	54	191	111	71	53
1,000	500	150	IRT	210	99	65	48	266	210	118	87
1,000	500	250	CTT	244	141	94	70	225	135	89	67
1,000	500	250	IRT	246	122	79	59	297	262	166	124

Note. $n_{c}$ = response change-point; $n_{0}$ = monitor starting point; $m$ = moving sample size; CTT = classical test theory; IRT = item response theory.

Table 6.

Lag Times Using Global Thresholds at $α$ = .01.

$n_{c}$	$n_{0}$	$m$	Method	Moderate items				Difficult items
				.1	.2	.3	.4	.1	.2	.3	.4
60	60	50	CTT	−	−	−	3	0	3	23	39
60	60	50	IRT	258	179	124	89	311	258	216	122
450	60	50	CTT	−	−	62	52	−	96	80	62
450	60	50	IRT	410	281	69	37	469	356	242	69
450	300	50	CTT	237	130	79	40	186	142	71	38
450	300	50	IRT	426	285	72	37	568	360	237	67
450	300	150	CTT	249	167	100	75	255	156	97	73
450	300	150	IRT	308	134	80	60	489	354	143	98
450	300	250	CTT	295	220	141	108	304	215	138	104
450	300	250	IRT	352	156	103	76	520	387	199	143
1,000	60	50	CTT	−	58	78	57	−	94	103	74
1,000	60	50	IRT	270	179	64	37	250	219	157	64
1,000	300	50	CTT	211	144	77	38	152	124	64	37
1,000	300	50	IRT	275	171	63	37	131	211	152	62
1,000	300	150	CTT	239	157	93	70	206	150	91	68
1,000	300	150	IRT	211	129	80	59	295	224	139	98
1,000	300	250	CTT	268	188	122	90	249	180	116	87
1,000	300	250	IRT	246	152	100	73	336	284	188	139
1,000	500	50	CTT	197	143	78	38	184	125	64	37
1,000	500	50	IRT	270	175	64	37	276	226	153	65
1,000	500	150	CTT	249	158	93	69	208	147	91	68
1,000	500	150	IRT	206	127	79	58	271	223	138	98
1,000	500	250	CTT	268	187	122	90	244	178	116	87
1,000	500	250	IRT	248	151	99	73	324	283	187	139

Note. – = power rates of .00. CTT = classical test theory; IRT = item response theory.

Figure 3b presents lag times over the p-increment for each of the sequential procedures. To depict the lag time in plots, lag times were converted to log values. As shown in Figure 3b, the item difficulty levels differently affected the lag times of the sequential methods. For moderately difficult items (see the first row of panels), the CTT method had slightly longer lag times than the IRT method except at $n_{c}$ = 60. In contrast, for the difficult items (see the second row of panels), the IRT method showed much longer lag times than the CTT method over all the conditions. However, the differences between the two methods tended to be small when $n_{c}$ = 1,000.

Figure 4b shows the log lag over the p-increment for each of the combinations of $n_{c},$ $n_{0}$ , and $m$ . The $n_{c},$ $n_{0}$ , and $m$ affected the methods differently. For example, as $n_{c}$ increased from 60 to 1,000, lag times increased in CTT except when the p-increment = .4, whereas lag times decreased in IRT regardless of item difficulty levels. When $n_{0}$ increased from 60 to 500, the CTT method resulted in shorter lag times while the IRT method showed similar lag times. Moreover, in both CTT and IRT methods, lag times tended to grow longer as $m$ increased.

Power Study 2: Using the Local Threshold

Power rates

ANOVA analysis and stepwise regression with two selection criteria—significance levels and the amount of variance explained by each simulation factor—resulted in a simpler model when the local threshold approach was used than when the global threshold approach was used. This occurred because the p-increment became a more dominant factor in the local threshold approach. The interaction effects between the p-increment and $n_{c}$ and between the p-increment and $α$ were significant for the power rate in the local threshold approach, as shown in Appendix C.

Figure 5a compares power rates between the global and local threshold methods under combinations of three simulation factors ( $n_{c,} n_{0}, m$ ) and item difficulty levels at $α$ = .05. Both the CTT and IRT methods showed that the local threshold methods (Δ and *) had higher power rates than did the global threshold methods (□ and ○) especially for the lower p-increments over all the simulation conditions. For example, when the p-increment = .1 for moderately difficult items at $α$ = .05, the lowest power rate for the local threshold approach was .86, whereas the lowest power rate for the global threshold approach was .07 (see Appendix E). However, power rates from two threshold approaches were close to each other (1.00) when the p-increment was at least .4 except at $n_{c}$ = 60. The sequential procedures using local thresholds tended to detect compromised items earlier than the change-point because of lower cutoff values. Power rates from the two sequential procedures at $α$ = .05 and $α$ = .01 for the local threshold approach are presented in Appendices E and F, respectively.

Figure 5.

Power and lag over p-increments for thresholds and sequential procedures at $α$ = .05.

Lag times

ANOVA analysis and stepwise regression showed that the interaction effects between the p-increment and the other simulation factors were significant for lag times, just as they were for power rates. In addition to the interaction effects between the p-increment and the sequential procedures (Method) and between $n_{c}$ and Method, $m$ and the item difficulty level (b) are important factors for lag times (see Appendix D). Overall, lag times decreased as the p-increment increased and as $n_{c}$ and $m$ decreased. As Figure 5b illustrates, the local threshold (Δ and $\overset{°}{\tilde{a}}$ ) showed similar lag times over the two sequential methods regardless of $n_{c}$ and $m$ except at $n_{c}$ = 60. Figure 5b also shows much shorter lag times for the local thresholds than for the global thresholds. Appendices G and H summarize lag times from the two sequential procedures at $α$ = .05 and $α$ = .01, respectively. The local threshold approach showed that, for moderately difficult items at $α$ = .05, lag times for the CTT method ranged from 14 to 157, and lag times for the IRT method ranged from 14 to 163. Under the same condition, the global threshold approach showed that lag times for the CTT method ranged from 30 to 299, and lag times for the IRT method ranged from 29 to 460.

Case Study

Methods

In addition to the simulation studies, a case study attempted to identify potentially compromised items in the real item pool administered in CAT. It used a real CAT data set from a licensure testing organization. The item pool included 1,472 items administered by using CAT to 69,562 candidates during 3 months in 2015. The number of responses to the items ranged from 0 to 18,548. The mean and the standard deviation (SD) of candidates’ abilities were .30 and .44, respectively. Among 1,472 items, 190 items were selected based on the number of responses generated for the simulation study. The number of responses to the selected items in the real data ranged between 3,109 and 9,505. All the selected items were multiple-choice items, and the item difficulty values ranged between −0.36 and 1.20.

The case study implemented the hybrid threshold method with cutoff points from the Type I error study at $α$ = .05. The global cutoff point at $α$ = .05 for the CTT and IRT methods was 3.5. The ranges of the local cutoff points were between 0.6 and 2.9 for the CTT method and between −0.2 and 3.1 for the IRT method. Values of $n_{0}$ and $m$ for the sequential procedures were selected based on the results from the simulation study. The sequential procedures started monitoring from the 300th response, which occurred about 1 week after starting the real exam administration. Simulation studies showed that the larger $n_{0}$ in the CTT method resulted in higher power rates and shorter lag times, while it seemed not to affect the IRT method. The chosen moving sample size for the procedures was 50. When $m =$ 50, the simulation studies showed relatively short lag times for both sequential procedures.

Results

The CTT- and IRT-based procedures flagged 41 and 55 items as compromised, respectively, when the global threshold was used under the conditions of $n_{0} = 300, m =$ 50, and $α$ = .05. Twenty items out of the 41 flagged by the CTT method were moderately difficult items, while all 55 items flagged by the IRT were moderately difficult items. The number of flagged items in each of the two methods was larger than the Type I error rate over the number of all items (190 * 0.05 = 9.5). Twelve items were flagged by both procedures.

One of the 12 items flagged by both methods is illustrated in Figure 6. Figure 6a shows the first 2,150 Y and Z statistics from the IRT and CTT procedures, respectively. The item was administered to 7,389 candidates, which indicated that the item’s exposure rate was 10.6% of all examinees in the exam period. The item’s difficulty parameter was −0.21, a moderate level of difficulty in the item pool. The two sequential procedures were applied from the 300th response, generating 7,090 (7,389 − 300 + 1) Y statistics and Z statistics in the IRT and CTT procedures, respectively. When $n_{0} = 300, m =$ 50, and $α$ = .05, the global threshold for the two procedures was 3.5, and the local thresholds were 1.7 and 1.4 for IRT and CTT, respectively. If the statistics were greater than the threshold, then the item was flagged by the corresponding sequential procedure.

Figure 6.

An example item flagged as compromised by CTT and IRT methods.

As shown in Figure 6a, the IRT method resulted in larger statistics than did the CTT method over the monitoring process. At the 556th response, the two statistics rose above the local threshold for the first time in the monitoring period. Then, between the 561st and 591st responses, the two statistics remained above the local threshold. At this point, the item may be flagged as a potentially compromised item and be monitored for trends in its statistics. In the monitoring process, the two statistics continuously increased and rose above the global threshold between the 1,960th and 1,984th responses. Figure 6b presents those statistics between the 1,970th and the 2,028th responses (59 responses in points). At the 1,970th response, the item was first flagged by the two local thresholds. The statistics of the item continuously increased, rose above the global threshold at the 1,984th response, and remained higher than the global threshold until the 2,004th response. At this point, this item could be masked immediately in the CAT. Suspending the item from delivery through masking would ensure that it would not be administered to any candidates during the following days until conclusive determination of whether the item was compromised.

However, as shown in Figure 6a, the statistics became smaller than the global threshold after the 2,004th response and even below the local thresholds after the 2,008th response. This pattern, characterized by increasing statistics above the local threshold and then decreasing statistics below the local threshold, seemed to recur many times in the examinee sequence until the end of administration, regardless of the sequence method. However, the Z statistics were not higher than the global threshold after the 2,004th response. The other 11 items flagged by the two methods in the case study showed similar patterns.

For further investigation, the 12 flagged items were sent to the test development content review team, which found no evidence of item compromise. In addition, examination of the candidates’ possible association factors at the change-points found, concerning the flagged items, no common factor among the candidates.

Discussion

The goal of this research was threefold: (1) applying sequential procedures to a variable-length CAT to find significant factors that contribute to their power rates and lag times, (2) proposing a local threshold approach for sequential procedures, and (3) suggesting a hybrid threshold approach to apply the sequential procedures to a real item pool.

The first goal of this research was to apply the CTT- and IRT-based sequential procedures to a real licensure exam in a variable-length CAT under various simulation factors, and it accomplished this goal by using the global threshold approach. This study showed that the two sequential procedures performed differently with some simulation factors when the p-increment was small. The CTT-based procedure had higher power rates and shorter lag times as the monitor starting point and change-point increased. In contrast, the IRT-based procedure had higher power rates and shorter lag times as the change-point increased, but its power rates and lag times remained relatively constant as the monitor starting point increased. Both procedures had shorter lag times and higher power rates as the moving sample size increased from 50 to 150. However, as the moving sample size increased from 150 to 250, power rates remained similar or decreased. Overall, the IRT procedure seemed to be more sensitive than the CTT procedure, and it flagged more items as compromised. The case study also indicated that the IRT method seemed more sensitive for moderately difficult items than for difficult items. As this study illustrated, one sequential procedure could perform better than the other under some conditions, but not under all. Hence, identifying items by using both sequential procedures flags a reasonable number of items.

Of the factors used in this study, the p-increment was the most important for the sequential procedure to detect compromised items in terms of the power rate and lag time. For example, if the p-increment was at least .3 at $α$ = .05, then the sequential procedures detected more than 90% of the compromised items correctly with relatively short lag times while controlling the Type I error rate for most simulation conditions. However, when the p-increment was small (i.e., .1 or .2), the power rate dramatically dropped, and the lag time lengthened.

To resolve these problems, another goal of this study was to propose a local threshold approach. This approach involves increasing the detection power rate when the p-increment is small, which can be more reasonable in real CAT tests. The simulation result showed that the local threshold approach was promising for detecting compromised items when the p-increment was small. When the p-increment was .1 or .2, the power rates using the local threshold were improved significantly, and the lag time was shortened. In practice, however, it would not be easy to determine whether the small increase in the statistics of the sequential procedure was from true item compromise or from false detection. In addition, the case study demonstrated that the statistics became higher than, and then lower than, the local threshold many times in the response sequence of each item.

Therefore, the current study suggested a hybrid threshold approach. The hybrid threshold approach can use the local threshold to expose a potentially compromised item in an early stage of the CAT administration and to monitor trends in item response functions until the item is flagged by the global threshold. If a reasonable number of statistics lie above the defined global threshold for an item, immediately masking the item ensures that the item will not be seen by any candidates until an item content review panel determines whether the item is compromised or not. In practice, it would be much easier to mask potentially compromised items than to detect and punish possible cheaters because psychometric evidence may not be sufficient to prove that examinees’ behavior is aberrant. Based on the global threshold, it can be beneficial to investigate candidates and see whether they had any factors in common, such as school, testing center, or city, although these factors may have decreased in significance due to the development of social media. If evidence suggests associations between candidates or supports the possibility of collusion, the item can be indicated as compromised and its administration can stop.

This study found that the sequential procedures using the hybrid approach can be implemented in real time. Before implementing these procedures, users should select a monitor starting point and a moving sample size (e.g., the 300th response and 50 responses, respectively). In addition, the global and local thresholds should be determined by their own CAT algorithms for administered items. Then the procedure using the hybrid approach can be applied to items after they have been administered to 300 test-takers and can monitor items on each day’s responses.

This study had some limitations that future research can investigate. First, using different item characteristics and CAT algorithms may result in different outcomes from those found in this study because the global and local thresholds depend heavily on items’ characteristics and the CAT algorithm. Such a study would attract more researchers to conduct studies on sequential procedures in other CAT algorithms. Second, the current study utilized the Rasch model and showed that the sequential procedure performed differently at varying difficulty levels. Examining the global threshold approach for subsets of items according to those items’ characteristics in different IRT models (e.g., 2PL [two-parameter logistic] or 3PL IRT model) would be another potential topic for future research.

This research demonstrated that the CTT- and IRT-based sequential procedures using the hybrid threshold approach can be applied to a variable-length CAT in real time. Because the licensure exam used in this study follows the standard item development process (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014) and is administered to more than 60,000 candidates in every quarter, the study presented a potential application of the sequential procedures by using the hybrid threshold approach to any CAT-based exam. In addition, the findings of this study can provide test developers and stakeholders with helpful guidelines for using the sequential procedures to detect compromised items administered in CAT. Moreover, the hybrid threshold approach can resolve the problems from the local and global thresholds. The approach improved power rates and lag times in an early stage of item administration and controlled Type I error rates in the decision-making stage.

Footnotes

Appendix

Appendix H

Lag Times based on the Local Threshold at α = .01

				Moderate Items				Difficult Items
n _c	n ₀	m	Method	.1	.2	.3	.4	.1	.2	.3	.4
60	60	50	CTT	195	101	52	32	207	94	43	25
60	60	50	IRT	129	83	50	42	174	64	47	31
450	60	50	CTT	146	47	28	21	150	55	29	21
450	60	50	IRT	151	46	26	19	157	53	29	21
450	300	50	CTT	142	46	27	20	138	50	27	21
450	300	50	IRT	143	46	26	19	140	48	27	21
450	300	150	CTT	163	72	49	38	145	68	48	37
450	300	150	IRT	176	68	45	33	144	64	44	34
450	300	250	CTT	196	98	66	51	172	87	62	50
450	300	250	IRT	185	85	54	41	173	79	53	40
1000	60	50	CTT	113	43	22	17	120	59	28	18
1000	60	50	IRT	122	50	26	20	148	64	31	21
1000	300	50	CTT	97	41	23	18	91	45	25	17
1000	300	50	IRT	119	50	27	20	106	43	24	17
1000	300	150	CTT	142	60	41	31	127	57	35	24
1000	300	150	IRT	169	66	44	33	124	52	31	23
1000	300	250	CTT	177	85	57	41	135	64	41	30
1000	300	250	IRT	191	84	56	41	130	61	38	27
1000	500	50	CTT	123	46	26	18	98	41	25	18
1000	500	50	IRT	124	46	26	18	104	42	25	18
1000	500	150	CTT	139	62	40	30	116	57	36	25
1000	500	150	IRT	159	67	43	31	111	53	33	23
1000	500	250	CTT	168	82	55	41	127	60	39	29
1000	500	250	IRT	187	83	54	40	126	58	36	26

Note. n_c = response change-point, n₀ = monitor starting point, and m = moving sample size.

Acknowledgements

We thank Dr. Marcoulides and two anonymous reviewers for their helpful comments and suggestions. Also, we thank Jennifer Pretzer whose support helped improve and clarify this manuscript.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Chansoon Lee

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Belov

D. I.

(2014). Detecting item preknowledge in computerized adaptive testing using information theory and combinatorial optimization. Journal of Computerized Adaptive Testing, 2(3), 37-58. https://doi.org/10.7333/1410-0203037

Choe

E. M.

Zhang

Chang

H. H.

(2018). Sequential detection of compromised items using response times in computerized adaptive testing. Psychometrika, 83(3), 650-673. https://doi.org/10.1007/s11336-017-9596-3

Grigg

O. A.

Farewell

V. T.

Spiegelhalter

D. J.

(2003). Use of risk-adjusted CUSUM and RSPRTcharts for monitoring in medical contexts. Statistical Methods in Medical Research, 12(2), 147-170. https://doi.org/10.1177/096228020301200205

Kang

H.-A.

Chang

H.-H.

(2016). Online detection of item compromise in CAT using responses and response times [Paper presentation]. Annual Meeting of the National Council on Measurement in Education, Washington, DC, United States.

Liu

Han

K. T.

(2019). Compromised item detection for computerized adaptive testing. Frontiers in Psychology, 10, Article 829. https://doi.org/10.3389/fpsyg.2019.00829

O’Leary

L. S.

Smith

R. W.

(2017). Detecting candidate preknowledge and compromised content using differential person and item functioning. In Cizek

G. J.

Wollack

J. A.

(Eds.), Handbook of quantitative methods for detecting cheating on tests (pp. 151-163). Routledge.

Page

E. S.

(1954). Continuous inspection schemes. Biometrika, 41(1/2), 100-115. https://doi.org/10.1093/biomet/41.1-2.100

Qian

Staniewska

Reckase

Woo

(2016). Using response time to detect item preknowledge in computer-based licensure examinations. Educational Measurement: Issues and Practice, 35(1), 38–47.

10.

R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

11.

Sinharay

(2017). Detection of item preknowledge using likelihood ratio test and score test. Journal of Educational and Behavioral Statistics, 42(1), 46-68. https://doi.org/10.3102/1076998616673872

12.

van der Linden

W. J.

(2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181-204. https://doi.org/10.3102/10769986031002181

13.

Veerkamp

W. J.

Glas

C. A.

(2000). Detection of known items in adaptive testing with a statistical quality control method. Journal of Educational and Behavioral Statistics, 25(4), 373-389. https://doi.org/10.3102/10769986025004373

14.

Wang

Shang

Kuncel

(2018). Detecting aberrant behavior and item preknowledge: A comparison of mixture modeling method and residual method. Journal of Educational and Behavioral Statistics, 43(4), 469-501. https://doi.org/10.3102/1076998618767123

15.

Zhang

(2014). A sequential procedure for detecting compromised items in the item pool of a CAT system. Applied Psychological Measurement, 38(2), 87-104. https://doi.org/10.1177/0146621613510062

16.

Zhang

(2016). Monitoring items in real time to enhance CAT security. Journal of Educational Measurement, 53(2), 131-151. https://doi.org/10.1111/jedm.12104