Abstract
An increasing concern of producers of educational assessments is fraudulent behavior during the assessment (van der Linden, 2009). Benefiting from item preknowledge (e.g., Eckerly, 2017; McLeod, Lewis, & Thissen, 2003) is one type of fraudulent behavior. This article suggests two new test statistics for detecting individuals who may have benefited from item preknowledge; the statistics can be used for both nonadaptive and adaptive assessments that may include either or both of dichotomous and polytomous items. Each new statistic has an asymptotic standard normal n distribution. It is demonstrated in detailed simulation studies that the Type I error rates of the new statistics are close to the nominal level and the values of power of the new statistics are larger than those of an existing statistic for addressing the same problem.
Keywords
Fraudulent behavior during educational assessments is an increasingly common problem. Naturally, there is a growing interest in methods for detection of fraudulent behavior, which is evident from the publication of three recent edited volumes on the topic (Cizek & Wollack, 2017; Kingston & Clark, 2014; Wollack & Fremer, 2013) in addition to numerous journal articles (e.g., Belov, 2013, 2016; McClintock, 2015; McLeod et al., 2003; Romero, Riascos, & Jara, 2015; Segall, 2002; Shu, Henson, & Luecht, 2013; van der Linden, 2009; van der Linden & Jeon, 2012; van der Linden & Lewis, 2015) and conference presentations (e.g., Eckerly, Babcock, & Wollack, 2015) on the topic.
While there are several types of fraudulent behavior (Kingston & Clark, 2014), the interest in this article will be on item preknowledge in which a “source” shares assessment questions and/or their answers (where the source could be a teacher, a test preparation company, a website, or individual examinees) and then several beneficiaries memorize the assessment questions and/or answers. For example, Educational Testing Service discovered in 2002 that many students in several countries were benefiting from websites showing live items used in the Graduate Record Examination (GRE); the phenomenon was so widespread that average scores on GRE verbal increased by 100 points (out of a possible 800 points) in one country and 50 points in another (Kyle, 2002). The shared/memorized items are usually referred to as “compromised” items. Item preknowledge results in inflated scores for some examinees and should be identified (Belov, 2013, p. 142). The focus of this article will be on detecting examinees who may have benefited from item preknowledge, both for nonadaptive assessments and for computerized adaptive tests (CATs) that are more susceptible to item preknowledge due to more frequent testing.
This article considers only the case when the investigator knows which items are compromised; Belov (2013), Eckerly (2017), and Shu, Henson, and Luecht (2013) considered this case. Typically, such a case arises when the assessment administrators become aware after an assessment about some items possibly being compromised. Such a case may also arise when the assessment administrators have applied a method to detect compromised items (e.g., that suggested by Veerkamp & Glas, 2000) to find that some items may have been compromised. The real data examples later in this article demonstrate that such a situation can arise in practice.
This article suggests two new statistics for detection of examinees who may have benefited from item preknowledge. Both statistics are based on item response theory (IRT) and quantify the difference between the ability estimates computed from the compromised items and noncompromised items and are based on two classical statistical tests, namely, the likelihood ratio test (LRT) and the score test (e.g., Cox & Hinkley, 1974; Rao, 1973); the asymptotic null distributions of these statistics are known. The new statistics are computed for two real data sets from a nonadaptive assessment. The Type I error rates and power of the new statistics are examined and are compared to those of the statistic of Belov (2013) for both nonadaptive and adaptive assessments.
As in Belov (2013), this article involves the assumption that the item parameters are known. While Belov seems to have considered only dichotomous items, 1 the two statistics suggested in this article apply to an assessment consisting of a mix of dichotomous and polytomous items; thus, the approaches suggested in this article are the first that can be used with an assessment that includes some polytomous items.
Background
Let us consider an assessment that consists of a mix of dichotomous and polytomous items whose parameters are assumed known (and are equal to the estimates computed from a previous calibration using IRT). Let yi denote the score of an examinee (with true ability of θ) on item i and let
where ai, bi, and ci, respectively, are the slope, difficulty, and guessing parameters of item i (e.g., Birnbaum, 1968), whereas if the generalized partial credit model is used for item i that has the score categories 0, 1,…, mi, then:
where ai is the slope parameter and bih’s are the location parameters of the item (e.g., Muraki, 1992). Let us assume that a known subset S of items in the item pool is suspected to have been compromised. Let s denote the subset of the compromised items that was administered to the examinee. Depending on the extent of item preknowledge and the design of the assessment, an examinee could receive between 0 and all of the compromised items (i.e., s could be between a null set and S). For example, it is possible that 50 items from a CAT pool are compromised (so that S includes 50 items) and an examinee knows about all of them but receives only 10 of them (so that s includes 10 items) when he or she is tested. Let
The Research Problem
Given the estimated parameters of the items, the scores of several examinees on several items, and a known set S of items that is suspected to have been compromised, the goal here is to detect examinees who may have benefited from item preknowledge on either a nonadaptive test or a CAT that may include both dichotomous and polytomous items. Further, it is assumed that if an examinee benefited from item preknowledge, then he or she has preknowledge of all items in S and (unfairly) obtains a score larger than what is expected from the IRT model and the examinee’s true ability on any item on S. This research problem is essentially the same as Case 1 considered in Belov (2013).
Review of a Statistic Based on the Kullback–Leibler Divergence
Belov (2013) suggested using the approximate Kullback–Leibler divergence (Kullback & Leibler, 1951) between the posterior distribution of the ability given the item scores on s and the posterior distribution of the ability given the item scores on
The above algorithm has at least three practical limitations. First, if examinees in more than 20% assessment centers benefited from item preknowledge, then the above strategy would fail to flag the examinees in them. Belov (2013) did not discover this problem because he simulated 20% or fewer aberrant assessment centers (an aberrant examinee is one who had item preknowledge and an aberrant assessment center is one that includes one or more aberrant examinees). Second, if the actual number of assessment centers whose examinees benefited from item preknowledge is smaller than 20%, then the flagging of 20% assessment centers and computing w from the remaining 80% assessment centers could lead to the loss of quite a bit of information (especially if there are only a few assessment centers) as well as causing some actually nonaberrant assessment centers appear aberrant. Third, for assessments that involve continuous administration and every center is not involved in every administration, it is not clear which assessment centers should be included in the computations.
It is clear from the above limitations that there is a scope of further research on this area. Especially, it can be beneficial if the investigator could use a test statistic with a known null distribution to detect item preknowledge—that would obviate the need to determine w empirically and would overcome the above-mentioned limitations of the algorithm of Belov (2013).
Review of Other Methods of Detecting Item Compromise
McLeod, Lewis, and Thissen (2003) suggested a Bayesian method to detect item preknowledge in CATs; they did not assume that the set of aberrant items is known—so their method is expected to lead to smaller power than the statistics considered in this article (some limited simulations confirmed this). Researchers such as Segall (2002), Shu et al. (2013), and Eckerly, Babcock, and Wollack (2015) suggested IRT models for detection of item preknowledge for nonadaptive assessments. van der Linden and van Krimpen-Stoop (2003) suggested a Bayesian method that uses item scores and response times to detect item preknowledge for nonadaptive assessments. Qian, Staniewska, Reckase, and Woo (2016) used a similar approach in which the residuals from the joint application of the Rasch or the two-parameter logistic model to item scores and the response-time model of van der Linden (2006) to item response times were used to flag examinees for possible item preknowledge in both adaptive and nonadaptive assessments. Researchers such as Veerkamp and Glas (2000) discussed how one can detect items that have been compromised. These methods are not considered henceforth in this article. Note that all of these methods apply to tests with only dichotomous items.
Two New Statistics Based on the LRT and the Score Test
The rationale behind the new statistics
The rationale of the h statistic is that the posterior distributions of a nonaberrant examinee (one who did not benefit from item preknowledge) based on s and
If, for an examinee, the posterior distributions based on s and
Both the LRT and the score test statistic have a known asymptotic null distribution, the χ2 distribution with one degree of freedom (e.g., Finkelman, Weiss, & Kim-Kang, 2010; Klauer & Rettig, 1990). Further, hypothesis tests based on likelihood ratios are asymptotically the most powerful in general 3 (e.g., Cox & Hinkley, 1974, pp. 312, 320; Drasgow, Levine, & McLaughlin, 1987) because of the Neyman–Pearson lemma (e.g., Cox & Hinkley, 1974; Lehmann & Romano, 2005; Romero et al., 2015), and the score test is asymptotically equivalent to the LRT (e.g., Cox & Hinkley, 1974; Lehmann & Romano, 2005)—therefore, these tests may perform better than the h statistic.
A Statistic Based on the LRT
For an examinee, let us define the maximum likelihood estimate (MLE) or the weighted maximum likelihood estimate (WLE; Warm, 1989) of the examinee ability from the scores on s as
The LRT statistic (e.g., Cox & Hinkley, 1974; Lehmann & Romano, 2005; Rao, 1973) for testing the null hypothesis of equality of the examinee ability over s and
where
and
Because, for example,
the LRT statistic given in Equation 1 can be expressed as
The above result holds for nonadaptive assessments due to local independence given the examinee ability and for CATs due to local independence given the examinee ability and item strings (e.g., Mislevy & Chang, 2000). Finkelman et al. (2010) showed in the context of measurement of change that a statistic similar to the above statistic follows an asymptotic
The statistic L is appropriate for two-sided alternative hypotheses. However, the alternative hypothesis in our case is one sided—it is desired here to detect those who performed better on the compromised items and not to detect those who performed better on the noncompromised items. For one-sided alternatives, researchers such as Cox (2006, p. 104), Cox and Hinkley (1974, p. 315), and Biehler, Holling, and Doebler (2015) suggested the use of the signed likelihood ratio statistic, which, in our context, is given by
Thus, the absolute value of Ls is equal to the square root of L, and Ls is positive if the examinee’s estimated ability based on s is larger than or equal to that based on
The statistic Ls has an asymptotic standard normal distribution (e.g., Cox, 2006, p. 104) under the null hypothesis of no item preknowledge. It is well known that the square of a standard normal random variable has a
A Statistic Based on the Score Test
Rao’s (1973) score test statistic or Lagrange multiplier test statistic for testing the null hypothesis of equality of the ability parameter over s and
where, for example,
Because the alternative hypothesis in our case is one sided as discussed earlier, it is more appropriate to use the signed score statistic (Cox & Hinkley, 1974, Cox, 2006, p. 104), which, in our context, is given by:
A large value of Rs (that means a large value of R accompanied with
Like Ls, the statistic Rs has an asymptotic standard normal distribution under the null hypothesis of no item preknowledge (e.g., Cox, 2006, p. 104).
The Advantage of Ls and Rs Over h
Both the Ls and Rs statistics have a known asymptotic null distribution, the standard normal distribution—that is a clear advantage over the h statistic when the number of compromised and noncompromised items are large. Further, the LRT and the score tests, which are the basis of Ls and Rs, are quite popular in statistical applications (e.g., Cos & Hinkley, 1974) and in educational measurement (Finkelman, Weiss, & Kim-Kang, 2010; Guo & Drasgow, 2010; Klauer & Rettig, 1990; Romero et al., 2015)—so these statistics would be intuitively appealing to measurement practitioners. Further, h is appropriate for two-sided alternative hypotheses and would be large for those who performed considerably better on
A Real Data Application
Let us consider item response data from two forms of a nonadaptive licensure assessment. The data sets were analyzed in several chapters (such as Eckerly, 2017) of Cizek and Wollack (2017) and in Sinharay (2016). Both forms include 170 operational items that are dichotomously scored; 87 items are common between the two forms. Item scores were available for 1,636 examinees for Form 1 and 1,644 examinees for Form 2. The licensure organization (who provided the data) identified 63 and 61 items as compromised on Forms 1 and 2, respectively (though several of these items were common between the forms, so that the combined number of compromised items was 64). The organization also flagged 46 and 48 individuals on Forms 1 and 2, respectively, as possible cheaters from a variety of statistical analysis and an investigative process that brought in other information. The examinees for each form were distributed in several hundred assessment centers.
The values of h, Ls, and Rs were computed from the two data sets. The Rasch model is operationally used in the assessment—the operational difficulty parameter estimates were used in the calculations. The WLEs of abilities were used to compute the Ls and Rs statistics primarily because the WLEs were always finite (that agrees with similar findings in Magis & Verhelst, 2014), whereas MLEs are infinite for all correct or all incorrect response patterns. The WLEs were computed using the Newton–Raphson algorithm.
The proportions of examinees flagged by the three statistics at levels of .001, .01, and .05 are provided in Table 1. The first three lines provide the proportions flagged among all examinees. The last three lines provide the proportions flagged only among the (46 or 48) examinees who were flagged by the licensure organization; thus, for example, the proportion .15 for Ls at level .001 for Form 1 in the fifth row of numbers denotes that among the 46 examinees flagged by the licensure organization, 7 examinees were flagged (note that 7/46 ≈ 0.15) at level .001 by Ls. The proportions flagged among those who were not flagged by the licensure organization are very close to those shown in the first three rows (that is expected as the organization flagged only about 2.4% of the examinees).
The Proportion of Examinees Flagged for the Real Data.
Table 1 shows that Ls and Rs flag more examinees than h in almost all cases. The smaller flagging rates of h, especially in the last three rows of the table, can most likely be attributed to the fact that the examinees who had large values of Ls or Rs were distributed among many more than 20% assessment centers and the examinees flagged by the licensure organization were also distributed among several assessment centers rather than being concentrated in a few of them; h is expected to perform relatively well when the examinees with item preknowledge are included in 20% or fewer assessment centers. Table 1 also shows that the proportion flagged for all the statistics is much larger among the examinees flagged by the licensure organization (bottom three rows of the table) compared to among all examinees (top three rows of the table)—this result provides some evidence that the suggested statistics are somewhat successful. Note that item compromise was not the only reason of flagging by the licensure organization—so a proportion smaller than 1.0 in the bottom three rows is not a severe limitation of the suggested statistics.
The first five rows of Table 2 provide, for five examinees flagged by the licensure organization on Form 2, the proportion correct scores (the number correct divided by the total number of items) on the compromised items (Ps) and noncompromised items (
Some Details for Six Examinees for the Real Data.
The performance on the compromised items was much better than that on the other items (with respect to both proportion corrects and ability estimates) for the first three examinees; among them, the first was flagged by all the three statistics at level .001, the second was flagged by Ls and Rs at level .001 but flagged by h at level .01 (and not at level .001), and the third was flagged by Ls and Rs at level .01 but not flagged by h at level .05. 5 Thus, Table 2 shows that h may not flag examinees who are flagged by Ls and Rs. The examinees corresponding to the fourth and fifth row of the table performed worse on compromised items than on noncompromised items (and must have been flagged by the licensure organization on other grounds) and were not flagged by any of the three statistics. The examinee corresponding to the sixth row was not flagged by Ls and Rs but was flagged by h; however, this examinee performed much worse on the compromised items than on the other items—so there does not exist enough evidence to flag the examinee for item preknowledge—the use of h leads to a false alarm for the examinee due to the inappropriateness of h to test one-sided hypotheses.
A Simulation Study Involving a Nonadaptive Assessment
A simulation study was performed to compare the Type I error rates and power of the h statistic to those of Ls and Rs for nonadaptive assessments.
Design of the Simulation
The simulation study involved a 170-item assessment calibrated using the Rasch model like the above-mentioned operational assessment. The operational item parameter estimates of Form 1 of the assessment were used as the true item parameters to generate data in the simulations and also in the computation of the statistics for detecting item preknowledge. The true abilities of all the examinees were simulated from the
A total of 1,000 assessment centers were used, with 100 examinees in each of them. The percentage of aberrant test centers was 10 or 20. Note that any value larger than 20 would clearly lead to a smaller power of the h statistic (as described earlier); however, no such value was used. Thus, the simulation cases considered here are the best-case scenarios for the h statistic. The percentage of aberrant examinees in the aberrant assessment centers was 10 or 20; the percentage of aberrant examinees in the nonaberrant assessment centers was 0. The size of S was assumed to be 10, 20, or 30. It was assumed that each aberrant examinee had preknowledge of the whole of S. 6
For any simulation condition (where an example of a simulation condition is “10 aberrant items, 10% aberrant assessment centers, and 10% aberrant examinees”), the set S was chosen as a random subset of 10, 20, or 30 items from all items of Form 1. Then, 10% or 20% centers were randomly chosen from the 1,000 centers as aberrant. Then, 10% or 20% aberrant examinees were randomly chosen from the 100 examinees in each aberrant center. Then, the following steps were performed for each examinee: Simulate the true ability parameter (θ) from the If the examinee is not an aberrant examinee, then simulate the item scores from the Rasch model using the above-mentioned true item parameters and θ simulated in the above step; If the examinee is an aberrant examinee, then simulate the item scores on the items that are not in S from the Rasch model using the above-mentioned true item parameters and θ simulated above, but simulate the item scores on the items that are in S as draws of a Bernoulli random variable with a success probability of .9 (rather than being simulated from the Rasch model); Compute three WLEs of the ability of the examinee, one each from the simulated item scores on s = S, the simulated item scores on Compute each test statistic for the examinee. Use the WLEs computed in the above step to compute Ls and Rs.
For Ls and Rs, the Type I error rates were computed at levels of .001, .01, and .05 as the proportion of the values of the corresponding statistic among the nonaberrant examinees that were larger than the corresponding critical values (3.09, 2.33, and 1.64, respectively) from the standard normal distribution. The Type I error rate of the h statistic was computed as the proportion of the nonaberrant examinees that were flagged by algorithm 2 of Belov (2013). The standard error (SE) corresponding to the Type I error is about .0003 when level .01 and about .0007 when level .05. 7
For Ls and Rs, the power was computed as the proportion of the values of the corresponding statistic among the aberrant examinees that were larger than the corresponding critical values from the standard normal distribution. The power of the h statistic was computed as the proportion of the aberrant examinees that were flagged by algorithm 2 of Belov (2013). The SE corresponding to the power is smaller than .015. 8
The value of L was negative and close to zero for a very few examinees in the simulations. While L should be positive in theory, negative values may occur for small number of items or sampling fluctuations; the values of the other statistics indicate that the corresponding examinees performed very similarly on compromised and noncompromised items. For such examinees, Ls was set equal to 0.
Results on Type I Error Rates
The Type I error rates hardly varied over the simulation conditions (governed by the percentage of aberrant assessment centers and examinees) when the size of S is fixed—so they were pooled together for each size of S. Table 3 displays the pooled Type I error rates of the three statistics for levels of .001, .01, and .05 for each size of S. While the Type I error rates of h are considerably smaller than the nominal level in all conditions, those of Ls and Rs are close to the nominal level. The Type I error rates of Ls are occasionally slightly larger than the nominal level, for example, for 10 compromised items, but are satisfactory in all cases according to Cochran’s (1952) criterion for robustness (p. 394) that deems Type I error rates smaller than .06, .015, and .0015 to be satisfactory at levels .05, .01, and .001, respectively; also, the Type I error rates of Ls and Rs become closer to the nominal level as the size of S increases, which provide evidence that their null distribution converges to the standard normal distribution as the number of items in S increases.
Type I Error Rates for Nonadaptive Assessments.
Results on Power
Table 4 presents the values of power for different simulation conditions for levels of .001, .01, and .05. The first column (with title “Size of S”) denotes the number of items in S and the second column (with title “Statistic”) denotes the statistic. Columns 3 through 5, with title “10%, 10%” with “.001, .01, .05” below it, show the power for 10% aberrant assessment centers and 10% aberrant examinees at levels of .001, .01, and .05, respectively; columns 6 through 8, with title “10%, 20%” with “.001, .01, .05” below it, show the power for 10% aberrant assessment centers and 20% aberrant examinees for the three levels; columns 9 through 11, with title “20%, 10%” with “.001, .01, .05” below it, show the power for 20% aberrant assessment centers and 10% aberrant examinees for the three levels; and columns 12–14, with title “20%, 20%” with “.001, .01, .05” below it, show the power for 20% aberrant assessment centers and 20% aberrant examinees for the three levels.
Power for Nonadaptive Assessments.
Note. Columns 3–5 show the power for 10% aberrant assessment centers and 10% aberrant examinees at levels of .001, .01, and .05, respectively; columns 6–8 show the power for 10% aberrant assessment centers and 20% aberrant examinees for the three levels; columns 9–11 show the power for 20% aberrant assessment centers and 10% aberrant examinees for the three levels; columns 12–14 show the power for 20% aberrant assessment centers and 20% aberrant examinees for the three levels.
The table shows that h is the least powerful among the three statistics and Ls is the most powerful. The power of Ls is often substantially larger than that of h (e.g., .73 vs. .42 for 10 compromised items, 20% aberrant centers, and 10% aberrant examinees). The power of Rs is quite close to that of Ls in all cases. The power of each statistic increases as S becomes larger.
The Type I error rate and power of the
A Simulation Study Involving a CAT
A detailed simulation study, somewhat similar to that in Belov (2013), was performed to compare the Type I error rates and power of the h statistic to those of the new statistics for CATs.
Design of the Simulation
The simulation study involved a 50-item CAT. The 3PLM was used as the IRT model. As in Belov (2013), an item pool of 500 items was used in all simulations. The true item parameters of the items of the pool were simulated. The true ai or slope parameters of the items in the pool were simulated from an
Computation
The steps in the simulation were to: Simulate a CAT without item preknowledge for the 1,000 centers with 100 examinees each; all the scores are generated from the 3PLM, Compute for each item the exposure rate, which is the percentage of examinees who received the item, Pick the 4, 12, 20, or 30 items from the item pool that have the largest exposure rate, and let these items constitute the set S, Randomly select 10% or 20% aberrant assessment centers and randomly select 10% or 20% aberrant examinees in each aberrant assessment center, Simulate CAT only for the aberrant assessment centers and aberrant examinees under the assumption that the set S determined in Step 3 is compromised; to do this, for each aberrant examinee: Simulate a CAT as above (using the 3PLM), except that when the examinee receives any item in S, simulate the score on the item as a draw of a Bernoulli random variable with a success probability of .9 (rather than simulating the score from the 3PLM) and Let the scores of these aberrant examinees replace the scores of the same number of nonaberrant examinees generated in the first step (to ensure that each assessment center ends up having the same number of examinees). Compute three WLEs of the ability of the examinee, one each from s, Compute each test statistic for each examinee. Use the WLEs computed in the above step to compute Ls and Rs.
The average size of s over the examinees was 3.3, 8.8, 12.2, and 14.3, respectively, when S included 4, 12, 20, and 30 items, respectively. Note that the above design was used to replicate the simulations of Belov (2013). In limited simulations under a few other designs of assigning compromised items to aberrant students, the Type I error rates of the statistics were very similar and power of the statistics were different, but the relative performance of the statistics was very similar.
A Fortran 90 computer program written by the authors was used for all the computations including the computation of the ability estimates. The WLEs were computed using the Newton–Raphson algorithm.
For Ls and Rs, the Type I error rate was computed as the proportion of the values of the corresponding statistic among the nonaberrant examinees that were larger than the corresponding critical values from the standard normal distribution. The Type I error rate of the h statistic was computed as the proportion of the nonaberrant examinees that were flagged by algorithm 2 of Belov (2013).
For Ls and Rs, the power was computed as the proportion of the value of the corresponding statistic among the aberrant examinees that were larger than the critical values from the standard normal distribution. The power of the h statistic was computed as the proportion of the aberrant examinees that were flagged by algorithm 2 of Belov (2013).
Results on Type I Error Rates and Power
The Type I error rates hardly varied over the simulation conditions (governed by the percentage of aberrant assessment centers and examinees) when the number of aberrant items is fixed—so they were pooled together. Table 5 displays the pooled Type I error rates of the three statistics for levels of .001, .01, and .05 for 4, 12, and 20 aberrant items—the rates for 30 aberrant items were very close to those for 20 aberrant items and are not shown. While the Type I error rates of h are considerably smaller than the nominal level in all conditions, those of Ls and Rs are very close to the nominal level. The Type I error rates of Rs are slightly smaller than those of Ls.
Type I Error Rates for CAT.
Note. CAT = computerized adaptive test.
Table 6 displays the Type I error rates at three levels of significance when the size of s is 1–4 in the case of 4 aberrant items. The percentage of examinees for each size of s is also provided in the second column of the table. The table shows that Ls and Rs become more conservative as the size of s decreases. For example, at level .05, the Type I error rate of Rs was .050 when the size of s is 4 but only .001 when the size of s is 1; one source of this conservativeness is that the average SE of the WLE is 1.37, 0.99, 0.97, and 0.91 when the size of s is 1, 2, 3, and 4, respectively. While these numbers indicate that Ls and Rs will suffer from small power for very small sizes of s, they also indicate that Ls and Rs will not incorrectly flag too many nonaberrant examinees in these cases.
Type I Error Rates of Ls and Rs for CATs With 4 Aberrant Items.
Note. Columns 3–5 provide the Type I error rates of Rs and columns 6–8 provide the Type I error rates of Ls, at significance levels (α) of .001, .01, and .05. CAT = computerized adaptive test.
Table 7, which looks very similar to Table 4, presents the power values for different simulation conditions for levels of .001, .01, and .05. The table shows that the power of Ls is the largest followed by that of Rs and then that of the h statistic. The power of Ls is often substantially larger than that of h (e.g., .76 vs. .55 for 12 compromised items, 20% aberrant centers, and 10% aberrant examinees at level .05). The power of each statistic increases as the number of compromised items increases because it is easier to detect item preknowledge when the number of compromised items is larger.
Power for CAT.
Note. Columns 3–5 show the power for 10% aberrant assessment centers and 10% aberrant examinees at levels of .001, .01, and .05, respectively; columns 6–8 show the power for 10% aberrant assessment centers and 20% aberrant examinees for the three levels; columns 9–11 show the power for 20% aberrant assessment centers and 10% aberrant examinees for the three levels; columns 12–14 show the power for 20% aberrant assessment centers and 20% aberrant examinees for the three levels. CAT = computerized adaptive test.
The power of h in Table 7 is often smaller than that in Belov (2013). That is primarily because the true abilities of the aberrant examinees were simulated in Belov from a uniform distribution between −3 and 0 but from a standard normal distribution here. 9
Discussion on the Type I Error Rates and Power
The above simulations show that the h statistic has the smallest Type I error rates among the three statistics considered here, but it also has the smallest power, often by a large margin. Combined with the practical limitations of the h statistic mentioned earlier, they provide strong evidence that the Ls and Rs statistics should be preferred over the h statistic.
Further, note that the above simulations, with 20% or fewer aberrant assessment centers, represented the best case scenarios for the h statistic. When the number of aberrant items is large, the number of aberrant assessment centers is likely to be large as well; in that case, the choice of not flagging any examinee in 80% assessment centers in algorithm 2 of Belov (2013) would lead to the low power of h. To examine this issue further, some limited simulations were performed with 12 aberrant items, 20% aberrant assessment centers with 10% aberrant examinees each and 10% more aberrant assessment centers with 5% aberrant examinees each; the power of the h statistic was much smaller than that of the Ls and Rs because the h statistic failed to detect any of the aberrant examinees in the assessment centers with 5% aberrant examinees each.
Conclusions
Standard 6.6 of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council for Measurement in Education, 2014, p. 137) includes the recommendation that when the assessment results have important consequences, score integrity should be supported through active efforts to prevent, detect, and correct scores obtained by fraudulent or deceptive means. As a consequence, there is a growing interest in detection of test fraud. This article suggests two new statistics for detecting individuals who may have benefited from item preknowledge on nonadaptive and adaptive tests that may include a mix of dichotomous and polytomous items. The suggested statistics are first, according to the authors’ knowledge, that apply to assessments that include some polytomous items. In simulation studies, for both nonadaptive and adaptive assessments, the Type I error rate of the suggested statistics were found close to the nominal level and the values of power were found considerably larger than that of a statistic based on the Kullback–Leibler divergence (Belov, 2013) on average.
Statistical indices for detecting item preknowledge are useful for providing confirming evidence of inappropriate behavior when evidence from other sources also exist, but the evidence provided by statistical indices is insufficient by itself. For example, Hanson, Harris, and Brennan (1994) commented that no statistical method on its own can provide conclusive proof that copying occurred (p. 25); the comment is true about item preknowledge as well. Researchers such as Tendeiro and Meijer (2014, p. 257) recommended complementing statistical indices of detecting irregularities with other sources of information such as seating charts, video surveillance, or follow-up interviews.
There are several limitations of this article and, consequently, several related topics can be further investigated. First, while our simulation study was detailed, it is possible to perform more simulations, possibly with polytomous items, different true item parameters, different test lengths, different IRT models, and different types of preknowledge. Second, the behavior of the suggested statistics can be further examined for a small number of compromised or noncompromised items. The asymptotic null distribution of these statistics is expected to hold only when the numbers of compromised items and noncompromised items administered to an examinee are large; if one of these numbers is small, a simulation-based null distribution should be used, or one should not use any of these statistics. Third, the item parameters were treated as known in this article. This assumption is reasonable especially in the context of CATs where item parameters are assumed known and the ability parameters are estimated. However, it is possible to estimate item parameters and consider the effect of the estimation on the properties of the indices considered here. Fourth, this article considers the case when the investigator precisely knows which items are compromised; in reality, there may be uncertainty with the identity of these items; in the presence of uncertainty, the suggested statistics can still be used, but they would lead to smaller power than that found in the simulations here. More research is possible on this issue. Fifth, the Type I error rate and power of the statistics suggested here may be compared to the Bayesian index suggested by McLeod et al. (2003) in future research. The latter does not involve the assumption that the set of aberrant items is known—so it is expected to lead to smaller power (some limited simulations confirmed this) than the statistics in this article. Sixth, the article by Belov (2016) was published at a later stage of this article—a comparison of Ls and Rs with the best performing statistics in Belov is a topic of future research. Finally, it would be interesting to combine the statistics here with the methods of van der Linden and van Krimpen-Stoop (2003) and Qian et al. (2016) that use response times to detect item preknowledge.
Footnotes
Acknowledgments
The author would like to thank the editor Daniel McCaffrey and the three anonymous reviewers for several helpful comments that led to a significant improvement of the article. The author would also like to thank James A. Wollack for sharing a data set that was used in this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
