Abstract
The prevalence of high-stakes test scores as a basis for significant decisions necessitates the dissemination of accurate and fair scores. However, the magnitude of these decisions has created an environment in which examinees may be prone to resort to cheating. To reduce the risk of cheating, multiple test forms are commonly administered. When multiple forms are employed, the forms must be equated to account for potential differences in form difficulty. If cheating occurs on one of the forms, the equating procedure may produce inaccurate results. A simulation study was conducted to examine the impact of cheating on item response theory (IRT) true score equating. Recovery of equated scores and scaling constants was assessed for the Stocking–Lord IRT scaling method under various conditions. Results indicated that cheating artificially increased the equated scores of the entire examinee group that was administered the compromised form. Future research should focus on the identification and removal of compromised items.
A single score on a high-stakes test is often used to make potentially life-altering decisions about an individual. For example, test scores may determine whether an individual is certified to practice in a chosen profession or a student is accepted into an educational institution. The magnitude of the consequences associated with high-stakes tests has created an environment where some examinees resort to cheating (Cizek, 1999). With the increased prevalence of high-stakes testing, specifically in education through the No Child Left Behind (NCLB; 2001) legislation, the threat of examinee impropriety has increased dramatically (Cohen & Wollack, 2006). When stakes are high, the testing organization must be certain that test scores accurately reflect each individual’s true ability. If any form of cheating, such as prior knowledge of the item, influences the responses on a test, scores will inaccurately represent the individual’s ability, and decisions based on the scores will be dubious at best (Haladyna & Downing, 2004).
The susceptibility of high-stakes tests to cheating behaviors is increased through the reuse of test items. To prevent the spread of test items, testing companies frequently use multiple forms when tests are administered on different occasions. Despite careful attempts to create parallel test forms, the multiple forms that testing programs develop often vary in terms of difficulty. As a result, slight differences in form difficulty may unfairly disadvantage some test takers. To assess test takers accurately, equating designs are utilized in an effort to produce comparable scores across forms of varying difficulty. One of the most common equating designs employed in large-scale testing is the nonequivalent anchor test design (NEAT; Holland, 2007). Under the NEAT design, common items, also called anchor items, are placed on the multiple forms of the test and used to estimate the relationship between form difficulties. Consequently, as the number of forms increases, the exposure rate of anchor items also increases, rendering them prone to contamination. Because the anchor items are integral to the equating process, the contamination of anchor items will not only distort the meaning of the test scores of individuals with preknowledge of the items but also inaccurately represent the differences between the test forms. Thus, compromised anchor items may potentially alter the scores for all the examinees who were administered the form.
Given the important decisions made from and the expanding use of high-stakes testing, it is necessary to investigate the potential effects of anchor item compromise on the equating process. Although much attention has been devoted to item exposure in computer adaptive testing (CAT; e.g., Finkelman, Nering, & Roussos, 2009) and cheating detection methods (e.g., Wollack, 2006), little emphasis has been placed on understanding the interaction between cheating and the equating process.
Item Response Theory (IRT) in Equating
A common method for conducting the equating of test scores is through IRT. The advantages of IRT, specifically the invariance of item and ability parameters, make its use in test equating particularly appealing (Hambleton, Swaminathan, & Rogers, 1991). Under IRT, each calibration results in item parameters that are on a different metric. If the assumptions of IRT hold, the resulting item parameter metrics differ by a linear transformation. A critical step in the equating process is to obtain the appropriate slope (A) and intercept (B) constants of this transformation to place the ability and item parameters from different tests on the same scale. This process is termed scaling. The ability value θ for examinee i on a calibration of a new form (NF) can be converted to the scale of the base form (BF) through the following:
The linear transformations for individual item parameters are calculated as follows:
where j refers to item j on the specified test. The c parameter is independent of the scale transformation.
Accurate scaling is essential for the equating process to reflect the true difference between forms. Under the NEAT design, scaling constants are estimated entirely through the anchor items. Thus, any source of error in the estimation of the anchor item parameters could result in inaccurate scaling constants.
The A and B constants required for the transformation can be derived using several methods (Kolen & Brennan, 2004). This study focuses on characteristic curve methods for equating. Characteristic curve methods consider the item parameters holistically, transforming the probability of correct response for an item on the NF scale to the scale of the BF through the following:
Equation 5 allows for all parameters for each individual anchor item to be considered simultaneously in producing a common metric. However, as item parameter values are estimates, there is no guarantee that one set of scaling constants will produce perfect concordance for the probability of correct response across all examinees and items (Kolen & Brennan, 2004).
Stocking and Lord (1983) proposed that a function minimizing the difference between the test characteristic curves (TCCs) across anchor items on the BF and the NF would produce the correct scaling constants to transform the metric. As shown in Equation 6, the Stocking–Lord method locates the A and B constants that minimize the sum of the anchor item characteristic curves (ICCs) across all examinees, i, and anchor items, j:
A multivariate search technique is used to solve for the scaling constants.
Effect of Compromised Items in IRT
Inherently, IRT equating is dependent on accurate parameter estimation. Compromised items introduce construct-irrelevant variance that assuredly distorts the estimation of parameters. For example, the ability parameter will reflect ability and prior knowledge of the item when compromised items are present. Few studies have examined the effect of compromised items on IRT ability or item parameters. Of the studies investigating this form of cheating in IRT, Yi, Zhang, and Chang (2008) compared the error in ability estimates resulting from compromised items under various CAT selection criteria. Ability estimates displayed severe positive bias. The mean difference between estimated and true abilities for low-ability examinees, −3.880≤θ< −0.890, increased by an average of more than 1 SD in the majority of the conditions investigated. The influence of compromised items on examinee ability estimates decreased as the initial true ability increased. Similarly, Guo, Tay, and Drasgow (2009) found that compromised items led to drastic overestimation of the ability estimates for low-ability examinees, resulting in tests that failed to discriminate among examinees.
Jurich, Goodman, and Becker (2010) investigated the effects of compromised items on the passing status of examinees. Using IRT observed-score equating, the study compared how mean–sigma, Stocking–Lord, and fixed anchor methods recovered the correct status of examinees. Under all scaling methods, examinees passed at drastically higher rates than expected. Unexpectedly, cheaters and honest test takers completing the NF benefited from the compromised items. Jurich, Goodman, and Becker hypothesized that the scaling methods incorrectly adjust for differences in ability when anchor items become compromised and that this incorrect adjustment benefits honest test takers as well.
Purpose
The purpose of the current study is to examine the impact of compromised anchor items on the equating process. Specifically, the study investigates how alterations in the equating process caused by cheating on anchor items translate into errors in recovering scaling constants and equated scores. Scaling constants provide a measure for assessing how cheating specifically affects IRT equating. The evaluation of equated scores serves as a global measure to analyze the impact that the interaction between cheating and equating has on the scores reported to examinees.
Method
A simulation was conducted to investigate the impact of anchor compromise on equated scores. Data were generated to simulate a NEAT design in which anchor items have been exposed to the population through repeated test administrations. Specifically, the simulation addresses the common situation in which two administrations of a test are given at different, successive, testing occasions requiring the use of two forms. The first administration of the test was created as if the items were unique; thus, none of the items on the original form were compromised. The second form required the use of common items for the NEAT equating, exposing these items to potential cheating. Hence, only anchor items on the NF were subject to possible cheating.
Test Generation
Two forms of a test, both containing 100 items, were generated for each replication of this study. One test represented a BF, the form that sets the scale of the scores. The other generated test form represented a NF of the test, which was equated to the scale of the BF. As this study used a NEAT design to equate scores from the NF to the BF scale, a certain number of items must be common across the two forms. The current study followed Angoff’s (1984) guidelines in using 20 items as the anchor set. Previous research suggests that 20 anchor items should ensure reliable estimation of ability differences (McKinley & Reckase, 1981; Vale, Maurelli, Gialluca, Weiss, & Ree, 1981; Wingersky & Lord, 1984). The remaining 80 items were unique to each form. In one condition, the anchor set was internal to the test; examinee responses to anchor items counted directly to their total score. In another condition, the anchor set was treated as external; the anchor items were used for the scaling process only and did not contribute to examinees’ total scores. When the anchor set is internal, cheaters benefit directly from higher scores on the compromised items as well as indirectly through any effects on the equating function. Comparing the results from internal and external anchors separates these effects because in the external condition any cheating effect operates through the equating function.
Item parameters from the 1998 administration of the National Assessment of Educational Progress (NAEP) mathematics test (Allen, Donoghue, & Schoeps, 2001) were used as the basis for creating the BF and the NF. Items with extreme location (b) or discrimination (a) parameters were excluded from selection to minimize the potential for estimation issues confounding the results. Specifically, items with b parameters with absolute values greater than 2.8 or a parameters on the normal metric above 1.7 or below 0.4 were removed from the item set. This process left a total of 236 items available for selection. Means and standard deviations for the item parameters retained in the bank were as follows: a (M = 0.95, SD = 0.30), b (M = 0.13, SD = 1.07), and c (M = 0.19, SD = 0.06).
For each replication, item parameters for both forms were randomly selected from the pool of 236. Anchor items were created by holding 20 randomly selected items constant across the two forms. That is, the item parameters were equivalent for these items across both forms. The unique items were then sampled without replacement from the remaining 216 item parameters.
Examinee Population
The probability of correct response on each item was generated for 3,000 simulated test takers using a three-parameter logistic (3PL) IRT model. This sample size was chosen to reflect a typical sample for a large-scale test. In addition, a large sample size reduces the risk of estimation problems affecting the results of the study. The probability of correct response was calculated using a randomly generated latent ability for each examinee in conjunction with the item parameters of the form administered to that examinee.
The IRT latent ability of examinees responding to the BF was generated from a standard normal distribution denoted as N(0,1). The latent ability distribution of examinees administered the NF was systematically varied to compare situations in which the examinees responding to the NF had an ability distribution that differed from the BF group in mean and/or variance. Four levels of this condition were investigated: N(0,1), N(−0.5,1), N(0,1.25), or N(−0.5,1.25). The mean NF ability was selected to be lower than the mean BF ability to prevent a situation in which the majority of examinees score near the maximum possible value. A potential ceiling effect in the data would make capturing the benefits of cheating problematic.
Cheating
Cheating conditions
Two conditions were manipulated to simulate the degree of cheating: (a) the proportion of compromised items and (b) the number of examinees with access to these items (referred to as cheaters). Four proportions of cheaters were examined: 5%, 10%, 25%, and 50%. The proportion of anchor items compromised was varied at 25% and 100%. These levels were chosen to explore the impact of small, moderate, and large amounts of cheating.
As cheating often goes undetected, determining what constitutes a low or high degree of cheating is difficult. Although the condition including 50% cheaters and 100% compromised anchor items may seem extreme, situations have occurred in which examinees had access to a large number of test questions, if not the entire form, immediately following the initial administration (Baron & Wirzbicki, 2008). Furthermore, this condition offers a picture of how equating is influenced by an extreme case of cheating. Therefore, results from the conditions with the greater degrees of cheating will still provide practical information.
In addition, the simulation incorporated a condition that included no compromised items. This condition allowed the quality of the equating process to be evaluated without the influence of cheating. Thus, the no-cheating condition served as a baseline for relative comparisons.
Cheating implementation
Cheating was implemented by adding .5 to the probability of answering a compromised item correctly for any examinee designated as a cheater. For example, if a cheating examinee’s original probability of correct response on a compromised item was .3, the adjusted probability would increase to .8. If the cheating examinee’s original probability of responding correctly was above .5, the examinee would necessarily get the item correct as the adjusted probability would exceed 1.
The cheating adjustment value of .5 was selected because it greatly improves the probability of correct response for a cheating examinee, yet still allows for incorrect responses to compromised items by cheaters with low abilities. This would reflect, for example, a hypothetical situation in which the examinee memorized the wrong answer or forgot the correct answer. Admittedly, the designated increase is arbitrary. However, there is a lack of research detailing benefits to having prior knowledge of an item. Further research should consider different methods to conceptualize and implement cheating behavior.
Calibration
After response probabilities were adjusted for cheating, dichotomous responses were created by comparing the probability of cheating to a random draw from a uniform (0,1) distribution. Item parameters for each of the simulated tests were then estimated separately using BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 2003). The FLOAT command was applied to remove the influence of incorrectly specified prior distributions on the item parameter estimates (Hendrickson & Kolen, 1999). The maximum number of Gauss–Newton iterations for the expectation-maximization algorithm was increased to 100. Ability distributions were estimated using an empirical distribution with 40 quadrature points. Aside from the modifications described previously, default BILOG-MG options were used for estimation.
Scaling
Once item parameters for both forms have been estimated, the NF item parameters were placed on the BF scale using the Stocking–Lord method. Other scaling methods (Haebara, mean/sigma, mean/mean, and fixed anchor) were explored as part of this study. However, as the scaling method had little impact on the findings, only the values from the Stocking–Lord scaling are reported in the tables and figures.
Equating
In the final step of the simulation process, true score IRT equating was applied to establish equivalence between NF and BF scores. True score equating converts an examinee’s number-correct score on the NF to an equivalent score on the BF. This conversion is accomplished by calculating the ability value corresponding to a number-correct total score on the NF. The resulting ability is then used to derive the expected true score on the BF. Because the minimum true score for the three-parameter IRT model is the sum of the c parameters, no true score exists for observed scores below this sum. Kolen’s (1981) linear interpolation procedure was applied when an observed score fell below the lowest possible true score.
Summary of Conditions
The current study varied four factors to investigate the effects of cheating on the equating process across a range of factors. Four proportions of cheaters, 5%, 10%, 25%, and 50%, and two proportions of compromised items, 25% and 100%, were included to explore how the extent of cheating affects the equating process. True ability for examinees administered the NF was generated from four normal distributions: N(0,1), N(−.5,1), N(0,1.25), and N(−0.5,1.25). The anchor items were manipulated to be either internal or external to the scoring of the test. The interaction of conditions resulted in a 4 × 2 × 4 × 2 design for a total of 64 unique conditions. The simulation process was replicated 500 times for each combination of conditions. New tests and examinees were generated for each replication.
Comparison Criteria
The accuracy of recovered scaling constants and equated scores was used to evaluate the effects of cheating on the equating process.
Scaling constants
True scaling constants were derived by calculating the A and B constants necessary to set the estimated NF examinee ability distribution, which was constrained to be N(0,1) in the calibration, equal to the true NF distribution.
To quantify errors in recovery, bias and root mean squared error (RMSE) were calculated for A and B scaling constants. Bias identifies systematic deviation of the estimated scaling constant from the true parameter. Bias is mathematically defined as the average difference of the estimated parameter from its true value across all replications:
where
RMSE provides a measure of absolute accuracy in parameter recovery. The RMSE statistic incorporates the bias and the variability of the sampled parameter. RMSE is computed by taking the square root of the average squared deviation between the estimated parameter and the true value. Mathematically, RMSE can be expressed by
The mathematical terms in RMSE are equivalent to those in the measure of bias.
Equated scores
Quantifying the impact of cheating on equated scores involved comparing an examinee’s equated score based on the observed responses (derived from
Results
Table 1 presents the mean bias for equated scores under the different levels of cheating in the four ability conditions. In addition, Table 1 displays the results for the internal and external anchor conditions. As mentioned above, in an external anchor set, the anchor items are used to compute the scaling constants only. Anchor items are not included in calculations of the examinees’ abilities or number-correct scores. Thus, the maximum possible score on the simulated test with an external anchor set was 80. The difference between total possible scores of the internal and external tests precludes direct comparisons between these conditions. To alleviate these issues, bias and RMSE for the external anchor conditions were converted to a proportion correct through dividing the statistics by 80. The proportion correct scores for the external anchor were then multiplied by 100 to make the values directly comparable with the internal anchor test.
Equated-Score Bias for the Stocking–Lord Scaling Method
Note: Values were converted to reflect percentage correct scores to increase comparability between internal and external anchor tests.
As expected, bias for the condition including no cheating approached zero. For the cheating condition including 25% compromised anchor items, equated-score bias increased with the proportion of cheaters. When the proportion of cheaters was increased to 25%, bias in the equated scores increased considerably. Another substantial increase occurred when the proportion of cheaters increased to 50%. The condition including 100% compromised items and 10% cheaters resulted in only slightly smaller bias in comparison with the bias for the condition with 25% compromised items and 50% cheating. The most extreme cheating condition, including 100% compromised items and 50% cheating, resulted in extremely large positively biased equated scores.
Results indicated that cheating on the anchor items had less of an effect on the accuracy of equated scores under the external anchor condition compared with the internal anchor condition, when the amount of cheating was substantial. Across the majority of conditions, including 100% compromised items, an internal anchor resulted in more positively biased scores than an external anchor set. The difference in bias between the two anchor sets was exacerbated at larger proportions of cheating.
Changes to the mean and variance in the ability distribution of examinees responding to the NF appeared to affect equated-score bias differentially depending on the extent of cheating. For instance, in the baseline and lowest cheating condition, a larger variance in ability led to more biased estimates of the equated scores. For the internal anchor, this effect was reversed in the higher cheating conditions, where a larger NF variance resulted in slightly less-biased estimates on average. In general, however, the differences among NF ability conditions were small.
Figure 1 depicts the equated-score bias as a function of the NF number-correct score 1 by selected cheating conditions. The figure presents the internal anchor condition with a N(0,1) NF ability distribution for the 10% and 50% cheaters condition. The baseline condition trend noticeably differed from the cheating conditions. In the baseline condition, scores were well recovered near the middle of the NF score distribution. Examinees obtaining low number-correct scores received underestimated equated scores on average, whereas equated scores were overestimated for examinees scoring high. This is because maximum likelihood (ML) estimates of θ are negatively biased for low-ability examinees and positively biased for high-ability examinees (Warm, 1989). This property is passed along to estimates of τ. When large amounts of cheating occurred, equated scores were consistently overestimated. Specifically, equated scores for examinees with low number-correct scores were drastically overestimated. Overestimation decreased at higher number-correct score values. The disparity among cheating conditions is readily apparent in the figure. In the extreme cheating condition, the positive bias greatly exceeds even the moderate cheating conditions explored in this study. At the peak bias for the extreme cheating condition, equated scores were inflated by approximately 24 points. This value is considerably larger than in either of the two moderate cheating conditions. Trends seen in Figure 1 were replicated in the various NF ability distribution conditions.

Equated-score bias for varying degree of cheating as a function of the NF number-correct score
Table 2 displays the RMSE of equated scores. RMSE is a function the bias and sampling variability of the parameter of interest. Thus, for conditions in which the bias is large, the RMSE must also be large. A RMSE unusually larger than the corresponding bias suggests that the estimation of the parameter is highly variable. For the current conditions, the increase in RMSE followed the increase in cheating proportions closely. Examining the bias and RMSE concurrently reveals that as cheating increases, the sampling variability of the equated score estimates does not increase. The rise in RMSE as cheating increases seems to be entirely a function of the increased bias in equated scores.
Equated Score RMSE by Ability Distribution
Note: RMSE = root mean squared error. Values were converted to reflect percentage correct scores to increase comparability between internal and external anchor tests.
Table 3 displays the bias and RMSE for the A scaling constant for each of the cheating conditions and ability distributions. Note that the anchor type does not affect the calculation of scaling constants. Thus, no comparisons were made across internal and external conditions. The A scaling constant is recovered well in the baseline condition, with almost no bias in estimation. For the moderate cheating conditions, the A constant was consistently underestimated. This held for the extreme cheating condition, except for the mean–sigma scaling method (not shown), which overestimated the A constant. RMSE indicated that the recovery of the estimated A constant became less accurate as the proportion of cheating increased.
Bias and RMSE for Scaling Constant A
Note: RMSE = root mean squared error.
In reference to the ability distribution, the results suggest that the A scaling constant was affected by the mean and the variance of the NF ability distribution. On average, the bias in estimating the A constant was larger when the mean of the ability distribution was −0.5. The conditions with a larger variance consistently produced more negative biases.
Bias and RMSE for the B scaling constant are presented in Table 4. When no cheating was present, the B constant was recovered with virtually no bias. Bias in the B scaling constant systematically increased as cheating increased. Bias severely increased from the moderate cheating conditions to the extreme condition. The considerable positive bias in the B scaling constant indicates that the estimated mean NF ability distribution was substantially positively biased when scaling the NF parameters to the BF scale. As with equated scores, the RMSE for the B constant indicated that the parameter was less accurately recovered as cheating increased. However, the small discrepancy between the RMSE and bias again suggests that the inaccuracy in recovering the B constant was largely attributable to the bias.
Bias and RMSE for Scaling Constant B
Note: RMSE = root mean squared error.
Results suggest that the mean of the ability distribution was the main distributional factor influencing estimation of the B scaling constant. The conditions including a mean ability distribution of −0.5 consistently produced a more positively biased estimate of the B constant. The additional overestimation of the B scaling constant was more severe in conditions including more cheating. Results showed that the variance of the NF distribution had little effect on bias of the B constant.
Table 5 contains the bias in equated scores for honest and cheating test takers. An examination of the bias for honest test takers indicates that equated scores were positively biased in all conditions, although these examinees’ probability of correct response was not influenced by cheating. In the internal anchor condition, bias for cheating examinees is consistently more positive across all conditions. Specifically, increasing the proportion of compromised items greatly benefited the cheating test takers in the internal anchor condition. In the most extreme cheating condition, dishonest test takers obtained equated scores 21 to 26 number-correct score points above their true scores on average. In comparison, honest examinees benefited by 13 to 15 score points. In stark contrast, differences between dishonest and honest test takers do not arise in the external anchor condition. Across all cheating conditions and ability distributions, bias for equated scores was nearly identical between honest and dishonest test takers. In addition, honest test takers in the external condition benefited more from the compromised anchor items than did honest examinees within the internal anchor condition.
Bias for Honest and Cheating Test Takers by Anchor Type for the Stocking–Lord Method
Note: Values were converted to reflect percentage correct scores to increase comparability between internal and external anchor tests.
Figure 2 graphically depicts the equated-score bias for the Stocking–Lord scaling method for internal and external anchor conditions. The figure illustrates that cheaters in the internal anchor condition benefited the most from cheating across the scale of NF scores. In contrast, the magnitude of bias for honest examinees in the internal anchor condition was less than the external anchor conditions across the entire distribution. The figure also shows that cheaters and honest examinees in the external anchor condition benefited equivalently from cheating across the range of number-correct scores.

Equated-score bias for internal and external anchor tests
Discussion
The prominent question of interest in this study concerned the impact of compromised items and proportion of cheaters on equated scores and scaling constants obtained from IRT true score equating under the NEAT design. Results indicated that an increase in either compromised anchor items or proportion of cheaters led to positively biased equated scores. Although overestimated equated scores were predictable, the extent of bias at even moderate degrees of cheating was disconcerting. Thus, the results suggest that equated scores obtained from even slightly compromised test forms overestimate examinees’ true abilities.
Compromised items introduce positive bias in the equating process in part because cheaters respond correctly to the compromised items above the level implied by their true ability. Therefore, if the anchor is internal, cheaters obtain inflated number-correct scores that, in turn, inflate estimates of their true ability. Although this process helps explain why cheaters benefit directly from compromised items, it does not account for the extensive degree of bias in equated scores, nor explain why honest takers benefited from cheating as well.
Investigating the effects of cheating on the scaling constants provides further insight into the biased equated scores. Specifically, the severe overestimation of the B scaling constants reveals how all the examinees in the NF group can benefit from cheating. When scaling the NF’s parameters to the metric of the BF, an artificially large B constant will cause the difficulty (b) item parameters for the unique items to be artificially increased (refer Equation 3). Thus, the NF’s unique items appear more difficult than they truly are. Because the inflated NF’s b parameters cause the NF’s unique items to appear more difficult, responding correctly to these items will considerably increase an examinee’s ability estimate, regardless of whether the examinee cheated. As a result, cheaters and honest test takers benefit from these inaccurately scaled unique items.
The large positive bias in the B scaling constant arises because cheating occurs on the items specifically used to scale the test form. As cheaters artificially respond correctly to several compromised items, the difficulty of the anchor items will be underestimated for the group taking the NF. The scaling process must overestimate the B constant to account for this decreased difficulty. In addition, results indicated that the A constant was slightly underestimated. This occurs because cheating introduces a factor irrelevant to the construct that influences responses, thus decreasing the discriminatory power (the a parameter) of anchor items on the NF. Therefore, the A constant is underestimated to account for the fact that the items seem less discriminatory.
The overestimation in equated scores is greater for lower and middle-ability examinees, as examinees with higher ability have less to gain from cheating. Figures 1 and 2 help illustrate this effect; the largest bias in equated scores occurs for examinees having number-correct scores at or slightly below the middle of the score scale. Equated-score bias decreases sharply given higher number-correct scores on the NF especially for the internal anchor condition. When the anchor set is internal, a smaller proportion of examinees taking the NF fall within the area of the ability distribution where cheating is most beneficial. Therefore, the overall bias resulting from cheating increases when examinees taking the NF are less able in comparison with the examinees setting the metric. However, when the anchor was external, or for honest test takers in either anchor condition, the score bias decreased less steeply as scores changed from low-middle to high, and the bias decreased more steeply as scores changed from low-middle to low (see Figure 2). Thus, the mean bias did not depend on the mean ability. For internal and external anchor sets and for honest and cheating examinees, the bias was greatest for examinees with low-middle ability levels. These results have negative implications for testing programs. Because lower ability examinees benefit the most from cheating, this leads to a potential situation where unqualified examinees may appear qualified as a consequence of cheating.
To explore the differential effects of anchor types, bias in equated scores was compared for honest and dishonest test takers within the two anchor conditions. As expected, in the internal condition, cheaters benefited from the compromised items to a greater degree than did honest test takers. This difference was exacerbated when the proportion of compromised anchor items was increased. In contrast, in the external anchor condition, cheaters and honest test takers benefited equally from cheating. This result supports the hypothesis that the direct benefit from the compromised items in the internal anchor introduces additional bias for cheating examinees. Bias in equated scores arises in the external anchor condition solely as a result of the impact of cheating on IRT scaling. The considerable bias obtained in the external condition displays the severe impact that cheating has on the scaling process. Employing an external anchor will prevent bias that is directly attributable to scores on compromised items; however, the substantial amount of bias related to inaccurate scaling will continue to plague the scores obtained from equating. Furthermore, employing an external anchor will increase the bias in scores for honest examinees. This occurs because, in the internal anchor condition, the anchor items are included in the equating process.
The overall trends found in this study demonstrate the detrimental effects of cheating on equated scores estimated under the NEAT design. When examinees have access to the items used to scale an NF of a test to a common metric, equated scores for the entire group of examinees will be overestimated. This trend was also found when the authors examined the mean–sigma, mean–mean, Haebara, and fixed common item methods of developing a common metric. Although relatively few examinees may be cheating on the test, the entire group of examinees administered the form will appear more proficient. Perhaps most important, the degree of bias, even at moderate cheating conditions, would be unacceptable for a high-stakes test. Clearly, decisions made based on the test scores when any amount of cheating on anchor items has occurred will be dubious at best. Specifically, these results suggest that if cheating occurs on their form, underqualified examinees—whether they engage in cheating behaviors or not—may be given an unfair advantage over qualified examinees that completed an uncompromised form.
Given the negative implications of compromised anchor items, focused attention should be given to identifying these items and removing them from the equating process and scoring. As such, future studies must address the detection of cheating and compromised items. Several results of this study may be useful in developing models that identify compromised items under the NEAT design. For example, if it is understood that the a and b parameters for the compromised anchor items will be underestimated for cheating examinees, a mixture IRT model (von Davier & Carstensen, 2007) can be specified that contains a two-class mixture, with a “cheater” class containing lower a and b parameters in reference to the other, “honest examinee,” class. Items that show large parameter differences across the two classes may be removed from scaling, equating, and scoring to protect against the negative consequences of cheating.
Alternatively, potential cheaters, instead of compromised items, could be identified and removed from the scaling and equating process, again, using mixture models. Person-fit statistics may provide another method of identifying cheaters. Although power is often somewhat low for these indices (McLeod & Lewis, 1999; St-Onge, Valois, Abdous, & Germain, 2011), it might be increased in this context by taking into account that only the anchor items were potential candidates for cheating. The person-fit index could be calculated for the anchor item responses, conditional on θ estimated based on the nonanchor item responses. This would prevent the dilution of power due to using items subject to cheating in the estimation of θ, as well as the loss of power due to model aberrance on only a small proportion of the items. Person-fit indices developed to operate on runs of sequential items (Armstrong & Shi, 2009; van Krimpen-Stoop & Meijer, 2001) could be adjusted to treat the set of anchor items as a run.
Admittedly, simulation studies cannot capture the complexity of applied situations completely. Although this simulation was designed to address this issue by exploring realistic conditions and using real-item parameters, several limitations remain. Perhaps most important, the cheating adjustment value selected in this study may not reflect the actual benefit of knowing the item prior to administration. Although prior item knowledge assuredly benefits examinees, future research should examine the functional relationship between prior knowledge and probability of correct response, to gauge the strength of this effect. In addition, because details of cheating often go unreported, the cheating conditions selected in this study may not reflect the degree of cheating occurring in actual high-stakes testing. However, the results of this study should generalize to other degrees of cheating and compromised items. That is, the positive bias in equated scores should be a function of the degree of cheating, such that any increase in cheating will only further overestimate the equated scores. In addition, it bears noting that the effects of cheating on equating will differ for other forms of cheating. For instance, the impact of answer copying, a common form of cheating (Belov, 2011), would depend on the items copied. The results in this study pertain to the situation when anchor items become compromised.
In conclusion, this study addressed a gap in the literature by exploring the effects of anchor item compromise on equated scores obtained from IRT equating under the NEAT design. If some examinees have prior access to the items used to scale a test form, equated scores obtained for all examinees administered the form may be overestimated. Even small amounts of cheating call into question the results obtained from the test, while extreme levels of cheating will completely distort the scores. Given the types of decisions made based on high-stakes tests, it is imperative that scores reflect examinees’ true abilities on the attributes measured. The impact that cheating has on the entire distribution of scores is a severe threat to the validity of inferences made from test scores. Scores for all examinees, honest and dishonest, may indicate that the examinee is more proficient than is the case in reality. A corollary of this effect is that examinees administered forms where no cheating occurred may be unfairly disadvantaged. Research investigating methods to detect compromised items must take priority to ensure that scores, and thus decisions made, from high-stakes tests accurately reflect the ability of the examinees.
Footnotes
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The author(s) received no financial support for the research, authorship, and/or publication of this article.
