Abstract
Preknowledge cheating jeopardizes the validity of inferences based on test results. Many methods have been developed to detect preknowledge cheating by jointly analyzing item responses and response times. Gaze fixations, an essential eye-tracker measure, can be utilized to help detect aberrant testing behavior with improved accuracy beyond using product and process data types in isolation. As such, this study proposes a mixture hierarchical model that integrates item responses, response times, and visual fixation counts collected from an eye-tracker (a) to detect aberrant test takers who have different levels of preknowledge and (b) to account for nuances in behavioral patterns between normally-behaved and aberrant examinees. A Bayesian approach to estimating model parameters is carried out via an MCMC algorithm. Finally, the proposed model is applied to experimental data to illustrate how the model can be used to identify test takers having preknowledge on the test items.
Keywords
Growing public popularity of visual-based remote assessment and testing necessitates constantly upgrading the tools to assess test takers’ behaviors. For instance, many at-home-testing programs were launched during the COVID-19 pandemic, which offers test takers a safe and convenient option to take their exams at home rather than at a test center. In such remote and difficult-to-supervise test-taking settings, how to ensure test takers actively and appropriately answer questions without cheating are essential and timely questions needing to be answered. In response to these questions, several testing companies are currently using online human proctors to remotely monitor test-taking activity and performance. An obvious, primary goal of such a proctoring system is reducing the prevalence of cheating behavior during those times that test centers were closed due to travel restrictions and lockdown conditions.
However, shortages and limitations with an online human proctoring system have been recognized by educators and practitioners, and exploited by students. Many are related to issues of test security. For example, the actual test taker could ask or hire someone to impersonate him or her to take the exam in their place. This could occur in different stages of the examination delivery process, especially during a “water- or bathroom-break,” which could be considered a new form of copy-cheating. Using advanced technology such as micro-inductive earpieces, micro-projectors, and cameras, are other ways “tech-savvy” students can cheat during online proctored exams. Students, for instance, may hide micro-cameras facing the laptop screen to record questions. Then, “expert tutors” can provide answers for them via micro-inductive earpieces, which are hard to detect—even when being monitored via an online proctor. Finally, cheating can occur by writing short, small notes and concealing them from the online-proctor camera. As mice adapt their behaviors to avoid more clever mousetraps, some test takers will continue to find ways to exploit testing environments that are administered online. Although these aberrant testing behaviors seem far-fetched, they have been flagged in online proctored exams nonetheless—sabotaging correct inferences drawn from these possibly contaminated test scores.
As a potential solution, eye-tracking technology could be utilized to monitor examinees’ online test-taking behaviors in a noninvasive manner. Many open-source pupil-detection algorithms have been developed and deployed allowing for a low-cost web-based camera to capture eye movements in conjunction with gaze estimation. This technology is an attractive alternative to online human proctoring because of its accessibility and affordability especially for large-scale educational assessments taken at home (Fuhl, Santini et al., 2016; Fuhl, Tonsen et al., 2016). Eye-tracking involves measuring either where the eye is focused or the motion of the eye as individual examinees view test items delivered online. Their gaze patterns intimate (a) where on the screen they are gazing; (b) the duration of time looking at an item; (c) detecting shifts in focus as they move from item to item; (d) identifying those facets of the online interface that they miss; and (e) whether the eye-related measures are from the same or different persons (Man & Harring, 2021). Therefore, potential cheating behaviors could be detected by analyzing the intensively collected eye-movement-related indicators from an eye-tracker along with other multimodal data such as item responses and response times. All this disparate, yet complementary information, could be utilized to detect aberrant test-taking behaviors like preknowledge cheating.
Among many types of cheating, preknowledge cheating is considered a prominent ongoing issue that testing companies are eager to resolve—especially in online testing settings. Item preknowledge refers to some test takers having prior access to test questions and/or answers before taking the assessment (McLeod et al., 2003; Sinharay, 2020). In remote test-taking settings, the concept of preknowledge cheating is defined to include cases in which tech-savvy test takers utilize high-tech devices to record questions then receive helps from “expert tutors” in unexpected ways. Detection of this type of fraudulent testing behavior relies heavily on the use of statistical analysis of multimodal data, where patterns of cheating are camouflaged and concealed—showing few external signs that a proctor might not be able to discern (Bliss, 2012; Toten & Maynes, 2019). This is because the retrieval of answers they receive from “tutors” by utilizing high-tech devices or through memorized items during “breaks” occurs internally, rather than externally (Toton & Maynes, 2019).
Many methods have been proposed to detect preknowledge cheating, including person-fit statistics, experiment-based evaluation, and data-mining-based methods, which are either analyzing the item response patterns or response time patterns. Of the many indices, a few popular, representative person-fit statistics based on item responses are: Gutttman error (Guttman, 1944)
However, few attempts have been made to jointly model eye-tracking indicators with item responses, and response times in a unified framework to detect preknowledge cheating. In this study, an ML-mixture three-way factor (ML-mixture) model is proposed to detect test takers having preknowledge on test items by jointly modeling item responses, RTs, and visual fixation counts. The proposed model, an extension of the Bayesian multilevel modeling framework proposed by van der Linden (2007) and Wang and Xu (2015), allows for the investigation of the association among latent factors: ability, working speed, and test engagement of underlying item responses, RTs, and visual fixation counts (VFCs) across latent classes, respectively.
In this three-way ML-mixture modeling approach, the Rasch model, an RT model, and a visual fixation counts model are specified at the measurement level. The variance-covariance structures of the person-side and item-side parameters are specified at level two. Bayesian estimation is used to estimate the proposed three-way ML-mixture joint model. Experimental data collected in an eye-tracking lab will be analyzed. The data come from a study in which participants were randomly assigned to one of three treatment conditions.
Multilevel Mixture Model Construction
The ML-mixture model (visualized in Figure 1) assumes that two latent classes exist among test takers, and each latent class captures a specific type of testing behavior, such as normal responding and preknowledge cheating. To identify the two latent groups via item responses, response times, and visual fixation counts, three conditional mixture probabilities are specified for item responses, response times, and visual fixations, respectively.

ML-Mixture Three-Way Joint Model of Item Response, Response Time, and Visual Fixation Counts.
For item responses, the conditional probability for a correct response can be specified as
where
Similarly, for response times, the conditional probability for response time required for a test taker
where
In parallel, for visual fixation counts, the conditional probability for visual fixation required for a test taker
where
To estimate the conditional probabilities for item responses, response times, and visual fixation counts, three respective measurements models—a one-parameter logistic model, a lognormal response time model, and a visual fixation counts model—are specified at the measurement level of the multilevel structure to facilitate identifying different latent classes.
Measurement Models
Item Response Model
A class-specific 1-PL (or Rasch Model) model (Lord, 1952; Rasch, 1960), was chosen to model the relation between latent ability reflecting responding accuracy and item responses within each latent class. The model is specified as
where
Response Time Model
In addition to the 1-PL model for item responses, a log-normal RT model (van der Linden, 2006) is selected to depict a test taker’s responding speed. Specification of the log-normal RT model for a specific latent class is defined as
where
Visual Fixation Counts Model
Following Man and Harring (2019), a visual fixation counts model is used to model the association between observed visual fixation counts and latent test visual engagement, which is specified as
where
Person-Side Structural Model at Level Two
The structural model incorporates one person-domain variance-covariance matrix describing the dependencies among three person-side parameters, which are (a) latent ability
with mean vector,
The variances of the latent constructs are represented by the diagonal components of the
By estimating class-specific parameters in the person-domain structural variance-covariance matrices, the relations among person parameters can be manifested across latent classes. Structural differences across latent classes represent distinct test-taking behavioral patterns regarding the normally behaved test takers and the ones who have preknowledge of test items.
Figure 1 displays the graphical representation of the ML-mixture model jointly modeling of item response, response time, and visual fixation counts across latent classes.
Testing Differential Item Functioning Across Latent Classes
Differential item functioning (DIF) occurs when different groups of test takers respond differently to the same item (Hambleton et al., 1991; Smith & Prometric, 2004). Typically, DIF is assessed based on observed grouping variables like gender or SES. However, in a mixture modeling framework, DIF is evaluated based on the latent classes, which relies on the assumption that the latent classes have been correctly identified.
To examine whether item DIF exists across latent classes, standardized Wald tests of the item DIF in the item parameters (i.e.,
where
Bayesian Estimation Using MCMC Sampling
Just Another Gibbs Sampler, a Bayesian estimation tool, (JAGS; Plummer, 2015), which is in the
Model Identification and Scalability
According to Paek and Cho (2015), interpreting and further utilizing model parameters within a mixture IRT framework requires that the model is identified within each class and that parameter estimates across latent classes are on a common scale. We discuss each of these in turn explicating how we intend to satisfy both identifiability and scalability conditions.
To properly identify the scales of the latent variables, model constraints are needed either on the item side (fixing the summation of item thresholds to zero) or the person-side (fixing the expectation of the latent ability parameter to zero). In this study, to ensure the identifiability of the model, constraints were placed on the person-side parameters. Within each latent class
Paek and Cho (2015) note that when the estimation method (i.e., in this study, we use an MCMC algorithm within a Bayesian approach) accommodates the latent class population distributions as part of the modeling, constraints to the population ability distributions of the latent classes [i.e.,
Label Switching Identification for Mixture Distributions
Identification is a potentially critical challenge in Bayesian estimation of mixture models. Identification of a mixture model necessitates that distinct parameter estimates result in distinct probability values. The definition of identification for a mixture distribution is defined as follows (McLachlan & Peel, 2000):
If and only if,
The invariance of the likelihood under relabeling of the mixture components (Diebolt & Robert, 1994; Redner & Walker, 1984) is a serious problem that must be handled. Several methods have been proposed to tackle the issue. For example, Vermunt and Magidson (2005) imposed strong informative prior to uniquely determine parameter estimates. Another commonly used method to tackle the label switching issue is to impose a set of identifiability constraints on the parameter space that could help identify the latent components and avoid latent class labels switching within MCMC chain at each iteration (Cho et al., 2010; Stephens, 2000).
In this study, several constraints were imposed on the item-side parameters to differentiate the two latent groups: normally behaved and aberrantly behaved test takers. First, item difficulties (
Prior Distributions
Weak informative priors are preferentially used in this study to increase the generalizability of our code by imposing vague prior beliefs on estimating parameters. The setting of priors in this way was also implemented in Man et al. (2022) and Man and Harring (2019). The prior specification for the person parameters across the two latent classes,
where
The prior distribution of item parameters is specified for each of the latent class, respectively. For the normally behaved latent class,
In terms of the latent class for the ones having preknowledge on test items, label-switching constraints are imposed to the means of item parameter by letting the item parameter from the first latent class minus the item DIFs that always larger than 0. By applying that constraints to the means of item difficulties, time intensities, and visual intensities of the second latent class, the two latent class can be correctly identified. The prior of each item parameter for the preknowledge-cheating latent class was specified as
where
To identify the two latent class,
The full joint likelihood function of person and item parameters for the ML-mixture model is as follows:
where
The joint posterior probability for the proposed model can be represented as
Outcome Measures for Model Selection and Classification Accuracy Evaluation
To evaluate the performance of the proposed method on classifying aberrant and non-aberrant test takers, the number of test takers in each of these categories were cross-tabulated. Schematically, these are true positives, false positives, false negatives, and true negatives, labeled as TP, FP, FN, and TN. 1 To summarize the results, outcome measures consist of sensitivity, specificity, and overall accuracy were calculated as follows:
Real Data Analysis
The data were fitted with the proposed ML-mixture three-way joint model of item responses, response times, and visual fixation counts. Parameter estimation of the measurement models at level-1 were presented across different latent classes. Moreover, the trade-offs of the person-side parameters at the level-2 were explored by reporting the estimated latent class-specific variance-covariance matrices.
Data Description
The eye-tracking study was conducted at an university with the IRB approval. The dataset used for this study includes
Number of Subjects in Each Condition.
Note. Normally behaved condition: participants in the control condition who did not receive any test preparation materials. Preknowledge-cheated condition: participants in this condition would receive similar exam questions and the answer key.
Item-Specific PPP-Values Across Items.
Note. IRT = item response model; PPP = Posterior predictive p-values; RTM = response time model; VFM = visual fixations model.
Students were seated around 80 cm away from a
Data Visualization
To have better understanding about the data and to appropriately model it for accurate inferences, the collected data were explored by showing the three-dimensional structure of the data space and the bivariate scatterplots of three variables with the estimated mixture densities using the

Figure 2(A): 3D Visualization of the Collected Data. Figure 2(B): Scatterplots of Essential Variables Across Three Conditions. The Variable names showing in the Matrix From the Top Left to the Bottom Right Are Total.Score, Total.Gaze, and Total.Time. The Distribution of Each Variable Is Listed on the Diagonal of the Plot Matrix.The Bivariate Scatterplots are Listed on the Off-Diagonal.
The proposed ML-mixture model of item responses, RTs, and visual fixation counts were fitted to the data to understand and evaluate the pattern differences in test-taking behaviors across latent classes. Parameter estimates of the level-1 measurement models across two latent classes were summarized. Furthermore, the test-taking behavioral pattern differences were reported by showing the corresponding covariance estimates across the identified latent groups. In addition, posterior predictive model checking (PPMC; Gelman et al., 2014) was utilized to evaluate model–data fit. A PPMC value is within the range of 0.05 to 0.95 indicates adequate model data fits (Sinharay et al., 2006). 2 displays the item-specific PPP-values for evaluating data model fits. In general, the majority of the PPP-values were near to 0.5 for the IRT, lognormal RT, and NBFM models, showing satisfactory data model fits across the three measurement models.
Item Characteristics Nuances Across Latent Groups
Figure 3 shows item parameter estimates across two latent classes. The dashed lines in Figure 3 correspond to the estimated finite mixture components of item difficulties, time intensities, and visual intensities, reflecting the impact of having preknowledge on the items. The solid line shows the estimated item difficulties, time intensities, and visual intensities for ones classified as normally behaved test takers. Due to the contains imposed for identifying latent classes, item difficulties (

Item Parameter Estimates Across Identified Latent Classes
In terms of item difficulties, examinees belonging to the preknowledge-cheating latent group appeared, on average, to have a lower level of item difficulty than those belonging to the normally behaved latent group. For the latent class with normal test-taking behavior,
Impact of Having Preknowledge of Test Items on Item DIFs
Note.
Similarly, test takers in the latent class for preknowledge cheating behavior tend to spend less time finishing their exams.
The impact of having preknowledge on visual intensities is similar to the ones observed regarding responding accuracy and time intensity, which means test takers who are classified in the preknowledge latent class tend to put less visual effort into tracking information to decode test items. Generally, the visual intensities from the test takers who belong to the normally behaved latent class ranged from 3.20 to 5.41, equivalent to 25 to 224 fixations counts by exponentiating the estimated visual intensity values. In contrast, visual intensities for the preknowledge-cheating latent class varied from 2.77 to 3.80, which is about 16 to 44 fixation counts averaged across all the test takers. Similarly, the Wald statistics and credible intervals supported the conclusion that the mean differences
Behavioral Pattern Differences Across Latent Groups
Figure 4 depicts behavioral pattern variations between the normally and aberrantly behaved test takers. The behavioral pattern differences were illustrated via the correlation matrices listed in Table 4, and the pair-wise scatterplots of the estimated latent constructs (ability, visual engagement, and processing speed) across the two identified latent classes. Regarding the latent preknowledge-cheating group, a high positive association between latent ability and working speed (

Scatterplots for Person-Side Parameter Estimates. A Loess Nonparametric Smoothed Curve Is Plotted for Each Scatterplot
Person-Side Correlation Matrix Estimates
Note. CI = confidence interval.
This result infers that when test takers answered normally, their latent ability was not related to their responding speed. For example, high-ability test takers could answer both quickly and slowly. Moreover, the visual engagements of high-ability test takers are also independent of their ability levels. As for cheated ones, when they know the answer keys of the test items, they tend to answer quickly without paying careful attention to the content of the questions.
Detection Accuracy
As a mean of comparison of the performance of detection of preknowledge cheating based on the proposed mixture method, Table 5 shows the cross-tabulation of the numbers of test takers summarized in each of these crossed categories, which are labeled as TP, FP, FN, and TN explained previously. The overall sensitive rate based on the proposed mixture method is perfect (
Classification.
Discussion
The use of technology-enhanced assessment systems has permitted practitioners and education specialists to gain a better understanding of the behavioral characteristics associated with the different groups exhibiting different responding styles through the use of enriched information compiled from the log-files of biometric and computational devices, such as extracted RTs and VFCs. As previously stated, home examinations are gaining in popularity. The opportunity for examinees to cheat during home examinations utilizing a range of advanced technologies is growing. Consequently, it is vital to preserve the integrity of an exam by securing it using multiple data sources beyond item responses. Through evaluating or modeling only response correctness on items, it is difficult for practitioners and researchers to discriminate between those with high ability and those with a priori test content knowledge. In this instance, if RT and visual attention are evaluated with the accuracy of responses, cheating cases, particularly those with prior knowledge of test questions, can be separated from high-ability candidates more precisely.
The suggested ML-mixture three-way joint model can aid in (a) accurately distinguishing abnormal test takers with prior knowledge of test items from those with normal behavior, (b) automatically estimating the person-side and item-side parameters for various latent groups, and, in addition, (c) investigating pattern differences in the trade-offs of visual attention, working speed, and accuracy across the manifested latent classes by accounting for test-taking behaviors differences by jointly modeling visual fixation collected from an eye-tracker with conventional psychometrics information such as item responses and response times. These demonstrated relationships may aid practitioners in comprehending and explaining the distinctions between the recognized types of responding actions. Potentially, this proposed model could serve as an useful tool for detecting aberrant test takers, thereby ensuring that our home-based online assessments are as secure as feasible.
The results from the real data example suggested that the proposed ML-mixture model yields at least two desirable outcomes. First, both item- and person-side parameters can be accurately estimated within specific latent classes. Accurately estimated item parameters can be used for future applications, such as assisting practitioners and substantive researchers in gaining a deeper understanding of the behavioral nuances and cognitive processes displayed by test takers from different groups, such as normally and aberrantly responding groups, in technology-enhanced environments. Second, aberrant test takers can be accurately classified concurrently. For example, class-specific labels,
Despite the fact that the proposed model showed promise in the current study, several limitations to this study need to be acknowledged. First, the current Bayesian implementation enables for accurately characterize parameter uncertainty. However, because of the computational intensity associated with Bayesian estimation, the suggested approach is best suited for post hoc analysis rather than detecting cheated cases in real time. Furthermore, the current model includes two latent classes for distinguishing individuals who have prior knowledge of test items from those who behave normally. To identify more cheating subcategories, model constraints need to be further extended to accommodate more latent classes. In addition, though the credibility intervals were reported for the DIFs, scale comparability was not tested throughout. Moreover, due to the limited sample size, the current model only used the Rasch model to model the item response; alternative IRT models, such as partial credit and test let models, might be used for more complex answer structures with more data.
In conclusion, the proposed model could be expanded. For instance, to study how preknowledge cheating could affect the mastery of latent skill attributes, it may be worthwhile to substitute the Rasch model with the Cognitive Diagnostic Model as a prospective next step in this line of research. In addition, many other biometric information variables, such as heart rate and blinking rates, could be added to the proposed modeling framework either as covariates or independent latent constructs, which could provide more refined formative feedback to practitioners for a better understanding of cheating machinery and ultimately boost the cheating detect rate.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
