Abstract
Change-point analysis (CPA) is a method for detecting abrupt changes in parameter(s) underlying a sequence of random variables. It has been applied to detect examinees’ aberrant test-taking behavior by identifying abrupt test performance change. Previous studies utilized maximum likelihood estimations of ability parameters, focusing on detecting one change point for each examinee. This article proposes a Bayesian CPA procedure using response times (RTs) to detect abrupt changes in examinee speed, which may be related to aberrant responding behaviors. The lognormal RT model is used to derive a procedure for detecting aberrant RT patterns. The method takes the numbers and locations of the change points as parameters in the model to detect multiple change points or multiple aberrant behaviors. Given the change points, the corresponding speed of each segment in the test can be estimated, which enables more accurate inferences about aberrant behaviors. Simulation study results indicate that the proposed procedure can effectively detect simulated aberrant behaviors and estimate change points accurately. The method is applied to data from a high-stakes computerized adaptive test, where its applicability is demonstrated.
When test results are used for high-stakes decisions, such as tracking students’ growth trajectory, career planning, and other professional decisions, test security is an important validity issue. With the rapid development of computer technology, many standardized tests are computer-based testing (CBT) due to its operational advantages. Most importantly, CBT enhances test security given data from multiple sources, such as item responses and response times (RTs) can be collected and utilized for aberrant responding behavior detection.
Item responses in a test provide information for the estimation of latent ability underlying the responses, while RTs provide information about the working speed of the respondent. In the lognormal RT model proposed by van der Linden (2006), it is assumed that the working speed of an examinee is constant throughout the test. However, in many cases, the working speed may vary during a test. One scenario is related to different response strategies that an examinee might use (Fox & Marianti, 2016). Another scenario is when examinees show aberrant test-taking behaviors due to, for example, the warm-up effect, test speededness, a decline in motivation, or preknowledge of some items (Bradlow et al., 1998; Cheng & Shao, 2022; Marianti et al., 2014). Aberrant responding behaviors not only lead to errors in item parameter estimation but also to inaccurate estimations of ability, which ultimately lead to invalid inferences about students’ abilities (Douglas et al., 1998).
The working speed of an examinee is auxiliary information for a better understanding of examinees’ response behavior. In other words, RTs may provide important information for the detection of aberrant behavior as unexpected patterns in RTs may indicate certain types of aberrant responding behaviors. Specifically, if examinees have speededness or preknowledge of some items, they are likely to respond more quickly on these items than on others (van der Linden & van Krimpen-Stoop, 2003). On the other hand, examinees affected by the warm-up effect are likely to respond more slowly at the beginning of a test (Shao, 2017). In addition, RTs could be used to improve test design, item selection in computerized adaptive tests, item calibration, and cognitive diagnosis (Zhan et al., 2018).
There are two common approaches to dealing with aberrant responding behaviors. One approach is to model the aberrant responses using mixture models. For example, Bolt et al. (2002) used a two-class mixture Rasch model to classify examinees into speeded and nonspeeded classes. Other researchers (e.g., Meyer, 2010; Schnipke & Scrams, 1997; Wang & Xu, 2015; Wise & DeMars, 2006) proposed mixture models to differentiate solution behaviors from rapid guessing behaviors. The other approach is to detect aberrant behaviors. Previous research has mainly developed person-fit statistics (PFSs) and change-point analysis (CPA) methods. PFSs provide a numerical measure of how well a person’s response pattern fits an item response theory (IRT) model, which allows the detection of aberrant response patterns (Embretson & Reise, 2000; Meijer, 2002; Meijer & Sijtsma, 2001). However, PFSs only show the model-data fit issue, and they do not provide more detailed information to distinguish between the normal and aberrant responses. Moreover, the calculation of PFSs requires examinees’ ability parameters that are known, and for examinees with aberrant behaviors, their ability estimations can also be seriously biased (Shao et al., 2016; Sinharay, 2016).
CPA is a method from the field of statistical quality control (e.g. Allalouf, 2007; Montgomery, 2013; von Davier, 2012) for detecting any change in the parameter(s) underlying a sequence of random variables. It has been used in many studies to detect aberrant responding behaviors by identifying an abrupt change in an examinee’s performance (Cheng & Shao, 2022; Choe et al., 2018; Shao et al., 2016; Sinharay, 2016; Zhang, 2014; Zhang & Li, 2016). A unique advantage of CPA is that it provides an estimate of the exact location (change point) at which an aberrant behavior occurs. This information is very useful. First, it helps test designers to develop tests in a more rational way. Second, the change points can tell us which responses are aberrant, so that the data excluding the aberrant item responses can be used to ensure the accuracy of parameter estimation.
Some areas remained untapped in previous studies on CPA. First, these studies considered and detected one change point only. However, it is possible for examinees to engage in both a warm-up effect and speededness, associated with two change points. Sinharay (2016) found that in some simulation conditions with more than one true change point, only one true change point was able to be detected, whereas the others could not be detected. Second, the maximum likelihood estimation (MLE) method of the examinee’s ability is often used in CPA. However, with fewer items, the estimated ability parameter based on the MLE method is less accurate and will thus influence the change point detection results. To increase the power of the test, Sinharay (2016) restricted the change point to the middle range of the items. Shao et al. (2016) also found that as the starting point of speeded responses gets closer to the end of the test, both the classification accuracy rate and power drop. Third, the detection results may be affected by some item characteristics. Shao et al. (2016) found that when items on a test are ordered from easy to difficult, the increasing difficulty parameters of items may be confounded with test speededness.
This article proposes a Bayesian CPA method using RTs to detect aberrant test-taking behaviors when there is more than one change point. The following sections first introduce current CPA methods for detecting aberrant behaviors. Next, the proposed Bayesian change-point detection procedure is presented, and simulation studies are conducted to investigate the effectiveness of the proposed approach. Real data analysis is then used to demonstrate its applicability. Finally, the limitations of the current study are addressed and future lines of research are suggested.
CPA for Aberrant Responding Behavior Detection
Various aberrant responding behaviors are often associated with an abrupt change in an examinee’s performance in taking a test. Zhang (2014) proposed a real-time sequential item monitoring procedure based on CPA to detect compromised items in computerized adaptive testing (CAT). Having compromised items in an item bank may cause substantial bias and root mean square error in ability estimates, jeopardizing the validity and fairness of the test (Liu et al., 2019; Zhang et al., 2012). Further, Zhang and Li (2016) proposed a sequential test method based on an IRT model. In this method, the MLE of the ability parameter is used. If there is a large difference between the expected and observed probability of a correct response, the item is likely to have been leaked. Choe et al. (2018) also proposed a change-point detection method in combination with RT data for detecting compromised items. Shao et al. (2016) proposed a likelihood ratio test statistic based on CPA to detect test speededness. This procedure can both identify speeded examinees and locate the point at which an examinee started to speed. The ability estimation was improved by removing suspected speeded responses rather than using the entire sets of responses consisting of those from examinees suspected of test speededness. Sinharay (2016) suggested three new PFSs based on tests for a change point to detect an abrupt change in an examinee’s ability related to aberrant behaviors in CAT. Among the three new PFSs, the power of the one based on the Wald test (Rao, 1973) was the largest. More recently, Cheng and Shao (2022) proposed a CPA procedure based on RTs to detect test speededness by detecting an increase in the working speed.
Unlike previous studies, we introduce a Bayesian approach to CPA (e.g., Chen & Gupta, 2012; Inclán, 1993; Smith, 1975) for two reasons. First, it does not rely on the asymptotic assumptions about test statistics in frequentist methods. This assumption can be problematic in situations, where the parametric models are restricted to a finite, possibly small interval of time (a small number of items). Second, a Bayesian approach allows the quantification of uncertainty in both the number and the location of the change points, which makes it possible to detect multiple change points and multiple aberrant behaviors simultaneously (Chopin, 2007; Ruggieri & Antonellis, 2016).
Further, we propose to analyze item RTs in the Bayesian CPA method. In general, RTs are continuous rather than binary, which can offer more information on aberrances (van der Linden & Guo, 2008). Van der Linden and van Krimpen-Stoop (2003) proposed using a lognormal RT model to detect preknowledge and speededness in CAT. They found that the methods using response data had nearly no power, but that Bayesian approach with RTs almost doubled detection rates relative to the classical methods based on item responses. Qian et al. (2016) applied the residual of the lognormal RT model to detect possible item preknowledge in computer-based licensure exams. Choe et al. (2018) found that, compared to the analysis of responses alone, preknowledge detection based on RTs can provide evident improvements in detection power with fewer false positives. Cheng and Shao (2022) also found that the power of CPA procedure to detect test speededness using RTs is improved compared with using dichotomous item response data.
In this article, the proposed Bayesian CPA method detects aberrant test-taking behaviors that are associated with abrupt changes in the examinee’s speed. The number and location of the change points are model parameters that are inferred based on their posterior distributions. The speed of the examinee in each segment of the test partitioned by the change point is also estimated. Using these detection results, the types of responding behaviors can be inferred more comprehensively with greater accuracy.
Method
The Model
Suppose the lognormal RT model (van der Linden, 2006) fits the regular RTs from examinees. Consider examinees
where parameter
In Equation 1, the log RT follows a normal distribution:
The RT variables are independent between persons as well as between items given a person’s latent speed parameter under the local independence assumption (van der Linden, 2006). From Equation 2, we have
That is, for each examinee p,
where
In theory, there are at most n − 1 change points for n observations. However, the change points to be detected here are not scattered in each item but appear as the last item of one stage, which means that the next stage starts with the next item. This is mainly caused by the warm-up effect, speededness, fatigue, decline in motivation, and so on—hence, the number of change points is small. In this study, at most two or three change points in dozens of observations are assumed. To illustrate how the procedure works, let the maximum number of change points for each examinee be m. Thus,
Using the convention that
Given local independence, the joint density for
Assume that r is the probability of observing a change point for an examinee in a test. This implies that the number of change points q can be assumed to follow a binomial (m, r) distribution for each examinee. That is,
By selecting appropriate values of r, the number of change points of the examinees can meet certain prior information. Using Equation 7, the posterior probability of q can be computed to determine which value is better supported by the data.
Assume that
The three quantities r,
Given the likelihood and the priors, the joint probability density function of
Integrating over each
where
The posterior probability function
where
The sums on the right side cover all the possible values of
For assessing the evidence on the data about the number of change points q, we use the posterior distribution of q:
The value of q is empirically determined by comparing the posterior probability of each value of q:
The evidence that supports the number of change points detected is the one with the largest posterior probability, that is, the Bayesian estimate of q obtained by the posterior mode:
Given the posterior estimate of q, we obtain the posterior estimate of
Specification of the Hyperparameters
The hyperparameters need to be specified for r,

Probability of different values of q as r changes for binomial (2, r) and binomial (3, r).
From one point of view, the prior of q is equivalent to adding a penalty term to the model. As the number of change points increases, the model becomes more complex, and the likelihood function usually increases. We can reduce the influence of this by adding certain prior knowledge (a penalty term) in order to have stronger evidence when choosing a more complicated model (lager number of change points). At this point, the posterior mode estimation of q is obtained by the product of the likelihood function and the penalty term.
The choice of
Inference About
The speed of each examinee at each stage of the test can also be inferred. Given the data and the modal value of the points of change
with
In this case, the posterior mean and SD of
If we want to calculate the marginal posterior of
At this point,
with
The moments are easy to evaluate but require many calculations, which can become unfeasible for lager values of q (Inclán, 1993). In this case, it would be necessary to consider the modal approximation given in Equation 16. In fact, the results given in Equations 16 and 18 are very close to each other, and this is further illustrated in the simulation study. Using the location of the change points and the estimation of the speed at different segments partitioned by the change points, we can infer aberrant behaviors flagged by the method more accurately.
Simulation Study
Simulation Design
The performance of the proposed method based on RTs is investigated for the following aberrant test-taking behaviors in a computerized test. These aberrant behaviors represent specific situations with abrupt changes in speed. It is worthy of note that this study focuses on the RT-based CPA method for the detection of changes in the responding behavior of examinees. The aberrant responding behavior simulated does not reflect in the true ability but manifested in the working speed, which result in aberrant RT patterns.
Case 1: Speededness
When there is not sufficient time to work carefully on all the items, examinees may rapidly respond to remaining items before time is up. In this type of aberrant pattern, the RTs to a set of consecutive items at the end of the test are noticeably shorter compared to items in the early parts of a test.
Case 2: Warm-up effect and speededness
As examinees begin to take a test, they may have trouble settling in or warming up to the normal test-taking process. As there is no time stress, the examinees may spend more time on fully reviewing the early items. This can be found in conjunction with speededness behavior when too much time is spent on early items, leading to the presence of two aberrant behaviors.
We simulate

Two cases of aberrant test-taking behaviors. Note. In all scenarios, the number line represents the order of items and the arrow indicates the location of the change point; that is, there is an abrupt change in the examinee’s speed at the next item. The items below the bracket indicate where the aberrant behavior occurs. The label above the bracket gives the name of the aberrant behavior.
One way to generate the RT data affected by behavior in Case 1 is to subtract a constant from
where L is a speed shift caused by the aberrant behavior and represents the degree of an abrupt change in an examinee’s speed. RTs with the warm-up effect can be generated analogously from
when
which represents the normal situation.
The simulation study is based on a nonadaptive computerized test of three different test lengths: 20 items (short test), 50 items (moderate test), and 80 items (long test). The discrimination parameters are generated from
In the simulation, we use
In the case of
The total sample size is set at

Difference between the estimated change point and the true change point.
Parameter Selection
First, let m = 3 for each examinee; that is, the possible number of change points for each examinee is 0, 1, 2, or 3. Therefore, as discussed before, the hyperparameter r should be chosen from (0, 0.25). In order to obtain the appropriate value of σ, we calculate the hit rate and the false alarm rate for our approach in each condition for 100 different values of σ: 0.05, 1,…, 4.95, 5, with μ = 0 and r is chosen from (0, 0.25) for each examinee. The sample size is 500. We also compute the corresponding results when m = 2 and r is from (0, 1/3). Figure 4 summarizes the results for the false alarm rate about σ when all 500 examinees respond normally for m = 3 and m = 2. Figures 5 and 6 summarize the results when all 500 examinees have aberrant behaviors, with the hit rate about σ in each condition for m = 3 and m = 2, respectively.

False alarm rate about σ when m = 3 (left) and m = 2 (right).

Hit rate about σ when m = 3.

Hit rate about σ when m = 2.
The results show that, in all conditions, as the value of σ increases from 0.05 to 5, the false alarm rate and hit rate first increase and then level off after about σ = 2 for three test lengths. This trend is expected; a smaller σ indicates a narrower range for the change of speed, and hence, it is more likely that no change point will be detected. By contrast, a larger value of σ indicates a larger range for the change of speed, which makes it easier to detect the change points. However, after about σ = 2, there is no obvious influence on the detection results. Note that the results for m = 2 and m = 3 are very similar, which indicates that the results are not sensitive to m in some way. (Because the space is limited, the hit rates of case 1 are shown here, and the regularity of the results of case 2 is similar to those). Based on the above analysis, m = 3, r ∈ (0, 0.25), σ = 2, and μ = 0 are used for the simulation study to evaluate the performance of the method.
Results of the Simulation
Tables 1 and 2 summarize the detection results for examinees who have the simulated aberrant behaviors with one and two change points. The hit rate, false detection rate, false alarm rate, and mean and SD of the absolute lag, calculated on the basis of 50 replications, are shown in each table. As shown in Table 1 for q = 1, in almost all conditions, the hit rate is high and the mean and SD of the absolute lag are small. The hit rate increases as the test length increases, because the longer the test, the more items will be affected by speededness. For a test of 20 items, the hit rate drops slightly as ω increases (the starting point of the speeded responses gets closer to the end of the test), but this is not found for tests of 50 or 80 items. This drop is expected because fewer items are affected by speededness, especially in a short test, and it is consistent with the findings of Shao et al. (2016). Note that the power range reported in Shao et al. (2016) was 0.49–0.88 for a test of 50 items, where the medians of ω were 0.5, 0.6, and 0.7. The Bayesian CPA method is therefore more efficient. An important reason for this is the use of RTs and the greater information that can be obtained from the data. In addition, the hit rate is influenced by the degree of speed shift, mainly for the test length is 20 items, which shows that the bigger shift, the higher the hit rate.
Detection for q = 1
When comparing the results of the lag, we find that conditions with a bigger speed shift (
Table 2 shows the detection results for aberrant behaviors when
Detection for
The false alarm rate is around 0.05 when the test length is 20 items, 0.03 for 50 items, and 0.02 for 80 items. Therefore, the false alarm rate can be well controlled within a reasonable range for tests of different lengths. For the false detection rate, the corresponding three columns in Tables 1 and 2 show the results of misidentifying the true number of change points as the other values, which can be seen that there is relatively small possibility of “magnifying” the number of change points.
Inference About the Types of Aberrant Behaviors
In addition to the detection of the change points, the value of
Posterior Expected Value and Standard Deviation of
Table 3 shows that the estimations of
Robustness of the Proposed Method to Assumption Violations
Violation of Lognormal RT Models
As the Bayesian CPA method is established for the lognormal RT model, some RT data that are not from the lognormal model are simulated. The distribution of RT data is typically positively skewed (Fazio, 1990). Assuming that the transformed RTs are normal, we simulated the χ2 distribution of RTs with degrees of freedom
Detection When There Is Noise in the Lognormal Response Time Model
Note. The proportion of examinees with aberrant behaviors is assumed to be 30%. When
Table 4 indicates that as the noise ratio increases, the hit rate of the method decreases, the false alarm rate increases, and the mean and SD of the lag also increase. These results suggest that the square root performs worse than the log and the reciprocal in terms of reducing positive skewness (Stocké, 2004; van der Linden et al., 1999), and so this RT model deviates further from the assumption. However, the performance of this method is slightly worse in the presence of such disturbances. It can therefore be concluded that the method is not very sensitive to the assumption of the lognormal RT model.
Violation of the Prior With the Binomial Distribution
Assuming that each examinee may have 0, 1, 2, or 3 numbers of change points, which do not follow binomial (3, r) distributions, the proposed detection model (with binomial distribution prior) was used to identify the change points. In addition, we consider another pattern of the speed change, the gradual speed change, to examine the performance of the method.
The simulation settings are specified as follows. Two test scenarios: In Case 1, the largest proportion of examinees comprises those who have aberrant test-taking behaviors. The truncated Poisson distribution is used to generate the number of change points for each examinee, which is less than or equal to 3. The parameter Two speed change patterns: When where Four test lengths: 20, 40, 60, and 80 are considered in each condition to get a more detailed picture of how detection results change with test length.
The false alarm rate, the hit rate of aberrant behaviors (examinees who have 1, 2, or 3 change points), and the overall correct classification rate for the number of change points are calculated to evaluate the performance of the method. The results are shown in Table 5.
Detection When the Underlying Prior Distribution is Different From the Binomial (3, r)
Note. “Abrupt” indicates that the speed change of the warm-up effect is abrupt; “gradual” indicates that the speed change of the warm-up effect is gradual.
First, the false alarm rates under various conditions are basically consistent with those shown in Tables 1 and 2. The longer the test length, the lower the false alarm rate. When the test length is 40, the false alarm rate is around 0.04. Second, in Cases 1 and 2, when the speed change is abrupt, the hit rates for one and two change points are roughly the same as those shown in Tables 1 and 2 and are largely unaffected by the underlying prior distribution. When the warm-up is subject to a gradual change in speed, the hit rates for detecting the change points are decreased compared to the abrupt change (see the hit rates for two and three change points). In this situation, the speed of the examinee is constantly changing across a series of items, and the degree of change at the change point is small, so the detection efficiency decreases significantly. In addition, when there are three change points, the hit rate is relatively lower than that of other numbers of change points, and this increases with increased test length. Regarding the correct classification rate of examinees’ test-taking behaviors (according to the number of change points), those in Case 2 are higher than in Case 1. As the number of examinees with zero change points is the largest in Case 2, which is consistent with our assumption (the probability of having zero change points is the largest for each examinee), the overall classification rate is higher.
Summary of the Simulation Results
In summary, the Bayesian CPA method is effective in detecting aberrant test-taking behaviors when there is more than one change point in the test performance of an examinee, and the false alarm rate can be controlled in a reasonable range. The location of change points for each examinee can also be estimated accurately. When this is combined with the estimation of the speed in each segment partitioned by the change points, it is possible to determine the number and types of aberrant behaviors flagged by the method. The results indicate that the test length, the location of change points, and the degree of speed changes influence the efficiency of the detection results. The longer the test length, the greater the degree of speed changes at change points, and the higher the efficiency of the method. In addition, the proposed method is robust to the violation of assumptions of the lognormal model and that the underlying distribution of the change point is not binomial.
Real Data Analysis
To illustrate the application of the proposed method in real data, we used data from a high-stakes standardized CAT program that was analyzed by Meng et al. (2015). In high-stakes tests, when the examinee may feel nervous or under time pressure, the warm-up effect and speededness behaviors are likely to occur. We compared the Bayesian CPA method with RTs and the Wald test statistic based on item responses (Sinharay, 2016) in the real data analysis.
The data set consists of 2,061 examinees and the test length is 37 items; that is, each examinee responds to 37 items delivered adaptively from a pool of 620 items. We used the lognormal RT model to fit the data and estimate the parameters in the model. According to Meng et al. (2015), the fit of the lognormal RT model for this empirical data set is satisfactory. The estimated intensity parameter has a mean of 4.45 and variance of 0.23, and the discrimination parameter has a mean of 1.66 and variance of 0.21.
First, we used the Bayesian CPA method to calculate the false alarm rate through a simulation with real data conditions (n = 37 and N = 2,061), assuming a maximum of three change points, using hyperparameters

False alarm rate about r (left) and its relationship with the hit rate (right).
For each examinee, the posterior probability of the number of change points of the RT pattern was computed. We found that 1,389 examinees had no change point, 616 examinees had one change point, 53 examinees had two change points, and three examinees had three change points. The examinees who had no change point did not show any abrupt change in speed in their test performance, and hence, these examinees could be deemed as normal responding examinees. The examinees who had change points showed abrupt changes in speed in their test performance, indicating suspicious response behavior. We estimated the locations of the change points for each examinee and the speed in each segment partitioned by the change points. These results allowed us to infer the types of the suspected behaviors, as discussed in the following.
Among examinees who had one change point, those whose estimated speed in the second segment was faster may have had speededness or the warm-up effect. Specifically, if the speed was relatively normal (moderate) in the first segment but abnormally fast in the second segment, the behavior could be related to speededness. If the speed in the first segment was abnormally slow but relatively normal in the second segment, and the change point was in the first few items, the warm-up effect is possible. For examinees whose speed in the second segment was slow (perhaps because of their response strategy), it is possible that later items were more difficult, and therefore, they spent more time on those items.
For examinees with two change points, if the locations of the change points were at the beginning and the end of the test, respectively, and the speed was abnormally slow at the beginning, moderate in the middle, and abnormally fast at the end, then it is possible that speededness and the warm-up effect occurred simultaneously. In addition, if the two change points were relatively close to each other, with the speed abnormally high in the middle but relatively normal on either side, then rapid guessing on the middle items is likely. Similarly, for examinees with three change points, one can make inferences about their test-taking behaviors on the basis of the change points and the speed. It should be noted that the presence and the types of aberrant behaviors need to be determined based on these empirical results holistically.
Figure 8 shows the RT patterns and the detection results of five examinees who had change points. The item number is shown along the x-axis and the RTs (in seconds) along the y-axis. The dashed vertical line represents the location of the change point for each examinee. We can see that there is a clear abrupt increase or drop-off in RT after the detected change point. The change point for the first examinee is at Item 27, with an estimated speed of −0.30 before the change point and 1.33 after the change point. The speed is very fast in the last segment, it is likely that speededness occurred. The change point for the second examinee is at Item 3, with an estimated speed of −0.74 before the change point and 0.23 after the change point, a warm-up effect is likely to be present for this examinee. The change point for the third examinee is at Item 27, with an estimated speed of −0.03 before the change point and −0.94 after the change point, indicating that the examinee most likely changed their response strategy before the end of the test. The change points for the fourth examinee are at Items 7 and 24, with an estimated speed of −0.86 before the first change point, 0.05 between the two change points, and 1.03 after the second change point. The examinee most likely had both a warm-up effect and speededness behaviors. The change points for the fifth examinee are at Items 23 and 35, with estimated speeds of −0.55, 2.70, and −0.72 in each test segment. The speed in the middle segment is abnormally fast, and it is therefore likely that the examinee adopted rapid guessing behavior for those items.

Response times of five detected examinees.
The Wald test statistics using response data were also applied to analyze the test-taking behavior of this group of examinees. Following Sinharay (2016), the unidimensional Rasch model was used for item calibration. A Bayesian ability estimate was used initially, the MLE of ability was used for the Wald test, and a 5% significance level was chosen (which was nearest to the false alarm rate of the Bayesian CPA method above). For the test with 37 items, 37 × 0.15 = 6 items located at the beginning and the end of the test were removed, giving a corresponding critical value of 8.85. A total of 382 examinees were found to have change points, which means that the correct response probability of these examinees changed significantly at a certain item. Among these examinees, 154 had one change of speed, 20 changed speed twice, one changed speed three times, and 207 examinees did not change speed.
These results can be combined with those obtained using the Bayesian CPA method to make further inferences about the aberrant behavior. For example, if the correct response probability of an examinee decreases after a certain item, and the speed after that is faster than before, it can be further shown that the examinee has speededness. In contrast, if the speed reduces, it is very likely that the decline in the probability of a correct response was caused by item difficulty, not by aberrant behaviors. For the 207 examinees who had no change of speed, there is insufficient evidence to determine the presence of aberrant behavior, despite the changes in their correct response probabilities. In addition, the speed of 497 examinees had one or more changes, but their correct response probabilities were not changed.
The results of the two methods show that when false alarm rates are similar, the Bayesian CPA method using RTs is efficient at detecting change points in the performance of examinees and to provide information about their speed change at each stage. By combining the detection results of the two methods, we are able to better understand the behavior change process of each examinee in the test. Further comparison and generalization of these results are undoubtedly worthy of attention in the future studies.
Summary and Discussion
Various aberrant behaviors are often observed in real test data. It is reasonable to expect that in many cases, there will be more than one change point or more than one aberrant behavior in an examinee’s test performance. RTs in CBT reveal information about the working speed of the examinee. Further, RTs provide information useful for detecting changes in response behaviors. This article proposes a Bayesian CPA approach using RTs to identify aberrant test-taking behaviors on the basis of abrupt changes in the speed of examinees.
The proposed method has the following characteristics and advantages: (1) It does not rely on the asymptotic assumptions about test statistics that can be problematic in situations, where parametric models are restricted to a small number of items; (2) the number and location of change points are taken as parameters in the model, and the prior information is combined to infer these parameters from the posterior distribution, such that an examinee can have multiple change points or multiple aberrant behaviors; (3) RTs are used to provide more information about changes in an examinee’s test performance and to improve the efficiency of detection of aberrant behavior; and (4) the speed of examinees at each stage can be determined, which helps to make more accurate inferences about the types of behaviors flagged by the method. Overall, the detection results provide a more comprehensive understanding of examinee behaviors.
In this study, the number of change points of each examinee is assumed to follow a binomial distribution. The hyperparameter r is associated with the probability of the number of change points; it is selected from a certain range and contributes to making the inference more efficient. This can be seen as a reasonable penalty imposed on the selection of change points, which reduces the effect by which the model itself tends to choose more change points. The hyperparameter
Several limitations of this study should be noted. First, the assumption of the proposed method for detecting aberrant behavior is the abrupt changes in speed. This may not be the only underlying factor in the occurrence of aberrant behavior for examinees. Accordingly, an important distinction between the detected change points is whether they are caused by aberrant behavior, item characteristics, or different strategies that an examinee might legitimately use. Our results indicate that it is possible to combine an examinee’s change points with the speed of each segment to distinguish different causes. In addition, the assumption of the procedure is that aberrant behavior will manifest in different responding speeds irrespective of performance. These results can be combined with information from the CPA results based on the response data. Testing the differences in response accuracy partitioned by the change points might be another perspective to validate the results from the proposed method. It is worthy of note that the evidence provided by statistical results should not be the sole source to flag possible aberrant behaviors, and it is highly recommended that such evidence is combined with evidence from other sources for cross-validation (Cheng & Shao, 2022; Marianti et al., 2014; Sinharay, 2016; van der Linden & van Krimpen-Stoop, 2003).
Second, the method is more powerful for longer tests and larger speed shift. If the test length is short (e.g., 20 items), the efficiency of the method is largely determined by the degree of the speed change. Further, if the speed shift is small, or the speed change is gradual, the performance of the method could be significantly worse, including a significant decrease in hit rate and change point estimation accuracy. Moreover, the greater the number of change points for the examinee, the stronger the dependence on these two conditions. Thus, how small speed shifts can be detected is worthy of further exploration.
Third, the method is appropriate when an investigator wants to detect aberrant behavior occurring on a set of consecutive items. However, for the detection of behavior on items scattered throughout a test, the method proposed here may not be appropriate. Fourth, the proposed method is established only for the lognormal RT model, but there may be data sets for which the lognormal model does not fit the data.
Future research may tap the following areas. First, the Bayesian CPA method could be applied to detect item preknowledge. When the investigator knows which items are compromised; the items answered by each examinee can be reordered, with the compromised items on one segment and the rest on another segment (Sinharay, 2017a, 2017b). The Bayesian CPA method can detect item preknowledge by examining whether the speed of an examinee is significantly faster on the compromised items than on the noncompromised items. Second, different prior distributions can be explored. Based on this study, it is critical to ensure the reasonable selection of parameters, which should not only take into account reality but also theoretical beliefs, to make the detection as efficient as possible. Third, different RT models can be applied in future work, such as the Box–Cox normal RT model (Klein Entink et al., 2009), which allows for more flexibility in the transformation of the RT data to obtain a normal distribution. Fourth, the proposed RT method can be combined with IRT models to utilize responses and RTs simultaneously through joint modeling. For example, in the hierarchical model of van der Linden (2007), it models the examinee ability
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research and/or authorship of this article: This work was supported by National Natural Science Foundation of China (11871141).
