Person Fit Analysis in Computerized Adaptive Testing Using Tests for a Change Point

Abstract

Meijer and van Krimpen-Stoop noted that the number of person-fit statistics (PFSs) that have been designed for computerized adaptive tests (CATs) is relatively modest. This article partially addresses that concern by suggesting three new PFSs for CATs. The statistics are based on tests for a change point and can be used to detect an abrupt change in test performance of examinees during a CAT. The Type I error rate and power of the statistics are computed from a detailed simulation study. The performances of the new statistics are compared with those of four existing PFSs using receiver operating characteristics curves. The new statistics are then computed using data from an operational and high-stakes CAT. The new PFSs appear promising for assessment of person fit for CATs.

Keywords

CUSUM likelihood ratio test receiver operating characteristics curve statistical process control score test Wald test

Computerized adaptive tests (CATs) rely heavily on item response theory (IRT) models. According to Standard 4.10 of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council for Measurement in Education, 2014), evidence of model fit should be documented when IRT models are used in test development. Person-fit statistics (PFSs), which quantify the fit of an examinee’s score pattern¹ to the IRT model (Bradlow & Weiss, 2001, p. 86), may constitute a part of that evidence (see, e.g., Ferrando, 2015; Meijer & Sijtsma, 2001, for detailed reviews of the existing PFSs). Further, in a report for the Council of Chief State School Officers, Olson and Fremer (2013) recommended the use of PFSs, in addition to other methods, to detect irregularities in answering behavior.

Glas, Meijer, and van Krimpen-Stoop (1998); Nering (1997); and van Krimpen-Stoop and Meijer (1999) showed that PFSs that are appropriate for nonadaptive tests usually have low power for CATs. The low power has been attributed to two features of CATs: short test length and modest spread of the item difficulties (Meijer & van Krimpen-Stoop, 2010, p. 320). In addition, Bradlow and Weiss (2001) noted that it is difficult to find appropriate null distributions (i.e., the distribution under no misfit) of PFSs in CATs because different examinees receive different items and, often, tests of different lengths. Therefore, it is no surprise that Meijer and van Krimpen-Stoop (2010, p. 320) noted the number of PFSs with satisfactory performances for CATs to be relatively modest.

Researchers such as Bradlow and Weiss (2001); Bradlow, Weiss, and Cho (1998); and van Krimpen-Stoop and Meijer (2000, 2001, 2002) suggested several PFSs that are based on the cumulative sum (CUSUM) procedure, which is a methodology from statistical process control (e.g., Montgomery, 2013). Each CUSUM-based PFS involves a CUSUM of positive and negative residuals after each item—a CUSUM that is too large in absolute value indicates a person misfit. The CUSUM-based PFSs have been successful in detecting a string of consecutive correct or incorrect answers (e.g., Meijer, 2002, p. 223), which is mostly associated with an abrupt change in the test performance (in the form of tiredness, speededness, loss of concentration, item preknowledge, etc.) of an examinee. Such an abrupt change is referred to as a theta shift in the literature on identification of faking (e.g., Ferrando & Anguiano-Carrasco, 2013).

Several researchers such as Hawkins, Qiu, and Kang (2003, p. 357) and Montgomery (2013, p. 491) noted that if an investigator is interested to detect an abrupt change in the context of statistical process control, the CUSUM procedures are the most appropriate (in the sense of being the most powerful) when the parameters of the underlying statistical model before and after the change are known; however, if one or more of the parameters are unknown, the application of tests for a change point (TFCP; Andrews, 1993; Chen & Gupta, 2012; Csorgo & Horvath, 1997) may be more appropriate than that of the CUSUM procedures. Given that the examinee ability parameter is unknown² in CATs, TFCP may be successful in detecting an abrupt change in test performance. However, examples of PFA using TFCP in the context of CATs are severely lacking.

This article suggests three new PFSs based on TFCP for use with CATs. The TFCP focus on finding the point in time where the underlying statistical model or the model parameters underlying a sequence of observations have changed in some fashion (Montgomery, 2013, p. 490); the null hypothesis of no change is tested versus the alternative hypothesis that a change has occurred after some observation. Accordingly, the three new PFSs provide three slightly different approaches to test the null hypothesis that the examinee ability was unchanged throughout the CAT (i.e., equivalent to no theta-shift and indicates no misfit) versus the alternative that it has changed after some item (that indicates misfit).

The next section includes some background material including reviews of existing PFSs in CATs, TFCP, and two existing applications of TFCP to educational testing. Three PFSs based on TFCP are suggested in the Method section. In the Simulation Study section, the Type I error rate and power of the new PFSs are examined and compared to those of the CUSUM-based PFSs of Bradlow et al. (1998), Armstrong and Shi (2009a), van Krimpen-Stoop and Meijer (2000), and van Krimpen-Stoop and Meijer (2001) in a simulation study. The new PFSs are computed using data from a high-stakes CAT in the Application section. Conclusions and recommendations are provided in the last section.

This article focuses only on dichotomous items whose parameters are assumed to be known. The assumption of known item parameters is common in scoring on CATs. Further, van Krimpen-Stoop and Meijer (2000), van Krimpen-Stoop and Meijer (2001), and Bradlow and Weiss (2001) all assumed item parameters to be known in PFA for CATs.

Background

Notations

Let’s consider a CAT in which n dichotomous items are administered to an examinee whose true ability is θ. Note that the test length n could vary over the examinees. Let Y_i represent the examinee’s score (i.e., 0 or 1) on the ith item, i = 1, 2, …, n. Let us denote the probability of a correct answer on item i for the examinee as P_i(θ). For example, for the three-parameter logistic model (3PLM),

P_{i} (θ) = c_{i} + (1 - c_{i}) \frac{\exp [a_{i} (θ - b_{i})]}{1 + \exp [a_{i} (θ - b_{i})]},

where a_i, b_i, and c_i, respectively, are the slope, difficulty, and guessing parameters of item i.

Review of PFSs Based on the CUSUM Procedure

All the PFSs that perform satisfactorily for CATs (e.g., Bradlow et al., 1998; Bradlow & Weiss, 2001; van Krimpen-Stoop & Meijer, 2000, 2001, 2002) are based on the CUSUM procedure. The score pattern of an examinee on a CAT is expected to roughly be an alternation of 0s and 1s, especially toward the end of the test when the estimated ability is close to the true ability; therefore, a string of consecutive 1s (correct answers) or consecutive 0s (incorrect answers) may be the result of an aberrant response behavior (e.g., Meijer & van Krimpen-Stoop, 2010, p. 322). The CUSUM-based PFSs offer different approaches to detect such strings.

Bradlow et al. (1998) suggested a CUSUM-based statistic at a given ability as

max_{1 \leq i \leq n} \frac{| \sum_{i} Y_{i} - \sum_{i} P_{i} (θ) |}{\sqrt{\sum_{i} P_{i} (θ) (1 - P_{i} (θ)})} \cdot

They suggested using the posterior mean of the above statistic (where the mean is computed over the posterior distribution of the examinee ability) as a PFS for CATs and referred to the PFS as the largest absolute realized deviation (LARD). The LARD should be computed after ordering the test items in different ways to detect different types of person misfit; for example, while it should be computed with the original item order to detect those who warm up to the test, it should be computed after reversing the item order to detect those who suffer from speededness. A score pattern is considered aberrant when the LARD is larger than a critical value, which is chosen using a permutation distribution. A permutation distribution of a PFS is somewhat similar to the bootstrap distribution of the PFS and is the distribution of the values of the statistic computed from several random permutations of the test items. Bradlow and Weiss (2001) suggested several other similar PFSs. While the PFSs suggested by Bradlow et al. (1998) and Bradlow and Weiss (2001) were computed for data from operational CATs, their Type I error rates and power have not been studied using simulated data.³

van Krimpen-Stoop and Meijer (2000) defined the iterative “upper” and “lower” cumulative statistics based on residual item scores on a CAT as

C_{i}^{+} = max {0, T_{i} + C_{i - 1}^{+}}, i = 1, 2, \dots, n,

and C_{i}^{-} = min {0, T_{i} + C_{i - 1}^{-}}, i = 1, 2, \dots, n,

where T_{i} = \frac{1}{n} \frac{Y_{i} - P_{i} (θ)}{λ_{i}},

is a weighted residual item score, and λ _i is a weight function. The starting values are

C_{0}^{+} = C_{0}^{-} = 0

. The

C_{i}^{+}

and

C_{i}^{-}

statistics accumulate the weighted residuals

[Y_{i} - P_{i} (θ)] / λ_{i}

. The contribution of an item to the accumulation is positive or negative depending on whether the answer to item i is correct or incorrect. Person misfit is concluded when, for an appropriate critical value h,

C_{i}^{+}

is larger than h or

C_{i}^{-}

is smaller than −h for some i. Meijer (2002) used the above approach with λ _i = 1 to perform PFA using data from a high-stakes certification test that is computer adaptive. Tendeiro and Meijer (2012) introduced the two-sided statistic⁴

C^{T} = \max_{1 \leq i \leq n} C_{i}^{+} - \min_{1 \leq i \leq n} C_{i}^{-},

that combines information from

C_{i}^{+}

and

C_{i}^{-}

. van Krimpen-Stoop and Meijer (2000), Meijer (2002), and Tendeiro and Meijer (2012) chose the critical values for

C_{i}^{+}

C_{i}^{-}

, and C^T using a parametric bootstrap simulation⁵ because of the difficulty of finding a critical value based on theoretical derivations. van Krimpen-Stoop and Meijer (2002) extended the above approach to polytomous items. The estimate of the examinee ability from the whole CAT is used for the computation of

C_{i}^{+}

C_{i}^{-}

, and C^T.

van Krimpen-Stoop and Meijer (2001) suggested a CUSUM-based procedure that employs upper and lower statistics that look like those in Equations 2 and 3, respectively, but are computed from S subsets of items. The upper and lower cumulative statistics, denoted as $C_{s}^{+}$ and $C_{s}^{-}$ , respectively, are defined as:

C_{s}^{+} = max {0, Z_{s} - 0.5 + C_{s - 1}^{+}}, s = 1, 2, \dots, S,

and C_{s}^{-} = min {0, Z_{s} + 0.5 + C_{s - 1}^{-}}, s = 1, 2, \dots, S,

where Z_s is the value of the

l_{z}^{*}

statistic (Snijders, 2001; van Krimpen-Stoop & Meijer, 1999) computed from item subset s using the ability estimate from the whole CAT. The

l_{z}^{*}

statistic has an asymptotic standard normal distribution (Snijders, 2001). Using results for CUSUM with normally distributed random variables (e.g., Siegmund, 1985), van Krimpen-Stoop and Meijer (2001) found theoretical critical values for

C_{s}^{+}

and

C_{s}^{-}

. For example, the critical values are 3.49 and 2.02 at 1% and 5% levels, respectively. van Krimpen-Stoop and Meijer found that for subsets with 10 items, the theoretical critical values led to a satisfactory Type I error rate of the

C_{s}^{+}

and

C_{s}^{-}

statistics. However, the Type I error rates were too small and too large, respectively, for larger and smaller subsets of items. As in Tendeiro and Meijer (2012), it is possible to consider the two-sided statistic

C^{l_{z}^{*}} = \max_{1 \leq s \leq S} C_{s}^{+} - \min_{1 \leq s \leq S} C_{s}^{-},

that combines information from

C_{s}^{+}

and

C_{s}^{-}

Armstrong and Shi (2009a) suggested a PFS based on the CUSUM procedure and a likelihood ratio statistic. In this approach, the “upper” and “lower” cumulative statistics are defined as

C_{i}^{U} = max {0, γ_{i}^{U} + C_{i - 1}^{U}}, i = 1, 2, \dots, n,

and C_{i}^{L} = min {0, γ_{i}^{L} + C_{i - 1}^{L}}, i = 1, 2, \dots, n,

where

γ_{i}^{U}

denotes a likelihood ratio statistic for testing whether there is an “aberrant upward shift” of the probability of a correct answer (due to an aberrant behavior such as item preknowledge) and

γ_{i}^{L}

denotes a likelihood ratio statistic for testing whether there is an “aberrant downward shift” of the probability of a correct answer (due to an aberrant behavior such as fatigue). The starting values are

C_{0}^{U} = C_{0}^{L} = 0

. Armstrong and Shi (2009a) and Tendeiro and Meijer (2012) described the computations of

γ_{i}^{U}

and

γ_{i}^{L}

and suggested the two-sided statistic:

C^{L R} = \max_{1 \leq i \leq n} C_{i}^{U} - \min_{1 \leq i \leq n} C_{i}^{L},

that combines information from

C_{i}^{U}

and

C_{i}^{L}

. The critical values of

C_{i}^{U}

C_{i}^{L}

, and C^LR can be obtained using Monte Carlo simulations (e.g., Armstrong & Shi, 2009a, p. 400).

Armstrong and Shi (2009a, p. 409) considered only nonadaptive tests and noted that their PFSs need to be adjusted for use with CATs, but such an adjustment is not available yet. However, their PFSs defined above can be computed for CATs without any adjustment.⁶ Tendeiro and Meijer (2012) suggested several PFSs based on the CUSUM procedure of Armstrong and Shi (2009a). Both Armstrong and Shi (2009a) and Tendeiro and Meijer (2012) found for simulated nonadaptive tests that the power of the PFSs given by Equations 8 through 10 is considerably larger than that of the PFSs given by Equations 2 through 5. Armstrong and Shi (2009b) suggested a model-free CUSUM procedure that is not considered here.

Other Person-Level Statistics for CATs

Meijer (2004) suggested the use of a statistic based on summed scores on subtests to detect unexpected combination of subscores on Paper and Pencil (P&P) tests and CATs; however, the power of the statistic was somewhat smaller on CATs compared to P&P tests; Meijer (2004, p. 132) attributed the smaller power to smaller spread of item difficulties in CATs. McLeod and Lewis (1999) and MeLeod, Lewis, and Thissen (2003) suggested new methods for detecting item memorization and item preknowledge, respectively, for CATs. However, these methods are not considered here because the interest here is in detecting an abrupt change in test performance.

TFCP in Statistical Process Control

There are several formulations of TFCP, but the formulation that is most relevant to this article is discussed in, for example, Andrews (1993), Chen and Gupta (2012, p. 2), Csorgo and Horvath (1997, p. 1), and Gombay and Horvath (1996) and involves the assumption that X₁, X₂, …, X_n are independent random variables, and X_i has a probability function (i.e., a mass function if X_i is discrete and a distribution function if X_i is continuous) f_i(X_i; ψ), where ψ is a unidimensional model parameter. The TFCP involve testing of the null hypothesis that no change in the value of ψ has occurred in the sequence X₁, X₂, …, X_n against the alternative hypothesis that the value of ψ has changed at a change point τ, so that the probability function of X_i, i = 1, 2, …, τ − 1 is f_i(X_i; ψ₁), but that of X_i, i = τ, τ + 1, …, n is f_i(X_i; ψ₂). Only the case when ψ₁, ψ₂, and τ are unknown is relevant here. Researchers such as Hawkins et al. (2003, p. 357) and Montgomery (2013, p. 491) noted that if one wants to test this hypothesis, the CUSUM procedures are the most appropriate (in the sense of being the most powerful) when the parameters of the underlying statistical model are known before and after the change; however, if the parameters are unknown, which is almost always the case in PFA for CATs (where the examinee ability is unknown), the TFCP may be more powerful than the CUSUM procedures.

Existing Applications of TFCP in Educational Testing

Lee and von Davier (2013) suggested an approach based on TFCP to detect unusual changes in the mean score of an international language assessment that is administered several times in a year. Shao, Li, and Cheng (2015) suggested an approach based on TFCP to detect speededness in nonadaptive tests; they used a likelihood ratio test (LRT) statistic and computed the critical values of the statistic using a permutation distribution. In this article, one of the new PFSs for CATs is very similar to the LRT statistic of Shao et al.; however, the critical values of the statistic are obtained here using theoretical large-sample derivations in Andrews (1993)—so the computational burden here is less than that in Shao et al. (2015). Also, given the discussion in Bradlow and Weiss (2001, pp. 94–95) on the problems with using simulation-based critical values for CATs that may lead to conservative tests, the use of a permutation distribution as in Shao et al. (2015) is not straightforward for CATs and may lead to a conservative test for CATs.

Other Research Related to Change in Examinee Ability or Test Performance

Researchers such as Finkelman, Weiss, and Kim-Kang (2010); Glas and Dagohoy (2007); and Klauer and Rettig (1990) considered hypothesis tests for change/difference in the ability of an examinee between two testing occasions and two subsets of items, but the change point is assumed known in their cases—so the hypothesis testing problem becomes simpler than that considered here. Researchers such as Bolt, Cohen, and Wollack (2002) suggested IRT models for modeling speededness in nonadaptive tests, which is a type of change in test performance that is of major interest to test administrators; however, no IRT model known to the authors exist for modeling speededness for CATs where different examinees answer different sets of items.

Method

The Rationale Behind Applying TFCP for Person-Fit Assessment in CAT

A review of research on PFA for CAT suggests that the CUSUM-based PFSs successfully detect person misfit when the examinee ability has abruptly changed during the test. For example,

The CUSUM-based PFSs are intended to detect score patterns that include a string of consecutive 1s or consecutive 0s (e.g., Meijer, 2002, p. 223), which is mostly associated with an abrupt change in the examinee ability.

In the simulation studies in van Krimpen-Stoop and Meijer (2001, p. 212), person misfit was introduced by setting the true examinee ability as θ₁ in the first half of a CAT but as θ₁ + δ in the second half of the CAT. This addition of δ is effectively an abrupt change in the true ability or, a theta shift, in the middle of the test.

Almost all the examples of person misfit in an operational CAT (such as possible distraction or loss of concentration, unfamiliarity in a section, fatigue or speededness) that are described in Meijer (2002) lead to an abrupt change in the examinee ability during the test.

Further, the change point and the examinee abilities before and after the change are unknown in PFA for CATs. Thus, given the widely accepted belief that TFCP are expected to be more appropriate than the CUSUM procedures when the parameters are unknown (e.g., Hawkins et al., 2003; Montgomery, 2013), the application of TFCP for PFA in CATs seems appropriate.

The examinee ability θ is the only model parameter (ψ) of interest in PFA for CAT because of the assumption of known item parameters. Because only dichotomous items are considered here, the f_i(X_i; ψ) in this context is the probability mass function for the Bernoulli distribution with probability of success equal to P_i(θ). The null hypothesis is that θ has not changed during the CAT and the alternative hypothesis is that it has changed and the probability of success is equal to P_i(θ₁) for the first (τ − 1) items but is equal to P_i(θ₂) for the last (n − τ + 1) items.

Three PFSs that are based on TFCP and can be used to test the abovementioned hypothesis for CATs are suggested below.

A PFS Based on the Wald Test

Let us denote the true ability of an examinee underlying the scores on items 1 to j as θ_1j and that underlying the scores on items (j + 1) to n as θ_2j. Let ${\hat{θ}}_{1 j}$ and ${\hat{θ}}_{2 j}$ denote the corresponding maximum likelihood estimates (MLEs) and ${\hat{θ}}_{0}$ denote the MLE of the ability computed from scores on items 1 to n. Finkelman et al. (2010) and Klauer and Rettig (1990) showed that the Wald test statistic (e.g., Rao, 1973, pp. 417–419) for testing the null hypothesis that θ_1j = θ_2j, when j is known, is given by

W_{j n} = \frac{{({\hat{θ}}_{1 j} - {\hat{θ}}_{2 j})}^{2}}{\frac{1}{I_{1 j} ({\hat{θ}}_{0})} + \frac{1}{I_{2 j} ({\hat{θ}}_{0})}},

where

I_{1 j} ({\hat{θ}}_{0})

is the estimated Fisher information based on items 1 to j and

I_{2 j} ({\hat{θ}}_{0})

is that based on items (j + 1) to n, both computed at

θ = {\hat{θ}}_{0}

. The statistic W_jn applies both to nonadaptive tests and to CATs for which the item scores Y_is are locally independent given the model parameters and item strings (e.g., Bradlow & Weiss, 2001; Mislevy & Chang, 2000). Finkelman et al. (2010) and Klauer and Rettig (1990) showed that for known j, the asymptotic distribution of the above statistic is a χ² distribution with one degree of freedom.⁷

Let us define the statistic

W_{max, n} = {max}_{n_{1} \leq j \leq n - n_{1}} W_{j n},

for testing the null hypothesis of no change in the ability

(H_{0} : θ_{1 j} = θ_{2 j} for all j)

versus the alternative of a change between items n₁ and (n − n₁).

Andrews (1993) and Csorgo and Horvath (1997) noted that the power of W_max,n is small (because the corresponding critical values diverge to infinity) if one wants to detect a change near the first or last observations. Therefore, Andrews recommended that in computing W_max,n, the maximum should be taken over n₁ ≤ j ≤ n − n₁ where n₁ > 2, to increase the power of the test. Andrews recommended setting n₁ to the integer nearest to 0.15n that restricts the estimated change point to roughly the middlemost 70% of the observations.

A PFS Based on the LRT

Researchers such as Finkelman et al. (2010) and Klauer and Rettig (1990) showed that the LRT statistic (e.g., Rao, 1973) for testing the null hypothesis θ_1j = θ_2j, when j is known, is given by

L_{j n} = - 2 {L ({\hat{θ}}_{0}; Y_{1}, Y_{2}, \dots, Y_{n}) - L ({\hat{θ}}_{1 j}; Y_{1}, Y_{2}, \dots, Y_{j}) - L ({\hat{θ}}_{2 j}; Y_{j + 1}, Y_{j + 2}, \dots, Y_{n})},

where, for example,

L (θ_{1 j}; Y_{1}, Y_{2}, \dots, Y_{j}) = \sum_{i = 1}^{j} [Y_{i} \log P_{i} (θ_{1 j}) + (1 - Y_{i}) \log {1 - P_{i} (θ_{1 j})}],

denotes the log likelihood of Y₁, Y₂, …, Y_j at θ_1j. The statistic L_jn applies both to nonadaptive tests and to CATs and, for known j, follows a χ² distribution with one degree of freedom asymptotically (e.g., Finkelman, Weiss, & Kim-Kang, 2010; Klauer & Rettig, 1990).

Let us now define the PFS

L_{max, n} = {max}_{n_{1} \leq j \leq n - n_{1}} L_{j n},

for testing the null hypothesis H₀: θ_1j = θ_2j for all j versus the alternative of a change between items n₁ and (n − n₁).

Shao et al. (2015) used a statistic very similar to the L_max,n statistic for detecting speededness in nonadaptive tests and obtained the critical values of the statistic using a permutation distribution.

A PFS Based on the Score (or Lagrange Multiplier) Test

The score test statistic or Lagrange multiplier test statistic (e.g., Rao, 1973) for testing the null hypothesis that θ_1j = θ_2j, when j is known, is given by

S_{j n} = \frac{{(\nabla ({\hat{θ}}_{0}; Y_{1}, Y_{2}, \dots, Y_{j}))}^{2}}{I_{1 j} ({\hat{θ}}_{0})} + \frac{{(\nabla ({\hat{θ}}_{0}; Y_{j + 1}, Y_{j + 2}, \dots, Y_{n}))}^{2}}{I_{2 j} ({\hat{θ}}_{0})},

where, for example,

\nabla ({\hat{θ}}_{0}; Y_{1}, Y_{2}, \dots, Y_{j})

is the first derivative of the log likelihood of Y₁, Y₂, …, Y_j at

θ = {\hat{θ}}_{0}

. Expressions of

\nabla ({\hat{θ}}_{0}; Y_{1}, Y_{2}, \dots, Y_{j})

and

I_{k j} ({\hat{θ}}_{0})

for common IRT models can be found in, for example, Baker and Kim (2004, pp. 64–71).

Glas and Dagohoy (2007) and Klauer and Rettig (1990) used a PFS similar to S_jn for testing the null hypothesis that the examinee ability is the same over two subsets of items on a nonadaptive test.

Let us define the PFS

S_{max, n} = {max}_{n_{1} \leq j \leq n - n_{1}} S_{j n},

for testing the null hypothesis H₀: θ_1j = θ_2j for all j versus the alternative of a change between items n₁ and (n − n₁).

Asymptotic Null Distribution of W_max,n, L_max,n, and S_max,n

Andrews (1993) considered the case of a sequence of parametric models f_i(X_i; ψ), where ψ is a vector of model parameters, fitted to a sequence of random variables X₁, X₂, …, X_n using the generalized method of moments estimators (of which MLE is a special case). For such a sequence of parametric models, if W_jn is a Wald-type test statistic, L_jn is an LRT statistic, and S_jn is a score/Lagrange multiplier test statistic, each for testing the null hypothesis that the parameter vector underlying observations 1 to j is equal to that underlying observations j + 1 to n and each computed using the generalized method of moments estimator, then, by theorem 3 (p. 838) of Andrews, the asymptotic null distributions of W_max,n, L_max,n, and S_max,n defined in Equations 12, 14, and 16 are the same and are given by the supremum of the square of a standardized tied-down Bessel process (e.g., Sen, 1981). Csorgo and Horvath (1997) and Gombay and Horvath (1996) provided a similar result for L_max,n for independent random variables where the parameter vector is estimated by its MLE.

IRT models in the context of PFA for CATs are included in the family of distributions considered by Andrews (1993), with the ability parameter θ playing the role of the parameter vector ψ of Andrews and the probability mass function for the Bernoulli distribution with probability of success equal to P_i(θ) playing the role of f_i(X_i; ψ) of Andrews. Further, the MLE of examinee ability is a special case of the generalized method of moment estimators (Andrews, 1993, p. 828) and the weighted likelihood estimate (WLE; Warm, 1989) of examinee ability is asymptotically equivalent to the MLE (Warm, 1989, p. 448). Thus, all the conditions of theorem 3 of Andrews (1993) are satisfied in the context of IRT models (both for nonadaptive tests and for CATs) when the MLE or WLE of ability is used to compute W_max,n, L_max,n, and S_max,n. Therefore, by theorem 3 of Andrews, the asymptotic null distributions of W_max,n, L_max,n, and S_max,n defined in Equations 12, 14, and 16 above are the same and are given by the supremum of the square of a standardized tied-down Bessel process (e.g., Sen, 1981).

Further Information on the New PFSs

Test statistics based on likelihood ratios are asymptotically the most powerful in general (e.g., Cox & Hinkley, 1974, pp. 312, 320) because of the Neyman–Pearson Lemma (e.g., Cox & Hinkley, 1974; Lehmann & Romano, 2005; Romero, Riascos, & Jara, 2015); from derivations in Andrews (1993) and Csorgo and Horvath (1997), the generalized LRT for testing the null hypothesis H₀:θ_1j = θ_2j for all j versus the alternative of a change between items n₁ and (n − n₁) leads to the test statistic L_max,n of Equation 14. Further, W_max,n and S_max,n are asymptotically equivalent to L_max,n (e.g., Andrews, 1993). Therefore, L_max,n, W_max,n, and S_max,n are expected to have satisfactory power to detect abrupt change.

Computation of the critical values of the aforementioned null distribution is not straightforward. However, the critical values are provided in, for example, Sen (1981, p. 397), Andrews (1993, p. 840), and Estrella (2003). Table 1 provides the critical values for several common values of 100n₁/n and significance level. For example, for 100n₁/n = 15, which is recommended by Andrews (1993, p. 826), the critical values at 1% and 5% levels of significance are 12.35 and 8.85, respectively. In comparison, the critical values of a χ² distribution with one degree of freedom at 1% and 5% levels are 6.63 and 3.84, respectively. The values in Table 1 are larger to adjust for the multiple hypothesis tests implicit in the application of any of W_max,n, L_max,n, and S_max,n.

Table 1.

Asymptotic Critical Values for the Distribution of the Supremum of the Square of a Standardized Tied-Down Bessel Process

100 × n₁/n	Significance Level
100 × n₁/n	1%	5%	10%
5	13.01	9.84	8.19
10	12.69	9.31	7.63
15	12.35	8.85	7.17
20	11.69	8.45	6.80

Under the alternative hypothesis, each of W_max,n, L_max,n, and S_max,n would be large. For example, W_max,n would be large under the alternative hypothesis because ${\hat{θ}}_{1 j} - {\hat{θ}}_{2 j}$ is expected to be large in absolute value for some j. Therefore, for each of W_max,n, L_max,n, and S_max,n, if the statistic is larger than the corresponding critical value, then the null hypothesis is rejected, a person misfit is concluded, and the true change point is estimated by that value of j for which W_max,n = W_jn or L_max,n = L_jn or S_max,n = S_jn.

The computation of W_max,n and L_max,n involves some ability estimates computed from only a few items. For example, for a 40-item test (i.e., n = 40), ${\hat{θ}}_{1 j}$ is computed from only 6 items when j = n₁. However, because of the adaptive nature of CATs, ability estimates based on a few items in CATs are more stable than those in P&P tests;⁸ as an outcome, W_max,n and L_max,n are also expected to be stable.

Note that though they were discussed in the context of CATs, W_max,n, L_max,n, and S_max,n can also be applied to nonadaptive tests without any further adjustment and using the critical values provided in Table 1 because given a set of item scores, the likelihood of the scores for an examinee is computed in the same way for a CAT and a nonadaptive test. However, for nonadaptive tests, the traditional PFSs (e.g., those discussed in Meijer & Sijtsma, 2001) would usually have decent power, so that the TFCP-based statistics may not provide much benefit.

Simulations

A detailed simulation study, somewhat similar to that in van Krimpen-Stoop and Meijer, (2001), was performed to examine the Type I error rate and power of W_max,n, L_max,n, and S_max,n from several simulated CATs. The simulation study also involves a comparison of the new PFSs with four existing CUSUM-based PFSs: LARD (Bradlow et al., 1998), C^T (Meijer, 2002), $C^{l_{z}^{*}}$ (van Krimpen-Stoop & Meijer, 2001), and C^LR (Armstrong & Shi, 2009a).

Design of the Simulation Study

The simulation study involved four levels of test length (20 items, 40 items, 60 items, and 100 items) that represent short, moderate, long, and very long tests. Note that even though CATs usually lead to reduction in test length compared to P&P tests, CATs with 100 items are not rare; both Bradlow and Weiss (2001) and Meijer (2002) analyzed data from operational variable-length CATs in which some examinees receive more than 100 items, and the real data example later in this article involves a variable-length CAT in which some examinees receive more than 100 items and each examinee receives at least 60 operational items.

An item pool of 800 items was used in all simulations. The Rasch model, for which P_i(θ) is given by Equation 1 with a_i = 1 and c_i = 0, was used in the simulation. The true difficulty parameters of the items of the pool used in the simulation were a random sample of the estimated difficulty parameters of the items in the item pool in the real data example discussed in the next section.

For each examinee, the first 2 items of the CAT were selected as the items that have the maximum information at θ = 0.⁹ Each latter item (2nd, 3rd, 4th, …, etc.) of the CAT was selected as the one that has maximum information at the ability estimated from the scores on the previous items. The WLEs of ability were used as ability estimates.¹⁰

To compute the Type I error rates of the PFSs, examinee score patterns that fitted the Rasch model were generated using true examinee abilities that were simulated from the standard normal distribution. For any test length, the Type I error rate was computed from 100,000 model-fitting score patterns.

To compute the power of the PFSs, for each examinee, the true ability for the first half of the CAT (denoted as θ₁) was simulated from the standard normal distribution and the true ability for the second half of the CAT was set as θ₁ + δ, and then these true abilities were used to generate a score pattern from the Rasch model. Thus, the true change points were 11, 21, 31, and 51, respectively, for the four test lengths. The following values of δ were considered: −2, −1, 1, and 2. Positive values of δ indicate better performance in the second half while negative values of δ indicate worse performance in the second half. This strategy of introducing person misfit and these values of δ are exactly the same as that in van Krimpen-Stoop and Meijer (2001, p. 212). Note that ability estimates from an operational data set in a figure later in this article show that the values of 2 and −2 of δ are not unreasonable in practice. Under any simulation condition (characterized by a test length and a value of δ), the power was computed from 100,000 aberrant score patterns.

Computation

A Fortran program written by the author was used for all the computations including the computation of the ability estimates and the PFSs. The WLEs were computed using the Newton–Raphson algorithm. For each simulated score pattern, the WLE was computed and six PFSs (W_max,n, L_max,n, S_max,n, C^T, $C^{l_{z}^{*}}$ , and C^LR) were computed using the WLE. The LARD was computed using rectangular quadrature involving 46 equally spaced points between −4.5 and 4.5, as in, for example, Thissen and Orlando (2001). The prior distribution for the examinee ability was the standard normal distribution. The TFCP-based PFSs (W_max,n, L_max,n, and S_max,n) were computed for 100n₁/n = 15—this choice restricts the estimated change point to be between, for example, Items 6 and 34 for 40-item tests. The quantities $C^{l_{z}^{*}}$ were computed from 10-item subsets. As recommended by Meijer (2002) and Armstrong and Shi (2009a), the WLE from the whole CAT was used in computing C^T, $C^{l_{z}^{*}}$ , and C^LR.

Type I Error Rates of W_max,n, L_max,n, and S_max,n

Table 2 shows the overall Type I error rates, expressed as percentages, for the TFCP-based PFSs at 1% and 5% significance levels. For a given significance level and test length, the overall Type I error rate of a TFCP-based PFS is the percentage of model-fitting score patterns for which the PFS is larger than the corresponding theoretical critical value from Table 1. The standard error corresponding to any reported value of Type I error rate is approximately 0.03% when the Type I error rate is near 1% and 0.07% when the Type I error rate is near 5%.¹¹

Table 2.

Overall Type I Error Rates (Expressed as a Percentage) From the Simulation Study

Level	Length	$W_{max, n}$	$L_{max, n}$	$S_{max, n}$
%	20	0.7	0.3	0.2
	40	0.9	0.6	0.5
	60	1.0	0.7	0.6
	100	1.0	0.9	1.0
%	20	2.6	1.9	1.3
	40	2.8	2.8	2.5
	60	3.5	3.4	3.4
	100	4.4	4.3	4.2

Table 2 shows that the Type I error rates of W_max,n, L_max,n and S_max,n are always smaller than or equal to the nominal level. As test length increases, the Type I error rates of all the statistics become closer to the nominal level. For short or moderately long tests (e.g., those with 40 items or smaller), the Type I error rates are considerably smaller than the nominal level, especially at 5% level, which would most likely result in the loss of power.

Power of W_max,n, L_max,n, and S_max,n

Table 3 shows the overall power at 5% level, expressed as a percentage, of W_max,n, L_max,n, and S_max,n. For a simulation condition, the overall power of a TFCP-based PFS is the percentage of aberrant score patterns for which the PFS is larger than the corresponding theoretical critical value from Table 1. The table also shows the mean and standard deviation of the estimated change points from the score patterns for which the corresponding PFS was significant at 5% level. It was found that the power for δ = d was almost identical to that for δ = −d. Therefore, the values of power for δ = −1 and δ = −2 are not reported and can be inferred from those for δ = 1 and δ = 2, respectively. The standard error associated with any reported value of power is smaller than 0.2%.

Table 3.

Overall Power (Expressed as a Percentage) at 5% Level and Mean and SD of the Estimated Change Points From the Simulation Study

δ	Length	Power			Mean			SD
δ	Length	W	L	S	W	L	S	W	L	S
1.0	20	11	10	7	12	10	8	2	2	2
	40	21	21	20	23	20	17	6	5	5
	60	31	31	29	33	31	27	9	8	8
	100	53	46	44	53	50	47	11	12	12
2.0	20	40	39	30	11	10	9	1	2	2
	40	73	72	69	21	20	19	4	4	3
	60	87	86	83	31	30	29	5	6	5
	100	96	94	93	51	50	49	9	7	7

Note. W, L, and S, respectively, denote $W_{max, n}$ , $S_{max, n}$ , and $S_{max, n}$ . “Mean” refers to mean of the change points. SD refers to SD of the change points.

Table 3 shows that the TFCP-based PFSs have modest power for short tests and δ = 1, but decent power for long tests and δ = 2. The W_max,n statistic seems to be the most powerful, followed by L_max,n, among the three TFCP-based PFSs in all cases. This result agrees with the result for linear models that the Wald statistic is larger than the LR statistic, which is larger than the score statistic (e.g., Breusch, 1979). Also, the estimated change points seem to be accurate on average, especially for δ = 2. For example, for 40-item tests, the true change point is 21 and the average of the estimated change points for the three PFSs is 19, 20, or 21 for δ = 2.

Comparison of the TFCP-Based and CUSUM-Based PFSs

The Type I error rates of the new TFCP-based PFSs (Table 2 of this article) differ from those of the $C^{l_{z}^{*}}$ statistic (table 4 of van Krimpen-Stoop & Meijer, 2001, p. 211); also, no theoretical critical values are available for the LARD (Bradlow et al., 1998), C^T, and C^LR. Therefore, analyses using the receiver operating characteristics (ROCs) were used to perform a fair comparison of the performance of the TFCP-based PFSs to that of the CUSUM-based PFSs. For a given set of simulated data, analysis using ROC involves the computation of the following two quantities for several values of c for each PFS:¹²

the false alarm rate, F(c), which is the proportion of times when the PFS for a model-fitting score pattern is larger than c and

the hit rate, H(c), which is the proportion of times when the PFS for a misfitting score pattern is larger than c.

Then, a graphical plot is created in which F(c) is represented on the x-axis, H(c) is represented on the y-axis, and a line joins {F(c), H(c)} for several values of c. The line is referred to as the ROC curve. The closer the ROC curve is to the upper left corner (that means a larger area under the ROC curve), the more powerful is the PFS for a given Type I error rate.

Figure 1 shows a comparison of the ROC curves for the TFCP-based PFSs to those for the CUSUM-based PFSs. The ROC curves for δ = 1 (left panels) and δ = 2 (right panels) are shown for 20-item tests (top row), 40-item tests (middle row), and 60-item tests (bottom row). The ROC curves were very close for the three TFCP-based PFSs (with that for W_max,n having the largest area by a small margin)—so only the curve for W_max,n (denoted as “Change Point”) among them is shown (using a solid line). The ROC curve for C^LR (Armstrong & Shi, 2009a) is denoted as “CUSUM-LR” and shown using a short dashed line. The ROC curve for C^T with λ _i = 1 (e.g., Meijer, 2002; Tendeiro & Meijer, 2012) is denoted as “CUSUM-Residual” and shown using a dotted line. The ROC curve for $C^{l_{z}^{*}}$ (van Krimpen-Stoop & Meijer, 2001) is denoted as “CUSUM-lzstar” and shown using a dotted and dashed line. The ROC curve for LARD (Bradlow et al., 1998) is denoted as “CUSUM-LARD” and is shown using a long dashed line. Table 4 shows the areas under the ROC curves (multiplied by 100) for W_max,n and the four CUSUM-based PFSs corresponding to each panel of Figure 1 and also for test length of 100. The areas for L_max,n and S_max,n are the same as that for W_max,n up to two decimal places almost always and are not shown here.

Figure 1.

Receiver operating characteristic curves for three test lengths.

Table 4.

100 × Area Under the ROC Curves

δ	Length	$W_{max, n}$	LARD	C^LR	C^T	$C^{l_{z}^{*}}$
1.0	20	60	59	57	58	56
	40	69	66	65	66	65
	60	77	70	71	74	73
	100	89	77	78	83	80
2.0	20	79	76	72	74	73
	40	92	88	87	88	90
	60	97	93	94	95	94
	100	100	98	98	99	91

Note. LARD = largest absolute realized deviation; ROC = receiver operating characteristic.

Figure 1 and Table 4 show that the area for each of the TFCP-based PFSs is slightly larger than or equal to that for each of the CUSUM-based PFSs. Thus, overall, the TFCP-based PFSs are slightly more powerful compared to the CUSUM-based PFSs. Among the CUSUM-based PFSs, LARD (Bradlow et al., 1998) is the most powerful for 20-item tests, but C^T and $C^{l_{z}^{*}}$ are the most powerful for 40-item and 60-item tests. This result contradicts the finding in Armstrong and Shi (2009a) and Tendeiro and Meijer (2012) that the power of C^LR is larger than that of C^T; this difference could be due to the fact that both Armstrong and Shi (2009a) and Tendeiro and Meijer (2012) used the 3PLM to simulate their data and used nonadaptive tests while the Rasch model was used to simulate the data for CATs here.

Thus, any TFCP-based PFS is more powerful than each CUSUM-based PFS in the simulations here. This conclusion agrees with the statement in Montgomery (2013) and Hawkins et al. (2003) that TFCP are expected to be more appropriate than CUSUM-based procedures in detecting a change when the model parameters are unknown.

To understand the practical implication of the larger power of the TFCP-based PFSs compared to the CUSUM-based PFSs, let us consider a data set that includes 70,000 examinees, just like our real data example discussed in the next section. Let us consider that all those examinees took a 60-item CAT. Let us also assume that in the data set, no actual misfit is present (i.e., δ = 0) for 90% examinees, a misfit with δ = 1 is present for 5% examinees, and a misfit with δ = 2 is present for 5% examinees. Table 5 shows the number of examinees with each value of δ and the expected number of flags for W_max,n and C^T at 5% significance level for this data set. The numbers were computed from the information provided by Figure 1; for example, the expected number of flags for W_max,n for δ = 2 is 3,010 because Figure 1 indicates that the power of W_max,n for 60-item tests at 5% level is about 0.86 (i.e., 3,500 × 0.86 = 3,010). Table 5 indicates that while both W_max,n and C^T incorrectly flag 3,150 examinees (those with δ = 0) because the Type I error rates of both of them are close to the nominal level, W_max,n correctly flags 455 examinees (those with δ = 1 or δ = 2) more than C^T.

Table 5.

The Number of Flags for $W_{max, n}$ and $C^{T}$

Numbers	δ = 0	δ = 1	δ = 2
Total number in sample	63,000	3,500	3,500
Expected Number Flagged by $W_{max, n}$	3,150	1,050	3,010
Expected Number Flagged by C^T	3,150	875	2,730

Further Simulations With Other Item Pools

The above simulations were repeated with two other item pools. One of these is simulated exactly as in van Krimpen-Stoop and Meijer (2001) that is based on the two-parameter logistic model and has a larger spread of item difficulties; the other (item pool) was simulated from the 3PLM. The TFCP-based PFSs were more powerful than the CUSUM-based PFSs in simulations with these item pools as well although the Type I error rate, power, and areas under the ROC curve were slightly different from those reported in Tables 2 through 4 and the comparative performance of the CUSUM-based PFSs was different (e.g., C^LR was most powerful among these) from the above simulations. The TFCP-based PFSs were also computed with 100n₁/n = 20 and 25; the Type I error rates of them were close to those in Table 2 and the values of power were slightly larger than those in Table 3.

Further Simulations With Gradual Changes

The above simulations involve only an abrupt change in the examinee ability, where the true ability changes in the middle of the test by 1 or 2. Therefore, some additional simulations were performed where the true ability changes gradually. The following three types of gradual change were considered:

The true ability (θ₁) for the first item is a number randomly drawn from a standard normal distribution, but then the true ability gradually increases till it becomes θ₁ + δ for item number $\frac{n}{4}$ and stays equal to that till the end of the test, where δ is 1.0 or 2.0; for example, if θ₁ = 0 and δ = 1, then, on a 60-item test, the true ability increases by about 0.07 on each item between Item 2 and Item 15 and the true ability is equal to 1.0 for Items 15–60. This simulation is intended to replicate a gradual warm-up effect.

The true ability (θ₁) for the first 75% items is a number randomly drawn from a standard normal distribution, but then the true ability gradually decreases till it becomes θ₁ − δ for the last item of the test, where δ is 1.0 or 2.0; for example, if θ₁ = 0 and δ = 1, then, on a 60-item test, the true ability is 0 for Items 1–45, but it decreases by about 0.07 on each item between Item 46 and Item 60 till the true ability is equal to −1.0 for Item 60. This simulation is intended to replicate a gradual tiredness effect.

The true ability (θ₁) for the first 25% items is a number randomly drawn from a standard normal distribution, but then the true ability gradually increases or decreases till it becomes θ₁ + δ on item number $\frac{3 n}{4}$ and stays equal to that value for the last quarter of the test, where δ is 1, 2, −1, or −2; for example, if θ₁ = 0 and δ = 1, then, on a 60-item test, the true ability is 0 for Items 1–15, but it increases or decreases by about 0.03 on each item between Item 16 and Item 45 till the true ability is equal to 1.0 for Item 45. This simulation is intended to replicate a gradual change of ability in the middle of the test.

An analysis using ROC curves shows that the TFCP-based PFSs were slightly less powerful than the CUSUM-based PFSs in the first of the above three cases but were more powerful than the CUSUM-based PFSs in the other two cases.¹³

Overall Conclusions From the Simulations

The simulation studies show that the new PFSs have satisfactory Type I error rates and are more powerful overall compared to existing PFSs (that are based on CUSUMs) in a variety of situations. These findings, together with the fact that their critical values can be found from a table, make the new PFSs quite attractive.

Application to Data From an Operational High-Stakes CAT

The Data Set and the Analysis

The TFCP-based PFSs were applied to a real data set that includes information on about 70,000 examinees who took a large-scale high-stakes health-care licensure examination over a few months in 2015. The examination has been computer adaptive in the last several years and currently is a variable-length CAT. Each examinee is administered a minimum of 60 operational items and a maximum of 250 operational items; each examinee is also administered 15 pretest items. The unidimensional Rasch model is used for item calibration and scoring. The model has been found to fit the data adequately in a variety of analysis performed by psychometricians who work on the examination and by external researchers. A Bayesian ability estimate is used initially and the MLE of ability is used in the later part of the examination. The item selection mechanism is based on the constrained CAT procedure of Kingsbury and Zara (1989); first, the content area of the item is chosen; then, to control item exposure, an optimal item is randomly selected from 15 items that provide the most information at the current ability estimate. The current cut score used for passing is 0.00 in the logit scale. The pass–fail decision is based on a 95% confidence interval around the examinee’s current ability estimate. A pass–fail decision is made for an examinee as soon as the examinee has answered at least 60 operational items and the 95% confidence interval does not include the cut score; the examinee continues the test when the confidence interval includes the cut score. Special pass–fail decisions are used if the examinee runs out of time or has taken the maximum number of items. The item pool that was used with the examinees in this data set included a large number of dichotomously scored items¹⁴ each of which belongs to one of the eight content areas; most of the items are four-option multiple-choice items, but some items are of other types such as items that require an examinee to select all the options that apply, items that require filling in a blank, and items that require an examinee to specify an area on a figure. On average, the examinees in the data set took 126 items. Among the examinees in the data set, about 75% passed the examination.

The operational item parameters were used for assessing person fit. The pretest items were excluded from the computations. The MLE of ability, restricted between −4.5 and 4.5, was used to compute the PFSs.

Results

Table 6 provides the percentage of statistically significant p values for W_max,n, L_max,n, S_max,n, and $l_{z}^{*}$ (Snijders, 2001) using theoretical critical values at 1% and 5% levels. The table also shows the percentage of statistically significant values of the CUSUM-based PFSs $C_{i}^{+}$ or $C_{i}^{-}$ (e.g., Meijer, 2002) using 0.10 and −0.10 as the bounds at 5% overall level and using 0.12 and −0.12 as the bounds at 1% overall level, where the bounds were found using a bootstrap simulation as recommended by Meijer (2002).¹⁵

Table 6.

Percentage of Statistically Significant PFSs for the Real Data

Level	$W_{max, n}$	$L_{max, n}$	$S_{max, n}$	$C_{i}^{+}$ & $C_{i}^{-}$	$l_{z}^{*}$
%	2.2	1.2	1.1	0.8	0.2
%	6.5	5.6	5.5	3.7	0.9

Note. PFS = person-fit statistic.

The extent of person misfit seems to be minimal for the data set, as the percentage of significant PFSs is slightly larger the nominal level for the TFCP-based PFSs and smaller than the nominal level for the CUSUM-based PFSs. While this could be due to the low power of the PFSs, it could also be due to the good fit of the Rasch model to the data set.

Figure 2 shows the score patterns of four examinees whose W_max,n, L_max,n, and S_max,n were found significant at 1% level. The item number is shown along the x-axis, and the item score (0 or 1) is shown as a hollow circle along the y-axis. Thus, for example, a hollow circle near the top of a panel represents a score of 1. The estimate of the change point is shown as a vertical dashed line. For example, the change point on the top left panel is 27. The title of each panel shows two ability estimates (MLE up to the first decimal place): one from the items up to the estimated change point and the other from the items after the estimated change point. The CUSUMs based on $C_{i}^{+}$ and $C_{i}^{-}$ (e.g., Meijer, 2002) are also shown for each examinee. The two bounds for the CUSUMs at 5% overall level (0.10 and −0.10) are shown using horizontal solid lines: $C_{i}^{+}$ ’s are shown using triangles and $C_{i}^{-}$ ’s are shown using inverted triangles.¹⁶

Figure 2.

Score patterns of four examinees.

Figure 2 shows that the estimated change points seem to represent the change in performance of the examinees quite accurately. The top two panels represent examinees whose performance dropped substantially during the test and the MLE after the estimated change point is smaller by 2.2 or more compared to that up to the estimated change point. The bottom two panels represent examinees whose performance improved substantially during the test and the MLE after the estimated change point is larger by 2.4 or more compared to that up to the estimated change point. The examinee represented in the bottom right panel most likely had trouble settling in or warming up (a phenomenon mentioned by, e.g., Meijer, 2002, p. 227). Thus, the new PFSs seem to perform reasonably well for the real data set.

The CUSUMs show misfit for only two of these four examinees in Figure 2—those corresponding to the top left and bottom right panels. All the responses after the estimated change points for the other two examinees were incorrect (top right panel) or correct (bottom left panel) and the TFCP-based PFSs correctly identified them as having misfit; the number of responses after the estimated change point were not large enough to allow the CUSUMs to go outside the bounds even though they changed directions at the estimated change points and were close to the bounds for the last item.

Conclusions

This article suggests three PFSs based on TFCP (e.g., Andrews, 1993; Chen & Gupta, 2012; Csorgo & Horvath, 1997; Hawkins et al., 2003; Montgomery, 2013) for use with CATs. The Type I error rates of the PFSs, when used with their theoretical critical values obtained from Andrews (1993), were found to be smaller than or equal to the nominal level in a simulation study. The values of power of the PFSs were found to be satisfactory for long tests and were larger than those of several CUSUM-based PFSs in a comparison using ROC curves. Because there exist several long operational CATs (such as those whose data were analyzed in this article or by Bradlow et al., 1998; Meijer, 2002), the PFSs, when used with their theoretical/asymptotic critical values, promise to be useful in practice. The power may be slightly larger if one uses the PFSs along with null distributions obtained from bootstrap simulations (e.g., Sinharay, 2016) or permutation distributions (e.g., Bradlow et al., 1998), especially for short or moderately long tests. However, the use of a bootstrap simulation or permutation distribution would lead to more computations and may be problematic as noted by Bradlow and Weiss (2001, pp. 94–95).

The new PFSs were discussed in the context of CATs, but they can be applied to nonadaptive tests as well. For example, they can be applied to detect speededness in nonadaptive tests as in Shao et al. (2015) or to detect other types of abrupt changes in nonadaptive tests.

A major benefit of the new PFSs is that they provide an estimate of the change point, that is, of the item after which the examinee shows aberrant behavior. These estimates may provide useful information regarding speededness, fatigue, warm-up effect, and so on, depending on the nature of the aberrant behavior. For example, for a CAT that is speeded to a majority of examinees, the estimated change points of the examinees may suggest an appropriate reduction in length that would make the CAT speeded to a smaller number of examinees. Meijer (2002, p. 231) mentioned that a benefit of the CUSUM procedure is that the CUSUM plots make it immediately clear where the aberrant behavior is situated. The new TFCP-based PFSs go one step further by providing a statistical estimate of the change point without the need to manually examine a plot and thus could lead to a substantial saving of resources.

Among the three new PFSs, which are equivalent asymptotically, the power of the PFS based on the Wald test was largest and that based on the score test was smallest, both by a small margin. However, this relationship could be an outcome of the item parameters. It may be wise to apply all of these tests in a real application and combine information from them.

The new PFSs are appropriate when an investigator wants to detect abrupt changes such as warming up effects, speededness/fatigue toward the end, and item preknowledge on a set of consecutive items in the behavior of the examinees. However, these statistics may not always be the most appropriate for detecting aberrant behavior in a CAT. For example, if one wants to detect misfit due to test fraud such as item memorization, item preknowledge, or item collusion in a CAT and the items involving fraud are scattered throughout the CAT, the PFSs suggested in this article may lack power and statistics such as those suggested by McLeod and Lewis (1999) and McLeod, Lewis, and Thissen (2003) may be more appropriate.

In an application of the new PFSs to a data set, the PFSs and the estimated change points associated with the new PFSs may be examined further. For example, it is possible to try to determine the cause of the misfit of those who were flagged by the PFSs. An examinee like the one corresponding to the bottom right panel of Figure 2 may have been flagged because of a warm-up effect and an examinee like the one corresponding to the top right panel of Figure 2 may have been flagged because of speededness. The pattern for the examinee represented in the bottom left panel of Figure 2 is hard to explain—it is possible to investigate this pattern further. An investigator should also examine if the extent of misfit and the pattern of the estimated change points are related to the characteristics of the examinees (such as race and ethnicity) and items (such as item content).

Statistical indices for determination of person misfit may be useful for providing confirming evidence of aberrant behavior when evidence from other sources (such as reports from the teachers/proctors in testing centers on examinee behavior) also exist, but the evidence provided by statistical indices is insufficient by itself. For example, Hanson, Harris, and Brennan (1994) commented, in the context of detection of answer copying, that no statistical method on its own can provide conclusive proof that copying occurred (p. 25); the comment is true about other aberrant examinee behavior as well. Researchers such as Tendeiro and Meijer (2014, p. 257) and Meijer, Egberink, Emons, and Sijtsma (2008) recommended complementing PFSs with other sources of information such as seating charts, video surveillance, or follow-up interviews.

There are several further limitations of this article and, consequently, several related topics that may be investigated further. First, only one change point was considered and detected in this article—it is possible to consider and detect multiple change points (an example would be warm-up effect early on the CAT and fatigue at the end) in further research. Note that the power to detect multiple change points is expected to be smaller than that reported, for example, in Table 3. It was found in some limited simulations that if there is more than one true change point, then the new PFSs would usually estimate the change point to be one of the true change points depending on the nature of the true changes. For example, if the true ability is 0 between Items 1 and 15, 1 between Items 16 and 45, and −1 between Items 46 and 60, then the statistics would estimate the change point to be Item 46 as that allows the maximum difference of the ability estimated from the items before change (about 0.5) and those after change (about −1). Second, it is possible to perform a change-point analysis using response times, or, using both item scores and response times. Third, in the simulations in this article, the maximum information method was used for item selection and no exposure-control method was used; other item selection methods and some exposure control methods may be used in further simulation studies. Fourth, this article considered tests with only dichotomous items and it is possible to consider tests that include all or some polytomous items in future research—the PFSs suggested here can be easily extended to such tests. Fifth, PFA for a data set usually involves hypothesis testing for several examinees. Therefore, in practical applications of PFA, an investigator may want to adjust for multiple comparisons by, for example, controlling the false discovery rate as in Shao et al. (2015). Finally, further research such as Conrad et al. (2010); Conijn, Emons, and Sijtsma (2014); Ferrando (2012); Meijer et al. (2008); and Meijer and Tendiro (2014) should explore how PFA and the new PFSs can be applied and used in practice, especially in high-stakes CATs.

Footnotes

Author’s Note

Any opinions expressed in this publication are those of the author and not necessarily of Pacific Metrics Corporation.

Acknowledgments

The author would like to thank Li Cai, the editor, and the two anonymous reviewers for several helpful comments that led to a significant improvement of the article. The author would also like to thank Ada Woo and Doyoung Kim for their helpful comments.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

References

American Educational Research Association, American Psychological Association, & National Council for Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Andrews

(1993). Tests for parameter instability and structural change with unknown change point. Econometrica, 61, 821–856.

Armstrong

Shi

(2009a). A parametric cumulative sum statistic for person fit. Applied Psychological Measurement, 33, 391–410.

Armstrong

Shi

(2009b). Model-free CUSUM methods for person fit. Journal of Educational Measurement, 46, 408–428.

Baker

F. B.

Kim

H. S.

(2004). Item response theory: Parameter estimation techniques (2nd ed.). New York, NY: Marcel Dekker.

Bolt

D. M.

Cohen

A. S.

Wollack

J. A.

(2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39, 331–348.

Bradlow

Weiss

R. E.

(2001). Outlier measures and norming methods for computerized adaptive tests. Journal of Educational and Behavioral Statistics, 26, 85–104.

Bradlow

Weiss

R. E.

Cho

(1998). Bayesian detection of outliers in computerized adaptive tests. Journal of the American Statistical Association, 93, 910–919.

Breusch

T. S.

(1979). Conflict among criteria for testing hypotheses: Extensions and comments. Econometrica, 47, 203–207.

10.

Chen

Gupta

A. K.

(2012). Parametric statistical change point analysis (2nd ed.). Boston, MA: Birkhauser.

11.

Conijn

J. M.

Emons

W. H. M.

Sijtsma

(2014). Statistic

l_{z}

-based person-fit methods for noncognitive multiscale measures. Applied Psychological Measurement, 38, 122–136.

12.

Conrad

K. J.

Bezruczko

Chan

Riley

Diamond

Dennis

M. L.

(2010). Screening for atypical suicide risk with person fit statistics among people presenting to alcohol and other drug treatment. Drug and Alcohol Dependence, 106, 92–100.

13.

Cox

D. R.

Hinkley

D. V.

(1974). Theoretical statistics. London, England: Chapman and Hall.

14.

Csorgo

Horvath

(1997). Limit theorems in change-point analysis. New York, NY: Wiley.

15.

Estrella

(2003). Critical values and p values of Bessel process distributions: Computation and application to structural break tests. Econometric Theory, 19, 1128–1143.

16.

Ferrando

P. J.

(2012). Assessing inconsistent responding in E and N measures: An application of person-fit analysis in personality. Personality and Individual Differences, 52, 718–722.

17.

Ferrando

P. J.

(2015). Assessing person-fit in typical-response measures. In Reise

S. P.

Revicki

(Eds.), Handbook of item response theory modeling: Applications to typical performance assessment (pp. 128–156). New York, NY: Routledge.

18.

Ferrando

P. J.

Anguiano-Carrasco

(2013). A structural model-based optimal person-fit procedure for identifying faking. Educational and Psychological Measurement, 73, 173–190.

19.

Finkelman

Weiss

D. J.

Kim-Kang

(2010). Item selection and hypothesis testing for the adaptive measurement of change. Applied Psychological Measurement, 34, 238–254.

20.

Glas

C. A. W.

Dagohoy

A. V. T.

(2007). A person fit test for IRT models for polytomous items. Psychometrika, 72, 159–180.

21.

Glas

C. A. W.

Meijer

R. R.

van Krimpen-Stoop

E. M. L. A.

(1998). Statistical tests for person misfit in computerized adaptive testing (Research Rep. No. 98-01). Enschede, the Netherlands: University of Twente.

22.

Gombay

Horvath

(1996). On the rate of approximations for maximum likelihood tests in change-point models. Journal of Multivariate Analysis, 56, 120–152.

23.

Hanson

B. A.

Harris

D. J.

Brennan

R. L.

(1994). A comparison of several statistical methods for examining allegations of copying (ACT research report series no. 87-15). Iowa City, IA: American College Testing.

24.

Hawkins

Qiu

Kang

(2003). The changepoint model for statistical process control. Journal of Quality Technology, 35, 355–366.

25.

Kingsbury

G. G.

Zara

A. R.

(1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2, 359–375.

26.

Klauer

K. C.

Rettig

(1990). An approximately standardized person test for assessing consistency with a latent trait model. British Journal of Mathematical and Statistical Psychology, 43, 193–206.

27.

Lee

Y.-H.

von Davier

A. A.

(2013). Monitoring scale scores over time via quality control charts, model-based approaches, and time series techniques. Psychometrika, 78, 557–575.

28.

Lehmann

E. L.

Romano

J. P.

(2005). Testing statistical hypotheses (3rd ed.). New York, NY: Springer-Verlag.

29.

McLeod

L. D.

Lewis

(1999). Detecting item memorization in the CAT environment. Applied Psycholgoical Measurement, 23, 147–160.

30.

McLeod

L. D.

Lewis

Thissen

(2003). A Bayesian method for the detection of item preknowledge in computerized adaptive testing. Applied Psychological Measurement, 27, 121–137.

31.

Meijer

R. R.

(2002). Outlier detection in high-stakes certification testing. Journal of Educational Measurement, 39, 219–233.

32.

Meijer

R. R.

(2004). Using patterns of summed scores in paper-and-pencil tests and computer-adaptive tests to detect misfitting item score patterns. Journal of Educational Measurement, 41, 119–136.

33.

Meijer

R. R.

Egberink

I. J.

Emons

W. H.

Sijtsma

(2008). Detection and validation of unscalable item score patterns using item response theory: An illustration with Harters self-perception profile for children. Journal of Personality Assessment, 90, 227–238.

34.

Meijer

R. R.

Sijtsma

(2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107–135.

35.

Meijer

R. R.

Tendeiro

J. N.

(2014). The use of person-fit scores in high-stakes educational testing: How to use them and what they tell us (LSAC Research Report Series). Newtown, PA: Law School Admission Council.

36.

Meijer

R. R.

van Krimpen-Stoop

E. M. L. A.

(2010). Detecting person misfit in adaptive testing. In van der Linden

W. J.

Glas

C. A.

(Eds.), Elements of adaptive testing (pp. 315–329). Dordrecht, the Netherlands: Springer.

37.

Mislevy

R. J.

Chang

H. H.

(2000). Does adaptive testing violate local independence? Psychometrika, 65, 149–156.

38.

Montgomery

D. C.

(2013). Introduction to statistical quality control. New York, NY: Wiley.

39.

Nering

M. L.

(1997). The distribution of indexes of person fit within the computerized adaptive testing environment. Applied Psychological Measurement, 21, 115–127.

40.

Olson

J. F.

Fremer

(2013). TILSA test security guidebook: Preventing, detecting, and investigating test securities irregularities. Washington, DC: Council of Chief State School Officers.

41.

Rao

C. R.

(1973). Linear statistical inference and its applications (2nd ed.). New York, NY: John Wiley.

42.

Romero

Riascos

Jara

(2015). On the optimality of answer-copying indices: Theory and practice. Journal of Educational and Behavioral Statistics, 40, 435–453.

43.

Sen

P. K.

(1981). Sequential nonparametrics: Invariance principles and statistical inference. New York, NY: Wiley.

44.

Shao

Cheng

(2015). A change point based method for test speededness detection. Psychometrika. doi:10.1007/s11336-015-9476-7

45.

Siegmund

(1985). Sequential analysis: Tests and confidence intervals. New York, NY: Springer.

46.

Sinharay

(2016). Assessment of person fit using resampling-based approaches. Journal of Educational Measurement, 53, 63–85.

47.

Snijders

(2001). Asymptotic distribution of person-fit statistics with estimated person parameter. Psychometrika, 66, 331–342.

48.

Tendeiro

J. N.

Meijer

R. R.

(2012). A CUSUM to detect person misfit: A discussion and some alternatives for existing procedures. Applied Psychological Measurement, 36, 420–442.

49.

Tendeiro

J. N.

Meijer

R. R.

(2014). Detection of invalid test scores: The usefulness of simple nonparametric statistics. Journal of Educational Measurement, 51, 239–259.

50.

Thissen

Orlando

(2001). Item response theory for items scored in two categories. In Thissen

Wainer

(Eds.), Test scoring (pp. 73–140). Hillsdale, NJ: Lawrence Erlbaum.

51.

van der Linden

W. J.

(1998). Bayesian item selection criteria for adaptive testing. Psychometrika, 63, 201–216.

52.

van Krimpen-Stoop

E. M. L. A.

Meijer

R. R.

(1999). The null distribution of person-fit statistics for conventional and adaptive tests. Applied Psychological Measurement, 23, 327–345.

53.

van Krimpen-Stoop

E. M. L. A.

Meijer

R. R.

(2000). Detecting person misfit in adaptive testing using statistical process control techniques. In van der Linden

W. J.

Glas

C. A.

(Eds.), Computerized adaptive testing: Theory and practice (pp. 201–219). Dordrecht, the Netherlands: Springer.

54.

van Krimpen-Stoop

E. M. L. A.

Meijer

R. R.

(2001). CUSUM-based person-fit statistics for adaptive testing. Journal of Educational and Behavioral Statistics, 26, 199–217.

55.

van Krimpen-Stoop

E. M. L. A.

Meijer

R. R.

(2002). Detection of person misfit in computerized adaptive tests with polytomous items. Applied Psychological Measurement, 26, 164–180.

56.

Warm

T. A.

(1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450.

Level	Length	$W_{max, n}$	$L_{max, n}$	$S_{max, n}$
%	20	0.7	0.3	0.2
	40	0.9	0.6	0.5
	60	1.0	0.7	0.6
	100	1.0	0.9	1.0
%	20	2.6	1.9	1.3
	40	2.8	2.8	2.5
	60	3.5	3.4	3.4
	100	4.4	4.3	4.2

δ	Length	$W_{max, n}$	LARD	C^LR	C^T	$C^{l_{z}^{*}}$
1.0	20	60	59	57	58	56
	40	69	66	65	66	65
	60	77	70	71	74	73
	100	89	77	78	83	80
2.0	20	79	76	72	74	73
	40	92	88	87	88	90
	60	97	93	94	95	94
	100	100	98	98	99	91

Level	Length	$W_{max, n}$	$L_{max, n}$	$S_{max, n}$
%	20	0.7	0.3	0.2
	40	0.9	0.6	0.5
	60	1.0	0.7	0.6
	100	1.0	0.9	1.0
%	20	2.6	1.9	1.3
	40	2.8	2.8	2.5
	60	3.5	3.4	3.4
	100	4.4	4.3	4.2

δ	Length	$W_{max, n}$	LARD	C^LR	C^T	$C^{l_{z}^{*}}$
1.0	20	60	59	57	58	56
	40	69	66	65	66	65
	60	77	70	71	74	73
	100	89	77	78	83	80
2.0	20	79	76	72	74	73
	40	92	88	87	88	90
	60	97	93	94	95	94
	100	100	98	98	99	91

Person Fit Analysis in Computerized Adaptive Testing Using Tests for a Change Point

Abstract

Keywords

Background

Notations

Review of PFSs Based on the CUSUM Procedure

Other Person-Level Statistics for CATs

TFCP in Statistical Process Control

Existing Applications of TFCP in Educational Testing

Other Research Related to Change in Examinee Ability or Test Performance

Method

The Rationale Behind Applying TFCP for Person-Fit Assessment in CAT

A PFS Based on the Wald Test

A PFS Based on the LRT

A PFS Based on the Score (or Lagrange Multiplier) Test

Asymptotic Null Distribution of Wmax,n, Lmax,n, and Smax,n

Further Information on the New PFSs

Simulations

Design of the Simulation Study

Computation

Type I Error Rates of Wmax,n, Lmax,n, and Smax,n

Power of Wmax,n, Lmax,n, and Smax,n

Comparison of the TFCP-Based and CUSUM-Based PFSs

Further Simulations With Other Item Pools

Further Simulations With Gradual Changes

Overall Conclusions From the Simulations

Application to Data From an Operational High-Stakes CAT

The Data Set and the Analysis

Results

Conclusions

Footnotes

Author’s Note

Acknowledgments

Declaration of Conflicting Interests

Funding

Notes

References

Asymptotic Null Distribution of W_max,n, L_max,n, and S_max,n

Type I Error Rates of W_max,n, L_max,n, and S_max,n

Power of W_max,n, L_max,n, and S_max,n

Level	Length	$W_{max, n}$	$L_{max, n}$	$S_{max, n}$
%	20	0.7	0.3	0.2
	40	0.9	0.6	0.5
	60	1.0	0.7	0.6
	100	1.0	0.9	1.0
%	20	2.6	1.9	1.3
	40	2.8	2.8	2.5
	60	3.5	3.4	3.4
	100	4.4	4.3	4.2

δ	Length	$W_{max, n}$	LARD	C^LR	C^T	$C^{l_{z}^{*}}$
1.0	20	60	59	57	58	56
	40	69	66	65	66	65
	60	77	70	71	74	73
	100	89	77	78	83	80
2.0	20	79	76	72	74	73
	40	92	88	87	88	90
	60	97	93	94	95	94
	100	100	98	98	99	91