Parameter Drift Detection in Multidimensional Computerized Adaptive Testing Based on Informational Distance/Divergence Measures

Abstract

An informational distance/divergence-based approach is proposed to detect the presence of parameter drift in multidimensional computerized adaptive testing (MCAT). The study presents significance testing procedures for identifying changes in multidimensional item response functions (MIRFs) over time based on informational distance/divergence measures that capture the discrepancy between two probability functions. To approximate the MIRFs from the observed response data, the k-nearest neighbors algorithm is used with the random search method. A simulation study suggests that the distance/divergence-based drift measures perform effectively in identifying the instances of parameter drift in MCAT. They showed moderate power with small samples of 500 examinees and excellent power when the sample size was as large as 1,000. The proposed drift measures also adequately controlled for Type I error at the nominal level under the null hypothesis.

Keywords

item parameter drift multidimensional item response function multidimensional computerized adaptive testing

Multidimensional computerized adaptive testing (MCAT; Bloxom & Vale, 1987; Segall, 1996; Tam, 1992) has recently gained much popularity as a diagnostic measurement tool in educational, psychological, and clinical settings. It allows more comprehensive description of examinees’ test performances than traditional unidimensional item response theory (IRT) and achieves much efficiency in test administration by taking advantage of adaptive item selection technique (Chang, 2015). One of the critical issues in administering MCAT is to ensure item parameters remain stable over time. Because scoring of examinees’ latent traits is predicated on the item parameter invariance, any systematic instability in item parameters can adversely affect the scoring process. If test items are used to grant professional competence such as in certification or licensure assessments, shifts in item parameters can lead to false decisions when classifying examinees. Hence, it is important to identify instances of parameter drift as soon as possible to minimize any repercussions on the validity and comparability of the scaled scores.

There are a number of factors causing the item parameter drift in practice. The occurrence of parameter drift (Goldstein, 1983) can be attributed to changes in instruction or curriculum (Bock, Muraki, & Pfeiffenberger, 1988; Cook, Eignor, & Taft, 1988), changes in constructs (Chan, Drasgow, & Sawin, 1999), or changes made to items between test administrations (Sykes & Ito, 1993). The parameter drift can also appear as a consequence of item disclosure by previous test takers, test fraud, item overexposure, or inappropriate testwise training (Li, 2008).

A common practice for identifying parameter drift in unidimensional items is to recalibrate items and evaluate changes in the estimated parameter values over testing occasions. The instance of parameter drift can be determined by comparing the item parameter estimates directly (Bock et al., 1988) or comparing model fit statistics obtained from the different testing events (Glas, 2000). Alternatively, one can compare test characteristic curves across time points (Guo, Zheng, & Chang, 2015; Wollack, Cohen, & Wells, 2003). Statistical measures that are originally developed for analysis of differential item functioning (DIF) also serve as drift detection measures (e.g., Kim & Cohen, 1991; Lord, 1980; Raju, 1988). Concerned with the computerized adaptive testing (CAT), the cumulative sum chart based on standardized differences between estimated item parameter values can be used to evaluate the drift (Veerkamp & Glas, 2000). For a discussion of causes and methods for detecting parameter drift in unidimensional tests, see Clark (2014).

Although much has been studied about the drift detection methods in unidimensional tests, there has been relatively little work developing or comparing drift procedures in MCAT. A complication of using the traditional methods for the MCAT data is that outcomes of the drift analysis are likely to be confounded with high calibration error or linking error. Because the traditional approaches use item parameter estimates from separate calibrations as a basis for drift analysis, the efficacy of the procedures is critically dependent upon the quality of item calibration (Donoghue & Isham, 1998). If the parameter estimates contain high uncertainty due to large numbers of parameters or missing responses, inferences about the existence of parameter drift may be hardly validated. Besides, calibrating items in MCAT separately time after time is an onerous task, especially when the items have varying frequencies in terms of exposure distribution. In instances where operational items need to be routinely monitored during the MCAT administrations, the computational burden of separate calibrations becomes even more serious.

The focus of this article is on developing viable options for detecting parameter drift without the need to calibrate items in the MCAT data. The new procedures do not involve item parameter estimation, and hence, they can be freed from calibration error or linking error. The procedures are based on significance testing of the difference between an observed multidimensional item response function (MIRF) and an anticipated MIRF. In a typical MCAT program, a precalibrated item pool is available such that items can be selected according to their known statistical properties. Therefore, the presence of parameter drift can be determined by comparing the probability function observed from the operational testing against the probability function derived from the initial item parameter values. The present article introduces four informational distance/divergence measures that serve this role.

The present article is organized as follows. First, the theoretical framework for the analysis of parameter drift in multidimensional items is briefly outlined. The article proceeds to introduce the informational distance/divergence measures that can be used for identifying the existence of the drift. The proposed methods are then evaluated in simulations. Finally, some discussion on the use of informational distance/divergence measures is provided. Throughout the article, it is assumed that items used for MCAT have adequate fit for the psychometric model of concern, and they are precalibrated with enough precision before the operational use. The initial parameter values of the items can thus be treated as known or reference values in the subsequent drift analysis.

Analysis of Parameter Drift in Multidimensional Items

Suppose a test purports to measure a set of latent proficiencies, $θ = (θ_{1}, \dots, θ_{p}) \in Θ$ , and denote a vector of item parameters for a studied item as $θ$ . The function relating the $θ$ to the probability of a correct item response is called the MIRF and denoted as $P (θ; θ)$ . Let $θ_{0}$ and $θ_{1}$ represent the item parameter vectors corresponding to two testing occasions. Because the item parameters uniquely dictate the shapes of item response functions (Kim, Cohen, & Park, 1995), item parameter invariance can be characterized by the equivalence of the two MIRFs:

P (θ; β_{0}) = P (θ; β_{1}),

for all levels of $θ \in Θ$ . A response score for a dichotomous item, $X$ , has the Bernoulli distribution, $f (X = x | θ; θ) = P (θ)^{x} (1 - P (θ))^{1 - x}$ , for which Equation 1 can be equivalently expressed as

f_{0} = f (X = x | θ; β_{0}) = f (X = x | θ; β_{1}) = f_{1},

for all levels of $θ \in Θ$ . The probability functions, $f_{0}$ and $f_{1}$ , represent two hypotheses about the functional form of the studied item. When applied to MCAT, $f_{0}$ denotes the probability function computed from the preestimated item parameters, whereas $f_{1}$ represents the probability function observed from the response data.

Consider a dissimilarity measure, $D$ that captures the discrepancy between the probability functions. Equation 2 indicates that the degree of heterogeneity between $f_{0}$ and $f_{1}$ is a function of $θ$ . To obtain an overall amount of parameter drift for the studied item, an average is taken over all possible levels of $θ$ :

D = \int_{Θ} D (f_{0} ∥ f_{1}; θ) ω (θ) d θ,

where $D (f_{0} | | f_{1}; θ)$ evaluates the distance or divergence of $f_{1}$ from $f_{0}$ at $θ$ , and $ω (θ)$ is the density function of $θ$ . The quantity ${{{\mathcal D}}$ assesses the dissimilarity between $f_{0}$ and $f_{1}$ without regard to whether or not the magnitude of drift is constant with the varying proficiency levels. The hypotheses for testing the presence of parameter drift are stated as ${H_0}:{{{\mathcal D}} = 0$ and ${H_1}:{{{\mathcal D}} \gt 0.$ The null hypothesis of no drift is rejected with level of significance α if the value of ${{{\mathcal D}}$ exceeds the 100(1 −α)th percentile.

Computation of the drift statistic in Equation 3 requires a matching criterion so that the two probability functions can be evaluated at the same proficiency level. In the unidimensional tests, examinees’ number-correct scores or unidimensional proficiency estimates are commonly used as a matching variable. When a test is intended to measure multiple traits, matching on the unidimensional criterion can result in Type I error inflation (Ackerman, 1992). Therefore, to meet the validity standards and to avoid high Type I error rate, the present study employs latent proficiency vectors estimated from the operational MCAT as a matching criterion. The proficiency estimates take all pertinent latent dimensions into account and are readily obtainable from the administrations of the MCAT. Note that in multidimensional tests, it is almost impossible to match examinees on the exact values of the proficiency estimates (i.e., thin matching). A certain strategy must be employed to pool examinees based on their estimated proficiency vectors (i.e., thick matching). This goal is achieved by employing the k-nearest neighbors (k-NNs) algorithm.

A basic procedure follows from this. Given a collection of query vectors, $q_{l}$ ( $l = 1, \dots, Q$ ), the algorithm searches for k examinees whose estimated proficiency vectors are close to $q_{l}$ according to a certain distance metric. Item responses for the k-nearest examinees are then collected to approximate the MIRF. The probability of a correct response for $q_{l}$ is calculated by the sample proportion of the correct responses among the nearest neighbors as follows:

\tilde{P} (q_{l}) = \frac{1}{k} \sum_{i \in N_{k} (q_{l})}^{k} x_{ij},

where $N_{k} (q_{l})$ represents the k-NNs to a query vector $q_{l}$ , and $x_{ij}$ is an observed response score for an examinee $i$ to the studied item $j$ . It is often necessary to use a weighted average of the k-NNs so that the nearer contributes more to the average outcome than the more distant ones. Let $d ({\hat{θ}}_{i}, q_{l})$ denote the distance between the query vector $q_{l}$ and the ith proficiency estimate ${\hat{θ}}_{i}$ in the neighborhood of $q_{l}$ . Then, a weighted approximation of the MIRF is obtained as

\tilde{P} (q_{l}) = \sum_{i \in N_{k} (q_{l})}^{k} x_{ij} ω ({\hat{θ}}_{i}, q_{l}),

where

ω ({\hat{θ}}_{i}, q_{l}) = \frac{\exp [- d ({\hat{θ}}_{i}, q_{l})]}{\sum_{i \in N_{k} (q_{l})} \exp [- d ({\hat{θ}}_{i}, q_{l})]} .

The weights defined in this manner satisfy $\sum_{i = 1}^{k} ω ({\hat{θ}}_{i}, q_{l}) = 1 .$ The type of the distance metric $d (\cdot, \cdot)$ can be determined based on the properties of the latent parameters. A common practice for a vector of a continuous variable is to use the Euclidean distance.

The main advantage of using the k-NNs technique in the MCAT data is that the algorithm always finds a proper set of matching variables for comparing the MIRFs. Despite the fact that the size or the distribution of examinee samples can change over different testing events, the k-NNs algorithm can approximate the MIRFs with comparable precision over time by taking the k-NNs close to the query vectors into consideration.

Informational Distance/Divergence Measures

To structure the heterogeneity between the MIRFs, a statistical dissimilarity index that separates pairs of probability functions is needed. This article proposes four informational distance/divergence measures that serve such purpose. The informational distance/divergence measures have been widely employed in many areas of statistics such as binary hypothesis testing, classification applications, anomaly detection in high-dimensional data, and so forth. The use of the distance/divergence measures is especially advantageous in multidimensional settings in that they can summarize the degree of heterogeneity between probability distributions into a single numeric without regard to the size of dimension or the number of parameters.

The following introduces four informational distance/divergence measures for identifying the presence of parameter drift in MCAT and presents corresponding estimation methods of the drift statistics. The drift statistics below are defined for the dichotomous items for convenience; however, their definitions stated in terms of the MIRFs are not necessarily limited to the dichotomous cases. When multidimensional polytomous items are of concern, the distance/divergence measures can be adjusted to assess the discrepancy between two multidimensional item response category functions.

In line with the notations above, the vector of reference item parameters (i.e., initial item parameter estimates) is denoted as $θ_{0}$ , and the parameter vector that characterizes the item in the operational assessment is denoted as $θ_{1}$ . Analogously, the MIRFs corresponding to each item parameter vector are denoted as $P_{0} (θ) = P (θ; θ_{0})$ and $P_{1} (θ) = P (θ; θ_{1})$ , respectively.

Euclidean Distance

One of the simplest ways to quantify the degree of dissimilarity between two probability functions is to use the Euclidean distance. For given $θ$ , the Euclidean distance between $f_{0} = f (x | θ; θ_{0})$ and $f_{1} = f (x | θ; θ_{1})$ is calculated as

D_{E} (f_{0} ∥ f_{1}; θ) = \sqrt{\sum_{x = 0}^{1} {[f (x | θ; β_{0}) - f (x | θ; β_{1})]}^{2}} .

The quantity defined in Equation 7 is nonnegative and equals 0 if and only if $f_{0} = f_{1}$ for all $θ \in Θ$ . As a special case of the Minkowski distance (also known as $L_{p}$ norm) of order $p = 2$ , the Euclidean distance is always finite between the probability distributions.

The drift statistic based on the Euclidean distance is obtained by plugging Equation 7 into Equation 3:

D_{E} = \int_{Θ} D_{E} (f_{0} | | f_{1}; θ) ω (θ) d θ = \sqrt{2} \int_{Θ} | P_{0} (θ) - P_{1} (θ) | ω (θ) d θ .

In Equation 8, $P_{0} (θ)$ is computed from the reference item parameters, and hence, it only needs to find $P_{1} (θ)$ . Applying the k-NNs technique described above, the drift statistic for the Euclidean distance is estimated as

{\hat{D}}_{E} = \sqrt{2} \sum_{l = 1}^{Q} | P_{0} (q_{l}) - {\tilde{P}}_{1} (q_{l}) | ω (q_{l}) Δ,

where $P_{0} (q_{l})$ is the MIRF evaluated at $q_{l}$ ; ${\tilde{P}}_{1} (q_{l})$ is the MIRF corresponding to $N_{k} (q_{l})$ ; $ω (q_{l})$ is the distributional weight of $q_{l}$ ; and $Δ$ is an increment between $q_{l - 1}$ and $q_{l}$ . A common choice of $Δ$ in the continuous space is to use the Euclidean distance.

Hellinger Distance

The Euclidean distance of the square root function is called Hellinger distance (Bhattacharyya, 1943). In the present context, the Hellinger distance between $f_{0}$ and $f_{1}$ conditioned on $θ$ is defined as

D_{H} (f_{0} ∥ f_{1}; θ) = \sqrt{\frac{1}{2} \sum_{x = 0}^{1} {\sqrt{f (x | θ; β_{0})} - \sqrt{f (x | θ; β_{1})}}^{2}} .

The Hellinger distance, D_H satisfies the metric properties—nonnegativity, symmetry, and triangle inequality—and is bounded between 0 and 1 as a result of Cauchy–Schwarz inequality. Deriving the drift statistic for the Hellinger distance parallels that for the Euclidean distance. Substituting the dissimilarity measure with the Hellinger distance in Equation 3 gives the following:

D_{H} = \int_{Θ} D_{H} (f_{0} | | f_{1}; θ) ω (θ) d θ .

Based on the observed responses, this quantity can be estimated as follows:

{\hat{D}}_{H} = \frac{1}{\sqrt{2}} \sum_{l = 1}^{Q} {{(\sqrt{P_{0} (q_{l})} - \sqrt{{\tilde{P}}_{1} (q_{l})})}^{2} + {(\sqrt{Q_{0} (q_{l})} - \sqrt{{\tilde{Q}}_{1} (q_{l})})}^{2}}^{1 / 2} ω (q_{l}) Δ,

where $Q_{0} (q_{l}) = 1 - P_{0} (q_{l})$ and ${\tilde{Q}}_{1} (q_{l}) = 1 - {\tilde{P}}_{1} (q_{l})$ .

Kullback–Leibler (KL) Divergence

The KL (Kullback & Leibler, 1951) divergence (also known as relative entropy or crossing entropy) has been frequently used in information theory as a measure of discriminating two probability functions. Chang and Ying (1996) introduced the KL divergence to the IRT framework as a global item information measure that discriminates examinees’ true proficiencies and provisional estimates in CAT. In the context of multidimensional IRT, the KL divergence of $f_{1}$ from $f_{0}$ at $θ$ is defined as follows:

D_{λ} (f_{0} | | f_{1}; θ) = \sum_{x = 0}^{1} f (x | θ; β_{0}) \log \frac{f (x | θ; β_{0})}{f (x | θ; β_{1})} .

The KL divergence is nonnegative because of Gibbs inequality and equals 0 when the two distributions coincide almost everywhere. The drift statistic for the KL divergence is obtained as

D_{λ} = \int_{Θ} D_{λ} (f_{0} | | f_{1}; θ) ω (θ) d θ,

and estimated as

{\hat{D}}_{λ} = \sum_{l = 1}^{Q} [P_{0} (q_{l}) \log \frac{P_{0} (q_{l})}{{\tilde{P}}_{1} (q_{l})} + Q_{0} (q_{l}) \log \frac{Q_{0} (q_{l})}{{\tilde{Q}}_{1} (q_{l})}] ω (q_{l}) Δ .

While the KL divergence has been considered as one of the most powerful discriminating measures of probability functions in information theory, it has a number of drawbacks that may detract from its usefulness as a pairwise drift measure in applied settings. By definition, the KL divergence is not symmetric in $f_{0}$ and $f_{1}$ , and hence, the values of the KL divergence can differ depending on the order of the arguments. In addition, although the KL divergence is lower bounded by the Hellinger distance such that $2{D}^2_{{{H}}} \le {{D}_{{KL}}$, the values of the KL divergence can potentially equal infinity. Because of this unboundedness, the distribution of test statistics from the KL divergence can be severely right-tailed, resulting in unstable critical values when significance testing is implemented using the empirical sampling distributions. On this account, this study proposes an alternative dissimilarity measure that is bounded and invariant with respect to permutations of the arguments but still preserves most of the desirable properties of the KL divergence.

Jensen–Shannon (JS) Divergence

Let $λ$ be a probability for $X$ being drawn from a probability distribution $f_{0}$ , and $1 - λ$ be a probability for $X$ having a counterpart probability distribution $f_{1}$ . A symmetric version of the KL divergence can be obtained by a weighted average as follows:

D_{λ} (f_{0} | | f_{1}; θ) = λ D_{K L} (f_{0} | | λ f_{0} + (1 - λ) f_{1}; θ) + (1 - λ) D_{K L} (f_{1} | | λ f_{0} + (1 - λ) f_{1}; θ) .

The quantity $D_{λ}$ evaluates the expected information gain about $X$ from discovering which probability function $X$ is sampled from. By imposing the same degree of uncertainty on the two probability functions, the JS (Rao, 1982) divergence is obtained as follows:

D_{J S} (f_{0} | | f_{1}; θ) = \frac{1}{2} [D_{K L} (f_{0} | | f_{m}; θ) + D_{K L} (f_{1} | | f_{m}; θ)],

where $f_{m} = (f_{0} + f_{1}) / 2$ . The JS divergence satisfies the metric properties and is bounded between 0 and $\log 2$ . For evaluating the parameter drift in the multidimensional IRT framework, the JS divergence-based drift measure is defined as follows:

D_{JS} = \frac{1}{2} \int_{Θ} \sum_{x = 0}^{1} [f (x | θ; β_{0}) \log \frac{f (x | θ; β_{0})}{f_{m} (x | θ)} + f (x | θ; β_{1}) \log \frac{f (x | θ; β_{1})}{f_{m} (x | θ)}] ω (θ) d θ,

where $f_{m} (x | θ) = [f (x | θ; θ_{0}) + f (x | θ; θ_{1})] / 2 .$ Based on the k-NNs, the quantity is approximated as follows:

{\hat{D}}_{JS} = \sum_{l = 1}^{Q} \frac{1}{2} [P_{0} (q_{l}) \log \frac{P_{0} (q_{l})}{P_{m} (q_{l})} + Q_{0} (q_{l}) \log \frac{Q_{0} (q_{l})}{Q_{m} (q_{l})} + \dots + {\tilde{P}}_{1} (q_{l}) \log \frac{{\tilde{P}}_{1} (q_{l})}{P_{m} (q_{l})} + {\tilde{Q}}_{1} (q_{l}) \log \frac{{\tilde{Q}}_{1} (q_{l})}{Q_{m} (q_{l})}] ω (q_{l}) Δ,

where $P_{m} (q_{l}) = [P_{0} (q_{l}) + {\tilde{P}}_{1} (q_{l})] / 2$ and $Q_{m} (q_{l}) = [Q_{0} (q_{l}) + {\tilde{Q}}_{1} (q_{l})] / 2 .$

Simulation Study

A simulation study is used to examine the performance of the distance/divergence-based drift measures. For the drift analysis to be fully sequential, significance testing must be carried out after each included observation; however, such procedure becomes inefficient in terms of a central process unit time. The present study conducts drift analysis at three predefined time points instead. The drift statistics were computed at each time an item was administered to n = 500, 1,000, and 1,500 examinees, and significance testing was carried out on the difference between the original and observed probability functions.

Determination of k

Because the performance of the drift measures depends on the k-NNs algorithm, it is important to carefully choose the size of k. An overly large value of k would fail to evaluate the heterogeneity in the item response functions properly. If k is too small, the approximation of item response functions could be unstable because only a bit of responses are utilized for a given query vector. (In the extreme case of k = 1, $P_{0}$ and ${\tilde{P}}_{1}$ are computed based on only one response.) In both cases, test statistics would become powerless. To ensure stable estimation of the drift statistics and adequate statistical power, the present study examined the performance of the drift measures using sets of possible k values. The values of k were determined as $n / Q$ such that no information was lost from the response data. When $Q$ was not a factor of n, k was determined as the nearest integer greater than $n / Q$ so that a rehash of the response data could be minimized in the computation of the drift statistics. For controlling the quality of the drift statistics, the k-NNs algorithm was required to have a minimum of six and a maximum of 15 query vectors.

Determination of Query Vectors

Among a number of heuristic techniques for determining the query vectors, grid search and random search are the most widely used methods for hyperparameter optimization of the k-NNs algorithm. The grid search method uses grid points manually drawn from a subset of a hyperparameter space (i.e., $Θ$ ) as query vectors, whereas the random search method uses query vectors that are randomly sampled from the $Θ$ . Because the grid search operates on a grid that increases by a specific value on a finite interval, it can suffer from the curse of dimensionality. Compared with the grid search, the random search is more efficient and computationally less intensive (Bergstra & Bengio, 2012). For this reason, the current study employed the random search method for determining the query vectors in the computation of the drift statistics. To prevent attenuation of drift impact from occurring, the upper and lower limits of $Θ$ were set at 2 and − 2, respectively, and query vectors were randomly sampled from this subset.

Determination of Critical Values

The asymptotic distributions of the drift statistics proposed are unknown. Thus, critical values were obtained empirically through bootstrap resampling under the null hypothesis. A brief description of finding the critical values at significance level α is given as follows.

Step 1: Compute drift statistics at each time an item is assigned to n examinees in the null case of MCAT administration.

Step 2: For each n condition, treat the original set of the drift statistics as a population of scores and perform resampling with replacement.

Step 3: Repeat Step 2 to generate m resamples of test statistics.

Step 4: Calculate the 100(1 −α) quantile in each m resample to obtain critical values.

Step 5: Determine an empirical critical value as an average of the critical values over the m resamples.

The number of resamples, m, should be large enough to get a good estimation of the quantile. The present study used m = 1,000 resamples and determined the critical values at significance level α = 0.05 in the upper tail of the sampling distributions. The critical values resulting from the above steps are size-corrected that do not depend on unknown population parameters but are likely to yield a test with little power. Therefore, to secure adequate power of the drift statistics, n = 500 was considered as the minimum sample size for drift analysis.

Data Generation

MCAT was administered based on the multidimensional three-parameter logistic model (Reckase, 1997):

P (θ; a, b, c) = c + \frac{1 - c}{1 + \exp [- 1.7 a^{T} (θ - b)]},

where $a = (a_{1}, \dots, a_{p})$ is the item discrimination vector, $b$ is the intercept parameter, and $c$ is the lower asymptote. The dimension of $a$ depends on the number of latent traits being measured by tests. The present study considered two scenarios, two-dimensional and three-dimensional CAT, to examine the performance of the drift measures under the different levels of dimensionality. Examinees’ latent proficiency levels were simulated from the multivariate normal distribution with zero means, unit variances, and correlations of 0.3. MCAT was administered to 5,000 examinees based on the Bayesian D-optimality item selection criterion (Segall, 1996). For proficiency estimation, the maximum a posteriori was used. Test length was fixed at 35. Item exposure rate was controlled by setting the maximum item exposure rate at 0.3 such that no more than 30% of examinees in the sample received the same item. Each item pool had 300 test items with approximate simple structure (Zhang & Stout, 1999). Item parameters were randomly sampled from the following distributions: $a ~ U (0.5, 1.2)$ for the primary dimension(s), $a ~ U (0, 0.4)$ for the secondary dimension(s), $b ~ N (0, 1)$ , and $c ~ Beta (100, 400)$ .

Parameter Drift Simulation

Within each MCAT scenario, performances of the drift measures were evaluated with or without parameter drift occurring. The study design included three proportions of item pool-wide parameter drift: 0%, 5%, and 10%. Motivated by the tendency in practice in which overexposed items tend to have high likelihood of being flagged for parameter drift (Zhang, 2014), drift items were randomly selected from a set of frequently used items in the null case of MCAT. All items chosen for parameter drift were made to experience the same type of parameter drift to examine the impact of drift type systematically. Two types of parameter drift were created. The parallel shifts in the MIRFs were made by changing the b-parameters of the drift items by 0.3 or 0.5 units from the original values. The nonparallel transformations of the MIRFs were simulated by changing the a-parameters by 0.5 units and the b-parameters by 0.3 or 0.5 units. The direction of the parameter drift was determined such that the drift items became less discriminating or less difficult over the testing events (e.g., DeMars, 2004; Veerkamp & Glas, 2000). Because the proposed drift measures do not place any constraints on the direction, similar inferences can be made about the performance of the measures in the opposite direction.

Crossing the conditioned factors resulted in 18 different MCAT scenarios. Each scenario was executed with 100 replications to regulate the sampling error, and the results were averaged over the replications.

Results

Determination of Q

The preparatory study for determining the number of query vectors suggested that $Q$ values between eight and 10 have minor differences in terms of power and Type I error rates. (The performances of the drift measures in relation to varying $Q$ values are presented in the appendix.) For the sake of simplicity in presentation, results provided below are summarized using the fixed values of $Q$ . Specifically, the Euclidean distance and the KL divergence used eight query vectors. Results for the Hellinger distance and the JS divergence were reported based on 10 and nine query vectors, respectively. These choices were motivated by the consideration that the resulting tests had the highest power against the alternative hypothesis across all simulation conditions. On the whole, as the $Q$ values were further from these optimal values, both the power and Type I error tended to decrease.

Empirical Sampling Distributions

To serve as test statistics for significance testing, drift measures must have distinct limiting distributions under the null hypothesis. As their asymptotic distributions are unknown, the current study obtained empirical sampling distributions via simulation and evaluated their characteristics in the null case of no drift. Figure 1 presents empirical sampling distributions of the drift statistics obtained from the two-dimensional CAT. All drift measures demonstrated explicit forms of sampling distributions. The distance-based statistics had the symmetric sampling distributions. The divergence-based measures appeared positively skewed distributions. The impact of increasing n on the drift statistics was manifested by shifts of the sampling distributions toward the left. In general, the empirical sampling distributions became less dispersed, less skewed, and less peaked as n increased.

Figure 1.

Empirical sampling distributions in the two-dimensional computerized adaptive testing.

Similar patterns were observed for the three-dimensional case. Figure 2 provides empirical sampling distributions obtained from the three-dimensional CAT. All drift test statistics showed distinct sampling distributions under the null. Compared with Figure 1, the sampling distributions in Figure 2 had smaller averages and smaller standard deviations (SDs), suggesting that drift statistics from the higher dimension were less distinguishable. The trends associated with n were consistent with those for the two-dimensional CAT. When the $Q$ was held constant for each drift measure, the sampling distributions tended to shift toward the left and became less dispersed, less skewed, and less peaked as n increased.

Figure 2.

Empirical sampling distributions in the three-dimensional computerized adaptive testing.

The figures presented above suggest that the distance/divergence-based drift statistics have explicit sampling distributions under the null hypothesis. Based on these distributions, empirical critical values were obtained through bootstrap resampling at the significance level of 0.05. Table 1 reports the average and SD values of the critical values used in the study. In line with the prior results, the empirical critical values tended to decrease as n or p increased. Along with these changes, the critical values became less dispersed, suggesting that the critical values became more consistent.

Table 1.

Averages and SDs of Empirical Critical Values.

$n$	$p = 2$				$p = 3$
	EU	HE	KL	JS	EU	HE	KL	JS
500 (SD)	.107 (.009)	.268 (.019)	.030 (.004)	.009 (.001)	.036 (.003)	.154 (.011)	.010 (.001)	.003 (.000)
1,000 (SD)	.080 (.008)	.205 (.016)	.018 (.003)	.005 (.001)	.027 (.003)	.119 (.009)	.006 (.001)	.002 (.000)
1,500 (SD)	.066 (.007)	.170 (.015)	.012 (.002)	.004 (.001)	.022 (.002)	.095 (.007)	.004 (.001)	.001 (.000)

Note. $n$ = sample size; $p$ = number of latent dimensions; EU = Euclidean distance; HE = Hellinger distance; KL = Kullback–Leibler divergence; JS = Jensen–Shannon divergence.

Type I Error Study

In Table 2, Type I error rates of the drift measures are summarized for the null and drift cases. The Type I error rate was defined as the percentage of nondrift items that were erroneously identified as drift in each MCAT administration. The values reported for the drift cases were obtained by averaging over the drift levels and drift types because no systematic pattern was found across these factors. Overall, the drift measures maintained good adherence to the nominal level of significance under the null hypothesis. They constantly kept Type I error rates below the nominal level without regard to n or p. SDs of the Type I errors remained small and slightly increased as n increased. The average SDs were .006, .008, and .009 for each n condition.

Table 2.

Type I Error Rates in the Null and Drift Cases.

Drift	$n$	$p = 2$				$p = 3$
		EU	HE	KL	JS	EU	HE	KL	JS
0%	500	.043	.042	.042	.043	.046	.046	.046	.047
	1,000	.037	.039	.037	.037	.044	.044	.044	.043
	1,500	.037	.036	.037	.038	.042	.045	.045	.044
5%	500	.054	.067	.068	.064	.057	.061	.065	.060
	1,000	.068	.076	.072	.076	.075	.072	.076	.072
	1,500	.080	.090	.088	.091	.111	.112	.114	.112
10%	500	.076	.107	.112	.100	.074	.083	.093	.081
	1,000	.113	.146	.146	.149	.114	.123	.133	.123
	1,500	.153	.191	.192	.197	.188	.209	.213	.206

Note. $n$ = sample size; $p$ = number of latent dimensions; Drift = proportion of drift items in the item pools; EU = Euclidean distance; HE = Hellinger distance; KL = Kullback–Leibler divergence; JS = Jensen–Shannon divergence.

The drift measures began to display Type I error inflation as the item pools were contaminated with drift items. The magnitude of the inflation was somewhat minor or moderate when 5% of the items in the pool were flagged for parameter drift. The Type I error inflation problem appeared substantial as 10% of the item pools were flagged for drift. Among the distance/divergence measures, the Euclidean distance was found most conservative in terms of Type I error inflation. Increasing n in the drift analyses resulted in increased Type I error rates for all drift measures possibly due to the greater sensitivity with the larger n. Consistent with the null case, SDs of the Type I error rates were influenced by the level of n, that is, the larger the n, the more the variability in the Type I error rates. In the 5% drift case, SDs of the Type I error rates were .031, .044, and .054 for increasing n. In the 10% drift case, SDs of the Type I error rates increased to .046, .073, and .094. Overall, the impact of the dimensionality on the averages and SDs of the Type I error rates was not immediately obvious.

Power Study

Table 3 reports power rates of the distance/divergence measures when 5% of the items in the pool were flagged for parameter drift. The power rate was defined as the proportion of correctly identified drift items in each MCAT administration. For evaluation, power rates above .80 were considered excellent (Cohen, 1992) and rates between 0.70 and 0.80 were considered moderate. In Table 3, the distance/divergence measures identified changes in the item parameters quite effectively despite the presence of multidimensionality. The measures showed moderate power when n was as small as 500 and showed excellent power when n was equal to or greater than 1,000. Overall, increasing n led to substantial improvement in the detecting power for all drift measures. The increase in n also resulted in decreasing SDs of the power rates. SDs of the power rates were on average .108, .094, and .082 in the two-dimensional CAT, and .110, .096, and .084 in the three-dimensional CAT at each conditioned n. These results suggest that the drift measures performed more consistently along with the larger n.

Table 3.

Power Rates When 5% of Item Pools Were Flagged for Parameter Drift.

$Δ (a, b)$	$n$	$p = 2$				$p = 3$
		EU	HE	KL	JS	EU	HE	KL	JS
(0, .3)	500	.754	.772	.769	.758	.753	.763	.784	.775
	1,000	.836	.839	.861	.850	.818	.847	.853	.836
	1,500	.886	.896	.901	.895	.868	.889	.886	.883
(0, .5)	500	.751	.769	.777	.781	.751	.777	.771	.787
	1,000	.838	.843	.853	.851	.827	.838	.851	.845
	1,500	.882	.895	.904	.908	.876	.887	.893	.893
(.5, .3)	500	.755	.756	.767	.764	.746	.759	.770	.771
	1,000	.835	.833	.852	.861	.821	.831	.847	.843
	1,500	.885	.880	.906	.898	.874	.875	.895	.889
(.5, .5)	500	.738	.760	.771	.765	.758	.775	.775	.770
	1,000	.839	.848	.845	.851	.825	.841	.846	.837
	1,500	.889	.896	.904	.907	.872	.891	.897	.884

Note. $Δ (a, b)$ = changes in a- and b-parameters; $n$ = sample size; $p$ = number of latent dimensions; EU = Euclidean distance; HE = Hellinger distance; KL = Kullback–Leibler divergence; JS = Jensen–Shannon divergence.

In Table 3, the divergence measures in general outperformed the distance measures. The differences between the lowest and highest power rates, however, occurred less than 0.05 in all occasions, indicating that the drift measures under evaluation performed very comparably in detection of drift items. Overall, the impact of the drift level and the drift type on the power rates seemed minor. One plausible explanation for this pattern would be the use of random query vectors in the k-NNs algorithm. As the query vectors were randomly selected, absolute values of the drift statistics could vary depending on the selected query vectors. A direct comparison of the power performances will be less meaningful in such cases despite the systematic differences in the drift level and the drift type. The grid search method, on the contrary, may show distinct patterns related to these factors. According to preliminary studies, when a set of fixed query vectors (e.g., a grid between [−2, 2] at increments of 1) was used in the k-NNs algorithm, the drift measures displayed clear patterns in the power rates as the drift level and the drift type changed. The larger the drift level, the higher the power. The b-drifted items were constantly better detected compared with those with ab-parameter drift.

Table 4 provides power rates of the drift measures when 10% of the items in the item pools were flagged for parameter drift. Comparison between Tables 3 and 4 reveals that the proportion of the drift items in the pools had a distinct impact on the power performances of the drift measures. As the item pools included more drift items, the detecting power of the drift measures decreased across all conditions. The extent of the reduction in the power was, however, rather modest; the differences in the power rates occurred less than 0.05 under all conditions controlled for in this study.

Table 4.

Power Rates When 10% of Item Pools Were Flagged for Parameter Drift.

$Δ (a, b)$	$n$	$p = 2$				$p = 3$
		EU	HE	KL	JS	EU	HE	KL	JS
(0, .3)	500	.713	.734	.743	.740	.717	.746	.753	.753
	1,000	.805	.812	.826	.826	.802	.809	.827	.816
	1,500	.857	.867	.880	.874	.849	.870	.878	.868
(0, .5)	500	.713	.741	.746	.741	.724	.736	.750	.739
	1,000	.807	.810	.824	.827	.795	.820	.824	.818
	1,500	.858	.871	.881	.871	.851	.868	.881	.881
(.5, .3)	500	.707	.729	.736	.725	.712	.741	.734	.736
	1,000	.804	.799	.808	.816	.794	.808	.817	.811
	1,500	.847	.856	.862	.868	.845	.870	.868	.874
(.5, .5)	500	.707	.720	.732	.724	.718	.746	.741	.743
	1,000	.805	.809	.817	.806	.791	.798	.813	.803
	1,500	.853	.849	.879	.860	.845	.860	.873	.863

Conclusion

Preserving the quality of an item pool is essential for any continuous testing program to ensure that test scores have the same meaning over time. The traditional approach to quality control is to routinely recalibrate items and compare parameter estimates with initial values. When data from MCAT are analyzed for parameter drift, difficulties arise due to the multidimensionality and sparseness in the test data. In this article, significance testing procedures were proposed to identify the presence of parameter drift in MCAT without the need for calibrating items and, thereby, without invoking problems related to calibration or linking error. The drift indices were based on four widely used informational distance/divergence measures such that the heterogeneity in the MIRFs could be summarized into a single numeric value. The significance testing was conducted for the difference between two MIRFs, one from the initial item parameter estimates and the other from operational testing. For approximating the MIRFs from the observed data, the k-NNs algorithm was employed with the random query search method.

The simulation study demonstrated the potential of the distance/divergence measures as a drift measure in the MCAT. The drift statistics had explicit sampling distributions and adequate control over Type I error in the null MCAT administrations. The drift measures produced moderate power under the small samples (n = 500) and excellent power as the sample size increased to 1,000. Contamination of the item pools by drift items degraded the power and Type I error performances of the drift measures, yet in a predictable manner.

The procedures developed in this study do not assume any prior model nor require item calibration, and hence, they are inexpensive to use and much less cumbersome than the traditional calibration-based methods. Furthermore, as the procedures need only the knowledge about the response function for computing the drift statistics, they can be easily generalized to other parametric models. The k-NNs technique introduced in this study also promises high potential for approximating the MIRFs in operational testing. Despite changes of samples (and therefore, in data sparseness) over time, it always finds proper sets of matching variables for comparing the MIRFs with comparable precision.

The performance of the distance/divergence measures can be further examined in future studies by systematically varying the item parameter values, the level of parameter drift, the degree of item pool contamination, and so on. In addition, since both the drift analysis and DIF analysis concern whether an item functions the same in different sets of data within the IRT framework, performances of the distance/divergence measures may well be investigated in detecting the existence of DIF. DIF studies will call for additional deliberation on factors such as sample sizes or proficiency differences between examinee groups.

Another issue that merits a systematic study is whether the distance/divergence measures would remain desirable in the presence of MIRF misspecification. As one anonymous reviewer pointed out, items may falsely be flagged for parameter drift because of misfit of the response model in spite of the absence of drift. Previous research has shown that even small amounts of model misfit can result in serious Type I error inflation for parametric DIF detection procedures (Bolt, 2002). A future investigation might well focus on to what extent model misfit affects the performance of the drift measures and whether the distance/divergence measures might offer advantages over fully parametric calibration-based procedures. The present study relied on the assumptions that the item response model being used for MCAT is precisely specified, and the item pools from which items are selected are precalibrated with enough accuracy. Hence, robustness of the drift procedures against the violations of these assumptions will warrant future discussion. A systematic study of this issue would entail an evaluation of Type I error performances under minor violations of the response model or based on an empirical data set in which no parameter drift is expected.

Footnotes

Appendix

Acknowledgements

The authors would like to thank Associate Editor, Dr. Daniel Bolt, and anonymous reviewers for their constructive comments and suggestions to improve the quality of the paper.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Ackerman

T. A.

(1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67-91.

Bergstra

Bengio

(2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, 281-305.

Bhattacharyya

(1943). On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of Calcutta Mathematical Society, 35, 99-109.

Bloxom

B. M.

Vale

C. D.

(1987, June). Multidimensional adaptive testing: A procedure for sequential estimation of the posterior centroid and dispersion of theta. Paper presented at annual meeting of the Psychometric Society, Montreal, Quebec, Canada.

Bock

R. D.

Muraki

Pfeiffenberger

(1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25, 275-285.

Bolt

D. M.

(2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15, 113-141.

Chan

K.-Y.

Drasgow

Sawin

L. L.

(1999). What is the shelf life of a test? The effect of time on the psychometrics of a Cognitive Ability Test battery. Journal of Educational Measurement, 84, 610-619.

Chang

H.-H.

(2015). Psychometrics behind computerized adaptive testing. Psychometrika, 80, 1-20.

Chang

H.-H.

Ying

(1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20, 213-229.

10.

Clark

A. K.

(2014, April). Parameter drift methodology and operational testing application. Poster presented at the annual meeting of the National Council on Measurement in Education, Philadelphia, PA.

11.

Cohen

(1992). A power primer. Psychological Bulletin, 112, 155-159.

12.

Cook

L. L.

Eignor

D. R.

Taft

H. L.

(1988). A comparative study of the effects of recency of instruction on the stability of IRT and conventional item parameter estimates. Journal of Educational Measurement, 25, 31-45.

13.

DeMars

C. E.

(2004). Detection of item parameter drift over multiple test administrations. Applied Measurement in Education, 17, 265-300.

14.

Donoghue

J. R.

Isham

S. P.

(1998). A comparison of procedures to detect item parameter drift. Applied Psychological Measurement, 22, 33-51.

15.

Glas

C. A. W.

(2000). Item calibration and parameter drift. In van der Linden

W. J.

Glas

C. A. W.

(Eds.), Computerized adaptive testing: Theory and practice (pp. 183-199). Norwell, MA: Kluwer Academic.

16.

Goldstein

(1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 20, 369-377.

17.

Guo

Zheng

Chang

H.-H.

(2015). A stepwise test characteristic curve method to detect item parameter drift. Journal of Educational Measurement, 52, 280-300.

18.

Kim

S.-H.

Cohen

A. S.

(1991). A comparison of two area measures for detecting differential item functioning. Applied Psychological Measurement, 15, 269-278.

19.

Kim

S.-H.

Cohen

A. S.

Park

T. H.

(1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32, 261-276.

20.

Kullback

Leibler

R. A.

(1951). On information and sufficiency. The Annals of Mathematical Statistics, 22, 79-86.

21.

(2008). Multidimensionality and item parameter drift: An investigation of linking items in a large-scale certification test (Unpublished doctoral dissertation). Michigan State University, East Lansing.

22.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.

23.

Raju

N. S.

(1988). The area between two item characteristic curves. Psychometrika, 53, 495-502.

24.

Rao

C. R.

(1982). Diversity and dissimilarity coefficients: A unified approach. Theoretical Population Biology, 21, 24-43.

25.

Reckase

M. D.

(1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25-36.

26.

Segall

D. O.

(1996). Multidimensional adaptive testing. Psychometrika, 61, 331-354.

27.

Sykes

R. C.

Ito

(1993, April). Item parameter drift in IRT-based licensure examinations. Paper presented at the annual meeting of the National Council on Measurement in Education, Atlanta, GA.

28.

Tam

S. S.

(1992). A comparison of methods for adaptive estimation of a multidimensional trait (Unpublished doctoral dissertation). Columbia University, New York, NY.

29.

Veerkamp

W. J. J.

Glas

C. A. W.

(2000). Detection of known items in adaptive testing with a statistical quality control method. Journal of Educational and Behavioral Statistics, 25, 373-389.

30.

Wollack

J. A.

Cohen

A. S.

Wells

C. S.

(2003). A method for maintaining scale stability in the presence of test speededness. Journal of Educational Measurement, 40, 307-330.

31.

Zhang

(2014). A sequential procedure for detecting compromised items in the item pool of a CAT system. Applied Psychological Measurement, 38, 87-104.

32.

Zhang

Stout

(1999). The theoretical DETECT index of dimensionality and its application to approximate simple structure. Psychometrika, 64, 213-249.