A Mixture IRTree Model for Performance Decline and Nonignorable Missing Data

Abstract

In educational assessments and achievement tests, test developers and administrators commonly assume that test-takers attempt all test items with full effort and leave no blank responses with unplanned missing values. However, aberrant response behavior—such as performance decline, dropping out beyond a certain point, and skipping certain items over the course of the test—is inevitable, especially for low-stakes assessments and speeded tests due to low motivation and time limits, respectively. In this study, test-takers are classified as normal or aberrant using a mixture item response theory (IRT) modeling approach, and aberrant response behavior is described and modeled using item response trees (IRTrees). Simulations are conducted to evaluate the efficiency and quality of the new class of mixture IRTree model using WinBUGS with Bayesian estimation. The results show that the parameter recovery is satisfactory for the proposed mixture IRTree model and that treating missing values as ignorable or incorrect and ignoring possible performance decline results in biased estimation. Finally, the applicability of the new model is illustrated by means of an empirical example based on the Program for International Student Assessment.

Keywords

item response theory (IRT)performance decline missing not at random IRTree mixture models

When cognitive ability or achievement tests are administered, test-takers respond to administered items and are expected to answer all items with maximum effort. The resulting measures are expected to provide inferences about their performance; these inferences in turn inform classroom instruction or enable international or interstate comparisons on large-scale assessments. Within the framework of classical scoring rules, number-correct scoring assumes that a test-taker responds to all items according to his or her ability, and no credit is given when items are omitted (Lord, 1975). However, this may not be the case in practice because test-takers are more likely to adopt different response strategies when they do not know the correct answers or when they are not motivated to answer. The formula scoring that has been proposed for correction of test raw scores defines three different operational processes in item responses: test-takers know the correct answer and endorse it, test-takers are not certain about the answer and omit it, and test-takers randomly guess among the given options (Rowley & Traub, 1977). In several empirical studies, researchers found that employing partial knowledge in item responses or answering items with partial effort plays a crucial role in the formula-scoring model (Bliss, 1980; Crocker & Algina, 1986; Cross & Frary, 1977). Thus, aberrant responses, which are defined as responses to test items that are inconsistent and are expected to threaten test validity (Meijer, 1996), can be introduced by the differing operational processes when administering test items. This study focuses on the aberrant responses of item omission and effort decline.

An information-processing approach that is well documented in cognitive component theory serves as an alternative perspective to identify and measure the distinct steps in problem solving for a cognitive ability test (Sternberg, 1977; Sternberg et al., 2003). Test items are viewed as a cognitive task in which success on the items can be explicitly related to distinct and requisite mental processes in sequence. Encountering questions in an ability test comprises at least four stages (Leighton & Gierl, 2007; Messick, 1989; Newell & Simon, 1972; Snow & Lohman, 1989). First, test-takers must pay attention to the question and comprehend it (i.e., the stage of perception and attention). Second, external stimuli are converted into internal representation (i.e., the representation stage). Third, a problem space is created for test-takers to retrieve their substantive knowledge and compare that knowledge and internal stimuli to search for the answer (i.e., the working-memory-process stage). Finally, test-takers have to evaluate the result and generate the answer (i.e., the generation stage). Sometimes, the information process may not be as ideal as the cognitive component theory assumes because noncognitive characteristics of test-takers, such as test anxiety, test-taking strategies, motivation, and perseverance, can interfere with their mental process and influence their test performance (Khine & Areepattamannil, 2016). When a highly anxious test-taker responds to items in a timed high-stakes test, for example, test anxiety and time pressure may disrupt the attention stage and result in not-reached items, interfere with the representation stage and lead to omission of that item, restrict working memory capacity and prevent the test-taker from using cognitive resources as best as he/she can (i.e., performance decline), or harm the strategic use of metacognitive skills (Ashcraft & Krause, 2007; Cassady & Johnson, 2002; Dutke & Stöber, 2001; Eysenck & Calvo, 1992; Peng et al., 2014; Sarason, 1988; Tobias, 1992; Tobias & Everson, 1997).

Item omission and performance decline can also be observed in low-stakes assessments. According to expectancy-value theory (Wigfield & Eccles, 2000), when tests have no consequences, test-takers who place a high value on tests persevere and expend more effort in test-taking than their counterparts who place a lower value on tests (Cole et al., 2008; Peng et al., 2014; Wise & DeMars, 2005). Therefore, a test-taker with low motivation may not persist throughout a test and may drop out after responding to certain items (i.e., not-reached items), may perceive an item and decide to omit it, or may attempt to give an answer but use less effort and seldom use metacognitive strategies (Boekaerts, 1997; Hong et al., 2009; Wise, 2009). Although dropping out, skipping, and performance decline processes can be attributed to different reasons and explained based on diverse theories, it is evident that these aberrant response behaviors are explicitly distinct and should be conceptualized in different fashions.

The assumption that all test-takers attempt all items to the best of their ability and leave no items blank is not realistic in practical testing situations because test-taker motivation may differ from person to person, and time constraints are frequently placed on the item response process. These nuisance factors contaminate the intended-measure ability and threaten test validity and reliability when fitting standard item response theory (IRT) models (Lord, 1980) to data. For example, the National Report Card showed that on average, 12th graders who did not have outstanding performance on the 2005 National Assessment of Educational Progress (NAEP) earned good grades in advanced courses (Grigg et al., 2007). This result can be explained by the low-stakes nature of the NAEP, which did not motivate test-takers to apply their best efforts (Cao & Stokes, 2008). Another example suggests that test-takers are more likely to receive lower scores on end-of-test items (Bolt et al., 2002; Goegebeur et al., 2008; Jin & Wang, 2014; Yamamoto & Everson, 1997) or to leave end-of-test items blank (Lu & Sireci, 2007; Suh et al., 2012) on a low-stakes large-scale educational assessment (e.g., the NAEP) or a timed power test with high-stakes purposes (e.g., a college entrance examination).

Few studies have simultaneously investigated performance decline in answering test items and missing item responses. Unplanned missing data are essentially unavoidable and the percentage of nonresponse data is not trivial in real testing situations. According to the 2006 Program for International Student Assessment (PISA) study, an average of 10% of the items were skipped and 4% were not reached (OECD, 2009). The missingness mechanism during testing should be considered and distinguished from performance decline on answering test items due to both speededness and low motivation. The tendency to reduce effort in responding to items and the tendency to omit responses are quite different processes constituting the patterns of examinees’ aberrant response behavior. In addition, omitted and not-reached items are mutually exclusive categories, and different processing models should be considered (Lord, 1980). This study integrates different item response models for aberrant response behavior. Additionally, it interprets different cognitively operational processes in test-taking using an IRTree-based approach and extends the tree-based model to classify different aberrant response classes using mixture distributions, which potentially sheds new light on data analysis methods and serves as the major contribution of this study. Using the developed model to fit the data collected from a large-scale assessment or high-stakes test, rather than employing a restricted model that is limited to some specific response behaviors, can facilitate the test validity and scoring inferences of test-takers.

The purpose of this study is to provide a new IRT model for modeling responses with gradual effort decline, skipped items, and not-reached items. This article is organized as follows. First, existing approaches to addressing nonignorable missing responses and performance decline are briefly overviewed. Second, the IRTree-based approach (De Boeck & Partchev, 2012) is extended to simultaneously account for response data with performance decline and nonignorable omissions by developing a mixture IRTree model that combines a mixture IRT model for different response behavior classes and an IRTree model to describe aberrant response behavior. Then, a series of simulations is conducted to assess the efficiency of the proposed model with respect to the parameter recovery using Bayesian estimation. These simulations also demonstrate how treating missing data inappropriately affects item and person parameter estimation. Following the simulations, empirical data from the PISA 2015 reading assessment are presented to illustrate the applications and implications of the mixture IRTree model. Finally, this article closes by drawing conclusions for the new model and making suggestions for future research.

Existing Approaches to Addressing Omitted Responses

According to Rubin’s (1976) taxonomy, missing-data mechanisms derived from missing completely at random (MCAR) and missing at random (MAR) do not pose a problem if they are ignored. However, if the missing pattern is missing not at random (MNAR), for example, when systematic patterns of missingness are due to nuisance factors (e.g., time limits or test-takers’ motivation), ignoring nonignorable missing responses leads to biased parameter estimates (Little & Rubin, 1987). Therefore, approaches must be considered with respect to their appropriateness for nonignorable missing patterns. The literature has proposed at least three approaches for handling nonignorable missing data: substituting incorrect answers, ignoring nonresponses, and using model-based approaches. The first two approaches depend on specific assumptions that may not be met in real testing situations and are more likely to result in biased estimation and incorrect inferences (Holman & Glas, 2005; Köhler et al., 2017; Pohl et al., 2014). Furthermore, the processes underlying skipped items and not-reached items should be differentiated; therefore, the model-based approaches are thought of as a better way to deal with nonresponses than previous methods.

A variety of model-based approaches for nonignorable missing data have been proposed by relating test-takers’ proficiency—determined from observed data responses—with their missing-data patterns (e.g., Glas & Pimentel, 2008; Glas et al., 2015; Holman & Glas, 2005; Okumura, 2014; Pohl et al., 2014; Rose et al., 2010; Rose et al., 2017). Among these models, IRTree-based approaches are used and extended in this study because they not only provide separate latent trait models for different cognitively operational processes (i.e., the proficiency and omission processes) but also supply a framework for interpreting different omission processes (i.e., the skipping and drop-out processes).

The use of a tree structure to represent sequential response processes controlled by different latent traits and measurement models (e.g., unidimensional IRT models) for different subprocesses that lead to the observed responses is advantageous because it allows clear-cut interpretation of sequential cognitive operations as test-takers respond to test items (De Boeck & Partchev, 2012). This structure has been widely applied in the field of cognitive psychometrics. To account for nonignorable missing responses due to MNAR, Debeer et al. (2017) propose several plausible IRTree models to decompose the cognitive processes test-takers may exhibit when taking a low-stakes or speeded test. Among these hypothetical models, the item-selection model performs well in parameter estimation using simulated data, and its ability to handle the nonignorable MNAR effect is supported by empirical data analysis. Therefore, this study adopts the item-selection model to model omitted responses and extends it to account for test-takers’ performance decline as a result of effort reduction. Readers who are interested in the original tree-based and item-selection models for not-reached and skipped items as well as the structural representation for the proposed models can refer to the study of Debeer et al. (see Figure 1 in Debeer et al., 2017, p. 337).

Existing Approaches to Addressing Performance Decline

The other aberrant response behavior that occurs when test-takers perceive an item and attempt to answer it is performance decline, which is most salient near the end of a test (van Barneveld, 2007; Wise, 1996). For test-takers with low motivation, it is reasonable to assume that the effort put into answering items is not as high as that of their motivated counterparts. In the most extreme cases, examinees with low motivation are assumed to switch from thoughtfully seeking answers based on their ability to randomly guessing answers after responding to some items. Cao and Stokes (2008) propose the IRT threshold guessing model, which incorporates an item location parameter to specify a threshold individually for each examinee in a two-parameter logistic model (2PLM). In this model, classes with low motivation answer questions up to a certain item (i.e., the examinee-specific item location threshold) and guess the remainder of test items due to loss of motivation. However, the assumption that test-takers suddenly switch from the attentive stage to the random guessing stage may be too stringent to satisfy the practical testing demand because examinees with little or no motivation are more likely to expend decreasing effort as the test progresses and to have a decreased probability of correct responses over the course of the test. A gradual decrease in the probability of correct responses to test items is thus considered to provide a more realistic view of unmotivated-examinee response behavior, which is characterized by the IRT continuous guessing model (Cao & Stokes, 2008) and the mixture IRT models for performance decline (Jin & Wang, 2014) in the literature.

The aberrant response behavior found in low-stakes tests for unmotivated test-takers can also be observed on items toward the end of a speeded test due to limited test administration time. Ignoring the possible local item dependency due to this limited time can cause biased item parameter estimation and incorrect inferences about examinees’ abilities (Bolt et al., 2002; Douglas et al., 1998; Goegebeur et al., 2008; Oshima, 1994). Notably, in speeded tests, test-takers under time constraints are affected by the “speeded effect.” They accelerate in response to the time constraint, changing their performance on items (Evans & Reilly, 1972), and such speeded behavior is often observed in timed power or high-stakes tests (Jin & Wang, 2014). As in the IRT threshold guessing model for unmotivated-examinee behavior, the HYBRID model (Yamamoto & Everson, 1997) assumes that speeded test-takers switch from a problem-solving process (i.e., a 2PLM) to a random guessing process after the examinees pass their speededness points. This sudden switch to random guessing behavior expressed in the HYBRID model remains controversial. An alternative model using the mixture Rasch model to identify nonspeeded and speeded classes was developed to constrain the item difficulty parameters of end-of test items for a speeded class as higher than those for a nonspeeded class (Bolt et al., 2002). However, a prespecified item location threshold where all speeded examinees switch to the speeded process makes this approach unfeasible and impractical because different examinees in a speeded class may feel speeded pressure at different item locations.

Instead of specifying a fixed location threshold and assuming random guessing after responding to a certain number of items, Goegebeur et al. (2008) proposed a speeded IRT model with a gradual process change that incorporates an examinee-specific threshold parameter and an examinee-specific change rate parameter into the three-parameter logistics model (3PLM). In this modeling approach, examinees are assumed to answer items with full effort from the beginning (with the probability of correct response following the 3PLM), and once they feel that there is insufficient time to answer the remaining items (i.e., passing through the item location threshold), they reduce their response efforts according to different change rates and may become completely random guessers near the end of the test. Although the model defined by Goegebeur et al. serves as the most general model to control for speededness effects due to test time limits (Suh et al., 2012), three types of random-effect parameters (i.e., ability, threshold, and change rate) are not linked linearly, making parameter estimation and model extension difficult in practical applications (Jin & Wang, 2014).

The literature has documented that aberrant response behavior occurs in examinees taking both low-stakes and speeded tests and has acknowledged a gradual change in the item-solution process, rather than a sudden switch from concentrated effort to guessing. However, in addition to reducing effort and randomly guessing when answering items, examinees may choose to omit some item responses entirely, and missing responses can be considered an indicator of speededness (Mroch & Bolt, 2006) or motivation loss (Cao & Stokes, 2008) in a test.

Suh et al. (2012) conducted a series of simulations to generate missing responses under the speeded IRT model of Goegebeur et al. (2008) and evaluated the effects on parameter estimation of different methods of scoring missing responses. Several limitations in that study deserve further attention. First, the speeded items (i.e., the items beyond a fixed threshold location) were assumed to be likely skipped; however, test-takers may experience different levels of speededness, and the chance of omitting items does not necessarily depend on the occurrence of performance decline. That is, test-takers may first determine whether to answer an item and then determine how much effort they want to expend if they decide to answer. Second, Suh et al. did not differentiate between skipped and not-reached responses. The literature has indicated that the two types of omitted responses involve different processes and are associated with various sources. Finally, the generated and fitting models in the simulations Suh et al. created were not consistent, so the relationship between the missing responses and the speeded behavior was not completely clear.

The New Model

In this section, the IRTree-based framework is employed to simultaneously include nonignorable missing data and describe test-takers’ gradual reduction in effort when responding to test items. Because omitting items and carelessly responding may coexist in test-taking (van Barneveld, 2007; Wise, 1996), the two types of behavior can be used as indicators of aberrant latent classes. It is hypothesized that test-takers work on items with full effort and do not provide any blank responses at the beginning of a test. If test-takers give their best performance throughout the test, they are considered to be part of the normal class. However, test-takers may not give their best effort to answer all items and are likely to exhibit some aberrant response behavior as the test progresses. Once test-takers lack motivation or feel time pressure after completing a certain test item, they arguably switch from normal response behavior to aberrant response behavior for the remaining test items. Therefore, a sequential process of aberrant response behavior can be assumed to govern test-takers’ responses to items beyond a certain item location, and three subprocesses (or internal nodes) can be used to represent the subsequences.

Figure 1 visualizes this study’s hypothesized mixture IRTree model that is used to represent the sequential choice process for modeling dropping out, skipping, and effort decline. First, a test-taker may either evaluate the stakes of failing the examination or consider whether he/she has sufficient motivation to take the test. If he/she decides to give his/her best performance throughout the test, the responses to all items are assumed to follow the 3PLM model and not to exhibit any aberrant behavior. Otherwise, the test-taker begins to respond to items in the sequence of item order and may work on items early in the test with full effort, switching to aberrant responses for items near the end of the test. Suppose test-taker i decides to switch from normal response behavior to aberrant response behavior beyond item k. The responses to $j = 1, \dots, k$ can be assumed to follow the regular 3PLM model, and the remainder ( $j = k + 1, \dots, J$ ) can be represented by the three sequentially interconnected subprocesses within the IRTree framework. When test-taker i decides not to give his/her best effort to answering item $k + 1$ , the first choice (i.e., the first subprocess) involves the dropping-out process, in which the probability that test-taker i decides to give up answering any item after $k + 1$ can be modeled as follows:

P (x_{ij} = d | θ_{i}^{(D)}, β_{j}^{(D)}) = \frac{\exp (θ_{i}^{(D)} - β_{j}^{(D)})}{1 + \exp (θ_{i}^{(D)} - β_{j}^{(D)})},

(1)

where $x_{ij} = d$ indicates that item j is the first item test-taker i does not reach; $θ_{i}^{(D)}$ is the level of the latent trait representing test-taker i’s propensity to drop out; and $β_{j}^{(D)}$ is the threshold parameter of drop out on item j. Because test-takers usually attempt the items early in the test, it is reasonable to assume that the probability of dropping out monotonously increases with item position. Hence, a linear function of item position can be constrained to the dropping-out threshold parameter (Glas & Pimentel, 2008), and the function is given by

β_{j}^{(D)} \equiv β_{jk}^{(D)} = η_{0} + (k - K) η_{1},

(2)

where $β_{jk}^{(D)}$ is the dropping-out threshold parameter for item j in position k $(k = 1, 2, \dots, K)$ .

Figure 1.

Graphical representation of the mixture IRTree-based model for performance decline and nonignorable missing data.

If the test-taker does not completely forsake the test, the second process determines whether item $k + 1$ is skipped by test-taker i: the probability of skipping item $k + 1$ is given by

P (x_{ij} = s | y^{(D)}, θ_{i}^{(S)}, β_{j}^{(S)}) = [1 - P (x_{ij} = d | θ_{i}^{(D)}, β_{j}^{(D)})] \times \frac{\exp (θ_{i}^{(S)} - β_{j}^{(S)})}{1 + \exp (θ_{i}^{(S)} - β_{j}^{(S)})},

(3)

where $x_{ij} = s$ indicates that test-taker i has skipped item j; $θ_{i}^{(S)}$ is the level of the latent trait representing test-taker i’s propensity to skip; and $β_{j}^{(S)}$ is the skipping threshold parameter of item j.

When test-taker i decides to respond, leaving no blank responses but exerting only partial effort on items due to personal or environmental factors (e.g., motivation and testing time), the third process of performance decline comes into effect. Consistent with previous findings (e.g., Cao & Stokes, 2008; Goegebeur et al. 2008), we assume that the probability of correct answers gradually decreases with regard to item location and that a linear decrement function of effort with respect to item location can be embedded in the 3PLM.

Because examinees can be classified into a normal response class or multiple aberrant response classes (depending on the switching points), J latent classes should be identified in the mixture IRTree model. Test-takers who always maintain full effort until the last item, J, are classified into the normal class, and others who switch from normal to aberrant behavior at different item locations are classified into one of $J - 1$ aberrant classes. Let $ξ_{i}$ be the effort-switching threshold for test-taker i ( $ξ_{i} = 1, \dots, J - 1$ ) and $δ$ be the positive decrement parameter. When $j \leq ξ_{i}$ , the probability of success on item j for test-taker i follows the 3PLM, and when $j > ξ_{i}$ —given the missingness mechanism—the probability of a correct response to item j for the test-taker can be defined as

\begin{matrix} P (x_{ij} = 1 | y^{(D)}, y^{(S)}, θ_{i}^{(P)}, β_{j}^{(P)}, α_{j}, π_{j}, ξ_{i}) = \\ [1 - P (x_{ij} = d | θ_{i}^{(D)}, β_{j}^{(D)})] \times [1 - P (x_{ij} = s | θ_{i}^{(S)}, β_{j}^{(S)})] \times \\ {π_{j} + (1 - π_{j}) \times \frac{\exp [α_{j} (θ_{i}^{(P)} - β_{j}^{(P)} - δ (j - ξ_{i}))]}{1 + \exp [α_{j} (θ_{i}^{(P)} - β_{j}^{(P)} - δ (j - ξ_{i}))]}}, \end{matrix}

(4)

and the probability of an incorrect response to the same item is

P (x_{ij} = 0 | y^{(D)}, y^{(S)}, θ_{i}^{(P)}, β_{j}^{(P)}, α_{j}, π_{j}, ξ_{i}) = 1 - P (x_{ij} = 1 | y^{(D)}, y^{(S)}, θ_{i}^{(P)}, β_{j}^{(P)}, α_{j}, π_{j}, ξ_{i}),

(5)

where $θ_{i}^{(P)}$ is the proficiency of test-taker i’s substantive knowledge that a test intends to measure; $β_{j}^{(P)}$ is the item threshold for item j with regard to the problem-solving process (i.e., item difficulty); $α_{j}$ is the discrimination parameter of item j with regard to $θ^{(p)}$ ; $π_{j}$ is the pseudo-guessing parameter for item j; and the other variables are defined the same as above. The later the item location is, the greater the decline in the test-takers’ efforts. Let g denote the latent classes to which test-takers belong and $g \in (ξ_{i}, J)$ . The joint probability of the mixture IRTree model for response vector x can be given by

P (x) = \sum_{g = 1}^{J} c_{g} P (x | y^{(D)}, y^{(S)}, y^{(P)}, g),

(6)

where $c_{g}$ is the mixture proportion for latent class g and $P (x | y^{(D)}, y^{(S)}, y^{(P)}, g)$ is the conditional probability defined above, depending on normal and aberrant response behaviors. Whether test-takers are normal or aberrant, a common set of 3PLM item parameters is assumed across all latent classes so that the intended-to-be-measured latent trait $θ^{(p)}$ is directly comparable among examinees.

The mixture proportion vector of $c = (c_{1}, c_{2}, \dots c_{J})$ should be specified for the distribution of group membership, and a Dirichlet distribution with J hyperparameters can be used as the prior of the discrete distribution c when Bayesian estimation is implemented. However, the computational burden increases as the test lengthens because as many as J groups must be identified. Alternatively, a more efficient approach of using a smoothly increasing curve rather than a discrete distribution was proposed by Cao and Stokes (2008) for the prior of group membership. For the normal response class (i.e., $g = J$ ), the probability that a test-taker always gives his/her best performance throughout the test is $c_{J}$ , which can be assumed to follow a beta distribution as follows:

c_{J} ~ Beta (b_{1}, b_{2}) .

(7)

The other probabilities used to represent a test-taker switching to aberrant response behavior at different item locations are denoted as $c_{1}$ to $c_{J - 1}$ , whose probability functions are given by

c_{j} = \frac{j^{ω} - {(j - 1)}^{ω}}{{(J - 1)}^{ω}} (1 - c_{J}), j = 1, \dots, J - 1,

(8)

where $ω$ is a positive parameter that determines the shape of the smoothly increasing curve and is assumed to follow a gamma distribution. The probability function is concave, increasing when $ω < 1$ , linearly increasing when $ω = 1$ , and convexly increasing when $ω > 1$ .

As noted by Debeer et al. (2017), an IRTree-based model not only is a purely mathematical formulation for response probabilities of outcome variables but also is capable of representing a “belief” that researchers have about the cognitive processes that underlie an item response. Therefore, the proposed mixture IRTree model in this study may not be universally appropriate for all situations, but we believe it is theoretically more appealing than other possible sequential operational processes. Even so, the IRTree approach is flexible in the sense that where appropriate, researchers can develop customized IRTrees based on substantive knowledge, and the proposed mixture IRTree model for missing data and performance decline can be easily revised to satisfy practical testing demands (Jeon & De Boeck, 2016).

A consistent item order that a linear test commonly uses is assumed for all test-takers in this study. However, test-takers may not respond to items in the sequence of the item order and may go back to review their answers and make changes at any time, unless the test is computerized. We adopt this assumption for several reasons. First, it is acknowledged that test-takers can decide the item response order and prioritize the response sequence. Although the item orders in a test differ between test-takers, item-position effects can be induced, and item characteristics can shift depending on the placement of items and the item order administered to test-takers (Debeer & Janssen, 2013). It is evident that item-position effects can be moderated by test-takers’ effort, motivation, and value attributed to the test (Qian, 2014; Weirich et al., 2017). The following simulations include rotated block and random ordering designs to demonstrate the impacts of position effects on parameter estimation when test-takers are allowed to respond to items in a different order sequence. Second, although item review and answer changes are commonly observed in linear testing designs, test-takers with low motivation may not use the item-review strategy to improve their scores due to the nature of low-stakes tests, and a highly anxious test-taker may not have enough time to change his/her answers due to time pressure in a speeded test. To maintain the scope of this study, the phenomenon of item review is not considered for modeling aberrant response behavior. In addition, the 2015 PISA reading assessment used for empirical demonstration in the following section was administered in computer-based testing situations. The proposed mixture IRTree model is justified to fit the data because both item ordering and item review are restricted in computerized testing.

Method

Simulation Design

A series of simulations with several manipulated factors were conducted to assess the efficiency of the mixture IRTree model for nonignorable missing responses and performance decline. The data were generated according to the proposed mixture IRTree model, with an MNAR missing mechanism and the assumption that some test-takers exert less effort on items as the test proceeds. For the first simulation study, three major independent variables were manipulated: (a) sample size (1,000 and 2,000 examinees), (b) test length (20 and 40 items), and (c) item ordering (one item order and multiple item orders; see below for more detail). The mixing proportion was set to 40% ( $c_{J}$ =0.40) for the normal response class and 60% for the aberrant response classes. The effort-switching threshold parameters ( $ξ_{i}$ ) were simulated with probability $c_{j}$ from Equation 8, where $ω$ was fixed at two to produce a convexly increasing probability function (Jin & Wang, 2014). Three latent traits of $θ_{i} = (θ_{i}^{(D)}, θ_{i}^{(S)}, θ_{i}^{(P)})$ for test-takers were sampled from a multivariate normal distribution with a zero mean vector and variance–covariance matrix $\sum_{θ}$ , where the variances of $\sum_{θ}$ were set to one, and the intercorrelations were all set to −.50—consistent with the design of Debeer et al. (2017). The dropping-out threshold parameter was represented by a linear function, as shown in Equation 2, and $η_{0}$ and $η_{1}$ were set to 2.00 and −.30, respectively, which produced a moderate proportion of dropping out among examinees and was consistent with practical situations (Debeer et al., 2017). Because whether an item is skipped is closely related to the difficulty of that item, skipping threshold parameter $β_{j}^{(S)}$ and item difficulty parameter $β_{j}^{(P)}$ were assumed to follow bivariate normal distributions with the mean equal to 0, the variance equal to 1, and the covariance equal to 0.50, in accordance with the findings in large-scale assessments (Pohl et al., 2012; Rose et al., 2010) and as used in previous simulation design (Köhler et al., 2017). As a result, approximately 10% of the items were skipped and 3% were not reached in each set of simulated data. These proportions are considerably close to the missing percentages in large-scale assessments (e.g., OECD, 2009).

When examinees entered the problem-solving process, the 3PLM was used to generate the responses to test items. For the normal response class, the item discrimination parameters were randomly sampled from a uniform distribution between 0.50 and 1.50; the pseudo-guessing parameters were set to 0.20 for all items; and the item difficulty parameters were generated from a joint distribution with the skipping threshold parameters described above. Note that a common pseudo-guessing parameter was estimated across all items because this parameter is too uncertain to estimate precisely, and such a constraint is not uncommon in real testing situations (van der Linden et al., 2010). For the aberrant response classes, the generated parameters relative to the 3PLM were set to the same values as in the normal response class, and decrement parameter $δ$ was fixed to 0.10. The specifications of the parameters for the response functions were consistent with those commonly found in practice and used in previous studies (e.g., Huang, 2017; Jin & Wang, 2014).

To meet the demands of practical testing situations, such as the large-scale assessments mentioned above, two types of item orders were considered in the simulation design. When one item order was used, all test-takers’ items were ordered in the same way. On the other hand, when multiple item orders were used, four equally sized blocks were rotated to produce four item orders (i.e., four booklets), and each booklet was randomly administered to 250 and 500 examinees in the scenarios with sample sizes of 1,000 and 2,000, respectively. We referred to the rotated block design adopted by Debeer et al. (2017), and the resulting four item orders are listed in Table 1. The item order within each block was fixed: only the order of the item blocks themselves was changed.

Table 1.

The Order of Item Blocks for the Four Booklets in the Simulation Study.

Booklet	Item Block
A	Block 1	Block 2	Block 3	Block 4
B	Block 2	Block 4	Block 1	Block 3
C	Block 3	Block 1	Block 4	Block 2
D	Block 4	Block 3	Block 2	Block 1

Note. The generated item parameters were randomly assigned to the four item blocks.

Although it is evident that the proportion of aberrant response behavior is not trivial in low-stakes or timed power tests (e.g., Cao & Stokes, 2008), a large proportion of aberrant response behavior among test-takers may not be realistic in practical settings. Therefore, a second simulation study was conducted to increase the mixing proportion of the normal response class to 80% ( $c_{J}$ =0.80) and 90% ( $c_{J}$ =0.90) in the fixed item order design. Because test-takers can decide the order of answers and may not follow the item order sequence in a linear test, a third simulation study was designed to generate a random item order for each examinee such that all test-takers responded to test items in different item orders. The mixing proportion of the normal response class in the third simulation study was set to the same value as in the first simulation study. The sample size and test length were fixed to 2,000 examinees and 40 items, respectively, for the second and third simulation studies, and other generated and simulated parameters were set to the same values as those used in the first simulation study. Each condition was replicated 30 times, which appeared to be sufficient because smaller sampling variation across replications was observed when the number of replications exceeded 30.

Analysis

Bayesian estimation via the Markov chain Monte Carlo (MCMC) method was used to calibrate the model parameters via the freeware WinBUGS (Spiegelhalter et al., 2003). It was necessary to specify priors for the model parameters to produce the joint posterior distributions of the parameters in the Bayesian estimation. A normal prior distribution, with a mean of 0 and a variance of 4, was used for the item difficulty, skipping threshold, and $η_{0}$ and $η_{1}$ parameters. A lognormal prior distribution, with a mean of 0 and a variance of 1, was used for the item discrimination parameters. For model identification purposes, the discrimination parameter of one item (e.g., the first item) was fixed to the generated value: this study arbitrarily set the value to one for the first item’s discriminating power. A beta prior with both hyperparameters equal to 1 was set for the common pseudo-guessing parameter. For the person parameters, the hyperparameters used in the beta prior (see Equation 7) to represent the proportion in the normal response class were both set to 1 with vague information. The $ω$ parameter had a gamma prior distribution with hyperparameters equal to 2 and 1, respectively. Decrement parameter $δ$ followed a gamma prior with hyperparameters equal to 1 and 5. Finally, the prior for the inverse of variance-covariance matrix $\sum_{θ}$ was set to follow a Wishart distribution with a diagonal matrix equal to 0.1 and three degrees of freedom (i.e., the number of latent traits). The prior distributions for the model parameters set above were consistent with, or similar to, those in previous studies using Bayesian estimation to calibrate the parameters in mixture IRT models (e.g., Cao & Stokes, 2008; Cho & Cohen, 2010; Cohen & Bolt, 2005; Huang, 2016, 2017; Jin & Wang, 2014).

After screening the convergence diagnostic using the multivariate potential scale reduction factor (Brooks & Gelman, 1998) with three parallel chains for several simulated data sets across simulation conditions, it was determined that 15,000 iterations, with the first 5,000 iterations treated as the burn-in period, were sufficient to provide stable parameter estimates. After that point, no label switching was observed within a single MCMC chain or between multiple MCMC chains. The same prior settings and iteration numbers used in the simulation study were applied to the following empirical data analysis. Although not presented, the WinBUGS commands for the proposed mixture IRTree model are available on request. The bias and root mean square error (RMSE) were computed to assess the recovery of the structural parameters, and the RMSE of the person parameter estimates was used to evaluate the person parameter recovery. It was expected that the parameters in the mixture IRTree model could be recovered satisfactorily, that large samples and long tests would increase the estimation precision, and that mistakenly treating data as MNAR would result in biased estimation.

Results

Tables 2 and 3 summarize the results of computing the bias and RMSE to assess the quality of model parameter estimation when the item order was either fixed or rotated. Because numerous item parameters were estimated and space constraints should be considered, the mean and standard deviation of the bias and RMSE across parameters are reported for the item parameters (i.e., $α$ , $β^{(P)}$ , and $β^{(S)}$ ). Because the two item order designs had similar result patterns, the results are discussed jointly rather than separately. With regard to the bias, except for the variances of $θ^{(D)}$ and $θ^{(P)}$ , the bias values were small. The variances of $θ^{(D)}$ and $θ^{(P)}$ were overestimated for most replications in each condition, but the bias was considerably mitigated as the test length increased. No systematic patterns were observed for the RMSE as the test length increased from 20 to 40 items for most estimators, with the exception of the estimates of the variance-covariance matrix, where a notable decrease in RMSE was observed for the variance and covariance parameter estimates. A large sample size resulted in a smaller RMSE and provided more acceptable parameter recovery, although a few exceptions were observed, and the unexpected impacts were smaller. In summary, the structural parameters in the proposed mixture IRTree model could be recovered satisfactorily with a larger sample and longer test with the use of Bayesian estimation regardless of the item order.

Table 2.

Statistical Summary of Parameter Recovery with Fixed Item Order Design.

Sample size	1,000				2,000
Test length	20		40		20		40
Criterion	Bias	RMSE	Bias	RMSE	Bias	RMSE	Bias	RMSE
Parameter
$α$
M	−0.115	0.225	−0.007	0.251	−0.048	0.153	−0.019	0.169
SD	0.072	0.077	0.258	0.271	0.035	0.043	0.143	0.172
$β^{(P)}$
M	0.092	0.327	0.068	0.305	0.035	0.198	−0.004	0.202
SD	0.229	0.123	0.207	0.138	0.116	0.074	0.122	0.124
$β^{(S)}$
M	0.016	0.347	−0.017	0.352	0.021	0.263	−0.017	0.275
SD	0.132	0.262	0.232	0.311	0.088	0.245	0.189	0.269
g	0.023	0.031	0.012	0.016	0.012	0.024	0.001	0.008
$η_{0}$	−0.024	0.103	0.019	0.088	−0.002	0.091	−0.006	0.081
$η_{1}$	−0.018	0.043	−0.004	0.020	−0.011	0.033	−0.004	0.012
$c_{J}$	0.024	0.032	−0.004	0.017	0.005	0.014	0.002	0.007
$ω$	0.002	0.099	0.035	0.086	0.002	0.064	0.053	0.094
$δ$	0.005	0.018	0.014	0.017	0.011	0.019	0.009	0.017
Variance
$θ^{(D)}$	0.284	0.674	0.024	0.302	0.276	0.619	−0.022	0.263
$θ^{(S)}$	0.024	0.208	0.014	0.138	0.004	0.117	0.024	0.141
$θ^{(P)}$	0.560	0.643	0.315	0.359	0.231	0.320	0.154	0.253
Covariance
( $θ^{(D)}$ , $θ^{(S)}$ )	−0.031	0.218	0.058	0.128	−0.022	0.152	0.034	0.121
( $θ^{(D)}$ , $θ^{(P)}$ )	−0.123	0.195	−0.096	0.157	−0.057	0.135	0.004	0.089
( $θ^{(S)}$ , $θ^{(P)}$ )	−0.099	0.175	−0.056	0.112	−0.062	0.109	−0.035	0.084

Note. $α$ = discrimination; $β^{(P)}$ = difficulty; $β^{(S)}$ = skipping threshold; g = pseudo-guessing; $η_{0}$ and $η_{1}$ = dropping-out threshold parameters; $c_{J}$ = mixing proportion for the normal response class; $ω$ = increasing curve; $δ$ = effort decrement; $θ^{(D)}$ = dropping-out propensity; $θ^{(S)}$ = skipping propensity; $θ^{(P)}$ = proficiency; RMSE = root mean square error.

Table 3.

Statistical Summary of Parameter Recovery with Rotated Block Design.

Sample size	1,000				2,000
Test length	20		40		20		40
Criterion	Bias	RMSE	Bias	RMSE	Bias	RMSE	Bias	RMSE
Parameter
$α$
M	−0.127	0.227	0.000	0.246	−0.089	0.167	−0.017	0.189
SD	0.065	0.069	0.269	0.260	0.036	0.047	0.159	0.221
$β^{(P)}$
M	0.091	0.335	0.064	0.309	0.037	0.223	0.064	0.218
SD	0.258	0.144	0.203	0.154	0.157	0.092	0.148	0.120
$β^{(S)}$
M	−0.024	0.203	0.003	0.189	−0.003	0.154	−0.017	0.127
SD	0.042	0.028	0.047	0.037	0.019	0.028	0.045	0.031
g	0.024	0.034	0.011	0.015	0.013	0.023	0.009	0.015
$η_{0}$	−0.025	0.123	0.024	0.083	0.009	0.076	0.056	0.092
$η_{1}$	−0.016	0.043	−0.009	0.029	−0.014	0.031	−0.012	0.015
$c_{J}$	0.003	0.018	0.000	0.018	0.001	0.010	0.002	0.013
$ω$	0.010	0.088	0.010	0.089	0.021	0.069	0.019	0.051
$δ$	0.025	0.033	0.016	0.020	0.018	0.025	0.011	0.016
Variance
$θ^{(D)}$	0.098	0.665	0.071	0.366	0.284	0.560	0.368	0.365
$θ^{(S)}$	0.038	0.160	−0.012	0.131	0.033	0.148	0.007	0.079
$θ^{(P)}$	0.588	0.698	0.287	0.338	0.345	0.413	0.188	0.308
Covariance
( $θ^{(D)}$ , $θ^{(S)}$ )	0.059	0.211	0.059	0.172	−0.037	0.132	−0.036	0.123
( $θ^{(D)}$ , $θ^{(P)}$ )	−0.100	0.186	−0.077	0.133	−0.083	0.138	−0.090	0.091
( $θ^{(S)}$ , $θ^{(P)}$ )	−0.131	0.193	−0.051	0.105	−0.080	0.118	−0.063	0.101

Note. $α$ = discrimination; $β^{(P)}$ = difficulty; $β^{(S)}$ = skipping threshold; g = pseudo-guessing; $η_{0}$ and $η_{1}$ = dropping-out threshold parameters; $c_{J}$ = mixing proportion for the normal response class; $ω$ = increasing curve; $δ$ = effort decrement; $θ^{(D)}$ = dropping-out propensity; $θ^{(S)}$ = skipping propensity; $θ^{(P)}$ = proficiency; RMSE = root mean square error.

The three person parameters representing the dropping-out propensity, skipping propensity, and substantive proficiency were also estimated, and the mean RMSE values for the three estimates across simulation replications are presented in Table 4. A long test length was associated with a more precise estimation of the person parameters, and sample size had a trivial impact on the person parameter estimation. In addition, because the percentage of missing responses was not substantial in the simulation design, as expected, the dropping-out and skipping propensities were not estimated as precisely as the target latent trait (i.e., proficiency). Additionally, no difference was observed in the person parameter recovery between the fixed and rotated item order designs.

Table 4.

Mean RMSE of the Person Parameter Estimates Across Replications.

Item order	Fixed				Rotated
Sample size	1,000		2,000		1,000		2,000
Test length	20	40	20	40	20	40	20	40
Parameter
$θ^{(D)}$	0.863	0.797	0.854	0.797	0.852	0.794	0.849	0.800
$θ^{(S)}$	0.825	0.756	0.821	0.758	0.819	0.756	0.820	0.763
$θ^{(P)}$	0.623	0.513	0.596	0.510	0.637	0.513	0.609	0.505

Note. $θ^{(D)}$ = dropping-out propensity; $θ^{(S)}$ = skipping propensity; $θ^{(P)}$ = proficiency; RMSE = root mean square error.

Treating missing data as nonresponses or incorrect responses is common in practical testing analyses and large-scale assessments. In addition to item omission, test-takers who exhibit low motivation or fatigue are often assumed to exert as much effort on responding to items as their full-effort counterparts in real-life data analysis. To investigate the consequences of ignoring MNAR patterns, mistakenly imputing incorrect answers for MNAR data, and treating all test-takers as full-effort ones, the traditional 3PLM was used to fit the simulated data in which missing responses were regarded as MAR or replaced by incorrect responses. The condition with a long test (i.e., 40 items) and a large sample (i.e., 2,000 persons) provided a comparison between different approaches to addressing missing data. For most estimators, as shown in Table 5, the parameters were recovered more poorly than were the parameters estimated by the data-generating model (see Tables 2 and 3), regardless of the model structural and person proficiency parameters being estimated. Furthermore, the parameter estimation was worse when the missing responses were substituted by incorrect answers than when the missingness was treated as MAR. In addition, the fixed block design resulted in more biased parameter estimation than did the rotated block design, especially for the incorrect answer substitution approach. Although not shown, the same conclusions could be drawn when the comparison was conducted in the small-sample and short-test conditions.

Table 5.

Statistical Summary of Parameter Recovery When Two Alternative Models Were Fit to the Simulated Data Under the 3PLM.

Approach	Ignoring missing responses				Substituting incorrect answer
Item order	Fixed		Rotated		Fixed		Rotated
Criterion	Bias	RMSE	Bias	RMSE	Bias	RMSE	Bias	RMSE
Parameter
$α$
Mean	0.038	0.174	0.046	0.171	−0.166	0.372	−0.091	0.185
SD	0.147	0.233	0.206	0.235	0.401	0.232	0.196	0.142
$β^{(P)}$
Mean	0.120	0.221	0.129	0.208	−0.063	0.646	0.120	0.269
SD	0.154	0.129	0.067	0.084	0.769	0.459	0.276	0.233
g	0.005	0.013	0.010	0.012	−0.197	0.197	−0.123	0.124
Person
$θ^{(P)}$	N/A	0.547	N/A	0.544	N/A	0.803	N/A	0.702

Note. $α$ = discrimination; $β^{(P)}$ = difficulty; g = pseudo-guessing; $θ^{(P)}$ = proficiency; RMSE = root mean square error. The proficiency parameter recovery was evaluated by RMSE; N/A = not applicable.

Note that the difference in the RMSE values of the proficiency estimates between the two approaches treating missingness as MNAR and MAR was small such that the missing data appeared to be ignored. This finding was consistent with previous studies because our simulation produced mild magnitudes of nonignorable missing responses (Köhler et al., 2017; Pohl et al., 2012; Rose et al., 2010). As the number of nonignorable missing values becomes large, it is reasonably expected that traditional IRT models cannot serve as appropriately fitting models for data analysis due to biased parameter estimation (Glas & Pimentel, 2008; Holman & Glas, 2005). A small difference in ability estimates between different approaches may have a significant impact on scoring inferences of test-takers (e.g., Huang, 2014), and the assumption of ignorability should be evaluated regardless of the item-nonresponse sizes (Rose et al., 2017); thus, the proposed mixture IRTree model is recommended for use to provide more precise proficiency estimates even though the proportion of missing responses is not substantial.

Table 6 summarizes the parameter recovery for the second and third simulation studies with a sample size of 2,000 and a test length of 40 items. When the proportion of the normal response class increased to 80% and 90%, the parameters relative to the 3PLM were estimated more precisely compared with the condition of the small proportion (i.e., 40%) used in the first simulation study. On the other hand, because the number of test-takers with aberrant response behavior decreased and the information associated with aberrant responses was not sufficient to provide precise estimation, as expected, the structural parameter estimates in the dropping-out and skipping subprocesses became more deteriorated compared with the first simulation study. The same findings for the structural parameter recovery applied to the person parameter recovery, as evidenced by the mean RMSE values of 0.896, 0.868, and 0.488 for the dropping-out propensity, skipping propensity, and substantive proficiency, respectively, in the 80% normal response class condition and 0.925, 0.891, and 0.486 for the three respective person parameters in the 90% normal response class condition.

Table 6.

Statistical Summary of Parameter Recovery for the Second and Third Simulation Studies.

Simulation	Second				Third
	Normal class proportion				Random item order
	80%		90%
Criterion	Bias	RMSE	Bias	RMSE	Bias	RMSE
Parameter
$α$
Mean	−0.016	0.166	−0.028	0.152	−0.046	0.183
SD	0.127	0.136	0.107	0.119	0.158	0.162
$β^{(P)}$
Mean	0.022	0.207	0.025	0.203	0.052	0.242
SD	0.131	0.128	0.129	0.112	0.176	0.130
$β^{(S)}$
Mean	−0.017	0.405	−0.019	0.505	−0.009	0.138
SD	0.254	0.317	0.263	0.333	0.025	0.029
g	0.006	0.011	0.003	0.009	0.007	0.011
$η_{0}$	−0.038	0.144	0.040	0.191	0.022	0.091
$η_{1}$	−0.021	0.055	0.000	0.046	−0.005	0.020
$c_{J}$	−0.002	0.008	0.000	0.008	0.000	0.011
$ω$	0.008	0.110	0.011	0.134	−0.002	0.059
$δ$	0.008	0.014	0.012	0.018	0.013	0.017
Variance
$θ^{(D)}$	0.224	0.898	0.226	0.891	0.141	0.493
$θ^{(S)}$	−0.024	0.164	0.054	0.253	0.003	0.099
$θ^{(P)}$	0.166	0.231	0.182	0.243	0.266	0.318
Covariance
( $θ^{(D)}$ , $θ^{(S)}$ )	0.040	0.237	0.119	0.385	0.015	0.143
( $θ^{(D)}$ , $θ^{(P)}$ )	−0.033	0.214	−0.111	0.288	−0.059	0.103
( $θ^{(S)}$ , $θ^{(P)}$ )	−0.045	0.104	−0.039	0.125	−0.063	0.084

Regarding the random item order design, as shown on the right-hand side of Table 6, the patterns of the structural parameter recovery were similar to those in the rotated block design. In addition, the person parameter recovery in the third simulation study was found to be comparable to that in the first simulation study, as indicated by the mean RMSE values of 0.794, 0.753, and 0.510 for the dropping-out propensity, skipping propensity, and substantive proficiency, respectively. In summary, increasing the numbers in the normal response class yielded better parameter estimation for the problem-solving process and poor parameter estimation for the aberrant response processes, and different item orders of test-takers had a trivial effect on the parameter recovery when the item-position effects were taken into account.

Empirical Demonstration

As a low-stakes assessment, PISA data were chosen as an empirical example to demonstrate how to apply the proposed mixture IRTree model to real data analysis. In 2015, the main survey in PISA included 66 forms (booklets) to measure reading, mathematics, science, and collaborative problem-solving literacy competencies. Six reading assessment forms were used in our analysis, where six testing clusters were administered in different sequences and the numbers of items in each cluster were 18, 14, 15, 14, 15, and 16. Consequently, the test lengths were 32 for Form 1 (Clusters 1 and 2), 29 for Form 2 (Clusters 2 and 3), 29 for Form 3 (Clusters 3 and 4), 29 for Form 4 (Clusters 4 and 5), 31 for Form 5 (Clusters 5 and 6), and 34 for Form 6 (Clusters 6 and 1). The sample recruited from Taiwan in 2015 to take the six forms consisted of 1,276 students: 7% of the respondents failed to attempt the last item, and 27% omitted at least one response. Detailed information about the assessment design of the 2015 PISA survey is available in the PISA 2015 technical report (OECD, 2017).

Because both selected- and constructed-response items were administered to test-takers in the reading assessment, to fit the proposed model, the polytomous items were dichotomized to convert a full credit response into a correct response and other responses into incorrect answers. Additionally, the one- and two-parameter logistic models were considered as the item response function in the framework of the mixture IRTree model. We were interested in the following questions: (a) Was it necessary to estimate the discrimination parameters? (b) Was the missingness pattern MAR and ignorable? (c) Was it necessary to include an effort decrement parameter to capture the phenomenon of test-taker performance decline as the test proceeded? Therefore, six fitting models were proposed to address these concerns: the mixture two-parameter IRTree model (Model 1), the mixture one-parameter IRTree model (Model 2), the mixture two-parameter IRTree model with zero covariance (MAR assumed; Model 3), the mixture one-parameter IRTree model with zero covariance (MAR assumed; Model 4), the mixture two-parameter IRTree model without performance decline (Model 5), and the mixture one-parameter IRTree model without performance decline (Model 6). The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) were computed to assess the model fit, and smaller values indicated a better fit of the model to the data.

The AIC values were 38,690; 39,260; 39,340; 39,880; 38,840; and 39,260; respectively, for the six fitting models, and the BIC values were 40,190; 40,300; 40,820; 40,900; 40,210; and 40,290. The mixture two-parameter IRTree model (Model 1) had the smallest AIC and BIC values and was therefore selected as the best-fitting model. The descriptive statistics for the best-fitting model were as follows: the estimates were between 0.13 and 2.13 (M = 0.87) for the discrimination parameters, −4.37 and 4.48 (M = −0.89) for the difficulty parameters, −0.57 and 6.39 (M = 3.28) for the skipping thresholds, $η_{0} = 6.24$ and $η_{1} = - 0.08$ for the dropping-out threshold parameters, and 0.02 for the effort decrement parameter. The variance was estimated as 1.99, 2.40, and 2.38, respectively, for the $θ^{(D)}$ , $θ^{(S)}$ , and $θ^{(P)}$ parameters; the covariance was 1.68 for the $θ^{(D)}$ and $θ^{(S)}$ parameters, −1.01 for the $θ^{(D)}$ and $θ^{(P)}$ parameters, and −1.52 for the $θ^{(S)}$ and $θ^{(P)}$ parameters, indicating that the variability in the propensity to exhibit missing responses among test-takers was not trivial and that less proficient test-takers were more likely to fail to attempt the last item and to skip items in the reading assessment. For the mixing proportions, the percentages of test-takers who normally responded to test items were approximately 63%, 60%, 68%, 73%, 65%, and 55% for the six respective booklets, similar to the findings in previous studies analyzing low-stakes assessments (e.g., Cao & Stokes, 2008; Jin & Wang, 2014).

Table 7 shows the missing response patterns of selected test-takers and the estimates of proficiency and omission propensity. The selected examples represent some typical response patterns when test-takers did their best or exhibited aberrant response behavior over the course of the six testing forms. Note that test-takers were classified into the normal response class when their switching points were equal to the test length for each testing form and that the information obtained from the variance–covariance matrix contributed to the estimation of omission propensity even though no omissions were observed. The same data set was fit to the 2PLM by treating the missingness mechanism as MAR and assuming that test-takers exert continuous effort. The comparison of the proficiency estimates between the two models was the focus. The estimates calibrated by the mixture two-parameter IRTree model were referred to as the gold standard because aberrant response behavior of test-takers was taken into account. As shown in Table 7, the differences in the proficiency estimates between the two models were most substantial when test-takers had both skipped and not-reached items, followed by when test-takers had only skipped items and when test-takers responded to test items with full effort and left no blank responses. Although the comparisons were illustrated in the low-stakes PISA assessment and the consequences of using the misleading model appeared to be trivial, similar results can be expected, and the impacts are likely to not be trivial for high-stakes assessments when time constraints are imposed and the effect of speededness should be taken into account.

Table 7.

Response Pattern Summary and Person Parameter Estimates for Selected Samples.

Fitting model			Dropping-out point	No. of skipped items	2PLM	Mixture 2P IRTree
Test form	ID	Switching point	Dropping-out point	No. of skipped items	$θ^{(P)}$	$θ^{(P)}$	$θ^{(D)}$	$θ^{(S)}$
1	194	32	N/A	N/A	−1.54	−1.53	0.59	0.92
	18	22	27	3	−2.25	−2.40	2.81	3.12
	207	17	N/A	5	−1.14	−1.45	1.14	1.17
	65	28	N/A	N/A	2.55	2.57	−1.17	−1.76
2	209	29	N/A	N/A	−0.16	−0.08	−0.12	−0.13
	9	3	14	6	−1.12	−2.88	2.54	2.81
	183	24	N/A	N/A	1.20	1.29	−0.81	−1.41
3	192	29	N/A	N/A	−1.63	−1.60	0.68	1.01
	52	5	N/A	4	0.77	0.67	0.39	0.56
	174	25	N/A	N/A	1.84	1.91	−0.86	−1.32
4	105	29	N/A	N/A	−2.10	−2.06	0.82	1.25
	132	5	24	1	−2.60	−2.30	0.96	0.52
	35	6	N/A	4	−1.93	−1.72	0.50	0.83
	62	26	N/A	N/A	2.35	2.40	−1.15	−1.67
5	137	31	N/A	N/A	−2.63	−2.57	1.02	1.57
	187	4	N/A	8	−1.91	−1.69	0.96	1.52
	23	26	N/A	N/A	3.19	3.28	−1.44	−2.17
6	65	34	N/A	N/A	1.50	1.58	−0.76	−1.10
	212	2	21	14	−0.38	−1.48	3.37	4.04
	9	11	N/A	6	0.39	0.61	0.34	0.49
	19	25	N/A	N/A	0.02	0.09	−0.14	−0.20

Note. 2PLM = two-parameter logistic model; IRTree = item response tree; N/A = not applicable. There were 32, 29, 29, 29, 31, and 34 items in test forms 1 to 6, respectively.

Conclusion

Test developers and administrators commonly assume normal responses of test-takers to test items measuring their performance on a specific domain; however, a variety of nuisance factors in real-life testing situations often violate this assumption. Aberrant response behavior as test-takers encounter items interferes with scoring and causes biased parameter estimation. Among the diverse types of aberrant response behavior, effort reduction and item omission may be the most salient factors for test scoring interpretation and have been investigated and modeled by various methodological approaches for low-stakes or timed power tests (e.g., Cao & Stokes, 2008; Debeer et al., 2017; Jin & Wang, 2014; Pohl et al., 2014; Rose et al., 2017). This study integrates mixture modeling and IRTree-based approaches to simultaneously classify test-takers with normal or aberrant response behavior and to construct the psychological process as aberrant response behavior arises. Few studies have discussed such an integration. In the proposed mixture IRTree model, a mixture sequential choice process is assumed for test-takers’ responses to test items, where normal respondents give their best performance throughout the test and do not leave any blank responses. On the other hand, beyond a certain item location, other respondents may switch to aberrant responses from normal responses due to motivation loss or time limitation pressure and decide whether to drop out, skip, or exert partial effort for the remainder of the test items by the three sequentially interconnected subprocesses. The mixture IRTree model is sufficiently flexible such that any type of IRT model can be used as a function of the subprocess responses. Following the previous approaches for nonignorable missing data and achieving the most generalization, in this study, the 3PLM is used as the item response function for the substantive process, and the 1PLM is used as the item response function for both the dropping-out and skipping processes.

The simulation results showed that the model structural and person parameters can be recovered satisfactorily, and similar to most simulation studies, increasing the sample size and test length results in a more precise estimation of the model and person parameters. Mistakenly treating MNAR missing responses as incorrect or MAR by fitting a standard 3PLM resulted in biased estimation, and incorrect answer substitution was substantial. To simulate practical testing situations, a reasonable proportion of missing responses was generated, in contrast to previous studies using relatively large numbers of missing responses (e.g., Glas & Pimentel, 2008). This may be the main reason that the difference in parameter recovery between the 3PLM and the true data-generating model was not as substantial as the literature has reported. The resulting difference in the precision of test-takers’ proficiency estimation may not have significant consequences in low-stakes tests but should not be neglected in high-stakes assessments. Furthermore, the precision of parameter estimation relative to the dropping-out and skipping processes deteriorated as the proportion of the normal response class increased, and the random item order design had little impact on parameter estimation.

The applicability of the mixture IRTree model was demonstrated using the 2015 PISA reading assessment from the Taiwan data. The results indicate that the missing-data pattern was MNAR and could not be ignored. Moreover, as in the simulation study, reading proficiency was negatively related to the propensities to drop out and to skip. When the data were fit to the 2PLM and test-takers’ normal responses to items were assumed, the biased estimates of test-takers’ reading proficiency were more stringent for the aberrant response classes with omitted responses than for those who attempted all test items. Although the developed mixture IRTree model is extraordinarily interpretable and flexible in the cognitive process of test-takers’ response behavior, we do not exclude other possibilities for alternative cognitive processes, and the conclusions derived from the empirical example may not apply to other countries and assessments. If a more convincible theory or appealing hypothesis arises, the modeling can easily be adjusted and customized by researchers to satisfy their conditions.

When not-reached and skipped items are derived from the MNAR mechanism, test-takers’ propensities to drop out and to skip are treated as threats to test validity and should be considered and included in the data analysis. However, the dropping-out and skipping propensities do not serve merely as nuisance factors. The relationship between missing data and other variables, such as test-takers’ background variables or other measured outcomes, can help us understand how test-takers’ learning environment influences their test-taking behavior. An explanatory IRTree model can be constructed that includes these manifest or latent variables as predictors in the item response function to provide deeper and more significant insight about the latent response processes (De Boeck & Wilson, 2004). For example, two measures obtained from a self-regulation scale or a self-control scale can be used to predict the levels of dropping-out propensity and can be evaluated and compared with respect to the proportion of variability that the external measures can explain. Similarly, item properties (e.g., abstract or concrete) can be introduced into the skipping-process function to predict the item’s skipping threshold parameter. The aberrant response behavior of test-takers may thus be mitigated by providing instructional interventions (e.g., self-regulation training) or by designing items in an appropriate manner.

Future directions for model extensions are provided as follows. First, as noted early in this study, IRTree-based models are merely a mathematical model to reflect the beliefs or assumptions that researchers have about the processes underlying an item response. Based on the rationale and abundant research of previous studies, we adopted the item-selection model proposed by Debeer et al. (2017) for dropping-out and skipped items and extended it to have a discrete latent class structure for different response behaviors. Other plausible IRTrees may be more suitable and applicable in some situations; for example, although the scenario is not theoretically appealing, test-takers may decide whether to skip an item at the beginning and then decide whether to drop out after that item. Diverse IRTrees for interpreting the latent processes that operate in dominating aberrant response behaviors should be explored and investigated. Second, this study focuses on several types of aberrant response behavior and disregards other possibilities. If possible, the mixture IRTree model should be extended to include additional subprocesses that represent other aberrant test-taking response behaviors (e.g., cheating or collusion), which can maximize the generalizability of mixture IRTree models at the price of increased computational burden. Finally, cognitive diagnostic assessments and their corresponding cognitive diagnosis models have recently prevailed in the fields of educational and psychological testing (Rupp et al., 2010). Applying IRTrees to cognitive diagnosis models to investigate test-takers’ aberrant response behavior in cognitive diagnostic assessments is an interesting topic for future study.

Footnotes

Acknowledgements

The author thanks the editor and two anonymous reviewers for their constructive comments on earlier drafts of this article.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the Ministry of Science and Technology, Taiwan (No. 108-2410-H-845-011).

ORCID iD

Hung-Yu Huang

References

Ashcraft

M. H.

Krause

J. A.

(2007). Working memory, math performance, and math anxiety. Psychonomic Bulletin & Review, 14(2), 243-248. https://doi.org/10.3758/BF03194059

Bliss

L. B.

(1980). A test of Lord’s assumption regarding examinee guessing behavior on multiple choice tests using elementary school children. Journal of Educational Measurement, 17(2), 147-153. https://doi.org/10.1111/j.1745-3984.1980.tb00823.x

Boekaerts

(1997). Self-regulated learning: A new concept embraced by researchers, policy makers, educators, teachers, and students. Learning and Instruction, 7(2), 161-186. https://doi.org/10.1016/S0959-4752(96)00015-1

Bolt

D. M.

Cohen

A. S.

Wollack

J. A.

(2002). Item parameter estimation under conditions of test speededness: Applications of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39(4), 331-348. https://doi.org/10.1111/j.1745-3984.2002.tb01146.x

Brooks

S. P.

Gelman

(1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4), 434-455. https://doi.org/10.1080/10618600.1998.10474787

Cao

Stokes

S. L.

(2008). Bayesian IRT guessing models for partial guessing behaviors. Psychometrika, 73, 209-230. https://doi.org/10.1007/s11336-007-9045-9

Cassady

J. C.

Johnson

R. E.

(2002). Cognitive test anxiety and academic performance. Contemporary Educational Psychology, 27(2), 270-295. https://doi.org/10.1006/ceps.2001.1094

Cho

S.-J.

Cohen

A. S.

(2010). Multilevel mixture IRT model with an application to DIF. Journal of Educational and Behavioral Statistics, 35(3), 336-370. https://doi.org/10.3102/1076998609353111

Cohen

A. S.

Bolt

D. M.

(2005). A mixture model analysis of differential item functioning. Journal of Educational Measurement, 42(2), 133-148. https://doi.org/10.1111/j.1745-3984.2005.00007

10.

Cole

J. S.

Bergin

D. A.

Whittaker

T. A.

(2008). Predicting student achievement for low stakes tests with effort and task value. Contemporary Educational Psychology, 33(4), 609-624. https://doi.org/10.1016/j.cedpsych.2007.10.002

11.

Crocker

Algina

(1986). Introduction to classical and modern test theory. Holt, Rinehart & Winston

12.

Cross

Frary

(1977). An empirical test of Lord’s theoretical results regarding formula scoring of multiple choice tests. Journal of Educational Measurement, 14(4), 313-322. https://doi.org/10.1111/j.1745-3984.1977.tb00047.x

13.

Debeer

Janssen

(2013). Modeling item-position effects within an IRT framework. Journal of Educational Measurement, 50(2), 164-185. https://doi.org/10.1111/jedm.12009

14.

Debeer

Janssen

De Boeck

(2017). Modeling skipped and not-reached items using IRTrees. Journal of Educational Measurement, 54(3), 333-363. https://doi.org/10.1111/jedm.12147

15.

De Boeck

Partchev

. (2012). IRTrees: Tree-based item response models of the GLMM family. Journal of Statistical Software, 48(1), 1-28. https://doi.org/10.18637/jss.v048.c01

16.

De Boeck

Wilson

. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. Springer.

17.

Douglas

Kim

H. R.

Habing

Gao

(1998). Investigating local dependence with conditional covariance functions. Journal of Educational & Behavioral Statistics, 23(2), 129-151. https://doi.org/10.3102/10769986023002129

18.

Dutke

Stöber

(2001). Test anxiety, working memory, and cognitive performance: Supportive effects of sequential demands. Cognition and Emotion, 15(3), 381-389. https://doi.org/10.1080/02699930125922

19.

Evans

F. R.

Reilly

R. R.

(1972). A study of speededness as a source of test bias. Journal of Educational Measurement, 9(2), 123-131. https://doi.org/10.1002/j.2333-8504.1972.tb00196.x

20.

Eysenck

M. W.

Calvo

M. G.

(1992). Anxiety and performance: The processing efficiency theory. Cognition and Emotion, 6(6), 409-434. https://doi.org/10.1080/02699939208409696

21.

Glas

C. A. W.

Pimentel

(2008). Modeling nonignorable missing data in speeded tests. Educational and Psychological Measurement, 68(6), 907-922. https://doi.org/10.1177/0013164408315262

22.

Glas

C. A. W.

Pimentel

Lamers

S. M. A.

(2015). Nonignorable data in IRT models: Polytomous responses and response propensity models with covariates. Psychological Test and Assessment Modeling, 57(4), 523-541.

23.

Goegebeur

De Boeck

Wollack

J. A.

Cohen

A. S.

(2008). A speeded item response model with gradual process change. Psychometrika, 73, 65-87. https://doi.org/10.1007/s11336-007-9031-2

24.

Grigg

Donahue

Dion

(2007). The nation’s report card: 12th-grade reading and mathematics 2005. National Center for Education Statistics.

25.

Holman

Glas

C. A. W.

(2005). Modelling nonignorable missing data mechanism with item response theory models. British Journal of Mathematical and Statistical Psychology, 58(1), 1-18. https://doi.org/10.1111/j.2044-8317.2005.tb00312.x

26.

Hong

Peng

Rowell

L. L.

(2009). Homework self-regulation: Grade, gender, and achievement level differences. Learning and Individual Differences, 19(2), 269-276. https://doi.org/10.1016/j.lindif.2008.11.009

27.

Huang

H.-Y.

(2014). Effects of the common scale setting in the assessment of differential item functioning. Psychological Reports, 114(1), 104-125. https://doi.org/10.2466/03.PR0.114k11w0

28.

Huang

H.-Y.

(2016). Mixture random-effect IRT models for controlling extreme response style on rating scales. Frontiers in Psychology, 7, 1706. https://doi.org/10.3389/fpsyg.2016.01706

29.

Huang

H.-Y.

(2017). Mixture IRT model with a higher-order structure for latent traits. Educational and Psychological Measurement, 77(2), 275-304. https://doi.org/10.1177/0013164416640327

30.

Jeon

De Boeck

(2016). A generalized item response tree model for psychological assessments. Behavior Research Methods, 48(3), 1070-1085. https://doi.org/10.3758/s13428-015-0631-y

31.

Jin

K.-Y.

Wang

W.-C.

(2014). Item response theory models for performance decline during testing. Journal of Educational Measurement, 51(2), 178-200. https://doi.org/10.1111/jedm.12041

32.

Khine

M. S.

Areepattamannil

(Eds.). (2016). Non-cognitive skills and factors in educational attainment. Sense.

33.

Köhler

Pohl

Carstensen

C. H.

(2017). Dealing with item nonresponse in large-scale cognitive assessments: The impact of missing data methods on estimated explanatory relationships. Journal of Educational Measurement, 54(4), 397-419. https://doi.org/10.1111/jedm.12154

34.

Leighton

J. P.

Gierl

M. J.

(2007). Why cognitive diagnostic assessment? In Leighton

J. P.

Gierl

M. J.

(Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 1-18). Cambridge University Press.

35.

Little

R. J. A.

Rubin

D. B.

(1987). Statistical analysis with missing data. Wiley.

36.

Lord

F. M.

(1975). Formula scoring and number-right scoring. Journal of Educational Measurement, 12(1), 7-11. https://doi.org/10.1111/j.1745-3984.1975.tb01003.x

37.

Lord

F. M.

(1980). Application of item response theory to practical testing problems. Erlbaum.

38.

Sireci

S. G.

(2007). Validity issues in test speededness. Educational Measurement: Issues and Practice, 26(4), 29-37. https://doi.org/10.1111/j.1745-3992.2007.00106.x

39.

Meijer

R. R.

(1996). Person-fit research: An introduction. Applied Measurement in Education, 9(1), 3-8. https://doi.org/10.1207/s15324818ame0901_2

40.

Messick

(1989). Validity. In Linn

R. L.

(Ed.), Educational measurement (3rd ed., pp. 1-103). American Council on Education/Macmillan.

41.

Mroch

A. A.

Bolt

D. M.

(2006, April). An IRT-based response likelihood approach for addressing test speededness[Paper presentation]. Meeting of the National Council on Measurement in Education, San Francisco, CA, United States.

42.

Newell

Simon

H. A.

(1972). Human problem solving. Prentice Hall.

43.

OECD. (2009). PISA 2006 technical report. https://www.oecd.org/pisa/data/42025182.pdf

44.

OECD. (2017). PISA 2015 technical report. https://www.oecd.org/pisa/sitedocument/PISA-2015-technical-report-final.pdf

45.

Okumura

(2014). Empirical differences in omission tendency and reading ability in PISA: An application of tree-based item response models. Educational and Psychological Measurement, 74(4), 611-626. https://doi.org/10.1177/0013164413516976

46.

Oshima

T. C.

(1994). The effect of speededness on parameter estimation in item response theory. Journal of Educational Measurement, 31(3), 200-219. https://doi.org/10.1111/j.1745-3984.1994.tb00443.x

47.

Peng

Hong

Mason

(2014). Motivational and cognitive test-taking strategies and their influence on test performance in mathematics. Educational Research and Evaluation: An International Journal on Theory and Practice, 20(5), 366-385. https://doi.org/10.1080/13803611.2014.966115

48.

Pohl

Gräfe

Rose

(2014). Dealing with omitted and not reached items in competence tests: Evaluating approaches accounting for missing responses in IRT models. Educational and Psychological Measurement, 74(3), 423-452. https://doi.org/10.1177/0013164413504926

49.

Pohl

Haberkorn

Hardt

Wiegand

(2012). NEPS Technical report for reading: Scaling results of starting cohort 3 in fifth grade (NEPS Working Paper No. 15). Otto-Friedrich-Universität, Nationales Bildungspanel.

50.

Qian

(2014). An investigation of position effects in large-scale writing assessments. Applied Psychological Measurement, 38(7), 518-534. https://doi.org/10.1177/0146621614534312

51.

Rose

von Davier

Nagengast

(2017). Modeling omitted and not-reached items in IRT models. Psychometrika, 82, 795-819. https://doi.org/10.1007/s11336-016-9544-7

52.

Rose

von Davier

(2010). Modeling nonignorable missing data with item response theory (IRT) (Research Report ETS RR-10-11). Educational Testing Service. https://www.ets.org/Media/Research/pdf/RR-10-11.pdf

53.

Rowley

G. L.

Traub

R. E.

(1977). Formula scoring, number-right scoring, and test-taking. strategy. Journal of Educational Measurement, 14(1), 15-22. https://doi.org/10.1111/j.1745-3984.1977.tb00024.x

54.

Rubin

D. B.

(1976). Inference and missing data. Biometrika, 63(3), 581-592. https://doi.org/10.1093/biomet/63.3.581

55.

Rupp

A. A.

Templin

J. L.

Henson

R. A.

(2010). Diagnostic measurement: Theory, methods, and applications. Guilford Press.

56.

Sarason

I. G.

(1988). Anxiety, self-preoccupation and attention. Anxiety Research, 1(1), 3-8. https://doi.org/10.1080/10615808808248215

57.

Snow

R. E.

Lohman

D. F.

(1989). Implications of cognitive psychology for educational measurement. In Linn

R. L.

(Ed.), Educational measurement (3rd ed., pp. 263-331). American Council on Education/Macmillan.

58.

Spiegelhalter

D. J.

Thomas

Best

N. G.

Lunn

(2003). WinBUGS (Version 1.4) [Computer program]. MRC Biostatistics Unit, Institute of Public Health. https://www.mrc-bsu.cam.ac.uk/wp-content/uploads/manual14.pdf

59.

Sternberg

R. J.

(1977). Component processes in analogical reasoning. Psychological Review, 84(4), 353-378. https://doi.org/10.1037/0033-295X.84.4.353

60.

Sternberg

R. J.

Lautrey

Lubart

T. I.

(Eds.). (2003). Models of intelligence: International perspectives. APA Books.

61.

Suh

Cho

S.-J.

Wollack

J. A.

(2012). A comparison of item calibration procedures in the presence of test speededness. Journal of Educational Measurement, 49(3), 285-311. https://doi.org/10.1111/j.1745-3984.2012.00176.x

62.

Tobias

(1992). The impact of test anxiety on cognition in school learning. In Hagtvet

(Ed.), Advances in test anxiety research (Vol. 7, pp. 18-31). Swets & Zeitlinger.

63.

Tobias

Everson

H. T.

(1997). Studying the relationship between affective and metacognitive variables. Anxiety, Stress, and Coping, 10(1), 59-81. https://doi.org/10.1080/10615809708249295

64.

van Barneveld

. (2007). The effect of examinee motivation on test construction within an IRT framework. Applied Psychological Measurement, 31(1), 31-46. https://doi.org/10.1177/0146621606286206

65.

van der Linden

W. J.

Klein Entink

R. H.

Fox

J.-P

. (2010). IRT parameter estimation with response times as collateral information. Applied Psychological Measurement, 34(5), 327-347. https://doi.org/10.1177/0146621609349800

66.

Weirich

Hecht

Penk

Roppelt

Böhme

(2017). Item position effects are moderated by changes in test-taking effort. Applied Psychological Measurement, 41(5), 115-129. https://doi.org/10.1177/0146621616676791

67.

Wigfield

Eccles

J. S.

(2000). Expectancy–value theory of achievement motivation. Contemporary Educational Psychology, 25(1), 68-81. https://doi.org/10.1006/ceps.1999.1015

68.

Wise

(1996, April). A persistent model of motivation and test performance(Paper presentation). The annual meeting of the American Educational Research Association, New York, NY.

69.

Wise

S. L.

(2009). Strategies for managing the problem of unmotivated examinees in low-stakes testing programs. Journal of General Education, 58(3), 152-166. https://doi.org/10.1353/jge.0.0042

70.

Wise

S. L.

DeMars

C. E.

(2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10(3), 1-17. https://doi.org/10.1207/s15326977ea1001_1

71.

Yamamoto

Everson

(1997). Modeling the effects of test length and test time on parameter estimation using the HYBRID model. In Rost

Langeheine

(Eds.), Applications of latent trait and latent class models in the social sciences (pp. 89-98). Waxmann.