The Prevalence and Implications of Slipping on Low-Stakes,Large-Scale Assessments

Abstract

In the absence of clear incentives, achievement tests may be subject to the effect of slipping where item response functions have upper asymptotes below one. Slipping reduces score precision for higher latent scores and distorts test developers’ understandings of item and test information. A multidimensional four-parameter normal ogive model was developed for large-scale assessments and applied to dichotomous items of the 2011 National Assessment of Educational Progress eighth-grade mathematics and reading tests. The results suggest that the probability of slipping exceeded 5% for 47.2% and 51.1% of the dichotomous mathematics and reading items, respectively. Furthermore, allowing for slipping resulted in larger item discrimination parameters, increased information in the lower-to-middle range of the latent trait, and decreased precision for scores one standard deviation above the mean. The results provide evidence that slipping is a factor that should be considered during test development and construction to ensure adequate measurement across the latent continuum.

Keywords

achievement item response theory NAEP psychometrics testing

Introduction

Sometimes students with the necessary knowledge and skills to succeed on an easier task make a mistake. There are several reasons why high-performing test takers might “slip” and mistakenly record an incorrect answer. First, slipping could be attributed to test takers’ careless errors. In the words of Barton and Lord (1981), “Even a high-ability student may make a clerical error in answering an easy item” (p. 2). Second, slipping could be a psychometric characteristic of items (i.e., the item response function [IRF] has an upper asymptote below one). Third, slipping could be affected by test takers’ motivation where in the absence of clear incentives, high-performing students may not respond to all items with maximal effort. This article takes a psychometric perspective and uses the term slipping to refer to situations where the upper asymptote of the IRF is less than one.

The primary cause of slipping is not always known, but what is clear is that slipping is an important methodological issue because it reduces precision for scores in the upper range of the latent continuum (Culpepper, 2016; Ogasawara, 2012; Reise & Waller, 2003; Waller & Reise, 2010). In fact, the reduction in score precision due to slipping affects decisions of both test developers and test users. For instance, in some cases, test developers assemble items to accurately measure performance across the latent continuum. In these cases, neglecting the effects of slipping could distort test developers’ understandings of measurement precision. That is, not accounting for slipping may yield an overly optimistic understanding of how well a collection of items measure higher achievement scores. Additionally, slipping can impact policymakers’ decisions and inferences from standardized test scores. In fact, prior empirical research suggests students may not be motivated to maximize their performance on the National Assessment of Educational Progress (NAEP) reading items (Braun, Kirsch, & Yamamoto, 2011; Brophy & Ames, 2005; Debeer, Buchholz, Hartig, & Janssen, 2014; O’Neil, Abedi, Miyoshi, & Mastergeorge, 2005; O’Neil, Sugrue, & Baker, 1995; Wise & DeMars, 2005). The lack of motivation could translate into slipping, which would affect policymakers’ ability to make inferences regarding higher reading proficiency levels.

Slipping is a critical issue to account for when constructing tests and formulating policy. The purpose of this article is 2-fold: (1) to develop new methodology for large-scale testing programs that accounts for slipping and (2) to provide new evidence regarding the prevalence of slipping on low-stakes, large-scale achievement tests by reanalyzing data from the 2011 NAEP mathematics and reading assessments. The remainder of this article has four sections. The first section reviews Barton and Lord’s (1981) four-parameter model (4PM) and provides examples of how slipping affects IRFs and item information functions. The second section describes a multidimensional version of the four-parameter normal ogive model (4PNO; Culpepper, 2016) for large-scale testing programs. Specifically, the developed 4PNO model is tailored for large-scale assessments where some item responses are missing by design (i.e., missing completely at random; see Patz & Junker, 1999; Rubin, 1976) to minimize the burden on students. The multidimensional 4PNO model discussed in this article is designed for large-scale assessments, which are typically constructed under the assumption of simple structure where latent variables load on a single item. The third section reports an application of the multidimensional 4PNO model to the 2011 eighth-grade NAEP mathematics and reading assessments. Slipping may be expected on the NAEP exams, given that test takers have minimal incentives to maximize performance. NAEP has been historically relevant for education researchers and policymakers, and uncovering evidence of slipping is important for understanding score precision. The third section reports item parameters and discusses how estimating the 4PNO model results in differences in expected score precision across the latent achievement continuum. The last section discusses the results for methodologist and practitioners and provides concluding remarks.

Slipping and the 4PM

This section provides a brief overview of the 4PM. The first subsection discusses the antecedents and dissemination of the 4PM in applied research, and the second subsection describes the 4PM model.

Antecedents and Dissemination of the 4PM

In a classic study, Barton and Lord (1981) examined slipping on the high-incentive, SAT mathematics (SAT-M), SAT verbal (SAT-V), graduate record examination (GRE) standardized admissions tests in addition to the Advanced Placement mathematics exam. However, the results of Barton and Lord’s research did not support the presence of systematic slipping, and one consequence of their study could be that the 4PM subsequently received less attention in the educational measurement literature.

More recently, the 4PM has received renewed interest in several domains. First, the 4PNO has been considered in research on computerized adaptive testing (CAT) (Rulison & Loken, 2009). For instance, Chang and Ying (2008) showed that ability estimates from CATs are subject to underestimation for examinees who miss a few items at the beginning of the test. Subsequent CAT research argued for the 4PM to improve ability estimation in the presence of early careless errors (Liao, Ho, Yen, & Cheng, 2012; Rulison & Loken, 2009). Second, research used the 4PM in the measurement of psychopathology (Reise & Waller, 2003; Waller & Reise, 2010). In fact, Reise and Waller (2003) were the first to demonstrate the utility of the 4PM for psychopathology research, and recent studies successfully applied the 4PM to the measurement of adolescent delinquency (Loken & Rulison, 2010), bullying (Culpepper, 2016), low self-esteem (Waller & Reise, 2010), and subscales of the adolescent version of the Minnesota Multiphasic Personality Inventory (MMPI; Reise & Waller, 2003). Third, renewed interest in the 4PM encouraged the development of new estimation strategies (Culpepper, 2016; Feuerstahler & Waller, 2014; Loken & Rulison, 2010).

The 4PM

This subsection reviews the 4PNO and comments on the effect slipping has on response probabilities and measurement precision. Suppose y is a dichotomous variable and θ is the underlying latent trait. The 4PNO IRF for item j is

P_{j} = P (y = 1 | θ, α_{j}, β_{j}, γ_{j}, ς_{j}) = γ_{j} + (1 - ς_{j} - γ_{j}) Φ (η_{j}),

where $Φ (\cdot)$ is the standard normal cumulative distribution function, γ_j is the guessing or lower asymptote parameter, and ς_j is the slipping or upper asymptote parameter. Note that an alternative parameterization of the 4PM in Equation 1 is to define $c = γ_{j}$ and $d = 1 - ς_{j}$ . Furthermore, $η_{j} = α_{j} θ - β_{j}$ where α_j is the item discrimination index (or item slope) and β_j is the item threshold (i.e., $β_{j} / α_{j}$ is the item difficulty).

Consider, for example, IRFs for two items from the 2011 eighth-grade NAEP mathematics and reading assessments. Figure 1 plots IRFs for mathematics item “M151901,” and Figure 2 includes an example of reading item “R058509.” The NAEP documentation indicates a three-parameter model is used for item M151901, and Figure 1 compares the status quo model with a 4PNO fit where the probability of slipping equals .14. Figure 1 shows how fixing the upper asymptote to one with the 3PNO reduces the item slope to 0.66, whereas the item slope for the 4PNO is 1.02. Similarly, the slipping probability for reading item R058509 equaled .24, which suggests that students with higher levels of reading achievement had at most a .76 probability of a correct response (i.e., the upper asymptote was reduced by 24%). Furthermore, the estimated item slope for the status quo 2PNO model was .47 in comparison to an item slope of .96 for the 4PNO. The examples in Figures 1 and 2 show how estimating slipping alters the IRF and, as shown next, reduces score precision for larger θ values.

Figure 1.

Estimated item response function and information function for 2011 eighth-grade National Assessment of Educational Progress (NAEP) mathematics item M151901 for the four-parameter normal ogive (4PNO) and non-4PNO models. The non-4PNO model uses a three-parameter model as specified in the NAEP data documentation. The estimated three-parameter normal ogive item parameters are ${\hat{α}}_{j}$ = 0.66, ${\hat{β}}_{j}$ = 0.10, and ${\hat{γ}}_{j}$ = 0.02. The estimated 4PNO item parameters are ${\hat{α}}_{j}$ = 1.02, ${\hat{β}}_{j}$ = 0.04, ${\hat{γ}}_{j}$ = 0.11, and ${\hat{ς}}_{j}$ = 0.14.

Figure 2.

Estimated item response function and information function for 2011 eighth-grade National Assessment of Educational Progress (NAEP) reading item R058509 for the four-parameter normal ogive (4PNO) and non-4PNO models. The non-4PNO model uses a two-parameter model as specified in the NAEP data documentation. The estimated two-parameter normal ogive item parameters are ${\hat{α}}_{j}$ = 0.47 and ${\hat{β}}_{j}$ = −0.24. The estimated 4PNO item parameters are ${\hat{α}}_{j}$ = 0.96, ${\hat{β}}_{j}$ = −1.05, ${\hat{γ}}_{j}$ = 0.01, and ${\hat{ς}}_{j}$ = 0.24.

A benefit of the item response theory (IRT) framework is that the Fisher information function for item j, I_j, provides insight regarding score precision for a given θ as a function of item parameters. The presence of a nonzero slipping probability reduces information for larger θ values. In fact, Magis (2013) derived the item information for the 4PM by noting that the general equation for item information for a dichotomous IRT model is

I_{j} = \frac{{({P^{'}}_{j})}^{2}}{P_{j} (1 - P_{j})} .

The first derivative of the 4PNO IRF with respect to θ is ${P^{'}}_{j} = α_{j} (1 - ς_{j} - γ_{j}) φ (η_{i j})$ where $φ (\cdot)$ is the standard normal density function. The 4PNO information function for item j is

I_{j} = \frac{α_{j}^{2} {(1 - ς_{j} - γ_{j})}^{2} {[φ (η_{i j})]}^{2}}{P_{j} (1 - P_{j})} .

Furthermore, denote the test information by $I_{θ} = \sum_{j} I_{j}$ .

Figures 1 and 2 provide examples of how I_j differs among the 2PNO, 3PNO, and 4PNO models. For instance, Figure 1 shows that information for item M151901 was smaller for the 4PNO versus the 3PNO for values of $θ > 1$ . Furthermore, the item slope for the 4PNO was larger than for the 3PNO, and Figure 1 shows that information for the 4PNO model was nearly twice as large as the 3PNO in the region of the item location. Figure 2 presents estimated information functions for item R058509. Similar to Figure 1, Figure 2 shows that the 4PNO information function was below the 2PNO for larger θ values (i.e., $θ > 0$ ). The larger item slope for the 4PNO contributed to the peak of the 4PNO information function being 3 times larger than the maximum of the 2PNO information function. The examples in Figures 1 and 2 demonstrate how nonzero slipping probabilities reduce information for larger θ values and increase information for θ values near the item location. Employing a two- or three-parameter model for these items rather than the 4PNO yields a different interpretation of item characteristics and measurement precision for θ. These examples illustrate how the 4PNO can identify items that poorly discriminate higher achieving students.

4PNO Bayesian Formulation for Multidimensional, Large-Scale Assessments

This section presents the Bayesian formulation for the multidimensional 4PNO model for dichotomous items. Let $i = 1, \dots, N$ , $j = 1, \dots, J$ , and $k = 1, \dots, K$ index individuals, items, and content areas or dimensions, respectively. The item response by individual i on item j in content area k is $y_{i j k}$ . Large-scale assessments include a collection of items across multiple content areas, so let J_k denote the number of items assessing content area k, and $J = \sum_{k = 1}^{K} J_{k}$ is the total number of items. Furthermore, let $y_{i k} = (y_{i 1 k}, \dots, y_{i J_{k} k})^{'}$ be a J_k-dimensional vector of responses by test taker i to items in content area k and $y_{i} = (y_{i 1}, \dots, y_{i K})^{'}$ be a J-dimensional vector of item responses across content areas. Let $Y^{'} = (y_{1}, \dots, y_{N})$ be a $N \times J$ matrix of item responses and $Y_{j k} = {(y_{1 j k}, \dots, y_{N j k})}^{'}$ be an N-dimensional vector of responses to item j in content area k.

Large-scale assessments score $y_{i j k}$ depending upon the item format. Some items are dichotomously scored (e.g., multiple choice), and others are graded by raters and provided partial credit on an ordinal scale. The application in this study analyzes both dichotomous and ordinal items. The goal of this study is to investigate slipping on dichotomously scored items, so the 4PNO is employed for dichotomous items, and the previously developed ordinal normal ogive model (ONO; Albert, 1992; Cowles, 1996) is implemented as the measurement model for polytomous items as demonstrated in prior research (Johnson, 2002; Johnson & Jenkins, 2004; Johnson & Sinharay, 2015).

Recall that the application of a balanced incomplete block (BIB) design (Frey, Hartig, & Rupp, 2009; Gonzalez & Rutkowski, 2010; Mislevy, Johnson, & Muraki, 1992) in large-scale assessments implies that individual i does not respond to all items, so that elements of y_i are missing completely at random. Accordingly, define a missing data indicator $u_{i j k} = 1$ if item j in content area k is included in the booklet assigned to individual i and 0 otherwise. Similarly, let $u_{i k} = (u_{i 1 k}, \dots, u_{i J_{k} k})^{'}$ be a vector of missing indicators for individual i and content area k and $u_{i} = {(u_{i 1}, \dots, u_{i K})}^{'}$ be a J-dimensional vector across content areas. Furthermore, let $U = {(u_{1}, \dots, u_{N})}^{T}$ be a $N \times J$ matrix of missing data indicators with the column corresponding to item j in content area k indicated by $U_{j k}$ . Note that this article does not discuss a model for $u_{i}$ , given that items are assumed missing completely at random.

In large-scale assessments, such as NAEP and TIMSS, items are developed based upon frameworks and simple structure is assumed, which implies that each item loads on a single dimension. Let $θ_{i} = {(θ_{i 1}, \dots, θ_{i K})}^{'}$ denote the multidimensional vector of latent attributes for individual i, and $Θ = {(θ_{1}, \dots, θ_{N})}^{'}$ is an $N \times K$ matrix of latent traits. Furthermore, the N-dimensional vector of latent traits for content area k is $Θ_{k} = {(θ_{1 k}, ..., θ_{N K})}^{'}$ .

Bayesian Model

The Bayesian formulation of the 4PNO for large-scale assessments follows,

y_{i j k} | W_{i j k}, γ_{j k}, ς_{j k} \sim Bernoulli [{(1 - ς_{j k})}^{W_{i j k}} γ_{j k}^{1 - W_{i j k}}] .

W_{i j k} = 1_{Z_{i j k} > 0} .

\begin{array}{l} Z_{i j k} | u_{i j k}, θ_{i k}, α_{j k}, β_{j k} \sim {\begin{array}{r} N (η_{i j k},1), & u_{i j k} = 1 \\ 0, & u_{i j k} = 0 \end{array} \\ η_{i j k} = α_{j k} θ_{i k} - β_{j k} . \end{array}

θ_{i} \sim N_{K} (0_{K}, I_{K}) .

ξ_{j k} = {(α_{j k}, β_{j k})}^{'} \sim N_{2} (μ_{ξ}, Σ_{ξ}) I (α_{j k} > 0) .

\begin{array}{l} p (γ_{j k}, ς_{j k}) \propto 1_{(γ_{j k}, ς_{j k}) \in Ω}, \\ Ω = {γ, ς : 0 \leq γ < 1 - ς \leq 1} . \end{array}

Equations 4, 5, and 6 imply the 4PNO IRF in Equation 1, where Equation 5 includes the indicator function, $1_{A}$ (i.e., $1_{A} = 1$ if A is true and 0 if A is false), so that $W_{i j k} = 1$ for $Z_{i j k} > 0$ and 0 otherwise. Note that in Equation 6 $Z_{i j k}$ is fixed to 0 for missing $y_{i j k}$ for computational convenience.¹

Equation 7 is a prior for the K-dimensional latent variables. Note that the model is identified by setting the prior mean of $θ_{i}$ to 0 (i.e., $0_{K}$ ) and the variance–covariance matrix to a K-dimensional identity matrix, $I_{K}$ . It is important to note that the assumption that the latent variables are a priori independent is consistent with the procedure for scaling NAEP item parameters. For example, the MGROUP software developed by the Educational Testing Service uses “…a discrete distribution over 41 quadrature points for each component of $θ$ so that the probabilities at the 41 points are estimated from the data; also the subscales are assumed to be independent a priori” (Sinharay & von Davier, 2005, p. 3). One natural extension of the prior in Equation 7 would be to instead model the residual variance–covariance matrix from a multivariate latent regression model (Johnson, 2002; Johnson & Jenkins, 2004; Johnson & Sinharay, 2015) with an inverse-Wishart prior.

Equation 8 is a commonly employed truncated bivariate normal prior for item slope and threshold parameters (Albert, 1992) with a prior mean vector of $μ_{ξ}$ and variance–covariance matrix of $Σ_{ξ}$ . Lastly, the slipping and guessing parameters have a truncated uniform prior to enforce the restriction that the upper asymptote is greater than the lower asymptote (i.e., $γ_{j k} < 1 - ς_{j k}$ ). Prior simulation research (Culpepper, 2016) supports the use of uniform priors for $γ_{j k}$ and $ς_{j k}$ for sample sizes of at least 5,000. Accordingly, uniform priors are appropriate for large-scale assessments such as NAEP where approximately 20,000 test takers answer each NAEP item.

Full Conditional Distributions

Let the vectors of dichotomous and continuous augmented data for individual i and content area k be $w_{i k} = {(W_{i 1 k}, \dots, W_{i J_{k} k})}^{'}$ and $z_{i k} = {(Z_{i 1 k}, \dots, Z_{i J_{k} k})}^{'}$ , respectively, and let the J augmented values for individual i be $w_{i} = {(w_{i 1}^{T}, \dots, w_{i K}^{T})}^{T}$ and $z_{i} = {(z_{i 1}^{T}, \dots, z_{i K}^{T})}^{T}$ . Furthermore, let $W = {(w_{1}, \dots, w_{n})}^{'}$ and $Z = {(z_{1}, \dots, z_{n})}^{'}$ be $N \times J$ matrices of augmented data and $W_{j k}$ and $Z_{j k}$ indicate the columns corresponding to item j in content area k. Let A be a $J \times K$ matrix of item slope parameters. Recall that the content frameworks of large-scale assessments assume simple structure for A where each row has one nonzero item slope. Individuals receive a random subset of items, so let $A_{i} = diag (u_{i}) A$ be a matrix that includes nonzero item slope parameters for items received by individual i and zeros for items that i was not assigned. Similarly, let $Θ_{j k} = Θ_{k} diag (U_{j k})$ be the latent traits of individuals who responded to item j in content area k.

Finally, let the item slopes and thresholds for content area k be $α_{k} = {(α_{1 k}, \dots, α_{J_{k} k})}^{'}$ and $β_{k} = {(β_{1 k}, \dots, β_{J_{k} k})}^{'}$ , respectively. The J-dimensional vector of item slopes and thresholds are $a = {({α^{'}}_{1}, \dots, {α^{'}}_{K})}^{'}$ and $b = {({β^{'}}_{1}, \dots, {β^{'}}_{K})}^{'}$ .

The aforementioned Bayesian formulation implies the following full conditional distributions (see Online Appendix A for additional derivations):

\begin{array}{l} W_{i j k} | y_{i j k}, θ_{i k}, ξ_{j k}, γ_{j k}, ς_{j k} \sim Bernoulli [Φ (η_{i j k}) {(\frac{1 - ς_{j k}}{P_{i j k}})}^{y_{i j k}} {(\frac{ς_{j k}}{1 - P_{i j k}})}^{1 - y_{i j k}}] \\ P_{i j k} = γ_{j k} + (1 - ς_{j k} - γ_{j k}) Φ (α_{j k} θ_{i k} - β_{j k}) . \end{array}

Z_{i j k} | W_{i j k}, u_{i j k}, θ_{i k}, α_{j k}, β_{j k} \sim {\begin{array}{r} N (η_{i j k},1) {(1_{Z_{i j k} \leq 0})}^{1 - W_{i j k}} {(1_{Z_{i j k} > 0})}^{W_{i j k}}, & u_{i j k} = 1 \\ 0, & u_{i j k} = 0 \end{array} .

\begin{matrix} θ_{i} | u_{i}, z_{i}, α, β \sim N_{K} (μ_{i θ}, Σ_{i θ}), \\ μ_{i θ} = {(A^{T} diag A + I_{K})}^{- 1} A^{T} diag (u_{i}) (z_{i} + β), \\ Σ_{i θ} = {(A^{T} diag (u_{i}) A + I_{K})}^{- 1} . \end{matrix}

ξ_{j k} | Z_{j k}, Θ_{k} \sim N_{2} (μ_{j k ξ}, Σ_{j k ξ}) 1_{α_{j k} > 0}, μ_{j k ξ} = {(X_{j k}^{T} diag (U_{j k}) X_{j k} + I_{2})}^{- 1} X_{j k}^{T} diag (U_{j k}) Z_{j k}, Σ_{j k ξ} = {(X_{j k}^{T} diag (U_{j k}) X_{j k} + I_{2})}^{- 1}, X_{j k}^{T} = (Θ_{k}^{T}, - 1_{N}^{T}) .

\begin{matrix} p (γ_{j k}, ς_{j k} | Y_{j k}, U_{j k}, W_{j k}) \propto γ_{j k}^{a_{γ} - 1} {(1 - γ_{j k})}^{b_{γ} - 1} ς_{j k}^{a_{ς} - 1} {(1 - ς_{j k})}^{b_{ς} - 1} 1_{(γ_{j k}, ς_{j k}) \in Ω}, \\ a_{γ} = 1 + \sum_{i : u_{i j k} = 1} y_{i j k} (1 - W_{i j k}), b_{γ} = 1 + \sum_{i : u_{i j k} = 1} (1 - y_{i j k}) (1 - W_{i j k}), \\ a_{ς} = 1 + \sum_{i : u_{i j k} = 1} (1 - y_{i j k}) W_{i j k}, b_{ς} = 1 + \sum_{i : u_{i j k} = 1} y_{i j k} W_{i j k} . \end{matrix}

Equations 10 and 11 are the full conditional distributions for the augmented data, $W_{i j k}$ and $Z_{i j k}$ , as derived in Culpepper (2016). In particular, the $W_{i j k}$ are conditionally independent Bernoulli random variables and the $Z_{i j k}$ are sampled as truncated normal random variables conditional on $W_{i j k}$ and individual and item parameters. Also, the $W_{i j k}$ are only sampled for nonmissing $y_{i j k}$ and $Z_{i j k} = 0$ for all missing values in the BIB design. Equation 12 shows the full conditional distribution of the latent traits for individual i, $θ_{i}$ , is multivariate normal in K dimensions. The conditional mean and variance–covariance matrix for $θ_{i}$ are the function of item slopes (i.e., A), the missing data indicators for test taker i (i.e., $diag (u_{i})$ ), item thresholds, b, and a K dimensional identity matrix (i.e., $I_{K}$ ).

Equation 13 includes the item slope and threshold full conditional distributions. Equation 13 shows that the full conditional distribution for item slope and threshold parameters is a truncated bivariate normal distribution. In particular, the full conditional distribution for $ξ_{j k}$ is a function of a latent design matrix $X_{j k}$ that includes a column of the dimension k latent traits and an N-dimensional column of minus ones (i.e., $- 1_{N}$ ). Recall that a subset of subjects respond to items, so Equation 13 is also a function of the missing data indicators for item j in content area k, $U_{j k}$ .

The full conditional distribution for the item guessing parameters given the slipping parameters (and vice versa) in Equation 14 are independent truncated β distributions with parameters $a_{γ}$ and $b_{γ}$ (and $a_{ς}$ and $b_{ς}$ for slipping). For instance, $a_{γ}$ is a function of the number of test takers who guessed (i.e., $y_{i j k} = 1$ and $W_{i j k} = 0$ ), and $a_{ς}$ is determined by the number of test takers who slipped (i.e., $y_{i j k} = 0$ and $W_{i j k} = 1$ ). Note that Culpepper (2016) describes a Gibbs-within-Gibbs algorithm for sampling $γ_{j k}$ and $ς_{j k}$ from truncated β distributions.

Application: 2011 Eighth-Grade NAEP Mathematics and Reading

This section estimates the prevalence of slipping on dichotomous items in the 2011 eighth-grade NAEP mathematics and reading data sets. NAEP is a low-stakes, low-incentives standardized test and was administered to a nationally representative random sample of eighth graders. This section discusses the sample and variables, the Markov chain Monte Carlo (MCMC) implementation, and results.

Sample and Variables

The 2011 NAEP mathematics and reading data set include item responses from 189,396 and 174,654 eighth-grade students, respectively. The mathematics assessment included $J = 155$ items of which 123 were dichotomously scored. The reading assessment included 89 dichotomously scored items of 130 total items. The mathematics items assess the following five latent dimensions: (1) ALG = algebra ( $J_{1} = 49$ ), (2) DAS = data analysis, statistics, and probability ( $J_{2} = 23$ ), (3) GEO = geometry ( $J_{3} = 30$ ); (4) MEA = measurement ( $J_{4} = 26$ ), and (5) NPO = number properties and operations ( $J_{5} = 27$ ). The reading items assess the following two latent dimensions: (1) INF = reading to gain information (J₁ = 70) and (2) LIT = reading for literary experience (J₂ = 60).

MCMC Implementation

As noted above, the ordinal items were included in the analyses using the ONO model. The two measurement models for the dichotomous items were the status quo models specified in the NAEP technical documentations (i.e., two- or three-parameter models) and the 4PNO. It is important to note that item R058805 in the reading assessment had low discriminating power (i.e., $\hat{α} = 0.231$ ). The small α for item R058805 presented possibly an empirical identifiability issue for the 4PNO, and the item was consequently analyzed with the two-parameter model as specified in the NAEP documentation. There were no model fit issues detected for any of the remaining 212 dichotomous items on the eighth-grade NAEP mathematics and reading tests.

MCMC with Gibbs sampling was used to estimate model parameters. In particular, the 4PNO models for mathematics and reading were compared with results from status quo models based upon the NAEP documentation. Chain lengths of 250,000 were executed, and the first 125,000 iterations were discarded as burn-in. Furthermore, the prior parameters for the item slope and thresholds were set as $μ_{ξ} = 0_{2}$ and $Σ_{ξ} = I_{2}$ . Ten cycles were also used to sample slipping and guessing with the Gibbs-within-Gibbs sampler. Convergence of the 4PNO models was assessed by running four chains with lengths of 250,000 iterations. The $\hat{R}$ index (Brooks & Gelman, 1998) was computed after discarding the first 125,000 iterations as burn-in. The computed $\hat{R}$ s were less than the recommended threshold of 1.2, which provided evidence of convergence. For example, for the reading data, the five number summary for $\hat{R}$ included a minimum value of 1.000, first quartile of 1.001, median of 1.003, third quartile of 1.005, and a max of 1.099.

Results

Relative model fit of the status quo and 4PNO measurement models was assessed using the deviance information criterion (DIC; Spiegelhalter, Best, Carlin, & Van Der Linde, 2002), given that prior Monte Carlo research supports the use of the DIC to correctly select the 4PNO over the 2PNO and 3PNO when the true data generating model is the 4PNO (Culpepper, 2016). The DIC indices provided evidence that the 4PNO improved fit for both the mathematics and reading assessments. The DIC for the mathematics assessment was 6659295 and 6680337 for the status quo and 4PNO measurement models, respectively. The DIC indices for the reading assessment were 4013794 for the status quo model and 4011974 for the 4PNO model.

Figure 3 plots $ς_{j k}$ across items for the mathematics and reading tests and provides evidence of nonzero slipping probabilities. Specifically, 47.2% and 51.1% of the mathematics and reading items, respectively, had slipping probabilities exceeding .05 and 20.3% and 23.9% of the items had $ς_{j k} > 0.10$ . The empirical evidence suggests that slipping is prevalent for the 2011 eighth-grade NAEP mathematics and reading assessments.

Figure 3.

Estimated $ς_{jk}$ for the National Assessment of Educational Progress mathematics and reading items. ${\hat{ς}}_{jk}$ is the posterior average. Boxplots are included to summarize the marginal distribution of ${\hat{ς}}_{jk}$ .

Readers are directed to Online Appendix B for tables of dichotomous item parameters for the status quo and 4PNO models. Figure 3 provides evidence regarding the prevalence of slipping in the low-incentive NAEP mathematics and reading tests. As noted above, allowing for slipping may impact the estimated values of the item slope (i.e., α_jk), difficulty (i.e., $β_{j k} / α_{j k}$ ), and guessing (i.e., $γ_{j k}$ ) parameters. Figure 4 includes plots of ${\hat{α}}_{j k}$ , ${\hat{β}}_{j k} / {\hat{α}}_{j k}$ , and ${\hat{γ}}_{j k}$ for the 4PNO and non-4PNO, status quo measurement models for dichotomous items of the mathematics and reading content areas. Figure 4 provides evidence that estimating slipping tends to result in larger item slopes for the mathematics and reading items. There was also evidence that some guessing parameters were larger when estimating slipping probabilities. In contrast, the item difficulties were largely unchanged between the 4PNO and non-4PNO models. The results in Figure 4 are representative of Monte Carlo evidence that item slopes are larger for the 4PNO than for misspecified two- and three-parameter models (Culpepper, 2016; Loken & Rulison, 2010).

Figure 4.

Estimated item slope (i.e., ${\hat{α}}_{jk}$ ), difficulty (i.e., ${\hat{β}}_{jk} / {\hat{α}}_{jk}$ ), and guessing (i.e., ${\hat{γ}}_{jk}$ parameters for the dichotomous scored National Assessment of Educational Progress [NAEP] mathematics and reading items using the four-parameter normal ogive [4PNO] and non-4PNO measurement models). The non-4PNO model uses two- and three-parameter models as specified in the NAEP data documentation. ${\hat{α}}_{jk}$ , ${\hat{β}}_{jk}$ , and ${\hat{γ}}_{jk}$ are posterior averages. A 45-degree reference line is included to compare differences between parameters estimated from the 4PNO and non-4PNO models.

The 4PNO estimated guessing parameters even for those items that were originally modeled with the 2PNO. Interestingly, there was evidence of nonzero guessing parameters from the 4PNO fit for several of the 2PNO dichotomously scored mathematics constructed response questions. For instance, M221701 had a guessing parameter of .37. As with “slipping,” “guessing” may be attributed to actual guessing behavior or an item characteristic and additional investigation of the items would be warranted. In contrast, the guessing parameters for the reading 2PNO items were all below 0.1.

Figure 4 demonstrates how estimating slipping probabilities impacts the item slope, location, and guessing parameters. Changes in item parameters imply changes in item information functions. Figures 5 and 6 plot the various mathematics and reading dimensions estimated test information functions for dichotomous items. Note the test information functions were computed as unweighted sums of item information functions, given that NAEP divides the item pool into blocks in a BIB design that “…ensures that approximately equal numbers of students receive each booklet” (Beaton & Zwick, 1992, p. 101). Figure 5 provides evidence that slipping on the dichotomous mathematics items reduced score precision for higher scores across the five latent dimensions. In contrast, the mathematics items provided greater precision in the middle of the latent continuum. The difference in 4PNO and non-4PNO test information functions was greatest for the ALG, DAS, and MEA content areas where ${\hat{I}}_{θ}$ was more peaked near the average and smaller in the right tail of the latent dimensions.

Figure 5.

Estimated unweighted test information function (i.e., ${\hat{I}}_{θ}$ ) of dichotomous items for eighth-grade latent mathematics achievement by content area. The content areas are (1) ALG = algebra; (2) DAS = data analysis, statistics, and probability; (3) GEO = geometry; (4) MEA = measurement; and (5) NPO = number properties and operations. The non-4PNO model uses two- and three-parameter models as specified in the National Assessment of Educational Progress data documentation.

Figure 6.

Estimated unweighted test information function (i.e., ${\hat{I}}_{θ}$ ) of dichotomous items for eighth-grade latent reading achievement by content area. The content areas are (1) INF = reading to gain information and (2) LIT = reading for literary experience. The non-4PNO model uses two- and three-parameter models as specified in the National Assessment of Educational Progress data documentation.

Figure 6 provides similar evidence for the two reading dimensions. There are several important differences in ${\hat{I}}_{θ}$ between the reading and mathematics content areas. First, unlike mathematics, the results from the 4PNO and non-4PNO models suggest that the dichotomous reading items provided the greatest precision for below average test takers (i.e., $θ_{1} < 0$ and $θ_{2} < 0$ ). Second, dichotomous test information functions for the reading items exhibited larger differences between the 4PNO and non-4PNO models. For instance, ${\hat{I}}_{θ}$ for the 4PNO models reached a maximum of approximately 20 and 15 for the INF and LIT dimensions in contrast to 14 and 10 for the non-4PNO models.

Discussion

This study serves as an initial inquiry into the prevalence and implications of slipping for low-incentives standardized testing. This section summarizes the contributions of the study and provides recommendations for future research.

First, this study offers new evidence regarding the prevalence of slipping on the low-incentive NAEP mathematics and reading assessments. The probability of slipping exceeded 5% for nearly half of the dichotomous mathematics and reading items. One consequence is that neglecting to estimate slipping leads to a different understanding about test information and score precision. In fact, the results in Figures 5 and 6 provide evidence that employing the status quo, non-4PNO models implied higher scores were measured with greater precision. However, the 4PNO models that allowed for slipping support a different conclusion regarding measurement precision. Specifically, mathematics scores were most precisely measured in the middle of the distribution, whereas reading scores were most precise for latent scores roughly a standard deviation below the average.

This study found that slipping was differentially represented among the mathematics and reading dichotomous items. That is, Figure 6 demonstrated reading test information functions differed more between the 4PNO versus the non-4PNO in comparison to the mathematics test information functions in Figure 5. One explanation is that the two reading content areas consisted of more dichotomous items with nonzero slipping probabilities than the five mathematics content areas. A potential implication for test developers is that adding more dichotomous items in a low-incentive assessment may contribute to more significant disparities in test information when using 4PNO versus non-4PNO models.

The application demonstrates that slipping is a factor that should be considered during test development and construction. Specifically, test information functions under the status quo model suggested that larger θ scores were measured with greater precision. Consequently, ignoring slipping during test construction may result in test developers being overly optimistic about higher latent scores. In contrast, a review of the test information functions for mathematics and reading achievement in Figures 5 and 6 suggests that additional difficult items are needed to improve score precision in the upper portion of the latent continuum.

The differences in interpretation of measurement precision between the status quo and 4PNO models are also important for practitioners. Policymakers are often interested in understanding characteristics of proficient and high-performing test takers to improve the educational experience for all students. The results in this study suggest that scores for high-performing students are measured less precisely. A reduction in measurement precision accordingly would complicate analyses that examine covariates of proficient students, given that increased measurement error attenuates measures of association (May & Nicewander, 1994). Consequently, the presence of slipping in low-incentive assessments would have the effect of making it more difficult to accurately identify correlates of success.

Second, this study presented a new formulation for the 4PNO that is applicable for large-scale assessments with missing data designs. The methodology developed and employed in this study could be applied to assess the prevalence of slipping on other large-scale assessments, such as the Programme for the International Assessment of Adult Competencies, the Programme for International Student Assessment, the Progress in International Reading Literacy Study, and the Trends in International Mathematics and Science Study.

There are several directions for future research. First, standardized tests are increasingly the focus of scrutiny and high-stakes decisions for teachers and schools (Baker, Oluwole, & Green, 2013; Koretz, 2015; Loeb, Soland, & Fox, 2014). High-stakes decisions are often made with achievement tests that bear minimal implications for test takers. That is, unlike college admissions and employment examinations elementary and secondary achievement tests are low-incentive assessments that bear fewer consequences for test takers. Slipping may be an issue in the absence of clear incentives for test takers. Student performance on NAEP tests has shown convergent validity with student achievement on state exams (Braun & Qian, 2007; McLaughlin, 1998). The presence of slipping on NAEP reading tests may offer an impetus for examining slipping on state achievement assessments. Namely, some states and districts routinely use standardized tests to make high-stakes decisions about the performance of teachers and schools. The results in this study demonstrate that the presence of slipping reduces precision for high-performing students and the reduction in information for higher θ values introduces measurement error and would accordingly distort value-added estimates (Doran, 2014; Lockwood & McCaffrey, 2014; McCaffrey, Castellano, & Lockwood, 2015). Additional research is needed to understand the impact of slipping on measures of teacher and school effectiveness.

Second, prior research examined the role of incentives to motivate and encourage test takers to maximize their performance (Braun et al., 2011; O’Neil et al., 1995; O’Neil et al., 2005). For example, Braun et al. (2011) provided test takers a maximum of US$35 based upon performance. Future research should consider the impact of monetary incentives on slipping probabilities. Such research could provide insight regarding the connection between motivation and slipping as measured by IRFs with asymptotes below one. One concern though is that offering monetary incentives is likely too costly for large-scale assessments, so additional research is needed on how to structure nonmonetary incentives to motivate test takers to maximize performance.

Third, previous studies explored several multidimensional factor structures for NAEP content areas (see, e.g., Harrell & Cai, 2016). The presence of slipping necessarily alters item slopes, and researchers may uncover a different structure after accounting for slipping. As noted by an anonymous reviewer, future research should consider the influence of slipping on investigations of factor structure in large-scale assessments. The developed model could be extended to other more general multidimensional structures, such as confirmatory models with complex structure and the exploratory approaches similar to Béguin and Glas (2001).

In short, this article provides evidence that low-incentive examinations are influenced by slipping. The application to the 2011 eighth-grade NAEP mathematics and reading assessments demonstrated how slipping impacts measurement precision. The results suggest that slipping is a factor that should be considered during test development and construction to improve score precision for higher performing test takers.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Note

References

Albert

(1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational and Behavioral Statistics, 17, 251–269.

Baker

B. D.

Oluwole

Green

P. C.

(2013). The legal consequences of mandating high stakes decisions based on low quality information: Teacher evaluation in the race-to-the-top era. Education Evaluation and Policy Analysis Archives, 21, 1–71.

Barton

M. A.

Lord

F. M

. (1981). An upper asymptote for the three-parameter logistic item-response model (Tech. Rep. No. 80-20). Princeton, New Jersey: Educational Testing Service.

Beaton

A. E.

Zwick

(1992). Overview of the National Assessment of Educational Progress. Journal of Educational and Behavioral Statistics, 17, 95–109.

Béguin

A. A.

Glas

C. A.

(2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66, 541–561.

Braun

H. I.

Kirsch

Yamamoto

(2011). An experimental study of the effects of monetary incentives on performance on the 12th-grade NAEP reading assessment. Teachers College Record, 113, 2309–2344.

Braun

H. I.

Qian

. (2007). An enhanced method for mapping state standards onto the NAEP scale. In Linking and aligning scores and scales (pp. 313–338). New York: Springer.

Brooks

S. P.

Gelman

(1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434–455.

Brophy

Ames

. (2005). NAEP testing for twelfth graders: Motivational issues. Washington, DC: National Assessment Governing Board.

10.

Chang

H. H.

Ying

(2008). To weight or not to weight? Balancing influence of initial items in adaptive testing. Psychometrika, 73, 441–450.

11.

Cowles

M. K.

(1996). Accelerating Monte Carlo Markov chain convergence for cumulative-link generalized linear models. Statistics and Computing, 6, 101–111.

12.

Culpepper

S. A.

(2016). Revisiting the 4-parameter item response model: Bayesian estimation and application. Psychometrika, 81, 1142–1163. doi:10.1007/s11336-015-9477-6

13.

Debeer

Buchholz

Hartig

Janssen

. (2014). Student, school, and country differences in sustained test-taking effort in the 2009 PISA reading assessment. Journal of Educational and Behavioral Statistics, 39, 502–523.

14.

Doran

H. C.

(2014). Methods for incorporating measurement error in value-added models and teacher classifications. Statistics and Public Policy, 1, 114–119.

15.

Feuerstahler

L. M.

Waller

N. G.

(2014). Estimation of the 4-parameter model with marginal maximum likelihood. Multivariate Behavioral Research, 49, 285–285.

16.

Frey

Hartig

Rupp

A. A.

(2009). An NCME instructional module on booklet designs in large-scale assessments of student achievement: Theory and practice. Educational Measurement: Issues and Practice, 28, 39–53.

17.

Gonzalez

Rutkowski

(2010). Practical approaches for choosing multiple-matrix sample designs. IEA-ETS Research Institute Monograph, 3, 125–156.

18.

Harrell

Cai

(2016). Multidimensional IRT calibration with simultaneous latent regression in large-scale survey assessments. Paper presentation at the annual meeting of the National Council on Measurement in Education. Washington, DC.

19.

Hoff

P. D

. (2009). A first course in Bayesian statistical methods. New York: Springer Science & Business Media.

20.

Johnson

M. S.

(2002). A Bayesian hierarchical model for multidimensional performance assessments. Paper presented at the annual meeting of the National Council on Measurement in Education. New Orleans, LA.

21.

Johnson

M. S.

Jenkins

(2004). A Bayesian hierarchical model for large-scale educational surveys: An application to the National Assessment of Educational Progress. Princeton, NJ: Educational Testing Service.

22.

Johnson

M. S.

Sinharay

(2015). Does the NAEP model adequately predict the achievement gap? Paper presented at the annual meeting of the National Council on Measurement in Education. Chicago, IL.

23.

Koretz

(2015). Adapting educational measurement to the demands of test-based accountability. Measurement: Interdisciplinary Research & Perspectives, 13, 1–25.

24.

Liao

Yen

Cheng

(2012). The four-parameter logistic item response theory model as a robust method of estimating ability despite aberrant responses. Social Behavior and Personality: An International Journal, 40, 1679–1694.

25.

Lockwood

McCaffrey

D. F.

(2014). Correcting for test score measurement error in ANCOVA models for estimating treatment effects. Journal of Educational and Behavioral Statistics, 39, 22–52.

26.

Loeb

Soland

Fox

(2014). Is a good teacher a good teacher for all? Comparing value-added of teachers with their English learners and non-English learners. Educational Evaluation and Policy Analysis, 36, 457–475.

27.

Loken

Rulison

K. L.

(2010). Estimation of a four-parameter item response theory model. British Journal of Mathematical and Statistical Psychology, 63, 509–525.

28.

Magis

(2013). A note on the item information function of the four-parameter logistic model. Applied Psychological Measurement, 37, 304–315.

29.

May

Nicewander

W. A.

(1994). Reliability and information functions for percentile ranks. Journal of Educational Measurement, 31, 313–325.

30.

McCaffrey

D. F.

Castellano

K. E.

Lockwood

(2015). The impact of measurement error on the accuracy of individual and aggregate SGP. Educational Measurement: Issues and Practice, 34, 15–21.

31.

McLaughlin

(1998). Study of the linkages of 1996 NAEP and state mathematics assessments in four states. Washington, DC: National Center for Education Statistics.

32.

Mislevy

Johnson

Muraki

(1992). Scaling procedures in NAEP. Journal of Educational Statistics, 17, 131–154.

33.

Ogasawara

(2012). Asymptotic expansions for the ability estimator in item response theory. Computational Statistics, 27, 661–683.

34.

O’Neil

H. F.

Jr. Abedi

Miyoshi

Mastergeorge

(2005). Monetary incentives for low-stakes tests. Educational Assessment, 10, 185–208.

35.

O’Neil

H. F.

Jr. Sugrue

Baker

E. L.

(1995). Effects of motivational interventions on the National Assessment of Educational Progress mathematics performance. Educational Assessment, 3, 135–157.

36.

Patz

R. J.

Junker

B. W.

(1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342–366.

37.

Reise

S. P.

Waller

N. G.

(2003). How many IRT parameters does it take to model psychopathology items? Psychological Methods, 8, 164–184.

38.

Rubin

D. B.

(1976). Inference and missing data. Biometrika, 63, 581–592.

39.

Rulison

K. L.

Loken

(2009). I’ve fallen and I can’t get up: Can high-ability students recover from early mistakes in CAT? Applied Psychological Measurement, 33, 83–101.

40.

Sinharay

von Davier

. (2005). Extension of the NAEP BGROUP program to higher dimensions (Tech. Rep. No. RR-05-27). Princeton, New Jersey: Educational Testing Service.

41.

Spiegelhalter

D. J.

Best

N. G.

Carlin

B. P.

Van Der Linde

(2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 583–639.

42.

Waller

N. G.

Reise

S. P.

(2010). Measuring psychopathology with nonstandard item response theory models: Fitting the four-parameter model to the Minnesota Multiphasic Personality Inventory. In Embretson

(Ed.), Measuring psychological constructs: Advances in model based approaches. Washington, DC: American Psychological Association.

43.

Wise

S. L.

DeMars

C. E.

(2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10, 1–17.