The Influence of Rapidly Guessed Item Responses on Teacher Value-Added Estimates: Implications for Policy and Practice

Abstract

While most educators assume that not all students try their best on achievement tests, no current research examines if behaviors associated with low test effort, like rapidly guessing on test items, affect teacher value-added estimates. In this article, we examined the prevalence of rapid guessing to determine if this behavior varied by grade, subject, and teacher, and evaluated if rapid guessing influenced teacher value-added estimates. We observed differences in rapid guessing across grades, subjects, and teachers; however, this behavior did not appear to have a substantive effect on teacher value-added estimates. These findings suggest that rapid guessing occurs frequently enough that educators should be mindful of its effect on the interpretations of student test results. However, based on the value-added specifications used in this research, this type of behavior did not appear to affect estimates of teacher performance.

Keywords

test-taking effort rapid guessing value-added estimates teacher evaluation

The use of value-added models (VAMs) in teacher evaluation systems has increased in prevalence in recent years, with states such as New York, Ohio, Tennessee, Minnesota, and others basing at least some portion of a teacher’s evaluation rating on value-added results (Dee & Wyckoff, 2015; Doherty & Jacobs, 2013; Herrmann, Walsh, Isenberg, & Resch, 2013). These models implicitly assume that a student tries his or her best on the achievement tests used as the basis for these models. However, a growing body of research demonstrates that this assumption is not always justified. Low student test-taking effort has been shown to introduce downward bias into student achievement estimates (Rios, Guo, Mao, & Liu, 2017; Setzer, Wise, van de Heuvel, & Ling, 2013; Wise, 2015; Wise & DeMars, 2005), which challenges the validity of test results for those students who do not put forth some minimum level of test-taking effort. Despite these findings, no studies to date consider if teacher value-added estimates are influenced by low student effort on the test used in the estimation of teacher effectiveness. We help close this gap by exploring if rapidly guessed item responses, one specific indicator of low test effort, influence these estimates. Briefly, a rapid guess is defined as any item response for which a student responded so rapidly he or she could not have reasonably provided an accurate response, given how long other students of similar ability levels took to respond to the same item.

We sought to investigate two broad issues in this research. First, we examined whether rapid guessing varied across teachers, test administration periods, grades, or subject areas. Second, we generated teacher value-added estimates using several different model specifications, including models that accounted for student rapid-guessing behaviors in differing ways. Results from these effort-adjusted VAMs were compared with standard value-added results (not adjusted for effort) to determine how rapid guessing may contribute to differences in the rank orderings of teachers, including a review of how the composition of teachers identified at different points in the performance distribution changed conditional on student rapid-guessing behavior. If states base teacher pay or remediation decisions on VAM-related effectiveness classifications, and teachers are reclassified when test-taking effort is considered, then states may be punishing or rewarding teachers based on test results that do not accurately reflect what students know or have learned.

A pivotal issue related to this research is the role teachers play in improving or maintaining student test-taking behavior. That is, do differences in student rapid-guessing rates reflect differences in the students assigned to a teacher, or do those differences provide some indication of the effectiveness of the teacher? If more effective teachers are able to elicit better test effort from their students, then controlling for rapid guessing in VAM could reduce a clear signal of teacher effectiveness from VAM estimates. While we do not directly answer the question of how teachers affect student test-taking behavior in this research, we do consider various ways to account for student rapid-guessing behavior when estimating teacher value added, which we elaborate on in greater detail in this article.

This article is organized as follows: In the next section, we summarize current research on how test-taking effort can be measured and how rapid guessing can introduce bias into estimates of student achievement. We also identify value-added modeling issues that are particularly relevant to this line of research. We then provide a summary of the data used for this study, an overview of how we measured rapid guessing, and a description of the VAMs employed for this research. Finally, we summarize our methods and results for each of our research questions, including a summary of how we accounted for student rapid guessing in our VAMs, and conclude with a discussion of the implications of this research.

Background

Measuring Student Test-Taking Effort

When presented with a selected response test item, a student can respond to that item in different ways. The student may carefully read the item and each of the potential answers, giving thoughtful consideration to selecting the correct answer for that item. This approach can be viewed as “effortful,” as the student is attempting to provide a correct response to the test item. Conversely, a student’s approach to responding to test items may be categorized as “non-effortful” if he or she makes no attempt to respond correctly, such as by randomly guessing when selecting item responses.

There are myriad reasons why a student may rapidly guess during a test event. There may be some idiosyncratic reason, such as a student feeling ill during testing or a testing environment that is overly loud and distracting. This behavior could also be a sign of test fatigue, with effort waning as the test progresses (Cao & Stokes, 2008). Rapid guessing may also be a signal of more persistent student characteristics, perhaps providing some indication about students’ perceptions of their own abilities or capacity to do well on a test, especially one that might be perceived as challenging (Soland, Jensen, Keys, Bi, & Wolk, 2017). If students do not believe they can do well on a test, then they may be more likely to rapidly guess on a single test or across multiple test events over time.

Teachers may also have some effect on student test effort. For example, some teachers may be better able to motivate their students to put forth their best effort on classroom assessments or have greater control over student behavior during a testing session compared with other teachers. These examples may be a signal of general teacher quality, especially if some teachers are better able to elicit and maintain high levels of test effort from their students. The extent to which teachers influence student test-taking effort has not been widely examined in the existing literature on this topic.

Whatever the reason for why rapid guessing occurs, measuring this behavior accurately and efficiently is not a straightforward task. Previous research on test-taking effort relied on student self-evaluations of their effort on more traditional pencil-and-paper tests, where students rated their overall level of test-taking effort upon completion of the test (Wise, 2015; Wise & DeMars, 2006). While this approach may capture some of the variability of student effort, it also requires students to truthfully and accurately summarize their effort across the entirety of the testing process. Beyond issues related to self-report bias, these self-evaluations may be less credible if students who did not try on the test also did not respond to the survey effortfully (Wise & DeMars, 2005).

Person-fit statistics can also be used to evaluate student test-taking effort by identifying item-score patterns that are atypical or improbable from what would be expected given a specific test design. Meijer and Sijtsma (2001) provided a comprehensive overview of various statistics that can be used for this purpose, including an elaboration on the practical uses of these various approaches for the purposes of detecting behaviors like cheating or guessing. An example of one such approach, intended specifically to identify these types of behaviors on computerized adaptive tests (CATs), was described by Bradlow, Weiss, and Cho (1998) and van Krimpen-Stoop and Meijer (2000). The authors suggested that patterns of consecutive correct or incorrect item responses at the end of a CAT—when item responses tend to alternate between correct and incorrect responses—may be an indication of an aberrant testing pattern. However, the utility of these person-fit approaches for the specific purpose of identifying low student test-taking effort is limited. Meijer and Sijtsma (2001) noted that the ability to detect these aberrant patterns is influenced by the degree and magnitude of student behavior, which may make it difficult to identify low test-taking effort in all but the most extreme cases. Wise and Kong (2005) concurred with this perspective and also described other factors that may influence a student’s item-response pattern, including cheating, lucky guessing, cognitive issues, and curricular differences across classroom and/or school settings, which further limits the ability to distinguish between these behaviors and clear instances of low test effort.

A method specifically developed for the purposes of measuring student rapid-guessing behavior on computer-administered tests, known as response time effort (RTE), was developed by Wise and Kong (2005) and refined by Wise and Ma (2012). RTE is a summary of student test-taking effort in which a student’s response to a test item can be classified as either solution behavior (i.e., effortful) or rapid-guessing behavior (i.e., non-effortful). This is done by comparing the duration of a student’s individual item response with a predetermined time threshold, with item responses flagged as rapid guesses if the duration of the student’s item response is less than the item-specific threshold.

Wise and Kong (2005) proposed thresholds that varied by the amount of reading required of a student. A review of item-response time distributions for a set of items showed that the distributions for many of these items were bimodal, with a spike in the frequency of test responses that were completed in 10 seconds or less. Based on this information, the authors established item thresholds of 3, 5, or 10 seconds, depending on the number of characters in an item, with responses that fell below these item-specific thresholds categorized as rapid guesses. The authors found that test-taking effort measured in this way was highly correlated with student self-reports of effort, and that the accuracy rates of item responses identified as rapid guesses were correct at a rate consistent with chance (approximately 25% accuracy for items with four response options). Using the same threshold-setting approach, Kong, Wise, and Bhola (2007) found that accuracy rates of items classified as rapid guesses were slightly higher than random chance (approximately 29%).

The threshold-setting process was later reevaluated by Wise and Ma (2012) to consider the mean duration of responses to individual items. This normative threshold approach was adapted to maximize the identification of rapidly guessed items, while minimizing the misclassification of an effortful item response as a rapid guess. The authors evaluated item accuracy rates based on three different minimum thresholds: 10%, 15%, and 20% of the average duration of item responses. To illustrate, if an item took students 60 seconds on average to complete, then item responses of 6 seconds or less would be flagged as a rapid guess using the 10% threshold. A maximum of 10 seconds was established for all items, given that responses longer than 10 seconds would no longer be reasonably considered a rapid response.

Wise and Ma (2012) found that the accuracy rates of rapidly guessed items identified using the 10% threshold closely approximated random guessing, while item responses classified as solution behaviors were consistent with expected accuracy rates. The expanded thresholds of 15% and 20% flagged item responses with accuracy rates greater than those of random guessing, suggesting that these increased thresholds misclassified some effortful student item responses as rapid guesses. These findings indicate that the 10% threshold allowed for effective differentiation between solution behaviors and rapid guesses, and that item responses flagged for rapid-guessing behavior reflect actual low-effort student responses (Wise & Ma, 2012). This specific threshold approach is supported by a considerable amount of validity evidence, which is described in detail by Wise (2015).

Relevant to this discussion, rapid guessing detected using the RTE approach has been shown to be uncorrelated with measures of student academic ability, indicating that this metric is not simply identifying low-achieving students who may have difficulty responding to items on a particular test (Kong et al., 2007; Rios, Liu, & Bridgeman, 2014; Wise & Kong, 2005). For example, Kong et al. (2007) examined the relationship between rapid guessing on a mandated college assessment and student academic ability measured on the SAT assessment. The authors found correlations that were near zero, indicating that rapid-guessing behavior was not simply a function of student academic ability.

Impact of Low Test-Taking Effort on Student Test Results

The effect of low test-taking effort on student test scores has been well documented in research and shows that low effort typically biases student test scores downward, underestimating what students know or have learned (Rios et al., 2017; Wise, 2015; Wise & DeMars, 2005; Wise & Kingsbury, 2016; Wise & Ma, 2012). Wise and DeMars (2005) reviewed a number of studies on the relationship between test-taking motivation and student performance and found that motivated students performed over a half of a standard deviation higher, on average, than unmotivated students. In a simulation study, Rios et al. (2017) also showed that students who rapidly guess on 6% or more of their test items tend to have their observed scores biased downward relative to their true scores by .2 standard deviations on average.

Because low test-taking effort can significantly distort inferences about student performance, Wise (2015) provided evidence that test events on which students rapidly guessed on 10% or more of the items could suffer from an understatement of the student’s true score sufficient to call the validity of that score into question. This 10% threshold was chosen empirically, in part by showing that scores from tests below the threshold showed lower levels of internal reliability, as well as other signals that these scores might be less valid (Wise & DeMars, 2010; Wise, Kingsbury, & Hauser, 2009; Wise & Ma, 2012). Rios et al. (2017) not only found that the removal of non-effortful responses resulted in increased mean test scores but also noted that these means could be upwardly biased if effort is highly correlated with academic ability.

Research focusing on the implications of low test-taking effort on student outcomes highlights the importance of identifying low test effort and how it may skew overall interpretations of student performance. This implication is particularly relevant when test results are used to evaluate the performance of teachers, schools, or educational programs. For example, the link between low test-taking effort and skewed test results was identified on the Trends in International Mathematics and Science Study (TIMSS), an assessment used to compare the quality of educational programs in different countries (Eklöf, Pavešič, & Grønmo, 2014). The authors found that low test effort contributed to differences between countries in assessed student performance.

Variations in student test-taking effort have also been shown to influence summaries of comparative school performance. Despite observing only a small amount of low student motivation on an assessment measuring knowledge in business-related subjects (1.3% of items were rapidly guessed), Setzer et al. (2013) found notable changes in the rank ordering of 84 higher education institutions after removing the scores for students who were clearly unmotivated while taking the test, and concluded that the variation of these rankings could have been much greater had the prevalence of low test-taking effort been more widespread.

Student Test-Taking Effort and VAM

No research to date has examined how student test-taking effort affects teacher VAM estimates or teacher rank orderings based on those estimates. Because low test effort can introduce negative bias into student test results, evidence that this test-score bias influences VAM estimates could have significant policy implications around the use of test scores for high-stakes evaluation purposes. If, for example, the test scores for a teacher’s students are substantially biased at posttest as a result of a high amount of rapid guessing, but no rapid guessing occurred at the pretest event, then the trajectory of growth in that classroom from pre- to posttest could be artificially deflated. The opposite is true if more rapid guessing occurs at the pretest event. Consistent with the prior literature that describes the potential impact of omitted variables on teacher VAM estimates (Lockwood & McCaffrey, 2014; McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2004), failing to account for rapid guessing in VAMs could potentially affect the validity of teacher value-added results, given the impact this behavior could have on the trajectory of student growth.

The way students are sorted to teachers and vice versa could also influence teacher VAM estimates. This sensitivity of VAM estimates is relevant to our study if patterns of student test-taking effort are nonrandom and teachers are sorted to students with differential test-taking effort in ways not accounted for in a VAM. Current research demonstrates that students are not randomly sorted to teachers (Kalogrides, Loeb, & Béteille, 2013; Loeb, Kalogrides, & Béteille, 2012; Paufler & Amrein-Beardsley, 2014) and that this nonrandom sorting can influence VAM estimates; however, issues related to nonrandom sorting may be sufficiently mitigated in models that account for lagged student achievement (Goldhaber & Chaplin, 2015; Kane & Staiger, 2008; Koedel & Betts, 2011; Rothstein, 2009). Nonetheless, current research on this topic has not evaluated the extent to which students with differential levels of test effort are nonrandomly assigned to teachers, and how that test effort, if left unaccounted for, may influence teacher value-added results. We fill a gap in the existing VAM literature by examining how rapid-guessing behavior varies across classrooms and by exploring the extent to which teacher VAM estimates are affected by student rapid-guessing behavior.

Data

The data used in this study came from 134 schools in 44 districts in an eastern U.S. state from the 2012–2013 and 2013–2014 school years. These districts were a subset of all districts in the state and were part of a consortium of districts that collaborated on best practices and uses of student test results for the purposes of educator evaluations. Three types of data were collected: student demographic data, student–teacher linkage information, and student testing data. The student demographic data files included information on student gender, race/ethnicity, English language learner (ELL) status, student eligibility for free or reduced lunch (FRL), and special education (SPED) status. The student–teacher linkage files included information about the number of instructional minutes a teacher provided a student, as well as the course(s) in which the teacher delivered instruction to students. In total, the datasets included information for 42,267 students in Grades 3 through 8 across 1,121 mathematics teachers and 1,109 reading teachers. Teachers with fewer than 10 students were excluded from this sample, an approach consistent with current research (e.g., see Lockwood & McCaffrey, 2014, and Loeb, Soland, & Fox, 2014).

Student test results from the NWEA MAP Growth reading and mathematics assessments were used in this study. The MAP Growth assessments are CATs that are typically administered 3 times a year, in the fall, winter, and spring. Each test takes approximately 40 to 60 minutes depending on the grade and subject area, though the test themselves are untimed. The reading assessment is generally comprised of 40 test items, and the mathematics assessment generally includes 50 test items. Students respond to assessment items in order (they cannot come back to a test question later), and a test event is finished when a student completes all of the test items. In this study, student achievement growth on the MAP Growth assessments was measured from spring of 2013 to spring of 2014.

There are several characteristics of the MAP Growth assessments themselves that may reduce, but not remove altogether, the likelihood of student rapid-guessing behavior. For example, the assessments and items are untimed, so the assessment format should provide minimal incentive for students to rapidly guess to try and complete all of the items on the assessments. Furthermore, because the assessments are adaptive, students only receive items at a difficulty level commensurate with their estimated achievement level, which reduces the likelihood that students are rapidly guessing due to responding to item content that is too easy or difficult for them. Finally, student rapid guessing on the MAP Growth assessments was collected unobtrusively, with item-response duration information not readily available to teachers or schools at the time of this research. This makes it unlikely that students or teachers adjusted their behavior in any way during the testing process that may affect the interpretation of our results.

Method

Measuring Rapid Guessing

This study utilized the previously referenced RTE metric with normative thresholds (Wise & Ma, 2012) to identify instances in which a student clearly engaged in non-effortful (i.e., rapid-guessing) test-taking behavior on the MAP Growth assessments. The normative thresholds used for this research were established by NWEA, the publisher of the MAP Growth assessments, and were based on item-level responses from a national sample of students. Using item-specific RTE data aggregated at the test event level, we summarized rapid guessing across a student’s test event as the proportion of the total number of test items on which a student’s item-response time was consistent with solution behavior. This proportion ranges from .0 to 1.0, with higher values indicating fewer numbers of test items being rapidly guessed and vice versa. For example, a test event in which a student did not rapidly guess on any test items would have an RTE of 1.0, and a test event in which a student rapidly guessed on half of his items would have an RTE of 0.5. Based on the previously described approach established by Wise (2015), test events with an RTE of 0.90 or less (10% or more rapidly guessed item responses) were flagged as potentially invalid and referred to as low-effort test events in this study.

Student responses flagged as rapid guesses using the RTE with normative thresholds approach (Wise & Ma, 2012) provide a validated way of accurately identifying when students have not invested enough time to appropriately consider the test question and attempt to provide an accurate response. However, there are situations, such as when a student takes an extremely long period of time on a test item but still provides a random guess, where rapid guessing does not fully capture the overall test effort of a student. Furthermore, as we have previously noted, there are any number of reasons for why a student may have rapidly guessed. This behavior may reflect a student’s general approach to test taking, or it may be an indicator of the ability of a teacher to keep his or her students motivated prior to and throughout the test-taking process. Rapid guessing may also simply be a signal of test fatigue, with students’ effort waning as the test progresses (Cao & Stokes, 2008). As such, the results of this research are limited to understanding how the specific behavior of rapid guessing on test items may influence teacher value-added estimates, irrespective of the specific reasons for why that rapid guessing occurred.

Estimating Teacher Value Added

We generated teacher performance estimates three ways using two different spring-to-spring VAM specifications. The first model, which we refer to as Model 1 throughout this article, is a teacher fixed-effects model that applies empirical Bayes shrinkage as described in Guarino, Maxfield, Reckase, Thompson, and Woolridge (2015). Estimates from this VAM, which did not directly account for student rapid guessing, served as our baseline set of teacher performance estimates.

Model 1 can be expressed in three equations: one that models student achievement over time (1.1), and two that model test measurement error (1.2) and (1.3). We modeled teacher value added for student i at time t such that

y_{i 1} = β_{0} + λ y_{i 0} + X_{i 1} β^{'} + α^{'} T_{i 1} + e_{i},

where $y_{i 1}$ is “true” achievement by student i in a given grade and subject in the spring of the current year; $y_{i 0}$ is a vector of “true” student achievement in the spring of the prior year; $X_{i 1}$ is a matrix of student i’s characteristics (in the posttest year) demeaned in the sample, including FRL status, SPED status, ELL status, gender, and race; and $T_{i 1}$ is a vector of teacher indicator variables (in the posttest year). We did not include multiple lag test scores because they are not available in the dataset, though this issue is mitigated by our approach to correcting for measurement error in the test.

In practice, we do not observe “true” current achievement $y_{i 1}$ or previous achievement $y_{i 0}$ . Instead, we observe measured scores $Y_{i 1}$ and $Y_{i 0}$ , that is

Y_{i 1} = y_{i 1} + ν_{i 1},

Y_{i 0} = y_{i 0} + ν_{i 0},

where ${ν;}_{i 1}$ is the measurement error in posttest achievement scores, and $ν_{i 0}$ is a vector of measurement error terms in pretest achievement. Combining the student achievement Equation 1.1 with the measurement error Equations 1.2 and 1.3 yields the following:

Y_{i 1} = β_{0} + λ Y_{i 0} + X_{i 1} β^{'} + α^{'} T_{i 1} + ε_{i},

where

ε_{i} = e_{i} + λ ν_{i 0} - ν_{i 1} .

Equation 1.4 cannot be consistently estimated using ordinary least squares because $λ$ ends up in the reduced form error term and thus will cause an attenuation bias on the estimate of $λ$ . This attenuation will bias the coefficients of interest $α$ by inducing a correlation between them and the average pretest value of the students with a particular classroom indicator. Instead, to consistently estimate the parameters in Equation 1.4, an approach that accounts for measurement error in the pretest measure $Y_{i 0}$ must be employed. MAP Growth assessments provide an estimated standard measure of error for each observed score, and therefore more advanced error in variables models may be employed rather than a standard error in variables adjustment that accounts for the overall reliability of the test. We employed an approach to adjust for measurement error that accounts for the heterogeneous measures of error. This approach, which is described in Fuller (2009), disattenuates the coefficient $λ$ and helps unbias the coefficient vector. We elaborate on our approach for correcting for measurement error and provide additional details on the value-added methodology used in this research in a technical appendix.

Teacher value added was also estimated using a model that introduced student test-taking effort at pretest (prior spring) as a model covariate. In practice, teacher value-added estimates should only be affected if there are substantive differences in the prevalence of rapid guesses between test administrations. Given that rapid guessing introduces negative bias into estimates of student achievement (on average), we would expect that a student with a high number of rapid guesses at pretest (and therefore an artificially lower pretest score) and no rapid guessing at posttest would demonstrate a growth pattern that was upwardly biased. Conversely, a student with no rapid guessing at pretest and a high number of rapid guesses at posttest (resulting in an artificially lower posttest score) would demonstrate growth that was downwardly biased. Because rapid guessing can affect student growth trajectories in different ways depending on when the rapid guessing occurs and to what extent, it may seem reasonable to control for this student behavior at all testing terms.

We did not, however, control for rapid guessing at both terms, given that contemporaneous effort is endogenous to a model of achievement. That is, it may be reasonable to hypothesize that a more effective teacher is able to elicit greater focus and effort from his or her students compared with a less effective teacher. If student test-taking effort is an indicator of teacher quality, and if teachers are responsible for improving or maintaining student test effort, then the presence of rapid guessing at posttest could be directly related to the effectiveness of the current teacher—the more effective the teacher, the less rapid guessing we might observe. Controlling for rapid guessing at posttest could therefore reduce a clear signal of teacher effectiveness from VAM estimates. For this reason, we controlled for student rapid guessing at pretest from the prior spring when the current teacher had no direct control over a student’s test-taking behavior. Results from this VAM specification are referred to as Model 2 in this article, with differences between Model 1 and Model 2 estimates serving as one way of interpreting how rapid guessing may affect estimates of teacher effectiveness. We intend to explore whether more effective teachers have a differential impact on student test-taking effort in future research.

Building upon Equation 1.4, Model 2 can be expressed as follows:

\begin{array}{l} Y_{i 1} = β_{o} + λ Y_{i 0} + X_{i 1} β^{'} + \\ β_{e 0} e_{i 0} + β_{e m 0} e m a x_{i 0} + α^{'} T_{i 1} + ε_{i} . \end{array}

In this model specification, $e_{i 0}$ is the RTE of a student assessment at pretest, and $e m a x_{i 0}$ is an indicator variable for RTE at pretest being 1.0 (otherwise 0). We included $e m a x_{i 0}$ in this model specification because the data were censored at this level. This implies that the coefficient on effort is only estimated from variance coming from the portion of the sample that is not already at maximum effort. Including multiple RTE variables as we do would be a problem if the outcome we were interested in for this article was the coefficient on effort. However, as we are primarily concerned with how accounting for effort in models affects teacher rank orderings (rather than trying to quantify the impact of RTE on test-score gains), the coefficient on RTE is not of interpretive importance. Results are also not sensitive to the inclusion of this dummy variable.

As previously noted, one potential concern with adjusting for effort by including only lagged RTE is that it does not directly account for effort at posttest. This decision was made to avoid endogeneity concerns, given that teachers are likely able to influence student test-taking effort, at least to some extent, and more effective teachers may be better at maintaining student effort throughout the test than less effective teachers. However, this does limit our ability to directly examine the effects of rapid guessing at posttest, which is relevant to this research given the potential influence this behavior at posttest can have on the trajectory of student growth (i.e., depressing student growth, given the negative bias introduced into the posttest score). While we cannot circumvent endogeneity issues in models that account for posttest rapid guessing, we did opt to estimate teacher value added in an alternative manner to examine the robustness of our Model 2 results when accounting for posttest effort.

For this alternative method, we reestimated teacher value added using Model 1 specifications but used scores from the MAP Growth assessments at pre- and posttest that were reestimated after removing rapidly guessed item responses. These “effort-moderated” scores essentially treat rapid guesses as missing, reproducing item-response theory (IRT) estimates using maximum-likelihood estimation based only on nonrapidly guessed item responses. With this effort-moderated approach, if a student rapidly guessed on 10 of 50 items, then the student’s IRT estimate would be based only on information from the 40 items that were not rapidly guessed.

These effort-moderated scores involve a classic bias-precision trade-off. Wise and DeMars (2006) and Wise and Kingsbury (2016) showed that effort-moderated scores remove most of the bias introduced into observed test scores by rapid guessing but yield achievement estimates with lower precision (i.e., greater standard error of measurement) because the estimates are based on fewer items. The main benefit of effort-moderated scores in the VAM context is that we can account for rapid guessing at pre- and posttest, which Model 2 does not and generate VAM estimates using test scores with reduced bias that was a direct function of student rapid-guessing behavior.

However, this approach does not assuage endogeneity concerns, given the potential influence teachers have on student behavior on a test. While this approach removes bias that occurs as a function of a student, this approach can still bias VAM estimates if rapid guessing is correlated with teacher quality. That is, if less effective teachers have student test scores that are nosier estimates of student achievement as a function of the increased presence of rapid guessing, then this effort-moderated approach could positively bias teacher VAM estimates, given that the effort-moderated scoring approach removes student item responses that may provide information about teacher quality. Specifically, less effective teachers whose students demonstrate more instances of rapid guessing would have VAM estimates that are positively biased relative to what their “true” VAM estimates should be (estimates based on all item responses, not just those item responses that were not rapidly guessed). Thus, we acknowledge that endogeneity issues remain a concern with this effort-moderated approach, but results from this model do allow us to examine how this behavior affects VAM estimates when we account for rapid guessing at both test terms, and chiefly serve as a robustness check for our primary Model 2 results.

We refer to our VAM model that uses student effort-moderated scores as Model 3 and compare these model results with results from Model 1. This approach, in a sense, represents an upper-bound of the effects of rapid guessing on teacher VAM estimates, when attributing none of the variations in rapid guessing at posttest to the effects of the teachers themselves. Because the model specification used in this approach is identical to that of Model 1, any differences observed between results are not a function of our modeling choices but instead reflect differences in achievement estimates based on all item responses or only nonrapidly guessed responses. As such, using effort-moderated scores helps ensure that our main results are not a function of omitting posttest RTE, nor are they a function of our modeling choices more generally.

Research Question 1: How Prevalent Was Rapid Guessing?

The first research question focused on the prevalence of rapid guessing within the research sample, including the extent to which rapid-guessing behavior varied across teachers, test administration periods, grades, and subject areas. To address this question, we relied primarily on descriptive statistics to identify the proportion of all tests within the study sample that met the low-effort test event criterion, and present this information at pretest and posttest by grade and subject. We also summarized differences in rapid guessing across teachers, showing the proportion of teachers within the sample with varying counts and proportions of students who met the low-effort test event criterion. We hypothesized that VAM estimates would only be affected in a significant way if rapid guessing was clustered within classrooms.

We also identified the proportion of students in our sample who did or did not meet the low-effort test event criterion at both pre- and posttest, to give some indication as to how common it was for students to repeat this behavior over time. Finally, we also present information on the relationship between mean student achievement and RTE aggregated at the teacher level, as it is relevant to understanding how rapid guessing may influence estimates of teacher performance.

Research Question 2: What Was the Effect of Rapid Guessing on VAM Estimates?

Our second research question addressed the extent to which rapid guessing influenced teacher VAM estimates. We examined this issue in several ways using teacher-level VAM estimates based on the previously described model specifications. We first examined the correlations in VAM estimates between Model 1—our baseline estimates of teacher performance—and Model 2 (rapid guessing at pretest introduced as a model covariate). This comparison included a summary of the average and maximum difference in VAM estimates across models, as well as the proportion of teachers with VAM estimates that differed by .25 and .50 standard deviations. We present this information by the number of low-effort test events at pretest to demonstrate if increased instances of rapid guessing contributed to differences in VAM estimates across model conditions.

We also examined how test effort influenced the relative ranking of teachers, focusing primarily on those teachers in the top and bottom 5% of the performance distribution. The identification of low- and high-performing teachers is consistent with current policy and research (Hanushek, 2011; Hanushek, Kain, O’Brien, & Rivkin, 2005). To do this, we ranked teachers within grade and subject using estimates from all three VAM conditions. We used these rankings to determine if there was consistency in the subsets of teachers identified in the top and bottom 5% of all teachers across model conditions, and if the rank ordering of teachers shifted at any other point in the distributions.

Limitations

The sample used in this study was a consortium of districts in one state that used student MAP Growth results as a component in their state-mandated educator evaluation system. Because these districts self-selected into this consortium, the overall generalizability of these findings may be limited. In addition, because MAP Growth results were used for high-stakes purposes for teachers, but were low stakes for students (meaning performance on the assessments did not affect students in any direct manner), that may have contributed to differences in how students, teachers, or schools approached the test-taking process. Further research of this type can help demonstrate if the trends and results we see with this research sample are consistent with what occurs in other school settings. Finally, the manner in which rapid guessing is measured in this article is limited in application only to computer-administered assessments. Alternative approaches to measuring rapid guessing would need to be used in replications of this study that utilize test results from traditional pencil-and-paper assessments.

Results

Research Question 1: How Prevalent Was Rapid Guessing?

Results in Table 1 show the proportion of students within the study sample who met the low-effort test event criterion (RTE ≤ .90), with this information presented by test event, subject, and grade. We observed differences in test-taking effort by grade and subject area, with rapid guessing more prevalent in the middle school grades (sixth–eighth) compared with elementary grades, and more common in reading than in mathematics. The percentage of students who met the low-effort test event criterion ranged from 1% of third-grade students in mathematics to 14% of seventh-grade students in reading. The greater prevalence in reading was likely a result of students having to engage with longer text passages on the reading assessment compared with the mathematics assessment, and is consistent with previous research on this topic (Setzer et al., 2013; Wise, 2006; Wise et al., 2009).

Table 1

Proportion of Students With Low-Effort Test Events (RTE ≤ .90)

Grade	Mathematics pretest	Mathematics posttest	Mathematics pretest and posttest	Reading pretest	Reading posttest	Reading pretest and posttest
Overall	.04	.04	.01	.07	.09	.02
3	.01	.02	.00	.02	.06	.00
4	.03	.02	.00	.07	.07	.02
5	.03	.02	.00	.06	.06	.02
6	.02	.05	.00	.05	.09	.02
7	.07	.08	.02	.11	.14	.04
8	.08	.07	.02	.12	.12	.05

Note. RTE = response time effort.

We also observed a small proportion of the overall student sample with low-effort test events at both pre- and posttest, including up to 5% of all eighth-grade reading students. Correlations of student-level RTE at pre- and posttest range from .13 in third-grade mathematics to .37 in eighth-grade reading, which indicates a moderate correlation in student-level test behavior across test events. Due to censoring in the RTE variable, we also summarized the proportion of students who did not meet the low-effort test event criterion at pretest and did or did not meet that criterion at posttest, and similarly for those students who did meet the low-effort test event criterion at pretest. These results for reading are presented in Table 2 and show that 93% of all students who did not meet the low-effort criterion at pretest also did not meet the criterion at posttest. Conversely, 34% of students who met the low-effort test event criterion at pretest also met that criterion at posttest. These results further suggest that there is some consistency in student test-taking behavior over time, giving increased importance to our effort-moderated estimates, which, unlike Model 2, account for effort at pre- and posttest. Results in math show similar trends.

Table 2

Proportion of Students Who Met or Did Not Meet the Low-Effort Test Event Criterion at Pre- and Posttest in Reading

	No. of test events	Posttest: Non-low-effort test event	Posttest: Low-effort test event
Pretest: Non-low-effort test event
Third	6,705	.94	.06
Fourth	6,291	.95	.05
Fifth	6,838	.95	.05
Sixth	6,283	.92	.08
Seventh	5,831	.90	.10
Eighth	6,420	.91	.09
All grades	38,368	.93	.07
Pretest: Low-effort test event
Third	118	.76	.24
Fourth	454	.74	.26
Fifth	453	.75	.25
Sixth	350	.65	.35
Seventh	740	.60	.40
Eighth	904	.62	.38
All grades	3,019	.66	.34

There were also differences in the distribution of low-effort test events in both subject areas across teachers (see Table 3). We present this information by school level based on the differences we observed in the prevalence of low-effort test events in middle school grades (sixth–eighth) compared with the elementary school grades (third–fifth), and by the count and proportion of low-effort test events. In general, we see greater variability across teachers in the upper grades, and in reading compared with the lower grades and mathematics. For example, 70% of elementary teachers had no low-effort test events in mathematics at pretest. By contrast, only 14% of middle school teachers had no low-effort test events in reading at pretest, whereas 33% of teachers had 20% or more of students who met this criterion.

Table 3

Proportion of Teachers With Increasing Counts and Proportions of Low-Effort Test Events

Count	Grade levels	Total teacher n	Mathematics pretest		Mathematics posttest			Reading pretest		Reading posttest
Count	Grade levels	Total teacher n	Proportion	Cum.	Proportion	Cum.	Total teacher n	Proportion	Cum.	Proportion	Cum.
0	Third–fifth	799	.70	0.70	.69	0.69	781	.50	0.50	.41	0.41
1	Third–fifth		.21	0.91	.21	0.90		.26	0.76	.30	0.71
2	Third–fifth		.06	0.97	.06	0.96		.14	0.90	.13	0.84
3	Third–fifth		.02	0.99	.02	0.98		.05	0.95	.08	0.92
4+	Third–fifth		.01	1.00	.02	1.00		.06	1.00	.07	1.00
Proportion
.00	Third–fifth		.70	0.70	.69	0.69		.50	0.50	.41	0.41
.05	Third–fifth		.13	0.83	.13	0.82		.16	0.66	.16	0.57
.10	Third–fifth		.13	0.96	.14	0.96		.18	0.84	.21	0.78
.20	Third–fifth		.03	0.99	.04	1.00		.12	0.96	.17	0.95
>.20	Third–fifth		.00	1.00	.01	1.00		.04	1.00	.05	1.00
0	Sixth–eighth	372	.29	0.29	.20	0.20	431	.14	0.14	.09	0.09
1	Sixth–eighth		.23	0.52	.25	0.45		.22	0.36	.20	0.29
2	Sixth–eighth		.16	0.68	.16	0.61		.19	0.55	.15	0.44
3	Sixth–eighth		.12	0.80	.12	0.73		.11	0.66	.13	0.57
4+	Sixth–eighth		.21	1.00	.27	1.00		.34	1.00	.43	1.00
Proportion
.00	Sixth–eighth		.29	0.29	.20	0.20		.14	0.14	.09	0.09
.05	Sixth–eighth		.34	0.63	.34	0.54		.26	0.40	.25	0.34
.10	Sixth–eighth		.21	0.84	.27	0.81		.25	0.65	.23	0.57
.20	Sixth–eighth		.11	0.95	.14	0.95		.23	0.88	.27	0.84
>.20	Sixth–eighth		.05	1.00	.05	1.00		.11	1.00	.17	1.00

Note. Cum. = cumulative.

Finally, we also examined the relationship between rapid guessing and student achievement, as it is relevant to understanding how this behavior might influence VAM estimates. In Figure 1, we show this relationship by summarizing mean achievement across subjects at pretest by the mean RTE decile for a teacher’s group of students. These results show a clear relationship between achievement and RTE—the greater the amount of rapid guessing within a classroom, the lower the overall mean achievement. The direction of this relationship makes intuitive sense, given the negative bias that rapid guessing introduces into estimates of student achievement.

Figure 1.

Relationship by RTE deciles between mean achievement and mean RTE at pretest.

Research Question 2: What Was the Effect of Rapid Guessing on VAM Estimates?

Given the differences we observed in the prevalence of rapid guessing across grades, subjects, and teachers, we next evaluated the extent to which rapid guessing influenced teacher value-added estimates. A comparison of VAM estimates between Model 1, our baseline VAM specification, and Model 2, where rapid guessing is controlled for at pretest, is shown in Table 4, with these results presented by subject, school level, and the number of pretest low-effort test events. These results show that even in the most extreme settings—reading results in middle school grades for teachers with four or more low-effort test events—there is minimal divergence in teacher VAM estimates. The average absolute difference between model estimates is .04 standard deviations, and only 2% of teachers had VAM estimates that differed by .25 standard deviations or more (with no teacher estimates differing by .50 standard deviations or more). A comparison of VAM estimates between Model 1 and Model 3 returned similar results (results not shown).

Table 4

Comparison of Teacher-Level Value-Added Estimates by Subject, School Level, and Number of Low-Effort Tests at Pretest, Models 1–2

Subject	Low-effort test events	Grade levels	Teacher n	Correlation	Proportion .25 SD absolute difference	Proportion .50 SD absolute difference	Average absolute SD difference	Maximum absolute SD difference
Mathematics	0	Third–fifth	563	1	.00	.00	.02	.10
	1	Third–fifth	168	0.99	.00	.00	.03	.14
	2	Third–fifth	44	0.99	.00	.00	.06	.14
	3	Third–fifth	15	0.99	.00	.00	.07	.20
	4+	Third–fifth	9	0.99	.00	.00	.11	.24
Mathematics	0	Sixth–eighth	107	0.99	.01	.00	.03	.26
	1	Sixth–eighth	87	0.99	.00	.00	.04	.17
	2	Sixth–eighth	58	0.99	.00	.00	.04	.20
	3	Sixth–eighth	43	0.99	.02	.00	.04	.25
	4+	Sixth–eighth	77	0.99	.06	.00	.09	.48
Reading	0	Third–fifth	389	1	.00	.00	.02	.09
	1	Third–fifth	200	0.99	.00	.00	.02	.10
	2	Third–fifth	106	0.99	.00	.00	.03	.14
	3	Third–fifth	41	0.99	.00	.00	.04	.18
	4+	Third–fifth	45	0.99	.02	.00	.08	.29
Reading	0	Sixth–eighth	62	0.99	.00	.00	.02	.15
	1	Sixth–eighth	94	0.99	.00	.00	.02	.11
	2	Sixth–eighth	83	0.99	.00	.00	.02	.12
	3	Sixth–eighth	47	0.99	.00	.00	.03	.11
	4+	Sixth–eighth	145	0.99	.02	.00	.04	.37

Note. Teachers with fewer than 10 total students were excluded from these analyses. Correlations in this table show the relationship between VAM estimates from Models 1 and 2. VAM = value-added models.

Not surprisingly, the minimal differences in VAM estimates resulted in little resorting of teachers across the performance distribution based on teacher rankings generated from Model 1 and Model 2 estimates. These results are presented in Table 5 and show that across grades and subjects, teacher rankings generated from Model 1 estimates are highly consistent with rankings from Model 2, including at the extremes of the performance distribution (i.e., top and bottom 5% of teachers). For nearly all teachers, the performance category in which they were ranked under the Model 1 condition was the same as their ranking under the Model 2 condition.

Table 5

Changes to Rank Ordering of Teachers Based on Model 1 and Model 2 VAM Estimates

	Subject	Grade levels	Teacher group (%)	≤5%	6%–25%	26%–74%	75%–94%	≥95%	Total teachers
				Model 2
Model 1	Mathematics	Third–fifth	≤5	0.98	0.03	0.00	0.00	0.00	40
	Mathematics	Third–fifth	6–25	0.01	0.98	0.02	0.00	0.00	160
	Mathematics	Third–fifth	26–74	0.00	0.01	0.97	0.02	0.00	392
	Mathematics	Third–fifth	75–94	0.00	0.00	0.04	0.95	0.01	160
	Mathematics	Third–fifth	≥95	0.00	0.00	0.00	0.02	0.98	47
				Model 2
Model 1	Mathematics	Sixth–eighth	≤5	0.94	0.06	0.00	0.00	0.00	18
	Mathematics	Sixth–eighth	6–25	0.01	0.95	0.04	0.00	0.00	75
	Mathematics	Sixth–eighth	26–74	0.00	0.02	0.96	0.02	0.00	183
	Mathematics	Sixth–eighth	75–94	0.00	0.00	0.05	0.95	0.00	74
	Mathematics	Sixth–eighth	≥95	0.00	0.00	0.00	0.00	1.00	22
				Model 2
Model 1	Reading	Third–fifth	≤5	1.00	0.00	0.00	0.00	0.00	39
	Reading	Third–fifth	6–25	0.00	0.97	0.03	0.00	0.00	156
	Reading	Third–fifth	26–74	0.00	0.01	0.98	0.01	0.00	383
	Reading	Third–fifth	75–94	0.00	0.00	0.03	0.96	0.01	157
	Reading	Third–fifth	≥95	0.00	0.00	0.00	0.04	0.96	46
				Model 2
Model 1	Reading	Sixth–eighth	≤5	0.95	0.05	0.00	0.00	0.00	21
	Reading	Sixth–eighth	6–25	0.01	0.98	0.01	0.00	0.00	87
	Reading	Sixth–eighth	26–74	0.00	0.01	0.98	0.01	0.00	211
	Reading	Sixth–eighth	75–94	0.00	0.00	0.03	0.94	0.02	87
	Reading	Sixth–eighth	≥95	0.00	0.00	0.00	0.08	0.92	25

Note. VAM = value-added models.

We also examined the stability of rankings based on teacher VAM estimates by comparing Model 1 estimates with Model 3 estimates. Results from this comparison are shown in Table 6 and show similar trends to what was observed in the previous comparison: The relative ranking of teachers is not significantly affected for the majority of teachers across the performance distribution based on the use of these two different VAM estimates. There was a greater amount of resorting of middle school reading teachers at the high end of the performance distribution, with 20% of those teachers identified in the top 5% of teachers under Model 1 conditions shifting into the next lowest performance category under Model 3 conditions. These teachers had the highest amount of rapid guessing for their students at pretest, and therefore the highest percentage of pretest items dropped from student achievement estimates. The positive change in mean preachievement, as well as the noise introduced due to item-level sampling variability, likely both contributed to this downward shift in teacher rankings for teachers at the high extreme of the performance distribution. However, across both sets of comparisons, there are no instances of teacher rankings shifting more than one performance category in either direction.

Table 6

Changes to Rank Ordering of Teachers Based on Model 1 and Model 3 VAM Estimates

	Subject	Grade levels	Teacher group (%)	≤5%	6%–25%	26%–74%	75%–94%	≥95%	Total teachers
				Model 3
Model 1	Mathematics	Third–fifth	≤5	0.98	0.03	0.00	0.00	0.00	40
	Mathematics	Third–fifth	6–25	0.01	0.98	0.01	0.00	0.00	160
	Mathematics	Third–fifth	26–74	0.00	0.01	0.97	0.02	0.00	392
	Mathematics	Third–fifth	75–94	0.00	0.00	0.05	0.95	0.00	160
	Mathematics	Third–fifth	≥95	0.00	0.00	0.00	0.00	1.00	47
				Model 3
Model 1	Mathematics	Sixth–eighth	≤5	0.94	0.06	0.00	0.00	0.00	18
	Mathematics	Sixth–eighth	6–25	0.01	0.89	0.09	0.00	0.00	75
	Mathematics	Sixth–eighth	26–74	0.00	0.04	0.93	0.03	0.00	183
	Mathematics	Sixth–eighth	75–94	0.00	0.00	0.07	0.93	0.00	74
	Mathematics	Sixth–eighth	≥95	0.00	0.00	0.00	0.00	1.00	22
				Model 3
Model 1	Reading	Third–fifth	≤5	0.95	0.05	0.00	0.00	0.00	39
	Reading	Third–fifth	6–25	0.01	0.93	0.06	0.00	0.00	156
	Reading	Third–fifth	26–74	0.00	0.02	0.96	0.02	0.00	383
	Reading	Third–fifth	75–94	0.00	0.00	0.04	0.93	0.03	157
	Reading	Third–fifth	≥95	0.00	0.00	0.00	0.11	0.89	46
				Model 3
Model 1	Reading	Sixth–eighth	≤5	0.95	0.05	0.00	0.00	0.00	21
	Reading	Sixth–eighth	6–25	0.01	0.86	0.13	0.00	0.00	87
	Reading	Sixth–eighth	26–74	0.00	0.05	0.92	0.03	0.00	211
	Reading	Sixth–eighth	75–94	0.00	0.00	0.08	0.86	0.06	87
	Reading	Sixth–eighth	≥95	0.00	0.00	0.00	0.20	0.80	25

Note. VAM = value-added models.

Discussion

The results from this study highlight two broad findings related to the prevalence of student rapid-guessing behavior on assessments, and how that behavior may affect estimates of teacher quality: First, using data that identified when students rapidly guessed on test items, we found that the prevalence of rapid guessing varied across grades and subject areas. Students in upper grades were more likely to rapidly guess compared with students in lower grades, and this behavior was more common in reading than in mathematics. At the extreme, 14% of seventh-grade students provided rapid guesses to 10% or more of the items on their posttest reading assessment. We also found that rapid guessing was not uniform across teachers, as there were some teachers with greater rapid-guessing rates compared with others.

However, our second finding is that the rapid guessing found in our sample did not appear to have a substantive influence on teacher value-added estimates. VAMs that accounted for rapid guessing in differing ways returned estimates highly similar to those generated from models that did not directly account for student rapid guessing. Estimates from effort-adjusted models also returned rank orderings of teachers consistent with rankings from baseline VAM estimates. While not shown in this article, we also generated teacher VAM estimates using controls for rapid-guessing behavior at both pre- and posttest (not using effort-moderated scores), and those results were also highly similar to the results presented in this article. Thus, these results suggest that under these VAM specifications, rapid guessing had minimal influence on teacher value-added estimates.

There are several plausible reasons for this that could be investigated in future research: First, VAMs, such as those used in this research, may already indirectly account for student rapid-guessing behavior. When students rapidly guess, there is a subsequent negative biasing effect on student achievement estimates. As most VAMs control for lagged student test scores, these models may already capture whatever component of rapid guessing is persistent.

The minimal impact on teacher VAM estimates may also be a function of rapid guessing not being extreme and clustered at a great enough level to have a meaningful effect. Even if students are nonrandomly assigned to teachers in ways correlated with test-taking effort, then there would need to be significant differences in rapid-guessing rates across teachers for this behavior to differentially influence VAM estimates. While we do observe variability in rapid guessing across teachers, our results show that the average overall rapid-guessing rates are still quite low even for those teachers with the most extreme occurrences of this behavior (see Figure 1). Thus, this behavior may not be great enough at the aggregate to substantively influence the stability of teacher VAM estimates, especially with VAMs that control for lagged student achievement.

One remaining and important question we did not examine in this study, but intend to address in future work, is if maintaining or improving student test-taking effort is something within a teacher’s direct control. That is, do differences in rapid-guessing rates reflect differences in the students assigned to these teachers? Or are these differences an explicit signal of teacher effectiveness? Regardless of the reason for why rapid guessing occurs, and where the responsibility lies in improving and maintaining student test-taking effort, it would be prudent for educators to actively monitor this student behavior. Given that rapid-guessing behavior can negatively bias student achievement and growth estimates, and countless educational decisions are made for students based on test results, this further reinforces the need for educators to attend to the test-taking effort of their students. School systems may also identify incentives that can be tied to these assessments, especially for those assessments with little or no stakes for students, so that students are further encouraged to try their best on achievement tests.

In summary, the results of this research should help mitigate potential concerns from educators and policymakers about how student rapid guessing may influence the validity of teacher value-added estimates. Our findings suggest that rapid guessing has minimal influence on VAM estimates, especially when models used account for lagged achievement. Further research can show if these findings are unique to the specific modeling approaches or measures of student test-taking effort used in this research and if rapid guessing differentially affects interpretations of teacher performance in school systems that do not use value-added modeling in their accountability framework.

Supplementary Material

EEPA759600_Appendix – Supplemental material for The Influence of Rapidly Guessed Item Responses on Teacher Value-Added Estimates: Implications for Policy and Practice

Supplemental material, EEPA759600_Appendix for The Influence of Rapidly Guessed Item Responses on Teacher Value-Added Estimates: Implications for Policy and Practice by Nate Jensen, Andrew Rice and James Soland in Educational Evaluation and Policy Analysis

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Authors

NATE JENSEN, PhD, is a senior research scientist at NWEA. His primary research interests include teacher and school accountability, test motivation, best practices in the application and use of student testing data, and the relationship between attendance and student achievement growth.

ANDREW RICE, PhD, is the vice president of research and operations at education analytics. His primary research interests include school accountability models, assessment data, educator impact measurement, student growth measurement, human capital management metrics, and social-emotional learning metrics.

JAMES SOLAND, PhD, is a research scientist at NWEA. His research focuses on educational assessment, evaluation, and accountability. In particular, he explores how to measure social-emotional learning and test motivation, including how both influence estimates of teacher and school effectiveness.

References

Bradlow

E. T.

Weiss

R. E.

Cho

(1998). Bayesian identification of outliers in computerized adaptive testing. Journal of the American Statistical Association, 93, 910–919.

Cao

Stokes

S. L.

(2008). Bayesian IRT guessing models for partial guessing behaviors. Psychometrika, 73, 209–230.

Dee

T. S.

Wyckoff

(2015). Incentives, selection, and teacher performance: Evidence from IMPACT. Journal of Policy Analysis and Management, 34, 267–297.

Doherty

K. M.

Jacobs

(2013). Connect the dots: Using evaluations of teacher effectiveness to inform policy and practice. State of the States 2013. Washington, DC: National Council on Teacher Quality.

Eklöf

Pavešič

B. J.

Grønmo

L. S.

(2014). A cross-national comparison of reported effort and mathematics performance in TIMSS advanced. Applied Measurement in Education, 27, 31–45.

Fuller

W. A.

(2009). Measurement error models (Vol. 305). New York, NY: John Wiley.

Goldhaber

Chaplin

D. D.

(2015). Assessing the “Rothstein Falsification Test”: Does it really show teacher value-added models are biased? Journal of Research on Educational Effectiveness, 8, 8–34.

Guarino

C. M.

Maxfield

Reckase

M. D.

Thompson

P. N.

Woolridge

J. M.

(2015). An evaluation of empirical Bayes’s estimation of value-added teacher performance measures. Journal of Educational and Behavioral Statistics, 40, 190–222.

Hanushek

E. A.

(2011). The economic value of higher teacher quality. Economics of Education Review, 30, 466–479.

10.

Hanushek

E. A.

Kain

J. F.

O’Brien

D. M.

Rivkin

S. G.

(2005). The market for teacher quality (No. w11154). Cambridge, MA: National Bureau of Economic Research.

11.

Herrmann

Walsh

Isenberg

Resch

(2013). Shrinkage of value-added estimates and characteristics of students with hard-to-predict achievement levels. Washington, DC: Mathematica Policy Research.

12.

Kalogrides

Loeb

Béteille

(2013). Systematic sorting teacher characteristics and class assignments. Sociology of Education, 86, 103–123.

13.

Kane

T. J.

Staiger

D. O.

(2008). Estimating teacher impacts on student achievement: An experimental evaluation (No. w14607). Cambridge, MA: National Bureau of Economic Research.

14.

Koedel

Betts

J. R.

(2011). Does student sorting invalidate value-added models of teacher effectiveness? An extended analysis of the Rothstein critique. Education, 6(1), 18–42.

15.

Kong

X. J.

Wise

S. L.

Bhola

D. S.

(2007). Setting the response time threshold parameter to differentiate solution behavior from rapid-guessing behavior. Educational and Psychological Measurement, 67, 606–619.

16.

Lockwood

J. R.

McCaffrey

D. F.

(2014). Correcting for test score measurement error in ANCOVA models for estimating treatment effects. Journal of Educational and Behavioral Statistics, 39, 22–52.

17.

Loeb

Kalogrides

Béteille

(2012). Effective schools: Teacher hiring, assignment, development, and retention. Education, 7, 269–304.

18.

Loeb

Soland

Fox

(2014). Is a good teacher a good teacher for all? Comparing value-added of teachers with their English learners and non-English learners. Educational Evaluation and Policy Analysis, 36, 457–475.

19.

McCaffrey

D. F.

Lockwood

J. R.

Koretz

Louis

T. A.

Hamilton

(2004). Models for value-added modeling of teacher effects. Journal of Educational and Behavioral Statistics, 29, 67–101.

20.

Meijer

R. R.

Sijtsma

(2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107–135.

21.

Paufler

N. A.

Amrein-Beardsley

(2014). The random assignment of students into elementary classrooms: Implications for value-added analyses and interpretations. American Educational Research Journal, 51, 328–362.

22.

Rao

J. N.

Molina

(2015). Small area estimation. New York, NY: John Wiley.

23.

Rios

J. A.

Guo

Mao

Liu

O. L.

(2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not? International Journal of Testing, 17, 74–104.

24.

Rios

J. A.

Liu

O. L.

Bridgeman

(2014). Identifying low-effort examinees on student learning outcomes assessment: A comparison of two approaches. New Directions for Institutional Research, 2014(161), 69–82.

25.

Rothstein

(2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education, 4, 537–571.

26.

Setzer

J. C.

Wise

S. L.

van den Heuvel

J. R.

Ling

(2013). An investigation of examinee test-taking effort on a large-scale assessment. Applied Measurement in Education, 26, 34–49.

27.

Soland

Jensen

Keys

Wolk

(2017). Are test and academic engagement related? A case study using rapid guessing. Manuscript submitted for publication.

28.

van Krimpen-Stoop

E. M. L. A.

Meijer

R. R

. (2000). Detecting person-misfit in adaptive testing using statistical process control techniques. In van der Linden

W. J.

Glas

C. A. W.

(Eds.), Computerized adaptive testing: Theory and practice (pp. 201–219). Boston, MA: Kluwer.

29.

Wise

S. L.

(2006). An investigation of the differential effort received by items on a low-stakes Computer-Based Test. Applied Measurement in Education, 19, 95–114.

30.

Wise

S. L.

(2015). Effort analysis: Individual score validation of achievement test data. Applied Measurement in Education, 28, 237–252.

31.

Wise

S. L.

DeMars

C. E.

(2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10, 1–17.

32.

Wise

S. L.

DeMars

C. E.

(2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43, 19–38.

33.

Wise

S. L.

DeMars

C. E.

(2010). Examinee noneffort and the validity of program assessment results. Educational Assessment, 15, 27–41.

34.

Wise

S. L.

Kingsbury

G. G.

(2016). Modeling student test-taking motivation in the context of an Adaptive Achievement Test. Journal of Educational Measurement, 53, 86–105.

35.

Wise

S. L.

Kingsbury

G. G.

Hauser

(2009, April). How do I know that this score is valid? The case for assessing individual score validity. Paper presented at the meeting of the National Council on Measurement in Education, San Diego, CA.

36.

Wise

S. L.

Kong

(2005). Response time effort: A new measure of examinee motivation in Computer-Based Tests. Applied Measurement in Education, 18, 163–183.

37.

Wise

S. L.

(2012, April). Setting respond time thresholds for a CAT item pool: The normative threshold method. Paper presented at the meeting of the National Council on Measurement in Education, Vancouver, British Columbia, Canada.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB