Abstract
In this article, the change in examinee effort during an assessment, which we will refer to as persistence, is modeled as an effect of item position. A multilevel extension is proposed to analyze hierarchically structured data and decompose the individual differences in persistence. Data from the 2009 Program of International Student Achievement (PISA) reading assessment from N = 467,819 students from 65 countries are analyzed with the proposed model, and the results are compared across countries. A decrease in examinee effort during the PISA reading assessment was found consistently across countries, with individual differences within and between schools. Both the decrease and the individual differences are more pronounced in lower performing countries. Within schools, persistence is slightly negatively correlated with reading ability; but at the school level, this correlation is positive in most countries. The results of our analyses indicate that it is important to model and control examinee effort in low-stakes assessments.
Introduction
Educational policymakers attach great importance to the outcomes of large-scale international assessments such as the Program of International Student Achievement (PISA) and the Trends in International Mathematics and Science Study (TIMSS). Performance changes in these studies are used to evaluate and develop educational programs and policies. In contrast with the high-stakes implications attached to the results, test takers commonly perceive these assessments as low stakes, as there are no personal consequences related to their performance on the test. This might cause some test takers to expend low effort during the assessment, which can result in biased and invalid measurements. Therefore, a high-stakes question arises for low-stakes assessments (Barry, Horst, Finney, Brown, & Kopp, 2010): Does examinee effort—and the differences therein—form a threat for the validity of international assessments?
Most research regarding the issue of low examinee effort in low-stakes assessments addresses the issues of manipulating examinee effort, accounting for low examinee effort, and measuring examinee effort (e.g., Steedle, 2014; Swerdzewski, Harmes, & Finney, 2011; Waskiwicz, 2011; Wise & DeMars, 2005; Wise & Kong, 2005). In these studies, examinee effort is commonly seen as constant throughout the assessment. However, research has shown that performance in large-scale assessments can decrease (e.g., Hohensinn et al., 2008; Meyers, Miller, & Way, 2009), possibly due to fatigue or a decline in motivation. Hence, it seems likely that, during testing, a change in examinee effort can take place.
This article addresses the change in examinee effort in large-scale, low-stakes assessments. An item response theory (IRT) model for effects of item position (Debeer & Janssen, 2013; e.g., Hartig & Buchholz, 2012) is proposed to investigate changes in examinee effort. The model is extended to fit the hierarchical data structure that is commonly present in international assessments. The extended model will be applied to data from the 2009 PISA reading assessment to examine the change in examinee effort during testing. Differences in this change within and between countries will be assessed and their relation with the PISA country score will be investigated.
In the following, first examinee effort and its relation to performance will be discussed. Then, it will be explained how this relation can cause validity problems, especially in low-stakes assessments. A brief overview of the current methods and techniques for dealing with this issue will be given. Finally, an IRT model to model a change in examinee effort and its multilevel extension will be proposed.
Examinee Effort in Low-Stakes Assessments
Examinee effort or test motivation refers to “a student’s engagement and expenditure of energy toward the goal of attaining the highest possible score on the test” (Wise & DeMars, 2005, p. 2). A high expenditure of energy is needed for demanding tasks, such as responding to test items in an achievement test. When examinee effort is low, a test taker will not fully engage his or her ability, which will lead to a worse performance than what could be expected, given the test taker’s ability. This relation between examinee effort and performance has been repeatedly found (e.g., Abdelfattah, 2010; Liu, Bridgeman, & Adler, 2012; Steedle, 2014; Swerdzewski et al., 2011; Waskiwicz, 2011; Wise & DeMars, 2005).
The expectancy-value model proposed by Eccles and Wigfield (e.g., Eccles, 1983; Wigfield & Eccles, 2000) provides a useful perspective for understanding examinee effort in testing situations. According to this model, many test takers will hold weak value beliefs on the tests in the context of a low-stakes assessment because there are no consequences or personal benefits associated with student performance. A weak value belief combined with the awareness of the costs associated with the assessment will—according to the expectancy-value model—lead to low examinee effort.
This theoretical prediction has been empirically confirmed in several studies. When test takers do not perceive the importance or usefulness of an exam, their test-taking effort will be lower (Cole, Bergin, & Whittaker, 2008). Similarly, when students are asked to evaluate their testing motivation after completing a low-stakes assessment, they indicate that the effort they exerted was lower than the effort they would exert when the assessment was high stakes (Butler & Adams, 2007). Eklöf, Pavešič, and Grønmo (2014) found that the reported examinee effort on the low-stakes 2008 TIMSS test was on average low and that there was a relationship between reported effort and test performance.
Examinee effort is on average not only lower in low-stakes assessments compared to high-stakes assessments, but it is also likely to be more variable (Barry et al., 2010). Because examinee effort is related to test performance, and low examinee effort tends to result in a distorted ability estimate, the exerted effort can be a source of construct-irrelevant variance (Haladyna & Downing, 2004). Therefore, the relation between examinee effort and performance together with the variability of examinee effort can threaten the validity of test scores in low-stakes assessments (Wise & DeMars, 2010).
Measuring Examinee Effort
Different methods and techniques have been proposed for measuring examinee effort. A first strategy is to use self-report questionnaires after the assessment, such as the Student Opinion Survey (SOS; Sundre & Moore, 2002; Wolf & Smith, 1995), which was found to yield high values (mid- to upper 80s) for coefficient α in college samples (Sundre & Moore, 2002) or the Effort Thermometer (Kunter et al., 2002) used in PISA studies. Self-report measures, however, may have accuracy and validity problems (Wise & DeMars, 2005). Less motivated students may respond more carelessly or untruthfully. Moreover, low-performing students may attribute their performance to low effort instead of to their ability level (Wise & Kong, 2005).
Wise and Kong (2005) proposed an alternative strategy to measure examinee effort, namely response time effort (RTE), which is a reaction time–based measure used in computer-based testing. RTE supposes that there are two distinct response behaviors: solution behavior and rapid-guessing behavior, which are assumed to correspond to high and low effort, respectively. By setting a response time threshold for every item, the response behavior is classified as follows: slower than (or equal to) the threshold is regarded as solution behavior and faster than the threshold as rapid-guessing behavior. The proportion of items for which a test taker is classified into solution behavior gives a test taker’s RTE. The applicability of RTE as a measurement of examinee effort has been repeatedly demonstrated (Silm, Must, & Taeht, 2013; Steedle, 2014; Swerdzewski et al., 2011; Wise, Pastor, & Kong, 2009). The issue of setting the response time threshold has been addressed by Kong, Wise, and Bhola (2007). However, RTE is not without problems. It requires response time information, which is not available in many low-stakes assessments, and uses a deterministic classification of response behavior. Moreover, because it equates low examinee effort to rapid guessing, it assumes that solution behavior is not affected by low examinee effort.
Dealing With Low Examinee Effort
Several procedures have been suggested to deal with the issue of low examinee effort. One approach is to manipulate the students’ test-taking motivation, for instance, by increasing the stakes of the assessment by making the test performance part of the grading system or by explaining the importance of the low-stakes assessments. Different manipulating strategies have been shown to improve test-taking motivation and increase test performance (Liu et al., 2012; Wise & DeMars, 2005).
A second approach is motivation filtering. Unmotivated test takers or test takers exerting low effort are deleted from the sample (Sundre & Wise, 2003; Wise & DeMars, 2005). Two important assumptions are made, namely, first, that it is possible to detect the low-effort test takers and validly measure examinee effort and, second, that there is no relation between test-taking effort and the actual level of proficiency. Results show that motivation filtering increases the average test performance (Steedle, 2014; Swerdzewski et al., 2011; Wise & DeMars, 2010; Wise, Wise, & Bhola, 2006), both when a self-questionnaire and RTE are used to measure examinee effort. Rios, Liu, and Bridgeman (in press) showed that RTE filtering, however, had a slightly stronger relationship with test performance.
A third way to address the low-effort issue is to include test-taking effort into the measurement model. Both Wise and DeMars (2006) and Meyer (2010) proposed an IRT model that incorporates the response time to classify item responses into rapid-guessing behavior and solving behavior. Within these models, it is assumed that low examinee effort is related to rapid guessing, and therefore, a very quick response time can be seen as proxy of low examinee effort. Although both models can increase the validity of the proficiency measurement, the problems mentioned with regard to RTE also exist here.
Another model worth mentioning is the model of Goegebeur, De Boeck, Wollack, and Cohen (2008), as it also jointly models guessing behavior and problem-solving behavior. The model assumes that during an assessment, there may be a gradual shift from problem-solving behavior to guessing behavior starting at a person-specific speededness point in the test. An advantage of the model is that response accuracy is modeled and no response time information is needed. However, this model was explicitly proposed for speeded tests to model the increase in rapid-guessing behavior. Because low-stakes tests are commonly designed to be nonspeeded, this model is less apt for modeling low examinee effort in low-stakes testing.
Change in Examinee Effort During Testing
Most studies on test-taking motivation and examinee effort implicitly assume that motivation or effort does not change during testing. The definition for examinee effort (Wise & DeMars, 2005) and the expectancy-value model (e.g., Eccles, 1983; Wigfield & Eccles, 2000), however, do not restrict examinee effort to be constant during an assessment. Moreover, it seems rather likely that the effort a test taker expends to solve individual items is not the same for every item. Given that in longer assessments test takers can become fatigued or less motivated, a downward trend in examinee effort can be expected, rather than random changes in examinee effort during testing. Indeed, studies using response time as an indicator of rapid-guessing behavior and low examinee effort indicate that one of the best predictors of rapid guessing is the position of the item in a test (Wise, 2006; Wise et al., 2009).
Modeling change in examinee effort
Debeer and Janssen (2013) proposed an IRT-based framework to model proficiency and change in performance related to item position during testing. A possible interpretation of this change dimension is a change in examinee effort that can vary over persons and that can affect performance during the assessment. In order to apply their framework, items have to appear in different positions to disentangle the effects of item difficulty and item position. Hence, the model is only applicable when the test consists of (partly overlapping) test forms, and item orders are different across test forms, or when item parameters are known.
A one-parameter logistic version of the model of Debeer and Janssen (2013) with a linear item position effect for an assessment with P test takers and I binary test items that can be administered in K positions, reads as:
Ypik
is the response of person p to item i, which was administered at position k. θ
p
is the proficiency of person p, and β
i
is the difficulty of item i when administered at the first position of the test. The linear effect of item position (γ + δ
p
) is the change in performance that takes place during testing, where γ is the average change and δ
p
is the deviation from this average for person p. The individual change in performance will be referred to as persistence (cf. Hartig & Buchholz, 2012). A positive value for (γ + δ
p
) indicates an increase in performance, a negative value a decrease. θ
p
and δ
p
follow a bivariate normal distribution with variances
Multilevel extension
Large-scale international assessments, such as PISA, often use a systematic stratified sampling procedure that results in a hierarchical data structure. In the case of PISA, students are nested within schools. The model in Equation 1 can be hierarchically extended, resulting in a multilevel decomposition of the random effects. More specifically, the variance and covariance of ability and persistence are decomposed into a between-school part and a within-school part:
θ
s
and θ
ps
represent the between-school part and the within-school part for ability, respectively. The same holds for the persistence parameters δ
s
and δ
ps
. It is assumed that θ
s
and δ
s
follow a bivariate normal distribution over schools with variances
The multilevel version of the model may help in providing insights into the nature of the change in examinee effort. Using the multilevel decomposition, it is possible to investigate whether the variance in persistence is located at the school level or at the individual level. For example, schools may differ in stressing the high-stakes implications resulting in different “testing climates” between schools. Also, the correlation between persistence and ability can also be investigated within and between schools.
The Present Study
Hartig and Buchholz (2012) investigated the decrease in performance in the PISA 2006 science assessment in 10 of the 57 participating countries using the model in Equation 1. They found a significant negative effect of item position, consistently across the 10 countries, but with more prominent effects in countries with lower national performance levels. Although science ability and persistence were practically uncorrelated in high-performing countries, a negative correlation was found in lower performing countries. This study intends to generalize their findings for the 2009 PISA reading assessment data for all participating countries with the multilevel extended model (Equation 2).
PISA 2009 reading assessment
The Program for International Student Assessment (PISA) is a triennial system of international assessments that focus on the competencies of 15-year-olds in reading, mathematics, and science literacy. In 2009, reading literacy was the major domain. Because PISA uses a rotated block design, students were administered only a part of all the reading items. Clusters of items were presented at different cluster positions across students. This is a requisite to investigate effects of item position and the change in examinee effort during testing.
Research questions
It can be assumed that the effects of item position observed in the science assessment (Hartig & Buchholz, 2012) are of a general nature and that there are no reasons to believe that they differ from the effects found within other domains. Hence, the following hypothesis can be formulated. We expect a general negative effect of cluster position on reading performance (Hypothesis 1) that indicates a decrease in examinee effort during the assessment.
Second, given previously found results (Debeer & Janssen, 2013; Hartig & Buchholz, 2012), we expect that there are individual differences in the decrease in examinee effort (Hypothesis 2). Further, we will examine the variability in persistence within and between schools. We hypothesize that most of the variance is found within schools (Hypothesis 2a). And, as we expect that school regime and (implicit) test expectations are different between schools, at least a part of the variance in persistence is related to the school level (Hypothesis 2b).
Third, the correlation between persistence and reading ability is estimated. Given the findings of Hartig and Buchholz (2012), we expect a small or no correlation between ability and persistence in the reading assessment (Hypothesis 3), both within (Hypothesis 3a) and between (Hypothesis 3b) schools.
Finally, the results will be compared across all countries participating in PISA 2009. By relating the national reading score for a country to the results of the analyses, more insights into the nature and relevance of the effects might be obtained, and differences between high- and low-performing countries can be observed. Hartig and Buchholz (2012) found that the individual differences in persistence are more pronounced in lower performing countries, and that the negative correlation between persistence and science ability is stronger in low-performing countries, while there was no correlation in high-performing countries. We expect to find similar results in PISA 2009, across all countries (Hypothesis 4).
Method
Participants
In total, 467,819 students from 65 countries participated in the PISA 2009 assessment. Within each country, students were drawn through a two-tiered stratified sampling process consisting of a systematic sampling of individual schools with a probability proportional to the school size, from which 35 students were randomly selected. More details about the sampling procedure can be found in the PISA 2009 technical report (Organization for Economic Cooperation and Development [OECD], 2012).
Procedure
In the assessment, there were 218 test items (131 reading, 34 math, and 53 science). The items were partitioned in 13-item clusters: 7 for reading (R1–R7), 3 for math (M1–M3), and 3 for science (S1–S3). Each cluster represented 30 minutes of test time. Countries that were expected to have a lower reading score were offered the option of administering an easier set of items. For those countries, two of the standard reading clusters (R3A and R4A) were substituted with two easier reading clusters (R3B and R4B). The sets of items in the standard and easier clusters were matched in terms of the distribution of text format, aspect, and item format. The other 11 clusters were administered in all countries. In total, 20 countries opted to administer the easier clusters.
The items were presented to students in 13 standard test booklets (Booklet 1–13) and 7 easier booklets (Booklet 21–27), 1 with each booklet being composed of four clusters (Table 1). Using a balanced incomplete block design, each item cluster appeared in each of the four possible cluster positions within a test booklet once. This way, each pair of item clusters appears in only one booklet. Within the item clusters, the position of the items was fixed. Therefore, the effects of cluster position will be modeled instead of the effects of item position. Applied to Equations 1 and 2, k is replaced by c which is the position of the item cluster, ranging from 1 to 4. Each sampled student was randomly assigned to 1 of the 13 test booklets available in a country.
Visual Representation of the PISA 2009 Rotated Block Design
Note. PISA = Program of International Student Achievement. Each booklet consists of 4 of the 13 item clusters (7 reading clusters [R1–R7], 3 math clusters [M1–M3], and 3 science clusters [S1–S3]). There are four cluster positions, and each item cluster is presented at every cluster position once. The easier booklets (21–27) and clusters (R3B and R4B) are represented in italic.
Data
Only data from the PISA 2009 paper-and-pencil reading literacy assessment 2 will be analyzed. The item formats employed for reading items were either selected response multiple choice or constructed response. Both dichotomous and partial credit scoring are used in PISA. In total, 125 reading items were analyzed. 3 To fit the binary item response model of Equations 1 and 2, 7 items were dichotomized by only considering a full credit response as correct. Further, not-reached responses were dropped, and missing responses were treated as incorrect. More information on the items, the response formats, and the scoring rules can be found in the PISA 2009 technical report (OECD, 2012).
Analysis
The models in Equations 1 and 2 were used to analyze the data within each country separately. Both models can be seen as generalizations of the Rasch model or the logistic multilevel model with item responses as Level-1 variable nested within students (e.g., Kamata, Bauer, & Miyazaki, 2008). As item responses are nested in students, we will refer to the model in Equation 1 as the two-level model. The model in Equation 2 will be referred to as the three-level model, with responses nested in students and students nested in schools. All analyses were conducted with the multilevel software HLM (Raudenbush, Bryk, & Congdon, 2004, 2013) using penalized quasi-likelihood estimation.
Because the analyses are conducted separately for each country, there is no common scale, and the estimated effects are not directly comparable across countries. Therefore, for the two-level results, the estimated effect of cluster position
For the three-level model, the total reading ability standard deviation
Results
An overview of the parameters of interest for every country can be found in Online Appendices A and B (available at http://jeb.sagepub.com/supplemental). Online Appendix A (available at http://jeb.sagepub.com/supplemental) lists the estimates of
Average Persistence
As expected in Hypothesis 1, a negative effect of cluster position is consistently found across all participating countries. On average, there is a decline in examinee effort during testing, which results in a decreasing probability of a correct response when the item is placed further in the test. Figure 1 gives the distribution of the estimated standardized average persistence

Histogram of the average estimated persistence γ* across all countries (N = 65) according to the three-level analyses (Mean = −0.166; SD = 0.034).
Effect of the Decrease in Examinee Effort γ* During the PISA 2009 Reading Assessment on the Change in Probability of a Correct Response of Students of Average Ability (θ = 0), When an Item of Average Difficulty (β i = 0) Is Placed One Cluster Position or Three Cluster Positions Further in the Assessment
Note. PISA = Program of International Student Achievement. The changes in the probability of a correct response are given for three effect sizes: the highest (Greece), the average, and the lowest (Finland) decrease in examinee effort. Estimates from the three-level analyses are used.
There are no discrepancies between the two-level and the three-level estimates for γ* (root mean square difference [RMSD] = 0.007). The country with the largest difference was Slovenia, with a difference of 0.022. Hence, both the two-level and the three-level results are in line with our first Hypothesis 1: In all countries, there is a decrease in average persistence during testing.
Individual Differences in Persistence
Figure 2 gives the distribution of the estimated total individual differences in persistence relative to the individual differences in reading ability

Histogram of the estimated total individual differences in persistence
Given the size of the individual differences in persistence, in all countries, at least a proportion of students demonstrate an increase in examinee effort, and hence, an increase in the probability of a correct response for an item when it is administered at a later cluster position in the assessment. On average, about 20% of the students have a zero or a positive change in examinee effort during the assessment. Although an increase in examinee effort seems counterintuitive, a possible explanation is that some test takers might exert very low effort in the beginning of the test, which makes an increase in examinee effort more likely than a decrease.
The three-level model decomposes the individual differences in persistence and reading ability in a within-school and a between-school part. Figure 3 gives the distribution of the ICC across countries for (a) reading ability and (b) persistence. In all countries, only a small proportion of the differences in persistence is related to between-school differences. On average, the proportion is about 10% (SD = 0.048). This proportion is considerably smaller than the proportion for the individual differences in reading ability, where the ICC is on average about 36% (SD = 0.146). The findings are in line with the second hypothesis (Hypotheses 2a and 2b): At least a part of the individual differences in persistence can be explained by the school level.

Histogram of the ICC of reading ability (a) and the ICC of persistence (b) across the participating countries (N = 65). The mean ICC for reading ability is .360 (SD = 0.146), and the mean ICC for persistence is .100 (SD = 0.048). ICC = intraclass correlation.
Correlation Between Reading Ability and Persistence
Before examining the decomposition of the correlation between ability and persistence in the three-level model, first the two-level correlations ρθδ are discussed. In line with Hypothesis 3, in most countries, the estimated two-level correlation between students’ reading ability and their persistence

Histogram of the correlation between a student’s ability and persistence ρθδ across all countries (N = 65) according to the two-level analyses (Mean = −0.028, SD = 0.146).
The three-level model decomposes the correlation between reading ability and persistence into a within-school ρθδps
and a between-school ρθδs
part. Figure 5 gives the distribution of both estimated correlations across countries. Within schools, there seems to be a zero or a small negative correlation between ability and persistence

Histogram of the within-school correlation ρθδps (a) and the between-school correlation ρθδs (b) across the participating countries (N = 65). The mean correlation within schools is −.155 (SD = 0.154), and the mean correlation between schools is .432 (SD = 0.334).
Relation to PISA National Scores in Reading Ability
Table 3 gives the correlations (N = 65) of the PISA national reading score with the estimates of (a) the standardized average persistence
Correlations of the Parameter Estimates of the Two-Level and Three-Level Analysis With the PISA National Reading Score for N = 65 Countries
Note. PISA = Program of International Student Achievement. aThere is no decomposition of the ability–persistence correlation in the two-level model.
The results in Table 3 show that there is a positive correlation of medium size between a country’s PISA reading ability score and the average persistence in that country. This correlation indicates that the decrease in examinee effort is larger in countries with lower PISA reading ability, despite the fact that (some) lower performing countries were administered easier booklets. Further, the national reading score is negatively correlated with the amount of individual differences in persistence within a country. In lower performing countries, there are more individual differences in this decline. As the individual differences in persistence are expressed relative to the individual differences in ability, this result indicates that persistence plays a relatively bigger role in students’ PISA reading scores in lower ability countries.
Finally, although the correlations between ability and persistence are all close to zero, there is a clear positive correlation between these numerically small estimated correlations in the two-level analyses and the PISA reading score. This result shows that in countries with a higher national reading score, the correlation between students’ ability and persistence is more likely to be positive, while it is more likely to be negative in countries with lower national scores. The three-level results show that this effect is found both at the between-school level and at the within-school level. In higher performing countries, the positive relation between a school’s average ability and a school’s average persistence is stronger than in lower performing countries. Within schools, the ability and persistence are more negatively correlated in countries with a lower PISA reading score.
It is not clear what the substantial processes behind the findings are, but the notable correlations between national reading score and the different model parameters indicate that, rather than random differences between the countries, there are consistent differences between low-performing and high-performing countries with regard to the persistence.
Discussion
This article used a model-based measure to investigate the change in examinee effort during testing. A multilevel extension of the model was applied to the PISA 2009 reading assessment data allowing a decomposition of the individual differences within and between schools. This is the first study to examine and decompose the individual differences in persistence during a low-stakes international assessment. Although the study was exploratory and the results were extensive and complex, a number of interesting and potentially important conclusions can be made.
Key Findings
First, a decrease in examinee effort during the assessment was found consistently across countries. On average, the effort students expend during the PISA reading assessment decreases, which results in a lower performance toward the end of the assessment. This is in line with previous studies on the effect of item position (e.g., Debeer & Janssen, 2013; Hartig & Buchholz, 2012; Meyers et al., 2009). Because of the generality of this effect, it can be expected that in most large-scale low-stakes assessments, a decrease in examinee effort takes place.
Second, individual differences in persistence were found in all countries. Part of the variance in persistence was related to the school level, but most of the variance was found within schools. Therefore, student characteristics rather than school characteristics can be interesting to explain the individual differences in persistence. For instance, as there are gender differences in reported effort and in the relation between reported effort and performance (Eklöf, Pavešič, & Grønmo, 2014), there may also be gender differences in the change in effort.
Third, at the student level, persistence and ability are not or are only slightly negatively correlated. This implies that persistence—and the individual differences therein—can be seen as a source of construct-irrelevant variance and can form a threat to the validity of the PISA measurement. As there are high-stakes implications attached to the PISA results, an important task for future research is to investigate to what extent the validity in large-scale international assessments is influenced by persistence.
Interestingly, at the school level, the correlation between persistence and ability was positive in most countries. Although the differences in average persistence between schools were rather small, they seem closely related to the differences in ability between schools. These high correlations might be caused by differences between schools in the extent to which they (unwittingly) motivate their students to do their best during the PISA assessment. Maybe schools that attach high importance to PISA performance motivate their students more or have more disciplined students, resulting in an overall higher performance and a weaker decrease in examinee effort during the assessment. On the other hand, schools that attract higher ability students might also have a stronger “testing climate,” resulting in more sustained examinee effort.
Fourthly, as hypothesized in Hypothesis 4, the differences in the average decrease in examinee effort and the size of the variance in persistence found across countries are related to the national reading score. Although more research is needed to interpret and explain these results, it is clear that the differences in persistence across countries can have an impact on the PISA performance. Eklöf et al. (2014) found the following differences between three countries: (a) differences in the reported test-taking effort during the TIMSS 2008 assessment and (b) differences in the correlation between the reported effort and test performance. Our findings are in line with these results and confirm the need for monitoring, controlling, and modeling of examinee effort in low-stakes assessments such as PISA and TIMSS.
Persistence Versus Examinee Effort
The proposed model does not result in an estimate of the average effort expended by a student throughout the assessment. It is a measure of persistence, which can be seen as the change in examinee effort that causes a change in a test taker’s performance during the test. Although the constructs are related, the measured average examinee effort and the persistence do not have to be correlated. It would be interesting to investigate whether there is a correlation between the persistence and the average examinee effort.
Unlike self-report questionnaires and RTE, the proposed measure for persistence is solely based upon response accuracy information. Neither additional self-report information nor response time information is required. However, without items being administered at different positions, the proposed models are not applicable, and bias on the ability measurement due to changes in examinee effort cannot be avoided.
The change in examinee effort and the individual differences in persistence can have various causes such as a change in motivation or a change in the energy level of the test taker (i.e., increasing fatigue). However, investigating the nature of examinee effort and the mechanisms underlying the change in effort during testing is not straightforward.
Limitations
In this section, technical limitations and potential further research are discussed. The model with multilevel extensions (cf. Equation 2) was formulated for hierarchical data with students nested in schools. In case of the PISA data, a country level could be added, nesting the schools within countries. Using such a four-level model, all data could be analyzed simultaneously. However, because this would result in a total sample size that would demand computing power that far exceeds the computing power of most personal computers, we opted to run the analyses country by country.
To investigate and model the change in examinee effort in the PISA reading data, a linear effect of cluster position was used. However, in the framework proposed by Debeer and Janssen (2013), other functions of item position (quadratic, exponential, etc.) are also put forward. In the case of PISA, with only four cluster positions, descriptive analyses (Hartig & Buchholz, 2012) and comparison of different models (Debeer & Janssen, 2013) supported a linear effect. In other applications, with more positions, other functions of item position might be better suited to model the change in examinee effort.
In the analyses of the PISA reading data, not-reached responses were considered as missing at random, and missing responses before the last item with a response were treated as incorrect. Implicitly, it is assumed that there is no relation between the probability of an omission and the exerted examinee effort. It is, however, likely that a change in examinee effort can also have an effect on the tendency to omit items or not reach items. Several methods have been proposed to model omissions together with the item responses (e.g., Debeer, Janssen, & De Boeck, 2013; Glas & Pimentel, 2008; Holman & Glas, 2005; Pohl, Gräfe, & Rose, 2014). Effects of item position can be included in these models to account for the change in examinee effort.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) declared receipt of the following financial support for the research, authorship, and/or publication of this article: The research leading to the results reported in this paper was supported in part by the Research Fund of KU Leuven (GOA/15/003) and by the Interuniversity Attraction Poles program financed by the Belgian government (IAP/P7/06).
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
