Abstract
Standardized measures are often used as an index of students’ reading comprehension and scores have important implications, particularly for students who perform below expectations. This study examined secondary-level students’ patterns of responding and the prevalence and impact of non-attempted items on a timed, group-administered, multiple-choice test of reading comprehension. The Reading Comprehension subtest from the Gates-MacGinitie Reading Test was administered to 694 students in Grades 7 to 9. Students were categorized according to their test performance (low-, middle-, and high-achieving). Scores of the lowest achieving subgroup were affected significantly by high rates of non-attempted items, particularly on the later third of the test. Furthermore, the percentage of students who completed the assessment was far below that reported by the test authors. The results send a cautionary message to researchers and educators that, when text comprehension is the primary assessment target, to consider rates of non-attempted items and their impact on interpreting students’ text processing skills. Practical considerations are presented.
Group-administered, multiple-choice tests are some of the most commonly used methods of assessing reading comprehension skills, particularly with students in middle elementary school and beyond. Typically, these assessments consist of a series of reading passages, each followed by a set of multiple-choice questions. The efficiency of group-administered tests makes them an attractive option for evaluating comprehension with large groups of students. While these tests may be technically adequate, several factors associated with group-administered tests of reading comprehension may affect students’ performance, and thus may potentially mislead or obscure interpretation of students’ text processing and reading comprehension skills.
Multiple-choice tests of comprehension can be of considerable length. For example, on the reading comprehension portions of tests such as the Stanford Achievement Test (SAT-10; Harcourt Brace, 2003) and the Gates-MacGinitie Reading Tests–fourth edition (GMRT-4; MacGinitie, MacGinitie, Maria, & Dreyer, 2002), students read up to 11 passages and answer up to 54 questions within time periods that range up to 50 min. Although a sufficient amount of reading content is needed to support the number of test items required to ensure test reliability and adequate assessment of the construct, test length has implications for attention resources across periods of sustained effort. Comprehending text is a resource-intensive process when reading difficult or unfamiliar content. Good readers actively monitor their comprehension and may selectively deploy attentional resources based on the demands of the reading task, difficulty of the text, or other factors (Kinnunen, Vauras, & Niemi, 1998; Paris & Myers, 1981; Reynolds, 2000; van den Broek, 2010). Feng, D’Mello, and Graesser (2013) found that difficult reading tasks were more likely to give rise to mind wandering (i.e., reading text without actively processing), which the authors suggested may be due to the challenge for the reader to create a situation model in difficult text and thus be prone to off-task thoughts that are competing for attention. DiCerbo, Oliver, Albers, and Blanchard (2004) observed improved performance when students took a reading comprehension test (SAT-9) that was divided into halves or thirds across separate days, compared with taking the entire test in 1 day. Effects were most pronounced for middle- and low-achieving students, who demonstrated significantly higher scores on the divided-time administrations. Thus, depleted attentional resources from sustained effort may adversely affect reading comprehension processes over longer periods of time.
Failure to attempt to answer test items (i.e., non-responses) is another factor that can impede the accurate interpretation of comprehension test scores. Non-attempted items are often scored the same way as incorrect responses when raw and standard scores are calculated. Thus, a high frequency of non-attempted items may look like lower comprehension skills when, in fact, the student may not have read large portions of the test. Non-attempted items can negatively affect investigations of students’ text processing; for example, to reliably assess the effects of reader–text interactions, Eason, Goldberg, Young, Geist, and Cutting (2012) had to exclude 12% of their sample that had responded to less than 90% of the items of the Stanford Diagnostic Reading Test. Research has not routinely investigated or reported the prevalence of non-attempted items, patterns in which they occur, or frequency across student achievement levels on group-administered tests of reading comprehension.
There may be various reasons for non-attempted items. For example, fluency deficits that are the result of poor decoding and inefficient word recognition skills tax cognitive resources and can significantly impede comprehension (Nathan & Stanovich, 1991; Perfetti, 1985), regardless of the type of comprehension test or testing situation. However, on tests such as the GMRT-4 that use fixed time limits, reading fluency may affect test completion as faster readers may answer more questions within an allotted time. The GMRT-4 is considered a “power” test, meaning that students have sufficient time to answer all the items (MacGinitie et al., 2002), although, some would argue that any test with a time limit that some students do not finish is a “speeded” test and that scores could be negatively affected by the speeded nature of the task (Kerstiens, 1990). Consistent with this view, Eason, Sabatini, Goldberg, Bruce, and Cutting (2013) observed that rates of reading connected text consistently explained substantial unique variance in the prediction of performance on timed, multiple-choice comprehension tests that contained longer passages.
Response mortality (i.e., non-response to items due to voluntarily ceasing participation) can also pose a problem for interpretation of test scores and may occur more frequently in group-assessment situations. As adult supervision is diffused in group-assessment situations, students who are bored or frustrated with a long and difficult test may disengage from the assessment. Research studies represent a special case with implications for response mortality that may differ from other testing situations. That is, given voluntary participation, and if students perceive that assessments are not required or their performance has no meaningful consequences, response mortality may be higher on long and difficult assessments. Students with low comprehension skills are precisely the students whom research interventions may target, but low achievers may be more easily frustrated with difficult tasks, and may also demonstrate lower motivation (Lepper, Corpus, & Iyengar, 2005). Thus, lower initial motivation coupled with low frustration tolerance may result in very little persistence with difficult reading tasks. Consequently, these factors can obscure the assessment of students’ text processing skills specific to reading comprehension.
GMRT-4 test authors report that in the fall samples for students in Grades 7 to 9, between 85% and 90% of students completed the entire test, whereas 94% to 96% of students completed at least 75% of the items. However, independent studies have not routinely reported the frequency of test completion or the prevalence of non-attempted items. More research is needed to understand students’ patterns of accuracy and level of non-responses on tests of reading comprehension, and their implications on assessing comprehension skills. For example, low scores earned by two different students may look the same, yet those scores may have resulted from several scenarios including (a) few items attempted but high accuracy, (b) many items attempted but low accuracy, or (c) test cessation. We suggest that understanding the anatomy of a test score is important for understanding the potential implications on interpreting test results and subsequent educational decision-making.
Study Purpose
This study investigated the response patterns of students in seventh to ninth grade on a timed, group-administered, multiple-choice test of reading comprehension. We were particularly interested in examining the percentage of correct responses and non-attempted items across the test with subgroups of students categorized according to their comprehension scores. Conducted within the context of a research study, the results of our investigation have implications for the ways in which group-administered tests of comprehension are interpreted and the factors that may complicate these conclusions of students’ reading comprehension skills.
The study addressed the following research questions:
Method
Participants and Setting
Participants included 694 students in seventh through ninth grades from five schools (three school districts) in the southwestern United States. Students were participating in a study of interventions designed to improve secondary students’ comprehension skills (data used in the present analyses were collected at pretest, prior to intervention), and were selected from 48 unique class periods among 12 English language arts teachers. From those classrooms, all students receiving reading/language arts instruction in the general education setting were eligible to participate, and students who subsequently consented formed the study sample.
The gender and ethnicity breakdown of the sample was as follows: 46.5% male, 31.7% White, 31.2% African American, and 35.1% Hispanic/Latino. Students identifying as Asian, American Indian or Alaska Native, or mixed/biracial each represented less than .5%. All participating schools received Title I funding, and across the districts the percentage of economically disadvantaged families ranged from 71.8% to 76.6%.
Measure
GMRT-4, Reading Comprehension subtest
The GMRT-4 Reading Comprehension subtest (MacGinitie et al., 2002) is a group-administered assessment of reading comprehension. The measure contains narrative and expository passages ranging from 3 to 15 sentences long, followed by 3 to 6 multiple-choice questions per passage. Students read and answer the questions silently during a 35-min time-limited administration session. Internal consistency reliability is reported as ranging from .90 to .95.
Procedures
Assessment administration
Each student received the GMRT-4, Form S, during the fall of the school year in a group administration format in accordance with standardized procedures (MacGinitie et al., 2002). Trained examiners included research staff and graduate students in special education. Training for examiners consisted of a presentation on the general procedures, practice time, a practice session observed by project coordinators, and an observed administration with students. Examiners circulated around the classroom during test administration, and students were encouraged to answer as many questions as possible.
Scoring
Each of the 48 items on the GMRT-4 was analyzed to determine (a) whether the student attempted the item and (b) correct or incorrect responding. Each student response was coded using a series of “Countif” statements to consolidate student response patterns (e.g., correct, incorrect, not attempted).
Student subgroup categories
For analysis purposes, three subgroups of students were formed based on the normative data from the GMRT-4. Students scoring at or above the 50th percentile (n = 195) formed a “high” group, a “low” group consisted of students scoring at or below the 30th percentile (n = 362), and the remaining students constituted a “mid” group (30th-50th percentile, n = 137).
Results
RQ1: Differences in Response Accuracy as a Function of Student Achievement Level
Data were first analyzed descriptively according to response accuracy across the items of the GRMT-4. Figure 1 displays the average number of correct responses on each item for each of the reader subgroups. As illustrated, rates of correct responding varied considerably across the test—a pattern of variability that is consistent with the test design. That is, the GMRT-4 authors intended to have a range of item difficulty within each passage and constructed the test to present increasingly difficult items and reading passages as the test progresses (MacGinitie et al., 2002). With respect to correct responding, a clear decreasing trend is evident in the data across the later portions of the test, particularly for the low-achieving subgroup.

Rates of correct responding by item and achievement subgroup.
Next we examined the effect of non-attempted items on estimates of student performance. In contrast to the previous analysis, this analysis examined only correct responses to attempted items. As displayed in Figure 2, considering only attempted items resulted in different levels of accuracy. This effect is particularly noteworthy for the lowest achieving subgroup on the later portion of the test.

Rates of correct responding to only attempted items by achievement subgroup.
RQ2: Differences in the Percentage of Non-Attempted Items as a Function of Student Achievement Level
Figure 3 displays the proportion of students, as categorized by achievement subgroup, who attempted each item across the GMRT-4. These data ignore whether answers were correct, and instead only reflect the percentage of students who attempted each item. As illustrated, the percentage of attempted items decreased across the test, most notably for students in the low-achievement group. Item non-response by this subgroup is evident by Item 15, and clear separation has occurred by Item 24. More than 42.8% of the low group (n = 155) did not respond to Item 32 and beyond, which represents non-responses to more than a third of the test.

Proportion of students attempting item by position of item in the test.
A noteworthy pattern emerged for participants in the low-achievement subgroup between Items 23 and 36. Specifically, declines in the percentage of item attempts are observed following Items 23, 28, 31, and 35. Each of these items corresponds with the last item for Passages 5, 6, 7, and 8, respectively. For example, after Passage 5 (Item 23), 7.8% of the sample of the lowest achieving readers did not attempt the following item. Similar declines in participation were also noted after Passages 6, 7, and 8, in which case approximately 8% of the students remaining in that group did not attempt an item on the subsequent passage. This response pattern suggests that finishing the last item for a passage served as a stopping point for students, which may have been due to the student running out of time or an active decision to not continue.
In total, only 66% of our full sample attempted at least 75% of the test items. In contrast to the 85% to 90% of student in the GMRT-4 normative sample that completed the test in the allotted time, only 44% of students in our sample attempted every item. For students in our high-achievement subgroup, 76.9% attempted every item on the test. Test completion was much lower among students in the two lower -achieving groups, as 46% of the mid group and only 26.2% of the low group completed every item on the GMRT-4.
Effects of considering attempted versus non-attempted items on overall test scores
To more closely examine the effects that non-responding had on students’ scores, we used t tests to compare two methods of scoring items and determining students’ scores. The first method examined the overall accuracy and considered every item. Unanswered and incorrect items were both treated as incorrect items, which is in accordance with how the GMRT-4 is scored. This approach was contrasted with scoring that only considered the items that students attempted.
As reported in Table 1, statistically significant differences were observed for only the low-achieving subgroup when contrasting these scores; t(94) = −2.365, p = .02. Considering students’ accuracy of responding when only items attempted were considered, students in the lowest achievement subgroup evidenced a difference of 9% in their overall accuracy in responding to comprehension questions, a difference that was statistically significant. In contrast, students in both the mid, t(94) = −1.238, p = .22, and high, t(94) = −0.597, p = .55, groups showed no significant differences between the total number of items correct and number of attempted items correct.
Differences in Correct Responses by Group.
Note. Percentile groups based on GMRT-4 comprehension scores, Low = at or below the 30th percentile, Mid = 30-50th percentile, High = at or above the 50th percentile. GMRT-4 = Gates-MacGinitie Reading Tests–fourth edition.
p < .05.
Are non-attempted items the result of strategic test taking?
Skipping difficult items may reflect students’ use of test-taking strategies, whereas the lack of any subsequently attempted items may be more indicative of students who either gave up or ran out of time. Therefore, we investigated patterns of unanswered items to determine whether they were the result of strategic skipping, or if unanswered items marked the point in which students stopped working.
As reported in Table 2, in the vast majority of cases, a non-attempted item was followed by no responses to any of the remaining items. For the lowest achieving subgroup, only 5% of students responded to at least one additional item after initially skipping one item. Similar patterns were observed across the middle- and high-achieving subgroups (11% and 3%, respectively). Thus, for over 90% of the students in each subgroup, a non-response to one item was associated with non-responses on all the remaining items on the GMRT-4, suggesting that students did not skip items and continued working. Combined with earlier analyses on the percentage of unanswered items across the thirds of the GMRT-4, data indicate that large portions of the sample, particularly students with lower scores, may not have been exposed to significant portions of test content.
Rates of Non-Attempted Items Due to Selective Skipping of Questions.
Note. Percentile groups based on GMRT-4 comprehension scores, Low = at or below the 30th percentile, Mid = 30th-50th percentile, High = at or above the 50th percentile. GMRT-4 = Gates-MacGinitie Reading Tests–fourth edition.
Discussion
Timed, group-administered, multiple-choice assessments are commonly used by researchers and educators to evaluate students’ text processing skills and the effects of intervention on reading comprehension. Current national standards (e.g., Common Core State Standards) underscore the importance of reading comprehension assessment as a way to evaluate a student’s ability to understand and evaluate complex text (National Governors Association, 2010). Comprehension assessments are used to identify struggling students, and to evaluate knowledge acquisition or responsiveness to instruction. Given the significance of assessment data to inform our understanding of text processing, as well as in evaluating students’ attainment of standards, it is important to understand the factors that may influence scores on tests of reading comprehension.
In the current study, we investigated the response patterns of 694 students in Grades 7 to 9 on the Reading Comprehension subtest of the GMRT-4, a timed, group-administered multiple-choice test. We specifically focused on the frequency of non-attempted items, which in typical practice are often treated as incorrect responses, thus potentially confounding conclusions and inferences made regarding students’ comprehension scores.
Our results indicated a declining trend in accuracy across the items of the GMRT-4, most notably for the lowest and middle-achieving subgroups. This was not unexpected given the increasing difficulty of the items on GMRT-4. However, our analyses of the percentage of non-attempted items indicated that high proportions of struggling readers did not attempt significant portions of the test, with steep declines in responding over the course of the assessment. Although this might be expected for a test with a time limit, the number of test completers in our sample was far lower than the 85% to 90% reported by publishers for seventh to ninth graders (MacGinitie et al., 2002). Within our full sample, only 44% attempted every item. Only higher achievers (i.e., students with GMRT-4 scores at the 50th percentile or higher) demonstrated test completion percentages (77%) that approached those reported by the test authors.
Of students with GMRT-4 scores between the 30th and 50th percentiles, only 46% attempted every item, and of students with scores below the 30th percentile on the GMRT-4—the group of students that educators or researchers studying the effects of comprehension interventions would want to reach the most—only 26% attempted every item. Considering only attempted items for the lowest scoring subgroup resulted in statistically significant differences in overall response accuracy. In addition, non-responding did not appear to be the result of students strategically skipping difficult items, as very few students skipped an item and then continued answering subsequent items.
Implications for Test Score Interpretation
The present results suggest the need for considerable caution when interpreting scores on timed, group-administered tests of reading comprehension. “Reading comprehension” scores may be influenced significantly by variables other than students’ text processing skills, and this issue is magnified when tests (such as the GMRT-4) treat non-attempted items as incorrect responses. These factors may complicate conclusions of students’ text comprehension skills, and potentially obscure the source of students’ reading comprehension difficulties.
Reasons for non-attempted items in our sample may have been due to differences in reading fluency, test-taking speed, reading motivation, or reading stamina. Regardless, questions pertaining to what tests of reading comprehension truly measure are relevant here (e.g., Keenan, Betjemann, & Olson, 2008). Using a score from a timed, multiple-choice reading comprehension test such as the GMRT-4 may not take into account that many students, especially low achievers, have not read significant portions of the test. In practice, a low score on a reading comprehension test suggests the need to address the student’s text comprehension skills and strategies, such as comprehension monitoring, main idea generation, on inference-making skills. However, text reading fluency, motivation, task persistence, and fatigue—factors more indirectly related to text processing—may play a greater role in influencing comprehension test scores than factors more causally related to reading comprehension. A high percentage of non-attempted items can potentially obscure interpretation of students’ skills in text processing, utilization of comprehension strategies learned in an intervention, skills at processing different types of texts (narrative, expository), response accuracy to different types of comprehension questions (e.g., literal, inferential), utilization of vocabulary or background knowledge, or employment of comprehension monitoring and inference-making skills.
The present results also have statistical implications for analyses that may treat non-attempted items as “missing.” Analytic procedures that estimate missing data can carry assumptions that data are either “missing completely at random” (MCAR; i.e., the reason the data are missing is independent from any observed or unobserved variables) or “missing at random” (MAR; i.e., missing data are related to an observed variable, but not related to the scores that would have been present). As the results of this study demonstrated, non-attempted items were significantly more prevalent among low-achieving students, thus challenging notions of MCAR or MAR. Any analysis or approach that attempts to assess achievement at the individual item level (e.g., Alonzo, Basaraba, Tindal, & Carriveau, 2009; Eason et al., 2012; Ozuru, Rowe, O’Reilly, & McNamara, 2008) must account for the fact that missing data on items may be non-ignorable and MCAR or MAR assumptions may not be met.
Although we cannot assume our findings would be observed across other studies, we suspect that research studies may be more susceptible to higher rates of non-attempted items on group-administered tests. Unlike in educational testing situations, where students may be required to take exams and test performance may have significant consequences (e.g., eligibility to graduate), research studies are unique in that informed and voluntary consent procedures, and no clear benefit (or consequence) for participation, may be associated with a less effort and persistence with difficult tasks. This is particularly relevant for low-achieving students, who may be easily frustrated by difficult content and may already lack motivation to persist with difficult or mundane tasks. In short, without demands for completion, and with the explicit language in informed consent/assent procedures that is characteristic of research studies, higher frequencies of unanswered items and lack of task engagement might be expected for the students most at risk for comprehension difficulties.
Practical Solutions
Greater awareness of the occurrence of non-attempted items on group-administered tests of comprehension can lead to practical considerations for test administration as well as awareness of the implications on test score interpretation. From an educational decision-making standpoint, educators might consider an additional step in examining the rates of attempted versus non-attempted items, particularly among low-achieving students. “Looking beyond the score” in this manner will help distinguish students who scored low but answered very few items from students who attempted many items but demonstrated poor accuracy overall. This information can then inform follow-up assessment with these students on whether fluency and/or motivation was a cause for few attempted items, and whether reading comprehension strategies should be addressed for students with high rates of attempted items but low accuracy.
From a research methodology perspective, some researchers have addressed the issue by eliminating students that did not complete the test from their analyses. For example, Eason et al. (2012) eliminated 12% of students from analyses who did not complete at least 90% of the test items. Unfortunately his approach would have been untenable in our situation, as it would have resulted in the elimination of over half of our sample.
Alternatively, researchers may consider breaking the assessment into shorter sections to prevent effects of fatigue. DiCerbo et al. (2004) found improved performance across divided-time administration with third-grade students, and there may be similar possibilities for secondary-aged students, particularly low achievers. Moreover, tests such as the GMRT-4 could be administered on a full-test basis, but scores might be analyzed in segments (i.e., thirds) to evaluate accuracy on portions of the test where effort may have been stronger, or in the very least, portions in which students were more likely to have read the test passages and test items.
Second, testing students in smaller groups (e.g., 7-10 students), although not as efficient as large-group testing, may encourage greater task persistence and engagement through lower adult:student ratios and increased level of supervision. In addition, this type of situation may allow researchers to better keep track of the students who gave up or stopped working.
Third, researchers might encourage effort and task persistence as much as possible. Emphasizing the importance of valid assessments and the need for them to make their best effort may prompt some students to comply. Incentives approved by teachers and administrators that are appropriate within human subjects guidelines might also be considered to encourage effort and task persistence.
Finally, our results underscore the importance of considering multiple measures when assessing reading comprehension. As Fletcher (2006) noted, the use of a single assessment is not likely to adequately measure a construct as broad and complex as comprehension. A multi-method assessment approach is needed, one in which different types of texts and response formats are utilized.
Limitations and Future Directions
Several limitations of the current study should be considered. First, as we did not have access to the relevant information, we were unable to determine the specific reasons for declines in accuracy or increases in the percentage of non-attempted items. Future research might investigate the frequency with which non-attempted items are due to failure to complete the assessment within time limits or to poor motivation. Second, although our results are similar to those obtained with other group-administered assessments of comprehension (DiCerbo et al., 2004), our analyses only considered student performance on the GMRT-4. Thus, subsequent research might examine similar response patterns on different tests. Third, additional studies should be conducted to determine if the response patterns observed in this study can be generalized to other populations and are stable within the identified groups. Finally, although beyond the scope of this study, further research is needed to more fully address the source of comprehension scores. That is, would performance patterns differ if, consistent with many state accountability assessments, students were given an unlimited amount of time to complete the test? Would students who did not complete large portions of the test persevere and demonstrate different outcomes if given extended time? These are empirical questions that have significant implications for how tests are administered.
Conclusion
Timed, group-administered, multiple-choice tests are a widely used method for evaluating students’ reading comprehension, particularly in research studies. However, group-administered test situations and test characteristics can introduce factors that affect scores and potentially obscure evaluations of students’ comprehension skills. Rates of item completion in practice may fall short of those reported by test-publishers. As demonstrated in this study, the frequency of non-attempted items on longer tests may pose a distinct problem for test score interpretation, especially for lower achieving students for whom accurate assessment of data is often needed the most. When text comprehension skills are the primary assessment objective, results of this study should caution researchers and educators to consider rates of non-attempted items and their implications on interpreting reading comprehension test scores and understanding the source of comprehension difficulties. Similar to the skills required for comprehending text, the inferences we make when interpreting student performance encourage us to read between the lines to more fully understand what the scores mean for individual students.
Footnotes
Authors’ Note
The opinions expressed are those of the authors and do not represent the views of the Institute of Education Sciences or the U.S. Department of Education. Eric L. Oslund is now at the Department of Elementary and Special Education, Middle Tennessee State University.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by the Institute of Education Sciences, U.S. Department of Education, through Grant R305F100013 to Texas A&M University as part of the Reading for Understanding Research Initiative.
