Cloze Tests May be Quick,But Are They Dirty? Development and Preliminary Validation of a Cloze Test of Reading Comprehension

Abstract

A commonly held view is that cloze tests may well provide a quick measure of something reading related, but that they are not suitable for assessments of understanding of ideas beyond the sentence boundary. The present article presents challenges to this view. It is argued that word gaps can be carefully selected so that filling them in requires proper understanding of the ideas of the text. The reliability and validity of such a comprehension-focused cloze test was demonstrated in a study of 204 Danish adults attending reading courses or general education. The quick (10 min) cloze comprehension test correlated strongly (r = .84) with a standard (30 min) question-answering comprehension test. Only a small part of this correlation was accounted for by decoding ability or vocabulary. The cloze test was somewhat more sensitive to decoding ability than the question-answering comprehension test was, and it provided a better fit to the participants’ self-reported reading difficulties.

Keywords

reading comprehension cloze test decoding vocabulary validity

Cloze tests of reading have a relatively high number of items per 100 words as compared with other test formats with intact texts. This means that cloze tests may be time efficient. However, the validity of cloze tests as measures of reading comprehension has been questioned by several researchers (e.g., Carlisle & Rice, 2004; Farr & Carey, 1986; Pearson & Hamm, 2005). For example, in a review of procedures for the assessment of reading comprehension, Carlisle and Rice (2004) warn, “Cloze tests tend to assess what we call ‘local’ comprehension, which refers to sensitivity to the grammatical and semantic constraints on meaning. ( . . . ) Cloze tests are not ideal if one wants to assess understanding and recall of ideas and information in natural passages” (p. 535). In the present article, we challenge this view. We concede that many existing cloze tests may not tap understanding of ideas, but we argue that cloze tests are not inherently limited to “local” comprehension. Rather, cloze tests are natural tests of inferences, and the inferences are not necessarily limited in scope. We also present evidence that a cloze test can be constructed to match a much more time-consuming question-answering test in terms of reliability and external criterion-based validity.

Cloze can be defined as “any procedure whereby bits of some discourse are omitted and the task set the examinee is to restore the missing pieces” (Oller & Jonz, 1994, p. 3). Hence, cloze tests may have a gap for every fifth, seventh, or tenth word, or gaps may be made in a more considered manner. Here is an example:

I do think Mrs. Long is ____ good a creature as ever lived – ____ her nieces are very pretty behaved ____, and not at all handsome: I ____ them prodigiously.(Austen, 1813).

Every seventh word has been replaced by a gap. The original (i.e., correct) words are as, and, girls, like. These words may all be inferred from the local context. The as . . . as construction provides an example of a gap that can be filled based on the sentence. The and gap requires a little more context-understanding that the contents of the neighboring sentences are parallel and both positive. Nieces are likely to be relatively young females, so girls is an obvious guess based on the same sentence. The final gap-word, like, is not a surprise given the positive attitude of the speaker toward the girls.

By contrast, consider this gap in the same text:

I do think Mrs. Long is as good a creature as ever lived – and her nieces are very pretty behaved girls, ____ not at all handsome: I like them prodigiously.

Local context would suggest but. The speaker, Mrs. Bennet, expresses a positive view of the nieces (“pretty behaved”) followed by what would seem a negative view (“not at all handsome”). A contrasting conjunction, such as but, yet, or however, appears to be missing. But the original conjunction is and, indicating that the two compliments are both positive. “Not at all handsome” is positive here because the girls are thus not effective competitors to the speaker’s daughters in the marriage market. This evaluation requires a proper understanding of the novel, that is, an understanding of the subtext at the unspoken socio-emotional level that drives the events. A reader who relies on local context only can not fill this gap correctly.

Our point is that there are gaps and there are gaps. Some gaps require proper comprehension to be filled correctly. Those are the interesting ones for comprehension testing.

Fixed-Ratio Gaps

Originally, Taylor (1953) developed the cloze test with a fixed-ratio format as an alternative to conventional standardized tests. During the following decades, published studies reported mostly moderate correlations between cloze test scores and scores on tests that were traditionally considered measures of reading comprehension. For example, based on data from Bormuth (1967, Table 1) the average correlation was .58 when different text passages were used for cloze and traditional comprehension tests, whereas the average correlation was .64 when the two test formats were compared with the same text passages (Bormuth, 1967, p. 5-7). Furthermore, research from the 1970’s and later has indicated that cloze tests with a fixed gap ratio, for example, a gap for every fifth or seventh word, are not valid measures of text comprehension (e.g., Alderson, 1979; Kintsch & Yarbrough, 1982). Indeed, Shanahan, Kamil, and Tobin (1982) demonstrated that native speakers of English can perform about as well on cloze passages in which the sentences have been scrambled as on well-organized passages. This indicates that cloze tests with a fixed gap ratio are insensitive to integration of information across sentence boundaries. Shanahan et al. (1982) concluded that “it seems to be unreasonable to use and interpret cloze in classroom practice as a global measure of reading comprehension” (p. 250). In a subsequent article, Shanahan and Kamil (1984) stated that “the use of cloze to measure comprehension should be viewed with suspicion” (p. 252). During the cognitive revolution in the 1980’s, when the integrative aspects of comprehension were in focus, measures that did not require text integration went out of favor (Pearson & Hamm, 2005).

Table 1.

Descriptive Statistics for the Measures of Reading and Vocabulary. Reliability Estimates (Cronbach’s alpha) in the Present Sample are Added in the Last Column.

Measure	Mean	SD	Range	α
Cloze comprehension (max 41)	19.6	7.9	2-38	.92
Q-A comprehension (max 40)	25.5	7.6	7-39	.89
Receptive vocabulary (max 37)	29.0	4.8	15-37	.86
Phonological coding (max 38)	13.3	8.0	0-37	.88
Orthographic coding (max 52)	35.3	10.0	5-51	.95

The survival of fixed-ratio cloze tests in educational testing may be explained by the fact that the cloze procedure appears to be a reliable measure of readability. Cloze provides a very good estimate of the experienced difficulty of a text with correlations in the order of .80-.90 (e.g., Bormuth, 1967). When cloze procedure is used as a readability measure, fixed conventions for the creation of gaps are mandatory.

Variably-Spaced Gaps

Variably-spaced gaps are used in some standardized tests of reading comprehension, for example, Suffolk Reading Scale (Hagley, 1987), The Degrees of Reading Power (Touchstone, 2001), Stanford Diagnostic Reading Test (Karlsen & Gardner, 1995) and in the Woodcock-Johnson Passage Comprehension subtest (Woodcock, McGrew, & Mather, 2001). Nonetheless, research still indicates that these tests may not tap comprehension but only assess the surface level of word and sentence processing. Several studies have thus shown that cloze tests are considerably more sensitive to individual differences in decoding than traditional question-answering tests are (e.g., Francis, Fletcher, Catts, & Tomblin, 2005; Keenan, Betjemann, & Olson, 2008; Nation & Snowling, 1997; Spear-Swerling, 2004). For example, in a sample of 7- to 10-year-old children Nation and Snowling (1997) found that skills in single word reading accounted for more variance in a cloze test (79%) than in a question-answering test (53%). Furthermore, Nation and Snowling found that listening comprehension accounted for additional variance on the question-answering test, but not on the cloze test, once single word reading was statistically controlled for. These and similar findings have led researchers to suggest that cloze tests mainly tap decoding skills (see also Cutting & Scarborough, 2006; Keenan et al., 2008).

However, the cloze test in Nation and Snowling’s (1997) study (Suffolk Reading Scale) uses single, unconnected sentences. So that precludes any requirement for higher-order integration abilities. The limited demands on comprehension may be a simple consequence of the particular selection of texts (sentences) and gaps rather than of the cloze format itself.

In contrast, a study by Spear-Swerling (2004) provided results that suggested that the cloze format is indeed capable of assessing comprehension. She used the Degrees of Reading Power test which requires both processing within sentences and integration of information across sentences. In a study of grade 4 children, she found that a composite measure of language comprehension accounted for significant independent variation in cloze performance even after decoding skills were controlled for.

Comprehension-Demanding Gaps

We suggest that the cloze format is not entirely to blame for the limitations of the published tests. A cloze test of comprehension needs comprehension-demanding gaps. If, for example, the target ability is information integration across sentences, then gaps should be selected to require readers to attend to information from more than one sentence (see also Greene, 2001). Perhaps the most obvious way of doing so is to select gaps that require inferences across sentences to be filled correctly, as in the example below. This and the following examples illustrate the cloze test developed for the present study.

Your skin may become drier during long flights. So you might wish to bring moisturizing cream. Flights may cause other kinds of _______ [dryness—inconveniences—hazards?]. For example, many passengers get blocked ears or even a ruptured eardrum.

Although dryness would fit with the first mentioned inconvenience, it is hardly a hazard like a ruptured eardrum. So the more abstract term, inconvenience, provides the better fit to both examples given in the text. The type of inference required in the example is an induction, that is, the activation of a superior concept from more specific examples. But, of course, it is possible to create gaps that require other types of inferences across sentences. Thus, in the following example, a causal inference must be drawn to reinsert the missing word:

A morning in December, the dust man was standing on the platform at the back of the dust cart. It was very icy. He [ran—slipped—tripped] and fell to the ground.

The concept of cohesive ties may be useful for studies of comprehension across sentence boundaries. A cohesive tie can be defined as a text device that makes texts cohere across sentences (Halliday & Hasan, 1976). For example, pronominal references may constitute semantic links between sentences: “At 23:33 Helen and Peter arrived at the station. Christine was there to welcome [her—them—him].” Them refers to Helen and Peter across the sentence boundary. Hence, if pronouns are turned into gaps, the reader may have to refer to the contents of the previous sentence to infer the missing words (see also Bridge & Winograd, 1982). Resolving a pronominal reference can be categorized as an anaphoric bridging inference (Singer, 2007).

Lexical cohesion is another type of cohesive ties where lexical items refer to the same or parts or aspects of the same unity, for example, “My surgeon got angry when he found out that after my first operation I had driven home on my own. So after my second operation, my [consultant—patient—soldiers] insisted that I should be accompanied on my way back.” It is likely that surgeon and consultant (or doctor in American English) refer to the same person, in which case the lexical reference creates a cohesive tie between the sentences.

Conjunction is a third type of cohesive tie which may be used as a probe in a cloze test. For example, “Helen and Peter preferred to travel in the evening. [Therefore—Anyway—Furthermore], they decided to leave the town at 20:45.” To choose the correct word, the logical relationship between the sentences must be inferred, that is, whether it is causal, adversative or additive.

Several studies using experimental cloze tests have found indications that less skilled comprehenders are poorer than good comprehenders at reinserting words belonging to cohesive ties in texts (e.g., Bridge & Winograd, 1982; Cain, Patson, & Andrews, 2005, study 2; Geva & Ryan, 1985).

In addition, these studies have verified in various ways that the problems of the less skilled comprehenders reside at levels of comprehension beyond single words and sentences. For example, Bridge and Winograd (1982) asked good and poor readers in grade 9 to think aloud while they reinserted cohesive items in cloze tasks. The results indicated that the good readers were more likely than poor readers to take information from more than one sentence into account when needed for reinserting the correct tie. Similarly, Cain et al. (2005, study 2) found that 8 and 9-year-old poor comprehenders were less likely than good comprehenders to reinsert the correct conjunctions between sentences. More specifically, Cain et al. found that the poor comprehenders’ low performance on the cloze task with conjunctions was not explained by poor decoding or sentence-level comprehension.

In sum, these studies indicate that carefully constructed cloze tests can be sensitive to differences in children’s comprehension skills. These results are in line with Spear-Swearling’s (2004) finding that language comprehension accounted for significant independent variation in children’s performance on the Degrees of Reading Power test even after decoding skills were controlled. Furthermore, in Spear-Swearling’s study, children’s performance on this cloze test correlated at .76 with their performance on a reading comprehension test that used a question-answering format. However, the total time required for administration of the Degrees of Reading Power cloze test is 75 min. On the basis of the existing research, it is uncertain whether a similar, but quicker cloze test with comprehension-demanding gaps also can provide an adequate measure of reading comprehension.

The aim of the present study was to extend earlier research on cloze tests as a measure of reading comprehension into adult reading and to investigate whether a quick cloze test can be as reliable and valid as a more time-consuming question-answering test. More specifically, the study was concerned with the following research questions: Is it possible to construct a quick cloze comprehension test that can match a more time-consuming standard question-answering comprehension test for adults in terms of (a) reliability, (b) sensitivity to text comprehension beyond decoding ability and simple vocabulary size, and (c) external validity? We hypothesized that this was possible.

To answer these questions, we constructed a very quick cloze test with comprehension-demanding gaps and administered this test along with a standard question-answering test as well as measures of decoding and vocabulary to a relatively large sample of adults with low educational levels.

Method

Participants

The participants in the study were 204 adults and young adults with Danish as their first language. All were enrolled in courses for adults with low educational levels. The gender ratio was 45/55 female-to-male. The average age was 34.1 years (standard deviation = 12.8; range 17-70). One-hundred-and-thirty participants were enrolled in one of two voluntary reading courses, either:

- remedial teaching of reading and writing for dyslexic adults (n = 81), or

- literacy classes for adults with insufficient reading and writing skills (n = 49).

The remaining 74 participants were enrolled in types of education that traditionally attract a significant number of adults with low educational levels and poorly developed reading skills:

- vocational training, for example, courses for doormen and roofers (n = 52),

- courses in Danish for adults preparing for a standard school-leaving exam (n = 22).

Materials

All materials were in Danish which, like English, is considered a deep orthography with a complex syllable structure (Seymour, Aro, & Erskine, 2003).

Experimental cloze test

A cloze test was developed for this study. It comprised 10 texts of a total of 1,230 words (range 42-335). Five of the texts were narratives, such as a story from a trade union magazine about a work accident. The other five texts were expository, for example, instructions in case of fire. A total of 41 gaps were created by removing one word for each gap. At each gap, participants were asked to select the best-fitting word from a selection of four. Twenty-six of the gaps were created by removing a part of a cohesive tie (see the introduction). These ties were 13 pronominal references, four lexical ties, and nine conjunctions. The remaining 14 items were gaps that could be refilled by means of other inferences using information beyond the present sentence, such as inductive and causal inferences (see the examples in the introduction). Participants were given 10 min to read the texts and refill as many gaps as possible. The score was the number of correctly answered items.

Standard question-answering test of reading comprehension

The question-answering measure of reading comprehension selected for the study was a traditional test with relatively long texts and questions with multiple-choice answers (Arnbak, 2001). An early version of the test was developed for a national study of adult literacy (Elbro, Møller, & Nielsen, 1995), and the test is similar to those used in international studies of adult literacy, such as the Second International Adult Literacy Survey (SIALS) and Programme for the Internal Assessment of Adult Competencies (PIAAC). The test comprises three subtests: schematized texts (documents), expository texts, and narrative texts (including newspaper stories). For the present study, only the two last-mentioned subtests with coherent prose texts were selected, as they are the more demanding with respect to reading comprehension. In contrast, the subtest with schematized texts mainly requires that readers locate specific pieces of information in lists and tables.

The two subtests with coherent prose comprise 11 texts of a total of approximately 2,750 words (range 76-543) with 40 questions. The narrative texts are mainly newspaper stories reporting events, for example, a story about the failed escape of two bank robbers in a stolen car. The expository texts describe various topics and/or present guidelines, for example, the recommended weekly maximum amount of alcohol for males and females. Some of the questions require retrieval of information explicitly stated in the text whereas other questions require inferences based on one or more pieces of information from the text.

The test has a long track record. Reliability for the complete test has been reported at .92 with each of the three subtests at or above .82 (Cronbach’s alpha, Arnbak & Elbro, 1999). The external concurrent and predictive validity is well documented: test performance has been found to correlate well with teacher ratings (r = .53-.62; Arnbak & Elbro, 1999) and with exam marks (r = .60; Elbro & Arnbak, 2002).

The participants were given 30 min to read the texts and to answer as many questions as possible. The score was the number of correctly answered items.

Vocabulary test

With this measure of receptive vocabulary, the participant listens to spoken target words and is asked to select the corresponding picture from a group of four alternatives (similarly to the Peabody Picture Vocabulary Test). The present measure comprised 37 items and was a revised version of a standard vocabulary test (Danish Ministry of Education, 2006) which has previously yielded a reliability estimate of .84 in native Danish speaking adults (Elbro, 2010).

Tests of word decoding

Decoding ability was measured in two tasks, one of phonological coding and one of orthographical coding. In the phonological coding task, participants are asked to select the nonword that sounds like a real word from four alternatives (e.g., sharf, slout, skore, sof, where skore is the expected answer because it sounds like the real word score). The present measure was a paper and pencil version of a computer-administered test (Danish Ministry of Education, 2006) which has previously yielded a reliability estimate of .90 in native Danish speaking adults (Elbro, 2010). The participants were allowed five min to attempt as many of the 38 items as possible. In the orthographical coding task, participants are asked to select the one correctly spelled word among four similar-sounding alternatives (e.g., reine, rain, rane, raign). The participants were given three min to attempt as many of the 52 items as possible. This test was developed for the present study.

Raw scores with each of the two measures were normalized and averaged into one decoding score.

Questionnaire

A questionnaire was administered to the participants with a view to an assessment of the external validity of the reading measures. The participants were asked about their age, primary language, and educational level. Two further questions were concerned with experiences of reading or writing problems in education or at work. One read, “Has reading or spelling caused problems for you in relation to education?”. The other was, “Has reading or spelling caused problems for you in relation to your job?”

Procedure

All tests were administered in group settings by trained experimenters who read the questionnaire and the instructions aloud to the participants. The participants worked individually under the supervision of the experimenter. The materials were presented in the same order to all participants: questionnaire, cloze comprehension, orthographical coding, vocabulary, phonological coding, standard question-answering reading comprehension.

Results

All score distributions were approximately normal with values of skewness within acceptable ranges (+/−2). Descriptive statistics of the measures are presented in Table 1.

The cloze comprehension measure had an internal homogeneity (Cronbach’s alpha reliability) of .92 which was on a par with the standard question-answering test of reading comprehension (.89) and the two word decoding tests. All of the reading measures were timed measures, so scores combined accuracy and speed. This combination may have provided inflated estimates of reliability because all the unseen (and thus not correctly solved) items occur together toward the end of the test. Consequently, additional analyses of reliability were run exclusively with the 25 first items that were attempted by all participants. Cronbach’s alpha was .80 for these items. Similarly, reliability for the first 37 items of the standard reading comprehension test attempted by all participants was .84.

The correlation between the two decoding measures (phonological coding and orthographical coding) was .55 (p < .001). These two measures were normalized and averaged into one decoding score which was used in subsequent analyses. The simple correlations between each of the measures and the age of the participants in the study are presented below the diagonal in Table 2. Age correlated positively and significantly with vocabulary (.27; p < .001), but negatively with decoding (–.20; p < .001), and no significant correlations were found between age and the measures of reading comprehension. The correlation between the cloze comprehension test and the standard question-answering test of reading comprehension was high (.84), especially when considering the reliability of the measures. This strong correlation suggests that the cloze comprehension test has very good concurrent validity. It is also noteworthy that the two measures of reading comprehension correlate at about the same level with receptive vocabulary. This is an indication that the cloze comprehension test did indeed pick up variation in vocabulary, and not only in decoding. Finally, the cloze comprehension test was significantly more sensitive to decoding (r = .73) than the question-answering comprehension test (r = .63) was (the difference between the two correlations was significant, t(201) = –3.67; p < .001). This difference was explained by a higher sensitivity to reading speed of the cloze test than of the question-answering test. Thus, the correlations between word decoding and comprehension accuracy (percent correct) were quite similar, .45 and .43 for the cloze test and the question-answering test, respectively. Partial correlations controlling for age are presented above the diagonal in Table 2. None of the partial correlations differed significantly from the simple correlations.

Table 2.

Correlation Coefficients Between the Variables.

Measure	1	2	3	4
1. Cloze comprehension	—	.84**	.61**	.73**
2. Q-A comprehension	.84**	—	.66**	.62**
3. Vocabulary	.57**	.60**	—	.39**
4. Word decoding	.73**	.63**	.32**	—
5. Age	−.09	−.11	.27**	−.20**

Note. Simple correlations are presented below the diagonal, partial correlations controlling for age above the diagonal.

p < .001.

Given the high correlations between decoding and both of the efficiency measures of comprehension, it was important to assess the possible contribution of decoding to the correlation between the two reading comprehension tests. This was done in a hierarchical regression analysis with the standard question-answering comprehension measure as the dependent variable. The results are displayed in Table 3.

Table 3.

Summary of Results From a Hierarchical Regression Analyses on a Standard Question-Answering Measure of Reading Comprehension (Standard Q-A Comprehension). Standardized Regression Coefficients (β’s) Are Provided for the Final Models.

Step #	Measure	R ²	ΔR²	F change	β
1	Word decoding	.39	.39	129.2**	.07
2	Vocabulary	.57	.18	82.2**	.20**
3	Cloze comprehension	.72	.16	112.4**	.67**
1	Age	.01	.01	2.3	−.11*
2	Word decoding	.39	.38	120.9**	.04
3	Vocabulary	.59	.20	94.6**	.25**
4	Cloze comprehension	.74	.15	108.9**	.66**

After controlling for decoding and vocabulary, cloze comprehension still accounted for a sizeable 16% of the variance in question-answering comprehension (Table 3). It is also noteworthy that the best fitting final model in the first analysis (at step 3) did not contain a significant, unique contribution from decoding, whereas the standardized coefficient for cloze comprehension was a highly significant .67. Hence, the strong simple bivariate correlation between the two measures of comprehension (Table 2) was not explained away by decoding or vocabulary.

To investigate whether the results were confounded by variance in age, an additional hierarchical regression analysis was conducted with age entered in the first step. However, as shown in Table 3, the overall pattern did not change. This was to be expected because, although age was correlated with vocabulary, it was not significantly correlated with reading comprehension (Table 2). Age did not affect any of the subsequent analyses either.

The regression analysis on question-answering comprehension showed that vocabulary contributed independent variance to comprehension when decoding was controlled for. To see whether the same was true for cloze comprehension, another multiple regression analysis was conducted. This time cloze comprehension was the dependent variable. The results are displayed in Table 4.

Table 4.

Summary of Results From a Hierarchical Regression Analysis on Cloze Comprehension. Standardized Regression Coefficients (β’s) Are Provided for the Final Model (Step 3).

Step #	Measure	R ²	ΔR²	F change	β
1	Word decoding	.54	.54	231.1**	.36**
2	Vocabulary	.66	.12	68.7**	.13*
3	Q-A comprehension	.78	.12	111.5**	.54**

p < .01. **p < .001.

Two results are noteworthy from the regression analysis on cloze comprehension. First, vocabulary did contribute to the variance in cloze comprehension after controlling for decoding. Second, word decoding maintained a significant regression coefficient in the final model (step 3) in which question-answering comprehension was also entered. This result corroborates the pattern seen in the simple correlations (Table 2)—that word decoding may influence the cloze comprehension measure slightly more than it influences the standard question-answering measure of comprehension.

The final research question concerned the external validity of the cloze comprehension measure. About half (105/204) of the adults reported being affected by poor reading and/or writing in their education and/or at work. The individual responses (difficulties versus no difficulties) were subjected to a series of hierarchical logistic regression analyses with the individual reading measures as independent variables. The aim was to explore how well reported difficulties were accounted for by the reading measures. The results are displayed in Table 5.

Table 5.

Summary of Results From Hierarchical Logistic Regression Analyses on Experienced Reading Handicaps in Education or at Work. Nagelkerke Estimates of the Total Amount of Variance (R²) Accounted for Are Provided in the Last Column.

Step #	Measure	χ²	Δχ²	R ²
1	Word decoding	66.8	66.8**	.38
2	Vocabulary	66.9	0.0	.38
3	Q-A comprehension	68.4	1.5	.39
4	Cloze comprehension	68.6	0.2	.39
1	Q-A comprehension	33.1	33.1**	.20
2	Cloze comprehension	39.8	6.6*	.24
1	Cloze comprehension	38.8	38.8**	.23
2	Q-A comprehension	39.8	1.0	.24

p < .01. **p < .001.

The results gave a strong indication that word decoding is the reading skill that influenced the participants’ experience of having reading and writing problems. When decoding was entered first, none of the other measures predicted further variation in experienced reading problems. The second and third analyses indicated that when question-answering comprehension was entered first, cloze comprehension continued to predict further variance, whereas question-answering comprehension did not predict further variance once cloze comprehension was controlled for. This pattern is in accordance with the above finding that decoding is slightly more important to cloze comprehension than to question-answering comprehension.

Discussion

The results from the study suggest that a quick (10 min) cloze comprehension test can be designed to match a more time-consuming (30 min) question-answering comprehension test in several important areas. A cloze test can be developed to correlate closely (.84) with a standard question-answering measure of reading comprehension—and to provide results on a par with a standard comprehension measure in terms of reliability and validity. The present experimental cloze test was more closely correlated with a standard question-answering comprehension test than previous cloze tests with fixed-ratio gaps have been reported to be (correlations of about .6, see the introduction). Indeed, the correlation between the experimental cloze test and the question-answering test was at least as strong as the correlation (.76) reported from an earlier study using a more time-consuming cloze test with comprehension-demanding gaps (Spear-Swerling, 2004). The experimental cloze comprehension measure was found to be as reliable as the standard question-answering measure of reading comprehension, and as reliable as standard measures of word decoding. In particular, the cloze test was sensitive to text integration and inferences beyond word decoding, vocabulary and single sentence comprehension. This conclusion rests on two arguments: First, the gaps in the cloze test were selected to require that information is brought together from more than one sentence (see the introduction). Second, given that the traditional question-answering comprehension test measures comprehension beyond single word knowledge and decoding, the shared variance with the cloze test is likely to reflect such comprehension.

Previous research indicates that many published cloze tests are highly sensitive to decoding and word level processes, and not very sensitive to higher-order comprehension processes (e.g., Keenan et al., 2008; Nation & Snowling, 1997). The present study is in line with previous studies in the sense that it demonstrated a clear correlation between the cloze comprehension test and word decoding. Indeed, the cloze test was somewhat more sensitive than a question-answering comprehension test to individual differences in decoding skill. However, the reason for this difference may be that the cloze test was more sensitive to reading speed than the question-answering test. There was no difference between the two tests when comprehension accuracy was measured.

The experimental cloze comprehension test was sensitive to individual differences in comprehension beyond the word level. In this respect, the present study replicates previous results from studies with comprehension-demanding cloze tests with children (e.g., Cain et al., 2005; Spear-Swerling, 2004). The present study extended previous results to adults with limited post-compulsory education and demonstrated how a cloze comprehension test can be quick and easy to use, yet at the same time provide a measure that is rather similar to that of a standard question-answering comprehension test with very good reliability and external concurrent and predictive validity.

The study also extended previous studies by including a separate estimate of the external criterion-related validity of the measures of reading. These results indicated that the cloze measure had a stronger external validity than the standard question-answering comprehension measure. The cloze measure gave the better fit to participants’ self-reported experiences of reading and writing handicaps in education and/or at work. So in cases where this kind of external validity plays any role, cloze comprehension may be preferred over question-answering comprehension. However, this interpretation should be tempered by the finding that a simple measure of decoding provided by far the best correlate to the adults’ experienced reading handicaps. Decoding may be the limiting reading ability in this group, whereas higher-order comprehension abilities may pose limitations in other groups of adults. Unfortunately, tests of reading comprehension are rarely subjected to the relevant tests of external criterion-based validity, so comparisons are difficult to make.

It is a possible weakness of the present study that it only employed one “gold standard” test of reading comprehension to which the cloze comprehension test was compared. It would have been a strength if a separate listening comprehension and/or other measures of reading comprehension had been included. However, it is worth noting that the standard question-answering test employed in the study has a long track record as a measure of comprehension beyond the word level (see the methods section). It employs long, authentic texts and questions that reflect the standard uses of the texts and includes inference-demanding questions based on information from different parts of the texts.

Similarly, more and even stronger control measures could have shed further light on the exact cognitive components that are activated by the two comprehension measures. It is probably fair to assume, though, that decoding was properly controlled for with two sensitive measures of a total of 90 items. However, inclusion of a measure of listening comprehension would have been preferable. It might have provided evidence that the unique variance shared between the two measures of comprehension was also shared with oral comprehension ability.

A limitation of the present study is inherent in the low level of education of the participants. The results may not generalize to other, better educated segments of the adult population. Therefore, the results need cross-validation in other samples that are more representative of the general population.

In conclusion, a quick cloze test of reading comprehension is feasible. It need not be entirely dependent on word decoding and other low-level processes. The cloze test constructed for this study serves as an illustrative example. However, the present study is preliminary and further research is needed. We suggest that future studies should systematically vary cloze tasks to provide more specific directives for future test construction. New cloze tasks should be assessed with individuals with a wider range of educational backgrounds and reading abilities.

Footnotes

Author’s Note

Portions of these data were presented at the sixteenth annual meeting of The Society for the Scientific Study of Reading, 2009

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by a grant from The Danish Ministry of Education to The Centre for Reading Research.

Bios

Anna S. Gellert is a postdoctoral researcher at University of Copenhagen, Denmark. Her research interests include the relationship between skills in reading and language as well as the development of measures for assessing these skills.

Carsten Elbro is a professor of applied linguistics at University of Copenhagen, Denmark. His current research interests include reading, reading development and reading difficulties.

References

Alderson

J. C.

(1979). The Cloze procedure and proficiency in English as a foreign language. TESOL Quarterly, 13, 219-227.

Arnbak

(2001). Læsetekster for unge og voksne [Texts for young adults and adults]. Copenhagen, Denmark: Dansk Psykologisk Forlag.

Arnbak

Elbro

(1999). Læsning, læsekurser og uddannelse. Om unge og voksnes funktionelle læsefærdighed i uddannelse og på læsekurser vurderet med et nyt materiale [Reading, reading courses, and education. On functional literacy in young adults and adults in education and in reading courses measured by a new test]. Copenhagen, Denmark: Centre for Reading Research and Danish Ministry of Education.

Austen (1813). Pride and prejudice, London, UK:T. Egerton, Whitehall.

Bormuth

J. R.

(1967). Cloze readability procedure (CSEIP Occasional Report 1). Los Angeles: University of California.

Bridge

C. A.

Winograd

P. N.

(1982). Readers’ awareness of cohesive relationships during Cloze comprehension. Journal of Reading Behavior, 14, 299-312.

Cain

Patson

Andrews

(2005). Age- and ability-related differences in young readers’ use of conjunctions. Journal of Child Language, 32, 877-892.

Carlisle

Rice

(2004). Assessment of reading comprehension. In Stone

Silliman

Ehren

Apel

(Eds.), Handbook of language and literacy (pp. 521-555). New York, NY: Guilford.

Cutting

L. E.

Scarborough

(2006). Prediction of reading comprehension: Relative contributions of word recognition, language proficiency, and other cognitive skills can depend on how comprehension is measured. Scientific Studies of Reading, 10, 277-299.

10.

Danish Ministry of Education (2006). Visitationstest til ordblindeundervisning for voksne [Screening of dyslexia among adults]. Copenhagen, Denmark: Author.

11.

Elbro

(2010). Dyslexia as disability or handicap: When does vocabulary matter? Journal of Learning Disabilities, 43, 469-478.

12.

Elbro

Arnbak

(2002). Components of reading comprehension as predictors of educational achievement. In Hjelmquist

von Euler

(Eds.), Dyslexia and literacy (pp. 69-83). London, UK: Whurr.

13.

Elbro

Møller

Nielsen

E. M.

(1995). Functional reading difficulties in Denmark. A study of adult reading of common texts. Reading and Writing, 7, 257-276.

14.

Farr

Carey

R. F.

(1986). Reading: What can be measured? Newark, DE: International Reading Association.

15.

Francis

D. J.

Fletcher

J. M.

Catts

H. W.

Tomblin

J. B.

(2005). Dimensions affecting the assessment of reading comprehension. In Paris

S. G.

Stahl

S. A.

(Eds.), Children’s reading comprehension and assessment (pp. 369-394). Mahwah, NJ: Erlbaum.

16.

Geva

Ryan

E. B.

(1985). Use of conjunctions in expository texts by skilled and less skilled readers. Journal of Reading Behavior, 17, 331-346.

17.

Greene

B. B.

(2001). Testing reading comprehension of theoretical discourse with Cloze. Journal of Research in Reading, 24, 82-98.

18.

Hagley

(1987). Suffolk reading scale. Windsor, UK: NFER-Nelson.

19.

Halliday

M. A. K.

Hasan

(1976). Cohesion in english. Hong Kong: Longman.

20.

Karlsen

Gardner

E. F.

(1995). Stanford diagnostic reading test (4th ed.). San Antonio, TX: Psychological Corporation.

21.

Keenan

J. M.

Betjemann

R. S.

Olson

R. K.

(2008). Reading comprehension tests vary in the skills they assess: Differential dependence on decoding and oral comprehension. Scientific Studies of Reading, 12, 281-300.

22.

Kintsch

Yarbrough

J. C.

(1982). Role of rhetorical structure in text comprehension. Journal of Educational Psychology, 74, 828-834.

23.

Nation

Snowling

(1997). Assessing reading difficulties: The validity and utility of current measures of reading skill. British Journal of Educational Psychology, 67, 359-370.

24.

Oller

J. W.

Jonz

(1994). Why Cloze procedure? In Oller

J. W.

Jonz

(Eds.), Cloze and coherence (pp. 1-20). Lewisburg, PA: Associated University Press.

25.

Pearson

P. D.

Hamm

(2005). The assessment of reading comprehension: A review of practices—past, present, and future. In Paris

S. G.

Stahl

S.A.

(Eds.), Children’s reading comprehension and assessment (pp. 13-69). Mahwah, NJ: Erlbaum.

26.

Seymour

P. H. K.

Aro

Erskine

J. M.

(2003). Foundation literacy acquisition in European orthographies. British Journal of Psychology, 94, 143-174.

27.

Shanahan

Kamil

(1984). The relationship of concurrent and construct validities of Cloze. In Niles

J. A.

Harris

L. A.

(Eds.), Changing perspectives on research in reading/language processing and instruction (pp. 252-256). Rochester, NY: National Reading Conference.

28.

Shanahan

Kamil

Tobin

(1982). Cloze as a measure of intersentential comprehension. Reading Research Quarterly, 17, 229-255.

29.

Singer

(2007). Inference processing in discourse comprehension. In Gaskell

(Ed.), Oxford handbook of psycholinguistics (pp. 343-359). Oxford, UK: Oxford University Press.

30.

Spear-Swerling

(2004). Fourth graders’ performance on a state-mandated assessment involving two different measures of reading comprehension. Reading Psychology, 25, 121-148.

31.

Taylor

W. L.

(1953). Cloze procedure: A new tool for measuring readability. Journalism Quarterly, 30, 415-433.

32.

Touchstone (2001). The degrees of reading power program: DRP in brief. Brewster, NY: Touchstone Applied Science Associates.

33.

Woodcock

R. W.

McGrew

K. S.

Mather

(2001). Woodcock–Johnson III tests of achievement. Itasca, IL: Riverside.